Commit graph

77 commits

Author SHA1 Message Date
Valerio
78e198a347 fix: use ENTRYPOINT instead of CMD in Dockerfiles for proper arg passthrough
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Docker CMD gets overridden by any args, while ENTRYPOINT receives them.
This fixes `docker run webclaw <url>` silently ignoring the URL argument.

Bump to 0.3.13.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:24:26 +02:00
Valerio
050b2ef463 feat: add allow_subdomains and allow_external_links to CrawlConfig
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.

Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().

Bump to 0.3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:33:06 +02:00
Valerio
a4c351d5ae feat: add fallback sitemap paths for broader discovery
Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.

Bump to 0.3.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:22:57 +02:00
Valerio
25b6282d5f style: fix rustfmt for 2-element delay array
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:21:53 +02:00
Valerio
954aabe3e8 perf: reduce fetch timeout to 12s and retries to 2
Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:18:57 +02:00
Valerio
5ea646a332 fix: resolve clippy warnings from #14 (collapsible_if, manual_inspect)
CI runs Rust 1.94 which flags these. Collapsed nested if-let in
cell_has_block_content() and replaced .map()+return with .inspect()
in table_to_md().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:28:59 +02:00
Valerio
3cf9dbaf2a chore: bump to 0.3.9, fix formatting from #14
Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:24:17 +02:00
Valerio
87ecf4241f
fix: layout tables, stack overflow, and noise filter (#14)
fix: layout tables rendered as sections instead of markdown tables
2026-04-04 15:20:08 +02:00
devnen
70c67f2ed6 fix: prevent noise filter from swallowing content in malformed HTML
Two related fixes for content being stripped by the noise filter:

1. Remove <form> from unconditional noise tags. ASP.NET and similar
   frameworks wrap entire pages in a <form> tag — these are not input
   forms. Forms with >500 chars of text are now treated as content
   wrappers, not noise.

2. Add safety valve for class/ID noise matching. When malformed HTML
   leaves a noise container unclosed (e.g., <div class="header"> missing
   its </div>), the HTML5 parser makes all subsequent siblings into
   children of that container. A header/nav/footer with >5000 chars of
   text is almost certainly a broken wrapper absorbing real content —
   exempt it from noise filtering.
2026-04-04 01:38:42 +02:00
devnen
74bac87435 fix: prevent stack overflow on deeply nested HTML pages
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.

Two-layer fix:

1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
   with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit

2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
   html5ever parsing and extraction both have room on deeply nested pages

Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
2026-04-03 23:45:19 +02:00
devnen
95a6681b02 fix: detect layout tables and render as sections instead of markdown tables
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).

Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.

Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
  (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
  lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
  instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
  column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
2026-04-03 22:24:35 +02:00
Valerio
1d2018c98e fix: MCP research saves to file, returns compact response
Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.

Same query returns cached result instantly without spending credits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:05:45 +02:00
Valerio
f7cc0cc5cf feat: CLI --research flag + MCP cloud fallback + structured research output
- --research "query": deep research via cloud API, saves JSON file with
  report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:04:04 +02:00
Valerio
344eea74d9 feat: structured data in markdown/LLM output + v0.3.6
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:16:56 +02:00
Valerio
b219fc3648 fix(ci): update all 4 Homebrew checksums after Docker build completes
Previous approach used mislav/bump-homebrew-formula-action which only
updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker
finishes, computes SHAs, and writes the complete formula.

Fixes #12 (brew install checksum mismatch on Linux)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:02:27 +02:00
Valerio
8d29382b25 feat: extract __NEXT_DATA__ into structured_data
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.

Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:04:51 +02:00
Valerio
4e81c3430d docs: update npm package license to AGPL-3.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:33:43 +02:00
Valerio
c43da982c3 docs: update README license references from MIT to AGPL-3.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:28:40 +02:00
Valerio
84b2e6092e feat: SvelteKit data extraction + license change to AGPL-3.0
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:37:56 +02:00
Valerio
b4800e681c ci: fix aarch64 cross-compilation for BoringSSL (boring-sys2)
boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross-
compilation, we need g++, cmake, and CC/CXX env vars pointing to the
cross-compiler. Also removed stale reqwest_unstable RUSTFLAG.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:39:43 +02:00
Valerio
a1b9a55048 chore: add SKILL.md to repo root for skills.sh discoverability
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:27:17 +02:00
Valerio
124352e0b4 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:25:40 +02:00
Valerio
1a5d3d8aaf chore: remove reqwest_unstable rustflag (no longer needed)
The --cfg reqwest_unstable flag was required by the old patched reqwest.
wreq handles everything internally — no special build flags needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:15:05 +02:00
Valerio
11b8f68f51 fix: update Dockerfile for BoringSSL build deps (cmake, clang)
wreq uses BoringSSL (via boring-sys2) which needs cmake and clang
at build time. Removed stale reference to Impit's patched rustls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:13:18 +02:00
Valerio
aaf51eddef feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:04:55 +02:00
Valerio
0d0da265ab chore: bump to v0.3.2, update changelog
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:56:51 +02:00
Valerio
da1d76c97a feat: add --cookie-file support for JSON cookie files
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes #7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:54:53 +02:00
Valerio
44f23332cc style: collapse nested if per clippy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:13:55 +02:00
Valerio
20c810b8d2 chore: bump v0.3.1, update CHANGELOG, fix fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:11:54 +02:00
Valerio
7041a1d992 feat: cookie warmup fallback for Akamai-protected pages
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.

Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:09:31 +02:00
github-actions[bot]
75e0a9cdef chore: update webclaw-tls dependencies 2026-03-30 12:03:06 +00:00
github-actions[bot]
b784a3fa1b chore: update webclaw-tls dependencies 2026-03-30 11:48:44 +00:00
Valerio
4cba36337b style: fix fmt in client.rs test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:18:57 +02:00
Valerio
199dab6dfa fix: adapt to webclaw-tls v0.1.1 HeaderMap API change
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:09:50 +02:00
github-actions[bot]
68b9406ff5 chore: update webclaw-tls dependencies 2026-03-30 09:53:03 +00:00
Valerio
31f35fd895 ci: fix ambiguous reqwest version in dependency sync
Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch).
Disambiguate with version specs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:52:35 +02:00
Valerio
4f0c59ac7f ci: replace stale primp check with webclaw-tls dependency sync
Replaces the weekly primp compatibility check (which fails since primp
was removed in v0.3.0) with an automated dependency sync workflow.
Triggered by webclaw-tls pushes via repository_dispatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:39:55 +02:00
Valerio
ee3c714aa9 docs: update CONTRIBUTING.md for v0.3.0 architecture
- Replace Impit/primp references with webclaw-tls
- Add architecture diagram showing crate layout + TLS repo
- Update crate boundaries table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 10:17:26 +02:00
Valerio
e3b0d0bd74 fix: make reddit and linkedin modules public for server access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:54:35 +02:00
Valerio
7051d2193b docs: add v0.3.0 changelog entry
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:47:04 +02:00
Valerio
f275a93bec fix: clippy empty-line-after-doc-comment in browser.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:45:05 +02:00
Valerio
140234c139 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:43:11 +02:00
Valerio
f13cb83c73 feat: replace primp with webclaw-tls, bump to v0.3.0
Replace primp dependency with our own TLS fingerprinting stack
(webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match.

- Remove primp entirely (zero references remaining)
- webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls
- Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains)
- Skip unknown certificate extensions (SCT tolerance)
- 99% bypass rate on 102 sites (was ~85% with primp)
- Fixes #5 (HTTPS broken — example.com and similar sites now work)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:40:10 +02:00
Valerio
77e93441c0 fix(ci): add QEMU for arm64 apt-get in Docker build
Plain docker build --platform linux/arm64 on amd64 runner needs QEMU
to execute RUN commands. QEMU is only needed for apt-get (seconds),
not for Rust compilation (the binaries are pre-built).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:51:51 +01:00
Valerio
8cf021a00b fix(ci): single Docker job with plain docker build + manifest
buildx creates manifest lists per-platform which can't be nested.
Use plain docker build for each arch then docker manifest create
to combine them. Single job, no matrix, no QEMU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:45:05 +01:00
Valerio
78810793cf chore: align Cargo.toml version with v0.2.3 tag 2026-03-27 20:41:02 +01:00
Valerio
ef120f6ec7 fix(ci): fix Docker binary path extraction from release tarball
Tarball extracts to webclaw-vX.Y.Z-target/ directory, not flat.
Use direct cp instead of find.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:39:14 +01:00
Valerio
48a3c45b36 ci: use pre-built binaries for Docker instead of QEMU cross-compilation
QEMU arm64 Rust builds took 60+ min and timed out in CI. Now the Docker
job downloads the pre-built release binaries and packages them directly.

- Dockerfile.ci: slim image for CI (downloads pre-built binaries)
- Dockerfile: full source build for local dev (unchanged build stage)
- Both use ubuntu:24.04 (GLIBC 2.39 matches CI build environment)
- Multi-arch manifest combines amd64 + arm64 images

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:32:50 +01:00
Valerio
dfcddd1973 ci: build multi-platform Docker images (amd64 + arm64)
Image only had linux/amd64, failing on Apple Silicon Macs with
"no matching manifest for linux/arm64/v8". Added QEMU + buildx
multi-platform support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:18:49 +01:00
Valerio
341f4737e1 test: v0.2.2 pre-release check 2026-03-27 18:48:15 +01:00