Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two related fixes for content being stripped by the noise filter:
1. Remove <form> from unconditional noise tags. ASP.NET and similar
frameworks wrap entire pages in a <form> tag — these are not input
forms. Forms with >500 chars of text are now treated as content
wrappers, not noise.
2. Add safety valve for class/ID noise matching. When malformed HTML
leaves a noise container unclosed (e.g., <div class="header"> missing
its </div>), the HTML5 parser makes all subsequent siblings into
children of that container. A header/nav/footer with >5000 chars of
text is almost certainly a broken wrapper absorbing real content —
exempt it from noise filtering.
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.
Two-layer fix:
1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit
2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
html5ever parsing and extraction both have room on deeply nested pages
Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).
Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.
Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
(p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.
Same query returns cached result instantly without spending credits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --research "query": deep research via cloud API, saves JSON file with
report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous approach used mislav/bump-homebrew-formula-action which only
updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker
finishes, computes SHAs, and writes the complete formula.
Fixes#12 (brew install checksum mismatch on Linux)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.
Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross-
compilation, we need g++, cmake, and CC/CXX env vars pointing to the
cross-compiler. Also removed stale reqwest_unstable RUSTFLAG.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --cfg reqwest_unstable flag was required by the old patched reqwest.
wreq handles everything internally — no special build flags needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wreq uses BoringSSL (via boring-sys2) which needs cmake and clang
at build time. Removed stale reference to Impit's patched rustls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes#7
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.
Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch).
Disambiguate with version specs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the weekly primp compatibility check (which fails since primp
was removed in v0.3.0) with an automated dependency sync workflow.
Triggered by webclaw-tls pushes via repository_dispatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plain docker build --platform linux/arm64 on amd64 runner needs QEMU
to execute RUN commands. QEMU is only needed for apt-get (seconds),
not for Rust compilation (the binaries are pre-built).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
buildx creates manifest lists per-platform which can't be nested.
Use plain docker build for each arch then docker manifest create
to combine them. Single job, no matrix, no QEMU.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tarball extracts to webclaw-vX.Y.Z-target/ directory, not flat.
Use direct cp instead of find.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QEMU arm64 Rust builds took 60+ min and timed out in CI. Now the Docker
job downloads the pre-built release binaries and packages them directly.
- Dockerfile.ci: slim image for CI (downloads pre-built binaries)
- Dockerfile: full source build for local dev (unchanged build stage)
- Both use ubuntu:24.04 (GLIBC 2.39 matches CI build environment)
- Multi-arch manifest combines amd64 + arm64 images
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Image only had linux/amd64, failing on Apple Silicon Macs with
"no matching manifest for linux/arm64/v8". Added QEMU + buildx
multi-platform support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs every Monday — updates primp to latest, tries to build.
If patches are out of sync the build fails with a clear error
pointing to primp's Cargo.toml for the new patch list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself
(primp-reqwest). Without this patch, cargo install gets vanilla
reqwest 0.13 which is missing the HTTP/2 impersonation methods.
Users should use: cargo install --locked --git ...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the webclaw skill definition for Claude Code / Smithery.
Located at skill/SKILL.md with proper frontmatter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Docker image auto-built on every release via CI
- QuickJS sandbox executes inline <script> tags to extract JS-embedded
content (window.__PRELOADED_STATE__, self.__next_f, etc.)
- Bumped version to 0.2.1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pushes ghcr.io/0xmassi/webclaw:latest and :vX.Y.Z on every tagged
release. Uses BuildKit cache for fast rebuilds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
webclaw-core has QuickJS behind a feature flag for extracting data from
inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc).
The server was using an old lockfile without the feature enabled.
Updated deps to v0.2.0 and explicitly enabled quickjs. This improves
extraction on SPAs like NYTimes, Nike, and Bloomberg where content is
embedded in JS variable assignments rather than visible DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>