Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).
Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.
Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
(p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.
Same query returns cached result instantly without spending credits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --research "query": deep research via cloud API, saves JSON file with
report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous approach used mislav/bump-homebrew-formula-action which only
updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker
finishes, computes SHAs, and writes the complete formula.
Fixes#12 (brew install checksum mismatch on Linux)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.
Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross-
compilation, we need g++, cmake, and CC/CXX env vars pointing to the
cross-compiler. Also removed stale reqwest_unstable RUSTFLAG.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --cfg reqwest_unstable flag was required by the old patched reqwest.
wreq handles everything internally — no special build flags needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wreq uses BoringSSL (via boring-sys2) which needs cmake and clang
at build time. Removed stale reference to Impit's patched rustls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes#7
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.
Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch).
Disambiguate with version specs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the weekly primp compatibility check (which fails since primp
was removed in v0.3.0) with an automated dependency sync workflow.
Triggered by webclaw-tls pushes via repository_dispatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plain docker build --platform linux/arm64 on amd64 runner needs QEMU
to execute RUN commands. QEMU is only needed for apt-get (seconds),
not for Rust compilation (the binaries are pre-built).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
buildx creates manifest lists per-platform which can't be nested.
Use plain docker build for each arch then docker manifest create
to combine them. Single job, no matrix, no QEMU.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tarball extracts to webclaw-vX.Y.Z-target/ directory, not flat.
Use direct cp instead of find.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QEMU arm64 Rust builds took 60+ min and timed out in CI. Now the Docker
job downloads the pre-built release binaries and packages them directly.
- Dockerfile.ci: slim image for CI (downloads pre-built binaries)
- Dockerfile: full source build for local dev (unchanged build stage)
- Both use ubuntu:24.04 (GLIBC 2.39 matches CI build environment)
- Multi-arch manifest combines amd64 + arm64 images
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Image only had linux/amd64, failing on Apple Silicon Macs with
"no matching manifest for linux/arm64/v8". Added QEMU + buildx
multi-platform support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs every Monday — updates primp to latest, tries to build.
If patches are out of sync the build fails with a clear error
pointing to primp's Cargo.toml for the new patch list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself
(primp-reqwest). Without this patch, cargo install gets vanilla
reqwest 0.13 which is missing the HTTP/2 impersonation methods.
Users should use: cargo install --locked --git ...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the webclaw skill definition for Claude Code / Smithery.
Located at skill/SKILL.md with proper frontmatter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Docker image auto-built on every release via CI
- QuickJS sandbox executes inline <script> tags to extract JS-embedded
content (window.__PRELOADED_STATE__, self.__next_f, etc.)
- Bumped version to 0.2.1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pushes ghcr.io/0xmassi/webclaw:latest and :vX.Y.Z on every tagged
release. Uses BuildKit cache for fast rebuilds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
webclaw-core has QuickJS behind a feature flag for extracting data from
inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc).
The server was using an old lockfile without the feature enabled.
Updated deps to v0.2.0 and explicitly enabled quickjs. This improves
extraction on SPAs like NYTimes, Nike, and Bloomberg where content is
embedded in JS variable assignments rather than visible DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension
New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.
Closes#3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Watch mode:
- --watch polls a URL at --watch-interval (default 5min)
- Reports diffs to stdout when content changes
- --on-change runs a command with diff JSON on stdin
- Ctrl+C stops cleanly
Webhooks:
- --webhook POSTs JSON on crawl/batch complete and watch changes
- Auto-detects Discord and Slack URLs, formats as embeds/blocks
- Also available via WEBCLAW_WEBHOOK_URL env var
- Non-blocking, errors logged to stderr
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>