mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 KiB
15 KiB
Changelog
All notable changes to webclaw are documented here. Format follows Keep a Changelog.
[0.3.12] — 2026-04-10
Added
- Crawl scope control: new
allow_subdomainsandallow_external_linksfields onCrawlConfig. By default crawls stay same-origin. Enableallow_subdomainsto follow sibling/child subdomains (e.g. blog.example.com from example.com), orallow_external_linksfor full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au).
[0.3.11] — 2026-04-10
Added
- Sitemap fallback paths: discovery now tries
/sitemap_index.xml,/wp-sitemap.xml, and/sitemap/sitemap-index.xmlin addition to the standard/sitemap.xml. Sites using WordPress or non-standard sitemap locations are now discovered without needing external search.
[0.3.10] — 2026-04-10
Changed
- Fetch timeout reduced from 30s to 12s: prevents cascading slowdowns when proxies are unresponsive. Worst-case per-URL drops from ~94s to ~25s.
- Retry attempts reduced from 3 to 2: combined with shorter timeout, total worst-case is 12s + 1s delay + 12s = 25s instead of 30s + 1s + 30s + 3s + 30s = 94s.
[0.3.9] — 2026-04-04
Fixed
- Layout tables rendered as sections: tables used for page layout (containing block elements like
<p>,<div>,<hr>) are now rendered as standalone sections instead of pipe-delimited markdown tables. Fixes Drudge Report and similar sites where all content was flattened into a single unreadable line. (by @devnen in #14) - Stack overflow on deeply nested HTML: pages with 200+ DOM nesting levels (e.g., Express.co.uk live blogs) no longer overflow the stack. Two-layer fix: depth guard in markdown.rs falls back to iterator-based text collection at depth 200, and
extract_with_options()spawns an 8 MB worker thread for safety on Windows. (by @devnen in #14) - Noise filter swallowing content in malformed HTML:
<form>tags no longer unconditionally treated as noise — ASP.NET page-wrapping forms (>500 chars) are preserved. Safety valve prevents unclosed noise containers (header/footer with >5000 chars) from absorbing entire page content. (by @devnen in #14)
Changed
- Bold/italic block passthrough:
<b>/<strong>/<em>/<i>tags containing block-level children (e.g., Drudge wrapping columns in<b>) now act as transparent containers instead of collapsing everything into inline bold/italic. (by @devnen in #14)
[0.3.8] — 2026-04-03
Fixed
- MCP research token overflow: research results are now saved to
~/.webclaw/research/and the MCP tool returns file paths + findings instead of the full report. Prevents "exceeds maximum allowed tokens" errors in Claude/Cursor. - Research caching: same query returns cached result instantly without spending credits.
- Anthropic rate limit throttling: 60s delay between LLM calls in research to stay under Tier 1 limits (50K input tokens/min).
Added
dirsdependency for~/.webclaw/research/path resolution.
[0.3.7] — 2026-04-03
Added
--researchCLI flag: run deep research via the cloud API. Prints report to stdout and saves full result (report + sources + findings) to a JSON file. Supports--deepfor longer reports.- MCP extract/summarize cloud fallback: when no local LLM is available, these tools now fall back to the cloud API instead of erroring. Set
WEBCLAW_API_KEYfor automatic fallback. - MCP research structured output: the research tool now returns structured JSON (report + sources + findings + metadata) instead of raw text, so agents can reference individual findings and source URLs.
[0.3.6] — 2026-04-02
Added
- Structured data in markdown/LLM output:
__NEXT_DATA__, SvelteKit, and JSON-LD data now appears as a## Structured Datasection with a JSON code block at the end of-f markdownand-f llmoutput. Works with--only-main-contentand all other flags.
Fixed
- Homebrew CI: formula now updates all 4 platform checksums after Docker build completes, preventing SHA mismatch on Linux installs (#12).
[0.3.5] — 2026-04-02
Added
__NEXT_DATA__extraction: Next.js pages now have theirpagePropsJSON extracted intostructured_data. Contains prices, product info, page state, and other data that isn't in the visible HTML. Tested on 45 sites — 13 now return rich structured data (BBC, Forbes, Nike, Stripe, TripAdvisor, Glassdoor, NASA, etc.).
[0.3.4] — 2026-04-01
Added
- SvelteKit data island extraction: extracts structured JSON from
kit.start()data arrays. Handles unquoted JS object keys by converting to valid JSON before parsing. Data appears in thestructured_datafield.
Changed
- License changed from MIT to AGPL-3.0.
[0.3.3] — 2026-04-01
Changed
- Replaced custom TLS stack with wreq: migrated from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — both battle-tested with 60+ browser profiles.
- Removed all
[patch.crates-io]entries: consumers no longer need to patch rustls, h2, hyper, hyper-util, or reqwest. Just depend on webclaw normally. - Browser profiles rebuilt on wreq's Emulation API: Chrome 145, Firefox 135, Safari 18, Edge 145 with correct TLS options (cipher suites, curves, GREASE, ECH, PSK session resumption), HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
- Better TLS compatibility: BoringSSL handles more server configurations than patched rustls (e.g. servers that previously returned IllegalParameter alerts).
Removed
- webclaw-tls dependency and all 5 forked crates (webclaw-rustls, webclaw-h2, webclaw-hyper, webclaw-hyper-util, webclaw-reqwest).
Acknowledgments
- TLS and HTTP/2 fingerprinting powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.
[0.3.2] — 2026-03-31
Added
--cookie-fileflag: load cookies from JSON files exported by browser extensions (EditThisCookie, Cookie-Editor). Format:[{name, value, domain, ...}].- MCP
cookiesparameter: thescrapetool now accepts acookiesarray for authenticated scraping. - Combined cookies:
--cookieand--cookie-filecan be used together and merge automatically.
[0.3.1] — 2026-03-30
Added
- Cookie warmup fallback: when a fetch returns an Akamai challenge page, automatically visits the homepage first to collect
_abck/bm_szcookies, then retries the original URL. Enables extraction of Akamai-protected subpages (e.g. fansale ticket pages) without JS rendering.
Changed
- Fixed HTTP header wire order (accept/user-agent were in wrong positions) and added H2 PRIORITY flag in HEADERS frames.
FetchResult.headersnow useshttp::HeaderMapinstead ofHashMap<String, String>— avoids per-response allocation, preserves multi-value headers.
[0.3.0] — 2026-03-29
Changed
- Replaced primp with webclaw-tls: switched to custom TLS fingerprinting stack.
- Browser profiles: Chrome 146 (Win/Mac), Firefox 135+, Safari 18, Edge 146 — captured from real browsers.
- HTTP/2 fingerprinting: SETTINGS frame ordering and pseudo-header ordering based on concepts pioneered by @0x676e67.
Fixed
- HTTPS completely broken (#5): primp's forked rustls rejected valid certificates (UnknownIssuer on cross-signed chains like example.com). Fixed by using native OS root CAs alongside Mozilla bundle.
- Unknown certificate extensions: servers returning SCT in certificate entries no longer cause TLS errors.
Added
- Native root CA support: uses OS trust store (macOS Keychain, Windows cert store) in addition to webpki-roots.
- HTTP/2 fingerprinting: SETTINGS frame ordering and pseudo-header ordering match real browsers.
- Per-browser header ordering: HTTP headers sent in browser-specific wire order.
- Bandwidth tracking: atomic byte counters shared across cloned clients.
[0.2.2] — 2026-03-27
Fixed
cargo installbroken with primp 1.2.0: added missingreqwestpatch to[patch.crates-io]. primp moved to reqwest 0.13 which requires a patched fork.- Weekly dependency check: CI now runs every Monday to catch primp patch drift before users hit it.
[0.2.1] — 2026-03-27
Added
- Docker image on GHCR:
docker run ghcr.io/0xmassi/webclaw— auto-built on every release - QuickJS data island extraction: inline
<script>execution catcheswindow.__PRELOADED_STATE__, Next.js hydration data, and other JS-embedded content
Fixed
- Docker CI now runs as part of the release workflow (was missing, image was never published)
[0.2.0] — 2026-03-26
Added
- DOCX extraction: auto-detected by Content-Type or URL extension, outputs markdown with headings
- XLSX/XLS extraction: spreadsheets converted to markdown tables, multi-sheet support via calamine
- CSV extraction: parsed with quoted field handling, output as markdown table
- HTML output format:
-f htmlreturns sanitized HTML from the extracted content - Multi-URL watch:
--watchnow works with--urls-fileto monitor multiple URLs in parallel - Batch + LLM extraction:
--extract-promptand--extract-jsonnow work with multiple URLs - Scheduled batch watch: watch multiple URLs with aggregate change reports and per-URL diffs
[0.1.7] — 2026-03-26
Fixed
--only-main-content,--include, and--excludenow work in batch mode (#3)
[0.1.6] — 2026-03-26
Added
--watch: monitor a URL for changes at a configurable interval with diff output--watch-interval: seconds between checks (default: 300)--on-change: run a command when changes are detected (diff JSON piped to stdin)--webhook: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks
[0.1.5] — 2026-03-26
Added
--output-dir: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes- CSV input with custom filenames:
url,filenameformat in--urls-file - Root URLs use
hostname/index.extto avoid collisions in batch mode - Subdirectories created automatically from URL path structure
[0.1.4] — 2026-03-26
Added
- QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
- Executes inline
<script>tags in a sandboxed runtime to capturewindow.__*data blobs - Parses Next.js RSC flight data (
self.__next_f) for App Router sites - Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
- Feature-gated with
quickjsfeature flag (enabled by default, disable for WASM builds)
[0.1.3] — 2026-03-25
Added
- Crawl streaming: real-time progress on stderr as pages complete (
[2/50] OK https://... (234ms, 1523 words)) - Crawl resume/cancel:
--crawl-state <path>saves progress on Ctrl+C and resumes from where it left off - MCP server proxy support via
WEBCLAW_PROXYandWEBCLAW_PROXY_FILEenv vars
Changed
- Crawl results now expose visited set and remaining frontier for accurate state persistence
[0.1.2] — 2026-03-25
Changed
- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)
Fixed
- Reddit scraping: use plain HTTP client for
.jsonendpoint (TLS fingerprinting was getting blocked)
Added
- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches
[0.1.1] — 2026-03-24
Fixed
- MCP server now identifies as
webclaw-mcpinstead ofrmcpin the MCP handshake - Research tool polling caps at 200 iterations (~10 min) instead of looping forever
- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
- Text format output strips markdown table syntax (
| --- |pipes) - All MCP tools validate URLs before network calls with clear error messages
- Cloud API HTTP client has 60s timeout instead of no timeout
- Local fetch calls timeout after 30s to prevent hanging on slow servers
- Diff cloud fallback computes actual diff instead of returning raw scrape JSON
- FetchClient startup failure logs and exits gracefully instead of panicking
Added
- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages
[0.1.0] — 2026-03-18
First public release. Full-featured web content extraction toolkit for LLMs.
Core Extraction
- Readability-style content scoring with text density, semantic tags, and link density penalties
- Exact CSS class token noise filtering with body-force fallback for SPAs
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
- JSON data island extraction (React, Next.js, Contentful CMS)
- YouTube transcript extraction (title, channel, views, duration, description)
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
- Brand identity extraction (name, colors, fonts, logos, OG image)
- Content change tracking / diff engine
- CSS selector filtering (include/exclude)
Fetching & Crawling
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
- BFS same-origin crawler with configurable depth, concurrency, and delay
- Sitemap.xml and robots.txt discovery
- Batch multi-URL concurrent extraction
- Per-request proxy rotation from pool file
- Reddit JSON API and LinkedIn post extractors
LLM Integration
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
- JSON schema extraction (structured data from pages)
- Natural language prompt extraction
- Page summarization with configurable sentence count
- PDF text extraction via pdf-extract
- Auto-detection by Content-Type header
MCP Server
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
- stdio transport for Claude Desktop, Claude Code, and any MCP client
- Smart Fetch: local extraction first, cloud API fallback
CLI
- 4 output formats: markdown, JSON, plain text, LLM-optimized
- CSS selector filtering, crawling, sitemap discovery
- Brand extraction, content diffing, LLM features
- Browser profile selection, proxy support, stdin/file input
Infrastructure
- Docker multi-stage build with Ollama sidecar
- Deploy script for Hetzner VPS