apunkt/webclaw

Fork 0

mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Valerio 095ae5d4b1

CI / Test (push) Waiting to run

Details

CI / Lint (push) Waiting to run

Details

CI / Docs (push) Waiting to run

Details

polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 )

Three P3 items from the 2026-04-16 audit. Bump to 0.3.17.

webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice
plus eq_ignore_ascii_case for the directive test. That was fragile:
"Sitemap :" (space before colon) fell through silently, inline
"# ..." comments leaked into the URL, and a line with no URL at all
returned an empty string. Rewritten to split on the first colon,
match any-case "sitemap" as the directive name, strip comments, and
require `://` in the value. +7 unit tests cover case variants,
space-before-colon, comments, empty values, non-URL values, and
non-sitemap directives.

webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire
instead of Relaxed. Behaviourally equivalent on current hardware for
single-word atomic loads, but the explicit ordering documents intent
for readers + compilers.

webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox
FetchClient. Tool calls that repeatedly request the firefox profile
without cookies used to build a fresh reqwest pool + TLS stack per
call. Chrome (default) already used the long-lived field; Random is
per-call by design; cookie-bearing requests still build ad-hoc since
the cookie header is part of the client shape.

Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core,
43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace.

Refs: docs/AUDIT-2026-04-16.md P3 section

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-16 20:21:32 +02:00

20 KiB

Raw Blame History

Changelog

All notable changes to webclaw are documented here. Format follows Keep a Changelog.

[0.3.17] — 2026-04-16

Changed

webclaw-fetch::sitemap::parse_robots_txt now does proper directive parsing. The previous trimmed[..8].eq_ignore_ascii_case("sitemap:") slice couldn't handle "Sitemap :" (space before colon) from bad generators, didn't strip inline # ... comments, and would have returned empty/garbage values if a directive line had no URL. Now splits on the first colon, matches any-case sitemap as the directive name, strips comments, and requires the value to contain :// before accepting it. Eight new unit tests cover case variants, space-before-colon, inline comments, non-URL values, and non-sitemap directives.
webclaw-fetch::crawler::is_cancelled uses Ordering::Acquire (was Relaxed). Technically equivalent on x86/arm64 for single-word loads, but the explicit ordering documents the synchronization intent for readers and the compiler.

Added

webclaw-mcp caches the Firefox FetchClient lazily. Tool calls that repeatedly request the Firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call; a single OnceLock keeps the client alive for the life of the server. Chrome (default) and Random (by design per-call) are unaffected.

[0.3.16] — 2026-04-16

Hardened

Response body caps across fetch + LLM providers (P2). Every HTTP response buffered from the network is now rejected if it exceeds a hard size cap. webclaw-fetch::Response::from_wreq caps HTML/doc responses at 50 MB (before the allocation pays for anything and as a belt-and-braces check after bytes().await); webclaw-llm providers (anthropic / openai / ollama) cap JSON responses at 5 MB via a shared response_json_capped helper. Previously an adversarial or runaway upstream could push unbounded memory into the process. Closes the DoS-via-giant-body class of bugs noted in the audit.
Crawler frontier cap (P2). After each depth level the frontier is truncated to max(max_pages × 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches, keeping string allocations alive long after the crawl was effectively done.
Glob pattern validation (P2). User-supplied include_patterns / exclude_patterns passed to the crawler are now rejected if they contain more than 4 ** wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested ** against long paths; this keeps adversarial config files from weaponising it.

Cleanup

Removed blanket #![allow(dead_code)] in webclaw-cli/src/main.rs. No dead code surfaced; the suppression was obsolete.
.gitignore: replaced overbroad *.json with specific local-artifact patterns. The previous rule would have swallowed package.json / components.json / .smithery/*.json if they were ever modified.

[0.3.15] — 2026-04-16

Fixed

Batch/crawl no longer panics on semaphore close (P1). Three permit.acquire().await.expect("semaphore closed") call sites in webclaw-fetch (client::fetch_batch, client::fetch_and_extract_batch_with_options, crawler inner loop) now surface a typed FetchError::Build("semaphore closed before acquire") or a failed PageResult instead of panicking the spawned task. Under normal operation nothing changes; under shutdown-race or adversarial runtime state, the caller sees one failed entry in the batch instead of losing the task silently to the runtime's panic handler. Surfaced by the 2026-04-16 workspace audit.

[0.3.14] — 2026-04-16

Security

--on-change command injection closed (P0). The --on-change flag on webclaw watch and its multi-URL variant used to pipe the whole user-supplied string through sh -c. Anyone (or any LLM driving the MCP surface, or any config file parsed on the user's behalf) that could influence the flag value could execute arbitrary shell. The command is now tokenized with shlex and executed directly via Command::new(prog).args(args), so metacharacters like ;, &&, |, $(), <(...), and env expansion no longer fire. A WEBCLAW_ALLOW_SHELL=1 escape hatch is available for users who genuinely need pipelines; it logs a warning on every invocation so it can't slip in silently. Surfaced by the 2026-04-16 workspace audit.

[0.3.13] — 2026-04-10

Fixed

Docker CMD replaced with ENTRYPOINT: both Dockerfile and Dockerfile.ci now use ENTRYPOINT ["webclaw"] instead of CMD ["webclaw"]. CLI arguments (e.g. docker run webclaw https://example.com) now pass through correctly instead of being ignored.

[0.3.12] — 2026-04-10

Added

Crawl scope control: new allow_subdomains and allow_external_links fields on CrawlConfig. By default crawls stay same-origin. Enable allow_subdomains to follow sibling/child subdomains (e.g. blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au).

[0.3.11] — 2026-04-10

Added

Sitemap fallback paths: discovery now tries /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml in addition to the standard /sitemap.xml. Sites using WordPress or non-standard sitemap locations are now discovered without needing external search.

[0.3.10] — 2026-04-10

Changed

Fetch timeout reduced from 30s to 12s: prevents cascading slowdowns when proxies are unresponsive. Worst-case per-URL drops from ~94s to ~25s.
Retry attempts reduced from 3 to 2: combined with shorter timeout, total worst-case is 12s + 1s delay + 12s = 25s instead of 30s + 1s + 30s + 3s + 30s = 94s.

[0.3.9] — 2026-04-04

Fixed

Layout tables rendered as sections: tables used for page layout (containing block elements like <p>, <div>, <hr>) are now rendered as standalone sections instead of pipe-delimited markdown tables. Fixes Drudge Report and similar sites where all content was flattened into a single unreadable line. (by @devnen in #14)
Stack overflow on deeply nested HTML: pages with 200+ DOM nesting levels (e.g., Express.co.uk live blogs) no longer overflow the stack. Two-layer fix: depth guard in markdown.rs falls back to iterator-based text collection at depth 200, and extract_with_options() spawns an 8 MB worker thread for safety on Windows. (by @devnen in #14)
Noise filter swallowing content in malformed HTML: <form> tags no longer unconditionally treated as noise — ASP.NET page-wrapping forms (>500 chars) are preserved. Safety valve prevents unclosed noise containers (header/footer with >5000 chars) from absorbing entire page content. (by @devnen in #14)

Changed

Bold/italic block passthrough: <b>/<strong>/<em>/<i> tags containing block-level children (e.g., Drudge wrapping columns in <b>) now act as transparent containers instead of collapsing everything into inline bold/italic. (by @devnen in #14)

[0.3.8] — 2026-04-03

Fixed

MCP research token overflow: research results are now saved to ~/.webclaw/research/ and the MCP tool returns file paths + findings instead of the full report. Prevents "exceeds maximum allowed tokens" errors in Claude/Cursor.
Research caching: same query returns cached result instantly without spending credits.
Anthropic rate limit throttling: 60s delay between LLM calls in research to stay under Tier 1 limits (50K input tokens/min).

Added

dirs dependency for ~/.webclaw/research/ path resolution.

[0.3.7] — 2026-04-03

Added

--research CLI flag: run deep research via the cloud API. Prints report to stdout and saves full result (report + sources + findings) to a JSON file. Supports --deep for longer reports.
MCP extract/summarize cloud fallback: when no local LLM is available, these tools now fall back to the cloud API instead of erroring. Set WEBCLAW_API_KEY for automatic fallback.
MCP research structured output: the research tool now returns structured JSON (report + sources + findings + metadata) instead of raw text, so agents can reference individual findings and source URLs.

[0.3.6] — 2026-04-02

Added

Structured data in markdown/LLM output: __NEXT_DATA__, SvelteKit, and JSON-LD data now appears as a ## Structured Data section with a JSON code block at the end of -f markdown and -f llm output. Works with --only-main-content and all other flags.

Fixed

Homebrew CI: formula now updates all 4 platform checksums after Docker build completes, preventing SHA mismatch on Linux installs (#12).

[0.3.5] — 2026-04-02

Added

__NEXT_DATA__ extraction: Next.js pages now have their pageProps JSON extracted into structured_data. Contains prices, product info, page state, and other data that isn't in the visible HTML. Tested on 45 sites — 13 now return rich structured data (BBC, Forbes, Nike, Stripe, TripAdvisor, Glassdoor, NASA, etc.).

[0.3.4] — 2026-04-01

Added

SvelteKit data island extraction: extracts structured JSON from kit.start() data arrays. Handles unquoted JS object keys by converting to valid JSON before parsing. Data appears in the structured_data field.

Changed

License changed from MIT to AGPL-3.0.

[0.3.3] — 2026-04-01

Changed

Replaced custom TLS stack with wreq: migrated from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — both battle-tested with 60+ browser profiles.
Removed all [patch.crates-io] entries: consumers no longer need to patch rustls, h2, hyper, hyper-util, or reqwest. Just depend on webclaw normally.
Browser profiles rebuilt on wreq's Emulation API: Chrome 145, Firefox 135, Safari 18, Edge 145 with correct TLS options (cipher suites, curves, GREASE, ECH, PSK session resumption), HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
Better TLS compatibility: BoringSSL handles more server configurations than patched rustls (e.g. servers that previously returned IllegalParameter alerts).

Removed

webclaw-tls dependency and all 5 forked crates (webclaw-rustls, webclaw-h2, webclaw-hyper, webclaw-hyper-util, webclaw-reqwest).

Acknowledgments

TLS and HTTP/2 fingerprinting powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.

[0.3.2] — 2026-03-31

Added

--cookie-file flag: load cookies from JSON files exported by browser extensions (EditThisCookie, Cookie-Editor). Format: [{name, value, domain, ...}].
MCP cookies parameter: the scrape tool now accepts a cookies array for authenticated scraping.
Combined cookies: --cookie and --cookie-file can be used together and merge automatically.

[0.3.1] — 2026-03-30

Added

Cookie warmup fallback: when a fetch returns an Akamai challenge page, automatically visits the homepage first to collect _abck/bm_sz cookies, then retries the original URL. Enables extraction of Akamai-protected subpages (e.g. fansale ticket pages) without JS rendering.

Changed

Fixed HTTP header wire order (accept/user-agent were in wrong positions) and added H2 PRIORITY flag in HEADERS frames.
FetchResult.headers now uses http::HeaderMap instead of HashMap<String, String> — avoids per-response allocation, preserves multi-value headers.

[0.3.0] — 2026-03-29

Changed

Replaced primp with webclaw-tls: switched to custom TLS fingerprinting stack.
Browser profiles: Chrome 146 (Win/Mac), Firefox 135+, Safari 18, Edge 146 — captured from real browsers.
HTTP/2 fingerprinting: SETTINGS frame ordering and pseudo-header ordering based on concepts pioneered by @0x676e67.

Fixed

HTTPS completely broken (#5): primp's forked rustls rejected valid certificates (UnknownIssuer on cross-signed chains like example.com). Fixed by using native OS root CAs alongside Mozilla bundle.
Unknown certificate extensions: servers returning SCT in certificate entries no longer cause TLS errors.

Added

Native root CA support: uses OS trust store (macOS Keychain, Windows cert store) in addition to webpki-roots.
HTTP/2 fingerprinting: SETTINGS frame ordering and pseudo-header ordering match real browsers.
Per-browser header ordering: HTTP headers sent in browser-specific wire order.
Bandwidth tracking: atomic byte counters shared across cloned clients.

[0.2.2] — 2026-03-27

Fixed

cargo install broken with primp 1.2.0: added missing reqwest patch to [patch.crates-io]. primp moved to reqwest 0.13 which requires a patched fork.
Weekly dependency check: CI now runs every Monday to catch primp patch drift before users hit it.

[0.2.1] — 2026-03-27

Added

Docker image on GHCR: docker run ghcr.io/0xmassi/webclaw — auto-built on every release
QuickJS data island extraction: inline <script> execution catches window.__PRELOADED_STATE__, Next.js hydration data, and other JS-embedded content

Fixed

Docker CI now runs as part of the release workflow (was missing, image was never published)

[0.2.0] — 2026-03-26

Added

DOCX extraction: auto-detected by Content-Type or URL extension, outputs markdown with headings
XLSX/XLS extraction: spreadsheets converted to markdown tables, multi-sheet support via calamine
CSV extraction: parsed with quoted field handling, output as markdown table
HTML output format: -f html returns sanitized HTML from the extracted content
Multi-URL watch: --watch now works with --urls-file to monitor multiple URLs in parallel
Batch + LLM extraction: --extract-prompt and --extract-json now work with multiple URLs
Scheduled batch watch: watch multiple URLs with aggregate change reports and per-URL diffs

[0.1.7] — 2026-03-26

Fixed

--only-main-content, --include, and --exclude now work in batch mode (#3)

[0.1.6] — 2026-03-26

Added

--watch: monitor a URL for changes at a configurable interval with diff output
--watch-interval: seconds between checks (default: 300)
--on-change: run a command when changes are detected (diff JSON piped to stdin)
--webhook: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks

[0.1.5] — 2026-03-26

Added

--output-dir: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
CSV input with custom filenames: url,filename format in --urls-file
Root URLs use hostname/index.ext to avoid collisions in batch mode
Subdirectories created automatically from URL path structure

[0.1.4] — 2026-03-26

Added

QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
Executes inline <script> tags in a sandboxed runtime to capture window.__* data blobs
Parses Next.js RSC flight data (self.__next_f) for App Router sites
Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
Feature-gated with quickjs feature flag (enabled by default, disable for WASM builds)

[0.1.3] — 2026-03-25

Added

Crawl streaming: real-time progress on stderr as pages complete ([2/50] OK https://... (234ms, 1523 words))
Crawl resume/cancel: --crawl-state <path> saves progress on Ctrl+C and resumes from where it left off
MCP server proxy support via WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars

Changed

Crawl results now expose visited set and remaining frontier for accurate state persistence

[0.1.2] — 2026-03-25

Changed

Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)

Fixed

Reddit scraping: use plain HTTP client for .json endpoint (TLS fingerprinting was getting blocked)

Added

YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches

[0.1.1] — 2026-03-24

Fixed

MCP server now identifies as webclaw-mcp instead of rmcp in the MCP handshake
Research tool polling caps at 200 iterations (~10 min) instead of looping forever
CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
Text format output strips markdown table syntax (| --- | pipes)
All MCP tools validate URLs before network calls with clear error messages
Cloud API HTTP client has 60s timeout instead of no timeout
Local fetch calls timeout after 30s to prevent hanging on slow servers
Diff cloud fallback computes actual diff instead of returning raw scrape JSON
FetchClient startup failure logs and exits gracefully instead of panicking

Added

Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

[0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

Core Extraction

Readability-style content scoring with text density, semantic tags, and link density penalties
Exact CSS class token noise filtering with body-force fallback for SPAs
HTML → markdown conversion with URL resolution, image alt text, srcset optimization
9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
JSON data island extraction (React, Next.js, Contentful CMS)
YouTube transcript extraction (title, channel, views, duration, description)
Lazy-loaded image detection (data-src, data-lazy-src, data-original)
Brand identity extraction (name, colors, fonts, logos, OG image)
Content change tracking / diff engine
CSS selector filtering (include/exclude)

Fetching & Crawling

TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
BFS same-origin crawler with configurable depth, concurrency, and delay
Sitemap.xml and robots.txt discovery
Batch multi-URL concurrent extraction
Per-request proxy rotation from pool file
Reddit JSON API and LinkedIn post extractors

LLM Integration

Provider chain: Ollama (local-first) → OpenAI → Anthropic
JSON schema extraction (structured data from pages)
Natural language prompt extraction
Page summarization with configurable sentence count

PDF

PDF text extraction via pdf-extract
Auto-detection by Content-Type header

MCP Server

8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
stdio transport for Claude Desktop, Claude Code, and any MCP client
Smart Fetch: local extraction first, cloud API fallback

CLI

4 output formats: markdown, JSON, plain text, LLM-optimized
CSS selector filtering, crawling, sitemap discovery
Brand extraction, content diffing, LLM features
Browser profile selection, proxy support, stdin/file input

Infrastructure

Docker multi-stage build with Ollama sidecar
Deploy script for Hetzner VPS

20 KiB Raw Blame History Unescape Escape

Changelog

[0.3.17] — 2026-04-16

Changed

Added

[0.3.16] — 2026-04-16

Hardened

Cleanup

[0.3.15] — 2026-04-16

Fixed

[0.3.14] — 2026-04-16

Security

[0.3.13] — 2026-04-10

Fixed

[0.3.12] — 2026-04-10

Added

[0.3.11] — 2026-04-10

Added

[0.3.10] — 2026-04-10

Changed

[0.3.9] — 2026-04-04

Fixed

Changed

[0.3.8] — 2026-04-03

Fixed

Added

[0.3.7] — 2026-04-03

Added

[0.3.6] — 2026-04-02

Added

Fixed

[0.3.5] — 2026-04-02

Added

[0.3.4] — 2026-04-01

Added

Changed

[0.3.3] — 2026-04-01

Changed

Removed

Acknowledgments

[0.3.2] — 2026-03-31

Added

[0.3.1] — 2026-03-30

Added

Changed

[0.3.0] — 2026-03-29

Changed

Fixed

Added

[0.2.2] — 2026-03-27

Fixed

[0.2.1] — 2026-03-27

Added

Fixed

[0.2.0] — 2026-03-26

Added

[0.1.7] — 2026-03-26

Fixed

[0.1.6] — 2026-03-26

Added

[0.1.5] — 2026-03-26

Added

[0.1.4] — 2026-03-26

Added

[0.1.3] — 2026-03-25

Added

Changed

[0.1.2] — 2026-03-25

Changed

Fixed

Added

[0.1.1] — 2026-03-24

Fixed

Added

[0.1.0] — 2026-03-18

Core Extraction

Fetching & Crawling

LLM Integration

PDF

MCP Server

20 KiB

Raw Blame History