webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-07 22:15:12 +02:00

Author	SHA1	Message	Date
Valerio	095ae5d4b1	polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Three P3 items from the 2026-04-16 audit. Bump to 0.3.17. webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice plus eq_ignore_ascii_case for the directive test. That was fragile: "Sitemap :" (space before colon) fell through silently, inline "# ..." comments leaked into the URL, and a line with no URL at all returned an empty string. Rewritten to split on the first colon, match any-case "sitemap" as the directive name, strip comments, and require `://` in the value. +7 unit tests cover case variants, space-before-colon, comments, empty values, non-URL values, and non-sitemap directives. webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire instead of Relaxed. Behaviourally equivalent on current hardware for single-word atomic loads, but the explicit ordering documents intent for readers + compilers. webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox FetchClient. Tool calls that repeatedly request the firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call. Chrome (default) already used the long-lived field; Random is per-call by design; cookie-bearing requests still build ad-hoc since the cookie header is part of the client shape. Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core, 43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace. Refs: docs/AUDIT-2026-04-16.md P3 section Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 20:21:32 +02:00
Valerio	d69c50a31d	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:44:08 +02:00
Valerio	7773c8af2a	fix(fetch): surface semaphore-closed as typed error instead of panic (P1) (#21 ) Three call sites in webclaw-fetch used .expect("semaphore closed") on `Semaphore::acquire()`. Under normal operation they never fire, but under a shutdown race or adversarial runtime state the spawned task would panic and be silently dropped from the batch / crawl run — the caller would see fewer results than URLs with no indication why. Rewritten to match on the acquire result: - client::fetch_batch and client::fetch_and_extract_batch_with_options now emit BatchResult/BatchExtractResult carrying FetchError::Build("semaphore closed before acquire"). - crawler's inner loop emits a failed PageResult with the same error string instead of panicking. Behaviorally a no-op for the happy path. Fixes the silent-dropped-task class of bug noted in the 2026-04-16 audit. Version: 0.3.14 -> 0.3.15 CHANGELOG updated. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:20:26 +02:00
Valerio	6316b1a6e7	fix: handle raw newlines in JSON-LD strings Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Sites like Bluesky emit JSON-LD with literal newline characters inside string values (technically invalid JSON). Add sanitize_json_newlines() fallback that escapes control characters inside quoted strings before retrying the parse. This recovers ProfilePage, Product, and other structured data that was previously silently dropped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 11:40:25 +02:00
Valerio	050b2ef463	feat: add allow_subdomains and allow_external_links to CrawlConfig Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:33:06 +02:00
Valerio	a4c351d5ae	feat: add fallback sitemap paths for broader discovery Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms use non-standard paths that were previously missed. Paths found via robots.txt are deduplicated to avoid double-fetching. Bump to 0.3.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:22:57 +02:00
Valerio	25b6282d5f	style: fix rustfmt for 2-element delay array Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 17:21:53 +02:00
Valerio	954aabe3e8	perf: reduce fetch timeout to 12s and retries to 2 Stress testing showed 33% of proxies are dead, causing 30s+ timeouts per request with 3 retries (worst case 94s). Reducing timeout from 30s to 12s and retries from 3 to 2 brings worst case to 25s. Combined with disabling 509 dead proxies from the pool, this should significantly improve response times under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 17:18:57 +02:00
Valerio	124352e0b4	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:25:40 +02:00
Valerio	aaf51eddef	feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:04:55 +02:00
Valerio	44f23332cc	style: collapse nested if per clippy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:13:55 +02:00
Valerio	20c810b8d2	chore: bump v0.3.1, update CHANGELOG, fix fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:11:54 +02:00
Valerio	7041a1d992	feat: cookie warmup fallback for Akamai-protected pages When a fetch returns a challenge page (small HTML with Akamai markers), automatically visit the homepage first to collect _abck/bm_sz cookies, then retry the original URL. This bypasses Akamai's cookie-based gate on subpages without needing JS execution. Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr sensor marker on responses under 15KB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:09:31 +02:00
Valerio	4cba36337b	style: fix fmt in client.rs test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:18:57 +02:00
Valerio	199dab6dfa	fix: adapt to webclaw-tls v0.1.1 HeaderMap API change Response.headers() now returns &http::HeaderMap instead of &HashMap<String, String>. Updated FetchResult, is_pdf_content_type, is_document_content_type, is_bot_protected, and all related tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:09:50 +02:00
Valerio	e3b0d0bd74	fix: make reddit and linkedin modules public for server access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:54:35 +02:00
Valerio	f275a93bec	fix: clippy empty-line-after-doc-comment in browser.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:45:05 +02:00
Valerio	140234c139	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:43:11 +02:00
Valerio	f13cb83c73	feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:40:10 +02:00
Valerio	ea14848772	feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:28:23 +01:00
Valerio	0e4128782a	fix: v0.1.7 — extraction options now work in batch mode (#3 ) --only-main-content, --include, and --exclude were ignored in batch mode because run_batch used default ExtractionOptions. Added fetch_and_extract_batch_with_options to pass CLI options through. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 13:30:20 +01:00
Valerio	0c91c6d5a9	feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:38:28 +01:00
Valerio	afe4d3077d	feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 18:50:07 +01:00
Valerio	907966a983	fix: use plain client for Reddit JSON endpoint Reddit blocks TLS-fingerprinted clients on their .json API but accepts standard requests with a browser User-Agent. Switch to a non-impersonated primp client for the Reddit fallback path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 18:43:47 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

25 commits