webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-16 23:45:13 +02:00

Author	SHA1	Message	Date
devnen	66974366d7	feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld JSON-LD is consistently the cleanest source on major outlets (Reuters, BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured Data block at the bottom of -f llm output; this iter teaches it to parse the JSON-LD by schema and surface it usefully. New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review, WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is auto-lifted (Reuters CollectionPage shape). Two new CLI flags: --prefer-structured: surfaces the schema-aware block at the TOP of the output, before prose. For -f llm emits a Markdown summary block; for -f json emits a {structured, extracted} envelope. Bypasses the default DROP list for WebPage/chrome types when explicitly requested. --articles-from-jsonld: when the page contains ItemList or LiveBlogPosting, output ONLY a JSON array of articles ({position, title, url, published}). When no such schema is present, emit a stderr hint and fall through to default extraction (no error). Default behavior (neither flag set) byte-identical to iter-3 on all default-flag probes (regression sentinel passed): Cyrillic p14 still 7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical, M3 registry p44/p45/p46 still fast-fail with exit 67. 14 new tests in webclaw-core covering schema-variant parsing, parse error handling, fall-through behavior, flag combinations, and the default-byte-identical sentinel. Workspace tests 657 -> 671.	2026-05-23 20:38:59 +02:00
devnen	e28b22adf7	feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls Sites known to require CAPTCHA-solving (Cloudflare interstitials) or browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot be reached by webclaw's chrome impersonation; they return interstitial stubs ('Just a moment...', 'Please enable JS and disable any ad blocker') with 0 useful content. Currently each call wastes 5-10s on the timeout before the caller sees the failure. New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists known bad hosts with a category (CloudflareInterstitial / AdblockWall) and suggested substitute domains. Host matching: lowercase + strip leading 'www.' + exact-match against registered host. On registry hit, webclaw writes 'error: <host> is <category>-walled; suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67 (EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000 to <50 for listed hosts. Initial entries: ambito.com (Cloudflare; substitutes cronista.com, iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr, lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are subscription paywalls with different bypass semantics; deferred to M11. 10 new tests in webclaw-fetch covering host normalization, www stripping, path-under-host matching, case insensitivity, unknown-domain pass-through, and the formatted error message (9 unit + 1 fetch-layer integration). Workspace test total 647 -> 657.	2026-05-23 19:42:15 +02:00
devnen	562c6a15f0	gitignore: cover improve-loop and local build artifacts improve-loop's loop.py writes baselines/, .loop-scratch/ and a *-loop-progress.log per run. _build-release.bat / _build-release.log are a local wrapper for invoking cargo build with the right MSVC + LLVM + NASM env (replaces the missing update.py from CLAUDE.md). None should land in git.	2026-05-23 17:42:05 +02:00
devnen	e620173d3a	docs+gitignore: portable-install sync note and local scratch ignores CLAUDE.md gains a mandatory step at the top describing the rebuild->copy-> verify dance for the portable Claude Code install at C:\_projects\claude- portable, plus a local-build env snippet for the BoringSSL bindgen vars (LIBCLANG_PATH, NASM on PATH) that update.py sets automatically but a plain shell does not. .gitignore adds runtime/scratch entries that shouldn't have been tracked: __pycache__/, .last_update_check, .playwright-cli/, demo_sample.html, demo_saved.json. Nothing currently tracked is affected (none of these were under version control).	2026-05-23 17:29:05 +02:00
Valerio	d69c50a31d	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:44:08 +02:00
Valerio	050b2ef463	feat: add allow_subdomains and allow_external_links to CrawlConfig Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:33:06 +02:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

7 commits