webclaw/crates
devnen e28b22adf7 feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls
Sites known to require CAPTCHA-solving (Cloudflare interstitials) or
browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot
be reached by webclaw's chrome impersonation; they return interstitial
stubs ('Just a moment...', 'Please enable JS and disable any ad blocker')
with 0 useful content. Currently each call wastes 5-10s on the timeout
before the caller sees the failure.

New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists
known bad hosts with a category (CloudflareInterstitial / AdblockWall)
and suggested substitute domains. Host matching: lowercase + strip
leading 'www.' + exact-match against registered host.

On registry hit, webclaw writes 'error: <host> is <category>-walled;
suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67
(EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000
to <50 for listed hosts.

Initial entries: ambito.com (Cloudflare; substitutes cronista.com,
iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr,
lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are
subscription paywalls with different bypass semantics; deferred to M11.

10 new tests in webclaw-fetch covering host normalization, www
stripping, path-under-host matching, case insensitivity, unknown-domain
pass-through, and the formatted error message (9 unit + 1 fetch-layer
integration). Workspace test total 647 -> 657.
2026-05-23 19:42:15 +02:00
..
webclaw-cli feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls 2026-05-23 19:42:15 +02:00
webclaw-core feat(core): JS-hub page detector + --prefer-articles flag 2026-05-23 18:55:17 +02:00
webclaw-fetch feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls 2026-05-23 19:42:15 +02:00
webclaw-llm fix: support LLM provider compatibility options 2026-05-06 11:36:53 +02:00
webclaw-mcp fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server fix: validate self-host route URLs consistently 2026-05-04 14:30:06 +02:00