feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls

Sites known to require CAPTCHA-solving (Cloudflare interstitials) or
browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot
be reached by webclaw's chrome impersonation; they return interstitial
stubs ('Just a moment...', 'Please enable JS and disable any ad blocker')
with 0 useful content. Currently each call wastes 5-10s on the timeout
before the caller sees the failure.

New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists
known bad hosts with a category (CloudflareInterstitial / AdblockWall)
and suggested substitute domains. Host matching: lowercase + strip
leading 'www.' + exact-match against registered host.

On registry hit, webclaw writes 'error: <host> is <category>-walled;
suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67
(EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000
to <50 for listed hosts.

Initial entries: ambito.com (Cloudflare; substitutes cronista.com,
iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr,
lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are
subscription paywalls with different bypass semantics; deferred to M11.

10 new tests in webclaw-fetch covering host normalization, www
stripping, path-under-host matching, case insensitivity, unknown-domain
pass-through, and the formatted error message (9 unit + 1 fetch-layer
integration). Workspace test total 647 -> 657.
This commit is contained in:
devnen 2026-05-23 19:42:15 +02:00
parent 31a8f6150f
commit e28b22adf7
6 changed files with 319 additions and 4 deletions

View file

@ -942,10 +942,20 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
let client =
FetchClient::new(build_fetch_config(cli)).map_err(|e| format!("client error: {e}"))?;
let options = build_extraction_options(cli);
let result = client
.fetch_and_extract_with_options(url, &options)
.await
.map_err(|e| format!("fetch error: {e}"))?;
let result = match client.fetch_and_extract_with_options(url, &options).await {
Ok(r) => r,
// M3: known-bad-sites registry hit. The error message is already
// formatted per phase-A contract. Emit it to stderr verbatim and
// exit 67 (chosen because webclaw's existing error paths all use
// exit 1; 67 is distinct so callers can grep for "host is in the
// known-bad registry" specifically without colliding with generic
// fetch failures, and falls inside the BSD sysexits.h band).
Err(webclaw_fetch::FetchError::KnownBadSite { message, .. }) => {
eprintln!("{message}");
process::exit(67);
}
Err(e) => return Err(format!("fetch error: {e}")),
};
// Check if we should fall back to cloud
let reason = detect_empty(&result);