mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-17 23:55:13 +02:00
feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls
Sites known to require CAPTCHA-solving (Cloudflare interstitials) or
browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot
be reached by webclaw's chrome impersonation; they return interstitial
stubs ('Just a moment...', 'Please enable JS and disable any ad blocker')
with 0 useful content. Currently each call wastes 5-10s on the timeout
before the caller sees the failure.
New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists
known bad hosts with a category (CloudflareInterstitial / AdblockWall)
and suggested substitute domains. Host matching: lowercase + strip
leading 'www.' + exact-match against registered host.
On registry hit, webclaw writes 'error: <host> is <category>-walled;
suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67
(EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000
to <50 for listed hosts.
Initial entries: ambito.com (Cloudflare; substitutes cronista.com,
iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr,
lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are
subscription paywalls with different bypass semantics; deferred to M11.
10 new tests in webclaw-fetch covering host normalization, www
stripping, path-under-host matching, case insensitivity, unknown-domain
pass-through, and the formatted error message (9 unit + 1 fetch-layer
integration). Workspace test total 647 -> 657.
This commit is contained in:
parent
31a8f6150f
commit
e28b22adf7
6 changed files with 319 additions and 4 deletions
|
|
@ -942,10 +942,20 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
|
|||
let client =
|
||||
FetchClient::new(build_fetch_config(cli)).map_err(|e| format!("client error: {e}"))?;
|
||||
let options = build_extraction_options(cli);
|
||||
let result = client
|
||||
.fetch_and_extract_with_options(url, &options)
|
||||
.await
|
||||
.map_err(|e| format!("fetch error: {e}"))?;
|
||||
let result = match client.fetch_and_extract_with_options(url, &options).await {
|
||||
Ok(r) => r,
|
||||
// M3: known-bad-sites registry hit. The error message is already
|
||||
// formatted per phase-A contract. Emit it to stderr verbatim and
|
||||
// exit 67 (chosen because webclaw's existing error paths all use
|
||||
// exit 1; 67 is distinct so callers can grep for "host is in the
|
||||
// known-bad registry" specifically without colliding with generic
|
||||
// fetch failures, and falls inside the BSD sysexits.h band).
|
||||
Err(webclaw_fetch::FetchError::KnownBadSite { message, .. }) => {
|
||||
eprintln!("{message}");
|
||||
process::exit(67);
|
||||
}
|
||||
Err(e) => return Err(format!("fetch error: {e}")),
|
||||
};
|
||||
|
||||
// Check if we should fall back to cloud
|
||||
let reason = detect_empty(&result);
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue