mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-16 23:45:13 +02:00
Sites known to require CAPTCHA-solving (Cloudflare interstitials) or
browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot
be reached by webclaw's chrome impersonation; they return interstitial
stubs ('Just a moment...', 'Please enable JS and disable any ad blocker')
with 0 useful content. Currently each call wastes 5-10s on the timeout
before the caller sees the failure.
New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists
known bad hosts with a category (CloudflareInterstitial / AdblockWall)
and suggested substitute domains. Host matching: lowercase + strip
leading 'www.' + exact-match against registered host.
On registry hit, webclaw writes 'error: <host> is <category>-walled;
suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67
(EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000
to <50 for listed hosts.
Initial entries: ambito.com (Cloudflare; substitutes cronista.com,
iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr,
lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are
subscription paywalls with different bypass semantics; deferred to M11.
10 new tests in webclaw-fetch covering host normalization, www
stripping, path-under-host matching, case insensitivity, unknown-domain
pass-through, and the formatted error message (9 unit + 1 fetch-layer
integration). Workspace test total 647 -> 657.
29 lines
723 B
Text
29 lines
723 B
Text
target/
|
|
.DS_Store
|
|
.env
|
|
.env.*
|
|
proxies.txt
|
|
.claude/skills/
|
|
# Scratch / local artifacts (previously covered by overbroad `*.json`,
|
|
# which would have also swallowed package.json, components.json,
|
|
# .smithery/*.json if they were ever modified).
|
|
*.local.json
|
|
local-test-results.json
|
|
# CLI research command dumps JSON output keyed on the query; they're
|
|
# not code and shouldn't live in git. Track deliberately-saved research
|
|
# output under a different name.
|
|
research-*.json
|
|
|
|
# Local runtime/scratch — never repo content.
|
|
__pycache__/
|
|
.last_update_check
|
|
.playwright-cli/
|
|
demo_sample.html
|
|
demo_saved.json
|
|
baselines/
|
|
.loop-scratch/
|
|
*-loop-progress.log
|
|
_build-release.bat
|
|
_build-release.log
|
|
improve-loop-CONTINUE.md
|
|
iter-*-smoke/
|