mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-12 23:05:12 +02:00
Detects NYT/WSJ/FT/Bloomberg/Substack paywall overlay markers in extracted
HTML and emits stderr warning:
# webclaw: warning: paywall detected on <name> (<host>); full article
may not be accessible. Try --paywall-bypass or https://archive.is/<url>
Detection uses a declarative signature registry (parallel to M3
known-bad-sites): per-host suffix gate + any-of substring scan of
publisher-specific CSS classes / data-attributes / JSON-LD markers.
NYT markers (vi-gateway-container, "isAccessibleForFree":false,
meteredContent) were verified against a real live NYT article;
other publishers use documented per-publisher overlay conventions.
New --paywall-bypass flag attempts a soft bypass: injects a Googlebot
User-Agent into the FetchConfig headers (some publishers serve full
content to crawlers for SEO indexing). If the paywall is STILL
detected post-Googlebot, the stderr warning switches to the
bypass-aware variant naming the attempted strategy and pointing at
https://archive.is/<url> as an external fallback.
This is BEST-EFFORT. webclaw has no headless browser and cannot
bypass paywalls requiring real session auth. Honest stderr language
reflects that. Plumbing is minimal: webclaw-fetch gets a new
`paywall` module + post-fetch detection hook in
fetch_and_extract_with_options, and FetchClient gets a
`with_paywall_bypass(bool)` builder method the CLI calls when the
flag is set.
17 new tests (13 in paywall.rs covering host-gate / marker-gate /
false-positive resistance / message formatting / Googlebot UA
constant; 4 in webclaw-cli mod tests covering flag presence,
default value, header injection wiring). Workspace 724 -> 741.
Critical false-positive sentinels verified: p43 example.com 313 B
byte-identical (stderr empty), p09 bbc.com 13K+ (stderr empty), p47
reuters.com 10K+ (stderr empty). Cyrillic p14 srbijagas 7777 B
byte-identical (M15 sentinel preserved across 11 iters). M3 fast-fail
on ambito.com exit 67 byte-identical. M14 truncation warning intact.
No probe.py changes. No baseline modifications. No Cargo deps added.
|
||
|---|---|---|
| .. | ||
| webclaw-cli | ||
| webclaw-core | ||
| webclaw-fetch | ||
| webclaw-llm | ||
| webclaw-mcp | ||
| webclaw-pdf | ||
| webclaw-server | ||