webclaw/crates/webclaw-cli
devnen c37867309c feat(fetch): paywall HTML-signature detection + best-effort --paywall-bypass
Detects NYT/WSJ/FT/Bloomberg/Substack paywall overlay markers in extracted
HTML and emits stderr warning:
  # webclaw: warning: paywall detected on <name> (<host>); full article
    may not be accessible. Try --paywall-bypass or https://archive.is/<url>

Detection uses a declarative signature registry (parallel to M3
known-bad-sites): per-host suffix gate + any-of substring scan of
publisher-specific CSS classes / data-attributes / JSON-LD markers.
NYT markers (vi-gateway-container, "isAccessibleForFree":false,
meteredContent) were verified against a real live NYT article;
other publishers use documented per-publisher overlay conventions.

New --paywall-bypass flag attempts a soft bypass: injects a Googlebot
User-Agent into the FetchConfig headers (some publishers serve full
content to crawlers for SEO indexing). If the paywall is STILL
detected post-Googlebot, the stderr warning switches to the
bypass-aware variant naming the attempted strategy and pointing at
https://archive.is/<url> as an external fallback.

This is BEST-EFFORT. webclaw has no headless browser and cannot
bypass paywalls requiring real session auth. Honest stderr language
reflects that. Plumbing is minimal: webclaw-fetch gets a new
`paywall` module + post-fetch detection hook in
fetch_and_extract_with_options, and FetchClient gets a
`with_paywall_bypass(bool)` builder method the CLI calls when the
flag is set.

17 new tests (13 in paywall.rs covering host-gate / marker-gate /
false-positive resistance / message formatting / Googlebot UA
constant; 4 in webclaw-cli mod tests covering flag presence,
default value, header injection wiring). Workspace 724 -> 741.

Critical false-positive sentinels verified: p43 example.com 313 B
byte-identical (stderr empty), p09 bbc.com 13K+ (stderr empty), p47
reuters.com 10K+ (stderr empty). Cyrillic p14 srbijagas 7777 B
byte-identical (M15 sentinel preserved across 11 iters). M3 fast-fail
on ambito.com exit 67 byte-identical. M14 truncation warning intact.

No probe.py changes. No baseline modifications. No Cargo deps added.
2026-05-24 08:55:54 +02:00
..
src feat(fetch): paywall HTML-signature detection + best-effort --paywall-bypass 2026-05-24 08:55:54 +02:00
Cargo.toml fix(cli): close --on-change command injection via sh -c (P0) (#20) 2026-04-16 18:37:02 +02:00