webclaw/crates
devnen c37867309c feat(fetch): paywall HTML-signature detection + best-effort --paywall-bypass
Detects NYT/WSJ/FT/Bloomberg/Substack paywall overlay markers in extracted
HTML and emits stderr warning:
  # webclaw: warning: paywall detected on <name> (<host>); full article
    may not be accessible. Try --paywall-bypass or https://archive.is/<url>

Detection uses a declarative signature registry (parallel to M3
known-bad-sites): per-host suffix gate + any-of substring scan of
publisher-specific CSS classes / data-attributes / JSON-LD markers.
NYT markers (vi-gateway-container, "isAccessibleForFree":false,
meteredContent) were verified against a real live NYT article;
other publishers use documented per-publisher overlay conventions.

New --paywall-bypass flag attempts a soft bypass: injects a Googlebot
User-Agent into the FetchConfig headers (some publishers serve full
content to crawlers for SEO indexing). If the paywall is STILL
detected post-Googlebot, the stderr warning switches to the
bypass-aware variant naming the attempted strategy and pointing at
https://archive.is/<url> as an external fallback.

This is BEST-EFFORT. webclaw has no headless browser and cannot
bypass paywalls requiring real session auth. Honest stderr language
reflects that. Plumbing is minimal: webclaw-fetch gets a new
`paywall` module + post-fetch detection hook in
fetch_and_extract_with_options, and FetchClient gets a
`with_paywall_bypass(bool)` builder method the CLI calls when the
flag is set.

17 new tests (13 in paywall.rs covering host-gate / marker-gate /
false-positive resistance / message formatting / Googlebot UA
constant; 4 in webclaw-cli mod tests covering flag presence,
default value, header injection wiring). Workspace 724 -> 741.

Critical false-positive sentinels verified: p43 example.com 313 B
byte-identical (stderr empty), p09 bbc.com 13K+ (stderr empty), p47
reuters.com 10K+ (stderr empty). Cyrillic p14 srbijagas 7777 B
byte-identical (M15 sentinel preserved across 11 iters). M3 fast-fail
on ambito.com exit 67 byte-identical. M14 truncation warning intact.

No probe.py changes. No baseline modifications. No Cargo deps added.
2026-05-24 08:55:54 +02:00
..
webclaw-cli feat(fetch): paywall HTML-signature detection + best-effort --paywall-bypass 2026-05-24 08:55:54 +02:00
webclaw-core feat(core): word-count breakdown in header — article vs chrome split 2026-05-23 23:56:14 +02:00
webclaw-fetch feat(fetch): paywall HTML-signature detection + best-effort --paywall-bypass 2026-05-24 08:55:54 +02:00
webclaw-llm fix: support LLM provider compatibility options 2026-05-06 11:36:53 +02:00
webclaw-mcp feat(core): word-count breakdown in header — article vs chrome split 2026-05-23 23:56:14 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server feat(core): word-count breakdown in header — article vs chrome split 2026-05-23 23:56:14 +02:00