webclaw/crates/webclaw-cli
devnen 920d71f561 Strip more llm output noise: consent walls, bare integers, JSON-LD body duplication
- detect_empty: add ConsentWall variant for GDPR/cookie redirects (Yahoo,
  Google, EU news sites). Detect via final-URL host match (consent.*,
  /consent/, collectConsent) and warn with proxy/cookie remediation hint.
- init_logging: silence html5ever / markup5ever / selectors WARNs by
  default (foster-parenting messages from malformed real-world HTML
  pollute stderr with dozens of lines per fetch; override via WEBCLAW_LOG).
- cleanup: add strip_bare_number_lines for paragraphs that are just a
  short integer (news-index comment counts, page numbers); make
  is_ui_control_token case-insensitive and extend UI_CONTROLS with
  pagination chrome (next, prev, previous, older, newer) plus bare
  <=4-digit integers so '0 Next'-style glued lines are caught.
- links: drop bare-integer link labels and #comment-stream / #comments /
  #disqus hrefs from the deduplicated Links section.
- mod: scrub articleBody / body / text / description fields from JSON-LD
  structured-data emission when they would duplicate the rendered markdown
  body (always for articleBody; conditional >=500 chars for the others).

All 292 core tests pass.
2026-05-16 18:55:28 +02:00
..
src Strip more llm output noise: consent walls, bare integers, JSON-LD body duplication 2026-05-16 18:55:28 +02:00
Cargo.toml fix(cli): close --on-change command injection via sh -c (P0) (#20) 2026-04-16 18:37:02 +02:00