webclaw/crates
devnen 920d71f561 Strip more llm output noise: consent walls, bare integers, JSON-LD body duplication
- detect_empty: add ConsentWall variant for GDPR/cookie redirects (Yahoo,
  Google, EU news sites). Detect via final-URL host match (consent.*,
  /consent/, collectConsent) and warn with proxy/cookie remediation hint.
- init_logging: silence html5ever / markup5ever / selectors WARNs by
  default (foster-parenting messages from malformed real-world HTML
  pollute stderr with dozens of lines per fetch; override via WEBCLAW_LOG).
- cleanup: add strip_bare_number_lines for paragraphs that are just a
  short integer (news-index comment counts, page numbers); make
  is_ui_control_token case-insensitive and extend UI_CONTROLS with
  pagination chrome (next, prev, previous, older, newer) plus bare
  <=4-digit integers so '0 Next'-style glued lines are caught.
- links: drop bare-integer link labels and #comment-stream / #comments /
  #disqus hrefs from the deduplicated Links section.
- mod: scrub articleBody / body / text / description fields from JSON-LD
  structured-data emission when they would duplicate the rendered markdown
  body (always for articleBody; conditional >=500 chars for the others).

All 292 core tests pass.
2026-05-16 18:55:28 +02:00
..
webclaw-cli Strip more llm output noise: consent walls, bare integers, JSON-LD body duplication 2026-05-16 18:55:28 +02:00
webclaw-core Strip more llm output noise: consent walls, bare integers, JSON-LD body duplication 2026-05-16 18:55:28 +02:00
webclaw-fetch fix: harden fetch URL validation 2026-05-04 11:50:57 +02:00
webclaw-llm fix: support LLM provider compatibility options 2026-05-06 11:36:53 +02:00
webclaw-mcp fix: harden fetch URL validation 2026-05-04 11:50:57 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server fix: validate self-host route URLs consistently 2026-05-04 14:30:06 +02:00