mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-10 22:45:13 +02:00
- detect_empty: add ConsentWall variant for GDPR/cookie redirects (Yahoo, Google, EU news sites). Detect via final-URL host match (consent.*, /consent/, collectConsent) and warn with proxy/cookie remediation hint. - init_logging: silence html5ever / markup5ever / selectors WARNs by default (foster-parenting messages from malformed real-world HTML pollute stderr with dozens of lines per fetch; override via WEBCLAW_LOG). - cleanup: add strip_bare_number_lines for paragraphs that are just a short integer (news-index comment counts, page numbers); make is_ui_control_token case-insensitive and extend UI_CONTROLS with pagination chrome (next, prev, previous, older, newer) plus bare <=4-digit integers so '0 Next'-style glued lines are caught. - links: drop bare-integer link labels and #comment-stream / #comments / #disqus hrefs from the deduplicated Links section. - mod: scrub articleBody / body / text / description fields from JSON-LD structured-data emission when they would duplicate the rendered markdown body (always for articleBody; conditional >=500 chars for the others). All 292 core tests pass. |
||
|---|---|---|
| .. | ||
| webclaw-cli | ||
| webclaw-core | ||
| webclaw-fetch | ||
| webclaw-llm | ||
| webclaw-mcp | ||
| webclaw-pdf | ||
| webclaw-server | ||