webclaw/crates
devnen 70c67f2ed6 fix: prevent noise filter from swallowing content in malformed HTML
Two related fixes for content being stripped by the noise filter:

1. Remove <form> from unconditional noise tags. ASP.NET and similar
   frameworks wrap entire pages in a <form> tag — these are not input
   forms. Forms with >500 chars of text are now treated as content
   wrappers, not noise.

2. Add safety valve for class/ID noise matching. When malformed HTML
   leaves a noise container unclosed (e.g., <div class="header"> missing
   its </div>), the HTML5 parser makes all subsequent siblings into
   children of that container. A header/nav/footer with >5000 chars of
   text is almost certainly a broken wrapper absorbing real content —
   exempt it from noise filtering.
2026-04-04 01:38:42 +02:00
..
webclaw-cli feat: CLI --research flag + MCP cloud fallback + structured research output 2026-04-03 14:04:04 +02:00
webclaw-core fix: prevent noise filter from swallowing content in malformed HTML 2026-04-04 01:38:42 +02:00
webclaw-fetch style: cargo fmt 2026-04-01 18:25:40 +02:00
webclaw-llm Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-mcp fix: MCP research saves to file, returns compact response 2026-04-03 16:05:45 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00