webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-10 22:45:13 +02:00

Nenad Oric df8bdc96db Improve --format llm output quality on news index pages This PR fixes three independent issues that surface when running `webclaw --format llm` against modern news index pages. They were all reproducible against bbc.com/news/world and reuters.com/world/middle-east during a real briefing-generation run. ### 1. Framework hydration blobs no longer dump into the output `to_llm_text` was unconditionally appending every parsed structured-data item as a `## Structured Data` JSON fence. On Next.js sites, that means the entire `__NEXT_DATA__` `pageProps` object — ad-targeting flags, build IDs, schedule paths, feature toggles — gets serialized straight into the LLM context. On bbc.com/news/world it was about 140 KB of pure framework noise drowning the actual page content. The fix layers three filters: - Items with a Schema.org `@type` of `WebSite`, `WebPage`, or `SiteNavigationElement` are dropped as chrome. - Items without an `@type` (typical of `pageProps` or SvelteKit data) are kept only if their serialized size stays under 4 KB — small parsed records with real content survive, hydration blobs do not. - The whole section is suppressed if the total serialized size exceeds 16 KB, regardless of type. Past that threshold it is almost never useful to a downstream LLM. JSON-LD records with content-bearing `@type` values (`Article`, `NewsArticle`, `Product`, `Recipe`, `FAQPage`, `Event`, etc.) are preserved. ### 2. Element → Text node smashing `children_to_md` and `inline_text` only ran the `needs_separator` check on `Element → Element` transitions. When an element rendered text with no trailing whitespace and was followed by a sibling text node that started with a non-whitespace character, the two got concatenated with no separator. The same check now applies to the `Text` branch in both functions. ### 3. Accessibility link chrome no longer leaks into prose Sites like Reuters wrap external/new-window links with screen-reader-only spans (e.g. `, opens new tab`, `external link`). These have no consistent class hook, so the structural noise filter cannot reliably catch them and they bleed into the rendered text — sometimes dozens of times per page. A targeted regex scrub now runs in two places: in the body cleanup pipeline (`strip_a11y_link_chrome`, called early after `strip_leaked_js`) and in the link-label cleaner (`clean_link_label`) so the deduplicated `## Links` section is also clean. ### Tests All 286 existing unit tests pass. 8 new tests cover: - structured-data filter: chrome-type drop, oversized untyped drop, small untyped keep, `NewsArticle` keep - markdown separator: `Element → Text → Element` no longer smashes - a11y stripper: common phrasings, variant phrasings ("opens in a new window", "external link"), and code-fence preservation		2026-05-10 14:30:07 +02:00
..
webclaw-cli	fix: support LLM provider compatibility options	2026-05-06 11:36:53 +02:00
webclaw-core	Improve --format llm output quality on news index pages	2026-05-10 14:30:07 +02:00
webclaw-fetch	fix: harden fetch URL validation	2026-05-04 11:50:57 +02:00
webclaw-llm	fix: support LLM provider compatibility options	2026-05-06 11:36:53 +02:00
webclaw-mcp	fix: harden fetch URL validation	2026-05-04 11:50:57 +02:00
webclaw-pdf	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
webclaw-server	fix: validate self-host route URLs consistently	2026-05-04 14:30:06 +02:00