mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-25 03:08:06 +02:00
Improve --format llm output quality on news index pages
This PR fixes three independent issues that surface when running
`webclaw --format llm` against modern news index pages. They were
all reproducible against bbc.com/news/world and reuters.com/world/middle-east
during a real briefing-generation run.
### 1. Framework hydration blobs no longer dump into the output
`to_llm_text` was unconditionally appending every parsed structured-data
item as a `## Structured Data` JSON fence. On Next.js sites, that means
the entire `__NEXT_DATA__` `pageProps` object — ad-targeting flags,
build IDs, schedule paths, feature toggles — gets serialized straight
into the LLM context. On bbc.com/news/world it was about 140 KB of
pure framework noise drowning the actual page content.
The fix layers three filters:
- Items with a Schema.org `@type` of `WebSite`, `WebPage`, or
`SiteNavigationElement` are dropped as chrome.
- Items without an `@type` (typical of `pageProps` or SvelteKit
data) are kept only if their serialized size stays under 4 KB —
small parsed records with real content survive, hydration blobs
do not.
- The whole section is suppressed if the total serialized size
exceeds 16 KB, regardless of type. Past that threshold it is
almost never useful to a downstream LLM.
JSON-LD records with content-bearing `@type` values (`Article`,
`NewsArticle`, `Product`, `Recipe`, `FAQPage`, `Event`, etc.) are
preserved.
### 2. Element → Text node smashing
`children_to_md` and `inline_text` only ran the `needs_separator`
check on `Element → Element` transitions. When an element rendered
text with no trailing whitespace and was followed by a sibling
text node that started with a non-whitespace character, the two
got concatenated with no separator. The same check now applies to
the `Text` branch in both functions.
### 3. Accessibility link chrome no longer leaks into prose
Sites like Reuters wrap external/new-window links with
screen-reader-only spans (e.g. `, opens new tab`, `external link`).
These have no consistent class hook, so the structural noise filter
cannot reliably catch them and they bleed into the rendered text —
sometimes dozens of times per page.
A targeted regex scrub now runs in two places: in the body cleanup
pipeline (`strip_a11y_link_chrome`, called early after `strip_leaked_js`)
and in the link-label cleaner (`clean_link_label`) so the deduplicated
`## Links` section is also clean.
### Tests
All 286 existing unit tests pass. 8 new tests cover:
- structured-data filter: chrome-type drop, oversized untyped drop,
small untyped keep, `NewsArticle` keep
- markdown separator: `Element → Text → Element` no longer smashes
- a11y stripper: common phrasings, variant phrasings ("opens in a
new window", "external link"), and code-fence preservation
This commit is contained in:
parent
7f75143954
commit
df8bdc96db
5 changed files with 234 additions and 4 deletions
|
|
@ -320,6 +320,9 @@ fn children_to_md(
|
|||
}
|
||||
}
|
||||
Node::Text(text) => {
|
||||
if !text.is_empty() && !out.is_empty() && needs_separator(&out, text) {
|
||||
out.push(' ');
|
||||
}
|
||||
out.push_str(text);
|
||||
}
|
||||
_ => {}
|
||||
|
|
@ -350,6 +353,9 @@ fn inline_text(
|
|||
}
|
||||
}
|
||||
Node::Text(text) => {
|
||||
if !text.is_empty() && !out.is_empty() && needs_separator(&out, text) {
|
||||
out.push(' ');
|
||||
}
|
||||
out.push_str(text);
|
||||
}
|
||||
_ => {}
|
||||
|
|
@ -1606,4 +1612,18 @@ mod tests {
|
|||
"collapse_whitespace stripped 6-space indent: {output}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn text_after_inline_element_keeps_separator() {
|
||||
// Reuters-style markup: <a><time>3h</time>ago</a><a>Tanker crosses...</a>
|
||||
// The "ago" text node sits between two element children. Without a
|
||||
// separator check on the Text branch, "ago" + "Tanker" would smash
|
||||
// together as "agoTanker".
|
||||
let html = r#"<div><span>3h</span>ago<span>Tanker crosses Strait</span></div>"#;
|
||||
let (md, _, _) = convert_html(html, None);
|
||||
assert!(
|
||||
!md.contains("agoTanker"),
|
||||
"Element->Text->Element smashed together: {md}"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue