This PR fixes three independent issues that surface when running
`webclaw --format llm` against modern news index pages. They were
all reproducible against bbc.com/news/world and reuters.com/world/middle-east
during a real briefing-generation run.
### 1. Framework hydration blobs no longer dump into the output
`to_llm_text` was unconditionally appending every parsed structured-data
item as a `## Structured Data` JSON fence. On Next.js sites, that means
the entire `__NEXT_DATA__` `pageProps` object — ad-targeting flags,
build IDs, schedule paths, feature toggles — gets serialized straight
into the LLM context. On bbc.com/news/world it was about 140 KB of
pure framework noise drowning the actual page content.
The fix layers three filters:
- Items with a Schema.org `@type` of `WebSite`, `WebPage`, or
`SiteNavigationElement` are dropped as chrome.
- Items without an `@type` (typical of `pageProps` or SvelteKit
data) are kept only if their serialized size stays under 4 KB —
small parsed records with real content survive, hydration blobs
do not.
- The whole section is suppressed if the total serialized size
exceeds 16 KB, regardless of type. Past that threshold it is
almost never useful to a downstream LLM.
JSON-LD records with content-bearing `@type` values (`Article`,
`NewsArticle`, `Product`, `Recipe`, `FAQPage`, `Event`, etc.) are
preserved.
### 2. Element → Text node smashing
`children_to_md` and `inline_text` only ran the `needs_separator`
check on `Element → Element` transitions. When an element rendered
text with no trailing whitespace and was followed by a sibling
text node that started with a non-whitespace character, the two
got concatenated with no separator. The same check now applies to
the `Text` branch in both functions.
### 3. Accessibility link chrome no longer leaks into prose
Sites like Reuters wrap external/new-window links with
screen-reader-only spans (e.g. `, opens new tab`, `external link`).
These have no consistent class hook, so the structural noise filter
cannot reliably catch them and they bleed into the rendered text —
sometimes dozens of times per page.
A targeted regex scrub now runs in two places: in the body cleanup
pipeline (`strip_a11y_link_chrome`, called early after `strip_leaked_js`)
and in the link-label cleaner (`clean_link_label`) so the deduplicated
`## Links` section is also clean.
### Tests
All 286 existing unit tests pass. 8 new tests cover:
- structured-data filter: chrome-type drop, oversized untyped drop,
small untyped keep, `NewsArticle` keep
- markdown separator: `Element → Text → Element` no longer smashes
- a11y stripper: common phrasings, variant phrasings ("opens in a
new window", "external link"), and code-fence preservation
CI runs Rust 1.94 which flags these. Collapsed nested if-let in
cell_has_block_content() and replaced .map()+return with .inspect()
in table_to_md().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.
Two-layer fix:
1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit
2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
html5ever parsing and extraction both have room on deeply nested pages
Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).
Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.
Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
(p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)
CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax
MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>