webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-15 23:35:14 +02:00

Author	SHA1	Message	Date
Nenad Oric	df8bdc96db	Improve --format llm output quality on news index pages This PR fixes three independent issues that surface when running `webclaw --format llm` against modern news index pages. They were all reproducible against bbc.com/news/world and reuters.com/world/middle-east during a real briefing-generation run. ### 1. Framework hydration blobs no longer dump into the output `to_llm_text` was unconditionally appending every parsed structured-data item as a `## Structured Data` JSON fence. On Next.js sites, that means the entire `__NEXT_DATA__` `pageProps` object — ad-targeting flags, build IDs, schedule paths, feature toggles — gets serialized straight into the LLM context. On bbc.com/news/world it was about 140 KB of pure framework noise drowning the actual page content. The fix layers three filters: - Items with a Schema.org `@type` of `WebSite`, `WebPage`, or `SiteNavigationElement` are dropped as chrome. - Items without an `@type` (typical of `pageProps` or SvelteKit data) are kept only if their serialized size stays under 4 KB — small parsed records with real content survive, hydration blobs do not. - The whole section is suppressed if the total serialized size exceeds 16 KB, regardless of type. Past that threshold it is almost never useful to a downstream LLM. JSON-LD records with content-bearing `@type` values (`Article`, `NewsArticle`, `Product`, `Recipe`, `FAQPage`, `Event`, etc.) are preserved. ### 2. Element → Text node smashing `children_to_md` and `inline_text` only ran the `needs_separator` check on `Element → Element` transitions. When an element rendered text with no trailing whitespace and was followed by a sibling text node that started with a non-whitespace character, the two got concatenated with no separator. The same check now applies to the `Text` branch in both functions. ### 3. Accessibility link chrome no longer leaks into prose Sites like Reuters wrap external/new-window links with screen-reader-only spans (e.g. `, opens new tab`, `external link`). These have no consistent class hook, so the structural noise filter cannot reliably catch them and they bleed into the rendered text — sometimes dozens of times per page. A targeted regex scrub now runs in two places: in the body cleanup pipeline (`strip_a11y_link_chrome`, called early after `strip_leaked_js`) and in the link-label cleaner (`clean_link_label`) so the deduplicated `## Links` section is also clean. ### Tests All 286 existing unit tests pass. 8 new tests cover: - structured-data filter: chrome-type drop, oversized untyped drop, small untyped keep, `NewsArticle` keep - markdown separator: `Element → Text → Element` no longer smashes - a11y stripper: common phrasings, variant phrasings ("opens in a new window", "external link"), and code-fence preservation	2026-05-10 14:30:07 +02:00
Valerio	a5c3433372	fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-04-23 15:26:31 +02:00
Valerio	5ea646a332	fix: resolve clippy warnings from #14 (collapsible_if, manual_inspect) CI runs Rust 1.94 which flags these. Collapsed nested if-let in cell_has_block_content() and replaced .map()+return with .inspect() in table_to_md(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:28:59 +02:00
Valerio	3cf9dbaf2a	chore: bump to 0.3.9, fix formatting from #14 Version bump for layout table, stack overflow, and noise filter fixes contributed by @devnen. Also fixes cargo fmt issues that caused CI lint failure on the merge commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:24:17 +02:00
devnen	74bac87435	fix: prevent stack overflow on deeply nested HTML pages Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing the default 1 MB main-thread stack on Windows during recursive markdown conversion. Two-layer fix: 1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit 2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so html5ever parsing and extraction both have room on deeply nested pages Tested with Express.co.uk live blog (previously crashed, now extracts 2000+ lines of clean markdown) and drudgereport.com (still works correctly).	2026-04-03 23:45:19 +02:00
devnen	95a6681b02	fix: detect layout tables and render as sections instead of markdown tables Sites like Drudge Report use <table> for page layout, not data. Each cell contains extensive block-level content (divs, hrs, paragraphs, links). Previously, table_to_md() called inline_text() on every cell, collapsing all whitespace and flattening block elements into a single unreadable line. Changes: - Add cell_has_block_content() heuristic: scans for block-level descendants (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables - Layout tables render each cell as a standalone section separated by blank lines, using children_to_md() to preserve block structure - Data tables (no block elements in cells) keep existing markdown table format - Bold/italic tags containing block elements are treated as containers instead of wrapping in //* (fixes Drudge's <b><font>...</font></b> column wrappers that contain the entire column content) - Add tests for layout tables with paragraphs and with links	2026-04-03 22:24:35 +02:00
Valerio	ea9c783bc5	fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:25:05 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

8 commits