Current Word count: N is a single number conflating article body
and surrounding chrome (nav, ads, footer). Callers couldn't tell
from the header alone whether to drill or move on.
New: Word count: N (article: M, chrome: K) in -f llm/text output.
For -f json: adds word_count_article and word_count_chrome
fields alongside the existing word_count.
M (article body) is sourced from JSON-LD articleBody when M4's
parser found one (NewsArticle or Review.reviewBody); otherwise
computed by llm::body_word_count (the M2-style heuristic — words
outside markdown link patterns, the same body::process_body output
hub_detect uses).
--mode summary / toc / sections fall back to the simple Word count: N
form (the modes don't extract body content; the breakdown would be
meaningless). Suppression piggybacks on the existing include_status
toggle in build_metadata_header_with_opts.
9 new tests in webclaw-core (4 in lib.rs::tests for the population
logic; 5 in llm/metadata.rs::m12_tests for the header formatter).
Workspace 701 -> 710.
Section-URL ambiguity is recurring friction — callers have to guess
whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR-
specific live FX dashboard), or decrypt.co root (ticker ribbon) vs
/news/ (article list), or bbc.com/news/world vs /news/world/europe/.
Each guess costs a round-trip.
New `--mode sections` returns the discoverable section URLs parsed
from the page's nav, in one round-trip. Subsumes issue #16 (non-
English nav harder to LLM-parse — sections come back as data, not
prose).
Multi-signal heuristic on the existing link extraction:
URL-pattern match (/<category>/ style short paths), repetition
(section links appear in header + footer), DOM-position when
available. Fallback when zero sections detected: emit top-N links
with a "(none detected; first N shown)" note.
Format: -f llm/text emits `Sections:` followed by `- [Label](url)`
list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`.
13 new tests in webclaw-core (688 -> 701).
On sites like Hollywood Reporter where the extracted body is < 500 words
because the page is JS-walled (chrome rendering is needed), webclaw now
emits a one-line stderr hint:
# hint: extracted body is N words (thin); the page may be JS-walled.
Try --browser chrome for JS-rendered content.
Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs)
mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption
list for utility domains (example.com, httpbin.org, etc) where thinness
is by design.
The originally proposed --retry-thin flag was dropped after phase A
determined webclaw has no headless-JS backend to retry to (--browser
only affects User-Agent impersonation, not actual rendering). The
hint-only design lets the caller decide: re-run with --browser chrome
manually, or switch to a different fetcher entirely.
Hint suppressed in --mode summary / --mode toc (link/outline focused);
M3 fast-fails skip the formatter entirely so no hint.
Stdout invariance: tested byte-identical on all p01-p15 default probes.
M10 only modifies stderr.
10 new tests (workspace 678 -> 688).
Webclaw previously emitted URL, Title, Description, and Word count in
the -f llm header but no HTTP status. On a 404 response, the caller
had no signal apart from inspecting the body (e.g. dailysabah.com/
business/economy returns a 404 page; webclaw was extracting '13 words'
of the error page without flagging the 404 status).
New behavior: every -f llm/text/json output includes a 'Status: <code>'
header line (after URL: per phase A's placement). Emitted on all
responses including 200 for consistency — callers can't otherwise
distinguish 'webclaw saw 200' from 'webclaw missed status info'.
For -f json: top-level "status": <code> field added.
Modes --mode summary and --mode toc are exempt: the status line would
clutter the link-list and outline outputs.
M3 fast-fails (known-bad-sites) also skip the status line because
they exit before the formatter is reached.
7 new tests in webclaw-core (workspace total 671 -> 678).
JSON-LD is consistently the cleanest source on major outlets (Reuters,
BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured
Data block at the bottom of -f llm output; this iter teaches it to
parse the JSON-LD by schema and surface it usefully.
New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies
items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review,
WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is
auto-lifted (Reuters CollectionPage shape).
Two new CLI flags:
--prefer-structured: surfaces the schema-aware block at the TOP of the
output, before prose. For -f llm emits a Markdown summary block; for
-f json emits a {structured, extracted} envelope. Bypasses the default
DROP list for WebPage/chrome types when explicitly requested.
--articles-from-jsonld: when the page contains ItemList or
LiveBlogPosting, output ONLY a JSON array of articles
({position, title, url, published}). When no such schema is present,
emit a stderr hint and fall through to default extraction (no error).
Default behavior (neither flag set) byte-identical to iter-3 on all
default-flag probes (regression sentinel passed): Cyrillic p14 still
7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical,
M3 registry p44/p45/p46 still fast-fail with exit 67.
14 new tests in webclaw-core covering schema-variant parsing, parse
error handling, fall-through behavior, flag combinations, and the
default-byte-identical sentinel. Workspace tests 657 -> 671.
Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/)
where the rendered markup has nav-only content with no article bodies —
chrome retry doesn't help because the data genuinely isn't in the markup.
Heuristic: word_count < 500 AND link_count >= 5 against the extracted output.
--prefer-articles: when set, a hub-classified page returns the extracted
link list (reusing the M1 --mode summary machinery) instead of the sparse
body. On non-hub pages, behavior is unchanged.
stderr hint: always emitted on hub detection so the caller knows to drill
/story/_/id/<id>/ URLs from a citation list.
False-positive resistance verified: BBC News /world (link-heavy aggregator,
1500+ words body) and n1info.rs (widget-heavy but content-rich) both
classify as non-hub and emit full extraction.
9 new tests in webclaw-core (317 -> 326).
Three additive CLI flags addressing the 50KB persisted-output cap that
trips Claude Code's per-tool-result harness on aggregator front pages
(apnews.com, cnbc.com/markets/, b92.net all >50KB by default):
--max-output-bytes N: truncates final output at N bytes with a clear
'[truncated: M more bytes ...]' footer. N=0 means unlimited (default).
UTF-8 codepoint-boundary safe; also wraps JSON output so truncated
output stays parseable.
--mode summary: returns only the extracted link list (titles + URLs),
no body text. For aggregator front pages where the LLM is going to
drill the individual articles next anyway.
--mode toc: returns H1/H2 outline + first paragraph after each H2.
For long single-article pages.
New flags are orthogonal to -f (json/llm/text). 9 new unit tests in
webclaw-core, total goes 308 -> 317 passing. Smoke-tested on
apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped),
pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).
Security audit follow-up across the workspace:
- webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a
cfg(not(wasm32)) target dependency and the extraction entry point uses
a direct call on wasm instead of spawning a thread, so it builds and
runs on wasm32 with or without default features.
- webclaw-core: bound the structured-data scrubber recursion (depth cap)
so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the
stack.
- webclaw-fetch: stream the response body with a running ceiling so a
small highly compressed payload cannot inflate to gigabytes in memory;
redact user:pass@ from proxy URLs before they reach error strings.
- webclaw-cli: contain output filenames inside the chosen directory
(reject .. / absolute, drop traversal path segments), run --webhook
URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s,
and make research slug truncation char-safe.
- webclaw-mcp: char-safe slug truncation (no multibyte slice panic).
- setup.sh / deploy/hetzner.sh: replace eval on read input with
printf -v, and mask auth key / API token in console output.
- CI: enforce the wasm32 build invariant for webclaw-core.
Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
Improve LLM-format output for modern news and documentation pages.
- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog
Thanks to Nenad Oric (@devnen) for the original PR and contribution.
`search_from = abs_pos + 1` landed mid-char when a rejected match
started on a multi-byte UTF-8 character, panicking on the next
`markdown[search_from..]` slice. Advance by `needle.len()` instead —
always a valid char boundary, and skips the whole rejected match
instead of re-scanning inside it.
Repro: webclaw https://bruler.ru/about_brand -f json
Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'"
After: extracts 2.3KB of clean Cyrillic markdown with 7 sections
Two regression tests cover multi-byte rejected matches and
all-rejected cycles in Cyrillic text.
Closes#16
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cli): close --on-change command injection via sh -c (P0)
The --on-change flag on `webclaw watch` (single-URL, line 1588) and
`webclaw watch` multi-URL mode (line 1738) previously handed the entire
user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`.
Any path that can influence that string — a malicious config file, an MCP
client driven by an LLM with prompt-injection exposure, an untrusted
environment variable substitution — gets arbitrary shell execution.
The command is now tokenized with `shlex::split` (POSIX-ish quoting rules)
and executed directly via `Command::new(prog).args(args)`. Metacharacters
like `;`, `&&`, `|`, `$()`, `<(...)`, env expansion, and globbing no longer
fire.
An explicit opt-in escape hatch is available for users who genuinely need
a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path
and logs a warning on every invocation so it can't slip in silently.
Both call sites now route through a shared `spawn_on_change()` helper.
Adds `shlex = "1"` to webclaw-cli dependencies.
Version: 0.3.13 -> 0.3.14
CHANGELOG updated.
Surfaced by the 2026-04-16 workspace audit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore(brand): fix clippy 1.95 unnecessary_sort_by errors
Pre-existing sort_by calls in brand.rs became hard errors under clippy
1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same
ordering, no behavior change. Bundled here so CI goes green on the P0
command-injection fix.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sites like Bluesky emit JSON-LD with literal newline characters inside
string values (technically invalid JSON). Add sanitize_json_newlines()
fallback that escapes control characters inside quoted strings before
retrying the parse. This recovers ProfilePage, Product, and other
structured data that was previously silently dropped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runs Rust 1.94 which flags these. Collapsed nested if-let in
cell_has_block_content() and replaced .map()+return with .inspect()
in table_to_md().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two related fixes for content being stripped by the noise filter:
1. Remove <form> from unconditional noise tags. ASP.NET and similar
frameworks wrap entire pages in a <form> tag — these are not input
forms. Forms with >500 chars of text are now treated as content
wrappers, not noise.
2. Add safety valve for class/ID noise matching. When malformed HTML
leaves a noise container unclosed (e.g., <div class="header"> missing
its </div>), the HTML5 parser makes all subsequent siblings into
children of that container. A header/nav/footer with >5000 chars of
text is almost certainly a broken wrapper absorbing real content —
exempt it from noise filtering.
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.
Two-layer fix:
1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit
2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
html5ever parsing and extraction both have room on deeply nested pages
Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).
Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.
Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
(p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.
Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).
Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM
Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)
CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax
MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>