Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/)
where the rendered markup has nav-only content with no article bodies —
chrome retry doesn't help because the data genuinely isn't in the markup.
Heuristic: word_count < 500 AND link_count >= 5 against the extracted output.
--prefer-articles: when set, a hub-classified page returns the extracted
link list (reusing the M1 --mode summary machinery) instead of the sparse
body. On non-hub pages, behavior is unchanged.
stderr hint: always emitted on hub detection so the caller knows to drill
/story/_/id/<id>/ URLs from a citation list.
False-positive resistance verified: BBC News /world (link-heavy aggregator,
1500+ words body) and n1info.rs (widget-heavy but content-rich) both
classify as non-hub and emit full extraction.
9 new tests in webclaw-core (317 -> 326).
Three additive CLI flags addressing the 50KB persisted-output cap that
trips Claude Code's per-tool-result harness on aggregator front pages
(apnews.com, cnbc.com/markets/, b92.net all >50KB by default):
--max-output-bytes N: truncates final output at N bytes with a clear
'[truncated: M more bytes ...]' footer. N=0 means unlimited (default).
UTF-8 codepoint-boundary safe; also wraps JSON output so truncated
output stays parseable.
--mode summary: returns only the extracted link list (titles + URLs),
no body text. For aggregator front pages where the LLM is going to
drill the individual articles next anyway.
--mode toc: returns H1/H2 outline + first paragraph after each H2.
For long single-article pages.
New flags are orthogonal to -f (json/llm/text). 9 new unit tests in
webclaw-core, total goes 308 -> 317 passing. Smoke-tested on
apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped),
pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).
improve-loop's loop.py writes baselines/, .loop-scratch/ and a
*-loop-progress.log per run. _build-release.bat / _build-release.log are
a local wrapper for invoking cargo build with the right MSVC + LLVM +
NASM env (replaces the missing update.py from CLAUDE.md). None should
land in git.
CLAUDE.md gains a mandatory step at the top describing the rebuild->copy->
verify dance for the portable Claude Code install at C:\_projects\claude-
portable, plus a local-build env snippet for the BoringSSL bindgen vars
(LIBCLANG_PATH, NASM on PATH) that update.py sets automatically but a plain
shell does not.
.gitignore adds runtime/scratch entries that shouldn't have been tracked:
__pycache__/, .last_update_check, .playwright-cli/, demo_sample.html,
demo_saved.json. Nothing currently tracked is affected (none of these were
under version control).
GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4
as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump
all nine references to v5 ahead of the deadline. The artifact steps are
v5-compatible: upload uses a unique matrix-target name and the download
step flattens subdirectories with find afterward.
The v0.6.4 tag shipped the API surface discovery module but the
release commit left the workspace version at 0.6.3 with no matching
changelog entry. Bump [workspace.package] to 0.6.4 and add the
[0.6.4] CHANGELOG section so the code matches the tag.
Security audit follow-up across the workspace:
- webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a
cfg(not(wasm32)) target dependency and the extraction entry point uses
a direct call on wasm instead of spawning a thread, so it builds and
runs on wasm32 with or without default features.
- webclaw-core: bound the structured-data scrubber recursion (depth cap)
so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the
stack.
- webclaw-fetch: stream the response body with a running ceiling so a
small highly compressed payload cannot inflate to gigabytes in memory;
redact user:pass@ from proxy URLs before they reach error strings.
- webclaw-cli: contain output filenames inside the chosen directory
(reject .. / absolute, drop traversal path segments), run --webhook
URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s,
and make research slug truncation char-safe.
- webclaw-mcp: char-safe slug truncation (no multibyte slice panic).
- setup.sh / deploy/hetzner.sh: replace eval on read input with
printf -v, and mask auth key / API token in console output.
- CI: enforce the wasm32 build invariant for webclaw-core.
Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
Improve LLM-format output for modern news and documentation pages.
- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog
Thanks to Nenad Oric (@devnen) for the original PR and contribution.
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.
The repo had no heading-level brand anchor, only a banner image and
an h3 slogan. Search engines indexing the README were missing the
canonical brand signal. The new h1 is what GitHub renders as the
title of the page and what Google co-ranks with webclaw.io.
Bumps workspace version to 0.5.7.
Surface webclaw.io as a clear alternative path for visitors who want
the antibot, JS rendering, async jobs, search, and watches the OSS
server doesn't ship. Sits between the value-prop and the install
instructions so self-host stays the primary on-ramp.