webclaw/crates/webclaw-core
devnen 76cd515a3e feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites
On sites like Hollywood Reporter where the extracted body is < 500 words
because the page is JS-walled (chrome rendering is needed), webclaw now
emits a one-line stderr hint:

  # hint: extracted body is N words (thin); the page may be JS-walled.
    Try --browser chrome for JS-rendered content.

Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs)
mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption
list for utility domains (example.com, httpbin.org, etc) where thinness
is by design.

The originally proposed --retry-thin flag was dropped after phase A
determined webclaw has no headless-JS backend to retry to (--browser
only affects User-Agent impersonation, not actual rendering). The
hint-only design lets the caller decide: re-run with --browser chrome
manually, or switch to a different fetcher entirely.

Hint suppressed in --mode summary / --mode toc (link/outline focused);
M3 fast-fails skip the formatter entirely so no hint.

Stdout invariance: tested byte-identical on all p01-p15 default probes.
M10 only modifies stderr.

10 new tests (workspace 678 -> 688).
2026-05-23 22:18:12 +02:00
..
src feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites 2026-05-23 22:18:12 +02:00
testdata fix: prevent stack overflow on deeply nested HTML pages 2026-04-03 23:45:19 +02:00
Cargo.toml fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00