webclaw/crates/webclaw-cli
devnen 76cd515a3e feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites
On sites like Hollywood Reporter where the extracted body is < 500 words
because the page is JS-walled (chrome rendering is needed), webclaw now
emits a one-line stderr hint:

  # hint: extracted body is N words (thin); the page may be JS-walled.
    Try --browser chrome for JS-rendered content.

Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs)
mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption
list for utility domains (example.com, httpbin.org, etc) where thinness
is by design.

The originally proposed --retry-thin flag was dropped after phase A
determined webclaw has no headless-JS backend to retry to (--browser
only affects User-Agent impersonation, not actual rendering). The
hint-only design lets the caller decide: re-run with --browser chrome
manually, or switch to a different fetcher entirely.

Hint suppressed in --mode summary / --mode toc (link/outline focused);
M3 fast-fails skip the formatter entirely so no hint.

Stdout invariance: tested byte-identical on all p01-p15 default probes.
M10 only modifies stderr.

10 new tests (workspace 678 -> 688).
2026-05-23 22:18:12 +02:00
..
src feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites 2026-05-23 22:18:12 +02:00
Cargo.toml fix(cli): close --on-change command injection via sh -c (P0) (#20) 2026-04-16 18:37:02 +02:00