webclaw/crates
devnen 76cd515a3e feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites
On sites like Hollywood Reporter where the extracted body is < 500 words
because the page is JS-walled (chrome rendering is needed), webclaw now
emits a one-line stderr hint:

  # hint: extracted body is N words (thin); the page may be JS-walled.
    Try --browser chrome for JS-rendered content.

Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs)
mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption
list for utility domains (example.com, httpbin.org, etc) where thinness
is by design.

The originally proposed --retry-thin flag was dropped after phase A
determined webclaw has no headless-JS backend to retry to (--browser
only affects User-Agent impersonation, not actual rendering). The
hint-only design lets the caller decide: re-run with --browser chrome
manually, or switch to a different fetcher entirely.

Hint suppressed in --mode summary / --mode toc (link/outline focused);
M3 fast-fails skip the formatter entirely so no hint.

Stdout invariance: tested byte-identical on all p01-p15 default probes.
M10 only modifies stderr.

10 new tests (workspace 678 -> 688).
2026-05-23 22:18:12 +02:00
..
webclaw-cli feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites 2026-05-23 22:18:12 +02:00
webclaw-core feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites 2026-05-23 22:18:12 +02:00
webclaw-fetch feat(core): HTTP status header line in -f llm/text/json output 2026-05-23 21:29:26 +02:00
webclaw-llm fix: support LLM provider compatibility options 2026-05-06 11:36:53 +02:00
webclaw-mcp feat(core): HTTP status header line in -f llm/text/json output 2026-05-23 21:29:26 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server feat(core): HTTP status header line in -f llm/text/json output 2026-05-23 21:29:26 +02:00