mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-28 03:29:38 +02:00
feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites
On sites like Hollywood Reporter where the extracted body is < 500 words
because the page is JS-walled (chrome rendering is needed), webclaw now
emits a one-line stderr hint:
# hint: extracted body is N words (thin); the page may be JS-walled.
Try --browser chrome for JS-rendered content.
Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs)
mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption
list for utility domains (example.com, httpbin.org, etc) where thinness
is by design.
The originally proposed --retry-thin flag was dropped after phase A
determined webclaw has no headless-JS backend to retry to (--browser
only affects User-Agent impersonation, not actual rendering). The
hint-only design lets the caller decide: re-run with --browser chrome
manually, or switch to a different fetcher entirely.
Hint suppressed in --mode summary / --mode toc (link/outline focused);
M3 fast-fails skip the formatter entirely so no hint.
Stdout invariance: tested byte-identical on all p01-p15 default probes.
M10 only modifies stderr.
10 new tests (workspace 678 -> 688).
This commit is contained in:
parent
dfcd51d9e0
commit
76cd515a3e
4 changed files with 280 additions and 6 deletions
|
|
@ -11,8 +11,10 @@ mod images;
|
|||
mod links;
|
||||
mod metadata;
|
||||
mod output_size;
|
||||
mod thin_body;
|
||||
|
||||
pub use hub_detect::{classify as classify_hub, HubClassification};
|
||||
pub use thin_body::{classify as classify_thin_body, ThinBodyClassification};
|
||||
pub use output_size::{
|
||||
to_json_summary, to_json_toc, to_llm_summary, to_llm_toc, truncate_json_with_wrapper,
|
||||
truncate_with_footer,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue