webclaw/crates/webclaw-cli/src
devnen 31a8f6150f feat(core): JS-hub page detector + --prefer-articles flag
Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/)
where the rendered markup has nav-only content with no article bodies —
chrome retry doesn't help because the data genuinely isn't in the markup.
Heuristic: word_count < 500 AND link_count >= 5 against the extracted output.

--prefer-articles: when set, a hub-classified page returns the extracted
link list (reusing the M1 --mode summary machinery) instead of the sparse
body. On non-hub pages, behavior is unchanged.

stderr hint: always emitted on hub detection so the caller knows to drill
/story/_/id/<id>/ URLs from a citation list.

False-positive resistance verified: BBC News /world (link-heavy aggregator,
1500+ words body) and n1info.rs (widget-heavy but content-rich) both
classify as non-hub and emit full extraction.

9 new tests in webclaw-core (317 -> 326).
2026-05-23 18:55:17 +02:00
..
bench.rs feat(cli): add webclaw bench <url> subcommand (closes #26) 2026-04-22 12:25:29 +02:00
main.rs feat(core): JS-hub page detector + --prefer-articles flag 2026-05-23 18:55:17 +02:00