webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-10 22:45:13 +02:00

devnen 31a8f6150f feat(core): JS-hub page detector + --prefer-articles flag Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/) where the rendered markup has nav-only content with no article bodies — chrome retry doesn't help because the data genuinely isn't in the markup. Heuristic: word_count < 500 AND link_count >= 5 against the extracted output. --prefer-articles: when set, a hub-classified page returns the extracted link list (reusing the M1 --mode summary machinery) instead of the sparse body. On non-hub pages, behavior is unchanged. stderr hint: always emitted on hub detection so the caller knows to drill /story/_/id/<id>/ URLs from a citation list. False-positive resistance verified: BBC News /world (link-heavy aggregator, 1500+ words body) and n1info.rs (widget-heavy but content-rich) both classify as non-hub and emit full extraction. 9 new tests in webclaw-core (317 -> 326).		2026-05-23 18:55:17 +02:00
..
webclaw-cli	feat(core): JS-hub page detector + --prefer-articles flag	2026-05-23 18:55:17 +02:00
webclaw-core	feat(core): JS-hub page detector + --prefer-articles flag	2026-05-23 18:55:17 +02:00
webclaw-fetch	feat(core): endpoints module for API surface extraction from HTML and JS (#47 )	2026-05-19 19:05:16 +02:00
webclaw-llm	fix: support LLM provider compatibility options	2026-05-06 11:36:53 +02:00
webclaw-mcp	fix: harden resource limits, path safety, and WASM build (#46 )	2026-05-19 17:03:52 +02:00
webclaw-pdf	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
webclaw-server	fix: validate self-host route URLs consistently	2026-05-04 14:30:06 +02:00