webclaw/crates/webclaw-core
devnen d5a3aa4bf9 feat(core): word-count breakdown in header — article vs chrome split
Current Word count: N is a single number conflating article body
and surrounding chrome (nav, ads, footer). Callers couldn't tell
from the header alone whether to drill or move on.

New: Word count: N (article: M, chrome: K) in -f llm/text output.
For -f json: adds word_count_article and word_count_chrome
fields alongside the existing word_count.

M (article body) is sourced from JSON-LD articleBody when M4's
parser found one (NewsArticle or Review.reviewBody); otherwise
computed by llm::body_word_count (the M2-style heuristic — words
outside markdown link patterns, the same body::process_body output
hub_detect uses).

--mode summary / toc / sections fall back to the simple Word count: N
form (the modes don't extract body content; the breakdown would be
meaningless). Suppression piggybacks on the existing include_status
toggle in build_metadata_header_with_opts.

9 new tests in webclaw-core (4 in lib.rs::tests for the population
logic; 5 in llm/metadata.rs::m12_tests for the header formatter).
Workspace 701 -> 710.
2026-05-23 23:56:14 +02:00
..
src feat(core): word-count breakdown in header — article vs chrome split 2026-05-23 23:56:14 +02:00
testdata fix: prevent stack overflow on deeply nested HTML pages 2026-04-03 23:45:19 +02:00
Cargo.toml fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00