mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-28 03:29:38 +02:00
feat(cli): add --max-output-bytes and --mode summary,toc for output-size control
Three additive CLI flags addressing the 50KB persisted-output cap that trips Claude Code's per-tool-result harness on aggregator front pages (apnews.com, cnbc.com/markets/, b92.net all >50KB by default): --max-output-bytes N: truncates final output at N bytes with a clear '[truncated: M more bytes ...]' footer. N=0 means unlimited (default). UTF-8 codepoint-boundary safe; also wraps JSON output so truncated output stays parseable. --mode summary: returns only the extracted link list (titles + URLs), no body text. For aggregator front pages where the LLM is going to drill the individual articles next anyway. --mode toc: returns H1/H2 outline + first paragraph after each H2. For long single-article pages. New flags are orthogonal to -f (json/llm/text). 9 new unit tests in webclaw-core, total goes 308 -> 317 passing. Smoke-tested on apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped), pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).
This commit is contained in:
parent
562c6a15f0
commit
339f41bb7c
4 changed files with 756 additions and 54 deletions
|
|
@ -9,6 +9,12 @@ mod cleanup;
|
|||
mod images;
|
||||
mod links;
|
||||
mod metadata;
|
||||
mod output_size;
|
||||
|
||||
pub use output_size::{
|
||||
to_json_summary, to_json_toc, to_llm_summary, to_llm_toc, truncate_json_with_wrapper,
|
||||
truncate_with_footer,
|
||||
};
|
||||
|
||||
use crate::types::ExtractionResult;
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue