webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-10 22:45:13 +02:00

devnen 66974366d7 feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld JSON-LD is consistently the cleanest source on major outlets (Reuters, BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured Data block at the bottom of -f llm output; this iter teaches it to parse the JSON-LD by schema and surface it usefully. New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review, WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is auto-lifted (Reuters CollectionPage shape). Two new CLI flags: --prefer-structured: surfaces the schema-aware block at the TOP of the output, before prose. For -f llm emits a Markdown summary block; for -f json emits a {structured, extracted} envelope. Bypasses the default DROP list for WebPage/chrome types when explicitly requested. --articles-from-jsonld: when the page contains ItemList or LiveBlogPosting, output ONLY a JSON array of articles ({position, title, url, published}). When no such schema is present, emit a stderr hint and fall through to default extraction (no error). Default behavior (neither flag set) byte-identical to iter-3 on all default-flag probes (regression sentinel passed): Cyrillic p14 still 7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical, M3 registry p44/p45/p46 still fast-fail with exit 67. 14 new tests in webclaw-core covering schema-variant parsing, parse error handling, fall-through behavior, flag combinations, and the default-byte-identical sentinel. Workspace tests 657 -> 671.		2026-05-23 20:38:59 +02:00
..
src	feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld	2026-05-23 20:38:59 +02:00
testdata	fix: prevent stack overflow on deeply nested HTML pages	2026-04-03 23:45:19 +02:00
Cargo.toml	fix: harden resource limits, path safety, and WASM build (#46 )	2026-05-19 17:03:52 +02:00