webclaw/crates
devnen 66974366d7 feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld
JSON-LD is consistently the cleanest source on major outlets (Reuters,
BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured
Data block at the bottom of -f llm output; this iter teaches it to
parse the JSON-LD by schema and surface it usefully.

New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies
items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review,
WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is
auto-lifted (Reuters CollectionPage shape).

Two new CLI flags:

--prefer-structured: surfaces the schema-aware block at the TOP of the
output, before prose. For -f llm emits a Markdown summary block; for
-f json emits a {structured, extracted} envelope. Bypasses the default
DROP list for WebPage/chrome types when explicitly requested.

--articles-from-jsonld: when the page contains ItemList or
LiveBlogPosting, output ONLY a JSON array of articles
({position, title, url, published}). When no such schema is present,
emit a stderr hint and fall through to default extraction (no error).

Default behavior (neither flag set) byte-identical to iter-3 on all
default-flag probes (regression sentinel passed): Cyrillic p14 still
7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical,
M3 registry p44/p45/p46 still fast-fail with exit 67.

14 new tests in webclaw-core covering schema-variant parsing, parse
error handling, fall-through behavior, flag combinations, and the
default-byte-identical sentinel. Workspace tests 657 -> 671.
2026-05-23 20:38:59 +02:00
..
webclaw-cli feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld 2026-05-23 20:38:59 +02:00
webclaw-core feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld 2026-05-23 20:38:59 +02:00
webclaw-fetch feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls 2026-05-23 19:42:15 +02:00
webclaw-llm fix: support LLM provider compatibility options 2026-05-06 11:36:53 +02:00
webclaw-mcp fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server fix: validate self-host route URLs consistently 2026-05-04 14:30:06 +02:00