mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-09 22:35:12 +02:00
JSON-LD is consistently the cleanest source on major outlets (Reuters,
BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured
Data block at the bottom of -f llm output; this iter teaches it to
parse the JSON-LD by schema and surface it usefully.
New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies
items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review,
WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is
auto-lifted (Reuters CollectionPage shape).
Two new CLI flags:
--prefer-structured: surfaces the schema-aware block at the TOP of the
output, before prose. For -f llm emits a Markdown summary block; for
-f json emits a {structured, extracted} envelope. Bypasses the default
DROP list for WebPage/chrome types when explicitly requested.
--articles-from-jsonld: when the page contains ItemList or
LiveBlogPosting, output ONLY a JSON array of articles
({position, title, url, published}). When no such schema is present,
emit a stderr hint and fall through to default extraction (no error).
Default behavior (neither flag set) byte-identical to iter-3 on all
default-flag probes (regression sentinel passed): Cyrillic p14 still
7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical,
M3 registry p44/p45/p46 still fast-fail with exit 67.
14 new tests in webclaw-core covering schema-variant parsing, parse
error handling, fall-through behavior, flag combinations, and the
default-byte-identical sentinel. Workspace tests 657 -> 671.
30 lines
731 B
Text
30 lines
731 B
Text
target/
|
|
.DS_Store
|
|
.env
|
|
.env.*
|
|
proxies.txt
|
|
.claude/skills/
|
|
# Scratch / local artifacts (previously covered by overbroad `*.json`,
|
|
# which would have also swallowed package.json, components.json,
|
|
# .smithery/*.json if they were ever modified).
|
|
*.local.json
|
|
local-test-results.json
|
|
# CLI research command dumps JSON output keyed on the query; they're
|
|
# not code and shouldn't live in git. Track deliberately-saved research
|
|
# output under a different name.
|
|
research-*.json
|
|
|
|
# Local runtime/scratch — never repo content.
|
|
__pycache__/
|
|
.last_update_check
|
|
.playwright-cli/
|
|
demo_sample.html
|
|
demo_saved.json
|
|
baselines/
|
|
.loop-scratch/
|
|
*-loop-progress.log
|
|
_build-release.bat
|
|
_build-release.log
|
|
improve-loop-CONTINUE.md
|
|
iter-*-smoke/
|
|
_local/
|