webclaw/.gitignore
devnen 66974366d7 feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld
JSON-LD is consistently the cleanest source on major outlets (Reuters,
BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured
Data block at the bottom of -f llm output; this iter teaches it to
parse the JSON-LD by schema and surface it usefully.

New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies
items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review,
WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is
auto-lifted (Reuters CollectionPage shape).

Two new CLI flags:

--prefer-structured: surfaces the schema-aware block at the TOP of the
output, before prose. For -f llm emits a Markdown summary block; for
-f json emits a {structured, extracted} envelope. Bypasses the default
DROP list for WebPage/chrome types when explicitly requested.

--articles-from-jsonld: when the page contains ItemList or
LiveBlogPosting, output ONLY a JSON array of articles
({position, title, url, published}). When no such schema is present,
emit a stderr hint and fall through to default extraction (no error).

Default behavior (neither flag set) byte-identical to iter-3 on all
default-flag probes (regression sentinel passed): Cyrillic p14 still
7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical,
M3 registry p44/p45/p46 still fast-fail with exit 67.

14 new tests in webclaw-core covering schema-variant parsing, parse
error handling, fall-through behavior, flag combinations, and the
default-byte-identical sentinel. Workspace tests 657 -> 671.
2026-05-23 20:38:59 +02:00

30 lines
731 B
Text

target/
.DS_Store
.env
.env.*
proxies.txt
.claude/skills/
# Scratch / local artifacts (previously covered by overbroad `*.json`,
# which would have also swallowed package.json, components.json,
# .smithery/*.json if they were ever modified).
*.local.json
local-test-results.json
# CLI research command dumps JSON output keyed on the query; they're
# not code and shouldn't live in git. Track deliberately-saved research
# output under a different name.
research-*.json
# Local runtime/scratch — never repo content.
__pycache__/
.last_update_check
.playwright-cli/
demo_sample.html
demo_saved.json
baselines/
.loop-scratch/
*-loop-progress.log
_build-release.bat
_build-release.log
improve-loop-CONTINUE.md
iter-*-smoke/
_local/