webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-28 03:29:38 +02:00

Author	SHA1	Message	Date
devnen	ade2a5143c	feat(core): --mode sections for nav-URL discovery Section-URL ambiguity is recurring friction — callers have to guess whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR- specific live FX dashboard), or decrypt.co root (ticker ribbon) vs /news/ (article list), or bbc.com/news/world vs /news/world/europe/. Each guess costs a round-trip. New `--mode sections` returns the discoverable section URLs parsed from the page's nav, in one round-trip. Subsumes issue #16 (non- English nav harder to LLM-parse — sections come back as data, not prose). Multi-signal heuristic on the existing link extraction: URL-pattern match (/<category>/ style short paths), repetition (section links appear in header + footer), DOM-position when available. Fallback when zero sections detected: emit top-N links with a "(none detected; first N shown)" note. Format: -f llm/text emits `Sections:` followed by `- [Label](url)` list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`. 13 new tests in webclaw-core (688 -> 701).	2026-05-23 23:14:40 +02:00
devnen	76cd515a3e	feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites On sites like Hollywood Reporter where the extracted body is < 500 words because the page is JS-walled (chrome rendering is needed), webclaw now emits a one-line stderr hint: # hint: extracted body is N words (thin); the page may be JS-walled. Try --browser chrome for JS-rendered content. Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs) mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption list for utility domains (example.com, httpbin.org, etc) where thinness is by design. The originally proposed --retry-thin flag was dropped after phase A determined webclaw has no headless-JS backend to retry to (--browser only affects User-Agent impersonation, not actual rendering). The hint-only design lets the caller decide: re-run with --browser chrome manually, or switch to a different fetcher entirely. Hint suppressed in --mode summary / --mode toc (link/outline focused); M3 fast-fails skip the formatter entirely so no hint. Stdout invariance: tested byte-identical on all p01-p15 default probes. M10 only modifies stderr. 10 new tests (workspace 678 -> 688).	2026-05-23 22:18:12 +02:00
devnen	dfcd51d9e0	feat(core): HTTP status header line in -f llm/text/json output Webclaw previously emitted URL, Title, Description, and Word count in the -f llm header but no HTTP status. On a 404 response, the caller had no signal apart from inspecting the body (e.g. dailysabah.com/ business/economy returns a 404 page; webclaw was extracting '13 words' of the error page without flagging the 404 status). New behavior: every -f llm/text/json output includes a 'Status: <code>' header line (after URL: per phase A's placement). Emitted on all responses including 200 for consistency — callers can't otherwise distinguish 'webclaw saw 200' from 'webclaw missed status info'. For -f json: top-level "status": <code> field added. Modes --mode summary and --mode toc are exempt: the status line would clutter the link-list and outline outputs. M3 fast-fails (known-bad-sites) also skip the status line because they exit before the formatter is reached. 7 new tests in webclaw-core (workspace total 671 -> 678).	2026-05-23 21:29:26 +02:00
devnen	66974366d7	feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld JSON-LD is consistently the cleanest source on major outlets (Reuters, BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured Data block at the bottom of -f llm output; this iter teaches it to parse the JSON-LD by schema and surface it usefully. New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review, WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is auto-lifted (Reuters CollectionPage shape). Two new CLI flags: --prefer-structured: surfaces the schema-aware block at the TOP of the output, before prose. For -f llm emits a Markdown summary block; for -f json emits a {structured, extracted} envelope. Bypasses the default DROP list for WebPage/chrome types when explicitly requested. --articles-from-jsonld: when the page contains ItemList or LiveBlogPosting, output ONLY a JSON array of articles ({position, title, url, published}). When no such schema is present, emit a stderr hint and fall through to default extraction (no error). Default behavior (neither flag set) byte-identical to iter-3 on all default-flag probes (regression sentinel passed): Cyrillic p14 still 7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical, M3 registry p44/p45/p46 still fast-fail with exit 67. 14 new tests in webclaw-core covering schema-variant parsing, parse error handling, fall-through behavior, flag combinations, and the default-byte-identical sentinel. Workspace tests 657 -> 671.	2026-05-23 20:38:59 +02:00
devnen	31a8f6150f	feat(core): JS-hub page detector + --prefer-articles flag Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/) where the rendered markup has nav-only content with no article bodies — chrome retry doesn't help because the data genuinely isn't in the markup. Heuristic: word_count < 500 AND link_count >= 5 against the extracted output. --prefer-articles: when set, a hub-classified page returns the extracted link list (reusing the M1 --mode summary machinery) instead of the sparse body. On non-hub pages, behavior is unchanged. stderr hint: always emitted on hub detection so the caller knows to drill /story/_/id/<id>/ URLs from a citation list. False-positive resistance verified: BBC News /world (link-heavy aggregator, 1500+ words body) and n1info.rs (widget-heavy but content-rich) both classify as non-hub and emit full extraction. 9 new tests in webclaw-core (317 -> 326).	2026-05-23 18:55:17 +02:00
devnen	339f41bb7c	feat(cli): add --max-output-bytes and --mode summary,toc for output-size control Three additive CLI flags addressing the 50KB persisted-output cap that trips Claude Code's per-tool-result harness on aggregator front pages (apnews.com, cnbc.com/markets/, b92.net all >50KB by default): --max-output-bytes N: truncates final output at N bytes with a clear '[truncated: M more bytes ...]' footer. N=0 means unlimited (default). UTF-8 codepoint-boundary safe; also wraps JSON output so truncated output stays parseable. --mode summary: returns only the extracted link list (titles + URLs), no body text. For aggregator front pages where the LLM is going to drill the individual articles next anyway. --mode toc: returns H1/H2 outline + first paragraph after each H2. For long single-article pages. New flags are orthogonal to -f (json/llm/text). 9 new unit tests in webclaw-core, total goes 308 -> 317 passing. Smoke-tested on apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped), pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).	2026-05-23 18:17:42 +02:00
Valerio	be8bcfebd9	fix: harden resource limits, path safety, and WASM build (#46 ) Security audit follow-up across the workspace: - webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a cfg(not(wasm32)) target dependency and the extraction entry point uses a direct call on wasm instead of spawning a thread, so it builds and runs on wasm32 with or without default features. - webclaw-core: bound the structured-data scrubber recursion (depth cap) so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the stack. - webclaw-fetch: stream the response body with a running ceiling so a small highly compressed payload cannot inflate to gigabytes in memory; redact user:pass@ from proxy URLs before they reach error strings. - webclaw-cli: contain output filenames inside the chosen directory (reject .. / absolute, drop traversal path segments), run --webhook URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s, and make research slug truncation char-safe. - webclaw-mcp: char-safe slug truncation (no multibyte slice panic). - setup.sh / deploy/hetzner.sh: replace eval on read input with printf -v, and mask auth key / API token in console output. - CI: enforce the wasm32 build invariant for webclaw-core. Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.	2026-05-19 17:03:52 +02:00
Valerio	3fabdc1d02	fix: clean llm output noise Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.	2026-05-18 18:39:33 +02:00
devnen	e8ca1417d6	Improve --format llm output quality (#37 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Improve LLM-format output for modern news and documentation pages. - Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records - Fix element/text spacing without detaching punctuation on docs, forums, and reference pages - Remove common accessibility link chrome from LLM text and link labels - Bump workspace version to 0.6.0 and update the changelog Thanks to Nenad Oric (@devnen) for the original PR and contribution.	2026-05-10 15:11:12 +02:00
Valerio	344eea74d9	feat: structured data in markdown/LLM output + v0.3.6 __NEXT_DATA__, SvelteKit, and JSON-LD now appear as a ## Structured Data section in -f markdown and -f llm output. Works with --only-main-content and all extraction flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:16:56 +02:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

11 commits