Commit graph

7 commits

Author SHA1 Message Date
devnen
66974366d7 feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld
JSON-LD is consistently the cleanest source on major outlets (Reuters,
BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured
Data block at the bottom of -f llm output; this iter teaches it to
parse the JSON-LD by schema and surface it usefully.

New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies
items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review,
WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is
auto-lifted (Reuters CollectionPage shape).

Two new CLI flags:

--prefer-structured: surfaces the schema-aware block at the TOP of the
output, before prose. For -f llm emits a Markdown summary block; for
-f json emits a {structured, extracted} envelope. Bypasses the default
DROP list for WebPage/chrome types when explicitly requested.

--articles-from-jsonld: when the page contains ItemList or
LiveBlogPosting, output ONLY a JSON array of articles
({position, title, url, published}). When no such schema is present,
emit a stderr hint and fall through to default extraction (no error).

Default behavior (neither flag set) byte-identical to iter-3 on all
default-flag probes (regression sentinel passed): Cyrillic p14 still
7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical,
M3 registry p44/p45/p46 still fast-fail with exit 67.

14 new tests in webclaw-core covering schema-variant parsing, parse
error handling, fall-through behavior, flag combinations, and the
default-byte-identical sentinel. Workspace tests 657 -> 671.
2026-05-23 20:38:59 +02:00
devnen
e28b22adf7 feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls
Sites known to require CAPTCHA-solving (Cloudflare interstitials) or
browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot
be reached by webclaw's chrome impersonation; they return interstitial
stubs ('Just a moment...', 'Please enable JS and disable any ad blocker')
with 0 useful content. Currently each call wastes 5-10s on the timeout
before the caller sees the failure.

New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists
known bad hosts with a category (CloudflareInterstitial / AdblockWall)
and suggested substitute domains. Host matching: lowercase + strip
leading 'www.' + exact-match against registered host.

On registry hit, webclaw writes 'error: <host> is <category>-walled;
suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67
(EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000
to <50 for listed hosts.

Initial entries: ambito.com (Cloudflare; substitutes cronista.com,
iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr,
lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are
subscription paywalls with different bypass semantics; deferred to M11.

10 new tests in webclaw-fetch covering host normalization, www
stripping, path-under-host matching, case insensitivity, unknown-domain
pass-through, and the formatted error message (9 unit + 1 fetch-layer
integration). Workspace test total 647 -> 657.
2026-05-23 19:42:15 +02:00
devnen
562c6a15f0 gitignore: cover improve-loop and local build artifacts
improve-loop's loop.py writes baselines/, .loop-scratch/ and a
*-loop-progress.log per run. _build-release.bat / _build-release.log are
a local wrapper for invoking cargo build with the right MSVC + LLVM +
NASM env (replaces the missing update.py from CLAUDE.md). None should
land in git.
2026-05-23 17:42:05 +02:00
devnen
e620173d3a docs+gitignore: portable-install sync note and local scratch ignores
CLAUDE.md gains a mandatory step at the top describing the rebuild->copy->
verify dance for the portable Claude Code install at C:\_projects\claude-
portable, plus a local-build env snippet for the BoringSSL bindgen vars
(LIBCLANG_PATH, NASM on PATH) that update.py sets automatically but a plain
shell does not.

.gitignore adds runtime/scratch entries that shouldn't have been tracked:
__pycache__/, .last_update_check, .playwright-cli/, demo_sample.html,
demo_saved.json. Nothing currently tracked is affected (none of these were
under version control).
2026-05-23 17:29:05 +02:00
Valerio
d69c50a31d
feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)

Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
  Content-Length up front (before the allocation) and the actual
  .bytes() length after (belt-and-braces against lying upstreams).
  Previously the HTML -> markdown conversion downstream could allocate
  multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
  response_json_capped helper with a 5 MB cap. Protects against a
  malicious or runaway provider response exhausting memory.

Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.

Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.

Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
  no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
  artifact patterns (previous rule would have swallowed package.json,
  components.json, .smithery/*.json).

Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).

Version: 0.3.15 -> 0.3.16
CHANGELOG updated.

Refs: docs/AUDIT-2026-04-16.md (P2 section)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: gitignore CLI research dumps, drop accidentally-tracked file

research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).

Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 19:44:08 +02:00
Valerio
050b2ef463 feat: add allow_subdomains and allow_external_links to CrawlConfig
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.

Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().

Bump to 0.3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:33:06 +02:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00