- webclaw-llm: add explicit request + connect timeouts to the reqwest
client in every provider (anthropic, openai, ollama) with a shorter
timeout on the ollama health check, so a stalled provider fails fast.
- webclaw-llm: fix a panic when truncating a provider error body that
contains multibyte characters near the 500-char cut (char-safe take).
- webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char
boundary so oversized scripts with non-ASCII content no longer panic.
- webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of
`byte as char`, preserving multibyte UTF-8 in SvelteKit string values
rather than producing Latin-1 mojibake.
- webclaw-cli: have fire_webhook return its JoinHandle and await it at
the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps.
- webclaw-mcp: drop the up-front DNS pre-validation loop in batch that
aborted the whole request on one bad URL; the fetch layer already
applies the same SSRF guard per URL and reports per-URL errors.
- webclaw-fetch: include the port in the warmup homepage URL so hosts
on a non-default port are warmed correctly.
Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)
Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
Content-Length up front (before the allocation) and the actual
.bytes() length after (belt-and-braces against lying upstreams).
Previously the HTML -> markdown conversion downstream could allocate
multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
response_json_capped helper with a 5 MB cap. Protects against a
malicious or runaway provider response exhausting memory.
Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.
Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.
Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
artifact patterns (previous rule would have swallowed package.json,
components.json, .smithery/*.json).
Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).
Version: 0.3.15 -> 0.3.16
CHANGELOG updated.
Refs: docs/AUDIT-2026-04-16.md (P2 section)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: gitignore CLI research dumps, drop accidentally-tracked file
research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).
Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>