mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `**` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad *.json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/*.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-*.json output from `webclaw ... --research ...` got silently swept into git by the relaxed *.json gitignore in the preceding commit. The old blanket *.json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-*.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
7773c8af2a
commit
d69c50a31d
12 changed files with 219 additions and 13 deletions
17
packages/create-webclaw/server.json
Normal file
17
packages/create-webclaw/server.json
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
{
|
||||
"$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
|
||||
"name": "io.github.0xMassi/webclaw",
|
||||
"title": "webclaw",
|
||||
"description": "Web extraction MCP server. Scrape, crawl, extract, summarize any URL to clean markdown.",
|
||||
"version": "0.1.4",
|
||||
"packages": [
|
||||
{
|
||||
"registryType": "npm",
|
||||
"identifier": "create-webclaw",
|
||||
"version": "0.1.4",
|
||||
"transport": {
|
||||
"type": "stdio"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue