feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run

* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)

Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
  Content-Length up front (before the allocation) and the actual
  .bytes() length after (belt-and-braces against lying upstreams).
  Previously the HTML -> markdown conversion downstream could allocate
  multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
  response_json_capped helper with a 5 MB cap. Protects against a
  malicious or runaway provider response exhausting memory.

Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.

Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.

Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
  no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
  artifact patterns (previous rule would have swallowed package.json,
  components.json, .smithery/*.json).

Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).

Version: 0.3.15 -> 0.3.16
CHANGELOG updated.

Refs: docs/AUDIT-2026-04-16.md (P2 section)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: gitignore CLI research dumps, drop accidentally-tracked file

research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).

Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Valerio 2026-04-16 19:44:08 +02:00 committed by GitHub
parent 7773c8af2a
commit d69c50a31d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
12 changed files with 219 additions and 13 deletions

View file

@ -95,7 +95,9 @@ impl LlmProvider for AnthropicProvider {
)));
}
let json: serde_json::Value = resp.json().await?;
// Read body with a size cap so a malicious or misbehaving
// endpoint can't allocate unbounded memory via resp.json().
let json = super::response_json_capped(resp).await?;
// Anthropic response: {"content": [{"type": "text", "text": "..."}]}
let raw = json["content"][0]["text"]

View file

@ -2,6 +2,8 @@ pub mod anthropic;
pub mod ollama;
pub mod openai;
use crate::error::LlmError;
/// Load an API key from an explicit override or an environment variable.
/// Returns `None` if neither is set or the value is empty.
pub(crate) fn load_api_key(override_key: Option<String>, env_var: &str) -> Option<String> {
@ -9,6 +11,36 @@ pub(crate) fn load_api_key(override_key: Option<String>, env_var: &str) -> Optio
if key.is_empty() { None } else { Some(key) }
}
/// Maximum bytes we'll pull from an LLM provider response. 5 MB is already
/// ~5× the largest real payload any of these providers emits for normal
/// completions; anything bigger is either a streaming bug on their end or
/// an adversarial response aimed at exhausting our memory.
pub(crate) const MAX_RESPONSE_BYTES: u64 = 5 * 1024 * 1024;
/// Read a provider response as JSON, capping total bytes at
/// [`MAX_RESPONSE_BYTES`]. Rejects via Content-Length if the server is
/// honest about size; otherwise reads to completion and checks the actual
/// byte length so an unbounded body still can't swallow unbounded memory.
pub(crate) async fn response_json_capped(
resp: reqwest::Response,
) -> Result<serde_json::Value, LlmError> {
if let Some(len) = resp.content_length()
&& len > MAX_RESPONSE_BYTES
{
return Err(LlmError::ProviderError(format!(
"response body {len} bytes exceeds cap {MAX_RESPONSE_BYTES}"
)));
}
let bytes = resp.bytes().await?;
if bytes.len() as u64 > MAX_RESPONSE_BYTES {
return Err(LlmError::ProviderError(format!(
"response body {} bytes exceeds cap {MAX_RESPONSE_BYTES}",
bytes.len()
)));
}
serde_json::from_slice(&bytes).map_err(|e| LlmError::InvalidJson(format!("response body: {e}")))
}
#[cfg(test)]
mod tests {
use super::*;

View file

@ -80,7 +80,9 @@ impl LlmProvider for OllamaProvider {
)));
}
let json: serde_json::Value = resp.json().await?;
// Cap response body size to defend against adversarial payloads
// or a runaway local model streaming gigabytes.
let json = super::response_json_capped(resp).await?;
let raw = json["message"]["content"]
.as_str()

View file

@ -91,7 +91,8 @@ impl LlmProvider for OpenAiProvider {
)));
}
let json: serde_json::Value = resp.json().await?;
// Cap response body size to defend against adversarial payloads.
let json = super::response_json_capped(resp).await?;
let raw = json["choices"][0]["message"]["content"]
.as_str()