webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-08 22:25:12 +02:00

Valerio d69c50a31d Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>		2026-04-16 19:44:08 +02:00
..
index.mjs	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
package.json	feat: add allow_subdomains and allow_external_links to CrawlConfig	2026-04-14 19:33:06 +02:00
README.md	docs: update npm package license to AGPL-3.0	2026-04-02 11:33:43 +02:00
server.json	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00

README.md

One command to give your AI agent reliable web access.
_{No headless browser. No Puppeteer. No 403s.}

Quick Start

npx create-webclaw

That's it. Auto-detects your AI tools, downloads the MCP server, configures everything.

Works with Claude Desktop, Claude Code, Cursor, Windsurf, VS Code, OpenCode, Codex CLI, and Antigravity.

The Problem

Your AI agent calls fetch() and gets a 403. Cloudflare, Akamai, and every major CDN fingerprint the TLS handshake and block non-browser clients before the request hits the server.

When it does work, you get 100KB+ of raw HTML — navigation, ads, cookie banners, scripts. Your agent burns 4,000+ tokens parsing noise.

The Fix

webclaw impersonates Chrome 146 at the TLS protocol level. Perfect JA4 fingerprint. Perfect HTTP/2 Akamai hash. 99% bypass rate on 102 tested sites.

Then it extracts just the content — clean markdown, 67% fewer tokens.

                     Raw HTML                          webclaw
┌──────────────────────────────────┐    ┌──────────────────────────────────┐
│ <div class="ad-wrapper">         │    │ # Breaking: AI Breakthrough      │
│ <nav class="global-nav">         │    │                                  │
│ <script>window.__NEXT_DATA__     │    │ Researchers achieved 94%         │
│ ={...8KB of JSON...}</script>    │    │ accuracy on cross-domain         │
│ <div class="social-share">       │    │ reasoning benchmarks.            │
│ <!-- 142,847 characters -->      │    │                                  │
│                                  │    │ ## Key Findings                  │
│         4,820 tokens             │    │         1,590 tokens             │
└──────────────────────────────────┘    └──────────────────────────────────┘

What It Does

npx create-webclaw

Detects installed AI tools (Claude, Cursor, Windsurf, VS Code, OpenCode, Codex, Antigravity)
Downloads the webclaw-mcp binary for your platform (macOS arm64/x86, Linux x86/arm64)
Asks for your API key (optional — works locally without one)
Writes the MCP config for each detected tool

10 MCP Tools

After setup, your AI agent has access to:

Tool	What it does	API key needed?
scrape	Extract content from any URL	No
crawl	Recursively crawl a website	No
search	Web search + parallel scrape	Yes (Serper)
map	Discover URLs from sitemaps	No
batch	Extract multiple URLs in parallel	No
extract	LLM-powered structured extraction	Yes
summarize	Content summarization	Yes
diff	Track content changes	No
brand	Extract brand identity	No
research	Deep multi-page research	Yes

8 of 10 tools work fully offline. No API key, no cloud, no tracking.

Supported Tools

Tool	Config location
Claude Desktop	`~/Library/Application Support/Claude/claude_desktop_config.json`
Claude Code	`~/.claude.json`
Cursor	`.cursor/mcp.json`
Windsurf	`~/.codeium/windsurf/mcp_config.json`
VS Code (Continue)	`~/.continue/config.json`
OpenCode	`~/.opencode/config.json`
Codex CLI	`~/.codex/config.json`
Antigravity	`~/.antigravity/mcp.json`

Sites That Work

webclaw gets through where default fetch() gets blocked:

Nike, Cloudflare, Bloomberg, Zillow, Indeed, Viagogo, Fansale, Wikipedia, Stripe, and 93 more. Tested on 102 sites with 99% success rate.

Alternative Install Methods

Homebrew

brew tap 0xMassi/webclaw && brew install webclaw

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Cargo

cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli

Prebuilt Binaries

Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).

License

AGPL-3.0