webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-08 22:25:12 +02:00

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server. https://webclaw.io

Find a file

Valerio d71eebdacc fix(mcp): silence dead-code warning on tool_router field (closes #30 ) cargo install webclaw-mcp on a fresh machine prints warning: field `tool_router` is never read --> crates/webclaw-mcp/src/server.rs:22:5 The field is essential — dropping it unregisters every MCP tool. The warning shows up because rmcp 1.3.x changed how the #[tool_handler] macro reads the field: instead of referencing it by name in the generated impl, it goes through a derived trait method. rustc's dead-code lint sees only the named usage and fires. The field stays. Annotated with #[allow(dead_code)] and a comment explaining the situation so the next person looking at this doesn't remove the field thinking it's actually unused. No behaviour change. Verified clean compile under rmcp 1.3.0 in our lock; the warning will disappear for anyone running cargo install against this commit.		2026-04-22 12:25:39 +02:00
.cargo	chore: remove reqwest_unstable rustflag (no longer needed)	2026-04-01 18:15:05 +02:00
.github	fix(ci): update all 4 Homebrew checksums after Docker build completes	2026-04-02 19:02:27 +02:00
assets	feat: add demo GIF to README — web_fetch 403 vs webclaw success	2026-03-24 11:07:26 +01:00
benchmarks	docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25 )	2026-04-17 14:46:19 +02:00
crates	fix(mcp): silence dead-code warning on tool_router field (closes #30 )	2026-04-22 12:25:39 +02:00
deploy	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
examples	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
packages/create-webclaw	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00
skill	add SKILL.md for Claude Code skill integration	2026-03-27 17:58:01 +01:00
.dockerignore	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
.gitignore	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00
Cargo.lock	feat(server): add OSS webclaw-server REST API binary (closes #29 )	2026-04-22 12:25:11 +02:00
Cargo.toml	feat(server): add OSS webclaw-server REST API binary (closes #29 )	2026-04-22 12:25:11 +02:00
CHANGELOG.md	fix(docker): entrypoint shim so child images with custom CMD work (#28 )	2026-04-17 15:57:47 +02:00
CLAUDE.md	feat(server): add OSS webclaw-server REST API binary (closes #29 )	2026-04-22 12:25:11 +02:00
CODE_OF_CONDUCT.md	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
CONTRIBUTING.md	docs: update CONTRIBUTING.md for v0.3.0 architecture	2026-03-30 10:17:26 +02:00
docker-compose.yml	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
docker-entrypoint.sh	fix(docker): entrypoint shim so child images with custom CMD work (#28 )	2026-04-17 15:57:47 +02:00
Dockerfile	feat(server): add OSS webclaw-server REST API binary (closes #29 )	2026-04-22 12:25:11 +02:00
Dockerfile.ci	feat(server): add OSS webclaw-server REST API binary (closes #29 )	2026-04-22 12:25:11 +02:00
env.example	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
glama.json	fix: update glama.json to match schema, add Glama badge to README	2026-03-24 16:27:12 +01:00
LICENSE	feat: SvelteKit data extraction + license change to AGPL-3.0	2026-04-01 20:37:56 +02:00
proxies.example.txt	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
README.md	docs: update README license references from MIT to AGPL-3.0	2026-04-02 11:28:40 +02:00
rustfmt.toml	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
setup.sh	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
SKILL.md	chore: add SKILL.md to repo root for skills.sh discoverability	2026-04-01 18:27:17 +02:00
smithery.yaml	fix: handle raw newlines in JSON-LD strings	2026-04-16 11:40:25 +02:00
targets_1000.txt	fix: handle raw newlines in JSON-LD strings	2026-04-16 11:40:25 +02:00

README.md

The fastest web scraper for AI agents.
_{67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.}

Claude Code: web_fetch gets 403, webclaw extracts successfully
_{Claude Code's built-in web_fetch → 403 Forbidden. webclaw → clean markdown.}

Your AI agent calls fetch() and gets a 403. Or 142KB of raw HTML that burns through your token budget. webclaw fixes both.

It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: 67% fewer tokens than raw HTML, with metadata, links, and images preserved.

                     Raw HTML                          webclaw
┌──────────────────────────────────┐    ┌──────────────────────────────────┐
│ <div class="ad-wrapper">         │    │ # Breaking: AI Breakthrough      │
│ <nav class="global-nav">         │    │                                  │
│ <script>window.__NEXT_DATA__     │    │ Researchers achieved 94%         │
│ ={...8KB of JSON...}</script>    │    │ accuracy on cross-domain         │
│ <div class="social-share">       │    │ reasoning benchmarks.            │
│ <button>Tweet</button>           │    │                                  │
│ <footer class="site-footer">     │    │ ## Key Findings                  │
│ <!-- 142,847 characters -->      │    │ - 3x faster inference            │
│                                  │    │ - Open-source weights            │
│         4,820 tokens             │    │         1,590 tokens             │
└──────────────────────────────────┘    └──────────────────────────────────┘

Get Started (30 seconds)

For AI agents (Claude, Cursor, Windsurf, VS Code)

npx create-webclaw

Auto-detects your AI tools, downloads the MCP server, and configures everything. One command.

Homebrew (macOS/Linux)

brew tap 0xMassi/webclaw
brew install webclaw

Prebuilt binaries

Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).

Cargo (from source)

cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Docker Compose (with Ollama for LLM features)

cp env.example .env
docker compose up -d

Why webclaw?

	webclaw	Firecrawl	Trafilatura	Readability
Extraction accuracy	95.1%	—	80.6%	83.5%
Token efficiency	-67%	—	-55%	-51%
Speed (100KB page)	3.2ms	~500ms	18.4ms	8.7ms
TLS fingerprinting	Yes	No	No	No
Self-hosted	Yes	No	Yes	Yes
MCP (Claude/Cursor)	Yes	No	No	No
No browser required	Yes	No	Yes	Yes
Cost	Free		Free	Free

Choose webclaw if you want fast local extraction, LLM-optimized output, and native AI agent integration.

What it looks like

$ webclaw https://stripe.com -f llm

> URL: https://stripe.com
> Title: Stripe | Financial Infrastructure for the Internet
> Language: en
> Word count: 847

# Stripe | Financial Infrastructure for the Internet

Stripe is a suite of APIs powering online payment processing
and commerce solutions for internet businesses of all sizes.

## Products
- Payments — Accept payments online and in person
- Billing — Manage subscriptions and invoicing
- Connect — Build a marketplace or platform
...

$ webclaw https://github.com --brand

{
  "name": "GitHub",
  "colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
  "fonts": ["Mona Sans", "ui-monospace"],
  "logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
}

$ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Crawling... 50/50 pages extracted
---
# Page 1: https://docs.rust-lang.org/
...
# Page 2: https://docs.rust-lang.org/book/
...

MCP Server — 10 tools for AI agents

webclaw ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.

npx create-webclaw    # auto-detects and configures everything

Or manual setup — add to your Claude Desktop config:

{
  "mcpServers": {
    "webclaw": {
      "command": "~/.webclaw/webclaw-mcp"
    }
  }
}

Then in Claude: "Scrape the top 5 results for 'web scraping tools' and compare their pricing" — it just works.

Available tools

Tool	Description	Requires API key?
`scrape`	Extract content from any URL	No
`crawl`	Recursive site crawl	No
`map`	Discover URLs from sitemaps	No
`batch`	Parallel multi-URL extraction	No
`extract`	LLM-powered structured extraction	No (needs Ollama)
`summarize`	Page summarization	No (needs Ollama)
`diff`	Content change detection	No
`brand`	Brand identity extraction	No
`search`	Web search + scrape results	Yes
`research`	Deep multi-source research	Yes

8 of 10 tools work locally — no account, no API key, fully private.

Features

Extraction

Readability scoring — multi-signal content detection (text density, semantic tags, link ratio)
Noise filtering — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
Data island extraction — catches React/Next.js JSON payloads, JSON-LD, hydration data
YouTube metadata — structured data from any YouTube video
PDF extraction — auto-detected via Content-Type
5 output formats — markdown, text, JSON, LLM-optimized, HTML

Content control

webclaw URL --include "article, .content"       # CSS selector include
webclaw URL --exclude "nav, footer, .sidebar"    # CSS selector exclude
webclaw URL --only-main-content                  # Auto-detect main content

Crawling

webclaw URL --crawl --depth 3 --max-pages 100   # BFS same-origin crawl
webclaw URL --crawl --sitemap                    # Seed from sitemap
webclaw URL --map                                # Discover URLs only

LLM features (Ollama / OpenAI / Anthropic)

webclaw URL --summarize                          # Page summary
webclaw URL --extract-prompt "Get all prices"    # Natural language extraction
webclaw URL --extract-json '{"type":"object"}'   # Schema-enforced extraction

Change tracking

webclaw URL -f json > snap.json                  # Take snapshot
webclaw URL --diff-with snap.json                # Compare later

Brand extraction

webclaw URL --brand                              # Colors, fonts, logos, OG image

Proxy rotation

webclaw URL --proxy http://user:pass@host:port   # Single proxy
webclaw URLs --proxy-file proxies.txt            # Pool rotation

Benchmarks

All numbers from real tests on 50 diverse pages. See benchmarks/ for methodology and reproduction instructions.

Extraction quality

Accuracy      webclaw     ███████████████████ 95.1%
              readability ████████████████▋   83.5%
              trafilatura ████████████████    80.6%
              newspaper3k █████████████▎      66.4%

Noise removal webclaw     ███████████████████ 96.1%
              readability █████████████████▊  89.4%
              trafilatura ██████████████████▏ 91.2%
              newspaper3k ███████████████▎    76.8%

Speed (pure extraction, no network)

10KB page     webclaw     ██                   0.8ms
              readability █████                2.1ms
              trafilatura ██████████           4.3ms

100KB page    webclaw     ██                   3.2ms
              readability █████                8.7ms
              trafilatura ██████████           18.4ms

Token efficiency (feeding to Claude/GPT)

Format	Tokens	vs Raw HTML
Raw HTML	4,820	baseline
readability	2,340	-51%
trafilatura	2,180	-55%
webclaw llm	1,590	-67%

Crawl speed

Concurrency	webclaw	Crawl4AI	Scrapy
5	9.8 pg/s	5.2 pg/s	7.1 pg/s
10	18.4 pg/s	8.7 pg/s	12.3 pg/s
20	32.1 pg/s	14.2 pg/s	21.8 pg/s

Architecture

webclaw/
  crates/
    webclaw-core     Pure extraction engine. Zero network deps. WASM-safe.
    webclaw-fetch    HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.
    webclaw-llm      LLM provider chain (Ollama -> OpenAI -> Anthropic)
    webclaw-pdf      PDF text extraction
    webclaw-mcp      MCP server (10 tools for AI agents)
    webclaw-cli      CLI binary

webclaw-core takes raw HTML as a &str and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.

Configuration

Variable	Description
`WEBCLAW_API_KEY`	Cloud API key (enables bot bypass, JS rendering, search, research)
`OLLAMA_HOST`	Ollama URL for local LLM features (default: `http://localhost:11434`)
`OPENAI_API_KEY`	OpenAI API key for LLM features
`ANTHROPIC_API_KEY`	Anthropic API key for LLM features
`WEBCLAW_PROXY`	Single proxy URL
`WEBCLAW_PROXY_FILE`	Path to proxy pool file

Cloud API (optional)

For bot-protected sites, JS rendering, and advanced features, webclaw offers a hosted API at webclaw.io.

The CLI and MCP server work locally first. Cloud is used as a fallback when:

A site has bot protection (Cloudflare, DataDome, WAF)
A page requires JavaScript rendering
You use search or research tools

export WEBCLAW_API_KEY=wc_your_key

# Automatic: tries local first, cloud on bot detection
webclaw https://protected-site.com

# Force cloud
webclaw --cloud https://spa-site.com

SDKs

npm install @webclaw/sdk                  # TypeScript/JavaScript
pip install webclaw                        # Python
go get github.com/0xMassi/webclaw-go      # Go

Use cases

AI agents — Give Claude/Cursor/GPT real-time web access via MCP
Research — Crawl documentation, competitor sites, news archives
Price monitoring — Track changes with --diff-with snapshots
Training data — Prepare web content for fine-tuning with token-optimized output
Content pipelines — Batch extract + summarize in CI/CD
Brand intelligence — Extract visual identity from any website

Community

Discord — questions, feedback, show what you built
GitHub Issues — bug reports and feature requests

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Acknowledgments

TLS and HTTP/2 browser fingerprinting is powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.

License

AGPL-3.0