2026-03-23 18:31:11 +01:00
|
|
|
# Webclaw
|
|
|
|
|
|
|
|
|
|
Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats.
|
|
|
|
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
webclaw/
|
|
|
|
|
crates/
|
|
|
|
|
webclaw-core/ # Pure extraction engine. WASM-safe. Zero network deps.
|
|
|
|
|
# + ExtractionOptions (include/exclude CSS selectors)
|
|
|
|
|
# + diff engine (change tracking)
|
|
|
|
|
# + brand extraction (DOM/CSS analysis)
|
|
|
|
|
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
|
|
|
|
|
# + proxy pool rotation (per-request)
|
|
|
|
|
# + PDF content-type detection
|
|
|
|
|
# + document parsing (DOCX, XLSX, CSV)
|
|
|
|
|
webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic)
|
|
|
|
|
# + JSON schema extraction, prompt extraction, summarization
|
|
|
|
|
webclaw-pdf/ # PDF text extraction via pdf-extract
|
|
|
|
|
webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents
|
|
|
|
|
webclaw-cli/ # CLI binary
|
feat(server): add OSS webclaw-server REST API binary (closes #29)
Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.
Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.
Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.
Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.
Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).
2026-04-22 12:25:11 +02:00
|
|
|
webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart
|
|
|
|
|
# of api.webclaw.io, without anti-bot / JS / jobs / auth)
|
2026-03-23 18:31:11 +01:00
|
|
|
```
|
|
|
|
|
|
feat(server): add OSS webclaw-server REST API binary (closes #29)
Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.
Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.
Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.
Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.
Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).
2026-04-22 12:25:11 +02:00
|
|
|
Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).
|
2026-03-23 18:31:11 +01:00
|
|
|
|
|
|
|
|
### Core Modules (`webclaw-core`)
|
|
|
|
|
- `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
|
|
|
|
|
- `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe.
|
|
|
|
|
- `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS
|
|
|
|
|
- `markdown.rs` — HTML to markdown with URL resolution, asset collection
|
|
|
|
|
- `llm.rs` — 9-step LLM optimization pipeline (image strip, emphasis strip, link dedup, stat merge, whitespace collapse)
|
|
|
|
|
- `domain.rs` — Domain detection from URL patterns + DOM heuristics
|
|
|
|
|
- `metadata.rs` — OG, Twitter Card, standard meta tag extraction
|
|
|
|
|
- `types.rs` — Core data structures (ExtractionResult, Metadata, Content)
|
|
|
|
|
- `filter.rs` — CSS selector include/exclude filtering (ExtractionOptions)
|
|
|
|
|
- `diff.rs` — Content change tracking engine (snapshot diffing)
|
|
|
|
|
- `brand.rs` — Brand identity extraction from DOM structure and CSS
|
|
|
|
|
|
|
|
|
|
### Fetch Modules (`webclaw-fetch`)
|
|
|
|
|
- `client.rs` — FetchClient with primp TLS impersonation
|
|
|
|
|
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
|
|
|
|
|
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
|
|
|
|
|
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
|
|
|
|
|
- `batch.rs` — Multi-URL concurrent extraction
|
|
|
|
|
- `proxy.rs` — Proxy pool with per-request rotation
|
|
|
|
|
- `document.rs` — Document parsing: DOCX, XLSX, CSV auto-detection and extraction
|
|
|
|
|
- `search.rs` — Web search via Serper.dev with parallel result scraping
|
|
|
|
|
|
|
|
|
|
### LLM Modules (`webclaw-llm`)
|
|
|
|
|
- Provider chain: Ollama (local-first) -> OpenAI -> Anthropic
|
|
|
|
|
- JSON schema extraction, prompt-based extraction, summarization
|
|
|
|
|
|
|
|
|
|
### PDF Modules (`webclaw-pdf`)
|
|
|
|
|
- PDF text extraction via pdf-extract crate
|
|
|
|
|
|
|
|
|
|
### MCP Server (`webclaw-mcp`)
|
|
|
|
|
- Model Context Protocol server over stdio transport
|
|
|
|
|
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
|
|
|
|
|
- Works with Claude Desktop, Claude Code, and any MCP client
|
|
|
|
|
- Uses `rmcp` crate (official Rust MCP SDK)
|
|
|
|
|
|
feat(server): add OSS webclaw-server REST API binary (closes #29)
Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.
Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.
Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.
Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.
Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).
2026-04-22 12:25:11 +02:00
|
|
|
### REST API Server (`webclaw-server`)
|
|
|
|
|
- Axum 0.8, stateless, no database, no job queue
|
|
|
|
|
- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the
|
|
|
|
|
capability exists in OSS
|
|
|
|
|
- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
|
|
|
|
|
`--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
|
|
|
|
|
- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
|
|
|
|
|
- Does NOT include: anti-bot bypass, JS rendering, async jobs,
|
|
|
|
|
multi-tenant auth, billing, proxy rotation, search/research/watch/
|
|
|
|
|
agent-scrape. Those live behind api.webclaw.io and are closed-source.
|
|
|
|
|
|
2026-03-23 18:31:11 +01:00
|
|
|
## Hard Rules
|
|
|
|
|
|
|
|
|
|
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
|
|
|
|
|
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
|
|
|
|
|
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
|
|
|
|
|
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
|
|
|
|
|
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
|
|
|
|
|
|
|
|
|
|
## Build & Test
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
cargo build --release # Both binaries
|
|
|
|
|
cargo test --workspace # All tests
|
|
|
|
|
cargo test -p webclaw-core # Core only
|
|
|
|
|
cargo test -p webclaw-llm # LLM only
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## CLI
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Basic extraction
|
|
|
|
|
webclaw https://example.com
|
|
|
|
|
webclaw https://example.com --format llm
|
|
|
|
|
|
|
|
|
|
# Content filtering
|
|
|
|
|
webclaw https://example.com --include "article" --exclude "nav,footer"
|
|
|
|
|
webclaw https://example.com --only-main-content
|
|
|
|
|
|
|
|
|
|
# Batch + proxy rotation
|
|
|
|
|
webclaw url1 url2 url3 --proxy-file proxies.txt
|
|
|
|
|
webclaw --urls-file urls.txt --concurrency 10
|
|
|
|
|
|
|
|
|
|
# Sitemap discovery
|
|
|
|
|
webclaw https://docs.example.com --map
|
|
|
|
|
|
|
|
|
|
# Crawling (with sitemap seeding)
|
|
|
|
|
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap
|
|
|
|
|
|
|
|
|
|
# Change tracking
|
|
|
|
|
webclaw https://example.com -f json > snap.json
|
|
|
|
|
webclaw https://example.com --diff-with snap.json
|
|
|
|
|
|
|
|
|
|
# Brand extraction
|
|
|
|
|
webclaw https://example.com --brand
|
|
|
|
|
|
|
|
|
|
# LLM features (Ollama local-first)
|
|
|
|
|
webclaw https://example.com --summarize
|
|
|
|
|
webclaw https://example.com --extract-prompt "Get all pricing tiers"
|
|
|
|
|
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"}}}'
|
|
|
|
|
|
|
|
|
|
# PDF (auto-detected via Content-Type)
|
|
|
|
|
webclaw https://example.com/report.pdf
|
|
|
|
|
|
|
|
|
|
# Browser impersonation: chrome (default), firefox, random
|
|
|
|
|
webclaw https://example.com --browser firefox
|
|
|
|
|
|
|
|
|
|
# Local file / stdin
|
|
|
|
|
webclaw --file page.html
|
|
|
|
|
cat page.html | webclaw --stdin
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Key Thresholds
|
|
|
|
|
|
|
|
|
|
- Scoring minimum: 50 chars text length
|
|
|
|
|
- Semantic bonus: +50 for `<article>`/`<main>`, +25 for content class/ID
|
|
|
|
|
- Link density: >50% = 0.1x score, >30% = 0.5x
|
|
|
|
|
- Data island fallback triggers when DOM word count < 30
|
|
|
|
|
- Eyebrow text max: 80 chars
|
|
|
|
|
|
|
|
|
|
## MCP Setup
|
|
|
|
|
|
|
|
|
|
Add to Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json`):
|
|
|
|
|
```json
|
|
|
|
|
{
|
|
|
|
|
"mcpServers": {
|
|
|
|
|
"webclaw": {
|
|
|
|
|
"command": "/path/to/webclaw-mcp"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Skills
|
|
|
|
|
|
|
|
|
|
- `/scrape <url>` — extract content from a URL
|
|
|
|
|
- `/benchmark [url]` — run extraction performance benchmarks
|
|
|
|
|
- `/research <url>` — deep web research via crawl + extraction
|
|
|
|
|
- `/crawl <url>` — crawl a website
|
|
|
|
|
- `/commit` — conventional commit with change analysis
|
|
|
|
|
|
|
|
|
|
## Git
|
|
|
|
|
|
|
|
|
|
- Remote: `git@github.com:0xMassi/webclaw.git`
|
|
|
|
|
- Use `/commit` skill for commits
|