webclaw/CLAUDE.md

# Webclaw

Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats.

## Architecture

```
webclaw/
  crates/
    webclaw-core/     # Pure extraction engine. WASM-safe. Zero network deps.
                      # + ExtractionOptions (include/exclude CSS selectors)
                      # + diff engine (change tracking)
                      # + brand extraction (DOM/CSS analysis)
    webclaw-fetch/    # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
                      # + proxy pool rotation (per-request)
                      # + PDF content-type detection
                      # + document parsing (DOCX, XLSX, CSV)
    webclaw-llm/      # LLM provider chain (Ollama -> OpenAI -> Anthropic)
                      # + JSON schema extraction, prompt extraction, summarization
    webclaw-pdf/      # PDF text extraction via pdf-extract
    webclaw-mcp/      # MCP server (Model Context Protocol) for AI agents
    webclaw-cli/      # CLI binary
    webclaw-server/   # Minimal axum REST API (self-hosting; OSS counterpart
                      # of api.webclaw.io, without anti-bot / JS / jobs / auth)
```

Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).

### Core Modules (`webclaw-core`)
- `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
- `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe.
- `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS
- `markdown.rs` — HTML to markdown with URL resolution, asset collection
- `llm.rs` — 9-step LLM optimization pipeline (image strip, emphasis strip, link dedup, stat merge, whitespace collapse)
- `domain.rs` — Domain detection from URL patterns + DOM heuristics
- `metadata.rs` — OG, Twitter Card, standard meta tag extraction
- `types.rs` — Core data structures (ExtractionResult, Metadata, Content)
- `filter.rs` — CSS selector include/exclude filtering (ExtractionOptions)
- `diff.rs` — Content change tracking engine (snapshot diffing)
- `brand.rs` — Brand identity extraction from DOM structure and CSS

### Fetch Modules (`webclaw-fetch`)
- `client.rs` — FetchClient with primp TLS impersonation
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
- `batch.rs` — Multi-URL concurrent extraction
- `proxy.rs` — Proxy pool with per-request rotation
- `document.rs` — Document parsing: DOCX, XLSX, CSV auto-detection and extraction
- `search.rs` — Web search via Serper.dev with parallel result scraping

### LLM Modules (`webclaw-llm`)
- Provider chain: Ollama (local-first) -> OpenAI -> Anthropic
- JSON schema extraction, prompt-based extraction, summarization

### PDF Modules (`webclaw-pdf`)
- PDF text extraction via pdf-extract crate

### MCP Server (`webclaw-mcp`)
- Model Context Protocol server over stdio transport
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
- Works with Claude Desktop, Claude Code, and any MCP client
- Uses `rmcp` crate (official Rust MCP SDK)

### REST API Server (`webclaw-server`)
- Axum 0.8, stateless, no database, no job queue
- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the
  capability exists in OSS
- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
  `--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
- Does NOT include: anti-bot bypass, JS rendering, async jobs,
  multi-tenant auth, billing, proxy rotation, search/research/watch/
  agent-scrape. Those live behind api.webclaw.io and are closed-source.

## Hard Rules

- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.

## Build & Test

```bash
cargo build --release           # Both binaries
cargo test --workspace          # All tests
cargo test -p webclaw-core      # Core only
cargo test -p webclaw-llm       # LLM only
```

## CLI

```bash
# Basic extraction
webclaw https://example.com
webclaw https://example.com --format llm

# Content filtering
webclaw https://example.com --include "article" --exclude "nav,footer"
webclaw https://example.com --only-main-content

# Batch + proxy rotation
webclaw url1 url2 url3 --proxy-file proxies.txt
webclaw --urls-file urls.txt --concurrency 10

# Sitemap discovery
webclaw https://docs.example.com --map

# Crawling (with sitemap seeding)
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap

# Change tracking
webclaw https://example.com -f json > snap.json
webclaw https://example.com --diff-with snap.json

# Brand extraction
webclaw https://example.com --brand

# LLM features (Ollama local-first)
webclaw https://example.com --summarize
webclaw https://example.com --extract-prompt "Get all pricing tiers"
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"}}}'

# PDF (auto-detected via Content-Type)
webclaw https://example.com/report.pdf

# Browser impersonation: chrome (default), firefox, random
webclaw https://example.com --browser firefox

# Local file / stdin
webclaw --file page.html
cat page.html | webclaw --stdin
```

## Key Thresholds

- Scoring minimum: 50 chars text length
- Semantic bonus: +50 for `<article>`/`<main>`, +25 for content class/ID
- Link density: >50% = 0.1x score, >30% = 0.5x
- Data island fallback triggers when DOM word count < 30
- Eyebrow text max: 80 chars

## MCP Setup

Add to Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json`):
```json
{
  "mcpServers": {
    "webclaw": {
      "command": "/path/to/webclaw-mcp"
    }
  }
}
```

## Skills

- `/scrape <url>` — extract content from a URL
- `/benchmark [url]` — run extraction performance benchmarks
- `/research <url>` — deep web research via crawl + extraction
- `/crawl <url>` — crawl a website
- `/commit` — conventional commit with change analysis

## Git

- Remote: `git@github.com:0xMassi/webclaw.git`
- Use `/commit` skill for commits
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`# Webclaw`

			`Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats.`

			`## Architecture`

			```
			`webclaw/`
			`crates/`
			`webclaw-core/ # Pure extraction engine. WASM-safe. Zero network deps.`
			`# + ExtractionOptions (include/exclude CSS selectors)`
			`# + diff engine (change tracking)`
			`# + brand extraction (DOM/CSS analysis)`
			`webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.`
			`# + proxy pool rotation (per-request)`
			`# + PDF content-type detection`
			`# + document parsing (DOCX, XLSX, CSV)`
			`webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic)`
			`# + JSON schema extraction, prompt extraction, summarization`
			`webclaw-pdf/ # PDF text extraction via pdf-extract`
			`webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents`
			`webclaw-cli/ # CLI binary`
feat(server): add OSS webclaw-server REST API binary (closes #29) Self-hosters hitting docs/self-hosting were promised three binaries but the OSS Docker image only shipped two. webclaw-server lived in the closed-source hosted-platform repo, which couldn't be opened. This adds a minimal axum REST API in the OSS repo so self-hosting actually works without pretending to ship the cloud platform. Crate at crates/webclaw-server/. Stateless, no database, no job queue, single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map, batch, extract, summarize, diff, brand}. JSON shapes mirror api.webclaw.io for the endpoints OSS can support, so swapping between self-hosted and hosted is a base-URL change. Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison is constant-time (subtle::ConstantTimeEq). Open mode (no key) is allowed and binds 127.0.0.1 by default; the Docker image flips WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box. Hard caps to keep naive callers from OOMing the process: crawl capped at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent. For unbounded crawls or anti-bot bypass the docs point users at the hosted API. Dockerfile + Dockerfile.ci updated to copy webclaw-server into /usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0 (new public binary). 2026-04-22 12:25:11 +02:00			`webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart`
			`# of api.webclaw.io, without anti-bot / JS / jobs / auth)`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			```

feat(server): add OSS webclaw-server REST API binary (closes #29) Self-hosters hitting docs/self-hosting were promised three binaries but the OSS Docker image only shipped two. webclaw-server lived in the closed-source hosted-platform repo, which couldn't be opened. This adds a minimal axum REST API in the OSS repo so self-hosting actually works without pretending to ship the cloud platform. Crate at crates/webclaw-server/. Stateless, no database, no job queue, single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map, batch, extract, summarize, diff, brand}. JSON shapes mirror api.webclaw.io for the endpoints OSS can support, so swapping between self-hosted and hosted is a base-URL change. Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison is constant-time (subtle::ConstantTimeEq). Open mode (no key) is allowed and binds 127.0.0.1 by default; the Docker image flips WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box. Hard caps to keep naive callers from OOMing the process: crawl capped at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent. For unbounded crawls or anti-bot bypass the docs point users at the hosted API. Dockerfile + Dockerfile.ci updated to copy webclaw-server into /usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0 (new public binary). 2026-04-22 12:25:11 +02:00			Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00
			### Core Modules (`webclaw-core`)
			- `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
			- `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe.
			- `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS
			- `markdown.rs` — HTML to markdown with URL resolution, asset collection
			- `llm.rs` — 9-step LLM optimization pipeline (image strip, emphasis strip, link dedup, stat merge, whitespace collapse)
			- `domain.rs` — Domain detection from URL patterns + DOM heuristics
			- `metadata.rs` — OG, Twitter Card, standard meta tag extraction
			- `types.rs` — Core data structures (ExtractionResult, Metadata, Content)
			- `filter.rs` — CSS selector include/exclude filtering (ExtractionOptions)
			- `diff.rs` — Content change tracking engine (snapshot diffing)
			- `brand.rs` — Brand identity extraction from DOM structure and CSS

			### Fetch Modules (`webclaw-fetch`)
			- `client.rs` — FetchClient with primp TLS impersonation
			- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
			- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
			- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
			- `batch.rs` — Multi-URL concurrent extraction
			- `proxy.rs` — Proxy pool with per-request rotation
			- `document.rs` — Document parsing: DOCX, XLSX, CSV auto-detection and extraction
			- `search.rs` — Web search via Serper.dev with parallel result scraping

			### LLM Modules (`webclaw-llm`)
			`- Provider chain: Ollama (local-first) -> OpenAI -> Anthropic`
			`- JSON schema extraction, prompt-based extraction, summarization`

			### PDF Modules (`webclaw-pdf`)
			`- PDF text extraction via pdf-extract crate`

			### MCP Server (`webclaw-mcp`)
			`- Model Context Protocol server over stdio transport`
			`- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand`
			`- Works with Claude Desktop, Claude Code, and any MCP client`
			- Uses `rmcp` crate (official Rust MCP SDK)

feat(server): add OSS webclaw-server REST API binary (closes #29) Self-hosters hitting docs/self-hosting were promised three binaries but the OSS Docker image only shipped two. webclaw-server lived in the closed-source hosted-platform repo, which couldn't be opened. This adds a minimal axum REST API in the OSS repo so self-hosting actually works without pretending to ship the cloud platform. Crate at crates/webclaw-server/. Stateless, no database, no job queue, single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map, batch, extract, summarize, diff, brand}. JSON shapes mirror api.webclaw.io for the endpoints OSS can support, so swapping between self-hosted and hosted is a base-URL change. Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison is constant-time (subtle::ConstantTimeEq). Open mode (no key) is allowed and binds 127.0.0.1 by default; the Docker image flips WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box. Hard caps to keep naive callers from OOMing the process: crawl capped at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent. For unbounded crawls or anti-bot bypass the docs point users at the hosted API. Dockerfile + Dockerfile.ci updated to copy webclaw-server into /usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0 (new public binary). 2026-04-22 12:25:11 +02:00			### REST API Server (`webclaw-server`)
			`- Axum 0.8, stateless, no database, no job queue`
			`- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the`
			`capability exists in OSS`
			- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
			`--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
			`- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent`
			`- Does NOT include: anti-bot bypass, JS rendering, async jobs,`
			`multi-tenant auth, billing, proxy rotation, search/research/watch/`
			`agent-scrape. Those live behind api.webclaw.io and are closed-source.`

Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`## Hard Rules`

			- Core has ZERO network dependencies — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
			- primp requires `[patch.crates-io]` for patched rustls/h2 forks at workspace level.
			- RUSTFLAGS are set in `.cargo/config.toml` — no need to pass manually.
			`- webclaw-llm uses plain reqwest (NOT primp-patched). LLM APIs don't need TLS fingerprinting.`
			- qwen3 thinking tags (`<think>`) are stripped at both provider and consumer levels.

			`## Build & Test`

			```bash
			`cargo build --release # Both binaries`
			`cargo test --workspace # All tests`
			`cargo test -p webclaw-core # Core only`
			`cargo test -p webclaw-llm # LLM only`
			```

			`## CLI`

			```bash
			`# Basic extraction`
			`webclaw https://example.com`
			`webclaw https://example.com --format llm`

			`# Content filtering`
			`webclaw https://example.com --include "article" --exclude "nav,footer"`
			`webclaw https://example.com --only-main-content`

			`# Batch + proxy rotation`
			`webclaw url1 url2 url3 --proxy-file proxies.txt`
			`webclaw --urls-file urls.txt --concurrency 10`

			`# Sitemap discovery`
			`webclaw https://docs.example.com --map`

			`# Crawling (with sitemap seeding)`
			`webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap`

			`# Change tracking`
			`webclaw https://example.com -f json > snap.json`
			`webclaw https://example.com --diff-with snap.json`

			`# Brand extraction`
			`webclaw https://example.com --brand`

			`# LLM features (Ollama local-first)`
			`webclaw https://example.com --summarize`
			`webclaw https://example.com --extract-prompt "Get all pricing tiers"`
			`webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"}}}'`

			`# PDF (auto-detected via Content-Type)`
			`webclaw https://example.com/report.pdf`

			`# Browser impersonation: chrome (default), firefox, random`
			`webclaw https://example.com --browser firefox`

			`# Local file / stdin`
			`webclaw --file page.html`
			`cat page.html \| webclaw --stdin`
			```

			`## Key Thresholds`

			`- Scoring minimum: 50 chars text length`
			- Semantic bonus: +50 for `<article>`/`<main>`, +25 for content class/ID
			`- Link density: >50% = 0.1x score, >30% = 0.5x`
			`- Data island fallback triggers when DOM word count < 30`
			`- Eyebrow text max: 80 chars`

			`## MCP Setup`

			Add to Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json`):
			```json
			`{`
			`"mcpServers": {`
			`"webclaw": {`
			`"command": "/path/to/webclaw-mcp"`
			`}`
			`}`
			`}`
			```

			`## Skills`

			- `/scrape <url>` — extract content from a URL
			- `/benchmark [url]` — run extraction performance benchmarks
			- `/research <url>` — deep web research via crawl + extraction
			- `/crawl <url>` — crawl a website
			- `/commit` — conventional commit with change analysis

			`## Git`

			- Remote: `git@github.com:0xMassi/webclaw.git`
			- Use `/commit` skill for commits