mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-17 23:55:13 +02:00
Bring core/CLAUDE.md current with the slices rescued this cycle, and fold in earlier /init corrections that were never committed. New capabilities documented: - search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search` subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY) + the now-local-first MCP `search` tool. - map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap + bounded crawl fallback), gzip sitemap support, and the new `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags. - perf: shared `extractors/og.rs` parser and the QuickJS runtime gate / parsed-document reuse noted on `js_eval.rs`. Corrections folded in: real browser fingerprint versions live in tls.rs (not browser.rs), accurate module/route lists, Repo Layout section, and removal of the now-false "search lives only in production" notes. Bumped the stated workspace version to 0.6.13. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
208 lines
13 KiB
Markdown
208 lines
13 KiB
Markdown
# Webclaw
|
|
|
|
Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
webclaw/
|
|
crates/
|
|
webclaw-core/ # Pure extraction engine. WASM-safe. Zero network deps.
|
|
# + ExtractionOptions (include/exclude CSS selectors)
|
|
# + diff engine (change tracking)
|
|
# + brand extraction (DOM/CSS analysis)
|
|
webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
|
|
# + proxy pool rotation (per-request)
|
|
# + PDF content-type detection
|
|
# + document parsing (DOCX, XLSX, CSV)
|
|
# + layered URL discovery (map) + Serper web search (BYO key)
|
|
webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic)
|
|
# + JSON schema extraction, prompt extraction, summarization
|
|
webclaw-pdf/ # PDF text extraction via pdf-extract
|
|
webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents
|
|
webclaw-cli/ # CLI binary
|
|
webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart
|
|
# of api.webclaw.io, without anti-bot / JS / jobs / auth)
|
|
```
|
|
|
|
Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).
|
|
|
|
### Core Modules (`webclaw-core`)
|
|
- `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
|
|
- `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe.
|
|
- `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS
|
|
- `structured_data.rs` — JSON-LD, Next.js `__NEXT_DATA__`, and SvelteKit data-island extraction
|
|
- `js_eval.rs` — QuickJS sandbox (rquickjs) that runs inline `<script>` tags to recover JS-assigned blobs (`window.__PRELOADED_STATE__`, Next.js `self.__next_f`) the static path can't see. Behind the default `quickjs` feature, gated `cfg(not(target_arch = "wasm32"))` — rquickjs links a C lib and won't build for wasm. Never ungate it (see Hard Rules). Runtime-gated for speed: the VM is skipped entirely when the page has no JS-candidate markers (`has_js_candidate_data`), and it reuses the already-parsed document instead of re-parsing.
|
|
- `endpoints.rs` — API surface discovery: REST paths, GraphQL, and WebSocket endpoints mined from inline scripts + JS bundle text (regex over string literals, DoS-bounded). Pure: caller passes raw text.
|
|
- `markdown.rs` — HTML to markdown with URL resolution, asset collection
|
|
- `llm/` — directory module (`mod` + `body`/`cleanup`/`images`/`links`/`metadata`): 9-step LLM optimization pipeline (image strip, emphasis strip, link dedup, stat merge, whitespace collapse)
|
|
- `domain.rs` — Domain detection from URL patterns + DOM heuristics
|
|
- `metadata.rs` — OG, Twitter Card, standard meta tag extraction
|
|
- `types.rs` — Core data structures (ExtractionResult, Metadata, Content, plus ExtractionOptions for include/exclude CSS selectors — applied in `extractor.rs`; there is no `filter.rs`)
|
|
- `diff.rs` — Content change tracking engine (snapshot diffing)
|
|
- `brand.rs` — Brand identity extraction from DOM structure and CSS
|
|
- `reddit.rs` — old.reddit.com thread vertical extractor (parses server-rendered HTML directly; no JS/API key). Test fixtures under `testdata/reddit/*.html` are `exclude`d from the published crate (Cargo.toml).
|
|
- `youtube.rs` — `ytInitialPlayerResponse` parser, structured markdown for `youtube.com/watch` URLs (title, channel, views, published, duration, description). Produces the legacy markdown shape — for transcripts and a structured `YoutubeData` block see the production server's `youtube_transcript.rs` short-circuit (yt-dlp via proxy pool).
|
|
|
|
### Fetch Modules (`webclaw-fetch`)
|
|
- `client.rs` — `FetchClient` with wreq BoringSSL TLS impersonation; also implements batch (`BatchResult`/`BatchExtractResult` — there is no `batch.rs`). Implements the public `Fetcher` trait so callers (incl. server adapters) can swap implementations.
|
|
- `fetcher.rs` — the public `Fetcher` trait (`Send + Sync`). Vertical extractors take `&dyn Fetcher`, not `&FetchClient`.
|
|
- `browser.rs` — `BrowserProfile`/`BrowserVariant` enums only (Chrome, ChromeMacos, Firefox, Safari, SafariIos26, Edge). No version numbers live here.
|
|
- `tls.rs` — the real fingerprint builder: per-variant wreq `Emulation` (cipher/sigalg/curve lists, TLS extension order, HTTP/2 SETTINGS, header wire-order). Browser versions are set HERE: Chrome 145, Firefox 135, Edge 145, Safari 18.3.1, Safari iOS 26. SafariIos26 composes on top of `wreq_util::Profile::SafariIos26`. SSRF-safe redirect policy lives here too.
|
|
- `extractors/` — ~28 vertical site extractors (Amazon, eBay, GitHub, Instagram, LinkedIn, Reddit, YouTube, npm, PyPI, HuggingFace, ...); `extractors/mod.rs` is the dispatch table. All reach the network through `&dyn Fetcher`. `extractors/og.rs` is the shared single-pass Open Graph (`og:*`) meta parser the verticals use (`raw()` vs `unescaped()`).
|
|
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
|
|
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt; gzip `.xml.gz` supported via `decode_sitemap_body`, sitemap-index recursion)
|
|
- `map.rs` — layered URL discovery (`discover_urls` / `MapOptions`): sitemaps first, then a bounded same-origin crawl fallback when the sitemap is thin, harvesting links from fetched pages + the unfetched frontier (deduped against the sitemap set)
|
|
- `search.rs` — web search via Serper.dev with the caller's own key (`search` / `SearchOptions` / `SearchResult`; pure `parse_serper_organic`). Plain wreq client (JSON API, no fingerprinting); optional bounded concurrent fetch+extract of result pages. Powers the CLI `search` subcommand, the MCP `search` tool, and the OSS server `POST /v1/search`.
|
|
- `proxy.rs` — Proxy pool with per-request rotation
|
|
- `document.rs` — Document parsing: DOCX, XLSX, CSV auto-detection and extraction
|
|
- `cloud.rs` — `CloudClient` for hosted antibot escalation, exposed via `Fetcher::cloud()`
|
|
- `locale.rs` — Accept-Language by TLD (`accept_language_for_tld` / `_for_url`)
|
|
- `url_security.rs` — SSRF guards + SSRF-safe redirect policy
|
|
|
|
### LLM Modules (`webclaw-llm`)
|
|
- Provider chain: Ollama (local-first) -> OpenAI -> Anthropic
|
|
- JSON schema extraction, prompt-based extraction, summarization
|
|
|
|
### PDF Modules (`webclaw-pdf`)
|
|
- PDF text extraction via pdf-extract crate
|
|
|
|
### MCP Server (`webclaw-mcp`)
|
|
- Model Context Protocol server over stdio transport
|
|
- 12 tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, list_extractors, vertical_scrape. `search` is local-first via the caller's `SERPER_API_KEY` (falls back to the hosted API when unset); `research` uses the hosted deep-research API. The rest run locally.
|
|
- Works with Claude Desktop, Claude Code, and any MCP client
|
|
- Uses `rmcp` crate (official Rust MCP SDK)
|
|
|
|
### REST API Server (`webclaw-server`)
|
|
- Axum 0.8, stateless, no database, no job queue
|
|
- 10 POST routes (incl. `POST /v1/scrape/{vertical}` and `POST /v1/search`) +
|
|
`GET /v1/extractors` + `GET /health`. JSON shapes mirror api.webclaw.io
|
|
where the capability exists in OSS. The vertical surface
|
|
(`routes/structured.rs`) mirrors the MCP `list_extractors` /
|
|
`vertical_scrape` tools. `POST /v1/search` is gated on `SERPER_API_KEY`
|
|
(returns 501 when unset).
|
|
- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
|
|
`--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
|
|
- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
|
|
- Does NOT include: anti-bot bypass, JS rendering, async jobs,
|
|
multi-tenant auth, billing, proxy rotation, research/watch/
|
|
agent-scrape. Those live behind api.webclaw.io and are closed-source.
|
|
(Web search IS available here as a bring-your-own-Serper-key path.)
|
|
|
|
## Hard Rules
|
|
|
|
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible. The `quickjs` feature (default ON) pulls in rquickjs, which links a C lib and can't target wasm32; it's gated `cfg(not(target_arch = "wasm32"))` in `lib.rs`. CI compiles webclaw-core for wasm32 both with AND without default features — never ungate that.
|
|
- **webclaw-fetch pins wreq exactly**: `wreq = "=6.0.0-rc.29"` + `wreq-util = "=3.0.0-rc.12"` (BoringSSL). The `=` pin is deliberate — these are release candidates with no semver stability between rc.N builds. No `[patch.crates-io]` forks needed; wreq handles TLS internally.
|
|
- **No build flags in `.cargo/config.toml`** (it is comments-only) — don't add any locally. BUT CI (`.github/workflows/ci.yml`, `deps.yml`) DOES export `RUSTFLAGS: "--cfg reqwest_unstable"` for the wreq path; don't remove it from CI.
|
|
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
|
|
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
|
|
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
|
|
|
|
## Build & Test
|
|
|
|
```bash
|
|
cargo build --release # All three binaries (webclaw, webclaw-mcp, webclaw-server)
|
|
cargo test --workspace # All tests
|
|
cargo test -p webclaw-core # Core only
|
|
cargo test -p webclaw-llm # LLM only
|
|
```
|
|
|
|
CI (`.github/workflows/ci.yml`, with `RUSTFLAGS=--cfg reqwest_unstable`) runs four jobs — match them locally before pushing:
|
|
- `cargo test --workspace`
|
|
- `cargo fmt --check --all` + `cargo clippy --all -- -D warnings` (warnings fail CI)
|
|
- `cargo check --target wasm32-unknown-unknown -p webclaw-core` **with and without** `--no-default-features` (guards the WASM-safe rule)
|
|
- `cargo doc --no-deps --workspace`
|
|
|
|
## Repo Layout & Packaging
|
|
|
|
Workspace is version **0.6.13**, edition **2024**, license **AGPL-3.0** (matters for the public-OSS scrubbing rules). No crate declares `rust-version`, so MSRV is implicit — edition 2024 floors it at Rust 1.85+; CI pins `dtolnay/rust-toolchain@stable`.
|
|
|
|
Artifacts outside `crates/` that need separate attention:
|
|
- `packages/create-webclaw/` — `npx create-webclaw` Node scaffolder that installs/configures the MCP server for AI agents (Claude, Cursor, Windsurf, ...). Versioned independently (own `package.json`) — bump it separately when MCP setup changes.
|
|
- `smithery.yaml` + `glama.json` — MCP-registry manifests (Smithery stdio config spawning `webclaw-mcp` with optional `WEBCLAW_API_KEY`; Glama). Update when the MCP launch command or env changes.
|
|
- `examples/` — runnable demos (cloudflare-diagnostics, firecrawl-compatible-api, html-to-markdown-rag, mcp-web-scraping, proxy-backed-crawling).
|
|
- `Dockerfile` / `Dockerfile.ci` / `docker-compose.yml`, `benchmarks/` (`/benchmark` skill), `SKILL.md` + `skill/` (Claude Code skill).
|
|
|
|
## CLI
|
|
|
|
```bash
|
|
# Basic extraction
|
|
webclaw https://example.com
|
|
webclaw https://example.com --format llm
|
|
|
|
# Content filtering
|
|
webclaw https://example.com --include "article" --exclude "nav,footer"
|
|
webclaw https://example.com --only-main-content
|
|
|
|
# Batch + proxy rotation
|
|
webclaw url1 url2 url3 --proxy-file proxies.txt
|
|
webclaw --urls-file urls.txt --concurrency 10
|
|
|
|
# URL discovery (--map): sitemaps first, bounded crawl fallback when the sitemap is thin
|
|
webclaw https://docs.example.com --map
|
|
webclaw https://news.ycombinator.com --map --map-pages 150 --map-limit 500
|
|
webclaw https://docs.example.com --map --no-map-crawl # sitemap-only (no crawl fallback)
|
|
|
|
# Crawling (with sitemap seeding)
|
|
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap
|
|
|
|
# Web search via Serper.dev (bring your own key: --serper-key or SERPER_API_KEY)
|
|
webclaw search "rust async runtime" --num 5
|
|
webclaw search "best web scraper" --scrape -f json # also fetch + extract result pages
|
|
|
|
# Change tracking
|
|
webclaw https://example.com -f json > snap.json
|
|
webclaw https://example.com --diff-with snap.json
|
|
|
|
# Brand extraction
|
|
webclaw https://example.com --brand
|
|
|
|
# LLM features (Ollama local-first)
|
|
webclaw https://example.com --summarize
|
|
webclaw https://example.com --extract-prompt "Get all pricing tiers"
|
|
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"}}}'
|
|
|
|
# PDF (auto-detected via Content-Type)
|
|
webclaw https://example.com/report.pdf
|
|
|
|
# Browser impersonation: chrome (default), firefox, random
|
|
webclaw https://example.com --browser firefox
|
|
|
|
# Local file / stdin
|
|
webclaw --file page.html
|
|
cat page.html | webclaw --stdin
|
|
```
|
|
|
|
## Key Thresholds
|
|
|
|
- Scoring minimum: 50 chars text length
|
|
- Semantic bonus: +50 for `<article>`/`<main>`, +25 for content class/ID
|
|
- Link density (generic divs): >50% = 0.1x score, >30% = 0.5x. Semantic nodes (article/main/role=main) get a milder curve: >70% = 0.3x, >50% = 0.5x (`extractor.rs`)
|
|
- Data island fallback triggers when DOM word count < 500 (`SPARSE_THRESHOLD` in `data_island.rs`)
|
|
- Eyebrow text max: 80 chars
|
|
|
|
## MCP Setup
|
|
|
|
Add to Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json`):
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"webclaw": {
|
|
"command": "/path/to/webclaw-mcp"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Skills
|
|
|
|
- `/scrape <url>` — extract content from a URL
|
|
- `/benchmark [url]` — run extraction performance benchmarks
|
|
- `/research <url>` — deep web research via crawl + extraction
|
|
- `/crawl <url>` — crawl a website
|
|
- `/commit` — conventional commit with change analysis
|
|
|
|
## Git
|
|
|
|
- Remote: `git@github.com:0xMassi/webclaw.git`
|
|
- Use `/commit` skill for commits
|