mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-17 23:55:13 +02:00
Bring core/CLAUDE.md current with the slices rescued this cycle, and fold in earlier /init corrections that were never committed. New capabilities documented: - search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search` subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY) + the now-local-first MCP `search` tool. - map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap + bounded crawl fallback), gzip sitemap support, and the new `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags. - perf: shared `extractors/og.rs` parser and the QuickJS runtime gate / parsed-document reuse noted on `js_eval.rs`. Corrections folded in: real browser fingerprint versions live in tls.rs (not browser.rs), accurate module/route lists, Repo Layout section, and removal of the now-false "search lives only in production" notes. Bumped the stated workspace version to 0.6.13. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13 KiB
13 KiB
Webclaw
Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats.
Architecture
webclaw/
crates/
webclaw-core/ # Pure extraction engine. WASM-safe. Zero network deps.
# + ExtractionOptions (include/exclude CSS selectors)
# + diff engine (change tracking)
# + brand extraction (DOM/CSS analysis)
webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
# + proxy pool rotation (per-request)
# + PDF content-type detection
# + document parsing (DOCX, XLSX, CSV)
# + layered URL discovery (map) + Serper web search (BYO key)
webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic)
# + JSON schema extraction, prompt extraction, summarization
webclaw-pdf/ # PDF text extraction via pdf-extract
webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents
webclaw-cli/ # CLI binary
webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart
# of api.webclaw.io, without anti-bot / JS / jobs / auth)
Three binaries: webclaw (CLI), webclaw-mcp (MCP server), webclaw-server (REST API for self-hosting).
Core Modules (webclaw-core)
extractor.rs— Readability-style scoring: text density, semantic tags, link density penaltynoise.rs— Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe.data_island.rs— JSON data island extraction for React SPAs, Next.js, Contentful CMSstructured_data.rs— JSON-LD, Next.js__NEXT_DATA__, and SvelteKit data-island extractionjs_eval.rs— QuickJS sandbox (rquickjs) that runs inline<script>tags to recover JS-assigned blobs (window.__PRELOADED_STATE__, Next.jsself.__next_f) the static path can't see. Behind the defaultquickjsfeature, gatedcfg(not(target_arch = "wasm32"))— rquickjs links a C lib and won't build for wasm. Never ungate it (see Hard Rules). Runtime-gated for speed: the VM is skipped entirely when the page has no JS-candidate markers (has_js_candidate_data), and it reuses the already-parsed document instead of re-parsing.endpoints.rs— API surface discovery: REST paths, GraphQL, and WebSocket endpoints mined from inline scripts + JS bundle text (regex over string literals, DoS-bounded). Pure: caller passes raw text.markdown.rs— HTML to markdown with URL resolution, asset collectionllm/— directory module (mod+body/cleanup/images/links/metadata): 9-step LLM optimization pipeline (image strip, emphasis strip, link dedup, stat merge, whitespace collapse)domain.rs— Domain detection from URL patterns + DOM heuristicsmetadata.rs— OG, Twitter Card, standard meta tag extractiontypes.rs— Core data structures (ExtractionResult, Metadata, Content, plus ExtractionOptions for include/exclude CSS selectors — applied inextractor.rs; there is nofilter.rs)diff.rs— Content change tracking engine (snapshot diffing)brand.rs— Brand identity extraction from DOM structure and CSSreddit.rs— old.reddit.com thread vertical extractor (parses server-rendered HTML directly; no JS/API key). Test fixtures undertestdata/reddit/*.htmlareexcluded from the published crate (Cargo.toml).youtube.rs—ytInitialPlayerResponseparser, structured markdown foryoutube.com/watchURLs (title, channel, views, published, duration, description). Produces the legacy markdown shape — for transcripts and a structuredYoutubeDatablock see the production server'syoutube_transcript.rsshort-circuit (yt-dlp via proxy pool).
Fetch Modules (webclaw-fetch)
client.rs—FetchClientwith wreq BoringSSL TLS impersonation; also implements batch (BatchResult/BatchExtractResult— there is nobatch.rs). Implements the publicFetchertrait so callers (incl. server adapters) can swap implementations.fetcher.rs— the publicFetchertrait (Send + Sync). Vertical extractors take&dyn Fetcher, not&FetchClient.browser.rs—BrowserProfile/BrowserVariantenums only (Chrome, ChromeMacos, Firefox, Safari, SafariIos26, Edge). No version numbers live here.tls.rs— the real fingerprint builder: per-variant wreqEmulation(cipher/sigalg/curve lists, TLS extension order, HTTP/2 SETTINGS, header wire-order). Browser versions are set HERE: Chrome 145, Firefox 135, Edge 145, Safari 18.3.1, Safari iOS 26. SafariIos26 composes on top ofwreq_util::Profile::SafariIos26. SSRF-safe redirect policy lives here too.extractors/— ~28 vertical site extractors (Amazon, eBay, GitHub, Instagram, LinkedIn, Reddit, YouTube, npm, PyPI, HuggingFace, ...);extractors/mod.rsis the dispatch table. All reach the network through&dyn Fetcher.extractors/og.rsis the shared single-pass Open Graph (og:*) meta parser the verticals use (raw()vsunescaped()).crawler.rs— BFS same-origin crawler with configurable depth/concurrency/delaysitemap.rs— Sitemap discovery and parsing (sitemap.xml, robots.txt; gzip.xml.gzsupported viadecode_sitemap_body, sitemap-index recursion)map.rs— layered URL discovery (discover_urls/MapOptions): sitemaps first, then a bounded same-origin crawl fallback when the sitemap is thin, harvesting links from fetched pages + the unfetched frontier (deduped against the sitemap set)search.rs— web search via Serper.dev with the caller's own key (search/SearchOptions/SearchResult; pureparse_serper_organic). Plain wreq client (JSON API, no fingerprinting); optional bounded concurrent fetch+extract of result pages. Powers the CLIsearchsubcommand, the MCPsearchtool, and the OSS serverPOST /v1/search.proxy.rs— Proxy pool with per-request rotationdocument.rs— Document parsing: DOCX, XLSX, CSV auto-detection and extractioncloud.rs—CloudClientfor hosted antibot escalation, exposed viaFetcher::cloud()locale.rs— Accept-Language by TLD (accept_language_for_tld/_for_url)url_security.rs— SSRF guards + SSRF-safe redirect policy
LLM Modules (webclaw-llm)
- Provider chain: Ollama (local-first) -> OpenAI -> Anthropic
- JSON schema extraction, prompt-based extraction, summarization
PDF Modules (webclaw-pdf)
- PDF text extraction via pdf-extract crate
MCP Server (webclaw-mcp)
- Model Context Protocol server over stdio transport
- 12 tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, list_extractors, vertical_scrape.
searchis local-first via the caller'sSERPER_API_KEY(falls back to the hosted API when unset);researchuses the hosted deep-research API. The rest run locally. - Works with Claude Desktop, Claude Code, and any MCP client
- Uses
rmcpcrate (official Rust MCP SDK)
REST API Server (webclaw-server)
- Axum 0.8, stateless, no database, no job queue
- 10 POST routes (incl.
POST /v1/scrape/{vertical}andPOST /v1/search) +GET /v1/extractors+GET /health. JSON shapes mirror api.webclaw.io where the capability exists in OSS. The vertical surface (routes/structured.rs) mirrors the MCPlist_extractors/vertical_scrapetools.POST /v1/searchis gated onSERPER_API_KEY(returns 501 when unset). - Constant-time bearer-token auth via
subtle::ConstantTimeEqwhen--api-key/WEBCLAW_API_KEYis set; otherwise open mode - Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
- Does NOT include: anti-bot bypass, JS rendering, async jobs, multi-tenant auth, billing, proxy rotation, research/watch/ agent-scrape. Those live behind api.webclaw.io and are closed-source. (Web search IS available here as a bring-your-own-Serper-key path.)
Hard Rules
- Core has ZERO network dependencies — takes
&strHTML, returns structured output. Keep it WASM-compatible. Thequickjsfeature (default ON) pulls in rquickjs, which links a C lib and can't target wasm32; it's gatedcfg(not(target_arch = "wasm32"))inlib.rs. CI compiles webclaw-core for wasm32 both with AND without default features — never ungate that. - webclaw-fetch pins wreq exactly:
wreq = "=6.0.0-rc.29"+wreq-util = "=3.0.0-rc.12"(BoringSSL). The=pin is deliberate — these are release candidates with no semver stability between rc.N builds. No[patch.crates-io]forks needed; wreq handles TLS internally. - No build flags in
.cargo/config.toml(it is comments-only) — don't add any locally. BUT CI (.github/workflows/ci.yml,deps.yml) DOES exportRUSTFLAGS: "--cfg reqwest_unstable"for the wreq path; don't remove it from CI. - webclaw-llm uses plain reqwest. LLM APIs don't need TLS fingerprinting, so no wreq dep.
- Vertical extractors take
&dyn Fetcher, not&FetchClient. This lets the production server plug in aProductionFetcherthat adds domain_hints routing and antibot escalation on top of the same wreq client. - qwen3 thinking tags (
<think>) are stripped at both provider and consumer levels.
Build & Test
cargo build --release # All three binaries (webclaw, webclaw-mcp, webclaw-server)
cargo test --workspace # All tests
cargo test -p webclaw-core # Core only
cargo test -p webclaw-llm # LLM only
CI (.github/workflows/ci.yml, with RUSTFLAGS=--cfg reqwest_unstable) runs four jobs — match them locally before pushing:
cargo test --workspacecargo fmt --check --all+cargo clippy --all -- -D warnings(warnings fail CI)cargo check --target wasm32-unknown-unknown -p webclaw-corewith and without--no-default-features(guards the WASM-safe rule)cargo doc --no-deps --workspace
Repo Layout & Packaging
Workspace is version 0.6.13, edition 2024, license AGPL-3.0 (matters for the public-OSS scrubbing rules). No crate declares rust-version, so MSRV is implicit — edition 2024 floors it at Rust 1.85+; CI pins dtolnay/rust-toolchain@stable.
Artifacts outside crates/ that need separate attention:
packages/create-webclaw/—npx create-webclawNode scaffolder that installs/configures the MCP server for AI agents (Claude, Cursor, Windsurf, ...). Versioned independently (ownpackage.json) — bump it separately when MCP setup changes.smithery.yaml+glama.json— MCP-registry manifests (Smithery stdio config spawningwebclaw-mcpwith optionalWEBCLAW_API_KEY; Glama). Update when the MCP launch command or env changes.examples/— runnable demos (cloudflare-diagnostics, firecrawl-compatible-api, html-to-markdown-rag, mcp-web-scraping, proxy-backed-crawling).Dockerfile/Dockerfile.ci/docker-compose.yml,benchmarks/(/benchmarkskill),SKILL.md+skill/(Claude Code skill).
CLI
# Basic extraction
webclaw https://example.com
webclaw https://example.com --format llm
# Content filtering
webclaw https://example.com --include "article" --exclude "nav,footer"
webclaw https://example.com --only-main-content
# Batch + proxy rotation
webclaw url1 url2 url3 --proxy-file proxies.txt
webclaw --urls-file urls.txt --concurrency 10
# URL discovery (--map): sitemaps first, bounded crawl fallback when the sitemap is thin
webclaw https://docs.example.com --map
webclaw https://news.ycombinator.com --map --map-pages 150 --map-limit 500
webclaw https://docs.example.com --map --no-map-crawl # sitemap-only (no crawl fallback)
# Crawling (with sitemap seeding)
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap
# Web search via Serper.dev (bring your own key: --serper-key or SERPER_API_KEY)
webclaw search "rust async runtime" --num 5
webclaw search "best web scraper" --scrape -f json # also fetch + extract result pages
# Change tracking
webclaw https://example.com -f json > snap.json
webclaw https://example.com --diff-with snap.json
# Brand extraction
webclaw https://example.com --brand
# LLM features (Ollama local-first)
webclaw https://example.com --summarize
webclaw https://example.com --extract-prompt "Get all pricing tiers"
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"}}}'
# PDF (auto-detected via Content-Type)
webclaw https://example.com/report.pdf
# Browser impersonation: chrome (default), firefox, random
webclaw https://example.com --browser firefox
# Local file / stdin
webclaw --file page.html
cat page.html | webclaw --stdin
Key Thresholds
- Scoring minimum: 50 chars text length
- Semantic bonus: +50 for
<article>/<main>, +25 for content class/ID - Link density (generic divs): >50% = 0.1x score, >30% = 0.5x. Semantic nodes (article/main/role=main) get a milder curve: >70% = 0.3x, >50% = 0.5x (
extractor.rs) - Data island fallback triggers when DOM word count < 500 (
SPARSE_THRESHOLDindata_island.rs) - Eyebrow text max: 80 chars
MCP Setup
Add to Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"webclaw": {
"command": "/path/to/webclaw-mcp"
}
}
}
Skills
/scrape <url>— extract content from a URL/benchmark [url]— run extraction performance benchmarks/research <url>— deep web research via crawl + extraction/crawl <url>— crawl a website/commit— conventional commit with change analysis
Git
- Remote:
git@github.com:0xMassi/webclaw.git - Use
/commitskill for commits