# Webclaw Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats. ## Architecture ``` webclaw/ crates/ webclaw-core/ # Pure extraction engine. WASM-safe. Zero network deps. # + ExtractionOptions (include/exclude CSS selectors) # + diff engine (change tracking) # + brand extraction (DOM/CSS analysis) webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops. # + proxy pool rotation (per-request) # + PDF content-type detection # + document parsing (DOCX, XLSX, CSV) # + layered URL discovery (map) + Serper web search (BYO key) webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic) # + JSON schema extraction, prompt extraction, summarization webclaw-pdf/ # PDF text extraction via pdf-extract webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents webclaw-cli/ # CLI binary webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart # of api.webclaw.io, without anti-bot / JS / jobs / auth) ``` Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting). ### Core Modules (`webclaw-core`) - `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty - `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe. - `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS - `structured_data.rs` — JSON-LD, Next.js `__NEXT_DATA__`, and SvelteKit data-island extraction - `js_eval.rs` — QuickJS sandbox (rquickjs) that runs inline `