diff --git a/CHANGELOG.md b/CHANGELOG.md index dc337d0..4069d54 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,28 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.4.0] — 2026-04-22 + +### Added +- **`webclaw bench ` — per-URL extraction micro-benchmark (#26).** New subcommand. Fetches a URL once, runs the same extraction pipeline as `--format llm`, and prints a small ASCII table comparing raw-HTML tokens vs. llm-output tokens, bytes, and extraction time. Pass `--json` for a single-line JSON object (stable shape, easy to append to ndjson in CI). Pass `--facts ` with a file in the same schema as `benchmarks/facts.json` to get a fidelity column ("4/5 facts preserved"); URLs absent from the facts file produce no fidelity row, so uncurated sites aren't shown as 0/0. v1 uses an approximate tokenizer (`chars/4` for Latin text, `chars/2` when CJK dominates) — off by ±10% vs. a real BPE tokenizer, but the signal ("the LLM pipeline dropped 93% of the raw bytes") is the point. Output clearly labels counts as `≈ tokens` so nobody confuses them with a real tiktoken run. Swapping in `tiktoken-rs` later is a one-function change in `bench.rs`. Adding this as a `clap` subcommand rather than a flag also lays the groundwork for future subcommands without breaking the existing flag-based flow — `webclaw --format llm` still works exactly as before. + +- **`webclaw-server` — new OSS binary for self-hosting a REST API (#29).** Until now, `docs/self-hosting` promised a `webclaw-server` binary that only existed in the hosted-platform repo (closed source). The Docker image shipped two binaries while the docs advertised three, which sent self-hosters into a bug loop. This release closes the gap: a new crate at `crates/webclaw-server/` builds a minimal, stateless axum server that exposes the OSS extraction pipeline over HTTP with the same JSON shapes as api.webclaw.io. Endpoints: `GET /health`, `POST /v1/{scrape,crawl,map,batch,extract,summarize,diff,brand}`. Run with `webclaw-server --port 3000 [--host 0.0.0.0] [--api-key ]` or the matching `WEBCLAW_PORT` / `WEBCLAW_HOST` / `WEBCLAW_API_KEY` env vars. Bearer auth is constant-time (via `subtle::ConstantTimeEq`); open mode (no key) is allowed on `127.0.0.1` for local development. + + What self-hosting gives you: the full extraction pipeline, Crawler, sitemap discovery, brand/diff, LLM extract/summarize (via Ollama or your own OpenAI/Anthropic key). What it does *not* give you: anti-bot bypass (Cloudflare, DataDome, WAFs), headless JS rendering, async job queues, multi-tenant auth/billing, domain-hints and proxy routing — those require the hosted backend at api.webclaw.io and are intentionally not open-source. The self-hosting docs have been updated to reflect this split honestly. + +- **`crawl` endpoint runs synchronously and hard-caps at 500 pages / 20 concurrency.** No job queue, no background workers — a naive caller can't OOM the process. `batch` caps at 100 URLs / 20 concurrency for the same reason. For unbounded crawls use the hosted API. + +### Changed +- **Docker image now ships three binaries**, not two. `Dockerfile` and `Dockerfile.ci` both add `webclaw-server` to `/usr/local/bin/` and `EXPOSE 3000` for documentation. The entrypoint shim is unchanged: `docker run IMAGE webclaw-server --port 3000` Just Works, and the CLI/URL pass-through from v0.3.19 is preserved. + +### Docs +- Rewrote `docs/self-hosting` on the landing site to differentiate OSS (self-hosted REST) from the hosted platform. Added a capability matrix so new users don't have to read the repo to figure out why Cloudflare-protected sites still 403 when pointing at their own box. + +### Fixed +- **Dead-code warning on `cargo install webclaw-mcp` (#30).** `rmcp` 1.3.x changed how the `#[tool_handler]` macro reads the `tool_router` struct field — it now goes through a derived trait impl instead of referencing the field by name, so rustc's dead-code lint no longer sees it. The field is still essential (dropping it unregisters every MCP tool), just invisible to the lint. Annotated with `#[allow(dead_code)]` and a comment explaining why. No behaviour change. Warning disappears on the next `cargo install`. + +--- + ## [0.3.19] — 2026-04-17 ### Fixed diff --git a/CLAUDE.md b/CLAUDE.md index ad15cf1..eac2f9f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -20,9 +20,11 @@ webclaw/ webclaw-pdf/ # PDF text extraction via pdf-extract webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents webclaw-cli/ # CLI binary + webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart + # of api.webclaw.io, without anti-bot / JS / jobs / auth) ``` -Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server). +Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting). ### Core Modules (`webclaw-core`) - `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty @@ -60,6 +62,17 @@ Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server). - Works with Claude Desktop, Claude Code, and any MCP client - Uses `rmcp` crate (official Rust MCP SDK) +### REST API Server (`webclaw-server`) +- Axum 0.8, stateless, no database, no job queue +- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the + capability exists in OSS +- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when + `--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode +- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent +- Does NOT include: anti-bot bypass, JS rendering, async jobs, + multi-tenant auth, billing, proxy rotation, search/research/watch/ + agent-scrape. Those live behind api.webclaw.io and are closed-source. + ## Hard Rules - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible. diff --git a/Cargo.lock b/Cargo.lock index e5c30e7..0f5fc5c 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -182,6 +182,70 @@ version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" +[[package]] +name = "axum" +version = "0.8.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "31b698c5f9a010f6573133b09e0de5408834d0c82f8d7475a89fc1867a71cd90" +dependencies = [ + "axum-core", + "axum-macros", + "bytes", + "form_urlencoded", + "futures-util", + "http", + "http-body", + "http-body-util", + "hyper", + "hyper-util", + "itoa", + "matchit", + "memchr", + "mime", + "percent-encoding", + "pin-project-lite", + "serde_core", + "serde_json", + "serde_path_to_error", + "serde_urlencoded", + "sync_wrapper", + "tokio", + "tower", + "tower-layer", + "tower-service", + "tracing", +] + +[[package]] +name = "axum-core" +version = "0.5.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "08c78f31d7b1291f7ee735c1c6780ccde7785daae9a9206026862dab7d8792d1" +dependencies = [ + "bytes", + "futures-core", + "http", + "http-body", + "http-body-util", + "mime", + "pin-project-lite", + "sync_wrapper", + "tower-layer", + "tower-service", + "tracing", +] + +[[package]] +name = "axum-macros" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7aa268c23bfbbd2c4363b9cd302a4f504fb2a9dfe7e3451d66f35dd392e20aca" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + [[package]] name = "base64" version = "0.22.1" @@ -1132,6 +1196,12 @@ version = "1.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87" +[[package]] +name = "httpdate" +version = "1.0.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9" + [[package]] name = "hyper" version = "1.9.0" @@ -1145,6 +1215,7 @@ dependencies = [ "http", "http-body", "httparse", + "httpdate", "itoa", "pin-project-lite", "smallvec", @@ -1559,6 +1630,12 @@ dependencies = [ "regex-automata", ] +[[package]] +name = "matchit" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "47e1ffaa40ddd1f3ed91f717a33c8c0ee23fff369e3aa8772b9605cc1d22f4c3" + [[package]] name = "md-5" version = "0.10.6" @@ -1575,6 +1652,12 @@ version = "2.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79" +[[package]] +name = "mime" +version = "0.3.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a" + [[package]] name = "minimal-lexical" version = "0.2.1" @@ -2403,6 +2486,17 @@ dependencies = [ "zmij", ] +[[package]] +name = "serde_path_to_error" +version = "0.1.20" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "10a9ff822e371bb5403e391ecd83e182e0e77ba7f6fe0160b795797109d1b457" +dependencies = [ + "itoa", + "serde", + "serde_core", +] + [[package]] name = "serde_urlencoded" version = "0.7.1" @@ -2757,6 +2851,7 @@ dependencies = [ "tokio", "tower-layer", "tower-service", + "tracing", ] [[package]] @@ -2780,6 +2875,7 @@ dependencies = [ "tower", "tower-layer", "tower-service", + "tracing", ] [[package]] @@ -2800,6 +2896,7 @@ version = "0.1.44" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "63e71662fa4b2a2c3a26f570f037eb95bb1f85397f3cd8076caed2f026a6d100" dependencies = [ + "log", "pin-project-lite", "tracing-attributes", "tracing-core", @@ -3102,7 +3199,7 @@ dependencies = [ [[package]] name = "webclaw-cli" -version = "0.3.19" +version = "0.4.0" dependencies = [ "clap", "dotenvy", @@ -3123,7 +3220,7 @@ dependencies = [ [[package]] name = "webclaw-core" -version = "0.3.19" +version = "0.4.0" dependencies = [ "ego-tree", "once_cell", @@ -3141,7 +3238,7 @@ dependencies = [ [[package]] name = "webclaw-fetch" -version = "0.3.19" +version = "0.4.0" dependencies = [ "bytes", "calamine", @@ -3163,7 +3260,7 @@ dependencies = [ [[package]] name = "webclaw-llm" -version = "0.3.19" +version = "0.4.0" dependencies = [ "async-trait", "reqwest", @@ -3176,7 +3273,7 @@ dependencies = [ [[package]] name = "webclaw-mcp" -version = "0.3.19" +version = "0.4.0" dependencies = [ "dirs", "dotenvy", @@ -3197,13 +3294,34 @@ dependencies = [ [[package]] name = "webclaw-pdf" -version = "0.3.19" +version = "0.4.0" dependencies = [ "pdf-extract", "thiserror", "tracing", ] +[[package]] +name = "webclaw-server" +version = "0.4.0" +dependencies = [ + "anyhow", + "axum", + "clap", + "serde", + "serde_json", + "subtle", + "thiserror", + "tokio", + "tower-http", + "tracing", + "tracing-subscriber", + "webclaw-core", + "webclaw-fetch", + "webclaw-llm", + "webclaw-pdf", +] + [[package]] name = "webpki-root-certs" version = "1.0.6" diff --git a/Cargo.toml b/Cargo.toml index 41e78ac..e17d843 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.3.19" +version = "0.4.0" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/Dockerfile b/Dockerfile index 36fa67f..6f84e06 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,5 +1,12 @@ # webclaw — Multi-stage Docker build -# Produces 2 binaries: webclaw (CLI) and webclaw-mcp (MCP server) +# Produces 3 binaries: +# webclaw — CLI (single-shot extraction, crawl, MCP-less use) +# webclaw-mcp — MCP server (stdio, for AI agents) +# webclaw-server — minimal REST API for self-hosting (OSS, stateless) +# +# NOTE: this is NOT the hosted API at api.webclaw.io — the cloud service +# adds anti-bot bypass, JS rendering, multi-tenant auth and async jobs +# that are intentionally not open-source. See docs/self-hosting. # --------------------------------------------------------------------------- # Stage 1: Build all binaries in release mode @@ -25,6 +32,7 @@ COPY crates/webclaw-llm/Cargo.toml crates/webclaw-llm/Cargo.toml COPY crates/webclaw-pdf/Cargo.toml crates/webclaw-pdf/Cargo.toml COPY crates/webclaw-mcp/Cargo.toml crates/webclaw-mcp/Cargo.toml COPY crates/webclaw-cli/Cargo.toml crates/webclaw-cli/Cargo.toml +COPY crates/webclaw-server/Cargo.toml crates/webclaw-server/Cargo.toml # Copy .cargo config if present (optional build flags) COPY .cargo .cargo @@ -35,7 +43,8 @@ RUN mkdir -p crates/webclaw-core/src && echo "" > crates/webclaw-core/src/lib.rs && mkdir -p crates/webclaw-llm/src && echo "" > crates/webclaw-llm/src/lib.rs \ && mkdir -p crates/webclaw-pdf/src && echo "" > crates/webclaw-pdf/src/lib.rs \ && mkdir -p crates/webclaw-mcp/src && echo "fn main() {}" > crates/webclaw-mcp/src/main.rs \ - && mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs + && mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs \ + && mkdir -p crates/webclaw-server/src && echo "fn main() {}" > crates/webclaw-server/src/main.rs # Pre-build dependencies (this layer is cached until Cargo.toml/lock changes) RUN cargo build --release 2>/dev/null || true @@ -54,9 +63,22 @@ RUN apt-get update && apt-get install -y --no-install-recommends \ ca-certificates \ && rm -rf /var/lib/apt/lists/* -# Copy both binaries +# Copy all three binaries COPY --from=builder /build/target/release/webclaw /usr/local/bin/webclaw COPY --from=builder /build/target/release/webclaw-mcp /usr/local/bin/webclaw-mcp +COPY --from=builder /build/target/release/webclaw-server /usr/local/bin/webclaw-server + +# Default port the REST API listens on when you run `webclaw-server` inside +# the container. Override with -e WEBCLAW_PORT=... or --port. Published only +# as documentation; callers still need `-p 3000:3000` on `docker run`. +EXPOSE 3000 + +# Container default: bind all interfaces so `-p 3000:3000` works. The binary +# itself defaults to 127.0.0.1 (safe for `cargo run` on a laptop); inside +# Docker that would make the server unreachable, so we flip it here. +# Override with -e WEBCLAW_HOST=127.0.0.1 if you front this with another +# process in the same container. +ENV WEBCLAW_HOST=0.0.0.0 # Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other # commands directly so this image can be used as a FROM base with custom CMD. diff --git a/Dockerfile.ci b/Dockerfile.ci index dd1efcb..ccd8a33 100644 --- a/Dockerfile.ci +++ b/Dockerfile.ci @@ -12,6 +12,15 @@ RUN apt-get update && apt-get install -y --no-install-recommends \ ARG BINARY_DIR COPY ${BINARY_DIR}/webclaw /usr/local/bin/webclaw COPY ${BINARY_DIR}/webclaw-mcp /usr/local/bin/webclaw-mcp +COPY ${BINARY_DIR}/webclaw-server /usr/local/bin/webclaw-server + +# Default REST API port when running `webclaw-server` inside the container. +EXPOSE 3000 + +# Container default: bind all interfaces so `-p 3000:3000` works. The +# binary itself defaults to 127.0.0.1; flipping here keeps the CLI safe on +# a laptop but makes the container reachable out of the box. +ENV WEBCLAW_HOST=0.0.0.0 # Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other # commands directly so this image can be used as a FROM base with custom CMD. diff --git a/crates/webclaw-cli/src/bench.rs b/crates/webclaw-cli/src/bench.rs new file mode 100644 index 0000000..3e45da4 --- /dev/null +++ b/crates/webclaw-cli/src/bench.rs @@ -0,0 +1,422 @@ +//! `webclaw bench ` — per-URL extraction micro-benchmark. +//! +//! Fetches a page, extracts it via the same pipeline that powers +//! `--format llm`, and reports how many tokens the LLM pipeline +//! removed vs. the raw HTML. Optional `--facts` reuses the +//! benchmark harness's curated fact lists to score fidelity. +//! +//! v1 uses an *approximate* tokenizer (chars/4 for Latin text, +//! chars/2 for CJK-heavy text). Output is clearly labeled +//! "≈ tokens" so nobody mistakes it for a real tiktoken run. +//! Swapping to tiktoken-rs later is a one-function change. + +use std::path::{Path, PathBuf}; +use std::time::Instant; + +use webclaw_core::{extract, to_llm_text}; +use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig}; + +/// Inputs collected from the clap subcommand. +pub struct BenchArgs { + pub url: String, + pub json: bool, + pub facts: Option, +} + +/// What a single bench run measures. +struct BenchResult { + url: String, + raw_tokens: usize, + raw_bytes: usize, + llm_tokens: usize, + llm_bytes: usize, + reduction_pct: f64, + elapsed_secs: f64, + /// `Some((found, total))` when `--facts` is supplied and the URL has + /// an entry in the facts file; `None` otherwise. + facts: Option<(usize, usize)>, +} + +pub async fn run(args: &BenchArgs) -> Result<(), String> { + // Dedicated client so bench doesn't care about global CLI flags + // (proxies, custom headers, etc.). A reproducible microbench is + // more useful than an over-configurable one; if someone wants to + // bench behind a proxy they can set WEBCLAW_PROXY — respected + // by FetchConfig via the regular channels if we extend later. + let config = FetchConfig { + browser: BrowserProfile::Chrome, + ..FetchConfig::default() + }; + let client = FetchClient::new(config).map_err(|e| format!("build client: {e}"))?; + + let start = Instant::now(); + let fetched = client + .fetch(&args.url) + .await + .map_err(|e| format!("fetch: {e}"))?; + + let extraction = + extract(&fetched.html, Some(&fetched.url)).map_err(|e| format!("extract: {e}"))?; + let llm_text = to_llm_text(&extraction, Some(&fetched.url)); + let elapsed = start.elapsed(); + + let raw_tokens = approx_tokens(&fetched.html); + let llm_tokens = approx_tokens(&llm_text); + let raw_bytes = fetched.html.len(); + let llm_bytes = llm_text.len(); + let reduction_pct = if raw_tokens == 0 { + 0.0 + } else { + 100.0 * (1.0 - llm_tokens as f64 / raw_tokens as f64) + }; + + let facts = match args.facts.as_deref() { + Some(path) => check_facts(path, &args.url, &llm_text)?, + None => None, + }; + + let result = BenchResult { + url: args.url.clone(), + raw_tokens, + raw_bytes, + llm_tokens, + llm_bytes, + reduction_pct, + elapsed_secs: elapsed.as_secs_f64(), + facts, + }; + + if args.json { + print_json(&result); + } else { + print_box(&result); + } + Ok(()) +} + +// --------------------------------------------------------------------------- +// Approximate tokenizer +// --------------------------------------------------------------------------- + +/// Rough token count. `chars / 4` is the classic English rule of thumb +/// (close to cl100k_base for typical prose). CJK scripts pack ~2 chars +/// per token, so we switch to `chars / 2` when CJK dominates. +/// +/// Off by ±10% vs. a real BPE tokenizer, which is fine for "is webclaw's +/// output 66% smaller or 66% bigger than raw HTML" — the signal is +/// order-of-magnitude, not precise accounting. +fn approx_tokens(s: &str) -> usize { + let total: usize = s.chars().count(); + if total == 0 { + return 0; + } + let cjk = s.chars().filter(|c| is_cjk(*c)).count(); + let cjk_ratio = cjk as f64 / total as f64; + if cjk_ratio > 0.30 { + total.div_ceil(2) + } else { + total.div_ceil(4) + } +} + +fn is_cjk(c: char) -> bool { + let n = c as u32; + (0x4E00..=0x9FFF).contains(&n) // CJK Unified Ideographs + || (0x3040..=0x309F).contains(&n) // Hiragana + || (0x30A0..=0x30FF).contains(&n) // Katakana + || (0xAC00..=0xD7AF).contains(&n) // Hangul Syllables + || (0x3400..=0x4DBF).contains(&n) // CJK Extension A +} + +// --------------------------------------------------------------------------- +// Output: ASCII / Unicode box +// --------------------------------------------------------------------------- + +const BOX_WIDTH: usize = 62; // inner width between the two side borders + +fn print_box(r: &BenchResult) { + let host = display_host(&r.url); + let version = env!("CARGO_PKG_VERSION"); + + let top = "─".repeat(BOX_WIDTH); + let sep = "─".repeat(BOX_WIDTH); + + // Header: host on the left, "webclaw X.Y.Z" on the right. + let left = host; + let right = format!("webclaw {version}"); + let pad = BOX_WIDTH.saturating_sub(left.chars().count() + right.chars().count() + 2); + let header = format!(" {}{}{} ", left, " ".repeat(pad), right); + + println!("┌{top}┐"); + println!("│{header}│"); + println!("├{sep}┤"); + print_row( + "raw HTML", + &format!("{} ≈ tokens", fmt_int(r.raw_tokens)), + &fmt_bytes(r.raw_bytes), + ); + print_row( + "--format llm", + &format!("{} ≈ tokens", fmt_int(r.llm_tokens)), + &fmt_bytes(r.llm_bytes), + ); + print_row("token reduction", &format!("{:.1}%", r.reduction_pct), ""); + print_row("extraction time", &format!("{:.2} s", r.elapsed_secs), ""); + if let Some((found, total)) = r.facts { + let pct = if total == 0 { + 0.0 + } else { + 100.0 * found as f64 / total as f64 + }; + print_row( + "facts preserved", + &format!("{found}/{total} ({pct:.1}%)"), + "", + ); + } + println!("└{top}┘"); + println!(); + println!("note: token counts are approximate (chars/4 Latin, chars/2 CJK)."); +} + +fn print_row(label: &str, middle: &str, right: &str) { + // Layout inside the box: + // "