webclaw/benchmarks/README.md
Valerio e27ee1f86f
docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)
Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:46:19 +02:00

4.1 KiB
Raw Blame History

Benchmarks

Reproducible benchmarks comparing webclaw against open-source and commercial web extraction tools. Every number here ships with the script that produced it. Run ./run.sh to regenerate.

Headline

webclaw preserves more page content than any other tool tested, at 2.4× the speed of the closest competitor.

Across 18 production sites (SPAs, documentation, long-form articles, news, enterprise marketing), measured over 3 runs per site with OpenAI's cl100k_base tokenizer. Last run: 2026-04-17, webclaw v0.3.18.

Tool Fidelity (facts preserved) Token reduction vs raw HTML Mean latency
webclaw --format llm 76 / 90 (84.4 %) 92.5 % 0.41 s
Firecrawl API (v2, hosted) 70 / 90 (77.8 %) 92.4 % 0.99 s
Trafilatura 2.0 45 / 90 (50.0 %) 97.8 % (by dropping content) 0.21 s

webclaw matches or beats both competitors on fidelity on all 18 sites.

Why webclaw wins

  • Speed. 2.4× faster than Firecrawl's hosted API. Firecrawl defaults to browser rendering for everything; webclaw's in-process TLS-fingerprinted fetch plus deterministic extractor reaches comparable-or-better content without that overhead.
  • Fidelity. Trafilatura's higher token reduction comes from dropping content. On the 18 sites tested it missed 45 of 90 key facts — entire customer-story sections, release dates, product names. webclaw keeps them.
  • Deterministic. Same URL → same output. No LLM post-processing, no paraphrasing, no hallucination risk.

Per-site results

Numbers are median of 3 runs. raw = raw fetched HTML token count. facts = hand-curated visible facts preserved out of 5 per site.

Site raw HTML webclaw Firecrawl Trafilatura wc facts fc facts tr facts
openai.com 170 K 1,238 3,139 0 3/5 2/5 0/5
vercel.com 380 K 1,076 4,029 585 3/5 3/5 3/5
anthropic.com 103 K 672 560 96 5/5 5/5 4/5
notion.com 109 K 13,416 5,261 91 5/5 5/5 2/5
stripe.com 243 K 81,974 8,922 2,418 5/5 5/5 0/5
tavily.com 30 K 1,361 1,969 182 5/5 4/5 3/5
shopify.com 184 K 1,939 5,384 595 3/5 3/5 3/5
docs.python.org 5 K 689 1,623 347 4/5 4/5 4/5
react.dev 107 K 3,332 4,959 763 5/5 5/5 3/5
tailwindcss.com/docs/installation 113 K 779 813 430 4/5 4/5 2/5
nextjs.org/docs 228 K 968 885 631 4/5 4/5 4/5
github.com 234 K 1,438 3,058 486 5/5 4/5 3/5
en.wikipedia.org/wiki/Rust 189 K 47,823 59,326 37,427 5/5 5/5 5/5
simonwillison.net/…/latent-reasoning 3 K 724 525 0 4/5 2/5 0/5
paulgraham.com/essays.html 2 K 169 295 0 2/5 1/5 0/5
techcrunch.com 143 K 7,265 11,408 397 5/5 5/5 5/5
databricks.com 274 K 2,001 5,471 311 4/5 4/5 4/5
hashicorp.com 109 K 1,501 4,289 0 5/5 5/5 0/5

Reproducing this benchmark

cd benchmarks/
./run.sh

Requirements:

  • Python 3.9+
  • pip install tiktoken trafilatura firecrawl-py
  • webclaw release binary at ../target/release/webclaw (or set $WEBCLAW)
  • Firecrawl API key (free tier: 500 credits/month, enough for many runs) — export as FIRECRAWL_API_KEY. If omitted, the benchmark runs with webclaw and Trafilatura only.

One run of the full suite burns ~60 Firecrawl credits (18 sites × 3 runs, plus Firecrawl's scrape costs 1 credit each).

Methodology

See methodology.md for:

  • Tokenizer rationale (cl100k_base → covers GPT-4 / GPT-3.5 / text-embedding-3-*)
  • Fact selection procedure and how to propose additions
  • Why median of 3 runs (CDN / cache / network noise)
  • Raw data schema (results/*.json)
  • Notes on site churn (news aggregators, release pages)

Raw data

Per-run results are committed as JSON at results/YYYY-MM-DD.json so the history of measurements is auditable. Diff two runs to see regressions or improvements across webclaw versions.