webclaw/benchmarks/README.md
Valerio e27ee1f86f
docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)
Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:46:19 +02:00

94 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Benchmarks
Reproducible benchmarks comparing `webclaw` against open-source and commercial
web extraction tools. Every number here ships with the script that produced it.
Run `./run.sh` to regenerate.
## Headline
**webclaw preserves more page content than any other tool tested, at 2.4× the
speed of the closest competitor.**
Across 18 production sites (SPAs, documentation, long-form articles, news,
enterprise marketing), measured over 3 runs per site with OpenAI's
`cl100k_base` tokenizer. Last run: 2026-04-17, webclaw v0.3.18.
| Tool | Fidelity (facts preserved) | Token reduction vs raw HTML | Mean latency |
|---|---:|---:|---:|
| **webclaw `--format llm`** | **76 / 90 (84.4 %)** | 92.5 % | **0.41 s** |
| Firecrawl API (v2, hosted) | 70 / 90 (77.8 %) | 92.4 % | 0.99 s |
| Trafilatura 2.0 | 45 / 90 (50.0 %) | 97.8 % (by dropping content) | 0.21 s |
**webclaw matches or beats both competitors on fidelity on all 18 sites.**
## Why webclaw wins
- **Speed.** 2.4× faster than Firecrawl's hosted API. Firecrawl defaults to
browser rendering for everything; webclaw's in-process TLS-fingerprinted
fetch plus deterministic extractor reaches comparable-or-better content
without that overhead.
- **Fidelity.** Trafilatura's higher token reduction comes from dropping
content. On the 18 sites tested it missed 45 of 90 key facts — entire
customer-story sections, release dates, product names. webclaw keeps them.
- **Deterministic.** Same URL → same output. No LLM post-processing, no
paraphrasing, no hallucination risk.
## Per-site results
Numbers are median of 3 runs. `raw` = raw fetched HTML token count.
`facts` = hand-curated visible facts preserved out of 5 per site.
| Site | raw HTML | webclaw | Firecrawl | Trafilatura | wc facts | fc facts | tr facts |
|---|---:|---:|---:|---:|:---:|:---:|:---:|
| openai.com | 170 K | 1,238 | 3,139 | 0 | **3/5** | 2/5 | 0/5 |
| vercel.com | 380 K | 1,076 | 4,029 | 585 | **3/5** | 3/5 | 3/5 |
| anthropic.com | 103 K | 672 | 560 | 96 | **5/5** | 5/5 | 4/5 |
| notion.com | 109 K | 13,416 | 5,261 | 91 | **5/5** | 5/5 | 2/5 |
| stripe.com | 243 K | 81,974 | 8,922 | 2,418 | **5/5** | 5/5 | 0/5 |
| tavily.com | 30 K | 1,361 | 1,969 | 182 | **5/5** | 4/5 | 3/5 |
| shopify.com | 184 K | 1,939 | 5,384 | 595 | **3/5** | 3/5 | 3/5 |
| docs.python.org | 5 K | 689 | 1,623 | 347 | **4/5** | 4/5 | 4/5 |
| react.dev | 107 K | 3,332 | 4,959 | 763 | **5/5** | 5/5 | 3/5 |
| tailwindcss.com/docs/installation | 113 K | 779 | 813 | 430 | **4/5** | 4/5 | 2/5 |
| nextjs.org/docs | 228 K | 968 | 885 | 631 | **4/5** | 4/5 | 4/5 |
| github.com | 234 K | 1,438 | 3,058 | 486 | **5/5** | 4/5 | 3/5 |
| en.wikipedia.org/wiki/Rust | 189 K | 47,823 | 59,326 | 37,427 | **5/5** | 5/5 | 5/5 |
| simonwillison.net/…/latent-reasoning | 3 K | 724 | 525 | 0 | **4/5** | 2/5 | 0/5 |
| paulgraham.com/essays.html | 2 K | 169 | 295 | 0 | **2/5** | 1/5 | 0/5 |
| techcrunch.com | 143 K | 7,265 | 11,408 | 397 | **5/5** | 5/5 | 5/5 |
| databricks.com | 274 K | 2,001 | 5,471 | 311 | **4/5** | 4/5 | 4/5 |
| hashicorp.com | 109 K | 1,501 | 4,289 | 0 | **5/5** | 5/5 | 0/5 |
## Reproducing this benchmark
```bash
cd benchmarks/
./run.sh
```
Requirements:
- Python 3.9+
- `pip install tiktoken trafilatura firecrawl-py`
- `webclaw` release binary at `../target/release/webclaw` (or set `$WEBCLAW`)
- Firecrawl API key (free tier: 500 credits/month, enough for many runs) —
export as `FIRECRAWL_API_KEY`. If omitted, the benchmark runs with webclaw
and Trafilatura only.
One run of the full suite burns ~60 Firecrawl credits (18 sites × 3 runs,
plus Firecrawl's scrape costs 1 credit each).
## Methodology
See [methodology.md](methodology.md) for:
- Tokenizer rationale (`cl100k_base` → covers GPT-4 / GPT-3.5 /
`text-embedding-3-*`)
- Fact selection procedure and how to propose additions
- Why median of 3 runs (CDN / cache / network noise)
- Raw data schema (`results/*.json`)
- Notes on site churn (news aggregators, release pages)
## Raw data
Per-run results are committed as JSON at `results/YYYY-MM-DD.json` so the
history of measurements is auditable. Diff two runs to see regressions or
improvements across webclaw versions.