docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)

Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Valerio 2026-04-17 14:46:19 +02:00 committed by GitHub
parent 0463b5e263
commit e27ee1f86f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 934 additions and 118 deletions

31
benchmarks/sites.txt Normal file
View file

@ -0,0 +1,31 @@
# One URL per line. Comments (#) and blank lines ignored.
# Sites chosen to span: SPA marketing, enterprise SaaS, documentation,
# long-form content, news, and aggregator pages.
# --- SPA marketing ---
https://openai.com
https://vercel.com
https://anthropic.com
https://www.notion.com
https://stripe.com
https://tavily.com
https://www.shopify.com
# --- Documentation ---
https://docs.python.org/3/
https://react.dev
https://tailwindcss.com/docs/installation
https://nextjs.org/docs
https://github.com
# --- Long-form content ---
https://en.wikipedia.org/wiki/Rust_(programming_language)
https://simonwillison.net/2026/Mar/15/latent-reasoning/
https://paulgraham.com/essays.html
# --- News / commerce ---
https://techcrunch.com
# --- Enterprise SaaS ---
https://www.databricks.com
https://www.hashicorp.com