Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.
New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.
Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):
tool reduction_mean fidelity latency_mean
webclaw 92.5% 76/90 (84.4%) 0.41s
firecrawl 92.4% 70/90 (77.8%) 0.99s
trafilatura 97.8% 45/90 (50.0%) 0.21s
webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.
Includes:
- README.md — headline table + per-site breakdown
- methodology.md — tokenizer, fact selection, run rationale
- sites.txt — 18 canonical URLs
- facts.json — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh — one-command reproduction
Closes#18
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>