mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)
Replaces the previous benchmarks/README.md, which claimed specific numbers (94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no reproducing code committed to the repo. The `webclaw-bench` crate and `benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced never existed. This is what #18 was calling out. New benchmarks/ is fully reproducible. Every number ships with the script that produced it. `./benchmarks/run.sh` regenerates everything. Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18, cl100k_base tokenizer): tool reduction_mean fidelity latency_mean webclaw 92.5% 76/90 (84.4%) 0.41s firecrawl 92.4% 70/90 (77.8%) 0.99s trafilatura 97.8% 45/90 (50.0%) 0.21s webclaw matches or beats both competitors on fidelity on all 18 sites while running 2.4x faster than Firecrawl's hosted API. Includes: - README.md — headline table + per-site breakdown - methodology.md — tokenizer, fact selection, run rationale - sites.txt — 18 canonical URLs - facts.json — 90 curated facts (PRs welcome to add sites) - scripts/bench.py — the runner - results/2026-04-17.json — today's raw data, median of 3 runs - run.sh — one-command reproduction Closes #18 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
0463b5e263
commit
e27ee1f86f
7 changed files with 934 additions and 118 deletions
31
benchmarks/sites.txt
Normal file
31
benchmarks/sites.txt
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
# One URL per line. Comments (#) and blank lines ignored.
|
||||
# Sites chosen to span: SPA marketing, enterprise SaaS, documentation,
|
||||
# long-form content, news, and aggregator pages.
|
||||
|
||||
# --- SPA marketing ---
|
||||
https://openai.com
|
||||
https://vercel.com
|
||||
https://anthropic.com
|
||||
https://www.notion.com
|
||||
https://stripe.com
|
||||
https://tavily.com
|
||||
https://www.shopify.com
|
||||
|
||||
# --- Documentation ---
|
||||
https://docs.python.org/3/
|
||||
https://react.dev
|
||||
https://tailwindcss.com/docs/installation
|
||||
https://nextjs.org/docs
|
||||
https://github.com
|
||||
|
||||
# --- Long-form content ---
|
||||
https://en.wikipedia.org/wiki/Rust_(programming_language)
|
||||
https://simonwillison.net/2026/Mar/15/latent-reasoning/
|
||||
https://paulgraham.com/essays.html
|
||||
|
||||
# --- News / commerce ---
|
||||
https://techcrunch.com
|
||||
|
||||
# --- Enterprise SaaS ---
|
||||
https://www.databricks.com
|
||||
https://www.hashicorp.com
|
||||
Loading…
Add table
Add a link
Reference in a new issue