webclaw/benchmarks/facts.json
Valerio e27ee1f86f
docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)
Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:46:19 +02:00

23 lines
2.4 KiB
JSON

{
"_comment": "Hand-curated 'visible facts' per site. Inspected from live pages on 2026-04-17. PRs welcome to add sites or adjust facts — keep facts specific (customer names, headline stats, product names), not generic words.",
"facts": {
"https://openai.com": ["ChatGPT", "Sora", "API", "Enterprise", "research"],
"https://vercel.com": ["Next.js", "Hobby", "Pro", "Enterprise", "deploy"],
"https://anthropic.com": ["Opus", "Claude", "Glasswing", "Perseverance", "NASA"],
"https://www.notion.com": ["agents", "Forbes", "Figma", "Ramp", "Cursor"],
"https://stripe.com": ["Hertz", "URBN", "Instacart", "99.999", "1.9"],
"https://tavily.com": ["search", "extract", "crawl", "research", "developers"],
"https://www.shopify.com": ["Plus", "merchants", "retail", "brands", "checkout"],
"https://docs.python.org/3/": ["tutorial", "library", "reference", "setup", "distribution"],
"https://react.dev": ["Components", "JSX", "Hooks", "Learn", "Reference"],
"https://tailwindcss.com/docs/installation": ["Vite", "PostCSS", "CLI", "install", "Next.js"],
"https://nextjs.org/docs": ["App Router", "Pages Router", "getting-started", "deploying", "Server"],
"https://github.com": ["Copilot", "Actions", "millions", "developers", "Enterprise"],
"https://en.wikipedia.org/wiki/Rust_(programming_language)": ["Graydon", "Mozilla", "borrow", "Cargo", "2015"],
"https://simonwillison.net/2026/Mar/15/latent-reasoning/": ["latent", "reasoning", "Willison", "model", "Simon"],
"https://paulgraham.com/essays.html": ["Graham", "essay", "startup", "Lisp", "founders"],
"https://techcrunch.com": ["TechCrunch", "startup", "news", "events", "latest"],
"https://www.databricks.com": ["Lakehouse", "platform", "data", "MLflow", "AI"],
"https://www.hashicorp.com": ["Terraform", "Vault", "Consul", "infrastructure", "enterprise"]
}
}