docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)

Replaces the previous benchmarks/README.md, which claimed specific numbers (94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no reproducing code committed to the repo. The `webclaw-bench` crate and `benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced never existed. This is what #18 was calling out. New benchmarks/ is fully reproducible. Every number ships with the script that produced it. `./benchmarks/run.sh` regenerates everything. Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18, cl100k_base tokenizer): tool reduction_mean fidelity latency_mean webclaw 92.5% 76/90 (84.4%) 0.41s firecrawl 92.4% 70/90 (77.8%) 0.99s trafilatura 97.8% 45/90 (50.0%) 0.21s webclaw matches or beats both competitors on fidelity on all 18 sites while running 2.4x faster than Firecrawl's hosted API. Includes: - README.md — headline table + per-site breakdown - methodology.md — tokenizer, fact selection, run rationale - sites.txt — 18 canonical URLs - facts.json — 90 curated facts (PRs welcome to add sites) - scripts/bench.py — the runner - results/2026-04-17.json — today's raw data, median of 3 runs - run.sh — one-command reproduction Closes #18 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-24 02:58:05 +02:00 · 2026-04-17 14:46:19 +02:00 · 2026-04-17 14:46:19 +02:00 · e27ee1f86f
commit e27ee1f86f
parent 0463b5e263
7 changed files with 934 additions and 118 deletions
--- a/benchmarks/methodology.md
+++ b/benchmarks/methodology.md
@ -0,0 +1,142 @@
+# Methodology
+
+## What is measured
+
+Three metrics per site:
+
+1. **Token efficiency** — tokens of the extractor's output vs tokens of the
+   raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower
+   tokens *only matters if the content is preserved*, so tokens are always
+   reported alongside fidelity.
+2. **Fidelity** — how many hand-curated "visible facts" the extractor
+   preserved. Per site we list 5 strings that any reader would say are
+   meaningfully on the page (customer names, headline stats, product names,
+   release information). Matched case-insensitively with word boundaries
+   where the fact is a single alphanumeric token (`API` does not match
+   `apiece`).
+3. **Latency** — wall-clock time from URL submission to markdown output.
+   Includes fetch + extraction. Network-dependent, so reported as the
+   median of 3 runs.
+
+## Tokenizer
+
+`cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by
+GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug
+extracted web content into. Pinned in `scripts/bench.py`.
+
+## Tool versions
+
+Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run
+published at launch used:
+
+- `webclaw 0.3.18` (release build, default options, `--format llm`)
+- `trafilatura 2.0.0` (`extract(html, output_format="markdown",
+  include_links=True, include_tables=True, favor_recall=True)`)
+- `firecrawl-py 4.x` against Firecrawl's hosted `v2` API
+  (`scrape(url, formats=["markdown"])`)
+
+## Fact selection
+
+Facts for each site were chosen by manual inspection of the live page in a
+browser on 2026-04-17. Selection criteria:
+
+- must be **visibly present** (not in `<head>`, `<script>`, or hidden
+  sections)
+- must be **specific** — customer names, headline stats, product names,
+  release dates. Not generic words like "the", "platform", "we".
+- must be **stable across multiple loads** (no AB-tested copy, no random
+  customer rotations)
+- 5 facts per site, documented in `facts.json`
+
+Facts are committed as data, not code, so **new facts can be proposed via
+pull request**. Any addition runs against all three tools automatically.
+
+Known limitation: sites change. News aggregators, release pages, and
+blog indexes drift. If a fact disappears because the page changed (not
+because the extractor dropped it), we expect all three tools to miss it
+together, which makes it visible as "all tools tied on this site" in the
+per-site breakdown. Facts on churning pages are refreshed on each published
+run.
+
+## Why median of 3 runs
+
+Single-run numbers are noisy:
+
+- **Latency** varies ±30% from run to run due to network jitter, CDN cache
+  state, and the remote server's own load.
+- **Raw-HTML token count** can vary if the server renders different content
+  per request (A/B tests, geo-IP, session state).
+- **Tool-specific flakiness** exists at the long tail. The occasional
+  Firecrawl 502 or trafilatura fetch failure would otherwise distort a
+  single-run benchmark.
+
+We run each site 3 times, take the median per metric. The published
+number is the 50th percentile; the full run data (min / median / max)
+is preserved in `results/YYYY-MM-DD.json`.
+
+## Fair comparison notes
+
+- **Each tool fetches via its own preferred path.** webclaw uses its
+  in-process primp HTTP client. Trafilatura uses `requests`. Firecrawl
+  fetches via its hosted infrastructure (Chrome CDP when needed). This is
+  the apples-to-apples developer-experience comparison: what you get when
+  you call each tool with a URL. The "vs raw HTML" column uses webclaw's
+  `--raw-html` as the baseline denominator.
+- **Firecrawl's default engine picker** runs in "auto" mode with browser
+  rendering for sites it detects need it. No flags tuned, no URLs
+  cherry-picked.
+- **No retries**, no fallbacks, no post-processing on top of any tool's
+  output. If a tool returns `""` or errors, that is the measured result
+  for that run. The median of 3 runs absorbs transient errors; persistent
+  extraction failures (e.g. trafilatura on `simonwillison.net`, which
+  returned `""` on all 3 runs) show up as 0 tokens and 0 facts.
+
+## Raw data schema
+
+`results/YYYY-MM-DD.json`:
+
+```json
+{
+  "timestamp": "2026-04-17 ...",
+  "webclaw_version": "0.3.18",
+  "trafilatura_version": "2.0.0",
+  "tokenizer": "cl100k_base",
+  "runs_per_site": 3,
+  "site_count": 18,
+  "total_facts": 90,
+  "aggregates": {
+    "webclaw":     { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
+    "trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
+    "firecrawl":   { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
+  },
+  "per_site": [
+    {
+      "url": "https://openai.com",
+      "facts_count": 5,
+      "raw_tokens": 170508,
+      "webclaw":     { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
+      "trafilatura": { "tokens_med": 0,    "facts_med": 0, "seconds_med": 0.17 },
+      "firecrawl":   { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
+    },
+    ...
+  ]
+}
+```
+
+## What's not here (roadmap)
+
+These measurements are intentionally out of scope for this initial
+benchmark. Each deserves its own harness and its own run.
+
+- **n-gram content overlap** — v2 metric to replace curated-fact matching.
+  Measure: fraction of trigrams from the visually-rendered page text that
+  appear in the extractor's output. Harder to curate, easier to scale.
+- **Competitors besides trafilatura / firecrawl** — Mozilla Readability,
+  Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or
+  wrapper subprocess runners. PRs welcome.
+- **Anti-bot / protected sites** — Cloudflare Turnstile, DataDome, AWS
+  WAF, hCaptcha. These require the Webclaw Cloud API with the antibot
+  sidecar, not the open-source CLI, and will be published separately on
+  the Webclaw landing page once the testing harness there is public.
+- **Crawl throughput** — pages-per-second under concurrent load. Different
+  axis from single-page extraction; lives in its own benchmark.