docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)

Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Valerio 2026-04-17 14:46:19 +02:00 committed by GitHub
parent 0463b5e263
commit e27ee1f86f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 934 additions and 118 deletions

142
benchmarks/methodology.md Normal file
View file

@ -0,0 +1,142 @@
# Methodology
## What is measured
Three metrics per site:
1. **Token efficiency** — tokens of the extractor's output vs tokens of the
raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower
tokens *only matters if the content is preserved*, so tokens are always
reported alongside fidelity.
2. **Fidelity** — how many hand-curated "visible facts" the extractor
preserved. Per site we list 5 strings that any reader would say are
meaningfully on the page (customer names, headline stats, product names,
release information). Matched case-insensitively with word boundaries
where the fact is a single alphanumeric token (`API` does not match
`apiece`).
3. **Latency** — wall-clock time from URL submission to markdown output.
Includes fetch + extraction. Network-dependent, so reported as the
median of 3 runs.
## Tokenizer
`cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by
GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug
extracted web content into. Pinned in `scripts/bench.py`.
## Tool versions
Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run
published at launch used:
- `webclaw 0.3.18` (release build, default options, `--format llm`)
- `trafilatura 2.0.0` (`extract(html, output_format="markdown",
include_links=True, include_tables=True, favor_recall=True)`)
- `firecrawl-py 4.x` against Firecrawl's hosted `v2` API
(`scrape(url, formats=["markdown"])`)
## Fact selection
Facts for each site were chosen by manual inspection of the live page in a
browser on 2026-04-17. Selection criteria:
- must be **visibly present** (not in `<head>`, `<script>`, or hidden
sections)
- must be **specific** — customer names, headline stats, product names,
release dates. Not generic words like "the", "platform", "we".
- must be **stable across multiple loads** (no AB-tested copy, no random
customer rotations)
- 5 facts per site, documented in `facts.json`
Facts are committed as data, not code, so **new facts can be proposed via
pull request**. Any addition runs against all three tools automatically.
Known limitation: sites change. News aggregators, release pages, and
blog indexes drift. If a fact disappears because the page changed (not
because the extractor dropped it), we expect all three tools to miss it
together, which makes it visible as "all tools tied on this site" in the
per-site breakdown. Facts on churning pages are refreshed on each published
run.
## Why median of 3 runs
Single-run numbers are noisy:
- **Latency** varies ±30% from run to run due to network jitter, CDN cache
state, and the remote server's own load.
- **Raw-HTML token count** can vary if the server renders different content
per request (A/B tests, geo-IP, session state).
- **Tool-specific flakiness** exists at the long tail. The occasional
Firecrawl 502 or trafilatura fetch failure would otherwise distort a
single-run benchmark.
We run each site 3 times, take the median per metric. The published
number is the 50th percentile; the full run data (min / median / max)
is preserved in `results/YYYY-MM-DD.json`.
## Fair comparison notes
- **Each tool fetches via its own preferred path.** webclaw uses its
in-process primp HTTP client. Trafilatura uses `requests`. Firecrawl
fetches via its hosted infrastructure (Chrome CDP when needed). This is
the apples-to-apples developer-experience comparison: what you get when
you call each tool with a URL. The "vs raw HTML" column uses webclaw's
`--raw-html` as the baseline denominator.
- **Firecrawl's default engine picker** runs in "auto" mode with browser
rendering for sites it detects need it. No flags tuned, no URLs
cherry-picked.
- **No retries**, no fallbacks, no post-processing on top of any tool's
output. If a tool returns `""` or errors, that is the measured result
for that run. The median of 3 runs absorbs transient errors; persistent
extraction failures (e.g. trafilatura on `simonwillison.net`, which
returned `""` on all 3 runs) show up as 0 tokens and 0 facts.
## Raw data schema
`results/YYYY-MM-DD.json`:
```json
{
"timestamp": "2026-04-17 ...",
"webclaw_version": "0.3.18",
"trafilatura_version": "2.0.0",
"tokenizer": "cl100k_base",
"runs_per_site": 3,
"site_count": 18,
"total_facts": 90,
"aggregates": {
"webclaw": { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
"trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
"firecrawl": { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
},
"per_site": [
{
"url": "https://openai.com",
"facts_count": 5,
"raw_tokens": 170508,
"webclaw": { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
"trafilatura": { "tokens_med": 0, "facts_med": 0, "seconds_med": 0.17 },
"firecrawl": { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
},
...
]
}
```
## What's not here (roadmap)
These measurements are intentionally out of scope for this initial
benchmark. Each deserves its own harness and its own run.
- **n-gram content overlap** — v2 metric to replace curated-fact matching.
Measure: fraction of trigrams from the visually-rendered page text that
appear in the extractor's output. Harder to curate, easier to scale.
- **Competitors besides trafilatura / firecrawl** — Mozilla Readability,
Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or
wrapper subprocess runners. PRs welcome.
- **Anti-bot / protected sites** — Cloudflare Turnstile, DataDome, AWS
WAF, hCaptcha. These require the Webclaw Cloud API with the antibot
sidecar, not the open-source CLI, and will be published separately on
the Webclaw landing page once the testing harness there is public.
- **Crawl throughput** — pages-per-second under concurrent load. Different
axis from single-page extraction; lives in its own benchmark.