Replaces the previous benchmarks/README.md, which claimed specific numbers (94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no reproducing code committed to the repo. The `webclaw-bench` crate and `benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced never existed. This is what #18 was calling out. New benchmarks/ is fully reproducible. Every number ships with the script that produced it. `./benchmarks/run.sh` regenerates everything. Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18, cl100k_base tokenizer): tool reduction_mean fidelity latency_mean webclaw 92.5% 76/90 (84.4%) 0.41s firecrawl 92.4% 70/90 (77.8%) 0.99s trafilatura 97.8% 45/90 (50.0%) 0.21s webclaw matches or beats both competitors on fidelity on all 18 sites while running 2.4x faster than Firecrawl's hosted API. Includes: - README.md — headline table + per-site breakdown - methodology.md — tokenizer, fact selection, run rationale - sites.txt — 18 canonical URLs - facts.json — 90 curated facts (PRs welcome to add sites) - scripts/bench.py — the runner - results/2026-04-17.json — today's raw data, median of 3 runs - run.sh — one-command reproduction Closes #18 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.8 KiB
Methodology
What is measured
Three metrics per site:
- Token efficiency — tokens of the extractor's output vs tokens of the raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower tokens only matters if the content is preserved, so tokens are always reported alongside fidelity.
- Fidelity — how many hand-curated "visible facts" the extractor
preserved. Per site we list 5 strings that any reader would say are
meaningfully on the page (customer names, headline stats, product names,
release information). Matched case-insensitively with word boundaries
where the fact is a single alphanumeric token (
APIdoes not matchapiece). - Latency — wall-clock time from URL submission to markdown output. Includes fetch + extraction. Network-dependent, so reported as the median of 3 runs.
Tokenizer
cl100k_base via OpenAI's tiktoken crate. This is the encoding used by
GPT-4, GPT-3.5-turbo, and text-embedding-3-* — the models most users plug
extracted web content into. Pinned in scripts/bench.py.
Tool versions
Listed at the top of each run's results/YYYY-MM-DD.json file. The run
published at launch used:
webclaw 0.3.18(release build, default options,--format llm)trafilatura 2.0.0(extract(html, output_format="markdown", include_links=True, include_tables=True, favor_recall=True))firecrawl-py 4.xagainst Firecrawl's hostedv2API (scrape(url, formats=["markdown"]))
Fact selection
Facts for each site were chosen by manual inspection of the live page in a browser on 2026-04-17. Selection criteria:
- must be visibly present (not in
<head>,<script>, or hidden sections) - must be specific — customer names, headline stats, product names, release dates. Not generic words like "the", "platform", "we".
- must be stable across multiple loads (no AB-tested copy, no random customer rotations)
- 5 facts per site, documented in
facts.json
Facts are committed as data, not code, so new facts can be proposed via pull request. Any addition runs against all three tools automatically.
Known limitation: sites change. News aggregators, release pages, and blog indexes drift. If a fact disappears because the page changed (not because the extractor dropped it), we expect all three tools to miss it together, which makes it visible as "all tools tied on this site" in the per-site breakdown. Facts on churning pages are refreshed on each published run.
Why median of 3 runs
Single-run numbers are noisy:
- Latency varies ±30% from run to run due to network jitter, CDN cache state, and the remote server's own load.
- Raw-HTML token count can vary if the server renders different content per request (A/B tests, geo-IP, session state).
- Tool-specific flakiness exists at the long tail. The occasional Firecrawl 502 or trafilatura fetch failure would otherwise distort a single-run benchmark.
We run each site 3 times, take the median per metric. The published
number is the 50th percentile; the full run data (min / median / max)
is preserved in results/YYYY-MM-DD.json.
Fair comparison notes
- Each tool fetches via its own preferred path. webclaw uses its
in-process primp HTTP client. Trafilatura uses
requests. Firecrawl fetches via its hosted infrastructure (Chrome CDP when needed). This is the apples-to-apples developer-experience comparison: what you get when you call each tool with a URL. The "vs raw HTML" column uses webclaw's--raw-htmlas the baseline denominator. - Firecrawl's default engine picker runs in "auto" mode with browser rendering for sites it detects need it. No flags tuned, no URLs cherry-picked.
- No retries, no fallbacks, no post-processing on top of any tool's
output. If a tool returns
""or errors, that is the measured result for that run. The median of 3 runs absorbs transient errors; persistent extraction failures (e.g. trafilatura onsimonwillison.net, which returned""on all 3 runs) show up as 0 tokens and 0 facts.
Raw data schema
results/YYYY-MM-DD.json:
{
"timestamp": "2026-04-17 ...",
"webclaw_version": "0.3.18",
"trafilatura_version": "2.0.0",
"tokenizer": "cl100k_base",
"runs_per_site": 3,
"site_count": 18,
"total_facts": 90,
"aggregates": {
"webclaw": { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
"trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
"firecrawl": { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
},
"per_site": [
{
"url": "https://openai.com",
"facts_count": 5,
"raw_tokens": 170508,
"webclaw": { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
"trafilatura": { "tokens_med": 0, "facts_med": 0, "seconds_med": 0.17 },
"firecrawl": { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
},
...
]
}
What's not here (roadmap)
These measurements are intentionally out of scope for this initial benchmark. Each deserves its own harness and its own run.
- n-gram content overlap — v2 metric to replace curated-fact matching. Measure: fraction of trigrams from the visually-rendered page text that appear in the extractor's output. Harder to curate, easier to scale.
- Competitors besides trafilatura / firecrawl — Mozilla Readability, Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or wrapper subprocess runners. PRs welcome.
- Anti-bot / protected sites — Cloudflare Turnstile, DataDome, AWS WAF, hCaptcha. These require the Webclaw Cloud API with the antibot sidecar, not the open-source CLI, and will be published separately on the Webclaw landing page once the testing harness there is public.
- Crawl throughput — pages-per-second under concurrent load. Different axis from single-page extraction; lives in its own benchmark.