mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25 )

Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-17 14:46:19 +02:00

5.8 KiB

Raw Blame History

Methodology

What is measured

Three metrics per site:

Token efficiency — tokens of the extractor's output vs tokens of the raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower tokens only matters if the content is preserved, so tokens are always reported alongside fidelity.
Fidelity — how many hand-curated "visible facts" the extractor preserved. Per site we list 5 strings that any reader would say are meaningfully on the page (customer names, headline stats, product names, release information). Matched case-insensitively with word boundaries where the fact is a single alphanumeric token (API does not match apiece).
Latency — wall-clock time from URL submission to markdown output. Includes fetch + extraction. Network-dependent, so reported as the median of 3 runs.

Tokenizer

cl100k_base via OpenAI's tiktoken crate. This is the encoding used by GPT-4, GPT-3.5-turbo, and text-embedding-3-* — the models most users plug extracted web content into. Pinned in scripts/bench.py.

Tool versions

Listed at the top of each run's results/YYYY-MM-DD.json file. The run published at launch used:

webclaw 0.3.18 (release build, default options, --format llm)
trafilatura 2.0.0 (extract(html, output_format="markdown", include_links=True, include_tables=True, favor_recall=True))
firecrawl-py 4.x against Firecrawl's hosted v2 API (scrape(url, formats=["markdown"]))

Fact selection

Facts for each site were chosen by manual inspection of the live page in a browser on 2026-04-17. Selection criteria:

must be visibly present (not in <head>, <script>, or hidden sections)
must be specific — customer names, headline stats, product names, release dates. Not generic words like "the", "platform", "we".
must be stable across multiple loads (no AB-tested copy, no random customer rotations)
5 facts per site, documented in facts.json

Facts are committed as data, not code, so new facts can be proposed via pull request. Any addition runs against all three tools automatically.

Known limitation: sites change. News aggregators, release pages, and blog indexes drift. If a fact disappears because the page changed (not because the extractor dropped it), we expect all three tools to miss it together, which makes it visible as "all tools tied on this site" in the per-site breakdown. Facts on churning pages are refreshed on each published run.

Why median of 3 runs

Single-run numbers are noisy:

Latency varies ±30% from run to run due to network jitter, CDN cache state, and the remote server's own load.
Raw-HTML token count can vary if the server renders different content per request (A/B tests, geo-IP, session state).
Tool-specific flakiness exists at the long tail. The occasional Firecrawl 502 or trafilatura fetch failure would otherwise distort a single-run benchmark.

We run each site 3 times, take the median per metric. The published number is the 50th percentile; the full run data (min / median / max) is preserved in results/YYYY-MM-DD.json.

Fair comparison notes

Each tool fetches via its own preferred path. webclaw uses its in-process primp HTTP client. Trafilatura uses requests. Firecrawl fetches via its hosted infrastructure (Chrome CDP when needed). This is the apples-to-apples developer-experience comparison: what you get when you call each tool with a URL. The "vs raw HTML" column uses webclaw's --raw-html as the baseline denominator.
Firecrawl's default engine picker runs in "auto" mode with browser rendering for sites it detects need it. No flags tuned, no URLs cherry-picked.
No retries, no fallbacks, no post-processing on top of any tool's output. If a tool returns "" or errors, that is the measured result for that run. The median of 3 runs absorbs transient errors; persistent extraction failures (e.g. trafilatura on simonwillison.net, which returned "" on all 3 runs) show up as 0 tokens and 0 facts.

Raw data schema

results/YYYY-MM-DD.json:

{
  "timestamp": "2026-04-17 ...",
  "webclaw_version": "0.3.18",
  "trafilatura_version": "2.0.0",
  "tokenizer": "cl100k_base",
  "runs_per_site": 3,
  "site_count": 18,
  "total_facts": 90,
  "aggregates": {
    "webclaw":     { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
    "trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
    "firecrawl":   { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
  },
  "per_site": [
    {
      "url": "https://openai.com",
      "facts_count": 5,
      "raw_tokens": 170508,
      "webclaw":     { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
      "trafilatura": { "tokens_med": 0,    "facts_med": 0, "seconds_med": 0.17 },
      "firecrawl":   { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
    },
    ...
  ]
}

What's not here (roadmap)

These measurements are intentionally out of scope for this initial benchmark. Each deserves its own harness and its own run.

n-gram content overlap — v2 metric to replace curated-fact matching. Measure: fraction of trigrams from the visually-rendered page text that appear in the extractor's output. Harder to curate, easier to scale.
Competitors besides trafilatura / firecrawl — Mozilla Readability, Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or wrapper subprocess runners. PRs welcome.
Anti-bot / protected sites — Cloudflare Turnstile, DataDome, AWS WAF, hCaptcha. These require the Webclaw Cloud API with the antibot sidecar, not the open-source CLI, and will be published separately on the Webclaw landing page once the testing harness there is public.
Crawl throughput — pages-per-second under concurrent load. Different axis from single-page extraction; lives in its own benchmark.

5.8 KiB Raw Blame History