mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)
Replaces the previous benchmarks/README.md, which claimed specific numbers (94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no reproducing code committed to the repo. The `webclaw-bench` crate and `benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced never existed. This is what #18 was calling out. New benchmarks/ is fully reproducible. Every number ships with the script that produced it. `./benchmarks/run.sh` regenerates everything. Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18, cl100k_base tokenizer): tool reduction_mean fidelity latency_mean webclaw 92.5% 76/90 (84.4%) 0.41s firecrawl 92.4% 70/90 (77.8%) 0.99s trafilatura 97.8% 45/90 (50.0%) 0.21s webclaw matches or beats both competitors on fidelity on all 18 sites while running 2.4x faster than Firecrawl's hosted API. Includes: - README.md — headline table + per-site breakdown - methodology.md — tokenizer, fact selection, run rationale - sites.txt — 18 canonical URLs - facts.json — 90 curated facts (PRs welcome to add sites) - scripts/bench.py — the runner - results/2026-04-17.json — today's raw data, median of 3 runs - run.sh — one-command reproduction Closes #18 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
0463b5e263
commit
e27ee1f86f
7 changed files with 934 additions and 118 deletions
142
benchmarks/methodology.md
Normal file
142
benchmarks/methodology.md
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
# Methodology
|
||||
|
||||
## What is measured
|
||||
|
||||
Three metrics per site:
|
||||
|
||||
1. **Token efficiency** — tokens of the extractor's output vs tokens of the
|
||||
raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower
|
||||
tokens *only matters if the content is preserved*, so tokens are always
|
||||
reported alongside fidelity.
|
||||
2. **Fidelity** — how many hand-curated "visible facts" the extractor
|
||||
preserved. Per site we list 5 strings that any reader would say are
|
||||
meaningfully on the page (customer names, headline stats, product names,
|
||||
release information). Matched case-insensitively with word boundaries
|
||||
where the fact is a single alphanumeric token (`API` does not match
|
||||
`apiece`).
|
||||
3. **Latency** — wall-clock time from URL submission to markdown output.
|
||||
Includes fetch + extraction. Network-dependent, so reported as the
|
||||
median of 3 runs.
|
||||
|
||||
## Tokenizer
|
||||
|
||||
`cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by
|
||||
GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug
|
||||
extracted web content into. Pinned in `scripts/bench.py`.
|
||||
|
||||
## Tool versions
|
||||
|
||||
Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run
|
||||
published at launch used:
|
||||
|
||||
- `webclaw 0.3.18` (release build, default options, `--format llm`)
|
||||
- `trafilatura 2.0.0` (`extract(html, output_format="markdown",
|
||||
include_links=True, include_tables=True, favor_recall=True)`)
|
||||
- `firecrawl-py 4.x` against Firecrawl's hosted `v2` API
|
||||
(`scrape(url, formats=["markdown"])`)
|
||||
|
||||
## Fact selection
|
||||
|
||||
Facts for each site were chosen by manual inspection of the live page in a
|
||||
browser on 2026-04-17. Selection criteria:
|
||||
|
||||
- must be **visibly present** (not in `<head>`, `<script>`, or hidden
|
||||
sections)
|
||||
- must be **specific** — customer names, headline stats, product names,
|
||||
release dates. Not generic words like "the", "platform", "we".
|
||||
- must be **stable across multiple loads** (no AB-tested copy, no random
|
||||
customer rotations)
|
||||
- 5 facts per site, documented in `facts.json`
|
||||
|
||||
Facts are committed as data, not code, so **new facts can be proposed via
|
||||
pull request**. Any addition runs against all three tools automatically.
|
||||
|
||||
Known limitation: sites change. News aggregators, release pages, and
|
||||
blog indexes drift. If a fact disappears because the page changed (not
|
||||
because the extractor dropped it), we expect all three tools to miss it
|
||||
together, which makes it visible as "all tools tied on this site" in the
|
||||
per-site breakdown. Facts on churning pages are refreshed on each published
|
||||
run.
|
||||
|
||||
## Why median of 3 runs
|
||||
|
||||
Single-run numbers are noisy:
|
||||
|
||||
- **Latency** varies ±30% from run to run due to network jitter, CDN cache
|
||||
state, and the remote server's own load.
|
||||
- **Raw-HTML token count** can vary if the server renders different content
|
||||
per request (A/B tests, geo-IP, session state).
|
||||
- **Tool-specific flakiness** exists at the long tail. The occasional
|
||||
Firecrawl 502 or trafilatura fetch failure would otherwise distort a
|
||||
single-run benchmark.
|
||||
|
||||
We run each site 3 times, take the median per metric. The published
|
||||
number is the 50th percentile; the full run data (min / median / max)
|
||||
is preserved in `results/YYYY-MM-DD.json`.
|
||||
|
||||
## Fair comparison notes
|
||||
|
||||
- **Each tool fetches via its own preferred path.** webclaw uses its
|
||||
in-process primp HTTP client. Trafilatura uses `requests`. Firecrawl
|
||||
fetches via its hosted infrastructure (Chrome CDP when needed). This is
|
||||
the apples-to-apples developer-experience comparison: what you get when
|
||||
you call each tool with a URL. The "vs raw HTML" column uses webclaw's
|
||||
`--raw-html` as the baseline denominator.
|
||||
- **Firecrawl's default engine picker** runs in "auto" mode with browser
|
||||
rendering for sites it detects need it. No flags tuned, no URLs
|
||||
cherry-picked.
|
||||
- **No retries**, no fallbacks, no post-processing on top of any tool's
|
||||
output. If a tool returns `""` or errors, that is the measured result
|
||||
for that run. The median of 3 runs absorbs transient errors; persistent
|
||||
extraction failures (e.g. trafilatura on `simonwillison.net`, which
|
||||
returned `""` on all 3 runs) show up as 0 tokens and 0 facts.
|
||||
|
||||
## Raw data schema
|
||||
|
||||
`results/YYYY-MM-DD.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-04-17 ...",
|
||||
"webclaw_version": "0.3.18",
|
||||
"trafilatura_version": "2.0.0",
|
||||
"tokenizer": "cl100k_base",
|
||||
"runs_per_site": 3,
|
||||
"site_count": 18,
|
||||
"total_facts": 90,
|
||||
"aggregates": {
|
||||
"webclaw": { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
|
||||
"trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
|
||||
"firecrawl": { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
|
||||
},
|
||||
"per_site": [
|
||||
{
|
||||
"url": "https://openai.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 170508,
|
||||
"webclaw": { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
|
||||
"trafilatura": { "tokens_med": 0, "facts_med": 0, "seconds_med": 0.17 },
|
||||
"firecrawl": { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## What's not here (roadmap)
|
||||
|
||||
These measurements are intentionally out of scope for this initial
|
||||
benchmark. Each deserves its own harness and its own run.
|
||||
|
||||
- **n-gram content overlap** — v2 metric to replace curated-fact matching.
|
||||
Measure: fraction of trigrams from the visually-rendered page text that
|
||||
appear in the extractor's output. Harder to curate, easier to scale.
|
||||
- **Competitors besides trafilatura / firecrawl** — Mozilla Readability,
|
||||
Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or
|
||||
wrapper subprocess runners. PRs welcome.
|
||||
- **Anti-bot / protected sites** — Cloudflare Turnstile, DataDome, AWS
|
||||
WAF, hCaptcha. These require the Webclaw Cloud API with the antibot
|
||||
sidecar, not the open-source CLI, and will be published separately on
|
||||
the Webclaw landing page once the testing harness there is public.
|
||||
- **Crawl throughput** — pages-per-second under concurrent load. Different
|
||||
axis from single-page extraction; lives in its own benchmark.
|
||||
Loading…
Add table
Add a link
Reference in a new issue