mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Replaces the previous benchmarks/README.md, which claimed specific numbers (94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no reproducing code committed to the repo. The `webclaw-bench` crate and `benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced never existed. This is what #18 was calling out. New benchmarks/ is fully reproducible. Every number ships with the script that produced it. `./benchmarks/run.sh` regenerates everything. Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18, cl100k_base tokenizer): tool reduction_mean fidelity latency_mean webclaw 92.5% 76/90 (84.4%) 0.41s firecrawl 92.4% 70/90 (77.8%) 0.99s trafilatura 97.8% 45/90 (50.0%) 0.21s webclaw matches or beats both competitors on fidelity on all 18 sites while running 2.4x faster than Firecrawl's hosted API. Includes: - README.md — headline table + per-site breakdown - methodology.md — tokenizer, fact selection, run rationale - sites.txt — 18 canonical URLs - facts.json — 90 curated facts (PRs welcome to add sites) - scripts/bench.py — the runner - results/2026-04-17.json — today's raw data, median of 3 runs - run.sh — one-command reproduction Closes #18 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
142 lines
5.8 KiB
Markdown
142 lines
5.8 KiB
Markdown
# Methodology
|
|
|
|
## What is measured
|
|
|
|
Three metrics per site:
|
|
|
|
1. **Token efficiency** — tokens of the extractor's output vs tokens of the
|
|
raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower
|
|
tokens *only matters if the content is preserved*, so tokens are always
|
|
reported alongside fidelity.
|
|
2. **Fidelity** — how many hand-curated "visible facts" the extractor
|
|
preserved. Per site we list 5 strings that any reader would say are
|
|
meaningfully on the page (customer names, headline stats, product names,
|
|
release information). Matched case-insensitively with word boundaries
|
|
where the fact is a single alphanumeric token (`API` does not match
|
|
`apiece`).
|
|
3. **Latency** — wall-clock time from URL submission to markdown output.
|
|
Includes fetch + extraction. Network-dependent, so reported as the
|
|
median of 3 runs.
|
|
|
|
## Tokenizer
|
|
|
|
`cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by
|
|
GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug
|
|
extracted web content into. Pinned in `scripts/bench.py`.
|
|
|
|
## Tool versions
|
|
|
|
Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run
|
|
published at launch used:
|
|
|
|
- `webclaw 0.3.18` (release build, default options, `--format llm`)
|
|
- `trafilatura 2.0.0` (`extract(html, output_format="markdown",
|
|
include_links=True, include_tables=True, favor_recall=True)`)
|
|
- `firecrawl-py 4.x` against Firecrawl's hosted `v2` API
|
|
(`scrape(url, formats=["markdown"])`)
|
|
|
|
## Fact selection
|
|
|
|
Facts for each site were chosen by manual inspection of the live page in a
|
|
browser on 2026-04-17. Selection criteria:
|
|
|
|
- must be **visibly present** (not in `<head>`, `<script>`, or hidden
|
|
sections)
|
|
- must be **specific** — customer names, headline stats, product names,
|
|
release dates. Not generic words like "the", "platform", "we".
|
|
- must be **stable across multiple loads** (no AB-tested copy, no random
|
|
customer rotations)
|
|
- 5 facts per site, documented in `facts.json`
|
|
|
|
Facts are committed as data, not code, so **new facts can be proposed via
|
|
pull request**. Any addition runs against all three tools automatically.
|
|
|
|
Known limitation: sites change. News aggregators, release pages, and
|
|
blog indexes drift. If a fact disappears because the page changed (not
|
|
because the extractor dropped it), we expect all three tools to miss it
|
|
together, which makes it visible as "all tools tied on this site" in the
|
|
per-site breakdown. Facts on churning pages are refreshed on each published
|
|
run.
|
|
|
|
## Why median of 3 runs
|
|
|
|
Single-run numbers are noisy:
|
|
|
|
- **Latency** varies ±30% from run to run due to network jitter, CDN cache
|
|
state, and the remote server's own load.
|
|
- **Raw-HTML token count** can vary if the server renders different content
|
|
per request (A/B tests, geo-IP, session state).
|
|
- **Tool-specific flakiness** exists at the long tail. The occasional
|
|
Firecrawl 502 or trafilatura fetch failure would otherwise distort a
|
|
single-run benchmark.
|
|
|
|
We run each site 3 times, take the median per metric. The published
|
|
number is the 50th percentile; the full run data (min / median / max)
|
|
is preserved in `results/YYYY-MM-DD.json`.
|
|
|
|
## Fair comparison notes
|
|
|
|
- **Each tool fetches via its own preferred path.** webclaw uses its
|
|
in-process primp HTTP client. Trafilatura uses `requests`. Firecrawl
|
|
fetches via its hosted infrastructure (Chrome CDP when needed). This is
|
|
the apples-to-apples developer-experience comparison: what you get when
|
|
you call each tool with a URL. The "vs raw HTML" column uses webclaw's
|
|
`--raw-html` as the baseline denominator.
|
|
- **Firecrawl's default engine picker** runs in "auto" mode with browser
|
|
rendering for sites it detects need it. No flags tuned, no URLs
|
|
cherry-picked.
|
|
- **No retries**, no fallbacks, no post-processing on top of any tool's
|
|
output. If a tool returns `""` or errors, that is the measured result
|
|
for that run. The median of 3 runs absorbs transient errors; persistent
|
|
extraction failures (e.g. trafilatura on `simonwillison.net`, which
|
|
returned `""` on all 3 runs) show up as 0 tokens and 0 facts.
|
|
|
|
## Raw data schema
|
|
|
|
`results/YYYY-MM-DD.json`:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2026-04-17 ...",
|
|
"webclaw_version": "0.3.18",
|
|
"trafilatura_version": "2.0.0",
|
|
"tokenizer": "cl100k_base",
|
|
"runs_per_site": 3,
|
|
"site_count": 18,
|
|
"total_facts": 90,
|
|
"aggregates": {
|
|
"webclaw": { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
|
|
"trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
|
|
"firecrawl": { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
|
|
},
|
|
"per_site": [
|
|
{
|
|
"url": "https://openai.com",
|
|
"facts_count": 5,
|
|
"raw_tokens": 170508,
|
|
"webclaw": { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
|
|
"trafilatura": { "tokens_med": 0, "facts_med": 0, "seconds_med": 0.17 },
|
|
"firecrawl": { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
|
|
},
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
## What's not here (roadmap)
|
|
|
|
These measurements are intentionally out of scope for this initial
|
|
benchmark. Each deserves its own harness and its own run.
|
|
|
|
- **n-gram content overlap** — v2 metric to replace curated-fact matching.
|
|
Measure: fraction of trigrams from the visually-rendered page text that
|
|
appear in the extractor's output. Harder to curate, easier to scale.
|
|
- **Competitors besides trafilatura / firecrawl** — Mozilla Readability,
|
|
Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or
|
|
wrapper subprocess runners. PRs welcome.
|
|
- **Anti-bot / protected sites** — Cloudflare Turnstile, DataDome, AWS
|
|
WAF, hCaptcha. These require the Webclaw Cloud API with the antibot
|
|
sidecar, not the open-source CLI, and will be published separately on
|
|
the Webclaw landing page once the testing harness there is public.
|
|
- **Crawl throughput** — pages-per-second under concurrent load. Different
|
|
axis from single-page extraction; lives in its own benchmark.
|