# Methodology ## What is measured Three metrics per site: 1. **Token efficiency** — tokens of the extractor's output vs tokens of the raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower tokens *only matters if the content is preserved*, so tokens are always reported alongside fidelity. 2. **Fidelity** — how many hand-curated "visible facts" the extractor preserved. Per site we list 5 strings that any reader would say are meaningfully on the page (customer names, headline stats, product names, release information). Matched case-insensitively with word boundaries where the fact is a single alphanumeric token (`API` does not match `apiece`). 3. **Latency** — wall-clock time from URL submission to markdown output. Includes fetch + extraction. Network-dependent, so reported as the median of 3 runs. ## Tokenizer `cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug extracted web content into. Pinned in `scripts/bench.py`. ## Tool versions Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run published at launch used: - `webclaw 0.3.18` (release build, default options, `--format llm`) - `trafilatura 2.0.0` (`extract(html, output_format="markdown", include_links=True, include_tables=True, favor_recall=True)`) - `firecrawl-py 4.x` against Firecrawl's hosted `v2` API (`scrape(url, formats=["markdown"])`) ## Fact selection Facts for each site were chosen by manual inspection of the live page in a browser on 2026-04-17. Selection criteria: - must be **visibly present** (not in `
`, `