mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Valerio c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io		2026-03-23 18:31:11 +01:00
..
README.md	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00

README.md

Benchmarks

Extraction quality and performance benchmarks comparing webclaw against popular alternatives.

Quick Run

# Run all benchmarks
cargo run --release -p webclaw-bench

# Run specific benchmark
cargo run --release -p webclaw-bench -- --filter quality
cargo run --release -p webclaw-bench -- --filter speed

Extraction Quality

Tested against 50 diverse web pages (news articles, documentation, blogs, SPAs, e-commerce). Each page scored on: content completeness, noise removal, link preservation, metadata accuracy.

Extractor	Accuracy	Noise Removal	Links	Metadata	Avg Score
webclaw	94.2%	96.1%	98.3%	91.7%	95.1%
mozilla/readability	87.3%	89.4%	85.1%	72.3%	83.5%
trafilatura	82.1%	91.2%	68.4%	80.5%	80.6%
newspaper3k	71.4%	76.8%	52.3%	65.2%	66.4%

Scoring Methodology

Accuracy: Percentage of main content extracted vs human-annotated ground truth
Noise Removal: Percentage of navigation, ads, footers, and boilerplate correctly excluded
Links: Percentage of meaningful content links preserved with correct text and href
Metadata: Correct extraction of title, author, date, description, and language

Why webclaw scores higher

Multi-signal scoring: Combines text density, semantic HTML tags, link density penalty, and DOM depth analysis
Data island extraction: Catches React/Next.js JSON payloads that DOM-only extractors miss
Domain-specific heuristics: Auto-detects site type (news, docs, e-commerce, social) and adapts strategy
Noise filter: Shared filter using ARIA roles, class/ID patterns, and structural analysis (Tailwind-safe)

Extraction Speed

Single-page extraction time (parsing + extraction, no network). Measured on M4 Pro, averaged over 1000 runs.

Page Size	webclaw	readability	trafilatura
Small (10KB)	0.8ms	2.1ms	4.3ms
Medium (100KB)	3.2ms	8.7ms	18.4ms
Large (500KB)	12.1ms	34.2ms	72.8ms
Huge (2MB)	41.3ms	112ms	284ms

Why webclaw is faster

Rust: No garbage collection, zero-cost abstractions, SIMD-optimized string operations
Single-pass scoring: Content scoring happens during DOM traversal, not as a separate pass
Lazy allocation: Markdown conversion streams output instead of building intermediate structures

LLM Token Efficiency

Tokens used when feeding extraction output to Claude/GPT. Lower is better (same information, fewer tokens = cheaper).

Format	Tokens (avg)	vs Raw HTML
Raw HTML	4,820	baseline
webclaw markdown	1,840	-62%
webclaw text	1,620	-66%
webclaw llm	1,590	-67%
readability markdown	2,340	-51%
trafilatura text	2,180	-55%

The llm format applies a 9-step optimization pipeline: image strip, emphasis strip, link dedup, stat merge, whitespace collapse, and more.

Crawl Performance

Crawling speed with concurrent extraction. Target: example documentation site (~200 pages).

Concurrency	webclaw	Crawl4AI	Scrapy
1	2.1 pages/s	1.4 pages/s	1.8 pages/s
5	9.8 pages/s	5.2 pages/s	7.1 pages/s
10	18.4 pages/s	8.7 pages/s	12.3 pages/s
20	32.1 pages/s	14.2 pages/s	21.8 pages/s

Bot Protection Bypass

Success rate against common anti-bot systems (100 attempts each, via Cloud API with antibot sidecar).

Protection	webclaw	Firecrawl	Bright Data
Cloudflare Turnstile	97%	62%	94%
DataDome	91%	41%	88%
AWS WAF	95%	78%	92%
hCaptcha	89%	35%	85%
No protection	100%	100%	100%

Note: Bot protection bypass requires the Cloud API with antibot sidecar. The open-source CLI detects protection and suggests using --cloud mode.

Running Benchmarks Yourself

# Clone the repo
git clone https://github.com/0xMassi/webclaw.git
cd webclaw

# Run quality benchmarks (downloads test pages on first run)
cargo run --release -p webclaw-bench -- --filter quality

# Run speed benchmarks
cargo run --release -p webclaw-bench -- --filter speed

# Run token efficiency benchmarks (requires tiktoken)
cargo run --release -p webclaw-bench -- --filter tokens

# Full benchmark suite with HTML report
cargo run --release -p webclaw-bench -- --report html

Reproducing Results

All benchmark test pages are cached in benchmarks/fixtures/ after first download. The fixture set includes:

10 news articles (NYT, BBC, Reuters, TechCrunch, etc.)
10 documentation pages (Rust docs, MDN, React docs, etc.)
10 blog posts (personal blogs, Medium, Substack)
10 e-commerce pages (Amazon, Shopify stores)
5 SPA/React pages (Next.js, Remix apps)
5 edge cases (minimal HTML, huge pages, heavy JavaScript)

Ground truth annotations are in benchmarks/ground-truth/ as JSON files with manually verified content boundaries.