webclaw/crates/webclaw-fetch/src/lib.rs

//! webclaw-fetch: HTTP client layer with browser TLS fingerprint impersonation.
//! Uses wreq (BoringSSL) for browser-grade TLS + HTTP/2 fingerprinting.
//! Automatically detects PDF responses and delegates to webclaw-pdf.
pub mod browser;
pub mod client;
pub mod crawler;
pub mod document;
pub mod error;
pub mod extractors;
pub mod linkedin;
pub mod proxy;
pub mod reddit;
pub mod sitemap;
pub mod tls;

pub use browser::BrowserProfile;
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
pub use error::FetchError;
pub use http::HeaderMap;
pub use proxy::{parse_proxy_file, parse_proxy_line};
pub use sitemap::SitemapEntry;
pub use webclaw_pdf::PdfMode;
feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-29 16:40:10 +02:00			`//! webclaw-fetch: HTTP client layer with browser TLS fingerprint impersonation.`
feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-01 18:04:55 +02:00			`//! Uses wreq (BoringSSL) for browser-grade TLS + HTTP/2 fingerprinting.`
feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-29 16:40:10 +02:00			`//! Automatically detects PDF responses and delegates to webclaw-pdf.`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`pub mod browser;`
			`pub mod client;`
			`pub mod crawler;`
feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-26 15:28:23 +01:00			`pub mod document;`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`pub mod error;`
feat(extractors): add vertical extractors module + first 6 verticals New extractors module returns site-specific typed JSON instead of generic markdown. Each extractor: - declares a URL pattern via matches() - fetches from the site's official JSON API where one exists - returns a typed serde_json::Value with documented field names - exposes an INFO struct that powers the /v1/extractors catalog First 6 verticals shipped, all hitting public JSON APIs (no HTML scraping, zero antibot risk): - reddit → www.reddit.com/*/.json - hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call) - github_repo → api.github.com/repos/{owner}/{repo} - pypi → pypi.org/pypi/{name}/json - npm → registry.npmjs.org/{name} + downloads/point/last-week - huggingface_model → huggingface.co/api/models/{owner}/{name} Server-side routes added: - POST /v1/scrape/{vertical} explicit per-vertical extraction - GET /v1/extractors catalog (name, label, description, url_patterns) The dispatcher validates that URL matches the requested vertical before running, so users get "URL doesn't match the X extractor" instead of opaque parse failures inside the extractor. 17 unit tests cover URL matching + path parsing for each vertical. Live tests against canonical URLs (rust-lang/rust, requests pypi, react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post) all return correct typed JSON in 100-300ms. Sample sizes: github 863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree). Marketing positioning: Firecrawl charges 5 credits per /extract call and you write the schema. Webclaw returns the same JSON in 1 credit per /scrape/{vertical} call with hand-written deterministic extractors per site. 2026-04-22 14:11:43 +02:00			`pub mod extractors;`
fix: make reddit and linkedin modules public for server access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-29 16:54:35 +02:00			`pub mod linkedin;`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`pub mod proxy;`
fix: make reddit and linkedin modules public for server access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-29 16:54:35 +02:00			`pub mod reddit;`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`pub mod sitemap;`
feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-01 18:04:55 +02:00			`pub mod tls;`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00
			`pub use browser::BrowserProfile;`
			`pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};`
feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 21:38:28 +01:00			`pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`pub use error::FetchError;`
style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-01 18:25:40 +02:00			`pub use http::HeaderMap;`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`pub use proxy::{parse_proxy_file, parse_proxy_line};`
			`pub use sitemap::SitemapEntry;`
			`pub use webclaw_pdf::PdfMode;`