New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog
First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):
- reddit → www.reddit.com/*/.json
- hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo → api.github.com/repos/{owner}/{repo}
- pypi → pypi.org/pypi/{name}/json
- npm → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}
Server-side routes added:
- POST /v1/scrape/{vertical} explicit per-vertical extraction
- GET /v1/extractors catalog (name, label, description, url_patterns)
The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.
17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).
Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension
New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume
MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>