Sites like Bluesky emit JSON-LD with literal newline characters inside
string values (technically invalid JSON). Add sanitize_json_newlines()
fallback that escapes control characters inside quoted strings before
retrying the parse. This recovers ProfilePage, Product, and other
structured data that was previously silently dropped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.
Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().
Bump to 0.3.12.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.
Bump to 0.3.11.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.
Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension
New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.
Closes#3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume
MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reddit blocks TLS-fingerprinted clients on their .json API but
accepts standard requests with a browser User-Agent. Switch to
a non-impersonated primp client for the Reddit fallback path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>