- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
immobiliare.it with country-it residential.
- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
explicit extension_permutation, advertise h3 in ALPN and ALPS.
JA3 43067709b025da334de1279a120f8e14, akamai_fp
52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
Cloudflare-fronted sites.
- New locale module: accept_language_for_url / accept_language_for_tld.
TLD to Accept-Language mapping, unknown TLDs default to en-US.
DataDome geo-vs-locale cross-checks are now trivially satisfiable.
- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.
Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.
Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
`&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.
Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
pulling rmcp as a dep)
Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.
## New module: webclaw-fetch/src/cloud.rs
Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:
- `CloudError` enum with dedicated variants for the four HTTP
outcomes callers act on differently — 401 (key rejected),
402 (insufficient plan), 429 (rate limited), plus ServerError /
Network / ParseFailed. Each variant's Display message ends with
an actionable URL (signup / pricing / dashboard) so API consumers
can surface it verbatim.
- `From<CloudError> for String` bridge so the dozen existing
`.await?` call sites in MCP / CLI that expected `Result<_, String>`
keep compiling. We can migrate them to the typed error per-site
later without a churn commit.
- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
flag pattern (explicit key wins, env fallback, None when empty).
`::from_env()` kept for MCP-style call sites.
- `with_key_and_base` for staging / integration tests.
- `scrape / post / get / fetch_html` — `fetch_html` is new, a
convenience that calls /v1/scrape with formats=["html"] and
returns the raw HTML string so vertical extractors can plug
antibot-bypassed HTML straight into their parsers.
- `is_bot_protected` + `needs_js_rendering` detectors moved
over verbatim. Detection patterns are public (CF / DataDome /
AWS WAF challenge-page signatures) — no moat leak.
- `smart_fetch` kept on the original `Result<_, String>`
signature so MCP's six call sites compile unchanged.
- `smart_fetch_html` is new: the local-first-then-cloud flow
for the vertical-extractor pattern, returning the typed
`CloudError` so extractors can emit precise upgrade-path
messages.
## Cleanup
- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
`webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of
webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
transitively pulled in by webclaw-llm; this just makes the
dependency relationship explicit at the call site.
## Tests
16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)
Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:
- linkedin_post: /embed/feed/update/{urn} returns full body,
author, image, OG tags. Accepts both the urn:li:share
and urn:li:activity URN forms plus the pretty
/posts/{slug}-{id}-{suffix} URLs.
- instagram_post: /p/{shortcode}/embed/captioned/ returns the full
caption, username, thumbnail. Same endpoint serves
reels and IGTV, kind correctly classified.
- instagram_profile: /api/v1/users/web_profile_info/?username=X with the
x-ig-app-id header (Instagram's public web-app id,
sent by their own JS bundle). Returns the full
profile + the 12 most recent posts with shortcodes,
kinds, like/comment counts, thumbnails, and caption
previews. Falls back to OG-tag scraping of the
public HTML if the API ever 401/403s.
The IG profile output is shaped so callers can fan out cleanly:
for p in profile.recent_posts:
scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.
Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().
Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.
Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
/ shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
rounded), is_verified=True, is_business=True, biography with emojis,
12 recent posts with shortcodes + kinds + likes. 400ms.
Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension
New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>