fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls

fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
fix(fetch): reject HTML verification pages served at .json reddit URL
2026-04-25 00:06:21 +02:00 · 2026-04-23 15:26:31 +02:00 · 2026-04-23 15:17:04 +02:00 · 2026-04-23 15:06:35 +02:00 · 2026-04-23 14:59:29 +02:00 · 2026-04-23 13:32:55 +02:00
45 changed files with 812 additions and 114 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -3,6 +3,64 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).

+## [0.5.6] — 2026-04-23
+
+### Added
+- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
+
+### Fixed
+- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
+- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
+
+---
+
+## [0.5.5] — 2026-04-23
+
+### Added
+- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
+
+---
+
+## [0.5.4] — 2026-04-23
+
+### Added
+- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
+- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
+
+### Changed
+- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
+- Bumped `wreq-util` to `3.0.0-rc.10`.
+
+---
+
+## [0.5.2] — 2026-04-22
+
+### Added
+- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
+
+- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
+
+- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
+
+### Changed
+- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
+
+---
+
+## [0.5.1] — 2026-04-22
+
+### Added
+- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
+
+  The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
+
+  Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
+
+### Changed
+- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
+
+---
+
 ## [0.5.0] — 2026-04-22

 ### Added
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -11,7 +11,7 @@ webclaw/
                      # + ExtractionOptions (include/exclude CSS selectors)
                      # + diff engine (change tracking)
                      # + brand extraction (DOM/CSS analysis)
-    webclaw-fetch/    # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
+    webclaw-fetch/    # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
                      # + proxy pool rotation (per-request)
                      # + PDF content-type detection
                      # + document parsing (DOCX, XLSX, CSV)
@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 - `brand.rs` — Brand identity extraction from DOM structure and CSS

 ### Fetch Modules (`webclaw-fetch`)
- `client.rs` — FetchClient with primp TLS impersonation
+- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
 - `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
 - `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
 - `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 ## Hard Rules

 - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
+- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
+- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
+- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
+- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
 - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.

 ## Build & Test
--- a/Cargo.lock
+++ b/Cargo.lock
@ -2967,6 +2967,26 @@ dependencies = [
 "pom",
 ]

+[[package]]
+name = "typed-builder"
+version = "0.23.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
+dependencies = [
+ "typed-builder-macro",
+]
+
+[[package]]
+name = "typed-builder-macro"
+version = "0.23.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "typed-path"
 version = "0.12.3"
@ -3199,7 +3219,7 @@ dependencies = [

 [[package]]
 name = "webclaw-cli"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
 "clap",
 "dotenvy",
@ -3220,7 +3240,7 @@ dependencies = [

 [[package]]
 name = "webclaw-core"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
 "ego-tree",
 "once_cell",
@ -3238,8 +3258,9 @@ dependencies = [

 [[package]]
 name = "webclaw-fetch"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
+ "async-trait",
 "bytes",
 "calamine",
 "http",
@ -3257,12 +3278,13 @@ dependencies = [
 "webclaw-core",
 "webclaw-pdf",
 "wreq",
+ "wreq-util",
 "zip 2.4.2",
 ]

 [[package]]
 name = "webclaw-llm"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
 "async-trait",
 "reqwest",
@ -3275,7 +3297,7 @@ dependencies = [

 [[package]]
 name = "webclaw-mcp"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
 "dirs",
 "dotenvy",
@ -3295,7 +3317,7 @@ dependencies = [

 [[package]]
 name = "webclaw-pdf"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
 "pdf-extract",
 "thiserror",
@ -3304,7 +3326,7 @@ dependencies = [

 [[package]]
 name = "webclaw-server"
-version = "0.5.0"
+version = "0.5.6"
 dependencies = [
 "anyhow",
 "axum",
@ -3708,6 +3730,16 @@ dependencies = [
 "zstd",
 ]

+[[package]]
+name = "wreq-util"
+version = "3.0.0-rc.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
+dependencies = [
+ "typed-builder",
+ "wreq",
+]
+
 [[package]]
 name = "writeable"
 version = "0.6.2"
--- a/Cargo.toml
+++ b/Cargo.toml
@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]

 [workspace.package]
-version = "0.5.0"
+version = "0.5.6"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@ -308,6 +308,34 @@ enum Commands {
        #[arg(long)]
        facts: Option<PathBuf>,
    },
+
+    /// List all vertical extractors in the catalog.
+    ///
+    /// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
+    /// a human-friendly label, a one-line description, and the URL
+    /// patterns it claims. The same data is served by `/v1/extractors`
+    /// when running the REST API.
+    Extractors {
+        /// Emit JSON instead of a human-friendly table.
+        #[arg(long)]
+        json: bool,
+    },
+
+    /// Run a vertical extractor by name. Returns typed JSON with fields
+    /// specific to the target site (title, price, author, rating, etc.)
+    /// rather than generic markdown.
+    ///
+    /// Use `webclaw extractors` to see the full list. Example:
+    /// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
+    Vertical {
+        /// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
+        name: String,
+        /// URL to extract.
+        url: String,
+        /// Emit compact JSON (single line). Default is pretty-printed.
+        #[arg(long)]
+        raw: bool,
+    },
 }

 #[derive(Clone, ValueEnum)]
@ -323,6 +351,9 @@ enum OutputFormat {
 enum Browser {
    Chrome,
    Firefox,
+    /// Safari iOS 26. Pair with a country-matched residential proxy for sites
+    /// that reject non-mobile profiles.
+    SafariIos,
    Random,
 }

@ -349,6 +380,7 @@ impl From<Browser> for BrowserProfile {
        match b {
            Browser::Chrome => BrowserProfile::Chrome,
            Browser::Firefox => BrowserProfile::Firefox,
+            Browser::SafariIos => BrowserProfile::SafariIos,
            Browser::Random => BrowserProfile::Random,
        }
    }
@ -2288,6 +2320,83 @@ async fn main() {
                }
                return;
            }
+            Commands::Extractors { json } => {
+                let entries = webclaw_fetch::extractors::list();
+                if *json {
+                    // Serialize with serde_json. ExtractorInfo derives
+                    // Serialize so this is a one-liner.
+                    match serde_json::to_string_pretty(&entries) {
+                        Ok(s) => println!("{s}"),
+                        Err(e) => {
+                            eprintln!("error: failed to serialise catalog: {e}");
+                            process::exit(1);
+                        }
+                    }
+                } else {
+                    // Human-friendly table: NAME + LABEL + one URL
+                    // pattern sample. Keeps the output scannable on a
+                    // narrow terminal.
+                    println!("{} vertical extractors available:\n", entries.len());
+                    let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
+                    let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
+                    for e in &entries {
+                        let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
+                        println!(
+                            "  {:<nw$}  {:<lw$}  {}",
+                            e.name,
+                            e.label,
+                            pattern_sample,
+                            nw = name_w,
+                            lw = label_w,
+                        );
+                    }
+                    println!("\nRun one: webclaw vertical <name> <url>");
+                }
+                return;
+            }
+            Commands::Vertical { name, url, raw } => {
+                // Build a FetchClient with cloud fallback attached when
+                // WEBCLAW_API_KEY is set. Antibot-gated verticals
+                // (amazon, ebay, etsy, trustpilot) need this to escalate
+                // on bot protection.
+                let fetch_cfg = webclaw_fetch::FetchConfig {
+                    browser: webclaw_fetch::BrowserProfile::Firefox,
+                    ..webclaw_fetch::FetchConfig::default()
+                };
+                let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
+                    Ok(c) => c,
+                    Err(e) => {
+                        eprintln!("error: failed to build fetch client: {e}");
+                        process::exit(1);
+                    }
+                };
+                if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
+                    client = client.with_cloud(cloud);
+                }
+                match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
+                    Ok(data) => {
+                        let rendered = if *raw {
+                            serde_json::to_string(&data)
+                        } else {
+                            serde_json::to_string_pretty(&data)
+                        };
+                        match rendered {
+                            Ok(s) => println!("{s}"),
+                            Err(e) => {
+                                eprintln!("error: JSON encode failed: {e}");
+                                process::exit(1);
+                            }
+                        }
+                    }
+                    Err(e) => {
+                        // UrlMismatch / UnknownVertical / Fetch all get
+                        // Display impls with actionable messages.
+                        eprintln!("error: {e}");
+                        process::exit(1);
+                    }
+                }
+                return;
+            }
        }
    }

--- a/crates/webclaw-core/src/markdown.rs
+++ b/crates/webclaw-core/src/markdown.rs
@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
            continue;
        }

-        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
-        if trimmed.starts_with('|') && trimmed.ends_with('|') {
+        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
+        // Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
+        // (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
+        if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
            let inner = &trimmed[1..trimmed.len() - 1];
            let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
            lines.push(cells.join("\t"));
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@ -12,7 +12,9 @@ serde = { workspace = true }
 thiserror = { workspace = true }
 tracing = { workspace = true }
 tokio = { workspace = true }
+async-trait = "0.1"
 wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
+wreq-util = "3.0.0-rc.10"
 http = "1"
 bytes = "1"
 url = "2"
--- a/crates/webclaw-fetch/src/browser.rs
+++ b/crates/webclaw-fetch/src/browser.rs
@ -7,6 +7,10 @@ pub enum BrowserProfile {
    #[default]
    Chrome,
    Firefox,
+    /// Safari iOS 26 (iPhone). The one profile proven to defeat
+    /// DataDome's immobiliare.it / idealista.it / target.com-class
+    /// rules when paired with a country-scoped residential proxy.
+    SafariIos,
    /// Randomly pick from all available profiles on each request.
    Random,
 }
@ -18,6 +22,7 @@ pub enum BrowserVariant {
    ChromeMacos,
    Firefox,
    Safari,
+    SafariIos26,
    Edge,
 }

--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@ -261,10 +261,65 @@ impl FetchClient {
        self.cloud.as_deref()
    }

+    /// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
+    /// `.json` API, and Akamai-style challenge responses trigger a homepage
+    /// cookie warmup and a retry. Returns the same `FetchResult` shape as
+    /// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
+    /// server) benefits without shape churn.
+    ///
+    /// This is the method most callers want. Use plain [`Self::fetch`] only
+    /// when you need literal no-rescue behavior (e.g. inside the rescue
+    /// logic itself to avoid recursion).
+    pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
+        // Reddit: the HTML page shows a verification interstitial for most
+        // client IPs, but appending `.json` returns the post + comment tree
+        // publicly. `parse_reddit_json` in downstream code knows how to read
+        // the result; here we just do the URL swap at the fetch layer.
+        if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
+            let json_url = crate::reddit::json_url(url);
+            // Reddit's public .json API serves JSON to identifiable bot
+            // User-Agents and blocks browser UAs with a verification wall.
+            // Override our Chrome-profile UA for this specific call.
+            let ua = concat!(
+                "Webclaw/",
+                env!("CARGO_PKG_VERSION"),
+                " (+https://webclaw.io)"
+            );
+            if let Ok(resp) = self
+                .fetch_with_headers(&json_url, &[("user-agent", ua)])
+                .await
+                && resp.status == 200
+            {
+                let first = resp.html.trim_start().as_bytes().first().copied();
+                if matches!(first, Some(b'{') | Some(b'[')) {
+                    return Ok(resp);
+                }
+            }
+            // If the .json fetch failed or returned HTML, fall through.
+        }
+
+        let resp = self.fetch(url).await?;
+
+        // Akamai / bazadebezolkohpepadr challenge: visit the homepage to
+        // collect warmup cookies (_abck, bm_sz, etc.), then retry.
+        if is_challenge_html(&resp.html)
+            && let Some(homepage) = extract_homepage(url)
+        {
+            debug!("challenge detected, warming cookies via {homepage}");
+            let _ = self.fetch(&homepage).await;
+            if let Ok(retry) = self.fetch(url).await {
+                return Ok(retry);
+            }
+        }
+
+        Ok(resp)
+    }
+
    /// Fetch a URL and return the raw HTML + response metadata.
    ///
    /// Automatically retries on transient failures (network errors, 5xx, 429)
-    /// with exponential backoff: 0s, 1s (2 attempts total).
+    /// with exponential backoff: 0s, 1s (2 attempts total). No per-site
+    /// rescue logic; use [`Self::fetch_smart`] for that.
    #[instrument(skip(self), fields(url = %url))]
    pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
        let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -599,12 +654,43 @@ impl FetchClient {
    }
 }

+// ---------------------------------------------------------------------------
+// Fetcher trait implementation
+//
+// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
+// rather than `FetchClient` directly, which is what lets the production
+// API server swap in a tls-sidecar-backed implementation without
+// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
+// self-hosted OSS server) this impl means "pass the FetchClient you
+// already have; nothing changes".
+// ---------------------------------------------------------------------------
+
+#[async_trait::async_trait]
+impl crate::fetcher::Fetcher for FetchClient {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        FetchClient::fetch(self, url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        FetchClient::fetch_with_headers(self, url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
+        FetchClient::cloud(self)
+    }
+}
+
 /// Collect the browser variants to use based on the browser profile.
 fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
    match profile {
        BrowserProfile::Random => browser::all_variants(),
        BrowserProfile::Chrome => vec![browser::latest_chrome()],
        BrowserProfile::Firefox => vec![browser::latest_firefox()],
+        BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
    }
 }

@ -682,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {

 /// Detect if a response looks like a bot protection challenge page.
 fn is_challenge_response(response: &Response) -> bool {
-    let len = response.body().len();
+    is_challenge_html(response.text().as_ref())
+}
+
+/// Same as `is_challenge_response`, operating on a body string directly
+/// so callers holding a `FetchResult` can reuse the heuristic.
+fn is_challenge_html(html: &str) -> bool {
+    let len = html.len();
    if len > 15_000 || len == 0 {
        return false;
    }
-
-    let text = response.text();
-    let lower = text.to_lowercase();
-
+    let lower = html.to_lowercase();
    if lower.contains("<title>challenge page</title>") {
        return true;
    }
-
    if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
        return true;
    }
-
    false
 }

--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@ -66,7 +66,9 @@ use serde_json::{Value, json};
 use thiserror::Error;
 use tracing::{debug, info, warn};

-use crate::client::FetchClient;
+// Client type isn't needed here anymore now that smart_fetch* takes
+// `&dyn Fetcher`. Kept as a comment for historical context: this
+// module used to import FetchClient directly before v0.5.1.

 // ---------------------------------------------------------------------------
 // URLs + defaults — keep in one place so "change the signup link" is a
@ -506,7 +508,7 @@ pub enum SmartFetchResult {
 /// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
 /// [`CloudError`] so you can render precise UX.
 pub async fn smart_fetch(
-    client: &FetchClient,
+    client: &dyn crate::fetcher::Fetcher,
    cloud: Option<&CloudClient>,
    url: &str,
    include_selectors: &[String],
@ -613,7 +615,7 @@ pub struct FetchedHtml {
 /// Designed for the vertical-extractor pattern where the caller has
 /// its own parser and just needs bytes.
 pub async fn smart_fetch_html(
-    client: &FetchClient,
+    client: &dyn crate::fetcher::Fetcher,
    cloud: Option<&CloudClient>,
    url: &str,
 ) -> Result<FetchedHtml, CloudError> {
--- a/crates/webclaw-fetch/src/extractors/amazon_product.rs
+++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs
@ -32,9 +32,9 @@ use regex::Regex;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "amazon_product",
@ -59,7 +59,7 @@ pub fn matches(url: &str) -> bool {
    parse_asin(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let asin = parse_asin(url)
        .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;

--- a/crates/webclaw-fetch/src/extractors/arxiv.rs
+++ b/crates/webclaw-fetch/src/extractors/arxiv.rs
@ -10,8 +10,8 @@ use quick_xml::events::Event;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "arxiv",
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/abs/") || url.contains("/pdf/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let id = parse_id(url)
        .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;

--- a/crates/webclaw-fetch/src/extractors/crates_io.rs
+++ b/crates/webclaw-fetch/src/extractors/crates_io.rs
@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "crates_io",
@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/crates/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let name = parse_name(url)
        .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;

--- a/crates/webclaw-fetch/src/extractors/dev_to.rs
+++ b/crates/webclaw-fetch/src/extractors/dev_to.rs
@ -8,8 +8,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "dev_to",
@ -61,7 +61,7 @@ const RESERVED_FIRST_SEGS: &[&str] = &[
    "t",
 ];

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (username, slug) = parse_username_slug(url).ok_or_else(|| {
        FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/docker_hub.rs
+++ b/crates/webclaw-fetch/src/extractors/docker_hub.rs
@ -8,8 +8,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "docker_hub",
@ -29,7 +29,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/_/") || url.contains("/r/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (namespace, name) = parse_repo(url)
        .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;

--- a/crates/webclaw-fetch/src/extractors/ebay_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
@ -14,9 +14,9 @@ use regex::Regex;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "ebay_listing",
@ -39,7 +39,7 @@ pub fn matches(url: &str) -> bool {
    parse_item_id(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let item_id = parse_item_id(url)
        .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;

--- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
@ -42,8 +42,8 @@ use regex::Regex;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "ecommerce_product",
@ -69,7 +69,7 @@ pub fn matches(url: &str) -> bool {
    !host_of(url).is_empty()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let resp = client.fetch(url).await?;
    if !(200..300).contains(&resp.status) {
        return Err(FetchError::Build(format!(
--- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
@ -26,9 +26,9 @@ use regex::Regex;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "etsy_listing",
@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
    parse_listing_id(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let listing_id = parse_listing_id(url)
        .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;

--- a/crates/webclaw-fetch/src/extractors/github_issue.rs
+++ b/crates/webclaw-fetch/src/extractors/github_issue.rs
@ -10,8 +10,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "github_issue",
@ -34,7 +34,7 @@ pub fn matches(url: &str) -> bool {
    parse_issue(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
        FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/github_pr.rs
+++ b/crates/webclaw-fetch/src/extractors/github_pr.rs
@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "github_pr",
@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
    parse_pr(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
        FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/github_release.rs
+++ b/crates/webclaw-fetch/src/extractors/github_release.rs
@ -8,8 +8,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "github_release",
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
    parse_release(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
        FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/github_repo.rs
+++ b/crates/webclaw-fetch/src/extractors/github_repo.rs
@ -10,8 +10,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "github_repo",
@ -70,7 +70,7 @@ const RESERVED_OWNERS: &[&str] = &[
    "about",
 ];

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
        FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/hackernews.rs
+++ b/crates/webclaw-fetch/src/extractors/hackernews.rs
@ -10,8 +10,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "hackernews",
@ -40,7 +40,7 @@ pub fn matches(url: &str) -> bool {
    false
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let id = parse_item_id(url).ok_or_else(|| {
        FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
+++ b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
@ -7,8 +7,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "huggingface_dataset",
@ -38,7 +38,7 @@ pub fn matches(url: &str) -> bool {
    segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let dataset_path = parse_dataset_path(url).ok_or_else(|| {
        FetchError::Build(format!(
            "hf_dataset: cannot parse dataset path from '{url}'"
--- a/crates/webclaw-fetch/src/extractors/huggingface_model.rs
+++ b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "huggingface_model",
@ -61,7 +61,7 @@ const RESERVED_NAMESPACES: &[&str] = &[
    "search",
 ];

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (owner, name) = parse_owner_name(url).ok_or_else(|| {
        FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/instagram_post.rs
+++ b/crates/webclaw-fetch/src/extractors/instagram_post.rs
@ -11,8 +11,8 @@ use serde_json::{Value, json};
 use std::sync::OnceLock;

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "instagram_post",
@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
    parse_shortcode(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
        FetchError::Build(format!(
            "instagram_post: cannot parse shortcode from '{url}'"
--- a/crates/webclaw-fetch/src/extractors/instagram_profile.rs
+++ b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
@ -23,8 +23,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "instagram_profile",
@ -80,7 +80,7 @@ const RESERVED: &[&str] = &[
    "signup",
 ];

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let username = parse_username(url).ok_or_else(|| {
        FetchError::Build(format!(
            "instagram_profile: cannot parse username from '{url}'"
@ -198,7 +198,7 @@ fn classify(n: &MediaNode) -> &'static str {
 /// pull whatever OG tags we can. Returns less data and explicitly
 /// flags `data_completeness: "og_only"` so callers know.
 async fn og_fallback(
-    client: &FetchClient,
+    client: &dyn Fetcher,
    username: &str,
    original_url: &str,
    api_status: u16,
--- a/crates/webclaw-fetch/src/extractors/linkedin_post.rs
+++ b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
@ -14,8 +14,8 @@ use serde_json::{Value, json};
 use std::sync::OnceLock;

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "linkedin_post",
@ -36,7 +36,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/feed/update/urn:li:") || url.contains("/posts/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let urn = extract_urn(url).ok_or_else(|| {
        FetchError::Build(format!(
            "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@ -46,8 +46,8 @@ pub mod youtube_video;
 use serde::Serialize;
 use serde_json::Value;

-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 /// Public catalog entry for `/v1/extractors`. Stable shape — clients
 /// rely on `name` to pick the right `/v1/scrape/{name}` route.
@ -102,7 +102,7 @@ pub fn list() -> Vec<ExtractorInfo> {
 /// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
 /// pick a vertical explicitly.
 pub async fn dispatch_by_url(
-    client: &FetchClient,
+    client: &dyn Fetcher,
    url: &str,
 ) -> Option<Result<(&'static str, Value), FetchError>> {
    if reddit::matches(url) {
@ -281,7 +281,7 @@ pub async fn dispatch_by_url(
 /// users get a clear "wrong route" error instead of a confusing parse
 /// failure deep in the extractor.
 pub async fn dispatch_by_name(
-    client: &FetchClient,
+    client: &dyn Fetcher,
    name: &str,
    url: &str,
 ) -> Result<Value, ExtractorDispatchError> {
--- a/crates/webclaw-fetch/src/extractors/npm.rs
+++ b/crates/webclaw-fetch/src/extractors/npm.rs
@ -13,8 +13,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "npm",
@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/package/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let name = parse_name(url)
        .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;

@ -94,7 +94,7 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
    }))
 }

-async fn fetch_weekly_downloads(client: &FetchClient, name: &str) -> Result<i64, FetchError> {
+async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
    let url = format!(
        "https://api.npmjs.org/downloads/point/last-week/{}",
        urlencode_segment(name)
--- a/crates/webclaw-fetch/src/extractors/pypi.rs
+++ b/crates/webclaw-fetch/src/extractors/pypi.rs
@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "pypi",
@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/project/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (name, version) = parse_project(url).ok_or_else(|| {
        FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/extractors/reddit.rs
+++ b/crates/webclaw-fetch/src/extractors/reddit.rs
@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "reddit",
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
    is_reddit_host && url.contains("/comments/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let json_url = build_json_url(url);
    let resp = client.fetch(&json_url).await?;
    if resp.status != 200 {
--- a/crates/webclaw-fetch/src/extractors/shopify_collection.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
@ -15,8 +15,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "shopify_collection",
@ -49,7 +49,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
    "github.com",
 ];

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let (coll_meta_url, coll_products_url) = build_json_urls(url);

    // Step 1: collection metadata. Shopify returns 200 on missing
--- a/crates/webclaw-fetch/src/extractors/shopify_product.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs
@ -21,8 +21,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "shopify_product",
@ -65,7 +65,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
    "github.com", // /products is a marketing page
 ];

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let json_url = build_json_url(url);
    let resp = client.fetch(&json_url).await?;
    if resp.status == 404 {
--- a/crates/webclaw-fetch/src/extractors/stackoverflow.rs
+++ b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
@ -13,8 +13,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "stackoverflow",
@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
    parse_question_id(url).is_some()
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let id = parse_question_id(url).ok_or_else(|| {
        FetchError::Build(format!(
            "stackoverflow: cannot parse question id from '{url}'"
--- a/crates/webclaw-fetch/src/extractors/substack_post.rs
+++ b/crates/webclaw-fetch/src/extractors/substack_post.rs
@ -28,9 +28,9 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "substack_post",
@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/p/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let slug = parse_slug(url).ok_or_else(|| {
        FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
    })?;
@ -149,7 +149,7 @@ fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
 // ---------------------------------------------------------------------------

 async fn html_fallback(
-    client: &FetchClient,
+    client: &dyn Fetcher,
    url: &str,
    api_url: &str,
    slug: &str,
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@ -32,9 +32,9 @@ use regex::Regex;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "trustpilot_reviews",
@ -51,7 +51,7 @@ pub fn matches(url: &str) -> bool {
    url.contains("/review/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
        .await
        .map_err(cloud_to_fetch_err)?;
--- a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
@ -15,8 +15,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "woocommerce_product",
@ -42,7 +42,7 @@ pub fn matches(url: &str) -> bool {
        || url.contains("/produit/") // common fr locale
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let slug = parse_slug(url).ok_or_else(|| {
        FetchError::Build(format!(
            "woocommerce_product: cannot parse slug from '{url}'"
--- a/crates/webclaw-fetch/src/extractors/youtube_video.rs
+++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs
@ -25,8 +25,8 @@ use regex::Regex;
 use serde_json::{Value, json};

 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;

 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "youtube_video",
@ -45,7 +45,7 @@ pub fn matches(url: &str) -> bool {
        || url.contains("youtube-nocookie.com/embed/")
 }

-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
    let video_id = parse_video_id(url).ok_or_else(|| {
        FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
    })?;
--- a/crates/webclaw-fetch/src/fetcher.rs
+++ b/crates/webclaw-fetch/src/fetcher.rs
@ -0,0 +1,118 @@
+//! Pluggable fetcher abstraction for vertical extractors.
+//!
+//! Extractors call the network through this trait instead of hard-
+//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
+//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
+//! server, which must not use in-process TLS fingerprinting, provides
+//! its own implementation that routes through the Go tls-sidecar.
+//!
+//! Both paths expose the same [`FetchResult`] shape and the same
+//! optional cloud-escalation client, so extractor logic stays
+//! identical across environments.
+//!
+//! ## Choosing an implementation
+//!
+//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
+//!   with [`FetchClient::with_cloud`] to attach cloud fallback, pass
+//!   it to extractors as `&client`.
+//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
+//!   (in `server/src/engine/`) that delegates to `engine::tls_client`
+//!   and wraps it in `Arc<dyn Fetcher>` for handler injection.
+//!
+//! ## Why a trait and not a free function
+//!
+//! Extractors need state beyond a single fetch: the cloud client for
+//! antibot escalation, and in the future per-user proxy pools, tenant
+//! headers, circuit breakers. A trait keeps that state encapsulated
+//! behind the fetch interface instead of threading it through every
+//! extractor signature.
+
+use async_trait::async_trait;
+
+use crate::client::FetchResult;
+use crate::cloud::CloudClient;
+use crate::error::FetchError;
+
+/// HTTP fetch surface used by vertical extractors.
+///
+/// Implementations must be `Send + Sync` because extractor dispatchers
+/// run them inside tokio tasks, potentially across many requests.
+#[async_trait]
+pub trait Fetcher: Send + Sync {
+    /// Fetch a URL and return the raw response body + metadata. The
+    /// body is in `FetchResult::html` regardless of the actual content
+    /// type — JSON API endpoints put JSON there, HTML pages put HTML.
+    /// Extractors branch on response status and body shape.
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
+
+    /// Fetch with additional request headers. Needed for endpoints
+    /// that authenticate via a specific header (Instagram's
+    /// `x-ig-app-id`, for example). Default implementation routes to
+    /// [`Self::fetch`] so implementers without header support stay
+    /// functional, though the `Option<String>` field they'd set won't
+    /// be populated on the request.
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        _headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        self.fetch(url).await
+    }
+
+    /// Optional cloud-escalation client for antibot bypass. Returning
+    /// `Some` tells extractors they can call into the hosted API when
+    /// local fetch hits a challenge page. Returning `None` makes
+    /// cloud-gated extractors emit [`CloudError::NotConfigured`] with
+    /// an actionable signup link.
+    ///
+    /// The default implementation returns `None` because not every
+    /// deployment wants cloud fallback (self-hosts that don't have a
+    /// webclaw.io subscription, for instance).
+    ///
+    /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
+    fn cloud(&self) -> Option<&CloudClient> {
+        None
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
+// ---------------------------------------------------------------------------
+
+#[async_trait]
+impl<T: Fetcher + ?Sized> Fetcher for &T {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        (**self).fetch(url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        (**self).fetch_with_headers(url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&CloudClient> {
+        (**self).cloud()
+    }
+}
+
+#[async_trait]
+impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        (**self).fetch(url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        (**self).fetch_with_headers(url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&CloudClient> {
+        (**self).cloud()
+    }
+}
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@ -8,7 +8,9 @@ pub mod crawler;
 pub mod document;
 pub mod error;
 pub mod extractors;
+pub mod fetcher;
 pub mod linkedin;
+pub mod locale;
 pub mod proxy;
 pub mod reddit;
 pub mod sitemap;
@ -18,7 +20,9 @@ pub use browser::BrowserProfile;
 pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
 pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
 pub use error::FetchError;
+pub use fetcher::Fetcher;
 pub use http::HeaderMap;
+pub use locale::{accept_language_for_tld, accept_language_for_url};
 pub use proxy::{parse_proxy_file, parse_proxy_line};
 pub use sitemap::SitemapEntry;
 pub use webclaw_pdf::PdfMode;
--- a/crates/webclaw-fetch/src/locale.rs
+++ b/crates/webclaw-fetch/src/locale.rs
@ -0,0 +1,77 @@
+//! Derive an `Accept-Language` header from a URL.
+//!
+//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
+//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
+//! target country + a browser UA but the wrong `Accept-Language` is a bot
+//! signal. Matching the site's expected locale gets us through.
+//!
+//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
+
+/// Best-effort `Accept-Language` header value for the given URL's TLD.
+/// Returns `None` if the URL cannot be parsed.
+pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
+    let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
+    let tld = host.rsplit('.').next()?;
+    Some(accept_language_for_tld(tld))
+}
+
+/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
+/// Unknown TLDs fall back to US English.
+pub fn accept_language_for_tld(tld: &str) -> &'static str {
+    match tld {
+        "it" => "it-IT,it;q=0.9",
+        "fr" => "fr-FR,fr;q=0.9",
+        "de" | "at" => "de-DE,de;q=0.9",
+        "es" => "es-ES,es;q=0.9",
+        "pt" => "pt-PT,pt;q=0.9",
+        "nl" => "nl-NL,nl;q=0.9",
+        "pl" => "pl-PL,pl;q=0.9",
+        "se" => "sv-SE,sv;q=0.9",
+        "no" => "nb-NO,nb;q=0.9",
+        "dk" => "da-DK,da;q=0.9",
+        "fi" => "fi-FI,fi;q=0.9",
+        "cz" => "cs-CZ,cs;q=0.9",
+        "ro" => "ro-RO,ro;q=0.9",
+        "gr" => "el-GR,el;q=0.9",
+        "tr" => "tr-TR,tr;q=0.9",
+        "ru" => "ru-RU,ru;q=0.9",
+        "jp" => "ja-JP,ja;q=0.9",
+        "kr" => "ko-KR,ko;q=0.9",
+        "cn" => "zh-CN,zh;q=0.9",
+        "tw" | "hk" => "zh-TW,zh;q=0.9",
+        "br" => "pt-BR,pt;q=0.9",
+        "mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
+        "uk" | "ie" => "en-GB,en;q=0.9",
+        _ => "en-US,en;q=0.9",
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn tld_dispatch() {
+        assert_eq!(
+            accept_language_for_url("https://www.immobiliare.it/annunci/1"),
+            Some("it-IT,it;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://www.leboncoin.fr/"),
+            Some("fr-FR,fr;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://www.amazon.co.uk/"),
+            Some("en-GB,en;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://example.com/"),
+            Some("en-US,en;q=0.9")
+        );
+    }
+
+    #[test]
+    fn bad_url_returns_none() {
+        assert_eq!(accept_language_for_url("not-a-url"), None);
+    }
+}
--- a/crates/webclaw-fetch/src/tls.rs
+++ b/crates/webclaw-fetch/src/tls.rs
@ -7,10 +7,15 @@

 use std::time::Duration;

+use std::borrow::Cow;
+
 use wreq::http2::{
    Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
 };
-use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
+use wreq::tls::{
+    AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
+    TlsVersion,
+};
 use wreq::{Client, Emulation};

 use crate::browser::BrowserVariant;
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
 /// Safari curves.
 const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";

+/// Safari iOS 26 TLS extension order, matching bogdanfinn's
+/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
+/// inserts them itself. Diverges from wreq-util's default SafariIos26
+/// extension order, which DataDome's immobiliare.it ruleset flags.
+fn safari_ios_extensions() -> Vec<ExtensionType> {
+    vec![
+        ExtensionType::CERTIFICATE_TIMESTAMP,
+        ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
+        ExtensionType::SERVER_NAME,
+        ExtensionType::CERT_COMPRESSION,
+        ExtensionType::KEY_SHARE,
+        ExtensionType::SUPPORTED_VERSIONS,
+        ExtensionType::PSK_KEY_EXCHANGE_MODES,
+        ExtensionType::SUPPORTED_GROUPS,
+        ExtensionType::RENEGOTIATE,
+        ExtensionType::SIGNATURE_ALGORITHMS,
+        ExtensionType::STATUS_REQUEST,
+        ExtensionType::EC_POINT_FORMATS,
+        ExtensionType::EXTENDED_MASTER_SECRET,
+    ]
+}
+
+/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
+/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
+/// per handshake, but indeed.com's WAF allowlists this specific wire order
+/// and rejects permuted ones. GREASE slots are inserted by wreq.
+///
+/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
+fn chrome_extensions() -> Vec<ExtensionType> {
+    vec![
+        ExtensionType::CERTIFICATE_TIMESTAMP,                  // 18
+        ExtensionType::STATUS_REQUEST,                         // 5
+        ExtensionType::SESSION_TICKET,                         // 35
+        ExtensionType::KEY_SHARE,                              // 51
+        ExtensionType::SUPPORTED_GROUPS,                       // 10
+        ExtensionType::PSK_KEY_EXCHANGE_MODES,                 // 45
+        ExtensionType::EC_POINT_FORMATS,                       // 11
+        ExtensionType::CERT_COMPRESSION,                       // 27
+        ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
+        ExtensionType::SUPPORTED_VERSIONS,       // 43
+        ExtensionType::SIGNATURE_ALGORITHMS,     // 13
+        ExtensionType::SERVER_NAME,              // 0
+        ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
+        ExtensionType::ENCRYPTED_CLIENT_HELLO,   // 65037
+        ExtensionType::RENEGOTIATE,              // 65281
+        ExtensionType::EXTENDED_MASTER_SECRET,   // 23
+    ]
+}
+
 // --- Chrome HTTP headers in correct wire order ---

 const CHROME_HEADERS: &[(&str, &str)] = &[
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
    ("sec-fetch-dest", "document"),
 ];

+/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
+/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
+/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
+/// include zstd (Safari can't decode it). Verified against bogdanfinn on
+/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
+/// expects for a real iPhone.
+const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
+    (
+        "accept",
+        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
+    ),
+    ("accept-language", "en-US,en;q=0.9"),
+    ("accept-encoding", "gzip, deflate, br"),
+    (
+        "user-agent",
+        "Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
+    ),
+    ("upgrade-insecure-requests", "1"),
+];
+
 const EDGE_HEADERS: &[(&str, &str)] = &[
    (
        "sec-ch-ua",
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
 ];

 fn chrome_tls() -> TlsOptions {
+    // permute_extensions is off so the explicit extension_permutation sticks.
+    // Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
+    // fixed order, so matching that gets us through.
    TlsOptions::builder()
        .cipher_list(CHROME_CIPHERS)
        .sigalgs_list(CHROME_SIGALGS)
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
        .min_tls_version(TlsVersion::TLS_1_2)
        .max_tls_version(TlsVersion::TLS_1_3)
        .grease_enabled(true)
-        .permute_extensions(true)
+        .permute_extensions(false)
+        .extension_permutation(chrome_extensions())
        .enable_ech_grease(true)
        .pre_shared_key(true)
        .enable_ocsp_stapling(true)
        .enable_signed_cert_timestamps(true)
-        .alps_protocols([AlpsProtocol::HTTP2])
+        .alpn_protocols([
+            AlpnProtocol::HTTP3,
+            AlpnProtocol::HTTP2,
+            AlpnProtocol::HTTP1,
+        ])
+        .alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
        .alps_use_new_codepoint(true)
        .aes_hw_override(true)
        .certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
        .build()
 }

+/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
+/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
+/// because the wire-level defaults from wreq-util are already correct for ciphers,
+/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
+/// DataDome compatibility are overridden here:
+///
+///  1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
+///     ends up `8d909525bd5bbb79f133d11cc05159fe`).
+///  2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
+///     wreq-util omits this frame; real Safari and bogdanfinn include it.
+///     This flip is the thing DataDome actually reads — the akamai_fingerprint
+///     hash changes from `c52879e43202aeb92740be6e8c86ea96` to
+///     `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
+///  3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
+///     `priority: u=0, i`, zstd), replace with the real iOS 26 set.
+///  4. `accept-language` preserved from config.extra_headers for locale.
+fn safari_ios_emulation() -> wreq::Emulation {
+    use wreq::EmulationFactory;
+    let mut em = wreq_util::Emulation::SafariIos26.emulation();
+
+    if let Some(tls) = em.tls_options_mut().as_mut() {
+        tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
+    }
+
+    // Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
+    // and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
+    // to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
+    if let Some(h2) = em.http2_options_mut().as_mut() {
+        h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
+    }
+
+    let hm = em.headers_mut();
+    hm.clear();
+    for (k, v) in SAFARI_IOS_HEADERS {
+        if let (Ok(n), Ok(val)) = (
+            http::header::HeaderName::from_bytes(k.as_bytes()),
+            http::header::HeaderValue::from_str(v),
+        ) {
+            hm.append(n, val);
+        }
+    }
+
+    em
+}
+
 fn chrome_h2() -> Http2Options {
+    // SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
+    // ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
+    // MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
+    // and indeed.com's WAF reads this as a bot signal when present. Priority
+    // weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
    Http2Options::builder()
        .initial_window_size(6_291_456)
        .initial_connection_window_size(15_728_640)
        .max_header_list_size(262_144)
        .header_table_size(65_536)
-        .max_concurrent_streams(1000u32)
        .enable_push(false)
        .settings_order(
            SettingsOrder::builder()
                .extend([
                    SettingId::HeaderTableSize,
                    SettingId::EnablePush,
-                    SettingId::MaxConcurrentStreams,
                    SettingId::InitialWindowSize,
-                    SettingId::MaxFrameSize,
                    SettingId::MaxHeaderListSize,
-                    SettingId::EnableConnectProtocol,
-                    SettingId::NoRfc7540Priorities,
                ])
                .build(),
        )
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
                ])
                .build(),
        )
-        .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
+        .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
        .build()
 }

@ -328,32 +456,38 @@ pub fn build_client(
    extra_headers: &std::collections::HashMap<String, String>,
    proxy: Option<&str>,
 ) -> Result<Client, FetchError> {
-    let (tls, h2, headers) = match variant {
-        BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
-        BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
-        BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
-        BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
-        BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
+    // SafariIos26 builds its Emulation on top of wreq-util's base instead
+    // of from scratch. See `safari_ios_emulation` for why.
+    let mut emulation = match variant {
+        BrowserVariant::SafariIos26 => safari_ios_emulation(),
+        other => {
+            let (tls, h2, headers) = match other {
+                BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
+                BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
+                BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
+                BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
+                BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
+                BrowserVariant::SafariIos26 => unreachable!("handled above"),
+            };
+            Emulation::builder()
+                .tls_options(tls)
+                .http2_options(h2)
+                .headers(build_headers(headers))
+                .build()
+        }
    };

-    let mut header_map = build_headers(headers);
-
-    // Append extra headers after profile defaults
+    // Append extra headers after profile defaults.
+    let hm = emulation.headers_mut();
    for (k, v) in extra_headers {
        if let (Ok(n), Ok(val)) = (
            http::header::HeaderName::from_bytes(k.as_bytes()),
            http::header::HeaderValue::from_str(v),
        ) {
-            header_map.insert(n, val);
+            hm.insert(n, val);
        }
    }

-    let emulation = Emulation::builder()
-        .tls_options(tls)
-        .http2_options(h2)
-        .headers(header_map)
-        .build();
-
    let mut builder = Client::builder()
        .emulation(emulation)
        .redirect(wreq::redirect::Policy::limited(10))
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@ -718,6 +718,55 @@ impl WebclawMcp {
            Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
        }
    }
+
+    /// List every vertical extractor the server knows about. Returns a
+    /// JSON array of `{name, label, description, url_patterns}` entries.
+    /// Call this to discover what verticals are available before using
+    /// `vertical_scrape`.
+    #[tool]
+    async fn list_extractors(
+        &self,
+        Parameters(_params): Parameters<ListExtractorsParams>,
+    ) -> Result<String, String> {
+        let catalog = webclaw_fetch::extractors::list();
+        serde_json::to_string_pretty(&catalog)
+            .map_err(|e| format!("failed to serialise extractor catalog: {e}"))
+    }
+
+    /// Run a vertical extractor by name and return typed JSON specific
+    /// to the target site (title, price, rating, author, etc.), not
+    /// generic markdown. Use `list_extractors` to discover available
+    /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
+    /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
+    ///
+    /// Antibot-gated verticals (amazon_product, ebay_listing,
+    /// etsy_listing, trustpilot_reviews) will automatically escalate to
+    /// the webclaw cloud API when local fetch hits bot protection,
+    /// provided `WEBCLAW_API_KEY` is set.
+    #[tool]
+    async fn vertical_scrape(
+        &self,
+        Parameters(params): Parameters<VerticalParams>,
+    ) -> Result<String, String> {
+        validate_url(&params.url)?;
+        // Use the cached Firefox client, not the default Chrome one.
+        // Reddit's `.json` endpoint rejects the wreq-Chrome TLS
+        // fingerprint with a 403 even from residential IPs (they
+        // ship a fingerprint blocklist that includes common
+        // browser-emulation libraries). The wreq-Firefox fingerprint
+        // still passes, and Firefox is equally fine for every other
+        // vertical in the catalog, so it's a strictly-safer default
+        // for `vertical_scrape` than the generic `scrape` tool's
+        // Chrome default. Matches the CLI `webclaw vertical`
+        // subcommand which already uses Firefox.
+        let client = self.firefox_or_build()?;
+        let data =
+            webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
+                .await
+                .map_err(|e| e.to_string())?;
+        serde_json::to_string_pretty(&data)
+            .map_err(|e| format!("failed to serialise extractor output: {e}"))
+    }
 }

 #[tool_handler]
@ -727,7 +776,8 @@ impl ServerHandler for WebclawMcp {
            .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
            .with_instructions(String::from(
                "Webclaw MCP server -- web content extraction for AI agents. \
-                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
+                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
+                 list_extractors, vertical_scrape.",
            ))
    }
 }
--- a/crates/webclaw-mcp/src/tools.rs
+++ b/crates/webclaw-mcp/src/tools.rs
@ -103,3 +103,20 @@ pub struct SearchParams {
    /// Number of results to return (default: 10)
    pub num_results: Option<u32>,
 }
+
+/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
+#[derive(Debug, Deserialize, JsonSchema)]
+pub struct VerticalParams {
+    /// Name of the vertical extractor. Call `list_extractors` to see all
+    /// available names. Examples: "reddit", "github_repo", "pypi",
+    /// "trustpilot_reviews", "youtube_video", "shopify_product".
+    pub name: String,
+    /// URL to extract. Must match the URL patterns the extractor claims;
+    /// otherwise the tool returns a clear "URL mismatch" error.
+    pub url: String,
+}
+
+/// `list_extractors` takes no arguments but we still need an empty struct
+/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
+#[derive(Debug, Deserialize, JsonSchema)]
+pub struct ListExtractorsParams {}
Author	SHA1	Message	Date
Valerio	a5c3433372	fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls Some checks failed CI / Test (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Docs (push) Has been cancelled Details	2026-04-23 15:26:31 +02:00
Valerio	966981bc42	fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details	2026-04-23 15:17:04 +02:00
Valerio	866fa88aa0	fix(fetch): reject HTML verification pages served at .json reddit URL	2026-04-23 15:06:35 +02:00
Valerio	b413d702b2	feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6	2026-04-23 14:59:29 +02:00
Valerio	98a177dec4	feat(cli): expose safari-ios browser profile + bump to 0.5.5	2026-04-23 13:32:55 +02:00
Valerio	e1af2da509	docs(claude): drop sidecar references, mention ProductionFetcher	2026-04-23 13:25:23 +02:00
Valerio	2285c585b1	docs(changelog): simplify 0.5.4 entry	2026-04-23 13:01:02 +02:00
Valerio	b77767814a	Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper - New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26. Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers, gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3 8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on immobiliare.it with country-it residential. - BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256, explicit extension_permutation, advertise h3 in ALPN and ALPS. JA3 43067709b025da334de1279a120f8e14, akamai_fp 52d84b11737d980aef856699f885ca86. Fixes indeed.com and other Cloudflare-fronted sites. - New locale module: accept_language_for_url / accept_language_for_tld. TLD to Accept-Language mapping, unknown TLDs default to en-US. DataDome geo-vs-locale cross-checks are now trivially satisfiable. - wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.	2026-04-23 12:58:24 +02:00
Valerio	4bf11d902f	fix(mcp): vertical_scrape uses Firefox profile, not default Chrome Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a 403 even from residential IPs. Their block list includes known browser-emulation library fingerprints. wreq-Firefox passes. The CLI `vertical` subcommand already forced Firefox; MCP `vertical_scrape` was still falling back to the long-lived `self.fetch_client` which defaults to Chrome, so reddit failed on MCP and nobody noticed because the earlier test runs all had an API key set that masked the issue. Switched vertical_scrape to reuse `self.firefox_or_build()` which gives us the cached Firefox client (same pattern the scrape tool uses when the caller requests `browser: firefox`). Firefox is strictly-safer-than-Chrome for every vertical in the catalog, so making it the hard default for `vertical_scrape` is the right call. Verified end-to-end from a clean shell with no WEBCLAW_API_KEY: - MCP reddit: 679ms, post/author/6 comments correct - MCP instagram_profile: 1157ms, 18471 followers No change to the `scrape` tool -- it keeps the user-selectable browser param. Bumps version to 0.5.3.	2026-04-22 23:18:11 +02:00
Valerio	0daa2fec1a	feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable) Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details Wires the vertical extractor catalog into both the CLI and the MCP server so users don't have to hit the HTTP API to invoke them. Same semantics as `/v1/scrape/{vertical}` + `/v1/extractors`. CLI (webclaw-cli): - New subcommand `webclaw extractors` lists all 28 extractors with name, label, and sample URL. `--json` flag emits the full catalog as machine-readable JSON. - New subcommand `webclaw vertical <name> <url>` runs a specific extractor and prints typed JSON. Pretty-printed by default; `--raw` for single-line. Exits 1 with a clear "URL does not match" error on mismatch. - FetchClient built with Firefox profile + cloud fallback attached when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate. MCP (webclaw-mcp): - New tool `list_extractors` (no args) returns the catalog as pretty-printed JSON for in-session discovery. - New tool `vertical_scrape` takes `{name, url}` and returns typed JSON. Reuses the long-lived self.fetch_client. - Tool count goes from 10 to 12. Server-info instruction string updated accordingly. Tests: 215 passing, clippy clean. Manual surface-tested end-to-end: CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns 28-entry catalog + typed responses for pypi/requests + rust-lang/rust in 200-400ms. Version bumped to 0.5.2 (minor for API additions, backwards compatible).	2026-04-22 21:41:15 +02:00
Valerio	058493bc8f	feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.	2026-04-22 21:17:50 +02:00
Valerio	aaa5103504	docs(claude): fix stale primp references, document wreq + Fetcher trait webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago but CLAUDE.md still documented primp, the `[patch.crates-io]` requirement, and RUSTFLAGS that no longer apply. Refreshed four sections: - Crate listing: webclaw-fetch uses wreq, not primp - client.rs description: wreq BoringSSL, plus a note that FetchClient will implement the new Fetcher trait so production can swap in a tls-sidecar-backed fetcher without importing wreq - Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines, added the "Vertical extractors take `&dyn Fetcher`" rule that makes the architectural separation explicit for the upcoming production integration - Removed language about primp being "patched"; reqwest in webclaw-llm is now just "plain reqwest" with no relationship to wreq	2026-04-22 21:11:18 +02:00