diff --git a/CHANGELOG.md b/CHANGELOG.md
index ef2d2f2..4069d54 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,57 +3,6 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
-## [0.5.2] — 2026-04-22
-
-### Added
-- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
-
-- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
-
-- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
-
-### Changed
-- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
-
----
-
-## [0.5.1] — 2026-04-22
-
-### Added
-- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
-
-  The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
-
-  Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
-
-### Changed
-- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
-
----
-
-## [0.5.0] — 2026-04-22
-
-### Added
-- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob.
-- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route.
-- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source.
-- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran.
-- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `<script>` tags, metadata as OG `<meta>` tags, markdown in a `<pre>`) so existing local parsers run unchanged across both paths. No per-extractor code path branches are needed for "came from cloud" vs "came from local".
-- **Trustpilot 2025 schema parser.** Trustpilot replaced their single-Organization + aggregateRating shape with three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table `mainEntity` carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent reviews. The parser walks all three, skips the site-level Org, picks the Dataset by `about.@id` matching the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and returns recent reviews with author / country / date / rating / title / text / likes.
-- **OG-tag fallback in `ecommerce_product` for sites with no JSON-LD and sites with JSON-LD but empty offers.** Three paths now: `jsonld` (Schema.org Product with offers), `jsonld+og` (Product JSON-LD plus OG product tags filling in missing price), and `og_fallback` (no JSON-LD at all, build minimal payload from `og:title`, `og:image`, `og:description`, `product:price:amount`, `product:price:currency`, `product:availability`, `product:brand`). `has_og_product_signal()` gates the fallback on `og:type=product` or a price tag so blog posts don't get mis-classified as products.
-- **URL-slug title fallback in `etsy_listing` for delisted / blocked pages.** When Etsy serves a placeholder page (`"etsy.com"`, `"Etsy - Your place to buy..."`, `"This item is unavailable"`), humanise the URL slug (`/listing/123/personalized-stainless-steel-tumbler` becomes `"Personalized Stainless Steel Tumbler"`) so callers always get a meaningful title. Plus shop falls through `offers[].seller.name` then top-level `brand` because Etsy uses both schemas depending on listing age.
-- **Force-cloud-escalation in `amazon_product` when local HTML lacks Product JSON-LD.** Amazon A/B-tests JSON-LD presence. When local fetch succeeds but has no `Product` block and a cloud client is configured, the extractor force-escalates to the cloud which reliably surfaces title + description via its render engine. Added OG meta-tag fallback so the cloud's synthesized HTML output (OG tags only, no Amazon DOM IDs) still yields title / image / description.
-- **AWS WAF "Verifying your connection" detector in `cloud::is_bot_protected`.** Trustpilot serves a `~565` byte interstitial with an `interstitial-spinner` CSS class. The detector now fires on that pattern with a `< 10_000` byte size gate to avoid false positives on real articles that happen to mention the phrase.
-
-### Changed
-- **`webclaw-fetch::FetchClient` gained an optional `cloud` field** via `with_cloud(CloudClient)`. Extractors reach it through `client.cloud()` to decide whether to escalate. `webclaw-server::AppState` reads `WEBCLAW_CLOUD_API_KEY` (preferred) or falls back to `WEBCLAW_API_KEY` only when inbound auth is not configured (open mode).
-- **Consolidated `CloudClient` into `webclaw-fetch`.** Previously duplicated between `webclaw-mcp/src/cloud.rs` (302 LOC) and `webclaw-cli/src/cloud.rs` (80 LOC). Single canonical home with typed `CloudError` (`NotConfigured`, `Unauthorized`, `InsufficientPlan`, `RateLimited`, `ServerError`, `Network`, `ParseFailed`) that Display with actionable URLs; `From<CloudError> for String` bridge keeps pre-existing CLI / MCP call sites compiling unchanged during migration.
-
-### Tests
-- 215 unit tests passing in `webclaw-fetch` (100+ new, covering every extractor's matcher, URL parser, JSON-LD / OG fallback paths, and the cloud synthesis helper). `cargo clippy --workspace --release --no-deps` clean.
-
----
-
 ## [0.4.0] — 2026-04-22
 
 ### Added
diff --git a/CLAUDE.md b/CLAUDE.md
index fcd27da..eac2f9f 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -11,7 +11,7 @@ webclaw/
                       # + ExtractionOptions (include/exclude CSS selectors)
                       # + diff engine (change tracking)
                       # + brand extraction (DOM/CSS analysis)
-    webclaw-fetch/    # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
+    webclaw-fetch/    # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
                       # + proxy pool rotation (per-request)
                       # + PDF content-type detection
                       # + document parsing (DOCX, XLSX, CSV)
@@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 - `brand.rs` — Brand identity extraction from DOM structure and CSS
 
 ### Fetch Modules (`webclaw-fetch`)
-- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
+- `client.rs` — FetchClient with primp TLS impersonation
 - `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
 - `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
 - `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
@@ -76,10 +76,9 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 ## Hard Rules
 
 - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
-- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
-- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
-- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
-- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq.
+- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
+- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
+- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
 - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
 
 ## Build & Test
diff --git a/Cargo.lock b/Cargo.lock
index ed0f4fa..0f5fc5c 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3199,7 +3199,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3220,7 +3220,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3238,16 +3238,13 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
- "async-trait",
  "bytes",
  "calamine",
  "http",
  "quick-xml 0.37.5",
  "rand 0.8.5",
- "regex",
- "reqwest",
  "serde",
  "serde_json",
  "tempfile",
@@ -3263,7 +3260,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3276,10 +3273,11 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
  "dirs",
  "dotenvy",
+ "reqwest",
  "rmcp",
  "schemars",
  "serde",
@@ -3296,7 +3294,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3305,7 +3303,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.5.2"
+version = "0.4.0"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index a286972..e17d843 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.2"
+version = "0.4.0"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-cli/src/cloud.rs b/crates/webclaw-cli/src/cloud.rs
new file mode 100644
index 0000000..464eb4c
--- /dev/null
+++ b/crates/webclaw-cli/src/cloud.rs
@@ -0,0 +1,80 @@
+/// Cloud API client for automatic fallback when local extraction fails.
+///
+/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
+/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
+/// all requests go through the cloud API directly.
+///
+/// NOTE: The canonical, full-featured cloud module lives in webclaw-mcp/src/cloud.rs
+/// (smart_fetch, bot detection, JS rendering checks). This is the minimal subset
+/// needed by the CLI. Kept separate to avoid pulling in rmcp via webclaw-mcp.
+/// and adding webclaw-mcp as a dependency would pull in rmcp.
+use serde_json::{Value, json};
+
+const API_BASE: &str = "https://api.webclaw.io/v1";
+
+pub struct CloudClient {
+    api_key: String,
+    http: reqwest::Client,
+}
+
+impl CloudClient {
+    /// Create from explicit key or WEBCLAW_API_KEY env var.
+    pub fn new(explicit_key: Option<&str>) -> Option<Self> {
+        let key = explicit_key
+            .map(String::from)
+            .or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
+            .filter(|k| !k.is_empty())?;
+
+        Some(Self {
+            api_key: key,
+            http: reqwest::Client::new(),
+        })
+    }
+
+    /// Scrape via the cloud API.
+    pub async fn scrape(
+        &self,
+        url: &str,
+        formats: &[&str],
+        include_selectors: &[String],
+        exclude_selectors: &[String],
+        only_main_content: bool,
+    ) -> Result<Value, String> {
+        let mut body = json!({
+            "url": url,
+            "formats": formats,
+        });
+        if only_main_content {
+            body["only_main_content"] = json!(true);
+        }
+        if !include_selectors.is_empty() {
+            body["include_selectors"] = json!(include_selectors);
+        }
+        if !exclude_selectors.is_empty() {
+            body["exclude_selectors"] = json!(exclude_selectors);
+        }
+        self.post("scrape", body).await
+    }
+
+    async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
+        let resp = self
+            .http
+            .post(format!("{API_BASE}/{endpoint}"))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .json(&body)
+            .timeout(std::time::Duration::from_secs(120))
+            .send()
+            .await
+            .map_err(|e| format!("cloud API request failed: {e}"))?;
+
+        let status = resp.status();
+        if !status.is_success() {
+            let text = resp.text().await.unwrap_or_default();
+            return Err(format!("cloud API error {status}: {text}"));
+        }
+
+        resp.json::<Value>()
+            .await
+            .map_err(|e| format!("cloud API response parse failed: {e}"))
+    }
+}
diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs
index a12cae1..7cf765e 100644
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@@ -1,6 +1,7 @@
 /// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
 /// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
 mod bench;
+mod cloud;
 
 use std::io::{self, Read as _};
 use std::path::{Path, PathBuf};
@@ -308,34 +309,6 @@ enum Commands {
         #[arg(long)]
         facts: Option<PathBuf>,
     },
-
-    /// List all vertical extractors in the catalog.
-    ///
-    /// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
-    /// a human-friendly label, a one-line description, and the URL
-    /// patterns it claims. The same data is served by `/v1/extractors`
-    /// when running the REST API.
-    Extractors {
-        /// Emit JSON instead of a human-friendly table.
-        #[arg(long)]
-        json: bool,
-    },
-
-    /// Run a vertical extractor by name. Returns typed JSON with fields
-    /// specific to the target site (title, price, author, rating, etc.)
-    /// rather than generic markdown.
-    ///
-    /// Use `webclaw extractors` to see the full list. Example:
-    /// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
-    Vertical {
-        /// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
-        name: String,
-        /// URL to extract.
-        url: String,
-        /// Emit compact JSON (single line). Default is pretty-printed.
-        #[arg(long)]
-        raw: bool,
-    },
 }
 
 #[derive(Clone, ValueEnum)]
@@ -701,7 +674,7 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
     let url = normalize_url(raw_url);
     let url = url.as_str();
 
-    let cloud_client = webclaw_fetch::cloud::CloudClient::new(cli.api_key.as_deref());
+    let cloud_client = cloud::CloudClient::new(cli.api_key.as_deref());
 
     // --cloud: skip local, go straight to cloud API
     if cli.cloud {
@@ -2316,83 +2289,6 @@ async fn main() {
                 }
                 return;
             }
-            Commands::Extractors { json } => {
-                let entries = webclaw_fetch::extractors::list();
-                if *json {
-                    // Serialize with serde_json. ExtractorInfo derives
-                    // Serialize so this is a one-liner.
-                    match serde_json::to_string_pretty(&entries) {
-                        Ok(s) => println!("{s}"),
-                        Err(e) => {
-                            eprintln!("error: failed to serialise catalog: {e}");
-                            process::exit(1);
-                        }
-                    }
-                } else {
-                    // Human-friendly table: NAME + LABEL + one URL
-                    // pattern sample. Keeps the output scannable on a
-                    // narrow terminal.
-                    println!("{} vertical extractors available:\n", entries.len());
-                    let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
-                    let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
-                    for e in &entries {
-                        let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
-                        println!(
-                            "  {:<nw$}  {:<lw$}  {}",
-                            e.name,
-                            e.label,
-                            pattern_sample,
-                            nw = name_w,
-                            lw = label_w,
-                        );
-                    }
-                    println!("\nRun one: webclaw vertical <name> <url>");
-                }
-                return;
-            }
-            Commands::Vertical { name, url, raw } => {
-                // Build a FetchClient with cloud fallback attached when
-                // WEBCLAW_API_KEY is set. Antibot-gated verticals
-                // (amazon, ebay, etsy, trustpilot) need this to escalate
-                // on bot protection.
-                let fetch_cfg = webclaw_fetch::FetchConfig {
-                    browser: webclaw_fetch::BrowserProfile::Firefox,
-                    ..webclaw_fetch::FetchConfig::default()
-                };
-                let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
-                    Ok(c) => c,
-                    Err(e) => {
-                        eprintln!("error: failed to build fetch client: {e}");
-                        process::exit(1);
-                    }
-                };
-                if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
-                    client = client.with_cloud(cloud);
-                }
-                match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
-                    Ok(data) => {
-                        let rendered = if *raw {
-                            serde_json::to_string(&data)
-                        } else {
-                            serde_json::to_string_pretty(&data)
-                        };
-                        match rendered {
-                            Ok(s) => println!("{s}"),
-                            Err(e) => {
-                                eprintln!("error: JSON encode failed: {e}");
-                                process::exit(1);
-                            }
-                        }
-                    }
-                    Err(e) => {
-                        // UrlMismatch / UnknownVertical / Fetch all get
-                        // Display impls with actionable messages.
-                        eprintln!("error: {e}");
-                        process::exit(1);
-                    }
-                }
-                return;
-            }
         }
     }
 
diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml
index a47ba7e..0b22d12 100644
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@@ -12,15 +12,12 @@ serde = { workspace = true }
 thiserror = { workspace = true }
 tracing = { workspace = true }
 tokio = { workspace = true }
-async-trait = "0.1"
 wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
 http = "1"
 bytes = "1"
 url = "2"
 rand = "0.8"
 quick-xml = { version = "0.37", features = ["serde"] }
-regex = "1"
-reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
 serde_json.workspace = true
 calamine = "0.34"
 zip = "2"
diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index 8fd5ff5..cc6378a 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -177,11 +177,6 @@ enum ClientPool {
 pub struct FetchClient {
     pool: ClientPool,
     pdf_mode: PdfMode,
-    /// Optional cloud-fallback client. Extractors that need to
-    /// escalate past bot protection call `client.cloud()` to get this
-    /// out. Stored as `Arc` so cloning a `FetchClient` (common in
-    /// axum state) doesn't clone the underlying reqwest pool.
-    cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
 }
 
 impl FetchClient {
@@ -230,35 +225,7 @@ impl FetchClient {
             ClientPool::Rotating { clients }
         };
 
-        Ok(Self {
-            pool,
-            pdf_mode,
-            cloud: None,
-        })
-    }
-
-    /// Attach a cloud-fallback client. Returns `self` so it composes in
-    /// a builder-ish way:
-    ///
-    /// ```ignore
-    /// let client = FetchClient::new(config)?
-    ///     .with_cloud(CloudClient::from_env()?);
-    /// ```
-    ///
-    /// Extractors that can escalate past bot protection will call
-    /// `client.cloud()` internally. Sets the field regardless of
-    /// whether `cloud` is configured to bypass anything specific —
-    /// attachment is cheap (just wraps in `Arc`).
-    pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
-        self.cloud = Some(std::sync::Arc::new(cloud));
-        self
-    }
-
-    /// Optional cloud-fallback client, if one was attached via
-    /// [`Self::with_cloud`]. Extractors that handle antibot sites
-    /// pass this into `cloud::smart_fetch_html`.
-    pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
-        self.cloud.as_deref()
+        Ok(Self { pool, pdf_mode })
     }
 
     /// Fetch a URL and return the raw HTML + response metadata.
@@ -312,85 +279,14 @@ impl FetchClient {
 
     /// Single fetch attempt.
     async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
-        self.fetch_once_with_headers(url, &[]).await
-    }
-
-    /// Single fetch attempt with optional per-request headers appended
-    /// after the profile defaults. Used by extractors that need to
-    /// satisfy site-specific headers (e.g. `x-ig-app-id` for Instagram's
-    /// internal API).
-    async fn fetch_once_with_headers(
-        &self,
-        url: &str,
-        extra: &[(&str, &str)],
-    ) -> Result<FetchResult, FetchError> {
         let start = Instant::now();
         let client = self.pick_client(url);
 
-        let mut req = client.get(url);
-        for (k, v) in extra {
-            req = req.header(*k, *v);
-        }
-        let resp = req.send().await?;
+        let resp = client.get(url).send().await?;
         let response = Response::from_wreq(resp).await?;
         response_to_result(response, start)
     }
 
-    /// Fetch a URL with extra per-request headers appended after the
-    /// browser-profile defaults. Same retry semantics as `fetch`.
-    ///
-    /// Use this when an upstream API requires a header the global
-    /// `FetchConfig.headers` shouldn't carry to other hosts (Instagram's
-    /// `x-ig-app-id`, GitHub's `Authorization` once we wire `GITHUB_TOKEN`,
-    /// Reddit's compliant UA when we add OAuth, etc.).
-    #[instrument(skip(self, extra), fields(url = %url, extra_count = extra.len()))]
-    pub async fn fetch_with_headers(
-        &self,
-        url: &str,
-        extra: &[(&str, &str)],
-    ) -> Result<FetchResult, FetchError> {
-        let delays = [Duration::ZERO, Duration::from_secs(1)];
-        let mut last_err = None;
-
-        for (attempt, delay) in delays.iter().enumerate() {
-            if attempt > 0 {
-                tokio::time::sleep(*delay).await;
-            }
-            match self.fetch_once_with_headers(url, extra).await {
-                Ok(result) => {
-                    if is_retryable_status(result.status) && attempt < delays.len() - 1 {
-                        warn!(
-                            url,
-                            status = result.status,
-                            attempt = attempt + 1,
-                            "retryable status, will retry"
-                        );
-                        last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
-                        continue;
-                    }
-                    if attempt > 0 {
-                        debug!(url, attempt = attempt + 1, "retry succeeded");
-                    }
-                    return Ok(result);
-                }
-                Err(e) => {
-                    if !is_retryable_error(&e) || attempt == delays.len() - 1 {
-                        return Err(e);
-                    }
-                    warn!(
-                        url,
-                        error = %e,
-                        attempt = attempt + 1,
-                        "transient error, will retry"
-                    );
-                    last_err = Some(e);
-                }
-            }
-        }
-
-        Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
-    }
-
     /// Fetch a URL then extract structured content.
     #[instrument(skip(self), fields(url = %url))]
     pub async fn fetch_and_extract(
@@ -599,36 +495,6 @@ impl FetchClient {
     }
 }
 
-// ---------------------------------------------------------------------------
-// Fetcher trait implementation
-//
-// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
-// rather than `FetchClient` directly, which is what lets the production
-// API server swap in a tls-sidecar-backed implementation without
-// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
-// self-hosted OSS server) this impl means "pass the FetchClient you
-// already have; nothing changes".
-// ---------------------------------------------------------------------------
-
-#[async_trait::async_trait]
-impl crate::fetcher::Fetcher for FetchClient {
-    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
-        FetchClient::fetch(self, url).await
-    }
-
-    async fn fetch_with_headers(
-        &self,
-        url: &str,
-        headers: &[(&str, &str)],
-    ) -> Result<FetchResult, FetchError> {
-        FetchClient::fetch_with_headers(self, url, headers).await
-    }
-
-    fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
-        FetchClient::cloud(self)
-    }
-}
-
 /// Collect the browser variants to use based on the browser profile.
 fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
     match profile {
diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs
deleted file mode 100644
index 3bad383..0000000
--- a/crates/webclaw-fetch/src/cloud.rs
+++ /dev/null
@@ -1,853 +0,0 @@
-//! Cloud API fallback client for api.webclaw.io.
-//!
-//! When local fetch hits bot protection or a JS-only SPA, callers can
-//! fall back to the hosted API which runs the full antibot / CDP
-//! pipeline. This module is the shared home for that flow: previously
-//! duplicated between `webclaw-mcp/src/cloud.rs` and
-//! `webclaw-cli/src/cloud.rs`.
-//!
-//! ## Architecture
-//!
-//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
-//!   REST surface. Typed errors for the four HTTP failures callers act
-//!   on differently (401 / 402 / 429 / other) plus network + parse.
-//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
-//!   response bodies. The detection patterns are public (CF / DataDome
-//!   challenge-page signatures) so these live in OSS without leaking
-//!   any moat.
-//! - [`smart_fetch`] — try-local-then-escalate flow returning an
-//!   [`ExtractionResult`] or raw cloud JSON. Kept on the original
-//!   `Result<_, String>` signature so the existing MCP / CLI call
-//!   sites work unchanged.
-//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
-//!   pattern: just give me antibot-bypassed HTML so I can run my own
-//!   parser on it. Returns the typed [`CloudError`] so extractors can
-//!   emit precise "upgrade your plan" / "invalid key" messages.
-//!
-//! ## Cloud response shape and [`synthesize_html`]
-//!
-//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
-//! `html` field even when `formats=["html"]` is requested. By design
-//! the cloud API returns a parsed bundle:
-//!
-//! ```text
-//! {
-//!   "url":             "https://...",
-//!   "metadata":        { title, description, image, site_name, ... },  // OG / meta tags
-//!   "structured_data": [ { "@type": "...", ... }, ... ],               // JSON-LD blocks
-//!   "markdown":        "# Page Title\n\n...",                          // cleaned markdown
-//!   "antibot":         { engine, path, user_agent },                   // bypass telemetry
-//!   "cache":           { status, age_seconds }
-//! }
-//! ```
-//!
-//! [`CloudClient::fetch_html`] reassembles that bundle back into a
-//! minimal synthetic HTML document so the existing local extractor
-//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
-//! cloud output. Each `structured_data` entry becomes a
-//! `<script type="application/ld+json">` tag; each `metadata` field
-//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
-//! `<pre>` inside the body. Callers that walk Schema.org blocks see
-//! exactly what they'd see on a real live page.
-//!
-//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
-//! won't hit on the synthesised HTML — those IDs only exist on live
-//! Amazon pages. Extractors that need DOM regex keep OG meta tag
-//! fallbacks for that reason.
-//!
-//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
-//! signup when a site is blocked; nothing fails silently. Cloud users
-//! get the escalation for free.
-
-use std::time::Duration;
-
-use http::HeaderMap;
-use serde_json::{Value, json};
-use thiserror::Error;
-use tracing::{debug, info, warn};
-
-// Client type isn't needed here anymore now that smart_fetch* takes
-// `&dyn Fetcher`. Kept as a comment for historical context: this
-// module used to import FetchClient directly before v0.5.1.
-
-// ---------------------------------------------------------------------------
-// URLs + defaults — keep in one place so "change the signup link" is a
-// single-commit edit.
-// ---------------------------------------------------------------------------
-
-const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
-const DEFAULT_TIMEOUT_SECS: u64 = 120;
-
-const SIGNUP_URL: &str = "https://webclaw.io/signup";
-const PRICING_URL: &str = "https://webclaw.io/pricing";
-const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
-
-// ---------------------------------------------------------------------------
-// Errors
-// ---------------------------------------------------------------------------
-
-/// Structured cloud-fallback error. Variants correspond to the HTTP
-/// outcomes callers act on differently — a 401 needs a different UX
-/// than a 402 which needs a different UX than a network blip.
-///
-/// Display messages end with an actionable URL so API consumers can
-/// surface them to users verbatim.
-#[derive(Debug, Error)]
-pub enum CloudError {
-    /// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
-    /// and friends when they hit bot protection but have no client to
-    /// escalate to.
-    #[error(
-        "this site is behind antibot protection. \
-         Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
-         Free tier: {SIGNUP_URL}"
-    )]
-    NotConfigured,
-
-    /// HTTP 401 — the key is present but rejected.
-    #[error(
-        "WEBCLAW_API_KEY rejected (HTTP 401). \
-         Check or regenerate your key at {KEYS_URL}"
-    )]
-    Unauthorized,
-
-    /// HTTP 402 — the key is valid but the plan doesn't cover the call.
-    #[error(
-        "your plan doesn't include this endpoint / site (HTTP 402). \
-         Upgrade at {PRICING_URL}"
-    )]
-    InsufficientPlan,
-
-    /// HTTP 429 — rate limit.
-    #[error(
-        "cloud API rate limit reached (HTTP 429). \
-         Wait a moment or upgrade at {PRICING_URL}"
-    )]
-    RateLimited,
-
-    /// HTTP 4xx / 5xx the caller probably can't do anything specific
-    /// about. Body is truncated to a sensible length for logs.
-    #[error("cloud API returned HTTP {status}: {body}")]
-    ServerError { status: u16, body: String },
-
-    #[error("cloud request failed: {0}")]
-    Network(String),
-
-    #[error("cloud response parse failed: {0}")]
-    ParseFailed(String),
-}
-
-impl CloudError {
-    /// Build from a non-success HTTP response, routing well-known
-    /// statuses to dedicated variants.
-    fn from_status_and_body(status: u16, body: String) -> Self {
-        match status {
-            401 => Self::Unauthorized,
-            402 => Self::InsufficientPlan,
-            429 => Self::RateLimited,
-            _ => Self::ServerError {
-                status,
-                body: truncate(&body, 500).to_string(),
-            },
-        }
-    }
-}
-
-impl From<reqwest::Error> for CloudError {
-    fn from(e: reqwest::Error) -> Self {
-        Self::Network(e.to_string())
-    }
-}
-
-/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
-/// sites `use .await?` into functions returning `Result<_, String>`.
-/// Having this `From` impl means those sites keep compiling while we
-/// migrate them to the typed error over time.
-impl From<CloudError> for String {
-    fn from(e: CloudError) -> Self {
-        e.to_string()
-    }
-}
-
-fn truncate(text: &str, max: usize) -> &str {
-    match text.char_indices().nth(max) {
-        Some((byte_pos, _)) => &text[..byte_pos],
-        None => text,
-    }
-}
-
-// ---------------------------------------------------------------------------
-// CloudClient
-// ---------------------------------------------------------------------------
-
-/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
-/// inner `reqwest::Client` already refcounts its connection pool.
-#[derive(Clone)]
-pub struct CloudClient {
-    api_key: String,
-    base_url: String,
-    http: reqwest::Client,
-}
-
-impl CloudClient {
-    /// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
-    /// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
-    /// neither is set / both are empty.
-    ///
-    /// This is the function call sites should use by default — it's
-    /// what both the CLI and MCP want.
-    pub fn new(explicit_key: Option<&str>) -> Option<Self> {
-        explicit_key
-            .map(String::from)
-            .or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
-            .filter(|k| !k.trim().is_empty())
-            .map(Self::with_key)
-    }
-
-    /// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
-    /// readability at call sites that never accept a flag.
-    pub fn from_env() -> Option<Self> {
-        Self::new(None)
-    }
-
-    /// Build with an explicit key. Useful when the caller already has
-    /// a key from somewhere other than env or a flag (e.g. loaded from
-    /// config).
-    pub fn with_key(api_key: impl Into<String>) -> Self {
-        Self::with_key_and_base(api_key, API_BASE_DEFAULT)
-    }
-
-    /// Build with an explicit key and base URL. Used by integration
-    /// tests and staging deployments.
-    pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
-        let http = reqwest::Client::builder()
-            .timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
-            .build()
-            .expect("reqwest client builder failed with default settings");
-        Self {
-            api_key: api_key.into(),
-            base_url: base_url.into().trim_end_matches('/').to_string(),
-            http,
-        }
-    }
-
-    pub fn base_url(&self) -> &str {
-        &self.base_url
-    }
-
-    /// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
-    /// normalise the slash.
-    pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
-        let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
-        let resp = self
-            .http
-            .post(&url)
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .json(&body)
-            .send()
-            .await?;
-        parse_cloud_response(resp).await
-    }
-
-    /// Generic GET.
-    pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
-        let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
-        let resp = self
-            .http
-            .get(&url)
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .send()
-            .await?;
-        parse_cloud_response(resp).await
-    }
-
-    /// `POST /v1/scrape` with the caller's extraction options. This is
-    /// the public "do everything" surface: the cloud side handles
-    /// fetch + antibot + JS render + extraction + formatting.
-    pub async fn scrape(
-        &self,
-        url: &str,
-        formats: &[&str],
-        include_selectors: &[String],
-        exclude_selectors: &[String],
-        only_main_content: bool,
-    ) -> Result<Value, CloudError> {
-        let mut body = json!({ "url": url, "formats": formats });
-        if only_main_content {
-            body["only_main_content"] = json!(true);
-        }
-        if !include_selectors.is_empty() {
-            body["include_selectors"] = json!(include_selectors);
-        }
-        if !exclude_selectors.is_empty() {
-            body["exclude_selectors"] = json!(exclude_selectors);
-        }
-        self.post("scrape", body).await
-    }
-
-    /// Get antibot-bypassed page data back as a synthetic HTML string.
-    ///
-    /// `api.webclaw.io/v1/scrape` intentionally does not return raw
-    /// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
-    /// plus `metadata` (title, description, OG tags, image) plus a
-    /// `markdown` body. We reassemble those into a minimal HTML doc
-    /// that looks enough like the real page for our local extractor
-    /// parsers to run unchanged: each JSON-LD block gets emitted as a
-    /// `<script type="application/ld+json">` tag, metadata gets
-    /// emitted as OG `<meta>` tags, and the markdown lands in the
-    /// body. Extractors that walk JSON-LD (ecommerce_product,
-    /// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
-    /// see exactly the same shapes they'd see from a live HTML fetch.
-    pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
-        let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
-        Ok(synthesize_html(&resp))
-    }
-}
-
-/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
-/// response so existing HTML-based extractor parsers can run against
-/// cloud output without a separate code path.
-fn synthesize_html(resp: &Value) -> String {
-    let mut out = String::with_capacity(8_192);
-    out.push_str("<html><head>\n");
-
-    // Metadata → OG meta tags. Keep keys stable with what local
-    // extractors read: og:title, og:description, og:image, og:site_name.
-    if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
-        for (src_key, og_key) in [
-            ("title", "title"),
-            ("description", "description"),
-            ("image", "image"),
-            ("site_name", "site_name"),
-        ] {
-            if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
-                && !val.is_empty()
-            {
-                out.push_str(&format!(
-                    "<meta property=\"og:{og_key}\" content=\"{}\">\n",
-                    html_escape_attr(val)
-                ));
-            }
-        }
-    }
-
-    // Structured data blocks → <script type="application/ld+json">.
-    // Serialise losslessly so extract_json_ld's parser gets the same
-    // shape it would get from a real page.
-    if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
-        for block in blocks {
-            if let Ok(s) = serde_json::to_string(block) {
-                out.push_str("<script type=\"application/ld+json\">");
-                out.push_str(&s);
-                out.push_str("</script>\n");
-            }
-        }
-    }
-
-    out.push_str("</head><body>\n");
-
-    // Markdown body → plaintext in <body>. Extractors that regex over
-    // <div> IDs won't hit here, but they won't hit on local cloud
-    // bypass either. OK to keep minimal.
-    if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
-        out.push_str("<pre>");
-        out.push_str(&html_escape_text(md));
-        out.push_str("</pre>\n");
-    }
-
-    out.push_str("</body></html>");
-    out
-}
-
-fn html_escape_attr(s: &str) -> String {
-    s.replace('&', "&amp;")
-        .replace('"', "&quot;")
-        .replace('<', "&lt;")
-        .replace('>', "&gt;")
-}
-
-fn html_escape_text(s: &str) -> String {
-    s.replace('&', "&amp;")
-        .replace('<', "&lt;")
-        .replace('>', "&gt;")
-}
-
-async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
-    let status = resp.status();
-    if status.is_success() {
-        return resp
-            .json()
-            .await
-            .map_err(|e| CloudError::ParseFailed(e.to_string()));
-    }
-    let body = resp.text().await.unwrap_or_default();
-    Err(CloudError::from_status_and_body(status.as_u16(), body))
-}
-
-// ---------------------------------------------------------------------------
-// Detection
-// ---------------------------------------------------------------------------
-
-/// True when a fetched response body is actually a bot-protection
-/// challenge page rather than the content the caller asked for.
-///
-/// Conservative — only fires on patterns that indicate the *entire*
-/// page is a challenge, not embedded CAPTCHAs on a real content page.
-pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
-    let html_lower = html.to_lowercase();
-
-    // Cloudflare challenge page.
-    if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
-        return true;
-    }
-
-    // Cloudflare "Just a moment" / "Checking your browser" interstitial.
-    if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
-        && html_lower.contains("cf-spinner")
-    {
-        return true;
-    }
-
-    // Cloudflare Turnstile. Only counts when the page is small —
-    // legitimate pages embed Turnstile for signup forms etc.
-    if (html_lower.contains("cf-turnstile")
-        || html_lower.contains("challenges.cloudflare.com/turnstile"))
-        && html.len() < 100_000
-    {
-        return true;
-    }
-
-    // DataDome.
-    if html_lower.contains("geo.captcha-delivery.com")
-        || html_lower.contains("captcha-delivery.com/captcha")
-    {
-        return true;
-    }
-
-    // AWS WAF.
-    if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
-        return true;
-    }
-
-    // AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
-    // Distinct from the captcha-branded path above: the challenge page is
-    // a tiny HTML shell with an `interstitial-spinner` div and no content.
-    // Gating on html.len() keeps false-positives off long pages that
-    // happen to mention the phrase in an unrelated context.
-    if html_lower.contains("interstitial-spinner")
-        && html_lower.contains("verifying your connection")
-        && html.len() < 10_000
-    {
-        return true;
-    }
-
-    // hCaptcha *blocking* page (not just an embedded widget).
-    if html_lower.contains("hcaptcha.com")
-        && html_lower.contains("h-captcha")
-        && html.len() < 50_000
-    {
-        return true;
-    }
-
-    // Cloudflare via response headers + challenge body.
-    let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
-    if has_cf_headers
-        && (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
-    {
-        return true;
-    }
-
-    false
-}
-
-/// True when a page likely needs JS rendering — a large HTML document
-/// with almost no extractable text + an SPA framework signature.
-pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
-    let has_scripts = html.contains("<script");
-
-    // Tier 1: almost no extractable text from a large-ish page.
-    if word_count < 50 && html.len() > 5_000 && has_scripts {
-        return true;
-    }
-
-    // Tier 2: SPA framework markers + low content-to-HTML ratio.
-    if word_count < 800 && html.len() > 50_000 && has_scripts {
-        let html_lower = html.to_lowercase();
-        let has_spa_marker = html_lower.contains("react-app")
-            || html_lower.contains("id=\"__next\"")
-            || html_lower.contains("id=\"root\"")
-            || html_lower.contains("id=\"app\"")
-            || html_lower.contains("__next_data__")
-            || html_lower.contains("nuxt")
-            || html_lower.contains("ng-app");
-        if has_spa_marker {
-            return true;
-        }
-    }
-
-    false
-}
-
-// ---------------------------------------------------------------------------
-// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
-// or raw cloud JSON)
-// ---------------------------------------------------------------------------
-
-/// Result of [`smart_fetch`]: either a local extraction or the raw
-/// cloud API response when we escalated.
-pub enum SmartFetchResult {
-    Local(Box<webclaw_core::ExtractionResult>),
-    Cloud(Value),
-}
-
-/// Try local fetch + extract first. On bot protection or detected
-/// JS-render, fall back to `cloud.scrape(...)` with the caller's
-/// formats. Returns `Err(String)` so existing call sites that expect
-/// stringified errors keep compiling.
-///
-/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
-/// [`CloudError`] so you can render precise UX.
-pub async fn smart_fetch(
-    client: &dyn crate::fetcher::Fetcher,
-    cloud: Option<&CloudClient>,
-    url: &str,
-    include_selectors: &[String],
-    exclude_selectors: &[String],
-    only_main_content: bool,
-    formats: &[&str],
-) -> Result<SmartFetchResult, String> {
-    let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
-        .await
-        .map_err(|_| format!("Fetch timed out after 30s for {url}"))?
-        .map_err(|e| format!("Fetch failed: {e}"))?;
-
-    if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
-        info!(url, "bot protection detected, falling back to cloud API");
-        return cloud_scrape_fallback(
-            cloud,
-            url,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-            formats,
-        )
-        .await;
-    }
-
-    let options = webclaw_core::ExtractionOptions {
-        include_selectors: include_selectors.to_vec(),
-        exclude_selectors: exclude_selectors.to_vec(),
-        only_main_content,
-        include_raw_html: false,
-    };
-    let extraction =
-        webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
-            .map_err(|e| format!("Extraction failed: {e}"))?;
-
-    if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
-        info!(
-            url,
-            word_count = extraction.metadata.word_count,
-            html_len = fetch_result.html.len(),
-            "JS-rendered page detected, falling back to cloud API"
-        );
-        return cloud_scrape_fallback(
-            cloud,
-            url,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-            formats,
-        )
-        .await;
-    }
-
-    Ok(SmartFetchResult::Local(Box::new(extraction)))
-}
-
-async fn cloud_scrape_fallback(
-    cloud: Option<&CloudClient>,
-    url: &str,
-    include_selectors: &[String],
-    exclude_selectors: &[String],
-    only_main_content: bool,
-    formats: &[&str],
-) -> Result<SmartFetchResult, String> {
-    let Some(c) = cloud else {
-        return Err(CloudError::NotConfigured.to_string());
-    };
-    let resp = c
-        .scrape(
-            url,
-            formats,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-        )
-        .await
-        .map_err(|e| e.to_string())?;
-    info!(url, "cloud API fallback successful");
-    Ok(SmartFetchResult::Cloud(resp))
-}
-
-// ---------------------------------------------------------------------------
-// Smart-fetch-HTML: for vertical extractors
-// ---------------------------------------------------------------------------
-
-/// Where the HTML ultimately came from — useful for callers that want
-/// to track "did we fall back?" for logging or pricing.
-#[derive(Debug, Clone, Copy, PartialEq, Eq)]
-pub enum FetchSource {
-    Local,
-    Cloud,
-}
-
-/// Antibot-aware HTML fetch result. The `html` field is always populated.
-pub struct FetchedHtml {
-    pub html: String,
-    pub final_url: String,
-    pub source: FetchSource,
-}
-
-/// Try local fetch; on bot protection, escalate to the cloud's
-/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
-///
-/// Designed for the vertical-extractor pattern where the caller has
-/// its own parser and just needs bytes.
-pub async fn smart_fetch_html(
-    client: &dyn crate::fetcher::Fetcher,
-    cloud: Option<&CloudClient>,
-    url: &str,
-) -> Result<FetchedHtml, CloudError> {
-    let resp = client
-        .fetch(url)
-        .await
-        .map_err(|e| CloudError::Network(e.to_string()))?;
-
-    if !is_bot_protected(&resp.html, &resp.headers) {
-        return Ok(FetchedHtml {
-            html: resp.html,
-            final_url: resp.url,
-            source: FetchSource::Local,
-        });
-    }
-
-    let Some(c) = cloud else {
-        warn!(url, "bot protection detected + no cloud client configured");
-        return Err(CloudError::NotConfigured);
-    };
-    debug!(url, "bot protection detected, escalating to cloud");
-    let html = c.fetch_html(url).await?;
-    Ok(FetchedHtml {
-        html,
-        final_url: url.to_string(),
-        source: FetchSource::Cloud,
-    })
-}
-
-// ---------------------------------------------------------------------------
-// Tests
-// ---------------------------------------------------------------------------
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    fn empty_headers() -> HeaderMap {
-        HeaderMap::new()
-    }
-
-    // --- detectors ----------------------------------------------------------
-
-    #[test]
-    fn is_bot_protected_detects_cloudflare_challenge() {
-        let html = "<html><body>_cf_chl_opt loaded</body></html>";
-        assert!(is_bot_protected(html, &empty_headers()));
-    }
-
-    #[test]
-    fn is_bot_protected_detects_turnstile_on_short_page() {
-        let html = "<div class=\"cf-turnstile\"></div>";
-        assert!(is_bot_protected(html, &empty_headers()));
-    }
-
-    #[test]
-    fn is_bot_protected_ignores_turnstile_on_real_content() {
-        let html = format!(
-            "<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
-            "lots of real content ".repeat(8_000)
-        );
-        assert!(!is_bot_protected(&html, &empty_headers()));
-    }
-
-    #[test]
-    fn is_bot_protected_detects_aws_waf_verifying_connection() {
-        // The exact shape Trustpilot serves under AWS WAF.
-        let html = r#"<div class="container"><div id="loading-state">
-            <div class="interstitial-spinner" id="spinner"></div>
-            <h1>Verifying your connection...</h1></div></div>"#;
-        assert!(is_bot_protected(html, &empty_headers()));
-    }
-
-    #[test]
-    fn synthesize_html_embeds_jsonld_and_og_tags() {
-        let resp = json!({
-            "url": "https://example.com/p/1",
-            "metadata": {
-                "title": "My Product",
-                "description": "A nice thing.",
-                "image": "https://cdn.example.com/1.jpg",
-                "site_name": "Example Shop"
-            },
-            "structured_data": [
-                {"@context":"https://schema.org","@type":"Product",
-                 "name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
-            ],
-            "markdown": "# Widget\n\nA nice widget."
-        });
-        let html = synthesize_html(&resp);
-        // OG tags from metadata.
-        assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
-        assert!(
-            html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
-        );
-        // JSON-LD block preserved losslessly.
-        assert!(html.contains(r#"<script type="application/ld+json">"#));
-        assert!(html.contains(r#""@type":"Product""#));
-        assert!(html.contains(r#""price":"9.99""#));
-        // Body carries markdown.
-        assert!(html.contains("A nice widget."));
-    }
-
-    #[test]
-    fn synthesize_html_handles_missing_fields_gracefully() {
-        let resp = json!({"url": "https://example.com", "metadata": {}});
-        let html = synthesize_html(&resp);
-        // No panic, no stray unclosed tags.
-        assert!(html.starts_with("<html><head>"));
-        assert!(html.ends_with("</body></html>"));
-    }
-
-    #[test]
-    fn synthesize_html_escapes_attribute_quotes() {
-        let resp = json!({
-            "metadata": {"title": r#"She said "hi""#}
-        });
-        let html = synthesize_html(&resp);
-        assert!(html.contains(r#"og:title" content="She said &quot;hi&quot;""#));
-    }
-
-    #[test]
-    fn is_bot_protected_ignores_phrase_on_real_content() {
-        // A real article that happens to mention the phrase in prose
-        // should not trigger the short-page detector.
-        let html = format!(
-            "<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
-            "article text ".repeat(2_000)
-        );
-        assert!(!is_bot_protected(&html, &empty_headers()));
-    }
-
-    #[test]
-    fn needs_js_rendering_flags_spa_skeleton() {
-        let html = format!(
-            "<html><body><div id=\"__next\"></div>{}</body></html>",
-            "<script>x</script>".repeat(500)
-        );
-        assert!(needs_js_rendering(10, &html));
-    }
-
-    #[test]
-    fn needs_js_rendering_passes_real_article() {
-        let html = format!(
-            "<html><body>{}<script>x</script></body></html>",
-            "Real article text ".repeat(5_000)
-        );
-        assert!(!needs_js_rendering(5_000, &html));
-    }
-
-    // --- CloudError mapping -------------------------------------------------
-
-    #[test]
-    fn cloud_error_maps_401() {
-        let e = CloudError::from_status_and_body(401, "invalid key".into());
-        assert!(matches!(e, CloudError::Unauthorized));
-        assert!(e.to_string().contains(KEYS_URL));
-    }
-
-    #[test]
-    fn cloud_error_maps_402() {
-        let e = CloudError::from_status_and_body(402, "{}".into());
-        assert!(matches!(e, CloudError::InsufficientPlan));
-        assert!(e.to_string().contains(PRICING_URL));
-    }
-
-    #[test]
-    fn cloud_error_maps_429() {
-        let e = CloudError::from_status_and_body(429, "slow down".into());
-        assert!(matches!(e, CloudError::RateLimited));
-        assert!(e.to_string().contains(PRICING_URL));
-    }
-
-    #[test]
-    fn cloud_error_maps_generic_5xx() {
-        let e = CloudError::from_status_and_body(503, "x".repeat(2000));
-        match e {
-            CloudError::ServerError { status, body } => {
-                assert_eq!(status, 503);
-                assert!(body.len() <= 500);
-            }
-            _ => panic!("expected ServerError"),
-        }
-    }
-
-    #[test]
-    fn not_configured_error_points_at_signup() {
-        let msg = CloudError::NotConfigured.to_string();
-        assert!(msg.contains(SIGNUP_URL));
-        assert!(msg.contains("WEBCLAW_API_KEY"));
-    }
-
-    // --- CloudClient construction ------------------------------------------
-
-    #[test]
-    fn cloud_client_explicit_key_wins_over_env() {
-        // SAFETY: this test mutates process env. Serial tests only.
-        // Set env to something, pass an explicit key, explicit should win.
-        // (We don't actually *call* the API, just check the struct stored
-        // the right key.)
-        // rustc std::env::set_var is unsafe in newer toolchains.
-        unsafe {
-            std::env::set_var("WEBCLAW_API_KEY", "from-env");
-        }
-        let client = CloudClient::new(Some("from-flag")).expect("client built");
-        assert_eq!(client.api_key, "from-flag");
-        unsafe {
-            std::env::remove_var("WEBCLAW_API_KEY");
-        }
-    }
-
-    #[test]
-    fn cloud_client_none_when_empty() {
-        unsafe {
-            std::env::remove_var("WEBCLAW_API_KEY");
-        }
-        assert!(CloudClient::new(None).is_none());
-        assert!(CloudClient::new(Some("")).is_none());
-        assert!(CloudClient::new(Some("   ")).is_none());
-    }
-
-    #[test]
-    fn cloud_client_base_url_strips_trailing_slash() {
-        let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
-        assert_eq!(c.base_url(), "https://api.example.com/v1");
-    }
-
-    #[test]
-    fn truncate_respects_char_boundaries() {
-        // Ensure we don't slice inside a multi-byte char.
-        let s = "a".repeat(10) + "é"; // é is 2 bytes
-        let out = truncate(&s, 11);
-        assert_eq!(out.chars().count(), 11);
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs
deleted file mode 100644
index fed6b9f..0000000
--- a/crates/webclaw-fetch/src/extractors/amazon_product.rs
+++ /dev/null
@@ -1,452 +0,0 @@
-//! Amazon product detail page extractor.
-//!
-//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
-//! inconsistently protected. Sometimes our local TLS fingerprint gets
-//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
-//! sometimes we land on a real page that for whatever reason ships
-//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
-//! extractor has a two-stage fallback:
-//!
-//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
-//!    we have everything (title, brand, price, availability, rating).
-//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
-//!    a cloud client is configured, force-escalate to api.webclaw.io.
-//!    Cloud's render + antibot pipeline reliably surfaces the
-//!    structured data. Without a cloud client we return whatever we
-//!    got from local (usually just title via `#productTitle` or OG
-//!    meta tags).
-//!
-//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
-//! `#landingImage`) second, OG `<meta>` tags third. The OG path
-//! matters because the cloud's synthesized HTML ships metadata as
-//! OG tags but lacks Amazon's DOM IDs.
-//!
-//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
-//! path. ASINs are a stable Amazon identifier so we extract that as
-//! part of the response even when everything else is empty (tells
-//! callers the URL was at least recognised).
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::cloud::{self, CloudError};
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "amazon_product",
-    label: "Amazon product",
-    description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
-    url_patterns: &[
-        "https://www.amazon.com/dp/{ASIN}",
-        "https://www.amazon.co.uk/dp/{ASIN}",
-        "https://www.amazon.de/dp/{ASIN}",
-        "https://www.amazon.fr/dp/{ASIN}",
-        "https://www.amazon.it/dp/{ASIN}",
-        "https://www.amazon.es/dp/{ASIN}",
-        "https://www.amazon.co.jp/dp/{ASIN}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !is_amazon_host(host) {
-        return false;
-    }
-    parse_asin(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let asin = parse_asin(url)
-        .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
-
-    let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
-        .await
-        .map_err(cloud_to_fetch_err)?;
-
-    // Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
-    // pages (they A/B-test it). When local fetch succeeded but has no
-    // Product JSON-LD, force-escalate to the cloud which runs the
-    // render pipeline and reliably surfaces structured data. No-op
-    // when cloud isn't configured — we return whatever local gave us.
-    if fetched.source == cloud::FetchSource::Local
-        && find_product_jsonld(&fetched.html).is_none()
-        && let Some(c) = client.cloud()
-    {
-        match c.fetch_html(url).await {
-            Ok(cloud_html) => {
-                fetched = cloud::FetchedHtml {
-                    html: cloud_html,
-                    final_url: url.to_string(),
-                    source: cloud::FetchSource::Cloud,
-                };
-            }
-            Err(e) => {
-                tracing::debug!(
-                    error = %e,
-                    "amazon_product: cloud escalation failed, keeping local"
-                );
-            }
-        }
-    }
-
-    let mut data = parse(&fetched.html, url, &asin);
-    if let Some(obj) = data.as_object_mut() {
-        obj.insert(
-            "data_source".into(),
-            match fetched.source {
-                cloud::FetchSource::Local => json!("local"),
-                cloud::FetchSource::Cloud => json!("cloud"),
-            },
-        );
-    }
-    Ok(data)
-}
-
-/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
-/// file) and the source URL, extract Amazon product detail. Returns a
-/// `Value` rather than a typed struct so callers can pass it through
-/// without carrying webclaw_fetch types.
-pub fn parse(html: &str, url: &str, asin: &str) -> Value {
-    let jsonld = find_product_jsonld(html);
-    // Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
-    // (only present on real static HTML) > cloud-synthesized og:title.
-    let title = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "name"))
-        .or_else(|| dom_title(html))
-        .or_else(|| og(html, "title"));
-    let image = jsonld
-        .as_ref()
-        .and_then(get_first_image)
-        .or_else(|| dom_image(html))
-        .or_else(|| og(html, "image"));
-    let brand = jsonld.as_ref().and_then(get_brand);
-    let description = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "description"))
-        .or_else(|| og(html, "description"));
-    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
-    let offer = jsonld.as_ref().and_then(first_offer);
-
-    let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
-    let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
-
-    json!({
-        "url":              url,
-        "asin":             asin,
-        "title":            title,
-        "brand":            brand,
-        "description":      description,
-        "image":            image,
-        "price":            offer.as_ref().and_then(|o| get_text(o, "price")),
-        "currency":         offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
-        "availability":     offer.as_ref().and_then(|o| {
-            get_text(o, "availability").map(|s|
-                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
-        }),
-        "condition":        offer.as_ref().and_then(|o| {
-            get_text(o, "itemCondition").map(|s|
-                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
-        }),
-        "sku":              sku,
-        "mpn":              mpn,
-        "aggregate_rating": aggregate_rating,
-    })
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn is_amazon_host(host: &str) -> bool {
-    host.starts_with("www.amazon.") || host.starts_with("amazon.")
-}
-
-/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
-/// - /dp/{ASIN}
-/// - /gp/product/{ASIN}
-/// - /product/{ASIN}
-/// - /exec/obidos/ASIN/{ASIN}
-fn parse_asin(url: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
-    });
-    re.captures(url)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-// ---------------------------------------------------------------------------
-// JSON-LD walkers — light reuse of ecommerce_product's style
-// ---------------------------------------------------------------------------
-
-fn find_product_jsonld(html: &str) -> Option<Value> {
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-    for b in blocks {
-        if let Some(found) = find_product_in(&b) {
-            return Some(found);
-        }
-    }
-    None
-}
-
-fn find_product_in(v: &Value) -> Option<Value> {
-    if is_product_type(v) {
-        return Some(v.clone());
-    }
-    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
-        for item in graph {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    if let Some(arr) = v.as_array() {
-        for item in arr {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    None
-}
-
-fn is_product_type(v: &Value) -> bool {
-    let Some(t) = v.get("@type") else {
-        return false;
-    };
-    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
-    match t {
-        Value::String(s) => is_prod(s),
-        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
-        _ => false,
-    }
-}
-
-fn get_text(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| match x {
-        Value::String(s) => Some(s.clone()),
-        Value::Number(n) => Some(n.to_string()),
-        _ => None,
-    })
-}
-
-fn get_brand(v: &Value) -> Option<String> {
-    let brand = v.get("brand")?;
-    if let Some(s) = brand.as_str() {
-        return Some(s.to_string());
-    }
-    brand
-        .as_object()
-        .and_then(|o| o.get("name"))
-        .and_then(|n| n.as_str())
-        .map(String::from)
-}
-
-fn get_first_image(v: &Value) -> Option<String> {
-    match v.get("image")? {
-        Value::String(s) => Some(s.clone()),
-        Value::Array(arr) => arr.iter().find_map(|x| match x {
-            Value::String(s) => Some(s.clone()),
-            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
-            _ => None,
-        }),
-        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
-        _ => None,
-    }
-}
-
-fn first_offer(v: &Value) -> Option<Value> {
-    let offers = v.get("offers")?;
-    match offers {
-        Value::Array(arr) => arr.first().cloned(),
-        Value::Object(_) => Some(offers.clone()),
-        _ => None,
-    }
-}
-
-fn get_aggregate_rating(v: &Value) -> Option<Value> {
-    let r = v.get("aggregateRating")?;
-    Some(json!({
-        "rating_value": get_text(r, "ratingValue"),
-        "review_count": get_text(r, "reviewCount"),
-        "best_rating":  get_text(r, "bestRating"),
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// DOM fallbacks — cheap regex for the two fields most likely to be
-// missing from JSON-LD on Amazon.
-// ---------------------------------------------------------------------------
-
-fn dom_title(html: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
-    re.captures(html)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().trim().to_string())
-}
-
-fn dom_image(html: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
-    re.captures(html)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
-/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
-/// line of defence for `title`, `image`, `description`.
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| html_unescape(m.as_str()));
-        }
-    }
-    None
-}
-
-/// Undo the synthesize_html attribute escaping for the few entities it
-/// emits. Keeps us off a heavier HTML-entity dep.
-fn html_unescape(s: &str) -> String {
-    s.replace("&quot;", "\"")
-        .replace("&amp;", "&")
-        .replace("&lt;", "<")
-        .replace("&gt;", ">")
-}
-
-fn cloud_to_fetch_err(e: CloudError) -> FetchError {
-    FetchError::Build(e.to_string())
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_multi_locale() {
-        assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
-        assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
-        assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
-        assert!(matches(
-            "https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
-        ));
-    }
-
-    #[test]
-    fn rejects_non_product_urls() {
-        assert!(!matches("https://www.amazon.com/"));
-        assert!(!matches("https://www.amazon.com/gp/cart"));
-        assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
-    }
-
-    #[test]
-    fn parse_asin_extracts_from_multiple_shapes() {
-        assert_eq!(
-            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
-            Some("B0CHX1W1XY".into())
-        );
-        assert_eq!(
-            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
-            Some("B0CHX1W1XY".into())
-        );
-        assert_eq!(
-            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
-            Some("B0CHX1W1XY".into())
-        );
-        assert_eq!(
-            parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
-            Some("B0CHX1W1XY".into())
-        );
-        assert_eq!(
-            parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
-            Some("B0CHX1W1XY".into())
-        );
-        assert_eq!(parse_asin("https://www.amazon.com/"), None);
-    }
-
-    #[test]
-    fn parse_extracts_from_fixture_jsonld() {
-        // Minimal Amazon-style fixture with a Product JSON-LD block.
-        let html = r##"
-<html><head>
-<script type="application/ld+json">
-{"@context":"https://schema.org","@type":"Product",
- "name":"ACME Widget","sku":"B0CHX1W1XY",
- "brand":{"@type":"Brand","name":"ACME"},
- "image":"https://m.media-amazon.com/images/I/abc.jpg",
- "offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
-           "availability":"https://schema.org/InStock"},
- "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
-</script>
-</head><body></body></html>"##;
-        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
-        assert_eq!(v["asin"], "B0CHX1W1XY");
-        assert_eq!(v["title"], "ACME Widget");
-        assert_eq!(v["brand"], "ACME");
-        assert_eq!(v["price"], "19.99");
-        assert_eq!(v["currency"], "USD");
-        assert_eq!(v["availability"], "InStock");
-        assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
-        assert_eq!(v["aggregate_rating"]["review_count"], "1234");
-    }
-
-    #[test]
-    fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
-        let html = r#"
-<html><body>
-<span id="productTitle">Fallback Title</span>
-<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
-</body></html>
-"#;
-        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
-        assert_eq!(v["title"], "Fallback Title");
-        assert_eq!(
-            v["image"],
-            "https://m.media-amazon.com/images/I/fallback.jpg"
-        );
-    }
-
-    #[test]
-    fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
-        // Shape we see from the cloud synthesize_html path: OG tags
-        // only, no JSON-LD, no Amazon DOM IDs.
-        let html = r##"<html><head>
-<meta property="og:title" content="Cloud-sourced MacBook Pro">
-<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
-<meta property="og:description" content="Via api.webclaw.io">
-</head></html>"##;
-        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
-        assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
-        assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
-        assert_eq!(v["description"], "Via api.webclaw.io");
-    }
-
-    #[test]
-    fn og_unescape_handles_quot_entity() {
-        let html = r#"<meta property="og:title" content="Apple &quot;M2 Pro&quot; Laptop">"#;
-        assert_eq!(
-            og(html, "title").as_deref(),
-            Some(r#"Apple "M2 Pro" Laptop"#)
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/arxiv.rs b/crates/webclaw-fetch/src/extractors/arxiv.rs
deleted file mode 100644
index c2b85c0..0000000
--- a/crates/webclaw-fetch/src/extractors/arxiv.rs
+++ /dev/null
@@ -1,314 +0,0 @@
-//! ArXiv paper structured extractor.
-//!
-//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}`
-//! which returns Atom XML. We parse just enough to surface title, authors,
-//! abstract, categories, and the canonical PDF link. No HTML scraping
-//! required and no auth.
-
-use quick_xml::Reader;
-use quick_xml::events::Event;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "arxiv",
-    label: "ArXiv paper",
-    description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.",
-    url_patterns: &[
-        "https://arxiv.org/abs/{id}",
-        "https://arxiv.org/abs/{id}v{n}",
-        "https://arxiv.org/pdf/{id}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "arxiv.org" && host != "www.arxiv.org" {
-        return false;
-    }
-    url.contains("/abs/") || url.contains("/pdf/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let id = parse_id(url)
-        .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
-
-    let api_url = format!("https://export.arxiv.org/api/query?id_list={id}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "arxiv api returned status {}",
-            resp.status
-        )));
-    }
-
-    let entry = parse_atom_entry(&resp.html)
-        .ok_or_else(|| FetchError::BodyDecode("arxiv: no <entry> in response".into()))?;
-    if entry.title.is_none() && entry.summary.is_none() {
-        return Err(FetchError::BodyDecode(format!(
-            "arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)"
-        )));
-    }
-
-    Ok(json!({
-        "url":              url,
-        "id":               id,
-        "arxiv_id":         entry.id,
-        "title":            entry.title,
-        "authors":          entry.authors,
-        "abstract":         entry.summary.map(|s| collapse_whitespace(&s)),
-        "published":        entry.published,
-        "updated":          entry.updated,
-        "primary_category": entry.primary_category,
-        "categories":       entry.categories,
-        "doi":              entry.doi,
-        "comment":          entry.comment,
-        "pdf_url":          entry.pdf_url,
-        "abs_url":          entry.abs_url,
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// Helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`)
-/// and the `.pdf` extension when present.
-fn parse_id(url: &str) -> Option<String> {
-    let after = url
-        .split("/abs/")
-        .nth(1)
-        .or_else(|| url.split("/pdf/").nth(1))?;
-    let stripped = after
-        .split(['?', '#'])
-        .next()?
-        .trim_end_matches('/')
-        .trim_end_matches(".pdf");
-    // Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345"
-    let no_version = match stripped.rfind('v') {
-        Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i],
-        _ => stripped,
-    };
-    if no_version.is_empty() {
-        None
-    } else {
-        Some(no_version.to_string())
-    }
-}
-
-fn collapse_whitespace(s: &str) -> String {
-    s.split_whitespace().collect::<Vec<_>>().join(" ")
-}
-
-#[derive(Default)]
-struct AtomEntry {
-    id: Option<String>,
-    title: Option<String>,
-    summary: Option<String>,
-    published: Option<String>,
-    updated: Option<String>,
-    primary_category: Option<String>,
-    categories: Vec<String>,
-    authors: Vec<String>,
-    doi: Option<String>,
-    comment: Option<String>,
-    pdf_url: Option<String>,
-    abs_url: Option<String>,
-}
-
-/// Parse the first `<entry>` block of an ArXiv Atom feed.
-fn parse_atom_entry(xml: &str) -> Option<AtomEntry> {
-    let mut reader = Reader::from_str(xml);
-    let mut buf = Vec::new();
-
-    // States
-    let mut in_entry = false;
-    let mut current: Option<&'static str> = None;
-    let mut in_author = false;
-    let mut in_author_name = false;
-    let mut entry = AtomEntry::default();
-
-    loop {
-        match reader.read_event_into(&mut buf) {
-            Ok(Event::Start(ref e)) => {
-                let local = e.local_name();
-                match local.as_ref() {
-                    b"entry" => in_entry = true,
-                    b"id" if in_entry && !in_author => current = Some("id"),
-                    b"title" if in_entry => current = Some("title"),
-                    b"summary" if in_entry => current = Some("summary"),
-                    b"published" if in_entry => current = Some("published"),
-                    b"updated" if in_entry => current = Some("updated"),
-                    b"author" if in_entry => in_author = true,
-                    b"name" if in_author => {
-                        in_author_name = true;
-                        current = Some("author_name");
-                    }
-                    b"category" if in_entry => {
-                        // primary_category is namespaced (arxiv:primary_category)
-                        // category is plain. quick-xml gives us local-name only,
-                        // so we treat both as categories and take the first as
-                        // primary.
-                        for attr in e.attributes().flatten() {
-                            if attr.key.as_ref() == b"term"
-                                && let Ok(v) = attr.unescape_value()
-                            {
-                                let term = v.to_string();
-                                if entry.primary_category.is_none() {
-                                    entry.primary_category = Some(term.clone());
-                                }
-                                entry.categories.push(term);
-                            }
-                        }
-                    }
-                    b"link" if in_entry => {
-                        let mut href = None;
-                        let mut rel = None;
-                        let mut typ = None;
-                        for attr in e.attributes().flatten() {
-                            match attr.key.as_ref() {
-                                b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
-                                b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
-                                b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
-                                _ => {}
-                            }
-                        }
-                        if let Some(h) = href {
-                            if typ.as_deref() == Some("application/pdf") {
-                                entry.pdf_url = Some(h.clone());
-                            }
-                            if rel.as_deref() == Some("alternate") {
-                                entry.abs_url = Some(h);
-                            }
-                        }
-                    }
-                    _ => current = None,
-                }
-            }
-            Ok(Event::Empty(ref e)) => {
-                // Self-closing tags (<link href="..." />). Same handling as Start.
-                let local = e.local_name();
-                if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry {
-                    let mut href = None;
-                    let mut rel = None;
-                    let mut typ = None;
-                    let mut term = None;
-                    for attr in e.attributes().flatten() {
-                        match attr.key.as_ref() {
-                            b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
-                            b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
-                            b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
-                            b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()),
-                            _ => {}
-                        }
-                    }
-                    if let Some(t) = term {
-                        if entry.primary_category.is_none() {
-                            entry.primary_category = Some(t.clone());
-                        }
-                        entry.categories.push(t);
-                    }
-                    if let Some(h) = href {
-                        if typ.as_deref() == Some("application/pdf") {
-                            entry.pdf_url = Some(h.clone());
-                        }
-                        if rel.as_deref() == Some("alternate") {
-                            entry.abs_url = Some(h);
-                        }
-                    }
-                }
-            }
-            Ok(Event::Text(ref e)) => {
-                if let (Some(field), Ok(text)) = (current, e.unescape()) {
-                    let text = text.to_string();
-                    match field {
-                        "id" => entry.id = Some(text.trim().to_string()),
-                        "title" => entry.title = append_text(entry.title.take(), &text),
-                        "summary" => entry.summary = append_text(entry.summary.take(), &text),
-                        "published" => entry.published = Some(text.trim().to_string()),
-                        "updated" => entry.updated = Some(text.trim().to_string()),
-                        "author_name" => entry.authors.push(text.trim().to_string()),
-                        _ => {}
-                    }
-                }
-            }
-            Ok(Event::End(ref e)) => {
-                let local = e.local_name();
-                match local.as_ref() {
-                    b"entry" => break,
-                    b"author" => in_author = false,
-                    b"name" => in_author_name = false,
-                    _ => {}
-                }
-                if !in_author_name {
-                    current = None;
-                }
-            }
-            Ok(Event::Eof) => break,
-            Err(_) => return None,
-            _ => {}
-        }
-        buf.clear();
-    }
-
-    if in_entry { Some(entry) } else { None }
-}
-
-/// Concatenate text fragments (long fields can be split across multiple
-/// text events if they contain entities or CDATA).
-fn append_text(prev: Option<String>, next: &str) -> Option<String> {
-    match prev {
-        Some(mut s) => {
-            s.push_str(next);
-            Some(s)
-        }
-        None => Some(next.to_string()),
-    }
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_arxiv_urls() {
-        assert!(matches("https://arxiv.org/abs/2401.12345"));
-        assert!(matches("https://arxiv.org/abs/2401.12345v2"));
-        assert!(matches("https://arxiv.org/pdf/2401.12345.pdf"));
-        assert!(!matches("https://arxiv.org/"));
-        assert!(!matches("https://example.com/abs/foo"));
-    }
-
-    #[test]
-    fn parse_id_strips_version_and_extension() {
-        assert_eq!(
-            parse_id("https://arxiv.org/abs/2401.12345"),
-            Some("2401.12345".into())
-        );
-        assert_eq!(
-            parse_id("https://arxiv.org/abs/2401.12345v3"),
-            Some("2401.12345".into())
-        );
-        assert_eq!(
-            parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"),
-            Some("2401.12345".into())
-        );
-    }
-
-    #[test]
-    fn collapse_whitespace_handles_newlines_and_tabs() {
-        assert_eq!(collapse_whitespace("a   b\n\tc  "), "a b c");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/crates_io.rs b/crates/webclaw-fetch/src/extractors/crates_io.rs
deleted file mode 100644
index 719579f..0000000
--- a/crates/webclaw-fetch/src/extractors/crates_io.rs
+++ /dev/null
@@ -1,168 +0,0 @@
-//! crates.io structured extractor.
-//!
-//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No
-//! auth, no rate limit at normal usage. The response includes both
-//! the crate metadata and the full version list, which we summarize
-//! down to a count + latest release info to keep the payload small.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "crates_io",
-    label: "crates.io package",
-    description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.",
-    url_patterns: &[
-        "https://crates.io/crates/{name}",
-        "https://crates.io/crates/{name}/{version}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "crates.io" && host != "www.crates.io" {
-        return false;
-    }
-    url.contains("/crates/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let name = parse_name(url)
-        .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
-
-    let api_url = format!("https://crates.io/api/v1/crates/{name}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "crates.io: crate '{name}' not found"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "crates.io api returned status {}",
-            resp.status
-        )));
-    }
-
-    let body: CratesResponse = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?;
-
-    let c = body.crate_;
-    let latest_version = body
-        .versions
-        .iter()
-        .find(|v| !v.yanked.unwrap_or(false))
-        .or_else(|| body.versions.first());
-
-    Ok(json!({
-        "url":                 url,
-        "name":                c.id,
-        "description":         c.description,
-        "homepage":            c.homepage,
-        "documentation":       c.documentation,
-        "repository":          c.repository,
-        "max_stable_version":  c.max_stable_version,
-        "max_version":         c.max_version,
-        "newest_version":      c.newest_version,
-        "downloads":           c.downloads,
-        "recent_downloads":    c.recent_downloads,
-        "categories":          c.categories,
-        "keywords":            c.keywords,
-        "release_count":       body.versions.len(),
-        "latest_release_date": latest_version.and_then(|v| v.created_at.clone()),
-        "latest_license":      latest_version.and_then(|v| v.license.clone()),
-        "latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()),
-        "latest_yanked":       latest_version.and_then(|v| v.yanked),
-        "created_at":          c.created_at,
-        "updated_at":          c.updated_at,
-    }))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn parse_name(url: &str) -> Option<String> {
-    let after = url.split("/crates/").nth(1)?;
-    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
-    let first = stripped.split('/').find(|s| !s.is_empty())?;
-    Some(first.to_string())
-}
-
-// ---------------------------------------------------------------------------
-// crates.io API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct CratesResponse {
-    #[serde(rename = "crate")]
-    crate_: CrateInfo,
-    #[serde(default)]
-    versions: Vec<VersionInfo>,
-}
-
-#[derive(Deserialize)]
-struct CrateInfo {
-    id: Option<String>,
-    description: Option<String>,
-    homepage: Option<String>,
-    documentation: Option<String>,
-    repository: Option<String>,
-    max_stable_version: Option<String>,
-    max_version: Option<String>,
-    newest_version: Option<String>,
-    downloads: Option<i64>,
-    recent_downloads: Option<i64>,
-    #[serde(default)]
-    categories: Vec<String>,
-    #[serde(default)]
-    keywords: Vec<String>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct VersionInfo {
-    license: Option<String>,
-    rust_version: Option<String>,
-    yanked: Option<bool>,
-    created_at: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_crate_pages() {
-        assert!(matches("https://crates.io/crates/serde"));
-        assert!(matches("https://crates.io/crates/tokio/1.45.0"));
-        assert!(!matches("https://crates.io/"));
-        assert!(!matches("https://example.com/crates/foo"));
-    }
-
-    #[test]
-    fn parse_name_handles_versioned_urls() {
-        assert_eq!(
-            parse_name("https://crates.io/crates/serde"),
-            Some("serde".into())
-        );
-        assert_eq!(
-            parse_name("https://crates.io/crates/tokio/1.45.0"),
-            Some("tokio".into())
-        );
-        assert_eq!(
-            parse_name("https://crates.io/crates/scraper/?foo=bar"),
-            Some("scraper".into())
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/dev_to.rs b/crates/webclaw-fetch/src/extractors/dev_to.rs
deleted file mode 100644
index 86199d8..0000000
--- a/crates/webclaw-fetch/src/extractors/dev_to.rs
+++ /dev/null
@@ -1,188 +0,0 @@
-//! dev.to article structured extractor.
-//!
-//! `dev.to/api/articles/{username}/{slug}` returns the full article body,
-//! tags, reaction count, comment count, and reading time. Anonymous
-//! access works fine for published posts.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "dev_to",
-    label: "dev.to article",
-    description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.",
-    url_patterns: &["https://dev.to/{username}/{slug}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "dev.to" && host != "www.dev.to" {
-        return false;
-    }
-    let path = url
-        .split("://")
-        .nth(1)
-        .and_then(|s| s.split_once('/'))
-        .map(|(_, p)| p)
-        .unwrap_or("");
-    let stripped = path
-        .split(['?', '#'])
-        .next()
-        .unwrap_or("")
-        .trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    // Need exactly /{username}/{slug}, with username starting with non-reserved.
-    segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0])
-}
-
-const RESERVED_FIRST_SEGS: &[&str] = &[
-    "api",
-    "tags",
-    "search",
-    "settings",
-    "enter",
-    "signup",
-    "about",
-    "code-of-conduct",
-    "privacy",
-    "terms",
-    "contact",
-    "sponsorships",
-    "sponsors",
-    "shop",
-    "videos",
-    "listings",
-    "podcasts",
-    "p",
-    "t",
-];
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (username, slug) = parse_username_slug(url).ok_or_else(|| {
-        FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
-    })?;
-
-    let api_url = format!("https://dev.to/api/articles/{username}/{slug}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "dev_to: article '{username}/{slug}' not found"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "dev.to api returned status {}",
-            resp.status
-        )));
-    }
-
-    let a: Article = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?;
-
-    Ok(json!({
-        "url":               url,
-        "id":                a.id,
-        "title":             a.title,
-        "description":       a.description,
-        "body_markdown":     a.body_markdown,
-        "url_canonical":     a.canonical_url,
-        "published_at":      a.published_at,
-        "edited_at":         a.edited_at,
-        "reading_time_min":  a.reading_time_minutes,
-        "tags":              a.tag_list,
-        "positive_reactions": a.positive_reactions_count,
-        "public_reactions":  a.public_reactions_count,
-        "comments_count":    a.comments_count,
-        "page_views_count":  a.page_views_count,
-        "cover_image":       a.cover_image,
-        "author": json!({
-            "username":  a.user.as_ref().and_then(|u| u.username.clone()),
-            "name":      a.user.as_ref().and_then(|u| u.name.clone()),
-            "twitter":   a.user.as_ref().and_then(|u| u.twitter_username.clone()),
-            "github":    a.user.as_ref().and_then(|u| u.github_username.clone()),
-            "website":   a.user.as_ref().and_then(|u| u.website_url.clone()),
-        }),
-    }))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn parse_username_slug(url: &str) -> Option<(String, String)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let username = segs.next()?;
-    let slug = segs.next()?;
-    Some((username.to_string(), slug.to_string()))
-}
-
-// ---------------------------------------------------------------------------
-// dev.to API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Article {
-    id: Option<i64>,
-    title: Option<String>,
-    description: Option<String>,
-    body_markdown: Option<String>,
-    canonical_url: Option<String>,
-    published_at: Option<String>,
-    edited_at: Option<String>,
-    reading_time_minutes: Option<i64>,
-    tag_list: Option<serde_json::Value>, // string OR array depending on endpoint
-    positive_reactions_count: Option<i64>,
-    public_reactions_count: Option<i64>,
-    comments_count: Option<i64>,
-    page_views_count: Option<i64>,
-    cover_image: Option<String>,
-    user: Option<UserRef>,
-}
-
-#[derive(Deserialize)]
-struct UserRef {
-    username: Option<String>,
-    name: Option<String>,
-    twitter_username: Option<String>,
-    github_username: Option<String>,
-    website_url: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_article_urls() {
-        assert!(matches("https://dev.to/ben/welcome-thread"));
-        assert!(matches("https://dev.to/0xmassi/some-post-1abc"));
-        assert!(!matches("https://dev.to/"));
-        assert!(!matches("https://dev.to/api/articles/foo/bar"));
-        assert!(!matches("https://dev.to/tags/rust"));
-        assert!(!matches("https://dev.to/ben")); // user profile, not article
-        assert!(!matches("https://example.com/ben/post"));
-    }
-
-    #[test]
-    fn parse_pulls_username_and_slug() {
-        assert_eq!(
-            parse_username_slug("https://dev.to/ben/welcome-thread"),
-            Some(("ben".into(), "welcome-thread".into()))
-        );
-        assert_eq!(
-            parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"),
-            Some(("0xmassi".into(), "some-post-1abc".into()))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/docker_hub.rs b/crates/webclaw-fetch/src/extractors/docker_hub.rs
deleted file mode 100644
index bce9315..0000000
--- a/crates/webclaw-fetch/src/extractors/docker_hub.rs
+++ /dev/null
@@ -1,150 +0,0 @@
-//! Docker Hub repository structured extractor.
-//!
-//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`.
-//! Anonymous access is allowed for public images. The official-image
-//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "docker_hub",
-    label: "Docker Hub repository",
-    description: "Returns image metadata: pull count, star count, last_updated, official flag, description.",
-    url_patterns: &[
-        "https://hub.docker.com/_/{name}",
-        "https://hub.docker.com/r/{namespace}/{name}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "hub.docker.com" {
-        return false;
-    }
-    url.contains("/_/") || url.contains("/r/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (namespace, name) = parse_repo(url)
-        .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
-
-    let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "docker_hub: repo '{namespace}/{name}' not found"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "docker_hub api returned status {}",
-            resp.status
-        )));
-    }
-
-    let r: RepoResponse = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?;
-
-    Ok(json!({
-        "url":               url,
-        "namespace":         r.namespace,
-        "name":              r.name,
-        "full_name":         format!("{namespace}/{name}"),
-        "pull_count":        r.pull_count,
-        "star_count":        r.star_count,
-        "description":       r.description,
-        "full_description":  r.full_description,
-        "last_updated":      r.last_updated,
-        "date_registered":   r.date_registered,
-        "is_official":       namespace == "library",
-        "is_private":        r.is_private,
-        "status_description":r.status_description,
-        "categories":        r.categories,
-    }))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Parse `(namespace, name)` from a Docker Hub URL. The official-image
-/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos
-/// `/r/foo/bar` map to `(foo, bar)`.
-fn parse_repo(url: &str) -> Option<(String, String)> {
-    if let Some(after) = url.split("/_/").nth(1) {
-        let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
-        let name = stripped.split('/').next().filter(|s| !s.is_empty())?;
-        return Some(("library".into(), name.to_string()));
-    }
-    let after = url.split("/r/").nth(1)?;
-    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let ns = segs.next()?;
-    let nm = segs.next()?;
-    Some((ns.to_string(), nm.to_string()))
-}
-
-#[derive(Deserialize)]
-struct RepoResponse {
-    namespace: Option<String>,
-    name: Option<String>,
-    pull_count: Option<i64>,
-    star_count: Option<i64>,
-    description: Option<String>,
-    full_description: Option<String>,
-    last_updated: Option<String>,
-    date_registered: Option<String>,
-    is_private: Option<bool>,
-    status_description: Option<String>,
-    #[serde(default)]
-    categories: Vec<DockerCategory>,
-}
-
-#[derive(Deserialize, serde::Serialize)]
-struct DockerCategory {
-    name: Option<String>,
-    slug: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_docker_urls() {
-        assert!(matches("https://hub.docker.com/_/nginx"));
-        assert!(matches("https://hub.docker.com/r/grafana/grafana"));
-        assert!(!matches("https://hub.docker.com/"));
-        assert!(!matches("https://example.com/_/nginx"));
-    }
-
-    #[test]
-    fn parse_repo_handles_official_and_personal() {
-        assert_eq!(
-            parse_repo("https://hub.docker.com/_/nginx"),
-            Some(("library".into(), "nginx".into()))
-        );
-        assert_eq!(
-            parse_repo("https://hub.docker.com/_/nginx/tags"),
-            Some(("library".into(), "nginx".into()))
-        );
-        assert_eq!(
-            parse_repo("https://hub.docker.com/r/grafana/grafana"),
-            Some(("grafana".into(), "grafana".into()))
-        );
-        assert_eq!(
-            parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"),
-            Some(("grafana".into(), "grafana".into()))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/ebay_listing.rs b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
deleted file mode 100644
index dbc85ab..0000000
--- a/crates/webclaw-fetch/src/extractors/ebay_listing.rs
+++ /dev/null
@@ -1,337 +0,0 @@
-//! eBay listing extractor.
-//!
-//! eBay item pages at `ebay.com/itm/{id}` and international variants
-//! usually ship a `Product` JSON-LD block with title, price, currency,
-//! condition, and an `AggregateOffer` when bidding. eBay applies
-//! Cloudflare + custom WAF selectively — some item IDs return normal
-//! HTML to the Firefox profile, others 403 / get the "Pardon our
-//! interruption" page. We route through `cloud::smart_fetch_html` so
-//! both paths resolve to the same parser.
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::cloud::{self, CloudError};
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "ebay_listing",
-    label: "eBay listing",
-    description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
-    url_patterns: &[
-        "https://www.ebay.com/itm/{id}",
-        "https://www.ebay.co.uk/itm/{id}",
-        "https://www.ebay.de/itm/{id}",
-        "https://www.ebay.fr/itm/{id}",
-        "https://www.ebay.it/itm/{id}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !is_ebay_host(host) {
-        return false;
-    }
-    parse_item_id(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let item_id = parse_item_id(url)
-        .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
-
-    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
-        .await
-        .map_err(cloud_to_fetch_err)?;
-
-    let mut data = parse(&fetched.html, url, &item_id);
-    if let Some(obj) = data.as_object_mut() {
-        obj.insert(
-            "data_source".into(),
-            match fetched.source {
-                cloud::FetchSource::Local => json!("local"),
-                cloud::FetchSource::Cloud => json!("cloud"),
-            },
-        );
-    }
-    Ok(data)
-}
-
-pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
-    let jsonld = find_product_jsonld(html);
-    let title = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "name"))
-        .or_else(|| og(html, "title"));
-    let image = jsonld
-        .as_ref()
-        .and_then(get_first_image)
-        .or_else(|| og(html, "image"));
-    let brand = jsonld.as_ref().and_then(get_brand);
-    let description = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "description"))
-        .or_else(|| og(html, "description"));
-    let offer = jsonld.as_ref().and_then(first_offer);
-
-    // eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
-    let (low_price, high_price, single_price) = match offer.as_ref() {
-        Some(o) => (
-            get_text(o, "lowPrice"),
-            get_text(o, "highPrice"),
-            get_text(o, "price"),
-        ),
-        None => (None, None, None),
-    };
-    let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
-
-    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
-
-    json!({
-        "url":             url,
-        "item_id":         item_id,
-        "title":           title,
-        "brand":           brand,
-        "description":     description,
-        "image":           image,
-        "price":           single_price,
-        "low_price":       low_price,
-        "high_price":      high_price,
-        "offer_count":     offer_count,
-        "currency":        offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
-        "availability":    offer.as_ref().and_then(|o| {
-            get_text(o, "availability").map(|s|
-                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
-        }),
-        "condition":       offer.as_ref().and_then(|o| {
-            get_text(o, "itemCondition").map(|s|
-                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
-        }),
-        "seller":          offer.as_ref().and_then(|o|
-            o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
-        "aggregate_rating": aggregate_rating,
-    })
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn is_ebay_host(host: &str) -> bool {
-    host.starts_with("www.ebay.") || host.starts_with("ebay.")
-}
-
-/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
-/// URLs. IDs are 10-15 digits today, but we accept any all-digit
-/// trailing segment so the extractor stays forward-compatible.
-fn parse_item_id(url: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        // /itm/(optional-slug/)?(digits)([/?#]|end)
-        Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
-    });
-    re.captures(url)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-// ---------------------------------------------------------------------------
-// JSON-LD walkers
-// ---------------------------------------------------------------------------
-
-fn find_product_jsonld(html: &str) -> Option<Value> {
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-    for b in blocks {
-        if let Some(found) = find_product_in(&b) {
-            return Some(found);
-        }
-    }
-    None
-}
-
-fn find_product_in(v: &Value) -> Option<Value> {
-    if is_product_type(v) {
-        return Some(v.clone());
-    }
-    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
-        for item in graph {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    if let Some(arr) = v.as_array() {
-        for item in arr {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    None
-}
-
-fn is_product_type(v: &Value) -> bool {
-    let Some(t) = v.get("@type") else {
-        return false;
-    };
-    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
-    match t {
-        Value::String(s) => is_prod(s),
-        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
-        _ => false,
-    }
-}
-
-fn get_text(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| match x {
-        Value::String(s) => Some(s.clone()),
-        Value::Number(n) => Some(n.to_string()),
-        _ => None,
-    })
-}
-
-fn get_brand(v: &Value) -> Option<String> {
-    let brand = v.get("brand")?;
-    if let Some(s) = brand.as_str() {
-        return Some(s.to_string());
-    }
-    brand
-        .as_object()
-        .and_then(|o| o.get("name"))
-        .and_then(|n| n.as_str())
-        .map(String::from)
-}
-
-fn get_first_image(v: &Value) -> Option<String> {
-    match v.get("image")? {
-        Value::String(s) => Some(s.clone()),
-        Value::Array(arr) => arr.iter().find_map(|x| match x {
-            Value::String(s) => Some(s.clone()),
-            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
-            _ => None,
-        }),
-        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
-        _ => None,
-    }
-}
-
-fn first_offer(v: &Value) -> Option<Value> {
-    let offers = v.get("offers")?;
-    match offers {
-        Value::Array(arr) => arr.first().cloned(),
-        Value::Object(_) => Some(offers.clone()),
-        _ => None,
-    }
-}
-
-fn get_aggregate_rating(v: &Value) -> Option<Value> {
-    let r = v.get("aggregateRating")?;
-    Some(json!({
-        "rating_value": get_text(r, "ratingValue"),
-        "review_count": get_text(r, "reviewCount"),
-        "best_rating":  get_text(r, "bestRating"),
-    }))
-}
-
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-fn cloud_to_fetch_err(e: CloudError) -> FetchError {
-    FetchError::Build(e.to_string())
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_ebay_item_urls() {
-        assert!(matches("https://www.ebay.com/itm/325478156234"));
-        assert!(matches(
-            "https://www.ebay.com/itm/vintage-typewriter/325478156234"
-        ));
-        assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
-        assert!(!matches("https://www.ebay.com/"));
-        assert!(!matches("https://www.ebay.com/sch/foo"));
-        assert!(!matches("https://example.com/itm/325478156234"));
-    }
-
-    #[test]
-    fn parse_item_id_handles_slugged_urls() {
-        assert_eq!(
-            parse_item_id("https://www.ebay.com/itm/325478156234"),
-            Some("325478156234".into())
-        );
-        assert_eq!(
-            parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
-            Some("325478156234".into())
-        );
-        assert_eq!(
-            parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
-            Some("325478156234".into())
-        );
-    }
-
-    #[test]
-    fn parse_extracts_from_fixture_jsonld() {
-        let html = r##"
-<html><head>
-<script type="application/ld+json">
-{"@context":"https://schema.org","@type":"Product",
- "name":"Vintage Typewriter","sku":"TW-001",
- "brand":{"@type":"Brand","name":"Olivetti"},
- "image":"https://i.ebayimg.com/images/abc.jpg",
- "offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
-           "availability":"https://schema.org/InStock",
-           "itemCondition":"https://schema.org/UsedCondition",
-           "seller":{"@type":"Person","name":"vintage_seller_99"}}}
-</script>
-</head></html>"##;
-        let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
-        assert_eq!(v["title"], "Vintage Typewriter");
-        assert_eq!(v["price"], "79.99");
-        assert_eq!(v["currency"], "GBP");
-        assert_eq!(v["availability"], "InStock");
-        assert_eq!(v["condition"], "UsedCondition");
-        assert_eq!(v["seller"], "vintage_seller_99");
-        assert_eq!(v["brand"], "Olivetti");
-    }
-
-    #[test]
-    fn parse_handles_aggregate_offer_price_range() {
-        let html = r##"
-<script type="application/ld+json">
-{"@type":"Product","name":"Used Copies",
- "offers":{"@type":"AggregateOffer","offerCount":"5",
-           "lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
-</script>
-"##;
-        let v = parse(html, "https://www.ebay.com/itm/1", "1");
-        assert_eq!(v["low_price"], "10.00");
-        assert_eq!(v["high_price"], "50.00");
-        assert_eq!(v["offer_count"], "5");
-        assert_eq!(v["currency"], "USD");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
deleted file mode 100644
index 019fb68..0000000
--- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
+++ /dev/null
@@ -1,553 +0,0 @@
-//! Generic ecommerce product extractor via Schema.org JSON-LD.
-//!
-//! Every modern ecommerce site ships a `<script type="application/ld+json">`
-//! Product block for SEO / rich-result snippets. Google's own SEO docs
-//! force this markup on anyone who wants to appear in shopping search.
-//! We take advantage of it: one extractor that works on Shopify,
-//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
-//! and anything else that follows Schema.org.
-//!
-//! **Explicit-call only** (`/v1/scrape/ecommerce_product`). Not in the
-//! auto-dispatch because we can't identify "this is a product page"
-//! from the URL alone. When the caller knows they have a product URL,
-//! this is the reliable fallback for stores where shopify_product
-//! doesn't apply.
-//!
-//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
-//! so JSON-LD parsing is shared with the rest of the extraction
-//! pipeline. We walk all blocks looking for `@type: Product`,
-//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
-//!
-//! ## OG fallback
-//!
-//! Two real-world cases JSON-LD alone can't cover:
-//!
-//! 1. Site has no Product JSON-LD at all (smaller Squarespace / custom
-//!    storefronts, many European shops).
-//! 2. Site has Product JSON-LD but the `offers` block is empty (seen on
-//!    Patagonia and other catalog-style sites that split price onto a
-//!    separate widget).
-//!
-//! For case 1 we build a minimal payload from OG / product meta tags
-//! (`og:title`, `og:image`, `og:description`, `product:price:amount`,
-//! `product:price:currency`, `product:availability`, `product:brand`).
-//! For case 2 we augment the JSON-LD offers list with an OG-derived
-//! offer so callers get a price either way. A `data_source` field
-//! (`"jsonld"` / `"jsonld+og"` / `"og_fallback"`) tells the caller
-//! which branch produced the data.
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "ecommerce_product",
-    label: "Ecommerce product (generic)",
-    description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
-    url_patterns: &[
-        "https://{any-ecom-store}/products/{slug}",
-        "https://{any-ecom-store}/product/{slug}",
-        "https://{any-ecom-store}/p/{slug}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    // Maximally permissive: explicit-call-only extractor. We trust the
-    // caller knows they're pointing at a product page. Custom ecom
-    // sites use every conceivable URL shape (warbyparker.com uses
-    // `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
-    // matching would false-negative a lot. All we gate on is a valid
-    // http(s) URL with a host.
-    if !(url.starts_with("http://") || url.starts_with("https://")) {
-        return false;
-    }
-    !host_of(url).is_empty()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let resp = client.fetch(url).await?;
-    if !(200..300).contains(&resp.status) {
-        return Err(FetchError::Build(format!(
-            "ecommerce_product: status {} for {url}",
-            resp.status
-        )));
-    }
-    parse(&resp.html, url).ok_or_else(|| {
-        FetchError::BodyDecode(format!(
-            "ecommerce_product: no Schema.org Product JSON-LD and no OG product tags on {url}"
-        ))
-    })
-}
-
-/// Pure parser: try JSON-LD first, fall back to OG meta tags. Returns
-/// `None` when neither path has enough to say "this is a product page".
-pub fn parse(html: &str, url: &str) -> Option<Value> {
-    // Reuse the core JSON-LD parser so we benefit from whatever
-    // robustness it gains over time (handling @graph, arrays, etc.).
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-    let product = find_product(&blocks);
-
-    if let Some(p) = product {
-        Some(build_jsonld_payload(&p, html, url))
-    } else if has_og_product_signal(html) {
-        Some(build_og_payload(html, url))
-    } else {
-        None
-    }
-}
-
-/// Build the rich payload from a Product JSON-LD node. Augments the
-/// `offers` array with an OG-derived offer when JSON-LD offers is empty
-/// so callers get a price on sites like Patagonia.
-fn build_jsonld_payload(product: &Value, html: &str, url: &str) -> Value {
-    let mut offers = collect_offers(product);
-    let mut data_source = "jsonld";
-    if offers.is_empty()
-        && let Some(og_offer) = build_og_offer(html)
-    {
-        offers.push(og_offer);
-        data_source = "jsonld+og";
-    }
-
-    json!({
-        "url":                url,
-        "data_source":        data_source,
-        "name":               get_text(product, "name").or_else(|| og(html, "title")),
-        "description":        get_text(product, "description").or_else(|| og(html, "description")),
-        "brand":              get_brand(product).or_else(|| meta_property(html, "product:brand")),
-        "sku":                get_text(product, "sku"),
-        "mpn":                get_text(product, "mpn"),
-        "gtin":               get_text(product, "gtin")
-                                 .or_else(|| get_text(product, "gtin13"))
-                                 .or_else(|| get_text(product, "gtin12"))
-                                 .or_else(|| get_text(product, "gtin8")),
-        "product_id":         get_text(product, "productID"),
-        "category":           get_text(product, "category"),
-        "color":              get_text(product, "color"),
-        "material":           get_text(product, "material"),
-        "images":             nonempty_or_og(collect_images(product), html),
-        "offers":             offers,
-        "aggregate_rating":   get_aggregate_rating(product),
-        "review_count":       get_review_count(product),
-        "raw_schema_type":    get_text(product, "@type"),
-        "raw_jsonld":         product.clone(),
-    })
-}
-
-/// Build a minimal payload from OG / product meta tags. Used when a
-/// page has no Product JSON-LD at all.
-fn build_og_payload(html: &str, url: &str) -> Value {
-    let offers = build_og_offer(html).map(|o| vec![o]).unwrap_or_default();
-    let image = og(html, "image");
-    let images: Vec<Value> = image.map(|i| vec![Value::String(i)]).unwrap_or_default();
-
-    json!({
-        "url":                url,
-        "data_source":        "og_fallback",
-        "name":               og(html, "title"),
-        "description":        og(html, "description"),
-        "brand":              meta_property(html, "product:brand"),
-        "sku":                None::<String>,
-        "mpn":                None::<String>,
-        "gtin":               None::<String>,
-        "product_id":         None::<String>,
-        "category":           None::<String>,
-        "color":              None::<String>,
-        "material":           None::<String>,
-        "images":             images,
-        "offers":             offers,
-        "aggregate_rating":   Value::Null,
-        "review_count":       None::<String>,
-        "raw_schema_type":    None::<String>,
-        "raw_jsonld":         Value::Null,
-    })
-}
-
-fn nonempty_or_og(imgs: Vec<Value>, html: &str) -> Vec<Value> {
-    if !imgs.is_empty() {
-        return imgs;
-    }
-    og(html, "image")
-        .map(|s| vec![Value::String(s)])
-        .unwrap_or_default()
-}
-
-// ---------------------------------------------------------------------------
-// JSON-LD walkers
-// ---------------------------------------------------------------------------
-
-/// Recursively walk the JSON-LD blocks and return the first node whose
-/// `@type` is Product, ProductGroup, or IndividualProduct.
-fn find_product(blocks: &[Value]) -> Option<Value> {
-    for b in blocks {
-        if let Some(found) = find_product_in(b) {
-            return Some(found);
-        }
-    }
-    None
-}
-
-fn find_product_in(v: &Value) -> Option<Value> {
-    if is_product_type(v) {
-        return Some(v.clone());
-    }
-    // @graph: [ {...}, {...} ]
-    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
-        for item in graph {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    // Bare array wrapper
-    if let Some(arr) = v.as_array() {
-        for item in arr {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    None
-}
-
-fn is_product_type(v: &Value) -> bool {
-    let t = match v.get("@type") {
-        Some(t) => t,
-        None => return false,
-    };
-    let match_str = |s: &str| {
-        matches!(
-            s,
-            "Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
-        )
-    };
-    match t {
-        Value::String(s) => match_str(s),
-        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
-        _ => false,
-    }
-}
-
-fn get_text(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| match x {
-        Value::String(s) => Some(s.clone()),
-        Value::Number(n) => Some(n.to_string()),
-        _ => None,
-    })
-}
-
-fn get_brand(v: &Value) -> Option<String> {
-    let brand = v.get("brand")?;
-    if let Some(s) = brand.as_str() {
-        return Some(s.to_string());
-    }
-    if let Some(obj) = brand.as_object()
-        && let Some(n) = obj.get("name").and_then(|x| x.as_str())
-    {
-        return Some(n.to_string());
-    }
-    None
-}
-
-fn collect_images(v: &Value) -> Vec<Value> {
-    match v.get("image") {
-        Some(Value::String(s)) => vec![Value::String(s.clone())],
-        Some(Value::Array(arr)) => arr
-            .iter()
-            .filter_map(|x| match x {
-                Value::String(s) => Some(Value::String(s.clone())),
-                Value::Object(_) => x.get("url").cloned(),
-                _ => None,
-            })
-            .collect(),
-        Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
-        _ => Vec::new(),
-    }
-}
-
-/// Normalise both bare Offer and AggregateOffer into a uniform array.
-fn collect_offers(v: &Value) -> Vec<Value> {
-    let offers = match v.get("offers") {
-        Some(o) => o,
-        None => return Vec::new(),
-    };
-    let collect_single = |o: &Value| -> Option<Value> {
-        Some(json!({
-            "price":            get_text(o, "price"),
-            "low_price":        get_text(o, "lowPrice"),
-            "high_price":       get_text(o, "highPrice"),
-            "currency":         get_text(o, "priceCurrency"),
-            "availability":     get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
-            "item_condition":   get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
-            "valid_until":      get_text(o, "priceValidUntil"),
-            "url":              get_text(o, "url"),
-            "seller":           o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
-            "offer_count":      get_text(o, "offerCount"),
-        }))
-    };
-    match offers {
-        Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
-        Value::Object(_) => collect_single(offers).into_iter().collect(),
-        _ => Vec::new(),
-    }
-}
-
-fn get_aggregate_rating(v: &Value) -> Option<Value> {
-    let r = v.get("aggregateRating")?;
-    Some(json!({
-        "rating_value":  get_text(r, "ratingValue"),
-        "best_rating":   get_text(r, "bestRating"),
-        "worst_rating":  get_text(r, "worstRating"),
-        "rating_count":  get_text(r, "ratingCount"),
-        "review_count":  get_text(r, "reviewCount"),
-    }))
-}
-
-fn get_review_count(v: &Value) -> Option<String> {
-    v.get("aggregateRating")
-        .and_then(|r| get_text(r, "reviewCount"))
-        .or_else(|| get_text(v, "reviewCount"))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-// ---------------------------------------------------------------------------
-// OG / product meta-tag helpers
-// ---------------------------------------------------------------------------
-
-/// True when the HTML has enough OG / product meta tags to justify
-/// building a fallback payload. A single `og:title` isn't enough on its
-/// own — every blog post has that. We require either a product price
-/// tag or at least an `og:type` of `product`/`og:product` to avoid
-/// mis-classifying articles as products.
-fn has_og_product_signal(html: &str) -> bool {
-    let has_price = meta_property(html, "product:price:amount").is_some()
-        || meta_property(html, "og:price:amount").is_some();
-    if has_price {
-        return true;
-    }
-    // `<meta property="og:type" content="product">` is the Schema.org OG
-    // marker for product pages.
-    let og_type = og(html, "type").unwrap_or_default().to_lowercase();
-    matches!(og_type.as_str(), "product" | "og:product" | "product.item")
-}
-
-/// Build a single Offer-shaped Value from OG / product meta tags, or
-/// `None` if there's no price info at all.
-fn build_og_offer(html: &str) -> Option<Value> {
-    let price = meta_property(html, "product:price:amount")
-        .or_else(|| meta_property(html, "og:price:amount"));
-    let currency = meta_property(html, "product:price:currency")
-        .or_else(|| meta_property(html, "og:price:currency"));
-    let availability = meta_property(html, "product:availability")
-        .or_else(|| meta_property(html, "og:availability"));
-    price.as_ref()?;
-    Some(json!({
-        "price":            price,
-        "low_price":        None::<String>,
-        "high_price":       None::<String>,
-        "currency":         currency,
-        "availability":     availability,
-        "item_condition":   None::<String>,
-        "valid_until":      None::<String>,
-        "url":              None::<String>,
-        "seller":           None::<String>,
-        "offer_count":      None::<String>,
-    }))
-}
-
-/// Pull the value of `<meta property="og:{prop}" content="...">`.
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-/// Pull the value of any `<meta property="..." content="...">` tag.
-/// Needed for namespaced OG variants like `product:price:amount` that
-/// the simple `og:*` matcher above doesn't cover.
-fn meta_property(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-    use serde_json::json;
-
-    #[test]
-    fn matches_any_http_url_with_host() {
-        assert!(matches("https://www.allbirds.com/products/tree-runner"));
-        assert!(matches(
-            "https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
-        ));
-        assert!(matches("https://example.com/p/widget"));
-        assert!(matches("http://shop.example.com/foo/bar"));
-    }
-
-    #[test]
-    fn rejects_empty_or_non_http() {
-        assert!(!matches(""));
-        assert!(!matches("not-a-url"));
-        assert!(!matches("ftp://example.com/file"));
-    }
-
-    #[test]
-    fn find_product_walks_graph() {
-        let block = json!({
-            "@context": "https://schema.org",
-            "@graph": [
-                {"@type": "Organization", "name": "ACME"},
-                {"@type": "Product", "name": "Widget", "sku": "ABC"}
-            ]
-        });
-        let blocks = vec![block];
-        let p = find_product(&blocks).unwrap();
-        assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
-    }
-
-    #[test]
-    fn find_product_handles_array_type() {
-        let block = json!({
-            "@type": ["Product", "Clothing"],
-            "name": "Tee"
-        });
-        assert!(is_product_type(&block));
-    }
-
-    #[test]
-    fn get_brand_from_string_or_object() {
-        assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
-        assert_eq!(
-            get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
-            Some("ACME".into())
-        );
-    }
-
-    #[test]
-    fn collect_offers_handles_single_and_aggregate() {
-        let p = json!({
-            "offers": {
-                "@type": "Offer",
-                "price": "19.99",
-                "priceCurrency": "USD",
-                "availability": "https://schema.org/InStock"
-            }
-        });
-        let offers = collect_offers(&p);
-        assert_eq!(offers.len(), 1);
-        assert_eq!(
-            offers[0].get("price").and_then(|v| v.as_str()),
-            Some("19.99")
-        );
-        assert_eq!(
-            offers[0].get("availability").and_then(|v| v.as_str()),
-            Some("InStock")
-        );
-    }
-
-    // --- OG fallback --------------------------------------------------------
-
-    #[test]
-    fn has_og_product_signal_accepts_product_type_or_price() {
-        let type_only = r#"<meta property="og:type" content="product">"#;
-        let price_only = r#"<meta property="product:price:amount" content="49.00">"#;
-        let neither = r#"<meta property="og:title" content="My Article"><meta property="og:type" content="article">"#;
-        assert!(has_og_product_signal(type_only));
-        assert!(has_og_product_signal(price_only));
-        assert!(!has_og_product_signal(neither));
-    }
-
-    #[test]
-    fn og_fallback_builds_payload_without_jsonld() {
-        let html = r##"<html><head>
-            <meta property="og:type" content="product">
-            <meta property="og:title" content="Handmade Candle">
-            <meta property="og:image" content="https://cdn.example.com/candle.jpg">
-            <meta property="og:description" content="Small-batch soy candle.">
-            <meta property="product:price:amount" content="18.00">
-            <meta property="product:price:currency" content="USD">
-            <meta property="product:availability" content="in stock">
-            <meta property="product:brand" content="Little Studio">
-        </head></html>"##;
-        let v = parse(html, "https://example.com/p/candle").unwrap();
-        assert_eq!(v["data_source"], "og_fallback");
-        assert_eq!(v["name"], "Handmade Candle");
-        assert_eq!(v["description"], "Small-batch soy candle.");
-        assert_eq!(v["brand"], "Little Studio");
-        assert_eq!(v["offers"][0]["price"], "18.00");
-        assert_eq!(v["offers"][0]["currency"], "USD");
-        assert_eq!(v["offers"][0]["availability"], "in stock");
-        assert_eq!(v["images"][0], "https://cdn.example.com/candle.jpg");
-    }
-
-    #[test]
-    fn jsonld_augments_empty_offers_with_og_price() {
-        // Patagonia-shaped page: Product JSON-LD without an Offer, plus
-        // product:price:* OG tags. We should merge.
-        let html = r##"<html><head>
-            <script type="application/ld+json">
-            {"@context":"https://schema.org","@type":"Product",
-             "name":"Better Sweater","brand":"Patagonia",
-             "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.4","reviewCount":"1142"}}
-            </script>
-            <meta property="product:price:amount" content="139.00">
-            <meta property="product:price:currency" content="USD">
-        </head></html>"##;
-        let v = parse(html, "https://patagonia.com/p/x").unwrap();
-        assert_eq!(v["data_source"], "jsonld+og");
-        assert_eq!(v["name"], "Better Sweater");
-        assert_eq!(v["offers"].as_array().unwrap().len(), 1);
-        assert_eq!(v["offers"][0]["price"], "139.00");
-    }
-
-    #[test]
-    fn jsonld_only_stays_pure_jsonld() {
-        let html = r##"<html><head>
-            <script type="application/ld+json">
-            {"@type":"Product","name":"Widget",
-             "offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
-            </script>
-        </head></html>"##;
-        let v = parse(html, "https://example.com/p/w").unwrap();
-        assert_eq!(v["data_source"], "jsonld");
-        assert_eq!(v["offers"][0]["price"], "9.99");
-    }
-
-    #[test]
-    fn parse_returns_none_on_no_product_signals() {
-        let html = r#"<html><head>
-            <meta property="og:title" content="My Blog Post">
-            <meta property="og:type" content="article">
-        </head></html>"#;
-        assert!(parse(html, "https://blog.example.com/post").is_none());
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
deleted file mode 100644
index ea9ed0b..0000000
--- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs
+++ /dev/null
@@ -1,572 +0,0 @@
-//! Etsy listing extractor.
-//!
-//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant
-//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD
-//! block with title, price, currency, availability, shop seller, and
-//! an `AggregateRating` for the listing.
-//!
-//! Etsy puts Cloudflare + custom WAF in front of product pages with a
-//! high variance: the Firefox profile gets clean HTML most of the time
-//! but some listings return a CF interstitial. We route through
-//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
-//! same as `ebay_listing`.
-//!
-//! ## URL slug as last-resort title
-//!
-//! Even with cloud antibot bypass, Etsy frequently serves a generic
-//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
-//! empty markdown). In that case we humanise the slug from the URL
-//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
-//! "Personalized Stainless Steel Tumbler") so callers always get a
-//! meaningful title. Degrades gracefully when the URL has no slug.
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::cloud::{self, CloudError};
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "etsy_listing",
-    label: "Etsy listing",
-    description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.",
-    url_patterns: &[
-        "https://www.etsy.com/listing/{id}",
-        "https://www.etsy.com/listing/{id}/{slug}",
-        "https://www.etsy.com/{locale}/listing/{id}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !is_etsy_host(host) {
-        return false;
-    }
-    parse_listing_id(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let listing_id = parse_listing_id(url)
-        .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
-
-    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
-        .await
-        .map_err(cloud_to_fetch_err)?;
-
-    let mut data = parse(&fetched.html, url, &listing_id);
-    if let Some(obj) = data.as_object_mut() {
-        obj.insert(
-            "data_source".into(),
-            match fetched.source {
-                cloud::FetchSource::Local => json!("local"),
-                cloud::FetchSource::Cloud => json!("cloud"),
-            },
-        );
-    }
-    Ok(data)
-}
-
-pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
-    let jsonld = find_product_jsonld(html);
-    let slug_title = humanise_slug(parse_slug(url).as_deref());
-
-    let title = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "name"))
-        .or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
-        .or(slug_title);
-    let description = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "description"))
-        .or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
-    let image = jsonld
-        .as_ref()
-        .and_then(get_first_image)
-        .or_else(|| og(html, "image"));
-    let brand = jsonld.as_ref().and_then(get_brand);
-
-    // Etsy listings often ship either a single Offer or an
-    // AggregateOffer when the listing has variants with different prices.
-    let offer = jsonld.as_ref().and_then(first_offer);
-    let (low_price, high_price, single_price) = match offer.as_ref() {
-        Some(o) => (
-            get_text(o, "lowPrice"),
-            get_text(o, "highPrice"),
-            get_text(o, "price"),
-        ),
-        None => (None, None, None),
-    };
-    let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency"));
-    let availability = offer
-        .as_ref()
-        .and_then(|o| get_text(o, "availability").map(strip_schema_prefix));
-    let item_condition = jsonld
-        .as_ref()
-        .and_then(|v| get_text(v, "itemCondition"))
-        .map(strip_schema_prefix);
-
-    // Shop name: offers[0].seller.name on newer listings, top-level
-    // `brand` on older listings (Etsy changed the schema around 2022).
-    // Fall back through both so either shape resolves.
-    let shop = offer
-        .as_ref()
-        .and_then(|o| {
-            o.get("seller")
-                .and_then(|s| s.get("name"))
-                .and_then(|n| n.as_str())
-                .map(String::from)
-        })
-        .or_else(|| brand.clone());
-    let shop_url = shop_url_from_html(html);
-
-    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
-
-    json!({
-        "url":              url,
-        "listing_id":       listing_id,
-        "title":            title,
-        "description":      description,
-        "image":            image,
-        "brand":            brand,
-        "price":            single_price,
-        "low_price":        low_price,
-        "high_price":       high_price,
-        "currency":         currency,
-        "availability":     availability,
-        "item_condition":   item_condition,
-        "shop":             shop,
-        "shop_url":         shop_url,
-        "aggregate_rating": aggregate_rating,
-    })
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn is_etsy_host(host: &str) -> bool {
-    host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com")
-}
-
-/// Extract the numeric listing id. Etsy ids are 9-11 digits today but
-/// we accept any all-digit segment right after `/listing/`.
-///
-/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised
-/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`).
-fn parse_listing_id(url: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap());
-    re.captures(url)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-/// Extract the URL slug after the listing id, e.g.
-/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
-/// is the bare `/listing/{id}` shape.
-fn parse_slug(url: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
-    re.captures(url)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-/// Turn a URL slug into a human-ish title:
-/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
-/// Steel Tumbler`. Word-cap each dash-separated token; preserves
-/// underscores as spaces too. Returns `None` on empty input.
-fn humanise_slug(slug: Option<&str>) -> Option<String> {
-    let raw = slug?.trim();
-    if raw.is_empty() {
-        return None;
-    }
-    let words: Vec<String> = raw
-        .split(['-', '_'])
-        .filter(|w| !w.is_empty())
-        .map(capitalise_word)
-        .collect();
-    if words.is_empty() {
-        None
-    } else {
-        Some(words.join(" "))
-    }
-}
-
-fn capitalise_word(w: &str) -> String {
-    let mut chars = w.chars();
-    match chars.next() {
-        Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
-        None => String::new(),
-    }
-}
-
-/// True when the OG title is Etsy's fallback-page title rather than a
-/// listing-specific title. Expired / region-blocked / antibot-filtered
-/// pages return Etsy's sitewide tagline:
-/// `"Etsy - Your place to buy and sell all things handmade..."`, or
-/// simply `"etsy.com"`. A real listing title always starts with the
-/// item name, never with "Etsy - " or the domain.
-fn is_generic_title(t: &str) -> bool {
-    let normalised = t.trim().to_lowercase();
-    if matches!(
-        normalised.as_str(),
-        "etsy.com" | "etsy" | "www.etsy.com" | ""
-    ) {
-        return true;
-    }
-    // Etsy's sitewide marketing tagline, served on 404 / blocked pages.
-    if normalised.starts_with("etsy - ")
-        || normalised.starts_with("etsy.com - ")
-        || normalised.starts_with("etsy uk - ")
-    {
-        return true;
-    }
-    // Etsy's "item unavailable" placeholder, served on delisted
-    // products. Keep the slug fallback so callers still see what the
-    // URL was about.
-    normalised.starts_with("this item is unavailable")
-        || normalised.starts_with("sorry, this item is")
-        || normalised == "item not available - etsy"
-}
-
-/// True when the OG description is an Etsy error-page placeholder or
-/// sitewide marketing blurb rather than a real listing description.
-fn is_generic_description(d: &str) -> bool {
-    let normalised = d.trim().to_lowercase();
-    if normalised.is_empty() {
-        return true;
-    }
-    normalised.starts_with("sorry, the page you were looking for")
-        || normalised.starts_with("page not found")
-        || normalised.starts_with("find the perfect handmade gift")
-}
-
-// ---------------------------------------------------------------------------
-// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
-// extractors can diverge without cross-impact)
-// ---------------------------------------------------------------------------
-
-fn find_product_jsonld(html: &str) -> Option<Value> {
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-    for b in blocks {
-        if let Some(found) = find_product_in(&b) {
-            return Some(found);
-        }
-    }
-    None
-}
-
-fn find_product_in(v: &Value) -> Option<Value> {
-    if is_product_type(v) {
-        return Some(v.clone());
-    }
-    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
-        for item in graph {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    if let Some(arr) = v.as_array() {
-        for item in arr {
-            if let Some(found) = find_product_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    None
-}
-
-fn is_product_type(v: &Value) -> bool {
-    let Some(t) = v.get("@type") else {
-        return false;
-    };
-    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
-    match t {
-        Value::String(s) => is_prod(s),
-        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
-        _ => false,
-    }
-}
-
-fn get_text(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| match x {
-        Value::String(s) => Some(s.clone()),
-        Value::Number(n) => Some(n.to_string()),
-        _ => None,
-    })
-}
-
-fn get_brand(v: &Value) -> Option<String> {
-    let brand = v.get("brand")?;
-    if let Some(s) = brand.as_str() {
-        return Some(s.to_string());
-    }
-    brand
-        .as_object()
-        .and_then(|o| o.get("name"))
-        .and_then(|n| n.as_str())
-        .map(String::from)
-}
-
-fn get_first_image(v: &Value) -> Option<String> {
-    match v.get("image")? {
-        Value::String(s) => Some(s.clone()),
-        Value::Array(arr) => arr.iter().find_map(|x| match x {
-            Value::String(s) => Some(s.clone()),
-            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
-            _ => None,
-        }),
-        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
-        _ => None,
-    }
-}
-
-fn first_offer(v: &Value) -> Option<Value> {
-    let offers = v.get("offers")?;
-    match offers {
-        Value::Array(arr) => arr.first().cloned(),
-        Value::Object(_) => Some(offers.clone()),
-        _ => None,
-    }
-}
-
-fn get_aggregate_rating(v: &Value) -> Option<Value> {
-    let r = v.get("aggregateRating")?;
-    Some(json!({
-        "rating_value": get_text(r, "ratingValue"),
-        "review_count": get_text(r, "reviewCount"),
-        "best_rating":  get_text(r, "bestRating"),
-    }))
-}
-
-fn strip_schema_prefix(s: String) -> String {
-    s.replace("http://schema.org/", "")
-        .replace("https://schema.org/", "")
-}
-
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-/// Etsy links the owning shop with a canonical anchor like
-/// `<a href="/shop/ShopName" ...>`. Grab the first one after the
-/// breadcrumb boundary.
-fn shop_url_from_html(html: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap());
-    re.captures(html)
-        .and_then(|c| c.get(1))
-        .map(|m| format!("https://www.etsy.com{}", m.as_str()))
-}
-
-fn cloud_to_fetch_err(e: CloudError) -> FetchError {
-    FetchError::Build(e.to_string())
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_etsy_listing_urls() {
-        assert!(matches("https://www.etsy.com/listing/123456789"));
-        assert!(matches(
-            "https://www.etsy.com/listing/123456789/vintage-typewriter"
-        ));
-        assert!(matches(
-            "https://www.etsy.com/fr/listing/123456789/vintage-typewriter"
-        ));
-        assert!(!matches("https://www.etsy.com/"));
-        assert!(!matches("https://www.etsy.com/shop/SomeShop"));
-        assert!(!matches("https://example.com/listing/123456789"));
-    }
-
-    #[test]
-    fn parse_listing_id_handles_slug_and_locale() {
-        assert_eq!(
-            parse_listing_id("https://www.etsy.com/listing/123456789"),
-            Some("123456789".into())
-        );
-        assert_eq!(
-            parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"),
-            Some("123456789".into())
-        );
-        assert_eq!(
-            parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"),
-            Some("123456789".into())
-        );
-        assert_eq!(
-            parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"),
-            Some("123456789".into())
-        );
-    }
-
-    #[test]
-    fn parse_extracts_from_fixture_jsonld() {
-        let html = r##"
-<html><head>
-<script type="application/ld+json">
-{"@context":"https://schema.org","@type":"Product",
- "name":"Handmade Ceramic Mug","sku":"MUG-001",
- "brand":{"@type":"Brand","name":"Studio Clay"},
- "image":["https://i.etsystatic.com/abc.jpg","https://i.etsystatic.com/xyz.jpg"],
- "itemCondition":"https://schema.org/NewCondition",
- "offers":{"@type":"Offer","price":"24.00","priceCurrency":"USD",
-           "availability":"https://schema.org/InStock",
-           "seller":{"@type":"Organization","name":"StudioClay"}},
- "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.9","reviewCount":"127","bestRating":"5"}}
-</script>
-<a href="/shop/StudioClay" class="wt-text-link">StudioClay</a>
-</head></html>"##;
-        let v = parse(html, "https://www.etsy.com/listing/1", "1");
-        assert_eq!(v["title"], "Handmade Ceramic Mug");
-        assert_eq!(v["price"], "24.00");
-        assert_eq!(v["currency"], "USD");
-        assert_eq!(v["availability"], "InStock");
-        assert_eq!(v["item_condition"], "NewCondition");
-        assert_eq!(v["shop"], "StudioClay");
-        assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay");
-        assert_eq!(v["brand"], "Studio Clay");
-        assert_eq!(v["aggregate_rating"]["rating_value"], "4.9");
-        assert_eq!(v["aggregate_rating"]["review_count"], "127");
-    }
-
-    #[test]
-    fn parse_handles_aggregate_offer_price_range() {
-        let html = r##"
-<script type="application/ld+json">
-{"@type":"Product","name":"Mug Set",
- "offers":{"@type":"AggregateOffer",
-           "lowPrice":"18.00","highPrice":"36.00","priceCurrency":"USD"}}
-</script>
-"##;
-        let v = parse(html, "https://www.etsy.com/listing/2", "2");
-        assert_eq!(v["low_price"], "18.00");
-        assert_eq!(v["high_price"], "36.00");
-        assert_eq!(v["currency"], "USD");
-    }
-
-    #[test]
-    fn parse_falls_back_to_og_when_no_jsonld() {
-        let html = r#"
-<html><head>
-<meta property="og:title" content="Minimal Fallback Item">
-<meta property="og:description" content="OG-only extraction test.">
-<meta property="og:image" content="https://i.etsystatic.com/fallback.jpg">
-</head></html>"#;
-        let v = parse(html, "https://www.etsy.com/listing/3", "3");
-        assert_eq!(v["title"], "Minimal Fallback Item");
-        assert_eq!(v["description"], "OG-only extraction test.");
-        assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg");
-        // No price fields when we only have OG.
-        assert!(v["price"].is_null());
-    }
-
-    #[test]
-    fn parse_slug_from_url() {
-        assert_eq!(
-            parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
-            Some("vintage-typewriter".into())
-        );
-        assert_eq!(
-            parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
-            Some("slug".into())
-        );
-        assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
-        assert_eq!(
-            parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
-            Some("slug".into())
-        );
-    }
-
-    #[test]
-    fn humanise_slug_capitalises_each_word() {
-        assert_eq!(
-            humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
-            Some("Personalized Stainless Steel Tumbler")
-        );
-        assert_eq!(
-            humanise_slug(Some("hand_crafted_mug")).as_deref(),
-            Some("Hand Crafted Mug")
-        );
-        assert_eq!(humanise_slug(Some("")), None);
-        assert_eq!(humanise_slug(None), None);
-    }
-
-    #[test]
-    fn is_generic_title_catches_common_shapes() {
-        assert!(is_generic_title("etsy.com"));
-        assert!(is_generic_title("Etsy"));
-        assert!(is_generic_title("  etsy.com  "));
-        assert!(is_generic_title(
-            "Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
-        ));
-        assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
-        assert!(!is_generic_title("Vintage Typewriter"));
-        assert!(!is_generic_title("Handmade Etsy-style Mug"));
-    }
-
-    #[test]
-    fn is_generic_description_catches_404_shapes() {
-        assert!(is_generic_description(""));
-        assert!(is_generic_description(
-            "Sorry, the page you were looking for was not found."
-        ));
-        assert!(is_generic_description("Page not found"));
-        assert!(!is_generic_description(
-            "Hand-thrown ceramic mug, dishwasher safe."
-        ));
-    }
-
-    #[test]
-    fn parse_uses_slug_when_og_is_generic() {
-        // Cloud-blocked Etsy listing: og:title is a site-wide generic
-        // placeholder, no JSON-LD, no description. Slug should win.
-        let html = r#"<html><head>
-<meta property="og:title" content="etsy.com">
-</head></html>"#;
-        let v = parse(
-            html,
-            "https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
-            "1079113183",
-        );
-        assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
-    }
-
-    #[test]
-    fn parse_prefers_real_og_over_slug() {
-        let html = r#"<html><head>
-<meta property="og:title" content="Real Listing Title">
-</head></html>"#;
-        let v = parse(
-            html,
-            "https://www.etsy.com/listing/1079113183/the-url-slug",
-            "1079113183",
-        );
-        assert_eq!(v["title"], "Real Listing Title");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/github_issue.rs b/crates/webclaw-fetch/src/extractors/github_issue.rs
deleted file mode 100644
index 9a64f21..0000000
--- a/crates/webclaw-fetch/src/extractors/github_issue.rs
+++ /dev/null
@@ -1,172 +0,0 @@
-//! GitHub issue structured extractor.
-//!
-//! Mirror of `github_pr` but on `/issues/{number}`. Uses
-//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the
-//! issue body + comment count + labels + milestone + author /
-//! assignees. Full per-comment bodies would be another call; kept for
-//! a follow-up.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "github_issue",
-    label: "GitHub issue",
-    description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.",
-    url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = url
-        .split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if host != "github.com" && host != "www.github.com" {
-        return false;
-    }
-    parse_issue(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
-        FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
-    })?;
-
-    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "github_issue: issue '{owner}/{repo}#{number}' not found"
-        )));
-    }
-    if resp.status == 403 {
-        return Err(FetchError::Build(
-            "github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
-        ));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "github api returned status {}",
-            resp.status
-        )));
-    }
-
-    let issue: Issue = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?;
-
-    // The same endpoint returns PRs too; reject if we got one so the caller
-    // uses /v1/scrape/github_pr instead of getting a half-shaped payload.
-    if issue.pull_request.is_some() {
-        return Err(FetchError::Build(format!(
-            "github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr"
-        )));
-    }
-
-    Ok(json!({
-        "url":         url,
-        "owner":       owner,
-        "repo":        repo,
-        "number":      issue.number,
-        "title":       issue.title,
-        "body":        issue.body,
-        "state":       issue.state,
-        "state_reason":issue.state_reason,
-        "author":      issue.user.as_ref().and_then(|u| u.login.clone()),
-        "labels":      issue.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
-        "assignees":   issue.assignees.iter().filter_map(|u| u.login.clone()).collect::<Vec<_>>(),
-        "milestone":   issue.milestone.as_ref().and_then(|m| m.title.clone()),
-        "comments":    issue.comments,
-        "locked":      issue.locked,
-        "created_at":  issue.created_at,
-        "updated_at":  issue.updated_at,
-        "closed_at":   issue.closed_at,
-        "html_url":    issue.html_url,
-    }))
-}
-
-fn parse_issue(url: &str) -> Option<(String, String, u64)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    if segs.len() < 4 || segs[2] != "issues" {
-        return None;
-    }
-    let number: u64 = segs[3].parse().ok()?;
-    Some((segs[0].to_string(), segs[1].to_string(), number))
-}
-
-// ---------------------------------------------------------------------------
-// GitHub issue API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Issue {
-    number: Option<i64>,
-    title: Option<String>,
-    body: Option<String>,
-    state: Option<String>,
-    state_reason: Option<String>,
-    locked: Option<bool>,
-    comments: Option<i64>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-    closed_at: Option<String>,
-    html_url: Option<String>,
-    user: Option<UserRef>,
-    #[serde(default)]
-    labels: Vec<LabelRef>,
-    #[serde(default)]
-    assignees: Vec<UserRef>,
-    milestone: Option<Milestone>,
-    /// Present when this "issue" is actually a pull request. The REST
-    /// API overloads the issues endpoint for PRs.
-    pull_request: Option<serde_json::Value>,
-}
-
-#[derive(Deserialize)]
-struct UserRef {
-    login: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct LabelRef {
-    name: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Milestone {
-    title: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_issue_urls() {
-        assert!(matches("https://github.com/rust-lang/rust/issues/100"));
-        assert!(matches("https://github.com/rust-lang/rust/issues/100/"));
-        assert!(!matches("https://github.com/rust-lang/rust"));
-        assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
-        assert!(!matches("https://github.com/rust-lang/rust/issues"));
-    }
-
-    #[test]
-    fn parse_issue_extracts_owner_repo_number() {
-        assert_eq!(
-            parse_issue("https://github.com/rust-lang/rust/issues/100"),
-            Some(("rust-lang".into(), "rust".into(), 100))
-        );
-        assert_eq!(
-            parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"),
-            Some(("rust-lang".into(), "rust".into(), 100))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/github_pr.rs b/crates/webclaw-fetch/src/extractors/github_pr.rs
deleted file mode 100644
index 266d3cd..0000000
--- a/crates/webclaw-fetch/src/extractors/github_pr.rs
+++ /dev/null
@@ -1,189 +0,0 @@
-//! GitHub pull request structured extractor.
-//!
-//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns
-//! the PR metadata + a counted summary of comments and review activity.
-//! Full diff and per-comment bodies require additional calls — left for
-//! a follow-up enhancement so the v1 stays one network round-trip.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "github_pr",
-    label: "GitHub pull request",
-    description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.",
-    url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = url
-        .split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if host != "github.com" && host != "www.github.com" {
-        return false;
-    }
-    parse_pr(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
-        FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
-    })?;
-
-    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "github_pr: pull request '{owner}/{repo}#{number}' not found"
-        )));
-    }
-    if resp.status == 403 {
-        return Err(FetchError::Build(
-            "github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
-        ));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "github api returned status {}",
-            resp.status
-        )));
-    }
-
-    let p: PullRequest = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?;
-
-    Ok(json!({
-        "url":            url,
-        "owner":          owner,
-        "repo":           repo,
-        "number":         p.number,
-        "title":          p.title,
-        "body":           p.body,
-        "state":          p.state,
-        "draft":          p.draft,
-        "merged":         p.merged,
-        "merged_at":      p.merged_at,
-        "merge_commit_sha": p.merge_commit_sha,
-        "author":         p.user.as_ref().and_then(|u| u.login.clone()),
-        "labels":         p.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
-        "milestone":      p.milestone.as_ref().and_then(|m| m.title.clone()),
-        "head_ref":       p.head.as_ref().and_then(|r| r.ref_name.clone()),
-        "base_ref":       p.base.as_ref().and_then(|r| r.ref_name.clone()),
-        "head_sha":       p.head.as_ref().and_then(|r| r.sha.clone()),
-        "additions":      p.additions,
-        "deletions":      p.deletions,
-        "changed_files":  p.changed_files,
-        "commits":        p.commits,
-        "comments":       p.comments,
-        "review_comments":p.review_comments,
-        "created_at":     p.created_at,
-        "updated_at":     p.updated_at,
-        "closed_at":      p.closed_at,
-        "html_url":       p.html_url,
-    }))
-}
-
-fn parse_pr(url: &str) -> Option<(String, String, u64)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    // /{owner}/{repo}/pull/{number} (or /pulls/{number} variant)
-    if segs.len() < 4 {
-        return None;
-    }
-    if segs[2] != "pull" && segs[2] != "pulls" {
-        return None;
-    }
-    let number: u64 = segs[3].parse().ok()?;
-    Some((segs[0].to_string(), segs[1].to_string(), number))
-}
-
-// ---------------------------------------------------------------------------
-// GitHub PR API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct PullRequest {
-    number: Option<i64>,
-    title: Option<String>,
-    body: Option<String>,
-    state: Option<String>,
-    draft: Option<bool>,
-    merged: Option<bool>,
-    merged_at: Option<String>,
-    merge_commit_sha: Option<String>,
-    user: Option<UserRef>,
-    #[serde(default)]
-    labels: Vec<LabelRef>,
-    milestone: Option<Milestone>,
-    head: Option<GitRef>,
-    base: Option<GitRef>,
-    additions: Option<i64>,
-    deletions: Option<i64>,
-    changed_files: Option<i64>,
-    commits: Option<i64>,
-    comments: Option<i64>,
-    review_comments: Option<i64>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-    closed_at: Option<String>,
-    html_url: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct UserRef {
-    login: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct LabelRef {
-    name: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Milestone {
-    title: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct GitRef {
-    #[serde(rename = "ref")]
-    ref_name: Option<String>,
-    sha: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_pr_urls() {
-        assert!(matches("https://github.com/rust-lang/rust/pull/12345"));
-        assert!(matches(
-            "https://github.com/rust-lang/rust/pull/12345/files"
-        ));
-        assert!(!matches("https://github.com/rust-lang/rust"));
-        assert!(!matches("https://github.com/rust-lang/rust/issues/100"));
-        assert!(!matches("https://github.com/rust-lang"));
-    }
-
-    #[test]
-    fn parse_pr_extracts_owner_repo_number() {
-        assert_eq!(
-            parse_pr("https://github.com/rust-lang/rust/pull/12345"),
-            Some(("rust-lang".into(), "rust".into(), 12345))
-        );
-        assert_eq!(
-            parse_pr("https://github.com/rust-lang/rust/pull/12345/files"),
-            Some(("rust-lang".into(), "rust".into(), 12345))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/github_release.rs b/crates/webclaw-fetch/src/extractors/github_release.rs
deleted file mode 100644
index 7699d09..0000000
--- a/crates/webclaw-fetch/src/extractors/github_release.rs
+++ /dev/null
@@ -1,179 +0,0 @@
-//! GitHub release structured extractor.
-//!
-//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns
-//! the release notes body, asset list with download counts, and
-//! prerelease flag.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "github_release",
-    label: "GitHub release",
-    description: "Returns release metadata: tag, name, body (release notes), assets with download counts.",
-    url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = url
-        .split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if host != "github.com" && host != "www.github.com" {
-        return false;
-    }
-    parse_release(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
-        FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
-    })?;
-
-    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "github_release: release '{owner}/{repo}@{tag}' not found"
-        )));
-    }
-    if resp.status == 403 {
-        return Err(FetchError::Build(
-            "github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour."
-                .into(),
-        ));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "github api returned status {}",
-            resp.status
-        )));
-    }
-
-    let r: Release = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?;
-
-    let assets: Vec<Value> = r
-        .assets
-        .iter()
-        .map(|a| {
-            json!({
-                "name": a.name,
-                "size": a.size,
-                "download_count": a.download_count,
-                "browser_download_url": a.browser_download_url,
-                "content_type": a.content_type,
-                "created_at": a.created_at,
-                "updated_at": a.updated_at,
-            })
-        })
-        .collect();
-
-    Ok(json!({
-        "url":           url,
-        "owner":         owner,
-        "repo":          repo,
-        "tag_name":      r.tag_name,
-        "name":          r.name,
-        "body":          r.body,
-        "draft":         r.draft,
-        "prerelease":    r.prerelease,
-        "author":        r.author.as_ref().and_then(|u| u.login.clone()),
-        "created_at":    r.created_at,
-        "published_at":  r.published_at,
-        "asset_count":   assets.len(),
-        "total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::<i64>(),
-        "assets":        assets,
-        "html_url":      r.html_url,
-    }))
-}
-
-fn parse_release(url: &str) -> Option<(String, String, String)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    // /{owner}/{repo}/releases/tag/{tag}
-    if segs.len() < 5 {
-        return None;
-    }
-    if segs[2] != "releases" || segs[3] != "tag" {
-        return None;
-    }
-    Some((
-        segs[0].to_string(),
-        segs[1].to_string(),
-        segs[4].to_string(),
-    ))
-}
-
-// ---------------------------------------------------------------------------
-// GitHub Release API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Release {
-    tag_name: Option<String>,
-    name: Option<String>,
-    body: Option<String>,
-    draft: Option<bool>,
-    prerelease: Option<bool>,
-    author: Option<UserRef>,
-    created_at: Option<String>,
-    published_at: Option<String>,
-    html_url: Option<String>,
-    #[serde(default)]
-    assets: Vec<Asset>,
-}
-
-#[derive(Deserialize)]
-struct UserRef {
-    login: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Asset {
-    name: Option<String>,
-    size: Option<i64>,
-    download_count: Option<i64>,
-    browser_download_url: Option<String>,
-    content_type: Option<String>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_release_urls() {
-        assert!(matches(
-            "https://github.com/rust-lang/rust/releases/tag/1.85.0"
-        ));
-        assert!(matches(
-            "https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"
-        ));
-        assert!(!matches("https://github.com/rust-lang/rust"));
-        assert!(!matches("https://github.com/rust-lang/rust/releases"));
-        assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
-    }
-
-    #[test]
-    fn parse_release_extracts_owner_repo_tag() {
-        assert_eq!(
-            parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"),
-            Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into()))
-        );
-        assert_eq!(
-            parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"),
-            Some(("rust-lang".into(), "rust".into(), "1.85.0".into()))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/github_repo.rs b/crates/webclaw-fetch/src/extractors/github_repo.rs
deleted file mode 100644
index 2a62aa3..0000000
--- a/crates/webclaw-fetch/src/extractors/github_repo.rs
+++ /dev/null
@@ -1,212 +0,0 @@
-//! GitHub repository structured extractor.
-//!
-//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`.
-//! Unauthenticated requests get 60/hour per IP, which is fine for users
-//! self-hosting and for low-volume cloud usage. Production cloud should
-//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't
-//! depend on it being set — it works open out of the box.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "github_repo",
-    label: "GitHub repository",
-    description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.",
-    url_patterns: &["https://github.com/{owner}/{repo}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = url
-        .split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if host != "github.com" && host != "www.github.com" {
-        return false;
-    }
-    // Path must be exactly /{owner}/{repo} (or with trailing slash). Reject
-    // sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the
-    // future github_issue / github_pr extractors will handle.
-    let path = url
-        .split("://")
-        .nth(1)
-        .and_then(|s| s.split_once('/'))
-        .map(|(_, p)| p)
-        .unwrap_or("");
-    let stripped = path
-        .split(['?', '#'])
-        .next()
-        .unwrap_or("")
-        .trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0])
-}
-
-/// GitHub uses some top-level paths for non-repo pages.
-const RESERVED_OWNERS: &[&str] = &[
-    "settings",
-    "marketplace",
-    "explore",
-    "topics",
-    "trending",
-    "collections",
-    "events",
-    "sponsors",
-    "issues",
-    "pulls",
-    "notifications",
-    "new",
-    "organizations",
-    "login",
-    "join",
-    "search",
-    "about",
-];
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
-        FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
-    })?;
-
-    let api_url = format!("https://api.github.com/repos/{owner}/{repo}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "github_repo: repo '{owner}/{repo}' not found"
-        )));
-    }
-    if resp.status == 403 {
-        return Err(FetchError::Build(
-            "github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
-        ));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "github api returned status {}",
-            resp.status
-        )));
-    }
-
-    let r: Repo = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?;
-
-    Ok(json!({
-        "url":              url,
-        "owner":            r.owner.as_ref().map(|o| &o.login),
-        "name":             r.name,
-        "full_name":        r.full_name,
-        "description":      r.description,
-        "homepage":         r.homepage,
-        "language":         r.language,
-        "topics":           r.topics,
-        "license":          r.license.as_ref().and_then(|l| l.spdx_id.clone()),
-        "license_name":     r.license.as_ref().map(|l| l.name.clone()),
-        "default_branch":   r.default_branch,
-        "stars":            r.stargazers_count,
-        "forks":            r.forks_count,
-        "watchers":         r.subscribers_count,
-        "open_issues":      r.open_issues_count,
-        "size_kb":          r.size,
-        "archived":         r.archived,
-        "fork":             r.fork,
-        "is_template":      r.is_template,
-        "has_issues":       r.has_issues,
-        "has_wiki":         r.has_wiki,
-        "has_pages":        r.has_pages,
-        "has_discussions":  r.has_discussions,
-        "created_at":       r.created_at,
-        "updated_at":       r.updated_at,
-        "pushed_at":        r.pushed_at,
-        "html_url":         r.html_url,
-    }))
-}
-
-fn parse_owner_repo(url: &str) -> Option<(String, String)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let owner = segs.next()?.to_string();
-    let repo = segs.next()?.to_string();
-    Some((owner, repo))
-}
-
-// ---------------------------------------------------------------------------
-// GitHub API types — only the fields we surface
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Repo {
-    name: Option<String>,
-    full_name: Option<String>,
-    description: Option<String>,
-    homepage: Option<String>,
-    language: Option<String>,
-    #[serde(default)]
-    topics: Vec<String>,
-    license: Option<License>,
-    default_branch: Option<String>,
-    stargazers_count: Option<i64>,
-    forks_count: Option<i64>,
-    subscribers_count: Option<i64>,
-    open_issues_count: Option<i64>,
-    size: Option<i64>,
-    archived: Option<bool>,
-    fork: Option<bool>,
-    is_template: Option<bool>,
-    has_issues: Option<bool>,
-    has_wiki: Option<bool>,
-    has_pages: Option<bool>,
-    has_discussions: Option<bool>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-    pushed_at: Option<String>,
-    html_url: Option<String>,
-    owner: Option<Owner>,
-}
-
-#[derive(Deserialize)]
-struct Owner {
-    login: String,
-}
-
-#[derive(Deserialize)]
-struct License {
-    name: String,
-    spdx_id: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_repo_root_only() {
-        assert!(matches("https://github.com/rust-lang/rust"));
-        assert!(matches("https://github.com/rust-lang/rust/"));
-        assert!(!matches("https://github.com/rust-lang/rust/issues"));
-        assert!(!matches("https://github.com/rust-lang/rust/pulls/123"));
-        assert!(!matches("https://github.com/rust-lang"));
-        assert!(!matches("https://github.com/marketplace"));
-        assert!(!matches("https://github.com/topics/rust"));
-        assert!(!matches("https://example.com/foo/bar"));
-    }
-
-    #[test]
-    fn parse_owner_repo_handles_trailing_slash_and_query() {
-        assert_eq!(
-            parse_owner_repo("https://github.com/rust-lang/rust"),
-            Some(("rust-lang".into(), "rust".into()))
-        );
-        assert_eq!(
-            parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"),
-            Some(("rust-lang".into(), "rust".into()))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/hackernews.rs b/crates/webclaw-fetch/src/extractors/hackernews.rs
deleted file mode 100644
index 91d4520..0000000
--- a/crates/webclaw-fetch/src/extractors/hackernews.rs
+++ /dev/null
@@ -1,186 +0,0 @@
-//! Hacker News structured extractor.
-//!
-//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which
-//! returns the full post + recursive comment tree in a single request.
-//! The official Firebase API at `hacker-news.firebaseio.com` requires
-//! N+1 fetches per comment, so we'd hit either timeout or rate-limit
-//! on any non-trivial thread.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "hackernews",
-    label: "Hacker News story",
-    description: "Returns post + nested comment tree for a Hacker News item.",
-    url_patterns: &[
-        "https://news.ycombinator.com/item?id=N",
-        "https://hn.algolia.com/items/N",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = url
-        .split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if host == "news.ycombinator.com" {
-        return url.contains("item?id=") || url.contains("item%3Fid=");
-    }
-    if host == "hn.algolia.com" {
-        return url.contains("/items/");
-    }
-    false
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let id = parse_item_id(url).ok_or_else(|| {
-        FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
-    })?;
-
-    let api_url = format!("https://hn.algolia.com/api/v1/items/{id}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "hn algolia returned status {}",
-            resp.status
-        )));
-    }
-
-    let item: AlgoliaItem = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?;
-
-    let post = post_json(&item);
-    let comments: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
-
-    Ok(json!({
-        "url": url,
-        "post": post,
-        "comments": comments,
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// Helpers
-// ---------------------------------------------------------------------------
-
-/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the
-/// Algolia mirror's `/items/N` form.
-fn parse_item_id(url: &str) -> Option<u64> {
-    if let Some(after) = url.split("id=").nth(1) {
-        let n = after.split('&').next().unwrap_or(after);
-        if let Ok(id) = n.parse::<u64>() {
-            return Some(id);
-        }
-    }
-    if let Some(after) = url.split("/items/").nth(1) {
-        let n = after.split(['/', '?', '#']).next().unwrap_or(after);
-        if let Ok(id) = n.parse::<u64>() {
-            return Some(id);
-        }
-    }
-    None
-}
-
-fn post_json(item: &AlgoliaItem) -> Value {
-    json!({
-        "id":              item.id,
-        "type":            item.r#type,
-        "title":           item.title,
-        "url":             item.url,
-        "author":          item.author,
-        "points":          item.points,
-        "text":            item.text,                 // populated for ask/show/tell
-        "created_at":      item.created_at,
-        "created_at_unix": item.created_at_i,
-        "comment_count":   count_descendants(item),
-        "permalink":       item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")),
-    })
-}
-
-fn comment_json(item: &AlgoliaItem) -> Option<Value> {
-    if !matches!(item.r#type.as_deref(), Some("comment")) {
-        return None;
-    }
-    // Dead/deleted comments still appear in the tree; surface them honestly.
-    let replies: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
-    Some(json!({
-        "id":              item.id,
-        "author":          item.author,
-        "text":            item.text,
-        "created_at":      item.created_at,
-        "created_at_unix": item.created_at_i,
-        "parent_id":       item.parent_id,
-        "story_id":        item.story_id,
-        "replies":         replies,
-    }))
-}
-
-fn count_descendants(item: &AlgoliaItem) -> usize {
-    item.children
-        .iter()
-        .filter(|c| matches!(c.r#type.as_deref(), Some("comment")))
-        .map(|c| 1 + count_descendants(c))
-        .sum()
-}
-
-// ---------------------------------------------------------------------------
-// Algolia API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct AlgoliaItem {
-    id: Option<u64>,
-    r#type: Option<String>,
-    title: Option<String>,
-    url: Option<String>,
-    author: Option<String>,
-    points: Option<i64>,
-    text: Option<String>,
-    created_at: Option<String>,
-    created_at_i: Option<i64>,
-    parent_id: Option<u64>,
-    story_id: Option<u64>,
-    #[serde(default)]
-    children: Vec<AlgoliaItem>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_hn_item_urls() {
-        assert!(matches("https://news.ycombinator.com/item?id=1"));
-        assert!(matches("https://news.ycombinator.com/item?id=12345"));
-        assert!(matches("https://hn.algolia.com/items/1"));
-    }
-
-    #[test]
-    fn rejects_non_item_urls() {
-        assert!(!matches("https://news.ycombinator.com/"));
-        assert!(!matches("https://news.ycombinator.com/news"));
-        assert!(!matches("https://example.com/item?id=1"));
-    }
-
-    #[test]
-    fn parse_item_id_handles_both_forms() {
-        assert_eq!(
-            parse_item_id("https://news.ycombinator.com/item?id=1"),
-            Some(1)
-        );
-        assert_eq!(
-            parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"),
-            Some(12345)
-        );
-        assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999));
-        assert_eq!(parse_item_id("https://example.com/foo"), None);
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
deleted file mode 100644
index e1f84f7..0000000
--- a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
+++ /dev/null
@@ -1,189 +0,0 @@
-//! HuggingFace dataset structured extractor.
-//!
-//! Same shape as the model extractor but hits the dataset endpoint.
-//! `huggingface.co/api/datasets/{owner}/{name}`.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "huggingface_dataset",
-    label: "HuggingFace dataset",
-    description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.",
-    url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "huggingface.co" && host != "www.huggingface.co" {
-        return false;
-    }
-    let path = url
-        .split("://")
-        .nth(1)
-        .and_then(|s| s.split_once('/'))
-        .map(|(_, p)| p)
-        .unwrap_or("");
-    let stripped = path
-        .split(['?', '#'])
-        .next()
-        .unwrap_or("")
-        .trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    // /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical).
-    segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let dataset_path = parse_dataset_path(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "hf_dataset: cannot parse dataset path from '{url}'"
-        ))
-    })?;
-
-    let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "hf_dataset: '{dataset_path}' not found"
-        )));
-    }
-    if resp.status == 401 {
-        return Err(FetchError::Build(format!(
-            "hf_dataset: '{dataset_path}' requires authentication (gated)"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "hf_dataset api returned status {}",
-            resp.status
-        )));
-    }
-
-    let d: DatasetInfo = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?;
-
-    let files: Vec<Value> = d
-        .siblings
-        .iter()
-        .map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
-        .collect();
-
-    Ok(json!({
-        "url":             url,
-        "id":              d.id,
-        "private":         d.private,
-        "gated":           d.gated,
-        "downloads":       d.downloads,
-        "downloads_30d":   d.downloads_all_time,
-        "likes":           d.likes,
-        "tags":            d.tags,
-        "license":         d.card_data.as_ref().and_then(|c| c.license.clone()),
-        "language":        d.card_data.as_ref().and_then(|c| c.language.clone()),
-        "task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()),
-        "size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()),
-        "annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()),
-        "configs":         d.card_data.as_ref().and_then(|c| c.configs.clone()),
-        "created_at":      d.created_at,
-        "last_modified":   d.last_modified,
-        "sha":             d.sha,
-        "file_count":      d.siblings.len(),
-        "files":           files,
-    }))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Returns the part to append to the API URL — either `name` (legacy
-/// top-level dataset like `squad`) or `owner/name` (canonical form).
-fn parse_dataset_path(url: &str) -> Option<String> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    if segs.next() != Some("datasets") {
-        return None;
-    }
-    let first = segs.next()?.to_string();
-    match segs.next() {
-        Some(second) => Some(format!("{first}/{second}")),
-        None => Some(first),
-    }
-}
-
-#[derive(Deserialize)]
-struct DatasetInfo {
-    id: Option<String>,
-    private: Option<bool>,
-    gated: Option<serde_json::Value>,
-    downloads: Option<i64>,
-    #[serde(rename = "downloadsAllTime")]
-    downloads_all_time: Option<i64>,
-    likes: Option<i64>,
-    #[serde(default)]
-    tags: Vec<String>,
-    #[serde(rename = "createdAt")]
-    created_at: Option<String>,
-    #[serde(rename = "lastModified")]
-    last_modified: Option<String>,
-    sha: Option<String>,
-    #[serde(rename = "cardData")]
-    card_data: Option<DatasetCard>,
-    #[serde(default)]
-    siblings: Vec<Sibling>,
-}
-
-#[derive(Deserialize)]
-struct DatasetCard {
-    license: Option<serde_json::Value>,
-    language: Option<serde_json::Value>,
-    task_categories: Option<serde_json::Value>,
-    size_categories: Option<serde_json::Value>,
-    annotations_creators: Option<serde_json::Value>,
-    configs: Option<serde_json::Value>,
-}
-
-#[derive(Deserialize)]
-struct Sibling {
-    rfilename: String,
-    size: Option<i64>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_dataset_pages() {
-        assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level
-        assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name
-        assert!(!matches("https://huggingface.co/openai/whisper-large-v3"));
-        assert!(!matches("https://huggingface.co/datasets/"));
-    }
-
-    #[test]
-    fn parse_dataset_path_works() {
-        assert_eq!(
-            parse_dataset_path("https://huggingface.co/datasets/squad"),
-            Some("squad".into())
-        );
-        assert_eq!(
-            parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"),
-            Some("openai/gsm8k".into())
-        );
-        assert_eq!(
-            parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"),
-            Some("openai/gsm8k".into())
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/huggingface_model.rs b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
deleted file mode 100644
index 4c549e0..0000000
--- a/crates/webclaw-fetch/src/extractors/huggingface_model.rs
+++ /dev/null
@@ -1,223 +0,0 @@
-//! HuggingFace model card structured extractor.
-//!
-//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`.
-//! Returns metadata + the parsed model card front matter, but does not
-//! pull the full README body — those are sometimes 100KB+ and the user
-//! can hit /v1/scrape if they want it as markdown.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "huggingface_model",
-    label: "HuggingFace model",
-    description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.",
-    url_patterns: &["https://huggingface.co/{owner}/{name}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "huggingface.co" && host != "www.huggingface.co" {
-        return false;
-    }
-    let path = url
-        .split("://")
-        .nth(1)
-        .and_then(|s| s.split_once('/'))
-        .map(|(_, p)| p)
-        .unwrap_or("");
-    let stripped = path
-        .split(['?', '#'])
-        .next()
-        .unwrap_or("")
-        .trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    // /{owner}/{name} but reject HF-internal sections + sub-pages.
-    if segs.len() != 2 {
-        return false;
-    }
-    !RESERVED_NAMESPACES.contains(&segs[0])
-}
-
-const RESERVED_NAMESPACES: &[&str] = &[
-    "datasets",
-    "spaces",
-    "blog",
-    "docs",
-    "api",
-    "models",
-    "papers",
-    "pricing",
-    "tasks",
-    "join",
-    "login",
-    "settings",
-    "organizations",
-    "new",
-    "search",
-];
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (owner, name) = parse_owner_name(url).ok_or_else(|| {
-        FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
-    })?;
-
-    let api_url = format!("https://huggingface.co/api/models/{owner}/{name}");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "hf model: '{owner}/{name}' not found"
-        )));
-    }
-    if resp.status == 401 {
-        return Err(FetchError::Build(format!(
-            "hf model: '{owner}/{name}' requires authentication (gated repo)"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "hf api returned status {}",
-            resp.status
-        )));
-    }
-
-    let m: ModelInfo = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?;
-
-    // Surface a flat file list — full siblings can be hundreds of entries
-    // for big repos. We keep it as-is because callers want to know about
-    // every shard; if it bloats responses too much we'll add pagination.
-    let files: Vec<Value> = m
-        .siblings
-        .iter()
-        .map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
-        .collect();
-
-    Ok(json!({
-        "url":             url,
-        "id":              m.id,
-        "model_id":        m.model_id,
-        "private":         m.private,
-        "gated":           m.gated,
-        "downloads":       m.downloads,
-        "downloads_30d":   m.downloads_all_time,
-        "likes":           m.likes,
-        "library_name":    m.library_name,
-        "pipeline_tag":    m.pipeline_tag,
-        "tags":            m.tags,
-        "license":         m.card_data.as_ref().and_then(|c| c.license.clone()),
-        "language":        m.card_data.as_ref().and_then(|c| c.language.clone()),
-        "datasets":        m.card_data.as_ref().and_then(|c| c.datasets.clone()),
-        "base_model":      m.card_data.as_ref().and_then(|c| c.base_model.clone()),
-        "model_type":      m.card_data.as_ref().and_then(|c| c.model_type.clone()),
-        "created_at":      m.created_at,
-        "last_modified":   m.last_modified,
-        "sha":             m.sha,
-        "file_count":      m.siblings.len(),
-        "files":           files,
-    }))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn parse_owner_name(url: &str) -> Option<(String, String)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let owner = segs.next()?.to_string();
-    let name = segs.next()?.to_string();
-    Some((owner, name))
-}
-
-// ---------------------------------------------------------------------------
-// HF API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct ModelInfo {
-    id: Option<String>,
-    #[serde(rename = "modelId")]
-    model_id: Option<String>,
-    private: Option<bool>,
-    gated: Option<serde_json::Value>, // bool or string ("auto" / "manual" / false)
-    downloads: Option<i64>,
-    #[serde(rename = "downloadsAllTime")]
-    downloads_all_time: Option<i64>,
-    likes: Option<i64>,
-    #[serde(rename = "library_name")]
-    library_name: Option<String>,
-    #[serde(rename = "pipeline_tag")]
-    pipeline_tag: Option<String>,
-    #[serde(default)]
-    tags: Vec<String>,
-    #[serde(rename = "createdAt")]
-    created_at: Option<String>,
-    #[serde(rename = "lastModified")]
-    last_modified: Option<String>,
-    sha: Option<String>,
-    #[serde(rename = "cardData")]
-    card_data: Option<CardData>,
-    #[serde(default)]
-    siblings: Vec<Sibling>,
-}
-
-#[derive(Deserialize)]
-struct CardData {
-    license: Option<serde_json::Value>, // string or array
-    language: Option<serde_json::Value>,
-    datasets: Option<serde_json::Value>,
-    #[serde(rename = "base_model")]
-    base_model: Option<serde_json::Value>,
-    #[serde(rename = "model_type")]
-    model_type: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Sibling {
-    rfilename: String,
-    size: Option<i64>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_model_pages() {
-        assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B"));
-        assert!(matches("https://huggingface.co/openai/whisper-large-v3"));
-        assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1
-    }
-
-    #[test]
-    fn rejects_hf_section_pages() {
-        assert!(!matches("https://huggingface.co/datasets/squad"));
-        assert!(!matches("https://huggingface.co/spaces/foo/bar"));
-        assert!(!matches("https://huggingface.co/blog/intro"));
-        assert!(!matches("https://huggingface.co/"));
-        assert!(!matches("https://huggingface.co/meta-llama"));
-    }
-
-    #[test]
-    fn parse_owner_name_pulls_both() {
-        assert_eq!(
-            parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"),
-            Some(("meta-llama".into(), "Meta-Llama-3-8B".into()))
-        );
-        assert_eq!(
-            parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"),
-            Some(("openai".into(), "whisper-large-v3".into()))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/instagram_post.rs b/crates/webclaw-fetch/src/extractors/instagram_post.rs
deleted file mode 100644
index 8847e36..0000000
--- a/crates/webclaw-fetch/src/extractors/instagram_post.rs
+++ /dev/null
@@ -1,235 +0,0 @@
-//! Instagram post structured extractor.
-//!
-//! Uses Instagram's public embed endpoint
-//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the
-//! full caption, author username, and thumbnail. No auth required.
-//! The same endpoint serves reels and IGTV under `/reel/{code}` and
-//! `/tv/{code}` URLs (we accept all three).
-
-use regex::Regex;
-use serde_json::{Value, json};
-use std::sync::OnceLock;
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "instagram_post",
-    label: "Instagram post",
-    description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.",
-    url_patterns: &[
-        "https://www.instagram.com/p/{shortcode}/",
-        "https://www.instagram.com/reel/{shortcode}/",
-        "https://www.instagram.com/tv/{shortcode}/",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !matches!(host, "www.instagram.com" | "instagram.com") {
-        return false;
-    }
-    parse_shortcode(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "instagram_post: cannot parse shortcode from '{url}'"
-        ))
-    })?;
-
-    // Instagram serves the same embed HTML for posts/reels/tv under /p/.
-    let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/");
-    let resp = client.fetch(&embed_url).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "instagram embed returned status {} for {shortcode}",
-            resp.status
-        )));
-    }
-
-    let html = &resp.html;
-    let username = parse_username(html);
-    let caption = parse_caption(html);
-    let thumbnail = parse_thumbnail(html);
-
-    Ok(json!({
-        "url":               url,
-        "embed_url":         embed_url,
-        "shortcode":         shortcode,
-        "kind":              kind,
-        "data_completeness": "embed",
-        "author_username":   username,
-        "caption":           caption,
-        "thumbnail_url":     thumbnail,
-        "canonical_url":     format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)),
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// URL parsing
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}.
-fn parse_shortcode(url: &str) -> Option<(&'static str, String)> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let first = segs.next()?;
-    let kind = match first {
-        "p" => "post",
-        "reel" | "reels" => "reel",
-        "tv" => "tv",
-        _ => return None,
-    };
-    let shortcode = segs.next()?;
-    if shortcode.is_empty() {
-        return None;
-    }
-    Some((kind, shortcode.to_string()))
-}
-
-fn path_segment_for(kind: &str) -> &'static str {
-    match kind {
-        "reel" => "reel",
-        "tv" => "tv",
-        _ => "p",
-    }
-}
-
-// ---------------------------------------------------------------------------
-// HTML scraping
-// ---------------------------------------------------------------------------
-
-/// Username appears as the anchor text inside `<a class="CaptionUsername">`.
-fn parse_username(html: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap());
-    re.captures(html)
-        .and_then(|c| c.get(1))
-        .map(|m| html_decode(m.as_str().trim()))
-}
-
-/// Caption sits inside `<div class="Caption">` after the username anchor.
-/// We grab the whole Caption block and strip out the username link, time
-/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate.
-fn parse_caption(html: &str) -> Option<String> {
-    static RE_OUTER: OnceLock<Regex> = OnceLock::new();
-    let outer = RE_OUTER
-        .get_or_init(|| Regex::new(r#"(?s)<div\s+class="Caption"[^>]*>(.*?)</div>"#).unwrap());
-    let block = outer.captures(html)?.get(1)?.as_str();
-
-    // Strip everything wrapped in <a class="CaptionUsername">...</a>.
-    static RE_USER: OnceLock<Regex> = OnceLock::new();
-    let user_re = RE_USER
-        .get_or_init(|| Regex::new(r#"(?s)<a[^>]*class="CaptionUsername"[^>]*>.*?</a>"#).unwrap());
-    let stripped = user_re.replace_all(block, "");
-
-    // Then strip anything remaining tagged.
-    static RE_TAGS: OnceLock<Regex> = OnceLock::new();
-    let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
-    let text = tag_re.replace_all(&stripped, " ");
-
-    let cleaned = collapse_whitespace(&html_decode(text.trim()));
-    if cleaned.is_empty() {
-        None
-    } else {
-        Some(cleaned)
-    }
-}
-
-/// Thumbnail is the `<img class="EmbeddedMediaImage">` inside the embed
-/// (or the og:image as fallback).
-fn parse_thumbnail(html: &str) -> Option<String> {
-    static RE_IMG: OnceLock<Regex> = OnceLock::new();
-    let img_re = RE_IMG.get_or_init(|| {
-        Regex::new(r#"(?s)<img[^>]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#)
-            .unwrap()
-    });
-    if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) {
-        return Some(html_decode(m.as_str()));
-    }
-    static RE_OG: OnceLock<Regex> = OnceLock::new();
-    let og_re = RE_OG.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:image"[^>]+content="([^"]+)""#).unwrap()
-    });
-    og_re
-        .captures(html)
-        .and_then(|c| c.get(1))
-        .map(|m| html_decode(m.as_str()))
-}
-
-fn html_decode(s: &str) -> String {
-    s.replace("&amp;", "&")
-        .replace("&lt;", "<")
-        .replace("&gt;", ">")
-        .replace("&quot;", "\"")
-        .replace("&#39;", "'")
-        .replace("&#064;", "@")
-        .replace("&#x2022;", "•")
-        .replace("&hellip;", "…")
-}
-
-fn collapse_whitespace(s: &str) -> String {
-    s.split_whitespace().collect::<Vec<_>>().join(" ")
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_post_reel_tv_urls() {
-        assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/"));
-        assert!(matches(
-            "https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"
-        ));
-        assert!(matches("https://www.instagram.com/reel/abc123/"));
-        assert!(matches("https://www.instagram.com/tv/abc123/"));
-        assert!(!matches("https://www.instagram.com/ticketswave"));
-        assert!(!matches("https://www.instagram.com/"));
-        assert!(!matches("https://example.com/p/abc/"));
-    }
-
-    #[test]
-    fn parse_shortcode_reads_each_kind() {
-        assert_eq!(
-            parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"),
-            Some(("post", "DT-RICMjeK5".into()))
-        );
-        assert_eq!(
-            parse_shortcode("https://www.instagram.com/reel/abc123/"),
-            Some(("reel", "abc123".into()))
-        );
-        assert_eq!(
-            parse_shortcode("https://www.instagram.com/tv/abc123"),
-            Some(("tv", "abc123".into()))
-        );
-    }
-
-    #[test]
-    fn parse_username_pulls_anchor_text() {
-        let html = r#"<a class="CaptionUsername" href="...">ticketswave</a>"#;
-        assert_eq!(parse_username(html).as_deref(), Some("ticketswave"));
-    }
-
-    #[test]
-    fn parse_caption_strips_username_anchor() {
-        let html = r#"<div class="Caption"><a class="CaptionUsername" href="...">ticketswave</a> Some caption text here</div>"#;
-        assert_eq!(
-            parse_caption(html).as_deref(),
-            Some("Some caption text here")
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/instagram_profile.rs b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
deleted file mode 100644
index 9a92b4c..0000000
--- a/crates/webclaw-fetch/src/extractors/instagram_profile.rs
+++ /dev/null
@@ -1,465 +0,0 @@
-//! Instagram profile structured extractor.
-//!
-//! Hits Instagram's internal `web_profile_info` endpoint at
-//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The
-//! `x-ig-app-id` header is Instagram's own public web-app id (not a
-//! secret) — the same value Instagram's own JavaScript bundle sends.
-//!
-//! Returns the full profile (bio, exact follower count, verified /
-//! business flags, profile picture) plus the **12 most recent posts**
-//! with shortcodes, like counts, types, thumbnails, and caption
-//! previews. Callers can fan out to `/v1/scrape/instagram_post` per
-//! shortcode to get the full caption + media.
-//!
-//! Pagination beyond 12 requires authenticated cookies + a CSRF token;
-//! we accept that as the practical ceiling for the unauth path. The
-//! cloud (with stored sessions) can paginate later as a follow-up.
-//!
-//! Falls back to OG-tag scraping of the public profile page if the API
-//! returns 401/403 — Instagram has tightened this endpoint multiple
-//! times, so we keep the second path warm.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "instagram_profile",
-    label: "Instagram profile",
-    description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).",
-    url_patterns: &["https://www.instagram.com/{username}/"],
-};
-
-/// Instagram's own public web-app identifier. Sent by their JS bundle
-/// on every API call, accepted by the unauth endpoint, not a secret.
-const IG_APP_ID: &str = "936619743392459";
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !matches!(host, "www.instagram.com" | "instagram.com") {
-        return false;
-    }
-    let path = url
-        .split("://")
-        .nth(1)
-        .and_then(|s| s.split_once('/'))
-        .map(|(_, p)| p)
-        .unwrap_or("");
-    let stripped = path
-        .split(['?', '#'])
-        .next()
-        .unwrap_or("")
-        .trim_end_matches('/');
-    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
-    segs.len() == 1 && !RESERVED.contains(&segs[0])
-}
-
-const RESERVED: &[&str] = &[
-    "p",
-    "reel",
-    "reels",
-    "tv",
-    "explore",
-    "stories",
-    "directory",
-    "accounts",
-    "about",
-    "developer",
-    "press",
-    "api",
-    "ads",
-    "blog",
-    "fragments",
-    "terms",
-    "privacy",
-    "session",
-    "login",
-    "signup",
-];
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let username = parse_username(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "instagram_profile: cannot parse username from '{url}'"
-        ))
-    })?;
-
-    let api_url =
-        format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}");
-    let extra_headers: &[(&str, &str)] = &[
-        ("x-ig-app-id", IG_APP_ID),
-        ("accept", "*/*"),
-        ("sec-fetch-site", "same-origin"),
-        ("x-requested-with", "XMLHttpRequest"),
-    ];
-    let resp = client.fetch_with_headers(&api_url, extra_headers).await?;
-
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "instagram_profile: '{username}' not found"
-        )));
-    }
-    // Auth wall fallback: Instagram occasionally tightens this endpoint
-    // and starts returning 401/403/302 to a login page. When that
-    // happens we still want to give the caller something useful — the
-    // OG tags from the public HTML page (no posts list, but bio etc).
-    if !(200..300).contains(&resp.status) {
-        return og_fallback(client, &username, url, resp.status).await;
-    }
-
-    let body: ApiResponse = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?;
-    let user = body.data.user;
-
-    let recent_posts: Vec<Value> = user
-        .edge_owner_to_timeline_media
-        .as_ref()
-        .map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect())
-        .unwrap_or_default();
-
-    Ok(json!({
-        "url":               url,
-        "canonical_url":     format!("https://www.instagram.com/{username}/"),
-        "username":          user.username.unwrap_or(username),
-        "data_completeness": "api",
-        "user_id":           user.id,
-        "full_name":         user.full_name,
-        "biography":         user.biography,
-        "biography_links":   user.bio_links,
-        "external_url":      user.external_url,
-        "category":          user.category_name,
-        "follower_count":    user.edge_followed_by.map(|c| c.count),
-        "following_count":   user.edge_follow.map(|c| c.count),
-        "post_count":        user.edge_owner_to_timeline_media.as_ref().map(|m| m.count),
-        "is_verified":       user.is_verified,
-        "is_private":        user.is_private,
-        "is_business":       user.is_business_account,
-        "is_professional":   user.is_professional_account,
-        "profile_pic_url":   user.profile_pic_url_hd.or(user.profile_pic_url),
-        "recent_posts":      recent_posts,
-    }))
-}
-
-/// Build the per-post summary the caller fans out from. Includes a
-/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`.
-fn post_summary(n: &MediaNode) -> Value {
-    let kind = classify(n);
-    let url = match kind {
-        "reel" => format!(
-            "https://www.instagram.com/reel/{}/",
-            n.shortcode.as_deref().unwrap_or("")
-        ),
-        _ => format!(
-            "https://www.instagram.com/p/{}/",
-            n.shortcode.as_deref().unwrap_or("")
-        ),
-    };
-    let caption = n
-        .edge_media_to_caption
-        .as_ref()
-        .and_then(|c| c.edges.first())
-        .and_then(|e| e.node.text.clone());
-    json!({
-        "shortcode":     n.shortcode,
-        "url":           url,
-        "kind":          kind,
-        "is_video":      n.is_video.unwrap_or(false),
-        "video_views":   n.video_view_count,
-        "thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()),
-        "display_url":   n.display_url,
-        "like_count":    n.edge_media_preview_like.as_ref().map(|c| c.count),
-        "comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count),
-        "taken_at":      n.taken_at_timestamp,
-        "caption":       caption,
-        "alt_text":      n.accessibility_caption,
-        "dimensions":    n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})),
-        "product_type":  n.product_type,
-    })
-}
-
-/// Best-effort post-type classification. `clips` is reels; `feed` is
-/// the regular grid. Sidecar = multi-photo carousel.
-fn classify(n: &MediaNode) -> &'static str {
-    if n.product_type.as_deref() == Some("clips") {
-        return "reel";
-    }
-    match n.typename.as_deref() {
-        Some("GraphSidecar") => "carousel",
-        Some("GraphVideo") => "video",
-        Some("GraphImage") => "photo",
-        _ => "post",
-    }
-}
-
-/// Fallback when the API path is blocked: hit the public profile HTML,
-/// pull whatever OG tags we can. Returns less data and explicitly
-/// flags `data_completeness: "og_only"` so callers know.
-async fn og_fallback(
-    client: &dyn Fetcher,
-    username: &str,
-    original_url: &str,
-    api_status: u16,
-) -> Result<Value, FetchError> {
-    let canonical = format!("https://www.instagram.com/{username}/");
-    let resp = client.fetch(&canonical).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "instagram_profile: api status {api_status}, html status {} for {username}",
-            resp.status
-        )));
-    }
-    let og = parse_og_tags(&resp.html);
-    let (followers, following, posts) =
-        parse_counts_from_og_description(og.get("description").map(String::as_str));
-
-    Ok(json!({
-        "url":               original_url,
-        "canonical_url":     canonical,
-        "username":          username,
-        "data_completeness": "og_only",
-        "fallback_reason":   format!("api returned {api_status}"),
-        "full_name":         parse_full_name(&og.get("title").cloned().unwrap_or_default()),
-        "follower_count":    followers,
-        "following_count":   following,
-        "post_count":        posts,
-        "profile_pic_url":   og.get("image").cloned(),
-        "biography":         null_value(),
-        "is_verified":       null_value(),
-        "is_business":       null_value(),
-        "recent_posts":      Vec::<Value>::new(),
-    }))
-}
-
-fn null_value() -> Value {
-    Value::Null
-}
-
-// ---------------------------------------------------------------------------
-// URL parsing
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn parse_username(url: &str) -> Option<String> {
-    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
-    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
-    stripped
-        .split('/')
-        .find(|s| !s.is_empty())
-        .map(|s| s.to_string())
-}
-
-// ---------------------------------------------------------------------------
-// OG-fallback helpers (kept self-contained — same shape as the previous
-// version we shipped, retained as the safety net)
-// ---------------------------------------------------------------------------
-
-fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
-    use regex::Regex;
-    use std::sync::OnceLock;
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    let mut out = std::collections::HashMap::new();
-    for c in re.captures_iter(html) {
-        let k = c
-            .get(1)
-            .map(|m| m.as_str().to_lowercase())
-            .unwrap_or_default();
-        let v = c
-            .get(2)
-            .map(|m| html_decode(m.as_str()))
-            .unwrap_or_default();
-        out.entry(k).or_insert(v);
-    }
-    out
-}
-
-fn parse_full_name(og_title: &str) -> Option<String> {
-    if og_title.is_empty() {
-        return None;
-    }
-    let decoded = html_decode(og_title);
-    let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim();
-    if trimmed.is_empty() {
-        None
-    } else {
-        Some(trimmed.to_string())
-    }
-}
-
-fn parse_counts_from_og_description(desc: Option<&str>) -> (Option<i64>, Option<i64>, Option<i64>) {
-    let Some(text) = desc else {
-        return (None, None, None);
-    };
-    let decoded = html_decode(text);
-    use regex::Regex;
-    use std::sync::OnceLock;
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap()
-    });
-    if let Some(c) = re.captures(&decoded) {
-        return (
-            c.get(1).and_then(|m| parse_compact_number(m.as_str())),
-            c.get(2).and_then(|m| parse_compact_number(m.as_str())),
-            c.get(3).and_then(|m| parse_compact_number(m.as_str())),
-        );
-    }
-    (None, None, None)
-}
-
-fn parse_compact_number(s: &str) -> Option<i64> {
-    let s = s.trim();
-    let (num_str, mul) = match s.chars().last() {
-        Some('K') => (&s[..s.len() - 1], 1_000i64),
-        Some('M') => (&s[..s.len() - 1], 1_000_000i64),
-        Some('B') => (&s[..s.len() - 1], 1_000_000_000i64),
-        _ => (s, 1i64),
-    };
-    let cleaned: String = num_str.chars().filter(|c| *c != ',').collect();
-    cleaned.parse::<f64>().ok().map(|f| (f * mul as f64) as i64)
-}
-
-fn html_decode(s: &str) -> String {
-    s.replace("&amp;", "&")
-        .replace("&lt;", "<")
-        .replace("&gt;", ">")
-        .replace("&quot;", "\"")
-        .replace("&#39;", "'")
-        .replace("&#064;", "@")
-        .replace("&#x2022;", "•")
-        .replace("&hellip;", "…")
-}
-
-// ---------------------------------------------------------------------------
-// Instagram web_profile_info API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct ApiResponse {
-    data: ApiData,
-}
-
-#[derive(Deserialize)]
-struct ApiData {
-    user: User,
-}
-
-#[derive(Deserialize)]
-struct User {
-    id: Option<String>,
-    username: Option<String>,
-    full_name: Option<String>,
-    biography: Option<String>,
-    bio_links: Option<Vec<serde_json::Value>>,
-    external_url: Option<String>,
-    category_name: Option<String>,
-    profile_pic_url: Option<String>,
-    profile_pic_url_hd: Option<String>,
-    is_verified: Option<bool>,
-    is_private: Option<bool>,
-    is_business_account: Option<bool>,
-    is_professional_account: Option<bool>,
-    edge_followed_by: Option<EdgeCount>,
-    edge_follow: Option<EdgeCount>,
-    edge_owner_to_timeline_media: Option<MediaEdges>,
-}
-
-#[derive(Deserialize)]
-struct EdgeCount {
-    count: i64,
-}
-
-#[derive(Deserialize)]
-struct MediaEdges {
-    count: i64,
-    edges: Vec<MediaEdge>,
-}
-
-#[derive(Deserialize)]
-struct MediaEdge {
-    node: MediaNode,
-}
-
-#[derive(Deserialize)]
-struct MediaNode {
-    #[serde(rename = "__typename")]
-    typename: Option<String>,
-    shortcode: Option<String>,
-    is_video: Option<bool>,
-    video_view_count: Option<i64>,
-    display_url: Option<String>,
-    thumbnail_src: Option<String>,
-    accessibility_caption: Option<String>,
-    taken_at_timestamp: Option<i64>,
-    product_type: Option<String>,
-    dimensions: Option<Dimensions>,
-    edge_media_preview_like: Option<EdgeCount>,
-    edge_media_to_comment: Option<EdgeCount>,
-    edge_media_to_caption: Option<CaptionEdges>,
-}
-
-#[derive(Deserialize)]
-struct Dimensions {
-    width: i64,
-    height: i64,
-}
-
-#[derive(Deserialize)]
-struct CaptionEdges {
-    edges: Vec<CaptionEdge>,
-}
-
-#[derive(Deserialize)]
-struct CaptionEdge {
-    node: CaptionNode,
-}
-
-#[derive(Deserialize)]
-struct CaptionNode {
-    text: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_profile_urls() {
-        assert!(matches("https://www.instagram.com/ticketswave"));
-        assert!(matches("https://www.instagram.com/ticketswave/"));
-        assert!(matches("https://instagram.com/0xmassi/?hl=en"));
-        assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/"));
-        assert!(!matches("https://www.instagram.com/explore"));
-        assert!(!matches("https://www.instagram.com/"));
-        assert!(!matches("https://example.com/foo"));
-    }
-
-    #[test]
-    fn parse_full_name_strips_handle() {
-        assert_eq!(
-            parse_full_name("Ticket Wave (&#064;ticketswave) &#x2022; Instagram photos and videos"),
-            Some("Ticket Wave".into())
-        );
-    }
-
-    #[test]
-    fn compact_number_handles_kmb() {
-        assert_eq!(parse_compact_number("18K"), Some(18_000));
-        assert_eq!(parse_compact_number("1.5M"), Some(1_500_000));
-        assert_eq!(parse_compact_number("1,234"), Some(1_234));
-        assert_eq!(parse_compact_number("641"), Some(641));
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/linkedin_post.rs b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
deleted file mode 100644
index ed7e07b..0000000
--- a/crates/webclaw-fetch/src/extractors/linkedin_post.rs
+++ /dev/null
@@ -1,266 +0,0 @@
-//! LinkedIn post structured extractor.
-//!
-//! Uses the public embed endpoint `/embed/feed/update/{urn}` which
-//! LinkedIn provides for sites that want to render a post inline. No
-//! auth required, returns SSR HTML with the full post body, OG tags,
-//! image, and a link back to the original post.
-//!
-//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`)
-//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by
-//! pulling the trailing numeric id and converting to an activity URN.
-
-use regex::Regex;
-use serde_json::{Value, json};
-use std::sync::OnceLock;
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "linkedin_post",
-    label: "LinkedIn post",
-    description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.",
-    url_patterns: &[
-        "https://www.linkedin.com/feed/update/urn:li:share:{id}",
-        "https://www.linkedin.com/feed/update/urn:li:activity:{id}",
-        "https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !matches!(host, "www.linkedin.com" | "linkedin.com") {
-        return false;
-    }
-    url.contains("/feed/update/urn:li:") || url.contains("/posts/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let urn = extract_urn(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
-        ))
-    })?;
-
-    let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}");
-    let resp = client.fetch(&embed_url).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "linkedin embed returned status {} for {urn}",
-            resp.status
-        )));
-    }
-
-    let html = &resp.html;
-    let og = parse_og_tags(html);
-    let body = parse_post_body(html);
-    let author = parse_author(html);
-    let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone());
-
-    Ok(json!({
-        "url":               url,
-        "embed_url":         embed_url,
-        "urn":               urn,
-        "canonical_url":     canonical_url,
-        "data_completeness": "embed",
-        "title":             og.get("title").cloned(),
-        "body":              body,
-        "author_name":       author,
-        "image_url":         og.get("image").cloned(),
-        "site_name":         og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()),
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// URN extraction
-// ---------------------------------------------------------------------------
-
-/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL.
-/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second-
-/// to-last `-` separated chunk. Both forms map to a URN we can hit the
-/// embed endpoint with.
-fn extract_urn(url: &str) -> Option<String> {
-    if let Some(idx) = url.find("urn:li:") {
-        let tail = &url[idx..];
-        let end = tail.find(['/', '?', '#']).unwrap_or(tail.len());
-        let urn = &tail[..end];
-        // Validate shape: urn:li:{type}:{digits}
-        let mut parts = urn.split(':');
-        if parts.next() == Some("urn")
-            && parts.next() == Some("li")
-            && parts.next().is_some()
-            && parts
-                .next()
-                .filter(|p| p.chars().all(|c| c.is_ascii_digit()))
-                .is_some()
-        {
-            return Some(urn.to_string());
-        }
-    }
-
-    // /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second-
-    // to-last segment after the last `-`.
-    if url.contains("/posts/") {
-        static RE: OnceLock<Regex> = OnceLock::new();
-        let re =
-            RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap());
-        if let Some(c) = re.captures(url)
-            && let Some(id) = c.get(1)
-        {
-            return Some(format!("urn:li:activity:{}", id.as_str()));
-        }
-    }
-    None
-}
-
-// ---------------------------------------------------------------------------
-// HTML scraping
-// ---------------------------------------------------------------------------
-
-/// Pull `og:foo` → value pairs out of `<meta property="og:..." content="...">`.
-/// Returns lowercased keys with leading `og:` stripped.
-fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    let mut out = std::collections::HashMap::new();
-    for c in re.captures_iter(html) {
-        let k = c
-            .get(1)
-            .map(|m| m.as_str().to_lowercase())
-            .unwrap_or_default();
-        let v = c
-            .get(2)
-            .map(|m| html_decode(m.as_str()))
-            .unwrap_or_default();
-        out.entry(k).or_insert(v);
-    }
-    out
-}
-
-/// Extract the post body text from the embed page. LinkedIn renders it
-/// inside `<p class="attributed-text-segment-list__content ...">{text}</p>`
-/// where the inner content can include nested `<a>` tags for links.
-fn parse_post_body(html: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(
-            r#"(?s)<p[^>]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)</p>"#,
-        )
-        .unwrap()
-    });
-    let inner = re.captures(html).and_then(|c| c.get(1))?.as_str();
-    Some(strip_tags(inner).trim().to_string())
-}
-
-/// Author name lives in the `<title>` like:
-///   "55 founding members are in… | Orc Dev"
-/// The chunk after the final `|` is the author display name. Falls back
-/// to the og:title minus the post body if there's no title.
-fn parse_author(html: &str) -> Option<String> {
-    static RE_TITLE: OnceLock<Regex> = OnceLock::new();
-    let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)</title>").unwrap());
-    let title = re.captures(html).and_then(|c| c.get(1))?.as_str();
-    title
-        .rsplit_once('|')
-        .map(|(_, name)| html_decode(name.trim()))
-}
-
-/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.)
-/// stuff into OG content attributes.
-fn html_decode(s: &str) -> String {
-    s.replace("&amp;", "&")
-        .replace("&lt;", "<")
-        .replace("&gt;", ">")
-        .replace("&quot;", "\"")
-        .replace("&#39;", "'")
-        .replace("&#064;", "@")
-        .replace("&#x2022;", "•")
-        .replace("&hellip;", "…")
-}
-
-/// Crude HTML tag stripper for the post body. Preserves text inside
-/// nested anchors so URLs don't disappear, and collapses runs of
-/// whitespace introduced by line wrapping.
-fn strip_tags(html: &str) -> String {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
-    let no_tags = re.replace_all(html, "").to_string();
-    html_decode(&no_tags)
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_li_post_urls() {
-        assert!(matches(
-            "https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"
-        ));
-        assert!(matches(
-            "https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288"
-        ));
-        assert!(matches(
-            "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c"
-        ));
-        assert!(!matches("https://www.linkedin.com/in/foo"));
-        assert!(!matches("https://www.linkedin.com/"));
-        assert!(!matches("https://example.com/feed/update/urn:li:share:1"));
-    }
-
-    #[test]
-    fn extract_urn_from_share_url() {
-        assert_eq!(
-            extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"),
-            Some("urn:li:share:7452618582213144577".into())
-        );
-    }
-
-    #[test]
-    fn extract_urn_from_pretty_post_url() {
-        assert_eq!(
-            extract_urn(
-                "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/"
-            ),
-            Some("urn:li:activity:7452618583290892288".into())
-        );
-    }
-
-    #[test]
-    fn parse_og_tags_basic() {
-        let html = r#"<meta property="og:image" content="https://x.com/a.png">
-<meta property="og:url" content="https://example.com/x">"#;
-        let og = parse_og_tags(html);
-        assert_eq!(
-            og.get("image").map(String::as_str),
-            Some("https://x.com/a.png")
-        );
-        assert_eq!(
-            og.get("url").map(String::as_str),
-            Some("https://example.com/x")
-        );
-    }
-
-    #[test]
-    fn parse_post_body_strips_anchor_tags() {
-        let html = r#"<p class="attributed-text-segment-list__content text-color-text" dir="ltr">Hello <a href="x">link</a> world</p>"#;
-        assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world"));
-    }
-
-    #[test]
-    fn html_decode_handles_common_entities() {
-        assert_eq!(html_decode("AT&amp;T &#064;jane"), "AT&T @jane");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
deleted file mode 100644
index 91ef8d0..0000000
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ /dev/null
@@ -1,502 +0,0 @@
-//! Vertical extractors: site-specific parsers that return typed JSON
-//! instead of generic markdown.
-//!
-//! Each extractor handles a single site or platform and exposes:
-//! - `matches(url)` to claim ownership of a URL pattern
-//! - `extract(client, url)` to fetch + parse into a typed JSON `Value`
-//! - `INFO` static for the catalog (`/v1/extractors`)
-//!
-//! The dispatch in this module is a simple `match`-style chain rather than
-//! a trait registry. With ~30 extractors that's still fast and avoids the
-//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit.
-//!
-//! Extractors prefer official JSON APIs over HTML scraping where one
-//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
-//! one). HTML extraction is the fallback for sites that don't.
-
-pub mod amazon_product;
-pub mod arxiv;
-pub mod crates_io;
-pub mod dev_to;
-pub mod docker_hub;
-pub mod ebay_listing;
-pub mod ecommerce_product;
-pub mod etsy_listing;
-pub mod github_issue;
-pub mod github_pr;
-pub mod github_release;
-pub mod github_repo;
-pub mod hackernews;
-pub mod huggingface_dataset;
-pub mod huggingface_model;
-pub mod instagram_post;
-pub mod instagram_profile;
-pub mod linkedin_post;
-pub mod npm;
-pub mod pypi;
-pub mod reddit;
-pub mod shopify_collection;
-pub mod shopify_product;
-pub mod stackoverflow;
-pub mod substack_post;
-pub mod trustpilot_reviews;
-pub mod woocommerce_product;
-pub mod youtube_video;
-
-use serde::Serialize;
-use serde_json::Value;
-
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-/// Public catalog entry for `/v1/extractors`. Stable shape — clients
-/// rely on `name` to pick the right `/v1/scrape/{name}` route.
-#[derive(Debug, Clone, Serialize)]
-pub struct ExtractorInfo {
-    /// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...).
-    pub name: &'static str,
-    /// Human-friendly display name.
-    pub label: &'static str,
-    /// One-line description of what the extractor returns.
-    pub description: &'static str,
-    /// Glob-ish URL pattern(s) the extractor claims. For documentation;
-    /// the actual matching is done by the extractor's `matches` fn.
-    pub url_patterns: &'static [&'static str],
-}
-
-/// Full catalog. Order is stable; new entries append.
-pub fn list() -> Vec<ExtractorInfo> {
-    vec![
-        reddit::INFO,
-        hackernews::INFO,
-        github_repo::INFO,
-        github_pr::INFO,
-        github_issue::INFO,
-        github_release::INFO,
-        pypi::INFO,
-        npm::INFO,
-        crates_io::INFO,
-        huggingface_model::INFO,
-        huggingface_dataset::INFO,
-        arxiv::INFO,
-        docker_hub::INFO,
-        dev_to::INFO,
-        stackoverflow::INFO,
-        substack_post::INFO,
-        youtube_video::INFO,
-        linkedin_post::INFO,
-        instagram_post::INFO,
-        instagram_profile::INFO,
-        shopify_product::INFO,
-        shopify_collection::INFO,
-        ecommerce_product::INFO,
-        woocommerce_product::INFO,
-        amazon_product::INFO,
-        ebay_listing::INFO,
-        etsy_listing::INFO,
-        trustpilot_reviews::INFO,
-    ]
-}
-
-/// Auto-detect mode: try every extractor's `matches`, return the first
-/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
-/// pick a vertical explicitly.
-pub async fn dispatch_by_url(
-    client: &dyn Fetcher,
-    url: &str,
-) -> Option<Result<(&'static str, Value), FetchError>> {
-    if reddit::matches(url) {
-        return Some(
-            reddit::extract(client, url)
-                .await
-                .map(|v| (reddit::INFO.name, v)),
-        );
-    }
-    if hackernews::matches(url) {
-        return Some(
-            hackernews::extract(client, url)
-                .await
-                .map(|v| (hackernews::INFO.name, v)),
-        );
-    }
-    if github_repo::matches(url) {
-        return Some(
-            github_repo::extract(client, url)
-                .await
-                .map(|v| (github_repo::INFO.name, v)),
-        );
-    }
-    if pypi::matches(url) {
-        return Some(
-            pypi::extract(client, url)
-                .await
-                .map(|v| (pypi::INFO.name, v)),
-        );
-    }
-    if npm::matches(url) {
-        return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
-    }
-    if github_pr::matches(url) {
-        return Some(
-            github_pr::extract(client, url)
-                .await
-                .map(|v| (github_pr::INFO.name, v)),
-        );
-    }
-    if github_issue::matches(url) {
-        return Some(
-            github_issue::extract(client, url)
-                .await
-                .map(|v| (github_issue::INFO.name, v)),
-        );
-    }
-    if github_release::matches(url) {
-        return Some(
-            github_release::extract(client, url)
-                .await
-                .map(|v| (github_release::INFO.name, v)),
-        );
-    }
-    if crates_io::matches(url) {
-        return Some(
-            crates_io::extract(client, url)
-                .await
-                .map(|v| (crates_io::INFO.name, v)),
-        );
-    }
-    if huggingface_model::matches(url) {
-        return Some(
-            huggingface_model::extract(client, url)
-                .await
-                .map(|v| (huggingface_model::INFO.name, v)),
-        );
-    }
-    if huggingface_dataset::matches(url) {
-        return Some(
-            huggingface_dataset::extract(client, url)
-                .await
-                .map(|v| (huggingface_dataset::INFO.name, v)),
-        );
-    }
-    if arxiv::matches(url) {
-        return Some(
-            arxiv::extract(client, url)
-                .await
-                .map(|v| (arxiv::INFO.name, v)),
-        );
-    }
-    if docker_hub::matches(url) {
-        return Some(
-            docker_hub::extract(client, url)
-                .await
-                .map(|v| (docker_hub::INFO.name, v)),
-        );
-    }
-    if dev_to::matches(url) {
-        return Some(
-            dev_to::extract(client, url)
-                .await
-                .map(|v| (dev_to::INFO.name, v)),
-        );
-    }
-    if stackoverflow::matches(url) {
-        return Some(
-            stackoverflow::extract(client, url)
-                .await
-                .map(|v| (stackoverflow::INFO.name, v)),
-        );
-    }
-    if linkedin_post::matches(url) {
-        return Some(
-            linkedin_post::extract(client, url)
-                .await
-                .map(|v| (linkedin_post::INFO.name, v)),
-        );
-    }
-    if instagram_post::matches(url) {
-        return Some(
-            instagram_post::extract(client, url)
-                .await
-                .map(|v| (instagram_post::INFO.name, v)),
-        );
-    }
-    if instagram_profile::matches(url) {
-        return Some(
-            instagram_profile::extract(client, url)
-                .await
-                .map(|v| (instagram_profile::INFO.name, v)),
-        );
-    }
-    // Antibot-gated verticals with unique hosts: safe to auto-dispatch
-    // because the matcher can't confuse the URL for anything else. The
-    // extractor's smart_fetch_html path handles the blocked-without-
-    // API-key case with a clear actionable error.
-    if amazon_product::matches(url) {
-        return Some(
-            amazon_product::extract(client, url)
-                .await
-                .map(|v| (amazon_product::INFO.name, v)),
-        );
-    }
-    if ebay_listing::matches(url) {
-        return Some(
-            ebay_listing::extract(client, url)
-                .await
-                .map(|v| (ebay_listing::INFO.name, v)),
-        );
-    }
-    if etsy_listing::matches(url) {
-        return Some(
-            etsy_listing::extract(client, url)
-                .await
-                .map(|v| (etsy_listing::INFO.name, v)),
-        );
-    }
-    if trustpilot_reviews::matches(url) {
-        return Some(
-            trustpilot_reviews::extract(client, url)
-                .await
-                .map(|v| (trustpilot_reviews::INFO.name, v)),
-        );
-    }
-    if youtube_video::matches(url) {
-        return Some(
-            youtube_video::extract(client, url)
-                .await
-                .map(|v| (youtube_video::INFO.name, v)),
-        );
-    }
-    // NOTE: shopify_product, shopify_collection, ecommerce_product,
-    // woocommerce_product, and substack_post are intentionally NOT
-    // in auto-dispatch. Their `matches()` functions are permissive
-    // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
-    // claiming those generically would steal URLs from the default
-    // `/v1/scrape` markdown flow. Callers opt in via
-    // `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
-    None
-}
-
-/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`).
-/// We still validate that the URL plausibly belongs to that vertical so
-/// users get a clear "wrong route" error instead of a confusing parse
-/// failure deep in the extractor.
-pub async fn dispatch_by_name(
-    client: &dyn Fetcher,
-    name: &str,
-    url: &str,
-) -> Result<Value, ExtractorDispatchError> {
-    match name {
-        n if n == reddit::INFO.name => {
-            run_or_mismatch(reddit::matches(url), n, url, || {
-                reddit::extract(client, url)
-            })
-            .await
-        }
-        n if n == hackernews::INFO.name => {
-            run_or_mismatch(hackernews::matches(url), n, url, || {
-                hackernews::extract(client, url)
-            })
-            .await
-        }
-        n if n == github_repo::INFO.name => {
-            run_or_mismatch(github_repo::matches(url), n, url, || {
-                github_repo::extract(client, url)
-            })
-            .await
-        }
-        n if n == pypi::INFO.name => {
-            run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await
-        }
-        n if n == npm::INFO.name => {
-            run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
-        }
-        n if n == github_pr::INFO.name => {
-            run_or_mismatch(github_pr::matches(url), n, url, || {
-                github_pr::extract(client, url)
-            })
-            .await
-        }
-        n if n == github_issue::INFO.name => {
-            run_or_mismatch(github_issue::matches(url), n, url, || {
-                github_issue::extract(client, url)
-            })
-            .await
-        }
-        n if n == github_release::INFO.name => {
-            run_or_mismatch(github_release::matches(url), n, url, || {
-                github_release::extract(client, url)
-            })
-            .await
-        }
-        n if n == crates_io::INFO.name => {
-            run_or_mismatch(crates_io::matches(url), n, url, || {
-                crates_io::extract(client, url)
-            })
-            .await
-        }
-        n if n == huggingface_model::INFO.name => {
-            run_or_mismatch(huggingface_model::matches(url), n, url, || {
-                huggingface_model::extract(client, url)
-            })
-            .await
-        }
-        n if n == huggingface_dataset::INFO.name => {
-            run_or_mismatch(huggingface_dataset::matches(url), n, url, || {
-                huggingface_dataset::extract(client, url)
-            })
-            .await
-        }
-        n if n == arxiv::INFO.name => {
-            run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await
-        }
-        n if n == docker_hub::INFO.name => {
-            run_or_mismatch(docker_hub::matches(url), n, url, || {
-                docker_hub::extract(client, url)
-            })
-            .await
-        }
-        n if n == dev_to::INFO.name => {
-            run_or_mismatch(dev_to::matches(url), n, url, || {
-                dev_to::extract(client, url)
-            })
-            .await
-        }
-        n if n == stackoverflow::INFO.name => {
-            run_or_mismatch(stackoverflow::matches(url), n, url, || {
-                stackoverflow::extract(client, url)
-            })
-            .await
-        }
-        n if n == linkedin_post::INFO.name => {
-            run_or_mismatch(linkedin_post::matches(url), n, url, || {
-                linkedin_post::extract(client, url)
-            })
-            .await
-        }
-        n if n == instagram_post::INFO.name => {
-            run_or_mismatch(instagram_post::matches(url), n, url, || {
-                instagram_post::extract(client, url)
-            })
-            .await
-        }
-        n if n == instagram_profile::INFO.name => {
-            run_or_mismatch(instagram_profile::matches(url), n, url, || {
-                instagram_profile::extract(client, url)
-            })
-            .await
-        }
-        n if n == shopify_product::INFO.name => {
-            run_or_mismatch(shopify_product::matches(url), n, url, || {
-                shopify_product::extract(client, url)
-            })
-            .await
-        }
-        n if n == ecommerce_product::INFO.name => {
-            run_or_mismatch(ecommerce_product::matches(url), n, url, || {
-                ecommerce_product::extract(client, url)
-            })
-            .await
-        }
-        n if n == amazon_product::INFO.name => {
-            run_or_mismatch(amazon_product::matches(url), n, url, || {
-                amazon_product::extract(client, url)
-            })
-            .await
-        }
-        n if n == ebay_listing::INFO.name => {
-            run_or_mismatch(ebay_listing::matches(url), n, url, || {
-                ebay_listing::extract(client, url)
-            })
-            .await
-        }
-        n if n == etsy_listing::INFO.name => {
-            run_or_mismatch(etsy_listing::matches(url), n, url, || {
-                etsy_listing::extract(client, url)
-            })
-            .await
-        }
-        n if n == trustpilot_reviews::INFO.name => {
-            run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
-                trustpilot_reviews::extract(client, url)
-            })
-            .await
-        }
-        n if n == youtube_video::INFO.name => {
-            run_or_mismatch(youtube_video::matches(url), n, url, || {
-                youtube_video::extract(client, url)
-            })
-            .await
-        }
-        n if n == substack_post::INFO.name => {
-            run_or_mismatch(substack_post::matches(url), n, url, || {
-                substack_post::extract(client, url)
-            })
-            .await
-        }
-        n if n == shopify_collection::INFO.name => {
-            run_or_mismatch(shopify_collection::matches(url), n, url, || {
-                shopify_collection::extract(client, url)
-            })
-            .await
-        }
-        n if n == woocommerce_product::INFO.name => {
-            run_or_mismatch(woocommerce_product::matches(url), n, url, || {
-                woocommerce_product::extract(client, url)
-            })
-            .await
-        }
-        _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
-    }
-}
-
-/// Errors that the dispatcher itself raises (vs. errors from inside an
-/// extractor, which come back wrapped in `Fetch`).
-#[derive(Debug, thiserror::Error)]
-pub enum ExtractorDispatchError {
-    #[error("unknown vertical: '{0}'")]
-    UnknownVertical(String),
-
-    #[error("URL '{url}' does not match the '{vertical}' extractor")]
-    UrlMismatch { vertical: String, url: String },
-
-    #[error(transparent)]
-    Fetch(#[from] FetchError),
-}
-
-/// Helper: when the caller explicitly picked a vertical but their URL
-/// doesn't match it, return `UrlMismatch` instead of running the
-/// extractor (which would just fail with a less-clear error).
-async fn run_or_mismatch<F, Fut>(
-    matches: bool,
-    vertical: &str,
-    url: &str,
-    f: F,
-) -> Result<Value, ExtractorDispatchError>
-where
-    F: FnOnce() -> Fut,
-    Fut: std::future::Future<Output = Result<Value, FetchError>>,
-{
-    if !matches {
-        return Err(ExtractorDispatchError::UrlMismatch {
-            vertical: vertical.to_string(),
-            url: url.to_string(),
-        });
-    }
-    f().await.map_err(ExtractorDispatchError::Fetch)
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn list_is_non_empty_and_unique() {
-        let entries = list();
-        assert!(!entries.is_empty());
-        let mut names: Vec<_> = entries.iter().map(|e| e.name).collect();
-        names.sort();
-        let before = names.len();
-        names.dedup();
-        assert_eq!(before, names.len(), "extractor names must be unique");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/npm.rs b/crates/webclaw-fetch/src/extractors/npm.rs
deleted file mode 100644
index f84da0e..0000000
--- a/crates/webclaw-fetch/src/extractors/npm.rs
+++ /dev/null
@@ -1,235 +0,0 @@
-//! npm package structured extractor.
-//!
-//! Uses two npm-run APIs:
-//!   - `registry.npmjs.org/{name}` for full package metadata
-//!   - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal
-//!
-//! The registry API returns the *full* document including every version
-//! ever published, which can be tens of MB for popular packages
-//! (`@types/node` etc). We strip down to the latest version's manifest
-//! and a count of releases — full history would explode the response.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "npm",
-    label: "npm package",
-    description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.",
-    url_patterns: &["https://www.npmjs.com/package/{name}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "www.npmjs.com" && host != "npmjs.com" {
-        return false;
-    }
-    url.contains("/package/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let name = parse_name(url)
-        .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
-
-    let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name));
-    let resp = client.fetch(&registry_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "npm: package '{name}' not found"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "npm registry returned status {}",
-            resp.status
-        )));
-    }
-
-    let pkg: PackageDoc = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?;
-
-    // Resolve "latest" to a concrete version.
-    let latest_version = pkg
-        .dist_tags
-        .as_ref()
-        .and_then(|t| t.get("latest"))
-        .cloned()
-        .or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned()));
-
-    let latest_manifest = latest_version
-        .as_deref()
-        .and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v)));
-
-    let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0);
-    let latest_release_date = latest_version
-        .as_deref()
-        .and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned()));
-
-    // Best-effort weekly downloads. If the api.npmjs.org call fails we
-    // surface `null` rather than failing the whole extractor — npm
-    // sometimes 503s the downloads endpoint while the registry is up.
-    let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok();
-
-    Ok(json!({
-        "url":                 url,
-        "name":                pkg.name.clone().unwrap_or(name.clone()),
-        "description":         pkg.description,
-        "latest_version":      latest_version,
-        "license":             latest_manifest.and_then(|m| m.license.clone()),
-        "homepage":            pkg.homepage,
-        "repository":          pkg.repository.as_ref().and_then(|r| r.url.clone()),
-        "dependencies":        latest_manifest.and_then(|m| m.dependencies.clone()),
-        "dev_dependencies":    latest_manifest.and_then(|m| m.dev_dependencies.clone()),
-        "peer_dependencies":   latest_manifest.and_then(|m| m.peer_dependencies.clone()),
-        "keywords":            pkg.keywords,
-        "maintainers":         pkg.maintainers,
-        "deprecated":          latest_manifest.and_then(|m| m.deprecated.clone()),
-        "release_count":       release_count,
-        "latest_release_date": latest_release_date,
-        "weekly_downloads":    weekly_downloads,
-    }))
-}
-
-async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
-    let url = format!(
-        "https://api.npmjs.org/downloads/point/last-week/{}",
-        urlencode_segment(name)
-    );
-    let resp = client.fetch(&url).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "npm downloads api status {}",
-            resp.status
-        )));
-    }
-    let dl: Downloads = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?;
-    Ok(dl.downloads)
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Extract the package name from an npmjs.com URL. Handles scoped packages
-/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`).
-fn parse_name(url: &str) -> Option<String> {
-    let after = url.split("/package/").nth(1)?;
-    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let first = segs.next()?;
-    if first.starts_with('@') {
-        let second = segs.next()?;
-        Some(format!("{first}/{second}"))
-    } else {
-        Some(first.to_string())
-    }
-}
-
-/// `@scope/name` must encode the `/` for the registry path. Plain names
-/// pass through untouched.
-fn urlencode_segment(name: &str) -> String {
-    name.replace('/', "%2F")
-}
-
-// ---------------------------------------------------------------------------
-// Registry types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct PackageDoc {
-    name: Option<String>,
-    description: Option<String>,
-    homepage: Option<serde_json::Value>, // sometimes string, sometimes object
-    repository: Option<Repository>,
-    keywords: Option<Vec<String>>,
-    maintainers: Option<Vec<Maintainer>>,
-    #[serde(rename = "dist-tags")]
-    dist_tags: Option<std::collections::BTreeMap<String, String>>,
-    versions: Option<std::collections::BTreeMap<String, VersionManifest>>,
-    time: Option<std::collections::BTreeMap<String, String>>,
-}
-
-#[derive(Deserialize, Default, Clone)]
-struct VersionManifest {
-    license: Option<serde_json::Value>, // string or object
-    dependencies: Option<std::collections::BTreeMap<String, String>>,
-    #[serde(rename = "devDependencies")]
-    dev_dependencies: Option<std::collections::BTreeMap<String, String>>,
-    #[serde(rename = "peerDependencies")]
-    peer_dependencies: Option<std::collections::BTreeMap<String, String>>,
-    // `deprecated` is sometimes a bool and sometimes a string in the
-    // registry. serde_json::Value covers both without failing the parse.
-    deprecated: Option<serde_json::Value>,
-}
-
-#[derive(Deserialize)]
-struct Repository {
-    url: Option<String>,
-}
-
-#[derive(Deserialize, Clone)]
-struct Maintainer {
-    name: Option<String>,
-    email: Option<String>,
-}
-
-impl serde::Serialize for Maintainer {
-    fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
-        use serde::ser::SerializeMap;
-        let mut m = s.serialize_map(Some(2))?;
-        m.serialize_entry("name", &self.name)?;
-        m.serialize_entry("email", &self.email)?;
-        m.end()
-    }
-}
-
-#[derive(Deserialize)]
-struct Downloads {
-    downloads: i64,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_npm_package_urls() {
-        assert!(matches("https://www.npmjs.com/package/react"));
-        assert!(matches("https://www.npmjs.com/package/@types/node"));
-        assert!(matches("https://npmjs.com/package/lodash"));
-        assert!(!matches("https://www.npmjs.com/"));
-        assert!(!matches("https://example.com/package/foo"));
-    }
-
-    #[test]
-    fn parse_name_handles_scoped_and_unscoped() {
-        assert_eq!(
-            parse_name("https://www.npmjs.com/package/react"),
-            Some("react".into())
-        );
-        assert_eq!(
-            parse_name("https://www.npmjs.com/package/@types/node"),
-            Some("@types/node".into())
-        );
-        assert_eq!(
-            parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"),
-            Some("lodash".into())
-        );
-    }
-
-    #[test]
-    fn urlencode_only_touches_scope_separator() {
-        assert_eq!(urlencode_segment("react"), "react");
-        assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/pypi.rs b/crates/webclaw-fetch/src/extractors/pypi.rs
deleted file mode 100644
index 33a4d1c..0000000
--- a/crates/webclaw-fetch/src/extractors/pypi.rs
+++ /dev/null
@@ -1,184 +0,0 @@
-//! PyPI package structured extractor.
-//!
-//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and
-//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both
-//! return the full release info plus history. No auth, no rate limits
-//! that we hit at normal usage.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "pypi",
-    label: "PyPI package",
-    description: "Returns package metadata: latest version, dependencies, license, release history.",
-    url_patterns: &[
-        "https://pypi.org/project/{name}/",
-        "https://pypi.org/project/{name}/{version}/",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "pypi.org" && host != "www.pypi.org" {
-        return false;
-    }
-    url.contains("/project/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (name, version) = parse_project(url).ok_or_else(|| {
-        FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
-    })?;
-
-    let api_url = match &version {
-        Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"),
-        None => format!("https://pypi.org/pypi/{name}/json"),
-    };
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "pypi: package '{name}' not found"
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "pypi api returned status {}",
-            resp.status
-        )));
-    }
-
-    let pkg: PypiResponse = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?;
-
-    let info = pkg.info;
-    let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0);
-
-    // Latest release date = max upload time across files in the latest version.
-    let latest_release_date = pkg
-        .releases
-        .as_ref()
-        .and_then(|map| info.version.as_deref().and_then(|v| map.get(v)))
-        .and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max());
-
-    // Drop the long description from the JSON shape — it's frequently a 50KB
-    // README and bloats responses. Callers who need it can hit /v1/scrape.
-    Ok(json!({
-        "url":                 url,
-        "name":                info.name,
-        "version":             info.version,
-        "summary":             info.summary,
-        "homepage":            info.home_page,
-        "license":             info.license,
-        "license_classifier":  pick_license_classifier(&info.classifiers),
-        "author":              info.author,
-        "author_email":        info.author_email,
-        "maintainer":          info.maintainer,
-        "requires_python":     info.requires_python,
-        "requires_dist":       info.requires_dist,
-        "keywords":            info.keywords,
-        "classifiers":         info.classifiers,
-        "yanked":              info.yanked,
-        "yanked_reason":       info.yanked_reason,
-        "project_urls":        info.project_urls,
-        "release_count":       release_count,
-        "latest_release_date": latest_release_date,
-    }))
-}
-
-/// PyPI puts the SPDX-ish license under classifiers like
-/// `License :: OSI Approved :: Apache Software License`. Surface the most
-/// specific one when the `license` field itself is empty/junk.
-fn pick_license_classifier(classifiers: &Option<Vec<String>>) -> Option<String> {
-    classifiers
-        .as_ref()?
-        .iter()
-        .filter(|c| c.starts_with("License ::"))
-        .max_by_key(|c| c.len())
-        .cloned()
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn parse_project(url: &str) -> Option<(String, Option<String>)> {
-    let after = url.split("/project/").nth(1)?;
-    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
-    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
-    let name = segs.next()?.to_string();
-    let version = segs.next().map(|v| v.to_string());
-    Some((name, version))
-}
-
-// ---------------------------------------------------------------------------
-// PyPI API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct PypiResponse {
-    info: Info,
-    releases: Option<std::collections::BTreeMap<String, Vec<File>>>,
-}
-
-#[derive(Deserialize)]
-struct Info {
-    name: Option<String>,
-    version: Option<String>,
-    summary: Option<String>,
-    home_page: Option<String>,
-    license: Option<String>,
-    author: Option<String>,
-    author_email: Option<String>,
-    maintainer: Option<String>,
-    requires_python: Option<String>,
-    requires_dist: Option<Vec<String>>,
-    keywords: Option<String>,
-    classifiers: Option<Vec<String>>,
-    yanked: Option<bool>,
-    yanked_reason: Option<String>,
-    project_urls: Option<std::collections::BTreeMap<String, String>>,
-}
-
-#[derive(Deserialize)]
-struct File {
-    upload_time: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_project_urls() {
-        assert!(matches("https://pypi.org/project/requests/"));
-        assert!(matches("https://pypi.org/project/numpy/1.26.0/"));
-        assert!(!matches("https://pypi.org/"));
-        assert!(!matches("https://example.com/project/foo"));
-    }
-
-    #[test]
-    fn parse_project_pulls_name_and_version() {
-        assert_eq!(
-            parse_project("https://pypi.org/project/requests/"),
-            Some(("requests".into(), None))
-        );
-        assert_eq!(
-            parse_project("https://pypi.org/project/numpy/1.26.0/"),
-            Some(("numpy".into(), Some("1.26.0".into())))
-        );
-        assert_eq!(
-            parse_project("https://pypi.org/project/scikit-learn/?foo=bar"),
-            Some(("scikit-learn".into(), None))
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/reddit.rs b/crates/webclaw-fetch/src/extractors/reddit.rs
deleted file mode 100644
index 13cdc16..0000000
--- a/crates/webclaw-fetch/src/extractors/reddit.rs
+++ /dev/null
@@ -1,234 +0,0 @@
-//! Reddit structured extractor — returns the full post + comment tree
-//! as typed JSON via Reddit's `.json` API.
-//!
-//! The same trick the markdown extractor in `crate::reddit` uses:
-//! appending `.json` to any post URL returns the data the new SPA
-//! frontend would load client-side. Zero antibot, zero JS rendering.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "reddit",
-    label: "Reddit thread",
-    description: "Returns post + nested comment tree with scores, authors, and timestamps.",
-    url_patterns: &[
-        "https://www.reddit.com/r/*/comments/*",
-        "https://reddit.com/r/*/comments/*",
-        "https://old.reddit.com/r/*/comments/*",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    let is_reddit_host = matches!(
-        host,
-        "reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
-    );
-    is_reddit_host && url.contains("/comments/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let json_url = build_json_url(url);
-    let resp = client.fetch(&json_url).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "reddit api returned status {}",
-            resp.status
-        )));
-    }
-
-    let listings: Vec<Listing> = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?;
-
-    if listings.is_empty() {
-        return Err(FetchError::BodyDecode("reddit response empty".into()));
-    }
-
-    // First listing = the post (single t3 child).
-    let post = listings
-        .first()
-        .and_then(|l| l.data.children.first())
-        .filter(|t| t.kind == "t3")
-        .map(|t| post_json(&t.data))
-        .unwrap_or(Value::Null);
-
-    // Second listing = the comment tree.
-    let comments: Vec<Value> = listings
-        .get(1)
-        .map(|l| l.data.children.iter().filter_map(comment_json).collect())
-        .unwrap_or_default();
-
-    Ok(json!({
-        "url": url,
-        "post": post,
-        "comments": comments,
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// JSON shapers
-// ---------------------------------------------------------------------------
-
-fn post_json(d: &ThingData) -> Value {
-    json!({
-        "id":               d.id,
-        "title":            d.title,
-        "author":           d.author,
-        "subreddit":        d.subreddit_name_prefixed,
-        "permalink":        d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
-        "url":              d.url_overridden_by_dest,
-        "is_self":          d.is_self,
-        "selftext":         d.selftext,
-        "score":            d.score,
-        "upvote_ratio":     d.upvote_ratio,
-        "num_comments":     d.num_comments,
-        "created_utc":      d.created_utc,
-        "link_flair_text":  d.link_flair_text,
-        "over_18":          d.over_18,
-        "spoiler":          d.spoiler,
-        "stickied":         d.stickied,
-        "locked":           d.locked,
-    })
-}
-
-/// Render a single comment + its reply tree. Returns `None` for non-t1
-/// kinds (the trailing `more` placeholder Reddit injects at depth limits).
-fn comment_json(thing: &Thing) -> Option<Value> {
-    if thing.kind != "t1" {
-        return None;
-    }
-    let d = &thing.data;
-    let replies: Vec<Value> = match &d.replies {
-        Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(),
-        _ => Vec::new(),
-    };
-    Some(json!({
-        "id":             d.id,
-        "author":         d.author,
-        "body":           d.body,
-        "score":          d.score,
-        "created_utc":    d.created_utc,
-        "is_submitter":   d.is_submitter,
-        "stickied":       d.stickied,
-        "depth":          d.depth,
-        "permalink":      d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
-        "replies":        replies,
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com`
-/// or `old.reddit.com` as the caller gave us). Routing through
-/// `old.reddit.com` unconditionally looks appealing but that host has
-/// stricter UA-based blocking than `www.reddit.com`, while the main
-/// host accepts our Chrome-fingerprinted client fine.
-fn build_json_url(url: &str) -> String {
-    let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
-    format!("{clean}.json?raw_json=1")
-}
-
-// ---------------------------------------------------------------------------
-// Reddit JSON types — only fields we render. Everything else is dropped.
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Listing {
-    data: ListingData,
-}
-
-#[derive(Deserialize)]
-struct ListingData {
-    children: Vec<Thing>,
-}
-
-#[derive(Deserialize)]
-struct Thing {
-    kind: String,
-    data: ThingData,
-}
-
-#[derive(Deserialize, Default)]
-struct ThingData {
-    // post (t3)
-    id: Option<String>,
-    title: Option<String>,
-    selftext: Option<String>,
-    subreddit_name_prefixed: Option<String>,
-    url_overridden_by_dest: Option<String>,
-    is_self: Option<bool>,
-    upvote_ratio: Option<f64>,
-    num_comments: Option<i64>,
-    over_18: Option<bool>,
-    spoiler: Option<bool>,
-    stickied: Option<bool>,
-    locked: Option<bool>,
-    link_flair_text: Option<String>,
-
-    // comment (t1)
-    author: Option<String>,
-    body: Option<String>,
-    score: Option<i64>,
-    created_utc: Option<f64>,
-    is_submitter: Option<bool>,
-    depth: Option<i64>,
-    permalink: Option<String>,
-
-    // recursive
-    replies: Option<Replies>,
-}
-
-#[derive(Deserialize)]
-#[serde(untagged)]
-enum Replies {
-    Listing(Listing),
-    #[allow(dead_code)]
-    Empty(String),
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_reddit_post_urls() {
-        assert!(matches(
-            "https://www.reddit.com/r/rust/comments/abc123/some_title/"
-        ));
-        assert!(matches(
-            "https://reddit.com/r/rust/comments/abc123/some_title"
-        ));
-        assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/"));
-    }
-
-    #[test]
-    fn rejects_non_post_reddit_urls() {
-        assert!(!matches("https://www.reddit.com/r/rust"));
-        assert!(!matches("https://www.reddit.com/user/foo"));
-        assert!(!matches("https://example.com/r/rust/comments/x"));
-    }
-
-    #[test]
-    fn json_url_appends_suffix_and_drops_query() {
-        assert_eq!(
-            build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"),
-            "https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1"
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/shopify_collection.rs b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
deleted file mode 100644
index 23d57c6..0000000
--- a/crates/webclaw-fetch/src/extractors/shopify_collection.rs
+++ /dev/null
@@ -1,242 +0,0 @@
-//! Shopify collection structured extractor.
-//!
-//! Every Shopify store exposes `/collections/{handle}.json` and
-//! `/collections/{handle}/products.json` on the public surface. This
-//! extractor hits `.json` (collection metadata) and falls through to
-//! `/products.json` for the first page of products. Same caveat as
-//! `shopify_product`: stores with Cloudflare in front of the shop
-//! will 403 the public path.
-//!
-//! Explicit-call only (like `shopify_product`). `/collections/{slug}`
-//! is a URL shape used by non-Shopify stores too, so auto-dispatch
-//! would claim too many URLs.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "shopify_collection",
-    label: "Shopify collection",
-    description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.",
-    url_patterns: &[
-        "https://{shop}/collections/{handle}",
-        "https://{shop}.myshopify.com/collections/{handle}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
-        return false;
-    }
-    url.contains("/collections/") && !url.ends_with("/collections/")
-}
-
-const NON_SHOPIFY_HOSTS: &[&str] = &[
-    "amazon.com",
-    "amazon.co.uk",
-    "amazon.de",
-    "ebay.com",
-    "etsy.com",
-    "walmart.com",
-    "target.com",
-    "aliexpress.com",
-    "huggingface.co", // has /collections/ for models
-    "github.com",
-];
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let (coll_meta_url, coll_products_url) = build_json_urls(url);
-
-    // Step 1: collection metadata. Shopify returns 200 on missing
-    // collections sometimes; check "collection" key below.
-    let meta_resp = client.fetch(&coll_meta_url).await?;
-    if meta_resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "shopify_collection: '{url}' not found"
-        )));
-    }
-    if meta_resp.status == 403 {
-        return Err(FetchError::Build(format!(
-            "shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store."
-        )));
-    }
-    if meta_resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "shopify returned status {} for {coll_meta_url}",
-            meta_resp.status
-        )));
-    }
-
-    let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| {
-        FetchError::BodyDecode(format!(
-            "shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})"
-        ))
-    })?;
-
-    // Step 2: first page of products for this collection.
-    let products = match client.fetch(&coll_products_url).await {
-        Ok(r) if r.status == 200 => serde_json::from_str::<ProductsWrapper>(&r.html)
-            .ok()
-            .map(|pw| pw.products)
-            .unwrap_or_default(),
-        _ => Vec::new(),
-    };
-
-    let product_summaries: Vec<Value> = products
-        .iter()
-        .map(|p| {
-            let first_variant = p.variants.first();
-            json!({
-                "id":              p.id,
-                "handle":          p.handle,
-                "title":           p.title,
-                "vendor":          p.vendor,
-                "product_type":    p.product_type,
-                "price":           first_variant.and_then(|v| v.price.clone()),
-                "compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()),
-                "available":       p.variants.iter().any(|v| v.available.unwrap_or(false)),
-                "variant_count":   p.variants.len(),
-                "image":           p.images.first().and_then(|i| i.src.clone()),
-                "created_at":      p.created_at,
-                "updated_at":      p.updated_at,
-            })
-        })
-        .collect();
-
-    let c = meta.collection;
-    Ok(json!({
-        "url":               url,
-        "meta_json_url":     coll_meta_url,
-        "products_json_url": coll_products_url,
-        "collection_id":     c.id,
-        "handle":            c.handle,
-        "title":             c.title,
-        "description_html":  c.body_html,
-        "published_at":      c.published_at,
-        "updated_at":        c.updated_at,
-        "sort_order":        c.sort_order,
-        "products_in_page":  product_summaries.len(),
-        "products":          product_summaries,
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Build `(collection.json, collection/products.json)` from a user URL.
-fn build_json_urls(url: &str) -> (String, String) {
-    let (path_part, _query_part) = match url.split_once('?') {
-        Some((a, b)) => (a, Some(b)),
-        None => (url, None),
-    };
-    let clean = path_part.trim_end_matches('/').trim_end_matches(".json");
-    (
-        format!("{clean}.json"),
-        format!("{clean}/products.json?limit=50"),
-    )
-}
-
-// ---------------------------------------------------------------------------
-// Shopify collection + product JSON shapes (subsets)
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct MetaWrapper {
-    collection: Collection,
-}
-
-#[derive(Deserialize)]
-struct Collection {
-    id: Option<i64>,
-    handle: Option<String>,
-    title: Option<String>,
-    body_html: Option<String>,
-    published_at: Option<String>,
-    updated_at: Option<String>,
-    sort_order: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct ProductsWrapper {
-    #[serde(default)]
-    products: Vec<ProductSummary>,
-}
-
-#[derive(Deserialize)]
-struct ProductSummary {
-    id: Option<i64>,
-    handle: Option<String>,
-    title: Option<String>,
-    vendor: Option<String>,
-    product_type: Option<String>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-    #[serde(default)]
-    variants: Vec<VariantSummary>,
-    #[serde(default)]
-    images: Vec<ImageSummary>,
-}
-
-#[derive(Deserialize)]
-struct VariantSummary {
-    price: Option<String>,
-    compare_at_price: Option<String>,
-    available: Option<bool>,
-}
-
-#[derive(Deserialize)]
-struct ImageSummary {
-    src: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_shopify_collection_urls() {
-        assert!(matches("https://www.allbirds.com/collections/mens"));
-        assert!(matches(
-            "https://shop.example.com/collections/new-arrivals?page=2"
-        ));
-    }
-
-    #[test]
-    fn rejects_non_shopify() {
-        assert!(!matches("https://github.com/collections/foo"));
-        assert!(!matches("https://huggingface.co/collections/foo"));
-        assert!(!matches("https://example.com/"));
-        assert!(!matches("https://example.com/collections/"));
-    }
-
-    #[test]
-    fn build_json_urls_derives_both_paths() {
-        let (meta, products) = build_json_urls("https://shop.example.com/collections/mens");
-        assert_eq!(meta, "https://shop.example.com/collections/mens.json");
-        assert_eq!(
-            products,
-            "https://shop.example.com/collections/mens/products.json?limit=50"
-        );
-    }
-
-    #[test]
-    fn build_json_urls_handles_trailing_slash() {
-        let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/");
-        assert_eq!(meta, "https://shop.example.com/collections/mens.json");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/shopify_product.rs b/crates/webclaw-fetch/src/extractors/shopify_product.rs
deleted file mode 100644
index b52ef36..0000000
--- a/crates/webclaw-fetch/src/extractors/shopify_product.rs
+++ /dev/null
@@ -1,318 +0,0 @@
-//! Shopify product structured extractor.
-//!
-//! Every Shopify store exposes a public JSON endpoint for each product
-//! by appending `.json` to the product URL:
-//!
-//!   https://shop.example.com/products/cool-tshirt
-//!   → https://shop.example.com/products/cool-tshirt.json
-//!
-//! There are ~4 million Shopify stores. The `.json` endpoint is
-//! undocumented but has been stable for 10+ years. When a store puts
-//! Cloudflare / antibot in front of the shop, this path can 403 just
-//! like any other — for those cases the caller should fall back to
-//! `ecommerce_product` (JSON-LD) or the cloud tier.
-//!
-//! This extractor is **explicit-call only** — it is NOT auto-dispatched
-//! from `/v1/scrape` because we cannot tell ahead of time whether an
-//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
-//! `/v1/scrape/shopify_product` when they know.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "shopify_product",
-    label: "Shopify product",
-    description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
-    url_patterns: &[
-        "https://{shop}/products/{handle}",
-        "https://{shop}.myshopify.com/products/{handle}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    // Any URL whose path contains /products/{something}. We do not
-    // filter by host — Shopify powers custom-domain stores. The
-    // extractor's /.json fallback is what confirms Shopify; `matches`
-    // just says "this is a plausible shape." Still reject obviously
-    // non-Shopify known hosts to save a failed request.
-    let host = host_of(url);
-    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
-        return false;
-    }
-    url.contains("/products/") && !url.ends_with("/products/")
-}
-
-/// Hosts we know are not Shopify — reject so we don't burn a request.
-const NON_SHOPIFY_HOSTS: &[&str] = &[
-    "amazon.com",
-    "amazon.co.uk",
-    "amazon.de",
-    "amazon.fr",
-    "amazon.it",
-    "ebay.com",
-    "etsy.com",
-    "walmart.com",
-    "target.com",
-    "aliexpress.com",
-    "bestbuy.com",
-    "wayfair.com",
-    "homedepot.com",
-    "github.com", // /products is a marketing page
-];
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let json_url = build_json_url(url);
-    let resp = client.fetch(&json_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "shopify_product: '{url}' not found (got 404 from {json_url})"
-        )));
-    }
-    if resp.status == 403 {
-        return Err(FetchError::Build(format!(
-            "shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "shopify returned status {} for {json_url}",
-            resp.status
-        )));
-    }
-
-    let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
-        FetchError::BodyDecode(format!(
-            "shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
-        ))
-    })?;
-    let p = body.product;
-
-    let variants: Vec<Value> = p
-        .variants
-        .iter()
-        .map(|v| {
-            json!({
-                "id":                  v.id,
-                "title":               v.title,
-                "sku":                 v.sku,
-                "barcode":             v.barcode,
-                "price":               v.price,
-                "compare_at_price":    v.compare_at_price,
-                "available":           v.available,
-                "inventory_quantity":  v.inventory_quantity,
-                "position":            v.position,
-                "weight":              v.weight,
-                "weight_unit":         v.weight_unit,
-                "requires_shipping":   v.requires_shipping,
-                "taxable":             v.taxable,
-                "option1":             v.option1,
-                "option2":             v.option2,
-                "option3":             v.option3,
-            })
-        })
-        .collect();
-
-    let images: Vec<Value> = p
-        .images
-        .iter()
-        .map(|i| {
-            json!({
-                "src":      i.src,
-                "width":    i.width,
-                "height":   i.height,
-                "position": i.position,
-                "alt":      i.alt,
-            })
-        })
-        .collect();
-
-    let options: Vec<Value> = p
-        .options
-        .iter()
-        .map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
-        .collect();
-
-    // Price range + availability summary across variants (the shape
-    // agents typically want without walking the variants array).
-    let prices: Vec<f64> = p
-        .variants
-        .iter()
-        .filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
-        .collect();
-    let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
-
-    Ok(json!({
-        "url":             url,
-        "json_url":        json_url,
-        "product_id":      p.id,
-        "handle":          p.handle,
-        "title":           p.title,
-        "vendor":          p.vendor,
-        "product_type":    p.product_type,
-        "tags":            p.tags,
-        "description_html":p.body_html,
-        "published_at":    p.published_at,
-        "created_at":      p.created_at,
-        "updated_at":      p.updated_at,
-        "variant_count":   variants.len(),
-        "image_count":     images.len(),
-        "any_available":   any_available,
-        "price_min":       prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
-        "price_max":       prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
-        "variants":        variants,
-        "images":          images,
-        "options":         options,
-    }))
-}
-
-/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
-/// trailing slashes, and query strings.
-fn build_json_url(url: &str) -> String {
-    let (path_part, query_part) = match url.split_once('?') {
-        Some((a, b)) => (a, Some(b)),
-        None => (url, None),
-    };
-    let clean = path_part.trim_end_matches('/');
-    let with_json = if clean.ends_with(".json") {
-        clean.to_string()
-    } else {
-        format!("{clean}.json")
-    };
-    match query_part {
-        Some(q) => format!("{with_json}?{q}"),
-        None => with_json,
-    }
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-// ---------------------------------------------------------------------------
-// Shopify product JSON shape (a subset of the full response)
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Wrapper {
-    product: Product,
-}
-
-#[derive(Deserialize)]
-struct Product {
-    id: Option<i64>,
-    title: Option<String>,
-    handle: Option<String>,
-    vendor: Option<String>,
-    product_type: Option<String>,
-    body_html: Option<String>,
-    published_at: Option<String>,
-    created_at: Option<String>,
-    updated_at: Option<String>,
-    #[serde(default)]
-    tags: serde_json::Value, // array OR comma-joined string depending on store
-    #[serde(default)]
-    variants: Vec<Variant>,
-    #[serde(default)]
-    images: Vec<Image>,
-    #[serde(default)]
-    options: Vec<Option_>,
-}
-
-#[derive(Deserialize)]
-struct Variant {
-    id: Option<i64>,
-    title: Option<String>,
-    sku: Option<String>,
-    barcode: Option<String>,
-    price: Option<String>,
-    compare_at_price: Option<String>,
-    available: Option<bool>,
-    inventory_quantity: Option<i64>,
-    position: Option<i64>,
-    weight: Option<f64>,
-    weight_unit: Option<String>,
-    requires_shipping: Option<bool>,
-    taxable: Option<bool>,
-    option1: Option<String>,
-    option2: Option<String>,
-    option3: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Image {
-    src: Option<String>,
-    width: Option<i64>,
-    height: Option<i64>,
-    position: Option<i64>,
-    alt: Option<String>,
-}
-
-#[derive(Deserialize)]
-#[serde(rename_all = "lowercase")]
-struct Option_ {
-    name: Option<String>,
-    position: Option<i64>,
-    #[serde(default)]
-    values: Vec<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_plausible_shopify_urls() {
-        assert!(matches(
-            "https://www.allbirds.com/products/mens-tree-runners"
-        ));
-        assert!(matches(
-            "https://shop.example.com/products/cool-tshirt?variant=123"
-        ));
-        assert!(matches("https://somestore.myshopify.com/products/thing-1"));
-    }
-
-    #[test]
-    fn rejects_known_non_shopify() {
-        assert!(!matches("https://www.amazon.com/dp/B0C123"));
-        assert!(!matches("https://www.etsy.com/listing/12345/foo"));
-        assert!(!matches("https://www.amazon.co.uk/products/thing"));
-        assert!(!matches("https://github.com/products"));
-    }
-
-    #[test]
-    fn rejects_non_product_urls() {
-        assert!(!matches("https://example.com/"));
-        assert!(!matches("https://example.com/products/"));
-        assert!(!matches("https://example.com/collections/all"));
-    }
-
-    #[test]
-    fn build_json_url_handles_slash_and_query() {
-        assert_eq!(
-            build_json_url("https://shop.example.com/products/foo"),
-            "https://shop.example.com/products/foo.json"
-        );
-        assert_eq!(
-            build_json_url("https://shop.example.com/products/foo/"),
-            "https://shop.example.com/products/foo.json"
-        );
-        assert_eq!(
-            build_json_url("https://shop.example.com/products/foo?variant=123"),
-            "https://shop.example.com/products/foo.json?variant=123"
-        );
-        assert_eq!(
-            build_json_url("https://shop.example.com/products/foo.json"),
-            "https://shop.example.com/products/foo.json"
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/stackoverflow.rs b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
deleted file mode 100644
index 03597a3..0000000
--- a/crates/webclaw-fetch/src/extractors/stackoverflow.rs
+++ /dev/null
@@ -1,216 +0,0 @@
-//! Stack Overflow Q&A structured extractor.
-//!
-//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}`
-//! with `site=stackoverflow`. Two calls: one for the question, one for
-//! its answers. Both come pre-filtered to include the rendered HTML body
-//! so we don't re-parse the question page itself.
-//!
-//! Anonymous access caps at 300 requests per IP per day. Production
-//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't
-//! require it to work out of the box.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "stackoverflow",
-    label: "Stack Overflow Q&A",
-    description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.",
-    url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host != "stackoverflow.com" && host != "www.stackoverflow.com" {
-        return false;
-    }
-    parse_question_id(url).is_some()
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let id = parse_question_id(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "stackoverflow: cannot parse question id from '{url}'"
-        ))
-    })?;
-
-    // Filter `withbody` includes the rendered HTML body for both questions
-    // and answers. Stack Exchange's filter system is documented at
-    // api.stackexchange.com/docs/filters.
-    let q_url = format!(
-        "https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody"
-    );
-    let q_resp = client.fetch(&q_url).await?;
-    if q_resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "stackexchange api returned status {}",
-            q_resp.status
-        )));
-    }
-    let q_body: QResponse = serde_json::from_str(&q_resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?;
-    let q = q_body
-        .items
-        .first()
-        .ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?;
-
-    let a_url = format!(
-        "https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes"
-    );
-    let a_resp = client.fetch(&a_url).await?;
-    let answers = if a_resp.status == 200 {
-        let a_body: AResponse = serde_json::from_str(&a_resp.html)
-            .map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?;
-        a_body
-            .items
-            .iter()
-            .map(|a| {
-                json!({
-                    "answer_id":     a.answer_id,
-                    "is_accepted":   a.is_accepted,
-                    "score":         a.score,
-                    "body":          a.body,
-                    "creation_date": a.creation_date,
-                    "last_edit_date":a.last_edit_date,
-                    "author":        a.owner.as_ref().and_then(|o| o.display_name.clone()),
-                    "author_rep":    a.owner.as_ref().and_then(|o| o.reputation),
-                })
-            })
-            .collect::<Vec<_>>()
-    } else {
-        Vec::new()
-    };
-
-    let accepted = answers
-        .iter()
-        .find(|a| {
-            a.get("is_accepted")
-                .and_then(|v| v.as_bool())
-                .unwrap_or(false)
-        })
-        .cloned();
-
-    Ok(json!({
-        "url":            url,
-        "question_id":    q.question_id,
-        "title":          q.title,
-        "body":           q.body,
-        "tags":           q.tags,
-        "score":          q.score,
-        "view_count":     q.view_count,
-        "answer_count":   q.answer_count,
-        "is_answered":    q.is_answered,
-        "accepted_answer_id": q.accepted_answer_id,
-        "creation_date":  q.creation_date,
-        "last_activity_date": q.last_activity_date,
-        "author":         q.owner.as_ref().and_then(|o| o.display_name.clone()),
-        "author_rep":     q.owner.as_ref().and_then(|o| o.reputation),
-        "link":           q.link,
-        "accepted_answer": accepted,
-        "top_answers":    answers,
-    }))
-}
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Parse question id from a URL of the form `/questions/{id}/{slug}`.
-fn parse_question_id(url: &str) -> Option<u64> {
-    let after = url.split("/questions/").nth(1)?;
-    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
-    let first = stripped.split('/').next()?;
-    first.parse::<u64>().ok()
-}
-
-// ---------------------------------------------------------------------------
-// Stack Exchange API types
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct QResponse {
-    #[serde(default)]
-    items: Vec<Question>,
-}
-
-#[derive(Deserialize)]
-struct Question {
-    question_id: Option<u64>,
-    title: Option<String>,
-    body: Option<String>,
-    #[serde(default)]
-    tags: Vec<String>,
-    score: Option<i64>,
-    view_count: Option<i64>,
-    answer_count: Option<i64>,
-    is_answered: Option<bool>,
-    accepted_answer_id: Option<u64>,
-    creation_date: Option<i64>,
-    last_activity_date: Option<i64>,
-    owner: Option<Owner>,
-    link: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct AResponse {
-    #[serde(default)]
-    items: Vec<Answer>,
-}
-
-#[derive(Deserialize)]
-struct Answer {
-    answer_id: Option<u64>,
-    is_accepted: Option<bool>,
-    score: Option<i64>,
-    body: Option<String>,
-    creation_date: Option<i64>,
-    last_edit_date: Option<i64>,
-    owner: Option<Owner>,
-}
-
-#[derive(Deserialize)]
-struct Owner {
-    display_name: Option<String>,
-    reputation: Option<i64>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_question_urls() {
-        assert!(matches(
-            "https://stackoverflow.com/questions/12345/some-slug"
-        ));
-        assert!(matches(
-            "https://stackoverflow.com/questions/12345/some-slug?answertab=votes"
-        ));
-        assert!(!matches("https://stackoverflow.com/"));
-        assert!(!matches("https://stackoverflow.com/questions"));
-        assert!(!matches("https://stackoverflow.com/users/100"));
-        assert!(!matches("https://example.com/questions/12345/x"));
-    }
-
-    #[test]
-    fn parse_question_id_handles_slug_and_query() {
-        assert_eq!(
-            parse_question_id("https://stackoverflow.com/questions/12345/some-slug"),
-            Some(12345)
-        );
-        assert_eq!(
-            parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"),
-            Some(12345)
-        );
-        assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None);
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs
deleted file mode 100644
index c5b5019..0000000
--- a/crates/webclaw-fetch/src/extractors/substack_post.rs
+++ /dev/null
@@ -1,565 +0,0 @@
-//! Substack post extractor.
-//!
-//! Every Substack publication exposes `/api/v1/posts/{slug}` that
-//! returns the full post as JSON: body HTML, cover image, author,
-//! publication info, reactions, paywall state. No auth on public
-//! posts.
-//!
-//! Works on both `*.substack.com` subdomains and custom domains
-//! (e.g. `simonwillison.net` uses Substack too). Detection is
-//! "URL has `/p/{slug}`" because that's the canonical Substack post
-//! path. Explicit-call only because the `/p/{slug}` URL shape is
-//! used by non-Substack sites too.
-//!
-//! ## Fallback
-//!
-//! The API endpoint is rate-limited aggressively on popular publications
-//! and occasionally returns 403 on custom domains with Cloudflare in
-//! front. When that happens we escalate to an HTML fetch (via
-//! `smart_fetch_html`, so antibot-protected custom domains still work)
-//! and extract OG tags + Article JSON-LD for a degraded-but-useful
-//! payload. The response shape stays stable across both paths; a
-//! `data_source` field tells the caller which branch ran.
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::cloud::{self, CloudError};
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "substack_post",
-    label: "Substack post",
-    description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.",
-    url_patterns: &[
-        "https://{pub}.substack.com/p/{slug}",
-        "https://{custom-domain}/p/{slug}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    if !(url.starts_with("http://") || url.starts_with("https://")) {
-        return false;
-    }
-    url.contains("/p/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let slug = parse_slug(url).ok_or_else(|| {
-        FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
-    })?;
-    let host = host_of(url);
-    if host.is_empty() {
-        return Err(FetchError::Build(format!(
-            "substack_post: empty host in '{url}'"
-        )));
-    }
-    let scheme = if url.starts_with("http://") {
-        "http"
-    } else {
-        "https"
-    };
-    let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
-
-    // 1. Try the public API. 200 = full payload; 404 = real miss; any
-    //    other status hands off to the HTML fallback so a transient rate
-    //    limit or a hardened custom domain doesn't fail the whole call.
-    let resp = client.fetch(&api_url).await?;
-    match resp.status {
-        200 => match serde_json::from_str::<Post>(&resp.html) {
-            Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)),
-            Err(e) => {
-                // API returned 200 but the body isn't the Post shape we
-                // expect. Could be a custom-domain site that exposes
-                // something else at /api/v1/posts/. Fall back to HTML
-                // rather than hard-failing.
-                html_fallback(
-                    client,
-                    url,
-                    &api_url,
-                    &slug,
-                    Some(format!(
-                        "api returned 200 but body was not Substack JSON ({e})"
-                    )),
-                )
-                .await
-            }
-        },
-        404 => Err(FetchError::Build(format!(
-            "substack_post: '{slug}' not found on {host} (got 404). \
-             If the publication isn't actually on Substack, use /v1/scrape instead."
-        ))),
-        _ => {
-            // Rate limit, 403, 5xx, whatever: try HTML.
-            let reason = format!("api returned status {} for {api_url}", resp.status);
-            html_fallback(client, url, &api_url, &slug, Some(reason)).await
-        }
-    }
-}
-
-// ---------------------------------------------------------------------------
-// API-path payload builder
-// ---------------------------------------------------------------------------
-
-fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
-    json!({
-        "url":                  url,
-        "api_url":              api_url,
-        "data_source":          "api",
-        "id":                   p.id,
-        "type":                 p.r#type,
-        "slug":                 p.slug.or_else(|| Some(slug.to_string())),
-        "title":                p.title,
-        "subtitle":             p.subtitle,
-        "description":          p.description,
-        "canonical_url":        p.canonical_url,
-        "post_date":            p.post_date,
-        "updated_at":           p.updated_at,
-        "audience":             p.audience,
-        "has_paywall":          matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")),
-        "is_free_preview":      p.is_free_preview,
-        "cover_image":          p.cover_image,
-        "word_count":           p.wordcount,
-        "reactions":            p.reactions,
-        "comment_count":        p.comment_count,
-        "body_html":            p.body_html,
-        "body_text":            p.truncated_body_text.or(p.body_text),
-        "publication": json!({
-            "id":           p.publication.as_ref().and_then(|pub_| pub_.id),
-            "name":         p.publication.as_ref().and_then(|pub_| pub_.name.clone()),
-            "subdomain":    p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()),
-            "custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()),
-        }),
-        "authors": p.published_bylines.iter().map(|a| json!({
-            "id":     a.id,
-            "name":   a.name,
-            "handle": a.handle,
-            "photo":  a.photo_url,
-        })).collect::<Vec<_>>(),
-    })
-}
-
-// ---------------------------------------------------------------------------
-// HTML fallback: OG + Article JSON-LD
-// ---------------------------------------------------------------------------
-
-async fn html_fallback(
-    client: &dyn Fetcher,
-    url: &str,
-    api_url: &str,
-    slug: &str,
-    fallback_reason: Option<String>,
-) -> Result<Value, FetchError> {
-    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
-        .await
-        .map_err(cloud_to_fetch_err)?;
-
-    let mut data = parse_html(&fetched.html, url, api_url, slug);
-    if let Some(obj) = data.as_object_mut() {
-        obj.insert(
-            "fetch_source".into(),
-            match fetched.source {
-                cloud::FetchSource::Local => json!("local"),
-                cloud::FetchSource::Cloud => json!("cloud"),
-            },
-        );
-        if let Some(reason) = fallback_reason {
-            obj.insert("fallback_reason".into(), json!(reason));
-        }
-    }
-    Ok(data)
-}
-
-/// Pure HTML parser. Pulls title, subtitle, description, cover image,
-/// publish date, and authors from OG tags and Article JSON-LD. Kept
-/// public so tests can exercise it with fixtures.
-pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value {
-    let article = find_article_jsonld(html);
-
-    let title = article
-        .as_ref()
-        .and_then(|v| get_text(v, "headline"))
-        .or_else(|| og(html, "title"));
-    let description = article
-        .as_ref()
-        .and_then(|v| get_text(v, "description"))
-        .or_else(|| og(html, "description"));
-    let cover_image = article
-        .as_ref()
-        .and_then(get_first_image)
-        .or_else(|| og(html, "image"));
-    let post_date = article
-        .as_ref()
-        .and_then(|v| get_text(v, "datePublished"))
-        .or_else(|| meta_property(html, "article:published_time"));
-    let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified"));
-    let publication_name = og(html, "site_name");
-    let authors = article.as_ref().map(extract_authors).unwrap_or_default();
-
-    json!({
-        "url":                  url,
-        "api_url":              api_url,
-        "data_source":          "html_fallback",
-        "slug":                 slug,
-        "title":                title,
-        "subtitle":             None::<String>,
-        "description":          description,
-        "canonical_url":        canonical_url(html).or_else(|| Some(url.to_string())),
-        "post_date":            post_date,
-        "updated_at":           updated_at,
-        "cover_image":          cover_image,
-        "body_html":            None::<String>,
-        "body_text":            None::<String>,
-        "word_count":           None::<i64>,
-        "comment_count":        None::<i64>,
-        "reactions":            Value::Null,
-        "has_paywall":          None::<bool>,
-        "is_free_preview":      None::<bool>,
-        "publication": json!({
-            "name": publication_name,
-        }),
-        "authors": authors,
-    })
-}
-
-fn extract_authors(v: &Value) -> Vec<Value> {
-    let Some(a) = v.get("author") else {
-        return Vec::new();
-    };
-    let one = |val: &Value| -> Option<Value> {
-        match val {
-            Value::String(s) => Some(json!({"name": s})),
-            Value::Object(_) => {
-                let name = val.get("name").and_then(|n| n.as_str())?;
-                let handle = val
-                    .get("url")
-                    .and_then(|u| u.as_str())
-                    .and_then(handle_from_author_url);
-                Some(json!({
-                    "name":   name,
-                    "handle": handle,
-                }))
-            }
-            _ => None,
-        }
-    };
-    match a {
-        Value::Array(arr) => arr.iter().filter_map(one).collect(),
-        _ => one(a).into_iter().collect(),
-    }
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-fn parse_slug(url: &str) -> Option<String> {
-    let after = url.split("/p/").nth(1)?;
-    let stripped = after
-        .split(['?', '#'])
-        .next()?
-        .trim_end_matches('/')
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if stripped.is_empty() {
-        None
-    } else {
-        Some(stripped.to_string())
-    }
-}
-
-/// Extract the Substack handle from an author URL like
-/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`.
-///
-/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack
-/// author page) so we don't synthesise a fake handle.
-fn handle_from_author_url(u: &str) -> Option<String> {
-    let after = u.rsplit_once('@').map(|(_, tail)| tail)?;
-    let clean = after.split(['/', '?', '#']).next()?;
-    if clean.is_empty() {
-        None
-    } else {
-        Some(clean.to_string())
-    }
-}
-
-// ---------------------------------------------------------------------------
-// HTML tag helpers
-// ---------------------------------------------------------------------------
-
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-/// Pull `<meta property="article:published_time" content="...">` and
-/// similar structured meta tags.
-fn meta_property(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-fn canonical_url(html: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE
-        .get_or_init(|| Regex::new(r#"(?i)<link[^>]+rel="canonical"[^>]+href="([^"]+)""#).unwrap());
-    re.captures(html)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-// ---------------------------------------------------------------------------
-// JSON-LD walkers (Article / NewsArticle)
-// ---------------------------------------------------------------------------
-
-fn find_article_jsonld(html: &str) -> Option<Value> {
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-    for b in blocks {
-        if let Some(found) = find_article_in(&b) {
-            return Some(found);
-        }
-    }
-    None
-}
-
-fn find_article_in(v: &Value) -> Option<Value> {
-    if is_article_type(v) {
-        return Some(v.clone());
-    }
-    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
-        for item in graph {
-            if let Some(found) = find_article_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    if let Some(arr) = v.as_array() {
-        for item in arr {
-            if let Some(found) = find_article_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    None
-}
-
-fn is_article_type(v: &Value) -> bool {
-    let Some(t) = v.get("@type") else {
-        return false;
-    };
-    let is_art = |s: &str| {
-        matches!(
-            s,
-            "Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting"
-        )
-    };
-    match t {
-        Value::String(s) => is_art(s),
-        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)),
-        _ => false,
-    }
-}
-
-fn get_text(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| match x {
-        Value::String(s) => Some(s.clone()),
-        Value::Number(n) => Some(n.to_string()),
-        _ => None,
-    })
-}
-
-fn get_first_image(v: &Value) -> Option<String> {
-    match v.get("image")? {
-        Value::String(s) => Some(s.clone()),
-        Value::Array(arr) => arr.iter().find_map(|x| match x {
-            Value::String(s) => Some(s.clone()),
-            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
-            _ => None,
-        }),
-        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
-        _ => None,
-    }
-}
-
-fn cloud_to_fetch_err(e: CloudError) -> FetchError {
-    FetchError::Build(e.to_string())
-}
-
-// ---------------------------------------------------------------------------
-// Substack API types (subset)
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Post {
-    id: Option<i64>,
-    r#type: Option<String>,
-    slug: Option<String>,
-    title: Option<String>,
-    subtitle: Option<String>,
-    description: Option<String>,
-    canonical_url: Option<String>,
-    post_date: Option<String>,
-    updated_at: Option<String>,
-    audience: Option<String>,
-    is_free_preview: Option<bool>,
-    cover_image: Option<String>,
-    wordcount: Option<i64>,
-    reactions: Option<serde_json::Value>,
-    comment_count: Option<i64>,
-    body_html: Option<String>,
-    body_text: Option<String>,
-    truncated_body_text: Option<String>,
-    publication: Option<Publication>,
-    #[serde(default, rename = "publishedBylines")]
-    published_bylines: Vec<Byline>,
-}
-
-#[derive(Deserialize)]
-struct Publication {
-    id: Option<i64>,
-    name: Option<String>,
-    subdomain: Option<String>,
-    custom_domain: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Byline {
-    id: Option<i64>,
-    name: Option<String>,
-    handle: Option<String>,
-    photo_url: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_post_urls() {
-        assert!(matches(
-            "https://stratechery.substack.com/p/the-tech-letter"
-        ));
-        assert!(matches("https://simonwillison.net/p/2024-08-01-something"));
-        assert!(!matches("https://example.com/"));
-        assert!(!matches("ftp://example.com/p/foo"));
-    }
-
-    #[test]
-    fn parse_slug_strips_query_and_trailing_slash() {
-        assert_eq!(
-            parse_slug("https://example.substack.com/p/my-post"),
-            Some("my-post".into())
-        );
-        assert_eq!(
-            parse_slug("https://example.substack.com/p/my-post/"),
-            Some("my-post".into())
-        );
-        assert_eq!(
-            parse_slug("https://example.substack.com/p/my-post?ref=123"),
-            Some("my-post".into())
-        );
-    }
-
-    #[test]
-    fn parse_html_extracts_from_og_tags() {
-        let html = r##"
-<html><head>
-<meta property="og:title" content="My Great Post">
-<meta property="og:description" content="A short summary.">
-<meta property="og:image" content="https://cdn.substack.com/cover.jpg">
-<meta property="og:site_name" content="My Publication">
-<meta property="article:published_time" content="2025-09-01T10:00:00Z">
-<link rel="canonical" href="https://mypub.substack.com/p/my-post">
-</head></html>"##;
-        let v = parse_html(
-            html,
-            "https://mypub.substack.com/p/my-post",
-            "https://mypub.substack.com/api/v1/posts/my-post",
-            "my-post",
-        );
-        assert_eq!(v["data_source"], "html_fallback");
-        assert_eq!(v["title"], "My Great Post");
-        assert_eq!(v["description"], "A short summary.");
-        assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg");
-        assert_eq!(v["post_date"], "2025-09-01T10:00:00Z");
-        assert_eq!(v["publication"]["name"], "My Publication");
-        assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post");
-    }
-
-    #[test]
-    fn parse_html_prefers_jsonld_when_present() {
-        let html = r##"
-<html><head>
-<meta property="og:title" content="OG Title">
-<script type="application/ld+json">
-{"@context":"https://schema.org","@type":"NewsArticle",
- "headline":"JSON-LD Title",
- "description":"JSON-LD desc.",
- "image":"https://cdn.substack.com/hero.jpg",
- "datePublished":"2025-10-12T08:30:00Z",
- "dateModified":"2025-10-12T09:00:00Z",
- "author":[{"@type":"Person","name":"Alice Author","url":"https://substack.com/@alice"}]}
-</script>
-</head></html>"##;
-        let v = parse_html(
-            html,
-            "https://example.com/p/a",
-            "https://example.com/api/v1/posts/a",
-            "a",
-        );
-        assert_eq!(v["title"], "JSON-LD Title");
-        assert_eq!(v["description"], "JSON-LD desc.");
-        assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg");
-        assert_eq!(v["post_date"], "2025-10-12T08:30:00Z");
-        assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z");
-        assert_eq!(v["authors"][0]["name"], "Alice Author");
-        assert_eq!(v["authors"][0]["handle"], "alice");
-    }
-
-    #[test]
-    fn handle_from_author_url_pulls_handle() {
-        assert_eq!(
-            handle_from_author_url("https://substack.com/@alice"),
-            Some("alice".into())
-        );
-        assert_eq!(
-            handle_from_author_url("https://mypub.substack.com/@bob/"),
-            Some("bob".into())
-        );
-        assert_eq!(
-            handle_from_author_url("https://not-substack.com/author/carol"),
-            None
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
deleted file mode 100644
index 8b77a29..0000000
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ /dev/null
@@ -1,572 +0,0 @@
-//! Trustpilot company reviews extractor.
-//!
-//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
-//! "Verifying your connection" interstitial, so this extractor always
-//! routes through [`cloud::smart_fetch_html`]. Without
-//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
-//! "set API key" error; with one it escalates to api.webclaw.io.
-//!
-//! ## 2025 JSON-LD schema
-//!
-//! Trustpilot replaced the old single-Organization + aggregateRating
-//! shape with three separate JSON-LD blocks:
-//!
-//! 1. `Organization` block for Trustpilot the platform itself
-//!    (company info, addresses, social profiles). Not the business
-//!    being reviewed. We detect and skip this.
-//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
-//!    per-star-bucket counts for the target business plus a Total
-//!    column. The Dataset's `name` is the business display name.
-//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
-//!    summary of reviews plus the individual review objects
-//!    (consumer, dates, rating, title, text, language, likes).
-//!
-//! Plus `metadata.title` from the page head parses as
-//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
-//! `metadata.description` carries `"{N} customers have already said"`.
-//! We use both as extra signal when the Dataset block is absent.
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::cloud::{self, CloudError};
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "trustpilot_reviews",
-    label: "Trustpilot reviews",
-    description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
-    url_patterns: &["https://www.trustpilot.com/review/{domain}"],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
-        return false;
-    }
-    url.contains("/review/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
-        .await
-        .map_err(cloud_to_fetch_err)?;
-
-    let mut data = parse(&fetched.html, url)?;
-    if let Some(obj) = data.as_object_mut() {
-        obj.insert(
-            "data_source".into(),
-            match fetched.source {
-                cloud::FetchSource::Local => json!("local"),
-                cloud::FetchSource::Cloud => json!("cloud"),
-            },
-        );
-    }
-    Ok(data)
-}
-
-/// Pure parser. Kept public so the cloud pipeline can reuse it on its
-/// own fetched HTML without going through the async extract path.
-pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
-    let domain = parse_review_domain(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
-        ))
-    })?;
-
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-
-    // The business Dataset block has `about.@id` pointing to the target
-    // domain's Organization (e.g. `.../Organization/anthropic.com`).
-    let dataset = find_business_dataset(&blocks, &domain);
-
-    // The aiSummary block: not typed (no `@type`), detect by key.
-    let ai_block = find_ai_summary_block(&blocks);
-
-    // Business name: Dataset > metadata.title regex > URL domain.
-    let business_name = dataset
-        .as_ref()
-        .and_then(|d| get_string(d, "name"))
-        .or_else(|| parse_name_from_og_title(html))
-        .or_else(|| Some(domain.clone()));
-
-    // Rating distribution from the csvw:Table columns. Each column has
-    // csvw:name like "1 star" / "Total" and a single cell with the
-    // integer count.
-    let distribution = dataset.as_ref().and_then(parse_star_distribution);
-    let (rating_from_dist, total_from_dist) = distribution
-        .as_ref()
-        .map(compute_rating_stats)
-        .unwrap_or((None, None));
-
-    // Page-title / page-description fallbacks. OG title format:
-    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
-    let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
-    let total_from_desc = parse_review_count_from_og_description(html);
-
-    // Recent reviews carried by the aiSummary block.
-    let recent_reviews: Vec<Value> = ai_block
-        .as_ref()
-        .and_then(|a| a.get("aiSummaryReviews"))
-        .and_then(|arr| arr.as_array())
-        .map(|arr| arr.iter().map(extract_review).collect())
-        .unwrap_or_default();
-
-    let ai_summary = ai_block
-        .as_ref()
-        .and_then(|a| a.get("aiSummary"))
-        .and_then(|s| s.get("summary"))
-        .and_then(|t| t.as_str())
-        .map(String::from);
-
-    Ok(json!({
-        "url":               url,
-        "domain":            domain,
-        "business_name":     business_name,
-        "rating_label":      rating_label,
-        "average_rating":    rating_from_dist.or(rating_from_og),
-        "review_count":      total_from_dist.or(total_from_desc),
-        "rating_distribution": distribution,
-        "ai_summary":        ai_summary,
-        "recent_reviews":    recent_reviews,
-        "review_count_listed": recent_reviews.len(),
-    }))
-}
-
-fn cloud_to_fetch_err(e: CloudError) -> FetchError {
-    FetchError::Build(e.to_string())
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Pull the target domain from `trustpilot.com/review/{domain}`.
-fn parse_review_domain(url: &str) -> Option<String> {
-    let after = url.split("/review/").nth(1)?;
-    let stripped = after
-        .split(['?', '#'])
-        .next()?
-        .trim_end_matches('/')
-        .split('/')
-        .next()
-        .unwrap_or("");
-    if stripped.is_empty() {
-        None
-    } else {
-        Some(stripped.to_string())
-    }
-}
-
-// ---------------------------------------------------------------------------
-// JSON-LD block walkers
-// ---------------------------------------------------------------------------
-
-/// Find the Dataset block whose `about.@id` references the target
-/// domain's Organization. Falls through to any Dataset if the @id
-/// check doesn't match (Trustpilot occasionally varies the URL).
-fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
-    let mut fallback_any_dataset: Option<Value> = None;
-    for block in blocks {
-        for node in walk_graph(block) {
-            if !is_dataset(&node) {
-                continue;
-            }
-            if dataset_about_matches_domain(&node, domain) {
-                return Some(node);
-            }
-            if fallback_any_dataset.is_none() {
-                fallback_any_dataset = Some(node);
-            }
-        }
-    }
-    fallback_any_dataset
-}
-
-fn is_dataset(v: &Value) -> bool {
-    v.get("@type")
-        .and_then(|t| t.as_str())
-        .is_some_and(|s| s == "Dataset")
-}
-
-fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
-    let about_id = v
-        .get("about")
-        .and_then(|a| a.get("@id"))
-        .and_then(|id| id.as_str());
-    let Some(id) = about_id else {
-        return false;
-    };
-    id.contains(&format!("/Organization/{domain}"))
-}
-
-/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
-/// presence of the `aiSummary` key.
-fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
-    for block in blocks {
-        for node in walk_graph(block) {
-            if node.get("aiSummary").is_some() {
-                return Some(node);
-            }
-        }
-    }
-    None
-}
-
-/// Flatten each block (and its `@graph`) into a list of nodes we can
-/// iterate over. Handles both `@graph: [ ... ]` (array) and
-/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
-fn walk_graph(block: &Value) -> Vec<Value> {
-    let mut out = vec![block.clone()];
-    if let Some(graph) = block.get("@graph") {
-        match graph {
-            Value::Array(arr) => out.extend(arr.iter().cloned()),
-            Value::Object(_) => out.push(graph.clone()),
-            _ => {}
-        }
-    }
-    out
-}
-
-// ---------------------------------------------------------------------------
-// Rating distribution (csvw:Table)
-// ---------------------------------------------------------------------------
-
-/// Parse the per-star distribution from the Dataset block. Returns
-/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
-fn parse_star_distribution(dataset: &Value) -> Option<Value> {
-    let columns = dataset
-        .get("mainEntity")?
-        .get("csvw:tableSchema")?
-        .get("csvw:columns")?
-        .as_array()?;
-    let mut out = serde_json::Map::new();
-    for col in columns {
-        let name = col.get("csvw:name").and_then(|n| n.as_str())?;
-        let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
-        let count = cell
-            .get("csvw:value")
-            .and_then(|v| v.as_str())
-            .and_then(|s| s.parse::<i64>().ok());
-        let percent = cell
-            .get("csvw:notes")
-            .and_then(|n| n.as_array())
-            .and_then(|arr| arr.first())
-            .and_then(|s| s.as_str())
-            .map(String::from);
-        let key = normalise_star_key(name);
-        out.insert(
-            key,
-            json!({
-                "count":   count,
-                "percent": percent,
-            }),
-        );
-    }
-    if out.is_empty() {
-        None
-    } else {
-        Some(Value::Object(out))
-    }
-}
-
-/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
-/// the raw "1 star" key which fights YAML/JS property access.
-fn normalise_star_key(name: &str) -> String {
-    let trimmed = name.trim().to_lowercase();
-    match trimmed.as_str() {
-        "1 star" => "one_star".into(),
-        "2 stars" => "two_stars".into(),
-        "3 stars" => "three_stars".into(),
-        "4 stars" => "four_stars".into(),
-        "5 stars" => "five_stars".into(),
-        "total" => "total".into(),
-        other => other.replace(' ', "_"),
-    }
-}
-
-/// Compute average rating (weighted by bucket) and total count from the
-/// parsed distribution. Returns `(average, total)`.
-fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
-    let Some(obj) = distribution.as_object() else {
-        return (None, None);
-    };
-    let get_count = |key: &str| -> i64 {
-        obj.get(key)
-            .and_then(|v| v.get("count"))
-            .and_then(|v| v.as_i64())
-            .unwrap_or(0)
-    };
-    let one = get_count("one_star");
-    let two = get_count("two_stars");
-    let three = get_count("three_stars");
-    let four = get_count("four_stars");
-    let five = get_count("five_stars");
-    let total_bucket = one + two + three + four + five;
-    let total = obj
-        .get("total")
-        .and_then(|v| v.get("count"))
-        .and_then(|v| v.as_i64())
-        .unwrap_or(total_bucket);
-    if total == 0 {
-        return (None, Some(0));
-    }
-    let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
-    let avg = weighted as f64 / total_bucket.max(1) as f64;
-    // One decimal place, matching how Trustpilot displays the score.
-    (Some(format!("{avg:.1}")), Some(total))
-}
-
-// ---------------------------------------------------------------------------
-// OG / meta-tag fallbacks
-// ---------------------------------------------------------------------------
-
-/// Regex out the business name from the standard Trustpilot OG title
-/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
-fn parse_name_from_og_title(html: &str) -> Option<String> {
-    let title = og(html, "title")?;
-    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
-    re.captures(&title)
-        .and_then(|c| c.get(1))
-        .map(|m| m.as_str().to_string())
-}
-
-/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
-/// from the OG title.
-fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
-    let Some(title) = og(html, "title") else {
-        return (None, None);
-    };
-    static RE: OnceLock<Regex> = OnceLock::new();
-    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
-    });
-    let Some(caps) = re.captures(&title) else {
-        return (None, None);
-    };
-    (
-        caps.get(1).map(|m| m.as_str().trim().to_string()),
-        caps.get(2).map(|m| m.as_str().to_string()),
-    )
-}
-
-/// Parse "hear what 226 customers have already said" from the OG
-/// description tag.
-fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
-    let desc = og(html, "description")?;
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
-    re.captures(&desc)?
-        .get(1)?
-        .as_str()
-        .replace(',', "")
-        .parse::<i64>()
-        .ok()
-}
-
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            let raw = c.get(2).map(|m| m.as_str())?;
-            return Some(html_unescape(raw));
-        }
-    }
-    None
-}
-
-/// Minimal HTML entity unescaping for the three entities the
-/// synthesize_html escaper might produce. Keeps us off a heavier dep.
-fn html_unescape(s: &str) -> String {
-    s.replace("&quot;", "\"")
-        .replace("&amp;", "&")
-        .replace("&lt;", "<")
-        .replace("&gt;", ">")
-}
-
-fn get_string(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| x.as_str().map(String::from))
-}
-
-// ---------------------------------------------------------------------------
-// Review extraction
-// ---------------------------------------------------------------------------
-
-fn extract_review(r: &Value) -> Value {
-    json!({
-        "id":          r.get("id").and_then(|v| v.as_str()),
-        "rating":      r.get("rating").and_then(|v| v.as_i64()),
-        "title":       r.get("title").and_then(|v| v.as_str()),
-        "text":        r.get("text").and_then(|v| v.as_str()),
-        "language":    r.get("language").and_then(|v| v.as_str()),
-        "source":      r.get("source").and_then(|v| v.as_str()),
-        "likes":       r.get("likes").and_then(|v| v.as_i64()),
-        "author":      r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
-        "author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
-        "author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
-        "verified":    r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
-        "date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
-        "date_published":   r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
-    })
-}
-
-// ---------------------------------------------------------------------------
-// Tests
-// ---------------------------------------------------------------------------
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_trustpilot_review_urls() {
-        assert!(matches("https://www.trustpilot.com/review/stripe.com"));
-        assert!(matches("https://trustpilot.com/review/example.com"));
-        assert!(!matches("https://www.trustpilot.com/"));
-        assert!(!matches("https://example.com/review/foo"));
-    }
-
-    #[test]
-    fn parse_review_domain_handles_query_and_slash() {
-        assert_eq!(
-            parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
-            Some("anthropic.com".into())
-        );
-        assert_eq!(
-            parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
-            Some("anthropic.com".into())
-        );
-        assert_eq!(
-            parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
-            Some("anthropic.com".into())
-        );
-    }
-
-    #[test]
-    fn normalise_star_key_covers_all_buckets() {
-        assert_eq!(normalise_star_key("1 star"), "one_star");
-        assert_eq!(normalise_star_key("2 stars"), "two_stars");
-        assert_eq!(normalise_star_key("5 stars"), "five_stars");
-        assert_eq!(normalise_star_key("Total"), "total");
-    }
-
-    #[test]
-    fn compute_rating_stats_weighted_average() {
-        // 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
-        let dist = json!({
-            "one_star":   { "count": 100, "percent": "50%" },
-            "two_stars":  { "count": 0,   "percent": "0%" },
-            "three_stars":{ "count": 0,   "percent": "0%" },
-            "four_stars": { "count": 0,   "percent": "0%" },
-            "five_stars": { "count": 100, "percent": "50%" },
-            "total":      { "count": 200, "percent": "100%" },
-        });
-        let (avg, total) = compute_rating_stats(&dist);
-        assert_eq!(avg.as_deref(), Some("3.0"));
-        assert_eq!(total, Some(200));
-    }
-
-    #[test]
-    fn parse_og_title_extracts_name_and_rating() {
-        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">"#;
-        assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
-        let (label, rating) = parse_rating_from_og_title(html);
-        assert_eq!(label.as_deref(), Some("Bad"));
-        assert_eq!(rating.as_deref(), Some("1.5"));
-    }
-
-    #[test]
-    fn parse_review_count_from_og_description_picks_number() {
-        let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
-        assert_eq!(parse_review_count_from_og_description(html), Some(226));
-    }
-
-    #[test]
-    fn parse_full_fixture_assembles_all_fields() {
-        let html = r##"<html><head>
-<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
-<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
-<script type="application/ld+json">
-{"@context":"https://schema.org","@graph":[
-  {"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
-]}
-</script>
-<script type="application/ld+json">
-{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
- "@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
- "@type":"Dataset",
- "about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
- "name":"Anthropic",
- "mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
-   {"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
-   {"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
-   {"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
-   {"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
-   {"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
-   {"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
- ]}}}}
-</script>
-<script type="application/ld+json">
-{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
- "aiSummaryReviews":[
-  {"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
-   "source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
-   "dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
-</script>
-</head></html>"##;
-        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
-        assert_eq!(v["domain"], "anthropic.com");
-        assert_eq!(v["business_name"], "Anthropic");
-        assert_eq!(v["rating_label"], "Bad");
-        assert_eq!(v["review_count"], 226);
-        assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
-        assert_eq!(v["rating_distribution"]["total"]["count"], 226);
-        assert_eq!(v["ai_summary"], "Mixed reviews.");
-        assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
-        assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
-        assert_eq!(v["recent_reviews"][0]["rating"], 1);
-        assert_eq!(v["recent_reviews"][0]["title"], "Bad");
-    }
-
-    #[test]
-    fn parse_falls_back_to_og_when_no_jsonld() {
-        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
-<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
-        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
-        assert_eq!(v["domain"], "anthropic.com");
-        assert_eq!(v["business_name"], "Anthropic");
-        assert_eq!(v["average_rating"], "1.5");
-        assert_eq!(v["review_count"], 226);
-        assert_eq!(v["rating_label"], "Bad");
-    }
-
-    #[test]
-    fn parse_returns_ok_with_url_domain_when_nothing_else() {
-        let v = parse(
-            "<html><head></head></html>",
-            "https://www.trustpilot.com/review/example.com",
-        )
-        .unwrap();
-        assert_eq!(v["domain"], "example.com");
-        assert_eq!(v["business_name"], "example.com");
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
deleted file mode 100644
index db6dd78..0000000
--- a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
+++ /dev/null
@@ -1,237 +0,0 @@
-//! WooCommerce product structured extractor.
-//!
-//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`.
-//! About 30-50% of WooCommerce stores expose this endpoint publicly
-//! (it's on by default, but common security plugins disable it).
-//! When it's off, the server returns 404 at /wp-json. We surface a
-//! clean error and point callers at `/v1/scrape/ecommerce_product`
-//! which works on any store with Schema.org JSON-LD.
-//!
-//! Explicit-call only. `/product/{slug}` is the default permalink for
-//! WooCommerce but custom stores use every variation imaginable, so
-//! auto-dispatch is unreliable.
-
-use serde::Deserialize;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "woocommerce_product",
-    label: "WooCommerce product",
-    description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).",
-    url_patterns: &[
-        "https://{shop}/product/{slug}",
-        "https://{shop}/shop/{slug}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    let host = host_of(url);
-    if host.is_empty() {
-        return false;
-    }
-    // Permissive: WooCommerce stores use custom domains + custom
-    // permalinks. The extractor's API probe is what confirms it's
-    // really WooCommerce.
-    url.contains("/product/")
-        || url.contains("/shop/")
-        || url.contains("/producto/") // common es locale
-        || url.contains("/produit/") // common fr locale
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let slug = parse_slug(url).ok_or_else(|| {
-        FetchError::Build(format!(
-            "woocommerce_product: cannot parse slug from '{url}'"
-        ))
-    })?;
-    let host = host_of(url);
-    if host.is_empty() {
-        return Err(FetchError::Build(format!(
-            "woocommerce_product: empty host in '{url}'"
-        )));
-    }
-    let scheme = if url.starts_with("http://") {
-        "http"
-    } else {
-        "https"
-    };
-    let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1");
-    let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
-            "woocommerce_product: {host} does not expose /wp-json/wc/store (404). \
-             Use /v1/scrape/ecommerce_product for JSON-LD fallback."
-        )));
-    }
-    if resp.status == 401 || resp.status == 403 {
-        return Err(FetchError::Build(format!(
-            "woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \
-             Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.",
-            resp.status
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "woocommerce api returned status {} for {api_url}",
-            resp.status
-        )));
-    }
-
-    let products: Vec<Product> = serde_json::from_str(&resp.html)
-        .map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?;
-    let p = products.into_iter().next().ok_or_else(|| {
-        FetchError::Build(format!(
-            "woocommerce_product: no product found for slug '{slug}' on {host}"
-        ))
-    })?;
-
-    let images: Vec<Value> = p
-        .images
-        .iter()
-        .map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt}))
-        .collect();
-    let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0);
-
-    Ok(json!({
-        "url":             url,
-        "api_url":         api_url,
-        "product_id":      p.id,
-        "name":            p.name,
-        "slug":            p.slug,
-        "sku":             p.sku,
-        "permalink":       p.permalink,
-        "on_sale":         p.on_sale,
-        "in_stock":        p.is_in_stock,
-        "is_purchasable":  p.is_purchasable,
-        "price":           p.prices.as_ref().and_then(|pr| pr.price.clone()),
-        "regular_price":   p.prices.as_ref().and_then(|pr| pr.regular_price.clone()),
-        "sale_price":      p.prices.as_ref().and_then(|pr| pr.sale_price.clone()),
-        "currency":        p.prices.as_ref().and_then(|pr| pr.currency_code.clone()),
-        "currency_minor":  p.prices.as_ref().and_then(|pr| pr.currency_minor_unit),
-        "price_range":     p.prices.as_ref().and_then(|pr| pr.price_range.clone()),
-        "average_rating":  p.average_rating,
-        "review_count":    p.review_count,
-        "description":     p.description,
-        "short_description": p.short_description,
-        "categories":      p.categories.iter().filter_map(|c| c.name.clone()).collect::<Vec<_>>(),
-        "tags":            p.tags.iter().filter_map(|t| t.name.clone()).collect::<Vec<_>>(),
-        "variation_count": variations_count,
-        "image_count":     images.len(),
-        "images":          images,
-    }))
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn host_of(url: &str) -> &str {
-    url.split("://")
-        .nth(1)
-        .unwrap_or(url)
-        .split('/')
-        .next()
-        .unwrap_or("")
-}
-
-/// Extract the product slug from common WooCommerce permalinks.
-fn parse_slug(url: &str) -> Option<String> {
-    for needle in ["/product/", "/shop/", "/producto/", "/produit/"] {
-        if let Some(after) = url.split(needle).nth(1) {
-            let stripped = after
-                .split(['?', '#'])
-                .next()?
-                .trim_end_matches('/')
-                .split('/')
-                .next()
-                .unwrap_or("");
-            if !stripped.is_empty() {
-                return Some(stripped.to_string());
-            }
-        }
-    }
-    None
-}
-
-// ---------------------------------------------------------------------------
-// Store API types (subset of the full response)
-// ---------------------------------------------------------------------------
-
-#[derive(Deserialize)]
-struct Product {
-    id: Option<i64>,
-    name: Option<String>,
-    slug: Option<String>,
-    sku: Option<String>,
-    permalink: Option<String>,
-    description: Option<String>,
-    short_description: Option<String>,
-    on_sale: Option<bool>,
-    is_in_stock: Option<bool>,
-    is_purchasable: Option<bool>,
-    average_rating: Option<serde_json::Value>, // string or number
-    review_count: Option<i64>,
-    prices: Option<Prices>,
-    #[serde(default)]
-    categories: Vec<Term>,
-    #[serde(default)]
-    tags: Vec<Term>,
-    #[serde(default)]
-    images: Vec<Img>,
-    variations: Option<Vec<serde_json::Value>>,
-}
-
-#[derive(Deserialize)]
-struct Prices {
-    price: Option<String>,
-    regular_price: Option<String>,
-    sale_price: Option<String>,
-    currency_code: Option<String>,
-    currency_minor_unit: Option<i64>,
-    price_range: Option<serde_json::Value>,
-}
-
-#[derive(Deserialize)]
-struct Term {
-    name: Option<String>,
-}
-
-#[derive(Deserialize)]
-struct Img {
-    src: Option<String>,
-    thumbnail: Option<String>,
-    alt: Option<String>,
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_common_permalinks() {
-        assert!(matches("https://shop.example.com/product/cool-widget"));
-        assert!(matches("https://shop.example.com/shop/cool-widget"));
-        assert!(matches("https://tienda.example.com/producto/cosa"));
-        assert!(matches("https://boutique.example.com/produit/chose"));
-    }
-
-    #[test]
-    fn parse_slug_handles_locale_and_suffix() {
-        assert_eq!(
-            parse_slug("https://shop.example.com/product/cool-widget"),
-            Some("cool-widget".into())
-        );
-        assert_eq!(
-            parse_slug("https://shop.example.com/product/cool-widget/?attr=red"),
-            Some("cool-widget".into())
-        );
-        assert_eq!(
-            parse_slug("https://tienda.example.com/producto/cosa/"),
-            Some("cosa".into())
-        );
-    }
-}
diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs
deleted file mode 100644
index 2551ff8..0000000
--- a/crates/webclaw-fetch/src/extractors/youtube_video.rs
+++ /dev/null
@@ -1,378 +0,0 @@
-//! YouTube video structured extractor.
-//!
-//! YouTube embeds the full player configuration in a
-//! `ytInitialPlayerResponse` JavaScript assignment at the top of
-//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the
-//! core crate's already-proven regex + parse to surface typed JSON
-//! from it: video id, title, author + channel id, view count,
-//! duration, upload date, keywords, thumbnails, caption-track URLs.
-//!
-//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
-//! shape is stable.
-//!
-//! ## Fallback
-//!
-//! `ytInitialPlayerResponse` is missing on EU-consent interstitials,
-//! some live-stream pre-show pages, and age-gated videos. In those
-//! cases we drop down to OG tags for `title`, `description`,
-//! `thumbnail`, and `channel`, and return a `data_source:
-//! "og_fallback"` payload so the caller can tell they got a degraded
-//! shape (no view count, duration, captions).
-
-use std::sync::OnceLock;
-
-use regex::Regex;
-use serde_json::{Value, json};
-
-use super::ExtractorInfo;
-use crate::error::FetchError;
-use crate::fetcher::Fetcher;
-
-pub const INFO: ExtractorInfo = ExtractorInfo {
-    name: "youtube_video",
-    label: "YouTube video",
-    description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.",
-    url_patterns: &[
-        "https://www.youtube.com/watch?v={id}",
-        "https://youtu.be/{id}",
-        "https://www.youtube.com/shorts/{id}",
-    ],
-};
-
-pub fn matches(url: &str) -> bool {
-    webclaw_core::youtube::is_youtube_url(url)
-        || url.contains("youtube.com/shorts/")
-        || url.contains("youtube-nocookie.com/embed/")
-}
-
-pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
-    let video_id = parse_video_id(url).ok_or_else(|| {
-        FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
-    })?;
-
-    // Always fetch the canonical /watch URL. /shorts/ and youtu.be
-    // sometimes serve a thinner page without the player blob.
-    let canonical = format!("https://www.youtube.com/watch?v={video_id}");
-    let resp = client.fetch(&canonical).await?;
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "youtube returned status {} for {canonical}",
-            resp.status
-        )));
-    }
-
-    if let Some(player) = extract_player_response(&resp.html) {
-        return Ok(build_player_payload(
-            &player, &resp.html, url, &canonical, &video_id,
-        ));
-    }
-
-    // No player blob. Fall back to OG tags so the call still returns
-    // something useful for consent / age-gate pages.
-    Ok(build_og_fallback(&resp.html, url, &canonical, &video_id))
-}
-
-// ---------------------------------------------------------------------------
-// Player-blob path (rich payload)
-// ---------------------------------------------------------------------------
-
-fn build_player_payload(
-    player: &Value,
-    html: &str,
-    url: &str,
-    canonical: &str,
-    video_id: &str,
-) -> Value {
-    let video_details = player.get("videoDetails");
-    let microformat = player
-        .get("microformat")
-        .and_then(|m| m.get("playerMicroformatRenderer"));
-
-    let thumbnails: Vec<Value> = video_details
-        .and_then(|vd| vd.get("thumbnail"))
-        .and_then(|t| t.get("thumbnails"))
-        .and_then(|t| t.as_array())
-        .cloned()
-        .unwrap_or_default();
-
-    let keywords: Vec<Value> = video_details
-        .and_then(|vd| vd.get("keywords"))
-        .and_then(|k| k.as_array())
-        .cloned()
-        .unwrap_or_default();
-
-    let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html);
-    let captions: Vec<Value> = caption_tracks
-        .iter()
-        .map(|c| {
-            json!({
-                "url":  c.url,
-                "lang": c.lang,
-                "name": c.name,
-            })
-        })
-        .collect();
-
-    json!({
-        "url":          url,
-        "canonical_url":canonical,
-        "data_source":  "player_response",
-        "video_id":     video_id,
-        "title":        get_str(video_details, "title"),
-        "description":  get_str(video_details, "shortDescription"),
-        "author":       get_str(video_details, "author"),
-        "channel_id":   get_str(video_details, "channelId"),
-        "channel_url":  get_str(microformat, "ownerProfileUrl"),
-        "view_count":   get_int(video_details, "viewCount"),
-        "length_seconds": get_int(video_details, "lengthSeconds"),
-        "is_live":      video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()),
-        "is_private":   video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()),
-        "is_unlisted":  microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()),
-        "allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()),
-        "category":     get_str(microformat, "category"),
-        "upload_date":  get_str(microformat, "uploadDate"),
-        "publish_date": get_str(microformat, "publishDate"),
-        "keywords":     keywords,
-        "thumbnails":   thumbnails,
-        "caption_tracks": captions,
-    })
-}
-
-// ---------------------------------------------------------------------------
-// OG fallback path (degraded payload)
-// ---------------------------------------------------------------------------
-
-fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value {
-    let title = og(html, "title");
-    let description = og(html, "description");
-    let thumbnail = og(html, "image");
-    // YouTube sets `<meta name="channel_name" ...>` on some pages but
-    // OG-only pages reliably carry `og:video:tag` and the channel in
-    // `<link itemprop="name">`. We keep this lean: just what's stable.
-    let channel = meta_name(html, "author");
-
-    json!({
-        "url":          url,
-        "canonical_url":canonical,
-        "data_source":  "og_fallback",
-        "video_id":     video_id,
-        "title":        title,
-        "description":  description,
-        "author":       channel,
-        // OG path: these are null so the caller doesn't have to guess.
-        "channel_id":   None::<String>,
-        "channel_url":  None::<String>,
-        "view_count":   None::<i64>,
-        "length_seconds": None::<i64>,
-        "is_live":      None::<bool>,
-        "is_private":   None::<bool>,
-        "is_unlisted":  None::<bool>,
-        "allow_ratings":None::<bool>,
-        "category":     None::<String>,
-        "upload_date":  None::<String>,
-        "publish_date": None::<String>,
-        "keywords":     Vec::<Value>::new(),
-        "thumbnails":   thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(),
-        "caption_tracks": Vec::<Value>::new(),
-    })
-}
-
-// ---------------------------------------------------------------------------
-// URL helpers
-// ---------------------------------------------------------------------------
-
-fn parse_video_id(url: &str) -> Option<String> {
-    // youtu.be/{id}
-    if let Some(after) = url.split("youtu.be/").nth(1) {
-        let id = after
-            .split(['?', '#', '/'])
-            .next()
-            .unwrap_or("")
-            .trim_end_matches('/');
-        if !id.is_empty() {
-            return Some(id.to_string());
-        }
-    }
-    // youtube.com/shorts/{id}
-    if let Some(after) = url.split("youtube.com/shorts/").nth(1) {
-        let id = after
-            .split(['?', '#', '/'])
-            .next()
-            .unwrap_or("")
-            .trim_end_matches('/');
-        if !id.is_empty() {
-            return Some(id.to_string());
-        }
-    }
-    // youtube-nocookie.com/embed/{id}
-    if let Some(after) = url.split("/embed/").nth(1) {
-        let id = after
-            .split(['?', '#', '/'])
-            .next()
-            .unwrap_or("")
-            .trim_end_matches('/');
-        if !id.is_empty() {
-            return Some(id.to_string());
-        }
-    }
-    // youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id})
-    if let Some(q) = url.split_once('?').map(|(_, q)| q)
-        && let Some(id) = q
-            .split('&')
-            .find_map(|p| p.strip_prefix("v=").map(|v| v.to_string()))
-    {
-        let id = id.split(['#', '/']).next().unwrap_or(&id).to_string();
-        if !id.is_empty() {
-            return Some(id);
-        }
-    }
-    None
-}
-
-// ---------------------------------------------------------------------------
-// Player-response parsing
-// ---------------------------------------------------------------------------
-
-fn extract_player_response(html: &str) -> Option<Value> {
-    // Same regex as webclaw_core::youtube. Duplicated here because
-    // core's regex is module-private. Kept in lockstep; changes are
-    // rare and we cover with tests in both places.
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE
-        .get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
-    let json_str = re.captures(html)?.get(1)?.as_str();
-    serde_json::from_str(json_str).ok()
-}
-
-// ---------------------------------------------------------------------------
-// Meta-tag helpers (for OG fallback)
-// ---------------------------------------------------------------------------
-
-fn og(html: &str, prop: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == prop) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-fn meta_name(html: &str, name: &str) -> Option<String> {
-    static RE: OnceLock<Regex> = OnceLock::new();
-    let re = RE.get_or_init(|| {
-        Regex::new(r#"(?i)<meta[^>]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
-    });
-    for c in re.captures_iter(html) {
-        if c.get(1).is_some_and(|m| m.as_str() == name) {
-            return c.get(2).map(|m| m.as_str().to_string());
-        }
-    }
-    None
-}
-
-fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
-    v.and_then(|x| x.get(key))
-        .and_then(|x| x.as_str().map(String::from))
-}
-
-fn get_int(v: Option<&Value>, key: &str) -> Option<i64> {
-    v.and_then(|x| x.get(key)).and_then(|x| {
-        x.as_i64()
-            .or_else(|| x.as_str().and_then(|s| s.parse::<i64>().ok()))
-    })
-}
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn matches_watch_urls() {
-        assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ"));
-        assert!(matches("https://youtu.be/dQw4w9WgXcQ"));
-        assert!(matches("https://www.youtube.com/shorts/abc123"));
-        assert!(matches(
-            "https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ"
-        ));
-    }
-
-    #[test]
-    fn rejects_non_video_urls() {
-        assert!(!matches("https://www.youtube.com/"));
-        assert!(!matches("https://www.youtube.com/channel/abc"));
-        assert!(!matches("https://example.com/watch?v=abc"));
-    }
-
-    #[test]
-    fn parse_video_id_from_each_shape() {
-        assert_eq!(
-            parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"),
-            Some("dQw4w9WgXcQ".into())
-        );
-        assert_eq!(
-            parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"),
-            Some("dQw4w9WgXcQ".into())
-        );
-        assert_eq!(
-            parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"),
-            Some("dQw4w9WgXcQ".into())
-        );
-        assert_eq!(
-            parse_video_id("https://youtu.be/dQw4w9WgXcQ"),
-            Some("dQw4w9WgXcQ".into())
-        );
-        assert_eq!(
-            parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"),
-            Some("dQw4w9WgXcQ".into())
-        );
-        assert_eq!(
-            parse_video_id("https://www.youtube.com/shorts/abc123"),
-            Some("abc123".into())
-        );
-    }
-
-    #[test]
-    fn extract_player_response_happy_path() {
-        let html = r#"
-<html><body>
-<script>
-var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","author":"A","viewCount":"100","lengthSeconds":"60","shortDescription":"d"}};
-</script>
-</body></html>
-"#;
-        let v = extract_player_response(html).unwrap();
-        let vd = v.get("videoDetails").unwrap();
-        assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
-    }
-
-    #[test]
-    fn og_fallback_extracts_basics_from_meta_tags() {
-        let html = r##"
-<html><head>
-<meta property="og:title" content="Example Video Title">
-<meta property="og:description" content="A cool video description.">
-<meta property="og:image" content="https://i.ytimg.com/vi/abc/maxresdefault.jpg">
-<meta name="author" content="Example Channel">
-</head></html>"##;
-        let v = build_og_fallback(
-            html,
-            "https://www.youtube.com/watch?v=abc",
-            "https://www.youtube.com/watch?v=abc",
-            "abc",
-        );
-        assert_eq!(v["data_source"], "og_fallback");
-        assert_eq!(v["title"], "Example Video Title");
-        assert_eq!(v["description"], "A cool video description.");
-        assert_eq!(v["author"], "Example Channel");
-        assert_eq!(
-            v["thumbnails"][0]["url"],
-            "https://i.ytimg.com/vi/abc/maxresdefault.jpg"
-        );
-        assert!(v["view_count"].is_null());
-        assert!(v["caption_tracks"].as_array().unwrap().is_empty());
-    }
-}
diff --git a/crates/webclaw-fetch/src/fetcher.rs b/crates/webclaw-fetch/src/fetcher.rs
deleted file mode 100644
index fabcf44..0000000
--- a/crates/webclaw-fetch/src/fetcher.rs
+++ /dev/null
@@ -1,118 +0,0 @@
-//! Pluggable fetcher abstraction for vertical extractors.
-//!
-//! Extractors call the network through this trait instead of hard-
-//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
-//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
-//! server, which must not use in-process TLS fingerprinting, provides
-//! its own implementation that routes through the Go tls-sidecar.
-//!
-//! Both paths expose the same [`FetchResult`] shape and the same
-//! optional cloud-escalation client, so extractor logic stays
-//! identical across environments.
-//!
-//! ## Choosing an implementation
-//!
-//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
-//!   with [`FetchClient::with_cloud`] to attach cloud fallback, pass
-//!   it to extractors as `&client`.
-//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
-//!   (in `server/src/engine/`) that delegates to `engine::tls_client`
-//!   and wraps it in `Arc<dyn Fetcher>` for handler injection.
-//!
-//! ## Why a trait and not a free function
-//!
-//! Extractors need state beyond a single fetch: the cloud client for
-//! antibot escalation, and in the future per-user proxy pools, tenant
-//! headers, circuit breakers. A trait keeps that state encapsulated
-//! behind the fetch interface instead of threading it through every
-//! extractor signature.
-
-use async_trait::async_trait;
-
-use crate::client::FetchResult;
-use crate::cloud::CloudClient;
-use crate::error::FetchError;
-
-/// HTTP fetch surface used by vertical extractors.
-///
-/// Implementations must be `Send + Sync` because extractor dispatchers
-/// run them inside tokio tasks, potentially across many requests.
-#[async_trait]
-pub trait Fetcher: Send + Sync {
-    /// Fetch a URL and return the raw response body + metadata. The
-    /// body is in `FetchResult::html` regardless of the actual content
-    /// type — JSON API endpoints put JSON there, HTML pages put HTML.
-    /// Extractors branch on response status and body shape.
-    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
-
-    /// Fetch with additional request headers. Needed for endpoints
-    /// that authenticate via a specific header (Instagram's
-    /// `x-ig-app-id`, for example). Default implementation routes to
-    /// [`Self::fetch`] so implementers without header support stay
-    /// functional, though the `Option<String>` field they'd set won't
-    /// be populated on the request.
-    async fn fetch_with_headers(
-        &self,
-        url: &str,
-        _headers: &[(&str, &str)],
-    ) -> Result<FetchResult, FetchError> {
-        self.fetch(url).await
-    }
-
-    /// Optional cloud-escalation client for antibot bypass. Returning
-    /// `Some` tells extractors they can call into the hosted API when
-    /// local fetch hits a challenge page. Returning `None` makes
-    /// cloud-gated extractors emit [`CloudError::NotConfigured`] with
-    /// an actionable signup link.
-    ///
-    /// The default implementation returns `None` because not every
-    /// deployment wants cloud fallback (self-hosts that don't have a
-    /// webclaw.io subscription, for instance).
-    ///
-    /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
-    fn cloud(&self) -> Option<&CloudClient> {
-        None
-    }
-}
-
-// ---------------------------------------------------------------------------
-// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
-// ---------------------------------------------------------------------------
-
-#[async_trait]
-impl<T: Fetcher + ?Sized> Fetcher for &T {
-    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
-        (**self).fetch(url).await
-    }
-
-    async fn fetch_with_headers(
-        &self,
-        url: &str,
-        headers: &[(&str, &str)],
-    ) -> Result<FetchResult, FetchError> {
-        (**self).fetch_with_headers(url, headers).await
-    }
-
-    fn cloud(&self) -> Option<&CloudClient> {
-        (**self).cloud()
-    }
-}
-
-#[async_trait]
-impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
-    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
-        (**self).fetch(url).await
-    }
-
-    async fn fetch_with_headers(
-        &self,
-        url: &str,
-        headers: &[(&str, &str)],
-    ) -> Result<FetchResult, FetchError> {
-        (**self).fetch_with_headers(url, headers).await
-    }
-
-    fn cloud(&self) -> Option<&CloudClient> {
-        (**self).cloud()
-    }
-}
diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs
index 83664a1..517cb6e 100644
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@@ -3,12 +3,9 @@
 //! Automatically detects PDF responses and delegates to webclaw-pdf.
 pub mod browser;
 pub mod client;
-pub mod cloud;
 pub mod crawler;
 pub mod document;
 pub mod error;
-pub mod extractors;
-pub mod fetcher;
 pub mod linkedin;
 pub mod proxy;
 pub mod reddit;
@@ -19,7 +16,6 @@ pub use browser::BrowserProfile;
 pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
 pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
 pub use error::FetchError;
-pub use fetcher::Fetcher;
 pub use http::HeaderMap;
 pub use proxy::{parse_proxy_file, parse_proxy_line};
 pub use sitemap::SitemapEntry;
diff --git a/crates/webclaw-mcp/Cargo.toml b/crates/webclaw-mcp/Cargo.toml
index ec3b2b4..df9dd97 100644
--- a/crates/webclaw-mcp/Cargo.toml
+++ b/crates/webclaw-mcp/Cargo.toml
@@ -22,5 +22,6 @@ serde_json = { workspace = true }
 tokio = { workspace = true }
 tracing = { workspace = true }
 tracing-subscriber = { workspace = true }
+reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
 url = "2"
 dirs = "6.0.0"
diff --git a/crates/webclaw-mcp/src/cloud.rs b/crates/webclaw-mcp/src/cloud.rs
new file mode 100644
index 0000000..ac602e4
--- /dev/null
+++ b/crates/webclaw-mcp/src/cloud.rs
@@ -0,0 +1,302 @@
+/// Cloud API fallback for protected sites.
+///
+/// When local fetch returns a challenge page, this module retries
+/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
+use std::time::Duration;
+
+use serde_json::{Value, json};
+use tracing::info;
+
+const API_BASE: &str = "https://api.webclaw.io/v1";
+
+/// Lightweight client for the webclaw cloud API.
+pub struct CloudClient {
+    api_key: String,
+    http: reqwest::Client,
+}
+
+impl CloudClient {
+    /// Create a new cloud client from WEBCLAW_API_KEY env var.
+    /// Returns None if the key is not set.
+    pub fn from_env() -> Option<Self> {
+        let key = std::env::var("WEBCLAW_API_KEY").ok()?;
+        if key.is_empty() {
+            return None;
+        }
+        let http = reqwest::Client::builder()
+            .timeout(Duration::from_secs(60))
+            .build()
+            .unwrap_or_default();
+        Some(Self { api_key: key, http })
+    }
+
+    /// Scrape a URL via the cloud API. Returns the response JSON.
+    pub async fn scrape(
+        &self,
+        url: &str,
+        formats: &[&str],
+        include_selectors: &[String],
+        exclude_selectors: &[String],
+        only_main_content: bool,
+    ) -> Result<Value, String> {
+        let mut body = json!({
+            "url": url,
+            "formats": formats,
+        });
+
+        if only_main_content {
+            body["only_main_content"] = json!(true);
+        }
+        if !include_selectors.is_empty() {
+            body["include_selectors"] = json!(include_selectors);
+        }
+        if !exclude_selectors.is_empty() {
+            body["exclude_selectors"] = json!(exclude_selectors);
+        }
+
+        self.post("scrape", body).await
+    }
+
+    /// Generic POST to the cloud API.
+    pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
+        let resp = self
+            .http
+            .post(format!("{API_BASE}/{endpoint}"))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .json(&body)
+            .send()
+            .await
+            .map_err(|e| format!("Cloud API request failed: {e}"))?;
+
+        let status = resp.status();
+        if !status.is_success() {
+            let text = resp.text().await.unwrap_or_default();
+            let truncated = truncate_error(&text);
+            return Err(format!("Cloud API error {status}: {truncated}"));
+        }
+
+        resp.json::<Value>()
+            .await
+            .map_err(|e| format!("Cloud API response parse failed: {e}"))
+    }
+
+    /// Generic GET from the cloud API.
+    pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
+        let resp = self
+            .http
+            .get(format!("{API_BASE}/{endpoint}"))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .send()
+            .await
+            .map_err(|e| format!("Cloud API request failed: {e}"))?;
+
+        let status = resp.status();
+        if !status.is_success() {
+            let text = resp.text().await.unwrap_or_default();
+            let truncated = truncate_error(&text);
+            return Err(format!("Cloud API error {status}: {truncated}"));
+        }
+
+        resp.json::<Value>()
+            .await
+            .map_err(|e| format!("Cloud API response parse failed: {e}"))
+    }
+}
+
+/// Truncate error body to avoid flooding logs with huge HTML responses.
+fn truncate_error(text: &str) -> &str {
+    const MAX_LEN: usize = 500;
+    match text.char_indices().nth(MAX_LEN) {
+        Some((byte_pos, _)) => &text[..byte_pos],
+        None => text,
+    }
+}
+
+/// Check if fetched HTML looks like a bot protection challenge page.
+/// Detects common bot protection challenge pages.
+pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool {
+    let html_lower = html.to_lowercase();
+
+    // Cloudflare challenge page
+    if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
+        return true;
+    }
+
+    // Cloudflare "checking your browser" spinner
+    if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
+        && html_lower.contains("cf-spinner")
+    {
+        return true;
+    }
+
+    // Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
+    if (html_lower.contains("cf-turnstile")
+        || html_lower.contains("challenges.cloudflare.com/turnstile"))
+        && html.len() < 100_000
+    {
+        return true;
+    }
+
+    // DataDome
+    if html_lower.contains("geo.captcha-delivery.com")
+        || html_lower.contains("captcha-delivery.com/captcha")
+    {
+        return true;
+    }
+
+    // AWS WAF
+    if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
+        return true;
+    }
+
+    // hCaptcha blocking page
+    if html_lower.contains("hcaptcha.com")
+        && html_lower.contains("h-captcha")
+        && html.len() < 50_000
+    {
+        return true;
+    }
+
+    // Cloudflare via headers + challenge body
+    let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
+    if has_cf_headers
+        && (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
+    {
+        return true;
+    }
+
+    false
+}
+
+/// Check if a page likely needs JS rendering (SPA with almost no text content).
+pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
+    let has_scripts = html.contains("<script");
+
+    // Tier 1: almost no extractable text from a large page
+    if word_count < 50 && html.len() > 5_000 && has_scripts {
+        return true;
+    }
+
+    // Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
+    if word_count < 800 && html.len() > 50_000 && has_scripts {
+        let html_lower = html.to_lowercase();
+        let has_spa_marker = html_lower.contains("react-app")
+            || html_lower.contains("id=\"__next\"")
+            || html_lower.contains("id=\"root\"")
+            || html_lower.contains("id=\"app\"")
+            || html_lower.contains("__next_data__")
+            || html_lower.contains("nuxt")
+            || html_lower.contains("ng-app");
+
+        if has_spa_marker {
+            return true;
+        }
+    }
+
+    false
+}
+
+/// Result of a smart fetch: either local extraction or cloud API response.
+pub enum SmartFetchResult {
+    /// Successfully extracted locally.
+    Local(Box<webclaw_core::ExtractionResult>),
+    /// Fell back to cloud API. Contains the API response JSON.
+    Cloud(Value),
+}
+
+/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
+///
+/// Returns the extraction result (local) or the cloud API response JSON.
+/// If no API key is configured and local fetch is blocked, returns an error
+/// with a helpful message.
+pub async fn smart_fetch(
+    client: &webclaw_fetch::FetchClient,
+    cloud: Option<&CloudClient>,
+    url: &str,
+    include_selectors: &[String],
+    exclude_selectors: &[String],
+    only_main_content: bool,
+    formats: &[&str],
+) -> Result<SmartFetchResult, String> {
+    // Step 1: Try local fetch (with timeout to avoid hanging on slow servers)
+    let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
+        .await
+        .map_err(|_| format!("Fetch timed out after 30s for {url}"))?
+        .map_err(|e| format!("Fetch failed: {e}"))?;
+
+    // Step 2: Check for bot protection
+    if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
+        info!(url, "bot protection detected, falling back to cloud API");
+        return cloud_fallback(
+            cloud,
+            url,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+            formats,
+        )
+        .await;
+    }
+
+    // Step 3: Extract locally
+    let options = webclaw_core::ExtractionOptions {
+        include_selectors: include_selectors.to_vec(),
+        exclude_selectors: exclude_selectors.to_vec(),
+        only_main_content,
+        include_raw_html: false,
+    };
+
+    let extraction =
+        webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
+            .map_err(|e| format!("Extraction failed: {e}"))?;
+
+    // Step 4: Check for JS-rendered pages (low content from large HTML)
+    if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
+        info!(
+            url,
+            word_count = extraction.metadata.word_count,
+            html_len = fetch_result.html.len(),
+            "JS-rendered page detected, falling back to cloud API"
+        );
+        return cloud_fallback(
+            cloud,
+            url,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+            formats,
+        )
+        .await;
+    }
+
+    Ok(SmartFetchResult::Local(Box::new(extraction)))
+}
+
+async fn cloud_fallback(
+    cloud: Option<&CloudClient>,
+    url: &str,
+    include_selectors: &[String],
+    exclude_selectors: &[String],
+    only_main_content: bool,
+    formats: &[&str],
+) -> Result<SmartFetchResult, String> {
+    match cloud {
+        Some(c) => {
+            let resp = c
+                .scrape(
+                    url,
+                    formats,
+                    include_selectors,
+                    exclude_selectors,
+                    only_main_content,
+                )
+                .await?;
+            info!(url, "cloud API fallback successful");
+            Ok(SmartFetchResult::Cloud(resp))
+        }
+        None => Err(format!(
+            "Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
+             Get a key at https://webclaw.io"
+        )),
+    }
+}
diff --git a/crates/webclaw-mcp/src/main.rs b/crates/webclaw-mcp/src/main.rs
index 89a4755..8576562 100644
--- a/crates/webclaw-mcp/src/main.rs
+++ b/crates/webclaw-mcp/src/main.rs
@@ -1,6 +1,7 @@
 /// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
 /// Exposes web extraction tools over stdio transport for AI agents
 /// like Claude Desktop, Claude Code, and other MCP clients.
+mod cloud;
 mod server;
 mod tools;
 
diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs
index a4af79d..f00eae7 100644
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@@ -15,8 +15,7 @@ use serde_json::json;
 use tracing::{error, info, warn};
 use url::Url;
 
-use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult};
-
+use crate::cloud::{self, CloudClient, SmartFetchResult};
 use crate::tools::*;
 
 pub struct WebclawMcp {
@@ -718,50 +717,6 @@ impl WebclawMcp {
             Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
         }
     }
-
-    /// List every vertical extractor the server knows about. Returns a
-    /// JSON array of `{name, label, description, url_patterns}` entries.
-    /// Call this to discover what verticals are available before using
-    /// `vertical_scrape`.
-    #[tool]
-    async fn list_extractors(
-        &self,
-        Parameters(_params): Parameters<ListExtractorsParams>,
-    ) -> Result<String, String> {
-        let catalog = webclaw_fetch::extractors::list();
-        serde_json::to_string_pretty(&catalog)
-            .map_err(|e| format!("failed to serialise extractor catalog: {e}"))
-    }
-
-    /// Run a vertical extractor by name and return typed JSON specific
-    /// to the target site (title, price, rating, author, etc.), not
-    /// generic markdown. Use `list_extractors` to discover available
-    /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
-    /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
-    ///
-    /// Antibot-gated verticals (amazon_product, ebay_listing,
-    /// etsy_listing, trustpilot_reviews) will automatically escalate to
-    /// the webclaw cloud API when local fetch hits bot protection,
-    /// provided `WEBCLAW_API_KEY` is set.
-    #[tool]
-    async fn vertical_scrape(
-        &self,
-        Parameters(params): Parameters<VerticalParams>,
-    ) -> Result<String, String> {
-        validate_url(&params.url)?;
-        // Reuse the long-lived default FetchClient. Extractors accept
-        // `&dyn Fetcher`; FetchClient implements the trait so this just
-        // works (see webclaw_fetch::Fetcher and client::FetchClient).
-        let data = webclaw_fetch::extractors::dispatch_by_name(
-            self.fetch_client.as_ref(),
-            &params.name,
-            &params.url,
-        )
-        .await
-        .map_err(|e| e.to_string())?;
-        serde_json::to_string_pretty(&data)
-            .map_err(|e| format!("failed to serialise extractor output: {e}"))
-    }
 }
 
 #[tool_handler]
@@ -771,8 +726,7 @@ impl ServerHandler for WebclawMcp {
             .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
             .with_instructions(String::from(
                 "Webclaw MCP server -- web content extraction for AI agents. \
-                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
-                 list_extractors, vertical_scrape.",
+                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
             ))
     }
 }
diff --git a/crates/webclaw-mcp/src/tools.rs b/crates/webclaw-mcp/src/tools.rs
index 02bf534..e0195f1 100644
--- a/crates/webclaw-mcp/src/tools.rs
+++ b/crates/webclaw-mcp/src/tools.rs
@@ -103,20 +103,3 @@ pub struct SearchParams {
     /// Number of results to return (default: 10)
     pub num_results: Option<u32>,
 }
-
-/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
-#[derive(Debug, Deserialize, JsonSchema)]
-pub struct VerticalParams {
-    /// Name of the vertical extractor. Call `list_extractors` to see all
-    /// available names. Examples: "reddit", "github_repo", "pypi",
-    /// "trustpilot_reviews", "youtube_video", "shopify_product".
-    pub name: String,
-    /// URL to extract. Must match the URL patterns the extractor claims;
-    /// otherwise the tool returns a clear "URL mismatch" error.
-    pub url: String,
-}
-
-/// `list_extractors` takes no arguments but we still need an empty struct
-/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
-#[derive(Debug, Deserialize, JsonSchema)]
-pub struct ListExtractorsParams {}
diff --git a/crates/webclaw-server/src/main.rs b/crates/webclaw-server/src/main.rs
index f4cfdcb..c57fed8 100644
--- a/crates/webclaw-server/src/main.rs
+++ b/crates/webclaw-server/src/main.rs
@@ -79,15 +79,10 @@ async fn main() -> anyhow::Result<()> {
 
     let v1 = Router::new()
         .route("/scrape", post(routes::scrape::scrape))
-        .route(
-            "/scrape/{vertical}",
-            post(routes::structured::scrape_vertical),
-        )
         .route("/crawl", post(routes::crawl::crawl))
         .route("/map", post(routes::map::map))
         .route("/batch", post(routes::batch::batch))
         .route("/extract", post(routes::extract::extract))
-        .route("/extractors", get(routes::structured::list_extractors))
         .route("/summarize", post(routes::summarize::summarize_route))
         .route("/diff", post(routes::diff::diff_route))
         .route("/brand", post(routes::brand::brand))
diff --git a/crates/webclaw-server/src/routes/mod.rs b/crates/webclaw-server/src/routes/mod.rs
index 01f1052..7c3d68e 100644
--- a/crates/webclaw-server/src/routes/mod.rs
+++ b/crates/webclaw-server/src/routes/mod.rs
@@ -15,5 +15,4 @@ pub mod extract;
 pub mod health;
 pub mod map;
 pub mod scrape;
-pub mod structured;
 pub mod summarize;
diff --git a/crates/webclaw-server/src/routes/structured.rs b/crates/webclaw-server/src/routes/structured.rs
deleted file mode 100644
index c9cdc1a..0000000
--- a/crates/webclaw-server/src/routes/structured.rs
+++ /dev/null
@@ -1,55 +0,0 @@
-//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`.
-//!
-//! Vertical extractors return typed JSON instead of generic markdown.
-//! See `webclaw_fetch::extractors` for the catalog and per-site logic.
-
-use axum::{
-    Json,
-    extract::{Path, State},
-};
-use serde::Deserialize;
-use serde_json::{Value, json};
-use webclaw_fetch::extractors::{self, ExtractorDispatchError};
-
-use crate::{error::ApiError, state::AppState};
-
-#[derive(Debug, Deserialize)]
-pub struct ScrapeRequest {
-    pub url: String,
-}
-
-/// Map dispatcher errors to ApiError so users get clean HTTP statuses
-/// instead of opaque 500s.
-impl From<ExtractorDispatchError> for ApiError {
-    fn from(e: ExtractorDispatchError) -> Self {
-        match e {
-            ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound,
-            ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()),
-            ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()),
-        }
-    }
-}
-
-/// `GET /v1/extractors` — catalog of all available verticals.
-pub async fn list_extractors() -> Json<Value> {
-    Json(json!({
-        "extractors": extractors::list(),
-    }))
-}
-
-/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit.
-pub async fn scrape_vertical(
-    State(state): State<AppState>,
-    Path(vertical): Path<String>,
-    Json(req): Json<ScrapeRequest>,
-) -> Result<Json<Value>, ApiError> {
-    if req.url.trim().is_empty() {
-        return Err(ApiError::bad_request("`url` is required"));
-    }
-    let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?;
-    Ok(Json(json!({
-        "vertical": vertical,
-        "url": req.url,
-        "data": data,
-    })))
-}
diff --git a/crates/webclaw-server/src/state.rs b/crates/webclaw-server/src/state.rs
index 6c2e8f7..b3f9b6b 100644
--- a/crates/webclaw-server/src/state.rs
+++ b/crates/webclaw-server/src/state.rs
@@ -1,24 +1,7 @@
 //! Shared application state. Cheap to clone via Arc; held by the axum
 //! Router for the life of the process.
-//!
-//! Two unrelated keys get carried here:
-//!
-//! 1. [`AppState::api_key`] — the **bearer token clients must present**
-//!    to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
-//!    Unset = open mode.
-//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
-//!    **outbound** credential for api.webclaw.io, used by extractors
-//!    that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
-//!    Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
-//!    error with a signup link.
-//!
-//! Different variables on purpose: conflating the two means operators
-//! who want their server behind an auth token can't also enable cloud
-//! fallback, and vice versa.
 
 use std::sync::Arc;
-use tracing::info;
-use webclaw_fetch::cloud::CloudClient;
 use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
 
 /// Single-process state shared across all request handlers.
@@ -34,7 +17,6 @@ struct Inner {
     /// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
     /// them nothing.
     pub fetch: Arc<FetchClient>,
-    /// Inbound bearer-auth token for this server's own `/v1/*` surface.
     pub api_key: Option<String>,
 }
 
@@ -42,34 +24,17 @@ impl AppState {
     /// Build the application state. The fetch client is constructed once
     /// and shared across requests so connection pools + browser profile
     /// state don't churn per request.
-    ///
-    /// `inbound_api_key` is the bearer token clients must present;
-    /// cloud-fallback credentials come from the env (checked here).
-    pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
+    pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
         let config = FetchConfig {
-            browser: BrowserProfile::Firefox,
+            browser: BrowserProfile::Chrome,
             ..FetchConfig::default()
         };
-        let mut fetch = FetchClient::new(config)
+        let fetch = FetchClient::new(config)
             .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
-
-        // Cloud fallback: only activates when the operator has provided
-        // an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
-        // (preferred, disambiguates from the inbound-auth key) and
-        // WEBCLAW_API_KEY as a fallback when there's no inbound key
-        // configured (backwards compat with MCP / CLI conventions).
-        if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
-            info!(
-                base = cloud.base_url(),
-                "cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
-            );
-            fetch = fetch.with_cloud(cloud);
-        }
-
         Ok(Self {
             inner: Arc::new(Inner {
                 fetch: Arc::new(fetch),
-                api_key: inbound_api_key,
+                api_key,
             }),
         })
     }
@@ -82,26 +47,3 @@ impl AppState {
         self.inner.api_key.as_deref()
     }
 }
-
-/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
-/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
-/// configured (i.e. open mode — the same env var can't mean two
-/// things to one process).
-fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
-    let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
-    if let Some(k) = cloud_key.as_deref()
-        && !k.trim().is_empty()
-    {
-        return Some(CloudClient::with_key(k));
-    }
-    // Reuse WEBCLAW_API_KEY only when not also acting as our own
-    // inbound-auth token — otherwise we'd be telling the operator
-    // they can't have both.
-    if inbound_api_key.is_none()
-        && let Ok(k) = std::env::var("WEBCLAW_API_KEY")
-        && !k.trim().is_empty()
-    {
-        return Some(CloudClient::with_key(k));
-    }
-    None
-}