From 2ba682adf353d06b2707b66f15944340fdf76256 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 12:25:11 +0200
Subject: [PATCH 01/30] feat(server): add OSS webclaw-server REST API binary
 (closes #29)

Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.

Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.

Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.

Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.

Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).
---
 CLAUDE.md                                     |  15 +-
 Cargo.lock                                    | 130 +++++++++++++++++-
 Cargo.toml                                    |   2 +-
 Dockerfile                                    |  28 +++-
 Dockerfile.ci                                 |   9 ++
 crates/webclaw-server/Cargo.toml              |  29 ++++
 crates/webclaw-server/src/auth.rs             |  48 +++++++
 crates/webclaw-server/src/error.rs            |  87 ++++++++++++
 crates/webclaw-server/src/main.rs             | 118 ++++++++++++++++
 crates/webclaw-server/src/routes/batch.rs     |  85 ++++++++++++
 crates/webclaw-server/src/routes/brand.rs     |  32 +++++
 crates/webclaw-server/src/routes/crawl.rs     |  85 ++++++++++++
 crates/webclaw-server/src/routes/diff.rs      |  92 +++++++++++++
 crates/webclaw-server/src/routes/extract.rs   |  81 +++++++++++
 crates/webclaw-server/src/routes/health.rs    |  10 ++
 crates/webclaw-server/src/routes/map.rs       |  49 +++++++
 crates/webclaw-server/src/routes/mod.rs       |  18 +++
 crates/webclaw-server/src/routes/scrape.rs    | 108 +++++++++++++++
 crates/webclaw-server/src/routes/summarize.rs |  52 +++++++
 crates/webclaw-server/src/state.rs            |  49 +++++++
 20 files changed, 1116 insertions(+), 11 deletions(-)
 create mode 100644 crates/webclaw-server/Cargo.toml
 create mode 100644 crates/webclaw-server/src/auth.rs
 create mode 100644 crates/webclaw-server/src/error.rs
 create mode 100644 crates/webclaw-server/src/main.rs
 create mode 100644 crates/webclaw-server/src/routes/batch.rs
 create mode 100644 crates/webclaw-server/src/routes/brand.rs
 create mode 100644 crates/webclaw-server/src/routes/crawl.rs
 create mode 100644 crates/webclaw-server/src/routes/diff.rs
 create mode 100644 crates/webclaw-server/src/routes/extract.rs
 create mode 100644 crates/webclaw-server/src/routes/health.rs
 create mode 100644 crates/webclaw-server/src/routes/map.rs
 create mode 100644 crates/webclaw-server/src/routes/mod.rs
 create mode 100644 crates/webclaw-server/src/routes/scrape.rs
 create mode 100644 crates/webclaw-server/src/routes/summarize.rs
 create mode 100644 crates/webclaw-server/src/state.rs

diff --git a/CLAUDE.md b/CLAUDE.md
index ad15cf1..eac2f9f 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -20,9 +20,11 @@ webclaw/
     webclaw-pdf/      # PDF text extraction via pdf-extract
     webclaw-mcp/      # MCP server (Model Context Protocol) for AI agents
     webclaw-cli/      # CLI binary
+    webclaw-server/   # Minimal axum REST API (self-hosting; OSS counterpart
+                      # of api.webclaw.io, without anti-bot / JS / jobs / auth)
 ```
 
-Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
+Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).
 
 ### Core Modules (`webclaw-core`)
 - `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
@@ -60,6 +62,17 @@ Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
 - Works with Claude Desktop, Claude Code, and any MCP client
 - Uses `rmcp` crate (official Rust MCP SDK)
 
+### REST API Server (`webclaw-server`)
+- Axum 0.8, stateless, no database, no job queue
+- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the
+  capability exists in OSS
+- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
+  `--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
+- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
+- Does NOT include: anti-bot bypass, JS rendering, async jobs,
+  multi-tenant auth, billing, proxy rotation, search/research/watch/
+  agent-scrape. Those live behind api.webclaw.io and are closed-source.
+
 ## Hard Rules
 
 - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
diff --git a/Cargo.lock b/Cargo.lock
index e5c30e7..0f5fc5c 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -182,6 +182,70 @@ version = "1.5.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8"
 
+[[package]]
+name = "axum"
+version = "0.8.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "31b698c5f9a010f6573133b09e0de5408834d0c82f8d7475a89fc1867a71cd90"
+dependencies = [
+ "axum-core",
+ "axum-macros",
+ "bytes",
+ "form_urlencoded",
+ "futures-util",
+ "http",
+ "http-body",
+ "http-body-util",
+ "hyper",
+ "hyper-util",
+ "itoa",
+ "matchit",
+ "memchr",
+ "mime",
+ "percent-encoding",
+ "pin-project-lite",
+ "serde_core",
+ "serde_json",
+ "serde_path_to_error",
+ "serde_urlencoded",
+ "sync_wrapper",
+ "tokio",
+ "tower",
+ "tower-layer",
+ "tower-service",
+ "tracing",
+]
+
+[[package]]
+name = "axum-core"
+version = "0.5.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "08c78f31d7b1291f7ee735c1c6780ccde7785daae9a9206026862dab7d8792d1"
+dependencies = [
+ "bytes",
+ "futures-core",
+ "http",
+ "http-body",
+ "http-body-util",
+ "mime",
+ "pin-project-lite",
+ "sync_wrapper",
+ "tower-layer",
+ "tower-service",
+ "tracing",
+]
+
+[[package]]
+name = "axum-macros"
+version = "0.5.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7aa268c23bfbbd2c4363b9cd302a4f504fb2a9dfe7e3451d66f35dd392e20aca"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "base64"
 version = "0.22.1"
@@ -1132,6 +1196,12 @@ version = "1.10.1"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87"
 
+[[package]]
+name = "httpdate"
+version = "1.0.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9"
+
 [[package]]
 name = "hyper"
 version = "1.9.0"
@@ -1145,6 +1215,7 @@ dependencies = [
  "http",
  "http-body",
  "httparse",
+ "httpdate",
  "itoa",
  "pin-project-lite",
  "smallvec",
@@ -1559,6 +1630,12 @@ dependencies = [
  "regex-automata",
 ]
 
+[[package]]
+name = "matchit"
+version = "0.8.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "47e1ffaa40ddd1f3ed91f717a33c8c0ee23fff369e3aa8772b9605cc1d22f4c3"
+
 [[package]]
 name = "md-5"
 version = "0.10.6"
@@ -1575,6 +1652,12 @@ version = "2.8.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79"
 
+[[package]]
+name = "mime"
+version = "0.3.17"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a"
+
 [[package]]
 name = "minimal-lexical"
 version = "0.2.1"
@@ -2403,6 +2486,17 @@ dependencies = [
  "zmij",
 ]
 
+[[package]]
+name = "serde_path_to_error"
+version = "0.1.20"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "10a9ff822e371bb5403e391ecd83e182e0e77ba7f6fe0160b795797109d1b457"
+dependencies = [
+ "itoa",
+ "serde",
+ "serde_core",
+]
+
 [[package]]
 name = "serde_urlencoded"
 version = "0.7.1"
@@ -2757,6 +2851,7 @@ dependencies = [
  "tokio",
  "tower-layer",
  "tower-service",
+ "tracing",
 ]
 
 [[package]]
@@ -2780,6 +2875,7 @@ dependencies = [
  "tower",
  "tower-layer",
  "tower-service",
+ "tracing",
 ]
 
 [[package]]
@@ -2800,6 +2896,7 @@ version = "0.1.44"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "63e71662fa4b2a2c3a26f570f037eb95bb1f85397f3cd8076caed2f026a6d100"
 dependencies = [
+ "log",
  "pin-project-lite",
  "tracing-attributes",
  "tracing-core",
@@ -3102,7 +3199,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.3.19"
+version = "0.4.0"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3123,7 +3220,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.3.19"
+version = "0.4.0"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3141,7 +3238,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.3.19"
+version = "0.4.0"
 dependencies = [
  "bytes",
  "calamine",
@@ -3163,7 +3260,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.3.19"
+version = "0.4.0"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3176,7 +3273,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.3.19"
+version = "0.4.0"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3197,13 +3294,34 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.3.19"
+version = "0.4.0"
 dependencies = [
  "pdf-extract",
  "thiserror",
  "tracing",
 ]
 
+[[package]]
+name = "webclaw-server"
+version = "0.4.0"
+dependencies = [
+ "anyhow",
+ "axum",
+ "clap",
+ "serde",
+ "serde_json",
+ "subtle",
+ "thiserror",
+ "tokio",
+ "tower-http",
+ "tracing",
+ "tracing-subscriber",
+ "webclaw-core",
+ "webclaw-fetch",
+ "webclaw-llm",
+ "webclaw-pdf",
+]
+
 [[package]]
 name = "webpki-root-certs"
 version = "1.0.6"
diff --git a/Cargo.toml b/Cargo.toml
index 41e78ac..e17d843 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.3.19"
+version = "0.4.0"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/Dockerfile b/Dockerfile
index 36fa67f..6f84e06 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,5 +1,12 @@
 # webclaw — Multi-stage Docker build
-# Produces 2 binaries: webclaw (CLI) and webclaw-mcp (MCP server)
+# Produces 3 binaries:
+#   webclaw         — CLI (single-shot extraction, crawl, MCP-less use)
+#   webclaw-mcp     — MCP server (stdio, for AI agents)
+#   webclaw-server  — minimal REST API for self-hosting (OSS, stateless)
+#
+# NOTE: this is NOT the hosted API at api.webclaw.io — the cloud service
+# adds anti-bot bypass, JS rendering, multi-tenant auth and async jobs
+# that are intentionally not open-source. See docs/self-hosting.
 
 # ---------------------------------------------------------------------------
 # Stage 1: Build all binaries in release mode
@@ -25,6 +32,7 @@ COPY crates/webclaw-llm/Cargo.toml crates/webclaw-llm/Cargo.toml
 COPY crates/webclaw-pdf/Cargo.toml crates/webclaw-pdf/Cargo.toml
 COPY crates/webclaw-mcp/Cargo.toml crates/webclaw-mcp/Cargo.toml
 COPY crates/webclaw-cli/Cargo.toml crates/webclaw-cli/Cargo.toml
+COPY crates/webclaw-server/Cargo.toml crates/webclaw-server/Cargo.toml
 
 # Copy .cargo config if present (optional build flags)
 COPY .cargo .cargo
@@ -35,7 +43,8 @@ RUN mkdir -p crates/webclaw-core/src && echo "" > crates/webclaw-core/src/lib.rs
     && mkdir -p crates/webclaw-llm/src && echo "" > crates/webclaw-llm/src/lib.rs \
     && mkdir -p crates/webclaw-pdf/src && echo "" > crates/webclaw-pdf/src/lib.rs \
     && mkdir -p crates/webclaw-mcp/src && echo "fn main() {}" > crates/webclaw-mcp/src/main.rs \
-    && mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs
+    && mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs \
+    && mkdir -p crates/webclaw-server/src && echo "fn main() {}" > crates/webclaw-server/src/main.rs
 
 # Pre-build dependencies (this layer is cached until Cargo.toml/lock changes)
 RUN cargo build --release 2>/dev/null || true
@@ -54,9 +63,22 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     ca-certificates \
     && rm -rf /var/lib/apt/lists/*
 
-# Copy both binaries
+# Copy all three binaries
 COPY --from=builder /build/target/release/webclaw /usr/local/bin/webclaw
 COPY --from=builder /build/target/release/webclaw-mcp /usr/local/bin/webclaw-mcp
+COPY --from=builder /build/target/release/webclaw-server /usr/local/bin/webclaw-server
+
+# Default port the REST API listens on when you run `webclaw-server` inside
+# the container. Override with -e WEBCLAW_PORT=... or --port. Published only
+# as documentation; callers still need `-p 3000:3000` on `docker run`.
+EXPOSE 3000
+
+# Container default: bind all interfaces so `-p 3000:3000` works. The binary
+# itself defaults to 127.0.0.1 (safe for `cargo run` on a laptop); inside
+# Docker that would make the server unreachable, so we flip it here.
+# Override with -e WEBCLAW_HOST=127.0.0.1 if you front this with another
+# process in the same container.
+ENV WEBCLAW_HOST=0.0.0.0
 
 # Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
 # commands directly so this image can be used as a FROM base with custom CMD.
diff --git a/Dockerfile.ci b/Dockerfile.ci
index dd1efcb..ccd8a33 100644
--- a/Dockerfile.ci
+++ b/Dockerfile.ci
@@ -12,6 +12,15 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
 ARG BINARY_DIR
 COPY ${BINARY_DIR}/webclaw /usr/local/bin/webclaw
 COPY ${BINARY_DIR}/webclaw-mcp /usr/local/bin/webclaw-mcp
+COPY ${BINARY_DIR}/webclaw-server /usr/local/bin/webclaw-server
+
+# Default REST API port when running `webclaw-server` inside the container.
+EXPOSE 3000
+
+# Container default: bind all interfaces so `-p 3000:3000` works. The
+# binary itself defaults to 127.0.0.1; flipping here keeps the CLI safe on
+# a laptop but makes the container reachable out of the box.
+ENV WEBCLAW_HOST=0.0.0.0
 
 # Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
 # commands directly so this image can be used as a FROM base with custom CMD.
diff --git a/crates/webclaw-server/Cargo.toml b/crates/webclaw-server/Cargo.toml
new file mode 100644
index 0000000..3d4c075
--- /dev/null
+++ b/crates/webclaw-server/Cargo.toml
@@ -0,0 +1,29 @@
+[package]
+name = "webclaw-server"
+version.workspace = true
+edition.workspace = true
+license.workspace = true
+repository.workspace = true
+description = "Minimal REST API server for self-hosting webclaw extraction. Wraps the OSS extraction crates with HTTP endpoints. NOT the production hosted API at api.webclaw.io — this is a stateless, single-binary reference server for local + self-hosted deployments."
+
+[[bin]]
+name = "webclaw-server"
+path = "src/main.rs"
+
+[dependencies]
+webclaw-core   = { workspace = true }
+webclaw-fetch  = { workspace = true }
+webclaw-llm    = { workspace = true }
+webclaw-pdf    = { workspace = true }
+
+axum           = { version = "0.8", features = ["macros"] }
+tokio          = { workspace = true }
+tower-http     = { version = "0.6", features = ["trace", "cors"] }
+clap           = { workspace = true, features = ["derive", "env"] }
+serde          = { workspace = true }
+serde_json     = { workspace = true }
+tracing        = { workspace = true }
+tracing-subscriber = { workspace = true, features = ["env-filter"] }
+anyhow         = "1"
+thiserror      = { workspace = true }
+subtle         = "2.6"
diff --git a/crates/webclaw-server/src/auth.rs b/crates/webclaw-server/src/auth.rs
new file mode 100644
index 0000000..390afc5
--- /dev/null
+++ b/crates/webclaw-server/src/auth.rs
@@ -0,0 +1,48 @@
+//! Optional bearer-token middleware.
+//!
+//! When the server is started without `--api-key`, every request is allowed
+//! through (server runs in "open" mode — appropriate for `localhost`-only
+//! deployments). When a key is configured, every `/v1/*` request must
+//! present `Authorization: Bearer <key>` and the comparison is constant-
+//! time to avoid timing-leaking the key.
+
+use axum::{
+    extract::{Request, State},
+    http::StatusCode,
+    middleware::Next,
+    response::Response,
+};
+use subtle::ConstantTimeEq;
+
+use crate::state::AppState;
+
+/// Axum middleware. Mount with `axum::middleware::from_fn_with_state`.
+pub async fn require_bearer(
+    State(state): State<AppState>,
+    request: Request,
+    next: Next,
+) -> Result<Response, StatusCode> {
+    let Some(expected) = state.api_key() else {
+        // Open mode — no key configured. Allow everything.
+        return Ok(next.run(request).await);
+    };
+
+    let Some(header) = request
+        .headers()
+        .get("authorization")
+        .and_then(|v| v.to_str().ok())
+    else {
+        return Err(StatusCode::UNAUTHORIZED);
+    };
+
+    let presented = header
+        .strip_prefix("Bearer ")
+        .or_else(|| header.strip_prefix("bearer "))
+        .ok_or(StatusCode::UNAUTHORIZED)?;
+
+    if presented.as_bytes().ct_eq(expected.as_bytes()).into() {
+        Ok(next.run(request).await)
+    } else {
+        Err(StatusCode::UNAUTHORIZED)
+    }
+}
diff --git a/crates/webclaw-server/src/error.rs b/crates/webclaw-server/src/error.rs
new file mode 100644
index 0000000..c49a1c9
--- /dev/null
+++ b/crates/webclaw-server/src/error.rs
@@ -0,0 +1,87 @@
+//! API error type. Maps internal errors to HTTP status codes + JSON.
+
+use axum::{
+    Json,
+    http::StatusCode,
+    response::{IntoResponse, Response},
+};
+use serde_json::json;
+use thiserror::Error;
+
+/// Public-facing API error. Always serializes as `{ "error": "..." }`.
+/// Keep messages user-actionable; internal details belong in tracing logs.
+///
+/// `Unauthorized` / `NotFound` / `Internal` are kept on the enum as
+/// stable variants for handlers that don't exist yet (planned: per-key
+/// rate-limit responses, dynamic route 404s). Marking them dead-code-OK
+/// is preferable to inventing them later in three places.
+#[allow(dead_code)]
+#[derive(Debug, Error)]
+pub enum ApiError {
+    #[error("{0}")]
+    BadRequest(String),
+
+    #[error("unauthorized")]
+    Unauthorized,
+
+    #[error("not found")]
+    NotFound,
+
+    #[error("upstream fetch failed: {0}")]
+    Fetch(String),
+
+    #[error("extraction failed: {0}")]
+    Extract(String),
+
+    #[error("LLM provider error: {0}")]
+    Llm(String),
+
+    #[error("internal: {0}")]
+    Internal(String),
+}
+
+impl ApiError {
+    pub fn bad_request(msg: impl Into<String>) -> Self {
+        Self::BadRequest(msg.into())
+    }
+    #[allow(dead_code)]
+    pub fn internal(msg: impl Into<String>) -> Self {
+        Self::Internal(msg.into())
+    }
+
+    fn status(&self) -> StatusCode {
+        match self {
+            Self::BadRequest(_) => StatusCode::BAD_REQUEST,
+            Self::Unauthorized => StatusCode::UNAUTHORIZED,
+            Self::NotFound => StatusCode::NOT_FOUND,
+            Self::Fetch(_) => StatusCode::BAD_GATEWAY,
+            Self::Extract(_) | Self::Llm(_) => StatusCode::UNPROCESSABLE_ENTITY,
+            Self::Internal(_) => StatusCode::INTERNAL_SERVER_ERROR,
+        }
+    }
+}
+
+impl IntoResponse for ApiError {
+    fn into_response(self) -> Response {
+        let body = Json(json!({ "error": self.to_string() }));
+        (self.status(), body).into_response()
+    }
+}
+
+impl From<webclaw_fetch::FetchError> for ApiError {
+    fn from(e: webclaw_fetch::FetchError) -> Self {
+        Self::Fetch(e.to_string())
+    }
+}
+
+impl From<webclaw_core::ExtractError> for ApiError {
+    fn from(e: webclaw_core::ExtractError) -> Self {
+        Self::Extract(e.to_string())
+    }
+}
+
+impl From<webclaw_llm::LlmError> for ApiError {
+    fn from(e: webclaw_llm::LlmError) -> Self {
+        Self::Llm(e.to_string())
+    }
+}
diff --git a/crates/webclaw-server/src/main.rs b/crates/webclaw-server/src/main.rs
new file mode 100644
index 0000000..c57fed8
--- /dev/null
+++ b/crates/webclaw-server/src/main.rs
@@ -0,0 +1,118 @@
+//! webclaw-server — minimal REST API for self-hosting webclaw extraction.
+//!
+//! This is the OSS reference server. It is intentionally small:
+//! single binary, stateless, no database, no job queue. It wraps the
+//! same extraction crates the CLI and MCP server use, exposed over
+//! HTTP with JSON shapes that mirror the hosted API at
+//! api.webclaw.io where the underlying capability exists in OSS.
+//!
+//! Hosted-only features (anti-bot bypass, JS rendering, async crawl
+//! jobs, multi-tenant auth, billing) are *not* implemented here and
+//! never will be — they're closed-source. See the docs for the full
+//! "what self-hosting gives you vs. what the cloud gives you" matrix.
+
+mod auth;
+mod error;
+mod routes;
+mod state;
+
+use std::net::{IpAddr, SocketAddr};
+use std::time::Duration;
+
+use axum::{
+    Router,
+    middleware::from_fn_with_state,
+    routing::{get, post},
+};
+use clap::Parser;
+use tower_http::cors::{Any, CorsLayer};
+use tower_http::trace::TraceLayer;
+use tracing::info;
+use tracing_subscriber::{EnvFilter, fmt};
+
+use crate::state::AppState;
+
+#[derive(Parser, Debug)]
+#[command(
+    name = "webclaw-server",
+    version,
+    about = "Minimal self-hosted REST API for webclaw extraction.",
+    long_about = "Stateless single-binary REST API. Wraps the OSS extraction \
+                  crates over HTTP. For the full hosted platform (anti-bot, \
+                  JS render, async jobs, multi-tenant), use api.webclaw.io."
+)]
+struct Args {
+    /// Port to listen on. Env: WEBCLAW_PORT.
+    #[arg(short, long, env = "WEBCLAW_PORT", default_value_t = 3000)]
+    port: u16,
+
+    /// Host to bind to. Env: WEBCLAW_HOST.
+    /// Default `127.0.0.1` keeps the server local-only; set to
+    /// `0.0.0.0` to expose on all interfaces (only do this with
+    /// `--api-key` set or behind a reverse proxy that adds auth).
+    #[arg(long, env = "WEBCLAW_HOST", default_value = "127.0.0.1")]
+    host: IpAddr,
+
+    /// Optional bearer token. Env: WEBCLAW_API_KEY. When set, every
+    /// `/v1/*` request must present `Authorization: Bearer <key>`.
+    /// When unset, the server runs in open mode (no auth) — only
+    /// safe on a local-bound interface or behind another auth layer.
+    #[arg(long, env = "WEBCLAW_API_KEY")]
+    api_key: Option<String>,
+
+    /// Tracing filter. Env: RUST_LOG.
+    #[arg(long, env = "RUST_LOG", default_value = "info,webclaw_server=info")]
+    log: String,
+}
+
+#[tokio::main]
+async fn main() -> anyhow::Result<()> {
+    let args = Args::parse();
+
+    fmt()
+        .with_env_filter(EnvFilter::try_new(&args.log).unwrap_or_else(|_| EnvFilter::new("info")))
+        .with_target(false)
+        .compact()
+        .init();
+
+    let state = AppState::new(args.api_key.clone())?;
+
+    let v1 = Router::new()
+        .route("/scrape", post(routes::scrape::scrape))
+        .route("/crawl", post(routes::crawl::crawl))
+        .route("/map", post(routes::map::map))
+        .route("/batch", post(routes::batch::batch))
+        .route("/extract", post(routes::extract::extract))
+        .route("/summarize", post(routes::summarize::summarize_route))
+        .route("/diff", post(routes::diff::diff_route))
+        .route("/brand", post(routes::brand::brand))
+        .layer(from_fn_with_state(state.clone(), auth::require_bearer));
+
+    let app = Router::new()
+        .route("/health", get(routes::health::health))
+        .nest("/v1", v1)
+        .layer(
+            // Permissive CORS — same posture as a self-hosted dev tool.
+            // Tighten in front with a reverse proxy if you expose this
+            // publicly.
+            CorsLayer::new()
+                .allow_origin(Any)
+                .allow_methods(Any)
+                .allow_headers(Any)
+                .max_age(Duration::from_secs(3600)),
+        )
+        .layer(TraceLayer::new_for_http())
+        .with_state(state);
+
+    let addr = SocketAddr::from((args.host, args.port));
+    let listener = tokio::net::TcpListener::bind(addr).await?;
+    let auth_status = if args.api_key.is_some() {
+        "bearer auth required"
+    } else {
+        "open mode (no auth)"
+    };
+    info!(%addr, mode = auth_status, "webclaw-server listening");
+
+    axum::serve(listener, app).await?;
+    Ok(())
+}
diff --git a/crates/webclaw-server/src/routes/batch.rs b/crates/webclaw-server/src/routes/batch.rs
new file mode 100644
index 0000000..99533c9
--- /dev/null
+++ b/crates/webclaw-server/src/routes/batch.rs
@@ -0,0 +1,85 @@
+//! POST /v1/batch — fetch + extract many URLs in parallel.
+//!
+//! `concurrency` is hard-capped at 20 to avoid hammering targets and
+//! to bound memory growth for naive callers. For larger batches use
+//! the hosted API.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::ExtractionOptions;
+
+use crate::{error::ApiError, state::AppState};
+
+const HARD_MAX_URLS: usize = 100;
+const HARD_MAX_CONCURRENCY: usize = 20;
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct BatchRequest {
+    pub urls: Vec<String>,
+    pub concurrency: Option<usize>,
+    pub include_selectors: Vec<String>,
+    pub exclude_selectors: Vec<String>,
+    pub only_main_content: bool,
+}
+
+pub async fn batch(
+    State(state): State<AppState>,
+    Json(req): Json<BatchRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.urls.is_empty() {
+        return Err(ApiError::bad_request("`urls` is required"));
+    }
+    if req.urls.len() > HARD_MAX_URLS {
+        return Err(ApiError::bad_request(format!(
+            "too many urls: {} (max {HARD_MAX_URLS})",
+            req.urls.len()
+        )));
+    }
+
+    let concurrency = req.concurrency.unwrap_or(5).clamp(1, HARD_MAX_CONCURRENCY);
+
+    let options = ExtractionOptions {
+        include_selectors: req.include_selectors,
+        exclude_selectors: req.exclude_selectors,
+        only_main_content: req.only_main_content,
+        include_raw_html: false,
+    };
+
+    let url_refs: Vec<&str> = req.urls.iter().map(|s| s.as_str()).collect();
+    let results = state
+        .fetch()
+        .fetch_and_extract_batch_with_options(&url_refs, concurrency, &options)
+        .await;
+
+    let mut ok = 0usize;
+    let mut errors = 0usize;
+    let mut out: Vec<Value> = Vec::with_capacity(results.len());
+    for r in results {
+        match r.result {
+            Ok(extraction) => {
+                ok += 1;
+                out.push(json!({
+                    "url": r.url,
+                    "metadata": extraction.metadata,
+                    "markdown": extraction.content.markdown,
+                }));
+            }
+            Err(e) => {
+                errors += 1;
+                out.push(json!({
+                    "url": r.url,
+                    "error": e.to_string(),
+                }));
+            }
+        }
+    }
+
+    Ok(Json(json!({
+        "total": out.len(),
+        "completed": ok,
+        "errors": errors,
+        "results": out,
+    })))
+}
diff --git a/crates/webclaw-server/src/routes/brand.rs b/crates/webclaw-server/src/routes/brand.rs
new file mode 100644
index 0000000..908976a
--- /dev/null
+++ b/crates/webclaw-server/src/routes/brand.rs
@@ -0,0 +1,32 @@
+//! POST /v1/brand — extract brand identity (colors, fonts, logo) from a page.
+//!
+//! Pure DOM/CSS analysis — no LLM, no network beyond the page fetch itself.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::brand::extract_brand;
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct BrandRequest {
+    pub url: String,
+}
+
+pub async fn brand(
+    State(state): State<AppState>,
+    Json(req): Json<BrandRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let fetched = state.fetch().fetch(&req.url).await?;
+    let brand = extract_brand(&fetched.html, Some(&fetched.url));
+
+    Ok(Json(json!({
+        "url": req.url,
+        "brand": brand,
+    })))
+}
diff --git a/crates/webclaw-server/src/routes/crawl.rs b/crates/webclaw-server/src/routes/crawl.rs
new file mode 100644
index 0000000..4d15195
--- /dev/null
+++ b/crates/webclaw-server/src/routes/crawl.rs
@@ -0,0 +1,85 @@
+//! POST /v1/crawl — synchronous BFS crawl.
+//!
+//! NOTE: this server is stateless — there is no job queue. Crawls run
+//! inline and return when complete. `max_pages` is hard-capped at 500
+//! to avoid OOM on naive callers. For large crawls + async jobs, use
+//! the hosted API at api.webclaw.io.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use std::time::Duration;
+use webclaw_fetch::{CrawlConfig, Crawler, FetchConfig};
+
+use crate::{error::ApiError, state::AppState};
+
+const HARD_MAX_PAGES: usize = 500;
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct CrawlRequest {
+    pub url: String,
+    pub max_depth: Option<usize>,
+    pub max_pages: Option<usize>,
+    pub use_sitemap: bool,
+    pub concurrency: Option<usize>,
+    pub allow_subdomains: bool,
+    pub allow_external_links: bool,
+    pub include_patterns: Vec<String>,
+    pub exclude_patterns: Vec<String>,
+}
+
+pub async fn crawl(
+    State(_state): State<AppState>,
+    Json(req): Json<CrawlRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let max_pages = req.max_pages.unwrap_or(50).min(HARD_MAX_PAGES);
+    let max_depth = req.max_depth.unwrap_or(3);
+    let concurrency = req.concurrency.unwrap_or(5).min(20);
+
+    let config = CrawlConfig {
+        fetch: FetchConfig::default(),
+        max_depth,
+        max_pages,
+        concurrency,
+        delay: Duration::from_millis(200),
+        path_prefix: None,
+        use_sitemap: req.use_sitemap,
+        include_patterns: req.include_patterns,
+        exclude_patterns: req.exclude_patterns,
+        allow_subdomains: req.allow_subdomains,
+        allow_external_links: req.allow_external_links,
+        progress_tx: None,
+        cancel_flag: None,
+    };
+
+    let crawler = Crawler::new(&req.url, config).map_err(ApiError::from)?;
+    let result = crawler.crawl(&req.url, None).await;
+
+    let pages: Vec<Value> = result
+        .pages
+        .iter()
+        .map(|p| {
+            json!({
+                "url": p.url,
+                "depth": p.depth,
+                "metadata": p.extraction.as_ref().map(|e| &e.metadata),
+                "markdown": p.extraction.as_ref().map(|e| e.content.markdown.as_str()).unwrap_or(""),
+                "error": p.error,
+            })
+        })
+        .collect();
+
+    Ok(Json(json!({
+        "url": req.url,
+        "status": "completed",
+        "total": result.total,
+        "completed": result.ok,
+        "errors": result.errors,
+        "elapsed_secs": result.elapsed_secs,
+        "pages": pages,
+    })))
+}
diff --git a/crates/webclaw-server/src/routes/diff.rs b/crates/webclaw-server/src/routes/diff.rs
new file mode 100644
index 0000000..e4e038d
--- /dev/null
+++ b/crates/webclaw-server/src/routes/diff.rs
@@ -0,0 +1,92 @@
+//! POST /v1/diff — compare current page content against a prior snapshot.
+//!
+//! Caller passes either a full prior `ExtractionResult` or the minimal
+//! `{ markdown, metadata }` shape used by the hosted API. We re-fetch
+//! the URL, extract, and run `webclaw_core::diff::diff` over the pair.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::{Content, ExtractionResult, Metadata, diff::diff};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct DiffRequest {
+    pub url: String,
+    pub previous: PreviousSnapshot,
+}
+
+/// Either a full prior extraction, or the minimal `{ markdown, metadata }`
+/// shape returned by /v1/scrape. Untagged so callers can send whichever
+/// they have on hand.
+#[derive(Debug, Deserialize)]
+#[serde(untagged)]
+pub enum PreviousSnapshot {
+    Full(ExtractionResult),
+    Minimal {
+        #[serde(default)]
+        markdown: String,
+        #[serde(default)]
+        metadata: Option<Metadata>,
+    },
+}
+
+impl PreviousSnapshot {
+    fn into_extraction(self) -> ExtractionResult {
+        match self {
+            Self::Full(r) => r,
+            Self::Minimal { markdown, metadata } => ExtractionResult {
+                metadata: metadata.unwrap_or_else(empty_metadata),
+                content: Content {
+                    markdown,
+                    plain_text: String::new(),
+                    links: Vec::new(),
+                    images: Vec::new(),
+                    code_blocks: Vec::new(),
+                    raw_html: None,
+                },
+                domain_data: None,
+                structured_data: Vec::new(),
+            },
+        }
+    }
+}
+
+fn empty_metadata() -> Metadata {
+    Metadata {
+        title: None,
+        description: None,
+        author: None,
+        published_date: None,
+        language: None,
+        url: None,
+        site_name: None,
+        image: None,
+        favicon: None,
+        word_count: 0,
+    }
+}
+
+pub async fn diff_route(
+    State(state): State<AppState>,
+    Json(req): Json<DiffRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let current = state.fetch().fetch_and_extract(&req.url).await?;
+    let previous = req.previous.into_extraction();
+    let result = diff(&previous, &current);
+
+    Ok(Json(json!({
+        "url": req.url,
+        "status": result.status,
+        "diff": result.text_diff,
+        "metadata_changes": result.metadata_changes,
+        "links_added": result.links_added,
+        "links_removed": result.links_removed,
+        "word_count_delta": result.word_count_delta,
+    })))
+}
diff --git a/crates/webclaw-server/src/routes/extract.rs b/crates/webclaw-server/src/routes/extract.rs
new file mode 100644
index 0000000..05b8909
--- /dev/null
+++ b/crates/webclaw-server/src/routes/extract.rs
@@ -0,0 +1,81 @@
+//! POST /v1/extract — LLM-powered structured extraction.
+//!
+//! Two modes:
+//! * `schema` — JSON Schema describing what to extract.
+//! * `prompt` — natural-language instructions.
+//!
+//! At least one must be provided. The provider chain is built per
+//! request from env (Ollama -> OpenAI -> Anthropic). Self-hosters
+//! get the same fallback behaviour as the CLI.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_llm::{ProviderChain, extract::extract_json, extract::extract_with_prompt};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct ExtractRequest {
+    pub url: String,
+    pub schema: Option<Value>,
+    pub prompt: Option<String>,
+    /// Optional override of the provider model name (e.g. `gpt-4o-mini`).
+    pub model: Option<String>,
+}
+
+pub async fn extract(
+    State(state): State<AppState>,
+    Json(req): Json<ExtractRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let has_schema = req.schema.is_some();
+    let has_prompt = req
+        .prompt
+        .as_deref()
+        .map(|p| !p.trim().is_empty())
+        .unwrap_or(false);
+    if !has_schema && !has_prompt {
+        return Err(ApiError::bad_request(
+            "either `schema` or `prompt` is required",
+        ));
+    }
+
+    // Fetch + extract first so we feed the LLM clean markdown instead of
+    // raw HTML. Cheaper tokens, better signal.
+    let extraction = state.fetch().fetch_and_extract(&req.url).await?;
+    let content = if extraction.content.markdown.trim().is_empty() {
+        extraction.content.plain_text.clone()
+    } else {
+        extraction.content.markdown.clone()
+    };
+    if content.trim().is_empty() {
+        return Err(ApiError::Extract(
+            "no extractable content on page".to_string(),
+        ));
+    }
+
+    let chain = ProviderChain::default().await;
+    if chain.is_empty() {
+        return Err(ApiError::Llm(
+            "no LLM providers configured (set OLLAMA_HOST, OPENAI_API_KEY, or ANTHROPIC_API_KEY)"
+                .to_string(),
+        ));
+    }
+
+    let model = req.model.as_deref();
+    let data = if let Some(schema) = req.schema.as_ref() {
+        extract_json(&content, schema, &chain, model).await?
+    } else {
+        let prompt = req.prompt.as_deref().unwrap_or_default();
+        extract_with_prompt(&content, prompt, &chain, model).await?
+    };
+
+    Ok(Json(json!({
+        "url": req.url,
+        "data": data,
+    })))
+}
diff --git a/crates/webclaw-server/src/routes/health.rs b/crates/webclaw-server/src/routes/health.rs
new file mode 100644
index 0000000..7ccd165
--- /dev/null
+++ b/crates/webclaw-server/src/routes/health.rs
@@ -0,0 +1,10 @@
+use axum::Json;
+use serde_json::{Value, json};
+
+pub async fn health() -> Json<Value> {
+    Json(json!({
+        "status": "ok",
+        "version": env!("CARGO_PKG_VERSION"),
+        "service": "webclaw-server",
+    }))
+}
diff --git a/crates/webclaw-server/src/routes/map.rs b/crates/webclaw-server/src/routes/map.rs
new file mode 100644
index 0000000..846183a
--- /dev/null
+++ b/crates/webclaw-server/src/routes/map.rs
@@ -0,0 +1,49 @@
+//! POST /v1/map — discover URLs from a site's sitemaps.
+//!
+//! Walks robots.txt + common sitemap paths, recursively resolves
+//! `<sitemapindex>` files, and returns the deduplicated list of URLs.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_fetch::sitemap;
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct MapRequest {
+    pub url: String,
+    /// When true, return the full SitemapEntry objects (with lastmod,
+    /// priority, changefreq). Defaults to false → bare URL strings,
+    /// matching the hosted-API shape.
+    #[serde(default)]
+    pub include_metadata: bool,
+}
+
+pub async fn map(
+    State(state): State<AppState>,
+    Json(req): Json<MapRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let entries = sitemap::discover(state.fetch(), &req.url).await?;
+
+    let body = if req.include_metadata {
+        json!({
+            "url": req.url,
+            "count": entries.len(),
+            "urls": entries,
+        })
+    } else {
+        let urls: Vec<&str> = entries.iter().map(|e| e.url.as_str()).collect();
+        json!({
+            "url": req.url,
+            "count": urls.len(),
+            "urls": urls,
+        })
+    };
+
+    Ok(Json(body))
+}
diff --git a/crates/webclaw-server/src/routes/mod.rs b/crates/webclaw-server/src/routes/mod.rs
new file mode 100644
index 0000000..7c3d68e
--- /dev/null
+++ b/crates/webclaw-server/src/routes/mod.rs
@@ -0,0 +1,18 @@
+//! HTTP route handlers.
+//!
+//! The OSS server exposes a deliberately small surface that mirrors the
+//! hosted-API JSON shapes where the underlying capability exists in the
+//! OSS crates. Endpoints that depend on private infrastructure
+//! (anti-bot bypass with stealth Chrome, JS rendering at scale,
+//! per-user auth, billing, async job queues, agent loops) are
+//! intentionally not implemented here. Use api.webclaw.io for those.
+
+pub mod batch;
+pub mod brand;
+pub mod crawl;
+pub mod diff;
+pub mod extract;
+pub mod health;
+pub mod map;
+pub mod scrape;
+pub mod summarize;
diff --git a/crates/webclaw-server/src/routes/scrape.rs b/crates/webclaw-server/src/routes/scrape.rs
new file mode 100644
index 0000000..1c5fc52
--- /dev/null
+++ b/crates/webclaw-server/src/routes/scrape.rs
@@ -0,0 +1,108 @@
+//! POST /v1/scrape — fetch a URL, run extraction, return the requested
+//! formats. JSON shape mirrors the hosted-API response where possible so
+//! migrating from self-hosted → cloud is a config change, not a code one.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::{ExtractionOptions, llm::to_llm_text};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct ScrapeRequest {
+    pub url: String,
+    /// Output formats. Allowed: "markdown", "text", "llm", "json", "html".
+    /// Defaults to ["markdown"]. Accepts a single string ("format")
+    /// or an array ("formats") for hosted-API compatibility.
+    #[serde(alias = "format")]
+    pub formats: ScrapeFormats,
+    pub include_selectors: Vec<String>,
+    pub exclude_selectors: Vec<String>,
+    pub only_main_content: bool,
+}
+
+#[derive(Debug, Deserialize)]
+#[serde(untagged)]
+pub enum ScrapeFormats {
+    One(String),
+    Many(Vec<String>),
+}
+
+impl Default for ScrapeFormats {
+    fn default() -> Self {
+        Self::Many(vec!["markdown".into()])
+    }
+}
+
+impl ScrapeFormats {
+    fn as_vec(&self) -> Vec<String> {
+        match self {
+            Self::One(s) => vec![s.clone()],
+            Self::Many(v) => v.clone(),
+        }
+    }
+}
+
+pub async fn scrape(
+    State(state): State<AppState>,
+    Json(req): Json<ScrapeRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let formats = req.formats.as_vec();
+
+    let options = ExtractionOptions {
+        include_selectors: req.include_selectors,
+        exclude_selectors: req.exclude_selectors,
+        only_main_content: req.only_main_content,
+        include_raw_html: formats.iter().any(|f| f == "html"),
+    };
+
+    let extraction = state
+        .fetch()
+        .fetch_and_extract_with_options(&req.url, &options)
+        .await?;
+
+    let mut body = json!({
+        "url": extraction.metadata.url.clone().unwrap_or_else(|| req.url.clone()),
+        "metadata": extraction.metadata,
+    });
+    let obj = body.as_object_mut().expect("json::object");
+
+    for f in &formats {
+        match f.as_str() {
+            "markdown" => {
+                obj.insert("markdown".into(), json!(extraction.content.markdown));
+            }
+            "text" => {
+                obj.insert("text".into(), json!(extraction.content.plain_text));
+            }
+            "llm" => {
+                let llm = to_llm_text(&extraction, extraction.metadata.url.as_deref());
+                obj.insert("llm".into(), json!(llm));
+            }
+            "html" => {
+                if let Some(raw) = &extraction.content.raw_html {
+                    obj.insert("html".into(), json!(raw));
+                }
+            }
+            "json" => {
+                obj.insert("json".into(), json!(extraction));
+            }
+            other => {
+                return Err(ApiError::bad_request(format!(
+                    "unknown format: '{other}' (allowed: markdown, text, llm, html, json)"
+                )));
+            }
+        }
+    }
+
+    if !extraction.structured_data.is_empty() {
+        obj.insert("structured_data".into(), json!(extraction.structured_data));
+    }
+
+    Ok(Json(body))
+}
diff --git a/crates/webclaw-server/src/routes/summarize.rs b/crates/webclaw-server/src/routes/summarize.rs
new file mode 100644
index 0000000..b967f1f
--- /dev/null
+++ b/crates/webclaw-server/src/routes/summarize.rs
@@ -0,0 +1,52 @@
+//! POST /v1/summarize — LLM-powered page summary.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_llm::{ProviderChain, summarize::summarize};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct SummarizeRequest {
+    pub url: String,
+    pub max_sentences: Option<usize>,
+    pub model: Option<String>,
+}
+
+pub async fn summarize_route(
+    State(state): State<AppState>,
+    Json(req): Json<SummarizeRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let extraction = state.fetch().fetch_and_extract(&req.url).await?;
+    let content = if extraction.content.markdown.trim().is_empty() {
+        extraction.content.plain_text.clone()
+    } else {
+        extraction.content.markdown.clone()
+    };
+    if content.trim().is_empty() {
+        return Err(ApiError::Extract(
+            "no extractable content on page".to_string(),
+        ));
+    }
+
+    let chain = ProviderChain::default().await;
+    if chain.is_empty() {
+        return Err(ApiError::Llm(
+            "no LLM providers configured (set OLLAMA_HOST, OPENAI_API_KEY, or ANTHROPIC_API_KEY)"
+                .to_string(),
+        ));
+    }
+
+    let summary = summarize(&content, req.max_sentences, &chain, req.model.as_deref()).await?;
+
+    Ok(Json(json!({
+        "url": req.url,
+        "summary": summary,
+    })))
+}
diff --git a/crates/webclaw-server/src/state.rs b/crates/webclaw-server/src/state.rs
new file mode 100644
index 0000000..b3f9b6b
--- /dev/null
+++ b/crates/webclaw-server/src/state.rs
@@ -0,0 +1,49 @@
+//! Shared application state. Cheap to clone via Arc; held by the axum
+//! Router for the life of the process.
+
+use std::sync::Arc;
+use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
+
+/// Single-process state shared across all request handlers.
+#[derive(Clone)]
+pub struct AppState {
+    inner: Arc<Inner>,
+}
+
+struct Inner {
+    /// Wrapped in `Arc` because `fetch_and_extract_batch_with_options`
+    /// (used by the /v1/batch handler) takes `self: &Arc<Self>` so it
+    /// can clone the client into spawned tasks. The single-call handlers
+    /// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
+    /// them nothing.
+    pub fetch: Arc<FetchClient>,
+    pub api_key: Option<String>,
+}
+
+impl AppState {
+    /// Build the application state. The fetch client is constructed once
+    /// and shared across requests so connection pools + browser profile
+    /// state don't churn per request.
+    pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
+        let config = FetchConfig {
+            browser: BrowserProfile::Chrome,
+            ..FetchConfig::default()
+        };
+        let fetch = FetchClient::new(config)
+            .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
+        Ok(Self {
+            inner: Arc::new(Inner {
+                fetch: Arc::new(fetch),
+                api_key,
+            }),
+        })
+    }
+
+    pub fn fetch(&self) -> &Arc<FetchClient> {
+        &self.inner.fetch
+    }
+
+    pub fn api_key(&self) -> Option<&str> {
+        self.inner.api_key.as_deref()
+    }
+}

From d91ad9c1f4ca1ff7456a5e4c6a2ff7aa76efdab2 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 12:25:29 +0200
Subject: [PATCH 02/30] feat(cli): add webclaw bench <url> subcommand (closes
 #26)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Per-URL extraction micro-benchmark. Fetches a URL once, runs the same
pipeline as --format llm, prints a small ASCII table comparing raw
HTML vs. llm output on tokens, bytes, and extraction time.

  webclaw bench https://stripe.com               # ASCII table
  webclaw bench https://stripe.com --json        # one-line JSON
  webclaw bench https://stripe.com --facts FILE  # adds fidelity row

The --facts file uses the same schema as benchmarks/facts.json (curated
visible-fact list per URL). URLs not in the file produce no fidelity
row, so an uncurated site doesn't show 0/0.

v1 uses an approximate tokenizer (chars/4 Latin, chars/2 when CJK
dominates). Off by ~10% vs cl100k_base but the signal — 'is the LLM
output 90% smaller than the raw HTML' — is order-of-magnitude, not
precise accounting. Output is labeled '~ tokens' so nobody mistakes
it for a real BPE count. Swapping in tiktoken-rs later is a one
function change; left out of v1 to avoid the 2 MB BPE-data binary
bloat for a feature most users will run a handful of times.

Implemented as a real clap subcommand (clap::Subcommand) rather than
yet another flag, with the existing flag-based flow falling through
when no subcommand is given. Existing 'webclaw <url> --format ...'
invocations work exactly as before. Lays the groundwork for future
subcommands without disrupting the legacy flat-flag UX.

12 new unit tests cover the tokenizer, formatters, host extraction,
and fact-matching. Verified end-to-end on example.com and tavily.com
(5/5 facts preserved at 93% token reduction).
---
 crates/webclaw-cli/src/bench.rs | 422 ++++++++++++++++++++++++++++++++
 crates/webclaw-cli/src/main.rs  |  50 +++-
 2 files changed, 471 insertions(+), 1 deletion(-)
 create mode 100644 crates/webclaw-cli/src/bench.rs

diff --git a/crates/webclaw-cli/src/bench.rs b/crates/webclaw-cli/src/bench.rs
new file mode 100644
index 0000000..3e45da4
--- /dev/null
+++ b/crates/webclaw-cli/src/bench.rs
@@ -0,0 +1,422 @@
+//! `webclaw bench <url>` — per-URL extraction micro-benchmark.
+//!
+//! Fetches a page, extracts it via the same pipeline that powers
+//! `--format llm`, and reports how many tokens the LLM pipeline
+//! removed vs. the raw HTML. Optional `--facts` reuses the
+//! benchmark harness's curated fact lists to score fidelity.
+//!
+//! v1 uses an *approximate* tokenizer (chars/4 for Latin text,
+//! chars/2 for CJK-heavy text). Output is clearly labeled
+//! "≈ tokens" so nobody mistakes it for a real tiktoken run.
+//! Swapping to tiktoken-rs later is a one-function change.
+
+use std::path::{Path, PathBuf};
+use std::time::Instant;
+
+use webclaw_core::{extract, to_llm_text};
+use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
+
+/// Inputs collected from the clap subcommand.
+pub struct BenchArgs {
+    pub url: String,
+    pub json: bool,
+    pub facts: Option<PathBuf>,
+}
+
+/// What a single bench run measures.
+struct BenchResult {
+    url: String,
+    raw_tokens: usize,
+    raw_bytes: usize,
+    llm_tokens: usize,
+    llm_bytes: usize,
+    reduction_pct: f64,
+    elapsed_secs: f64,
+    /// `Some((found, total))` when `--facts` is supplied and the URL has
+    /// an entry in the facts file; `None` otherwise.
+    facts: Option<(usize, usize)>,
+}
+
+pub async fn run(args: &BenchArgs) -> Result<(), String> {
+    // Dedicated client so bench doesn't care about global CLI flags
+    // (proxies, custom headers, etc.). A reproducible microbench is
+    // more useful than an over-configurable one; if someone wants to
+    // bench behind a proxy they can set WEBCLAW_PROXY — respected
+    // by FetchConfig via the regular channels if we extend later.
+    let config = FetchConfig {
+        browser: BrowserProfile::Chrome,
+        ..FetchConfig::default()
+    };
+    let client = FetchClient::new(config).map_err(|e| format!("build client: {e}"))?;
+
+    let start = Instant::now();
+    let fetched = client
+        .fetch(&args.url)
+        .await
+        .map_err(|e| format!("fetch: {e}"))?;
+
+    let extraction =
+        extract(&fetched.html, Some(&fetched.url)).map_err(|e| format!("extract: {e}"))?;
+    let llm_text = to_llm_text(&extraction, Some(&fetched.url));
+    let elapsed = start.elapsed();
+
+    let raw_tokens = approx_tokens(&fetched.html);
+    let llm_tokens = approx_tokens(&llm_text);
+    let raw_bytes = fetched.html.len();
+    let llm_bytes = llm_text.len();
+    let reduction_pct = if raw_tokens == 0 {
+        0.0
+    } else {
+        100.0 * (1.0 - llm_tokens as f64 / raw_tokens as f64)
+    };
+
+    let facts = match args.facts.as_deref() {
+        Some(path) => check_facts(path, &args.url, &llm_text)?,
+        None => None,
+    };
+
+    let result = BenchResult {
+        url: args.url.clone(),
+        raw_tokens,
+        raw_bytes,
+        llm_tokens,
+        llm_bytes,
+        reduction_pct,
+        elapsed_secs: elapsed.as_secs_f64(),
+        facts,
+    };
+
+    if args.json {
+        print_json(&result);
+    } else {
+        print_box(&result);
+    }
+    Ok(())
+}
+
+// ---------------------------------------------------------------------------
+// Approximate tokenizer
+// ---------------------------------------------------------------------------
+
+/// Rough token count. `chars / 4` is the classic English rule of thumb
+/// (close to cl100k_base for typical prose). CJK scripts pack ~2 chars
+/// per token, so we switch to `chars / 2` when CJK dominates.
+///
+/// Off by ±10% vs. a real BPE tokenizer, which is fine for "is webclaw's
+/// output 66% smaller or 66% bigger than raw HTML" — the signal is
+/// order-of-magnitude, not precise accounting.
+fn approx_tokens(s: &str) -> usize {
+    let total: usize = s.chars().count();
+    if total == 0 {
+        return 0;
+    }
+    let cjk = s.chars().filter(|c| is_cjk(*c)).count();
+    let cjk_ratio = cjk as f64 / total as f64;
+    if cjk_ratio > 0.30 {
+        total.div_ceil(2)
+    } else {
+        total.div_ceil(4)
+    }
+}
+
+fn is_cjk(c: char) -> bool {
+    let n = c as u32;
+    (0x4E00..=0x9FFF).contains(&n)   // CJK Unified Ideographs
+        || (0x3040..=0x309F).contains(&n) // Hiragana
+        || (0x30A0..=0x30FF).contains(&n) // Katakana
+        || (0xAC00..=0xD7AF).contains(&n) // Hangul Syllables
+        || (0x3400..=0x4DBF).contains(&n) // CJK Extension A
+}
+
+// ---------------------------------------------------------------------------
+// Output: ASCII / Unicode box
+// ---------------------------------------------------------------------------
+
+const BOX_WIDTH: usize = 62; // inner width between the two side borders
+
+fn print_box(r: &BenchResult) {
+    let host = display_host(&r.url);
+    let version = env!("CARGO_PKG_VERSION");
+
+    let top = "─".repeat(BOX_WIDTH);
+    let sep = "─".repeat(BOX_WIDTH);
+
+    // Header: host on the left, "webclaw X.Y.Z" on the right.
+    let left = host;
+    let right = format!("webclaw {version}");
+    let pad = BOX_WIDTH.saturating_sub(left.chars().count() + right.chars().count() + 2);
+    let header = format!(" {}{}{} ", left, " ".repeat(pad), right);
+
+    println!("┌{top}┐");
+    println!("│{header}│");
+    println!("├{sep}┤");
+    print_row(
+        "raw HTML",
+        &format!("{} ≈ tokens", fmt_int(r.raw_tokens)),
+        &fmt_bytes(r.raw_bytes),
+    );
+    print_row(
+        "--format llm",
+        &format!("{} ≈ tokens", fmt_int(r.llm_tokens)),
+        &fmt_bytes(r.llm_bytes),
+    );
+    print_row("token reduction", &format!("{:.1}%", r.reduction_pct), "");
+    print_row("extraction time", &format!("{:.2} s", r.elapsed_secs), "");
+    if let Some((found, total)) = r.facts {
+        let pct = if total == 0 {
+            0.0
+        } else {
+            100.0 * found as f64 / total as f64
+        };
+        print_row(
+            "facts preserved",
+            &format!("{found}/{total} ({pct:.1}%)"),
+            "",
+        );
+    }
+    println!("└{top}┘");
+    println!();
+    println!("note: token counts are approximate (chars/4 Latin, chars/2 CJK).");
+}
+
+fn print_row(label: &str, middle: &str, right: &str) {
+    // Layout inside the box:
+    //   " <label padded to 18>   <middle>   <right right-aligned to fit> "
+    let left_col = format!(" {:<18}", label);
+    let right_col = format!("{right} ");
+    let budget = BOX_WIDTH
+        .saturating_sub(left_col.chars().count())
+        .saturating_sub(right_col.chars().count());
+    let middle_col = format!("{:<width$}", middle, width = budget);
+    println!("│{left_col}{middle_col}{right_col}│");
+}
+
+fn fmt_int(n: usize) -> String {
+    // Comma-group thousands. Avoids pulling in num-format / thousands
+    // for one call site.
+    let s = n.to_string();
+    let bytes = s.as_bytes();
+    let mut out = String::with_capacity(s.len() + s.len() / 3);
+    for (i, b) in bytes.iter().enumerate() {
+        if i > 0 && (bytes.len() - i).is_multiple_of(3) {
+            out.push(',');
+        }
+        out.push(*b as char);
+    }
+    out
+}
+
+fn fmt_bytes(n: usize) -> String {
+    const KB: usize = 1024;
+    const MB: usize = KB * 1024;
+    if n >= MB {
+        format!("{:.1} MB", n as f64 / MB as f64)
+    } else if n >= KB {
+        format!("{} KB", n / KB)
+    } else {
+        format!("{n} B")
+    }
+}
+
+/// Best-effort host extraction — if the URL doesn't parse we fall back
+/// to the raw string so the box still prints something recognizable.
+fn display_host(url: &str) -> String {
+    url::Url::parse(url)
+        .ok()
+        .and_then(|u| u.host_str().map(|h| h.to_string()))
+        .unwrap_or_else(|| url.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON output — single line, stable key order for scripting / CI.
+// ---------------------------------------------------------------------------
+
+fn print_json(r: &BenchResult) {
+    let mut obj = serde_json::Map::new();
+    obj.insert("url".into(), r.url.clone().into());
+    obj.insert("raw_tokens".into(), r.raw_tokens.into());
+    obj.insert("raw_bytes".into(), r.raw_bytes.into());
+    obj.insert("llm_tokens".into(), r.llm_tokens.into());
+    obj.insert("llm_bytes".into(), r.llm_bytes.into());
+    obj.insert("token_reduction_pct".into(), round1(r.reduction_pct).into());
+    obj.insert("elapsed_secs".into(), round2(r.elapsed_secs).into());
+    obj.insert("token_method".into(), "approx".into());
+    obj.insert("webclaw_version".into(), env!("CARGO_PKG_VERSION").into());
+    if let Some((found, total)) = r.facts {
+        obj.insert("facts_found".into(), found.into());
+        obj.insert("facts_total".into(), total.into());
+    }
+    // Single-line JSON — easy to append to ndjson for CI runs.
+    println!("{}", serde_json::Value::Object(obj));
+}
+
+fn round1(f: f64) -> f64 {
+    (f * 10.0).round() / 10.0
+}
+fn round2(f: f64) -> f64 {
+    (f * 100.0).round() / 100.0
+}
+
+// ---------------------------------------------------------------------------
+// Facts file support
+// ---------------------------------------------------------------------------
+
+/// Load `facts.json` (same schema as `benchmarks/facts.json`) and check how
+/// many curated facts for this URL appear in the extracted LLM text.
+/// Returns `None` when the URL has no entry in the file — don't penalize
+/// a site that simply hasn't been curated yet.
+fn check_facts(path: &Path, url: &str, llm_text: &str) -> Result<Option<(usize, usize)>, String> {
+    let raw = std::fs::read_to_string(path)
+        .map_err(|e| format!("read facts file {}: {e}", path.display()))?;
+    let parsed: serde_json::Value =
+        serde_json::from_str(&raw).map_err(|e| format!("parse facts file: {e}"))?;
+
+    let facts_obj = parsed
+        .get("facts")
+        .and_then(|v| v.as_object())
+        .ok_or_else(|| "facts file missing `facts` object".to_string())?;
+
+    let Some(entry) = facts_obj.get(url) else {
+        // URL not curated in this facts file — don't print a fidelity
+        // column rather than showing a misleading 0/0.
+        return Ok(None);
+    };
+    let Some(list) = entry.as_array() else {
+        return Err(format!("facts['{url}'] is not an array"));
+    };
+
+    let total = list.len();
+    let text_low = llm_text.to_lowercase();
+    let mut found = 0usize;
+    for f in list {
+        let Some(fact) = f.as_str() else { continue };
+        if matches_fact(&text_low, fact) {
+            found += 1;
+        }
+    }
+    Ok(Some((found, total)))
+}
+
+/// Match a single fact against the lowercased text. Mirrors the
+/// python harness in `benchmarks/scripts/bench.py`:
+/// - Single alphanumeric token → word-boundary (so `API` doesn't hit
+///   `apiece`).
+/// - Multi-word or non-alpha facts (e.g. `99.999`) → substring.
+fn matches_fact(text_low: &str, fact: &str) -> bool {
+    let fact_low = fact.to_lowercase();
+    if fact_low.is_empty() {
+        return false;
+    }
+    let is_simple_token = fact_low.chars().all(|c| c.is_ascii_alphanumeric())
+        && fact_low
+            .chars()
+            .next()
+            .is_some_and(|c| c.is_ascii_alphabetic());
+
+    if !is_simple_token {
+        return text_low.contains(&fact_low);
+    }
+    // Word-boundary scan without pulling in the regex dependency just
+    // for this: find each occurrence and check neighbouring chars.
+    let bytes = text_low.as_bytes();
+    let needle = fact_low.as_bytes();
+    let mut i = 0;
+    while i + needle.len() <= bytes.len() {
+        if &bytes[i..i + needle.len()] == needle {
+            let before_ok = i == 0 || !bytes[i - 1].is_ascii_alphanumeric();
+            let after_idx = i + needle.len();
+            let after_ok = after_idx >= bytes.len() || !bytes[after_idx].is_ascii_alphanumeric();
+            if before_ok && after_ok {
+                return true;
+            }
+        }
+        i += 1;
+    }
+    false
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn approx_tokens_empty() {
+        assert_eq!(approx_tokens(""), 0);
+    }
+
+    #[test]
+    fn approx_tokens_latin_roughly_chars_over_4() {
+        // 100 ASCII chars → ~25 tokens
+        let s = "a".repeat(100);
+        assert_eq!(approx_tokens(&s), 25);
+    }
+
+    #[test]
+    fn approx_tokens_cjk_denser() {
+        // 100 CJK chars → ~50 tokens (chars/2 branch)
+        let s: String = "中".repeat(100);
+        assert_eq!(approx_tokens(&s), 50);
+    }
+
+    #[test]
+    fn approx_tokens_mixed_uses_latin_branch() {
+        // 80 latin + 20 CJK → CJK ratio 20% < 30% → chars/4 branch
+        let s = format!("{}{}", "a".repeat(80), "中".repeat(20));
+        assert_eq!(approx_tokens(&s), 25);
+    }
+
+    #[test]
+    fn fmt_int_commas() {
+        assert_eq!(fmt_int(0), "0");
+        assert_eq!(fmt_int(100), "100");
+        assert_eq!(fmt_int(1_000), "1,000");
+        assert_eq!(fmt_int(243_465), "243,465");
+        assert_eq!(fmt_int(12_345_678), "12,345,678");
+    }
+
+    #[test]
+    fn fmt_bytes_units() {
+        assert_eq!(fmt_bytes(500), "500 B");
+        assert_eq!(fmt_bytes(1024), "1 KB");
+        assert_eq!(fmt_bytes(1024 * 1024), "1.0 MB");
+        assert_eq!(fmt_bytes(1024 * 1024 * 3 + 1024 * 512), "3.5 MB");
+    }
+
+    #[test]
+    fn matches_fact_word_boundary() {
+        assert!(matches_fact("the api is ready", "API"));
+        // single-token alphanumeric: API should not hit apiece
+        assert!(!matches_fact("an apiece of land", "API"));
+    }
+
+    #[test]
+    fn matches_fact_multiword_substring() {
+        assert!(matches_fact("uptime is 99.999% this year", "99.999"));
+        assert!(matches_fact("the app router routes requests", "App Router"));
+    }
+
+    #[test]
+    fn matches_fact_case_insensitive() {
+        assert!(matches_fact("the claude model is opus", "Claude"));
+        assert!(matches_fact("the claude model is opus", "opus"));
+    }
+
+    #[test]
+    fn matches_fact_missing() {
+        assert!(!matches_fact("nothing to see here", "vercel"));
+    }
+
+    #[test]
+    fn display_host_parses_url() {
+        assert_eq!(display_host("https://stripe.com/"), "stripe.com");
+        assert_eq!(
+            display_host("https://docs.python.org/3/"),
+            "docs.python.org"
+        );
+    }
+
+    #[test]
+    fn display_host_falls_back_on_garbage() {
+        assert_eq!(display_host("not a url"), "not a url");
+    }
+}
diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs
index 8070d63..7cf765e 100644
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@@ -1,5 +1,6 @@
 /// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
 /// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
+mod bench;
 mod cloud;
 
 use std::io::{self, Read as _};
@@ -8,7 +9,7 @@ use std::process;
 use std::sync::Arc;
 use std::sync::atomic::{AtomicBool, Ordering};
 
-use clap::{Parser, ValueEnum};
+use clap::{Parser, Subcommand, ValueEnum};
 use tracing_subscriber::EnvFilter;
 use webclaw_core::{
     ChangeStatus, ContentDiff, ExtractionOptions, ExtractionResult, Metadata, extract_with_options,
@@ -86,6 +87,12 @@ fn warn_empty(url: &str, reason: &EmptyReason) {
 #[derive(Parser)]
 #[command(name = "webclaw", about = "Extract web content for LLMs", version)]
 struct Cli {
+    /// Optional subcommand. When omitted, the CLI falls back to the
+    /// traditional flag-based flow (URL + --format, --crawl, etc.).
+    /// Subcommands are used for flows that don't fit that model.
+    #[command(subcommand)]
+    command: Option<Commands>,
+
     /// URLs to fetch (multiple allowed)
     #[arg()]
     urls: Vec<String>,
@@ -283,6 +290,27 @@ struct Cli {
     output_dir: Option<PathBuf>,
 }
 
+#[derive(Subcommand)]
+enum Commands {
+    /// Per-URL extraction micro-benchmark: compares raw HTML vs. the
+    /// webclaw --format llm output on token count, bytes, and
+    /// extraction time. Uses an approximate tokenizer (see `--help`).
+    Bench {
+        /// URL to benchmark.
+        url: String,
+
+        /// Emit a single JSON line instead of the ASCII table.
+        /// Machine-readable shape stable across releases.
+        #[arg(long)]
+        json: bool,
+
+        /// Optional path to a facts.json (same schema as the repo's
+        /// benchmarks/facts.json) for a fidelity column.
+        #[arg(long)]
+        facts: Option<PathBuf>,
+    },
+}
+
 #[derive(Clone, ValueEnum)]
 enum OutputFormat {
     Markdown,
@@ -2244,6 +2272,26 @@ async fn main() {
     let cli = Cli::parse();
     init_logging(cli.verbose);
 
+    // Subcommand path. Handled before the flag dispatch so a subcommand
+    // can't collide with a flag-based flow. When no subcommand is set
+    // we fall through to the existing behaviour.
+    if let Some(ref cmd) = cli.command {
+        match cmd {
+            Commands::Bench { url, json, facts } => {
+                let args = bench::BenchArgs {
+                    url: url.clone(),
+                    json: *json,
+                    facts: facts.clone(),
+                };
+                if let Err(e) = bench::run(&args).await {
+                    eprintln!("error: {e}");
+                    process::exit(1);
+                }
+                return;
+            }
+        }
+    }
+
     // --map: sitemap discovery mode
     if cli.map {
         if let Err(e) = run_map(&cli).await {

From d71eebdacc3cbe8f71974a07fa2548f848c46fb0 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 12:25:39 +0200
Subject: [PATCH 03/30] fix(mcp): silence dead-code warning on tool_router
 field (closes #30)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

cargo install webclaw-mcp on a fresh machine prints

  warning: field `tool_router` is never read
   --> crates/webclaw-mcp/src/server.rs:22:5

The field is essential — dropping it unregisters every MCP tool. The
warning shows up because rmcp 1.3.x changed how the #[tool_handler]
macro reads the field: instead of referencing it by name in the
generated impl, it goes through a derived trait method. rustc's
dead-code lint sees only the named usage and fires.

The field stays. Annotated with #[allow(dead_code)] and a comment
explaining the situation so the next person looking at this doesn't
remove the field thinking it's actually unused.

No behaviour change. Verified clean compile under rmcp 1.3.0 in our
lock; the warning will disappear for anyone running cargo install
against this commit.
---
 crates/webclaw-mcp/src/server.rs | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs
index cdb79f0..f00eae7 100644
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@@ -19,6 +19,12 @@ use crate::cloud::{self, CloudClient, SmartFetchResult};
 use crate::tools::*;
 
 pub struct WebclawMcp {
+    /// Holds the registered MCP tools. `rmcp >= 1.3` reads this through a
+    /// derived trait impl (not by name), so rustc's dead-code lint can't
+    /// see the usage and fires a spurious `field tool_router is never
+    /// read` on `cargo install`. The field is essential — dropping it
+    /// would unregister every tool. See issue #30.
+    #[allow(dead_code)]
     tool_router: ToolRouter<Self>,
     fetch_client: Arc<webclaw_fetch::FetchClient>,
     /// Lazily-initialized Firefox client, reused across all tool calls that

From c7e5abea8f118517fa770997cfa4af760ba39601 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 12:25:44 +0200
Subject: [PATCH 04/30] docs(changelog): v0.4.0 release notes (#26, #29, #30)

---
 CHANGELOG.md | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index dc337d0..4069d54 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,28 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.4.0] — 2026-04-22
+
+### Added
+- **`webclaw bench <url>` — per-URL extraction micro-benchmark (#26).** New subcommand. Fetches a URL once, runs the same extraction pipeline as `--format llm`, and prints a small ASCII table comparing raw-HTML tokens vs. llm-output tokens, bytes, and extraction time. Pass `--json` for a single-line JSON object (stable shape, easy to append to ndjson in CI). Pass `--facts <path>` with a file in the same schema as `benchmarks/facts.json` to get a fidelity column ("4/5 facts preserved"); URLs absent from the facts file produce no fidelity row, so uncurated sites aren't shown as 0/0. v1 uses an approximate tokenizer (`chars/4` for Latin text, `chars/2` when CJK dominates) — off by ±10% vs. a real BPE tokenizer, but the signal ("the LLM pipeline dropped 93% of the raw bytes") is the point. Output clearly labels counts as `≈ tokens` so nobody confuses them with a real tiktoken run. Swapping in `tiktoken-rs` later is a one-function change in `bench.rs`. Adding this as a `clap` subcommand rather than a flag also lays the groundwork for future subcommands without breaking the existing flag-based flow — `webclaw <url> --format llm` still works exactly as before.
+
+- **`webclaw-server` — new OSS binary for self-hosting a REST API (#29).** Until now, `docs/self-hosting` promised a `webclaw-server` binary that only existed in the hosted-platform repo (closed source). The Docker image shipped two binaries while the docs advertised three, which sent self-hosters into a bug loop. This release closes the gap: a new crate at `crates/webclaw-server/` builds a minimal, stateless axum server that exposes the OSS extraction pipeline over HTTP with the same JSON shapes as api.webclaw.io. Endpoints: `GET /health`, `POST /v1/{scrape,crawl,map,batch,extract,summarize,diff,brand}`. Run with `webclaw-server --port 3000 [--host 0.0.0.0] [--api-key <bearer>]` or the matching `WEBCLAW_PORT` / `WEBCLAW_HOST` / `WEBCLAW_API_KEY` env vars. Bearer auth is constant-time (via `subtle::ConstantTimeEq`); open mode (no key) is allowed on `127.0.0.1` for local development.
+
+  What self-hosting gives you: the full extraction pipeline, Crawler, sitemap discovery, brand/diff, LLM extract/summarize (via Ollama or your own OpenAI/Anthropic key). What it does *not* give you: anti-bot bypass (Cloudflare, DataDome, WAFs), headless JS rendering, async job queues, multi-tenant auth/billing, domain-hints and proxy routing — those require the hosted backend at api.webclaw.io and are intentionally not open-source. The self-hosting docs have been updated to reflect this split honestly.
+
+- **`crawl` endpoint runs synchronously and hard-caps at 500 pages / 20 concurrency.** No job queue, no background workers — a naive caller can't OOM the process. `batch` caps at 100 URLs / 20 concurrency for the same reason. For unbounded crawls use the hosted API.
+
+### Changed
+- **Docker image now ships three binaries**, not two. `Dockerfile` and `Dockerfile.ci` both add `webclaw-server` to `/usr/local/bin/` and `EXPOSE 3000` for documentation. The entrypoint shim is unchanged: `docker run IMAGE webclaw-server --port 3000` Just Works, and the CLI/URL pass-through from v0.3.19 is preserved.
+
+### Docs
+- Rewrote `docs/self-hosting` on the landing site to differentiate OSS (self-hosted REST) from the hosted platform. Added a capability matrix so new users don't have to read the repo to figure out why Cloudflare-protected sites still 403 when pointing at their own box.
+
+### Fixed
+- **Dead-code warning on `cargo install webclaw-mcp` (#30).** `rmcp` 1.3.x changed how the `#[tool_handler]` macro reads the `tool_router` struct field — it now goes through a derived trait impl instead of referencing the field by name, so rustc's dead-code lint no longer sees it. The field is still essential (dropping it unregisters every MCP tool), just invisible to the lint. Annotated with `#[allow(dead_code)]` and a comment explaining why. No behaviour change. Warning disappears on the next `cargo install`.
+
+---
+
 ## [0.3.19] — 2026-04-17
 
 ### Fixed

From ccdb6d364b3ccc53d176261fcecc398eb792e10c Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 12:44:14 +0200
Subject: [PATCH 05/30] fix(ci): release workflow must include webclaw-server
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

v0.4.0 shipped tarballs without the new webclaw-server binary because
the release workflow predates that binary and was hardcoded for two:

- Package step used `cp ... 2>/dev/null || true`, so a missing binary
  was silently skipped instead of failing the job.
- Docker job's download step copied only webclaw + webclaw-mcp into
  the build context, so Dockerfile.ci's COPY webclaw-server step then
  died with 'file not found'.
- Homebrew formula's install block only covered the same two, so brew
  users would have gotten a release with a missing binary.

Three changes:

1. Package step now explicitly copies all three binaries and drops the
   swallow-all-errors pattern. If a future binary gets renamed or
   removed this step screams instead of silently publishing half a
   release.
2. Docker Download step copies webclaw-server alongside the other
   binaries into the build context.
3. Homebrew formula installs webclaw-server too.

v0.4.0 tag + GitHub Release will be deleted and re-pushed on top of
this commit so the canonical v0.4.0 artifacts are complete. No users
affected — download count was 0 on every broken asset.
---
 .github/workflows/release.yml | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 95be38f..4c4c241 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -66,8 +66,14 @@ jobs:
           tag="${GITHUB_REF#refs/tags/}"
           staging="webclaw-${tag}-${{ matrix.target }}"
           mkdir "$staging"
-          cp target/${{ matrix.target }}/release/webclaw "$staging/" 2>/dev/null || true
-          cp target/${{ matrix.target }}/release/webclaw-mcp "$staging/" 2>/dev/null || true
+          # Fail loud if any binary is missing. A silent `|| true` on the
+          # copy was how v0.4.0 shipped tarballs that lacked webclaw-server —
+          # don't repeat that mistake. If a future binary gets renamed or
+          # removed, this step should scream, not quietly publish an
+          # incomplete release.
+          cp target/${{ matrix.target }}/release/webclaw "$staging/"
+          cp target/${{ matrix.target }}/release/webclaw-mcp "$staging/"
+          cp target/${{ matrix.target }}/release/webclaw-server "$staging/"
           cp README.md LICENSE "$staging/"
           tar czf "$staging.tar.gz" "$staging"
           echo "ASSET=$staging.tar.gz" >> $GITHUB_ENV
@@ -134,6 +140,7 @@ jobs:
             mkdir -p "binaries-${target}"
             cp "${dir}/webclaw" "binaries-${target}/webclaw"
             cp "${dir}/webclaw-mcp" "binaries-${target}/webclaw-mcp"
+            cp "${dir}/webclaw-server" "binaries-${target}/webclaw-server"
             chmod +x "binaries-${target}"/*
           done
           ls -laR binaries-*/
@@ -220,6 +227,7 @@ jobs:
             def install
               bin.install "webclaw"
               bin.install "webclaw-mcp"
+              bin.install "webclaw-server"
             end
 
             test do

From 8ba7538c37686d0c0cc0907feddfbd63740c5c00 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 14:11:43 +0200
Subject: [PATCH 06/30] feat(extractors): add vertical extractors module +
 first 6 verticals
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog

First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):

- reddit       → www.reddit.com/*/.json
- hackernews   → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo  → api.github.com/repos/{owner}/{repo}
- pypi         → pypi.org/pypi/{name}/json
- npm          → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}

Server-side routes added:
- POST /v1/scrape/{vertical}  explicit per-vertical extraction
- GET  /v1/extractors         catalog (name, label, description, url_patterns)

The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.

17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).

Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
---
 .../src/extractors/github_repo.rs             | 212 ++++++++++++++++
 .../src/extractors/hackernews.rs              | 186 ++++++++++++++
 .../src/extractors/huggingface_model.rs       | 223 +++++++++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    | 199 +++++++++++++++
 crates/webclaw-fetch/src/extractors/npm.rs    | 235 ++++++++++++++++++
 crates/webclaw-fetch/src/extractors/pypi.rs   | 184 ++++++++++++++
 crates/webclaw-fetch/src/extractors/reddit.rs | 234 +++++++++++++++++
 crates/webclaw-fetch/src/lib.rs               |   1 +
 crates/webclaw-server/src/main.rs             |   5 +
 crates/webclaw-server/src/routes/mod.rs       |   1 +
 .../webclaw-server/src/routes/structured.rs   |  55 ++++
 11 files changed, 1535 insertions(+)
 create mode 100644 crates/webclaw-fetch/src/extractors/github_repo.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/hackernews.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/huggingface_model.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/mod.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/npm.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/pypi.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/reddit.rs
 create mode 100644 crates/webclaw-server/src/routes/structured.rs

diff --git a/crates/webclaw-fetch/src/extractors/github_repo.rs b/crates/webclaw-fetch/src/extractors/github_repo.rs
new file mode 100644
index 0000000..d89d06a
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/github_repo.rs
@@ -0,0 +1,212 @@
+//! GitHub repository structured extractor.
+//!
+//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`.
+//! Unauthenticated requests get 60/hour per IP, which is fine for users
+//! self-hosting and for low-volume cloud usage. Production cloud should
+//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't
+//! depend on it being set — it works open out of the box.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_repo",
+    label: "GitHub repository",
+    description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.",
+    url_patterns: &["https://github.com/{owner}/{repo}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    // Path must be exactly /{owner}/{repo} (or with trailing slash). Reject
+    // sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the
+    // future github_issue / github_pr extractors will handle.
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0])
+}
+
+/// GitHub uses some top-level paths for non-repo pages.
+const RESERVED_OWNERS: &[&str] = &[
+    "settings",
+    "marketplace",
+    "explore",
+    "topics",
+    "trending",
+    "collections",
+    "events",
+    "sponsors",
+    "issues",
+    "pulls",
+    "notifications",
+    "new",
+    "organizations",
+    "login",
+    "join",
+    "search",
+    "about",
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
+        FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_repo: repo '{owner}/{repo}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let r: Repo = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?;
+
+    Ok(json!({
+        "url":              url,
+        "owner":            r.owner.as_ref().map(|o| &o.login),
+        "name":             r.name,
+        "full_name":        r.full_name,
+        "description":      r.description,
+        "homepage":         r.homepage,
+        "language":         r.language,
+        "topics":           r.topics,
+        "license":          r.license.as_ref().and_then(|l| l.spdx_id.clone()),
+        "license_name":     r.license.as_ref().map(|l| l.name.clone()),
+        "default_branch":   r.default_branch,
+        "stars":            r.stargazers_count,
+        "forks":            r.forks_count,
+        "watchers":         r.subscribers_count,
+        "open_issues":      r.open_issues_count,
+        "size_kb":          r.size,
+        "archived":         r.archived,
+        "fork":             r.fork,
+        "is_template":      r.is_template,
+        "has_issues":       r.has_issues,
+        "has_wiki":         r.has_wiki,
+        "has_pages":        r.has_pages,
+        "has_discussions":  r.has_discussions,
+        "created_at":       r.created_at,
+        "updated_at":       r.updated_at,
+        "pushed_at":        r.pushed_at,
+        "html_url":         r.html_url,
+    }))
+}
+
+fn parse_owner_repo(url: &str) -> Option<(String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let owner = segs.next()?.to_string();
+    let repo = segs.next()?.to_string();
+    Some((owner, repo))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub API types — only the fields we surface
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Repo {
+    name: Option<String>,
+    full_name: Option<String>,
+    description: Option<String>,
+    homepage: Option<String>,
+    language: Option<String>,
+    #[serde(default)]
+    topics: Vec<String>,
+    license: Option<License>,
+    default_branch: Option<String>,
+    stargazers_count: Option<i64>,
+    forks_count: Option<i64>,
+    subscribers_count: Option<i64>,
+    open_issues_count: Option<i64>,
+    size: Option<i64>,
+    archived: Option<bool>,
+    fork: Option<bool>,
+    is_template: Option<bool>,
+    has_issues: Option<bool>,
+    has_wiki: Option<bool>,
+    has_pages: Option<bool>,
+    has_discussions: Option<bool>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    pushed_at: Option<String>,
+    html_url: Option<String>,
+    owner: Option<Owner>,
+}
+
+#[derive(Deserialize)]
+struct Owner {
+    login: String,
+}
+
+#[derive(Deserialize)]
+struct License {
+    name: String,
+    spdx_id: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_repo_root_only() {
+        assert!(matches("https://github.com/rust-lang/rust"));
+        assert!(matches("https://github.com/rust-lang/rust/"));
+        assert!(!matches("https://github.com/rust-lang/rust/issues"));
+        assert!(!matches("https://github.com/rust-lang/rust/pulls/123"));
+        assert!(!matches("https://github.com/rust-lang"));
+        assert!(!matches("https://github.com/marketplace"));
+        assert!(!matches("https://github.com/topics/rust"));
+        assert!(!matches("https://example.com/foo/bar"));
+    }
+
+    #[test]
+    fn parse_owner_repo_handles_trailing_slash_and_query() {
+        assert_eq!(
+            parse_owner_repo("https://github.com/rust-lang/rust"),
+            Some(("rust-lang".into(), "rust".into()))
+        );
+        assert_eq!(
+            parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"),
+            Some(("rust-lang".into(), "rust".into()))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/hackernews.rs b/crates/webclaw-fetch/src/extractors/hackernews.rs
new file mode 100644
index 0000000..7adaa1c
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/hackernews.rs
@@ -0,0 +1,186 @@
+//! Hacker News structured extractor.
+//!
+//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which
+//! returns the full post + recursive comment tree in a single request.
+//! The official Firebase API at `hacker-news.firebaseio.com` requires
+//! N+1 fetches per comment, so we'd hit either timeout or rate-limit
+//! on any non-trivial thread.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "hackernews",
+    label: "Hacker News story",
+    description: "Returns post + nested comment tree for a Hacker News item.",
+    url_patterns: &[
+        "https://news.ycombinator.com/item?id=N",
+        "https://hn.algolia.com/items/N",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host == "news.ycombinator.com" {
+        return url.contains("item?id=") || url.contains("item%3Fid=");
+    }
+    if host == "hn.algolia.com" {
+        return url.contains("/items/");
+    }
+    false
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let id = parse_item_id(url).ok_or_else(|| {
+        FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
+    })?;
+
+    let api_url = format!("https://hn.algolia.com/api/v1/items/{id}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "hn algolia returned status {}",
+            resp.status
+        )));
+    }
+
+    let item: AlgoliaItem = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?;
+
+    let post = post_json(&item);
+    let comments: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
+
+    Ok(json!({
+        "url": url,
+        "post": post,
+        "comments": comments,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the
+/// Algolia mirror's `/items/N` form.
+fn parse_item_id(url: &str) -> Option<u64> {
+    if let Some(after) = url.split("id=").nth(1) {
+        let n = after.split('&').next().unwrap_or(after);
+        if let Ok(id) = n.parse::<u64>() {
+            return Some(id);
+        }
+    }
+    if let Some(after) = url.split("/items/").nth(1) {
+        let n = after.split(['/', '?', '#']).next().unwrap_or(after);
+        if let Ok(id) = n.parse::<u64>() {
+            return Some(id);
+        }
+    }
+    None
+}
+
+fn post_json(item: &AlgoliaItem) -> Value {
+    json!({
+        "id":              item.id,
+        "type":            item.r#type,
+        "title":           item.title,
+        "url":             item.url,
+        "author":          item.author,
+        "points":          item.points,
+        "text":            item.text,                 // populated for ask/show/tell
+        "created_at":      item.created_at,
+        "created_at_unix": item.created_at_i,
+        "comment_count":   count_descendants(item),
+        "permalink":       item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")),
+    })
+}
+
+fn comment_json(item: &AlgoliaItem) -> Option<Value> {
+    if !matches!(item.r#type.as_deref(), Some("comment")) {
+        return None;
+    }
+    // Dead/deleted comments still appear in the tree; surface them honestly.
+    let replies: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
+    Some(json!({
+        "id":              item.id,
+        "author":          item.author,
+        "text":            item.text,
+        "created_at":      item.created_at,
+        "created_at_unix": item.created_at_i,
+        "parent_id":       item.parent_id,
+        "story_id":        item.story_id,
+        "replies":         replies,
+    }))
+}
+
+fn count_descendants(item: &AlgoliaItem) -> usize {
+    item.children
+        .iter()
+        .filter(|c| matches!(c.r#type.as_deref(), Some("comment")))
+        .map(|c| 1 + count_descendants(c))
+        .sum()
+}
+
+// ---------------------------------------------------------------------------
+// Algolia API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct AlgoliaItem {
+    id: Option<u64>,
+    r#type: Option<String>,
+    title: Option<String>,
+    url: Option<String>,
+    author: Option<String>,
+    points: Option<i64>,
+    text: Option<String>,
+    created_at: Option<String>,
+    created_at_i: Option<i64>,
+    parent_id: Option<u64>,
+    story_id: Option<u64>,
+    #[serde(default)]
+    children: Vec<AlgoliaItem>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_hn_item_urls() {
+        assert!(matches("https://news.ycombinator.com/item?id=1"));
+        assert!(matches("https://news.ycombinator.com/item?id=12345"));
+        assert!(matches("https://hn.algolia.com/items/1"));
+    }
+
+    #[test]
+    fn rejects_non_item_urls() {
+        assert!(!matches("https://news.ycombinator.com/"));
+        assert!(!matches("https://news.ycombinator.com/news"));
+        assert!(!matches("https://example.com/item?id=1"));
+    }
+
+    #[test]
+    fn parse_item_id_handles_both_forms() {
+        assert_eq!(
+            parse_item_id("https://news.ycombinator.com/item?id=1"),
+            Some(1)
+        );
+        assert_eq!(
+            parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"),
+            Some(12345)
+        );
+        assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999));
+        assert_eq!(parse_item_id("https://example.com/foo"), None);
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/huggingface_model.rs b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
new file mode 100644
index 0000000..decc68a
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
@@ -0,0 +1,223 @@
+//! HuggingFace model card structured extractor.
+//!
+//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`.
+//! Returns metadata + the parsed model card front matter, but does not
+//! pull the full README body — those are sometimes 100KB+ and the user
+//! can hit /v1/scrape if they want it as markdown.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "huggingface_model",
+    label: "HuggingFace model",
+    description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.",
+    url_patterns: &["https://huggingface.co/{owner}/{name}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "huggingface.co" && host != "www.huggingface.co" {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /{owner}/{name} but reject HF-internal sections + sub-pages.
+    if segs.len() != 2 {
+        return false;
+    }
+    !RESERVED_NAMESPACES.contains(&segs[0])
+}
+
+const RESERVED_NAMESPACES: &[&str] = &[
+    "datasets",
+    "spaces",
+    "blog",
+    "docs",
+    "api",
+    "models",
+    "papers",
+    "pricing",
+    "tasks",
+    "join",
+    "login",
+    "settings",
+    "organizations",
+    "new",
+    "search",
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (owner, name) = parse_owner_name(url).ok_or_else(|| {
+        FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
+    })?;
+
+    let api_url = format!("https://huggingface.co/api/models/{owner}/{name}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "hf model: '{owner}/{name}' not found"
+        )));
+    }
+    if resp.status == 401 {
+        return Err(FetchError::Build(format!(
+            "hf model: '{owner}/{name}' requires authentication (gated repo)"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "hf api returned status {}",
+            resp.status
+        )));
+    }
+
+    let m: ModelInfo = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?;
+
+    // Surface a flat file list — full siblings can be hundreds of entries
+    // for big repos. We keep it as-is because callers want to know about
+    // every shard; if it bloats responses too much we'll add pagination.
+    let files: Vec<Value> = m
+        .siblings
+        .iter()
+        .map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
+        .collect();
+
+    Ok(json!({
+        "url":             url,
+        "id":              m.id,
+        "model_id":        m.model_id,
+        "private":         m.private,
+        "gated":           m.gated,
+        "downloads":       m.downloads,
+        "downloads_30d":   m.downloads_all_time,
+        "likes":           m.likes,
+        "library_name":    m.library_name,
+        "pipeline_tag":    m.pipeline_tag,
+        "tags":            m.tags,
+        "license":         m.card_data.as_ref().and_then(|c| c.license.clone()),
+        "language":        m.card_data.as_ref().and_then(|c| c.language.clone()),
+        "datasets":        m.card_data.as_ref().and_then(|c| c.datasets.clone()),
+        "base_model":      m.card_data.as_ref().and_then(|c| c.base_model.clone()),
+        "model_type":      m.card_data.as_ref().and_then(|c| c.model_type.clone()),
+        "created_at":      m.created_at,
+        "last_modified":   m.last_modified,
+        "sha":             m.sha,
+        "file_count":      m.siblings.len(),
+        "files":           files,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_owner_name(url: &str) -> Option<(String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let owner = segs.next()?.to_string();
+    let name = segs.next()?.to_string();
+    Some((owner, name))
+}
+
+// ---------------------------------------------------------------------------
+// HF API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct ModelInfo {
+    id: Option<String>,
+    #[serde(rename = "modelId")]
+    model_id: Option<String>,
+    private: Option<bool>,
+    gated: Option<serde_json::Value>, // bool or string ("auto" / "manual" / false)
+    downloads: Option<i64>,
+    #[serde(rename = "downloadsAllTime")]
+    downloads_all_time: Option<i64>,
+    likes: Option<i64>,
+    #[serde(rename = "library_name")]
+    library_name: Option<String>,
+    #[serde(rename = "pipeline_tag")]
+    pipeline_tag: Option<String>,
+    #[serde(default)]
+    tags: Vec<String>,
+    #[serde(rename = "createdAt")]
+    created_at: Option<String>,
+    #[serde(rename = "lastModified")]
+    last_modified: Option<String>,
+    sha: Option<String>,
+    #[serde(rename = "cardData")]
+    card_data: Option<CardData>,
+    #[serde(default)]
+    siblings: Vec<Sibling>,
+}
+
+#[derive(Deserialize)]
+struct CardData {
+    license: Option<serde_json::Value>, // string or array
+    language: Option<serde_json::Value>,
+    datasets: Option<serde_json::Value>,
+    #[serde(rename = "base_model")]
+    base_model: Option<serde_json::Value>,
+    #[serde(rename = "model_type")]
+    model_type: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Sibling {
+    rfilename: String,
+    size: Option<i64>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_model_pages() {
+        assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B"));
+        assert!(matches("https://huggingface.co/openai/whisper-large-v3"));
+        assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1
+    }
+
+    #[test]
+    fn rejects_hf_section_pages() {
+        assert!(!matches("https://huggingface.co/datasets/squad"));
+        assert!(!matches("https://huggingface.co/spaces/foo/bar"));
+        assert!(!matches("https://huggingface.co/blog/intro"));
+        assert!(!matches("https://huggingface.co/"));
+        assert!(!matches("https://huggingface.co/meta-llama"));
+    }
+
+    #[test]
+    fn parse_owner_name_pulls_both() {
+        assert_eq!(
+            parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"),
+            Some(("meta-llama".into(), "Meta-Llama-3-8B".into()))
+        );
+        assert_eq!(
+            parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"),
+            Some(("openai".into(), "whisper-large-v3".into()))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
new file mode 100644
index 0000000..b9a539b
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -0,0 +1,199 @@
+//! Vertical extractors: site-specific parsers that return typed JSON
+//! instead of generic markdown.
+//!
+//! Each extractor handles a single site or platform and exposes:
+//! - `matches(url)` to claim ownership of a URL pattern
+//! - `extract(client, url)` to fetch + parse into a typed JSON `Value`
+//! - `INFO` static for the catalog (`/v1/extractors`)
+//!
+//! The dispatch in this module is a simple `match`-style chain rather than
+//! a trait registry. With ~30 extractors that's still fast and avoids the
+//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit.
+//!
+//! Extractors prefer official JSON APIs over HTML scraping where one
+//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
+//! one). HTML extraction is the fallback for sites that don't.
+
+pub mod github_repo;
+pub mod hackernews;
+pub mod huggingface_model;
+pub mod npm;
+pub mod pypi;
+pub mod reddit;
+
+use serde::Serialize;
+use serde_json::Value;
+
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+/// Public catalog entry for `/v1/extractors`. Stable shape — clients
+/// rely on `name` to pick the right `/v1/scrape/{name}` route.
+#[derive(Debug, Clone, Serialize)]
+pub struct ExtractorInfo {
+    /// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...).
+    pub name: &'static str,
+    /// Human-friendly display name.
+    pub label: &'static str,
+    /// One-line description of what the extractor returns.
+    pub description: &'static str,
+    /// Glob-ish URL pattern(s) the extractor claims. For documentation;
+    /// the actual matching is done by the extractor's `matches` fn.
+    pub url_patterns: &'static [&'static str],
+}
+
+/// Full catalog. Order is stable; new entries append.
+pub fn list() -> Vec<ExtractorInfo> {
+    vec![
+        reddit::INFO,
+        hackernews::INFO,
+        github_repo::INFO,
+        pypi::INFO,
+        npm::INFO,
+        huggingface_model::INFO,
+    ]
+}
+
+/// Auto-detect mode: try every extractor's `matches`, return the first
+/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
+/// pick a vertical explicitly.
+pub async fn dispatch_by_url(
+    client: &FetchClient,
+    url: &str,
+) -> Option<Result<(&'static str, Value), FetchError>> {
+    if reddit::matches(url) {
+        return Some(
+            reddit::extract(client, url)
+                .await
+                .map(|v| (reddit::INFO.name, v)),
+        );
+    }
+    if hackernews::matches(url) {
+        return Some(
+            hackernews::extract(client, url)
+                .await
+                .map(|v| (hackernews::INFO.name, v)),
+        );
+    }
+    if github_repo::matches(url) {
+        return Some(
+            github_repo::extract(client, url)
+                .await
+                .map(|v| (github_repo::INFO.name, v)),
+        );
+    }
+    if pypi::matches(url) {
+        return Some(
+            pypi::extract(client, url)
+                .await
+                .map(|v| (pypi::INFO.name, v)),
+        );
+    }
+    if npm::matches(url) {
+        return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
+    }
+    if huggingface_model::matches(url) {
+        return Some(
+            huggingface_model::extract(client, url)
+                .await
+                .map(|v| (huggingface_model::INFO.name, v)),
+        );
+    }
+    None
+}
+
+/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`).
+/// We still validate that the URL plausibly belongs to that vertical so
+/// users get a clear "wrong route" error instead of a confusing parse
+/// failure deep in the extractor.
+pub async fn dispatch_by_name(
+    client: &FetchClient,
+    name: &str,
+    url: &str,
+) -> Result<Value, ExtractorDispatchError> {
+    match name {
+        n if n == reddit::INFO.name => {
+            run_or_mismatch(reddit::matches(url), n, url, || {
+                reddit::extract(client, url)
+            })
+            .await
+        }
+        n if n == hackernews::INFO.name => {
+            run_or_mismatch(hackernews::matches(url), n, url, || {
+                hackernews::extract(client, url)
+            })
+            .await
+        }
+        n if n == github_repo::INFO.name => {
+            run_or_mismatch(github_repo::matches(url), n, url, || {
+                github_repo::extract(client, url)
+            })
+            .await
+        }
+        n if n == pypi::INFO.name => {
+            run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await
+        }
+        n if n == npm::INFO.name => {
+            run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
+        }
+        n if n == huggingface_model::INFO.name => {
+            run_or_mismatch(huggingface_model::matches(url), n, url, || {
+                huggingface_model::extract(client, url)
+            })
+            .await
+        }
+        _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
+    }
+}
+
+/// Errors that the dispatcher itself raises (vs. errors from inside an
+/// extractor, which come back wrapped in `Fetch`).
+#[derive(Debug, thiserror::Error)]
+pub enum ExtractorDispatchError {
+    #[error("unknown vertical: '{0}'")]
+    UnknownVertical(String),
+
+    #[error("URL '{url}' does not match the '{vertical}' extractor")]
+    UrlMismatch { vertical: String, url: String },
+
+    #[error(transparent)]
+    Fetch(#[from] FetchError),
+}
+
+/// Helper: when the caller explicitly picked a vertical but their URL
+/// doesn't match it, return `UrlMismatch` instead of running the
+/// extractor (which would just fail with a less-clear error).
+async fn run_or_mismatch<F, Fut>(
+    matches: bool,
+    vertical: &str,
+    url: &str,
+    f: F,
+) -> Result<Value, ExtractorDispatchError>
+where
+    F: FnOnce() -> Fut,
+    Fut: std::future::Future<Output = Result<Value, FetchError>>,
+{
+    if !matches {
+        return Err(ExtractorDispatchError::UrlMismatch {
+            vertical: vertical.to_string(),
+            url: url.to_string(),
+        });
+    }
+    f().await.map_err(ExtractorDispatchError::Fetch)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn list_is_non_empty_and_unique() {
+        let entries = list();
+        assert!(!entries.is_empty());
+        let mut names: Vec<_> = entries.iter().map(|e| e.name).collect();
+        names.sort();
+        let before = names.len();
+        names.dedup();
+        assert_eq!(before, names.len(), "extractor names must be unique");
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/npm.rs b/crates/webclaw-fetch/src/extractors/npm.rs
new file mode 100644
index 0000000..4343890
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/npm.rs
@@ -0,0 +1,235 @@
+//! npm package structured extractor.
+//!
+//! Uses two npm-run APIs:
+//!   - `registry.npmjs.org/{name}` for full package metadata
+//!   - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal
+//!
+//! The registry API returns the *full* document including every version
+//! ever published, which can be tens of MB for popular packages
+//! (`@types/node` etc). We strip down to the latest version's manifest
+//! and a count of releases — full history would explode the response.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "npm",
+    label: "npm package",
+    description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.",
+    url_patterns: &["https://www.npmjs.com/package/{name}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "www.npmjs.com" && host != "npmjs.com" {
+        return false;
+    }
+    url.contains("/package/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let name = parse_name(url)
+        .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
+
+    let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name));
+    let resp = client.fetch(&registry_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "npm: package '{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "npm registry returned status {}",
+            resp.status
+        )));
+    }
+
+    let pkg: PackageDoc = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?;
+
+    // Resolve "latest" to a concrete version.
+    let latest_version = pkg
+        .dist_tags
+        .as_ref()
+        .and_then(|t| t.get("latest"))
+        .cloned()
+        .or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned()));
+
+    let latest_manifest = latest_version
+        .as_deref()
+        .and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v)));
+
+    let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0);
+    let latest_release_date = latest_version
+        .as_deref()
+        .and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned()));
+
+    // Best-effort weekly downloads. If the api.npmjs.org call fails we
+    // surface `null` rather than failing the whole extractor — npm
+    // sometimes 503s the downloads endpoint while the registry is up.
+    let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok();
+
+    Ok(json!({
+        "url":                 url,
+        "name":                pkg.name.clone().unwrap_or(name.clone()),
+        "description":         pkg.description,
+        "latest_version":      latest_version,
+        "license":             latest_manifest.and_then(|m| m.license.clone()),
+        "homepage":            pkg.homepage,
+        "repository":          pkg.repository.as_ref().and_then(|r| r.url.clone()),
+        "dependencies":        latest_manifest.and_then(|m| m.dependencies.clone()),
+        "dev_dependencies":    latest_manifest.and_then(|m| m.dev_dependencies.clone()),
+        "peer_dependencies":   latest_manifest.and_then(|m| m.peer_dependencies.clone()),
+        "keywords":            pkg.keywords,
+        "maintainers":         pkg.maintainers,
+        "deprecated":          latest_manifest.and_then(|m| m.deprecated.clone()),
+        "release_count":       release_count,
+        "latest_release_date": latest_release_date,
+        "weekly_downloads":    weekly_downloads,
+    }))
+}
+
+async fn fetch_weekly_downloads(client: &FetchClient, name: &str) -> Result<i64, FetchError> {
+    let url = format!(
+        "https://api.npmjs.org/downloads/point/last-week/{}",
+        urlencode_segment(name)
+    );
+    let resp = client.fetch(&url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "npm downloads api status {}",
+            resp.status
+        )));
+    }
+    let dl: Downloads = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?;
+    Ok(dl.downloads)
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Extract the package name from an npmjs.com URL. Handles scoped packages
+/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`).
+fn parse_name(url: &str) -> Option<String> {
+    let after = url.split("/package/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let first = segs.next()?;
+    if first.starts_with('@') {
+        let second = segs.next()?;
+        Some(format!("{first}/{second}"))
+    } else {
+        Some(first.to_string())
+    }
+}
+
+/// `@scope/name` must encode the `/` for the registry path. Plain names
+/// pass through untouched.
+fn urlencode_segment(name: &str) -> String {
+    name.replace('/', "%2F")
+}
+
+// ---------------------------------------------------------------------------
+// Registry types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct PackageDoc {
+    name: Option<String>,
+    description: Option<String>,
+    homepage: Option<serde_json::Value>, // sometimes string, sometimes object
+    repository: Option<Repository>,
+    keywords: Option<Vec<String>>,
+    maintainers: Option<Vec<Maintainer>>,
+    #[serde(rename = "dist-tags")]
+    dist_tags: Option<std::collections::BTreeMap<String, String>>,
+    versions: Option<std::collections::BTreeMap<String, VersionManifest>>,
+    time: Option<std::collections::BTreeMap<String, String>>,
+}
+
+#[derive(Deserialize, Default, Clone)]
+struct VersionManifest {
+    license: Option<serde_json::Value>, // string or object
+    dependencies: Option<std::collections::BTreeMap<String, String>>,
+    #[serde(rename = "devDependencies")]
+    dev_dependencies: Option<std::collections::BTreeMap<String, String>>,
+    #[serde(rename = "peerDependencies")]
+    peer_dependencies: Option<std::collections::BTreeMap<String, String>>,
+    // `deprecated` is sometimes a bool and sometimes a string in the
+    // registry. serde_json::Value covers both without failing the parse.
+    deprecated: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct Repository {
+    url: Option<String>,
+}
+
+#[derive(Deserialize, Clone)]
+struct Maintainer {
+    name: Option<String>,
+    email: Option<String>,
+}
+
+impl serde::Serialize for Maintainer {
+    fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
+        use serde::ser::SerializeMap;
+        let mut m = s.serialize_map(Some(2))?;
+        m.serialize_entry("name", &self.name)?;
+        m.serialize_entry("email", &self.email)?;
+        m.end()
+    }
+}
+
+#[derive(Deserialize)]
+struct Downloads {
+    downloads: i64,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_npm_package_urls() {
+        assert!(matches("https://www.npmjs.com/package/react"));
+        assert!(matches("https://www.npmjs.com/package/@types/node"));
+        assert!(matches("https://npmjs.com/package/lodash"));
+        assert!(!matches("https://www.npmjs.com/"));
+        assert!(!matches("https://example.com/package/foo"));
+    }
+
+    #[test]
+    fn parse_name_handles_scoped_and_unscoped() {
+        assert_eq!(
+            parse_name("https://www.npmjs.com/package/react"),
+            Some("react".into())
+        );
+        assert_eq!(
+            parse_name("https://www.npmjs.com/package/@types/node"),
+            Some("@types/node".into())
+        );
+        assert_eq!(
+            parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"),
+            Some("lodash".into())
+        );
+    }
+
+    #[test]
+    fn urlencode_only_touches_scope_separator() {
+        assert_eq!(urlencode_segment("react"), "react");
+        assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode");
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/pypi.rs b/crates/webclaw-fetch/src/extractors/pypi.rs
new file mode 100644
index 0000000..f6b7c64
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/pypi.rs
@@ -0,0 +1,184 @@
+//! PyPI package structured extractor.
+//!
+//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and
+//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both
+//! return the full release info plus history. No auth, no rate limits
+//! that we hit at normal usage.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "pypi",
+    label: "PyPI package",
+    description: "Returns package metadata: latest version, dependencies, license, release history.",
+    url_patterns: &[
+        "https://pypi.org/project/{name}/",
+        "https://pypi.org/project/{name}/{version}/",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "pypi.org" && host != "www.pypi.org" {
+        return false;
+    }
+    url.contains("/project/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (name, version) = parse_project(url).ok_or_else(|| {
+        FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
+    })?;
+
+    let api_url = match &version {
+        Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"),
+        None => format!("https://pypi.org/pypi/{name}/json"),
+    };
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "pypi: package '{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "pypi api returned status {}",
+            resp.status
+        )));
+    }
+
+    let pkg: PypiResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?;
+
+    let info = pkg.info;
+    let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0);
+
+    // Latest release date = max upload time across files in the latest version.
+    let latest_release_date = pkg
+        .releases
+        .as_ref()
+        .and_then(|map| info.version.as_deref().and_then(|v| map.get(v)))
+        .and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max());
+
+    // Drop the long description from the JSON shape — it's frequently a 50KB
+    // README and bloats responses. Callers who need it can hit /v1/scrape.
+    Ok(json!({
+        "url":                 url,
+        "name":                info.name,
+        "version":             info.version,
+        "summary":             info.summary,
+        "homepage":            info.home_page,
+        "license":             info.license,
+        "license_classifier":  pick_license_classifier(&info.classifiers),
+        "author":              info.author,
+        "author_email":        info.author_email,
+        "maintainer":          info.maintainer,
+        "requires_python":     info.requires_python,
+        "requires_dist":       info.requires_dist,
+        "keywords":            info.keywords,
+        "classifiers":         info.classifiers,
+        "yanked":              info.yanked,
+        "yanked_reason":       info.yanked_reason,
+        "project_urls":        info.project_urls,
+        "release_count":       release_count,
+        "latest_release_date": latest_release_date,
+    }))
+}
+
+/// PyPI puts the SPDX-ish license under classifiers like
+/// `License :: OSI Approved :: Apache Software License`. Surface the most
+/// specific one when the `license` field itself is empty/junk.
+fn pick_license_classifier(classifiers: &Option<Vec<String>>) -> Option<String> {
+    classifiers
+        .as_ref()?
+        .iter()
+        .filter(|c| c.starts_with("License ::"))
+        .max_by_key(|c| c.len())
+        .cloned()
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_project(url: &str) -> Option<(String, Option<String>)> {
+    let after = url.split("/project/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let name = segs.next()?.to_string();
+    let version = segs.next().map(|v| v.to_string());
+    Some((name, version))
+}
+
+// ---------------------------------------------------------------------------
+// PyPI API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct PypiResponse {
+    info: Info,
+    releases: Option<std::collections::BTreeMap<String, Vec<File>>>,
+}
+
+#[derive(Deserialize)]
+struct Info {
+    name: Option<String>,
+    version: Option<String>,
+    summary: Option<String>,
+    home_page: Option<String>,
+    license: Option<String>,
+    author: Option<String>,
+    author_email: Option<String>,
+    maintainer: Option<String>,
+    requires_python: Option<String>,
+    requires_dist: Option<Vec<String>>,
+    keywords: Option<String>,
+    classifiers: Option<Vec<String>>,
+    yanked: Option<bool>,
+    yanked_reason: Option<String>,
+    project_urls: Option<std::collections::BTreeMap<String, String>>,
+}
+
+#[derive(Deserialize)]
+struct File {
+    upload_time: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_project_urls() {
+        assert!(matches("https://pypi.org/project/requests/"));
+        assert!(matches("https://pypi.org/project/numpy/1.26.0/"));
+        assert!(!matches("https://pypi.org/"));
+        assert!(!matches("https://example.com/project/foo"));
+    }
+
+    #[test]
+    fn parse_project_pulls_name_and_version() {
+        assert_eq!(
+            parse_project("https://pypi.org/project/requests/"),
+            Some(("requests".into(), None))
+        );
+        assert_eq!(
+            parse_project("https://pypi.org/project/numpy/1.26.0/"),
+            Some(("numpy".into(), Some("1.26.0".into())))
+        );
+        assert_eq!(
+            parse_project("https://pypi.org/project/scikit-learn/?foo=bar"),
+            Some(("scikit-learn".into(), None))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/reddit.rs b/crates/webclaw-fetch/src/extractors/reddit.rs
new file mode 100644
index 0000000..2d084dc
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/reddit.rs
@@ -0,0 +1,234 @@
+//! Reddit structured extractor — returns the full post + comment tree
+//! as typed JSON via Reddit's `.json` API.
+//!
+//! The same trick the markdown extractor in `crate::reddit` uses:
+//! appending `.json` to any post URL returns the data the new SPA
+//! frontend would load client-side. Zero antibot, zero JS rendering.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "reddit",
+    label: "Reddit thread",
+    description: "Returns post + nested comment tree with scores, authors, and timestamps.",
+    url_patterns: &[
+        "https://www.reddit.com/r/*/comments/*",
+        "https://reddit.com/r/*/comments/*",
+        "https://old.reddit.com/r/*/comments/*",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    let is_reddit_host = matches!(
+        host,
+        "reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
+    );
+    is_reddit_host && url.contains("/comments/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let json_url = build_json_url(url);
+    let resp = client.fetch(&json_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "reddit api returned status {}",
+            resp.status
+        )));
+    }
+
+    let listings: Vec<Listing> = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?;
+
+    if listings.is_empty() {
+        return Err(FetchError::BodyDecode("reddit response empty".into()));
+    }
+
+    // First listing = the post (single t3 child).
+    let post = listings
+        .first()
+        .and_then(|l| l.data.children.first())
+        .filter(|t| t.kind == "t3")
+        .map(|t| post_json(&t.data))
+        .unwrap_or(Value::Null);
+
+    // Second listing = the comment tree.
+    let comments: Vec<Value> = listings
+        .get(1)
+        .map(|l| l.data.children.iter().filter_map(comment_json).collect())
+        .unwrap_or_default();
+
+    Ok(json!({
+        "url": url,
+        "post": post,
+        "comments": comments,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// JSON shapers
+// ---------------------------------------------------------------------------
+
+fn post_json(d: &ThingData) -> Value {
+    json!({
+        "id":               d.id,
+        "title":            d.title,
+        "author":           d.author,
+        "subreddit":        d.subreddit_name_prefixed,
+        "permalink":        d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
+        "url":              d.url_overridden_by_dest,
+        "is_self":          d.is_self,
+        "selftext":         d.selftext,
+        "score":            d.score,
+        "upvote_ratio":     d.upvote_ratio,
+        "num_comments":     d.num_comments,
+        "created_utc":      d.created_utc,
+        "link_flair_text":  d.link_flair_text,
+        "over_18":          d.over_18,
+        "spoiler":          d.spoiler,
+        "stickied":         d.stickied,
+        "locked":           d.locked,
+    })
+}
+
+/// Render a single comment + its reply tree. Returns `None` for non-t1
+/// kinds (the trailing `more` placeholder Reddit injects at depth limits).
+fn comment_json(thing: &Thing) -> Option<Value> {
+    if thing.kind != "t1" {
+        return None;
+    }
+    let d = &thing.data;
+    let replies: Vec<Value> = match &d.replies {
+        Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(),
+        _ => Vec::new(),
+    };
+    Some(json!({
+        "id":             d.id,
+        "author":         d.author,
+        "body":           d.body,
+        "score":          d.score,
+        "created_utc":    d.created_utc,
+        "is_submitter":   d.is_submitter,
+        "stickied":       d.stickied,
+        "depth":          d.depth,
+        "permalink":      d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
+        "replies":        replies,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com`
+/// or `old.reddit.com` as the caller gave us). Routing through
+/// `old.reddit.com` unconditionally looks appealing but that host has
+/// stricter UA-based blocking than `www.reddit.com`, while the main
+/// host accepts our Chrome-fingerprinted client fine.
+fn build_json_url(url: &str) -> String {
+    let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
+    format!("{clean}.json?raw_json=1")
+}
+
+// ---------------------------------------------------------------------------
+// Reddit JSON types — only fields we render. Everything else is dropped.
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Listing {
+    data: ListingData,
+}
+
+#[derive(Deserialize)]
+struct ListingData {
+    children: Vec<Thing>,
+}
+
+#[derive(Deserialize)]
+struct Thing {
+    kind: String,
+    data: ThingData,
+}
+
+#[derive(Deserialize, Default)]
+struct ThingData {
+    // post (t3)
+    id: Option<String>,
+    title: Option<String>,
+    selftext: Option<String>,
+    subreddit_name_prefixed: Option<String>,
+    url_overridden_by_dest: Option<String>,
+    is_self: Option<bool>,
+    upvote_ratio: Option<f64>,
+    num_comments: Option<i64>,
+    over_18: Option<bool>,
+    spoiler: Option<bool>,
+    stickied: Option<bool>,
+    locked: Option<bool>,
+    link_flair_text: Option<String>,
+
+    // comment (t1)
+    author: Option<String>,
+    body: Option<String>,
+    score: Option<i64>,
+    created_utc: Option<f64>,
+    is_submitter: Option<bool>,
+    depth: Option<i64>,
+    permalink: Option<String>,
+
+    // recursive
+    replies: Option<Replies>,
+}
+
+#[derive(Deserialize)]
+#[serde(untagged)]
+enum Replies {
+    Listing(Listing),
+    #[allow(dead_code)]
+    Empty(String),
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_reddit_post_urls() {
+        assert!(matches(
+            "https://www.reddit.com/r/rust/comments/abc123/some_title/"
+        ));
+        assert!(matches(
+            "https://reddit.com/r/rust/comments/abc123/some_title"
+        ));
+        assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/"));
+    }
+
+    #[test]
+    fn rejects_non_post_reddit_urls() {
+        assert!(!matches("https://www.reddit.com/r/rust"));
+        assert!(!matches("https://www.reddit.com/user/foo"));
+        assert!(!matches("https://example.com/r/rust/comments/x"));
+    }
+
+    #[test]
+    fn json_url_appends_suffix_and_drops_query() {
+        assert_eq!(
+            build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"),
+            "https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1"
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs
index 517cb6e..831c2a5 100644
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@@ -6,6 +6,7 @@ pub mod client;
 pub mod crawler;
 pub mod document;
 pub mod error;
+pub mod extractors;
 pub mod linkedin;
 pub mod proxy;
 pub mod reddit;
diff --git a/crates/webclaw-server/src/main.rs b/crates/webclaw-server/src/main.rs
index c57fed8..f4cfdcb 100644
--- a/crates/webclaw-server/src/main.rs
+++ b/crates/webclaw-server/src/main.rs
@@ -79,10 +79,15 @@ async fn main() -> anyhow::Result<()> {
 
     let v1 = Router::new()
         .route("/scrape", post(routes::scrape::scrape))
+        .route(
+            "/scrape/{vertical}",
+            post(routes::structured::scrape_vertical),
+        )
         .route("/crawl", post(routes::crawl::crawl))
         .route("/map", post(routes::map::map))
         .route("/batch", post(routes::batch::batch))
         .route("/extract", post(routes::extract::extract))
+        .route("/extractors", get(routes::structured::list_extractors))
         .route("/summarize", post(routes::summarize::summarize_route))
         .route("/diff", post(routes::diff::diff_route))
         .route("/brand", post(routes::brand::brand))
diff --git a/crates/webclaw-server/src/routes/mod.rs b/crates/webclaw-server/src/routes/mod.rs
index 7c3d68e..01f1052 100644
--- a/crates/webclaw-server/src/routes/mod.rs
+++ b/crates/webclaw-server/src/routes/mod.rs
@@ -15,4 +15,5 @@ pub mod extract;
 pub mod health;
 pub mod map;
 pub mod scrape;
+pub mod structured;
 pub mod summarize;
diff --git a/crates/webclaw-server/src/routes/structured.rs b/crates/webclaw-server/src/routes/structured.rs
new file mode 100644
index 0000000..c9cdc1a
--- /dev/null
+++ b/crates/webclaw-server/src/routes/structured.rs
@@ -0,0 +1,55 @@
+//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`.
+//!
+//! Vertical extractors return typed JSON instead of generic markdown.
+//! See `webclaw_fetch::extractors` for the catalog and per-site logic.
+
+use axum::{
+    Json,
+    extract::{Path, State},
+};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_fetch::extractors::{self, ExtractorDispatchError};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct ScrapeRequest {
+    pub url: String,
+}
+
+/// Map dispatcher errors to ApiError so users get clean HTTP statuses
+/// instead of opaque 500s.
+impl From<ExtractorDispatchError> for ApiError {
+    fn from(e: ExtractorDispatchError) -> Self {
+        match e {
+            ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound,
+            ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()),
+            ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()),
+        }
+    }
+}
+
+/// `GET /v1/extractors` — catalog of all available verticals.
+pub async fn list_extractors() -> Json<Value> {
+    Json(json!({
+        "extractors": extractors::list(),
+    }))
+}
+
+/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit.
+pub async fn scrape_vertical(
+    State(state): State<AppState>,
+    Path(vertical): Path<String>,
+    Json(req): Json<ScrapeRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?;
+    Ok(Json(json!({
+        "vertical": vertical,
+        "url": req.url,
+        "data": data,
+    })))
+}

From 86182ef28a16d0c9d1fb84727329409991eae630 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 14:11:55 +0200
Subject: [PATCH 07/30] fix(server): switch default browser profile to Firefox

Reddit blocks wreq's Chrome 145 BoringSSL fingerprint at the JA3/JA4
TLS layer even though our HTTP headers correctly impersonate Chrome.
Curl from the same machine with the same Chrome User-Agent string
returns 200 from Reddit's .json endpoint; webclaw with the Chrome
profile returns 403. The detector clearly fingerprints below the
header layer.

Tested all six vertical extractors with the Firefox profile:
reddit, hackernews, github_repo, pypi, npm, huggingface_model all
return correct typed JSON. Firefox is a strict improvement on the
Chrome default for sites with active TLS-level bot detection, with
no regressions on the API-flavored sites that were already working.

Real fix is per-extractor preferred profile, but the structural
change to allow per-call profile selection in FetchClient is a
larger refactor. Flipping the global default is a one-line change
that ships the unblock now and lets users hit the new
/v1/scrape/{vertical} routes against Reddit immediately.
---
 crates/webclaw-server/src/state.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crates/webclaw-server/src/state.rs b/crates/webclaw-server/src/state.rs
index b3f9b6b..d7f151b 100644
--- a/crates/webclaw-server/src/state.rs
+++ b/crates/webclaw-server/src/state.rs
@@ -26,7 +26,7 @@ impl AppState {
     /// state don't churn per request.
     pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
         let config = FetchConfig {
-            browser: BrowserProfile::Chrome,
+            browser: BrowserProfile::Firefox,
             ..FetchConfig::default()
         };
         let fetch = FetchClient::new(config)

From b041f3cddd861299c845aa36801c5174a88852a9 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 14:20:21 +0200
Subject: [PATCH 08/30] feat(extractors): wave 2 \u2014 8 more verticals (14
 total)

Adds 8 more vertical extractors using public JSON APIs. All hit
deterministic endpoints with no antibot risk. Live tests pass
against canonical URLs for each.

AI / ML ecosystem (3):
- crates_io          \u2192 crates.io/api/v1/crates/{name}
- huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both
                       legacy /datasets/{name} and canonical {owner}/{name})
- arxiv              \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml)

Code / version control (2):
- github_pr      \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number}
- github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag}

Infrastructure (1):
- docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name}
              (official-image shorthand /_/nginx normalized to library/nginx)

Community / publishing (2):
- dev_to        \u2192 dev.to/api/articles/{username}/{slug}
- stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers,
                  filter=withbody for rendered HTML, sort=votes for
                  consistent top-answers ordering

Live test results (real URLs):
- serde:                 942M downloads, 838B response
- 'Attention Is All You Need': abstract + authors, 1.8KB
- nginx official:        12.9B pulls, 21k stars, 17KB
- openai/gsm8k:          822k downloads, 1.7KB
- rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB
- webclaw v0.4.0:        2.4KB
- a real dev.to article: 2.2KB body, 3.1KB total
- python yield Q&A:      score 13133, 51 answers, 104KB

Catalog now exposes 14 extractors via GET /v1/extractors. Total
unit tests across the module: 34 passing. Clippy clean. Fmt clean.

Marketing positioning sharpens: 14 dedicated extractors, all
deterministic, all 1-credit-per-call. Firecrawl's /extract is
5 credits per call and you write the schema yourself.
---
 crates/webclaw-fetch/src/extractors/arxiv.rs  | 314 ++++++++++++++++++
 .../webclaw-fetch/src/extractors/crates_io.rs | 168 ++++++++++
 crates/webclaw-fetch/src/extractors/dev_to.rs | 188 +++++++++++
 .../src/extractors/docker_hub.rs              | 150 +++++++++
 .../webclaw-fetch/src/extractors/github_pr.rs | 189 +++++++++++
 .../src/extractors/github_release.rs          | 179 ++++++++++
 .../src/extractors/huggingface_dataset.rs     | 189 +++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    | 117 +++++++
 .../src/extractors/stackoverflow.rs           | 216 ++++++++++++
 9 files changed, 1710 insertions(+)
 create mode 100644 crates/webclaw-fetch/src/extractors/arxiv.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/crates_io.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/dev_to.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/docker_hub.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/github_pr.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/github_release.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/stackoverflow.rs

diff --git a/crates/webclaw-fetch/src/extractors/arxiv.rs b/crates/webclaw-fetch/src/extractors/arxiv.rs
new file mode 100644
index 0000000..cbcb3d1
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/arxiv.rs
@@ -0,0 +1,314 @@
+//! ArXiv paper structured extractor.
+//!
+//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}`
+//! which returns Atom XML. We parse just enough to surface title, authors,
+//! abstract, categories, and the canonical PDF link. No HTML scraping
+//! required and no auth.
+
+use quick_xml::Reader;
+use quick_xml::events::Event;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "arxiv",
+    label: "ArXiv paper",
+    description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.",
+    url_patterns: &[
+        "https://arxiv.org/abs/{id}",
+        "https://arxiv.org/abs/{id}v{n}",
+        "https://arxiv.org/pdf/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "arxiv.org" && host != "www.arxiv.org" {
+        return false;
+    }
+    url.contains("/abs/") || url.contains("/pdf/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let id = parse_id(url)
+        .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
+
+    let api_url = format!("https://export.arxiv.org/api/query?id_list={id}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "arxiv api returned status {}",
+            resp.status
+        )));
+    }
+
+    let entry = parse_atom_entry(&resp.html)
+        .ok_or_else(|| FetchError::BodyDecode("arxiv: no <entry> in response".into()))?;
+    if entry.title.is_none() && entry.summary.is_none() {
+        return Err(FetchError::BodyDecode(format!(
+            "arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)"
+        )));
+    }
+
+    Ok(json!({
+        "url":              url,
+        "id":               id,
+        "arxiv_id":         entry.id,
+        "title":            entry.title,
+        "authors":          entry.authors,
+        "abstract":         entry.summary.map(|s| collapse_whitespace(&s)),
+        "published":        entry.published,
+        "updated":          entry.updated,
+        "primary_category": entry.primary_category,
+        "categories":       entry.categories,
+        "doi":              entry.doi,
+        "comment":          entry.comment,
+        "pdf_url":          entry.pdf_url,
+        "abs_url":          entry.abs_url,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`)
+/// and the `.pdf` extension when present.
+fn parse_id(url: &str) -> Option<String> {
+    let after = url
+        .split("/abs/")
+        .nth(1)
+        .or_else(|| url.split("/pdf/").nth(1))?;
+    let stripped = after
+        .split(['?', '#'])
+        .next()?
+        .trim_end_matches('/')
+        .trim_end_matches(".pdf");
+    // Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345"
+    let no_version = match stripped.rfind('v') {
+        Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i],
+        _ => stripped,
+    };
+    if no_version.is_empty() {
+        None
+    } else {
+        Some(no_version.to_string())
+    }
+}
+
+fn collapse_whitespace(s: &str) -> String {
+    s.split_whitespace().collect::<Vec<_>>().join(" ")
+}
+
+#[derive(Default)]
+struct AtomEntry {
+    id: Option<String>,
+    title: Option<String>,
+    summary: Option<String>,
+    published: Option<String>,
+    updated: Option<String>,
+    primary_category: Option<String>,
+    categories: Vec<String>,
+    authors: Vec<String>,
+    doi: Option<String>,
+    comment: Option<String>,
+    pdf_url: Option<String>,
+    abs_url: Option<String>,
+}
+
+/// Parse the first `<entry>` block of an ArXiv Atom feed.
+fn parse_atom_entry(xml: &str) -> Option<AtomEntry> {
+    let mut reader = Reader::from_str(xml);
+    let mut buf = Vec::new();
+
+    // States
+    let mut in_entry = false;
+    let mut current: Option<&'static str> = None;
+    let mut in_author = false;
+    let mut in_author_name = false;
+    let mut entry = AtomEntry::default();
+
+    loop {
+        match reader.read_event_into(&mut buf) {
+            Ok(Event::Start(ref e)) => {
+                let local = e.local_name();
+                match local.as_ref() {
+                    b"entry" => in_entry = true,
+                    b"id" if in_entry && !in_author => current = Some("id"),
+                    b"title" if in_entry => current = Some("title"),
+                    b"summary" if in_entry => current = Some("summary"),
+                    b"published" if in_entry => current = Some("published"),
+                    b"updated" if in_entry => current = Some("updated"),
+                    b"author" if in_entry => in_author = true,
+                    b"name" if in_author => {
+                        in_author_name = true;
+                        current = Some("author_name");
+                    }
+                    b"category" if in_entry => {
+                        // primary_category is namespaced (arxiv:primary_category)
+                        // category is plain. quick-xml gives us local-name only,
+                        // so we treat both as categories and take the first as
+                        // primary.
+                        for attr in e.attributes().flatten() {
+                            if attr.key.as_ref() == b"term"
+                                && let Ok(v) = attr.unescape_value()
+                            {
+                                let term = v.to_string();
+                                if entry.primary_category.is_none() {
+                                    entry.primary_category = Some(term.clone());
+                                }
+                                entry.categories.push(term);
+                            }
+                        }
+                    }
+                    b"link" if in_entry => {
+                        let mut href = None;
+                        let mut rel = None;
+                        let mut typ = None;
+                        for attr in e.attributes().flatten() {
+                            match attr.key.as_ref() {
+                                b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
+                                b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
+                                b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
+                                _ => {}
+                            }
+                        }
+                        if let Some(h) = href {
+                            if typ.as_deref() == Some("application/pdf") {
+                                entry.pdf_url = Some(h.clone());
+                            }
+                            if rel.as_deref() == Some("alternate") {
+                                entry.abs_url = Some(h);
+                            }
+                        }
+                    }
+                    _ => current = None,
+                }
+            }
+            Ok(Event::Empty(ref e)) => {
+                // Self-closing tags (<link href="..." />). Same handling as Start.
+                let local = e.local_name();
+                if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry {
+                    let mut href = None;
+                    let mut rel = None;
+                    let mut typ = None;
+                    let mut term = None;
+                    for attr in e.attributes().flatten() {
+                        match attr.key.as_ref() {
+                            b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
+                            b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
+                            b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
+                            b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()),
+                            _ => {}
+                        }
+                    }
+                    if let Some(t) = term {
+                        if entry.primary_category.is_none() {
+                            entry.primary_category = Some(t.clone());
+                        }
+                        entry.categories.push(t);
+                    }
+                    if let Some(h) = href {
+                        if typ.as_deref() == Some("application/pdf") {
+                            entry.pdf_url = Some(h.clone());
+                        }
+                        if rel.as_deref() == Some("alternate") {
+                            entry.abs_url = Some(h);
+                        }
+                    }
+                }
+            }
+            Ok(Event::Text(ref e)) => {
+                if let (Some(field), Ok(text)) = (current, e.unescape()) {
+                    let text = text.to_string();
+                    match field {
+                        "id" => entry.id = Some(text.trim().to_string()),
+                        "title" => entry.title = append_text(entry.title.take(), &text),
+                        "summary" => entry.summary = append_text(entry.summary.take(), &text),
+                        "published" => entry.published = Some(text.trim().to_string()),
+                        "updated" => entry.updated = Some(text.trim().to_string()),
+                        "author_name" => entry.authors.push(text.trim().to_string()),
+                        _ => {}
+                    }
+                }
+            }
+            Ok(Event::End(ref e)) => {
+                let local = e.local_name();
+                match local.as_ref() {
+                    b"entry" => break,
+                    b"author" => in_author = false,
+                    b"name" => in_author_name = false,
+                    _ => {}
+                }
+                if !in_author_name {
+                    current = None;
+                }
+            }
+            Ok(Event::Eof) => break,
+            Err(_) => return None,
+            _ => {}
+        }
+        buf.clear();
+    }
+
+    if in_entry { Some(entry) } else { None }
+}
+
+/// Concatenate text fragments (long fields can be split across multiple
+/// text events if they contain entities or CDATA).
+fn append_text(prev: Option<String>, next: &str) -> Option<String> {
+    match prev {
+        Some(mut s) => {
+            s.push_str(next);
+            Some(s)
+        }
+        None => Some(next.to_string()),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_arxiv_urls() {
+        assert!(matches("https://arxiv.org/abs/2401.12345"));
+        assert!(matches("https://arxiv.org/abs/2401.12345v2"));
+        assert!(matches("https://arxiv.org/pdf/2401.12345.pdf"));
+        assert!(!matches("https://arxiv.org/"));
+        assert!(!matches("https://example.com/abs/foo"));
+    }
+
+    #[test]
+    fn parse_id_strips_version_and_extension() {
+        assert_eq!(
+            parse_id("https://arxiv.org/abs/2401.12345"),
+            Some("2401.12345".into())
+        );
+        assert_eq!(
+            parse_id("https://arxiv.org/abs/2401.12345v3"),
+            Some("2401.12345".into())
+        );
+        assert_eq!(
+            parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"),
+            Some("2401.12345".into())
+        );
+    }
+
+    #[test]
+    fn collapse_whitespace_handles_newlines_and_tabs() {
+        assert_eq!(collapse_whitespace("a   b\n\tc  "), "a b c");
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/crates_io.rs b/crates/webclaw-fetch/src/extractors/crates_io.rs
new file mode 100644
index 0000000..915b1c3
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/crates_io.rs
@@ -0,0 +1,168 @@
+//! crates.io structured extractor.
+//!
+//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No
+//! auth, no rate limit at normal usage. The response includes both
+//! the crate metadata and the full version list, which we summarize
+//! down to a count + latest release info to keep the payload small.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "crates_io",
+    label: "crates.io package",
+    description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.",
+    url_patterns: &[
+        "https://crates.io/crates/{name}",
+        "https://crates.io/crates/{name}/{version}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "crates.io" && host != "www.crates.io" {
+        return false;
+    }
+    url.contains("/crates/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let name = parse_name(url)
+        .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
+
+    let api_url = format!("https://crates.io/api/v1/crates/{name}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "crates.io: crate '{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "crates.io api returned status {}",
+            resp.status
+        )));
+    }
+
+    let body: CratesResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?;
+
+    let c = body.crate_;
+    let latest_version = body
+        .versions
+        .iter()
+        .find(|v| !v.yanked.unwrap_or(false))
+        .or_else(|| body.versions.first());
+
+    Ok(json!({
+        "url":                 url,
+        "name":                c.id,
+        "description":         c.description,
+        "homepage":            c.homepage,
+        "documentation":       c.documentation,
+        "repository":          c.repository,
+        "max_stable_version":  c.max_stable_version,
+        "max_version":         c.max_version,
+        "newest_version":      c.newest_version,
+        "downloads":           c.downloads,
+        "recent_downloads":    c.recent_downloads,
+        "categories":          c.categories,
+        "keywords":            c.keywords,
+        "release_count":       body.versions.len(),
+        "latest_release_date": latest_version.and_then(|v| v.created_at.clone()),
+        "latest_license":      latest_version.and_then(|v| v.license.clone()),
+        "latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()),
+        "latest_yanked":       latest_version.and_then(|v| v.yanked),
+        "created_at":          c.created_at,
+        "updated_at":          c.updated_at,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_name(url: &str) -> Option<String> {
+    let after = url.split("/crates/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let first = stripped.split('/').find(|s| !s.is_empty())?;
+    Some(first.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// crates.io API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct CratesResponse {
+    #[serde(rename = "crate")]
+    crate_: CrateInfo,
+    #[serde(default)]
+    versions: Vec<VersionInfo>,
+}
+
+#[derive(Deserialize)]
+struct CrateInfo {
+    id: Option<String>,
+    description: Option<String>,
+    homepage: Option<String>,
+    documentation: Option<String>,
+    repository: Option<String>,
+    max_stable_version: Option<String>,
+    max_version: Option<String>,
+    newest_version: Option<String>,
+    downloads: Option<i64>,
+    recent_downloads: Option<i64>,
+    #[serde(default)]
+    categories: Vec<String>,
+    #[serde(default)]
+    keywords: Vec<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct VersionInfo {
+    license: Option<String>,
+    rust_version: Option<String>,
+    yanked: Option<bool>,
+    created_at: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_crate_pages() {
+        assert!(matches("https://crates.io/crates/serde"));
+        assert!(matches("https://crates.io/crates/tokio/1.45.0"));
+        assert!(!matches("https://crates.io/"));
+        assert!(!matches("https://example.com/crates/foo"));
+    }
+
+    #[test]
+    fn parse_name_handles_versioned_urls() {
+        assert_eq!(
+            parse_name("https://crates.io/crates/serde"),
+            Some("serde".into())
+        );
+        assert_eq!(
+            parse_name("https://crates.io/crates/tokio/1.45.0"),
+            Some("tokio".into())
+        );
+        assert_eq!(
+            parse_name("https://crates.io/crates/scraper/?foo=bar"),
+            Some("scraper".into())
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/dev_to.rs b/crates/webclaw-fetch/src/extractors/dev_to.rs
new file mode 100644
index 0000000..49372ce
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/dev_to.rs
@@ -0,0 +1,188 @@
+//! dev.to article structured extractor.
+//!
+//! `dev.to/api/articles/{username}/{slug}` returns the full article body,
+//! tags, reaction count, comment count, and reading time. Anonymous
+//! access works fine for published posts.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "dev_to",
+    label: "dev.to article",
+    description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.",
+    url_patterns: &["https://dev.to/{username}/{slug}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "dev.to" && host != "www.dev.to" {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // Need exactly /{username}/{slug}, with username starting with non-reserved.
+    segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0])
+}
+
+const RESERVED_FIRST_SEGS: &[&str] = &[
+    "api",
+    "tags",
+    "search",
+    "settings",
+    "enter",
+    "signup",
+    "about",
+    "code-of-conduct",
+    "privacy",
+    "terms",
+    "contact",
+    "sponsorships",
+    "sponsors",
+    "shop",
+    "videos",
+    "listings",
+    "podcasts",
+    "p",
+    "t",
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (username, slug) = parse_username_slug(url).ok_or_else(|| {
+        FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
+    })?;
+
+    let api_url = format!("https://dev.to/api/articles/{username}/{slug}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "dev_to: article '{username}/{slug}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "dev.to api returned status {}",
+            resp.status
+        )));
+    }
+
+    let a: Article = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?;
+
+    Ok(json!({
+        "url":               url,
+        "id":                a.id,
+        "title":             a.title,
+        "description":       a.description,
+        "body_markdown":     a.body_markdown,
+        "url_canonical":     a.canonical_url,
+        "published_at":      a.published_at,
+        "edited_at":         a.edited_at,
+        "reading_time_min":  a.reading_time_minutes,
+        "tags":              a.tag_list,
+        "positive_reactions": a.positive_reactions_count,
+        "public_reactions":  a.public_reactions_count,
+        "comments_count":    a.comments_count,
+        "page_views_count":  a.page_views_count,
+        "cover_image":       a.cover_image,
+        "author": json!({
+            "username":  a.user.as_ref().and_then(|u| u.username.clone()),
+            "name":      a.user.as_ref().and_then(|u| u.name.clone()),
+            "twitter":   a.user.as_ref().and_then(|u| u.twitter_username.clone()),
+            "github":    a.user.as_ref().and_then(|u| u.github_username.clone()),
+            "website":   a.user.as_ref().and_then(|u| u.website_url.clone()),
+        }),
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_username_slug(url: &str) -> Option<(String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let username = segs.next()?;
+    let slug = segs.next()?;
+    Some((username.to_string(), slug.to_string()))
+}
+
+// ---------------------------------------------------------------------------
+// dev.to API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Article {
+    id: Option<i64>,
+    title: Option<String>,
+    description: Option<String>,
+    body_markdown: Option<String>,
+    canonical_url: Option<String>,
+    published_at: Option<String>,
+    edited_at: Option<String>,
+    reading_time_minutes: Option<i64>,
+    tag_list: Option<serde_json::Value>, // string OR array depending on endpoint
+    positive_reactions_count: Option<i64>,
+    public_reactions_count: Option<i64>,
+    comments_count: Option<i64>,
+    page_views_count: Option<i64>,
+    cover_image: Option<String>,
+    user: Option<UserRef>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    username: Option<String>,
+    name: Option<String>,
+    twitter_username: Option<String>,
+    github_username: Option<String>,
+    website_url: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_article_urls() {
+        assert!(matches("https://dev.to/ben/welcome-thread"));
+        assert!(matches("https://dev.to/0xmassi/some-post-1abc"));
+        assert!(!matches("https://dev.to/"));
+        assert!(!matches("https://dev.to/api/articles/foo/bar"));
+        assert!(!matches("https://dev.to/tags/rust"));
+        assert!(!matches("https://dev.to/ben")); // user profile, not article
+        assert!(!matches("https://example.com/ben/post"));
+    }
+
+    #[test]
+    fn parse_pulls_username_and_slug() {
+        assert_eq!(
+            parse_username_slug("https://dev.to/ben/welcome-thread"),
+            Some(("ben".into(), "welcome-thread".into()))
+        );
+        assert_eq!(
+            parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"),
+            Some(("0xmassi".into(), "some-post-1abc".into()))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/docker_hub.rs b/crates/webclaw-fetch/src/extractors/docker_hub.rs
new file mode 100644
index 0000000..15c928c
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/docker_hub.rs
@@ -0,0 +1,150 @@
+//! Docker Hub repository structured extractor.
+//!
+//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`.
+//! Anonymous access is allowed for public images. The official-image
+//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "docker_hub",
+    label: "Docker Hub repository",
+    description: "Returns image metadata: pull count, star count, last_updated, official flag, description.",
+    url_patterns: &[
+        "https://hub.docker.com/_/{name}",
+        "https://hub.docker.com/r/{namespace}/{name}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "hub.docker.com" {
+        return false;
+    }
+    url.contains("/_/") || url.contains("/r/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (namespace, name) = parse_repo(url)
+        .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
+
+    let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "docker_hub: repo '{namespace}/{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "docker_hub api returned status {}",
+            resp.status
+        )));
+    }
+
+    let r: RepoResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?;
+
+    Ok(json!({
+        "url":               url,
+        "namespace":         r.namespace,
+        "name":              r.name,
+        "full_name":         format!("{namespace}/{name}"),
+        "pull_count":        r.pull_count,
+        "star_count":        r.star_count,
+        "description":       r.description,
+        "full_description":  r.full_description,
+        "last_updated":      r.last_updated,
+        "date_registered":   r.date_registered,
+        "is_official":       namespace == "library",
+        "is_private":        r.is_private,
+        "status_description":r.status_description,
+        "categories":        r.categories,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Parse `(namespace, name)` from a Docker Hub URL. The official-image
+/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos
+/// `/r/foo/bar` map to `(foo, bar)`.
+fn parse_repo(url: &str) -> Option<(String, String)> {
+    if let Some(after) = url.split("/_/").nth(1) {
+        let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+        let name = stripped.split('/').next().filter(|s| !s.is_empty())?;
+        return Some(("library".into(), name.to_string()));
+    }
+    let after = url.split("/r/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let ns = segs.next()?;
+    let nm = segs.next()?;
+    Some((ns.to_string(), nm.to_string()))
+}
+
+#[derive(Deserialize)]
+struct RepoResponse {
+    namespace: Option<String>,
+    name: Option<String>,
+    pull_count: Option<i64>,
+    star_count: Option<i64>,
+    description: Option<String>,
+    full_description: Option<String>,
+    last_updated: Option<String>,
+    date_registered: Option<String>,
+    is_private: Option<bool>,
+    status_description: Option<String>,
+    #[serde(default)]
+    categories: Vec<DockerCategory>,
+}
+
+#[derive(Deserialize, serde::Serialize)]
+struct DockerCategory {
+    name: Option<String>,
+    slug: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_docker_urls() {
+        assert!(matches("https://hub.docker.com/_/nginx"));
+        assert!(matches("https://hub.docker.com/r/grafana/grafana"));
+        assert!(!matches("https://hub.docker.com/"));
+        assert!(!matches("https://example.com/_/nginx"));
+    }
+
+    #[test]
+    fn parse_repo_handles_official_and_personal() {
+        assert_eq!(
+            parse_repo("https://hub.docker.com/_/nginx"),
+            Some(("library".into(), "nginx".into()))
+        );
+        assert_eq!(
+            parse_repo("https://hub.docker.com/_/nginx/tags"),
+            Some(("library".into(), "nginx".into()))
+        );
+        assert_eq!(
+            parse_repo("https://hub.docker.com/r/grafana/grafana"),
+            Some(("grafana".into(), "grafana".into()))
+        );
+        assert_eq!(
+            parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"),
+            Some(("grafana".into(), "grafana".into()))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/github_pr.rs b/crates/webclaw-fetch/src/extractors/github_pr.rs
new file mode 100644
index 0000000..9d4b95a
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/github_pr.rs
@@ -0,0 +1,189 @@
+//! GitHub pull request structured extractor.
+//!
+//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns
+//! the PR metadata + a counted summary of comments and review activity.
+//! Full diff and per-comment bodies require additional calls — left for
+//! a follow-up enhancement so the v1 stays one network round-trip.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_pr",
+    label: "GitHub pull request",
+    description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.",
+    url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    parse_pr(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
+        FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_pr: pull request '{owner}/{repo}#{number}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let p: PullRequest = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?;
+
+    Ok(json!({
+        "url":            url,
+        "owner":          owner,
+        "repo":           repo,
+        "number":         p.number,
+        "title":          p.title,
+        "body":           p.body,
+        "state":          p.state,
+        "draft":          p.draft,
+        "merged":         p.merged,
+        "merged_at":      p.merged_at,
+        "merge_commit_sha": p.merge_commit_sha,
+        "author":         p.user.as_ref().and_then(|u| u.login.clone()),
+        "labels":         p.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
+        "milestone":      p.milestone.as_ref().and_then(|m| m.title.clone()),
+        "head_ref":       p.head.as_ref().and_then(|r| r.ref_name.clone()),
+        "base_ref":       p.base.as_ref().and_then(|r| r.ref_name.clone()),
+        "head_sha":       p.head.as_ref().and_then(|r| r.sha.clone()),
+        "additions":      p.additions,
+        "deletions":      p.deletions,
+        "changed_files":  p.changed_files,
+        "commits":        p.commits,
+        "comments":       p.comments,
+        "review_comments":p.review_comments,
+        "created_at":     p.created_at,
+        "updated_at":     p.updated_at,
+        "closed_at":      p.closed_at,
+        "html_url":       p.html_url,
+    }))
+}
+
+fn parse_pr(url: &str) -> Option<(String, String, u64)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /{owner}/{repo}/pull/{number} (or /pulls/{number} variant)
+    if segs.len() < 4 {
+        return None;
+    }
+    if segs[2] != "pull" && segs[2] != "pulls" {
+        return None;
+    }
+    let number: u64 = segs[3].parse().ok()?;
+    Some((segs[0].to_string(), segs[1].to_string(), number))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub PR API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct PullRequest {
+    number: Option<i64>,
+    title: Option<String>,
+    body: Option<String>,
+    state: Option<String>,
+    draft: Option<bool>,
+    merged: Option<bool>,
+    merged_at: Option<String>,
+    merge_commit_sha: Option<String>,
+    user: Option<UserRef>,
+    #[serde(default)]
+    labels: Vec<LabelRef>,
+    milestone: Option<Milestone>,
+    head: Option<GitRef>,
+    base: Option<GitRef>,
+    additions: Option<i64>,
+    deletions: Option<i64>,
+    changed_files: Option<i64>,
+    commits: Option<i64>,
+    comments: Option<i64>,
+    review_comments: Option<i64>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    closed_at: Option<String>,
+    html_url: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    login: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct LabelRef {
+    name: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Milestone {
+    title: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct GitRef {
+    #[serde(rename = "ref")]
+    ref_name: Option<String>,
+    sha: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_pr_urls() {
+        assert!(matches("https://github.com/rust-lang/rust/pull/12345"));
+        assert!(matches(
+            "https://github.com/rust-lang/rust/pull/12345/files"
+        ));
+        assert!(!matches("https://github.com/rust-lang/rust"));
+        assert!(!matches("https://github.com/rust-lang/rust/issues/100"));
+        assert!(!matches("https://github.com/rust-lang"));
+    }
+
+    #[test]
+    fn parse_pr_extracts_owner_repo_number() {
+        assert_eq!(
+            parse_pr("https://github.com/rust-lang/rust/pull/12345"),
+            Some(("rust-lang".into(), "rust".into(), 12345))
+        );
+        assert_eq!(
+            parse_pr("https://github.com/rust-lang/rust/pull/12345/files"),
+            Some(("rust-lang".into(), "rust".into(), 12345))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/github_release.rs b/crates/webclaw-fetch/src/extractors/github_release.rs
new file mode 100644
index 0000000..b019550
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/github_release.rs
@@ -0,0 +1,179 @@
+//! GitHub release structured extractor.
+//!
+//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns
+//! the release notes body, asset list with download counts, and
+//! prerelease flag.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_release",
+    label: "GitHub release",
+    description: "Returns release metadata: tag, name, body (release notes), assets with download counts.",
+    url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    parse_release(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
+        FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_release: release '{owner}/{repo}@{tag}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour."
+                .into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let r: Release = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?;
+
+    let assets: Vec<Value> = r
+        .assets
+        .iter()
+        .map(|a| {
+            json!({
+                "name": a.name,
+                "size": a.size,
+                "download_count": a.download_count,
+                "browser_download_url": a.browser_download_url,
+                "content_type": a.content_type,
+                "created_at": a.created_at,
+                "updated_at": a.updated_at,
+            })
+        })
+        .collect();
+
+    Ok(json!({
+        "url":           url,
+        "owner":         owner,
+        "repo":          repo,
+        "tag_name":      r.tag_name,
+        "name":          r.name,
+        "body":          r.body,
+        "draft":         r.draft,
+        "prerelease":    r.prerelease,
+        "author":        r.author.as_ref().and_then(|u| u.login.clone()),
+        "created_at":    r.created_at,
+        "published_at":  r.published_at,
+        "asset_count":   assets.len(),
+        "total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::<i64>(),
+        "assets":        assets,
+        "html_url":      r.html_url,
+    }))
+}
+
+fn parse_release(url: &str) -> Option<(String, String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /{owner}/{repo}/releases/tag/{tag}
+    if segs.len() < 5 {
+        return None;
+    }
+    if segs[2] != "releases" || segs[3] != "tag" {
+        return None;
+    }
+    Some((
+        segs[0].to_string(),
+        segs[1].to_string(),
+        segs[4].to_string(),
+    ))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub Release API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Release {
+    tag_name: Option<String>,
+    name: Option<String>,
+    body: Option<String>,
+    draft: Option<bool>,
+    prerelease: Option<bool>,
+    author: Option<UserRef>,
+    created_at: Option<String>,
+    published_at: Option<String>,
+    html_url: Option<String>,
+    #[serde(default)]
+    assets: Vec<Asset>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    login: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Asset {
+    name: Option<String>,
+    size: Option<i64>,
+    download_count: Option<i64>,
+    browser_download_url: Option<String>,
+    content_type: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_release_urls() {
+        assert!(matches(
+            "https://github.com/rust-lang/rust/releases/tag/1.85.0"
+        ));
+        assert!(matches(
+            "https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"
+        ));
+        assert!(!matches("https://github.com/rust-lang/rust"));
+        assert!(!matches("https://github.com/rust-lang/rust/releases"));
+        assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
+    }
+
+    #[test]
+    fn parse_release_extracts_owner_repo_tag() {
+        assert_eq!(
+            parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"),
+            Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into()))
+        );
+        assert_eq!(
+            parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"),
+            Some(("rust-lang".into(), "rust".into(), "1.85.0".into()))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
new file mode 100644
index 0000000..cb1f524
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
@@ -0,0 +1,189 @@
+//! HuggingFace dataset structured extractor.
+//!
+//! Same shape as the model extractor but hits the dataset endpoint.
+//! `huggingface.co/api/datasets/{owner}/{name}`.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "huggingface_dataset",
+    label: "HuggingFace dataset",
+    description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.",
+    url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "huggingface.co" && host != "www.huggingface.co" {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical).
+    segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let dataset_path = parse_dataset_path(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "hf_dataset: cannot parse dataset path from '{url}'"
+        ))
+    })?;
+
+    let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "hf_dataset: '{dataset_path}' not found"
+        )));
+    }
+    if resp.status == 401 {
+        return Err(FetchError::Build(format!(
+            "hf_dataset: '{dataset_path}' requires authentication (gated)"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "hf_dataset api returned status {}",
+            resp.status
+        )));
+    }
+
+    let d: DatasetInfo = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?;
+
+    let files: Vec<Value> = d
+        .siblings
+        .iter()
+        .map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
+        .collect();
+
+    Ok(json!({
+        "url":             url,
+        "id":              d.id,
+        "private":         d.private,
+        "gated":           d.gated,
+        "downloads":       d.downloads,
+        "downloads_30d":   d.downloads_all_time,
+        "likes":           d.likes,
+        "tags":            d.tags,
+        "license":         d.card_data.as_ref().and_then(|c| c.license.clone()),
+        "language":        d.card_data.as_ref().and_then(|c| c.language.clone()),
+        "task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()),
+        "size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()),
+        "annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()),
+        "configs":         d.card_data.as_ref().and_then(|c| c.configs.clone()),
+        "created_at":      d.created_at,
+        "last_modified":   d.last_modified,
+        "sha":             d.sha,
+        "file_count":      d.siblings.len(),
+        "files":           files,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Returns the part to append to the API URL — either `name` (legacy
+/// top-level dataset like `squad`) or `owner/name` (canonical form).
+fn parse_dataset_path(url: &str) -> Option<String> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    if segs.next() != Some("datasets") {
+        return None;
+    }
+    let first = segs.next()?.to_string();
+    match segs.next() {
+        Some(second) => Some(format!("{first}/{second}")),
+        None => Some(first),
+    }
+}
+
+#[derive(Deserialize)]
+struct DatasetInfo {
+    id: Option<String>,
+    private: Option<bool>,
+    gated: Option<serde_json::Value>,
+    downloads: Option<i64>,
+    #[serde(rename = "downloadsAllTime")]
+    downloads_all_time: Option<i64>,
+    likes: Option<i64>,
+    #[serde(default)]
+    tags: Vec<String>,
+    #[serde(rename = "createdAt")]
+    created_at: Option<String>,
+    #[serde(rename = "lastModified")]
+    last_modified: Option<String>,
+    sha: Option<String>,
+    #[serde(rename = "cardData")]
+    card_data: Option<DatasetCard>,
+    #[serde(default)]
+    siblings: Vec<Sibling>,
+}
+
+#[derive(Deserialize)]
+struct DatasetCard {
+    license: Option<serde_json::Value>,
+    language: Option<serde_json::Value>,
+    task_categories: Option<serde_json::Value>,
+    size_categories: Option<serde_json::Value>,
+    annotations_creators: Option<serde_json::Value>,
+    configs: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct Sibling {
+    rfilename: String,
+    size: Option<i64>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_dataset_pages() {
+        assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level
+        assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name
+        assert!(!matches("https://huggingface.co/openai/whisper-large-v3"));
+        assert!(!matches("https://huggingface.co/datasets/"));
+    }
+
+    #[test]
+    fn parse_dataset_path_works() {
+        assert_eq!(
+            parse_dataset_path("https://huggingface.co/datasets/squad"),
+            Some("squad".into())
+        );
+        assert_eq!(
+            parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"),
+            Some("openai/gsm8k".into())
+        );
+        assert_eq!(
+            parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"),
+            Some("openai/gsm8k".into())
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index b9a539b..8cf8ba5 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -14,12 +14,20 @@
 //! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
 //! one). HTML extraction is the fallback for sites that don't.
 
+pub mod arxiv;
+pub mod crates_io;
+pub mod dev_to;
+pub mod docker_hub;
+pub mod github_pr;
+pub mod github_release;
 pub mod github_repo;
 pub mod hackernews;
+pub mod huggingface_dataset;
 pub mod huggingface_model;
 pub mod npm;
 pub mod pypi;
 pub mod reddit;
+pub mod stackoverflow;
 
 use serde::Serialize;
 use serde_json::Value;
@@ -48,9 +56,17 @@ pub fn list() -> Vec<ExtractorInfo> {
         reddit::INFO,
         hackernews::INFO,
         github_repo::INFO,
+        github_pr::INFO,
+        github_release::INFO,
         pypi::INFO,
         npm::INFO,
+        crates_io::INFO,
         huggingface_model::INFO,
+        huggingface_dataset::INFO,
+        arxiv::INFO,
+        docker_hub::INFO,
+        dev_to::INFO,
+        stackoverflow::INFO,
     ]
 }
 
@@ -92,6 +108,27 @@ pub async fn dispatch_by_url(
     if npm::matches(url) {
         return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
     }
+    if github_pr::matches(url) {
+        return Some(
+            github_pr::extract(client, url)
+                .await
+                .map(|v| (github_pr::INFO.name, v)),
+        );
+    }
+    if github_release::matches(url) {
+        return Some(
+            github_release::extract(client, url)
+                .await
+                .map(|v| (github_release::INFO.name, v)),
+        );
+    }
+    if crates_io::matches(url) {
+        return Some(
+            crates_io::extract(client, url)
+                .await
+                .map(|v| (crates_io::INFO.name, v)),
+        );
+    }
     if huggingface_model::matches(url) {
         return Some(
             huggingface_model::extract(client, url)
@@ -99,6 +136,41 @@ pub async fn dispatch_by_url(
                 .map(|v| (huggingface_model::INFO.name, v)),
         );
     }
+    if huggingface_dataset::matches(url) {
+        return Some(
+            huggingface_dataset::extract(client, url)
+                .await
+                .map(|v| (huggingface_dataset::INFO.name, v)),
+        );
+    }
+    if arxiv::matches(url) {
+        return Some(
+            arxiv::extract(client, url)
+                .await
+                .map(|v| (arxiv::INFO.name, v)),
+        );
+    }
+    if docker_hub::matches(url) {
+        return Some(
+            docker_hub::extract(client, url)
+                .await
+                .map(|v| (docker_hub::INFO.name, v)),
+        );
+    }
+    if dev_to::matches(url) {
+        return Some(
+            dev_to::extract(client, url)
+                .await
+                .map(|v| (dev_to::INFO.name, v)),
+        );
+    }
+    if stackoverflow::matches(url) {
+        return Some(
+            stackoverflow::extract(client, url)
+                .await
+                .map(|v| (stackoverflow::INFO.name, v)),
+        );
+    }
     None
 }
 
@@ -136,12 +208,57 @@ pub async fn dispatch_by_name(
         n if n == npm::INFO.name => {
             run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
         }
+        n if n == github_pr::INFO.name => {
+            run_or_mismatch(github_pr::matches(url), n, url, || {
+                github_pr::extract(client, url)
+            })
+            .await
+        }
+        n if n == github_release::INFO.name => {
+            run_or_mismatch(github_release::matches(url), n, url, || {
+                github_release::extract(client, url)
+            })
+            .await
+        }
+        n if n == crates_io::INFO.name => {
+            run_or_mismatch(crates_io::matches(url), n, url, || {
+                crates_io::extract(client, url)
+            })
+            .await
+        }
         n if n == huggingface_model::INFO.name => {
             run_or_mismatch(huggingface_model::matches(url), n, url, || {
                 huggingface_model::extract(client, url)
             })
             .await
         }
+        n if n == huggingface_dataset::INFO.name => {
+            run_or_mismatch(huggingface_dataset::matches(url), n, url, || {
+                huggingface_dataset::extract(client, url)
+            })
+            .await
+        }
+        n if n == arxiv::INFO.name => {
+            run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await
+        }
+        n if n == docker_hub::INFO.name => {
+            run_or_mismatch(docker_hub::matches(url), n, url, || {
+                docker_hub::extract(client, url)
+            })
+            .await
+        }
+        n if n == dev_to::INFO.name => {
+            run_or_mismatch(dev_to::matches(url), n, url, || {
+                dev_to::extract(client, url)
+            })
+            .await
+        }
+        n if n == stackoverflow::INFO.name => {
+            run_or_mismatch(stackoverflow::matches(url), n, url, || {
+                stackoverflow::extract(client, url)
+            })
+            .await
+        }
         _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
     }
 }
diff --git a/crates/webclaw-fetch/src/extractors/stackoverflow.rs b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
new file mode 100644
index 0000000..d74b511
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
@@ -0,0 +1,216 @@
+//! Stack Overflow Q&A structured extractor.
+//!
+//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}`
+//! with `site=stackoverflow`. Two calls: one for the question, one for
+//! its answers. Both come pre-filtered to include the rendered HTML body
+//! so we don't re-parse the question page itself.
+//!
+//! Anonymous access caps at 300 requests per IP per day. Production
+//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't
+//! require it to work out of the box.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "stackoverflow",
+    label: "Stack Overflow Q&A",
+    description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.",
+    url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "stackoverflow.com" && host != "www.stackoverflow.com" {
+        return false;
+    }
+    parse_question_id(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let id = parse_question_id(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "stackoverflow: cannot parse question id from '{url}'"
+        ))
+    })?;
+
+    // Filter `withbody` includes the rendered HTML body for both questions
+    // and answers. Stack Exchange's filter system is documented at
+    // api.stackexchange.com/docs/filters.
+    let q_url = format!(
+        "https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody"
+    );
+    let q_resp = client.fetch(&q_url).await?;
+    if q_resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "stackexchange api returned status {}",
+            q_resp.status
+        )));
+    }
+    let q_body: QResponse = serde_json::from_str(&q_resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?;
+    let q = q_body
+        .items
+        .first()
+        .ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?;
+
+    let a_url = format!(
+        "https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes"
+    );
+    let a_resp = client.fetch(&a_url).await?;
+    let answers = if a_resp.status == 200 {
+        let a_body: AResponse = serde_json::from_str(&a_resp.html)
+            .map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?;
+        a_body
+            .items
+            .iter()
+            .map(|a| {
+                json!({
+                    "answer_id":     a.answer_id,
+                    "is_accepted":   a.is_accepted,
+                    "score":         a.score,
+                    "body":          a.body,
+                    "creation_date": a.creation_date,
+                    "last_edit_date":a.last_edit_date,
+                    "author":        a.owner.as_ref().and_then(|o| o.display_name.clone()),
+                    "author_rep":    a.owner.as_ref().and_then(|o| o.reputation),
+                })
+            })
+            .collect::<Vec<_>>()
+    } else {
+        Vec::new()
+    };
+
+    let accepted = answers
+        .iter()
+        .find(|a| {
+            a.get("is_accepted")
+                .and_then(|v| v.as_bool())
+                .unwrap_or(false)
+        })
+        .cloned();
+
+    Ok(json!({
+        "url":            url,
+        "question_id":    q.question_id,
+        "title":          q.title,
+        "body":           q.body,
+        "tags":           q.tags,
+        "score":          q.score,
+        "view_count":     q.view_count,
+        "answer_count":   q.answer_count,
+        "is_answered":    q.is_answered,
+        "accepted_answer_id": q.accepted_answer_id,
+        "creation_date":  q.creation_date,
+        "last_activity_date": q.last_activity_date,
+        "author":         q.owner.as_ref().and_then(|o| o.display_name.clone()),
+        "author_rep":     q.owner.as_ref().and_then(|o| o.reputation),
+        "link":           q.link,
+        "accepted_answer": accepted,
+        "top_answers":    answers,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Parse question id from a URL of the form `/questions/{id}/{slug}`.
+fn parse_question_id(url: &str) -> Option<u64> {
+    let after = url.split("/questions/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let first = stripped.split('/').next()?;
+    first.parse::<u64>().ok()
+}
+
+// ---------------------------------------------------------------------------
+// Stack Exchange API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct QResponse {
+    #[serde(default)]
+    items: Vec<Question>,
+}
+
+#[derive(Deserialize)]
+struct Question {
+    question_id: Option<u64>,
+    title: Option<String>,
+    body: Option<String>,
+    #[serde(default)]
+    tags: Vec<String>,
+    score: Option<i64>,
+    view_count: Option<i64>,
+    answer_count: Option<i64>,
+    is_answered: Option<bool>,
+    accepted_answer_id: Option<u64>,
+    creation_date: Option<i64>,
+    last_activity_date: Option<i64>,
+    owner: Option<Owner>,
+    link: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct AResponse {
+    #[serde(default)]
+    items: Vec<Answer>,
+}
+
+#[derive(Deserialize)]
+struct Answer {
+    answer_id: Option<u64>,
+    is_accepted: Option<bool>,
+    score: Option<i64>,
+    body: Option<String>,
+    creation_date: Option<i64>,
+    last_edit_date: Option<i64>,
+    owner: Option<Owner>,
+}
+
+#[derive(Deserialize)]
+struct Owner {
+    display_name: Option<String>,
+    reputation: Option<i64>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_question_urls() {
+        assert!(matches(
+            "https://stackoverflow.com/questions/12345/some-slug"
+        ));
+        assert!(matches(
+            "https://stackoverflow.com/questions/12345/some-slug?answertab=votes"
+        ));
+        assert!(!matches("https://stackoverflow.com/"));
+        assert!(!matches("https://stackoverflow.com/questions"));
+        assert!(!matches("https://stackoverflow.com/users/100"));
+        assert!(!matches("https://example.com/questions/12345/x"));
+    }
+
+    #[test]
+    fn parse_question_id_handles_slug_and_query() {
+        assert_eq!(
+            parse_question_id("https://stackoverflow.com/questions/12345/some-slug"),
+            Some(12345)
+        );
+        assert_eq!(
+            parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"),
+            Some(12345)
+        );
+        assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None);
+    }
+}

From 3bb0a4bca0680e3f62094bc4d72550665e482d03 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 14:39:49 +0200
Subject: [PATCH 09/30] feat(extractors): add LinkedIn + Instagram with
 profile-to-posts fan-out

3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:

- linkedin_post:      /embed/feed/update/{urn} returns full body,
                      author, image, OG tags. Accepts both the urn:li:share
                      and urn:li:activity URN forms plus the pretty
                      /posts/{slug}-{id}-{suffix} URLs.

- instagram_post:     /p/{shortcode}/embed/captioned/ returns the full
                      caption, username, thumbnail. Same endpoint serves
                      reels and IGTV, kind correctly classified.

- instagram_profile:  /api/v1/users/web_profile_info/?username=X with the
                      x-ig-app-id header (Instagram's public web-app id,
                      sent by their own JS bundle). Returns the full
                      profile + the 12 most recent posts with shortcodes,
                      kinds, like/comment counts, thumbnails, and caption
                      previews. Falls back to OG-tag scraping of the
                      public HTML if the API ever 401/403s.

The IG profile output is shaped so callers can fan out cleanly:
  for p in profile.recent_posts:
      scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.

Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().

Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.

Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
  / shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
  username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
  rounded), is_verified=True, is_business=True, biography with emojis,
  12 recent posts with shortcodes + kinds + likes. 400ms.

Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
---
 Cargo.lock                                    |   1 +
 crates/webclaw-fetch/Cargo.toml               |   1 +
 crates/webclaw-fetch/src/client.rs            |  73 ++-
 .../src/extractors/instagram_post.rs          | 235 +++++++++
 .../src/extractors/instagram_profile.rs       | 465 ++++++++++++++++++
 .../src/extractors/linkedin_post.rs           | 266 ++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    |  45 ++
 7 files changed, 1085 insertions(+), 1 deletion(-)
 create mode 100644 crates/webclaw-fetch/src/extractors/instagram_post.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/instagram_profile.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/linkedin_post.rs

diff --git a/Cargo.lock b/Cargo.lock
index 0f5fc5c..8ca41b8 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3245,6 +3245,7 @@ dependencies = [
  "http",
  "quick-xml 0.37.5",
  "rand 0.8.5",
+ "regex",
  "serde",
  "serde_json",
  "tempfile",
diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml
index 0b22d12..77082a6 100644
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@@ -18,6 +18,7 @@ bytes = "1"
 url = "2"
 rand = "0.8"
 quick-xml = { version = "0.37", features = ["serde"] }
+regex = "1"
 serde_json.workspace = true
 calamine = "0.34"
 zip = "2"
diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index cc6378a..a4f6dd5 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -279,14 +279,85 @@ impl FetchClient {
 
     /// Single fetch attempt.
     async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
+        self.fetch_once_with_headers(url, &[]).await
+    }
+
+    /// Single fetch attempt with optional per-request headers appended
+    /// after the profile defaults. Used by extractors that need to
+    /// satisfy site-specific headers (e.g. `x-ig-app-id` for Instagram's
+    /// internal API).
+    async fn fetch_once_with_headers(
+        &self,
+        url: &str,
+        extra: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
         let start = Instant::now();
         let client = self.pick_client(url);
 
-        let resp = client.get(url).send().await?;
+        let mut req = client.get(url);
+        for (k, v) in extra {
+            req = req.header(*k, *v);
+        }
+        let resp = req.send().await?;
         let response = Response::from_wreq(resp).await?;
         response_to_result(response, start)
     }
 
+    /// Fetch a URL with extra per-request headers appended after the
+    /// browser-profile defaults. Same retry semantics as `fetch`.
+    ///
+    /// Use this when an upstream API requires a header the global
+    /// `FetchConfig.headers` shouldn't carry to other hosts (Instagram's
+    /// `x-ig-app-id`, GitHub's `Authorization` once we wire `GITHUB_TOKEN`,
+    /// Reddit's compliant UA when we add OAuth, etc.).
+    #[instrument(skip(self, extra), fields(url = %url, extra_count = extra.len()))]
+    pub async fn fetch_with_headers(
+        &self,
+        url: &str,
+        extra: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        let delays = [Duration::ZERO, Duration::from_secs(1)];
+        let mut last_err = None;
+
+        for (attempt, delay) in delays.iter().enumerate() {
+            if attempt > 0 {
+                tokio::time::sleep(*delay).await;
+            }
+            match self.fetch_once_with_headers(url, extra).await {
+                Ok(result) => {
+                    if is_retryable_status(result.status) && attempt < delays.len() - 1 {
+                        warn!(
+                            url,
+                            status = result.status,
+                            attempt = attempt + 1,
+                            "retryable status, will retry"
+                        );
+                        last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
+                        continue;
+                    }
+                    if attempt > 0 {
+                        debug!(url, attempt = attempt + 1, "retry succeeded");
+                    }
+                    return Ok(result);
+                }
+                Err(e) => {
+                    if !is_retryable_error(&e) || attempt == delays.len() - 1 {
+                        return Err(e);
+                    }
+                    warn!(
+                        url,
+                        error = %e,
+                        attempt = attempt + 1,
+                        "transient error, will retry"
+                    );
+                    last_err = Some(e);
+                }
+            }
+        }
+
+        Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
+    }
+
     /// Fetch a URL then extract structured content.
     #[instrument(skip(self), fields(url = %url))]
     pub async fn fetch_and_extract(
diff --git a/crates/webclaw-fetch/src/extractors/instagram_post.rs b/crates/webclaw-fetch/src/extractors/instagram_post.rs
new file mode 100644
index 0000000..05c9b8a
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/instagram_post.rs
@@ -0,0 +1,235 @@
+//! Instagram post structured extractor.
+//!
+//! Uses Instagram's public embed endpoint
+//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the
+//! full caption, author username, and thumbnail. No auth required.
+//! The same endpoint serves reels and IGTV under `/reel/{code}` and
+//! `/tv/{code}` URLs (we accept all three).
+
+use regex::Regex;
+use serde_json::{Value, json};
+use std::sync::OnceLock;
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "instagram_post",
+    label: "Instagram post",
+    description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.",
+    url_patterns: &[
+        "https://www.instagram.com/p/{shortcode}/",
+        "https://www.instagram.com/reel/{shortcode}/",
+        "https://www.instagram.com/tv/{shortcode}/",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.instagram.com" | "instagram.com") {
+        return false;
+    }
+    parse_shortcode(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "instagram_post: cannot parse shortcode from '{url}'"
+        ))
+    })?;
+
+    // Instagram serves the same embed HTML for posts/reels/tv under /p/.
+    let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/");
+    let resp = client.fetch(&embed_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "instagram embed returned status {} for {shortcode}",
+            resp.status
+        )));
+    }
+
+    let html = &resp.html;
+    let username = parse_username(html);
+    let caption = parse_caption(html);
+    let thumbnail = parse_thumbnail(html);
+
+    Ok(json!({
+        "url":               url,
+        "embed_url":         embed_url,
+        "shortcode":         shortcode,
+        "kind":              kind,
+        "data_completeness": "embed",
+        "author_username":   username,
+        "caption":           caption,
+        "thumbnail_url":     thumbnail,
+        "canonical_url":     format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL parsing
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}.
+fn parse_shortcode(url: &str) -> Option<(&'static str, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let first = segs.next()?;
+    let kind = match first {
+        "p" => "post",
+        "reel" | "reels" => "reel",
+        "tv" => "tv",
+        _ => return None,
+    };
+    let shortcode = segs.next()?;
+    if shortcode.is_empty() {
+        return None;
+    }
+    Some((kind, shortcode.to_string()))
+}
+
+fn path_segment_for(kind: &str) -> &'static str {
+    match kind {
+        "reel" => "reel",
+        "tv" => "tv",
+        _ => "p",
+    }
+}
+
+// ---------------------------------------------------------------------------
+// HTML scraping
+// ---------------------------------------------------------------------------
+
+/// Username appears as the anchor text inside `<a class="CaptionUsername">`.
+fn parse_username(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| html_decode(m.as_str().trim()))
+}
+
+/// Caption sits inside `<div class="Caption">` after the username anchor.
+/// We grab the whole Caption block and strip out the username link, time
+/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate.
+fn parse_caption(html: &str) -> Option<String> {
+    static RE_OUTER: OnceLock<Regex> = OnceLock::new();
+    let outer = RE_OUTER
+        .get_or_init(|| Regex::new(r#"(?s)<div\s+class="Caption"[^>]*>(.*?)</div>"#).unwrap());
+    let block = outer.captures(html)?.get(1)?.as_str();
+
+    // Strip everything wrapped in <a class="CaptionUsername">...</a>.
+    static RE_USER: OnceLock<Regex> = OnceLock::new();
+    let user_re = RE_USER
+        .get_or_init(|| Regex::new(r#"(?s)<a[^>]*class="CaptionUsername"[^>]*>.*?</a>"#).unwrap());
+    let stripped = user_re.replace_all(block, "");
+
+    // Then strip anything remaining tagged.
+    static RE_TAGS: OnceLock<Regex> = OnceLock::new();
+    let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
+    let text = tag_re.replace_all(&stripped, " ");
+
+    let cleaned = collapse_whitespace(&html_decode(text.trim()));
+    if cleaned.is_empty() {
+        None
+    } else {
+        Some(cleaned)
+    }
+}
+
+/// Thumbnail is the `<img class="EmbeddedMediaImage">` inside the embed
+/// (or the og:image as fallback).
+fn parse_thumbnail(html: &str) -> Option<String> {
+    static RE_IMG: OnceLock<Regex> = OnceLock::new();
+    let img_re = RE_IMG.get_or_init(|| {
+        Regex::new(r#"(?s)<img[^>]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#)
+            .unwrap()
+    });
+    if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) {
+        return Some(html_decode(m.as_str()));
+    }
+    static RE_OG: OnceLock<Regex> = OnceLock::new();
+    let og_re = RE_OG.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:image"[^>]+content="([^"]+)""#).unwrap()
+    });
+    og_re
+        .captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| html_decode(m.as_str()))
+}
+
+fn html_decode(s: &str) -> String {
+    s.replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+        .replace("&quot;", "\"")
+        .replace("&#39;", "'")
+        .replace("&#064;", "@")
+        .replace("&#x2022;", "•")
+        .replace("&hellip;", "…")
+}
+
+fn collapse_whitespace(s: &str) -> String {
+    s.split_whitespace().collect::<Vec<_>>().join(" ")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_post_reel_tv_urls() {
+        assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/"));
+        assert!(matches(
+            "https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"
+        ));
+        assert!(matches("https://www.instagram.com/reel/abc123/"));
+        assert!(matches("https://www.instagram.com/tv/abc123/"));
+        assert!(!matches("https://www.instagram.com/ticketswave"));
+        assert!(!matches("https://www.instagram.com/"));
+        assert!(!matches("https://example.com/p/abc/"));
+    }
+
+    #[test]
+    fn parse_shortcode_reads_each_kind() {
+        assert_eq!(
+            parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"),
+            Some(("post", "DT-RICMjeK5".into()))
+        );
+        assert_eq!(
+            parse_shortcode("https://www.instagram.com/reel/abc123/"),
+            Some(("reel", "abc123".into()))
+        );
+        assert_eq!(
+            parse_shortcode("https://www.instagram.com/tv/abc123"),
+            Some(("tv", "abc123".into()))
+        );
+    }
+
+    #[test]
+    fn parse_username_pulls_anchor_text() {
+        let html = r#"<a class="CaptionUsername" href="...">ticketswave</a>"#;
+        assert_eq!(parse_username(html).as_deref(), Some("ticketswave"));
+    }
+
+    #[test]
+    fn parse_caption_strips_username_anchor() {
+        let html = r#"<div class="Caption"><a class="CaptionUsername" href="...">ticketswave</a> Some caption text here</div>"#;
+        assert_eq!(
+            parse_caption(html).as_deref(),
+            Some("Some caption text here")
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/instagram_profile.rs b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
new file mode 100644
index 0000000..4524090
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
@@ -0,0 +1,465 @@
+//! Instagram profile structured extractor.
+//!
+//! Hits Instagram's internal `web_profile_info` endpoint at
+//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The
+//! `x-ig-app-id` header is Instagram's own public web-app id (not a
+//! secret) — the same value Instagram's own JavaScript bundle sends.
+//!
+//! Returns the full profile (bio, exact follower count, verified /
+//! business flags, profile picture) plus the **12 most recent posts**
+//! with shortcodes, like counts, types, thumbnails, and caption
+//! previews. Callers can fan out to `/v1/scrape/instagram_post` per
+//! shortcode to get the full caption + media.
+//!
+//! Pagination beyond 12 requires authenticated cookies + a CSRF token;
+//! we accept that as the practical ceiling for the unauth path. The
+//! cloud (with stored sessions) can paginate later as a follow-up.
+//!
+//! Falls back to OG-tag scraping of the public profile page if the API
+//! returns 401/403 — Instagram has tightened this endpoint multiple
+//! times, so we keep the second path warm.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "instagram_profile",
+    label: "Instagram profile",
+    description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).",
+    url_patterns: &["https://www.instagram.com/{username}/"],
+};
+
+/// Instagram's own public web-app identifier. Sent by their JS bundle
+/// on every API call, accepted by the unauth endpoint, not a secret.
+const IG_APP_ID: &str = "936619743392459";
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.instagram.com" | "instagram.com") {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    segs.len() == 1 && !RESERVED.contains(&segs[0])
+}
+
+const RESERVED: &[&str] = &[
+    "p",
+    "reel",
+    "reels",
+    "tv",
+    "explore",
+    "stories",
+    "directory",
+    "accounts",
+    "about",
+    "developer",
+    "press",
+    "api",
+    "ads",
+    "blog",
+    "fragments",
+    "terms",
+    "privacy",
+    "session",
+    "login",
+    "signup",
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let username = parse_username(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "instagram_profile: cannot parse username from '{url}'"
+        ))
+    })?;
+
+    let api_url =
+        format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}");
+    let extra_headers: &[(&str, &str)] = &[
+        ("x-ig-app-id", IG_APP_ID),
+        ("accept", "*/*"),
+        ("sec-fetch-site", "same-origin"),
+        ("x-requested-with", "XMLHttpRequest"),
+    ];
+    let resp = client.fetch_with_headers(&api_url, extra_headers).await?;
+
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "instagram_profile: '{username}' not found"
+        )));
+    }
+    // Auth wall fallback: Instagram occasionally tightens this endpoint
+    // and starts returning 401/403/302 to a login page. When that
+    // happens we still want to give the caller something useful — the
+    // OG tags from the public HTML page (no posts list, but bio etc).
+    if !(200..300).contains(&resp.status) {
+        return og_fallback(client, &username, url, resp.status).await;
+    }
+
+    let body: ApiResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?;
+    let user = body.data.user;
+
+    let recent_posts: Vec<Value> = user
+        .edge_owner_to_timeline_media
+        .as_ref()
+        .map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect())
+        .unwrap_or_default();
+
+    Ok(json!({
+        "url":               url,
+        "canonical_url":     format!("https://www.instagram.com/{username}/"),
+        "username":          user.username.unwrap_or(username),
+        "data_completeness": "api",
+        "user_id":           user.id,
+        "full_name":         user.full_name,
+        "biography":         user.biography,
+        "biography_links":   user.bio_links,
+        "external_url":      user.external_url,
+        "category":          user.category_name,
+        "follower_count":    user.edge_followed_by.map(|c| c.count),
+        "following_count":   user.edge_follow.map(|c| c.count),
+        "post_count":        user.edge_owner_to_timeline_media.as_ref().map(|m| m.count),
+        "is_verified":       user.is_verified,
+        "is_private":        user.is_private,
+        "is_business":       user.is_business_account,
+        "is_professional":   user.is_professional_account,
+        "profile_pic_url":   user.profile_pic_url_hd.or(user.profile_pic_url),
+        "recent_posts":      recent_posts,
+    }))
+}
+
+/// Build the per-post summary the caller fans out from. Includes a
+/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`.
+fn post_summary(n: &MediaNode) -> Value {
+    let kind = classify(n);
+    let url = match kind {
+        "reel" => format!(
+            "https://www.instagram.com/reel/{}/",
+            n.shortcode.as_deref().unwrap_or("")
+        ),
+        _ => format!(
+            "https://www.instagram.com/p/{}/",
+            n.shortcode.as_deref().unwrap_or("")
+        ),
+    };
+    let caption = n
+        .edge_media_to_caption
+        .as_ref()
+        .and_then(|c| c.edges.first())
+        .and_then(|e| e.node.text.clone());
+    json!({
+        "shortcode":     n.shortcode,
+        "url":           url,
+        "kind":          kind,
+        "is_video":      n.is_video.unwrap_or(false),
+        "video_views":   n.video_view_count,
+        "thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()),
+        "display_url":   n.display_url,
+        "like_count":    n.edge_media_preview_like.as_ref().map(|c| c.count),
+        "comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count),
+        "taken_at":      n.taken_at_timestamp,
+        "caption":       caption,
+        "alt_text":      n.accessibility_caption,
+        "dimensions":    n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})),
+        "product_type":  n.product_type,
+    })
+}
+
+/// Best-effort post-type classification. `clips` is reels; `feed` is
+/// the regular grid. Sidecar = multi-photo carousel.
+fn classify(n: &MediaNode) -> &'static str {
+    if n.product_type.as_deref() == Some("clips") {
+        return "reel";
+    }
+    match n.typename.as_deref() {
+        Some("GraphSidecar") => "carousel",
+        Some("GraphVideo") => "video",
+        Some("GraphImage") => "photo",
+        _ => "post",
+    }
+}
+
+/// Fallback when the API path is blocked: hit the public profile HTML,
+/// pull whatever OG tags we can. Returns less data and explicitly
+/// flags `data_completeness: "og_only"` so callers know.
+async fn og_fallback(
+    client: &FetchClient,
+    username: &str,
+    original_url: &str,
+    api_status: u16,
+) -> Result<Value, FetchError> {
+    let canonical = format!("https://www.instagram.com/{username}/");
+    let resp = client.fetch(&canonical).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "instagram_profile: api status {api_status}, html status {} for {username}",
+            resp.status
+        )));
+    }
+    let og = parse_og_tags(&resp.html);
+    let (followers, following, posts) =
+        parse_counts_from_og_description(og.get("description").map(String::as_str));
+
+    Ok(json!({
+        "url":               original_url,
+        "canonical_url":     canonical,
+        "username":          username,
+        "data_completeness": "og_only",
+        "fallback_reason":   format!("api returned {api_status}"),
+        "full_name":         parse_full_name(&og.get("title").cloned().unwrap_or_default()),
+        "follower_count":    followers,
+        "following_count":   following,
+        "post_count":        posts,
+        "profile_pic_url":   og.get("image").cloned(),
+        "biography":         null_value(),
+        "is_verified":       null_value(),
+        "is_business":       null_value(),
+        "recent_posts":      Vec::<Value>::new(),
+    }))
+}
+
+fn null_value() -> Value {
+    Value::Null
+}
+
+// ---------------------------------------------------------------------------
+// URL parsing
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_username(url: &str) -> Option<String> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    stripped
+        .split('/')
+        .find(|s| !s.is_empty())
+        .map(|s| s.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// OG-fallback helpers (kept self-contained — same shape as the previous
+// version we shipped, retained as the safety net)
+// ---------------------------------------------------------------------------
+
+fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
+    use regex::Regex;
+    use std::sync::OnceLock;
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    let mut out = std::collections::HashMap::new();
+    for c in re.captures_iter(html) {
+        let k = c
+            .get(1)
+            .map(|m| m.as_str().to_lowercase())
+            .unwrap_or_default();
+        let v = c
+            .get(2)
+            .map(|m| html_decode(m.as_str()))
+            .unwrap_or_default();
+        out.entry(k).or_insert(v);
+    }
+    out
+}
+
+fn parse_full_name(og_title: &str) -> Option<String> {
+    if og_title.is_empty() {
+        return None;
+    }
+    let decoded = html_decode(og_title);
+    let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim();
+    if trimmed.is_empty() {
+        None
+    } else {
+        Some(trimmed.to_string())
+    }
+}
+
+fn parse_counts_from_og_description(desc: Option<&str>) -> (Option<i64>, Option<i64>, Option<i64>) {
+    let Some(text) = desc else {
+        return (None, None, None);
+    };
+    let decoded = html_decode(text);
+    use regex::Regex;
+    use std::sync::OnceLock;
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap()
+    });
+    if let Some(c) = re.captures(&decoded) {
+        return (
+            c.get(1).and_then(|m| parse_compact_number(m.as_str())),
+            c.get(2).and_then(|m| parse_compact_number(m.as_str())),
+            c.get(3).and_then(|m| parse_compact_number(m.as_str())),
+        );
+    }
+    (None, None, None)
+}
+
+fn parse_compact_number(s: &str) -> Option<i64> {
+    let s = s.trim();
+    let (num_str, mul) = match s.chars().last() {
+        Some('K') => (&s[..s.len() - 1], 1_000i64),
+        Some('M') => (&s[..s.len() - 1], 1_000_000i64),
+        Some('B') => (&s[..s.len() - 1], 1_000_000_000i64),
+        _ => (s, 1i64),
+    };
+    let cleaned: String = num_str.chars().filter(|c| *c != ',').collect();
+    cleaned.parse::<f64>().ok().map(|f| (f * mul as f64) as i64)
+}
+
+fn html_decode(s: &str) -> String {
+    s.replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+        .replace("&quot;", "\"")
+        .replace("&#39;", "'")
+        .replace("&#064;", "@")
+        .replace("&#x2022;", "•")
+        .replace("&hellip;", "…")
+}
+
+// ---------------------------------------------------------------------------
+// Instagram web_profile_info API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct ApiResponse {
+    data: ApiData,
+}
+
+#[derive(Deserialize)]
+struct ApiData {
+    user: User,
+}
+
+#[derive(Deserialize)]
+struct User {
+    id: Option<String>,
+    username: Option<String>,
+    full_name: Option<String>,
+    biography: Option<String>,
+    bio_links: Option<Vec<serde_json::Value>>,
+    external_url: Option<String>,
+    category_name: Option<String>,
+    profile_pic_url: Option<String>,
+    profile_pic_url_hd: Option<String>,
+    is_verified: Option<bool>,
+    is_private: Option<bool>,
+    is_business_account: Option<bool>,
+    is_professional_account: Option<bool>,
+    edge_followed_by: Option<EdgeCount>,
+    edge_follow: Option<EdgeCount>,
+    edge_owner_to_timeline_media: Option<MediaEdges>,
+}
+
+#[derive(Deserialize)]
+struct EdgeCount {
+    count: i64,
+}
+
+#[derive(Deserialize)]
+struct MediaEdges {
+    count: i64,
+    edges: Vec<MediaEdge>,
+}
+
+#[derive(Deserialize)]
+struct MediaEdge {
+    node: MediaNode,
+}
+
+#[derive(Deserialize)]
+struct MediaNode {
+    #[serde(rename = "__typename")]
+    typename: Option<String>,
+    shortcode: Option<String>,
+    is_video: Option<bool>,
+    video_view_count: Option<i64>,
+    display_url: Option<String>,
+    thumbnail_src: Option<String>,
+    accessibility_caption: Option<String>,
+    taken_at_timestamp: Option<i64>,
+    product_type: Option<String>,
+    dimensions: Option<Dimensions>,
+    edge_media_preview_like: Option<EdgeCount>,
+    edge_media_to_comment: Option<EdgeCount>,
+    edge_media_to_caption: Option<CaptionEdges>,
+}
+
+#[derive(Deserialize)]
+struct Dimensions {
+    width: i64,
+    height: i64,
+}
+
+#[derive(Deserialize)]
+struct CaptionEdges {
+    edges: Vec<CaptionEdge>,
+}
+
+#[derive(Deserialize)]
+struct CaptionEdge {
+    node: CaptionNode,
+}
+
+#[derive(Deserialize)]
+struct CaptionNode {
+    text: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_profile_urls() {
+        assert!(matches("https://www.instagram.com/ticketswave"));
+        assert!(matches("https://www.instagram.com/ticketswave/"));
+        assert!(matches("https://instagram.com/0xmassi/?hl=en"));
+        assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/"));
+        assert!(!matches("https://www.instagram.com/explore"));
+        assert!(!matches("https://www.instagram.com/"));
+        assert!(!matches("https://example.com/foo"));
+    }
+
+    #[test]
+    fn parse_full_name_strips_handle() {
+        assert_eq!(
+            parse_full_name("Ticket Wave (&#064;ticketswave) &#x2022; Instagram photos and videos"),
+            Some("Ticket Wave".into())
+        );
+    }
+
+    #[test]
+    fn compact_number_handles_kmb() {
+        assert_eq!(parse_compact_number("18K"), Some(18_000));
+        assert_eq!(parse_compact_number("1.5M"), Some(1_500_000));
+        assert_eq!(parse_compact_number("1,234"), Some(1_234));
+        assert_eq!(parse_compact_number("641"), Some(641));
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/linkedin_post.rs b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
new file mode 100644
index 0000000..2d6a399
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
@@ -0,0 +1,266 @@
+//! LinkedIn post structured extractor.
+//!
+//! Uses the public embed endpoint `/embed/feed/update/{urn}` which
+//! LinkedIn provides for sites that want to render a post inline. No
+//! auth required, returns SSR HTML with the full post body, OG tags,
+//! image, and a link back to the original post.
+//!
+//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`)
+//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by
+//! pulling the trailing numeric id and converting to an activity URN.
+
+use regex::Regex;
+use serde_json::{Value, json};
+use std::sync::OnceLock;
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "linkedin_post",
+    label: "LinkedIn post",
+    description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.",
+    url_patterns: &[
+        "https://www.linkedin.com/feed/update/urn:li:share:{id}",
+        "https://www.linkedin.com/feed/update/urn:li:activity:{id}",
+        "https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.linkedin.com" | "linkedin.com") {
+        return false;
+    }
+    url.contains("/feed/update/urn:li:") || url.contains("/posts/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let urn = extract_urn(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
+        ))
+    })?;
+
+    let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}");
+    let resp = client.fetch(&embed_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "linkedin embed returned status {} for {urn}",
+            resp.status
+        )));
+    }
+
+    let html = &resp.html;
+    let og = parse_og_tags(html);
+    let body = parse_post_body(html);
+    let author = parse_author(html);
+    let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone());
+
+    Ok(json!({
+        "url":               url,
+        "embed_url":         embed_url,
+        "urn":               urn,
+        "canonical_url":     canonical_url,
+        "data_completeness": "embed",
+        "title":             og.get("title").cloned(),
+        "body":              body,
+        "author_name":       author,
+        "image_url":         og.get("image").cloned(),
+        "site_name":         og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URN extraction
+// ---------------------------------------------------------------------------
+
+/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL.
+/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second-
+/// to-last `-` separated chunk. Both forms map to a URN we can hit the
+/// embed endpoint with.
+fn extract_urn(url: &str) -> Option<String> {
+    if let Some(idx) = url.find("urn:li:") {
+        let tail = &url[idx..];
+        let end = tail.find(['/', '?', '#']).unwrap_or(tail.len());
+        let urn = &tail[..end];
+        // Validate shape: urn:li:{type}:{digits}
+        let mut parts = urn.split(':');
+        if parts.next() == Some("urn")
+            && parts.next() == Some("li")
+            && parts.next().is_some()
+            && parts
+                .next()
+                .filter(|p| p.chars().all(|c| c.is_ascii_digit()))
+                .is_some()
+        {
+            return Some(urn.to_string());
+        }
+    }
+
+    // /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second-
+    // to-last segment after the last `-`.
+    if url.contains("/posts/") {
+        static RE: OnceLock<Regex> = OnceLock::new();
+        let re =
+            RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap());
+        if let Some(c) = re.captures(url)
+            && let Some(id) = c.get(1)
+        {
+            return Some(format!("urn:li:activity:{}", id.as_str()));
+        }
+    }
+    None
+}
+
+// ---------------------------------------------------------------------------
+// HTML scraping
+// ---------------------------------------------------------------------------
+
+/// Pull `og:foo` → value pairs out of `<meta property="og:..." content="...">`.
+/// Returns lowercased keys with leading `og:` stripped.
+fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    let mut out = std::collections::HashMap::new();
+    for c in re.captures_iter(html) {
+        let k = c
+            .get(1)
+            .map(|m| m.as_str().to_lowercase())
+            .unwrap_or_default();
+        let v = c
+            .get(2)
+            .map(|m| html_decode(m.as_str()))
+            .unwrap_or_default();
+        out.entry(k).or_insert(v);
+    }
+    out
+}
+
+/// Extract the post body text from the embed page. LinkedIn renders it
+/// inside `<p class="attributed-text-segment-list__content ...">{text}</p>`
+/// where the inner content can include nested `<a>` tags for links.
+fn parse_post_body(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(
+            r#"(?s)<p[^>]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)</p>"#,
+        )
+        .unwrap()
+    });
+    let inner = re.captures(html).and_then(|c| c.get(1))?.as_str();
+    Some(strip_tags(inner).trim().to_string())
+}
+
+/// Author name lives in the `<title>` like:
+///   "55 founding members are in… | Orc Dev"
+/// The chunk after the final `|` is the author display name. Falls back
+/// to the og:title minus the post body if there's no title.
+fn parse_author(html: &str) -> Option<String> {
+    static RE_TITLE: OnceLock<Regex> = OnceLock::new();
+    let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)</title>").unwrap());
+    let title = re.captures(html).and_then(|c| c.get(1))?.as_str();
+    title
+        .rsplit_once('|')
+        .map(|(_, name)| html_decode(name.trim()))
+}
+
+/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.)
+/// stuff into OG content attributes.
+fn html_decode(s: &str) -> String {
+    s.replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+        .replace("&quot;", "\"")
+        .replace("&#39;", "'")
+        .replace("&#064;", "@")
+        .replace("&#x2022;", "•")
+        .replace("&hellip;", "…")
+}
+
+/// Crude HTML tag stripper for the post body. Preserves text inside
+/// nested anchors so URLs don't disappear, and collapses runs of
+/// whitespace introduced by line wrapping.
+fn strip_tags(html: &str) -> String {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
+    let no_tags = re.replace_all(html, "").to_string();
+    html_decode(&no_tags)
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_li_post_urls() {
+        assert!(matches(
+            "https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"
+        ));
+        assert!(matches(
+            "https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288"
+        ));
+        assert!(matches(
+            "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c"
+        ));
+        assert!(!matches("https://www.linkedin.com/in/foo"));
+        assert!(!matches("https://www.linkedin.com/"));
+        assert!(!matches("https://example.com/feed/update/urn:li:share:1"));
+    }
+
+    #[test]
+    fn extract_urn_from_share_url() {
+        assert_eq!(
+            extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"),
+            Some("urn:li:share:7452618582213144577".into())
+        );
+    }
+
+    #[test]
+    fn extract_urn_from_pretty_post_url() {
+        assert_eq!(
+            extract_urn(
+                "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/"
+            ),
+            Some("urn:li:activity:7452618583290892288".into())
+        );
+    }
+
+    #[test]
+    fn parse_og_tags_basic() {
+        let html = r#"<meta property="og:image" content="https://x.com/a.png">
+<meta property="og:url" content="https://example.com/x">"#;
+        let og = parse_og_tags(html);
+        assert_eq!(
+            og.get("image").map(String::as_str),
+            Some("https://x.com/a.png")
+        );
+        assert_eq!(
+            og.get("url").map(String::as_str),
+            Some("https://example.com/x")
+        );
+    }
+
+    #[test]
+    fn parse_post_body_strips_anchor_tags() {
+        let html = r#"<p class="attributed-text-segment-list__content text-color-text" dir="ltr">Hello <a href="x">link</a> world</p>"#;
+        assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world"));
+    }
+
+    #[test]
+    fn html_decode_handles_common_entities() {
+        assert_eq!(html_decode("AT&amp;T &#064;jane"), "AT&T @jane");
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index 8cf8ba5..1fe1034 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -24,6 +24,9 @@ pub mod github_repo;
 pub mod hackernews;
 pub mod huggingface_dataset;
 pub mod huggingface_model;
+pub mod instagram_post;
+pub mod instagram_profile;
+pub mod linkedin_post;
 pub mod npm;
 pub mod pypi;
 pub mod reddit;
@@ -67,6 +70,9 @@ pub fn list() -> Vec<ExtractorInfo> {
         docker_hub::INFO,
         dev_to::INFO,
         stackoverflow::INFO,
+        linkedin_post::INFO,
+        instagram_post::INFO,
+        instagram_profile::INFO,
     ]
 }
 
@@ -171,6 +177,27 @@ pub async fn dispatch_by_url(
                 .map(|v| (stackoverflow::INFO.name, v)),
         );
     }
+    if linkedin_post::matches(url) {
+        return Some(
+            linkedin_post::extract(client, url)
+                .await
+                .map(|v| (linkedin_post::INFO.name, v)),
+        );
+    }
+    if instagram_post::matches(url) {
+        return Some(
+            instagram_post::extract(client, url)
+                .await
+                .map(|v| (instagram_post::INFO.name, v)),
+        );
+    }
+    if instagram_profile::matches(url) {
+        return Some(
+            instagram_profile::extract(client, url)
+                .await
+                .map(|v| (instagram_profile::INFO.name, v)),
+        );
+    }
     None
 }
 
@@ -259,6 +286,24 @@ pub async fn dispatch_by_name(
             })
             .await
         }
+        n if n == linkedin_post::INFO.name => {
+            run_or_mismatch(linkedin_post::matches(url), n, url, || {
+                linkedin_post::extract(client, url)
+            })
+            .await
+        }
+        n if n == instagram_post::INFO.name => {
+            run_or_mismatch(instagram_post::matches(url), n, url, || {
+                instagram_post::extract(client, url)
+            })
+            .await
+        }
+        n if n == instagram_profile::INFO.name => {
+            run_or_mismatch(instagram_profile::matches(url), n, url, || {
+                instagram_profile::extract(client, url)
+            })
+            .await
+        }
         _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
     }
 }

From 0221c151dcd98e1b813133eb83336426c304d6bd Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 15:36:01 +0200
Subject: [PATCH 10/30] feat(extractors): wave 4 \u2014 ecommerce (shopify +
 generic JSON-LD)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two ecommerce extractors covering the long tail of online stores:

- shopify_product: hits the public /products/{handle}.json endpoint
  that every Shopify store exposes. Undocumented but stable for 10+
  years. Returns title, vendor, product_type, tags, full variants
  array (price, SKU, stock, options), images, options matrix, and
  the price_min/price_max/any_available summary fields. Covers the
  ~4M Shopify stores out there, modulo stores that put Cloudflare
  in front of the shop. Rejects known non-Shopify hosts (amazon,
  etsy, walmart, etc.) to save a failed request.

- ecommerce_product: generic Schema.org Product JSON-LD extractor.
  Works on any modern store that ships the Google-required Product
  rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
  Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
  images, normalized offers (Offer and AggregateOffer flattened into
  one shape with price, currency, availability, condition),
  aggregateRating, and the raw JSON-LD block for anyone who wants it.
  Reuses webclaw_core::structured_data::extract_json_ld so the
  JSON-LD parser stays shared across the extraction pipeline.

Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.

Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
  Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
  'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
  300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
  200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
  expected \u2014 the error message points callers at the ecommerce_product
  fallback, but Cloudflare also blocks the HTML path so those stores
  are cloud-tier territory.

Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.

Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
  NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
  turnstile blocks our Firefox + Chrome + Safari + mobile profiles
  at the TLS layer. Shipping it would return 403 more often than 200.
  Code kept in-tree under #[allow(dead_code)] for when the cloud
  tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
  story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
  so ecommerce_product covers them. A dedicated WooCommerce REST
  extractor (/wp-json/wc/store/products) would be marginal on top of
  that and only works on ~30% of stores that expose the REST API.

Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
---
 .../src/extractors/ecommerce_product.rs       | 314 +++++++++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    |  29 ++
 .../src/extractors/shopify_product.rs         | 318 ++++++++++++++++++
 .../src/extractors/trustpilot_reviews.rs      | 193 +++++++++++
 4 files changed, 854 insertions(+)
 create mode 100644 crates/webclaw-fetch/src/extractors/ecommerce_product.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/shopify_product.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs

diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
new file mode 100644
index 0000000..bad2f9b
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
@@ -0,0 +1,314 @@
+//! Generic ecommerce product extractor via Schema.org JSON-LD.
+//!
+//! Every modern ecommerce site ships a `<script type="application/ld+json">`
+//! Product block for SEO / rich-result snippets. Google's own SEO docs
+//! force this markup on anyone who wants to appear in shopping search.
+//! We take advantage of it: one extractor that works on Shopify,
+//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
+//! and anything else that follows Schema.org.
+//!
+//! **Explicit-call only** — `/v1/scrape/ecommerce_product`. Not in the
+//! auto-dispatch because we can't identify "this is a product page"
+//! from the URL alone. When the caller knows they have a product URL,
+//! this is the reliable fallback for stores where shopify_product
+//! doesn't apply.
+//!
+//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
+//! so JSON-LD parsing is shared with the rest of the extraction
+//! pipeline. We walk all blocks looking for `@type: Product`,
+//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
+
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "ecommerce_product",
+    label: "Ecommerce product (generic)",
+    description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
+    url_patterns: &[
+        "https://{any-ecom-store}/products/{slug}",
+        "https://{any-ecom-store}/product/{slug}",
+        "https://{any-ecom-store}/p/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    // Maximally permissive: explicit-call-only extractor. We trust the
+    // caller knows they're pointing at a product page. Custom ecom
+    // sites use every conceivable URL shape (warbyparker.com uses
+    // `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
+    // matching would false-negative a lot. All we gate on is a valid
+    // http(s) URL with a host.
+    if !(url.starts_with("http://") || url.starts_with("https://")) {
+        return false;
+    }
+    !host_of(url).is_empty()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let resp = client.fetch(url).await?;
+    if !(200..300).contains(&resp.status) {
+        return Err(FetchError::Build(format!(
+            "ecommerce_product: status {} for {url}",
+            resp.status
+        )));
+    }
+
+    // Reuse the core JSON-LD parser so we benefit from whatever
+    // robustness it gains over time (handling @graph, arrays, etc.).
+    let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
+    let product = find_product(&blocks).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "ecommerce_product: no Schema.org Product found in JSON-LD on {url}"
+        ))
+    })?;
+
+    Ok(json!({
+        "url":                url,
+        "name":               get_text(&product, "name"),
+        "description":        get_text(&product, "description"),
+        "brand":              get_brand(&product),
+        "sku":                get_text(&product, "sku"),
+        "mpn":                get_text(&product, "mpn"),
+        "gtin":               get_text(&product, "gtin")
+                                 .or_else(|| get_text(&product, "gtin13"))
+                                 .or_else(|| get_text(&product, "gtin12"))
+                                 .or_else(|| get_text(&product, "gtin8")),
+        "product_id":         get_text(&product, "productID"),
+        "category":           get_text(&product, "category"),
+        "color":              get_text(&product, "color"),
+        "material":           get_text(&product, "material"),
+        "images":             collect_images(&product),
+        "offers":             collect_offers(&product),
+        "aggregate_rating":   get_aggregate_rating(&product),
+        "review_count":       get_review_count(&product),
+        "raw_schema_type":    get_text(&product, "@type"),
+        "raw_jsonld":         product,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers
+// ---------------------------------------------------------------------------
+
+/// Recursively walk the JSON-LD blocks and return the first node whose
+/// `@type` is Product, ProductGroup, or IndividualProduct.
+fn find_product(blocks: &[Value]) -> Option<Value> {
+    for b in blocks {
+        if let Some(found) = find_product_in(b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    // @graph: [ {...}, {...} ]
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    // Bare array wrapper
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let t = match v.get("@type") {
+        Some(t) => t,
+        None => return false,
+    };
+    let match_str = |s: &str| {
+        matches!(
+            s,
+            "Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
+        )
+    };
+    match t {
+        Value::String(s) => match_str(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    if let Some(obj) = brand.as_object()
+        && let Some(n) = obj.get("name").and_then(|x| x.as_str())
+    {
+        return Some(n.to_string());
+    }
+    None
+}
+
+fn collect_images(v: &Value) -> Vec<Value> {
+    match v.get("image") {
+        Some(Value::String(s)) => vec![Value::String(s.clone())],
+        Some(Value::Array(arr)) => arr
+            .iter()
+            .filter_map(|x| match x {
+                Value::String(s) => Some(Value::String(s.clone())),
+                Value::Object(_) => x.get("url").cloned(),
+                _ => None,
+            })
+            .collect(),
+        Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
+        _ => Vec::new(),
+    }
+}
+
+/// Normalise both bare Offer and AggregateOffer into a uniform array.
+fn collect_offers(v: &Value) -> Vec<Value> {
+    let offers = match v.get("offers") {
+        Some(o) => o,
+        None => return Vec::new(),
+    };
+    let collect_single = |o: &Value| -> Option<Value> {
+        Some(json!({
+            "price":            get_text(o, "price"),
+            "low_price":        get_text(o, "lowPrice"),
+            "high_price":       get_text(o, "highPrice"),
+            "currency":         get_text(o, "priceCurrency"),
+            "availability":     get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
+            "item_condition":   get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
+            "valid_until":      get_text(o, "priceValidUntil"),
+            "url":              get_text(o, "url"),
+            "seller":           o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
+            "offer_count":      get_text(o, "offerCount"),
+        }))
+    };
+    match offers {
+        Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
+        Value::Object(_) => collect_single(offers).into_iter().collect(),
+        _ => Vec::new(),
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value":  get_text(r, "ratingValue"),
+        "best_rating":   get_text(r, "bestRating"),
+        "worst_rating":  get_text(r, "worstRating"),
+        "rating_count":  get_text(r, "ratingCount"),
+        "review_count":  get_text(r, "reviewCount"),
+    }))
+}
+
+fn get_review_count(v: &Value) -> Option<String> {
+    v.get("aggregateRating")
+        .and_then(|r| get_text(r, "reviewCount"))
+        .or_else(|| get_text(v, "reviewCount"))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use serde_json::json;
+
+    #[test]
+    fn matches_any_http_url_with_host() {
+        assert!(matches("https://www.allbirds.com/products/tree-runner"));
+        assert!(matches(
+            "https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
+        ));
+        assert!(matches("https://example.com/p/widget"));
+        assert!(matches("http://shop.example.com/foo/bar"));
+    }
+
+    #[test]
+    fn rejects_empty_or_non_http() {
+        assert!(!matches(""));
+        assert!(!matches("not-a-url"));
+        assert!(!matches("ftp://example.com/file"));
+    }
+
+    #[test]
+    fn find_product_walks_graph() {
+        let block = json!({
+            "@context": "https://schema.org",
+            "@graph": [
+                {"@type": "Organization", "name": "ACME"},
+                {"@type": "Product", "name": "Widget", "sku": "ABC"}
+            ]
+        });
+        let blocks = vec![block];
+        let p = find_product(&blocks).unwrap();
+        assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
+    }
+
+    #[test]
+    fn find_product_handles_array_type() {
+        let block = json!({
+            "@type": ["Product", "Clothing"],
+            "name": "Tee"
+        });
+        assert!(is_product_type(&block));
+    }
+
+    #[test]
+    fn get_brand_from_string_or_object() {
+        assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
+        assert_eq!(
+            get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
+            Some("ACME".into())
+        );
+    }
+
+    #[test]
+    fn collect_offers_handles_single_and_aggregate() {
+        let p = json!({
+            "offers": {
+                "@type": "Offer",
+                "price": "19.99",
+                "priceCurrency": "USD",
+                "availability": "https://schema.org/InStock"
+            }
+        });
+        let offers = collect_offers(&p);
+        assert_eq!(offers.len(), 1);
+        assert_eq!(
+            offers[0].get("price").and_then(|v| v.as_str()),
+            Some("19.99")
+        );
+        assert_eq!(
+            offers[0].get("availability").and_then(|v| v.as_str()),
+            Some("InStock")
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index 1fe1034..ea273e6 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -18,6 +18,7 @@ pub mod arxiv;
 pub mod crates_io;
 pub mod dev_to;
 pub mod docker_hub;
+pub mod ecommerce_product;
 pub mod github_pr;
 pub mod github_release;
 pub mod github_repo;
@@ -30,7 +31,15 @@ pub mod linkedin_post;
 pub mod npm;
 pub mod pypi;
 pub mod reddit;
+pub mod shopify_product;
 pub mod stackoverflow;
+// `trustpilot_reviews` code lives in the tree but is not wired into the
+// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
+// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
+// it would return 403 more often than not — bad UX. When the cloud tier
+// has residential proxies or a CDP renderer, flip this back on.
+#[allow(dead_code)]
+pub mod trustpilot_reviews;
 
 use serde::Serialize;
 use serde_json::Value;
@@ -73,6 +82,8 @@ pub fn list() -> Vec<ExtractorInfo> {
         linkedin_post::INFO,
         instagram_post::INFO,
         instagram_profile::INFO,
+        shopify_product::INFO,
+        ecommerce_product::INFO,
     ]
 }
 
@@ -198,6 +209,12 @@ pub async fn dispatch_by_url(
                 .map(|v| (instagram_profile::INFO.name, v)),
         );
     }
+    // NOTE: shopify_product and ecommerce_product are intentionally NOT
+    // in auto-dispatch. Their `matches()` functions are permissive
+    // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
+    // claiming those generically would steal URLs from the default
+    // `/v1/scrape` markdown flow. Callers opt in via
+    // `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
     None
 }
 
@@ -304,6 +321,18 @@ pub async fn dispatch_by_name(
             })
             .await
         }
+        n if n == shopify_product::INFO.name => {
+            run_or_mismatch(shopify_product::matches(url), n, url, || {
+                shopify_product::extract(client, url)
+            })
+            .await
+        }
+        n if n == ecommerce_product::INFO.name => {
+            run_or_mismatch(ecommerce_product::matches(url), n, url, || {
+                ecommerce_product::extract(client, url)
+            })
+            .await
+        }
         _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
     }
 }
diff --git a/crates/webclaw-fetch/src/extractors/shopify_product.rs b/crates/webclaw-fetch/src/extractors/shopify_product.rs
new file mode 100644
index 0000000..19f0438
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs
@@ -0,0 +1,318 @@
+//! Shopify product structured extractor.
+//!
+//! Every Shopify store exposes a public JSON endpoint for each product
+//! by appending `.json` to the product URL:
+//!
+//!   https://shop.example.com/products/cool-tshirt
+//!   → https://shop.example.com/products/cool-tshirt.json
+//!
+//! There are ~4 million Shopify stores. The `.json` endpoint is
+//! undocumented but has been stable for 10+ years. When a store puts
+//! Cloudflare / antibot in front of the shop, this path can 403 just
+//! like any other — for those cases the caller should fall back to
+//! `ecommerce_product` (JSON-LD) or the cloud tier.
+//!
+//! This extractor is **explicit-call only** — it is NOT auto-dispatched
+//! from `/v1/scrape` because we cannot tell ahead of time whether an
+//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
+//! `/v1/scrape/shopify_product` when they know.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "shopify_product",
+    label: "Shopify product",
+    description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
+    url_patterns: &[
+        "https://{shop}/products/{handle}",
+        "https://{shop}.myshopify.com/products/{handle}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    // Any URL whose path contains /products/{something}. We do not
+    // filter by host — Shopify powers custom-domain stores. The
+    // extractor's /.json fallback is what confirms Shopify; `matches`
+    // just says "this is a plausible shape." Still reject obviously
+    // non-Shopify known hosts to save a failed request.
+    let host = host_of(url);
+    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
+        return false;
+    }
+    url.contains("/products/") && !url.ends_with("/products/")
+}
+
+/// Hosts we know are not Shopify — reject so we don't burn a request.
+const NON_SHOPIFY_HOSTS: &[&str] = &[
+    "amazon.com",
+    "amazon.co.uk",
+    "amazon.de",
+    "amazon.fr",
+    "amazon.it",
+    "ebay.com",
+    "etsy.com",
+    "walmart.com",
+    "target.com",
+    "aliexpress.com",
+    "bestbuy.com",
+    "wayfair.com",
+    "homedepot.com",
+    "github.com", // /products is a marketing page
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let json_url = build_json_url(url);
+    let resp = client.fetch(&json_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "shopify_product: '{url}' not found (got 404 from {json_url})"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "shopify returned status {} for {json_url}",
+            resp.status
+        )));
+    }
+
+    let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
+        FetchError::BodyDecode(format!(
+            "shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
+        ))
+    })?;
+    let p = body.product;
+
+    let variants: Vec<Value> = p
+        .variants
+        .iter()
+        .map(|v| {
+            json!({
+                "id":                  v.id,
+                "title":               v.title,
+                "sku":                 v.sku,
+                "barcode":             v.barcode,
+                "price":               v.price,
+                "compare_at_price":    v.compare_at_price,
+                "available":           v.available,
+                "inventory_quantity":  v.inventory_quantity,
+                "position":            v.position,
+                "weight":              v.weight,
+                "weight_unit":         v.weight_unit,
+                "requires_shipping":   v.requires_shipping,
+                "taxable":             v.taxable,
+                "option1":             v.option1,
+                "option2":             v.option2,
+                "option3":             v.option3,
+            })
+        })
+        .collect();
+
+    let images: Vec<Value> = p
+        .images
+        .iter()
+        .map(|i| {
+            json!({
+                "src":      i.src,
+                "width":    i.width,
+                "height":   i.height,
+                "position": i.position,
+                "alt":      i.alt,
+            })
+        })
+        .collect();
+
+    let options: Vec<Value> = p
+        .options
+        .iter()
+        .map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
+        .collect();
+
+    // Price range + availability summary across variants (the shape
+    // agents typically want without walking the variants array).
+    let prices: Vec<f64> = p
+        .variants
+        .iter()
+        .filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
+        .collect();
+    let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
+
+    Ok(json!({
+        "url":             url,
+        "json_url":        json_url,
+        "product_id":      p.id,
+        "handle":          p.handle,
+        "title":           p.title,
+        "vendor":          p.vendor,
+        "product_type":    p.product_type,
+        "tags":            p.tags,
+        "description_html":p.body_html,
+        "published_at":    p.published_at,
+        "created_at":      p.created_at,
+        "updated_at":      p.updated_at,
+        "variant_count":   variants.len(),
+        "image_count":     images.len(),
+        "any_available":   any_available,
+        "price_min":       prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
+        "price_max":       prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
+        "variants":        variants,
+        "images":          images,
+        "options":         options,
+    }))
+}
+
+/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
+/// trailing slashes, and query strings.
+fn build_json_url(url: &str) -> String {
+    let (path_part, query_part) = match url.split_once('?') {
+        Some((a, b)) => (a, Some(b)),
+        None => (url, None),
+    };
+    let clean = path_part.trim_end_matches('/');
+    let with_json = if clean.ends_with(".json") {
+        clean.to_string()
+    } else {
+        format!("{clean}.json")
+    };
+    match query_part {
+        Some(q) => format!("{with_json}?{q}"),
+        None => with_json,
+    }
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+// ---------------------------------------------------------------------------
+// Shopify product JSON shape (a subset of the full response)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Wrapper {
+    product: Product,
+}
+
+#[derive(Deserialize)]
+struct Product {
+    id: Option<i64>,
+    title: Option<String>,
+    handle: Option<String>,
+    vendor: Option<String>,
+    product_type: Option<String>,
+    body_html: Option<String>,
+    published_at: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    #[serde(default)]
+    tags: serde_json::Value, // array OR comma-joined string depending on store
+    #[serde(default)]
+    variants: Vec<Variant>,
+    #[serde(default)]
+    images: Vec<Image>,
+    #[serde(default)]
+    options: Vec<Option_>,
+}
+
+#[derive(Deserialize)]
+struct Variant {
+    id: Option<i64>,
+    title: Option<String>,
+    sku: Option<String>,
+    barcode: Option<String>,
+    price: Option<String>,
+    compare_at_price: Option<String>,
+    available: Option<bool>,
+    inventory_quantity: Option<i64>,
+    position: Option<i64>,
+    weight: Option<f64>,
+    weight_unit: Option<String>,
+    requires_shipping: Option<bool>,
+    taxable: Option<bool>,
+    option1: Option<String>,
+    option2: Option<String>,
+    option3: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Image {
+    src: Option<String>,
+    width: Option<i64>,
+    height: Option<i64>,
+    position: Option<i64>,
+    alt: Option<String>,
+}
+
+#[derive(Deserialize)]
+#[serde(rename_all = "lowercase")]
+struct Option_ {
+    name: Option<String>,
+    position: Option<i64>,
+    #[serde(default)]
+    values: Vec<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_plausible_shopify_urls() {
+        assert!(matches(
+            "https://www.allbirds.com/products/mens-tree-runners"
+        ));
+        assert!(matches(
+            "https://shop.example.com/products/cool-tshirt?variant=123"
+        ));
+        assert!(matches("https://somestore.myshopify.com/products/thing-1"));
+    }
+
+    #[test]
+    fn rejects_known_non_shopify() {
+        assert!(!matches("https://www.amazon.com/dp/B0C123"));
+        assert!(!matches("https://www.etsy.com/listing/12345/foo"));
+        assert!(!matches("https://www.amazon.co.uk/products/thing"));
+        assert!(!matches("https://github.com/products"));
+    }
+
+    #[test]
+    fn rejects_non_product_urls() {
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("https://example.com/products/"));
+        assert!(!matches("https://example.com/collections/all"));
+    }
+
+    #[test]
+    fn build_json_url_handles_slash_and_query() {
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo"),
+            "https://shop.example.com/products/foo.json"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo/"),
+            "https://shop.example.com/products/foo.json"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo?variant=123"),
+            "https://shop.example.com/products/foo.json?variant=123"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo.json"),
+            "https://shop.example.com/products/foo.json"
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
new file mode 100644
index 0000000..41f40d4
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@@ -0,0 +1,193 @@
+//! Trustpilot company reviews extractor.
+//!
+//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich
+//! JSON-LD `LocalBusiness` / `Organization` block with aggregate
+//! rating + up to 20 recent reviews. No auth, no antibot for the
+//! page HTML itself.
+//!
+//! Auto-dispatch safe because the host is unique.
+
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "trustpilot_reviews",
+    label: "Trustpilot reviews",
+    description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
+    url_patterns: &["https://www.trustpilot.com/review/{domain}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
+        return false;
+    }
+    url.contains("/review/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let resp = client.fetch(url).await?;
+    if !(200..300).contains(&resp.status) {
+        return Err(FetchError::Build(format!(
+            "trustpilot_reviews: status {} for {url}",
+            resp.status
+        )));
+    }
+
+    let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
+    let business = find_business(&blocks).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
+        ))
+    })?;
+
+    let aggregate_rating = business.get("aggregateRating").map(|r| {
+        json!({
+            "rating_value":  get_text(r, "ratingValue"),
+            "best_rating":   get_text(r, "bestRating"),
+            "review_count":  get_text(r, "reviewCount"),
+        })
+    });
+
+    let reviews: Vec<Value> = business
+        .get("review")
+        .and_then(|r| r.as_array())
+        .map(|arr| {
+            arr.iter()
+                .map(|r| {
+                    json!({
+                        "author":         r.get("author")
+                                              .and_then(|a| a.get("name"))
+                                              .and_then(|n| n.as_str())
+                                              .map(String::from)
+                                              .or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
+                        "date_published": get_text(r, "datePublished"),
+                        "name":           get_text(r, "name"),
+                        "body":           get_text(r, "reviewBody"),
+                        "rating_value":   r.get("reviewRating")
+                                              .and_then(|rr| rr.get("ratingValue"))
+                                              .and_then(|v| v.as_str().map(String::from)
+                                                  .or_else(|| v.as_f64().map(|n| n.to_string()))),
+                        "language":       get_text(r, "inLanguage"),
+                    })
+                })
+                .collect()
+        })
+        .unwrap_or_default();
+
+    Ok(json!({
+        "url":              url,
+        "name":             get_text(&business, "name"),
+        "description":      get_text(&business, "description"),
+        "logo":             business.get("logo").and_then(|l| l.as_str()).map(String::from)
+                                .or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
+        "telephone":        get_text(&business, "telephone"),
+        "address":          business.get("address").cloned(),
+        "same_as":          business.get("sameAs").cloned(),
+        "aggregate_rating": aggregate_rating,
+        "review_count_listed": reviews.len(),
+        "reviews":          reviews,
+        "business_schema":  business.get("@type").cloned(),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walker — same pattern as ecommerce_product
+// ---------------------------------------------------------------------------
+
+fn find_business(blocks: &[Value]) -> Option<Value> {
+    for b in blocks {
+        if let Some(found) = find_business_in(b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_business_in(v: &Value) -> Option<Value> {
+    if is_business_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_business_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_business_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_business_type(v: &Value) -> bool {
+    let t = match v.get("@type") {
+        Some(t) => t,
+        None => return false,
+    };
+    let match_str = |s: &str| {
+        matches!(
+            s,
+            "Organization"
+                | "LocalBusiness"
+                | "Corporation"
+                | "OnlineBusiness"
+                | "Store"
+                | "Service"
+        )
+    };
+    match t {
+        Value::String(s) => match_str(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_trustpilot_review_urls() {
+        assert!(matches("https://www.trustpilot.com/review/stripe.com"));
+        assert!(matches("https://trustpilot.com/review/example.com"));
+        assert!(!matches("https://www.trustpilot.com/"));
+        assert!(!matches("https://example.com/review/foo"));
+    }
+
+    #[test]
+    fn is_business_type_handles_variants() {
+        use serde_json::json;
+        assert!(is_business_type(&json!({"@type": "Organization"})));
+        assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
+        assert!(is_business_type(
+            &json!({"@type": ["Organization", "Corporation"]})
+        ));
+        assert!(!is_business_type(&json!({"@type": "Product"})));
+    }
+}

From 0ab891bd6b8fc7f687c2026aece6b413b47f335c Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 16:05:44 +0200
Subject: [PATCH 11/30] refactor(cloud): consolidate CloudClient + smart_fetch
 into webclaw-fetch
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
  pulling rmcp as a dep)

Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.

## New module: webclaw-fetch/src/cloud.rs

Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:

- `CloudError` enum with dedicated variants for the four HTTP
  outcomes callers act on differently — 401 (key rejected),
  402 (insufficient plan), 429 (rate limited), plus ServerError /
  Network / ParseFailed. Each variant's Display message ends with
  an actionable URL (signup / pricing / dashboard) so API consumers
  can surface it verbatim.

- `From<CloudError> for String` bridge so the dozen existing
  `.await?` call sites in MCP / CLI that expected `Result<_, String>`
  keep compiling. We can migrate them to the typed error per-site
  later without a churn commit.

- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
  flag pattern (explicit key wins, env fallback, None when empty).
  `::from_env()` kept for MCP-style call sites.

- `with_key_and_base` for staging / integration tests.

- `scrape / post / get / fetch_html` — `fetch_html` is new, a
  convenience that calls /v1/scrape with formats=["html"] and
  returns the raw HTML string so vertical extractors can plug
  antibot-bypassed HTML straight into their parsers.

- `is_bot_protected` + `needs_js_rendering` detectors moved
  over verbatim. Detection patterns are public (CF / DataDome /
  AWS WAF challenge-page signatures) — no moat leak.

- `smart_fetch` kept on the original `Result<_, String>`
  signature so MCP's six call sites compile unchanged.

- `smart_fetch_html` is new: the local-first-then-cloud flow
  for the vertical-extractor pattern, returning the typed
  `CloudError` so extractors can emit precise upgrade-path
  messages.

## Cleanup

- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
  `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of
  webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
  webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
  transitively pulled in by webclaw-llm; this just makes the
  dependency relationship explicit at the call site.

## Tests

16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
  embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)

Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
---
 Cargo.lock                        |   2 +-
 crates/webclaw-cli/src/cloud.rs   |  80 ----
 crates/webclaw-cli/src/main.rs    |   3 +-
 crates/webclaw-fetch/Cargo.toml   |   1 +
 crates/webclaw-fetch/src/cloud.rs | 669 ++++++++++++++++++++++++++++++
 crates/webclaw-fetch/src/lib.rs   |   1 +
 crates/webclaw-mcp/Cargo.toml     |   1 -
 crates/webclaw-mcp/src/cloud.rs   | 302 --------------
 crates/webclaw-mcp/src/main.rs    |   1 -
 crates/webclaw-mcp/src/server.rs  |   3 +-
 10 files changed, 675 insertions(+), 388 deletions(-)
 delete mode 100644 crates/webclaw-cli/src/cloud.rs
 create mode 100644 crates/webclaw-fetch/src/cloud.rs
 delete mode 100644 crates/webclaw-mcp/src/cloud.rs

diff --git a/Cargo.lock b/Cargo.lock
index 8ca41b8..cb8296b 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3246,6 +3246,7 @@ dependencies = [
  "quick-xml 0.37.5",
  "rand 0.8.5",
  "regex",
+ "reqwest",
  "serde",
  "serde_json",
  "tempfile",
@@ -3278,7 +3279,6 @@ version = "0.4.0"
 dependencies = [
  "dirs",
  "dotenvy",
- "reqwest",
  "rmcp",
  "schemars",
  "serde",
diff --git a/crates/webclaw-cli/src/cloud.rs b/crates/webclaw-cli/src/cloud.rs
deleted file mode 100644
index 464eb4c..0000000
--- a/crates/webclaw-cli/src/cloud.rs
+++ /dev/null
@@ -1,80 +0,0 @@
-/// Cloud API client for automatic fallback when local extraction fails.
-///
-/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
-/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
-/// all requests go through the cloud API directly.
-///
-/// NOTE: The canonical, full-featured cloud module lives in webclaw-mcp/src/cloud.rs
-/// (smart_fetch, bot detection, JS rendering checks). This is the minimal subset
-/// needed by the CLI. Kept separate to avoid pulling in rmcp via webclaw-mcp.
-/// and adding webclaw-mcp as a dependency would pull in rmcp.
-use serde_json::{Value, json};
-
-const API_BASE: &str = "https://api.webclaw.io/v1";
-
-pub struct CloudClient {
-    api_key: String,
-    http: reqwest::Client,
-}
-
-impl CloudClient {
-    /// Create from explicit key or WEBCLAW_API_KEY env var.
-    pub fn new(explicit_key: Option<&str>) -> Option<Self> {
-        let key = explicit_key
-            .map(String::from)
-            .or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
-            .filter(|k| !k.is_empty())?;
-
-        Some(Self {
-            api_key: key,
-            http: reqwest::Client::new(),
-        })
-    }
-
-    /// Scrape via the cloud API.
-    pub async fn scrape(
-        &self,
-        url: &str,
-        formats: &[&str],
-        include_selectors: &[String],
-        exclude_selectors: &[String],
-        only_main_content: bool,
-    ) -> Result<Value, String> {
-        let mut body = json!({
-            "url": url,
-            "formats": formats,
-        });
-        if only_main_content {
-            body["only_main_content"] = json!(true);
-        }
-        if !include_selectors.is_empty() {
-            body["include_selectors"] = json!(include_selectors);
-        }
-        if !exclude_selectors.is_empty() {
-            body["exclude_selectors"] = json!(exclude_selectors);
-        }
-        self.post("scrape", body).await
-    }
-
-    async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
-        let resp = self
-            .http
-            .post(format!("{API_BASE}/{endpoint}"))
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .json(&body)
-            .timeout(std::time::Duration::from_secs(120))
-            .send()
-            .await
-            .map_err(|e| format!("cloud API request failed: {e}"))?;
-
-        let status = resp.status();
-        if !status.is_success() {
-            let text = resp.text().await.unwrap_or_default();
-            return Err(format!("cloud API error {status}: {text}"));
-        }
-
-        resp.json::<Value>()
-            .await
-            .map_err(|e| format!("cloud API response parse failed: {e}"))
-    }
-}
diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs
index 7cf765e..91af384 100644
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@@ -1,7 +1,6 @@
 /// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
 /// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
 mod bench;
-mod cloud;
 
 use std::io::{self, Read as _};
 use std::path::{Path, PathBuf};
@@ -674,7 +673,7 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
     let url = normalize_url(raw_url);
     let url = url.as_str();
 
-    let cloud_client = cloud::CloudClient::new(cli.api_key.as_deref());
+    let cloud_client = webclaw_fetch::cloud::CloudClient::new(cli.api_key.as_deref());
 
     // --cloud: skip local, go straight to cloud API
     if cli.cloud {
diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml
index 77082a6..2ec9b9d 100644
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@@ -19,6 +19,7 @@ url = "2"
 rand = "0.8"
 quick-xml = { version = "0.37", features = ["serde"] }
 regex = "1"
+reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
 serde_json.workspace = true
 calamine = "0.34"
 zip = "2"
diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs
new file mode 100644
index 0000000..3e1110a
--- /dev/null
+++ b/crates/webclaw-fetch/src/cloud.rs
@@ -0,0 +1,669 @@
+//! Cloud API fallback client for api.webclaw.io.
+//!
+//! When local fetch hits bot protection or a JS-only SPA, callers can
+//! fall back to the hosted API which runs the full antibot / CDP
+//! pipeline. This module is the shared home for that flow: previously
+//! duplicated between `webclaw-mcp/src/cloud.rs` and
+//! `webclaw-cli/src/cloud.rs`.
+//!
+//! ## Architecture
+//!
+//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
+//!   REST surface. Typed errors for the four HTTP failures callers act
+//!   on differently (401 / 402 / 429 / other) plus network + parse.
+//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
+//!   response bodies. The detection patterns are public (CF / DataDome
+//!   challenge-page signatures) so these live in OSS without leaking
+//!   any moat.
+//! - [`smart_fetch`] — try-local-then-escalate flow returning an
+//!   [`ExtractionResult`] or raw cloud JSON. Kept on the original
+//!   `Result<_, String>` signature so the existing MCP / CLI call
+//!   sites work unchanged.
+//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
+//!   pattern: just give me antibot-bypassed HTML so I can run my own
+//!   parser on it. Returns the typed [`CloudError`] so extractors can
+//!   emit precise "upgrade your plan" / "invalid key" messages.
+//!
+//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
+//! signup when a site is blocked; nothing fails silently. Cloud users
+//! get the escalation for free.
+
+use std::time::Duration;
+
+use http::HeaderMap;
+use serde_json::{Value, json};
+use thiserror::Error;
+use tracing::{debug, info, warn};
+
+use crate::client::FetchClient;
+
+// ---------------------------------------------------------------------------
+// URLs + defaults — keep in one place so "change the signup link" is a
+// single-commit edit.
+// ---------------------------------------------------------------------------
+
+const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
+const DEFAULT_TIMEOUT_SECS: u64 = 120;
+
+const SIGNUP_URL: &str = "https://webclaw.io/signup";
+const PRICING_URL: &str = "https://webclaw.io/pricing";
+const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
+
+// ---------------------------------------------------------------------------
+// Errors
+// ---------------------------------------------------------------------------
+
+/// Structured cloud-fallback error. Variants correspond to the HTTP
+/// outcomes callers act on differently — a 401 needs a different UX
+/// than a 402 which needs a different UX than a network blip.
+///
+/// Display messages end with an actionable URL so API consumers can
+/// surface them to users verbatim.
+#[derive(Debug, Error)]
+pub enum CloudError {
+    /// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
+    /// and friends when they hit bot protection but have no client to
+    /// escalate to.
+    #[error(
+        "this site is behind antibot protection. \
+         Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
+         Free tier: {SIGNUP_URL}"
+    )]
+    NotConfigured,
+
+    /// HTTP 401 — the key is present but rejected.
+    #[error(
+        "WEBCLAW_API_KEY rejected (HTTP 401). \
+         Check or regenerate your key at {KEYS_URL}"
+    )]
+    Unauthorized,
+
+    /// HTTP 402 — the key is valid but the plan doesn't cover the call.
+    #[error(
+        "your plan doesn't include this endpoint / site (HTTP 402). \
+         Upgrade at {PRICING_URL}"
+    )]
+    InsufficientPlan,
+
+    /// HTTP 429 — rate limit.
+    #[error(
+        "cloud API rate limit reached (HTTP 429). \
+         Wait a moment or upgrade at {PRICING_URL}"
+    )]
+    RateLimited,
+
+    /// HTTP 4xx / 5xx the caller probably can't do anything specific
+    /// about. Body is truncated to a sensible length for logs.
+    #[error("cloud API returned HTTP {status}: {body}")]
+    ServerError { status: u16, body: String },
+
+    #[error("cloud request failed: {0}")]
+    Network(String),
+
+    #[error("cloud response parse failed: {0}")]
+    ParseFailed(String),
+}
+
+impl CloudError {
+    /// Build from a non-success HTTP response, routing well-known
+    /// statuses to dedicated variants.
+    fn from_status_and_body(status: u16, body: String) -> Self {
+        match status {
+            401 => Self::Unauthorized,
+            402 => Self::InsufficientPlan,
+            429 => Self::RateLimited,
+            _ => Self::ServerError {
+                status,
+                body: truncate(&body, 500).to_string(),
+            },
+        }
+    }
+}
+
+impl From<reqwest::Error> for CloudError {
+    fn from(e: reqwest::Error) -> Self {
+        Self::Network(e.to_string())
+    }
+}
+
+/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
+/// sites `use .await?` into functions returning `Result<_, String>`.
+/// Having this `From` impl means those sites keep compiling while we
+/// migrate them to the typed error over time.
+impl From<CloudError> for String {
+    fn from(e: CloudError) -> Self {
+        e.to_string()
+    }
+}
+
+fn truncate(text: &str, max: usize) -> &str {
+    match text.char_indices().nth(max) {
+        Some((byte_pos, _)) => &text[..byte_pos],
+        None => text,
+    }
+}
+
+// ---------------------------------------------------------------------------
+// CloudClient
+// ---------------------------------------------------------------------------
+
+/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
+/// inner `reqwest::Client` already refcounts its connection pool.
+#[derive(Clone)]
+pub struct CloudClient {
+    api_key: String,
+    base_url: String,
+    http: reqwest::Client,
+}
+
+impl CloudClient {
+    /// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
+    /// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
+    /// neither is set / both are empty.
+    ///
+    /// This is the function call sites should use by default — it's
+    /// what both the CLI and MCP want.
+    pub fn new(explicit_key: Option<&str>) -> Option<Self> {
+        explicit_key
+            .map(String::from)
+            .or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
+            .filter(|k| !k.trim().is_empty())
+            .map(Self::with_key)
+    }
+
+    /// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
+    /// readability at call sites that never accept a flag.
+    pub fn from_env() -> Option<Self> {
+        Self::new(None)
+    }
+
+    /// Build with an explicit key. Useful when the caller already has
+    /// a key from somewhere other than env or a flag (e.g. loaded from
+    /// config).
+    pub fn with_key(api_key: impl Into<String>) -> Self {
+        Self::with_key_and_base(api_key, API_BASE_DEFAULT)
+    }
+
+    /// Build with an explicit key and base URL. Used by integration
+    /// tests and staging deployments.
+    pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
+        let http = reqwest::Client::builder()
+            .timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
+            .build()
+            .expect("reqwest client builder failed with default settings");
+        Self {
+            api_key: api_key.into(),
+            base_url: base_url.into().trim_end_matches('/').to_string(),
+            http,
+        }
+    }
+
+    pub fn base_url(&self) -> &str {
+        &self.base_url
+    }
+
+    /// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
+    /// normalise the slash.
+    pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
+        let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
+        let resp = self
+            .http
+            .post(&url)
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .json(&body)
+            .send()
+            .await?;
+        parse_cloud_response(resp).await
+    }
+
+    /// Generic GET.
+    pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
+        let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
+        let resp = self
+            .http
+            .get(&url)
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .send()
+            .await?;
+        parse_cloud_response(resp).await
+    }
+
+    /// `POST /v1/scrape` with the caller's extraction options. This is
+    /// the public "do everything" surface: the cloud side handles
+    /// fetch + antibot + JS render + extraction + formatting.
+    pub async fn scrape(
+        &self,
+        url: &str,
+        formats: &[&str],
+        include_selectors: &[String],
+        exclude_selectors: &[String],
+        only_main_content: bool,
+    ) -> Result<Value, CloudError> {
+        let mut body = json!({ "url": url, "formats": formats });
+        if only_main_content {
+            body["only_main_content"] = json!(true);
+        }
+        if !include_selectors.is_empty() {
+            body["include_selectors"] = json!(include_selectors);
+        }
+        if !exclude_selectors.is_empty() {
+            body["exclude_selectors"] = json!(exclude_selectors);
+        }
+        self.post("scrape", body).await
+    }
+
+    /// Convenience: scrape with `formats: ["html"]` and pull out the
+    /// raw HTML string. Used by vertical extractors that want to run
+    /// their own parser on antibot-bypassed HTML.
+    pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
+        let resp = self.scrape(url, &["html"], &[], &[], false).await?;
+        resp.get("html")
+            .and_then(|v| v.as_str())
+            .map(String::from)
+            .ok_or_else(|| {
+                CloudError::ParseFailed(
+                    "cloud /v1/scrape returned no `html` field — check cloud API version".into(),
+                )
+            })
+    }
+}
+
+async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
+    let status = resp.status();
+    if status.is_success() {
+        return resp
+            .json()
+            .await
+            .map_err(|e| CloudError::ParseFailed(e.to_string()));
+    }
+    let body = resp.text().await.unwrap_or_default();
+    Err(CloudError::from_status_and_body(status.as_u16(), body))
+}
+
+// ---------------------------------------------------------------------------
+// Detection
+// ---------------------------------------------------------------------------
+
+/// True when a fetched response body is actually a bot-protection
+/// challenge page rather than the content the caller asked for.
+///
+/// Conservative — only fires on patterns that indicate the *entire*
+/// page is a challenge, not embedded CAPTCHAs on a real content page.
+pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
+    let html_lower = html.to_lowercase();
+
+    // Cloudflare challenge page.
+    if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
+        return true;
+    }
+
+    // Cloudflare "Just a moment" / "Checking your browser" interstitial.
+    if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
+        && html_lower.contains("cf-spinner")
+    {
+        return true;
+    }
+
+    // Cloudflare Turnstile. Only counts when the page is small —
+    // legitimate pages embed Turnstile for signup forms etc.
+    if (html_lower.contains("cf-turnstile")
+        || html_lower.contains("challenges.cloudflare.com/turnstile"))
+        && html.len() < 100_000
+    {
+        return true;
+    }
+
+    // DataDome.
+    if html_lower.contains("geo.captcha-delivery.com")
+        || html_lower.contains("captcha-delivery.com/captcha")
+    {
+        return true;
+    }
+
+    // AWS WAF.
+    if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
+        return true;
+    }
+
+    // hCaptcha *blocking* page (not just an embedded widget).
+    if html_lower.contains("hcaptcha.com")
+        && html_lower.contains("h-captcha")
+        && html.len() < 50_000
+    {
+        return true;
+    }
+
+    // Cloudflare via response headers + challenge body.
+    let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
+    if has_cf_headers
+        && (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
+    {
+        return true;
+    }
+
+    false
+}
+
+/// True when a page likely needs JS rendering — a large HTML document
+/// with almost no extractable text + an SPA framework signature.
+pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
+    let has_scripts = html.contains("<script");
+
+    // Tier 1: almost no extractable text from a large-ish page.
+    if word_count < 50 && html.len() > 5_000 && has_scripts {
+        return true;
+    }
+
+    // Tier 2: SPA framework markers + low content-to-HTML ratio.
+    if word_count < 800 && html.len() > 50_000 && has_scripts {
+        let html_lower = html.to_lowercase();
+        let has_spa_marker = html_lower.contains("react-app")
+            || html_lower.contains("id=\"__next\"")
+            || html_lower.contains("id=\"root\"")
+            || html_lower.contains("id=\"app\"")
+            || html_lower.contains("__next_data__")
+            || html_lower.contains("nuxt")
+            || html_lower.contains("ng-app");
+        if has_spa_marker {
+            return true;
+        }
+    }
+
+    false
+}
+
+// ---------------------------------------------------------------------------
+// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
+// or raw cloud JSON)
+// ---------------------------------------------------------------------------
+
+/// Result of [`smart_fetch`]: either a local extraction or the raw
+/// cloud API response when we escalated.
+pub enum SmartFetchResult {
+    Local(Box<webclaw_core::ExtractionResult>),
+    Cloud(Value),
+}
+
+/// Try local fetch + extract first. On bot protection or detected
+/// JS-render, fall back to `cloud.scrape(...)` with the caller's
+/// formats. Returns `Err(String)` so existing call sites that expect
+/// stringified errors keep compiling.
+///
+/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
+/// [`CloudError`] so you can render precise UX.
+pub async fn smart_fetch(
+    client: &FetchClient,
+    cloud: Option<&CloudClient>,
+    url: &str,
+    include_selectors: &[String],
+    exclude_selectors: &[String],
+    only_main_content: bool,
+    formats: &[&str],
+) -> Result<SmartFetchResult, String> {
+    let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
+        .await
+        .map_err(|_| format!("Fetch timed out after 30s for {url}"))?
+        .map_err(|e| format!("Fetch failed: {e}"))?;
+
+    if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
+        info!(url, "bot protection detected, falling back to cloud API");
+        return cloud_scrape_fallback(
+            cloud,
+            url,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+            formats,
+        )
+        .await;
+    }
+
+    let options = webclaw_core::ExtractionOptions {
+        include_selectors: include_selectors.to_vec(),
+        exclude_selectors: exclude_selectors.to_vec(),
+        only_main_content,
+        include_raw_html: false,
+    };
+    let extraction =
+        webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
+            .map_err(|e| format!("Extraction failed: {e}"))?;
+
+    if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
+        info!(
+            url,
+            word_count = extraction.metadata.word_count,
+            html_len = fetch_result.html.len(),
+            "JS-rendered page detected, falling back to cloud API"
+        );
+        return cloud_scrape_fallback(
+            cloud,
+            url,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+            formats,
+        )
+        .await;
+    }
+
+    Ok(SmartFetchResult::Local(Box::new(extraction)))
+}
+
+async fn cloud_scrape_fallback(
+    cloud: Option<&CloudClient>,
+    url: &str,
+    include_selectors: &[String],
+    exclude_selectors: &[String],
+    only_main_content: bool,
+    formats: &[&str],
+) -> Result<SmartFetchResult, String> {
+    let Some(c) = cloud else {
+        return Err(CloudError::NotConfigured.to_string());
+    };
+    let resp = c
+        .scrape(
+            url,
+            formats,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+        )
+        .await
+        .map_err(|e| e.to_string())?;
+    info!(url, "cloud API fallback successful");
+    Ok(SmartFetchResult::Cloud(resp))
+}
+
+// ---------------------------------------------------------------------------
+// Smart-fetch-HTML: for vertical extractors
+// ---------------------------------------------------------------------------
+
+/// Where the HTML ultimately came from — useful for callers that want
+/// to track "did we fall back?" for logging or pricing.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum FetchSource {
+    Local,
+    Cloud,
+}
+
+/// Antibot-aware HTML fetch result. The `html` field is always populated.
+pub struct FetchedHtml {
+    pub html: String,
+    pub final_url: String,
+    pub source: FetchSource,
+}
+
+/// Try local fetch; on bot protection, escalate to the cloud's
+/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
+///
+/// Designed for the vertical-extractor pattern where the caller has
+/// its own parser and just needs bytes.
+pub async fn smart_fetch_html(
+    client: &FetchClient,
+    cloud: Option<&CloudClient>,
+    url: &str,
+) -> Result<FetchedHtml, CloudError> {
+    let resp = client
+        .fetch(url)
+        .await
+        .map_err(|e| CloudError::Network(e.to_string()))?;
+
+    if !is_bot_protected(&resp.html, &resp.headers) {
+        return Ok(FetchedHtml {
+            html: resp.html,
+            final_url: resp.url,
+            source: FetchSource::Local,
+        });
+    }
+
+    let Some(c) = cloud else {
+        warn!(url, "bot protection detected + no cloud client configured");
+        return Err(CloudError::NotConfigured);
+    };
+    debug!(url, "bot protection detected, escalating to cloud");
+    let html = c.fetch_html(url).await?;
+    Ok(FetchedHtml {
+        html,
+        final_url: url.to_string(),
+        source: FetchSource::Cloud,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn empty_headers() -> HeaderMap {
+        HeaderMap::new()
+    }
+
+    // --- detectors ----------------------------------------------------------
+
+    #[test]
+    fn is_bot_protected_detects_cloudflare_challenge() {
+        let html = "<html><body>_cf_chl_opt loaded</body></html>";
+        assert!(is_bot_protected(html, &empty_headers()));
+    }
+
+    #[test]
+    fn is_bot_protected_detects_turnstile_on_short_page() {
+        let html = "<div class=\"cf-turnstile\"></div>";
+        assert!(is_bot_protected(html, &empty_headers()));
+    }
+
+    #[test]
+    fn is_bot_protected_ignores_turnstile_on_real_content() {
+        let html = format!(
+            "<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
+            "lots of real content ".repeat(8_000)
+        );
+        assert!(!is_bot_protected(&html, &empty_headers()));
+    }
+
+    #[test]
+    fn needs_js_rendering_flags_spa_skeleton() {
+        let html = format!(
+            "<html><body><div id=\"__next\"></div>{}</body></html>",
+            "<script>x</script>".repeat(500)
+        );
+        assert!(needs_js_rendering(10, &html));
+    }
+
+    #[test]
+    fn needs_js_rendering_passes_real_article() {
+        let html = format!(
+            "<html><body>{}<script>x</script></body></html>",
+            "Real article text ".repeat(5_000)
+        );
+        assert!(!needs_js_rendering(5_000, &html));
+    }
+
+    // --- CloudError mapping -------------------------------------------------
+
+    #[test]
+    fn cloud_error_maps_401() {
+        let e = CloudError::from_status_and_body(401, "invalid key".into());
+        assert!(matches!(e, CloudError::Unauthorized));
+        assert!(e.to_string().contains(KEYS_URL));
+    }
+
+    #[test]
+    fn cloud_error_maps_402() {
+        let e = CloudError::from_status_and_body(402, "{}".into());
+        assert!(matches!(e, CloudError::InsufficientPlan));
+        assert!(e.to_string().contains(PRICING_URL));
+    }
+
+    #[test]
+    fn cloud_error_maps_429() {
+        let e = CloudError::from_status_and_body(429, "slow down".into());
+        assert!(matches!(e, CloudError::RateLimited));
+        assert!(e.to_string().contains(PRICING_URL));
+    }
+
+    #[test]
+    fn cloud_error_maps_generic_5xx() {
+        let e = CloudError::from_status_and_body(503, "x".repeat(2000));
+        match e {
+            CloudError::ServerError { status, body } => {
+                assert_eq!(status, 503);
+                assert!(body.len() <= 500);
+            }
+            _ => panic!("expected ServerError"),
+        }
+    }
+
+    #[test]
+    fn not_configured_error_points_at_signup() {
+        let msg = CloudError::NotConfigured.to_string();
+        assert!(msg.contains(SIGNUP_URL));
+        assert!(msg.contains("WEBCLAW_API_KEY"));
+    }
+
+    // --- CloudClient construction ------------------------------------------
+
+    #[test]
+    fn cloud_client_explicit_key_wins_over_env() {
+        // SAFETY: this test mutates process env. Serial tests only.
+        // Set env to something, pass an explicit key, explicit should win.
+        // (We don't actually *call* the API, just check the struct stored
+        // the right key.)
+        // rustc std::env::set_var is unsafe in newer toolchains.
+        unsafe {
+            std::env::set_var("WEBCLAW_API_KEY", "from-env");
+        }
+        let client = CloudClient::new(Some("from-flag")).expect("client built");
+        assert_eq!(client.api_key, "from-flag");
+        unsafe {
+            std::env::remove_var("WEBCLAW_API_KEY");
+        }
+    }
+
+    #[test]
+    fn cloud_client_none_when_empty() {
+        unsafe {
+            std::env::remove_var("WEBCLAW_API_KEY");
+        }
+        assert!(CloudClient::new(None).is_none());
+        assert!(CloudClient::new(Some("")).is_none());
+        assert!(CloudClient::new(Some("   ")).is_none());
+    }
+
+    #[test]
+    fn cloud_client_base_url_strips_trailing_slash() {
+        let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
+        assert_eq!(c.base_url(), "https://api.example.com/v1");
+    }
+
+    #[test]
+    fn truncate_respects_char_boundaries() {
+        // Ensure we don't slice inside a multi-byte char.
+        let s = "a".repeat(10) + "é"; // é is 2 bytes
+        let out = truncate(&s, 11);
+        assert_eq!(out.chars().count(), 11);
+    }
+}
diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs
index 831c2a5..3a4781e 100644
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@@ -3,6 +3,7 @@
 //! Automatically detects PDF responses and delegates to webclaw-pdf.
 pub mod browser;
 pub mod client;
+pub mod cloud;
 pub mod crawler;
 pub mod document;
 pub mod error;
diff --git a/crates/webclaw-mcp/Cargo.toml b/crates/webclaw-mcp/Cargo.toml
index df9dd97..ec3b2b4 100644
--- a/crates/webclaw-mcp/Cargo.toml
+++ b/crates/webclaw-mcp/Cargo.toml
@@ -22,6 +22,5 @@ serde_json = { workspace = true }
 tokio = { workspace = true }
 tracing = { workspace = true }
 tracing-subscriber = { workspace = true }
-reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
 url = "2"
 dirs = "6.0.0"
diff --git a/crates/webclaw-mcp/src/cloud.rs b/crates/webclaw-mcp/src/cloud.rs
deleted file mode 100644
index ac602e4..0000000
--- a/crates/webclaw-mcp/src/cloud.rs
+++ /dev/null
@@ -1,302 +0,0 @@
-/// Cloud API fallback for protected sites.
-///
-/// When local fetch returns a challenge page, this module retries
-/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
-use std::time::Duration;
-
-use serde_json::{Value, json};
-use tracing::info;
-
-const API_BASE: &str = "https://api.webclaw.io/v1";
-
-/// Lightweight client for the webclaw cloud API.
-pub struct CloudClient {
-    api_key: String,
-    http: reqwest::Client,
-}
-
-impl CloudClient {
-    /// Create a new cloud client from WEBCLAW_API_KEY env var.
-    /// Returns None if the key is not set.
-    pub fn from_env() -> Option<Self> {
-        let key = std::env::var("WEBCLAW_API_KEY").ok()?;
-        if key.is_empty() {
-            return None;
-        }
-        let http = reqwest::Client::builder()
-            .timeout(Duration::from_secs(60))
-            .build()
-            .unwrap_or_default();
-        Some(Self { api_key: key, http })
-    }
-
-    /// Scrape a URL via the cloud API. Returns the response JSON.
-    pub async fn scrape(
-        &self,
-        url: &str,
-        formats: &[&str],
-        include_selectors: &[String],
-        exclude_selectors: &[String],
-        only_main_content: bool,
-    ) -> Result<Value, String> {
-        let mut body = json!({
-            "url": url,
-            "formats": formats,
-        });
-
-        if only_main_content {
-            body["only_main_content"] = json!(true);
-        }
-        if !include_selectors.is_empty() {
-            body["include_selectors"] = json!(include_selectors);
-        }
-        if !exclude_selectors.is_empty() {
-            body["exclude_selectors"] = json!(exclude_selectors);
-        }
-
-        self.post("scrape", body).await
-    }
-
-    /// Generic POST to the cloud API.
-    pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
-        let resp = self
-            .http
-            .post(format!("{API_BASE}/{endpoint}"))
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .json(&body)
-            .send()
-            .await
-            .map_err(|e| format!("Cloud API request failed: {e}"))?;
-
-        let status = resp.status();
-        if !status.is_success() {
-            let text = resp.text().await.unwrap_or_default();
-            let truncated = truncate_error(&text);
-            return Err(format!("Cloud API error {status}: {truncated}"));
-        }
-
-        resp.json::<Value>()
-            .await
-            .map_err(|e| format!("Cloud API response parse failed: {e}"))
-    }
-
-    /// Generic GET from the cloud API.
-    pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
-        let resp = self
-            .http
-            .get(format!("{API_BASE}/{endpoint}"))
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .send()
-            .await
-            .map_err(|e| format!("Cloud API request failed: {e}"))?;
-
-        let status = resp.status();
-        if !status.is_success() {
-            let text = resp.text().await.unwrap_or_default();
-            let truncated = truncate_error(&text);
-            return Err(format!("Cloud API error {status}: {truncated}"));
-        }
-
-        resp.json::<Value>()
-            .await
-            .map_err(|e| format!("Cloud API response parse failed: {e}"))
-    }
-}
-
-/// Truncate error body to avoid flooding logs with huge HTML responses.
-fn truncate_error(text: &str) -> &str {
-    const MAX_LEN: usize = 500;
-    match text.char_indices().nth(MAX_LEN) {
-        Some((byte_pos, _)) => &text[..byte_pos],
-        None => text,
-    }
-}
-
-/// Check if fetched HTML looks like a bot protection challenge page.
-/// Detects common bot protection challenge pages.
-pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool {
-    let html_lower = html.to_lowercase();
-
-    // Cloudflare challenge page
-    if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
-        return true;
-    }
-
-    // Cloudflare "checking your browser" spinner
-    if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
-        && html_lower.contains("cf-spinner")
-    {
-        return true;
-    }
-
-    // Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
-    if (html_lower.contains("cf-turnstile")
-        || html_lower.contains("challenges.cloudflare.com/turnstile"))
-        && html.len() < 100_000
-    {
-        return true;
-    }
-
-    // DataDome
-    if html_lower.contains("geo.captcha-delivery.com")
-        || html_lower.contains("captcha-delivery.com/captcha")
-    {
-        return true;
-    }
-
-    // AWS WAF
-    if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
-        return true;
-    }
-
-    // hCaptcha blocking page
-    if html_lower.contains("hcaptcha.com")
-        && html_lower.contains("h-captcha")
-        && html.len() < 50_000
-    {
-        return true;
-    }
-
-    // Cloudflare via headers + challenge body
-    let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
-    if has_cf_headers
-        && (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
-    {
-        return true;
-    }
-
-    false
-}
-
-/// Check if a page likely needs JS rendering (SPA with almost no text content).
-pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
-    let has_scripts = html.contains("<script");
-
-    // Tier 1: almost no extractable text from a large page
-    if word_count < 50 && html.len() > 5_000 && has_scripts {
-        return true;
-    }
-
-    // Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
-    if word_count < 800 && html.len() > 50_000 && has_scripts {
-        let html_lower = html.to_lowercase();
-        let has_spa_marker = html_lower.contains("react-app")
-            || html_lower.contains("id=\"__next\"")
-            || html_lower.contains("id=\"root\"")
-            || html_lower.contains("id=\"app\"")
-            || html_lower.contains("__next_data__")
-            || html_lower.contains("nuxt")
-            || html_lower.contains("ng-app");
-
-        if has_spa_marker {
-            return true;
-        }
-    }
-
-    false
-}
-
-/// Result of a smart fetch: either local extraction or cloud API response.
-pub enum SmartFetchResult {
-    /// Successfully extracted locally.
-    Local(Box<webclaw_core::ExtractionResult>),
-    /// Fell back to cloud API. Contains the API response JSON.
-    Cloud(Value),
-}
-
-/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
-///
-/// Returns the extraction result (local) or the cloud API response JSON.
-/// If no API key is configured and local fetch is blocked, returns an error
-/// with a helpful message.
-pub async fn smart_fetch(
-    client: &webclaw_fetch::FetchClient,
-    cloud: Option<&CloudClient>,
-    url: &str,
-    include_selectors: &[String],
-    exclude_selectors: &[String],
-    only_main_content: bool,
-    formats: &[&str],
-) -> Result<SmartFetchResult, String> {
-    // Step 1: Try local fetch (with timeout to avoid hanging on slow servers)
-    let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
-        .await
-        .map_err(|_| format!("Fetch timed out after 30s for {url}"))?
-        .map_err(|e| format!("Fetch failed: {e}"))?;
-
-    // Step 2: Check for bot protection
-    if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
-        info!(url, "bot protection detected, falling back to cloud API");
-        return cloud_fallback(
-            cloud,
-            url,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-            formats,
-        )
-        .await;
-    }
-
-    // Step 3: Extract locally
-    let options = webclaw_core::ExtractionOptions {
-        include_selectors: include_selectors.to_vec(),
-        exclude_selectors: exclude_selectors.to_vec(),
-        only_main_content,
-        include_raw_html: false,
-    };
-
-    let extraction =
-        webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
-            .map_err(|e| format!("Extraction failed: {e}"))?;
-
-    // Step 4: Check for JS-rendered pages (low content from large HTML)
-    if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
-        info!(
-            url,
-            word_count = extraction.metadata.word_count,
-            html_len = fetch_result.html.len(),
-            "JS-rendered page detected, falling back to cloud API"
-        );
-        return cloud_fallback(
-            cloud,
-            url,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-            formats,
-        )
-        .await;
-    }
-
-    Ok(SmartFetchResult::Local(Box::new(extraction)))
-}
-
-async fn cloud_fallback(
-    cloud: Option<&CloudClient>,
-    url: &str,
-    include_selectors: &[String],
-    exclude_selectors: &[String],
-    only_main_content: bool,
-    formats: &[&str],
-) -> Result<SmartFetchResult, String> {
-    match cloud {
-        Some(c) => {
-            let resp = c
-                .scrape(
-                    url,
-                    formats,
-                    include_selectors,
-                    exclude_selectors,
-                    only_main_content,
-                )
-                .await?;
-            info!(url, "cloud API fallback successful");
-            Ok(SmartFetchResult::Cloud(resp))
-        }
-        None => Err(format!(
-            "Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
-             Get a key at https://webclaw.io"
-        )),
-    }
-}
diff --git a/crates/webclaw-mcp/src/main.rs b/crates/webclaw-mcp/src/main.rs
index 8576562..89a4755 100644
--- a/crates/webclaw-mcp/src/main.rs
+++ b/crates/webclaw-mcp/src/main.rs
@@ -1,7 +1,6 @@
 /// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
 /// Exposes web extraction tools over stdio transport for AI agents
 /// like Claude Desktop, Claude Code, and other MCP clients.
-mod cloud;
 mod server;
 mod tools;
 
diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs
index f00eae7..87c222e 100644
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@@ -15,7 +15,8 @@ use serde_json::json;
 use tracing::{error, info, warn};
 use url::Url;
 
-use crate::cloud::{self, CloudClient, SmartFetchResult};
+use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult};
+
 use crate::tools::*;
 
 pub struct WebclawMcp {

From d8c9274a9c718bcfefee5aa2163a291011aff59b Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 16:16:11 +0200
Subject: [PATCH 12/30] feat(extractors): wave 5 \u2014 Amazon, eBay,
 Trustpilot via cloud fallback

Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:

- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
  WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
  challenge-page detection we escalate to api.webclaw.io/v1/scrape
  with formats=['html'] and parse the antibot-bypassed HTML locally.

- Without a cloud key, callers get a typed CloudError::NotConfigured
  whose Display message points at https://webclaw.io/signup.
  Self-hosters without a webclaw.io account know exactly what to do.

## New extractors (all auto-dispatched \u2014 unique hosts)

- amazon_product: ASIN extraction from /dp/, /gp/product/,
  /product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
  locale. Parses the Product JSON-LD Amazon ships for SEO; falls
  back to #productTitle and #landingImage DOM selectors when
  JSON-LD is absent. Returns price, currency, availability,
  condition, brand, image, aggregate rating, SKU / MPN.

- ebay_listing: item-id extraction from /itm/{id} and
  /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
  .it. Parses both bare Offer (Buy It Now) and AggregateOffer
  (used-copies / auctions) from the Product JSON-LD. Returns
  price or low/high-price range, currency, condition, seller,
  offer_count, aggregate rating.

- trustpilot_reviews: reactivated from the `trustpilot_reviews`
  file that was previously dead-code'd. Parser already worked; it
  just needed the smart_fetch_html path to get past AWS WAF's
  'Verifying Connection' interstitial. Organisation / LocalBusiness
  JSON-LD block gives aggregate rating + up to 20 recent reviews.

## FetchClient change

- Added optional `cloud: Option<Arc<CloudClient>>` field with
  `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
  accessor. Extractors call client.cloud() to decide whether they
  can escalate. Cheap clones (Arc-wrapped).

## webclaw-server wiring

AppState::new() now reads the cloud credential from env:

1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
   server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
   mode (no inbound-auth key set), matching the MCP / CLI
   convention of that env var.

When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.

## Catalog + dispatch

All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.

## Tests

- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
  fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
  URL-matching + slugged-URL parsing + both Offer and AggregateOffer
  fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
  means having WEBCLAW_CLOUD_API_KEY set against a real cloud
  backend. Integration-test harness is a separate follow-up.

Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
---
 crates/webclaw-fetch/src/client.rs            |  35 +-
 .../src/extractors/amazon_product.rs          | 361 ++++++++++++++++++
 .../src/extractors/ebay_listing.rs            | 337 ++++++++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    |  54 ++-
 .../src/extractors/trustpilot_reviews.rs      |  57 ++-
 crates/webclaw-server/src/state.rs            |  64 +++-
 6 files changed, 884 insertions(+), 24 deletions(-)
 create mode 100644 crates/webclaw-fetch/src/extractors/amazon_product.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/ebay_listing.rs

diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index a4f6dd5..7ce16d7 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -177,6 +177,11 @@ enum ClientPool {
 pub struct FetchClient {
     pool: ClientPool,
     pdf_mode: PdfMode,
+    /// Optional cloud-fallback client. Extractors that need to
+    /// escalate past bot protection call `client.cloud()` to get this
+    /// out. Stored as `Arc` so cloning a `FetchClient` (common in
+    /// axum state) doesn't clone the underlying reqwest pool.
+    cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
 }
 
 impl FetchClient {
@@ -225,7 +230,35 @@ impl FetchClient {
             ClientPool::Rotating { clients }
         };
 
-        Ok(Self { pool, pdf_mode })
+        Ok(Self {
+            pool,
+            pdf_mode,
+            cloud: None,
+        })
+    }
+
+    /// Attach a cloud-fallback client. Returns `self` so it composes in
+    /// a builder-ish way:
+    ///
+    /// ```ignore
+    /// let client = FetchClient::new(config)?
+    ///     .with_cloud(CloudClient::from_env()?);
+    /// ```
+    ///
+    /// Extractors that can escalate past bot protection will call
+    /// `client.cloud()` internally. Sets the field regardless of
+    /// whether `cloud` is configured to bypass anything specific —
+    /// attachment is cheap (just wraps in `Arc`).
+    pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
+        self.cloud = Some(std::sync::Arc::new(cloud));
+        self
+    }
+
+    /// Optional cloud-fallback client, if one was attached via
+    /// [`Self::with_cloud`]. Extractors that handle antibot sites
+    /// pass this into `cloud::smart_fetch_html`.
+    pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
+        self.cloud.as_deref()
     }
 
     /// Fetch a URL and return the raw HTML + response metadata.
diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs
new file mode 100644
index 0000000..3c96385
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs
@@ -0,0 +1,361 @@
+//! Amazon product detail page extractor.
+//!
+//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
+//! a "Sorry, we need to verify you're human" interstitial to any
+//! client without a warm Amazon session + residential IP. Detection
+//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
+//! Amazon heuristic, so this extractor always hits the cloud fallback
+//! path in practice.
+//!
+//! Parsing logic works on the final HTML, local or cloud-sourced. We
+//! read the product details primarily from JSON-LD `Product` blocks
+//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
+//! specific DOM IDs picked up with cheap regex.
+//!
+//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
+//! path. ASINs are a stable Amazon identifier so we extract that as
+//! part of the response even when everything else is empty (tells
+//! callers the URL was at least recognised).
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "amazon_product",
+    label: "Amazon product",
+    description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
+    url_patterns: &[
+        "https://www.amazon.com/dp/{ASIN}",
+        "https://www.amazon.co.uk/dp/{ASIN}",
+        "https://www.amazon.de/dp/{ASIN}",
+        "https://www.amazon.fr/dp/{ASIN}",
+        "https://www.amazon.it/dp/{ASIN}",
+        "https://www.amazon.es/dp/{ASIN}",
+        "https://www.amazon.co.jp/dp/{ASIN}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !is_amazon_host(host) {
+        return false;
+    }
+    parse_asin(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let asin = parse_asin(url)
+        .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
+
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse(&fetched.html, url, &asin);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
+/// file) and the source URL, extract Amazon product detail. Returns a
+/// `Value` rather than a typed struct so callers can pass it through
+/// without carrying webclaw_fetch types.
+pub fn parse(html: &str, url: &str, asin: &str) -> Value {
+    let jsonld = find_product_jsonld(html);
+    let title = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "name"))
+        .or_else(|| dom_title(html));
+    let image = jsonld
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| dom_image(html));
+    let brand = jsonld.as_ref().and_then(get_brand);
+    let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
+    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
+    let offer = jsonld.as_ref().and_then(first_offer);
+
+    let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
+    let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
+
+    json!({
+        "url":              url,
+        "asin":             asin,
+        "title":            title,
+        "brand":            brand,
+        "description":      description,
+        "image":            image,
+        "price":            offer.as_ref().and_then(|o| get_text(o, "price")),
+        "currency":         offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
+        "availability":     offer.as_ref().and_then(|o| {
+            get_text(o, "availability").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "condition":        offer.as_ref().and_then(|o| {
+            get_text(o, "itemCondition").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "sku":              sku,
+        "mpn":              mpn,
+        "aggregate_rating": aggregate_rating,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn is_amazon_host(host: &str) -> bool {
+    host.starts_with("www.amazon.") || host.starts_with("amazon.")
+}
+
+/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
+/// - /dp/{ASIN}
+/// - /gp/product/{ASIN}
+/// - /product/{ASIN}
+/// - /exec/obidos/ASIN/{ASIN}
+fn parse_asin(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
+    });
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers — light reuse of ecommerce_product's style
+// ---------------------------------------------------------------------------
+
+fn find_product_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_product_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
+    match t {
+        Value::String(s) => is_prod(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    brand
+        .as_object()
+        .and_then(|o| o.get("name"))
+        .and_then(|n| n.as_str())
+        .map(String::from)
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn first_offer(v: &Value) -> Option<Value> {
+    let offers = v.get("offers")?;
+    match offers {
+        Value::Array(arr) => arr.first().cloned(),
+        Value::Object(_) => Some(offers.clone()),
+        _ => None,
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value": get_text(r, "ratingValue"),
+        "review_count": get_text(r, "reviewCount"),
+        "best_rating":  get_text(r, "bestRating"),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// DOM fallbacks — cheap regex for the two fields most likely to be
+// missing from JSON-LD on Amazon.
+// ---------------------------------------------------------------------------
+
+fn dom_title(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().trim().to_string())
+}
+
+fn dom_image(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_multi_locale() {
+        assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
+        assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
+        assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
+        assert!(matches(
+            "https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
+        ));
+    }
+
+    #[test]
+    fn rejects_non_product_urls() {
+        assert!(!matches("https://www.amazon.com/"));
+        assert!(!matches("https://www.amazon.com/gp/cart"));
+        assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
+    }
+
+    #[test]
+    fn parse_asin_extracts_from_multiple_shapes() {
+        assert_eq!(
+            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(parse_asin("https://www.amazon.com/"), None);
+    }
+
+    #[test]
+    fn parse_extracts_from_fixture_jsonld() {
+        // Minimal Amazon-style fixture with a Product JSON-LD block.
+        let html = r##"
+<html><head>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"Product",
+ "name":"ACME Widget","sku":"B0CHX1W1XY",
+ "brand":{"@type":"Brand","name":"ACME"},
+ "image":"https://m.media-amazon.com/images/I/abc.jpg",
+ "offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
+           "availability":"https://schema.org/InStock"},
+ "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
+</script>
+</head><body></body></html>"##;
+        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
+        assert_eq!(v["asin"], "B0CHX1W1XY");
+        assert_eq!(v["title"], "ACME Widget");
+        assert_eq!(v["brand"], "ACME");
+        assert_eq!(v["price"], "19.99");
+        assert_eq!(v["currency"], "USD");
+        assert_eq!(v["availability"], "InStock");
+        assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
+        assert_eq!(v["aggregate_rating"]["review_count"], "1234");
+    }
+
+    #[test]
+    fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
+        let html = r#"
+<html><body>
+<span id="productTitle">Fallback Title</span>
+<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
+</body></html>
+"#;
+        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
+        assert_eq!(v["title"], "Fallback Title");
+        assert_eq!(
+            v["image"],
+            "https://m.media-amazon.com/images/I/fallback.jpg"
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/ebay_listing.rs b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
new file mode 100644
index 0000000..14c36ef
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
@@ -0,0 +1,337 @@
+//! eBay listing extractor.
+//!
+//! eBay item pages at `ebay.com/itm/{id}` and international variants
+//! usually ship a `Product` JSON-LD block with title, price, currency,
+//! condition, and an `AggregateOffer` when bidding. eBay applies
+//! Cloudflare + custom WAF selectively — some item IDs return normal
+//! HTML to the Firefox profile, others 403 / get the "Pardon our
+//! interruption" page. We route through `cloud::smart_fetch_html` so
+//! both paths resolve to the same parser.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "ebay_listing",
+    label: "eBay listing",
+    description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
+    url_patterns: &[
+        "https://www.ebay.com/itm/{id}",
+        "https://www.ebay.co.uk/itm/{id}",
+        "https://www.ebay.de/itm/{id}",
+        "https://www.ebay.fr/itm/{id}",
+        "https://www.ebay.it/itm/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !is_ebay_host(host) {
+        return false;
+    }
+    parse_item_id(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let item_id = parse_item_id(url)
+        .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
+
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse(&fetched.html, url, &item_id);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
+    let jsonld = find_product_jsonld(html);
+    let title = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "name"))
+        .or_else(|| og(html, "title"));
+    let image = jsonld
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| og(html, "image"));
+    let brand = jsonld.as_ref().and_then(get_brand);
+    let description = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
+    let offer = jsonld.as_ref().and_then(first_offer);
+
+    // eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
+    let (low_price, high_price, single_price) = match offer.as_ref() {
+        Some(o) => (
+            get_text(o, "lowPrice"),
+            get_text(o, "highPrice"),
+            get_text(o, "price"),
+        ),
+        None => (None, None, None),
+    };
+    let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
+
+    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
+
+    json!({
+        "url":             url,
+        "item_id":         item_id,
+        "title":           title,
+        "brand":           brand,
+        "description":     description,
+        "image":           image,
+        "price":           single_price,
+        "low_price":       low_price,
+        "high_price":      high_price,
+        "offer_count":     offer_count,
+        "currency":        offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
+        "availability":    offer.as_ref().and_then(|o| {
+            get_text(o, "availability").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "condition":       offer.as_ref().and_then(|o| {
+            get_text(o, "itemCondition").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "seller":          offer.as_ref().and_then(|o|
+            o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
+        "aggregate_rating": aggregate_rating,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn is_ebay_host(host: &str) -> bool {
+    host.starts_with("www.ebay.") || host.starts_with("ebay.")
+}
+
+/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
+/// URLs. IDs are 10-15 digits today, but we accept any all-digit
+/// trailing segment so the extractor stays forward-compatible.
+fn parse_item_id(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        // /itm/(optional-slug/)?(digits)([/?#]|end)
+        Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
+    });
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers
+// ---------------------------------------------------------------------------
+
+fn find_product_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_product_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
+    match t {
+        Value::String(s) => is_prod(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    brand
+        .as_object()
+        .and_then(|o| o.get("name"))
+        .and_then(|n| n.as_str())
+        .map(String::from)
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn first_offer(v: &Value) -> Option<Value> {
+    let offers = v.get("offers")?;
+    match offers {
+        Value::Array(arr) => arr.first().cloned(),
+        Value::Object(_) => Some(offers.clone()),
+        _ => None,
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value": get_text(r, "ratingValue"),
+        "review_count": get_text(r, "reviewCount"),
+        "best_rating":  get_text(r, "bestRating"),
+    }))
+}
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_ebay_item_urls() {
+        assert!(matches("https://www.ebay.com/itm/325478156234"));
+        assert!(matches(
+            "https://www.ebay.com/itm/vintage-typewriter/325478156234"
+        ));
+        assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
+        assert!(!matches("https://www.ebay.com/"));
+        assert!(!matches("https://www.ebay.com/sch/foo"));
+        assert!(!matches("https://example.com/itm/325478156234"));
+    }
+
+    #[test]
+    fn parse_item_id_handles_slugged_urls() {
+        assert_eq!(
+            parse_item_id("https://www.ebay.com/itm/325478156234"),
+            Some("325478156234".into())
+        );
+        assert_eq!(
+            parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
+            Some("325478156234".into())
+        );
+        assert_eq!(
+            parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
+            Some("325478156234".into())
+        );
+    }
+
+    #[test]
+    fn parse_extracts_from_fixture_jsonld() {
+        let html = r##"
+<html><head>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"Product",
+ "name":"Vintage Typewriter","sku":"TW-001",
+ "brand":{"@type":"Brand","name":"Olivetti"},
+ "image":"https://i.ebayimg.com/images/abc.jpg",
+ "offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
+           "availability":"https://schema.org/InStock",
+           "itemCondition":"https://schema.org/UsedCondition",
+           "seller":{"@type":"Person","name":"vintage_seller_99"}}}
+</script>
+</head></html>"##;
+        let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
+        assert_eq!(v["title"], "Vintage Typewriter");
+        assert_eq!(v["price"], "79.99");
+        assert_eq!(v["currency"], "GBP");
+        assert_eq!(v["availability"], "InStock");
+        assert_eq!(v["condition"], "UsedCondition");
+        assert_eq!(v["seller"], "vintage_seller_99");
+        assert_eq!(v["brand"], "Olivetti");
+    }
+
+    #[test]
+    fn parse_handles_aggregate_offer_price_range() {
+        let html = r##"
+<script type="application/ld+json">
+{"@type":"Product","name":"Used Copies",
+ "offers":{"@type":"AggregateOffer","offerCount":"5",
+           "lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
+</script>
+"##;
+        let v = parse(html, "https://www.ebay.com/itm/1", "1");
+        assert_eq!(v["low_price"], "10.00");
+        assert_eq!(v["high_price"], "50.00");
+        assert_eq!(v["offer_count"], "5");
+        assert_eq!(v["currency"], "USD");
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index ea273e6..5cf0993 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -14,10 +14,12 @@
 //! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
 //! one). HTML extraction is the fallback for sites that don't.
 
+pub mod amazon_product;
 pub mod arxiv;
 pub mod crates_io;
 pub mod dev_to;
 pub mod docker_hub;
+pub mod ebay_listing;
 pub mod ecommerce_product;
 pub mod github_pr;
 pub mod github_release;
@@ -33,12 +35,6 @@ pub mod pypi;
 pub mod reddit;
 pub mod shopify_product;
 pub mod stackoverflow;
-// `trustpilot_reviews` code lives in the tree but is not wired into the
-// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
-// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
-// it would return 403 more often than not — bad UX. When the cloud tier
-// has residential proxies or a CDP renderer, flip this back on.
-#[allow(dead_code)]
 pub mod trustpilot_reviews;
 
 use serde::Serialize;
@@ -84,6 +80,9 @@ pub fn list() -> Vec<ExtractorInfo> {
         instagram_profile::INFO,
         shopify_product::INFO,
         ecommerce_product::INFO,
+        amazon_product::INFO,
+        ebay_listing::INFO,
+        trustpilot_reviews::INFO,
     ]
 }
 
@@ -209,6 +208,31 @@ pub async fn dispatch_by_url(
                 .map(|v| (instagram_profile::INFO.name, v)),
         );
     }
+    // Antibot-gated verticals with unique hosts: safe to auto-dispatch
+    // because the matcher can't confuse the URL for anything else. The
+    // extractor's smart_fetch_html path handles the blocked-without-
+    // API-key case with a clear actionable error.
+    if amazon_product::matches(url) {
+        return Some(
+            amazon_product::extract(client, url)
+                .await
+                .map(|v| (amazon_product::INFO.name, v)),
+        );
+    }
+    if ebay_listing::matches(url) {
+        return Some(
+            ebay_listing::extract(client, url)
+                .await
+                .map(|v| (ebay_listing::INFO.name, v)),
+        );
+    }
+    if trustpilot_reviews::matches(url) {
+        return Some(
+            trustpilot_reviews::extract(client, url)
+                .await
+                .map(|v| (trustpilot_reviews::INFO.name, v)),
+        );
+    }
     // NOTE: shopify_product and ecommerce_product are intentionally NOT
     // in auto-dispatch. Their `matches()` functions are permissive
     // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
@@ -333,6 +357,24 @@ pub async fn dispatch_by_name(
             })
             .await
         }
+        n if n == amazon_product::INFO.name => {
+            run_or_mismatch(amazon_product::matches(url), n, url, || {
+                amazon_product::extract(client, url)
+            })
+            .await
+        }
+        n if n == ebay_listing::INFO.name => {
+            run_or_mismatch(ebay_listing::matches(url), n, url, || {
+                ebay_listing::extract(client, url)
+            })
+            .await
+        }
+        n if n == trustpilot_reviews::INFO.name => {
+            run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
+                trustpilot_reviews::extract(client, url)
+            })
+            .await
+        }
         _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
     }
 }
diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
index 41f40d4..a5e1e48 100644
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@@ -1,16 +1,18 @@
 //! Trustpilot company reviews extractor.
 //!
-//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich
-//! JSON-LD `LocalBusiness` / `Organization` block with aggregate
-//! rating + up to 20 recent reviews. No auth, no antibot for the
-//! page HTML itself.
-//!
-//! Auto-dispatch safe because the host is unique.
+//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
+//! `Organization` / `LocalBusiness` block with aggregate rating + up
+//! to 20 recent reviews. The page HTML itself is usually behind AWS
+//! WAF's "Verifying Connection" interstitial — so this extractor
+//! always uses [`cloud::smart_fetch_html`] and only returns data when
+//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
+//! OSS users without a key get a clear error pointing at signup.
 
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
 use crate::client::FetchClient;
+use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
@@ -29,15 +31,22 @@ pub fn matches(url: &str) -> bool {
 }
 
 pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
-    let resp = client.fetch(url).await?;
-    if !(200..300).contains(&resp.status) {
-        return Err(FetchError::Build(format!(
-            "trustpilot_reviews: status {} for {url}",
-            resp.status
-        )));
-    }
+    // Trustpilot is always behind AWS WAF, so we go through smart_fetch
+    // which tries local first (which will hit the challenge interstitial),
+    // detects it, and escalates to cloud /v1/scrape for the real HTML.
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
 
-    let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
+    let html = parse(&fetched.html, url)?;
+    Ok(html_with_source(html, fetched.source))
+}
+
+/// Run the pure parser on already-fetched HTML. Split out so the cloud
+/// pipeline can call it directly after its own antibot-aware fetch
+/// without going through [`extract`].
+pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
     let business = find_business(&blocks).ok_or_else(|| {
         FetchError::BodyDecode(format!(
             "trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
@@ -94,6 +103,26 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
     }))
 }
 
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+/// Stamp `data_source` onto the parser output so callers can tell at a
+/// glance whether this row came from local or cloud. Useful for UX and
+/// for pricing-aware pipelines.
+fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
+    if let Some(obj) = v.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    v
+}
+
 // ---------------------------------------------------------------------------
 // JSON-LD walker — same pattern as ecommerce_product
 // ---------------------------------------------------------------------------
diff --git a/crates/webclaw-server/src/state.rs b/crates/webclaw-server/src/state.rs
index d7f151b..6c2e8f7 100644
--- a/crates/webclaw-server/src/state.rs
+++ b/crates/webclaw-server/src/state.rs
@@ -1,7 +1,24 @@
 //! Shared application state. Cheap to clone via Arc; held by the axum
 //! Router for the life of the process.
+//!
+//! Two unrelated keys get carried here:
+//!
+//! 1. [`AppState::api_key`] — the **bearer token clients must present**
+//!    to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
+//!    Unset = open mode.
+//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
+//!    **outbound** credential for api.webclaw.io, used by extractors
+//!    that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
+//!    Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
+//!    error with a signup link.
+//!
+//! Different variables on purpose: conflating the two means operators
+//! who want their server behind an auth token can't also enable cloud
+//! fallback, and vice versa.
 
 use std::sync::Arc;
+use tracing::info;
+use webclaw_fetch::cloud::CloudClient;
 use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
 
 /// Single-process state shared across all request handlers.
@@ -17,6 +34,7 @@ struct Inner {
     /// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
     /// them nothing.
     pub fetch: Arc<FetchClient>,
+    /// Inbound bearer-auth token for this server's own `/v1/*` surface.
     pub api_key: Option<String>,
 }
 
@@ -24,17 +42,34 @@ impl AppState {
     /// Build the application state. The fetch client is constructed once
     /// and shared across requests so connection pools + browser profile
     /// state don't churn per request.
-    pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
+    ///
+    /// `inbound_api_key` is the bearer token clients must present;
+    /// cloud-fallback credentials come from the env (checked here).
+    pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
         let config = FetchConfig {
             browser: BrowserProfile::Firefox,
             ..FetchConfig::default()
         };
-        let fetch = FetchClient::new(config)
+        let mut fetch = FetchClient::new(config)
             .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
+
+        // Cloud fallback: only activates when the operator has provided
+        // an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
+        // (preferred, disambiguates from the inbound-auth key) and
+        // WEBCLAW_API_KEY as a fallback when there's no inbound key
+        // configured (backwards compat with MCP / CLI conventions).
+        if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
+            info!(
+                base = cloud.base_url(),
+                "cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
+            );
+            fetch = fetch.with_cloud(cloud);
+        }
+
         Ok(Self {
             inner: Arc::new(Inner {
                 fetch: Arc::new(fetch),
-                api_key,
+                api_key: inbound_api_key,
             }),
         })
     }
@@ -47,3 +82,26 @@ impl AppState {
         self.inner.api_key.as_deref()
     }
 }
+
+/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
+/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
+/// configured (i.e. open mode — the same env var can't mean two
+/// things to one process).
+fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
+    let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
+    if let Some(k) = cloud_key.as_deref()
+        && !k.trim().is_empty()
+    {
+        return Some(CloudClient::with_key(k));
+    }
+    // Reuse WEBCLAW_API_KEY only when not also acting as our own
+    // inbound-auth token — otherwise we'd be telling the operator
+    // they can't have both.
+    if inbound_api_key.is_none()
+        && let Ok(k) = std::env::var("WEBCLAW_API_KEY")
+        && !k.trim().is_empty()
+    {
+        return Some(CloudClient::with_key(k));
+    }
+    None
+}

From 8cc727c2f2e876190af2d5c4b260576184afa0bf Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 16:33:35 +0200
Subject: [PATCH 13/30] feat(extractors): wave 6a, 5 easy verticals (27 total)

Adds 5 structured extractors that hit public APIs with stable shapes:

- github_issue: /repos/{o}/{r}/issues/{n} (rejects PRs, points to github_pr)
- shopify_collection: /collections/{handle}.json + products.json
- woocommerce_product: /wp-json/wc/store/v1/products?slug={slug}
- substack_post: /api/v1/posts/{slug} (works on custom domains too)
- youtube_video: ytInitialPlayerResponse blob from /watch HTML

Auto-dispatched: github_issue, youtube_video (unique hosts and stable
URL shapes). Explicit-call: shopify_collection, woocommerce_product,
substack_post (URL shapes overlap with non-target sites).

Tests: 82 total passing in webclaw-fetch (12 new), clippy clean.
---
 .../src/extractors/github_issue.rs            | 172 ++++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    |  57 +++-
 .../src/extractors/shopify_collection.rs      | 242 +++++++++++++++++
 .../src/extractors/substack_post.rs           | 213 +++++++++++++++
 .../src/extractors/woocommerce_product.rs     | 237 ++++++++++++++++
 .../src/extractors/youtube_video.rs           | 255 ++++++++++++++++++
 6 files changed, 1175 insertions(+), 1 deletion(-)
 create mode 100644 crates/webclaw-fetch/src/extractors/github_issue.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/shopify_collection.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/substack_post.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/woocommerce_product.rs
 create mode 100644 crates/webclaw-fetch/src/extractors/youtube_video.rs

diff --git a/crates/webclaw-fetch/src/extractors/github_issue.rs b/crates/webclaw-fetch/src/extractors/github_issue.rs
new file mode 100644
index 0000000..436faa9
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/github_issue.rs
@@ -0,0 +1,172 @@
+//! GitHub issue structured extractor.
+//!
+//! Mirror of `github_pr` but on `/issues/{number}`. Uses
+//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the
+//! issue body + comment count + labels + milestone + author /
+//! assignees. Full per-comment bodies would be another call; kept for
+//! a follow-up.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_issue",
+    label: "GitHub issue",
+    description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.",
+    url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    parse_issue(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
+        FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_issue: issue '{owner}/{repo}#{number}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let issue: Issue = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?;
+
+    // The same endpoint returns PRs too; reject if we got one so the caller
+    // uses /v1/scrape/github_pr instead of getting a half-shaped payload.
+    if issue.pull_request.is_some() {
+        return Err(FetchError::Build(format!(
+            "github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr"
+        )));
+    }
+
+    Ok(json!({
+        "url":         url,
+        "owner":       owner,
+        "repo":        repo,
+        "number":      issue.number,
+        "title":       issue.title,
+        "body":        issue.body,
+        "state":       issue.state,
+        "state_reason":issue.state_reason,
+        "author":      issue.user.as_ref().and_then(|u| u.login.clone()),
+        "labels":      issue.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
+        "assignees":   issue.assignees.iter().filter_map(|u| u.login.clone()).collect::<Vec<_>>(),
+        "milestone":   issue.milestone.as_ref().and_then(|m| m.title.clone()),
+        "comments":    issue.comments,
+        "locked":      issue.locked,
+        "created_at":  issue.created_at,
+        "updated_at":  issue.updated_at,
+        "closed_at":   issue.closed_at,
+        "html_url":    issue.html_url,
+    }))
+}
+
+fn parse_issue(url: &str) -> Option<(String, String, u64)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    if segs.len() < 4 || segs[2] != "issues" {
+        return None;
+    }
+    let number: u64 = segs[3].parse().ok()?;
+    Some((segs[0].to_string(), segs[1].to_string(), number))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub issue API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Issue {
+    number: Option<i64>,
+    title: Option<String>,
+    body: Option<String>,
+    state: Option<String>,
+    state_reason: Option<String>,
+    locked: Option<bool>,
+    comments: Option<i64>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    closed_at: Option<String>,
+    html_url: Option<String>,
+    user: Option<UserRef>,
+    #[serde(default)]
+    labels: Vec<LabelRef>,
+    #[serde(default)]
+    assignees: Vec<UserRef>,
+    milestone: Option<Milestone>,
+    /// Present when this "issue" is actually a pull request. The REST
+    /// API overloads the issues endpoint for PRs.
+    pull_request: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    login: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct LabelRef {
+    name: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Milestone {
+    title: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_issue_urls() {
+        assert!(matches("https://github.com/rust-lang/rust/issues/100"));
+        assert!(matches("https://github.com/rust-lang/rust/issues/100/"));
+        assert!(!matches("https://github.com/rust-lang/rust"));
+        assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
+        assert!(!matches("https://github.com/rust-lang/rust/issues"));
+    }
+
+    #[test]
+    fn parse_issue_extracts_owner_repo_number() {
+        assert_eq!(
+            parse_issue("https://github.com/rust-lang/rust/issues/100"),
+            Some(("rust-lang".into(), "rust".into(), 100))
+        );
+        assert_eq!(
+            parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"),
+            Some(("rust-lang".into(), "rust".into(), 100))
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index 5cf0993..510adc0 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -21,6 +21,7 @@ pub mod dev_to;
 pub mod docker_hub;
 pub mod ebay_listing;
 pub mod ecommerce_product;
+pub mod github_issue;
 pub mod github_pr;
 pub mod github_release;
 pub mod github_repo;
@@ -33,9 +34,13 @@ pub mod linkedin_post;
 pub mod npm;
 pub mod pypi;
 pub mod reddit;
+pub mod shopify_collection;
 pub mod shopify_product;
 pub mod stackoverflow;
+pub mod substack_post;
 pub mod trustpilot_reviews;
+pub mod woocommerce_product;
+pub mod youtube_video;
 
 use serde::Serialize;
 use serde_json::Value;
@@ -65,6 +70,7 @@ pub fn list() -> Vec<ExtractorInfo> {
         hackernews::INFO,
         github_repo::INFO,
         github_pr::INFO,
+        github_issue::INFO,
         github_release::INFO,
         pypi::INFO,
         npm::INFO,
@@ -75,11 +81,15 @@ pub fn list() -> Vec<ExtractorInfo> {
         docker_hub::INFO,
         dev_to::INFO,
         stackoverflow::INFO,
+        substack_post::INFO,
+        youtube_video::INFO,
         linkedin_post::INFO,
         instagram_post::INFO,
         instagram_profile::INFO,
         shopify_product::INFO,
+        shopify_collection::INFO,
         ecommerce_product::INFO,
+        woocommerce_product::INFO,
         amazon_product::INFO,
         ebay_listing::INFO,
         trustpilot_reviews::INFO,
@@ -131,6 +141,13 @@ pub async fn dispatch_by_url(
                 .map(|v| (github_pr::INFO.name, v)),
         );
     }
+    if github_issue::matches(url) {
+        return Some(
+            github_issue::extract(client, url)
+                .await
+                .map(|v| (github_issue::INFO.name, v)),
+        );
+    }
     if github_release::matches(url) {
         return Some(
             github_release::extract(client, url)
@@ -233,7 +250,15 @@ pub async fn dispatch_by_url(
                 .map(|v| (trustpilot_reviews::INFO.name, v)),
         );
     }
-    // NOTE: shopify_product and ecommerce_product are intentionally NOT
+    if youtube_video::matches(url) {
+        return Some(
+            youtube_video::extract(client, url)
+                .await
+                .map(|v| (youtube_video::INFO.name, v)),
+        );
+    }
+    // NOTE: shopify_product, shopify_collection, ecommerce_product,
+    // woocommerce_product, and substack_post are intentionally NOT
     // in auto-dispatch. Their `matches()` functions are permissive
     // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
     // claiming those generically would steal URLs from the default
@@ -282,6 +307,12 @@ pub async fn dispatch_by_name(
             })
             .await
         }
+        n if n == github_issue::INFO.name => {
+            run_or_mismatch(github_issue::matches(url), n, url, || {
+                github_issue::extract(client, url)
+            })
+            .await
+        }
         n if n == github_release::INFO.name => {
             run_or_mismatch(github_release::matches(url), n, url, || {
                 github_release::extract(client, url)
@@ -375,6 +406,30 @@ pub async fn dispatch_by_name(
             })
             .await
         }
+        n if n == youtube_video::INFO.name => {
+            run_or_mismatch(youtube_video::matches(url), n, url, || {
+                youtube_video::extract(client, url)
+            })
+            .await
+        }
+        n if n == substack_post::INFO.name => {
+            run_or_mismatch(substack_post::matches(url), n, url, || {
+                substack_post::extract(client, url)
+            })
+            .await
+        }
+        n if n == shopify_collection::INFO.name => {
+            run_or_mismatch(shopify_collection::matches(url), n, url, || {
+                shopify_collection::extract(client, url)
+            })
+            .await
+        }
+        n if n == woocommerce_product::INFO.name => {
+            run_or_mismatch(woocommerce_product::matches(url), n, url, || {
+                woocommerce_product::extract(client, url)
+            })
+            .await
+        }
         _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
     }
 }
diff --git a/crates/webclaw-fetch/src/extractors/shopify_collection.rs b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
new file mode 100644
index 0000000..095f7dd
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
@@ -0,0 +1,242 @@
+//! Shopify collection structured extractor.
+//!
+//! Every Shopify store exposes `/collections/{handle}.json` and
+//! `/collections/{handle}/products.json` on the public surface. This
+//! extractor hits `.json` (collection metadata) and falls through to
+//! `/products.json` for the first page of products. Same caveat as
+//! `shopify_product`: stores with Cloudflare in front of the shop
+//! will 403 the public path.
+//!
+//! Explicit-call only (like `shopify_product`). `/collections/{slug}`
+//! is a URL shape used by non-Shopify stores too, so auto-dispatch
+//! would claim too many URLs.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "shopify_collection",
+    label: "Shopify collection",
+    description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.",
+    url_patterns: &[
+        "https://{shop}/collections/{handle}",
+        "https://{shop}.myshopify.com/collections/{handle}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
+        return false;
+    }
+    url.contains("/collections/") && !url.ends_with("/collections/")
+}
+
+const NON_SHOPIFY_HOSTS: &[&str] = &[
+    "amazon.com",
+    "amazon.co.uk",
+    "amazon.de",
+    "ebay.com",
+    "etsy.com",
+    "walmart.com",
+    "target.com",
+    "aliexpress.com",
+    "huggingface.co", // has /collections/ for models
+    "github.com",
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let (coll_meta_url, coll_products_url) = build_json_urls(url);
+
+    // Step 1: collection metadata. Shopify returns 200 on missing
+    // collections sometimes; check "collection" key below.
+    let meta_resp = client.fetch(&coll_meta_url).await?;
+    if meta_resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "shopify_collection: '{url}' not found"
+        )));
+    }
+    if meta_resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store."
+        )));
+    }
+    if meta_resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "shopify returned status {} for {coll_meta_url}",
+            meta_resp.status
+        )));
+    }
+
+    let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| {
+        FetchError::BodyDecode(format!(
+            "shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})"
+        ))
+    })?;
+
+    // Step 2: first page of products for this collection.
+    let products = match client.fetch(&coll_products_url).await {
+        Ok(r) if r.status == 200 => serde_json::from_str::<ProductsWrapper>(&r.html)
+            .ok()
+            .map(|pw| pw.products)
+            .unwrap_or_default(),
+        _ => Vec::new(),
+    };
+
+    let product_summaries: Vec<Value> = products
+        .iter()
+        .map(|p| {
+            let first_variant = p.variants.first();
+            json!({
+                "id":              p.id,
+                "handle":          p.handle,
+                "title":           p.title,
+                "vendor":          p.vendor,
+                "product_type":    p.product_type,
+                "price":           first_variant.and_then(|v| v.price.clone()),
+                "compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()),
+                "available":       p.variants.iter().any(|v| v.available.unwrap_or(false)),
+                "variant_count":   p.variants.len(),
+                "image":           p.images.first().and_then(|i| i.src.clone()),
+                "created_at":      p.created_at,
+                "updated_at":      p.updated_at,
+            })
+        })
+        .collect();
+
+    let c = meta.collection;
+    Ok(json!({
+        "url":               url,
+        "meta_json_url":     coll_meta_url,
+        "products_json_url": coll_products_url,
+        "collection_id":     c.id,
+        "handle":            c.handle,
+        "title":             c.title,
+        "description_html":  c.body_html,
+        "published_at":      c.published_at,
+        "updated_at":        c.updated_at,
+        "sort_order":        c.sort_order,
+        "products_in_page":  product_summaries.len(),
+        "products":          product_summaries,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Build `(collection.json, collection/products.json)` from a user URL.
+fn build_json_urls(url: &str) -> (String, String) {
+    let (path_part, _query_part) = match url.split_once('?') {
+        Some((a, b)) => (a, Some(b)),
+        None => (url, None),
+    };
+    let clean = path_part.trim_end_matches('/').trim_end_matches(".json");
+    (
+        format!("{clean}.json"),
+        format!("{clean}/products.json?limit=50"),
+    )
+}
+
+// ---------------------------------------------------------------------------
+// Shopify collection + product JSON shapes (subsets)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct MetaWrapper {
+    collection: Collection,
+}
+
+#[derive(Deserialize)]
+struct Collection {
+    id: Option<i64>,
+    handle: Option<String>,
+    title: Option<String>,
+    body_html: Option<String>,
+    published_at: Option<String>,
+    updated_at: Option<String>,
+    sort_order: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct ProductsWrapper {
+    #[serde(default)]
+    products: Vec<ProductSummary>,
+}
+
+#[derive(Deserialize)]
+struct ProductSummary {
+    id: Option<i64>,
+    handle: Option<String>,
+    title: Option<String>,
+    vendor: Option<String>,
+    product_type: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    #[serde(default)]
+    variants: Vec<VariantSummary>,
+    #[serde(default)]
+    images: Vec<ImageSummary>,
+}
+
+#[derive(Deserialize)]
+struct VariantSummary {
+    price: Option<String>,
+    compare_at_price: Option<String>,
+    available: Option<bool>,
+}
+
+#[derive(Deserialize)]
+struct ImageSummary {
+    src: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_shopify_collection_urls() {
+        assert!(matches("https://www.allbirds.com/collections/mens"));
+        assert!(matches(
+            "https://shop.example.com/collections/new-arrivals?page=2"
+        ));
+    }
+
+    #[test]
+    fn rejects_non_shopify() {
+        assert!(!matches("https://github.com/collections/foo"));
+        assert!(!matches("https://huggingface.co/collections/foo"));
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("https://example.com/collections/"));
+    }
+
+    #[test]
+    fn build_json_urls_derives_both_paths() {
+        let (meta, products) = build_json_urls("https://shop.example.com/collections/mens");
+        assert_eq!(meta, "https://shop.example.com/collections/mens.json");
+        assert_eq!(
+            products,
+            "https://shop.example.com/collections/mens/products.json?limit=50"
+        );
+    }
+
+    #[test]
+    fn build_json_urls_handles_trailing_slash() {
+        let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/");
+        assert_eq!(meta, "https://shop.example.com/collections/mens.json");
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs
new file mode 100644
index 0000000..03ccbe8
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/substack_post.rs
@@ -0,0 +1,213 @@
+//! Substack post extractor.
+//!
+//! Every Substack publication exposes `/api/v1/posts/{slug}` that
+//! returns the full post as JSON: body HTML, cover image, author,
+//! publication info, reactions, paywall state. No auth on public
+//! posts.
+//!
+//! Works on both `*.substack.com` subdomains and custom domains
+//! (e.g. `simonwillison.net` uses Substack too). Detection is
+//! "URL has `/p/{slug}`" because that's the canonical Substack post
+//! path. Explicit-call only because the `/p/{slug}` URL shape is
+//! used by non-Substack sites too.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "substack_post",
+    label: "Substack post",
+    description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API.",
+    url_patterns: &[
+        "https://{pub}.substack.com/p/{slug}",
+        "https://{custom-domain}/p/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    if !(url.starts_with("http://") || url.starts_with("https://")) {
+        return false;
+    }
+    url.contains("/p/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let slug = parse_slug(url).ok_or_else(|| {
+        FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
+    })?;
+    let host = host_of(url);
+    if host.is_empty() {
+        return Err(FetchError::Build(format!(
+            "substack_post: empty host in '{url}'"
+        )));
+    }
+    let scheme = if url.starts_with("http://") {
+        "http"
+    } else {
+        "https"
+    };
+    let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "substack_post: '{slug}' not found on {host} (got 404). \
+             If the publication isn't actually on Substack, use /v1/scrape instead."
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "substack returned status {} for {api_url}",
+            resp.status
+        )));
+    }
+
+    let p: Post = serde_json::from_str(&resp.html).map_err(|e| {
+        FetchError::BodyDecode(format!(
+            "substack_post: '{host}' didn't return Substack JSON, likely not a Substack ({e})"
+        ))
+    })?;
+
+    Ok(json!({
+        "url":                  url,
+        "api_url":              api_url,
+        "id":                   p.id,
+        "type":                 p.r#type,
+        "slug":                 p.slug,
+        "title":                p.title,
+        "subtitle":             p.subtitle,
+        "description":          p.description,
+        "canonical_url":        p.canonical_url,
+        "post_date":            p.post_date,
+        "updated_at":           p.updated_at,
+        "audience":             p.audience,
+        "has_paywall":          matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")),
+        "is_free_preview":      p.is_free_preview,
+        "cover_image":          p.cover_image,
+        "word_count":           p.wordcount,
+        "reactions":            p.reactions,
+        "comment_count":        p.comment_count,
+        "body_html":            p.body_html,
+        "body_text":            p.truncated_body_text.or(p.body_text),
+        "publication": json!({
+            "id":           p.publication.as_ref().and_then(|pub_| pub_.id),
+            "name":         p.publication.as_ref().and_then(|pub_| pub_.name.clone()),
+            "subdomain":    p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()),
+            "custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()),
+        }),
+        "authors": p.published_bylines.iter().map(|a| json!({
+            "id":     a.id,
+            "name":   a.name,
+            "handle": a.handle,
+            "photo":  a.photo_url,
+        })).collect::<Vec<_>>(),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_slug(url: &str) -> Option<String> {
+    let after = url.split("/p/").nth(1)?;
+    let stripped = after
+        .split(['?', '#'])
+        .next()?
+        .trim_end_matches('/')
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if stripped.is_empty() {
+        None
+    } else {
+        Some(stripped.to_string())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Substack API types (subset)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Post {
+    id: Option<i64>,
+    r#type: Option<String>,
+    slug: Option<String>,
+    title: Option<String>,
+    subtitle: Option<String>,
+    description: Option<String>,
+    canonical_url: Option<String>,
+    post_date: Option<String>,
+    updated_at: Option<String>,
+    audience: Option<String>,
+    is_free_preview: Option<bool>,
+    cover_image: Option<String>,
+    wordcount: Option<i64>,
+    reactions: Option<serde_json::Value>,
+    comment_count: Option<i64>,
+    body_html: Option<String>,
+    body_text: Option<String>,
+    truncated_body_text: Option<String>,
+    publication: Option<Publication>,
+    #[serde(default, rename = "publishedBylines")]
+    published_bylines: Vec<Byline>,
+}
+
+#[derive(Deserialize)]
+struct Publication {
+    id: Option<i64>,
+    name: Option<String>,
+    subdomain: Option<String>,
+    custom_domain: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Byline {
+    id: Option<i64>,
+    name: Option<String>,
+    handle: Option<String>,
+    photo_url: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_post_urls() {
+        assert!(matches(
+            "https://stratechery.substack.com/p/the-tech-letter"
+        ));
+        assert!(matches("https://simonwillison.net/p/2024-08-01-something"));
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("ftp://example.com/p/foo"));
+    }
+
+    #[test]
+    fn parse_slug_strips_query_and_trailing_slash() {
+        assert_eq!(
+            parse_slug("https://example.substack.com/p/my-post"),
+            Some("my-post".into())
+        );
+        assert_eq!(
+            parse_slug("https://example.substack.com/p/my-post/"),
+            Some("my-post".into())
+        );
+        assert_eq!(
+            parse_slug("https://example.substack.com/p/my-post?ref=123"),
+            Some("my-post".into())
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
new file mode 100644
index 0000000..73f1109
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
@@ -0,0 +1,237 @@
+//! WooCommerce product structured extractor.
+//!
+//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`.
+//! About 30-50% of WooCommerce stores expose this endpoint publicly
+//! (it's on by default, but common security plugins disable it).
+//! When it's off, the server returns 404 at /wp-json. We surface a
+//! clean error and point callers at `/v1/scrape/ecommerce_product`
+//! which works on any store with Schema.org JSON-LD.
+//!
+//! Explicit-call only. `/product/{slug}` is the default permalink for
+//! WooCommerce but custom stores use every variation imaginable, so
+//! auto-dispatch is unreliable.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "woocommerce_product",
+    label: "WooCommerce product",
+    description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).",
+    url_patterns: &[
+        "https://{shop}/product/{slug}",
+        "https://{shop}/shop/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host.is_empty() {
+        return false;
+    }
+    // Permissive: WooCommerce stores use custom domains + custom
+    // permalinks. The extractor's API probe is what confirms it's
+    // really WooCommerce.
+    url.contains("/product/")
+        || url.contains("/shop/")
+        || url.contains("/producto/") // common es locale
+        || url.contains("/produit/") // common fr locale
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let slug = parse_slug(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "woocommerce_product: cannot parse slug from '{url}'"
+        ))
+    })?;
+    let host = host_of(url);
+    if host.is_empty() {
+        return Err(FetchError::Build(format!(
+            "woocommerce_product: empty host in '{url}'"
+        )));
+    }
+    let scheme = if url.starts_with("http://") {
+        "http"
+    } else {
+        "https"
+    };
+    let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "woocommerce_product: {host} does not expose /wp-json/wc/store (404). \
+             Use /v1/scrape/ecommerce_product for JSON-LD fallback."
+        )));
+    }
+    if resp.status == 401 || resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \
+             Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.",
+            resp.status
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "woocommerce api returned status {} for {api_url}",
+            resp.status
+        )));
+    }
+
+    let products: Vec<Product> = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?;
+    let p = products.into_iter().next().ok_or_else(|| {
+        FetchError::Build(format!(
+            "woocommerce_product: no product found for slug '{slug}' on {host}"
+        ))
+    })?;
+
+    let images: Vec<Value> = p
+        .images
+        .iter()
+        .map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt}))
+        .collect();
+    let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0);
+
+    Ok(json!({
+        "url":             url,
+        "api_url":         api_url,
+        "product_id":      p.id,
+        "name":            p.name,
+        "slug":            p.slug,
+        "sku":             p.sku,
+        "permalink":       p.permalink,
+        "on_sale":         p.on_sale,
+        "in_stock":        p.is_in_stock,
+        "is_purchasable":  p.is_purchasable,
+        "price":           p.prices.as_ref().and_then(|pr| pr.price.clone()),
+        "regular_price":   p.prices.as_ref().and_then(|pr| pr.regular_price.clone()),
+        "sale_price":      p.prices.as_ref().and_then(|pr| pr.sale_price.clone()),
+        "currency":        p.prices.as_ref().and_then(|pr| pr.currency_code.clone()),
+        "currency_minor":  p.prices.as_ref().and_then(|pr| pr.currency_minor_unit),
+        "price_range":     p.prices.as_ref().and_then(|pr| pr.price_range.clone()),
+        "average_rating":  p.average_rating,
+        "review_count":    p.review_count,
+        "description":     p.description,
+        "short_description": p.short_description,
+        "categories":      p.categories.iter().filter_map(|c| c.name.clone()).collect::<Vec<_>>(),
+        "tags":            p.tags.iter().filter_map(|t| t.name.clone()).collect::<Vec<_>>(),
+        "variation_count": variations_count,
+        "image_count":     images.len(),
+        "images":          images,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Extract the product slug from common WooCommerce permalinks.
+fn parse_slug(url: &str) -> Option<String> {
+    for needle in ["/product/", "/shop/", "/producto/", "/produit/"] {
+        if let Some(after) = url.split(needle).nth(1) {
+            let stripped = after
+                .split(['?', '#'])
+                .next()?
+                .trim_end_matches('/')
+                .split('/')
+                .next()
+                .unwrap_or("");
+            if !stripped.is_empty() {
+                return Some(stripped.to_string());
+            }
+        }
+    }
+    None
+}
+
+// ---------------------------------------------------------------------------
+// Store API types (subset of the full response)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Product {
+    id: Option<i64>,
+    name: Option<String>,
+    slug: Option<String>,
+    sku: Option<String>,
+    permalink: Option<String>,
+    description: Option<String>,
+    short_description: Option<String>,
+    on_sale: Option<bool>,
+    is_in_stock: Option<bool>,
+    is_purchasable: Option<bool>,
+    average_rating: Option<serde_json::Value>, // string or number
+    review_count: Option<i64>,
+    prices: Option<Prices>,
+    #[serde(default)]
+    categories: Vec<Term>,
+    #[serde(default)]
+    tags: Vec<Term>,
+    #[serde(default)]
+    images: Vec<Img>,
+    variations: Option<Vec<serde_json::Value>>,
+}
+
+#[derive(Deserialize)]
+struct Prices {
+    price: Option<String>,
+    regular_price: Option<String>,
+    sale_price: Option<String>,
+    currency_code: Option<String>,
+    currency_minor_unit: Option<i64>,
+    price_range: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct Term {
+    name: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Img {
+    src: Option<String>,
+    thumbnail: Option<String>,
+    alt: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_common_permalinks() {
+        assert!(matches("https://shop.example.com/product/cool-widget"));
+        assert!(matches("https://shop.example.com/shop/cool-widget"));
+        assert!(matches("https://tienda.example.com/producto/cosa"));
+        assert!(matches("https://boutique.example.com/produit/chose"));
+    }
+
+    #[test]
+    fn parse_slug_handles_locale_and_suffix() {
+        assert_eq!(
+            parse_slug("https://shop.example.com/product/cool-widget"),
+            Some("cool-widget".into())
+        );
+        assert_eq!(
+            parse_slug("https://shop.example.com/product/cool-widget/?attr=red"),
+            Some("cool-widget".into())
+        );
+        assert_eq!(
+            parse_slug("https://tienda.example.com/producto/cosa/"),
+            Some("cosa".into())
+        );
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs
new file mode 100644
index 0000000..c37230a
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs
@@ -0,0 +1,255 @@
+//! YouTube video structured extractor.
+//!
+//! YouTube embeds the full player configuration in a
+//! `ytInitialPlayerResponse` JavaScript assignment at the top of
+//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the
+//! core crate's already-proven regex + parse to surface typed JSON
+//! from it: video id, title, author + channel id, view count,
+//! duration, upload date, keywords, thumbnails, caption-track URLs.
+//!
+//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
+//! shape is stable.
+
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "youtube_video",
+    label: "YouTube video",
+    description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs.",
+    url_patterns: &[
+        "https://www.youtube.com/watch?v={id}",
+        "https://youtu.be/{id}",
+        "https://www.youtube.com/shorts/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    webclaw_core::youtube::is_youtube_url(url)
+        || url.contains("youtube.com/shorts/")
+        || url.contains("youtube-nocookie.com/embed/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let video_id = parse_video_id(url).ok_or_else(|| {
+        FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
+    })?;
+
+    // Always fetch the canonical /watch URL. /shorts/ and youtu.be
+    // sometimes serve a thinner page without the player blob.
+    let canonical = format!("https://www.youtube.com/watch?v={video_id}");
+    let resp = client.fetch(&canonical).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "youtube returned status {} for {canonical}",
+            resp.status
+        )));
+    }
+
+    let player = extract_player_response(&resp.html).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "youtube_video: no ytInitialPlayerResponse on {canonical} (video may be private, region-blocked, or removed)"
+        ))
+    })?;
+
+    let video_details = player.get("videoDetails");
+    let microformat = player
+        .get("microformat")
+        .and_then(|m| m.get("playerMicroformatRenderer"));
+
+    let thumbnails: Vec<Value> = video_details
+        .and_then(|vd| vd.get("thumbnail"))
+        .and_then(|t| t.get("thumbnails"))
+        .and_then(|t| t.as_array())
+        .cloned()
+        .unwrap_or_default();
+
+    let keywords: Vec<Value> = video_details
+        .and_then(|vd| vd.get("keywords"))
+        .and_then(|k| k.as_array())
+        .cloned()
+        .unwrap_or_default();
+
+    let caption_tracks = webclaw_core::youtube::extract_caption_tracks(&resp.html);
+    let captions: Vec<Value> = caption_tracks
+        .iter()
+        .map(|c| {
+            json!({
+                "url":  c.url,
+                "lang": c.lang,
+                "name": c.name,
+            })
+        })
+        .collect();
+
+    Ok(json!({
+        "url":          url,
+        "canonical_url":canonical,
+        "video_id":     video_id,
+        "title":        get_str(video_details, "title"),
+        "description":  get_str(video_details, "shortDescription"),
+        "author":       get_str(video_details, "author"),
+        "channel_id":   get_str(video_details, "channelId"),
+        "channel_url":  get_str(microformat, "ownerProfileUrl"),
+        "view_count":   get_int(video_details, "viewCount"),
+        "length_seconds": get_int(video_details, "lengthSeconds"),
+        "is_live":      video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()),
+        "is_private":   video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()),
+        "is_unlisted":  microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()),
+        "allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()),
+        "category":     get_str(microformat, "category"),
+        "upload_date":  get_str(microformat, "uploadDate"),
+        "publish_date": get_str(microformat, "publishDate"),
+        "keywords":     keywords,
+        "thumbnails":   thumbnails,
+        "caption_tracks": captions,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn parse_video_id(url: &str) -> Option<String> {
+    // youtu.be/{id}
+    if let Some(after) = url.split("youtu.be/").nth(1) {
+        let id = after
+            .split(['?', '#', '/'])
+            .next()
+            .unwrap_or("")
+            .trim_end_matches('/');
+        if !id.is_empty() {
+            return Some(id.to_string());
+        }
+    }
+    // youtube.com/shorts/{id}
+    if let Some(after) = url.split("youtube.com/shorts/").nth(1) {
+        let id = after
+            .split(['?', '#', '/'])
+            .next()
+            .unwrap_or("")
+            .trim_end_matches('/');
+        if !id.is_empty() {
+            return Some(id.to_string());
+        }
+    }
+    // youtube-nocookie.com/embed/{id}
+    if let Some(after) = url.split("/embed/").nth(1) {
+        let id = after
+            .split(['?', '#', '/'])
+            .next()
+            .unwrap_or("")
+            .trim_end_matches('/');
+        if !id.is_empty() {
+            return Some(id.to_string());
+        }
+    }
+    // youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id})
+    if let Some(q) = url.split_once('?').map(|(_, q)| q)
+        && let Some(id) = q
+            .split('&')
+            .find_map(|p| p.strip_prefix("v=").map(|v| v.to_string()))
+    {
+        let id = id.split(['#', '/']).next().unwrap_or(&id).to_string();
+        if !id.is_empty() {
+            return Some(id);
+        }
+    }
+    None
+}
+
+// ---------------------------------------------------------------------------
+// Player-response parsing
+// ---------------------------------------------------------------------------
+
+fn extract_player_response(html: &str) -> Option<Value> {
+    use regex::Regex;
+    use std::sync::OnceLock;
+    // Same regex as webclaw_core::youtube. Duplicated here because
+    // core's regex is module-private. Kept in lockstep; changes are
+    // rare and we cover with tests in both places.
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE
+        .get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
+    let json_str = re.captures(html)?.get(1)?.as_str();
+    serde_json::from_str(json_str).ok()
+}
+
+fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
+    v.and_then(|x| x.get(key))
+        .and_then(|x| x.as_str().map(String::from))
+}
+
+fn get_int(v: Option<&Value>, key: &str) -> Option<i64> {
+    v.and_then(|x| x.get(key)).and_then(|x| {
+        x.as_i64()
+            .or_else(|| x.as_str().and_then(|s| s.parse::<i64>().ok()))
+    })
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_watch_urls() {
+        assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ"));
+        assert!(matches("https://youtu.be/dQw4w9WgXcQ"));
+        assert!(matches("https://www.youtube.com/shorts/abc123"));
+        assert!(matches(
+            "https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ"
+        ));
+    }
+
+    #[test]
+    fn rejects_non_video_urls() {
+        assert!(!matches("https://www.youtube.com/"));
+        assert!(!matches("https://www.youtube.com/channel/abc"));
+        assert!(!matches("https://example.com/watch?v=abc"));
+    }
+
+    #[test]
+    fn parse_video_id_from_each_shape() {
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://youtu.be/dQw4w9WgXcQ"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/shorts/abc123"),
+            Some("abc123".into())
+        );
+    }
+
+    #[test]
+    fn extract_player_response_happy_path() {
+        let html = r#"
+<html><body>
+<script>
+var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","author":"A","viewCount":"100","lengthSeconds":"60","shortDescription":"d"}};
+</script>
+</body></html>
+"#;
+        let v = extract_player_response(html).unwrap();
+        let vd = v.get("videoDetails").unwrap();
+        assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
+    }
+}

From 7f5eb93b659a37de3eb4fc15136cab5148c8e7fc Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 16:44:51 +0200
Subject: [PATCH 14/30] feat(extractors): wave 6b, etsy_listing + HTML
 fallbacks for substack/youtube

Adds etsy_listing and hardens two existing extractors with HTML fallbacks
so transient API failures still return useful data.

New:
- etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD +
  OG fallback. Antibot-gated, routes through cloud::smart_fetch_html
  like amazon_product and ebay_listing. Auto-dispatched (etsy host is
  unique).

Hardened:
- substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit,
  403 on hardened custom domains, 5xx), fall back to HTML fetch and
  parse OG tags + Article JSON-LD. Response shape is stable across
  both paths, with a `data_source` field of "api" or "html_fallback".
- youtube_video: when ytInitialPlayerResponse is missing (EU-consent
  interstitial, age-gated, some live pre-shows), fall back to OG tags
  for title/description/thumbnail. `data_source` now "player_response"
  or "og_fallback".

Tests: 91 passing in webclaw-fetch (9 new), clippy clean.
---
 .../src/extractors/etsy_listing.rs            | 391 ++++++++++++++++++
 crates/webclaw-fetch/src/extractors/mod.rs    |  15 +
 .../src/extractors/substack_post.rs           | 388 ++++++++++++++++-
 .../src/extractors/youtube_video.rs           | 145 ++++++-
 4 files changed, 910 insertions(+), 29 deletions(-)
 create mode 100644 crates/webclaw-fetch/src/extractors/etsy_listing.rs

diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
new file mode 100644
index 0000000..bb7cc97
--- /dev/null
+++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
@@ -0,0 +1,391 @@
+//! Etsy listing extractor.
+//!
+//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant
+//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD
+//! block with title, price, currency, availability, shop seller, and
+//! an `AggregateRating` for the listing.
+//!
+//! Etsy puts Cloudflare + custom WAF in front of product pages with a
+//! high variance: the Firefox profile gets clean HTML most of the time
+//! but some listings return a CF interstitial. We route through
+//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
+//! same as `ebay_listing`.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "etsy_listing",
+    label: "Etsy listing",
+    description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.",
+    url_patterns: &[
+        "https://www.etsy.com/listing/{id}",
+        "https://www.etsy.com/listing/{id}/{slug}",
+        "https://www.etsy.com/{locale}/listing/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !is_etsy_host(host) {
+        return false;
+    }
+    parse_listing_id(url).is_some()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let listing_id = parse_listing_id(url)
+        .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
+
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse(&fetched.html, url, &listing_id);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
+    let jsonld = find_product_jsonld(html);
+
+    let title = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "name"))
+        .or_else(|| og(html, "title"));
+    let description = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
+    let image = jsonld
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| og(html, "image"));
+    let brand = jsonld.as_ref().and_then(get_brand);
+
+    // Etsy listings often ship either a single Offer or an
+    // AggregateOffer when the listing has variants with different prices.
+    let offer = jsonld.as_ref().and_then(first_offer);
+    let (low_price, high_price, single_price) = match offer.as_ref() {
+        Some(o) => (
+            get_text(o, "lowPrice"),
+            get_text(o, "highPrice"),
+            get_text(o, "price"),
+        ),
+        None => (None, None, None),
+    };
+    let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency"));
+    let availability = offer
+        .as_ref()
+        .and_then(|o| get_text(o, "availability").map(strip_schema_prefix));
+    let item_condition = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "itemCondition"))
+        .map(strip_schema_prefix);
+
+    // Shop name lives under offers[0].seller.name on Etsy.
+    let shop = offer.as_ref().and_then(|o| {
+        o.get("seller")
+            .and_then(|s| s.get("name"))
+            .and_then(|n| n.as_str())
+            .map(String::from)
+    });
+    let shop_url = shop_url_from_html(html);
+
+    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
+
+    json!({
+        "url":              url,
+        "listing_id":       listing_id,
+        "title":            title,
+        "description":      description,
+        "image":            image,
+        "brand":            brand,
+        "price":            single_price,
+        "low_price":        low_price,
+        "high_price":       high_price,
+        "currency":         currency,
+        "availability":     availability,
+        "item_condition":   item_condition,
+        "shop":             shop,
+        "shop_url":         shop_url,
+        "aggregate_rating": aggregate_rating,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn is_etsy_host(host: &str) -> bool {
+    host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com")
+}
+
+/// Extract the numeric listing id. Etsy ids are 9-11 digits today but
+/// we accept any all-digit segment right after `/listing/`.
+///
+/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised
+/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`).
+fn parse_listing_id(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap());
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
+// extractors can diverge without cross-impact)
+// ---------------------------------------------------------------------------
+
+fn find_product_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_product_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
+    match t {
+        Value::String(s) => is_prod(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    brand
+        .as_object()
+        .and_then(|o| o.get("name"))
+        .and_then(|n| n.as_str())
+        .map(String::from)
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn first_offer(v: &Value) -> Option<Value> {
+    let offers = v.get("offers")?;
+    match offers {
+        Value::Array(arr) => arr.first().cloned(),
+        Value::Object(_) => Some(offers.clone()),
+        _ => None,
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value": get_text(r, "ratingValue"),
+        "review_count": get_text(r, "reviewCount"),
+        "best_rating":  get_text(r, "bestRating"),
+    }))
+}
+
+fn strip_schema_prefix(s: String) -> String {
+    s.replace("http://schema.org/", "")
+        .replace("https://schema.org/", "")
+}
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+/// Etsy links the owning shop with a canonical anchor like
+/// `<a href="/shop/ShopName" ...>`. Grab the first one after the
+/// breadcrumb boundary.
+fn shop_url_from_html(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| format!("https://www.etsy.com{}", m.as_str()))
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_etsy_listing_urls() {
+        assert!(matches("https://www.etsy.com/listing/123456789"));
+        assert!(matches(
+            "https://www.etsy.com/listing/123456789/vintage-typewriter"
+        ));
+        assert!(matches(
+            "https://www.etsy.com/fr/listing/123456789/vintage-typewriter"
+        ));
+        assert!(!matches("https://www.etsy.com/"));
+        assert!(!matches("https://www.etsy.com/shop/SomeShop"));
+        assert!(!matches("https://example.com/listing/123456789"));
+    }
+
+    #[test]
+    fn parse_listing_id_handles_slug_and_locale() {
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/listing/123456789"),
+            Some("123456789".into())
+        );
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"),
+            Some("123456789".into())
+        );
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"),
+            Some("123456789".into())
+        );
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"),
+            Some("123456789".into())
+        );
+    }
+
+    #[test]
+    fn parse_extracts_from_fixture_jsonld() {
+        let html = r##"
+<html><head>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"Product",
+ "name":"Handmade Ceramic Mug","sku":"MUG-001",
+ "brand":{"@type":"Brand","name":"Studio Clay"},
+ "image":["https://i.etsystatic.com/abc.jpg","https://i.etsystatic.com/xyz.jpg"],
+ "itemCondition":"https://schema.org/NewCondition",
+ "offers":{"@type":"Offer","price":"24.00","priceCurrency":"USD",
+           "availability":"https://schema.org/InStock",
+           "seller":{"@type":"Organization","name":"StudioClay"}},
+ "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.9","reviewCount":"127","bestRating":"5"}}
+</script>
+<a href="/shop/StudioClay" class="wt-text-link">StudioClay</a>
+</head></html>"##;
+        let v = parse(html, "https://www.etsy.com/listing/1", "1");
+        assert_eq!(v["title"], "Handmade Ceramic Mug");
+        assert_eq!(v["price"], "24.00");
+        assert_eq!(v["currency"], "USD");
+        assert_eq!(v["availability"], "InStock");
+        assert_eq!(v["item_condition"], "NewCondition");
+        assert_eq!(v["shop"], "StudioClay");
+        assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay");
+        assert_eq!(v["brand"], "Studio Clay");
+        assert_eq!(v["aggregate_rating"]["rating_value"], "4.9");
+        assert_eq!(v["aggregate_rating"]["review_count"], "127");
+    }
+
+    #[test]
+    fn parse_handles_aggregate_offer_price_range() {
+        let html = r##"
+<script type="application/ld+json">
+{"@type":"Product","name":"Mug Set",
+ "offers":{"@type":"AggregateOffer",
+           "lowPrice":"18.00","highPrice":"36.00","priceCurrency":"USD"}}
+</script>
+"##;
+        let v = parse(html, "https://www.etsy.com/listing/2", "2");
+        assert_eq!(v["low_price"], "18.00");
+        assert_eq!(v["high_price"], "36.00");
+        assert_eq!(v["currency"], "USD");
+    }
+
+    #[test]
+    fn parse_falls_back_to_og_when_no_jsonld() {
+        let html = r#"
+<html><head>
+<meta property="og:title" content="Minimal Fallback Item">
+<meta property="og:description" content="OG-only extraction test.">
+<meta property="og:image" content="https://i.etsystatic.com/fallback.jpg">
+</head></html>"#;
+        let v = parse(html, "https://www.etsy.com/listing/3", "3");
+        assert_eq!(v["title"], "Minimal Fallback Item");
+        assert_eq!(v["description"], "OG-only extraction test.");
+        assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg");
+        // No price fields when we only have OG.
+        assert!(v["price"].is_null());
+    }
+}
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index 510adc0..5d06158 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -21,6 +21,7 @@ pub mod dev_to;
 pub mod docker_hub;
 pub mod ebay_listing;
 pub mod ecommerce_product;
+pub mod etsy_listing;
 pub mod github_issue;
 pub mod github_pr;
 pub mod github_release;
@@ -92,6 +93,7 @@ pub fn list() -> Vec<ExtractorInfo> {
         woocommerce_product::INFO,
         amazon_product::INFO,
         ebay_listing::INFO,
+        etsy_listing::INFO,
         trustpilot_reviews::INFO,
     ]
 }
@@ -243,6 +245,13 @@ pub async fn dispatch_by_url(
                 .map(|v| (ebay_listing::INFO.name, v)),
         );
     }
+    if etsy_listing::matches(url) {
+        return Some(
+            etsy_listing::extract(client, url)
+                .await
+                .map(|v| (etsy_listing::INFO.name, v)),
+        );
+    }
     if trustpilot_reviews::matches(url) {
         return Some(
             trustpilot_reviews::extract(client, url)
@@ -400,6 +409,12 @@ pub async fn dispatch_by_name(
             })
             .await
         }
+        n if n == etsy_listing::INFO.name => {
+            run_or_mismatch(etsy_listing::matches(url), n, url, || {
+                etsy_listing::extract(client, url)
+            })
+            .await
+        }
         n if n == trustpilot_reviews::INFO.name => {
             run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
                 trustpilot_reviews::extract(client, url)
diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs
index 03ccbe8..0571f3d 100644
--- a/crates/webclaw-fetch/src/extractors/substack_post.rs
+++ b/crates/webclaw-fetch/src/extractors/substack_post.rs
@@ -10,18 +10,32 @@
 //! "URL has `/p/{slug}`" because that's the canonical Substack post
 //! path. Explicit-call only because the `/p/{slug}` URL shape is
 //! used by non-Substack sites too.
+//!
+//! ## Fallback
+//!
+//! The API endpoint is rate-limited aggressively on popular publications
+//! and occasionally returns 403 on custom domains with Cloudflare in
+//! front. When that happens we escalate to an HTML fetch (via
+//! `smart_fetch_html`, so antibot-protected custom domains still work)
+//! and extract OG tags + Article JSON-LD for a degraded-but-useful
+//! payload. The response shape stays stable across both paths; a
+//! `data_source` field tells the caller which branch ran.
 
+use std::sync::OnceLock;
+
+use regex::Regex;
 use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
 use crate::client::FetchClient;
+use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "substack_post",
     label: "Substack post",
-    description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API.",
+    description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.",
     url_patterns: &[
         "https://{pub}.substack.com/p/{slug}",
         "https://{custom-domain}/p/{slug}",
@@ -51,32 +65,55 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
         "https"
     };
     let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
+
+    // 1. Try the public API. 200 = full payload; 404 = real miss; any
+    //    other status hands off to the HTML fallback so a transient rate
+    //    limit or a hardened custom domain doesn't fail the whole call.
     let resp = client.fetch(&api_url).await?;
-    if resp.status == 404 {
-        return Err(FetchError::Build(format!(
+    match resp.status {
+        200 => match serde_json::from_str::<Post>(&resp.html) {
+            Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)),
+            Err(e) => {
+                // API returned 200 but the body isn't the Post shape we
+                // expect. Could be a custom-domain site that exposes
+                // something else at /api/v1/posts/. Fall back to HTML
+                // rather than hard-failing.
+                html_fallback(
+                    client,
+                    url,
+                    &api_url,
+                    &slug,
+                    Some(format!(
+                        "api returned 200 but body was not Substack JSON ({e})"
+                    )),
+                )
+                .await
+            }
+        },
+        404 => Err(FetchError::Build(format!(
             "substack_post: '{slug}' not found on {host} (got 404). \
              If the publication isn't actually on Substack, use /v1/scrape instead."
-        )));
-    }
-    if resp.status != 200 {
-        return Err(FetchError::Build(format!(
-            "substack returned status {} for {api_url}",
-            resp.status
-        )));
+        ))),
+        _ => {
+            // Rate limit, 403, 5xx, whatever: try HTML.
+            let reason = format!("api returned status {} for {api_url}", resp.status);
+            html_fallback(client, url, &api_url, &slug, Some(reason)).await
+        }
     }
+}
 
-    let p: Post = serde_json::from_str(&resp.html).map_err(|e| {
-        FetchError::BodyDecode(format!(
-            "substack_post: '{host}' didn't return Substack JSON, likely not a Substack ({e})"
-        ))
-    })?;
+// ---------------------------------------------------------------------------
+// API-path payload builder
+// ---------------------------------------------------------------------------
 
-    Ok(json!({
+fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
+    json!({
         "url":                  url,
         "api_url":              api_url,
+        "data_source":          "api",
         "id":                   p.id,
         "type":                 p.r#type,
-        "slug":                 p.slug,
+        "slug":                 p.slug.or_else(|| Some(slug.to_string())),
         "title":                p.title,
         "subtitle":             p.subtitle,
         "description":          p.description,
@@ -104,7 +141,117 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
             "handle": a.handle,
             "photo":  a.photo_url,
         })).collect::<Vec<_>>(),
-    }))
+    })
+}
+
+// ---------------------------------------------------------------------------
+// HTML fallback: OG + Article JSON-LD
+// ---------------------------------------------------------------------------
+
+async fn html_fallback(
+    client: &FetchClient,
+    url: &str,
+    api_url: &str,
+    slug: &str,
+    fallback_reason: Option<String>,
+) -> Result<Value, FetchError> {
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse_html(&fetched.html, url, api_url, slug);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "fetch_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+        if let Some(reason) = fallback_reason {
+            obj.insert("fallback_reason".into(), json!(reason));
+        }
+    }
+    Ok(data)
+}
+
+/// Pure HTML parser. Pulls title, subtitle, description, cover image,
+/// publish date, and authors from OG tags and Article JSON-LD. Kept
+/// public so tests can exercise it with fixtures.
+pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value {
+    let article = find_article_jsonld(html);
+
+    let title = article
+        .as_ref()
+        .and_then(|v| get_text(v, "headline"))
+        .or_else(|| og(html, "title"));
+    let description = article
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
+    let cover_image = article
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| og(html, "image"));
+    let post_date = article
+        .as_ref()
+        .and_then(|v| get_text(v, "datePublished"))
+        .or_else(|| meta_property(html, "article:published_time"));
+    let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified"));
+    let publication_name = og(html, "site_name");
+    let authors = article.as_ref().map(extract_authors).unwrap_or_default();
+
+    json!({
+        "url":                  url,
+        "api_url":              api_url,
+        "data_source":          "html_fallback",
+        "slug":                 slug,
+        "title":                title,
+        "subtitle":             None::<String>,
+        "description":          description,
+        "canonical_url":        canonical_url(html).or_else(|| Some(url.to_string())),
+        "post_date":            post_date,
+        "updated_at":           updated_at,
+        "cover_image":          cover_image,
+        "body_html":            None::<String>,
+        "body_text":            None::<String>,
+        "word_count":           None::<i64>,
+        "comment_count":        None::<i64>,
+        "reactions":            Value::Null,
+        "has_paywall":          None::<bool>,
+        "is_free_preview":      None::<bool>,
+        "publication": json!({
+            "name": publication_name,
+        }),
+        "authors": authors,
+    })
+}
+
+fn extract_authors(v: &Value) -> Vec<Value> {
+    let Some(a) = v.get("author") else {
+        return Vec::new();
+    };
+    let one = |val: &Value| -> Option<Value> {
+        match val {
+            Value::String(s) => Some(json!({"name": s})),
+            Value::Object(_) => {
+                let name = val.get("name").and_then(|n| n.as_str())?;
+                let handle = val
+                    .get("url")
+                    .and_then(|u| u.as_str())
+                    .and_then(handle_from_author_url);
+                Some(json!({
+                    "name":   name,
+                    "handle": handle,
+                }))
+            }
+            _ => None,
+        }
+    };
+    match a {
+        Value::Array(arr) => arr.iter().filter_map(one).collect(),
+        _ => one(a).into_iter().collect(),
+    }
 }
 
 // ---------------------------------------------------------------------------
@@ -136,6 +283,139 @@ fn parse_slug(url: &str) -> Option<String> {
     }
 }
 
+/// Extract the Substack handle from an author URL like
+/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`.
+///
+/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack
+/// author page) so we don't synthesise a fake handle.
+fn handle_from_author_url(u: &str) -> Option<String> {
+    let after = u.rsplit_once('@').map(|(_, tail)| tail)?;
+    let clean = after.split(['/', '?', '#']).next()?;
+    if clean.is_empty() {
+        None
+    } else {
+        Some(clean.to_string())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// HTML tag helpers
+// ---------------------------------------------------------------------------
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+/// Pull `<meta property="article:published_time" content="...">` and
+/// similar structured meta tags.
+fn meta_property(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn canonical_url(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE
+        .get_or_init(|| Regex::new(r#"(?i)<link[^>]+rel="canonical"[^>]+href="([^"]+)""#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers (Article / NewsArticle)
+// ---------------------------------------------------------------------------
+
+fn find_article_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_article_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_article_in(v: &Value) -> Option<Value> {
+    if is_article_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_article_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_article_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_article_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_art = |s: &str| {
+        matches!(
+            s,
+            "Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting"
+        )
+    };
+    match t {
+        Value::String(s) => is_art(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
 // ---------------------------------------------------------------------------
 // Substack API types (subset)
 // ---------------------------------------------------------------------------
@@ -210,4 +490,76 @@ mod tests {
             Some("my-post".into())
         );
     }
+
+    #[test]
+    fn parse_html_extracts_from_og_tags() {
+        let html = r##"
+<html><head>
+<meta property="og:title" content="My Great Post">
+<meta property="og:description" content="A short summary.">
+<meta property="og:image" content="https://cdn.substack.com/cover.jpg">
+<meta property="og:site_name" content="My Publication">
+<meta property="article:published_time" content="2025-09-01T10:00:00Z">
+<link rel="canonical" href="https://mypub.substack.com/p/my-post">
+</head></html>"##;
+        let v = parse_html(
+            html,
+            "https://mypub.substack.com/p/my-post",
+            "https://mypub.substack.com/api/v1/posts/my-post",
+            "my-post",
+        );
+        assert_eq!(v["data_source"], "html_fallback");
+        assert_eq!(v["title"], "My Great Post");
+        assert_eq!(v["description"], "A short summary.");
+        assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg");
+        assert_eq!(v["post_date"], "2025-09-01T10:00:00Z");
+        assert_eq!(v["publication"]["name"], "My Publication");
+        assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post");
+    }
+
+    #[test]
+    fn parse_html_prefers_jsonld_when_present() {
+        let html = r##"
+<html><head>
+<meta property="og:title" content="OG Title">
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"NewsArticle",
+ "headline":"JSON-LD Title",
+ "description":"JSON-LD desc.",
+ "image":"https://cdn.substack.com/hero.jpg",
+ "datePublished":"2025-10-12T08:30:00Z",
+ "dateModified":"2025-10-12T09:00:00Z",
+ "author":[{"@type":"Person","name":"Alice Author","url":"https://substack.com/@alice"}]}
+</script>
+</head></html>"##;
+        let v = parse_html(
+            html,
+            "https://example.com/p/a",
+            "https://example.com/api/v1/posts/a",
+            "a",
+        );
+        assert_eq!(v["title"], "JSON-LD Title");
+        assert_eq!(v["description"], "JSON-LD desc.");
+        assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg");
+        assert_eq!(v["post_date"], "2025-10-12T08:30:00Z");
+        assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z");
+        assert_eq!(v["authors"][0]["name"], "Alice Author");
+        assert_eq!(v["authors"][0]["handle"], "alice");
+    }
+
+    #[test]
+    fn handle_from_author_url_pulls_handle() {
+        assert_eq!(
+            handle_from_author_url("https://substack.com/@alice"),
+            Some("alice".into())
+        );
+        assert_eq!(
+            handle_from_author_url("https://mypub.substack.com/@bob/"),
+            Some("bob".into())
+        );
+        assert_eq!(
+            handle_from_author_url("https://not-substack.com/author/carol"),
+            None
+        );
+    }
 }
diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs
index c37230a..81079f4 100644
--- a/crates/webclaw-fetch/src/extractors/youtube_video.rs
+++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs
@@ -9,7 +9,19 @@
 //!
 //! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
 //! shape is stable.
+//!
+//! ## Fallback
+//!
+//! `ytInitialPlayerResponse` is missing on EU-consent interstitials,
+//! some live-stream pre-show pages, and age-gated videos. In those
+//! cases we drop down to OG tags for `title`, `description`,
+//! `thumbnail`, and `channel`, and return a `data_source:
+//! "og_fallback"` payload so the caller can tell they got a degraded
+//! shape (no view count, duration, captions).
 
+use std::sync::OnceLock;
+
+use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
@@ -19,7 +31,7 @@ use crate::error::FetchError;
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "youtube_video",
     label: "YouTube video",
-    description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs.",
+    description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.",
     url_patterns: &[
         "https://www.youtube.com/watch?v={id}",
         "https://youtu.be/{id}",
@@ -49,12 +61,28 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
         )));
     }
 
-    let player = extract_player_response(&resp.html).ok_or_else(|| {
-        FetchError::BodyDecode(format!(
-            "youtube_video: no ytInitialPlayerResponse on {canonical} (video may be private, region-blocked, or removed)"
-        ))
-    })?;
+    if let Some(player) = extract_player_response(&resp.html) {
+        return Ok(build_player_payload(
+            &player, &resp.html, url, &canonical, &video_id,
+        ));
+    }
 
+    // No player blob. Fall back to OG tags so the call still returns
+    // something useful for consent / age-gate pages.
+    Ok(build_og_fallback(&resp.html, url, &canonical, &video_id))
+}
+
+// ---------------------------------------------------------------------------
+// Player-blob path (rich payload)
+// ---------------------------------------------------------------------------
+
+fn build_player_payload(
+    player: &Value,
+    html: &str,
+    url: &str,
+    canonical: &str,
+    video_id: &str,
+) -> Value {
     let video_details = player.get("videoDetails");
     let microformat = player
         .get("microformat")
@@ -73,7 +101,7 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
         .cloned()
         .unwrap_or_default();
 
-    let caption_tracks = webclaw_core::youtube::extract_caption_tracks(&resp.html);
+    let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html);
     let captions: Vec<Value> = caption_tracks
         .iter()
         .map(|c| {
@@ -85,9 +113,10 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
         })
         .collect();
 
-    Ok(json!({
+    json!({
         "url":          url,
         "canonical_url":canonical,
+        "data_source":  "player_response",
         "video_id":     video_id,
         "title":        get_str(video_details, "title"),
         "description":  get_str(video_details, "shortDescription"),
@@ -106,7 +135,46 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
         "keywords":     keywords,
         "thumbnails":   thumbnails,
         "caption_tracks": captions,
-    }))
+    })
+}
+
+// ---------------------------------------------------------------------------
+// OG fallback path (degraded payload)
+// ---------------------------------------------------------------------------
+
+fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value {
+    let title = og(html, "title");
+    let description = og(html, "description");
+    let thumbnail = og(html, "image");
+    // YouTube sets `<meta name="channel_name" ...>` on some pages but
+    // OG-only pages reliably carry `og:video:tag` and the channel in
+    // `<link itemprop="name">`. We keep this lean: just what's stable.
+    let channel = meta_name(html, "author");
+
+    json!({
+        "url":          url,
+        "canonical_url":canonical,
+        "data_source":  "og_fallback",
+        "video_id":     video_id,
+        "title":        title,
+        "description":  description,
+        "author":       channel,
+        // OG path: these are null so the caller doesn't have to guess.
+        "channel_id":   None::<String>,
+        "channel_url":  None::<String>,
+        "view_count":   None::<i64>,
+        "length_seconds": None::<i64>,
+        "is_live":      None::<bool>,
+        "is_private":   None::<bool>,
+        "is_unlisted":  None::<bool>,
+        "allow_ratings":None::<bool>,
+        "category":     None::<String>,
+        "upload_date":  None::<String>,
+        "publish_date": None::<String>,
+        "keywords":     Vec::<Value>::new(),
+        "thumbnails":   thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(),
+        "caption_tracks": Vec::<Value>::new(),
+    })
 }
 
 // ---------------------------------------------------------------------------
@@ -166,8 +234,6 @@ fn parse_video_id(url: &str) -> Option<String> {
 // ---------------------------------------------------------------------------
 
 fn extract_player_response(html: &str) -> Option<Value> {
-    use regex::Regex;
-    use std::sync::OnceLock;
     // Same regex as webclaw_core::youtube. Duplicated here because
     // core's regex is module-private. Kept in lockstep; changes are
     // rare and we cover with tests in both places.
@@ -178,6 +244,36 @@ fn extract_player_response(html: &str) -> Option<Value> {
     serde_json::from_str(json_str).ok()
 }
 
+// ---------------------------------------------------------------------------
+// Meta-tag helpers (for OG fallback)
+// ---------------------------------------------------------------------------
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn meta_name(html: &str, name: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == name) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
 fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
     v.and_then(|x| x.get(key))
         .and_then(|x| x.as_str().map(String::from))
@@ -252,4 +348,31 @@ var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","auth
         let vd = v.get("videoDetails").unwrap();
         assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
     }
+
+    #[test]
+    fn og_fallback_extracts_basics_from_meta_tags() {
+        let html = r##"
+<html><head>
+<meta property="og:title" content="Example Video Title">
+<meta property="og:description" content="A cool video description.">
+<meta property="og:image" content="https://i.ytimg.com/vi/abc/maxresdefault.jpg">
+<meta name="author" content="Example Channel">
+</head></html>"##;
+        let v = build_og_fallback(
+            html,
+            "https://www.youtube.com/watch?v=abc",
+            "https://www.youtube.com/watch?v=abc",
+            "abc",
+        );
+        assert_eq!(v["data_source"], "og_fallback");
+        assert_eq!(v["title"], "Example Video Title");
+        assert_eq!(v["description"], "A cool video description.");
+        assert_eq!(v["author"], "Example Channel");
+        assert_eq!(
+            v["thumbnails"][0]["url"],
+            "https://i.ytimg.com/vi/abc/maxresdefault.jpg"
+        );
+        assert!(v["view_count"].is_null());
+        assert!(v["caption_tracks"].as_array().unwrap().is_empty());
+    }
 }

From a53578e45c096dc141b7ebb65a5992451e638e63 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 17:07:31 +0200
Subject: [PATCH 15/30] fix(extractors): detect AWS WAF verifying-connection
 page, add OG fallback to ecommerce_product

Two targeted fixes surfaced by the manual extractor smoke test.

cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
  "Verifying your connection..." and an `interstitial-spinner` div.
  That pattern was not in our detector, so local fetch returned the
  challenge page, JSON-LD parsing found nothing, and the extractor
  emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
  happen to mention the phrase aren't misclassified. Two new tests
  cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
  smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
  actionable error without a key, or cloud-bypassed HTML with one.

ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
  produced an empty `offers` list when JSON-LD was present but its
  `offers` node was. Many sites (Patagonia-style catalog pages,
  smaller Squarespace stores) ship one or the other of OG / JSON-LD
  but not both with price data.
- Added OG meta-tag fallback that handles:
  * no JSON-LD at all -> build minimal payload from og:title,
    og:image, og:description, product:price:amount,
    product:price:currency, product:availability, product:brand
  * JSON-LD present but offers empty -> augment with an OG-derived
    offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
  so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
  so blog posts don't get mis-classified as products.

Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
---
 crates/webclaw-fetch/src/cloud.rs             |  32 ++
 .../src/extractors/ecommerce_product.rs       | 295 ++++++++++++++++--
 2 files changed, 299 insertions(+), 28 deletions(-)

diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs
index 3e1110a..ecce934 100644
--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@@ -325,6 +325,18 @@ pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
         return true;
     }
 
+    // AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
+    // Distinct from the captcha-branded path above: the challenge page is
+    // a tiny HTML shell with an `interstitial-spinner` div and no content.
+    // Gating on html.len() keeps false-positives off long pages that
+    // happen to mention the phrase in an unrelated context.
+    if html_lower.contains("interstitial-spinner")
+        && html_lower.contains("verifying your connection")
+        && html.len() < 10_000
+    {
+        return true;
+    }
+
     // hCaptcha *blocking* page (not just an embedded widget).
     if html_lower.contains("hcaptcha.com")
         && html_lower.contains("h-captcha")
@@ -564,6 +576,26 @@ mod tests {
         assert!(!is_bot_protected(&html, &empty_headers()));
     }
 
+    #[test]
+    fn is_bot_protected_detects_aws_waf_verifying_connection() {
+        // The exact shape Trustpilot serves under AWS WAF.
+        let html = r#"<div class="container"><div id="loading-state">
+            <div class="interstitial-spinner" id="spinner"></div>
+            <h1>Verifying your connection...</h1></div></div>"#;
+        assert!(is_bot_protected(html, &empty_headers()));
+    }
+
+    #[test]
+    fn is_bot_protected_ignores_phrase_on_real_content() {
+        // A real article that happens to mention the phrase in prose
+        // should not trigger the short-page detector.
+        let html = format!(
+            "<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
+            "article text ".repeat(2_000)
+        );
+        assert!(!is_bot_protected(&html, &empty_headers()));
+    }
+
     #[test]
     fn needs_js_rendering_flags_spa_skeleton() {
         let html = format!(
diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
index bad2f9b..099a8fb 100644
--- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
@@ -7,7 +7,7 @@
 //! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
 //! and anything else that follows Schema.org.
 //!
-//! **Explicit-call only** — `/v1/scrape/ecommerce_product`. Not in the
+//! **Explicit-call only** (`/v1/scrape/ecommerce_product`). Not in the
 //! auto-dispatch because we can't identify "this is a product page"
 //! from the URL alone. When the caller knows they have a product URL,
 //! this is the reliable fallback for stores where shopify_product
@@ -17,7 +17,28 @@
 //! so JSON-LD parsing is shared with the rest of the extraction
 //! pipeline. We walk all blocks looking for `@type: Product`,
 //! `ProductGroup`, or an `ItemList` whose first entry is a Product.
+//!
+//! ## OG fallback
+//!
+//! Two real-world cases JSON-LD alone can't cover:
+//!
+//! 1. Site has no Product JSON-LD at all (smaller Squarespace / custom
+//!    storefronts, many European shops).
+//! 2. Site has Product JSON-LD but the `offers` block is empty (seen on
+//!    Patagonia and other catalog-style sites that split price onto a
+//!    separate widget).
+//!
+//! For case 1 we build a minimal payload from OG / product meta tags
+//! (`og:title`, `og:image`, `og:description`, `product:price:amount`,
+//! `product:price:currency`, `product:availability`, `product:brand`).
+//! For case 2 we augment the JSON-LD offers list with an OG-derived
+//! offer so callers get a price either way. A `data_source` field
+//! (`"jsonld"` / `"jsonld+og"` / `"og_fallback"`) tells the caller
+//! which branch produced the data.
 
+use std::sync::OnceLock;
+
+use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
@@ -56,38 +77,104 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
             resp.status
         )));
     }
+    parse(&resp.html, url).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "ecommerce_product: no Schema.org Product JSON-LD and no OG product tags on {url}"
+        ))
+    })
+}
 
+/// Pure parser: try JSON-LD first, fall back to OG meta tags. Returns
+/// `None` when neither path has enough to say "this is a product page".
+pub fn parse(html: &str, url: &str) -> Option<Value> {
     // Reuse the core JSON-LD parser so we benefit from whatever
     // robustness it gains over time (handling @graph, arrays, etc.).
-    let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
-    let product = find_product(&blocks).ok_or_else(|| {
-        FetchError::BodyDecode(format!(
-            "ecommerce_product: no Schema.org Product found in JSON-LD on {url}"
-        ))
-    })?;
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    let product = find_product(&blocks);
 
-    Ok(json!({
+    if let Some(p) = product {
+        Some(build_jsonld_payload(&p, html, url))
+    } else if has_og_product_signal(html) {
+        Some(build_og_payload(html, url))
+    } else {
+        None
+    }
+}
+
+/// Build the rich payload from a Product JSON-LD node. Augments the
+/// `offers` array with an OG-derived offer when JSON-LD offers is empty
+/// so callers get a price on sites like Patagonia.
+fn build_jsonld_payload(product: &Value, html: &str, url: &str) -> Value {
+    let mut offers = collect_offers(product);
+    let mut data_source = "jsonld";
+    if offers.is_empty()
+        && let Some(og_offer) = build_og_offer(html)
+    {
+        offers.push(og_offer);
+        data_source = "jsonld+og";
+    }
+
+    json!({
         "url":                url,
-        "name":               get_text(&product, "name"),
-        "description":        get_text(&product, "description"),
-        "brand":              get_brand(&product),
-        "sku":                get_text(&product, "sku"),
-        "mpn":                get_text(&product, "mpn"),
-        "gtin":               get_text(&product, "gtin")
-                                 .or_else(|| get_text(&product, "gtin13"))
-                                 .or_else(|| get_text(&product, "gtin12"))
-                                 .or_else(|| get_text(&product, "gtin8")),
-        "product_id":         get_text(&product, "productID"),
-        "category":           get_text(&product, "category"),
-        "color":              get_text(&product, "color"),
-        "material":           get_text(&product, "material"),
-        "images":             collect_images(&product),
-        "offers":             collect_offers(&product),
-        "aggregate_rating":   get_aggregate_rating(&product),
-        "review_count":       get_review_count(&product),
-        "raw_schema_type":    get_text(&product, "@type"),
-        "raw_jsonld":         product,
-    }))
+        "data_source":        data_source,
+        "name":               get_text(product, "name").or_else(|| og(html, "title")),
+        "description":        get_text(product, "description").or_else(|| og(html, "description")),
+        "brand":              get_brand(product).or_else(|| meta_property(html, "product:brand")),
+        "sku":                get_text(product, "sku"),
+        "mpn":                get_text(product, "mpn"),
+        "gtin":               get_text(product, "gtin")
+                                 .or_else(|| get_text(product, "gtin13"))
+                                 .or_else(|| get_text(product, "gtin12"))
+                                 .or_else(|| get_text(product, "gtin8")),
+        "product_id":         get_text(product, "productID"),
+        "category":           get_text(product, "category"),
+        "color":              get_text(product, "color"),
+        "material":           get_text(product, "material"),
+        "images":             nonempty_or_og(collect_images(product), html),
+        "offers":             offers,
+        "aggregate_rating":   get_aggregate_rating(product),
+        "review_count":       get_review_count(product),
+        "raw_schema_type":    get_text(product, "@type"),
+        "raw_jsonld":         product.clone(),
+    })
+}
+
+/// Build a minimal payload from OG / product meta tags. Used when a
+/// page has no Product JSON-LD at all.
+fn build_og_payload(html: &str, url: &str) -> Value {
+    let offers = build_og_offer(html).map(|o| vec![o]).unwrap_or_default();
+    let image = og(html, "image");
+    let images: Vec<Value> = image.map(|i| vec![Value::String(i)]).unwrap_or_default();
+
+    json!({
+        "url":                url,
+        "data_source":        "og_fallback",
+        "name":               og(html, "title"),
+        "description":        og(html, "description"),
+        "brand":              meta_property(html, "product:brand"),
+        "sku":                None::<String>,
+        "mpn":                None::<String>,
+        "gtin":               None::<String>,
+        "product_id":         None::<String>,
+        "category":           None::<String>,
+        "color":              None::<String>,
+        "material":           None::<String>,
+        "images":             images,
+        "offers":             offers,
+        "aggregate_rating":   Value::Null,
+        "review_count":       None::<String>,
+        "raw_schema_type":    None::<String>,
+        "raw_jsonld":         Value::Null,
+    })
+}
+
+fn nonempty_or_og(imgs: Vec<Value>, html: &str) -> Vec<Value> {
+    if !imgs.is_empty() {
+        return imgs;
+    }
+    og(html, "image")
+        .map(|s| vec![Value::String(s)])
+        .unwrap_or_default()
 }
 
 // ---------------------------------------------------------------------------
@@ -236,6 +323,81 @@ fn host_of(url: &str) -> &str {
         .unwrap_or("")
 }
 
+// ---------------------------------------------------------------------------
+// OG / product meta-tag helpers
+// ---------------------------------------------------------------------------
+
+/// True when the HTML has enough OG / product meta tags to justify
+/// building a fallback payload. A single `og:title` isn't enough on its
+/// own — every blog post has that. We require either a product price
+/// tag or at least an `og:type` of `product`/`og:product` to avoid
+/// mis-classifying articles as products.
+fn has_og_product_signal(html: &str) -> bool {
+    let has_price = meta_property(html, "product:price:amount").is_some()
+        || meta_property(html, "og:price:amount").is_some();
+    if has_price {
+        return true;
+    }
+    // `<meta property="og:type" content="product">` is the Schema.org OG
+    // marker for product pages.
+    let og_type = og(html, "type").unwrap_or_default().to_lowercase();
+    matches!(og_type.as_str(), "product" | "og:product" | "product.item")
+}
+
+/// Build a single Offer-shaped Value from OG / product meta tags, or
+/// `None` if there's no price info at all.
+fn build_og_offer(html: &str) -> Option<Value> {
+    let price = meta_property(html, "product:price:amount")
+        .or_else(|| meta_property(html, "og:price:amount"));
+    let currency = meta_property(html, "product:price:currency")
+        .or_else(|| meta_property(html, "og:price:currency"));
+    let availability = meta_property(html, "product:availability")
+        .or_else(|| meta_property(html, "og:availability"));
+    price.as_ref()?;
+    Some(json!({
+        "price":            price,
+        "low_price":        None::<String>,
+        "high_price":       None::<String>,
+        "currency":         currency,
+        "availability":     availability,
+        "item_condition":   None::<String>,
+        "valid_until":      None::<String>,
+        "url":              None::<String>,
+        "seller":           None::<String>,
+        "offer_count":      None::<String>,
+    }))
+}
+
+/// Pull the value of `<meta property="og:{prop}" content="...">`.
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+/// Pull the value of any `<meta property="..." content="...">` tag.
+/// Needed for namespaced OG variants like `product:price:amount` that
+/// the simple `og:*` matcher above doesn't cover.
+fn meta_property(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -311,4 +473,81 @@ mod tests {
             Some("InStock")
         );
     }
+
+    // --- OG fallback --------------------------------------------------------
+
+    #[test]
+    fn has_og_product_signal_accepts_product_type_or_price() {
+        let type_only = r#"<meta property="og:type" content="product">"#;
+        let price_only = r#"<meta property="product:price:amount" content="49.00">"#;
+        let neither = r#"<meta property="og:title" content="My Article"><meta property="og:type" content="article">"#;
+        assert!(has_og_product_signal(type_only));
+        assert!(has_og_product_signal(price_only));
+        assert!(!has_og_product_signal(neither));
+    }
+
+    #[test]
+    fn og_fallback_builds_payload_without_jsonld() {
+        let html = r##"<html><head>
+            <meta property="og:type" content="product">
+            <meta property="og:title" content="Handmade Candle">
+            <meta property="og:image" content="https://cdn.example.com/candle.jpg">
+            <meta property="og:description" content="Small-batch soy candle.">
+            <meta property="product:price:amount" content="18.00">
+            <meta property="product:price:currency" content="USD">
+            <meta property="product:availability" content="in stock">
+            <meta property="product:brand" content="Little Studio">
+        </head></html>"##;
+        let v = parse(html, "https://example.com/p/candle").unwrap();
+        assert_eq!(v["data_source"], "og_fallback");
+        assert_eq!(v["name"], "Handmade Candle");
+        assert_eq!(v["description"], "Small-batch soy candle.");
+        assert_eq!(v["brand"], "Little Studio");
+        assert_eq!(v["offers"][0]["price"], "18.00");
+        assert_eq!(v["offers"][0]["currency"], "USD");
+        assert_eq!(v["offers"][0]["availability"], "in stock");
+        assert_eq!(v["images"][0], "https://cdn.example.com/candle.jpg");
+    }
+
+    #[test]
+    fn jsonld_augments_empty_offers_with_og_price() {
+        // Patagonia-shaped page: Product JSON-LD without an Offer, plus
+        // product:price:* OG tags. We should merge.
+        let html = r##"<html><head>
+            <script type="application/ld+json">
+            {"@context":"https://schema.org","@type":"Product",
+             "name":"Better Sweater","brand":"Patagonia",
+             "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.4","reviewCount":"1142"}}
+            </script>
+            <meta property="product:price:amount" content="139.00">
+            <meta property="product:price:currency" content="USD">
+        </head></html>"##;
+        let v = parse(html, "https://patagonia.com/p/x").unwrap();
+        assert_eq!(v["data_source"], "jsonld+og");
+        assert_eq!(v["name"], "Better Sweater");
+        assert_eq!(v["offers"].as_array().unwrap().len(), 1);
+        assert_eq!(v["offers"][0]["price"], "139.00");
+    }
+
+    #[test]
+    fn jsonld_only_stays_pure_jsonld() {
+        let html = r##"<html><head>
+            <script type="application/ld+json">
+            {"@type":"Product","name":"Widget",
+             "offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
+            </script>
+        </head></html>"##;
+        let v = parse(html, "https://example.com/p/w").unwrap();
+        assert_eq!(v["data_source"], "jsonld");
+        assert_eq!(v["offers"][0]["price"], "9.99");
+    }
+
+    #[test]
+    fn parse_returns_none_on_no_product_signals() {
+        let html = r#"<html><head>
+            <meta property="og:title" content="My Blog Post">
+            <meta property="og:type" content="article">
+        </head></html>"#;
+        assert!(parse(html, "https://blog.example.com/post").is_none());
+    }
 }

From e10066f527b7fa3194560abbb285dbbc553fab1a Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 17:24:50 +0200
Subject: [PATCH 16/30] fix(cloud): synthesize HTML from cloud response instead
 of requesting raw html

api.webclaw.io/v1/scrape does not return a `html` field even when
`formats=["html"]` is requested, by design: the cloud API returns
pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags,
title, description, image, site_name), and `markdown`.

Our CloudClient::fetch_html helper was premised on the API returning
raw HTML. Without a key set, the error message was hidden behind
CloudError::NotConfigured so the bug never surfaced. With a key set,
every extractor that escalated to cloud (trustpilot_reviews,
etsy_listing, amazon_product, ebay_listing, substack_post HTML
fallback) got back "cloud /v1/scrape returned no html field".

Fix: reassemble a minimal synthetic HTML document from the cloud's
parsed output. Each JSON-LD block goes back into a
`<script type="application/ld+json">` tag, metadata fields become OG
`<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing
local extractor parsers (find_product_jsonld, find_business,
og() regex) see the same shapes they'd see from a real page, so no
per-extractor changes needed.

Verified end-to-end with WEBCLAW_CLOUD_API_KEY set:
- trustpilot_reviews: escalates, returns Organization JSON-LD data
  (parser picks Trustpilot site-level Org not the reviewed business;
  tracked as a follow-up to update Trustpilot schema handling)
- etsy_listing: escalates via antibot render path; listing-specific
  data depends on target listing having JSON-LD (many Etsy listings
  don't)
- amazon_product, ebay_listing: stay local because their pages ship
  enough content not to trigger bot-detection escalation
- The other 24 extractors unchanged (local path, zero cloud credits)

Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
---
 crates/webclaw-fetch/src/cloud.rs | 143 +++++++++++++++++++++++++++---
 1 file changed, 131 insertions(+), 12 deletions(-)

diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs
index ecce934..dad7519 100644
--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@@ -252,22 +252,93 @@ impl CloudClient {
         self.post("scrape", body).await
     }
 
-    /// Convenience: scrape with `formats: ["html"]` and pull out the
-    /// raw HTML string. Used by vertical extractors that want to run
-    /// their own parser on antibot-bypassed HTML.
+    /// Get antibot-bypassed page data back as a synthetic HTML string.
+    ///
+    /// `api.webclaw.io/v1/scrape` intentionally does not return raw
+    /// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
+    /// plus `metadata` (title, description, OG tags, image) plus a
+    /// `markdown` body. We reassemble those into a minimal HTML doc
+    /// that looks enough like the real page for our local extractor
+    /// parsers to run unchanged: each JSON-LD block gets emitted as a
+    /// `<script type="application/ld+json">` tag, metadata gets
+    /// emitted as OG `<meta>` tags, and the markdown lands in the
+    /// body. Extractors that walk JSON-LD (ecommerce_product,
+    /// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
+    /// see exactly the same shapes they'd see from a live HTML fetch.
     pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
-        let resp = self.scrape(url, &["html"], &[], &[], false).await?;
-        resp.get("html")
-            .and_then(|v| v.as_str())
-            .map(String::from)
-            .ok_or_else(|| {
-                CloudError::ParseFailed(
-                    "cloud /v1/scrape returned no `html` field — check cloud API version".into(),
-                )
-            })
+        let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
+        Ok(synthesize_html(&resp))
     }
 }
 
+/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
+/// response so existing HTML-based extractor parsers can run against
+/// cloud output without a separate code path.
+fn synthesize_html(resp: &Value) -> String {
+    let mut out = String::with_capacity(8_192);
+    out.push_str("<html><head>\n");
+
+    // Metadata → OG meta tags. Keep keys stable with what local
+    // extractors read: og:title, og:description, og:image, og:site_name.
+    if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
+        for (src_key, og_key) in [
+            ("title", "title"),
+            ("description", "description"),
+            ("image", "image"),
+            ("site_name", "site_name"),
+        ] {
+            if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
+                && !val.is_empty()
+            {
+                out.push_str(&format!(
+                    "<meta property=\"og:{og_key}\" content=\"{}\">\n",
+                    html_escape_attr(val)
+                ));
+            }
+        }
+    }
+
+    // Structured data blocks → <script type="application/ld+json">.
+    // Serialise losslessly so extract_json_ld's parser gets the same
+    // shape it would get from a real page.
+    if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
+        for block in blocks {
+            if let Ok(s) = serde_json::to_string(block) {
+                out.push_str("<script type=\"application/ld+json\">");
+                out.push_str(&s);
+                out.push_str("</script>\n");
+            }
+        }
+    }
+
+    out.push_str("</head><body>\n");
+
+    // Markdown body → plaintext in <body>. Extractors that regex over
+    // <div> IDs won't hit here, but they won't hit on local cloud
+    // bypass either. OK to keep minimal.
+    if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
+        out.push_str("<pre>");
+        out.push_str(&html_escape_text(md));
+        out.push_str("</pre>\n");
+    }
+
+    out.push_str("</body></html>");
+    out
+}
+
+fn html_escape_attr(s: &str) -> String {
+    s.replace('&', "&amp;")
+        .replace('"', "&quot;")
+        .replace('<', "&lt;")
+        .replace('>', "&gt;")
+}
+
+fn html_escape_text(s: &str) -> String {
+    s.replace('&', "&amp;")
+        .replace('<', "&lt;")
+        .replace('>', "&gt;")
+}
+
 async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
     let status = resp.status();
     if status.is_success() {
@@ -585,6 +656,54 @@ mod tests {
         assert!(is_bot_protected(html, &empty_headers()));
     }
 
+    #[test]
+    fn synthesize_html_embeds_jsonld_and_og_tags() {
+        let resp = json!({
+            "url": "https://example.com/p/1",
+            "metadata": {
+                "title": "My Product",
+                "description": "A nice thing.",
+                "image": "https://cdn.example.com/1.jpg",
+                "site_name": "Example Shop"
+            },
+            "structured_data": [
+                {"@context":"https://schema.org","@type":"Product",
+                 "name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
+            ],
+            "markdown": "# Widget\n\nA nice widget."
+        });
+        let html = synthesize_html(&resp);
+        // OG tags from metadata.
+        assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
+        assert!(
+            html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
+        );
+        // JSON-LD block preserved losslessly.
+        assert!(html.contains(r#"<script type="application/ld+json">"#));
+        assert!(html.contains(r#""@type":"Product""#));
+        assert!(html.contains(r#""price":"9.99""#));
+        // Body carries markdown.
+        assert!(html.contains("A nice widget."));
+    }
+
+    #[test]
+    fn synthesize_html_handles_missing_fields_gracefully() {
+        let resp = json!({"url": "https://example.com", "metadata": {}});
+        let html = synthesize_html(&resp);
+        // No panic, no stray unclosed tags.
+        assert!(html.starts_with("<html><head>"));
+        assert!(html.ends_with("</body></html>"));
+    }
+
+    #[test]
+    fn synthesize_html_escapes_attribute_quotes() {
+        let resp = json!({
+            "metadata": {"title": r#"She said "hi""#}
+        });
+        let html = synthesize_html(&resp);
+        assert!(html.contains(r#"og:title" content="She said &quot;hi&quot;""#));
+    }
+
     #[test]
     fn is_bot_protected_ignores_phrase_on_real_content() {
         // A real article that happens to mention the phrase in prose

From b2e7dbf365def5a5c890b3b2af173f840bd293be Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 17:49:50 +0200
Subject: [PATCH 17/30] fix(extractors): perfect-score follow-ups (trustpilot
 2025 schema, amazon/etsy fallbacks, cloud docs)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses the four follow-ups surfaced by the cloud-key smoke test.

trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
  separate JSON-LD blocks: a site-level Organization (Trustpilot
  itself), a Dataset with a csvw:Table mainEntity carrying the
  per-star distribution for the target business, and an aiSummary +
  aiSummaryReviews block with the AI-generated summary and recent
  review objects.
- Parser now: skips the site-level Org, walks @graph as either array
  or single object, picks the Dataset whose about.@id references the
  target domain, parses each csvw:column for rating buckets, computes
  weighted-average rating + total from the distribution, extracts the
  aiSummary text, and turns aiSummaryReviews into a clean reviews
  array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
  average_rating when the Dataset block is absent. OG-description
  regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
  average_rating, review_count, rating_distribution (per-star count
  and percent), ai_summary, recent_reviews, review_count_listed,
  data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
  226 reviews with full distribution + AI summary + 2 recent reviews.

amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
  pages. When local fetch returns HTML without Product JSON-LD and
  a cloud client is configured, force-escalate to the cloud path
  which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
  cloud's synthesize_html output (OG tags only, no #productTitle DOM
  ID) still yields useful data. Real Amazon pages still prefer the
  DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
  MacBook Pro title + description + asin.

etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
  blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
  "This item is unavailable - Etsy", plus the OG description
  "Sorry, the page you were looking for was not found." is_generic_*
  helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
  `/listing/123456789/personalized-stainless-steel-tumbler` becomes
  `Personalized Stainless Steel Tumbler` so callers always get a
  meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
  listings that don't ship offers[].seller.name. Shop now falls
  through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
  (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
  225 reviews, InStock).

cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
  not return raw HTML by design and that [`synthesize_html`]
  reassembles the parsed response (metadata + structured_data +
  markdown) back into minimal HTML so existing local parsers run
  unchanged across both paths. Also notes the DOM-regex limitation
  for extractors that need live-page-specific DOM IDs.

Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
---
 crates/webclaw-fetch/src/cloud.rs             |  31 +
 .../src/extractors/amazon_product.rs          | 119 +++-
 .../src/extractors/etsy_listing.rs            | 199 +++++-
 .../src/extractors/trustpilot_reviews.rs      | 648 ++++++++++++++----
 4 files changed, 825 insertions(+), 172 deletions(-)

diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs
index dad7519..c70a75e 100644
--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@@ -24,6 +24,37 @@
 //!   parser on it. Returns the typed [`CloudError`] so extractors can
 //!   emit precise "upgrade your plan" / "invalid key" messages.
 //!
+//! ## Cloud response shape and [`synthesize_html`]
+//!
+//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
+//! `html` field even when `formats=["html"]` is requested. By design
+//! the cloud API returns a parsed bundle:
+//!
+//! ```text
+//! {
+//!   "url":             "https://...",
+//!   "metadata":        { title, description, image, site_name, ... },  // OG / meta tags
+//!   "structured_data": [ { "@type": "...", ... }, ... ],               // JSON-LD blocks
+//!   "markdown":        "# Page Title\n\n...",                          // cleaned markdown
+//!   "antibot":         { engine, path, user_agent },                   // bypass telemetry
+//!   "cache":           { status, age_seconds }
+//! }
+//! ```
+//!
+//! [`CloudClient::fetch_html`] reassembles that bundle back into a
+//! minimal synthetic HTML document so the existing local extractor
+//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
+//! cloud output. Each `structured_data` entry becomes a
+//! `<script type="application/ld+json">` tag; each `metadata` field
+//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
+//! `<pre>` inside the body. Callers that walk Schema.org blocks see
+//! exactly what they'd see on a real live page.
+//!
+//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
+//! won't hit on the synthesised HTML — those IDs only exist on live
+//! Amazon pages. Extractors that need DOM regex keep OG meta tag
+//! fallbacks for that reason.
+//!
 //! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
 //! signup when a site is blocked; nothing fails silently. Cloud users
 //! get the escalation for free.
diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs
index 3c96385..7f022fb 100644
--- a/crates/webclaw-fetch/src/extractors/amazon_product.rs
+++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs
@@ -1,16 +1,25 @@
 //! Amazon product detail page extractor.
 //!
-//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
-//! a "Sorry, we need to verify you're human" interstitial to any
-//! client without a warm Amazon session + residential IP. Detection
-//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
-//! Amazon heuristic, so this extractor always hits the cloud fallback
-//! path in practice.
+//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
+//! inconsistently protected. Sometimes our local TLS fingerprint gets
+//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
+//! sometimes we land on a real page that for whatever reason ships
+//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
+//! extractor has a two-stage fallback:
 //!
-//! Parsing logic works on the final HTML, local or cloud-sourced. We
-//! read the product details primarily from JSON-LD `Product` blocks
-//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
-//! specific DOM IDs picked up with cheap regex.
+//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
+//!    we have everything (title, brand, price, availability, rating).
+//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
+//!    a cloud client is configured, force-escalate to api.webclaw.io.
+//!    Cloud's render + antibot pipeline reliably surfaces the
+//!    structured data. Without a cloud client we return whatever we
+//!    got from local (usually just title via `#productTitle` or OG
+//!    meta tags).
+//!
+//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
+//! `#landingImage`) second, OG `<meta>` tags third. The OG path
+//! matters because the cloud's synthesized HTML ships metadata as
+//! OG tags but lacks Amazon's DOM IDs.
 //!
 //! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
 //! path. ASINs are a stable Amazon identifier so we extract that as
@@ -54,10 +63,36 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
     let asin = parse_asin(url)
         .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
 
-    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+    let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
         .await
         .map_err(cloud_to_fetch_err)?;
 
+    // Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
+    // pages (they A/B-test it). When local fetch succeeded but has no
+    // Product JSON-LD, force-escalate to the cloud which runs the
+    // render pipeline and reliably surfaces structured data. No-op
+    // when cloud isn't configured — we return whatever local gave us.
+    if fetched.source == cloud::FetchSource::Local
+        && find_product_jsonld(&fetched.html).is_none()
+        && let Some(c) = client.cloud()
+    {
+        match c.fetch_html(url).await {
+            Ok(cloud_html) => {
+                fetched = cloud::FetchedHtml {
+                    html: cloud_html,
+                    final_url: url.to_string(),
+                    source: cloud::FetchSource::Cloud,
+                };
+            }
+            Err(e) => {
+                tracing::debug!(
+                    error = %e,
+                    "amazon_product: cloud escalation failed, keeping local"
+                );
+            }
+        }
+    }
+
     let mut data = parse(&fetched.html, url, &asin);
     if let Some(obj) = data.as_object_mut() {
         obj.insert(
@@ -77,16 +112,23 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
 /// without carrying webclaw_fetch types.
 pub fn parse(html: &str, url: &str, asin: &str) -> Value {
     let jsonld = find_product_jsonld(html);
+    // Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
+    // (only present on real static HTML) > cloud-synthesized og:title.
     let title = jsonld
         .as_ref()
         .and_then(|v| get_text(v, "name"))
-        .or_else(|| dom_title(html));
+        .or_else(|| dom_title(html))
+        .or_else(|| og(html, "title"));
     let image = jsonld
         .as_ref()
         .and_then(get_first_image)
-        .or_else(|| dom_image(html));
+        .or_else(|| dom_image(html))
+        .or_else(|| og(html, "image"));
     let brand = jsonld.as_ref().and_then(get_brand);
-    let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
+    let description = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
     let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
     let offer = jsonld.as_ref().and_then(first_offer);
 
@@ -267,6 +309,31 @@ fn dom_image(html: &str) -> Option<String> {
         .map(|m| m.as_str().to_string())
 }
 
+/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
+/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
+/// line of defence for `title`, `image`, `description`.
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| html_unescape(m.as_str()));
+        }
+    }
+    None
+}
+
+/// Undo the synthesize_html attribute escaping for the few entities it
+/// emits. Keeps us off a heavier HTML-entity dep.
+fn html_unescape(s: &str) -> String {
+    s.replace("&quot;", "\"")
+        .replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+}
+
 fn cloud_to_fetch_err(e: CloudError) -> FetchError {
     FetchError::Build(e.to_string())
 }
@@ -358,4 +425,28 @@ mod tests {
             "https://m.media-amazon.com/images/I/fallback.jpg"
         );
     }
+
+    #[test]
+    fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
+        // Shape we see from the cloud synthesize_html path: OG tags
+        // only, no JSON-LD, no Amazon DOM IDs.
+        let html = r##"<html><head>
+<meta property="og:title" content="Cloud-sourced MacBook Pro">
+<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
+<meta property="og:description" content="Via api.webclaw.io">
+</head></html>"##;
+        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
+        assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
+        assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
+        assert_eq!(v["description"], "Via api.webclaw.io");
+    }
+
+    #[test]
+    fn og_unescape_handles_quot_entity() {
+        let html = r#"<meta property="og:title" content="Apple &quot;M2 Pro&quot; Laptop">"#;
+        assert_eq!(
+            og(html, "title").as_deref(),
+            Some(r#"Apple "M2 Pro" Laptop"#)
+        );
+    }
 }
diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
index bb7cc97..060c3b6 100644
--- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
@@ -10,6 +10,15 @@
 //! but some listings return a CF interstitial. We route through
 //! `cloud::smart_fetch_html` so both paths resolve to the same parser,
 //! same as `ebay_listing`.
+//!
+//! ## URL slug as last-resort title
+//!
+//! Even with cloud antibot bypass, Etsy frequently serves a generic
+//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
+//! empty markdown). In that case we humanise the slug from the URL
+//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
+//! "Personalized Stainless Steel Tumbler") so callers always get a
+//! meaningful title. Degrades gracefully when the URL has no slug.
 
 use std::sync::OnceLock;
 
@@ -63,15 +72,17 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
 
 pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
     let jsonld = find_product_jsonld(html);
+    let slug_title = humanise_slug(parse_slug(url).as_deref());
 
     let title = jsonld
         .as_ref()
         .and_then(|v| get_text(v, "name"))
-        .or_else(|| og(html, "title"));
+        .or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
+        .or(slug_title);
     let description = jsonld
         .as_ref()
         .and_then(|v| get_text(v, "description"))
-        .or_else(|| og(html, "description"));
+        .or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
     let image = jsonld
         .as_ref()
         .and_then(get_first_image)
@@ -98,13 +109,18 @@ pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
         .and_then(|v| get_text(v, "itemCondition"))
         .map(strip_schema_prefix);
 
-    // Shop name lives under offers[0].seller.name on Etsy.
-    let shop = offer.as_ref().and_then(|o| {
-        o.get("seller")
-            .and_then(|s| s.get("name"))
-            .and_then(|n| n.as_str())
-            .map(String::from)
-    });
+    // Shop name: offers[0].seller.name on newer listings, top-level
+    // `brand` on older listings (Etsy changed the schema around 2022).
+    // Fall back through both so either shape resolves.
+    let shop = offer
+        .as_ref()
+        .and_then(|o| {
+            o.get("seller")
+                .and_then(|s| s.get("name"))
+                .and_then(|n| n.as_str())
+                .map(String::from)
+        })
+        .or_else(|| brand.clone());
     let shop_url = shop_url_from_html(html);
 
     let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
@@ -158,6 +174,87 @@ fn parse_listing_id(url: &str) -> Option<String> {
         .map(|m| m.as_str().to_string())
 }
 
+/// Extract the URL slug after the listing id, e.g.
+/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
+/// is the bare `/listing/{id}` shape.
+fn parse_slug(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+/// Turn a URL slug into a human-ish title:
+/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
+/// Steel Tumbler`. Word-cap each dash-separated token; preserves
+/// underscores as spaces too. Returns `None` on empty input.
+fn humanise_slug(slug: Option<&str>) -> Option<String> {
+    let raw = slug?.trim();
+    if raw.is_empty() {
+        return None;
+    }
+    let words: Vec<String> = raw
+        .split(['-', '_'])
+        .filter(|w| !w.is_empty())
+        .map(capitalise_word)
+        .collect();
+    if words.is_empty() {
+        None
+    } else {
+        Some(words.join(" "))
+    }
+}
+
+fn capitalise_word(w: &str) -> String {
+    let mut chars = w.chars();
+    match chars.next() {
+        Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
+        None => String::new(),
+    }
+}
+
+/// True when the OG title is Etsy's fallback-page title rather than a
+/// listing-specific title. Expired / region-blocked / antibot-filtered
+/// pages return Etsy's sitewide tagline:
+/// `"Etsy - Your place to buy and sell all things handmade..."`, or
+/// simply `"etsy.com"`. A real listing title always starts with the
+/// item name, never with "Etsy - " or the domain.
+fn is_generic_title(t: &str) -> bool {
+    let normalised = t.trim().to_lowercase();
+    if matches!(
+        normalised.as_str(),
+        "etsy.com" | "etsy" | "www.etsy.com" | ""
+    ) {
+        return true;
+    }
+    // Etsy's sitewide marketing tagline, served on 404 / blocked pages.
+    if normalised.starts_with("etsy - ")
+        || normalised.starts_with("etsy.com - ")
+        || normalised.starts_with("etsy uk - ")
+    {
+        return true;
+    }
+    // Etsy's "item unavailable" placeholder, served on delisted
+    // products. Keep the slug fallback so callers still see what the
+    // URL was about.
+    normalised.starts_with("this item is unavailable")
+        || normalised.starts_with("sorry, this item is")
+        || normalised == "item not available - etsy"
+}
+
+/// True when the OG description is an Etsy error-page placeholder or
+/// sitewide marketing blurb rather than a real listing description.
+fn is_generic_description(d: &str) -> bool {
+    let normalised = d.trim().to_lowercase();
+    if normalised.is_empty() {
+        return true;
+    }
+    normalised.starts_with("sorry, the page you were looking for")
+        || normalised.starts_with("page not found")
+        || normalised.starts_with("find the perfect handmade gift")
+}
+
 // ---------------------------------------------------------------------------
 // JSON-LD walkers (same shape as ebay_listing; kept separate so the two
 // extractors can diverge without cross-impact)
@@ -388,4 +485,88 @@ mod tests {
         // No price fields when we only have OG.
         assert!(v["price"].is_null());
     }
+
+    #[test]
+    fn parse_slug_from_url() {
+        assert_eq!(
+            parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
+            Some("vintage-typewriter".into())
+        );
+        assert_eq!(
+            parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
+            Some("slug".into())
+        );
+        assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
+        assert_eq!(
+            parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
+            Some("slug".into())
+        );
+    }
+
+    #[test]
+    fn humanise_slug_capitalises_each_word() {
+        assert_eq!(
+            humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
+            Some("Personalized Stainless Steel Tumbler")
+        );
+        assert_eq!(
+            humanise_slug(Some("hand_crafted_mug")).as_deref(),
+            Some("Hand Crafted Mug")
+        );
+        assert_eq!(humanise_slug(Some("")), None);
+        assert_eq!(humanise_slug(None), None);
+    }
+
+    #[test]
+    fn is_generic_title_catches_common_shapes() {
+        assert!(is_generic_title("etsy.com"));
+        assert!(is_generic_title("Etsy"));
+        assert!(is_generic_title("  etsy.com  "));
+        assert!(is_generic_title(
+            "Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
+        ));
+        assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
+        assert!(!is_generic_title("Vintage Typewriter"));
+        assert!(!is_generic_title("Handmade Etsy-style Mug"));
+    }
+
+    #[test]
+    fn is_generic_description_catches_404_shapes() {
+        assert!(is_generic_description(""));
+        assert!(is_generic_description(
+            "Sorry, the page you were looking for was not found."
+        ));
+        assert!(is_generic_description("Page not found"));
+        assert!(!is_generic_description(
+            "Hand-thrown ceramic mug, dishwasher safe."
+        ));
+    }
+
+    #[test]
+    fn parse_uses_slug_when_og_is_generic() {
+        // Cloud-blocked Etsy listing: og:title is a site-wide generic
+        // placeholder, no JSON-LD, no description. Slug should win.
+        let html = r#"<html><head>
+<meta property="og:title" content="etsy.com">
+</head></html>"#;
+        let v = parse(
+            html,
+            "https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
+            "1079113183",
+        );
+        assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
+    }
+
+    #[test]
+    fn parse_prefers_real_og_over_slug() {
+        let html = r#"<html><head>
+<meta property="og:title" content="Real Listing Title">
+</head></html>"#;
+        let v = parse(
+            html,
+            "https://www.etsy.com/listing/1079113183/the-url-slug",
+            "1079113183",
+        );
+        assert_eq!(v["title"], "Real Listing Title");
+    }
 }
diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
index a5e1e48..ae97c67 100644
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@@ -1,13 +1,34 @@
 //! Trustpilot company reviews extractor.
 //!
-//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
-//! `Organization` / `LocalBusiness` block with aggregate rating + up
-//! to 20 recent reviews. The page HTML itself is usually behind AWS
-//! WAF's "Verifying Connection" interstitial — so this extractor
-//! always uses [`cloud::smart_fetch_html`] and only returns data when
-//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
-//! OSS users without a key get a clear error pointing at signup.
+//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
+//! "Verifying your connection" interstitial, so this extractor always
+//! routes through [`cloud::smart_fetch_html`]. Without
+//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
+//! "set API key" error; with one it escalates to api.webclaw.io.
+//!
+//! ## 2025 JSON-LD schema
+//!
+//! Trustpilot replaced the old single-Organization + aggregateRating
+//! shape with three separate JSON-LD blocks:
+//!
+//! 1. `Organization` block for Trustpilot the platform itself
+//!    (company info, addresses, social profiles). Not the business
+//!    being reviewed. We detect and skip this.
+//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
+//!    per-star-bucket counts for the target business plus a Total
+//!    column. The Dataset's `name` is the business display name.
+//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
+//!    summary of reviews plus the individual review objects
+//!    (consumer, dates, rating, title, text, language, likes).
+//!
+//! Plus `metadata.title` from the page head parses as
+//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
+//! `metadata.description` carries `"{N} customers have already said"`.
+//! We use both as extra signal when the Dataset block is absent.
 
+use std::sync::OnceLock;
+
+use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
@@ -18,7 +39,7 @@ use crate::error::FetchError;
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "trustpilot_reviews",
     label: "Trustpilot reviews",
-    description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
+    description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
     url_patterns: &["https://www.trustpilot.com/review/{domain}"],
 };
 
@@ -31,75 +52,88 @@ pub fn matches(url: &str) -> bool {
 }
 
 pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
-    // Trustpilot is always behind AWS WAF, so we go through smart_fetch
-    // which tries local first (which will hit the challenge interstitial),
-    // detects it, and escalates to cloud /v1/scrape for the real HTML.
     let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
         .await
         .map_err(cloud_to_fetch_err)?;
 
-    let html = parse(&fetched.html, url)?;
-    Ok(html_with_source(html, fetched.source))
+    let mut data = parse(&fetched.html, url)?;
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
 }
 
-/// Run the pure parser on already-fetched HTML. Split out so the cloud
-/// pipeline can call it directly after its own antibot-aware fetch
-/// without going through [`extract`].
+/// Pure parser. Kept public so the cloud pipeline can reuse it on its
+/// own fetched HTML without going through the async extract path.
 pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
-    let business = find_business(&blocks).ok_or_else(|| {
-        FetchError::BodyDecode(format!(
-            "trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
+    let domain = parse_review_domain(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
         ))
     })?;
 
-    let aggregate_rating = business.get("aggregateRating").map(|r| {
-        json!({
-            "rating_value":  get_text(r, "ratingValue"),
-            "best_rating":   get_text(r, "bestRating"),
-            "review_count":  get_text(r, "reviewCount"),
-        })
-    });
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
 
-    let reviews: Vec<Value> = business
-        .get("review")
-        .and_then(|r| r.as_array())
-        .map(|arr| {
-            arr.iter()
-                .map(|r| {
-                    json!({
-                        "author":         r.get("author")
-                                              .and_then(|a| a.get("name"))
-                                              .and_then(|n| n.as_str())
-                                              .map(String::from)
-                                              .or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
-                        "date_published": get_text(r, "datePublished"),
-                        "name":           get_text(r, "name"),
-                        "body":           get_text(r, "reviewBody"),
-                        "rating_value":   r.get("reviewRating")
-                                              .and_then(|rr| rr.get("ratingValue"))
-                                              .and_then(|v| v.as_str().map(String::from)
-                                                  .or_else(|| v.as_f64().map(|n| n.to_string()))),
-                        "language":       get_text(r, "inLanguage"),
-                    })
-                })
-                .collect()
-        })
+    // The business Dataset block has `about.@id` pointing to the target
+    // domain's Organization (e.g. `.../Organization/anthropic.com`).
+    let dataset = find_business_dataset(&blocks, &domain);
+
+    // The aiSummary block: not typed (no `@type`), detect by key.
+    let ai_block = find_ai_summary_block(&blocks);
+
+    // Business name: Dataset > metadata.title regex > URL domain.
+    let business_name = dataset
+        .as_ref()
+        .and_then(|d| get_string(d, "name"))
+        .or_else(|| parse_name_from_og_title(html))
+        .or_else(|| Some(domain.clone()));
+
+    // Rating distribution from the csvw:Table columns. Each column has
+    // csvw:name like "1 star" / "Total" and a single cell with the
+    // integer count.
+    let distribution = dataset.as_ref().and_then(parse_star_distribution);
+    let (rating_from_dist, total_from_dist) = distribution
+        .as_ref()
+        .map(compute_rating_stats)
+        .unwrap_or((None, None));
+
+    // Page-title / page-description fallbacks. OG title format:
+    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
+    let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
+    let total_from_desc = parse_review_count_from_og_description(html);
+
+    // Recent reviews carried by the aiSummary block.
+    let recent_reviews: Vec<Value> = ai_block
+        .as_ref()
+        .and_then(|a| a.get("aiSummaryReviews"))
+        .and_then(|arr| arr.as_array())
+        .map(|arr| arr.iter().map(extract_review).collect())
         .unwrap_or_default();
 
+    let ai_summary = ai_block
+        .as_ref()
+        .and_then(|a| a.get("aiSummary"))
+        .and_then(|s| s.get("summary"))
+        .and_then(|t| t.as_str())
+        .map(String::from);
+
     Ok(json!({
-        "url":              url,
-        "name":             get_text(&business, "name"),
-        "description":      get_text(&business, "description"),
-        "logo":             business.get("logo").and_then(|l| l.as_str()).map(String::from)
-                                .or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
-        "telephone":        get_text(&business, "telephone"),
-        "address":          business.get("address").cloned(),
-        "same_as":          business.get("sameAs").cloned(),
-        "aggregate_rating": aggregate_rating,
-        "review_count_listed": reviews.len(),
-        "reviews":          reviews,
-        "business_schema":  business.get("@type").cloned(),
+        "url":               url,
+        "domain":            domain,
+        "business_name":     business_name,
+        "rating_label":      rating_label,
+        "average_rating":    rating_from_dist.or(rating_from_og),
+        "review_count":      total_from_dist.or(total_from_desc),
+        "rating_distribution": distribution,
+        "ai_summary":        ai_summary,
+        "recent_reviews":    recent_reviews,
+        "review_count_listed": recent_reviews.len(),
     }))
 }
 
@@ -107,87 +141,10 @@ fn cloud_to_fetch_err(e: CloudError) -> FetchError {
     FetchError::Build(e.to_string())
 }
 
-/// Stamp `data_source` onto the parser output so callers can tell at a
-/// glance whether this row came from local or cloud. Useful for UX and
-/// for pricing-aware pipelines.
-fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
-    if let Some(obj) = v.as_object_mut() {
-        obj.insert(
-            "data_source".into(),
-            match source {
-                cloud::FetchSource::Local => json!("local"),
-                cloud::FetchSource::Cloud => json!("cloud"),
-            },
-        );
-    }
-    v
-}
-
 // ---------------------------------------------------------------------------
-// JSON-LD walker — same pattern as ecommerce_product
+// URL helpers
 // ---------------------------------------------------------------------------
 
-fn find_business(blocks: &[Value]) -> Option<Value> {
-    for b in blocks {
-        if let Some(found) = find_business_in(b) {
-            return Some(found);
-        }
-    }
-    None
-}
-
-fn find_business_in(v: &Value) -> Option<Value> {
-    if is_business_type(v) {
-        return Some(v.clone());
-    }
-    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
-        for item in graph {
-            if let Some(found) = find_business_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    if let Some(arr) = v.as_array() {
-        for item in arr {
-            if let Some(found) = find_business_in(item) {
-                return Some(found);
-            }
-        }
-    }
-    None
-}
-
-fn is_business_type(v: &Value) -> bool {
-    let t = match v.get("@type") {
-        Some(t) => t,
-        None => return false,
-    };
-    let match_str = |s: &str| {
-        matches!(
-            s,
-            "Organization"
-                | "LocalBusiness"
-                | "Corporation"
-                | "OnlineBusiness"
-                | "Store"
-                | "Service"
-        )
-    };
-    match t {
-        Value::String(s) => match_str(s),
-        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
-        _ => false,
-    }
-}
-
-fn get_text(v: &Value, key: &str) -> Option<String> {
-    v.get(key).and_then(|x| match x {
-        Value::String(s) => Some(s.clone()),
-        Value::Number(n) => Some(n.to_string()),
-        _ => None,
-    })
-}
-
 fn host_of(url: &str) -> &str {
     url.split("://")
         .nth(1)
@@ -197,6 +154,285 @@ fn host_of(url: &str) -> &str {
         .unwrap_or("")
 }
 
+/// Pull the target domain from `trustpilot.com/review/{domain}`.
+fn parse_review_domain(url: &str) -> Option<String> {
+    let after = url.split("/review/").nth(1)?;
+    let stripped = after
+        .split(['?', '#'])
+        .next()?
+        .trim_end_matches('/')
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if stripped.is_empty() {
+        None
+    } else {
+        Some(stripped.to_string())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD block walkers
+// ---------------------------------------------------------------------------
+
+/// Find the Dataset block whose `about.@id` references the target
+/// domain's Organization. Falls through to any Dataset if the @id
+/// check doesn't match (Trustpilot occasionally varies the URL).
+fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
+    let mut fallback_any_dataset: Option<Value> = None;
+    for block in blocks {
+        for node in walk_graph(block) {
+            if !is_dataset(&node) {
+                continue;
+            }
+            if dataset_about_matches_domain(&node, domain) {
+                return Some(node);
+            }
+            if fallback_any_dataset.is_none() {
+                fallback_any_dataset = Some(node);
+            }
+        }
+    }
+    fallback_any_dataset
+}
+
+fn is_dataset(v: &Value) -> bool {
+    v.get("@type")
+        .and_then(|t| t.as_str())
+        .is_some_and(|s| s == "Dataset")
+}
+
+fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
+    let about_id = v
+        .get("about")
+        .and_then(|a| a.get("@id"))
+        .and_then(|id| id.as_str());
+    let Some(id) = about_id else {
+        return false;
+    };
+    id.contains(&format!("/Organization/{domain}"))
+}
+
+/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
+/// presence of the `aiSummary` key.
+fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
+    for block in blocks {
+        for node in walk_graph(block) {
+            if node.get("aiSummary").is_some() {
+                return Some(node);
+            }
+        }
+    }
+    None
+}
+
+/// Flatten each block (and its `@graph`) into a list of nodes we can
+/// iterate over. Handles both `@graph: [ ... ]` (array) and
+/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
+fn walk_graph(block: &Value) -> Vec<Value> {
+    let mut out = vec![block.clone()];
+    if let Some(graph) = block.get("@graph") {
+        match graph {
+            Value::Array(arr) => out.extend(arr.iter().cloned()),
+            Value::Object(_) => out.push(graph.clone()),
+            _ => {}
+        }
+    }
+    out
+}
+
+// ---------------------------------------------------------------------------
+// Rating distribution (csvw:Table)
+// ---------------------------------------------------------------------------
+
+/// Parse the per-star distribution from the Dataset block. Returns
+/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
+fn parse_star_distribution(dataset: &Value) -> Option<Value> {
+    let columns = dataset
+        .get("mainEntity")?
+        .get("csvw:tableSchema")?
+        .get("csvw:columns")?
+        .as_array()?;
+    let mut out = serde_json::Map::new();
+    for col in columns {
+        let name = col.get("csvw:name").and_then(|n| n.as_str())?;
+        let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
+        let count = cell
+            .get("csvw:value")
+            .and_then(|v| v.as_str())
+            .and_then(|s| s.parse::<i64>().ok());
+        let percent = cell
+            .get("csvw:notes")
+            .and_then(|n| n.as_array())
+            .and_then(|arr| arr.first())
+            .and_then(|s| s.as_str())
+            .map(String::from);
+        let key = normalise_star_key(name);
+        out.insert(
+            key,
+            json!({
+                "count":   count,
+                "percent": percent,
+            }),
+        );
+    }
+    if out.is_empty() {
+        None
+    } else {
+        Some(Value::Object(out))
+    }
+}
+
+/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
+/// the raw "1 star" key which fights YAML/JS property access.
+fn normalise_star_key(name: &str) -> String {
+    let trimmed = name.trim().to_lowercase();
+    match trimmed.as_str() {
+        "1 star" => "one_star".into(),
+        "2 stars" => "two_stars".into(),
+        "3 stars" => "three_stars".into(),
+        "4 stars" => "four_stars".into(),
+        "5 stars" => "five_stars".into(),
+        "total" => "total".into(),
+        other => other.replace(' ', "_"),
+    }
+}
+
+/// Compute average rating (weighted by bucket) and total count from the
+/// parsed distribution. Returns `(average, total)`.
+fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
+    let Some(obj) = distribution.as_object() else {
+        return (None, None);
+    };
+    let get_count = |key: &str| -> i64 {
+        obj.get(key)
+            .and_then(|v| v.get("count"))
+            .and_then(|v| v.as_i64())
+            .unwrap_or(0)
+    };
+    let one = get_count("one_star");
+    let two = get_count("two_stars");
+    let three = get_count("three_stars");
+    let four = get_count("four_stars");
+    let five = get_count("five_stars");
+    let total_bucket = one + two + three + four + five;
+    let total = obj
+        .get("total")
+        .and_then(|v| v.get("count"))
+        .and_then(|v| v.as_i64())
+        .unwrap_or(total_bucket);
+    if total == 0 {
+        return (None, Some(0));
+    }
+    let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
+    let avg = weighted as f64 / total_bucket.max(1) as f64;
+    // One decimal place, matching how Trustpilot displays the score.
+    (Some(format!("{avg:.1}")), Some(total))
+}
+
+// ---------------------------------------------------------------------------
+// OG / meta-tag fallbacks
+// ---------------------------------------------------------------------------
+
+/// Regex out the business name from the standard Trustpilot OG title
+/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
+fn parse_name_from_og_title(html: &str) -> Option<String> {
+    let title = og(html, "title")?;
+    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
+    re.captures(&title)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
+/// from the OG title.
+fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
+    let Some(title) = og(html, "title") else {
+        return (None, None);
+    };
+    static RE: OnceLock<Regex> = OnceLock::new();
+    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
+    });
+    let Some(caps) = re.captures(&title) else {
+        return (None, None);
+    };
+    (
+        caps.get(1).map(|m| m.as_str().trim().to_string()),
+        caps.get(2).map(|m| m.as_str().to_string()),
+    )
+}
+
+/// Parse "hear what 226 customers have already said" from the OG
+/// description tag.
+fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
+    let desc = og(html, "description")?;
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
+    re.captures(&desc)?
+        .get(1)?
+        .as_str()
+        .replace(',', "")
+        .parse::<i64>()
+        .ok()
+}
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            let raw = c.get(2).map(|m| m.as_str())?;
+            return Some(html_unescape(raw));
+        }
+    }
+    None
+}
+
+/// Minimal HTML entity unescaping for the three entities the
+/// synthesize_html escaper might produce. Keeps us off a heavier dep.
+fn html_unescape(s: &str) -> String {
+    s.replace("&quot;", "\"")
+        .replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+}
+
+fn get_string(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| x.as_str().map(String::from))
+}
+
+// ---------------------------------------------------------------------------
+// Review extraction
+// ---------------------------------------------------------------------------
+
+fn extract_review(r: &Value) -> Value {
+    json!({
+        "id":          r.get("id").and_then(|v| v.as_str()),
+        "rating":      r.get("rating").and_then(|v| v.as_i64()),
+        "title":       r.get("title").and_then(|v| v.as_str()),
+        "text":        r.get("text").and_then(|v| v.as_str()),
+        "language":    r.get("language").and_then(|v| v.as_str()),
+        "source":      r.get("source").and_then(|v| v.as_str()),
+        "likes":       r.get("likes").and_then(|v| v.as_i64()),
+        "author":      r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
+        "author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
+        "author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
+        "verified":    r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
+        "date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
+        "date_published":   r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
+    })
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -210,13 +446,127 @@ mod tests {
     }
 
     #[test]
-    fn is_business_type_handles_variants() {
-        use serde_json::json;
-        assert!(is_business_type(&json!({"@type": "Organization"})));
-        assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
-        assert!(is_business_type(
-            &json!({"@type": ["Organization", "Corporation"]})
-        ));
-        assert!(!is_business_type(&json!({"@type": "Product"})));
+    fn parse_review_domain_handles_query_and_slash() {
+        assert_eq!(
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
+            Some("anthropic.com".into())
+        );
+        assert_eq!(
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
+            Some("anthropic.com".into())
+        );
+        assert_eq!(
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
+            Some("anthropic.com".into())
+        );
+    }
+
+    #[test]
+    fn normalise_star_key_covers_all_buckets() {
+        assert_eq!(normalise_star_key("1 star"), "one_star");
+        assert_eq!(normalise_star_key("2 stars"), "two_stars");
+        assert_eq!(normalise_star_key("5 stars"), "five_stars");
+        assert_eq!(normalise_star_key("Total"), "total");
+    }
+
+    #[test]
+    fn compute_rating_stats_weighted_average() {
+        // 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
+        let dist = json!({
+            "one_star":   { "count": 100, "percent": "50%" },
+            "two_stars":  { "count": 0,   "percent": "0%" },
+            "three_stars":{ "count": 0,   "percent": "0%" },
+            "four_stars": { "count": 0,   "percent": "0%" },
+            "five_stars": { "count": 100, "percent": "50%" },
+            "total":      { "count": 200, "percent": "100%" },
+        });
+        let (avg, total) = compute_rating_stats(&dist);
+        assert_eq!(avg.as_deref(), Some("3.0"));
+        assert_eq!(total, Some(200));
+    }
+
+    #[test]
+    fn parse_og_title_extracts_name_and_rating() {
+        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">"#;
+        assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
+        let (label, rating) = parse_rating_from_og_title(html);
+        assert_eq!(label.as_deref(), Some("Bad"));
+        assert_eq!(rating.as_deref(), Some("1.5"));
+    }
+
+    #[test]
+    fn parse_review_count_from_og_description_picks_number() {
+        let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
+        assert_eq!(parse_review_count_from_og_description(html), Some(226));
+    }
+
+    #[test]
+    fn parse_full_fixture_assembles_all_fields() {
+        let html = r##"<html><head>
+<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
+<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
+<script type="application/ld+json">
+{"@context":"https://schema.org","@graph":[
+  {"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
+]}
+</script>
+<script type="application/ld+json">
+{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
+ "@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
+ "@type":"Dataset",
+ "about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
+ "name":"Anthropic",
+ "mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
+   {"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
+   {"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
+   {"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
+   {"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
+   {"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
+   {"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
+ ]}}}}
+</script>
+<script type="application/ld+json">
+{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
+ "aiSummaryReviews":[
+  {"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
+   "source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
+   "dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
+</script>
+</head></html>"##;
+        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
+        assert_eq!(v["domain"], "anthropic.com");
+        assert_eq!(v["business_name"], "Anthropic");
+        assert_eq!(v["rating_label"], "Bad");
+        assert_eq!(v["review_count"], 226);
+        assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
+        assert_eq!(v["rating_distribution"]["total"]["count"], 226);
+        assert_eq!(v["ai_summary"], "Mixed reviews.");
+        assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
+        assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
+        assert_eq!(v["recent_reviews"][0]["rating"], 1);
+        assert_eq!(v["recent_reviews"][0]["title"], "Bad");
+    }
+
+    #[test]
+    fn parse_falls_back_to_og_when_no_jsonld() {
+        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
+<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
+        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
+        assert_eq!(v["domain"], "anthropic.com");
+        assert_eq!(v["business_name"], "Anthropic");
+        assert_eq!(v["average_rating"], "1.5");
+        assert_eq!(v["review_count"], 226);
+        assert_eq!(v["rating_label"], "Bad");
+    }
+
+    #[test]
+    fn parse_returns_ok_with_url_domain_when_nothing_else() {
+        let v = parse(
+            "<html><head></head></html>",
+            "https://www.trustpilot.com/review/example.com",
+        )
+        .unwrap();
+        assert_eq!(v["domain"], "example.com");
+        assert_eq!(v["business_name"], "example.com");
     }
 }

From 2373162c811b82301147ea6cfb6aa1dcb237814f Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 20:59:01 +0200
Subject: [PATCH 18/30] chore: release v0.5.0 (28 vertical extractors + cloud
 integration)

See CHANGELOG.md for the full entry. Headline: 28 site-specific
extractors returning typed JSON, five with automatic antibot
cloud-escalation via api.webclaw.io, `POST /v1/scrape/{vertical}` +
`GET /v1/extractors` on webclaw-server.
---
 CHANGELOG.md | 23 +++++++++++++++++++++++
 Cargo.lock   | 14 +++++++-------
 Cargo.toml   |  2 +-
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4069d54..938a0b4 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,29 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.5.0] — 2026-04-22
+
+### Added
+- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob.
+- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route.
+- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source.
+- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran.
+- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `<script>` tags, metadata as OG `<meta>` tags, markdown in a `<pre>`) so existing local parsers run unchanged across both paths. No per-extractor code path branches are needed for "came from cloud" vs "came from local".
+- **Trustpilot 2025 schema parser.** Trustpilot replaced their single-Organization + aggregateRating shape with three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table `mainEntity` carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent reviews. The parser walks all three, skips the site-level Org, picks the Dataset by `about.@id` matching the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and returns recent reviews with author / country / date / rating / title / text / likes.
+- **OG-tag fallback in `ecommerce_product` for sites with no JSON-LD and sites with JSON-LD but empty offers.** Three paths now: `jsonld` (Schema.org Product with offers), `jsonld+og` (Product JSON-LD plus OG product tags filling in missing price), and `og_fallback` (no JSON-LD at all, build minimal payload from `og:title`, `og:image`, `og:description`, `product:price:amount`, `product:price:currency`, `product:availability`, `product:brand`). `has_og_product_signal()` gates the fallback on `og:type=product` or a price tag so blog posts don't get mis-classified as products.
+- **URL-slug title fallback in `etsy_listing` for delisted / blocked pages.** When Etsy serves a placeholder page (`"etsy.com"`, `"Etsy - Your place to buy..."`, `"This item is unavailable"`), humanise the URL slug (`/listing/123/personalized-stainless-steel-tumbler` becomes `"Personalized Stainless Steel Tumbler"`) so callers always get a meaningful title. Plus shop falls through `offers[].seller.name` then top-level `brand` because Etsy uses both schemas depending on listing age.
+- **Force-cloud-escalation in `amazon_product` when local HTML lacks Product JSON-LD.** Amazon A/B-tests JSON-LD presence. When local fetch succeeds but has no `Product` block and a cloud client is configured, the extractor force-escalates to the cloud which reliably surfaces title + description via its render engine. Added OG meta-tag fallback so the cloud's synthesized HTML output (OG tags only, no Amazon DOM IDs) still yields title / image / description.
+- **AWS WAF "Verifying your connection" detector in `cloud::is_bot_protected`.** Trustpilot serves a `~565` byte interstitial with an `interstitial-spinner` CSS class. The detector now fires on that pattern with a `< 10_000` byte size gate to avoid false positives on real articles that happen to mention the phrase.
+
+### Changed
+- **`webclaw-fetch::FetchClient` gained an optional `cloud` field** via `with_cloud(CloudClient)`. Extractors reach it through `client.cloud()` to decide whether to escalate. `webclaw-server::AppState` reads `WEBCLAW_CLOUD_API_KEY` (preferred) or falls back to `WEBCLAW_API_KEY` only when inbound auth is not configured (open mode).
+- **Consolidated `CloudClient` into `webclaw-fetch`.** Previously duplicated between `webclaw-mcp/src/cloud.rs` (302 LOC) and `webclaw-cli/src/cloud.rs` (80 LOC). Single canonical home with typed `CloudError` (`NotConfigured`, `Unauthorized`, `InsufficientPlan`, `RateLimited`, `ServerError`, `Network`, `ParseFailed`) that Display with actionable URLs; `From<CloudError> for String` bridge keeps pre-existing CLI / MCP call sites compiling unchanged during migration.
+
+### Tests
+- 215 unit tests passing in `webclaw-fetch` (100+ new, covering every extractor's matcher, URL parser, JSON-LD / OG fallback paths, and the cloud synthesis helper). `cargo clippy --workspace --release --no-deps` clean.
+
+---
+
 ## [0.4.0] — 2026-04-22
 
 ### Added
diff --git a/Cargo.lock b/Cargo.lock
index cb8296b..3603981 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3199,7 +3199,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3220,7 +3220,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3238,7 +3238,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "bytes",
  "calamine",
@@ -3262,7 +3262,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3275,7 +3275,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3295,7 +3295,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3304,7 +3304,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.4.0"
+version = "0.5.0"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index e17d843..e8b2677 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.4.0"
+version = "0.5.0"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"

From aaa510350458da25dab15cc5632b62b7f01ec2f6 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 21:11:18 +0200
Subject: [PATCH 19/30] docs(claude): fix stale primp references, document wreq
 + Fetcher trait

webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago
but CLAUDE.md still documented primp, the `[patch.crates-io]`
requirement, and RUSTFLAGS that no longer apply. Refreshed four
sections:

- Crate listing: webclaw-fetch uses wreq, not primp
- client.rs description: wreq BoringSSL, plus a note that FetchClient
  will implement the new Fetcher trait so production can swap in a
  tls-sidecar-backed fetcher without importing wreq
- Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines,
  added the "Vertical extractors take `&dyn Fetcher`" rule that makes
  the architectural separation explicit for the upcoming production
  integration
- Removed language about primp being "patched"; reqwest in webclaw-llm
  is now just "plain reqwest" with no relationship to wreq
---
 CLAUDE.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index eac2f9f..fcd27da 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -11,7 +11,7 @@ webclaw/
                       # + ExtractionOptions (include/exclude CSS selectors)
                       # + diff engine (change tracking)
                       # + brand extraction (DOM/CSS analysis)
-    webclaw-fetch/    # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
+    webclaw-fetch/    # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
                       # + proxy pool rotation (per-request)
                       # + PDF content-type detection
                       # + document parsing (DOCX, XLSX, CSV)
@@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 - `brand.rs` — Brand identity extraction from DOM structure and CSS
 
 ### Fetch Modules (`webclaw-fetch`)
-- `client.rs` — FetchClient with primp TLS impersonation
+- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
 - `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
 - `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
 - `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
@@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 ## Hard Rules
 
 - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
-- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
-- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
-- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
+- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
+- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
+- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
+- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq.
 - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
 
 ## Build & Test

From 058493bc8f64b0cf47e7aaccda15f34a0ede74b9 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 21:17:50 +0200
Subject: [PATCH 20/30] feat(fetch): Fetcher trait so vertical extractors work
 under any HTTP backend
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.

Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.

Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
  blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
  its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
  `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
  take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
  native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.

Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
---
 CHANGELOG.md                                  |  14 +++
 Cargo.lock                                    |  15 +--
 Cargo.toml                                    |   2 +-
 crates/webclaw-fetch/Cargo.toml               |   1 +
 crates/webclaw-fetch/src/client.rs            |  30 +++++
 crates/webclaw-fetch/src/cloud.rs             |   8 +-
 .../src/extractors/amazon_product.rs          |   4 +-
 crates/webclaw-fetch/src/extractors/arxiv.rs  |   4 +-
 .../webclaw-fetch/src/extractors/crates_io.rs |   4 +-
 crates/webclaw-fetch/src/extractors/dev_to.rs |   4 +-
 .../src/extractors/docker_hub.rs              |   4 +-
 .../src/extractors/ebay_listing.rs            |   4 +-
 .../src/extractors/ecommerce_product.rs       |   4 +-
 .../src/extractors/etsy_listing.rs            |   4 +-
 .../src/extractors/github_issue.rs            |   4 +-
 .../webclaw-fetch/src/extractors/github_pr.rs |   4 +-
 .../src/extractors/github_release.rs          |   4 +-
 .../src/extractors/github_repo.rs             |   4 +-
 .../src/extractors/hackernews.rs              |   4 +-
 .../src/extractors/huggingface_dataset.rs     |   4 +-
 .../src/extractors/huggingface_model.rs       |   4 +-
 .../src/extractors/instagram_post.rs          |   4 +-
 .../src/extractors/instagram_profile.rs       |   6 +-
 .../src/extractors/linkedin_post.rs           |   4 +-
 crates/webclaw-fetch/src/extractors/mod.rs    |   6 +-
 crates/webclaw-fetch/src/extractors/npm.rs    |   6 +-
 crates/webclaw-fetch/src/extractors/pypi.rs   |   4 +-
 crates/webclaw-fetch/src/extractors/reddit.rs |   4 +-
 .../src/extractors/shopify_collection.rs      |   4 +-
 .../src/extractors/shopify_product.rs         |   4 +-
 .../src/extractors/stackoverflow.rs           |   4 +-
 .../src/extractors/substack_post.rs           |   6 +-
 .../src/extractors/trustpilot_reviews.rs      |   4 +-
 .../src/extractors/woocommerce_product.rs     |   4 +-
 .../src/extractors/youtube_video.rs           |   4 +-
 crates/webclaw-fetch/src/fetcher.rs           | 118 ++++++++++++++++++
 crates/webclaw-fetch/src/lib.rs               |   2 +
 37 files changed, 241 insertions(+), 73 deletions(-)
 create mode 100644 crates/webclaw-fetch/src/fetcher.rs

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 938a0b4..7cfd1e5 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,20 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.5.1] — 2026-04-22
+
+### Added
+- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
+
+  The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
+
+  Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
+
+### Changed
+- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
+
+---
+
 ## [0.5.0] — 2026-04-22
 
 ### Added
diff --git a/Cargo.lock b/Cargo.lock
index 3603981..bad52e3 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3199,7 +3199,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3220,7 +3220,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3238,8 +3238,9 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
+ "async-trait",
  "bytes",
  "calamine",
  "http",
@@ -3262,7 +3263,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3275,7 +3276,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3295,7 +3296,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3304,7 +3305,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.5.0"
+version = "0.5.1"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index e8b2677..92152f2 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.0"
+version = "0.5.1"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml
index 2ec9b9d..a47ba7e 100644
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@@ -12,6 +12,7 @@ serde = { workspace = true }
 thiserror = { workspace = true }
 tracing = { workspace = true }
 tokio = { workspace = true }
+async-trait = "0.1"
 wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
 http = "1"
 bytes = "1"
diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index 7ce16d7..8fd5ff5 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -599,6 +599,36 @@ impl FetchClient {
     }
 }
 
+// ---------------------------------------------------------------------------
+// Fetcher trait implementation
+//
+// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
+// rather than `FetchClient` directly, which is what lets the production
+// API server swap in a tls-sidecar-backed implementation without
+// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
+// self-hosted OSS server) this impl means "pass the FetchClient you
+// already have; nothing changes".
+// ---------------------------------------------------------------------------
+
+#[async_trait::async_trait]
+impl crate::fetcher::Fetcher for FetchClient {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        FetchClient::fetch(self, url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        FetchClient::fetch_with_headers(self, url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
+        FetchClient::cloud(self)
+    }
+}
+
 /// Collect the browser variants to use based on the browser profile.
 fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
     match profile {
diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs
index c70a75e..3bad383 100644
--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@@ -66,7 +66,9 @@ use serde_json::{Value, json};
 use thiserror::Error;
 use tracing::{debug, info, warn};
 
-use crate::client::FetchClient;
+// Client type isn't needed here anymore now that smart_fetch* takes
+// `&dyn Fetcher`. Kept as a comment for historical context: this
+// module used to import FetchClient directly before v0.5.1.
 
 // ---------------------------------------------------------------------------
 // URLs + defaults — keep in one place so "change the signup link" is a
@@ -506,7 +508,7 @@ pub enum SmartFetchResult {
 /// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
 /// [`CloudError`] so you can render precise UX.
 pub async fn smart_fetch(
-    client: &FetchClient,
+    client: &dyn crate::fetcher::Fetcher,
     cloud: Option<&CloudClient>,
     url: &str,
     include_selectors: &[String],
@@ -613,7 +615,7 @@ pub struct FetchedHtml {
 /// Designed for the vertical-extractor pattern where the caller has
 /// its own parser and just needs bytes.
 pub async fn smart_fetch_html(
-    client: &FetchClient,
+    client: &dyn crate::fetcher::Fetcher,
     cloud: Option<&CloudClient>,
     url: &str,
 ) -> Result<FetchedHtml, CloudError> {
diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs
index 7f022fb..fed6b9f 100644
--- a/crates/webclaw-fetch/src/extractors/amazon_product.rs
+++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs
@@ -32,9 +32,9 @@ use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "amazon_product",
@@ -59,7 +59,7 @@ pub fn matches(url: &str) -> bool {
     parse_asin(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let asin = parse_asin(url)
         .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
 
diff --git a/crates/webclaw-fetch/src/extractors/arxiv.rs b/crates/webclaw-fetch/src/extractors/arxiv.rs
index cbcb3d1..c2b85c0 100644
--- a/crates/webclaw-fetch/src/extractors/arxiv.rs
+++ b/crates/webclaw-fetch/src/extractors/arxiv.rs
@@ -10,8 +10,8 @@ use quick_xml::events::Event;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "arxiv",
@@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/abs/") || url.contains("/pdf/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let id = parse_id(url)
         .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
 
diff --git a/crates/webclaw-fetch/src/extractors/crates_io.rs b/crates/webclaw-fetch/src/extractors/crates_io.rs
index 915b1c3..719579f 100644
--- a/crates/webclaw-fetch/src/extractors/crates_io.rs
+++ b/crates/webclaw-fetch/src/extractors/crates_io.rs
@@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "crates_io",
@@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/crates/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let name = parse_name(url)
         .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
 
diff --git a/crates/webclaw-fetch/src/extractors/dev_to.rs b/crates/webclaw-fetch/src/extractors/dev_to.rs
index 49372ce..86199d8 100644
--- a/crates/webclaw-fetch/src/extractors/dev_to.rs
+++ b/crates/webclaw-fetch/src/extractors/dev_to.rs
@@ -8,8 +8,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "dev_to",
@@ -61,7 +61,7 @@ const RESERVED_FIRST_SEGS: &[&str] = &[
     "t",
 ];
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (username, slug) = parse_username_slug(url).ok_or_else(|| {
         FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/docker_hub.rs b/crates/webclaw-fetch/src/extractors/docker_hub.rs
index 15c928c..bce9315 100644
--- a/crates/webclaw-fetch/src/extractors/docker_hub.rs
+++ b/crates/webclaw-fetch/src/extractors/docker_hub.rs
@@ -8,8 +8,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "docker_hub",
@@ -29,7 +29,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/_/") || url.contains("/r/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (namespace, name) = parse_repo(url)
         .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
 
diff --git a/crates/webclaw-fetch/src/extractors/ebay_listing.rs b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
index 14c36ef..dbc85ab 100644
--- a/crates/webclaw-fetch/src/extractors/ebay_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
@@ -14,9 +14,9 @@ use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "ebay_listing",
@@ -39,7 +39,7 @@ pub fn matches(url: &str) -> bool {
     parse_item_id(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let item_id = parse_item_id(url)
         .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
 
diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
index 099a8fb..019fb68 100644
--- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
@@ -42,8 +42,8 @@ use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "ecommerce_product",
@@ -69,7 +69,7 @@ pub fn matches(url: &str) -> bool {
     !host_of(url).is_empty()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let resp = client.fetch(url).await?;
     if !(200..300).contains(&resp.status) {
         return Err(FetchError::Build(format!(
diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
index 060c3b6..ea9ed0b 100644
--- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
@@ -26,9 +26,9 @@ use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "etsy_listing",
@@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
     parse_listing_id(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let listing_id = parse_listing_id(url)
         .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
 
diff --git a/crates/webclaw-fetch/src/extractors/github_issue.rs b/crates/webclaw-fetch/src/extractors/github_issue.rs
index 436faa9..9a64f21 100644
--- a/crates/webclaw-fetch/src/extractors/github_issue.rs
+++ b/crates/webclaw-fetch/src/extractors/github_issue.rs
@@ -10,8 +10,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "github_issue",
@@ -34,7 +34,7 @@ pub fn matches(url: &str) -> bool {
     parse_issue(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
         FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/github_pr.rs b/crates/webclaw-fetch/src/extractors/github_pr.rs
index 9d4b95a..266d3cd 100644
--- a/crates/webclaw-fetch/src/extractors/github_pr.rs
+++ b/crates/webclaw-fetch/src/extractors/github_pr.rs
@@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "github_pr",
@@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
     parse_pr(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
         FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/github_release.rs b/crates/webclaw-fetch/src/extractors/github_release.rs
index b019550..7699d09 100644
--- a/crates/webclaw-fetch/src/extractors/github_release.rs
+++ b/crates/webclaw-fetch/src/extractors/github_release.rs
@@ -8,8 +8,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "github_release",
@@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
     parse_release(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
         FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/github_repo.rs b/crates/webclaw-fetch/src/extractors/github_repo.rs
index d89d06a..2a62aa3 100644
--- a/crates/webclaw-fetch/src/extractors/github_repo.rs
+++ b/crates/webclaw-fetch/src/extractors/github_repo.rs
@@ -10,8 +10,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "github_repo",
@@ -70,7 +70,7 @@ const RESERVED_OWNERS: &[&str] = &[
     "about",
 ];
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
         FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/hackernews.rs b/crates/webclaw-fetch/src/extractors/hackernews.rs
index 7adaa1c..91d4520 100644
--- a/crates/webclaw-fetch/src/extractors/hackernews.rs
+++ b/crates/webclaw-fetch/src/extractors/hackernews.rs
@@ -10,8 +10,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "hackernews",
@@ -40,7 +40,7 @@ pub fn matches(url: &str) -> bool {
     false
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let id = parse_item_id(url).ok_or_else(|| {
         FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
index cb1f524..e1f84f7 100644
--- a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
+++ b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
@@ -7,8 +7,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "huggingface_dataset",
@@ -38,7 +38,7 @@ pub fn matches(url: &str) -> bool {
     segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let dataset_path = parse_dataset_path(url).ok_or_else(|| {
         FetchError::Build(format!(
             "hf_dataset: cannot parse dataset path from '{url}'"
diff --git a/crates/webclaw-fetch/src/extractors/huggingface_model.rs b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
index decc68a..4c549e0 100644
--- a/crates/webclaw-fetch/src/extractors/huggingface_model.rs
+++ b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
@@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "huggingface_model",
@@ -61,7 +61,7 @@ const RESERVED_NAMESPACES: &[&str] = &[
     "search",
 ];
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (owner, name) = parse_owner_name(url).ok_or_else(|| {
         FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/instagram_post.rs b/crates/webclaw-fetch/src/extractors/instagram_post.rs
index 05c9b8a..8847e36 100644
--- a/crates/webclaw-fetch/src/extractors/instagram_post.rs
+++ b/crates/webclaw-fetch/src/extractors/instagram_post.rs
@@ -11,8 +11,8 @@ use serde_json::{Value, json};
 use std::sync::OnceLock;
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "instagram_post",
@@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
     parse_shortcode(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
         FetchError::Build(format!(
             "instagram_post: cannot parse shortcode from '{url}'"
diff --git a/crates/webclaw-fetch/src/extractors/instagram_profile.rs b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
index 4524090..9a92b4c 100644
--- a/crates/webclaw-fetch/src/extractors/instagram_profile.rs
+++ b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
@@ -23,8 +23,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "instagram_profile",
@@ -80,7 +80,7 @@ const RESERVED: &[&str] = &[
     "signup",
 ];
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let username = parse_username(url).ok_or_else(|| {
         FetchError::Build(format!(
             "instagram_profile: cannot parse username from '{url}'"
@@ -198,7 +198,7 @@ fn classify(n: &MediaNode) -> &'static str {
 /// pull whatever OG tags we can. Returns less data and explicitly
 /// flags `data_completeness: "og_only"` so callers know.
 async fn og_fallback(
-    client: &FetchClient,
+    client: &dyn Fetcher,
     username: &str,
     original_url: &str,
     api_status: u16,
diff --git a/crates/webclaw-fetch/src/extractors/linkedin_post.rs b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
index 2d6a399..ed7e07b 100644
--- a/crates/webclaw-fetch/src/extractors/linkedin_post.rs
+++ b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
@@ -14,8 +14,8 @@ use serde_json::{Value, json};
 use std::sync::OnceLock;
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "linkedin_post",
@@ -36,7 +36,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/feed/update/urn:li:") || url.contains("/posts/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let urn = extract_urn(url).ok_or_else(|| {
         FetchError::Build(format!(
             "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs
index 5d06158..91ef8d0 100644
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@@ -46,8 +46,8 @@ pub mod youtube_video;
 use serde::Serialize;
 use serde_json::Value;
 
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 /// Public catalog entry for `/v1/extractors`. Stable shape — clients
 /// rely on `name` to pick the right `/v1/scrape/{name}` route.
@@ -102,7 +102,7 @@ pub fn list() -> Vec<ExtractorInfo> {
 /// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
 /// pick a vertical explicitly.
 pub async fn dispatch_by_url(
-    client: &FetchClient,
+    client: &dyn Fetcher,
     url: &str,
 ) -> Option<Result<(&'static str, Value), FetchError>> {
     if reddit::matches(url) {
@@ -281,7 +281,7 @@ pub async fn dispatch_by_url(
 /// users get a clear "wrong route" error instead of a confusing parse
 /// failure deep in the extractor.
 pub async fn dispatch_by_name(
-    client: &FetchClient,
+    client: &dyn Fetcher,
     name: &str,
     url: &str,
 ) -> Result<Value, ExtractorDispatchError> {
diff --git a/crates/webclaw-fetch/src/extractors/npm.rs b/crates/webclaw-fetch/src/extractors/npm.rs
index 4343890..f84da0e 100644
--- a/crates/webclaw-fetch/src/extractors/npm.rs
+++ b/crates/webclaw-fetch/src/extractors/npm.rs
@@ -13,8 +13,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "npm",
@@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/package/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let name = parse_name(url)
         .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
 
@@ -94,7 +94,7 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
     }))
 }
 
-async fn fetch_weekly_downloads(client: &FetchClient, name: &str) -> Result<i64, FetchError> {
+async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
     let url = format!(
         "https://api.npmjs.org/downloads/point/last-week/{}",
         urlencode_segment(name)
diff --git a/crates/webclaw-fetch/src/extractors/pypi.rs b/crates/webclaw-fetch/src/extractors/pypi.rs
index f6b7c64..33a4d1c 100644
--- a/crates/webclaw-fetch/src/extractors/pypi.rs
+++ b/crates/webclaw-fetch/src/extractors/pypi.rs
@@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "pypi",
@@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/project/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (name, version) = parse_project(url).ok_or_else(|| {
         FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/extractors/reddit.rs b/crates/webclaw-fetch/src/extractors/reddit.rs
index 2d084dc..13cdc16 100644
--- a/crates/webclaw-fetch/src/extractors/reddit.rs
+++ b/crates/webclaw-fetch/src/extractors/reddit.rs
@@ -9,8 +9,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "reddit",
@@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
     is_reddit_host && url.contains("/comments/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let json_url = build_json_url(url);
     let resp = client.fetch(&json_url).await?;
     if resp.status != 200 {
diff --git a/crates/webclaw-fetch/src/extractors/shopify_collection.rs b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
index 095f7dd..23d57c6 100644
--- a/crates/webclaw-fetch/src/extractors/shopify_collection.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
@@ -15,8 +15,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "shopify_collection",
@@ -49,7 +49,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
     "github.com",
 ];
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let (coll_meta_url, coll_products_url) = build_json_urls(url);
 
     // Step 1: collection metadata. Shopify returns 200 on missing
diff --git a/crates/webclaw-fetch/src/extractors/shopify_product.rs b/crates/webclaw-fetch/src/extractors/shopify_product.rs
index 19f0438..b52ef36 100644
--- a/crates/webclaw-fetch/src/extractors/shopify_product.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs
@@ -21,8 +21,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "shopify_product",
@@ -65,7 +65,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
     "github.com", // /products is a marketing page
 ];
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let json_url = build_json_url(url);
     let resp = client.fetch(&json_url).await?;
     if resp.status == 404 {
diff --git a/crates/webclaw-fetch/src/extractors/stackoverflow.rs b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
index d74b511..03597a3 100644
--- a/crates/webclaw-fetch/src/extractors/stackoverflow.rs
+++ b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
@@ -13,8 +13,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "stackoverflow",
@@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
     parse_question_id(url).is_some()
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let id = parse_question_id(url).ok_or_else(|| {
         FetchError::Build(format!(
             "stackoverflow: cannot parse question id from '{url}'"
diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs
index 0571f3d..c5b5019 100644
--- a/crates/webclaw-fetch/src/extractors/substack_post.rs
+++ b/crates/webclaw-fetch/src/extractors/substack_post.rs
@@ -28,9 +28,9 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "substack_post",
@@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/p/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let slug = parse_slug(url).ok_or_else(|| {
         FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
     })?;
@@ -149,7 +149,7 @@ fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
 // ---------------------------------------------------------------------------
 
 async fn html_fallback(
-    client: &FetchClient,
+    client: &dyn Fetcher,
     url: &str,
     api_url: &str,
     slug: &str,
diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
index ae97c67..8b77a29 100644
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@@ -32,9 +32,9 @@ use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::cloud::{self, CloudError};
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "trustpilot_reviews",
@@ -51,7 +51,7 @@ pub fn matches(url: &str) -> bool {
     url.contains("/review/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
         .await
         .map_err(cloud_to_fetch_err)?;
diff --git a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
index 73f1109..db6dd78 100644
--- a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
@@ -15,8 +15,8 @@ use serde::Deserialize;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "woocommerce_product",
@@ -42,7 +42,7 @@ pub fn matches(url: &str) -> bool {
         || url.contains("/produit/") // common fr locale
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let slug = parse_slug(url).ok_or_else(|| {
         FetchError::Build(format!(
             "woocommerce_product: cannot parse slug from '{url}'"
diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs
index 81079f4..2551ff8 100644
--- a/crates/webclaw-fetch/src/extractors/youtube_video.rs
+++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs
@@ -25,8 +25,8 @@ use regex::Regex;
 use serde_json::{Value, json};
 
 use super::ExtractorInfo;
-use crate::client::FetchClient;
 use crate::error::FetchError;
+use crate::fetcher::Fetcher;
 
 pub const INFO: ExtractorInfo = ExtractorInfo {
     name: "youtube_video",
@@ -45,7 +45,7 @@ pub fn matches(url: &str) -> bool {
         || url.contains("youtube-nocookie.com/embed/")
 }
 
-pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
     let video_id = parse_video_id(url).ok_or_else(|| {
         FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
     })?;
diff --git a/crates/webclaw-fetch/src/fetcher.rs b/crates/webclaw-fetch/src/fetcher.rs
new file mode 100644
index 0000000..fabcf44
--- /dev/null
+++ b/crates/webclaw-fetch/src/fetcher.rs
@@ -0,0 +1,118 @@
+//! Pluggable fetcher abstraction for vertical extractors.
+//!
+//! Extractors call the network through this trait instead of hard-
+//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
+//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
+//! server, which must not use in-process TLS fingerprinting, provides
+//! its own implementation that routes through the Go tls-sidecar.
+//!
+//! Both paths expose the same [`FetchResult`] shape and the same
+//! optional cloud-escalation client, so extractor logic stays
+//! identical across environments.
+//!
+//! ## Choosing an implementation
+//!
+//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
+//!   with [`FetchClient::with_cloud`] to attach cloud fallback, pass
+//!   it to extractors as `&client`.
+//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
+//!   (in `server/src/engine/`) that delegates to `engine::tls_client`
+//!   and wraps it in `Arc<dyn Fetcher>` for handler injection.
+//!
+//! ## Why a trait and not a free function
+//!
+//! Extractors need state beyond a single fetch: the cloud client for
+//! antibot escalation, and in the future per-user proxy pools, tenant
+//! headers, circuit breakers. A trait keeps that state encapsulated
+//! behind the fetch interface instead of threading it through every
+//! extractor signature.
+
+use async_trait::async_trait;
+
+use crate::client::FetchResult;
+use crate::cloud::CloudClient;
+use crate::error::FetchError;
+
+/// HTTP fetch surface used by vertical extractors.
+///
+/// Implementations must be `Send + Sync` because extractor dispatchers
+/// run them inside tokio tasks, potentially across many requests.
+#[async_trait]
+pub trait Fetcher: Send + Sync {
+    /// Fetch a URL and return the raw response body + metadata. The
+    /// body is in `FetchResult::html` regardless of the actual content
+    /// type — JSON API endpoints put JSON there, HTML pages put HTML.
+    /// Extractors branch on response status and body shape.
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
+
+    /// Fetch with additional request headers. Needed for endpoints
+    /// that authenticate via a specific header (Instagram's
+    /// `x-ig-app-id`, for example). Default implementation routes to
+    /// [`Self::fetch`] so implementers without header support stay
+    /// functional, though the `Option<String>` field they'd set won't
+    /// be populated on the request.
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        _headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        self.fetch(url).await
+    }
+
+    /// Optional cloud-escalation client for antibot bypass. Returning
+    /// `Some` tells extractors they can call into the hosted API when
+    /// local fetch hits a challenge page. Returning `None` makes
+    /// cloud-gated extractors emit [`CloudError::NotConfigured`] with
+    /// an actionable signup link.
+    ///
+    /// The default implementation returns `None` because not every
+    /// deployment wants cloud fallback (self-hosts that don't have a
+    /// webclaw.io subscription, for instance).
+    ///
+    /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
+    fn cloud(&self) -> Option<&CloudClient> {
+        None
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
+// ---------------------------------------------------------------------------
+
+#[async_trait]
+impl<T: Fetcher + ?Sized> Fetcher for &T {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        (**self).fetch(url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        (**self).fetch_with_headers(url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&CloudClient> {
+        (**self).cloud()
+    }
+}
+
+#[async_trait]
+impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        (**self).fetch(url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        (**self).fetch_with_headers(url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&CloudClient> {
+        (**self).cloud()
+    }
+}
diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs
index 3a4781e..83664a1 100644
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@@ -8,6 +8,7 @@ pub mod crawler;
 pub mod document;
 pub mod error;
 pub mod extractors;
+pub mod fetcher;
 pub mod linkedin;
 pub mod proxy;
 pub mod reddit;
@@ -18,6 +19,7 @@ pub use browser::BrowserProfile;
 pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
 pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
 pub use error::FetchError;
+pub use fetcher::Fetcher;
 pub use http::HeaderMap;
 pub use proxy::{parse_proxy_file, parse_proxy_line};
 pub use sitemap::SitemapEntry;

From 0daa2fec1a3b32c1b00a58fa4ae498266910abbb Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 21:41:15 +0200
Subject: [PATCH 21/30] feat(cli+mcp): vertical extractor support (28
 extractors discoverable + callable)

Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.

CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
  name, label, and sample URL. `--json` flag emits the full catalog
  as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
  extractor and prints typed JSON. Pretty-printed by default; `--raw`
  for single-line. Exits 1 with a clear "URL does not match" error
  on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
  when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.

MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
  pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
  JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
  updated accordingly.

Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.

Version bumped to 0.5.2 (minor for API additions, backwards compatible).
---
 CHANGELOG.md                     |  14 +++++
 Cargo.lock                       |  14 ++---
 Cargo.toml                       |   2 +-
 crates/webclaw-cli/src/main.rs   | 105 +++++++++++++++++++++++++++++++
 crates/webclaw-mcp/src/server.rs |  47 +++++++++++++-
 crates/webclaw-mcp/src/tools.rs  |  17 +++++
 6 files changed, 190 insertions(+), 9 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7cfd1e5..ef2d2f2 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,20 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.5.2] — 2026-04-22
+
+### Added
+- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
+
+- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
+
+- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
+
+### Changed
+- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
+
+---
+
 ## [0.5.1] — 2026-04-22
 
 ### Added
diff --git a/Cargo.lock b/Cargo.lock
index bad52e3..ed0f4fa 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3199,7 +3199,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3220,7 +3220,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3238,7 +3238,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "async-trait",
  "bytes",
@@ -3263,7 +3263,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3276,7 +3276,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3296,7 +3296,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3305,7 +3305,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index 92152f2..a286972 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.1"
+version = "0.5.2"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs
index 91af384..a12cae1 100644
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@@ -308,6 +308,34 @@ enum Commands {
         #[arg(long)]
         facts: Option<PathBuf>,
     },
+
+    /// List all vertical extractors in the catalog.
+    ///
+    /// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
+    /// a human-friendly label, a one-line description, and the URL
+    /// patterns it claims. The same data is served by `/v1/extractors`
+    /// when running the REST API.
+    Extractors {
+        /// Emit JSON instead of a human-friendly table.
+        #[arg(long)]
+        json: bool,
+    },
+
+    /// Run a vertical extractor by name. Returns typed JSON with fields
+    /// specific to the target site (title, price, author, rating, etc.)
+    /// rather than generic markdown.
+    ///
+    /// Use `webclaw extractors` to see the full list. Example:
+    /// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
+    Vertical {
+        /// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
+        name: String,
+        /// URL to extract.
+        url: String,
+        /// Emit compact JSON (single line). Default is pretty-printed.
+        #[arg(long)]
+        raw: bool,
+    },
 }
 
 #[derive(Clone, ValueEnum)]
@@ -2288,6 +2316,83 @@ async fn main() {
                 }
                 return;
             }
+            Commands::Extractors { json } => {
+                let entries = webclaw_fetch::extractors::list();
+                if *json {
+                    // Serialize with serde_json. ExtractorInfo derives
+                    // Serialize so this is a one-liner.
+                    match serde_json::to_string_pretty(&entries) {
+                        Ok(s) => println!("{s}"),
+                        Err(e) => {
+                            eprintln!("error: failed to serialise catalog: {e}");
+                            process::exit(1);
+                        }
+                    }
+                } else {
+                    // Human-friendly table: NAME + LABEL + one URL
+                    // pattern sample. Keeps the output scannable on a
+                    // narrow terminal.
+                    println!("{} vertical extractors available:\n", entries.len());
+                    let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
+                    let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
+                    for e in &entries {
+                        let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
+                        println!(
+                            "  {:<nw$}  {:<lw$}  {}",
+                            e.name,
+                            e.label,
+                            pattern_sample,
+                            nw = name_w,
+                            lw = label_w,
+                        );
+                    }
+                    println!("\nRun one: webclaw vertical <name> <url>");
+                }
+                return;
+            }
+            Commands::Vertical { name, url, raw } => {
+                // Build a FetchClient with cloud fallback attached when
+                // WEBCLAW_API_KEY is set. Antibot-gated verticals
+                // (amazon, ebay, etsy, trustpilot) need this to escalate
+                // on bot protection.
+                let fetch_cfg = webclaw_fetch::FetchConfig {
+                    browser: webclaw_fetch::BrowserProfile::Firefox,
+                    ..webclaw_fetch::FetchConfig::default()
+                };
+                let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
+                    Ok(c) => c,
+                    Err(e) => {
+                        eprintln!("error: failed to build fetch client: {e}");
+                        process::exit(1);
+                    }
+                };
+                if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
+                    client = client.with_cloud(cloud);
+                }
+                match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
+                    Ok(data) => {
+                        let rendered = if *raw {
+                            serde_json::to_string(&data)
+                        } else {
+                            serde_json::to_string_pretty(&data)
+                        };
+                        match rendered {
+                            Ok(s) => println!("{s}"),
+                            Err(e) => {
+                                eprintln!("error: JSON encode failed: {e}");
+                                process::exit(1);
+                            }
+                        }
+                    }
+                    Err(e) => {
+                        // UrlMismatch / UnknownVertical / Fetch all get
+                        // Display impls with actionable messages.
+                        eprintln!("error: {e}");
+                        process::exit(1);
+                    }
+                }
+                return;
+            }
         }
     }
 
diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs
index 87c222e..a4af79d 100644
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@@ -718,6 +718,50 @@ impl WebclawMcp {
             Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
         }
     }
+
+    /// List every vertical extractor the server knows about. Returns a
+    /// JSON array of `{name, label, description, url_patterns}` entries.
+    /// Call this to discover what verticals are available before using
+    /// `vertical_scrape`.
+    #[tool]
+    async fn list_extractors(
+        &self,
+        Parameters(_params): Parameters<ListExtractorsParams>,
+    ) -> Result<String, String> {
+        let catalog = webclaw_fetch::extractors::list();
+        serde_json::to_string_pretty(&catalog)
+            .map_err(|e| format!("failed to serialise extractor catalog: {e}"))
+    }
+
+    /// Run a vertical extractor by name and return typed JSON specific
+    /// to the target site (title, price, rating, author, etc.), not
+    /// generic markdown. Use `list_extractors` to discover available
+    /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
+    /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
+    ///
+    /// Antibot-gated verticals (amazon_product, ebay_listing,
+    /// etsy_listing, trustpilot_reviews) will automatically escalate to
+    /// the webclaw cloud API when local fetch hits bot protection,
+    /// provided `WEBCLAW_API_KEY` is set.
+    #[tool]
+    async fn vertical_scrape(
+        &self,
+        Parameters(params): Parameters<VerticalParams>,
+    ) -> Result<String, String> {
+        validate_url(&params.url)?;
+        // Reuse the long-lived default FetchClient. Extractors accept
+        // `&dyn Fetcher`; FetchClient implements the trait so this just
+        // works (see webclaw_fetch::Fetcher and client::FetchClient).
+        let data = webclaw_fetch::extractors::dispatch_by_name(
+            self.fetch_client.as_ref(),
+            &params.name,
+            &params.url,
+        )
+        .await
+        .map_err(|e| e.to_string())?;
+        serde_json::to_string_pretty(&data)
+            .map_err(|e| format!("failed to serialise extractor output: {e}"))
+    }
 }
 
 #[tool_handler]
@@ -727,7 +771,8 @@ impl ServerHandler for WebclawMcp {
             .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
             .with_instructions(String::from(
                 "Webclaw MCP server -- web content extraction for AI agents. \
-                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
+                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
+                 list_extractors, vertical_scrape.",
             ))
     }
 }
diff --git a/crates/webclaw-mcp/src/tools.rs b/crates/webclaw-mcp/src/tools.rs
index e0195f1..02bf534 100644
--- a/crates/webclaw-mcp/src/tools.rs
+++ b/crates/webclaw-mcp/src/tools.rs
@@ -103,3 +103,20 @@ pub struct SearchParams {
     /// Number of results to return (default: 10)
     pub num_results: Option<u32>,
 }
+
+/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
+#[derive(Debug, Deserialize, JsonSchema)]
+pub struct VerticalParams {
+    /// Name of the vertical extractor. Call `list_extractors` to see all
+    /// available names. Examples: "reddit", "github_repo", "pypi",
+    /// "trustpilot_reviews", "youtube_video", "shopify_product".
+    pub name: String,
+    /// URL to extract. Must match the URL patterns the extractor claims;
+    /// otherwise the tool returns a clear "URL mismatch" error.
+    pub url: String,
+}
+
+/// `list_extractors` takes no arguments but we still need an empty struct
+/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
+#[derive(Debug, Deserialize, JsonSchema)]
+pub struct ListExtractorsParams {}

From 4bf11d902f2bf51f327b4f62d820c9b8013cf173 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Wed, 22 Apr 2026 23:18:11 +0200
Subject: [PATCH 22/30] fix(mcp): vertical_scrape uses Firefox profile, not
 default Chrome

Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.

Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.

Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers

No change to the `scrape` tool -- it keeps the user-selectable
browser param.

Bumps version to 0.5.3.
---
 Cargo.lock                       | 14 +++++++-------
 Cargo.toml                       |  2 +-
 crates/webclaw-mcp/src/server.rs | 25 +++++++++++++++----------
 3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/Cargo.lock b/Cargo.lock
index ed0f4fa..3b4aa51 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3199,7 +3199,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3220,7 +3220,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3238,7 +3238,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "async-trait",
  "bytes",
@@ -3263,7 +3263,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3276,7 +3276,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3296,7 +3296,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3305,7 +3305,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.5.2"
+version = "0.5.3"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index a286972..97c3055 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.2"
+version = "0.5.3"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs
index a4af79d..45e8647 100644
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@@ -749,16 +749,21 @@ impl WebclawMcp {
         Parameters(params): Parameters<VerticalParams>,
     ) -> Result<String, String> {
         validate_url(&params.url)?;
-        // Reuse the long-lived default FetchClient. Extractors accept
-        // `&dyn Fetcher`; FetchClient implements the trait so this just
-        // works (see webclaw_fetch::Fetcher and client::FetchClient).
-        let data = webclaw_fetch::extractors::dispatch_by_name(
-            self.fetch_client.as_ref(),
-            &params.name,
-            &params.url,
-        )
-        .await
-        .map_err(|e| e.to_string())?;
+        // Use the cached Firefox client, not the default Chrome one.
+        // Reddit's `.json` endpoint rejects the wreq-Chrome TLS
+        // fingerprint with a 403 even from residential IPs (they
+        // ship a fingerprint blocklist that includes common
+        // browser-emulation libraries). The wreq-Firefox fingerprint
+        // still passes, and Firefox is equally fine for every other
+        // vertical in the catalog, so it's a strictly-safer default
+        // for `vertical_scrape` than the generic `scrape` tool's
+        // Chrome default. Matches the CLI `webclaw vertical`
+        // subcommand which already uses Firefox.
+        let client = self.firefox_or_build()?;
+        let data =
+            webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
+                .await
+                .map_err(|e| e.to_string())?;
         serde_json::to_string_pretty(&data)
             .map_err(|e| format!("failed to serialise extractor output: {e}"))
     }

From b77767814a5fc767d0d4ebee78f0737beb7223db Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 12:58:24 +0200
Subject: [PATCH 23/30] Bump to 0.5.4: SafariIos profile + Chrome fingerprint
 alignment + locale helper

- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
  Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
  extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
  gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
  8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
  immobiliare.it with country-it residential.

- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
  MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
  explicit extension_permutation, advertise h3 in ALPN and ALPS.
  JA3 43067709b025da334de1279a120f8e14, akamai_fp
  52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
  Cloudflare-fronted sites.

- New locale module: accept_language_for_url / accept_language_for_tld.
  TLD to Accept-Language mapping, unknown TLDs default to en-US.
  DataDome geo-vs-locale cross-checks are now trivially satisfiable.

- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
---
 CHANGELOG.md                        |  14 +++
 Cargo.lock                          |  31 +++++
 Cargo.toml                          |   2 +-
 crates/webclaw-fetch/Cargo.toml     |   1 +
 crates/webclaw-fetch/src/browser.rs |   5 +
 crates/webclaw-fetch/src/client.rs  |   1 +
 crates/webclaw-fetch/src/lib.rs     |   2 +
 crates/webclaw-fetch/src/locale.rs  |  77 ++++++++++++
 crates/webclaw-fetch/src/tls.rs     | 184 ++++++++++++++++++++++++----
 9 files changed, 291 insertions(+), 26 deletions(-)
 create mode 100644 crates/webclaw-fetch/src/locale.rs

diff --git a/CHANGELOG.md b/CHANGELOG.md
index ef2d2f2..de610e5 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,20 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.5.4] — 2026-04-23
+
+### Added
+- **`BrowserProfile::SafariIos`** variant, mapped to a new `BrowserVariant::SafariIos26`. Built on top of `wreq_util::Emulation::SafariIos26` with four targeted overrides that close the gap against DataDome's immobiliare.it / target.com / bestbuy.com / sephora.com rulesets: TLS extension order pinned to bogdanfinn's `safari_ios_26_0` wire format, HTTP/2 HEADERS priority flag set (weight 256, exclusive, depends_on=0) while preserving wreq-util's SETTINGS + WINDOW_UPDATE, Safari iOS 26 header set without Chromium leaks, accept-encoding limited to `gzip, deflate, br` (no zstd). Empirically 9/10 on immobiliare with `country-it` residential, 2/2 on target/bestbuy/sephora with `country-us` residential. Matches bogdanfinn's JA3 `8d909525bd5bbb79f133d11cc05159fe` exactly.
+
+- **`accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers** in a new `locale` module. TLD to `Accept-Language` mapping (`.it` to `it-IT,it;q=0.9`, `.fr` to `fr-FR,fr;q=0.9`, etc.). Unknown TLDs fall back to `en-US,en;q=0.9`. DataDome rules that cross-check geo vs locale (Italian IP + English `accept-language` = bot) are now trivially satisfiable by callers that plumb the target URL through this helper before building a `FetchConfig`.
+
+### Changed
+- **`BrowserProfile::Chrome` fingerprint aligned to bogdanfinn `chrome_133`.** Three wire-level fixes: removed `MAX_CONCURRENT_STREAMS` from the HTTP/2 SETTINGS frame (real Chrome 133 does not send this setting), priority weight on the HEADERS frame changed from 220 to 256, TLS extension order pinned via `extension_permutation` to match bogdanfinn's stable JA3 `43067709b025da334de1279a120f8e14`. `alpn_protocols` extended to `[HTTP3, HTTP2, HTTP1]` and `alps_protocols` to `[HTTP3, HTTP2]` so Cloudflare's bot management sees the h3 advertisement real Chrome 133+ emits. Fixes indeed.com and other Cloudflare-protected sites that were serving the previous fingerprint a 403 "Security Check" challenge. Full matrix result (12 Chrome rows): 11/12 clean, the one failure is shared with bogdanfinn from the same proxy (IP reputation, not fingerprint).
+
+- **Bumped `wreq-util` from `2.2.6` to `3.0.0-rc.10`** to pick up `Emulation::SafariIos26`, which didn't ship until rc.10.
+
+---
+
 ## [0.5.2] — 2026-04-22
 
 ### Added
diff --git a/Cargo.lock b/Cargo.lock
index 3b4aa51..7302b9f 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -2967,6 +2967,26 @@ dependencies = [
  "pom",
 ]
 
+[[package]]
+name = "typed-builder"
+version = "0.23.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
+dependencies = [
+ "typed-builder-macro",
+]
+
+[[package]]
+name = "typed-builder-macro"
+version = "0.23.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "typed-path"
 version = "0.12.3"
@@ -3258,6 +3278,7 @@ dependencies = [
  "webclaw-core",
  "webclaw-pdf",
  "wreq",
+ "wreq-util",
  "zip 2.4.2",
 ]
 
@@ -3709,6 +3730,16 @@ dependencies = [
  "zstd",
 ]
 
+[[package]]
+name = "wreq-util"
+version = "3.0.0-rc.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
+dependencies = [
+ "typed-builder",
+ "wreq",
+]
+
 [[package]]
 name = "writeable"
 version = "0.6.2"
diff --git a/Cargo.toml b/Cargo.toml
index 97c3055..77a64a0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.3"
+version = "0.5.4"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml
index a47ba7e..3bf5401 100644
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@@ -14,6 +14,7 @@ tracing = { workspace = true }
 tokio = { workspace = true }
 async-trait = "0.1"
 wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
+wreq-util = "3.0.0-rc.10"
 http = "1"
 bytes = "1"
 url = "2"
diff --git a/crates/webclaw-fetch/src/browser.rs b/crates/webclaw-fetch/src/browser.rs
index 007ac9f..05f2c54 100644
--- a/crates/webclaw-fetch/src/browser.rs
+++ b/crates/webclaw-fetch/src/browser.rs
@@ -7,6 +7,10 @@ pub enum BrowserProfile {
     #[default]
     Chrome,
     Firefox,
+    /// Safari iOS 26 (iPhone). The one profile proven to defeat
+    /// DataDome's immobiliare.it / idealista.it / target.com-class
+    /// rules when paired with a country-scoped residential proxy.
+    SafariIos,
     /// Randomly pick from all available profiles on each request.
     Random,
 }
@@ -18,6 +22,7 @@ pub enum BrowserVariant {
     ChromeMacos,
     Firefox,
     Safari,
+    SafariIos26,
     Edge,
 }
 
diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index 8fd5ff5..e147337 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -635,6 +635,7 @@ fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
         BrowserProfile::Random => browser::all_variants(),
         BrowserProfile::Chrome => vec![browser::latest_chrome()],
         BrowserProfile::Firefox => vec![browser::latest_firefox()],
+        BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
     }
 }
 
diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs
index 83664a1..ca04bdb 100644
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@@ -10,6 +10,7 @@ pub mod error;
 pub mod extractors;
 pub mod fetcher;
 pub mod linkedin;
+pub mod locale;
 pub mod proxy;
 pub mod reddit;
 pub mod sitemap;
@@ -21,6 +22,7 @@ pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
 pub use error::FetchError;
 pub use fetcher::Fetcher;
 pub use http::HeaderMap;
+pub use locale::{accept_language_for_tld, accept_language_for_url};
 pub use proxy::{parse_proxy_file, parse_proxy_line};
 pub use sitemap::SitemapEntry;
 pub use webclaw_pdf::PdfMode;
diff --git a/crates/webclaw-fetch/src/locale.rs b/crates/webclaw-fetch/src/locale.rs
new file mode 100644
index 0000000..04079ec
--- /dev/null
+++ b/crates/webclaw-fetch/src/locale.rs
@@ -0,0 +1,77 @@
+//! Derive an `Accept-Language` header from a URL.
+//!
+//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
+//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
+//! target country + a browser UA but the wrong `Accept-Language` is a bot
+//! signal. Matching the site's expected locale gets us through.
+//!
+//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
+
+/// Best-effort `Accept-Language` header value for the given URL's TLD.
+/// Returns `None` if the URL cannot be parsed.
+pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
+    let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
+    let tld = host.rsplit('.').next()?;
+    Some(accept_language_for_tld(tld))
+}
+
+/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
+/// Unknown TLDs fall back to US English.
+pub fn accept_language_for_tld(tld: &str) -> &'static str {
+    match tld {
+        "it" => "it-IT,it;q=0.9",
+        "fr" => "fr-FR,fr;q=0.9",
+        "de" | "at" => "de-DE,de;q=0.9",
+        "es" => "es-ES,es;q=0.9",
+        "pt" => "pt-PT,pt;q=0.9",
+        "nl" => "nl-NL,nl;q=0.9",
+        "pl" => "pl-PL,pl;q=0.9",
+        "se" => "sv-SE,sv;q=0.9",
+        "no" => "nb-NO,nb;q=0.9",
+        "dk" => "da-DK,da;q=0.9",
+        "fi" => "fi-FI,fi;q=0.9",
+        "cz" => "cs-CZ,cs;q=0.9",
+        "ro" => "ro-RO,ro;q=0.9",
+        "gr" => "el-GR,el;q=0.9",
+        "tr" => "tr-TR,tr;q=0.9",
+        "ru" => "ru-RU,ru;q=0.9",
+        "jp" => "ja-JP,ja;q=0.9",
+        "kr" => "ko-KR,ko;q=0.9",
+        "cn" => "zh-CN,zh;q=0.9",
+        "tw" | "hk" => "zh-TW,zh;q=0.9",
+        "br" => "pt-BR,pt;q=0.9",
+        "mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
+        "uk" | "ie" => "en-GB,en;q=0.9",
+        _ => "en-US,en;q=0.9",
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn tld_dispatch() {
+        assert_eq!(
+            accept_language_for_url("https://www.immobiliare.it/annunci/1"),
+            Some("it-IT,it;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://www.leboncoin.fr/"),
+            Some("fr-FR,fr;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://www.amazon.co.uk/"),
+            Some("en-GB,en;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://example.com/"),
+            Some("en-US,en;q=0.9")
+        );
+    }
+
+    #[test]
+    fn bad_url_returns_none() {
+        assert_eq!(accept_language_for_url("not-a-url"), None);
+    }
+}
diff --git a/crates/webclaw-fetch/src/tls.rs b/crates/webclaw-fetch/src/tls.rs
index 608ae96..308265b 100644
--- a/crates/webclaw-fetch/src/tls.rs
+++ b/crates/webclaw-fetch/src/tls.rs
@@ -7,10 +7,15 @@
 
 use std::time::Duration;
 
+use std::borrow::Cow;
+
 use wreq::http2::{
     Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
 };
-use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
+use wreq::tls::{
+    AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
+    TlsVersion,
+};
 use wreq::{Client, Emulation};
 
 use crate::browser::BrowserVariant;
@@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
 /// Safari curves.
 const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
 
+/// Safari iOS 26 TLS extension order, matching bogdanfinn's
+/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
+/// inserts them itself. Diverges from wreq-util's default SafariIos26
+/// extension order, which DataDome's immobiliare.it ruleset flags.
+fn safari_ios_extensions() -> Vec<ExtensionType> {
+    vec![
+        ExtensionType::CERTIFICATE_TIMESTAMP,
+        ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
+        ExtensionType::SERVER_NAME,
+        ExtensionType::CERT_COMPRESSION,
+        ExtensionType::KEY_SHARE,
+        ExtensionType::SUPPORTED_VERSIONS,
+        ExtensionType::PSK_KEY_EXCHANGE_MODES,
+        ExtensionType::SUPPORTED_GROUPS,
+        ExtensionType::RENEGOTIATE,
+        ExtensionType::SIGNATURE_ALGORITHMS,
+        ExtensionType::STATUS_REQUEST,
+        ExtensionType::EC_POINT_FORMATS,
+        ExtensionType::EXTENDED_MASTER_SECRET,
+    ]
+}
+
+/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
+/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
+/// per handshake, but indeed.com's WAF allowlists this specific wire order
+/// and rejects permuted ones. GREASE slots are inserted by wreq.
+///
+/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
+fn chrome_extensions() -> Vec<ExtensionType> {
+    vec![
+        ExtensionType::CERTIFICATE_TIMESTAMP,                  // 18
+        ExtensionType::STATUS_REQUEST,                         // 5
+        ExtensionType::SESSION_TICKET,                         // 35
+        ExtensionType::KEY_SHARE,                              // 51
+        ExtensionType::SUPPORTED_GROUPS,                       // 10
+        ExtensionType::PSK_KEY_EXCHANGE_MODES,                 // 45
+        ExtensionType::EC_POINT_FORMATS,                       // 11
+        ExtensionType::CERT_COMPRESSION,                       // 27
+        ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
+        ExtensionType::SUPPORTED_VERSIONS,       // 43
+        ExtensionType::SIGNATURE_ALGORITHMS,     // 13
+        ExtensionType::SERVER_NAME,              // 0
+        ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
+        ExtensionType::ENCRYPTED_CLIENT_HELLO,   // 65037
+        ExtensionType::RENEGOTIATE,              // 65281
+        ExtensionType::EXTENDED_MASTER_SECRET,   // 23
+    ]
+}
+
 // --- Chrome HTTP headers in correct wire order ---
 
 const CHROME_HEADERS: &[(&str, &str)] = &[
@@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
     ("sec-fetch-dest", "document"),
 ];
 
+/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
+/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
+/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
+/// include zstd (Safari can't decode it). Verified against bogdanfinn on
+/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
+/// expects for a real iPhone.
+const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
+    (
+        "accept",
+        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
+    ),
+    ("accept-language", "en-US,en;q=0.9"),
+    ("accept-encoding", "gzip, deflate, br"),
+    (
+        "user-agent",
+        "Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
+    ),
+    ("upgrade-insecure-requests", "1"),
+];
+
 const EDGE_HEADERS: &[(&str, &str)] = &[
     (
         "sec-ch-ua",
@@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
 ];
 
 fn chrome_tls() -> TlsOptions {
+    // permute_extensions is off so the explicit extension_permutation sticks.
+    // Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
+    // fixed order, so matching that gets us through.
     TlsOptions::builder()
         .cipher_list(CHROME_CIPHERS)
         .sigalgs_list(CHROME_SIGALGS)
@@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
         .min_tls_version(TlsVersion::TLS_1_2)
         .max_tls_version(TlsVersion::TLS_1_3)
         .grease_enabled(true)
-        .permute_extensions(true)
+        .permute_extensions(false)
+        .extension_permutation(chrome_extensions())
         .enable_ech_grease(true)
         .pre_shared_key(true)
         .enable_ocsp_stapling(true)
         .enable_signed_cert_timestamps(true)
-        .alps_protocols([AlpsProtocol::HTTP2])
+        .alpn_protocols([
+            AlpnProtocol::HTTP3,
+            AlpnProtocol::HTTP2,
+            AlpnProtocol::HTTP1,
+        ])
+        .alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
         .alps_use_new_codepoint(true)
         .aes_hw_override(true)
         .certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
         .build()
 }
 
+/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
+/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
+/// because the wire-level defaults from wreq-util are already correct for ciphers,
+/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
+/// DataDome compatibility are overridden here:
+///
+///  1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
+///     ends up `8d909525bd5bbb79f133d11cc05159fe`).
+///  2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
+///     wreq-util omits this frame; real Safari and bogdanfinn include it.
+///     This flip is the thing DataDome actually reads — the akamai_fingerprint
+///     hash changes from `c52879e43202aeb92740be6e8c86ea96` to
+///     `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
+///  3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
+///     `priority: u=0, i`, zstd), replace with the real iOS 26 set.
+///  4. `accept-language` preserved from config.extra_headers for locale.
+fn safari_ios_emulation() -> wreq::Emulation {
+    use wreq::EmulationFactory;
+    let mut em = wreq_util::Emulation::SafariIos26.emulation();
+
+    if let Some(tls) = em.tls_options_mut().as_mut() {
+        tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
+    }
+
+    // Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
+    // and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
+    // to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
+    if let Some(h2) = em.http2_options_mut().as_mut() {
+        h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
+    }
+
+    let hm = em.headers_mut();
+    hm.clear();
+    for (k, v) in SAFARI_IOS_HEADERS {
+        if let (Ok(n), Ok(val)) = (
+            http::header::HeaderName::from_bytes(k.as_bytes()),
+            http::header::HeaderValue::from_str(v),
+        ) {
+            hm.append(n, val);
+        }
+    }
+
+    em
+}
+
 fn chrome_h2() -> Http2Options {
+    // SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
+    // ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
+    // MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
+    // and indeed.com's WAF reads this as a bot signal when present. Priority
+    // weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
     Http2Options::builder()
         .initial_window_size(6_291_456)
         .initial_connection_window_size(15_728_640)
         .max_header_list_size(262_144)
         .header_table_size(65_536)
-        .max_concurrent_streams(1000u32)
         .enable_push(false)
         .settings_order(
             SettingsOrder::builder()
                 .extend([
                     SettingId::HeaderTableSize,
                     SettingId::EnablePush,
-                    SettingId::MaxConcurrentStreams,
                     SettingId::InitialWindowSize,
-                    SettingId::MaxFrameSize,
                     SettingId::MaxHeaderListSize,
-                    SettingId::EnableConnectProtocol,
-                    SettingId::NoRfc7540Priorities,
                 ])
                 .build(),
         )
@@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
                 ])
                 .build(),
         )
-        .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
+        .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
         .build()
 }
 
@@ -328,32 +456,38 @@ pub fn build_client(
     extra_headers: &std::collections::HashMap<String, String>,
     proxy: Option<&str>,
 ) -> Result<Client, FetchError> {
-    let (tls, h2, headers) = match variant {
-        BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
-        BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
-        BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
-        BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
-        BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
+    // SafariIos26 builds its Emulation on top of wreq-util's base instead
+    // of from scratch. See `safari_ios_emulation` for why.
+    let mut emulation = match variant {
+        BrowserVariant::SafariIos26 => safari_ios_emulation(),
+        other => {
+            let (tls, h2, headers) = match other {
+                BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
+                BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
+                BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
+                BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
+                BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
+                BrowserVariant::SafariIos26 => unreachable!("handled above"),
+            };
+            Emulation::builder()
+                .tls_options(tls)
+                .http2_options(h2)
+                .headers(build_headers(headers))
+                .build()
+        }
     };
 
-    let mut header_map = build_headers(headers);
-
-    // Append extra headers after profile defaults
+    // Append extra headers after profile defaults.
+    let hm = emulation.headers_mut();
     for (k, v) in extra_headers {
         if let (Ok(n), Ok(val)) = (
             http::header::HeaderName::from_bytes(k.as_bytes()),
             http::header::HeaderValue::from_str(v),
         ) {
-            header_map.insert(n, val);
+            hm.insert(n, val);
         }
     }
 
-    let emulation = Emulation::builder()
-        .tls_options(tls)
-        .http2_options(h2)
-        .headers(header_map)
-        .build();
-
     let mut builder = Client::builder()
         .emulation(emulation)
         .redirect(wreq::redirect::Policy::limited(10))

From 2285c585b18bfe02a16f1bfc9ca118a0523f4f24 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 13:01:02 +0200
Subject: [PATCH 24/30] docs(changelog): simplify 0.5.4 entry

---
 CHANGELOG.md | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index de610e5..c249734 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,14 +6,12 @@ Format follows [Keep a Changelog](https://keepachangelog.com/).
 ## [0.5.4] — 2026-04-23
 
 ### Added
-- **`BrowserProfile::SafariIos`** variant, mapped to a new `BrowserVariant::SafariIos26`. Built on top of `wreq_util::Emulation::SafariIos26` with four targeted overrides that close the gap against DataDome's immobiliare.it / target.com / bestbuy.com / sephora.com rulesets: TLS extension order pinned to bogdanfinn's `safari_ios_26_0` wire format, HTTP/2 HEADERS priority flag set (weight 256, exclusive, depends_on=0) while preserving wreq-util's SETTINGS + WINDOW_UPDATE, Safari iOS 26 header set without Chromium leaks, accept-encoding limited to `gzip, deflate, br` (no zstd). Empirically 9/10 on immobiliare with `country-it` residential, 2/2 on target/bestbuy/sephora with `country-us` residential. Matches bogdanfinn's JA3 `8d909525bd5bbb79f133d11cc05159fe` exactly.
-
-- **`accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers** in a new `locale` module. TLD to `Accept-Language` mapping (`.it` to `it-IT,it;q=0.9`, `.fr` to `fr-FR,fr;q=0.9`, etc.). Unknown TLDs fall back to `en-US,en;q=0.9`. DataDome rules that cross-check geo vs locale (Italian IP + English `accept-language` = bot) are now trivially satisfiable by callers that plumb the target URL through this helper before building a `FetchConfig`.
+- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
+- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
 
 ### Changed
-- **`BrowserProfile::Chrome` fingerprint aligned to bogdanfinn `chrome_133`.** Three wire-level fixes: removed `MAX_CONCURRENT_STREAMS` from the HTTP/2 SETTINGS frame (real Chrome 133 does not send this setting), priority weight on the HEADERS frame changed from 220 to 256, TLS extension order pinned via `extension_permutation` to match bogdanfinn's stable JA3 `43067709b025da334de1279a120f8e14`. `alpn_protocols` extended to `[HTTP3, HTTP2, HTTP1]` and `alps_protocols` to `[HTTP3, HTTP2]` so Cloudflare's bot management sees the h3 advertisement real Chrome 133+ emits. Fixes indeed.com and other Cloudflare-protected sites that were serving the previous fingerprint a 403 "Security Check" challenge. Full matrix result (12 Chrome rows): 11/12 clean, the one failure is shared with bogdanfinn from the same proxy (IP reputation, not fingerprint).
-
-- **Bumped `wreq-util` from `2.2.6` to `3.0.0-rc.10`** to pick up `Emulation::SafariIos26`, which didn't ship until rc.10.
+- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
+- Bumped `wreq-util` to `3.0.0-rc.10`.
 
 ---
 

From e1af2da5092ab06e2d915209f93d0ccd74c548f8 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 13:25:23 +0200
Subject: [PATCH 25/30] docs(claude): drop sidecar references, mention
 ProductionFetcher

---
 CLAUDE.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index fcd27da..c33d61f 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -79,7 +79,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 - **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
 - **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
 - **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
-- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq.
+- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
 - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
 
 ## Build & Test

From 98a177dec489c8d29551648854f502dd6900e067 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 13:32:55 +0200
Subject: [PATCH 26/30] feat(cli): expose safari-ios browser profile + bump to
 0.5.5

---
 CHANGELOG.md                   |  7 +++++++
 Cargo.lock                     | 14 +++++++-------
 Cargo.toml                     |  2 +-
 crates/webclaw-cli/src/main.rs |  4 ++++
 4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index c249734..94b9ddb 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,13 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.5.5] — 2026-04-23
+
+### Added
+- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
+
+---
+
 ## [0.5.4] — 2026-04-23
 
 ### Added
diff --git a/Cargo.lock b/Cargo.lock
index 7302b9f..30135cd 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3219,7 +3219,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3240,7 +3240,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3258,7 +3258,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "async-trait",
  "bytes",
@@ -3284,7 +3284,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3297,7 +3297,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3317,7 +3317,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3326,7 +3326,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.5.3"
+version = "0.5.5"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index 77a64a0..abd5816 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.4"
+version = "0.5.5"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs
index a12cae1..e97f15d 100644
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@@ -351,6 +351,9 @@ enum OutputFormat {
 enum Browser {
     Chrome,
     Firefox,
+    /// Safari iOS 26. Pair with a country-matched residential proxy for sites
+    /// that reject non-mobile profiles.
+    SafariIos,
     Random,
 }
 
@@ -377,6 +380,7 @@ impl From<Browser> for BrowserProfile {
         match b {
             Browser::Chrome => BrowserProfile::Chrome,
             Browser::Firefox => BrowserProfile::Firefox,
+            Browser::SafariIos => BrowserProfile::SafariIos,
             Browser::Random => BrowserProfile::Random,
         }
     }

From b413d702b272960dcc3970394194f5328c784eeb Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 14:59:29 +0200
Subject: [PATCH 27/30] feat(fetch): add fetch_smart with Reddit + Akamai
 rescue paths, bump 0.5.6

---
 CHANGELOG.md                       | 10 +++++
 Cargo.lock                         | 14 +++----
 Cargo.toml                         |  2 +-
 crates/webclaw-fetch/src/client.rs | 59 ++++++++++++++++++++++++++----
 4 files changed, 69 insertions(+), 16 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 94b9ddb..54cb31f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,16 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 
+## [0.5.6] — 2026-04-23
+
+### Added
+- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
+
+### Fixed
+- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
+
+---
+
 ## [0.5.5] — 2026-04-23
 
 ### Added
diff --git a/Cargo.lock b/Cargo.lock
index 30135cd..b382000 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3219,7 +3219,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-cli"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "clap",
  "dotenvy",
@@ -3240,7 +3240,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-core"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "ego-tree",
  "once_cell",
@@ -3258,7 +3258,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "async-trait",
  "bytes",
@@ -3284,7 +3284,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-llm"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "async-trait",
  "reqwest",
@@ -3297,7 +3297,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "dirs",
  "dotenvy",
@@ -3317,7 +3317,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "pdf-extract",
  "thiserror",
@@ -3326,7 +3326,7 @@ dependencies = [
 
 [[package]]
 name = "webclaw-server"
-version = "0.5.5"
+version = "0.5.6"
 dependencies = [
  "anyhow",
  "axum",
diff --git a/Cargo.toml b/Cargo.toml
index abd5816..d9cfd92 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 
 [workspace.package]
-version = "0.5.5"
+version = "0.5.6"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index e147337..d61694f 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -261,10 +261,52 @@ impl FetchClient {
         self.cloud.as_deref()
     }
 
+    /// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
+    /// `.json` API, and Akamai-style challenge responses trigger a homepage
+    /// cookie warmup and a retry. Returns the same `FetchResult` shape as
+    /// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
+    /// server) benefits without shape churn.
+    ///
+    /// This is the method most callers want. Use plain [`Self::fetch`] only
+    /// when you need literal no-rescue behavior (e.g. inside the rescue
+    /// logic itself to avoid recursion).
+    pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
+        // Reddit: the HTML page shows a verification interstitial for most
+        // client IPs, but appending `.json` returns the post + comment tree
+        // publicly. `parse_reddit_json` in downstream code knows how to read
+        // the result; here we just do the URL swap at the fetch layer.
+        if crate::reddit::is_reddit_url(url) {
+            let json_url = crate::reddit::json_url(url);
+            if let Ok(resp) = self.fetch(&json_url).await {
+                if resp.status == 200 && !resp.html.is_empty() {
+                    return Ok(resp);
+                }
+            }
+            // If the .json fetch failed, fall through to the HTML path.
+        }
+
+        let resp = self.fetch(url).await?;
+
+        // Akamai / bazadebezolkohpepadr challenge: visit the homepage to
+        // collect warmup cookies (_abck, bm_sz, etc.), then retry.
+        if is_challenge_html(&resp.html)
+            && let Some(homepage) = extract_homepage(url)
+        {
+            debug!("challenge detected, warming cookies via {homepage}");
+            let _ = self.fetch(&homepage).await;
+            if let Ok(retry) = self.fetch(url).await {
+                return Ok(retry);
+            }
+        }
+
+        Ok(resp)
+    }
+
     /// Fetch a URL and return the raw HTML + response metadata.
     ///
     /// Automatically retries on transient failures (network errors, 5xx, 429)
-    /// with exponential backoff: 0s, 1s (2 attempts total).
+    /// with exponential backoff: 0s, 1s (2 attempts total). No per-site
+    /// rescue logic; use [`Self::fetch_smart`] for that.
     #[instrument(skip(self), fields(url = %url))]
     pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
         let delays = [Duration::ZERO, Duration::from_secs(1)];
@@ -713,22 +755,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
 
 /// Detect if a response looks like a bot protection challenge page.
 fn is_challenge_response(response: &Response) -> bool {
-    let len = response.body().len();
+    is_challenge_html(response.text().as_ref())
+}
+
+/// Same as `is_challenge_response`, operating on a body string directly
+/// so callers holding a `FetchResult` can reuse the heuristic.
+fn is_challenge_html(html: &str) -> bool {
+    let len = html.len();
     if len > 15_000 || len == 0 {
         return false;
     }
-
-    let text = response.text();
-    let lower = text.to_lowercase();
-
+    let lower = html.to_lowercase();
     if lower.contains("<title>challenge page</title>") {
         return true;
     }
-
     if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
         return true;
     }
-
     false
 }
 

From 866fa88aa05d208cb5389795cfc655876742cfbc Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 15:06:35 +0200
Subject: [PATCH 28/30] fix(fetch): reject HTML verification pages served at
 .json reddit URL

---
 crates/webclaw-fetch/src/client.rs | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index d61694f..78731e5 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -277,12 +277,18 @@ impl FetchClient {
         // the result; here we just do the URL swap at the fetch layer.
         if crate::reddit::is_reddit_url(url) {
             let json_url = crate::reddit::json_url(url);
-            if let Ok(resp) = self.fetch(&json_url).await {
-                if resp.status == 200 && !resp.html.is_empty() {
+            if let Ok(resp) = self.fetch(&json_url).await
+                && resp.status == 200
+            {
+                // Reddit will serve an HTML verification page at the .json
+                // URL too when the IP is flagged. Only return if the body
+                // actually starts with a JSON payload.
+                let first = resp.html.trim_start().as_bytes().first().copied();
+                if matches!(first, Some(b'{') | Some(b'[')) {
                     return Ok(resp);
                 }
             }
-            // If the .json fetch failed, fall through to the HTML path.
+            // If the .json fetch failed or returned HTML, fall through.
         }
 
         let resp = self.fetch(url).await?;

From 966981bc4299323721c2d43ff5aa157bf939b82c Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 15:17:04 +0200
Subject: [PATCH 29/30] fix(fetch): send bot-identifying UA on reddit .json API
 to bypass browser UA block

---
 crates/webclaw-fetch/src/client.rs | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs
index 78731e5..94d698f 100644
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@@ -275,14 +275,21 @@ impl FetchClient {
         // client IPs, but appending `.json` returns the post + comment tree
         // publicly. `parse_reddit_json` in downstream code knows how to read
         // the result; here we just do the URL swap at the fetch layer.
-        if crate::reddit::is_reddit_url(url) {
+        if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
             let json_url = crate::reddit::json_url(url);
-            if let Ok(resp) = self.fetch(&json_url).await
+            // Reddit's public .json API serves JSON to identifiable bot
+            // User-Agents and blocks browser UAs with a verification wall.
+            // Override our Chrome-profile UA for this specific call.
+            let ua = concat!(
+                "Webclaw/",
+                env!("CARGO_PKG_VERSION"),
+                " (+https://webclaw.io)"
+            );
+            if let Ok(resp) = self
+                .fetch_with_headers(&json_url, &[("user-agent", ua)])
+                .await
                 && resp.status == 200
             {
-                // Reddit will serve an HTML verification page at the .json
-                // URL too when the IP is flagged. Only return if the body
-                // actually starts with a JSON payload.
                 let first = resp.html.trim_start().as_bytes().first().copied();
                 if matches!(first, Some(b'{') | Some(b'[')) {
                     return Ok(resp);

From a5c3433372f33517f2aa765c2544ab6abdfe1cc7 Mon Sep 17 00:00:00 2001
From: Valerio <massimianivalerio1@gmail.com>
Date: Thu, 23 Apr 2026 15:26:31 +0200
Subject: [PATCH 30/30] fix(core+server): guard markdown pipe slice + detect
 trustpilot/reddit verify walls

---
 CHANGELOG.md                        | 3 ++-
 crates/webclaw-core/src/markdown.rs | 6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 54cb31f..3000593 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,10 +6,11 @@ Format follows [Keep a Changelog](https://keepachangelog.com/).
 ## [0.5.6] — 2026-04-23
 
 ### Added
-- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
+- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
 
 ### Fixed
 - Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
+- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
 
 ---
 
diff --git a/crates/webclaw-core/src/markdown.rs b/crates/webclaw-core/src/markdown.rs
index 1a61586..d0a2c23 100644
--- a/crates/webclaw-core/src/markdown.rs
+++ b/crates/webclaw-core/src/markdown.rs
@@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
             continue;
         }
 
-        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
-        if trimmed.starts_with('|') && trimmed.ends_with('|') {
+        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
+        // Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
+        // (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
+        if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
             let inner = &trimmed[1..trimmed.len() - 1];
             let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
             lines.push(cells.join("\t"));