fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls

fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
fix(fetch): reject HTML verification pages served at .json reddit URL
2026-04-25 00:06:21 +02:00 · 2026-04-23 15:26:31 +02:00 · 2026-04-23 15:17:04 +02:00 · 2026-04-23 15:06:35 +02:00 · 2026-04-23 14:59:29 +02:00 · 2026-04-23 13:32:55 +02:00
77 changed files with 13193 additions and 566 deletions
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@ -66,8 +66,14 @@ jobs:
          tag="${GITHUB_REF#refs/tags/}"
          staging="webclaw-${tag}-${{ matrix.target }}"
          mkdir "$staging"
-          cp target/${{ matrix.target }}/release/webclaw "$staging/" 2>/dev/null || true
-          cp target/${{ matrix.target }}/release/webclaw-mcp "$staging/" 2>/dev/null || true
+          # Fail loud if any binary is missing. A silent `|| true` on the
+          # copy was how v0.4.0 shipped tarballs that lacked webclaw-server —
+          # don't repeat that mistake. If a future binary gets renamed or
+          # removed, this step should scream, not quietly publish an
+          # incomplete release.
+          cp target/${{ matrix.target }}/release/webclaw "$staging/"
+          cp target/${{ matrix.target }}/release/webclaw-mcp "$staging/"
+          cp target/${{ matrix.target }}/release/webclaw-server "$staging/"
          cp README.md LICENSE "$staging/"
          tar czf "$staging.tar.gz" "$staging"
          echo "ASSET=$staging.tar.gz" >> $GITHUB_ENV
@ -134,6 +140,7 @@ jobs:
            mkdir -p "binaries-${target}"
            cp "${dir}/webclaw" "binaries-${target}/webclaw"
            cp "${dir}/webclaw-mcp" "binaries-${target}/webclaw-mcp"
+            cp "${dir}/webclaw-server" "binaries-${target}/webclaw-server"
            chmod +x "binaries-${target}"/*
          done
          ls -laR binaries-*/
@ -220,6 +227,7 @@ jobs:
            def install
              bin.install "webclaw"
              bin.install "webclaw-mcp"
+              bin.install "webclaw-server"
            end

            test do
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -3,6 +3,116 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).

+## [0.5.6] — 2026-04-23
+
+### Added
+- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
+
+### Fixed
+- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
+- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
+
+---
+
+## [0.5.5] — 2026-04-23
+
+### Added
+- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
+
+---
+
+## [0.5.4] — 2026-04-23
+
+### Added
+- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
+- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
+
+### Changed
+- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
+- Bumped `wreq-util` to `3.0.0-rc.10`.
+
+---
+
+## [0.5.2] — 2026-04-22
+
+### Added
+- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
+
+- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
+
+- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
+
+### Changed
+- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
+
+---
+
+## [0.5.1] — 2026-04-22
+
+### Added
+- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
+
+  The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
+
+  Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
+
+### Changed
+- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
+
+---
+
+## [0.5.0] — 2026-04-22
+
+### Added
+- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob.
+- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route.
+- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source.
+- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran.
+- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `<script>` tags, metadata as OG `<meta>` tags, markdown in a `<pre>`) so existing local parsers run unchanged across both paths. No per-extractor code path branches are needed for "came from cloud" vs "came from local".
+- **Trustpilot 2025 schema parser.** Trustpilot replaced their single-Organization + aggregateRating shape with three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table `mainEntity` carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent reviews. The parser walks all three, skips the site-level Org, picks the Dataset by `about.@id` matching the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and returns recent reviews with author / country / date / rating / title / text / likes.
+- **OG-tag fallback in `ecommerce_product` for sites with no JSON-LD and sites with JSON-LD but empty offers.** Three paths now: `jsonld` (Schema.org Product with offers), `jsonld+og` (Product JSON-LD plus OG product tags filling in missing price), and `og_fallback` (no JSON-LD at all, build minimal payload from `og:title`, `og:image`, `og:description`, `product:price:amount`, `product:price:currency`, `product:availability`, `product:brand`). `has_og_product_signal()` gates the fallback on `og:type=product` or a price tag so blog posts don't get mis-classified as products.
+- **URL-slug title fallback in `etsy_listing` for delisted / blocked pages.** When Etsy serves a placeholder page (`"etsy.com"`, `"Etsy - Your place to buy..."`, `"This item is unavailable"`), humanise the URL slug (`/listing/123/personalized-stainless-steel-tumbler` becomes `"Personalized Stainless Steel Tumbler"`) so callers always get a meaningful title. Plus shop falls through `offers[].seller.name` then top-level `brand` because Etsy uses both schemas depending on listing age.
+- **Force-cloud-escalation in `amazon_product` when local HTML lacks Product JSON-LD.** Amazon A/B-tests JSON-LD presence. When local fetch succeeds but has no `Product` block and a cloud client is configured, the extractor force-escalates to the cloud which reliably surfaces title + description via its render engine. Added OG meta-tag fallback so the cloud's synthesized HTML output (OG tags only, no Amazon DOM IDs) still yields title / image / description.
+- **AWS WAF "Verifying your connection" detector in `cloud::is_bot_protected`.** Trustpilot serves a `~565` byte interstitial with an `interstitial-spinner` CSS class. The detector now fires on that pattern with a `< 10_000` byte size gate to avoid false positives on real articles that happen to mention the phrase.
+
+### Changed
+- **`webclaw-fetch::FetchClient` gained an optional `cloud` field** via `with_cloud(CloudClient)`. Extractors reach it through `client.cloud()` to decide whether to escalate. `webclaw-server::AppState` reads `WEBCLAW_CLOUD_API_KEY` (preferred) or falls back to `WEBCLAW_API_KEY` only when inbound auth is not configured (open mode).
+- **Consolidated `CloudClient` into `webclaw-fetch`.** Previously duplicated between `webclaw-mcp/src/cloud.rs` (302 LOC) and `webclaw-cli/src/cloud.rs` (80 LOC). Single canonical home with typed `CloudError` (`NotConfigured`, `Unauthorized`, `InsufficientPlan`, `RateLimited`, `ServerError`, `Network`, `ParseFailed`) that Display with actionable URLs; `From<CloudError> for String` bridge keeps pre-existing CLI / MCP call sites compiling unchanged during migration.
+
+### Tests
+- 215 unit tests passing in `webclaw-fetch` (100+ new, covering every extractor's matcher, URL parser, JSON-LD / OG fallback paths, and the cloud synthesis helper). `cargo clippy --workspace --release --no-deps` clean.
+
+---
+
+## [0.4.0] — 2026-04-22
+
+### Added
+- **`webclaw bench <url>` — per-URL extraction micro-benchmark (#26).** New subcommand. Fetches a URL once, runs the same extraction pipeline as `--format llm`, and prints a small ASCII table comparing raw-HTML tokens vs. llm-output tokens, bytes, and extraction time. Pass `--json` for a single-line JSON object (stable shape, easy to append to ndjson in CI). Pass `--facts <path>` with a file in the same schema as `benchmarks/facts.json` to get a fidelity column ("4/5 facts preserved"); URLs absent from the facts file produce no fidelity row, so uncurated sites aren't shown as 0/0. v1 uses an approximate tokenizer (`chars/4` for Latin text, `chars/2` when CJK dominates) — off by ±10% vs. a real BPE tokenizer, but the signal ("the LLM pipeline dropped 93% of the raw bytes") is the point. Output clearly labels counts as `≈ tokens` so nobody confuses them with a real tiktoken run. Swapping in `tiktoken-rs` later is a one-function change in `bench.rs`. Adding this as a `clap` subcommand rather than a flag also lays the groundwork for future subcommands without breaking the existing flag-based flow — `webclaw <url> --format llm` still works exactly as before.
+
+- **`webclaw-server` — new OSS binary for self-hosting a REST API (#29).** Until now, `docs/self-hosting` promised a `webclaw-server` binary that only existed in the hosted-platform repo (closed source). The Docker image shipped two binaries while the docs advertised three, which sent self-hosters into a bug loop. This release closes the gap: a new crate at `crates/webclaw-server/` builds a minimal, stateless axum server that exposes the OSS extraction pipeline over HTTP with the same JSON shapes as api.webclaw.io. Endpoints: `GET /health`, `POST /v1/{scrape,crawl,map,batch,extract,summarize,diff,brand}`. Run with `webclaw-server --port 3000 [--host 0.0.0.0] [--api-key <bearer>]` or the matching `WEBCLAW_PORT` / `WEBCLAW_HOST` / `WEBCLAW_API_KEY` env vars. Bearer auth is constant-time (via `subtle::ConstantTimeEq`); open mode (no key) is allowed on `127.0.0.1` for local development.
+
+  What self-hosting gives you: the full extraction pipeline, Crawler, sitemap discovery, brand/diff, LLM extract/summarize (via Ollama or your own OpenAI/Anthropic key). What it does *not* give you: anti-bot bypass (Cloudflare, DataDome, WAFs), headless JS rendering, async job queues, multi-tenant auth/billing, domain-hints and proxy routing — those require the hosted backend at api.webclaw.io and are intentionally not open-source. The self-hosting docs have been updated to reflect this split honestly.
+
+- **`crawl` endpoint runs synchronously and hard-caps at 500 pages / 20 concurrency.** No job queue, no background workers — a naive caller can't OOM the process. `batch` caps at 100 URLs / 20 concurrency for the same reason. For unbounded crawls use the hosted API.
+
+### Changed
+- **Docker image now ships three binaries**, not two. `Dockerfile` and `Dockerfile.ci` both add `webclaw-server` to `/usr/local/bin/` and `EXPOSE 3000` for documentation. The entrypoint shim is unchanged: `docker run IMAGE webclaw-server --port 3000` Just Works, and the CLI/URL pass-through from v0.3.19 is preserved.
+
+### Docs
+- Rewrote `docs/self-hosting` on the landing site to differentiate OSS (self-hosted REST) from the hosted platform. Added a capability matrix so new users don't have to read the repo to figure out why Cloudflare-protected sites still 403 when pointing at their own box.
+
+### Fixed
+- **Dead-code warning on `cargo install webclaw-mcp` (#30).** `rmcp` 1.3.x changed how the `#[tool_handler]` macro reads the `tool_router` struct field — it now goes through a derived trait impl instead of referencing the field by name, so rustc's dead-code lint no longer sees it. The field is still essential (dropping it unregisters every MCP tool), just invisible to the lint. Annotated with `#[allow(dead_code)]` and a comment explaining why. No behaviour change. Warning disappears on the next `cargo install`.
+
+---
+
+## [0.3.19] — 2026-04-17
+
+### Fixed
+- **Docker image can be used as a FROM base again.** v0.3.13 switched the Docker `CMD` to `ENTRYPOINT ["webclaw"]` so that `docker run IMAGE https://example.com` would pass the URL through as expected. That change trapped a different use case: downstream Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own `CMD ["./setup.sh"]` — the child's `./setup.sh` became the first arg to `webclaw`, which tried to fetch it as a URL and failed with `error sending request for uri (https://./setup.sh)`. Both `Dockerfile` and `Dockerfile.ci` now use a small `docker-entrypoint.sh` shim that forwards flags (`-*`) and URLs (`http://`, `https://`) to `webclaw`, but `exec`s anything else directly. All four use cases now work: `docker run IMAGE https://example.com`, `docker run IMAGE --help`, child-image `CMD ["./setup.sh"]`, and `docker run IMAGE bash` for debugging. Default `CMD` is `["webclaw", "--help"]`.
+
+---
+
 ## [0.3.18] — 2026-04-16

 ### Fixed
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -11,7 +11,7 @@ webclaw/
                      # + ExtractionOptions (include/exclude CSS selectors)
                      # + diff engine (change tracking)
                      # + brand extraction (DOM/CSS analysis)
-    webclaw-fetch/    # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
+    webclaw-fetch/    # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
                      # + proxy pool rotation (per-request)
                      # + PDF content-type detection
                      # + document parsing (DOCX, XLSX, CSV)
@ -20,9 +20,11 @@ webclaw/
    webclaw-pdf/      # PDF text extraction via pdf-extract
    webclaw-mcp/      # MCP server (Model Context Protocol) for AI agents
    webclaw-cli/      # CLI binary
+    webclaw-server/   # Minimal axum REST API (self-hosting; OSS counterpart
+                      # of api.webclaw.io, without anti-bot / JS / jobs / auth)
 ```

-Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
+Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).

 ### Core Modules (`webclaw-core`)
 - `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
@ -38,7 +40,7 @@ Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
 - `brand.rs` — Brand identity extraction from DOM structure and CSS

 ### Fetch Modules (`webclaw-fetch`)
- `client.rs` — FetchClient with primp TLS impersonation
+- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
 - `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
 - `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
 - `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
@ -60,12 +62,24 @@ Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
 - Works with Claude Desktop, Claude Code, and any MCP client
 - Uses `rmcp` crate (official Rust MCP SDK)

+### REST API Server (`webclaw-server`)
+- Axum 0.8, stateless, no database, no job queue
+- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the
+  capability exists in OSS
+- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
+  `--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
+- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
+- Does NOT include: anti-bot bypass, JS rendering, async jobs,
+  multi-tenant auth, billing, proxy rotation, search/research/watch/
+  agent-scrape. Those live behind api.webclaw.io and are closed-source.
+
 ## Hard Rules

 - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
+- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
+- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
+- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
+- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
 - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.

 ## Build & Test
--- a/Cargo.lock
+++ b/Cargo.lock
@ -182,6 +182,70 @@ version = "1.5.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8"

+[[package]]
+name = "axum"
+version = "0.8.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "31b698c5f9a010f6573133b09e0de5408834d0c82f8d7475a89fc1867a71cd90"
+dependencies = [
+ "axum-core",
+ "axum-macros",
+ "bytes",
+ "form_urlencoded",
+ "futures-util",
+ "http",
+ "http-body",
+ "http-body-util",
+ "hyper",
+ "hyper-util",
+ "itoa",
+ "matchit",
+ "memchr",
+ "mime",
+ "percent-encoding",
+ "pin-project-lite",
+ "serde_core",
+ "serde_json",
+ "serde_path_to_error",
+ "serde_urlencoded",
+ "sync_wrapper",
+ "tokio",
+ "tower",
+ "tower-layer",
+ "tower-service",
+ "tracing",
+]
+
+[[package]]
+name = "axum-core"
+version = "0.5.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "08c78f31d7b1291f7ee735c1c6780ccde7785daae9a9206026862dab7d8792d1"
+dependencies = [
+ "bytes",
+ "futures-core",
+ "http",
+ "http-body",
+ "http-body-util",
+ "mime",
+ "pin-project-lite",
+ "sync_wrapper",
+ "tower-layer",
+ "tower-service",
+ "tracing",
+]
+
+[[package]]
+name = "axum-macros"
+version = "0.5.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7aa268c23bfbbd2c4363b9cd302a4f504fb2a9dfe7e3451d66f35dd392e20aca"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "base64"
 version = "0.22.1"
@ -1132,6 +1196,12 @@ version = "1.10.1"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87"

+[[package]]
+name = "httpdate"
+version = "1.0.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9"
+
 [[package]]
 name = "hyper"
 version = "1.9.0"
@ -1145,6 +1215,7 @@ dependencies = [
 "http",
 "http-body",
 "httparse",
+ "httpdate",
 "itoa",
 "pin-project-lite",
 "smallvec",
@ -1559,6 +1630,12 @@ dependencies = [
 "regex-automata",
 ]

+[[package]]
+name = "matchit"
+version = "0.8.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "47e1ffaa40ddd1f3ed91f717a33c8c0ee23fff369e3aa8772b9605cc1d22f4c3"
+
 [[package]]
 name = "md-5"
 version = "0.10.6"
@ -1575,6 +1652,12 @@ version = "2.8.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79"

+[[package]]
+name = "mime"
+version = "0.3.17"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a"
+
 [[package]]
 name = "minimal-lexical"
 version = "0.2.1"
@ -2403,6 +2486,17 @@ dependencies = [
 "zmij",
 ]

+[[package]]
+name = "serde_path_to_error"
+version = "0.1.20"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "10a9ff822e371bb5403e391ecd83e182e0e77ba7f6fe0160b795797109d1b457"
+dependencies = [
+ "itoa",
+ "serde",
+ "serde_core",
+]
+
 [[package]]
 name = "serde_urlencoded"
 version = "0.7.1"
@ -2757,6 +2851,7 @@ dependencies = [
 "tokio",
 "tower-layer",
 "tower-service",
+ "tracing",
 ]

 [[package]]
@ -2780,6 +2875,7 @@ dependencies = [
 "tower",
 "tower-layer",
 "tower-service",
+ "tracing",
 ]

 [[package]]
@ -2800,6 +2896,7 @@ version = "0.1.44"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "63e71662fa4b2a2c3a26f570f037eb95bb1f85397f3cd8076caed2f026a6d100"
 dependencies = [
+ "log",
 "pin-project-lite",
 "tracing-attributes",
 "tracing-core",
@ -2870,6 +2967,26 @@ dependencies = [
 "pom",
 ]

+[[package]]
+name = "typed-builder"
+version = "0.23.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
+dependencies = [
+ "typed-builder-macro",
+]
+
+[[package]]
+name = "typed-builder-macro"
+version = "0.23.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "typed-path"
 version = "0.12.3"
@ -3102,7 +3219,7 @@ dependencies = [

 [[package]]
 name = "webclaw-cli"
-version = "0.3.18"
+version = "0.5.6"
 dependencies = [
 "clap",
 "dotenvy",
@ -3123,7 +3240,7 @@ dependencies = [

 [[package]]
 name = "webclaw-core"
-version = "0.3.18"
+version = "0.5.6"
 dependencies = [
 "ego-tree",
 "once_cell",
@ -3141,13 +3258,16 @@ dependencies = [

 [[package]]
 name = "webclaw-fetch"
-version = "0.3.18"
+version = "0.5.6"
 dependencies = [
+ "async-trait",
 "bytes",
 "calamine",
 "http",
 "quick-xml 0.37.5",
 "rand 0.8.5",
+ "regex",
+ "reqwest",
 "serde",
 "serde_json",
 "tempfile",
@ -3158,12 +3278,13 @@ dependencies = [
 "webclaw-core",
 "webclaw-pdf",
 "wreq",
+ "wreq-util",
 "zip 2.4.2",
 ]

 [[package]]
 name = "webclaw-llm"
-version = "0.3.18"
+version = "0.5.6"
 dependencies = [
 "async-trait",
 "reqwest",
@ -3176,11 +3297,10 @@ dependencies = [

 [[package]]
 name = "webclaw-mcp"
-version = "0.3.18"
+version = "0.5.6"
 dependencies = [
 "dirs",
 "dotenvy",
- "reqwest",
 "rmcp",
 "schemars",
 "serde",
@ -3197,13 +3317,34 @@ dependencies = [

 [[package]]
 name = "webclaw-pdf"
-version = "0.3.18"
+version = "0.5.6"
 dependencies = [
 "pdf-extract",
 "thiserror",
 "tracing",
 ]

+[[package]]
+name = "webclaw-server"
+version = "0.5.6"
+dependencies = [
+ "anyhow",
+ "axum",
+ "clap",
+ "serde",
+ "serde_json",
+ "subtle",
+ "thiserror",
+ "tokio",
+ "tower-http",
+ "tracing",
+ "tracing-subscriber",
+ "webclaw-core",
+ "webclaw-fetch",
+ "webclaw-llm",
+ "webclaw-pdf",
+]
+
 [[package]]
 name = "webpki-root-certs"
 version = "1.0.6"
@ -3589,6 +3730,16 @@ dependencies = [
 "zstd",
 ]

+[[package]]
+name = "wreq-util"
+version = "3.0.0-rc.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
+dependencies = [
+ "typed-builder",
+ "wreq",
+]
+
 [[package]]
 name = "writeable"
 version = "0.6.2"
--- a/Cargo.toml
+++ b/Cargo.toml
@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]

 [workspace.package]
-version = "0.3.18"
+version = "0.5.6"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
--- a/37
+++ b/37
@ -1,5 +1,12 @@
 # webclaw — Multi-stage Docker build
-# Produces 2 binaries: webclaw (CLI) and webclaw-mcp (MCP server)
+# Produces 3 binaries:
+#   webclaw         — CLI (single-shot extraction, crawl, MCP-less use)
+#   webclaw-mcp     — MCP server (stdio, for AI agents)
+#   webclaw-server  — minimal REST API for self-hosting (OSS, stateless)
+#
+# NOTE: this is NOT the hosted API at api.webclaw.io — the cloud service
+# adds anti-bot bypass, JS rendering, multi-tenant auth and async jobs
+# that are intentionally not open-source. See docs/self-hosting.

 # ---------------------------------------------------------------------------
 # Stage 1: Build all binaries in release mode
@ -25,6 +32,7 @@ COPY crates/webclaw-llm/Cargo.toml crates/webclaw-llm/Cargo.toml
 COPY crates/webclaw-pdf/Cargo.toml crates/webclaw-pdf/Cargo.toml
 COPY crates/webclaw-mcp/Cargo.toml crates/webclaw-mcp/Cargo.toml
 COPY crates/webclaw-cli/Cargo.toml crates/webclaw-cli/Cargo.toml
+COPY crates/webclaw-server/Cargo.toml crates/webclaw-server/Cargo.toml

 # Copy .cargo config if present (optional build flags)
 COPY .cargo .cargo
@ -35,7 +43,8 @@ RUN mkdir -p crates/webclaw-core/src && echo "" > crates/webclaw-core/src/lib.rs
    && mkdir -p crates/webclaw-llm/src && echo "" > crates/webclaw-llm/src/lib.rs \
    && mkdir -p crates/webclaw-pdf/src && echo "" > crates/webclaw-pdf/src/lib.rs \
    && mkdir -p crates/webclaw-mcp/src && echo "fn main() {}" > crates/webclaw-mcp/src/main.rs \
-    && mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs
+    && mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs \
+    && mkdir -p crates/webclaw-server/src && echo "fn main() {}" > crates/webclaw-server/src/main.rs

 # Pre-build dependencies (this layer is cached until Cargo.toml/lock changes)
 RUN cargo build --release 2>/dev/null || true
@ -54,9 +63,27 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

-# Copy both binaries
+# Copy all three binaries
 COPY --from=builder /build/target/release/webclaw /usr/local/bin/webclaw
 COPY --from=builder /build/target/release/webclaw-mcp /usr/local/bin/webclaw-mcp
+COPY --from=builder /build/target/release/webclaw-server /usr/local/bin/webclaw-server

-# Default: run the CLI (ENTRYPOINT so args pass through)
-ENTRYPOINT ["webclaw"]
+# Default port the REST API listens on when you run `webclaw-server` inside
+# the container. Override with -e WEBCLAW_PORT=... or --port. Published only
+# as documentation; callers still need `-p 3000:3000` on `docker run`.
+EXPOSE 3000
+
+# Container default: bind all interfaces so `-p 3000:3000` works. The binary
+# itself defaults to 127.0.0.1 (safe for `cargo run` on a laptop); inside
+# Docker that would make the server unreachable, so we flip it here.
+# Override with -e WEBCLAW_HOST=127.0.0.1 if you front this with another
+# process in the same container.
+ENV WEBCLAW_HOST=0.0.0.0
+
+# Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
+# commands directly so this image can be used as a FROM base with custom CMD.
+COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
+RUN chmod +x /usr/local/bin/docker-entrypoint.sh
+
+ENTRYPOINT ["docker-entrypoint.sh"]
+CMD ["webclaw", "--help"]
--- a/Dockerfile.ci
+++ b/Dockerfile.ci
@ -12,5 +12,20 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
 ARG BINARY_DIR
 COPY ${BINARY_DIR}/webclaw /usr/local/bin/webclaw
 COPY ${BINARY_DIR}/webclaw-mcp /usr/local/bin/webclaw-mcp
+COPY ${BINARY_DIR}/webclaw-server /usr/local/bin/webclaw-server

-ENTRYPOINT ["webclaw"]
+# Default REST API port when running `webclaw-server` inside the container.
+EXPOSE 3000
+
+# Container default: bind all interfaces so `-p 3000:3000` works. The
+# binary itself defaults to 127.0.0.1; flipping here keeps the CLI safe on
+# a laptop but makes the container reachable out of the box.
+ENV WEBCLAW_HOST=0.0.0.0
+
+# Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
+# commands directly so this image can be used as a FROM base with custom CMD.
+COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
+RUN chmod +x /usr/local/bin/docker-entrypoint.sh
+
+ENTRYPOINT ["docker-entrypoint.sh"]
+CMD ["webclaw", "--help"]
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@ -1,130 +1,94 @@
 # Benchmarks

-Extraction quality and performance benchmarks comparing webclaw against popular alternatives.
+Reproducible benchmarks comparing `webclaw` against open-source and commercial
+web extraction tools. Every number here ships with the script that produced it.
+Run `./run.sh` to regenerate.

-## Quick Run
+## Headline
+
+**webclaw preserves more page content than any other tool tested, at 2.4× the
+speed of the closest competitor.**
+
+Across 18 production sites (SPAs, documentation, long-form articles, news,
+enterprise marketing), measured over 3 runs per site with OpenAI's
+`cl100k_base` tokenizer. Last run: 2026-04-17, webclaw v0.3.18.
+
+| Tool | Fidelity (facts preserved) | Token reduction vs raw HTML | Mean latency |
+|---|---:|---:|---:|
+| **webclaw `--format llm`** | **76 / 90  (84.4 %)** | 92.5 % | **0.41 s** |
+| Firecrawl API (v2, hosted) | 70 / 90  (77.8 %) | 92.4 % | 0.99 s |
+| Trafilatura 2.0 | 45 / 90  (50.0 %) | 97.8 % (by dropping content) | 0.21 s |
+
+**webclaw matches or beats both competitors on fidelity on all 18 sites.**
+
+## Why webclaw wins
+
+- **Speed.** 2.4× faster than Firecrawl's hosted API. Firecrawl defaults to
+  browser rendering for everything; webclaw's in-process TLS-fingerprinted
+  fetch plus deterministic extractor reaches comparable-or-better content
+  without that overhead.
+- **Fidelity.** Trafilatura's higher token reduction comes from dropping
+  content. On the 18 sites tested it missed 45 of 90 key facts — entire
+  customer-story sections, release dates, product names. webclaw keeps them.
+- **Deterministic.** Same URL → same output. No LLM post-processing, no
+  paraphrasing, no hallucination risk.
+
+## Per-site results
+
+Numbers are median of 3 runs. `raw` = raw fetched HTML token count.
+`facts` = hand-curated visible facts preserved out of 5 per site.
+
+| Site | raw HTML | webclaw | Firecrawl | Trafilatura | wc facts | fc facts | tr facts |
+|---|---:|---:|---:|---:|:---:|:---:|:---:|
+| openai.com | 170 K | 1,238 | 3,139 | 0 | **3/5** | 2/5 | 0/5 |
+| vercel.com | 380 K | 1,076 | 4,029 | 585 | **3/5** | 3/5 | 3/5 |
+| anthropic.com | 103 K | 672 | 560 | 96 | **5/5** | 5/5 | 4/5 |
+| notion.com | 109 K | 13,416 | 5,261 | 91 | **5/5** | 5/5 | 2/5 |
+| stripe.com | 243 K | 81,974 | 8,922 | 2,418 | **5/5** | 5/5 | 0/5 |
+| tavily.com | 30 K | 1,361 | 1,969 | 182 | **5/5** | 4/5 | 3/5 |
+| shopify.com | 184 K | 1,939 | 5,384 | 595 | **3/5** | 3/5 | 3/5 |
+| docs.python.org | 5 K | 689 | 1,623 | 347 | **4/5** | 4/5 | 4/5 |
+| react.dev | 107 K | 3,332 | 4,959 | 763 | **5/5** | 5/5 | 3/5 |
+| tailwindcss.com/docs/installation | 113 K | 779 | 813 | 430 | **4/5** | 4/5 | 2/5 |
+| nextjs.org/docs | 228 K | 968 | 885 | 631 | **4/5** | 4/5 | 4/5 |
+| github.com | 234 K | 1,438 | 3,058 | 486 | **5/5** | 4/5 | 3/5 |
+| en.wikipedia.org/wiki/Rust | 189 K | 47,823 | 59,326 | 37,427 | **5/5** | 5/5 | 5/5 |
+| simonwillison.net/…/latent-reasoning | 3 K | 724 | 525 | 0 | **4/5** | 2/5 | 0/5 |
+| paulgraham.com/essays.html | 2 K | 169 | 295 | 0 | **2/5** | 1/5 | 0/5 |
+| techcrunch.com | 143 K | 7,265 | 11,408 | 397 | **5/5** | 5/5 | 5/5 |
+| databricks.com | 274 K | 2,001 | 5,471 | 311 | **4/5** | 4/5 | 4/5 |
+| hashicorp.com | 109 K | 1,501 | 4,289 | 0 | **5/5** | 5/5 | 0/5 |
+
+## Reproducing this benchmark

 ```bash
-# Run all benchmarks
-cargo run --release -p webclaw-bench
-
-# Run specific benchmark
-cargo run --release -p webclaw-bench -- --filter quality
-cargo run --release -p webclaw-bench -- --filter speed
+cd benchmarks/
+./run.sh
 ```

-## Extraction Quality
+Requirements:
+- Python 3.9+
+- `pip install tiktoken trafilatura firecrawl-py`
+- `webclaw` release binary at `../target/release/webclaw` (or set `$WEBCLAW`)
+- Firecrawl API key (free tier: 500 credits/month, enough for many runs) —
+  export as `FIRECRAWL_API_KEY`. If omitted, the benchmark runs with webclaw
+  and Trafilatura only.

-Tested against 50 diverse web pages (news articles, documentation, blogs, SPAs, e-commerce).
-Each page scored on: content completeness, noise removal, link preservation, metadata accuracy.
+One run of the full suite burns ~60 Firecrawl credits (18 sites × 3 runs,
+plus Firecrawl's scrape costs 1 credit each).

-| Extractor | Accuracy | Noise Removal | Links | Metadata | Avg Score |
-|-----------|----------|---------------|-------|----------|-----------|
-| **webclaw** | **94.2%** | **96.1%** | **98.3%** | **91.7%** | **95.1%** |
-| mozilla/readability | 87.3% | 89.4% | 85.1% | 72.3% | 83.5% |
-| trafilatura | 82.1% | 91.2% | 68.4% | 80.5% | 80.6% |
-| newspaper3k | 71.4% | 76.8% | 52.3% | 65.2% | 66.4% |
+## Methodology

-### Scoring Methodology
+See [methodology.md](methodology.md) for:
+- Tokenizer rationale (`cl100k_base` → covers GPT-4 / GPT-3.5 /
+  `text-embedding-3-*`)
+- Fact selection procedure and how to propose additions
+- Why median of 3 runs (CDN / cache / network noise)
+- Raw data schema (`results/*.json`)
+- Notes on site churn (news aggregators, release pages)

- **Accuracy**: Percentage of main content extracted vs human-annotated ground truth
- **Noise Removal**: Percentage of navigation, ads, footers, and boilerplate correctly excluded
- **Links**: Percentage of meaningful content links preserved with correct text and href
- **Metadata**: Correct extraction of title, author, date, description, and language
+## Raw data

-### Why webclaw scores higher
-
-1. **Multi-signal scoring**: Combines text density, semantic HTML tags, link density penalty, and DOM depth analysis
-2. **Data island extraction**: Catches React/Next.js JSON payloads that DOM-only extractors miss
-3. **Domain-specific heuristics**: Auto-detects site type (news, docs, e-commerce, social) and adapts strategy
-4. **Noise filter**: Shared filter using ARIA roles, class/ID patterns, and structural analysis (Tailwind-safe)
-
-## Extraction Speed
-
-Single-page extraction time (parsing + extraction, no network). Measured on M4 Pro, averaged over 1000 runs.
-
-| Page Size | webclaw | readability | trafilatura |
-|-----------|---------|-------------|-------------|
-| Small (10KB) | **0.8ms** | 2.1ms | 4.3ms |
-| Medium (100KB) | **3.2ms** | 8.7ms | 18.4ms |
-| Large (500KB) | **12.1ms** | 34.2ms | 72.8ms |
-| Huge (2MB) | **41.3ms** | 112ms | 284ms |
-
-### Why webclaw is faster
-
-1. **Rust**: No garbage collection, zero-cost abstractions, SIMD-optimized string operations
-2. **Single-pass scoring**: Content scoring happens during DOM traversal, not as a separate pass
-3. **Lazy allocation**: Markdown conversion streams output instead of building intermediate structures
-
-## LLM Token Efficiency
-
-Tokens used when feeding extraction output to Claude/GPT. Lower is better (same information, fewer tokens = cheaper).
-
-| Format | Tokens (avg) | vs Raw HTML |
-|--------|-------------|-------------|
-| Raw HTML | 4,820 | baseline |
-| webclaw markdown | 1,840 | **-62%** |
-| webclaw text | 1,620 | **-66%** |
-| **webclaw llm** | **1,590** | **-67%** |
-| readability markdown | 2,340 | -51% |
-| trafilatura text | 2,180 | -55% |
-
-The `llm` format applies a 9-step optimization pipeline: image strip, emphasis strip, link dedup, stat merge, whitespace collapse, and more.
-
-## Crawl Performance
-
-Crawling speed with concurrent extraction. Target: example documentation site (~200 pages).
-
-| Concurrency | webclaw | Crawl4AI | Scrapy |
-|-------------|---------|----------|--------|
-| 1 | 2.1 pages/s | 1.4 pages/s | 1.8 pages/s |
-| 5 | **9.8 pages/s** | 5.2 pages/s | 7.1 pages/s |
-| 10 | **18.4 pages/s** | 8.7 pages/s | 12.3 pages/s |
-| 20 | **32.1 pages/s** | 14.2 pages/s | 21.8 pages/s |
-
-## Bot Protection Bypass
-
-Success rate against common anti-bot systems (100 attempts each, via Cloud API with antibot sidecar).
-
-| Protection | webclaw | Firecrawl | Bright Data |
-|------------|---------|-----------|-------------|
-| Cloudflare Turnstile | **97%** | 62% | 94% |
-| DataDome | **91%** | 41% | 88% |
-| AWS WAF | **95%** | 78% | 92% |
-| hCaptcha | **89%** | 35% | 85% |
-| No protection | 100% | 100% | 100% |
-
-Note: Bot protection bypass requires the Cloud API with antibot sidecar. The open-source CLI detects protection and suggests using `--cloud` mode.
-
-## Running Benchmarks Yourself
-
-```bash
-# Clone the repo
-git clone https://github.com/0xMassi/webclaw.git
-cd webclaw
-
-# Run quality benchmarks (downloads test pages on first run)
-cargo run --release -p webclaw-bench -- --filter quality
-
-# Run speed benchmarks
-cargo run --release -p webclaw-bench -- --filter speed
-
-# Run token efficiency benchmarks (requires tiktoken)
-cargo run --release -p webclaw-bench -- --filter tokens
-
-# Full benchmark suite with HTML report
-cargo run --release -p webclaw-bench -- --report html
-```
-
-## Reproducing Results
-
-All benchmark test pages are cached in `benchmarks/fixtures/` after first download. The fixture set includes:
-
- 10 news articles (NYT, BBC, Reuters, TechCrunch, etc.)
- 10 documentation pages (Rust docs, MDN, React docs, etc.)
- 10 blog posts (personal blogs, Medium, Substack)
- 10 e-commerce pages (Amazon, Shopify stores)
- 5 SPA/React pages (Next.js, Remix apps)
- 5 edge cases (minimal HTML, huge pages, heavy JavaScript)
-
-Ground truth annotations are in `benchmarks/ground-truth/` as JSON files with manually verified content boundaries.
+Per-run results are committed as JSON at `results/YYYY-MM-DD.json` so the
+history of measurements is auditable. Diff two runs to see regressions or
+improvements across webclaw versions.
--- a/benchmarks/facts.json
+++ b/benchmarks/facts.json
@ -0,0 +1,23 @@
+{
+  "_comment": "Hand-curated 'visible facts' per site. Inspected from live pages on 2026-04-17. PRs welcome to add sites or adjust facts — keep facts specific (customer names, headline stats, product names), not generic words.",
+  "facts": {
+    "https://openai.com":                                         ["ChatGPT", "Sora", "API", "Enterprise", "research"],
+    "https://vercel.com":                                         ["Next.js", "Hobby", "Pro", "Enterprise", "deploy"],
+    "https://anthropic.com":                                      ["Opus", "Claude", "Glasswing", "Perseverance", "NASA"],
+    "https://www.notion.com":                                     ["agents", "Forbes", "Figma", "Ramp", "Cursor"],
+    "https://stripe.com":                                         ["Hertz", "URBN", "Instacart", "99.999", "1.9"],
+    "https://tavily.com":                                         ["search", "extract", "crawl", "research", "developers"],
+    "https://www.shopify.com":                                    ["Plus", "merchants", "retail", "brands", "checkout"],
+    "https://docs.python.org/3/":                                 ["tutorial", "library", "reference", "setup", "distribution"],
+    "https://react.dev":                                          ["Components", "JSX", "Hooks", "Learn", "Reference"],
+    "https://tailwindcss.com/docs/installation":                  ["Vite", "PostCSS", "CLI", "install", "Next.js"],
+    "https://nextjs.org/docs":                                    ["App Router", "Pages Router", "getting-started", "deploying", "Server"],
+    "https://github.com":                                         ["Copilot", "Actions", "millions", "developers", "Enterprise"],
+    "https://en.wikipedia.org/wiki/Rust_(programming_language)":  ["Graydon", "Mozilla", "borrow", "Cargo", "2015"],
+    "https://simonwillison.net/2026/Mar/15/latent-reasoning/":    ["latent", "reasoning", "Willison", "model", "Simon"],
+    "https://paulgraham.com/essays.html":                         ["Graham", "essay", "startup", "Lisp", "founders"],
+    "https://techcrunch.com":                                     ["TechCrunch", "startup", "news", "events", "latest"],
+    "https://www.databricks.com":                                 ["Lakehouse", "platform", "data", "MLflow", "AI"],
+    "https://www.hashicorp.com":                                  ["Terraform", "Vault", "Consul", "infrastructure", "enterprise"]
+  }
+}
--- a/benchmarks/methodology.md
+++ b/benchmarks/methodology.md
@ -0,0 +1,142 @@
+# Methodology
+
+## What is measured
+
+Three metrics per site:
+
+1. **Token efficiency** — tokens of the extractor's output vs tokens of the
+   raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower
+   tokens *only matters if the content is preserved*, so tokens are always
+   reported alongside fidelity.
+2. **Fidelity** — how many hand-curated "visible facts" the extractor
+   preserved. Per site we list 5 strings that any reader would say are
+   meaningfully on the page (customer names, headline stats, product names,
+   release information). Matched case-insensitively with word boundaries
+   where the fact is a single alphanumeric token (`API` does not match
+   `apiece`).
+3. **Latency** — wall-clock time from URL submission to markdown output.
+   Includes fetch + extraction. Network-dependent, so reported as the
+   median of 3 runs.
+
+## Tokenizer
+
+`cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by
+GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug
+extracted web content into. Pinned in `scripts/bench.py`.
+
+## Tool versions
+
+Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run
+published at launch used:
+
+- `webclaw 0.3.18` (release build, default options, `--format llm`)
+- `trafilatura 2.0.0` (`extract(html, output_format="markdown",
+  include_links=True, include_tables=True, favor_recall=True)`)
+- `firecrawl-py 4.x` against Firecrawl's hosted `v2` API
+  (`scrape(url, formats=["markdown"])`)
+
+## Fact selection
+
+Facts for each site were chosen by manual inspection of the live page in a
+browser on 2026-04-17. Selection criteria:
+
+- must be **visibly present** (not in `<head>`, `<script>`, or hidden
+  sections)
+- must be **specific** — customer names, headline stats, product names,
+  release dates. Not generic words like "the", "platform", "we".
+- must be **stable across multiple loads** (no AB-tested copy, no random
+  customer rotations)
+- 5 facts per site, documented in `facts.json`
+
+Facts are committed as data, not code, so **new facts can be proposed via
+pull request**. Any addition runs against all three tools automatically.
+
+Known limitation: sites change. News aggregators, release pages, and
+blog indexes drift. If a fact disappears because the page changed (not
+because the extractor dropped it), we expect all three tools to miss it
+together, which makes it visible as "all tools tied on this site" in the
+per-site breakdown. Facts on churning pages are refreshed on each published
+run.
+
+## Why median of 3 runs
+
+Single-run numbers are noisy:
+
+- **Latency** varies ±30% from run to run due to network jitter, CDN cache
+  state, and the remote server's own load.
+- **Raw-HTML token count** can vary if the server renders different content
+  per request (A/B tests, geo-IP, session state).
+- **Tool-specific flakiness** exists at the long tail. The occasional
+  Firecrawl 502 or trafilatura fetch failure would otherwise distort a
+  single-run benchmark.
+
+We run each site 3 times, take the median per metric. The published
+number is the 50th percentile; the full run data (min / median / max)
+is preserved in `results/YYYY-MM-DD.json`.
+
+## Fair comparison notes
+
+- **Each tool fetches via its own preferred path.** webclaw uses its
+  in-process primp HTTP client. Trafilatura uses `requests`. Firecrawl
+  fetches via its hosted infrastructure (Chrome CDP when needed). This is
+  the apples-to-apples developer-experience comparison: what you get when
+  you call each tool with a URL. The "vs raw HTML" column uses webclaw's
+  `--raw-html` as the baseline denominator.
+- **Firecrawl's default engine picker** runs in "auto" mode with browser
+  rendering for sites it detects need it. No flags tuned, no URLs
+  cherry-picked.
+- **No retries**, no fallbacks, no post-processing on top of any tool's
+  output. If a tool returns `""` or errors, that is the measured result
+  for that run. The median of 3 runs absorbs transient errors; persistent
+  extraction failures (e.g. trafilatura on `simonwillison.net`, which
+  returned `""` on all 3 runs) show up as 0 tokens and 0 facts.
+
+## Raw data schema
+
+`results/YYYY-MM-DD.json`:
+
+```json
+{
+  "timestamp": "2026-04-17 ...",
+  "webclaw_version": "0.3.18",
+  "trafilatura_version": "2.0.0",
+  "tokenizer": "cl100k_base",
+  "runs_per_site": 3,
+  "site_count": 18,
+  "total_facts": 90,
+  "aggregates": {
+    "webclaw":     { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
+    "trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
+    "firecrawl":   { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
+  },
+  "per_site": [
+    {
+      "url": "https://openai.com",
+      "facts_count": 5,
+      "raw_tokens": 170508,
+      "webclaw":     { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
+      "trafilatura": { "tokens_med": 0,    "facts_med": 0, "seconds_med": 0.17 },
+      "firecrawl":   { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
+    },
+    ...
+  ]
+}
+```
+
+## What's not here (roadmap)
+
+These measurements are intentionally out of scope for this initial
+benchmark. Each deserves its own harness and its own run.
+
+- **n-gram content overlap** — v2 metric to replace curated-fact matching.
+  Measure: fraction of trigrams from the visually-rendered page text that
+  appear in the extractor's output. Harder to curate, easier to scale.
+- **Competitors besides trafilatura / firecrawl** — Mozilla Readability,
+  Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or
+  wrapper subprocess runners. PRs welcome.
+- **Anti-bot / protected sites** — Cloudflare Turnstile, DataDome, AWS
+  WAF, hCaptcha. These require the Webclaw Cloud API with the antibot
+  sidecar, not the open-source CLI, and will be published separately on
+  the Webclaw landing page once the testing harness there is public.
+- **Crawl throughput** — pages-per-second under concurrent load. Different
+  axis from single-page extraction; lives in its own benchmark.
--- a/benchmarks/results/2026-04-17.json
+++ b/benchmarks/results/2026-04-17.json
@ -0,0 +1,397 @@
+{
+  "timestamp": "2026-04-17 14:28:42",
+  "webclaw_version": "0.3.18",
+  "trafilatura_version": "2.0.0",
+  "tokenizer": "cl100k_base",
+  "runs_per_site": 3,
+  "site_count": 18,
+  "total_facts": 90,
+  "aggregates": {
+    "webclaw": {
+      "reduction_mean": 92.5,
+      "reduction_median": 97.8,
+      "facts_preserved": 76,
+      "total_facts": 90,
+      "fidelity_pct": 84.4,
+      "latency_mean": 0.41
+    },
+    "trafilatura": {
+      "reduction_mean": 97.8,
+      "reduction_median": 99.7,
+      "facts_preserved": 45,
+      "total_facts": 90,
+      "fidelity_pct": 50.0,
+      "latency_mean": 0.2
+    },
+    "firecrawl": {
+      "reduction_mean": 92.4,
+      "reduction_median": 96.2,
+      "facts_preserved": 70,
+      "total_facts": 90,
+      "fidelity_pct": 77.8,
+      "latency_mean": 0.99
+    }
+  },
+  "per_site": [
+    {
+      "url": "https://openai.com",
+      "facts_count": 5,
+      "raw_tokens": 170510,
+      "webclaw": {
+        "tokens_med": 1238,
+        "facts_med": 3,
+        "seconds_med": 0.49
+      },
+      "trafilatura": {
+        "tokens_med": 0,
+        "facts_med": 0,
+        "seconds_med": 0.12
+      },
+      "firecrawl": {
+        "tokens_med": 3139,
+        "facts_med": 2,
+        "seconds_med": 1.14
+      }
+    },
+    {
+      "url": "https://vercel.com",
+      "facts_count": 5,
+      "raw_tokens": 380172,
+      "webclaw": {
+        "tokens_med": 1076,
+        "facts_med": 3,
+        "seconds_med": 0.31
+      },
+      "trafilatura": {
+        "tokens_med": 585,
+        "facts_med": 3,
+        "seconds_med": 0.23
+      },
+      "firecrawl": {
+        "tokens_med": 4029,
+        "facts_med": 3,
+        "seconds_med": 0.99
+      }
+    },
+    {
+      "url": "https://anthropic.com",
+      "facts_count": 5,
+      "raw_tokens": 102911,
+      "webclaw": {
+        "tokens_med": 672,
+        "facts_med": 5,
+        "seconds_med": 0.31
+      },
+      "trafilatura": {
+        "tokens_med": 96,
+        "facts_med": 4,
+        "seconds_med": 0.21
+      },
+      "firecrawl": {
+        "tokens_med": 560,
+        "facts_med": 5,
+        "seconds_med": 0.81
+      }
+    },
+    {
+      "url": "https://www.notion.com",
+      "facts_count": 5,
+      "raw_tokens": 109312,
+      "webclaw": {
+        "tokens_med": 13416,
+        "facts_med": 5,
+        "seconds_med": 0.93
+      },
+      "trafilatura": {
+        "tokens_med": 91,
+        "facts_med": 2,
+        "seconds_med": 0.65
+      },
+      "firecrawl": {
+        "tokens_med": 5261,
+        "facts_med": 5,
+        "seconds_med": 0.99
+      }
+    },
+    {
+      "url": "https://stripe.com",
+      "facts_count": 5,
+      "raw_tokens": 243465,
+      "webclaw": {
+        "tokens_med": 81974,
+        "facts_med": 5,
+        "seconds_med": 0.71
+      },
+      "trafilatura": {
+        "tokens_med": 2418,
+        "facts_med": 0,
+        "seconds_med": 0.39
+      },
+      "firecrawl": {
+        "tokens_med": 8922,
+        "facts_med": 5,
+        "seconds_med": 1.04
+      }
+    },
+    {
+      "url": "https://tavily.com",
+      "facts_count": 5,
+      "raw_tokens": 29964,
+      "webclaw": {
+        "tokens_med": 1361,
+        "facts_med": 5,
+        "seconds_med": 0.33
+      },
+      "trafilatura": {
+        "tokens_med": 182,
+        "facts_med": 3,
+        "seconds_med": 0.18
+      },
+      "firecrawl": {
+        "tokens_med": 1969,
+        "facts_med": 4,
+        "seconds_med": 0.75
+      }
+    },
+    {
+      "url": "https://www.shopify.com",
+      "facts_count": 5,
+      "raw_tokens": 183738,
+      "webclaw": {
+        "tokens_med": 1939,
+        "facts_med": 3,
+        "seconds_med": 0.29
+      },
+      "trafilatura": {
+        "tokens_med": 595,
+        "facts_med": 3,
+        "seconds_med": 0.22
+      },
+      "firecrawl": {
+        "tokens_med": 5384,
+        "facts_med": 3,
+        "seconds_med": 0.98
+      }
+    },
+    {
+      "url": "https://docs.python.org/3/",
+      "facts_count": 5,
+      "raw_tokens": 5275,
+      "webclaw": {
+        "tokens_med": 689,
+        "facts_med": 4,
+        "seconds_med": 0.12
+      },
+      "trafilatura": {
+        "tokens_med": 347,
+        "facts_med": 4,
+        "seconds_med": 0.04
+      },
+      "firecrawl": {
+        "tokens_med": 1623,
+        "facts_med": 4,
+        "seconds_med": 0.79
+      }
+    },
+    {
+      "url": "https://react.dev",
+      "facts_count": 5,
+      "raw_tokens": 107406,
+      "webclaw": {
+        "tokens_med": 3332,
+        "facts_med": 5,
+        "seconds_med": 0.23
+      },
+      "trafilatura": {
+        "tokens_med": 763,
+        "facts_med": 3,
+        "seconds_med": 0.17
+      },
+      "firecrawl": {
+        "tokens_med": 4959,
+        "facts_med": 5,
+        "seconds_med": 0.92
+      }
+    },
+    {
+      "url": "https://tailwindcss.com/docs/installation",
+      "facts_count": 5,
+      "raw_tokens": 113258,
+      "webclaw": {
+        "tokens_med": 779,
+        "facts_med": 4,
+        "seconds_med": 0.27
+      },
+      "trafilatura": {
+        "tokens_med": 430,
+        "facts_med": 2,
+        "seconds_med": 0.2
+      },
+      "firecrawl": {
+        "tokens_med": 813,
+        "facts_med": 4,
+        "seconds_med": 1.02
+      }
+    },
+    {
+      "url": "https://nextjs.org/docs",
+      "facts_count": 5,
+      "raw_tokens": 228196,
+      "webclaw": {
+        "tokens_med": 968,
+        "facts_med": 4,
+        "seconds_med": 0.24
+      },
+      "trafilatura": {
+        "tokens_med": 631,
+        "facts_med": 4,
+        "seconds_med": 0.17
+      },
+      "firecrawl": {
+        "tokens_med": 885,
+        "facts_med": 4,
+        "seconds_med": 0.88
+      }
+    },
+    {
+      "url": "https://github.com",
+      "facts_count": 5,
+      "raw_tokens": 234232,
+      "webclaw": {
+        "tokens_med": 1438,
+        "facts_med": 5,
+        "seconds_med": 0.33
+      },
+      "trafilatura": {
+        "tokens_med": 486,
+        "facts_med": 3,
+        "seconds_med": 0.09
+      },
+      "firecrawl": {
+        "tokens_med": 3058,
+        "facts_med": 4,
+        "seconds_med": 0.92
+      }
+    },
+    {
+      "url": "https://en.wikipedia.org/wiki/Rust_(programming_language)",
+      "facts_count": 5,
+      "raw_tokens": 189406,
+      "webclaw": {
+        "tokens_med": 47823,
+        "facts_med": 5,
+        "seconds_med": 0.36
+      },
+      "trafilatura": {
+        "tokens_med": 37427,
+        "facts_med": 5,
+        "seconds_med": 0.28
+      },
+      "firecrawl": {
+        "tokens_med": 59326,
+        "facts_med": 5,
+        "seconds_med": 1.49
+      }
+    },
+    {
+      "url": "https://simonwillison.net/2026/Mar/15/latent-reasoning/",
+      "facts_count": 5,
+      "raw_tokens": 3212,
+      "webclaw": {
+        "tokens_med": 724,
+        "facts_med": 4,
+        "seconds_med": 0.12
+      },
+      "trafilatura": {
+        "tokens_med": 0,
+        "facts_med": 0,
+        "seconds_med": 0.03
+      },
+      "firecrawl": {
+        "tokens_med": 525,
+        "facts_med": 2,
+        "seconds_med": 0.89
+      }
+    },
+    {
+      "url": "https://paulgraham.com/essays.html",
+      "facts_count": 5,
+      "raw_tokens": 1786,
+      "webclaw": {
+        "tokens_med": 169,
+        "facts_med": 2,
+        "seconds_med": 0.9
+      },
+      "trafilatura": {
+        "tokens_med": 0,
+        "facts_med": 0,
+        "seconds_med": 0.22
+      },
+      "firecrawl": {
+        "tokens_med": 295,
+        "facts_med": 1,
+        "seconds_med": 0.71
+      }
+    },
+    {
+      "url": "https://techcrunch.com",
+      "facts_count": 5,
+      "raw_tokens": 143309,
+      "webclaw": {
+        "tokens_med": 7265,
+        "facts_med": 5,
+        "seconds_med": 0.25
+      },
+      "trafilatura": {
+        "tokens_med": 397,
+        "facts_med": 5,
+        "seconds_med": 0.2
+      },
+      "firecrawl": {
+        "tokens_med": 11408,
+        "facts_med": 5,
+        "seconds_med": 1.21
+      }
+    },
+    {
+      "url": "https://www.databricks.com",
+      "facts_count": 5,
+      "raw_tokens": 274051,
+      "webclaw": {
+        "tokens_med": 2001,
+        "facts_med": 4,
+        "seconds_med": 0.31
+      },
+      "trafilatura": {
+        "tokens_med": 311,
+        "facts_med": 4,
+        "seconds_med": 0.2
+      },
+      "firecrawl": {
+        "tokens_med": 5471,
+        "facts_med": 4,
+        "seconds_med": 1.34
+      }
+    },
+    {
+      "url": "https://www.hashicorp.com",
+      "facts_count": 5,
+      "raw_tokens": 108510,
+      "webclaw": {
+        "tokens_med": 1501,
+        "facts_med": 5,
+        "seconds_med": 0.91
+      },
+      "trafilatura": {
+        "tokens_med": 0,
+        "facts_med": 0,
+        "seconds_med": 0.03
+      },
+      "firecrawl": {
+        "tokens_med": 4289,
+        "facts_med": 5,
+        "seconds_med": 0.91
+      }
+    }
+  ]
+}
--- a/benchmarks/run.sh
+++ b/benchmarks/run.sh
@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+# Reproduce the webclaw benchmark.
+# Requires: python3, tiktoken, trafilatura. Optional: firecrawl-py + FIRECRAWL_API_KEY.
+
+set -euo pipefail
+cd "$(dirname "$0")"
+
+# Build webclaw if not present
+if [ ! -x "../target/release/webclaw" ]; then
+    echo "→ building webclaw..."
+    (cd .. && cargo build --release)
+fi
+
+# Install python deps if missing
+missing=""
+python3 -c "import tiktoken"     2>/dev/null || missing+=" tiktoken"
+python3 -c "import trafilatura"  2>/dev/null || missing+=" trafilatura"
+if [ -n "${FIRECRAWL_API_KEY:-}" ]; then
+    python3 -c "import firecrawl" 2>/dev/null || missing+=" firecrawl-py"
+fi
+if [ -n "$missing" ]; then
+    echo "→ installing python deps:$missing"
+    python3 -m pip install --quiet $missing
+fi
+
+# Run
+python3 scripts/bench.py
--- a/benchmarks/scripts/bench.py
+++ b/benchmarks/scripts/bench.py
@ -0,0 +1,232 @@
+#!/usr/bin/env python3
+"""
+webclaw benchmark — webclaw vs trafilatura vs firecrawl.
+
+Produces results/YYYY-MM-DD.json matching the schema in methodology.md.
+Sites and facts come from ../sites.txt and ../facts.json.
+Tokenizer: cl100k_base (GPT-4 / GPT-3.5 / text-embedding-3-*).
+
+Usage:
+    FIRECRAWL_API_KEY=fc-...  python3 bench.py
+    python3 bench.py  # runs webclaw + trafilatura only
+
+Optional env:
+    WEBCLAW                 path to webclaw release binary (default: ../../target/release/webclaw)
+    RUNS                    runs per site (default: 3)
+    WEBCLAW_TIMEOUT         seconds (default: 30)
+"""
+from __future__ import annotations
+import json, os, re, statistics, subprocess, sys, time
+from pathlib import Path
+
+HERE = Path(__file__).resolve().parent
+ROOT = HERE.parent  # benchmarks/
+REPO_ROOT = ROOT.parent  # core/
+
+WEBCLAW = os.environ.get("WEBCLAW", str(REPO_ROOT / "target" / "release" / "webclaw"))
+RUNS = int(os.environ.get("RUNS", "3"))
+WC_TIMEOUT = int(os.environ.get("WEBCLAW_TIMEOUT", "30"))
+
+try:
+    import tiktoken
+    import trafilatura
+except ImportError as e:
+    sys.exit(f"missing dep: {e}. run: pip install tiktoken trafilatura firecrawl-py")
+
+ENC = tiktoken.get_encoding("cl100k_base")
+
+FC_KEY = os.environ.get("FIRECRAWL_API_KEY")
+FC = None
+if FC_KEY:
+    try:
+        from firecrawl import Firecrawl
+        FC = Firecrawl(api_key=FC_KEY)
+    except ImportError:
+        print("firecrawl-py not installed; skipping firecrawl column", file=sys.stderr)
+
+
+def load_sites() -> list[str]:
+    path = ROOT / "sites.txt"
+    out = []
+    for line in path.read_text().splitlines():
+        s = line.split("#", 1)[0].strip()
+        if s:
+            out.append(s)
+    return out
+
+
+def load_facts() -> dict[str, list[str]]:
+    return json.loads((ROOT / "facts.json").read_text())["facts"]
+
+
+def run_webclaw_llm(url: str) -> tuple[str, float]:
+    t0 = time.time()
+    r = subprocess.run(
+        [WEBCLAW, url, "-f", "llm", "-t", str(WC_TIMEOUT)],
+        capture_output=True, text=True, timeout=WC_TIMEOUT + 15,
+    )
+    return r.stdout or "", time.time() - t0
+
+
+def run_webclaw_raw(url: str) -> str:
+    r = subprocess.run(
+        [WEBCLAW, url, "--raw-html", "-t", str(WC_TIMEOUT)],
+        capture_output=True, text=True, timeout=WC_TIMEOUT + 15,
+    )
+    return r.stdout or ""
+
+
+def run_trafilatura(url: str) -> tuple[str, float]:
+    t0 = time.time()
+    try:
+        html = trafilatura.fetch_url(url)
+        out = ""
+        if html:
+            out = trafilatura.extract(
+                html, output_format="markdown",
+                include_links=True, include_tables=True, favor_recall=True,
+            ) or ""
+    except Exception:
+        out = ""
+    return out, time.time() - t0
+
+
+def run_firecrawl(url: str) -> tuple[str, float]:
+    if not FC:
+        return "", 0.0
+    t0 = time.time()
+    try:
+        r = FC.scrape(url, formats=["markdown"])
+        return (r.markdown or ""), time.time() - t0
+    except Exception:
+        return "", time.time() - t0
+
+
+def tok(s: str) -> int:
+    return len(ENC.encode(s, disallowed_special=())) if s else 0
+
+
+_WORD = re.compile(r"[A-Za-z][A-Za-z0-9]*")
+
+def hit_count(text: str, facts: list[str]) -> int:
+    """Case-insensitive; word-boundary for single-token alphanumeric facts,
+    substring for multi-word or non-alpha facts (like '99.999')."""
+    if not text:
+        return 0
+    low = text.lower()
+    count = 0
+    for f in facts:
+        f_low = f.lower()
+        if " " in f or not f.isalpha():
+            if f_low in low:
+                count += 1
+        else:
+            if re.search(r"\b" + re.escape(f_low) + r"\b", low):
+                count += 1
+    return count
+
+
+def main() -> int:
+    sites = load_sites()
+    facts_by_url = load_facts()
+    print(f"running {len(sites)} sites × {3 if FC else 2} tools × {RUNS} runs")
+    if not FC:
+        print("  (no FIRECRAWL_API_KEY — skipping firecrawl column)")
+    print()
+
+    per_site = []
+    for i, url in enumerate(sites, 1):
+        facts = facts_by_url.get(url, [])
+        if not facts:
+            print(f"[{i}/{len(sites)}] {url}  SKIPPED — no facts in facts.json")
+            continue
+        print(f"[{i}/{len(sites)}] {url}")
+        raw_t = tok(run_webclaw_raw(url))
+
+        def run_one(fn):
+            out, seconds = fn(url)
+            return {"tokens": tok(out), "facts": hit_count(out, facts), "seconds": seconds}
+
+        runs = {"webclaw": [], "trafilatura": [], "firecrawl": []}
+        for _ in range(RUNS):
+            runs["webclaw"].append(run_one(run_webclaw_llm))
+            runs["trafilatura"].append(run_one(run_trafilatura))
+            if FC:
+                runs["firecrawl"].append(run_one(run_firecrawl))
+            else:
+                runs["firecrawl"].append({"tokens": 0, "facts": 0, "seconds": 0.0})
+
+        def med(tool, key):
+            return statistics.median(r[key] for r in runs[tool])
+
+        def med_ints(tool):
+            return {
+                "tokens_med":  int(med(tool, "tokens")),
+                "facts_med":   int(med(tool, "facts")),
+                "seconds_med": round(med(tool, "seconds"), 2),
+            }
+
+        per_site.append({
+            "url": url,
+            "facts_count": len(facts),
+            "raw_tokens": raw_t,
+            "webclaw":     med_ints("webclaw"),
+            "trafilatura": med_ints("trafilatura"),
+            "firecrawl":   med_ints("firecrawl"),
+        })
+        last = per_site[-1]
+        print(f"   raw={raw_t}  wc={last['webclaw']['tokens_med']}/{last['webclaw']['facts_med']}"
+              f"  tr={last['trafilatura']['tokens_med']}/{last['trafilatura']['facts_med']}"
+              f"  fc={last['firecrawl']['tokens_med']}/{last['firecrawl']['facts_med']}")
+
+    # aggregates
+    total_facts = sum(r["facts_count"] for r in per_site)
+
+    def agg(tool):
+        red_vals = [
+            (r["raw_tokens"] - r[tool]["tokens_med"]) / r["raw_tokens"] * 100
+            for r in per_site
+            if r["raw_tokens"] > 0 and r[tool]["tokens_med"] > 0
+        ]
+        return {
+            "reduction_mean":   round(statistics.mean(red_vals), 1) if red_vals else 0.0,
+            "reduction_median": round(statistics.median(red_vals), 1) if red_vals else 0.0,
+            "facts_preserved":  sum(r[tool]["facts_med"] for r in per_site),
+            "total_facts":      total_facts,
+            "fidelity_pct":     round(sum(r[tool]["facts_med"] for r in per_site) / total_facts * 100, 1) if total_facts else 0,
+            "latency_mean":     round(statistics.mean(r[tool]["seconds_med"] for r in per_site), 2),
+        }
+
+    result = {
+        "timestamp":           time.strftime("%Y-%m-%d %H:%M:%S"),
+        "webclaw_version":     subprocess.check_output([WEBCLAW, "--version"], text=True).strip().split()[-1],
+        "trafilatura_version": trafilatura.__version__,
+        "firecrawl_enabled":   FC is not None,
+        "tokenizer":           "cl100k_base",
+        "runs_per_site":       RUNS,
+        "site_count":          len(per_site),
+        "total_facts":         total_facts,
+        "aggregates":          {t: agg(t) for t in ["webclaw", "trafilatura", "firecrawl"]},
+        "per_site":            per_site,
+    }
+
+    out_path = ROOT / "results" / f"{time.strftime('%Y-%m-%d')}.json"
+    out_path.parent.mkdir(exist_ok=True)
+    out_path.write_text(json.dumps(result, indent=2))
+
+    print()
+    print("=" * 70)
+    print(f"{len(per_site)} sites, {total_facts} facts, median of {RUNS} runs")
+    print("=" * 70)
+    for t in ["webclaw", "trafilatura", "firecrawl"]:
+        a = result["aggregates"][t]
+        print(f"  {t:14s}  reduction_mean={a['reduction_mean']:5.1f}%"
+              f"  fidelity={a['facts_preserved']}/{a['total_facts']} ({a['fidelity_pct']}%)"
+              f"  latency={a['latency_mean']}s")
+    print()
+    print(f"  results → {out_path.relative_to(REPO_ROOT)}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/benchmarks/sites.txt
+++ b/benchmarks/sites.txt
@ -0,0 +1,31 @@
+# One URL per line. Comments (#) and blank lines ignored.
+# Sites chosen to span: SPA marketing, enterprise SaaS, documentation,
+# long-form content, news, and aggregator pages.
+
+# --- SPA marketing ---
+https://openai.com
+https://vercel.com
+https://anthropic.com
+https://www.notion.com
+https://stripe.com
+https://tavily.com
+https://www.shopify.com
+
+# --- Documentation ---
+https://docs.python.org/3/
+https://react.dev
+https://tailwindcss.com/docs/installation
+https://nextjs.org/docs
+https://github.com
+
+# --- Long-form content ---
+https://en.wikipedia.org/wiki/Rust_(programming_language)
+https://simonwillison.net/2026/Mar/15/latent-reasoning/
+https://paulgraham.com/essays.html
+
+# --- News / commerce ---
+https://techcrunch.com
+
+# --- Enterprise SaaS ---
+https://www.databricks.com
+https://www.hashicorp.com
--- a/crates/webclaw-cli/src/bench.rs
+++ b/crates/webclaw-cli/src/bench.rs
@ -0,0 +1,422 @@
+//! `webclaw bench <url>` — per-URL extraction micro-benchmark.
+//!
+//! Fetches a page, extracts it via the same pipeline that powers
+//! `--format llm`, and reports how many tokens the LLM pipeline
+//! removed vs. the raw HTML. Optional `--facts` reuses the
+//! benchmark harness's curated fact lists to score fidelity.
+//!
+//! v1 uses an *approximate* tokenizer (chars/4 for Latin text,
+//! chars/2 for CJK-heavy text). Output is clearly labeled
+//! "≈ tokens" so nobody mistakes it for a real tiktoken run.
+//! Swapping to tiktoken-rs later is a one-function change.
+
+use std::path::{Path, PathBuf};
+use std::time::Instant;
+
+use webclaw_core::{extract, to_llm_text};
+use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
+
+/// Inputs collected from the clap subcommand.
+pub struct BenchArgs {
+    pub url: String,
+    pub json: bool,
+    pub facts: Option<PathBuf>,
+}
+
+/// What a single bench run measures.
+struct BenchResult {
+    url: String,
+    raw_tokens: usize,
+    raw_bytes: usize,
+    llm_tokens: usize,
+    llm_bytes: usize,
+    reduction_pct: f64,
+    elapsed_secs: f64,
+    /// `Some((found, total))` when `--facts` is supplied and the URL has
+    /// an entry in the facts file; `None` otherwise.
+    facts: Option<(usize, usize)>,
+}
+
+pub async fn run(args: &BenchArgs) -> Result<(), String> {
+    // Dedicated client so bench doesn't care about global CLI flags
+    // (proxies, custom headers, etc.). A reproducible microbench is
+    // more useful than an over-configurable one; if someone wants to
+    // bench behind a proxy they can set WEBCLAW_PROXY — respected
+    // by FetchConfig via the regular channels if we extend later.
+    let config = FetchConfig {
+        browser: BrowserProfile::Chrome,
+        ..FetchConfig::default()
+    };
+    let client = FetchClient::new(config).map_err(|e| format!("build client: {e}"))?;
+
+    let start = Instant::now();
+    let fetched = client
+        .fetch(&args.url)
+        .await
+        .map_err(|e| format!("fetch: {e}"))?;
+
+    let extraction =
+        extract(&fetched.html, Some(&fetched.url)).map_err(|e| format!("extract: {e}"))?;
+    let llm_text = to_llm_text(&extraction, Some(&fetched.url));
+    let elapsed = start.elapsed();
+
+    let raw_tokens = approx_tokens(&fetched.html);
+    let llm_tokens = approx_tokens(&llm_text);
+    let raw_bytes = fetched.html.len();
+    let llm_bytes = llm_text.len();
+    let reduction_pct = if raw_tokens == 0 {
+        0.0
+    } else {
+        100.0 * (1.0 - llm_tokens as f64 / raw_tokens as f64)
+    };
+
+    let facts = match args.facts.as_deref() {
+        Some(path) => check_facts(path, &args.url, &llm_text)?,
+        None => None,
+    };
+
+    let result = BenchResult {
+        url: args.url.clone(),
+        raw_tokens,
+        raw_bytes,
+        llm_tokens,
+        llm_bytes,
+        reduction_pct,
+        elapsed_secs: elapsed.as_secs_f64(),
+        facts,
+    };
+
+    if args.json {
+        print_json(&result);
+    } else {
+        print_box(&result);
+    }
+    Ok(())
+}
+
+// ---------------------------------------------------------------------------
+// Approximate tokenizer
+// ---------------------------------------------------------------------------
+
+/// Rough token count. `chars / 4` is the classic English rule of thumb
+/// (close to cl100k_base for typical prose). CJK scripts pack ~2 chars
+/// per token, so we switch to `chars / 2` when CJK dominates.
+///
+/// Off by ±10% vs. a real BPE tokenizer, which is fine for "is webclaw's
+/// output 66% smaller or 66% bigger than raw HTML" — the signal is
+/// order-of-magnitude, not precise accounting.
+fn approx_tokens(s: &str) -> usize {
+    let total: usize = s.chars().count();
+    if total == 0 {
+        return 0;
+    }
+    let cjk = s.chars().filter(|c| is_cjk(*c)).count();
+    let cjk_ratio = cjk as f64 / total as f64;
+    if cjk_ratio > 0.30 {
+        total.div_ceil(2)
+    } else {
+        total.div_ceil(4)
+    }
+}
+
+fn is_cjk(c: char) -> bool {
+    let n = c as u32;
+    (0x4E00..=0x9FFF).contains(&n)   // CJK Unified Ideographs
+        || (0x3040..=0x309F).contains(&n) // Hiragana
+        || (0x30A0..=0x30FF).contains(&n) // Katakana
+        || (0xAC00..=0xD7AF).contains(&n) // Hangul Syllables
+        || (0x3400..=0x4DBF).contains(&n) // CJK Extension A
+}
+
+// ---------------------------------------------------------------------------
+// Output: ASCII / Unicode box
+// ---------------------------------------------------------------------------
+
+const BOX_WIDTH: usize = 62; // inner width between the two side borders
+
+fn print_box(r: &BenchResult) {
+    let host = display_host(&r.url);
+    let version = env!("CARGO_PKG_VERSION");
+
+    let top = "─".repeat(BOX_WIDTH);
+    let sep = "─".repeat(BOX_WIDTH);
+
+    // Header: host on the left, "webclaw X.Y.Z" on the right.
+    let left = host;
+    let right = format!("webclaw {version}");
+    let pad = BOX_WIDTH.saturating_sub(left.chars().count() + right.chars().count() + 2);
+    let header = format!(" {}{}{} ", left, " ".repeat(pad), right);
+
+    println!("┌{top}┐");
+    println!("│{header}│");
+    println!("├{sep}┤");
+    print_row(
+        "raw HTML",
+        &format!("{} ≈ tokens", fmt_int(r.raw_tokens)),
+        &fmt_bytes(r.raw_bytes),
+    );
+    print_row(
+        "--format llm",
+        &format!("{} ≈ tokens", fmt_int(r.llm_tokens)),
+        &fmt_bytes(r.llm_bytes),
+    );
+    print_row("token reduction", &format!("{:.1}%", r.reduction_pct), "");
+    print_row("extraction time", &format!("{:.2} s", r.elapsed_secs), "");
+    if let Some((found, total)) = r.facts {
+        let pct = if total == 0 {
+            0.0
+        } else {
+            100.0 * found as f64 / total as f64
+        };
+        print_row(
+            "facts preserved",
+            &format!("{found}/{total} ({pct:.1}%)"),
+            "",
+        );
+    }
+    println!("└{top}┘");
+    println!();
+    println!("note: token counts are approximate (chars/4 Latin, chars/2 CJK).");
+}
+
+fn print_row(label: &str, middle: &str, right: &str) {
+    // Layout inside the box:
+    //   " <label padded to 18>   <middle>   <right right-aligned to fit> "
+    let left_col = format!(" {:<18}", label);
+    let right_col = format!("{right} ");
+    let budget = BOX_WIDTH
+        .saturating_sub(left_col.chars().count())
+        .saturating_sub(right_col.chars().count());
+    let middle_col = format!("{:<width$}", middle, width = budget);
+    println!("│{left_col}{middle_col}{right_col}│");
+}
+
+fn fmt_int(n: usize) -> String {
+    // Comma-group thousands. Avoids pulling in num-format / thousands
+    // for one call site.
+    let s = n.to_string();
+    let bytes = s.as_bytes();
+    let mut out = String::with_capacity(s.len() + s.len() / 3);
+    for (i, b) in bytes.iter().enumerate() {
+        if i > 0 && (bytes.len() - i).is_multiple_of(3) {
+            out.push(',');
+        }
+        out.push(*b as char);
+    }
+    out
+}
+
+fn fmt_bytes(n: usize) -> String {
+    const KB: usize = 1024;
+    const MB: usize = KB * 1024;
+    if n >= MB {
+        format!("{:.1} MB", n as f64 / MB as f64)
+    } else if n >= KB {
+        format!("{} KB", n / KB)
+    } else {
+        format!("{n} B")
+    }
+}
+
+/// Best-effort host extraction — if the URL doesn't parse we fall back
+/// to the raw string so the box still prints something recognizable.
+fn display_host(url: &str) -> String {
+    url::Url::parse(url)
+        .ok()
+        .and_then(|u| u.host_str().map(|h| h.to_string()))
+        .unwrap_or_else(|| url.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON output — single line, stable key order for scripting / CI.
+// ---------------------------------------------------------------------------
+
+fn print_json(r: &BenchResult) {
+    let mut obj = serde_json::Map::new();
+    obj.insert("url".into(), r.url.clone().into());
+    obj.insert("raw_tokens".into(), r.raw_tokens.into());
+    obj.insert("raw_bytes".into(), r.raw_bytes.into());
+    obj.insert("llm_tokens".into(), r.llm_tokens.into());
+    obj.insert("llm_bytes".into(), r.llm_bytes.into());
+    obj.insert("token_reduction_pct".into(), round1(r.reduction_pct).into());
+    obj.insert("elapsed_secs".into(), round2(r.elapsed_secs).into());
+    obj.insert("token_method".into(), "approx".into());
+    obj.insert("webclaw_version".into(), env!("CARGO_PKG_VERSION").into());
+    if let Some((found, total)) = r.facts {
+        obj.insert("facts_found".into(), found.into());
+        obj.insert("facts_total".into(), total.into());
+    }
+    // Single-line JSON — easy to append to ndjson for CI runs.
+    println!("{}", serde_json::Value::Object(obj));
+}
+
+fn round1(f: f64) -> f64 {
+    (f * 10.0).round() / 10.0
+}
+fn round2(f: f64) -> f64 {
+    (f * 100.0).round() / 100.0
+}
+
+// ---------------------------------------------------------------------------
+// Facts file support
+// ---------------------------------------------------------------------------
+
+/// Load `facts.json` (same schema as `benchmarks/facts.json`) and check how
+/// many curated facts for this URL appear in the extracted LLM text.
+/// Returns `None` when the URL has no entry in the file — don't penalize
+/// a site that simply hasn't been curated yet.
+fn check_facts(path: &Path, url: &str, llm_text: &str) -> Result<Option<(usize, usize)>, String> {
+    let raw = std::fs::read_to_string(path)
+        .map_err(|e| format!("read facts file {}: {e}", path.display()))?;
+    let parsed: serde_json::Value =
+        serde_json::from_str(&raw).map_err(|e| format!("parse facts file: {e}"))?;
+
+    let facts_obj = parsed
+        .get("facts")
+        .and_then(|v| v.as_object())
+        .ok_or_else(|| "facts file missing `facts` object".to_string())?;
+
+    let Some(entry) = facts_obj.get(url) else {
+        // URL not curated in this facts file — don't print a fidelity
+        // column rather than showing a misleading 0/0.
+        return Ok(None);
+    };
+    let Some(list) = entry.as_array() else {
+        return Err(format!("facts['{url}'] is not an array"));
+    };
+
+    let total = list.len();
+    let text_low = llm_text.to_lowercase();
+    let mut found = 0usize;
+    for f in list {
+        let Some(fact) = f.as_str() else { continue };
+        if matches_fact(&text_low, fact) {
+            found += 1;
+        }
+    }
+    Ok(Some((found, total)))
+}
+
+/// Match a single fact against the lowercased text. Mirrors the
+/// python harness in `benchmarks/scripts/bench.py`:
+/// - Single alphanumeric token → word-boundary (so `API` doesn't hit
+///   `apiece`).
+/// - Multi-word or non-alpha facts (e.g. `99.999`) → substring.
+fn matches_fact(text_low: &str, fact: &str) -> bool {
+    let fact_low = fact.to_lowercase();
+    if fact_low.is_empty() {
+        return false;
+    }
+    let is_simple_token = fact_low.chars().all(|c| c.is_ascii_alphanumeric())
+        && fact_low
+            .chars()
+            .next()
+            .is_some_and(|c| c.is_ascii_alphabetic());
+
+    if !is_simple_token {
+        return text_low.contains(&fact_low);
+    }
+    // Word-boundary scan without pulling in the regex dependency just
+    // for this: find each occurrence and check neighbouring chars.
+    let bytes = text_low.as_bytes();
+    let needle = fact_low.as_bytes();
+    let mut i = 0;
+    while i + needle.len() <= bytes.len() {
+        if &bytes[i..i + needle.len()] == needle {
+            let before_ok = i == 0 || !bytes[i - 1].is_ascii_alphanumeric();
+            let after_idx = i + needle.len();
+            let after_ok = after_idx >= bytes.len() || !bytes[after_idx].is_ascii_alphanumeric();
+            if before_ok && after_ok {
+                return true;
+            }
+        }
+        i += 1;
+    }
+    false
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn approx_tokens_empty() {
+        assert_eq!(approx_tokens(""), 0);
+    }
+
+    #[test]
+    fn approx_tokens_latin_roughly_chars_over_4() {
+        // 100 ASCII chars → ~25 tokens
+        let s = "a".repeat(100);
+        assert_eq!(approx_tokens(&s), 25);
+    }
+
+    #[test]
+    fn approx_tokens_cjk_denser() {
+        // 100 CJK chars → ~50 tokens (chars/2 branch)
+        let s: String = "中".repeat(100);
+        assert_eq!(approx_tokens(&s), 50);
+    }
+
+    #[test]
+    fn approx_tokens_mixed_uses_latin_branch() {
+        // 80 latin + 20 CJK → CJK ratio 20% < 30% → chars/4 branch
+        let s = format!("{}{}", "a".repeat(80), "中".repeat(20));
+        assert_eq!(approx_tokens(&s), 25);
+    }
+
+    #[test]
+    fn fmt_int_commas() {
+        assert_eq!(fmt_int(0), "0");
+        assert_eq!(fmt_int(100), "100");
+        assert_eq!(fmt_int(1_000), "1,000");
+        assert_eq!(fmt_int(243_465), "243,465");
+        assert_eq!(fmt_int(12_345_678), "12,345,678");
+    }
+
+    #[test]
+    fn fmt_bytes_units() {
+        assert_eq!(fmt_bytes(500), "500 B");
+        assert_eq!(fmt_bytes(1024), "1 KB");
+        assert_eq!(fmt_bytes(1024 * 1024), "1.0 MB");
+        assert_eq!(fmt_bytes(1024 * 1024 * 3 + 1024 * 512), "3.5 MB");
+    }
+
+    #[test]
+    fn matches_fact_word_boundary() {
+        assert!(matches_fact("the api is ready", "API"));
+        // single-token alphanumeric: API should not hit apiece
+        assert!(!matches_fact("an apiece of land", "API"));
+    }
+
+    #[test]
+    fn matches_fact_multiword_substring() {
+        assert!(matches_fact("uptime is 99.999% this year", "99.999"));
+        assert!(matches_fact("the app router routes requests", "App Router"));
+    }
+
+    #[test]
+    fn matches_fact_case_insensitive() {
+        assert!(matches_fact("the claude model is opus", "Claude"));
+        assert!(matches_fact("the claude model is opus", "opus"));
+    }
+
+    #[test]
+    fn matches_fact_missing() {
+        assert!(!matches_fact("nothing to see here", "vercel"));
+    }
+
+    #[test]
+    fn display_host_parses_url() {
+        assert_eq!(display_host("https://stripe.com/"), "stripe.com");
+        assert_eq!(
+            display_host("https://docs.python.org/3/"),
+            "docs.python.org"
+        );
+    }
+
+    #[test]
+    fn display_host_falls_back_on_garbage() {
+        assert_eq!(display_host("not a url"), "not a url");
+    }
+}
--- a/crates/webclaw-cli/src/cloud.rs
+++ b/crates/webclaw-cli/src/cloud.rs
@ -1,80 +0,0 @@
-/// Cloud API client for automatic fallback when local extraction fails.
-///
-/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
-/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
-/// all requests go through the cloud API directly.
-///
-/// NOTE: The canonical, full-featured cloud module lives in webclaw-mcp/src/cloud.rs
-/// (smart_fetch, bot detection, JS rendering checks). This is the minimal subset
-/// needed by the CLI. Kept separate to avoid pulling in rmcp via webclaw-mcp.
-/// and adding webclaw-mcp as a dependency would pull in rmcp.
-use serde_json::{Value, json};
-
-const API_BASE: &str = "https://api.webclaw.io/v1";
-
-pub struct CloudClient {
-    api_key: String,
-    http: reqwest::Client,
-}
-
-impl CloudClient {
-    /// Create from explicit key or WEBCLAW_API_KEY env var.
-    pub fn new(explicit_key: Option<&str>) -> Option<Self> {
-        let key = explicit_key
-            .map(String::from)
-            .or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
-            .filter(|k| !k.is_empty())?;
-
-        Some(Self {
-            api_key: key,
-            http: reqwest::Client::new(),
-        })
-    }
-
-    /// Scrape via the cloud API.
-    pub async fn scrape(
-        &self,
-        url: &str,
-        formats: &[&str],
-        include_selectors: &[String],
-        exclude_selectors: &[String],
-        only_main_content: bool,
-    ) -> Result<Value, String> {
-        let mut body = json!({
-            "url": url,
-            "formats": formats,
-        });
-        if only_main_content {
-            body["only_main_content"] = json!(true);
-        }
-        if !include_selectors.is_empty() {
-            body["include_selectors"] = json!(include_selectors);
-        }
-        if !exclude_selectors.is_empty() {
-            body["exclude_selectors"] = json!(exclude_selectors);
-        }
-        self.post("scrape", body).await
-    }
-
-    async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
-        let resp = self
-            .http
-            .post(format!("{API_BASE}/{endpoint}"))
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .json(&body)
-            .timeout(std::time::Duration::from_secs(120))
-            .send()
-            .await
-            .map_err(|e| format!("cloud API request failed: {e}"))?;
-
-        let status = resp.status();
-        if !status.is_success() {
-            let text = resp.text().await.unwrap_or_default();
-            return Err(format!("cloud API error {status}: {text}"));
-        }
-
-        resp.json::<Value>()
-            .await
-            .map_err(|e| format!("cloud API response parse failed: {e}"))
-    }
-}
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@ -1,6 +1,6 @@
 /// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
 /// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
-mod cloud;
+mod bench;

 use std::io::{self, Read as _};
 use std::path::{Path, PathBuf};
@ -8,7 +8,7 @@ use std::process;
 use std::sync::Arc;
 use std::sync::atomic::{AtomicBool, Ordering};

-use clap::{Parser, ValueEnum};
+use clap::{Parser, Subcommand, ValueEnum};
 use tracing_subscriber::EnvFilter;
 use webclaw_core::{
    ChangeStatus, ContentDiff, ExtractionOptions, ExtractionResult, Metadata, extract_with_options,
@ -86,6 +86,12 @@ fn warn_empty(url: &str, reason: &EmptyReason) {
 #[derive(Parser)]
 #[command(name = "webclaw", about = "Extract web content for LLMs", version)]
 struct Cli {
+    /// Optional subcommand. When omitted, the CLI falls back to the
+    /// traditional flag-based flow (URL + --format, --crawl, etc.).
+    /// Subcommands are used for flows that don't fit that model.
+    #[command(subcommand)]
+    command: Option<Commands>,
+
    /// URLs to fetch (multiple allowed)
    #[arg()]
    urls: Vec<String>,
@ -283,6 +289,55 @@ struct Cli {
    output_dir: Option<PathBuf>,
 }

+#[derive(Subcommand)]
+enum Commands {
+    /// Per-URL extraction micro-benchmark: compares raw HTML vs. the
+    /// webclaw --format llm output on token count, bytes, and
+    /// extraction time. Uses an approximate tokenizer (see `--help`).
+    Bench {
+        /// URL to benchmark.
+        url: String,
+
+        /// Emit a single JSON line instead of the ASCII table.
+        /// Machine-readable shape stable across releases.
+        #[arg(long)]
+        json: bool,
+
+        /// Optional path to a facts.json (same schema as the repo's
+        /// benchmarks/facts.json) for a fidelity column.
+        #[arg(long)]
+        facts: Option<PathBuf>,
+    },
+
+    /// List all vertical extractors in the catalog.
+    ///
+    /// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
+    /// a human-friendly label, a one-line description, and the URL
+    /// patterns it claims. The same data is served by `/v1/extractors`
+    /// when running the REST API.
+    Extractors {
+        /// Emit JSON instead of a human-friendly table.
+        #[arg(long)]
+        json: bool,
+    },
+
+    /// Run a vertical extractor by name. Returns typed JSON with fields
+    /// specific to the target site (title, price, author, rating, etc.)
+    /// rather than generic markdown.
+    ///
+    /// Use `webclaw extractors` to see the full list. Example:
+    /// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
+    Vertical {
+        /// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
+        name: String,
+        /// URL to extract.
+        url: String,
+        /// Emit compact JSON (single line). Default is pretty-printed.
+        #[arg(long)]
+        raw: bool,
+    },
+}
+
 #[derive(Clone, ValueEnum)]
 enum OutputFormat {
    Markdown,
@ -296,6 +351,9 @@ enum OutputFormat {
 enum Browser {
    Chrome,
    Firefox,
+    /// Safari iOS 26. Pair with a country-matched residential proxy for sites
+    /// that reject non-mobile profiles.
+    SafariIos,
    Random,
 }

@ -322,6 +380,7 @@ impl From<Browser> for BrowserProfile {
        match b {
            Browser::Chrome => BrowserProfile::Chrome,
            Browser::Firefox => BrowserProfile::Firefox,
+            Browser::SafariIos => BrowserProfile::SafariIos,
            Browser::Random => BrowserProfile::Random,
        }
    }
@ -646,7 +705,7 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
    let url = normalize_url(raw_url);
    let url = url.as_str();

-    let cloud_client = cloud::CloudClient::new(cli.api_key.as_deref());
+    let cloud_client = webclaw_fetch::cloud::CloudClient::new(cli.api_key.as_deref());

    // --cloud: skip local, go straight to cloud API
    if cli.cloud {
@ -2244,6 +2303,103 @@ async fn main() {
    let cli = Cli::parse();
    init_logging(cli.verbose);

+    // Subcommand path. Handled before the flag dispatch so a subcommand
+    // can't collide with a flag-based flow. When no subcommand is set
+    // we fall through to the existing behaviour.
+    if let Some(ref cmd) = cli.command {
+        match cmd {
+            Commands::Bench { url, json, facts } => {
+                let args = bench::BenchArgs {
+                    url: url.clone(),
+                    json: *json,
+                    facts: facts.clone(),
+                };
+                if let Err(e) = bench::run(&args).await {
+                    eprintln!("error: {e}");
+                    process::exit(1);
+                }
+                return;
+            }
+            Commands::Extractors { json } => {
+                let entries = webclaw_fetch::extractors::list();
+                if *json {
+                    // Serialize with serde_json. ExtractorInfo derives
+                    // Serialize so this is a one-liner.
+                    match serde_json::to_string_pretty(&entries) {
+                        Ok(s) => println!("{s}"),
+                        Err(e) => {
+                            eprintln!("error: failed to serialise catalog: {e}");
+                            process::exit(1);
+                        }
+                    }
+                } else {
+                    // Human-friendly table: NAME + LABEL + one URL
+                    // pattern sample. Keeps the output scannable on a
+                    // narrow terminal.
+                    println!("{} vertical extractors available:\n", entries.len());
+                    let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
+                    let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
+                    for e in &entries {
+                        let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
+                        println!(
+                            "  {:<nw$}  {:<lw$}  {}",
+                            e.name,
+                            e.label,
+                            pattern_sample,
+                            nw = name_w,
+                            lw = label_w,
+                        );
+                    }
+                    println!("\nRun one: webclaw vertical <name> <url>");
+                }
+                return;
+            }
+            Commands::Vertical { name, url, raw } => {
+                // Build a FetchClient with cloud fallback attached when
+                // WEBCLAW_API_KEY is set. Antibot-gated verticals
+                // (amazon, ebay, etsy, trustpilot) need this to escalate
+                // on bot protection.
+                let fetch_cfg = webclaw_fetch::FetchConfig {
+                    browser: webclaw_fetch::BrowserProfile::Firefox,
+                    ..webclaw_fetch::FetchConfig::default()
+                };
+                let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
+                    Ok(c) => c,
+                    Err(e) => {
+                        eprintln!("error: failed to build fetch client: {e}");
+                        process::exit(1);
+                    }
+                };
+                if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
+                    client = client.with_cloud(cloud);
+                }
+                match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
+                    Ok(data) => {
+                        let rendered = if *raw {
+                            serde_json::to_string(&data)
+                        } else {
+                            serde_json::to_string_pretty(&data)
+                        };
+                        match rendered {
+                            Ok(s) => println!("{s}"),
+                            Err(e) => {
+                                eprintln!("error: JSON encode failed: {e}");
+                                process::exit(1);
+                            }
+                        }
+                    }
+                    Err(e) => {
+                        // UrlMismatch / UnknownVertical / Fetch all get
+                        // Display impls with actionable messages.
+                        eprintln!("error: {e}");
+                        process::exit(1);
+                    }
+                }
+                return;
+            }
+        }
+    }
+
    // --map: sitemap discovery mode
    if cli.map {
        if let Err(e) = run_map(&cli).await {
--- a/crates/webclaw-core/src/markdown.rs
+++ b/crates/webclaw-core/src/markdown.rs
@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
            continue;
        }

-        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
-        if trimmed.starts_with('|') && trimmed.ends_with('|') {
+        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
+        // Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
+        // (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
+        if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
            let inner = &trimmed[1..trimmed.len() - 1];
            let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
            lines.push(cells.join("\t"));
--- a/crates/webclaw-fetch/Cargo.toml
+++ b/crates/webclaw-fetch/Cargo.toml
@ -12,12 +12,16 @@ serde = { workspace = true }
 thiserror = { workspace = true }
 tracing = { workspace = true }
 tokio = { workspace = true }
+async-trait = "0.1"
 wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
+wreq-util = "3.0.0-rc.10"
 http = "1"
 bytes = "1"
 url = "2"
 rand = "0.8"
 quick-xml = { version = "0.37", features = ["serde"] }
+regex = "1"
+reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
 serde_json.workspace = true
 calamine = "0.34"
 zip = "2"
--- a/crates/webclaw-fetch/src/browser.rs
+++ b/crates/webclaw-fetch/src/browser.rs
@ -7,6 +7,10 @@ pub enum BrowserProfile {
    #[default]
    Chrome,
    Firefox,
+    /// Safari iOS 26 (iPhone). The one profile proven to defeat
+    /// DataDome's immobiliare.it / idealista.it / target.com-class
+    /// rules when paired with a country-scoped residential proxy.
+    SafariIos,
    /// Randomly pick from all available profiles on each request.
    Random,
 }
@ -18,6 +22,7 @@ pub enum BrowserVariant {
    ChromeMacos,
    Firefox,
    Safari,
+    SafariIos26,
    Edge,
 }

--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@ -177,6 +177,11 @@ enum ClientPool {
 pub struct FetchClient {
    pool: ClientPool,
    pdf_mode: PdfMode,
+    /// Optional cloud-fallback client. Extractors that need to
+    /// escalate past bot protection call `client.cloud()` to get this
+    /// out. Stored as `Arc` so cloning a `FetchClient` (common in
+    /// axum state) doesn't clone the underlying reqwest pool.
+    cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
 }

 impl FetchClient {
@ -225,13 +230,96 @@ impl FetchClient {
            ClientPool::Rotating { clients }
        };

-        Ok(Self { pool, pdf_mode })
+        Ok(Self {
+            pool,
+            pdf_mode,
+            cloud: None,
+        })
+    }
+
+    /// Attach a cloud-fallback client. Returns `self` so it composes in
+    /// a builder-ish way:
+    ///
+    /// ```ignore
+    /// let client = FetchClient::new(config)?
+    ///     .with_cloud(CloudClient::from_env()?);
+    /// ```
+    ///
+    /// Extractors that can escalate past bot protection will call
+    /// `client.cloud()` internally. Sets the field regardless of
+    /// whether `cloud` is configured to bypass anything specific —
+    /// attachment is cheap (just wraps in `Arc`).
+    pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
+        self.cloud = Some(std::sync::Arc::new(cloud));
+        self
+    }
+
+    /// Optional cloud-fallback client, if one was attached via
+    /// [`Self::with_cloud`]. Extractors that handle antibot sites
+    /// pass this into `cloud::smart_fetch_html`.
+    pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
+        self.cloud.as_deref()
+    }
+
+    /// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
+    /// `.json` API, and Akamai-style challenge responses trigger a homepage
+    /// cookie warmup and a retry. Returns the same `FetchResult` shape as
+    /// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
+    /// server) benefits without shape churn.
+    ///
+    /// This is the method most callers want. Use plain [`Self::fetch`] only
+    /// when you need literal no-rescue behavior (e.g. inside the rescue
+    /// logic itself to avoid recursion).
+    pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
+        // Reddit: the HTML page shows a verification interstitial for most
+        // client IPs, but appending `.json` returns the post + comment tree
+        // publicly. `parse_reddit_json` in downstream code knows how to read
+        // the result; here we just do the URL swap at the fetch layer.
+        if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
+            let json_url = crate::reddit::json_url(url);
+            // Reddit's public .json API serves JSON to identifiable bot
+            // User-Agents and blocks browser UAs with a verification wall.
+            // Override our Chrome-profile UA for this specific call.
+            let ua = concat!(
+                "Webclaw/",
+                env!("CARGO_PKG_VERSION"),
+                " (+https://webclaw.io)"
+            );
+            if let Ok(resp) = self
+                .fetch_with_headers(&json_url, &[("user-agent", ua)])
+                .await
+                && resp.status == 200
+            {
+                let first = resp.html.trim_start().as_bytes().first().copied();
+                if matches!(first, Some(b'{') | Some(b'[')) {
+                    return Ok(resp);
+                }
+            }
+            // If the .json fetch failed or returned HTML, fall through.
+        }
+
+        let resp = self.fetch(url).await?;
+
+        // Akamai / bazadebezolkohpepadr challenge: visit the homepage to
+        // collect warmup cookies (_abck, bm_sz, etc.), then retry.
+        if is_challenge_html(&resp.html)
+            && let Some(homepage) = extract_homepage(url)
+        {
+            debug!("challenge detected, warming cookies via {homepage}");
+            let _ = self.fetch(&homepage).await;
+            if let Ok(retry) = self.fetch(url).await {
+                return Ok(retry);
+            }
+        }
+
+        Ok(resp)
    }

    /// Fetch a URL and return the raw HTML + response metadata.
    ///
    /// Automatically retries on transient failures (network errors, 5xx, 429)
-    /// with exponential backoff: 0s, 1s (2 attempts total).
+    /// with exponential backoff: 0s, 1s (2 attempts total). No per-site
+    /// rescue logic; use [`Self::fetch_smart`] for that.
    #[instrument(skip(self), fields(url = %url))]
    pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
        let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -279,14 +367,85 @@ impl FetchClient {

    /// Single fetch attempt.
    async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
+        self.fetch_once_with_headers(url, &[]).await
+    }
+
+    /// Single fetch attempt with optional per-request headers appended
+    /// after the profile defaults. Used by extractors that need to
+    /// satisfy site-specific headers (e.g. `x-ig-app-id` for Instagram's
+    /// internal API).
+    async fn fetch_once_with_headers(
+        &self,
+        url: &str,
+        extra: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
        let start = Instant::now();
        let client = self.pick_client(url);

-        let resp = client.get(url).send().await?;
+        let mut req = client.get(url);
+        for (k, v) in extra {
+            req = req.header(*k, *v);
+        }
+        let resp = req.send().await?;
        let response = Response::from_wreq(resp).await?;
        response_to_result(response, start)
    }

+    /// Fetch a URL with extra per-request headers appended after the
+    /// browser-profile defaults. Same retry semantics as `fetch`.
+    ///
+    /// Use this when an upstream API requires a header the global
+    /// `FetchConfig.headers` shouldn't carry to other hosts (Instagram's
+    /// `x-ig-app-id`, GitHub's `Authorization` once we wire `GITHUB_TOKEN`,
+    /// Reddit's compliant UA when we add OAuth, etc.).
+    #[instrument(skip(self, extra), fields(url = %url, extra_count = extra.len()))]
+    pub async fn fetch_with_headers(
+        &self,
+        url: &str,
+        extra: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        let delays = [Duration::ZERO, Duration::from_secs(1)];
+        let mut last_err = None;
+
+        for (attempt, delay) in delays.iter().enumerate() {
+            if attempt > 0 {
+                tokio::time::sleep(*delay).await;
+            }
+            match self.fetch_once_with_headers(url, extra).await {
+                Ok(result) => {
+                    if is_retryable_status(result.status) && attempt < delays.len() - 1 {
+                        warn!(
+                            url,
+                            status = result.status,
+                            attempt = attempt + 1,
+                            "retryable status, will retry"
+                        );
+                        last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
+                        continue;
+                    }
+                    if attempt > 0 {
+                        debug!(url, attempt = attempt + 1, "retry succeeded");
+                    }
+                    return Ok(result);
+                }
+                Err(e) => {
+                    if !is_retryable_error(&e) || attempt == delays.len() - 1 {
+                        return Err(e);
+                    }
+                    warn!(
+                        url,
+                        error = %e,
+                        attempt = attempt + 1,
+                        "transient error, will retry"
+                    );
+                    last_err = Some(e);
+                }
+            }
+        }
+
+        Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
+    }
+
    /// Fetch a URL then extract structured content.
    #[instrument(skip(self), fields(url = %url))]
    pub async fn fetch_and_extract(
@ -495,12 +654,43 @@ impl FetchClient {
    }
 }

+// ---------------------------------------------------------------------------
+// Fetcher trait implementation
+//
+// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
+// rather than `FetchClient` directly, which is what lets the production
+// API server swap in a tls-sidecar-backed implementation without
+// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
+// self-hosted OSS server) this impl means "pass the FetchClient you
+// already have; nothing changes".
+// ---------------------------------------------------------------------------
+
+#[async_trait::async_trait]
+impl crate::fetcher::Fetcher for FetchClient {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        FetchClient::fetch(self, url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        FetchClient::fetch_with_headers(self, url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
+        FetchClient::cloud(self)
+    }
+}
+
 /// Collect the browser variants to use based on the browser profile.
 fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
    match profile {
        BrowserProfile::Random => browser::all_variants(),
        BrowserProfile::Chrome => vec![browser::latest_chrome()],
        BrowserProfile::Firefox => vec![browser::latest_firefox()],
+        BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
    }
 }

@ -578,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {

 /// Detect if a response looks like a bot protection challenge page.
 fn is_challenge_response(response: &Response) -> bool {
-    let len = response.body().len();
+    is_challenge_html(response.text().as_ref())
+}
+
+/// Same as `is_challenge_response`, operating on a body string directly
+/// so callers holding a `FetchResult` can reuse the heuristic.
+fn is_challenge_html(html: &str) -> bool {
+    let len = html.len();
    if len > 15_000 || len == 0 {
        return false;
    }
-
-    let text = response.text();
-    let lower = text.to_lowercase();
-
+    let lower = html.to_lowercase();
    if lower.contains("<title>challenge page</title>") {
        return true;
    }
-
    if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
        return true;
    }
-
    false
 }

--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@ -0,0 +1,853 @@
+//! Cloud API fallback client for api.webclaw.io.
+//!
+//! When local fetch hits bot protection or a JS-only SPA, callers can
+//! fall back to the hosted API which runs the full antibot / CDP
+//! pipeline. This module is the shared home for that flow: previously
+//! duplicated between `webclaw-mcp/src/cloud.rs` and
+//! `webclaw-cli/src/cloud.rs`.
+//!
+//! ## Architecture
+//!
+//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
+//!   REST surface. Typed errors for the four HTTP failures callers act
+//!   on differently (401 / 402 / 429 / other) plus network + parse.
+//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
+//!   response bodies. The detection patterns are public (CF / DataDome
+//!   challenge-page signatures) so these live in OSS without leaking
+//!   any moat.
+//! - [`smart_fetch`] — try-local-then-escalate flow returning an
+//!   [`ExtractionResult`] or raw cloud JSON. Kept on the original
+//!   `Result<_, String>` signature so the existing MCP / CLI call
+//!   sites work unchanged.
+//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
+//!   pattern: just give me antibot-bypassed HTML so I can run my own
+//!   parser on it. Returns the typed [`CloudError`] so extractors can
+//!   emit precise "upgrade your plan" / "invalid key" messages.
+//!
+//! ## Cloud response shape and [`synthesize_html`]
+//!
+//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
+//! `html` field even when `formats=["html"]` is requested. By design
+//! the cloud API returns a parsed bundle:
+//!
+//! ```text
+//! {
+//!   "url":             "https://...",
+//!   "metadata":        { title, description, image, site_name, ... },  // OG / meta tags
+//!   "structured_data": [ { "@type": "...", ... }, ... ],               // JSON-LD blocks
+//!   "markdown":        "# Page Title\n\n...",                          // cleaned markdown
+//!   "antibot":         { engine, path, user_agent },                   // bypass telemetry
+//!   "cache":           { status, age_seconds }
+//! }
+//! ```
+//!
+//! [`CloudClient::fetch_html`] reassembles that bundle back into a
+//! minimal synthetic HTML document so the existing local extractor
+//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
+//! cloud output. Each `structured_data` entry becomes a
+//! `<script type="application/ld+json">` tag; each `metadata` field
+//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
+//! `<pre>` inside the body. Callers that walk Schema.org blocks see
+//! exactly what they'd see on a real live page.
+//!
+//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
+//! won't hit on the synthesised HTML — those IDs only exist on live
+//! Amazon pages. Extractors that need DOM regex keep OG meta tag
+//! fallbacks for that reason.
+//!
+//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
+//! signup when a site is blocked; nothing fails silently. Cloud users
+//! get the escalation for free.
+
+use std::time::Duration;
+
+use http::HeaderMap;
+use serde_json::{Value, json};
+use thiserror::Error;
+use tracing::{debug, info, warn};
+
+// Client type isn't needed here anymore now that smart_fetch* takes
+// `&dyn Fetcher`. Kept as a comment for historical context: this
+// module used to import FetchClient directly before v0.5.1.
+
+// ---------------------------------------------------------------------------
+// URLs + defaults — keep in one place so "change the signup link" is a
+// single-commit edit.
+// ---------------------------------------------------------------------------
+
+const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
+const DEFAULT_TIMEOUT_SECS: u64 = 120;
+
+const SIGNUP_URL: &str = "https://webclaw.io/signup";
+const PRICING_URL: &str = "https://webclaw.io/pricing";
+const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
+
+// ---------------------------------------------------------------------------
+// Errors
+// ---------------------------------------------------------------------------
+
+/// Structured cloud-fallback error. Variants correspond to the HTTP
+/// outcomes callers act on differently — a 401 needs a different UX
+/// than a 402 which needs a different UX than a network blip.
+///
+/// Display messages end with an actionable URL so API consumers can
+/// surface them to users verbatim.
+#[derive(Debug, Error)]
+pub enum CloudError {
+    /// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
+    /// and friends when they hit bot protection but have no client to
+    /// escalate to.
+    #[error(
+        "this site is behind antibot protection. \
+         Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
+         Free tier: {SIGNUP_URL}"
+    )]
+    NotConfigured,
+
+    /// HTTP 401 — the key is present but rejected.
+    #[error(
+        "WEBCLAW_API_KEY rejected (HTTP 401). \
+         Check or regenerate your key at {KEYS_URL}"
+    )]
+    Unauthorized,
+
+    /// HTTP 402 — the key is valid but the plan doesn't cover the call.
+    #[error(
+        "your plan doesn't include this endpoint / site (HTTP 402). \
+         Upgrade at {PRICING_URL}"
+    )]
+    InsufficientPlan,
+
+    /// HTTP 429 — rate limit.
+    #[error(
+        "cloud API rate limit reached (HTTP 429). \
+         Wait a moment or upgrade at {PRICING_URL}"
+    )]
+    RateLimited,
+
+    /// HTTP 4xx / 5xx the caller probably can't do anything specific
+    /// about. Body is truncated to a sensible length for logs.
+    #[error("cloud API returned HTTP {status}: {body}")]
+    ServerError { status: u16, body: String },
+
+    #[error("cloud request failed: {0}")]
+    Network(String),
+
+    #[error("cloud response parse failed: {0}")]
+    ParseFailed(String),
+}
+
+impl CloudError {
+    /// Build from a non-success HTTP response, routing well-known
+    /// statuses to dedicated variants.
+    fn from_status_and_body(status: u16, body: String) -> Self {
+        match status {
+            401 => Self::Unauthorized,
+            402 => Self::InsufficientPlan,
+            429 => Self::RateLimited,
+            _ => Self::ServerError {
+                status,
+                body: truncate(&body, 500).to_string(),
+            },
+        }
+    }
+}
+
+impl From<reqwest::Error> for CloudError {
+    fn from(e: reqwest::Error) -> Self {
+        Self::Network(e.to_string())
+    }
+}
+
+/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
+/// sites `use .await?` into functions returning `Result<_, String>`.
+/// Having this `From` impl means those sites keep compiling while we
+/// migrate them to the typed error over time.
+impl From<CloudError> for String {
+    fn from(e: CloudError) -> Self {
+        e.to_string()
+    }
+}
+
+fn truncate(text: &str, max: usize) -> &str {
+    match text.char_indices().nth(max) {
+        Some((byte_pos, _)) => &text[..byte_pos],
+        None => text,
+    }
+}
+
+// ---------------------------------------------------------------------------
+// CloudClient
+// ---------------------------------------------------------------------------
+
+/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
+/// inner `reqwest::Client` already refcounts its connection pool.
+#[derive(Clone)]
+pub struct CloudClient {
+    api_key: String,
+    base_url: String,
+    http: reqwest::Client,
+}
+
+impl CloudClient {
+    /// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
+    /// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
+    /// neither is set / both are empty.
+    ///
+    /// This is the function call sites should use by default — it's
+    /// what both the CLI and MCP want.
+    pub fn new(explicit_key: Option<&str>) -> Option<Self> {
+        explicit_key
+            .map(String::from)
+            .or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
+            .filter(|k| !k.trim().is_empty())
+            .map(Self::with_key)
+    }
+
+    /// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
+    /// readability at call sites that never accept a flag.
+    pub fn from_env() -> Option<Self> {
+        Self::new(None)
+    }
+
+    /// Build with an explicit key. Useful when the caller already has
+    /// a key from somewhere other than env or a flag (e.g. loaded from
+    /// config).
+    pub fn with_key(api_key: impl Into<String>) -> Self {
+        Self::with_key_and_base(api_key, API_BASE_DEFAULT)
+    }
+
+    /// Build with an explicit key and base URL. Used by integration
+    /// tests and staging deployments.
+    pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
+        let http = reqwest::Client::builder()
+            .timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
+            .build()
+            .expect("reqwest client builder failed with default settings");
+        Self {
+            api_key: api_key.into(),
+            base_url: base_url.into().trim_end_matches('/').to_string(),
+            http,
+        }
+    }
+
+    pub fn base_url(&self) -> &str {
+        &self.base_url
+    }
+
+    /// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
+    /// normalise the slash.
+    pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
+        let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
+        let resp = self
+            .http
+            .post(&url)
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .json(&body)
+            .send()
+            .await?;
+        parse_cloud_response(resp).await
+    }
+
+    /// Generic GET.
+    pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
+        let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
+        let resp = self
+            .http
+            .get(&url)
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .send()
+            .await?;
+        parse_cloud_response(resp).await
+    }
+
+    /// `POST /v1/scrape` with the caller's extraction options. This is
+    /// the public "do everything" surface: the cloud side handles
+    /// fetch + antibot + JS render + extraction + formatting.
+    pub async fn scrape(
+        &self,
+        url: &str,
+        formats: &[&str],
+        include_selectors: &[String],
+        exclude_selectors: &[String],
+        only_main_content: bool,
+    ) -> Result<Value, CloudError> {
+        let mut body = json!({ "url": url, "formats": formats });
+        if only_main_content {
+            body["only_main_content"] = json!(true);
+        }
+        if !include_selectors.is_empty() {
+            body["include_selectors"] = json!(include_selectors);
+        }
+        if !exclude_selectors.is_empty() {
+            body["exclude_selectors"] = json!(exclude_selectors);
+        }
+        self.post("scrape", body).await
+    }
+
+    /// Get antibot-bypassed page data back as a synthetic HTML string.
+    ///
+    /// `api.webclaw.io/v1/scrape` intentionally does not return raw
+    /// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
+    /// plus `metadata` (title, description, OG tags, image) plus a
+    /// `markdown` body. We reassemble those into a minimal HTML doc
+    /// that looks enough like the real page for our local extractor
+    /// parsers to run unchanged: each JSON-LD block gets emitted as a
+    /// `<script type="application/ld+json">` tag, metadata gets
+    /// emitted as OG `<meta>` tags, and the markdown lands in the
+    /// body. Extractors that walk JSON-LD (ecommerce_product,
+    /// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
+    /// see exactly the same shapes they'd see from a live HTML fetch.
+    pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
+        let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
+        Ok(synthesize_html(&resp))
+    }
+}
+
+/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
+/// response so existing HTML-based extractor parsers can run against
+/// cloud output without a separate code path.
+fn synthesize_html(resp: &Value) -> String {
+    let mut out = String::with_capacity(8_192);
+    out.push_str("<html><head>\n");
+
+    // Metadata → OG meta tags. Keep keys stable with what local
+    // extractors read: og:title, og:description, og:image, og:site_name.
+    if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
+        for (src_key, og_key) in [
+            ("title", "title"),
+            ("description", "description"),
+            ("image", "image"),
+            ("site_name", "site_name"),
+        ] {
+            if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
+                && !val.is_empty()
+            {
+                out.push_str(&format!(
+                    "<meta property=\"og:{og_key}\" content=\"{}\">\n",
+                    html_escape_attr(val)
+                ));
+            }
+        }
+    }
+
+    // Structured data blocks → <script type="application/ld+json">.
+    // Serialise losslessly so extract_json_ld's parser gets the same
+    // shape it would get from a real page.
+    if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
+        for block in blocks {
+            if let Ok(s) = serde_json::to_string(block) {
+                out.push_str("<script type=\"application/ld+json\">");
+                out.push_str(&s);
+                out.push_str("</script>\n");
+            }
+        }
+    }
+
+    out.push_str("</head><body>\n");
+
+    // Markdown body → plaintext in <body>. Extractors that regex over
+    // <div> IDs won't hit here, but they won't hit on local cloud
+    // bypass either. OK to keep minimal.
+    if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
+        out.push_str("<pre>");
+        out.push_str(&html_escape_text(md));
+        out.push_str("</pre>\n");
+    }
+
+    out.push_str("</body></html>");
+    out
+}
+
+fn html_escape_attr(s: &str) -> String {
+    s.replace('&', "&amp;")
+        .replace('"', "&quot;")
+        .replace('<', "&lt;")
+        .replace('>', "&gt;")
+}
+
+fn html_escape_text(s: &str) -> String {
+    s.replace('&', "&amp;")
+        .replace('<', "&lt;")
+        .replace('>', "&gt;")
+}
+
+async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
+    let status = resp.status();
+    if status.is_success() {
+        return resp
+            .json()
+            .await
+            .map_err(|e| CloudError::ParseFailed(e.to_string()));
+    }
+    let body = resp.text().await.unwrap_or_default();
+    Err(CloudError::from_status_and_body(status.as_u16(), body))
+}
+
+// ---------------------------------------------------------------------------
+// Detection
+// ---------------------------------------------------------------------------
+
+/// True when a fetched response body is actually a bot-protection
+/// challenge page rather than the content the caller asked for.
+///
+/// Conservative — only fires on patterns that indicate the *entire*
+/// page is a challenge, not embedded CAPTCHAs on a real content page.
+pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
+    let html_lower = html.to_lowercase();
+
+    // Cloudflare challenge page.
+    if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
+        return true;
+    }
+
+    // Cloudflare "Just a moment" / "Checking your browser" interstitial.
+    if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
+        && html_lower.contains("cf-spinner")
+    {
+        return true;
+    }
+
+    // Cloudflare Turnstile. Only counts when the page is small —
+    // legitimate pages embed Turnstile for signup forms etc.
+    if (html_lower.contains("cf-turnstile")
+        || html_lower.contains("challenges.cloudflare.com/turnstile"))
+        && html.len() < 100_000
+    {
+        return true;
+    }
+
+    // DataDome.
+    if html_lower.contains("geo.captcha-delivery.com")
+        || html_lower.contains("captcha-delivery.com/captcha")
+    {
+        return true;
+    }
+
+    // AWS WAF.
+    if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
+        return true;
+    }
+
+    // AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
+    // Distinct from the captcha-branded path above: the challenge page is
+    // a tiny HTML shell with an `interstitial-spinner` div and no content.
+    // Gating on html.len() keeps false-positives off long pages that
+    // happen to mention the phrase in an unrelated context.
+    if html_lower.contains("interstitial-spinner")
+        && html_lower.contains("verifying your connection")
+        && html.len() < 10_000
+    {
+        return true;
+    }
+
+    // hCaptcha *blocking* page (not just an embedded widget).
+    if html_lower.contains("hcaptcha.com")
+        && html_lower.contains("h-captcha")
+        && html.len() < 50_000
+    {
+        return true;
+    }
+
+    // Cloudflare via response headers + challenge body.
+    let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
+    if has_cf_headers
+        && (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
+    {
+        return true;
+    }
+
+    false
+}
+
+/// True when a page likely needs JS rendering — a large HTML document
+/// with almost no extractable text + an SPA framework signature.
+pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
+    let has_scripts = html.contains("<script");
+
+    // Tier 1: almost no extractable text from a large-ish page.
+    if word_count < 50 && html.len() > 5_000 && has_scripts {
+        return true;
+    }
+
+    // Tier 2: SPA framework markers + low content-to-HTML ratio.
+    if word_count < 800 && html.len() > 50_000 && has_scripts {
+        let html_lower = html.to_lowercase();
+        let has_spa_marker = html_lower.contains("react-app")
+            || html_lower.contains("id=\"__next\"")
+            || html_lower.contains("id=\"root\"")
+            || html_lower.contains("id=\"app\"")
+            || html_lower.contains("__next_data__")
+            || html_lower.contains("nuxt")
+            || html_lower.contains("ng-app");
+        if has_spa_marker {
+            return true;
+        }
+    }
+
+    false
+}
+
+// ---------------------------------------------------------------------------
+// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
+// or raw cloud JSON)
+// ---------------------------------------------------------------------------
+
+/// Result of [`smart_fetch`]: either a local extraction or the raw
+/// cloud API response when we escalated.
+pub enum SmartFetchResult {
+    Local(Box<webclaw_core::ExtractionResult>),
+    Cloud(Value),
+}
+
+/// Try local fetch + extract first. On bot protection or detected
+/// JS-render, fall back to `cloud.scrape(...)` with the caller's
+/// formats. Returns `Err(String)` so existing call sites that expect
+/// stringified errors keep compiling.
+///
+/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
+/// [`CloudError`] so you can render precise UX.
+pub async fn smart_fetch(
+    client: &dyn crate::fetcher::Fetcher,
+    cloud: Option<&CloudClient>,
+    url: &str,
+    include_selectors: &[String],
+    exclude_selectors: &[String],
+    only_main_content: bool,
+    formats: &[&str],
+) -> Result<SmartFetchResult, String> {
+    let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
+        .await
+        .map_err(|_| format!("Fetch timed out after 30s for {url}"))?
+        .map_err(|e| format!("Fetch failed: {e}"))?;
+
+    if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
+        info!(url, "bot protection detected, falling back to cloud API");
+        return cloud_scrape_fallback(
+            cloud,
+            url,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+            formats,
+        )
+        .await;
+    }
+
+    let options = webclaw_core::ExtractionOptions {
+        include_selectors: include_selectors.to_vec(),
+        exclude_selectors: exclude_selectors.to_vec(),
+        only_main_content,
+        include_raw_html: false,
+    };
+    let extraction =
+        webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
+            .map_err(|e| format!("Extraction failed: {e}"))?;
+
+    if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
+        info!(
+            url,
+            word_count = extraction.metadata.word_count,
+            html_len = fetch_result.html.len(),
+            "JS-rendered page detected, falling back to cloud API"
+        );
+        return cloud_scrape_fallback(
+            cloud,
+            url,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+            formats,
+        )
+        .await;
+    }
+
+    Ok(SmartFetchResult::Local(Box::new(extraction)))
+}
+
+async fn cloud_scrape_fallback(
+    cloud: Option<&CloudClient>,
+    url: &str,
+    include_selectors: &[String],
+    exclude_selectors: &[String],
+    only_main_content: bool,
+    formats: &[&str],
+) -> Result<SmartFetchResult, String> {
+    let Some(c) = cloud else {
+        return Err(CloudError::NotConfigured.to_string());
+    };
+    let resp = c
+        .scrape(
+            url,
+            formats,
+            include_selectors,
+            exclude_selectors,
+            only_main_content,
+        )
+        .await
+        .map_err(|e| e.to_string())?;
+    info!(url, "cloud API fallback successful");
+    Ok(SmartFetchResult::Cloud(resp))
+}
+
+// ---------------------------------------------------------------------------
+// Smart-fetch-HTML: for vertical extractors
+// ---------------------------------------------------------------------------
+
+/// Where the HTML ultimately came from — useful for callers that want
+/// to track "did we fall back?" for logging or pricing.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum FetchSource {
+    Local,
+    Cloud,
+}
+
+/// Antibot-aware HTML fetch result. The `html` field is always populated.
+pub struct FetchedHtml {
+    pub html: String,
+    pub final_url: String,
+    pub source: FetchSource,
+}
+
+/// Try local fetch; on bot protection, escalate to the cloud's
+/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
+///
+/// Designed for the vertical-extractor pattern where the caller has
+/// its own parser and just needs bytes.
+pub async fn smart_fetch_html(
+    client: &dyn crate::fetcher::Fetcher,
+    cloud: Option<&CloudClient>,
+    url: &str,
+) -> Result<FetchedHtml, CloudError> {
+    let resp = client
+        .fetch(url)
+        .await
+        .map_err(|e| CloudError::Network(e.to_string()))?;
+
+    if !is_bot_protected(&resp.html, &resp.headers) {
+        return Ok(FetchedHtml {
+            html: resp.html,
+            final_url: resp.url,
+            source: FetchSource::Local,
+        });
+    }
+
+    let Some(c) = cloud else {
+        warn!(url, "bot protection detected + no cloud client configured");
+        return Err(CloudError::NotConfigured);
+    };
+    debug!(url, "bot protection detected, escalating to cloud");
+    let html = c.fetch_html(url).await?;
+    Ok(FetchedHtml {
+        html,
+        final_url: url.to_string(),
+        source: FetchSource::Cloud,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn empty_headers() -> HeaderMap {
+        HeaderMap::new()
+    }
+
+    // --- detectors ----------------------------------------------------------
+
+    #[test]
+    fn is_bot_protected_detects_cloudflare_challenge() {
+        let html = "<html><body>_cf_chl_opt loaded</body></html>";
+        assert!(is_bot_protected(html, &empty_headers()));
+    }
+
+    #[test]
+    fn is_bot_protected_detects_turnstile_on_short_page() {
+        let html = "<div class=\"cf-turnstile\"></div>";
+        assert!(is_bot_protected(html, &empty_headers()));
+    }
+
+    #[test]
+    fn is_bot_protected_ignores_turnstile_on_real_content() {
+        let html = format!(
+            "<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
+            "lots of real content ".repeat(8_000)
+        );
+        assert!(!is_bot_protected(&html, &empty_headers()));
+    }
+
+    #[test]
+    fn is_bot_protected_detects_aws_waf_verifying_connection() {
+        // The exact shape Trustpilot serves under AWS WAF.
+        let html = r#"<div class="container"><div id="loading-state">
+            <div class="interstitial-spinner" id="spinner"></div>
+            <h1>Verifying your connection...</h1></div></div>"#;
+        assert!(is_bot_protected(html, &empty_headers()));
+    }
+
+    #[test]
+    fn synthesize_html_embeds_jsonld_and_og_tags() {
+        let resp = json!({
+            "url": "https://example.com/p/1",
+            "metadata": {
+                "title": "My Product",
+                "description": "A nice thing.",
+                "image": "https://cdn.example.com/1.jpg",
+                "site_name": "Example Shop"
+            },
+            "structured_data": [
+                {"@context":"https://schema.org","@type":"Product",
+                 "name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
+            ],
+            "markdown": "# Widget\n\nA nice widget."
+        });
+        let html = synthesize_html(&resp);
+        // OG tags from metadata.
+        assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
+        assert!(
+            html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
+        );
+        // JSON-LD block preserved losslessly.
+        assert!(html.contains(r#"<script type="application/ld+json">"#));
+        assert!(html.contains(r#""@type":"Product""#));
+        assert!(html.contains(r#""price":"9.99""#));
+        // Body carries markdown.
+        assert!(html.contains("A nice widget."));
+    }
+
+    #[test]
+    fn synthesize_html_handles_missing_fields_gracefully() {
+        let resp = json!({"url": "https://example.com", "metadata": {}});
+        let html = synthesize_html(&resp);
+        // No panic, no stray unclosed tags.
+        assert!(html.starts_with("<html><head>"));
+        assert!(html.ends_with("</body></html>"));
+    }
+
+    #[test]
+    fn synthesize_html_escapes_attribute_quotes() {
+        let resp = json!({
+            "metadata": {"title": r#"She said "hi""#}
+        });
+        let html = synthesize_html(&resp);
+        assert!(html.contains(r#"og:title" content="She said &quot;hi&quot;""#));
+    }
+
+    #[test]
+    fn is_bot_protected_ignores_phrase_on_real_content() {
+        // A real article that happens to mention the phrase in prose
+        // should not trigger the short-page detector.
+        let html = format!(
+            "<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
+            "article text ".repeat(2_000)
+        );
+        assert!(!is_bot_protected(&html, &empty_headers()));
+    }
+
+    #[test]
+    fn needs_js_rendering_flags_spa_skeleton() {
+        let html = format!(
+            "<html><body><div id=\"__next\"></div>{}</body></html>",
+            "<script>x</script>".repeat(500)
+        );
+        assert!(needs_js_rendering(10, &html));
+    }
+
+    #[test]
+    fn needs_js_rendering_passes_real_article() {
+        let html = format!(
+            "<html><body>{}<script>x</script></body></html>",
+            "Real article text ".repeat(5_000)
+        );
+        assert!(!needs_js_rendering(5_000, &html));
+    }
+
+    // --- CloudError mapping -------------------------------------------------
+
+    #[test]
+    fn cloud_error_maps_401() {
+        let e = CloudError::from_status_and_body(401, "invalid key".into());
+        assert!(matches!(e, CloudError::Unauthorized));
+        assert!(e.to_string().contains(KEYS_URL));
+    }
+
+    #[test]
+    fn cloud_error_maps_402() {
+        let e = CloudError::from_status_and_body(402, "{}".into());
+        assert!(matches!(e, CloudError::InsufficientPlan));
+        assert!(e.to_string().contains(PRICING_URL));
+    }
+
+    #[test]
+    fn cloud_error_maps_429() {
+        let e = CloudError::from_status_and_body(429, "slow down".into());
+        assert!(matches!(e, CloudError::RateLimited));
+        assert!(e.to_string().contains(PRICING_URL));
+    }
+
+    #[test]
+    fn cloud_error_maps_generic_5xx() {
+        let e = CloudError::from_status_and_body(503, "x".repeat(2000));
+        match e {
+            CloudError::ServerError { status, body } => {
+                assert_eq!(status, 503);
+                assert!(body.len() <= 500);
+            }
+            _ => panic!("expected ServerError"),
+        }
+    }
+
+    #[test]
+    fn not_configured_error_points_at_signup() {
+        let msg = CloudError::NotConfigured.to_string();
+        assert!(msg.contains(SIGNUP_URL));
+        assert!(msg.contains("WEBCLAW_API_KEY"));
+    }
+
+    // --- CloudClient construction ------------------------------------------
+
+    #[test]
+    fn cloud_client_explicit_key_wins_over_env() {
+        // SAFETY: this test mutates process env. Serial tests only.
+        // Set env to something, pass an explicit key, explicit should win.
+        // (We don't actually *call* the API, just check the struct stored
+        // the right key.)
+        // rustc std::env::set_var is unsafe in newer toolchains.
+        unsafe {
+            std::env::set_var("WEBCLAW_API_KEY", "from-env");
+        }
+        let client = CloudClient::new(Some("from-flag")).expect("client built");
+        assert_eq!(client.api_key, "from-flag");
+        unsafe {
+            std::env::remove_var("WEBCLAW_API_KEY");
+        }
+    }
+
+    #[test]
+    fn cloud_client_none_when_empty() {
+        unsafe {
+            std::env::remove_var("WEBCLAW_API_KEY");
+        }
+        assert!(CloudClient::new(None).is_none());
+        assert!(CloudClient::new(Some("")).is_none());
+        assert!(CloudClient::new(Some("   ")).is_none());
+    }
+
+    #[test]
+    fn cloud_client_base_url_strips_trailing_slash() {
+        let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
+        assert_eq!(c.base_url(), "https://api.example.com/v1");
+    }
+
+    #[test]
+    fn truncate_respects_char_boundaries() {
+        // Ensure we don't slice inside a multi-byte char.
+        let s = "a".repeat(10) + "é"; // é is 2 bytes
+        let out = truncate(&s, 11);
+        assert_eq!(out.chars().count(), 11);
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/amazon_product.rs
+++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs
@ -0,0 +1,452 @@
+//! Amazon product detail page extractor.
+//!
+//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
+//! inconsistently protected. Sometimes our local TLS fingerprint gets
+//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
+//! sometimes we land on a real page that for whatever reason ships
+//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
+//! extractor has a two-stage fallback:
+//!
+//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
+//!    we have everything (title, brand, price, availability, rating).
+//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
+//!    a cloud client is configured, force-escalate to api.webclaw.io.
+//!    Cloud's render + antibot pipeline reliably surfaces the
+//!    structured data. Without a cloud client we return whatever we
+//!    got from local (usually just title via `#productTitle` or OG
+//!    meta tags).
+//!
+//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
+//! `#landingImage`) second, OG `<meta>` tags third. The OG path
+//! matters because the cloud's synthesized HTML ships metadata as
+//! OG tags but lacks Amazon's DOM IDs.
+//!
+//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
+//! path. ASINs are a stable Amazon identifier so we extract that as
+//! part of the response even when everything else is empty (tells
+//! callers the URL was at least recognised).
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "amazon_product",
+    label: "Amazon product",
+    description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
+    url_patterns: &[
+        "https://www.amazon.com/dp/{ASIN}",
+        "https://www.amazon.co.uk/dp/{ASIN}",
+        "https://www.amazon.de/dp/{ASIN}",
+        "https://www.amazon.fr/dp/{ASIN}",
+        "https://www.amazon.it/dp/{ASIN}",
+        "https://www.amazon.es/dp/{ASIN}",
+        "https://www.amazon.co.jp/dp/{ASIN}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !is_amazon_host(host) {
+        return false;
+    }
+    parse_asin(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let asin = parse_asin(url)
+        .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
+
+    let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    // Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
+    // pages (they A/B-test it). When local fetch succeeded but has no
+    // Product JSON-LD, force-escalate to the cloud which runs the
+    // render pipeline and reliably surfaces structured data. No-op
+    // when cloud isn't configured — we return whatever local gave us.
+    if fetched.source == cloud::FetchSource::Local
+        && find_product_jsonld(&fetched.html).is_none()
+        && let Some(c) = client.cloud()
+    {
+        match c.fetch_html(url).await {
+            Ok(cloud_html) => {
+                fetched = cloud::FetchedHtml {
+                    html: cloud_html,
+                    final_url: url.to_string(),
+                    source: cloud::FetchSource::Cloud,
+                };
+            }
+            Err(e) => {
+                tracing::debug!(
+                    error = %e,
+                    "amazon_product: cloud escalation failed, keeping local"
+                );
+            }
+        }
+    }
+
+    let mut data = parse(&fetched.html, url, &asin);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
+/// file) and the source URL, extract Amazon product detail. Returns a
+/// `Value` rather than a typed struct so callers can pass it through
+/// without carrying webclaw_fetch types.
+pub fn parse(html: &str, url: &str, asin: &str) -> Value {
+    let jsonld = find_product_jsonld(html);
+    // Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
+    // (only present on real static HTML) > cloud-synthesized og:title.
+    let title = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "name"))
+        .or_else(|| dom_title(html))
+        .or_else(|| og(html, "title"));
+    let image = jsonld
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| dom_image(html))
+        .or_else(|| og(html, "image"));
+    let brand = jsonld.as_ref().and_then(get_brand);
+    let description = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
+    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
+    let offer = jsonld.as_ref().and_then(first_offer);
+
+    let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
+    let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
+
+    json!({
+        "url":              url,
+        "asin":             asin,
+        "title":            title,
+        "brand":            brand,
+        "description":      description,
+        "image":            image,
+        "price":            offer.as_ref().and_then(|o| get_text(o, "price")),
+        "currency":         offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
+        "availability":     offer.as_ref().and_then(|o| {
+            get_text(o, "availability").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "condition":        offer.as_ref().and_then(|o| {
+            get_text(o, "itemCondition").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "sku":              sku,
+        "mpn":              mpn,
+        "aggregate_rating": aggregate_rating,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn is_amazon_host(host: &str) -> bool {
+    host.starts_with("www.amazon.") || host.starts_with("amazon.")
+}
+
+/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
+/// - /dp/{ASIN}
+/// - /gp/product/{ASIN}
+/// - /product/{ASIN}
+/// - /exec/obidos/ASIN/{ASIN}
+fn parse_asin(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
+    });
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers — light reuse of ecommerce_product's style
+// ---------------------------------------------------------------------------
+
+fn find_product_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_product_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
+    match t {
+        Value::String(s) => is_prod(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    brand
+        .as_object()
+        .and_then(|o| o.get("name"))
+        .and_then(|n| n.as_str())
+        .map(String::from)
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn first_offer(v: &Value) -> Option<Value> {
+    let offers = v.get("offers")?;
+    match offers {
+        Value::Array(arr) => arr.first().cloned(),
+        Value::Object(_) => Some(offers.clone()),
+        _ => None,
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value": get_text(r, "ratingValue"),
+        "review_count": get_text(r, "reviewCount"),
+        "best_rating":  get_text(r, "bestRating"),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// DOM fallbacks — cheap regex for the two fields most likely to be
+// missing from JSON-LD on Amazon.
+// ---------------------------------------------------------------------------
+
+fn dom_title(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().trim().to_string())
+}
+
+fn dom_image(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
+/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
+/// line of defence for `title`, `image`, `description`.
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| html_unescape(m.as_str()));
+        }
+    }
+    None
+}
+
+/// Undo the synthesize_html attribute escaping for the few entities it
+/// emits. Keeps us off a heavier HTML-entity dep.
+fn html_unescape(s: &str) -> String {
+    s.replace("&quot;", "\"")
+        .replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_multi_locale() {
+        assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
+        assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
+        assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
+        assert!(matches(
+            "https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
+        ));
+    }
+
+    #[test]
+    fn rejects_non_product_urls() {
+        assert!(!matches("https://www.amazon.com/"));
+        assert!(!matches("https://www.amazon.com/gp/cart"));
+        assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
+    }
+
+    #[test]
+    fn parse_asin_extracts_from_multiple_shapes() {
+        assert_eq!(
+            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(
+            parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
+            Some("B0CHX1W1XY".into())
+        );
+        assert_eq!(parse_asin("https://www.amazon.com/"), None);
+    }
+
+    #[test]
+    fn parse_extracts_from_fixture_jsonld() {
+        // Minimal Amazon-style fixture with a Product JSON-LD block.
+        let html = r##"
+<html><head>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"Product",
+ "name":"ACME Widget","sku":"B0CHX1W1XY",
+ "brand":{"@type":"Brand","name":"ACME"},
+ "image":"https://m.media-amazon.com/images/I/abc.jpg",
+ "offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
+           "availability":"https://schema.org/InStock"},
+ "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
+</script>
+</head><body></body></html>"##;
+        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
+        assert_eq!(v["asin"], "B0CHX1W1XY");
+        assert_eq!(v["title"], "ACME Widget");
+        assert_eq!(v["brand"], "ACME");
+        assert_eq!(v["price"], "19.99");
+        assert_eq!(v["currency"], "USD");
+        assert_eq!(v["availability"], "InStock");
+        assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
+        assert_eq!(v["aggregate_rating"]["review_count"], "1234");
+    }
+
+    #[test]
+    fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
+        let html = r#"
+<html><body>
+<span id="productTitle">Fallback Title</span>
+<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
+</body></html>
+"#;
+        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
+        assert_eq!(v["title"], "Fallback Title");
+        assert_eq!(
+            v["image"],
+            "https://m.media-amazon.com/images/I/fallback.jpg"
+        );
+    }
+
+    #[test]
+    fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
+        // Shape we see from the cloud synthesize_html path: OG tags
+        // only, no JSON-LD, no Amazon DOM IDs.
+        let html = r##"<html><head>
+<meta property="og:title" content="Cloud-sourced MacBook Pro">
+<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
+<meta property="og:description" content="Via api.webclaw.io">
+</head></html>"##;
+        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
+        assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
+        assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
+        assert_eq!(v["description"], "Via api.webclaw.io");
+    }
+
+    #[test]
+    fn og_unescape_handles_quot_entity() {
+        let html = r#"<meta property="og:title" content="Apple &quot;M2 Pro&quot; Laptop">"#;
+        assert_eq!(
+            og(html, "title").as_deref(),
+            Some(r#"Apple "M2 Pro" Laptop"#)
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/arxiv.rs
+++ b/crates/webclaw-fetch/src/extractors/arxiv.rs
@ -0,0 +1,314 @@
+//! ArXiv paper structured extractor.
+//!
+//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}`
+//! which returns Atom XML. We parse just enough to surface title, authors,
+//! abstract, categories, and the canonical PDF link. No HTML scraping
+//! required and no auth.
+
+use quick_xml::Reader;
+use quick_xml::events::Event;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "arxiv",
+    label: "ArXiv paper",
+    description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.",
+    url_patterns: &[
+        "https://arxiv.org/abs/{id}",
+        "https://arxiv.org/abs/{id}v{n}",
+        "https://arxiv.org/pdf/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "arxiv.org" && host != "www.arxiv.org" {
+        return false;
+    }
+    url.contains("/abs/") || url.contains("/pdf/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let id = parse_id(url)
+        .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
+
+    let api_url = format!("https://export.arxiv.org/api/query?id_list={id}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "arxiv api returned status {}",
+            resp.status
+        )));
+    }
+
+    let entry = parse_atom_entry(&resp.html)
+        .ok_or_else(|| FetchError::BodyDecode("arxiv: no <entry> in response".into()))?;
+    if entry.title.is_none() && entry.summary.is_none() {
+        return Err(FetchError::BodyDecode(format!(
+            "arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)"
+        )));
+    }
+
+    Ok(json!({
+        "url":              url,
+        "id":               id,
+        "arxiv_id":         entry.id,
+        "title":            entry.title,
+        "authors":          entry.authors,
+        "abstract":         entry.summary.map(|s| collapse_whitespace(&s)),
+        "published":        entry.published,
+        "updated":          entry.updated,
+        "primary_category": entry.primary_category,
+        "categories":       entry.categories,
+        "doi":              entry.doi,
+        "comment":          entry.comment,
+        "pdf_url":          entry.pdf_url,
+        "abs_url":          entry.abs_url,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`)
+/// and the `.pdf` extension when present.
+fn parse_id(url: &str) -> Option<String> {
+    let after = url
+        .split("/abs/")
+        .nth(1)
+        .or_else(|| url.split("/pdf/").nth(1))?;
+    let stripped = after
+        .split(['?', '#'])
+        .next()?
+        .trim_end_matches('/')
+        .trim_end_matches(".pdf");
+    // Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345"
+    let no_version = match stripped.rfind('v') {
+        Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i],
+        _ => stripped,
+    };
+    if no_version.is_empty() {
+        None
+    } else {
+        Some(no_version.to_string())
+    }
+}
+
+fn collapse_whitespace(s: &str) -> String {
+    s.split_whitespace().collect::<Vec<_>>().join(" ")
+}
+
+#[derive(Default)]
+struct AtomEntry {
+    id: Option<String>,
+    title: Option<String>,
+    summary: Option<String>,
+    published: Option<String>,
+    updated: Option<String>,
+    primary_category: Option<String>,
+    categories: Vec<String>,
+    authors: Vec<String>,
+    doi: Option<String>,
+    comment: Option<String>,
+    pdf_url: Option<String>,
+    abs_url: Option<String>,
+}
+
+/// Parse the first `<entry>` block of an ArXiv Atom feed.
+fn parse_atom_entry(xml: &str) -> Option<AtomEntry> {
+    let mut reader = Reader::from_str(xml);
+    let mut buf = Vec::new();
+
+    // States
+    let mut in_entry = false;
+    let mut current: Option<&'static str> = None;
+    let mut in_author = false;
+    let mut in_author_name = false;
+    let mut entry = AtomEntry::default();
+
+    loop {
+        match reader.read_event_into(&mut buf) {
+            Ok(Event::Start(ref e)) => {
+                let local = e.local_name();
+                match local.as_ref() {
+                    b"entry" => in_entry = true,
+                    b"id" if in_entry && !in_author => current = Some("id"),
+                    b"title" if in_entry => current = Some("title"),
+                    b"summary" if in_entry => current = Some("summary"),
+                    b"published" if in_entry => current = Some("published"),
+                    b"updated" if in_entry => current = Some("updated"),
+                    b"author" if in_entry => in_author = true,
+                    b"name" if in_author => {
+                        in_author_name = true;
+                        current = Some("author_name");
+                    }
+                    b"category" if in_entry => {
+                        // primary_category is namespaced (arxiv:primary_category)
+                        // category is plain. quick-xml gives us local-name only,
+                        // so we treat both as categories and take the first as
+                        // primary.
+                        for attr in e.attributes().flatten() {
+                            if attr.key.as_ref() == b"term"
+                                && let Ok(v) = attr.unescape_value()
+                            {
+                                let term = v.to_string();
+                                if entry.primary_category.is_none() {
+                                    entry.primary_category = Some(term.clone());
+                                }
+                                entry.categories.push(term);
+                            }
+                        }
+                    }
+                    b"link" if in_entry => {
+                        let mut href = None;
+                        let mut rel = None;
+                        let mut typ = None;
+                        for attr in e.attributes().flatten() {
+                            match attr.key.as_ref() {
+                                b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
+                                b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
+                                b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
+                                _ => {}
+                            }
+                        }
+                        if let Some(h) = href {
+                            if typ.as_deref() == Some("application/pdf") {
+                                entry.pdf_url = Some(h.clone());
+                            }
+                            if rel.as_deref() == Some("alternate") {
+                                entry.abs_url = Some(h);
+                            }
+                        }
+                    }
+                    _ => current = None,
+                }
+            }
+            Ok(Event::Empty(ref e)) => {
+                // Self-closing tags (<link href="..." />). Same handling as Start.
+                let local = e.local_name();
+                if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry {
+                    let mut href = None;
+                    let mut rel = None;
+                    let mut typ = None;
+                    let mut term = None;
+                    for attr in e.attributes().flatten() {
+                        match attr.key.as_ref() {
+                            b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
+                            b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
+                            b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
+                            b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()),
+                            _ => {}
+                        }
+                    }
+                    if let Some(t) = term {
+                        if entry.primary_category.is_none() {
+                            entry.primary_category = Some(t.clone());
+                        }
+                        entry.categories.push(t);
+                    }
+                    if let Some(h) = href {
+                        if typ.as_deref() == Some("application/pdf") {
+                            entry.pdf_url = Some(h.clone());
+                        }
+                        if rel.as_deref() == Some("alternate") {
+                            entry.abs_url = Some(h);
+                        }
+                    }
+                }
+            }
+            Ok(Event::Text(ref e)) => {
+                if let (Some(field), Ok(text)) = (current, e.unescape()) {
+                    let text = text.to_string();
+                    match field {
+                        "id" => entry.id = Some(text.trim().to_string()),
+                        "title" => entry.title = append_text(entry.title.take(), &text),
+                        "summary" => entry.summary = append_text(entry.summary.take(), &text),
+                        "published" => entry.published = Some(text.trim().to_string()),
+                        "updated" => entry.updated = Some(text.trim().to_string()),
+                        "author_name" => entry.authors.push(text.trim().to_string()),
+                        _ => {}
+                    }
+                }
+            }
+            Ok(Event::End(ref e)) => {
+                let local = e.local_name();
+                match local.as_ref() {
+                    b"entry" => break,
+                    b"author" => in_author = false,
+                    b"name" => in_author_name = false,
+                    _ => {}
+                }
+                if !in_author_name {
+                    current = None;
+                }
+            }
+            Ok(Event::Eof) => break,
+            Err(_) => return None,
+            _ => {}
+        }
+        buf.clear();
+    }
+
+    if in_entry { Some(entry) } else { None }
+}
+
+/// Concatenate text fragments (long fields can be split across multiple
+/// text events if they contain entities or CDATA).
+fn append_text(prev: Option<String>, next: &str) -> Option<String> {
+    match prev {
+        Some(mut s) => {
+            s.push_str(next);
+            Some(s)
+        }
+        None => Some(next.to_string()),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_arxiv_urls() {
+        assert!(matches("https://arxiv.org/abs/2401.12345"));
+        assert!(matches("https://arxiv.org/abs/2401.12345v2"));
+        assert!(matches("https://arxiv.org/pdf/2401.12345.pdf"));
+        assert!(!matches("https://arxiv.org/"));
+        assert!(!matches("https://example.com/abs/foo"));
+    }
+
+    #[test]
+    fn parse_id_strips_version_and_extension() {
+        assert_eq!(
+            parse_id("https://arxiv.org/abs/2401.12345"),
+            Some("2401.12345".into())
+        );
+        assert_eq!(
+            parse_id("https://arxiv.org/abs/2401.12345v3"),
+            Some("2401.12345".into())
+        );
+        assert_eq!(
+            parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"),
+            Some("2401.12345".into())
+        );
+    }
+
+    #[test]
+    fn collapse_whitespace_handles_newlines_and_tabs() {
+        assert_eq!(collapse_whitespace("a   b\n\tc  "), "a b c");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/crates_io.rs
+++ b/crates/webclaw-fetch/src/extractors/crates_io.rs
@ -0,0 +1,168 @@
+//! crates.io structured extractor.
+//!
+//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No
+//! auth, no rate limit at normal usage. The response includes both
+//! the crate metadata and the full version list, which we summarize
+//! down to a count + latest release info to keep the payload small.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "crates_io",
+    label: "crates.io package",
+    description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.",
+    url_patterns: &[
+        "https://crates.io/crates/{name}",
+        "https://crates.io/crates/{name}/{version}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "crates.io" && host != "www.crates.io" {
+        return false;
+    }
+    url.contains("/crates/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let name = parse_name(url)
+        .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
+
+    let api_url = format!("https://crates.io/api/v1/crates/{name}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "crates.io: crate '{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "crates.io api returned status {}",
+            resp.status
+        )));
+    }
+
+    let body: CratesResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?;
+
+    let c = body.crate_;
+    let latest_version = body
+        .versions
+        .iter()
+        .find(|v| !v.yanked.unwrap_or(false))
+        .or_else(|| body.versions.first());
+
+    Ok(json!({
+        "url":                 url,
+        "name":                c.id,
+        "description":         c.description,
+        "homepage":            c.homepage,
+        "documentation":       c.documentation,
+        "repository":          c.repository,
+        "max_stable_version":  c.max_stable_version,
+        "max_version":         c.max_version,
+        "newest_version":      c.newest_version,
+        "downloads":           c.downloads,
+        "recent_downloads":    c.recent_downloads,
+        "categories":          c.categories,
+        "keywords":            c.keywords,
+        "release_count":       body.versions.len(),
+        "latest_release_date": latest_version.and_then(|v| v.created_at.clone()),
+        "latest_license":      latest_version.and_then(|v| v.license.clone()),
+        "latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()),
+        "latest_yanked":       latest_version.and_then(|v| v.yanked),
+        "created_at":          c.created_at,
+        "updated_at":          c.updated_at,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_name(url: &str) -> Option<String> {
+    let after = url.split("/crates/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let first = stripped.split('/').find(|s| !s.is_empty())?;
+    Some(first.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// crates.io API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct CratesResponse {
+    #[serde(rename = "crate")]
+    crate_: CrateInfo,
+    #[serde(default)]
+    versions: Vec<VersionInfo>,
+}
+
+#[derive(Deserialize)]
+struct CrateInfo {
+    id: Option<String>,
+    description: Option<String>,
+    homepage: Option<String>,
+    documentation: Option<String>,
+    repository: Option<String>,
+    max_stable_version: Option<String>,
+    max_version: Option<String>,
+    newest_version: Option<String>,
+    downloads: Option<i64>,
+    recent_downloads: Option<i64>,
+    #[serde(default)]
+    categories: Vec<String>,
+    #[serde(default)]
+    keywords: Vec<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct VersionInfo {
+    license: Option<String>,
+    rust_version: Option<String>,
+    yanked: Option<bool>,
+    created_at: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_crate_pages() {
+        assert!(matches("https://crates.io/crates/serde"));
+        assert!(matches("https://crates.io/crates/tokio/1.45.0"));
+        assert!(!matches("https://crates.io/"));
+        assert!(!matches("https://example.com/crates/foo"));
+    }
+
+    #[test]
+    fn parse_name_handles_versioned_urls() {
+        assert_eq!(
+            parse_name("https://crates.io/crates/serde"),
+            Some("serde".into())
+        );
+        assert_eq!(
+            parse_name("https://crates.io/crates/tokio/1.45.0"),
+            Some("tokio".into())
+        );
+        assert_eq!(
+            parse_name("https://crates.io/crates/scraper/?foo=bar"),
+            Some("scraper".into())
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/dev_to.rs
+++ b/crates/webclaw-fetch/src/extractors/dev_to.rs
@ -0,0 +1,188 @@
+//! dev.to article structured extractor.
+//!
+//! `dev.to/api/articles/{username}/{slug}` returns the full article body,
+//! tags, reaction count, comment count, and reading time. Anonymous
+//! access works fine for published posts.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "dev_to",
+    label: "dev.to article",
+    description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.",
+    url_patterns: &["https://dev.to/{username}/{slug}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "dev.to" && host != "www.dev.to" {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // Need exactly /{username}/{slug}, with username starting with non-reserved.
+    segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0])
+}
+
+const RESERVED_FIRST_SEGS: &[&str] = &[
+    "api",
+    "tags",
+    "search",
+    "settings",
+    "enter",
+    "signup",
+    "about",
+    "code-of-conduct",
+    "privacy",
+    "terms",
+    "contact",
+    "sponsorships",
+    "sponsors",
+    "shop",
+    "videos",
+    "listings",
+    "podcasts",
+    "p",
+    "t",
+];
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (username, slug) = parse_username_slug(url).ok_or_else(|| {
+        FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
+    })?;
+
+    let api_url = format!("https://dev.to/api/articles/{username}/{slug}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "dev_to: article '{username}/{slug}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "dev.to api returned status {}",
+            resp.status
+        )));
+    }
+
+    let a: Article = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?;
+
+    Ok(json!({
+        "url":               url,
+        "id":                a.id,
+        "title":             a.title,
+        "description":       a.description,
+        "body_markdown":     a.body_markdown,
+        "url_canonical":     a.canonical_url,
+        "published_at":      a.published_at,
+        "edited_at":         a.edited_at,
+        "reading_time_min":  a.reading_time_minutes,
+        "tags":              a.tag_list,
+        "positive_reactions": a.positive_reactions_count,
+        "public_reactions":  a.public_reactions_count,
+        "comments_count":    a.comments_count,
+        "page_views_count":  a.page_views_count,
+        "cover_image":       a.cover_image,
+        "author": json!({
+            "username":  a.user.as_ref().and_then(|u| u.username.clone()),
+            "name":      a.user.as_ref().and_then(|u| u.name.clone()),
+            "twitter":   a.user.as_ref().and_then(|u| u.twitter_username.clone()),
+            "github":    a.user.as_ref().and_then(|u| u.github_username.clone()),
+            "website":   a.user.as_ref().and_then(|u| u.website_url.clone()),
+        }),
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_username_slug(url: &str) -> Option<(String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let username = segs.next()?;
+    let slug = segs.next()?;
+    Some((username.to_string(), slug.to_string()))
+}
+
+// ---------------------------------------------------------------------------
+// dev.to API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Article {
+    id: Option<i64>,
+    title: Option<String>,
+    description: Option<String>,
+    body_markdown: Option<String>,
+    canonical_url: Option<String>,
+    published_at: Option<String>,
+    edited_at: Option<String>,
+    reading_time_minutes: Option<i64>,
+    tag_list: Option<serde_json::Value>, // string OR array depending on endpoint
+    positive_reactions_count: Option<i64>,
+    public_reactions_count: Option<i64>,
+    comments_count: Option<i64>,
+    page_views_count: Option<i64>,
+    cover_image: Option<String>,
+    user: Option<UserRef>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    username: Option<String>,
+    name: Option<String>,
+    twitter_username: Option<String>,
+    github_username: Option<String>,
+    website_url: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_article_urls() {
+        assert!(matches("https://dev.to/ben/welcome-thread"));
+        assert!(matches("https://dev.to/0xmassi/some-post-1abc"));
+        assert!(!matches("https://dev.to/"));
+        assert!(!matches("https://dev.to/api/articles/foo/bar"));
+        assert!(!matches("https://dev.to/tags/rust"));
+        assert!(!matches("https://dev.to/ben")); // user profile, not article
+        assert!(!matches("https://example.com/ben/post"));
+    }
+
+    #[test]
+    fn parse_pulls_username_and_slug() {
+        assert_eq!(
+            parse_username_slug("https://dev.to/ben/welcome-thread"),
+            Some(("ben".into(), "welcome-thread".into()))
+        );
+        assert_eq!(
+            parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"),
+            Some(("0xmassi".into(), "some-post-1abc".into()))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/docker_hub.rs
+++ b/crates/webclaw-fetch/src/extractors/docker_hub.rs
@ -0,0 +1,150 @@
+//! Docker Hub repository structured extractor.
+//!
+//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`.
+//! Anonymous access is allowed for public images. The official-image
+//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "docker_hub",
+    label: "Docker Hub repository",
+    description: "Returns image metadata: pull count, star count, last_updated, official flag, description.",
+    url_patterns: &[
+        "https://hub.docker.com/_/{name}",
+        "https://hub.docker.com/r/{namespace}/{name}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "hub.docker.com" {
+        return false;
+    }
+    url.contains("/_/") || url.contains("/r/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (namespace, name) = parse_repo(url)
+        .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
+
+    let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "docker_hub: repo '{namespace}/{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "docker_hub api returned status {}",
+            resp.status
+        )));
+    }
+
+    let r: RepoResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?;
+
+    Ok(json!({
+        "url":               url,
+        "namespace":         r.namespace,
+        "name":              r.name,
+        "full_name":         format!("{namespace}/{name}"),
+        "pull_count":        r.pull_count,
+        "star_count":        r.star_count,
+        "description":       r.description,
+        "full_description":  r.full_description,
+        "last_updated":      r.last_updated,
+        "date_registered":   r.date_registered,
+        "is_official":       namespace == "library",
+        "is_private":        r.is_private,
+        "status_description":r.status_description,
+        "categories":        r.categories,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Parse `(namespace, name)` from a Docker Hub URL. The official-image
+/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos
+/// `/r/foo/bar` map to `(foo, bar)`.
+fn parse_repo(url: &str) -> Option<(String, String)> {
+    if let Some(after) = url.split("/_/").nth(1) {
+        let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+        let name = stripped.split('/').next().filter(|s| !s.is_empty())?;
+        return Some(("library".into(), name.to_string()));
+    }
+    let after = url.split("/r/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let ns = segs.next()?;
+    let nm = segs.next()?;
+    Some((ns.to_string(), nm.to_string()))
+}
+
+#[derive(Deserialize)]
+struct RepoResponse {
+    namespace: Option<String>,
+    name: Option<String>,
+    pull_count: Option<i64>,
+    star_count: Option<i64>,
+    description: Option<String>,
+    full_description: Option<String>,
+    last_updated: Option<String>,
+    date_registered: Option<String>,
+    is_private: Option<bool>,
+    status_description: Option<String>,
+    #[serde(default)]
+    categories: Vec<DockerCategory>,
+}
+
+#[derive(Deserialize, serde::Serialize)]
+struct DockerCategory {
+    name: Option<String>,
+    slug: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_docker_urls() {
+        assert!(matches("https://hub.docker.com/_/nginx"));
+        assert!(matches("https://hub.docker.com/r/grafana/grafana"));
+        assert!(!matches("https://hub.docker.com/"));
+        assert!(!matches("https://example.com/_/nginx"));
+    }
+
+    #[test]
+    fn parse_repo_handles_official_and_personal() {
+        assert_eq!(
+            parse_repo("https://hub.docker.com/_/nginx"),
+            Some(("library".into(), "nginx".into()))
+        );
+        assert_eq!(
+            parse_repo("https://hub.docker.com/_/nginx/tags"),
+            Some(("library".into(), "nginx".into()))
+        );
+        assert_eq!(
+            parse_repo("https://hub.docker.com/r/grafana/grafana"),
+            Some(("grafana".into(), "grafana".into()))
+        );
+        assert_eq!(
+            parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"),
+            Some(("grafana".into(), "grafana".into()))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/ebay_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/ebay_listing.rs
@ -0,0 +1,337 @@
+//! eBay listing extractor.
+//!
+//! eBay item pages at `ebay.com/itm/{id}` and international variants
+//! usually ship a `Product` JSON-LD block with title, price, currency,
+//! condition, and an `AggregateOffer` when bidding. eBay applies
+//! Cloudflare + custom WAF selectively — some item IDs return normal
+//! HTML to the Firefox profile, others 403 / get the "Pardon our
+//! interruption" page. We route through `cloud::smart_fetch_html` so
+//! both paths resolve to the same parser.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "ebay_listing",
+    label: "eBay listing",
+    description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
+    url_patterns: &[
+        "https://www.ebay.com/itm/{id}",
+        "https://www.ebay.co.uk/itm/{id}",
+        "https://www.ebay.de/itm/{id}",
+        "https://www.ebay.fr/itm/{id}",
+        "https://www.ebay.it/itm/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !is_ebay_host(host) {
+        return false;
+    }
+    parse_item_id(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let item_id = parse_item_id(url)
+        .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
+
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse(&fetched.html, url, &item_id);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
+    let jsonld = find_product_jsonld(html);
+    let title = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "name"))
+        .or_else(|| og(html, "title"));
+    let image = jsonld
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| og(html, "image"));
+    let brand = jsonld.as_ref().and_then(get_brand);
+    let description = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
+    let offer = jsonld.as_ref().and_then(first_offer);
+
+    // eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
+    let (low_price, high_price, single_price) = match offer.as_ref() {
+        Some(o) => (
+            get_text(o, "lowPrice"),
+            get_text(o, "highPrice"),
+            get_text(o, "price"),
+        ),
+        None => (None, None, None),
+    };
+    let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
+
+    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
+
+    json!({
+        "url":             url,
+        "item_id":         item_id,
+        "title":           title,
+        "brand":           brand,
+        "description":     description,
+        "image":           image,
+        "price":           single_price,
+        "low_price":       low_price,
+        "high_price":      high_price,
+        "offer_count":     offer_count,
+        "currency":        offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
+        "availability":    offer.as_ref().and_then(|o| {
+            get_text(o, "availability").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "condition":       offer.as_ref().and_then(|o| {
+            get_text(o, "itemCondition").map(|s|
+                s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
+        }),
+        "seller":          offer.as_ref().and_then(|o|
+            o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
+        "aggregate_rating": aggregate_rating,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn is_ebay_host(host: &str) -> bool {
+    host.starts_with("www.ebay.") || host.starts_with("ebay.")
+}
+
+/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
+/// URLs. IDs are 10-15 digits today, but we accept any all-digit
+/// trailing segment so the extractor stays forward-compatible.
+fn parse_item_id(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        // /itm/(optional-slug/)?(digits)([/?#]|end)
+        Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
+    });
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers
+// ---------------------------------------------------------------------------
+
+fn find_product_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_product_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
+    match t {
+        Value::String(s) => is_prod(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    brand
+        .as_object()
+        .and_then(|o| o.get("name"))
+        .and_then(|n| n.as_str())
+        .map(String::from)
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn first_offer(v: &Value) -> Option<Value> {
+    let offers = v.get("offers")?;
+    match offers {
+        Value::Array(arr) => arr.first().cloned(),
+        Value::Object(_) => Some(offers.clone()),
+        _ => None,
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value": get_text(r, "ratingValue"),
+        "review_count": get_text(r, "reviewCount"),
+        "best_rating":  get_text(r, "bestRating"),
+    }))
+}
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_ebay_item_urls() {
+        assert!(matches("https://www.ebay.com/itm/325478156234"));
+        assert!(matches(
+            "https://www.ebay.com/itm/vintage-typewriter/325478156234"
+        ));
+        assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
+        assert!(!matches("https://www.ebay.com/"));
+        assert!(!matches("https://www.ebay.com/sch/foo"));
+        assert!(!matches("https://example.com/itm/325478156234"));
+    }
+
+    #[test]
+    fn parse_item_id_handles_slugged_urls() {
+        assert_eq!(
+            parse_item_id("https://www.ebay.com/itm/325478156234"),
+            Some("325478156234".into())
+        );
+        assert_eq!(
+            parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
+            Some("325478156234".into())
+        );
+        assert_eq!(
+            parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
+            Some("325478156234".into())
+        );
+    }
+
+    #[test]
+    fn parse_extracts_from_fixture_jsonld() {
+        let html = r##"
+<html><head>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"Product",
+ "name":"Vintage Typewriter","sku":"TW-001",
+ "brand":{"@type":"Brand","name":"Olivetti"},
+ "image":"https://i.ebayimg.com/images/abc.jpg",
+ "offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
+           "availability":"https://schema.org/InStock",
+           "itemCondition":"https://schema.org/UsedCondition",
+           "seller":{"@type":"Person","name":"vintage_seller_99"}}}
+</script>
+</head></html>"##;
+        let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
+        assert_eq!(v["title"], "Vintage Typewriter");
+        assert_eq!(v["price"], "79.99");
+        assert_eq!(v["currency"], "GBP");
+        assert_eq!(v["availability"], "InStock");
+        assert_eq!(v["condition"], "UsedCondition");
+        assert_eq!(v["seller"], "vintage_seller_99");
+        assert_eq!(v["brand"], "Olivetti");
+    }
+
+    #[test]
+    fn parse_handles_aggregate_offer_price_range() {
+        let html = r##"
+<script type="application/ld+json">
+{"@type":"Product","name":"Used Copies",
+ "offers":{"@type":"AggregateOffer","offerCount":"5",
+           "lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
+</script>
+"##;
+        let v = parse(html, "https://www.ebay.com/itm/1", "1");
+        assert_eq!(v["low_price"], "10.00");
+        assert_eq!(v["high_price"], "50.00");
+        assert_eq!(v["offer_count"], "5");
+        assert_eq!(v["currency"], "USD");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
@ -0,0 +1,553 @@
+//! Generic ecommerce product extractor via Schema.org JSON-LD.
+//!
+//! Every modern ecommerce site ships a `<script type="application/ld+json">`
+//! Product block for SEO / rich-result snippets. Google's own SEO docs
+//! force this markup on anyone who wants to appear in shopping search.
+//! We take advantage of it: one extractor that works on Shopify,
+//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
+//! and anything else that follows Schema.org.
+//!
+//! **Explicit-call only** (`/v1/scrape/ecommerce_product`). Not in the
+//! auto-dispatch because we can't identify "this is a product page"
+//! from the URL alone. When the caller knows they have a product URL,
+//! this is the reliable fallback for stores where shopify_product
+//! doesn't apply.
+//!
+//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
+//! so JSON-LD parsing is shared with the rest of the extraction
+//! pipeline. We walk all blocks looking for `@type: Product`,
+//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
+//!
+//! ## OG fallback
+//!
+//! Two real-world cases JSON-LD alone can't cover:
+//!
+//! 1. Site has no Product JSON-LD at all (smaller Squarespace / custom
+//!    storefronts, many European shops).
+//! 2. Site has Product JSON-LD but the `offers` block is empty (seen on
+//!    Patagonia and other catalog-style sites that split price onto a
+//!    separate widget).
+//!
+//! For case 1 we build a minimal payload from OG / product meta tags
+//! (`og:title`, `og:image`, `og:description`, `product:price:amount`,
+//! `product:price:currency`, `product:availability`, `product:brand`).
+//! For case 2 we augment the JSON-LD offers list with an OG-derived
+//! offer so callers get a price either way. A `data_source` field
+//! (`"jsonld"` / `"jsonld+og"` / `"og_fallback"`) tells the caller
+//! which branch produced the data.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "ecommerce_product",
+    label: "Ecommerce product (generic)",
+    description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
+    url_patterns: &[
+        "https://{any-ecom-store}/products/{slug}",
+        "https://{any-ecom-store}/product/{slug}",
+        "https://{any-ecom-store}/p/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    // Maximally permissive: explicit-call-only extractor. We trust the
+    // caller knows they're pointing at a product page. Custom ecom
+    // sites use every conceivable URL shape (warbyparker.com uses
+    // `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
+    // matching would false-negative a lot. All we gate on is a valid
+    // http(s) URL with a host.
+    if !(url.starts_with("http://") || url.starts_with("https://")) {
+        return false;
+    }
+    !host_of(url).is_empty()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let resp = client.fetch(url).await?;
+    if !(200..300).contains(&resp.status) {
+        return Err(FetchError::Build(format!(
+            "ecommerce_product: status {} for {url}",
+            resp.status
+        )));
+    }
+    parse(&resp.html, url).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "ecommerce_product: no Schema.org Product JSON-LD and no OG product tags on {url}"
+        ))
+    })
+}
+
+/// Pure parser: try JSON-LD first, fall back to OG meta tags. Returns
+/// `None` when neither path has enough to say "this is a product page".
+pub fn parse(html: &str, url: &str) -> Option<Value> {
+    // Reuse the core JSON-LD parser so we benefit from whatever
+    // robustness it gains over time (handling @graph, arrays, etc.).
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    let product = find_product(&blocks);
+
+    if let Some(p) = product {
+        Some(build_jsonld_payload(&p, html, url))
+    } else if has_og_product_signal(html) {
+        Some(build_og_payload(html, url))
+    } else {
+        None
+    }
+}
+
+/// Build the rich payload from a Product JSON-LD node. Augments the
+/// `offers` array with an OG-derived offer when JSON-LD offers is empty
+/// so callers get a price on sites like Patagonia.
+fn build_jsonld_payload(product: &Value, html: &str, url: &str) -> Value {
+    let mut offers = collect_offers(product);
+    let mut data_source = "jsonld";
+    if offers.is_empty()
+        && let Some(og_offer) = build_og_offer(html)
+    {
+        offers.push(og_offer);
+        data_source = "jsonld+og";
+    }
+
+    json!({
+        "url":                url,
+        "data_source":        data_source,
+        "name":               get_text(product, "name").or_else(|| og(html, "title")),
+        "description":        get_text(product, "description").or_else(|| og(html, "description")),
+        "brand":              get_brand(product).or_else(|| meta_property(html, "product:brand")),
+        "sku":                get_text(product, "sku"),
+        "mpn":                get_text(product, "mpn"),
+        "gtin":               get_text(product, "gtin")
+                                 .or_else(|| get_text(product, "gtin13"))
+                                 .or_else(|| get_text(product, "gtin12"))
+                                 .or_else(|| get_text(product, "gtin8")),
+        "product_id":         get_text(product, "productID"),
+        "category":           get_text(product, "category"),
+        "color":              get_text(product, "color"),
+        "material":           get_text(product, "material"),
+        "images":             nonempty_or_og(collect_images(product), html),
+        "offers":             offers,
+        "aggregate_rating":   get_aggregate_rating(product),
+        "review_count":       get_review_count(product),
+        "raw_schema_type":    get_text(product, "@type"),
+        "raw_jsonld":         product.clone(),
+    })
+}
+
+/// Build a minimal payload from OG / product meta tags. Used when a
+/// page has no Product JSON-LD at all.
+fn build_og_payload(html: &str, url: &str) -> Value {
+    let offers = build_og_offer(html).map(|o| vec![o]).unwrap_or_default();
+    let image = og(html, "image");
+    let images: Vec<Value> = image.map(|i| vec![Value::String(i)]).unwrap_or_default();
+
+    json!({
+        "url":                url,
+        "data_source":        "og_fallback",
+        "name":               og(html, "title"),
+        "description":        og(html, "description"),
+        "brand":              meta_property(html, "product:brand"),
+        "sku":                None::<String>,
+        "mpn":                None::<String>,
+        "gtin":               None::<String>,
+        "product_id":         None::<String>,
+        "category":           None::<String>,
+        "color":              None::<String>,
+        "material":           None::<String>,
+        "images":             images,
+        "offers":             offers,
+        "aggregate_rating":   Value::Null,
+        "review_count":       None::<String>,
+        "raw_schema_type":    None::<String>,
+        "raw_jsonld":         Value::Null,
+    })
+}
+
+fn nonempty_or_og(imgs: Vec<Value>, html: &str) -> Vec<Value> {
+    if !imgs.is_empty() {
+        return imgs;
+    }
+    og(html, "image")
+        .map(|s| vec![Value::String(s)])
+        .unwrap_or_default()
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers
+// ---------------------------------------------------------------------------
+
+/// Recursively walk the JSON-LD blocks and return the first node whose
+/// `@type` is Product, ProductGroup, or IndividualProduct.
+fn find_product(blocks: &[Value]) -> Option<Value> {
+    for b in blocks {
+        if let Some(found) = find_product_in(b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    // @graph: [ {...}, {...} ]
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    // Bare array wrapper
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let t = match v.get("@type") {
+        Some(t) => t,
+        None => return false,
+    };
+    let match_str = |s: &str| {
+        matches!(
+            s,
+            "Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
+        )
+    };
+    match t {
+        Value::String(s) => match_str(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    if let Some(obj) = brand.as_object()
+        && let Some(n) = obj.get("name").and_then(|x| x.as_str())
+    {
+        return Some(n.to_string());
+    }
+    None
+}
+
+fn collect_images(v: &Value) -> Vec<Value> {
+    match v.get("image") {
+        Some(Value::String(s)) => vec![Value::String(s.clone())],
+        Some(Value::Array(arr)) => arr
+            .iter()
+            .filter_map(|x| match x {
+                Value::String(s) => Some(Value::String(s.clone())),
+                Value::Object(_) => x.get("url").cloned(),
+                _ => None,
+            })
+            .collect(),
+        Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
+        _ => Vec::new(),
+    }
+}
+
+/// Normalise both bare Offer and AggregateOffer into a uniform array.
+fn collect_offers(v: &Value) -> Vec<Value> {
+    let offers = match v.get("offers") {
+        Some(o) => o,
+        None => return Vec::new(),
+    };
+    let collect_single = |o: &Value| -> Option<Value> {
+        Some(json!({
+            "price":            get_text(o, "price"),
+            "low_price":        get_text(o, "lowPrice"),
+            "high_price":       get_text(o, "highPrice"),
+            "currency":         get_text(o, "priceCurrency"),
+            "availability":     get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
+            "item_condition":   get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
+            "valid_until":      get_text(o, "priceValidUntil"),
+            "url":              get_text(o, "url"),
+            "seller":           o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
+            "offer_count":      get_text(o, "offerCount"),
+        }))
+    };
+    match offers {
+        Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
+        Value::Object(_) => collect_single(offers).into_iter().collect(),
+        _ => Vec::new(),
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value":  get_text(r, "ratingValue"),
+        "best_rating":   get_text(r, "bestRating"),
+        "worst_rating":  get_text(r, "worstRating"),
+        "rating_count":  get_text(r, "ratingCount"),
+        "review_count":  get_text(r, "reviewCount"),
+    }))
+}
+
+fn get_review_count(v: &Value) -> Option<String> {
+    v.get("aggregateRating")
+        .and_then(|r| get_text(r, "reviewCount"))
+        .or_else(|| get_text(v, "reviewCount"))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+// ---------------------------------------------------------------------------
+// OG / product meta-tag helpers
+// ---------------------------------------------------------------------------
+
+/// True when the HTML has enough OG / product meta tags to justify
+/// building a fallback payload. A single `og:title` isn't enough on its
+/// own — every blog post has that. We require either a product price
+/// tag or at least an `og:type` of `product`/`og:product` to avoid
+/// mis-classifying articles as products.
+fn has_og_product_signal(html: &str) -> bool {
+    let has_price = meta_property(html, "product:price:amount").is_some()
+        || meta_property(html, "og:price:amount").is_some();
+    if has_price {
+        return true;
+    }
+    // `<meta property="og:type" content="product">` is the Schema.org OG
+    // marker for product pages.
+    let og_type = og(html, "type").unwrap_or_default().to_lowercase();
+    matches!(og_type.as_str(), "product" | "og:product" | "product.item")
+}
+
+/// Build a single Offer-shaped Value from OG / product meta tags, or
+/// `None` if there's no price info at all.
+fn build_og_offer(html: &str) -> Option<Value> {
+    let price = meta_property(html, "product:price:amount")
+        .or_else(|| meta_property(html, "og:price:amount"));
+    let currency = meta_property(html, "product:price:currency")
+        .or_else(|| meta_property(html, "og:price:currency"));
+    let availability = meta_property(html, "product:availability")
+        .or_else(|| meta_property(html, "og:availability"));
+    price.as_ref()?;
+    Some(json!({
+        "price":            price,
+        "low_price":        None::<String>,
+        "high_price":       None::<String>,
+        "currency":         currency,
+        "availability":     availability,
+        "item_condition":   None::<String>,
+        "valid_until":      None::<String>,
+        "url":              None::<String>,
+        "seller":           None::<String>,
+        "offer_count":      None::<String>,
+    }))
+}
+
+/// Pull the value of `<meta property="og:{prop}" content="...">`.
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+/// Pull the value of any `<meta property="..." content="...">` tag.
+/// Needed for namespaced OG variants like `product:price:amount` that
+/// the simple `og:*` matcher above doesn't cover.
+fn meta_property(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use serde_json::json;
+
+    #[test]
+    fn matches_any_http_url_with_host() {
+        assert!(matches("https://www.allbirds.com/products/tree-runner"));
+        assert!(matches(
+            "https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
+        ));
+        assert!(matches("https://example.com/p/widget"));
+        assert!(matches("http://shop.example.com/foo/bar"));
+    }
+
+    #[test]
+    fn rejects_empty_or_non_http() {
+        assert!(!matches(""));
+        assert!(!matches("not-a-url"));
+        assert!(!matches("ftp://example.com/file"));
+    }
+
+    #[test]
+    fn find_product_walks_graph() {
+        let block = json!({
+            "@context": "https://schema.org",
+            "@graph": [
+                {"@type": "Organization", "name": "ACME"},
+                {"@type": "Product", "name": "Widget", "sku": "ABC"}
+            ]
+        });
+        let blocks = vec![block];
+        let p = find_product(&blocks).unwrap();
+        assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
+    }
+
+    #[test]
+    fn find_product_handles_array_type() {
+        let block = json!({
+            "@type": ["Product", "Clothing"],
+            "name": "Tee"
+        });
+        assert!(is_product_type(&block));
+    }
+
+    #[test]
+    fn get_brand_from_string_or_object() {
+        assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
+        assert_eq!(
+            get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
+            Some("ACME".into())
+        );
+    }
+
+    #[test]
+    fn collect_offers_handles_single_and_aggregate() {
+        let p = json!({
+            "offers": {
+                "@type": "Offer",
+                "price": "19.99",
+                "priceCurrency": "USD",
+                "availability": "https://schema.org/InStock"
+            }
+        });
+        let offers = collect_offers(&p);
+        assert_eq!(offers.len(), 1);
+        assert_eq!(
+            offers[0].get("price").and_then(|v| v.as_str()),
+            Some("19.99")
+        );
+        assert_eq!(
+            offers[0].get("availability").and_then(|v| v.as_str()),
+            Some("InStock")
+        );
+    }
+
+    // --- OG fallback --------------------------------------------------------
+
+    #[test]
+    fn has_og_product_signal_accepts_product_type_or_price() {
+        let type_only = r#"<meta property="og:type" content="product">"#;
+        let price_only = r#"<meta property="product:price:amount" content="49.00">"#;
+        let neither = r#"<meta property="og:title" content="My Article"><meta property="og:type" content="article">"#;
+        assert!(has_og_product_signal(type_only));
+        assert!(has_og_product_signal(price_only));
+        assert!(!has_og_product_signal(neither));
+    }
+
+    #[test]
+    fn og_fallback_builds_payload_without_jsonld() {
+        let html = r##"<html><head>
+            <meta property="og:type" content="product">
+            <meta property="og:title" content="Handmade Candle">
+            <meta property="og:image" content="https://cdn.example.com/candle.jpg">
+            <meta property="og:description" content="Small-batch soy candle.">
+            <meta property="product:price:amount" content="18.00">
+            <meta property="product:price:currency" content="USD">
+            <meta property="product:availability" content="in stock">
+            <meta property="product:brand" content="Little Studio">
+        </head></html>"##;
+        let v = parse(html, "https://example.com/p/candle").unwrap();
+        assert_eq!(v["data_source"], "og_fallback");
+        assert_eq!(v["name"], "Handmade Candle");
+        assert_eq!(v["description"], "Small-batch soy candle.");
+        assert_eq!(v["brand"], "Little Studio");
+        assert_eq!(v["offers"][0]["price"], "18.00");
+        assert_eq!(v["offers"][0]["currency"], "USD");
+        assert_eq!(v["offers"][0]["availability"], "in stock");
+        assert_eq!(v["images"][0], "https://cdn.example.com/candle.jpg");
+    }
+
+    #[test]
+    fn jsonld_augments_empty_offers_with_og_price() {
+        // Patagonia-shaped page: Product JSON-LD without an Offer, plus
+        // product:price:* OG tags. We should merge.
+        let html = r##"<html><head>
+            <script type="application/ld+json">
+            {"@context":"https://schema.org","@type":"Product",
+             "name":"Better Sweater","brand":"Patagonia",
+             "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.4","reviewCount":"1142"}}
+            </script>
+            <meta property="product:price:amount" content="139.00">
+            <meta property="product:price:currency" content="USD">
+        </head></html>"##;
+        let v = parse(html, "https://patagonia.com/p/x").unwrap();
+        assert_eq!(v["data_source"], "jsonld+og");
+        assert_eq!(v["name"], "Better Sweater");
+        assert_eq!(v["offers"].as_array().unwrap().len(), 1);
+        assert_eq!(v["offers"][0]["price"], "139.00");
+    }
+
+    #[test]
+    fn jsonld_only_stays_pure_jsonld() {
+        let html = r##"<html><head>
+            <script type="application/ld+json">
+            {"@type":"Product","name":"Widget",
+             "offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
+            </script>
+        </head></html>"##;
+        let v = parse(html, "https://example.com/p/w").unwrap();
+        assert_eq!(v["data_source"], "jsonld");
+        assert_eq!(v["offers"][0]["price"], "9.99");
+    }
+
+    #[test]
+    fn parse_returns_none_on_no_product_signals() {
+        let html = r#"<html><head>
+            <meta property="og:title" content="My Blog Post">
+            <meta property="og:type" content="article">
+        </head></html>"#;
+        assert!(parse(html, "https://blog.example.com/post").is_none());
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
@ -0,0 +1,572 @@
+//! Etsy listing extractor.
+//!
+//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant
+//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD
+//! block with title, price, currency, availability, shop seller, and
+//! an `AggregateRating` for the listing.
+//!
+//! Etsy puts Cloudflare + custom WAF in front of product pages with a
+//! high variance: the Firefox profile gets clean HTML most of the time
+//! but some listings return a CF interstitial. We route through
+//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
+//! same as `ebay_listing`.
+//!
+//! ## URL slug as last-resort title
+//!
+//! Even with cloud antibot bypass, Etsy frequently serves a generic
+//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
+//! empty markdown). In that case we humanise the slug from the URL
+//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
+//! "Personalized Stainless Steel Tumbler") so callers always get a
+//! meaningful title. Degrades gracefully when the URL has no slug.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "etsy_listing",
+    label: "Etsy listing",
+    description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.",
+    url_patterns: &[
+        "https://www.etsy.com/listing/{id}",
+        "https://www.etsy.com/listing/{id}/{slug}",
+        "https://www.etsy.com/{locale}/listing/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !is_etsy_host(host) {
+        return false;
+    }
+    parse_listing_id(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let listing_id = parse_listing_id(url)
+        .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
+
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse(&fetched.html, url, &listing_id);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
+    let jsonld = find_product_jsonld(html);
+    let slug_title = humanise_slug(parse_slug(url).as_deref());
+
+    let title = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "name"))
+        .or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
+        .or(slug_title);
+    let description = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
+    let image = jsonld
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| og(html, "image"));
+    let brand = jsonld.as_ref().and_then(get_brand);
+
+    // Etsy listings often ship either a single Offer or an
+    // AggregateOffer when the listing has variants with different prices.
+    let offer = jsonld.as_ref().and_then(first_offer);
+    let (low_price, high_price, single_price) = match offer.as_ref() {
+        Some(o) => (
+            get_text(o, "lowPrice"),
+            get_text(o, "highPrice"),
+            get_text(o, "price"),
+        ),
+        None => (None, None, None),
+    };
+    let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency"));
+    let availability = offer
+        .as_ref()
+        .and_then(|o| get_text(o, "availability").map(strip_schema_prefix));
+    let item_condition = jsonld
+        .as_ref()
+        .and_then(|v| get_text(v, "itemCondition"))
+        .map(strip_schema_prefix);
+
+    // Shop name: offers[0].seller.name on newer listings, top-level
+    // `brand` on older listings (Etsy changed the schema around 2022).
+    // Fall back through both so either shape resolves.
+    let shop = offer
+        .as_ref()
+        .and_then(|o| {
+            o.get("seller")
+                .and_then(|s| s.get("name"))
+                .and_then(|n| n.as_str())
+                .map(String::from)
+        })
+        .or_else(|| brand.clone());
+    let shop_url = shop_url_from_html(html);
+
+    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
+
+    json!({
+        "url":              url,
+        "listing_id":       listing_id,
+        "title":            title,
+        "description":      description,
+        "image":            image,
+        "brand":            brand,
+        "price":            single_price,
+        "low_price":        low_price,
+        "high_price":       high_price,
+        "currency":         currency,
+        "availability":     availability,
+        "item_condition":   item_condition,
+        "shop":             shop,
+        "shop_url":         shop_url,
+        "aggregate_rating": aggregate_rating,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn is_etsy_host(host: &str) -> bool {
+    host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com")
+}
+
+/// Extract the numeric listing id. Etsy ids are 9-11 digits today but
+/// we accept any all-digit segment right after `/listing/`.
+///
+/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised
+/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`).
+fn parse_listing_id(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap());
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+/// Extract the URL slug after the listing id, e.g.
+/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
+/// is the bare `/listing/{id}` shape.
+fn parse_slug(url: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
+    re.captures(url)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+/// Turn a URL slug into a human-ish title:
+/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
+/// Steel Tumbler`. Word-cap each dash-separated token; preserves
+/// underscores as spaces too. Returns `None` on empty input.
+fn humanise_slug(slug: Option<&str>) -> Option<String> {
+    let raw = slug?.trim();
+    if raw.is_empty() {
+        return None;
+    }
+    let words: Vec<String> = raw
+        .split(['-', '_'])
+        .filter(|w| !w.is_empty())
+        .map(capitalise_word)
+        .collect();
+    if words.is_empty() {
+        None
+    } else {
+        Some(words.join(" "))
+    }
+}
+
+fn capitalise_word(w: &str) -> String {
+    let mut chars = w.chars();
+    match chars.next() {
+        Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
+        None => String::new(),
+    }
+}
+
+/// True when the OG title is Etsy's fallback-page title rather than a
+/// listing-specific title. Expired / region-blocked / antibot-filtered
+/// pages return Etsy's sitewide tagline:
+/// `"Etsy - Your place to buy and sell all things handmade..."`, or
+/// simply `"etsy.com"`. A real listing title always starts with the
+/// item name, never with "Etsy - " or the domain.
+fn is_generic_title(t: &str) -> bool {
+    let normalised = t.trim().to_lowercase();
+    if matches!(
+        normalised.as_str(),
+        "etsy.com" | "etsy" | "www.etsy.com" | ""
+    ) {
+        return true;
+    }
+    // Etsy's sitewide marketing tagline, served on 404 / blocked pages.
+    if normalised.starts_with("etsy - ")
+        || normalised.starts_with("etsy.com - ")
+        || normalised.starts_with("etsy uk - ")
+    {
+        return true;
+    }
+    // Etsy's "item unavailable" placeholder, served on delisted
+    // products. Keep the slug fallback so callers still see what the
+    // URL was about.
+    normalised.starts_with("this item is unavailable")
+        || normalised.starts_with("sorry, this item is")
+        || normalised == "item not available - etsy"
+}
+
+/// True when the OG description is an Etsy error-page placeholder or
+/// sitewide marketing blurb rather than a real listing description.
+fn is_generic_description(d: &str) -> bool {
+    let normalised = d.trim().to_lowercase();
+    if normalised.is_empty() {
+        return true;
+    }
+    normalised.starts_with("sorry, the page you were looking for")
+        || normalised.starts_with("page not found")
+        || normalised.starts_with("find the perfect handmade gift")
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
+// extractors can diverge without cross-impact)
+// ---------------------------------------------------------------------------
+
+fn find_product_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_product_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
+    match t {
+        Value::String(s) => is_prod(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    brand
+        .as_object()
+        .and_then(|o| o.get("name"))
+        .and_then(|n| n.as_str())
+        .map(String::from)
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn first_offer(v: &Value) -> Option<Value> {
+    let offers = v.get("offers")?;
+    match offers {
+        Value::Array(arr) => arr.first().cloned(),
+        Value::Object(_) => Some(offers.clone()),
+        _ => None,
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value": get_text(r, "ratingValue"),
+        "review_count": get_text(r, "reviewCount"),
+        "best_rating":  get_text(r, "bestRating"),
+    }))
+}
+
+fn strip_schema_prefix(s: String) -> String {
+    s.replace("http://schema.org/", "")
+        .replace("https://schema.org/", "")
+}
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+/// Etsy links the owning shop with a canonical anchor like
+/// `<a href="/shop/ShopName" ...>`. Grab the first one after the
+/// breadcrumb boundary.
+fn shop_url_from_html(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| format!("https://www.etsy.com{}", m.as_str()))
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_etsy_listing_urls() {
+        assert!(matches("https://www.etsy.com/listing/123456789"));
+        assert!(matches(
+            "https://www.etsy.com/listing/123456789/vintage-typewriter"
+        ));
+        assert!(matches(
+            "https://www.etsy.com/fr/listing/123456789/vintage-typewriter"
+        ));
+        assert!(!matches("https://www.etsy.com/"));
+        assert!(!matches("https://www.etsy.com/shop/SomeShop"));
+        assert!(!matches("https://example.com/listing/123456789"));
+    }
+
+    #[test]
+    fn parse_listing_id_handles_slug_and_locale() {
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/listing/123456789"),
+            Some("123456789".into())
+        );
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"),
+            Some("123456789".into())
+        );
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"),
+            Some("123456789".into())
+        );
+        assert_eq!(
+            parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"),
+            Some("123456789".into())
+        );
+    }
+
+    #[test]
+    fn parse_extracts_from_fixture_jsonld() {
+        let html = r##"
+<html><head>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"Product",
+ "name":"Handmade Ceramic Mug","sku":"MUG-001",
+ "brand":{"@type":"Brand","name":"Studio Clay"},
+ "image":["https://i.etsystatic.com/abc.jpg","https://i.etsystatic.com/xyz.jpg"],
+ "itemCondition":"https://schema.org/NewCondition",
+ "offers":{"@type":"Offer","price":"24.00","priceCurrency":"USD",
+           "availability":"https://schema.org/InStock",
+           "seller":{"@type":"Organization","name":"StudioClay"}},
+ "aggregateRating":{"@type":"AggregateRating","ratingValue":"4.9","reviewCount":"127","bestRating":"5"}}
+</script>
+<a href="/shop/StudioClay" class="wt-text-link">StudioClay</a>
+</head></html>"##;
+        let v = parse(html, "https://www.etsy.com/listing/1", "1");
+        assert_eq!(v["title"], "Handmade Ceramic Mug");
+        assert_eq!(v["price"], "24.00");
+        assert_eq!(v["currency"], "USD");
+        assert_eq!(v["availability"], "InStock");
+        assert_eq!(v["item_condition"], "NewCondition");
+        assert_eq!(v["shop"], "StudioClay");
+        assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay");
+        assert_eq!(v["brand"], "Studio Clay");
+        assert_eq!(v["aggregate_rating"]["rating_value"], "4.9");
+        assert_eq!(v["aggregate_rating"]["review_count"], "127");
+    }
+
+    #[test]
+    fn parse_handles_aggregate_offer_price_range() {
+        let html = r##"
+<script type="application/ld+json">
+{"@type":"Product","name":"Mug Set",
+ "offers":{"@type":"AggregateOffer",
+           "lowPrice":"18.00","highPrice":"36.00","priceCurrency":"USD"}}
+</script>
+"##;
+        let v = parse(html, "https://www.etsy.com/listing/2", "2");
+        assert_eq!(v["low_price"], "18.00");
+        assert_eq!(v["high_price"], "36.00");
+        assert_eq!(v["currency"], "USD");
+    }
+
+    #[test]
+    fn parse_falls_back_to_og_when_no_jsonld() {
+        let html = r#"
+<html><head>
+<meta property="og:title" content="Minimal Fallback Item">
+<meta property="og:description" content="OG-only extraction test.">
+<meta property="og:image" content="https://i.etsystatic.com/fallback.jpg">
+</head></html>"#;
+        let v = parse(html, "https://www.etsy.com/listing/3", "3");
+        assert_eq!(v["title"], "Minimal Fallback Item");
+        assert_eq!(v["description"], "OG-only extraction test.");
+        assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg");
+        // No price fields when we only have OG.
+        assert!(v["price"].is_null());
+    }
+
+    #[test]
+    fn parse_slug_from_url() {
+        assert_eq!(
+            parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
+            Some("vintage-typewriter".into())
+        );
+        assert_eq!(
+            parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
+            Some("slug".into())
+        );
+        assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
+        assert_eq!(
+            parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
+            Some("slug".into())
+        );
+    }
+
+    #[test]
+    fn humanise_slug_capitalises_each_word() {
+        assert_eq!(
+            humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
+            Some("Personalized Stainless Steel Tumbler")
+        );
+        assert_eq!(
+            humanise_slug(Some("hand_crafted_mug")).as_deref(),
+            Some("Hand Crafted Mug")
+        );
+        assert_eq!(humanise_slug(Some("")), None);
+        assert_eq!(humanise_slug(None), None);
+    }
+
+    #[test]
+    fn is_generic_title_catches_common_shapes() {
+        assert!(is_generic_title("etsy.com"));
+        assert!(is_generic_title("Etsy"));
+        assert!(is_generic_title("  etsy.com  "));
+        assert!(is_generic_title(
+            "Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
+        ));
+        assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
+        assert!(!is_generic_title("Vintage Typewriter"));
+        assert!(!is_generic_title("Handmade Etsy-style Mug"));
+    }
+
+    #[test]
+    fn is_generic_description_catches_404_shapes() {
+        assert!(is_generic_description(""));
+        assert!(is_generic_description(
+            "Sorry, the page you were looking for was not found."
+        ));
+        assert!(is_generic_description("Page not found"));
+        assert!(!is_generic_description(
+            "Hand-thrown ceramic mug, dishwasher safe."
+        ));
+    }
+
+    #[test]
+    fn parse_uses_slug_when_og_is_generic() {
+        // Cloud-blocked Etsy listing: og:title is a site-wide generic
+        // placeholder, no JSON-LD, no description. Slug should win.
+        let html = r#"<html><head>
+<meta property="og:title" content="etsy.com">
+</head></html>"#;
+        let v = parse(
+            html,
+            "https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
+            "1079113183",
+        );
+        assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
+    }
+
+    #[test]
+    fn parse_prefers_real_og_over_slug() {
+        let html = r#"<html><head>
+<meta property="og:title" content="Real Listing Title">
+</head></html>"#;
+        let v = parse(
+            html,
+            "https://www.etsy.com/listing/1079113183/the-url-slug",
+            "1079113183",
+        );
+        assert_eq!(v["title"], "Real Listing Title");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/github_issue.rs
+++ b/crates/webclaw-fetch/src/extractors/github_issue.rs
@ -0,0 +1,172 @@
+//! GitHub issue structured extractor.
+//!
+//! Mirror of `github_pr` but on `/issues/{number}`. Uses
+//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the
+//! issue body + comment count + labels + milestone + author /
+//! assignees. Full per-comment bodies would be another call; kept for
+//! a follow-up.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_issue",
+    label: "GitHub issue",
+    description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.",
+    url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    parse_issue(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
+        FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_issue: issue '{owner}/{repo}#{number}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let issue: Issue = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?;
+
+    // The same endpoint returns PRs too; reject if we got one so the caller
+    // uses /v1/scrape/github_pr instead of getting a half-shaped payload.
+    if issue.pull_request.is_some() {
+        return Err(FetchError::Build(format!(
+            "github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr"
+        )));
+    }
+
+    Ok(json!({
+        "url":         url,
+        "owner":       owner,
+        "repo":        repo,
+        "number":      issue.number,
+        "title":       issue.title,
+        "body":        issue.body,
+        "state":       issue.state,
+        "state_reason":issue.state_reason,
+        "author":      issue.user.as_ref().and_then(|u| u.login.clone()),
+        "labels":      issue.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
+        "assignees":   issue.assignees.iter().filter_map(|u| u.login.clone()).collect::<Vec<_>>(),
+        "milestone":   issue.milestone.as_ref().and_then(|m| m.title.clone()),
+        "comments":    issue.comments,
+        "locked":      issue.locked,
+        "created_at":  issue.created_at,
+        "updated_at":  issue.updated_at,
+        "closed_at":   issue.closed_at,
+        "html_url":    issue.html_url,
+    }))
+}
+
+fn parse_issue(url: &str) -> Option<(String, String, u64)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    if segs.len() < 4 || segs[2] != "issues" {
+        return None;
+    }
+    let number: u64 = segs[3].parse().ok()?;
+    Some((segs[0].to_string(), segs[1].to_string(), number))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub issue API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Issue {
+    number: Option<i64>,
+    title: Option<String>,
+    body: Option<String>,
+    state: Option<String>,
+    state_reason: Option<String>,
+    locked: Option<bool>,
+    comments: Option<i64>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    closed_at: Option<String>,
+    html_url: Option<String>,
+    user: Option<UserRef>,
+    #[serde(default)]
+    labels: Vec<LabelRef>,
+    #[serde(default)]
+    assignees: Vec<UserRef>,
+    milestone: Option<Milestone>,
+    /// Present when this "issue" is actually a pull request. The REST
+    /// API overloads the issues endpoint for PRs.
+    pull_request: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    login: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct LabelRef {
+    name: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Milestone {
+    title: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_issue_urls() {
+        assert!(matches("https://github.com/rust-lang/rust/issues/100"));
+        assert!(matches("https://github.com/rust-lang/rust/issues/100/"));
+        assert!(!matches("https://github.com/rust-lang/rust"));
+        assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
+        assert!(!matches("https://github.com/rust-lang/rust/issues"));
+    }
+
+    #[test]
+    fn parse_issue_extracts_owner_repo_number() {
+        assert_eq!(
+            parse_issue("https://github.com/rust-lang/rust/issues/100"),
+            Some(("rust-lang".into(), "rust".into(), 100))
+        );
+        assert_eq!(
+            parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"),
+            Some(("rust-lang".into(), "rust".into(), 100))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/github_pr.rs
+++ b/crates/webclaw-fetch/src/extractors/github_pr.rs
@ -0,0 +1,189 @@
+//! GitHub pull request structured extractor.
+//!
+//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns
+//! the PR metadata + a counted summary of comments and review activity.
+//! Full diff and per-comment bodies require additional calls — left for
+//! a follow-up enhancement so the v1 stays one network round-trip.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_pr",
+    label: "GitHub pull request",
+    description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.",
+    url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    parse_pr(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
+        FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_pr: pull request '{owner}/{repo}#{number}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let p: PullRequest = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?;
+
+    Ok(json!({
+        "url":            url,
+        "owner":          owner,
+        "repo":           repo,
+        "number":         p.number,
+        "title":          p.title,
+        "body":           p.body,
+        "state":          p.state,
+        "draft":          p.draft,
+        "merged":         p.merged,
+        "merged_at":      p.merged_at,
+        "merge_commit_sha": p.merge_commit_sha,
+        "author":         p.user.as_ref().and_then(|u| u.login.clone()),
+        "labels":         p.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
+        "milestone":      p.milestone.as_ref().and_then(|m| m.title.clone()),
+        "head_ref":       p.head.as_ref().and_then(|r| r.ref_name.clone()),
+        "base_ref":       p.base.as_ref().and_then(|r| r.ref_name.clone()),
+        "head_sha":       p.head.as_ref().and_then(|r| r.sha.clone()),
+        "additions":      p.additions,
+        "deletions":      p.deletions,
+        "changed_files":  p.changed_files,
+        "commits":        p.commits,
+        "comments":       p.comments,
+        "review_comments":p.review_comments,
+        "created_at":     p.created_at,
+        "updated_at":     p.updated_at,
+        "closed_at":      p.closed_at,
+        "html_url":       p.html_url,
+    }))
+}
+
+fn parse_pr(url: &str) -> Option<(String, String, u64)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /{owner}/{repo}/pull/{number} (or /pulls/{number} variant)
+    if segs.len() < 4 {
+        return None;
+    }
+    if segs[2] != "pull" && segs[2] != "pulls" {
+        return None;
+    }
+    let number: u64 = segs[3].parse().ok()?;
+    Some((segs[0].to_string(), segs[1].to_string(), number))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub PR API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct PullRequest {
+    number: Option<i64>,
+    title: Option<String>,
+    body: Option<String>,
+    state: Option<String>,
+    draft: Option<bool>,
+    merged: Option<bool>,
+    merged_at: Option<String>,
+    merge_commit_sha: Option<String>,
+    user: Option<UserRef>,
+    #[serde(default)]
+    labels: Vec<LabelRef>,
+    milestone: Option<Milestone>,
+    head: Option<GitRef>,
+    base: Option<GitRef>,
+    additions: Option<i64>,
+    deletions: Option<i64>,
+    changed_files: Option<i64>,
+    commits: Option<i64>,
+    comments: Option<i64>,
+    review_comments: Option<i64>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    closed_at: Option<String>,
+    html_url: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    login: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct LabelRef {
+    name: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Milestone {
+    title: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct GitRef {
+    #[serde(rename = "ref")]
+    ref_name: Option<String>,
+    sha: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_pr_urls() {
+        assert!(matches("https://github.com/rust-lang/rust/pull/12345"));
+        assert!(matches(
+            "https://github.com/rust-lang/rust/pull/12345/files"
+        ));
+        assert!(!matches("https://github.com/rust-lang/rust"));
+        assert!(!matches("https://github.com/rust-lang/rust/issues/100"));
+        assert!(!matches("https://github.com/rust-lang"));
+    }
+
+    #[test]
+    fn parse_pr_extracts_owner_repo_number() {
+        assert_eq!(
+            parse_pr("https://github.com/rust-lang/rust/pull/12345"),
+            Some(("rust-lang".into(), "rust".into(), 12345))
+        );
+        assert_eq!(
+            parse_pr("https://github.com/rust-lang/rust/pull/12345/files"),
+            Some(("rust-lang".into(), "rust".into(), 12345))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/github_release.rs
+++ b/crates/webclaw-fetch/src/extractors/github_release.rs
@ -0,0 +1,179 @@
+//! GitHub release structured extractor.
+//!
+//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns
+//! the release notes body, asset list with download counts, and
+//! prerelease flag.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_release",
+    label: "GitHub release",
+    description: "Returns release metadata: tag, name, body (release notes), assets with download counts.",
+    url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    parse_release(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
+        FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_release: release '{owner}/{repo}@{tag}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour."
+                .into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let r: Release = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?;
+
+    let assets: Vec<Value> = r
+        .assets
+        .iter()
+        .map(|a| {
+            json!({
+                "name": a.name,
+                "size": a.size,
+                "download_count": a.download_count,
+                "browser_download_url": a.browser_download_url,
+                "content_type": a.content_type,
+                "created_at": a.created_at,
+                "updated_at": a.updated_at,
+            })
+        })
+        .collect();
+
+    Ok(json!({
+        "url":           url,
+        "owner":         owner,
+        "repo":          repo,
+        "tag_name":      r.tag_name,
+        "name":          r.name,
+        "body":          r.body,
+        "draft":         r.draft,
+        "prerelease":    r.prerelease,
+        "author":        r.author.as_ref().and_then(|u| u.login.clone()),
+        "created_at":    r.created_at,
+        "published_at":  r.published_at,
+        "asset_count":   assets.len(),
+        "total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::<i64>(),
+        "assets":        assets,
+        "html_url":      r.html_url,
+    }))
+}
+
+fn parse_release(url: &str) -> Option<(String, String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /{owner}/{repo}/releases/tag/{tag}
+    if segs.len() < 5 {
+        return None;
+    }
+    if segs[2] != "releases" || segs[3] != "tag" {
+        return None;
+    }
+    Some((
+        segs[0].to_string(),
+        segs[1].to_string(),
+        segs[4].to_string(),
+    ))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub Release API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Release {
+    tag_name: Option<String>,
+    name: Option<String>,
+    body: Option<String>,
+    draft: Option<bool>,
+    prerelease: Option<bool>,
+    author: Option<UserRef>,
+    created_at: Option<String>,
+    published_at: Option<String>,
+    html_url: Option<String>,
+    #[serde(default)]
+    assets: Vec<Asset>,
+}
+
+#[derive(Deserialize)]
+struct UserRef {
+    login: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Asset {
+    name: Option<String>,
+    size: Option<i64>,
+    download_count: Option<i64>,
+    browser_download_url: Option<String>,
+    content_type: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_release_urls() {
+        assert!(matches(
+            "https://github.com/rust-lang/rust/releases/tag/1.85.0"
+        ));
+        assert!(matches(
+            "https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"
+        ));
+        assert!(!matches("https://github.com/rust-lang/rust"));
+        assert!(!matches("https://github.com/rust-lang/rust/releases"));
+        assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
+    }
+
+    #[test]
+    fn parse_release_extracts_owner_repo_tag() {
+        assert_eq!(
+            parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"),
+            Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into()))
+        );
+        assert_eq!(
+            parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"),
+            Some(("rust-lang".into(), "rust".into(), "1.85.0".into()))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/github_repo.rs
+++ b/crates/webclaw-fetch/src/extractors/github_repo.rs
@ -0,0 +1,212 @@
+//! GitHub repository structured extractor.
+//!
+//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`.
+//! Unauthenticated requests get 60/hour per IP, which is fine for users
+//! self-hosting and for low-volume cloud usage. Production cloud should
+//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't
+//! depend on it being set — it works open out of the box.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "github_repo",
+    label: "GitHub repository",
+    description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.",
+    url_patterns: &["https://github.com/{owner}/{repo}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host != "github.com" && host != "www.github.com" {
+        return false;
+    }
+    // Path must be exactly /{owner}/{repo} (or with trailing slash). Reject
+    // sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the
+    // future github_issue / github_pr extractors will handle.
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0])
+}
+
+/// GitHub uses some top-level paths for non-repo pages.
+const RESERVED_OWNERS: &[&str] = &[
+    "settings",
+    "marketplace",
+    "explore",
+    "topics",
+    "trending",
+    "collections",
+    "events",
+    "sponsors",
+    "issues",
+    "pulls",
+    "notifications",
+    "new",
+    "organizations",
+    "login",
+    "join",
+    "search",
+    "about",
+];
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
+        FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
+    })?;
+
+    let api_url = format!("https://api.github.com/repos/{owner}/{repo}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "github_repo: repo '{owner}/{repo}' not found"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(
+            "github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
+        ));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "github api returned status {}",
+            resp.status
+        )));
+    }
+
+    let r: Repo = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?;
+
+    Ok(json!({
+        "url":              url,
+        "owner":            r.owner.as_ref().map(|o| &o.login),
+        "name":             r.name,
+        "full_name":        r.full_name,
+        "description":      r.description,
+        "homepage":         r.homepage,
+        "language":         r.language,
+        "topics":           r.topics,
+        "license":          r.license.as_ref().and_then(|l| l.spdx_id.clone()),
+        "license_name":     r.license.as_ref().map(|l| l.name.clone()),
+        "default_branch":   r.default_branch,
+        "stars":            r.stargazers_count,
+        "forks":            r.forks_count,
+        "watchers":         r.subscribers_count,
+        "open_issues":      r.open_issues_count,
+        "size_kb":          r.size,
+        "archived":         r.archived,
+        "fork":             r.fork,
+        "is_template":      r.is_template,
+        "has_issues":       r.has_issues,
+        "has_wiki":         r.has_wiki,
+        "has_pages":        r.has_pages,
+        "has_discussions":  r.has_discussions,
+        "created_at":       r.created_at,
+        "updated_at":       r.updated_at,
+        "pushed_at":        r.pushed_at,
+        "html_url":         r.html_url,
+    }))
+}
+
+fn parse_owner_repo(url: &str) -> Option<(String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let owner = segs.next()?.to_string();
+    let repo = segs.next()?.to_string();
+    Some((owner, repo))
+}
+
+// ---------------------------------------------------------------------------
+// GitHub API types — only the fields we surface
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Repo {
+    name: Option<String>,
+    full_name: Option<String>,
+    description: Option<String>,
+    homepage: Option<String>,
+    language: Option<String>,
+    #[serde(default)]
+    topics: Vec<String>,
+    license: Option<License>,
+    default_branch: Option<String>,
+    stargazers_count: Option<i64>,
+    forks_count: Option<i64>,
+    subscribers_count: Option<i64>,
+    open_issues_count: Option<i64>,
+    size: Option<i64>,
+    archived: Option<bool>,
+    fork: Option<bool>,
+    is_template: Option<bool>,
+    has_issues: Option<bool>,
+    has_wiki: Option<bool>,
+    has_pages: Option<bool>,
+    has_discussions: Option<bool>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    pushed_at: Option<String>,
+    html_url: Option<String>,
+    owner: Option<Owner>,
+}
+
+#[derive(Deserialize)]
+struct Owner {
+    login: String,
+}
+
+#[derive(Deserialize)]
+struct License {
+    name: String,
+    spdx_id: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_repo_root_only() {
+        assert!(matches("https://github.com/rust-lang/rust"));
+        assert!(matches("https://github.com/rust-lang/rust/"));
+        assert!(!matches("https://github.com/rust-lang/rust/issues"));
+        assert!(!matches("https://github.com/rust-lang/rust/pulls/123"));
+        assert!(!matches("https://github.com/rust-lang"));
+        assert!(!matches("https://github.com/marketplace"));
+        assert!(!matches("https://github.com/topics/rust"));
+        assert!(!matches("https://example.com/foo/bar"));
+    }
+
+    #[test]
+    fn parse_owner_repo_handles_trailing_slash_and_query() {
+        assert_eq!(
+            parse_owner_repo("https://github.com/rust-lang/rust"),
+            Some(("rust-lang".into(), "rust".into()))
+        );
+        assert_eq!(
+            parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"),
+            Some(("rust-lang".into(), "rust".into()))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/hackernews.rs
+++ b/crates/webclaw-fetch/src/extractors/hackernews.rs
@ -0,0 +1,186 @@
+//! Hacker News structured extractor.
+//!
+//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which
+//! returns the full post + recursive comment tree in a single request.
+//! The official Firebase API at `hacker-news.firebaseio.com` requires
+//! N+1 fetches per comment, so we'd hit either timeout or rate-limit
+//! on any non-trivial thread.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "hackernews",
+    label: "Hacker News story",
+    description: "Returns post + nested comment tree for a Hacker News item.",
+    url_patterns: &[
+        "https://news.ycombinator.com/item?id=N",
+        "https://hn.algolia.com/items/N",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = url
+        .split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if host == "news.ycombinator.com" {
+        return url.contains("item?id=") || url.contains("item%3Fid=");
+    }
+    if host == "hn.algolia.com" {
+        return url.contains("/items/");
+    }
+    false
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let id = parse_item_id(url).ok_or_else(|| {
+        FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
+    })?;
+
+    let api_url = format!("https://hn.algolia.com/api/v1/items/{id}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "hn algolia returned status {}",
+            resp.status
+        )));
+    }
+
+    let item: AlgoliaItem = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?;
+
+    let post = post_json(&item);
+    let comments: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
+
+    Ok(json!({
+        "url": url,
+        "post": post,
+        "comments": comments,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the
+/// Algolia mirror's `/items/N` form.
+fn parse_item_id(url: &str) -> Option<u64> {
+    if let Some(after) = url.split("id=").nth(1) {
+        let n = after.split('&').next().unwrap_or(after);
+        if let Ok(id) = n.parse::<u64>() {
+            return Some(id);
+        }
+    }
+    if let Some(after) = url.split("/items/").nth(1) {
+        let n = after.split(['/', '?', '#']).next().unwrap_or(after);
+        if let Ok(id) = n.parse::<u64>() {
+            return Some(id);
+        }
+    }
+    None
+}
+
+fn post_json(item: &AlgoliaItem) -> Value {
+    json!({
+        "id":              item.id,
+        "type":            item.r#type,
+        "title":           item.title,
+        "url":             item.url,
+        "author":          item.author,
+        "points":          item.points,
+        "text":            item.text,                 // populated for ask/show/tell
+        "created_at":      item.created_at,
+        "created_at_unix": item.created_at_i,
+        "comment_count":   count_descendants(item),
+        "permalink":       item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")),
+    })
+}
+
+fn comment_json(item: &AlgoliaItem) -> Option<Value> {
+    if !matches!(item.r#type.as_deref(), Some("comment")) {
+        return None;
+    }
+    // Dead/deleted comments still appear in the tree; surface them honestly.
+    let replies: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
+    Some(json!({
+        "id":              item.id,
+        "author":          item.author,
+        "text":            item.text,
+        "created_at":      item.created_at,
+        "created_at_unix": item.created_at_i,
+        "parent_id":       item.parent_id,
+        "story_id":        item.story_id,
+        "replies":         replies,
+    }))
+}
+
+fn count_descendants(item: &AlgoliaItem) -> usize {
+    item.children
+        .iter()
+        .filter(|c| matches!(c.r#type.as_deref(), Some("comment")))
+        .map(|c| 1 + count_descendants(c))
+        .sum()
+}
+
+// ---------------------------------------------------------------------------
+// Algolia API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct AlgoliaItem {
+    id: Option<u64>,
+    r#type: Option<String>,
+    title: Option<String>,
+    url: Option<String>,
+    author: Option<String>,
+    points: Option<i64>,
+    text: Option<String>,
+    created_at: Option<String>,
+    created_at_i: Option<i64>,
+    parent_id: Option<u64>,
+    story_id: Option<u64>,
+    #[serde(default)]
+    children: Vec<AlgoliaItem>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_hn_item_urls() {
+        assert!(matches("https://news.ycombinator.com/item?id=1"));
+        assert!(matches("https://news.ycombinator.com/item?id=12345"));
+        assert!(matches("https://hn.algolia.com/items/1"));
+    }
+
+    #[test]
+    fn rejects_non_item_urls() {
+        assert!(!matches("https://news.ycombinator.com/"));
+        assert!(!matches("https://news.ycombinator.com/news"));
+        assert!(!matches("https://example.com/item?id=1"));
+    }
+
+    #[test]
+    fn parse_item_id_handles_both_forms() {
+        assert_eq!(
+            parse_item_id("https://news.ycombinator.com/item?id=1"),
+            Some(1)
+        );
+        assert_eq!(
+            parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"),
+            Some(12345)
+        );
+        assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999));
+        assert_eq!(parse_item_id("https://example.com/foo"), None);
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
+++ b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
@ -0,0 +1,189 @@
+//! HuggingFace dataset structured extractor.
+//!
+//! Same shape as the model extractor but hits the dataset endpoint.
+//! `huggingface.co/api/datasets/{owner}/{name}`.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "huggingface_dataset",
+    label: "HuggingFace dataset",
+    description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.",
+    url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "huggingface.co" && host != "www.huggingface.co" {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical).
+    segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let dataset_path = parse_dataset_path(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "hf_dataset: cannot parse dataset path from '{url}'"
+        ))
+    })?;
+
+    let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "hf_dataset: '{dataset_path}' not found"
+        )));
+    }
+    if resp.status == 401 {
+        return Err(FetchError::Build(format!(
+            "hf_dataset: '{dataset_path}' requires authentication (gated)"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "hf_dataset api returned status {}",
+            resp.status
+        )));
+    }
+
+    let d: DatasetInfo = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?;
+
+    let files: Vec<Value> = d
+        .siblings
+        .iter()
+        .map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
+        .collect();
+
+    Ok(json!({
+        "url":             url,
+        "id":              d.id,
+        "private":         d.private,
+        "gated":           d.gated,
+        "downloads":       d.downloads,
+        "downloads_30d":   d.downloads_all_time,
+        "likes":           d.likes,
+        "tags":            d.tags,
+        "license":         d.card_data.as_ref().and_then(|c| c.license.clone()),
+        "language":        d.card_data.as_ref().and_then(|c| c.language.clone()),
+        "task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()),
+        "size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()),
+        "annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()),
+        "configs":         d.card_data.as_ref().and_then(|c| c.configs.clone()),
+        "created_at":      d.created_at,
+        "last_modified":   d.last_modified,
+        "sha":             d.sha,
+        "file_count":      d.siblings.len(),
+        "files":           files,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Returns the part to append to the API URL — either `name` (legacy
+/// top-level dataset like `squad`) or `owner/name` (canonical form).
+fn parse_dataset_path(url: &str) -> Option<String> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    if segs.next() != Some("datasets") {
+        return None;
+    }
+    let first = segs.next()?.to_string();
+    match segs.next() {
+        Some(second) => Some(format!("{first}/{second}")),
+        None => Some(first),
+    }
+}
+
+#[derive(Deserialize)]
+struct DatasetInfo {
+    id: Option<String>,
+    private: Option<bool>,
+    gated: Option<serde_json::Value>,
+    downloads: Option<i64>,
+    #[serde(rename = "downloadsAllTime")]
+    downloads_all_time: Option<i64>,
+    likes: Option<i64>,
+    #[serde(default)]
+    tags: Vec<String>,
+    #[serde(rename = "createdAt")]
+    created_at: Option<String>,
+    #[serde(rename = "lastModified")]
+    last_modified: Option<String>,
+    sha: Option<String>,
+    #[serde(rename = "cardData")]
+    card_data: Option<DatasetCard>,
+    #[serde(default)]
+    siblings: Vec<Sibling>,
+}
+
+#[derive(Deserialize)]
+struct DatasetCard {
+    license: Option<serde_json::Value>,
+    language: Option<serde_json::Value>,
+    task_categories: Option<serde_json::Value>,
+    size_categories: Option<serde_json::Value>,
+    annotations_creators: Option<serde_json::Value>,
+    configs: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct Sibling {
+    rfilename: String,
+    size: Option<i64>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_dataset_pages() {
+        assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level
+        assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name
+        assert!(!matches("https://huggingface.co/openai/whisper-large-v3"));
+        assert!(!matches("https://huggingface.co/datasets/"));
+    }
+
+    #[test]
+    fn parse_dataset_path_works() {
+        assert_eq!(
+            parse_dataset_path("https://huggingface.co/datasets/squad"),
+            Some("squad".into())
+        );
+        assert_eq!(
+            parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"),
+            Some("openai/gsm8k".into())
+        );
+        assert_eq!(
+            parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"),
+            Some("openai/gsm8k".into())
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/huggingface_model.rs
+++ b/crates/webclaw-fetch/src/extractors/huggingface_model.rs
@ -0,0 +1,223 @@
+//! HuggingFace model card structured extractor.
+//!
+//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`.
+//! Returns metadata + the parsed model card front matter, but does not
+//! pull the full README body — those are sometimes 100KB+ and the user
+//! can hit /v1/scrape if they want it as markdown.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "huggingface_model",
+    label: "HuggingFace model",
+    description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.",
+    url_patterns: &["https://huggingface.co/{owner}/{name}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "huggingface.co" && host != "www.huggingface.co" {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    // /{owner}/{name} but reject HF-internal sections + sub-pages.
+    if segs.len() != 2 {
+        return false;
+    }
+    !RESERVED_NAMESPACES.contains(&segs[0])
+}
+
+const RESERVED_NAMESPACES: &[&str] = &[
+    "datasets",
+    "spaces",
+    "blog",
+    "docs",
+    "api",
+    "models",
+    "papers",
+    "pricing",
+    "tasks",
+    "join",
+    "login",
+    "settings",
+    "organizations",
+    "new",
+    "search",
+];
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (owner, name) = parse_owner_name(url).ok_or_else(|| {
+        FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
+    })?;
+
+    let api_url = format!("https://huggingface.co/api/models/{owner}/{name}");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "hf model: '{owner}/{name}' not found"
+        )));
+    }
+    if resp.status == 401 {
+        return Err(FetchError::Build(format!(
+            "hf model: '{owner}/{name}' requires authentication (gated repo)"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "hf api returned status {}",
+            resp.status
+        )));
+    }
+
+    let m: ModelInfo = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?;
+
+    // Surface a flat file list — full siblings can be hundreds of entries
+    // for big repos. We keep it as-is because callers want to know about
+    // every shard; if it bloats responses too much we'll add pagination.
+    let files: Vec<Value> = m
+        .siblings
+        .iter()
+        .map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
+        .collect();
+
+    Ok(json!({
+        "url":             url,
+        "id":              m.id,
+        "model_id":        m.model_id,
+        "private":         m.private,
+        "gated":           m.gated,
+        "downloads":       m.downloads,
+        "downloads_30d":   m.downloads_all_time,
+        "likes":           m.likes,
+        "library_name":    m.library_name,
+        "pipeline_tag":    m.pipeline_tag,
+        "tags":            m.tags,
+        "license":         m.card_data.as_ref().and_then(|c| c.license.clone()),
+        "language":        m.card_data.as_ref().and_then(|c| c.language.clone()),
+        "datasets":        m.card_data.as_ref().and_then(|c| c.datasets.clone()),
+        "base_model":      m.card_data.as_ref().and_then(|c| c.base_model.clone()),
+        "model_type":      m.card_data.as_ref().and_then(|c| c.model_type.clone()),
+        "created_at":      m.created_at,
+        "last_modified":   m.last_modified,
+        "sha":             m.sha,
+        "file_count":      m.siblings.len(),
+        "files":           files,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_owner_name(url: &str) -> Option<(String, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let owner = segs.next()?.to_string();
+    let name = segs.next()?.to_string();
+    Some((owner, name))
+}
+
+// ---------------------------------------------------------------------------
+// HF API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct ModelInfo {
+    id: Option<String>,
+    #[serde(rename = "modelId")]
+    model_id: Option<String>,
+    private: Option<bool>,
+    gated: Option<serde_json::Value>, // bool or string ("auto" / "manual" / false)
+    downloads: Option<i64>,
+    #[serde(rename = "downloadsAllTime")]
+    downloads_all_time: Option<i64>,
+    likes: Option<i64>,
+    #[serde(rename = "library_name")]
+    library_name: Option<String>,
+    #[serde(rename = "pipeline_tag")]
+    pipeline_tag: Option<String>,
+    #[serde(default)]
+    tags: Vec<String>,
+    #[serde(rename = "createdAt")]
+    created_at: Option<String>,
+    #[serde(rename = "lastModified")]
+    last_modified: Option<String>,
+    sha: Option<String>,
+    #[serde(rename = "cardData")]
+    card_data: Option<CardData>,
+    #[serde(default)]
+    siblings: Vec<Sibling>,
+}
+
+#[derive(Deserialize)]
+struct CardData {
+    license: Option<serde_json::Value>, // string or array
+    language: Option<serde_json::Value>,
+    datasets: Option<serde_json::Value>,
+    #[serde(rename = "base_model")]
+    base_model: Option<serde_json::Value>,
+    #[serde(rename = "model_type")]
+    model_type: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Sibling {
+    rfilename: String,
+    size: Option<i64>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_model_pages() {
+        assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B"));
+        assert!(matches("https://huggingface.co/openai/whisper-large-v3"));
+        assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1
+    }
+
+    #[test]
+    fn rejects_hf_section_pages() {
+        assert!(!matches("https://huggingface.co/datasets/squad"));
+        assert!(!matches("https://huggingface.co/spaces/foo/bar"));
+        assert!(!matches("https://huggingface.co/blog/intro"));
+        assert!(!matches("https://huggingface.co/"));
+        assert!(!matches("https://huggingface.co/meta-llama"));
+    }
+
+    #[test]
+    fn parse_owner_name_pulls_both() {
+        assert_eq!(
+            parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"),
+            Some(("meta-llama".into(), "Meta-Llama-3-8B".into()))
+        );
+        assert_eq!(
+            parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"),
+            Some(("openai".into(), "whisper-large-v3".into()))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/instagram_post.rs
+++ b/crates/webclaw-fetch/src/extractors/instagram_post.rs
@ -0,0 +1,235 @@
+//! Instagram post structured extractor.
+//!
+//! Uses Instagram's public embed endpoint
+//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the
+//! full caption, author username, and thumbnail. No auth required.
+//! The same endpoint serves reels and IGTV under `/reel/{code}` and
+//! `/tv/{code}` URLs (we accept all three).
+
+use regex::Regex;
+use serde_json::{Value, json};
+use std::sync::OnceLock;
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "instagram_post",
+    label: "Instagram post",
+    description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.",
+    url_patterns: &[
+        "https://www.instagram.com/p/{shortcode}/",
+        "https://www.instagram.com/reel/{shortcode}/",
+        "https://www.instagram.com/tv/{shortcode}/",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.instagram.com" | "instagram.com") {
+        return false;
+    }
+    parse_shortcode(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "instagram_post: cannot parse shortcode from '{url}'"
+        ))
+    })?;
+
+    // Instagram serves the same embed HTML for posts/reels/tv under /p/.
+    let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/");
+    let resp = client.fetch(&embed_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "instagram embed returned status {} for {shortcode}",
+            resp.status
+        )));
+    }
+
+    let html = &resp.html;
+    let username = parse_username(html);
+    let caption = parse_caption(html);
+    let thumbnail = parse_thumbnail(html);
+
+    Ok(json!({
+        "url":               url,
+        "embed_url":         embed_url,
+        "shortcode":         shortcode,
+        "kind":              kind,
+        "data_completeness": "embed",
+        "author_username":   username,
+        "caption":           caption,
+        "thumbnail_url":     thumbnail,
+        "canonical_url":     format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL parsing
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}.
+fn parse_shortcode(url: &str) -> Option<(&'static str, String)> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let first = segs.next()?;
+    let kind = match first {
+        "p" => "post",
+        "reel" | "reels" => "reel",
+        "tv" => "tv",
+        _ => return None,
+    };
+    let shortcode = segs.next()?;
+    if shortcode.is_empty() {
+        return None;
+    }
+    Some((kind, shortcode.to_string()))
+}
+
+fn path_segment_for(kind: &str) -> &'static str {
+    match kind {
+        "reel" => "reel",
+        "tv" => "tv",
+        _ => "p",
+    }
+}
+
+// ---------------------------------------------------------------------------
+// HTML scraping
+// ---------------------------------------------------------------------------
+
+/// Username appears as the anchor text inside `<a class="CaptionUsername">`.
+fn parse_username(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| html_decode(m.as_str().trim()))
+}
+
+/// Caption sits inside `<div class="Caption">` after the username anchor.
+/// We grab the whole Caption block and strip out the username link, time
+/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate.
+fn parse_caption(html: &str) -> Option<String> {
+    static RE_OUTER: OnceLock<Regex> = OnceLock::new();
+    let outer = RE_OUTER
+        .get_or_init(|| Regex::new(r#"(?s)<div\s+class="Caption"[^>]*>(.*?)</div>"#).unwrap());
+    let block = outer.captures(html)?.get(1)?.as_str();
+
+    // Strip everything wrapped in <a class="CaptionUsername">...</a>.
+    static RE_USER: OnceLock<Regex> = OnceLock::new();
+    let user_re = RE_USER
+        .get_or_init(|| Regex::new(r#"(?s)<a[^>]*class="CaptionUsername"[^>]*>.*?</a>"#).unwrap());
+    let stripped = user_re.replace_all(block, "");
+
+    // Then strip anything remaining tagged.
+    static RE_TAGS: OnceLock<Regex> = OnceLock::new();
+    let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
+    let text = tag_re.replace_all(&stripped, " ");
+
+    let cleaned = collapse_whitespace(&html_decode(text.trim()));
+    if cleaned.is_empty() {
+        None
+    } else {
+        Some(cleaned)
+    }
+}
+
+/// Thumbnail is the `<img class="EmbeddedMediaImage">` inside the embed
+/// (or the og:image as fallback).
+fn parse_thumbnail(html: &str) -> Option<String> {
+    static RE_IMG: OnceLock<Regex> = OnceLock::new();
+    let img_re = RE_IMG.get_or_init(|| {
+        Regex::new(r#"(?s)<img[^>]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#)
+            .unwrap()
+    });
+    if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) {
+        return Some(html_decode(m.as_str()));
+    }
+    static RE_OG: OnceLock<Regex> = OnceLock::new();
+    let og_re = RE_OG.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:image"[^>]+content="([^"]+)""#).unwrap()
+    });
+    og_re
+        .captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| html_decode(m.as_str()))
+}
+
+fn html_decode(s: &str) -> String {
+    s.replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+        .replace("&quot;", "\"")
+        .replace("&#39;", "'")
+        .replace("&#064;", "@")
+        .replace("&#x2022;", "•")
+        .replace("&hellip;", "…")
+}
+
+fn collapse_whitespace(s: &str) -> String {
+    s.split_whitespace().collect::<Vec<_>>().join(" ")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_post_reel_tv_urls() {
+        assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/"));
+        assert!(matches(
+            "https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"
+        ));
+        assert!(matches("https://www.instagram.com/reel/abc123/"));
+        assert!(matches("https://www.instagram.com/tv/abc123/"));
+        assert!(!matches("https://www.instagram.com/ticketswave"));
+        assert!(!matches("https://www.instagram.com/"));
+        assert!(!matches("https://example.com/p/abc/"));
+    }
+
+    #[test]
+    fn parse_shortcode_reads_each_kind() {
+        assert_eq!(
+            parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"),
+            Some(("post", "DT-RICMjeK5".into()))
+        );
+        assert_eq!(
+            parse_shortcode("https://www.instagram.com/reel/abc123/"),
+            Some(("reel", "abc123".into()))
+        );
+        assert_eq!(
+            parse_shortcode("https://www.instagram.com/tv/abc123"),
+            Some(("tv", "abc123".into()))
+        );
+    }
+
+    #[test]
+    fn parse_username_pulls_anchor_text() {
+        let html = r#"<a class="CaptionUsername" href="...">ticketswave</a>"#;
+        assert_eq!(parse_username(html).as_deref(), Some("ticketswave"));
+    }
+
+    #[test]
+    fn parse_caption_strips_username_anchor() {
+        let html = r#"<div class="Caption"><a class="CaptionUsername" href="...">ticketswave</a> Some caption text here</div>"#;
+        assert_eq!(
+            parse_caption(html).as_deref(),
+            Some("Some caption text here")
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/instagram_profile.rs
+++ b/crates/webclaw-fetch/src/extractors/instagram_profile.rs
@ -0,0 +1,465 @@
+//! Instagram profile structured extractor.
+//!
+//! Hits Instagram's internal `web_profile_info` endpoint at
+//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The
+//! `x-ig-app-id` header is Instagram's own public web-app id (not a
+//! secret) — the same value Instagram's own JavaScript bundle sends.
+//!
+//! Returns the full profile (bio, exact follower count, verified /
+//! business flags, profile picture) plus the **12 most recent posts**
+//! with shortcodes, like counts, types, thumbnails, and caption
+//! previews. Callers can fan out to `/v1/scrape/instagram_post` per
+//! shortcode to get the full caption + media.
+//!
+//! Pagination beyond 12 requires authenticated cookies + a CSRF token;
+//! we accept that as the practical ceiling for the unauth path. The
+//! cloud (with stored sessions) can paginate later as a follow-up.
+//!
+//! Falls back to OG-tag scraping of the public profile page if the API
+//! returns 401/403 — Instagram has tightened this endpoint multiple
+//! times, so we keep the second path warm.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "instagram_profile",
+    label: "Instagram profile",
+    description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).",
+    url_patterns: &["https://www.instagram.com/{username}/"],
+};
+
+/// Instagram's own public web-app identifier. Sent by their JS bundle
+/// on every API call, accepted by the unauth endpoint, not a secret.
+const IG_APP_ID: &str = "936619743392459";
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.instagram.com" | "instagram.com") {
+        return false;
+    }
+    let path = url
+        .split("://")
+        .nth(1)
+        .and_then(|s| s.split_once('/'))
+        .map(|(_, p)| p)
+        .unwrap_or("");
+    let stripped = path
+        .split(['?', '#'])
+        .next()
+        .unwrap_or("")
+        .trim_end_matches('/');
+    let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
+    segs.len() == 1 && !RESERVED.contains(&segs[0])
+}
+
+const RESERVED: &[&str] = &[
+    "p",
+    "reel",
+    "reels",
+    "tv",
+    "explore",
+    "stories",
+    "directory",
+    "accounts",
+    "about",
+    "developer",
+    "press",
+    "api",
+    "ads",
+    "blog",
+    "fragments",
+    "terms",
+    "privacy",
+    "session",
+    "login",
+    "signup",
+];
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let username = parse_username(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "instagram_profile: cannot parse username from '{url}'"
+        ))
+    })?;
+
+    let api_url =
+        format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}");
+    let extra_headers: &[(&str, &str)] = &[
+        ("x-ig-app-id", IG_APP_ID),
+        ("accept", "*/*"),
+        ("sec-fetch-site", "same-origin"),
+        ("x-requested-with", "XMLHttpRequest"),
+    ];
+    let resp = client.fetch_with_headers(&api_url, extra_headers).await?;
+
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "instagram_profile: '{username}' not found"
+        )));
+    }
+    // Auth wall fallback: Instagram occasionally tightens this endpoint
+    // and starts returning 401/403/302 to a login page. When that
+    // happens we still want to give the caller something useful — the
+    // OG tags from the public HTML page (no posts list, but bio etc).
+    if !(200..300).contains(&resp.status) {
+        return og_fallback(client, &username, url, resp.status).await;
+    }
+
+    let body: ApiResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?;
+    let user = body.data.user;
+
+    let recent_posts: Vec<Value> = user
+        .edge_owner_to_timeline_media
+        .as_ref()
+        .map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect())
+        .unwrap_or_default();
+
+    Ok(json!({
+        "url":               url,
+        "canonical_url":     format!("https://www.instagram.com/{username}/"),
+        "username":          user.username.unwrap_or(username),
+        "data_completeness": "api",
+        "user_id":           user.id,
+        "full_name":         user.full_name,
+        "biography":         user.biography,
+        "biography_links":   user.bio_links,
+        "external_url":      user.external_url,
+        "category":          user.category_name,
+        "follower_count":    user.edge_followed_by.map(|c| c.count),
+        "following_count":   user.edge_follow.map(|c| c.count),
+        "post_count":        user.edge_owner_to_timeline_media.as_ref().map(|m| m.count),
+        "is_verified":       user.is_verified,
+        "is_private":        user.is_private,
+        "is_business":       user.is_business_account,
+        "is_professional":   user.is_professional_account,
+        "profile_pic_url":   user.profile_pic_url_hd.or(user.profile_pic_url),
+        "recent_posts":      recent_posts,
+    }))
+}
+
+/// Build the per-post summary the caller fans out from. Includes a
+/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`.
+fn post_summary(n: &MediaNode) -> Value {
+    let kind = classify(n);
+    let url = match kind {
+        "reel" => format!(
+            "https://www.instagram.com/reel/{}/",
+            n.shortcode.as_deref().unwrap_or("")
+        ),
+        _ => format!(
+            "https://www.instagram.com/p/{}/",
+            n.shortcode.as_deref().unwrap_or("")
+        ),
+    };
+    let caption = n
+        .edge_media_to_caption
+        .as_ref()
+        .and_then(|c| c.edges.first())
+        .and_then(|e| e.node.text.clone());
+    json!({
+        "shortcode":     n.shortcode,
+        "url":           url,
+        "kind":          kind,
+        "is_video":      n.is_video.unwrap_or(false),
+        "video_views":   n.video_view_count,
+        "thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()),
+        "display_url":   n.display_url,
+        "like_count":    n.edge_media_preview_like.as_ref().map(|c| c.count),
+        "comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count),
+        "taken_at":      n.taken_at_timestamp,
+        "caption":       caption,
+        "alt_text":      n.accessibility_caption,
+        "dimensions":    n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})),
+        "product_type":  n.product_type,
+    })
+}
+
+/// Best-effort post-type classification. `clips` is reels; `feed` is
+/// the regular grid. Sidecar = multi-photo carousel.
+fn classify(n: &MediaNode) -> &'static str {
+    if n.product_type.as_deref() == Some("clips") {
+        return "reel";
+    }
+    match n.typename.as_deref() {
+        Some("GraphSidecar") => "carousel",
+        Some("GraphVideo") => "video",
+        Some("GraphImage") => "photo",
+        _ => "post",
+    }
+}
+
+/// Fallback when the API path is blocked: hit the public profile HTML,
+/// pull whatever OG tags we can. Returns less data and explicitly
+/// flags `data_completeness: "og_only"` so callers know.
+async fn og_fallback(
+    client: &dyn Fetcher,
+    username: &str,
+    original_url: &str,
+    api_status: u16,
+) -> Result<Value, FetchError> {
+    let canonical = format!("https://www.instagram.com/{username}/");
+    let resp = client.fetch(&canonical).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "instagram_profile: api status {api_status}, html status {} for {username}",
+            resp.status
+        )));
+    }
+    let og = parse_og_tags(&resp.html);
+    let (followers, following, posts) =
+        parse_counts_from_og_description(og.get("description").map(String::as_str));
+
+    Ok(json!({
+        "url":               original_url,
+        "canonical_url":     canonical,
+        "username":          username,
+        "data_completeness": "og_only",
+        "fallback_reason":   format!("api returned {api_status}"),
+        "full_name":         parse_full_name(&og.get("title").cloned().unwrap_or_default()),
+        "follower_count":    followers,
+        "following_count":   following,
+        "post_count":        posts,
+        "profile_pic_url":   og.get("image").cloned(),
+        "biography":         null_value(),
+        "is_verified":       null_value(),
+        "is_business":       null_value(),
+        "recent_posts":      Vec::<Value>::new(),
+    }))
+}
+
+fn null_value() -> Value {
+    Value::Null
+}
+
+// ---------------------------------------------------------------------------
+// URL parsing
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_username(url: &str) -> Option<String> {
+    let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
+    let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
+    stripped
+        .split('/')
+        .find(|s| !s.is_empty())
+        .map(|s| s.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// OG-fallback helpers (kept self-contained — same shape as the previous
+// version we shipped, retained as the safety net)
+// ---------------------------------------------------------------------------
+
+fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
+    use regex::Regex;
+    use std::sync::OnceLock;
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    let mut out = std::collections::HashMap::new();
+    for c in re.captures_iter(html) {
+        let k = c
+            .get(1)
+            .map(|m| m.as_str().to_lowercase())
+            .unwrap_or_default();
+        let v = c
+            .get(2)
+            .map(|m| html_decode(m.as_str()))
+            .unwrap_or_default();
+        out.entry(k).or_insert(v);
+    }
+    out
+}
+
+fn parse_full_name(og_title: &str) -> Option<String> {
+    if og_title.is_empty() {
+        return None;
+    }
+    let decoded = html_decode(og_title);
+    let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim();
+    if trimmed.is_empty() {
+        None
+    } else {
+        Some(trimmed.to_string())
+    }
+}
+
+fn parse_counts_from_og_description(desc: Option<&str>) -> (Option<i64>, Option<i64>, Option<i64>) {
+    let Some(text) = desc else {
+        return (None, None, None);
+    };
+    let decoded = html_decode(text);
+    use regex::Regex;
+    use std::sync::OnceLock;
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap()
+    });
+    if let Some(c) = re.captures(&decoded) {
+        return (
+            c.get(1).and_then(|m| parse_compact_number(m.as_str())),
+            c.get(2).and_then(|m| parse_compact_number(m.as_str())),
+            c.get(3).and_then(|m| parse_compact_number(m.as_str())),
+        );
+    }
+    (None, None, None)
+}
+
+fn parse_compact_number(s: &str) -> Option<i64> {
+    let s = s.trim();
+    let (num_str, mul) = match s.chars().last() {
+        Some('K') => (&s[..s.len() - 1], 1_000i64),
+        Some('M') => (&s[..s.len() - 1], 1_000_000i64),
+        Some('B') => (&s[..s.len() - 1], 1_000_000_000i64),
+        _ => (s, 1i64),
+    };
+    let cleaned: String = num_str.chars().filter(|c| *c != ',').collect();
+    cleaned.parse::<f64>().ok().map(|f| (f * mul as f64) as i64)
+}
+
+fn html_decode(s: &str) -> String {
+    s.replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+        .replace("&quot;", "\"")
+        .replace("&#39;", "'")
+        .replace("&#064;", "@")
+        .replace("&#x2022;", "•")
+        .replace("&hellip;", "…")
+}
+
+// ---------------------------------------------------------------------------
+// Instagram web_profile_info API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct ApiResponse {
+    data: ApiData,
+}
+
+#[derive(Deserialize)]
+struct ApiData {
+    user: User,
+}
+
+#[derive(Deserialize)]
+struct User {
+    id: Option<String>,
+    username: Option<String>,
+    full_name: Option<String>,
+    biography: Option<String>,
+    bio_links: Option<Vec<serde_json::Value>>,
+    external_url: Option<String>,
+    category_name: Option<String>,
+    profile_pic_url: Option<String>,
+    profile_pic_url_hd: Option<String>,
+    is_verified: Option<bool>,
+    is_private: Option<bool>,
+    is_business_account: Option<bool>,
+    is_professional_account: Option<bool>,
+    edge_followed_by: Option<EdgeCount>,
+    edge_follow: Option<EdgeCount>,
+    edge_owner_to_timeline_media: Option<MediaEdges>,
+}
+
+#[derive(Deserialize)]
+struct EdgeCount {
+    count: i64,
+}
+
+#[derive(Deserialize)]
+struct MediaEdges {
+    count: i64,
+    edges: Vec<MediaEdge>,
+}
+
+#[derive(Deserialize)]
+struct MediaEdge {
+    node: MediaNode,
+}
+
+#[derive(Deserialize)]
+struct MediaNode {
+    #[serde(rename = "__typename")]
+    typename: Option<String>,
+    shortcode: Option<String>,
+    is_video: Option<bool>,
+    video_view_count: Option<i64>,
+    display_url: Option<String>,
+    thumbnail_src: Option<String>,
+    accessibility_caption: Option<String>,
+    taken_at_timestamp: Option<i64>,
+    product_type: Option<String>,
+    dimensions: Option<Dimensions>,
+    edge_media_preview_like: Option<EdgeCount>,
+    edge_media_to_comment: Option<EdgeCount>,
+    edge_media_to_caption: Option<CaptionEdges>,
+}
+
+#[derive(Deserialize)]
+struct Dimensions {
+    width: i64,
+    height: i64,
+}
+
+#[derive(Deserialize)]
+struct CaptionEdges {
+    edges: Vec<CaptionEdge>,
+}
+
+#[derive(Deserialize)]
+struct CaptionEdge {
+    node: CaptionNode,
+}
+
+#[derive(Deserialize)]
+struct CaptionNode {
+    text: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_profile_urls() {
+        assert!(matches("https://www.instagram.com/ticketswave"));
+        assert!(matches("https://www.instagram.com/ticketswave/"));
+        assert!(matches("https://instagram.com/0xmassi/?hl=en"));
+        assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/"));
+        assert!(!matches("https://www.instagram.com/explore"));
+        assert!(!matches("https://www.instagram.com/"));
+        assert!(!matches("https://example.com/foo"));
+    }
+
+    #[test]
+    fn parse_full_name_strips_handle() {
+        assert_eq!(
+            parse_full_name("Ticket Wave (&#064;ticketswave) &#x2022; Instagram photos and videos"),
+            Some("Ticket Wave".into())
+        );
+    }
+
+    #[test]
+    fn compact_number_handles_kmb() {
+        assert_eq!(parse_compact_number("18K"), Some(18_000));
+        assert_eq!(parse_compact_number("1.5M"), Some(1_500_000));
+        assert_eq!(parse_compact_number("1,234"), Some(1_234));
+        assert_eq!(parse_compact_number("641"), Some(641));
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/linkedin_post.rs
+++ b/crates/webclaw-fetch/src/extractors/linkedin_post.rs
@ -0,0 +1,266 @@
+//! LinkedIn post structured extractor.
+//!
+//! Uses the public embed endpoint `/embed/feed/update/{urn}` which
+//! LinkedIn provides for sites that want to render a post inline. No
+//! auth required, returns SSR HTML with the full post body, OG tags,
+//! image, and a link back to the original post.
+//!
+//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`)
+//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by
+//! pulling the trailing numeric id and converting to an activity URN.
+
+use regex::Regex;
+use serde_json::{Value, json};
+use std::sync::OnceLock;
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "linkedin_post",
+    label: "LinkedIn post",
+    description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.",
+    url_patterns: &[
+        "https://www.linkedin.com/feed/update/urn:li:share:{id}",
+        "https://www.linkedin.com/feed/update/urn:li:activity:{id}",
+        "https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.linkedin.com" | "linkedin.com") {
+        return false;
+    }
+    url.contains("/feed/update/urn:li:") || url.contains("/posts/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let urn = extract_urn(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
+        ))
+    })?;
+
+    let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}");
+    let resp = client.fetch(&embed_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "linkedin embed returned status {} for {urn}",
+            resp.status
+        )));
+    }
+
+    let html = &resp.html;
+    let og = parse_og_tags(html);
+    let body = parse_post_body(html);
+    let author = parse_author(html);
+    let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone());
+
+    Ok(json!({
+        "url":               url,
+        "embed_url":         embed_url,
+        "urn":               urn,
+        "canonical_url":     canonical_url,
+        "data_completeness": "embed",
+        "title":             og.get("title").cloned(),
+        "body":              body,
+        "author_name":       author,
+        "image_url":         og.get("image").cloned(),
+        "site_name":         og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URN extraction
+// ---------------------------------------------------------------------------
+
+/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL.
+/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second-
+/// to-last `-` separated chunk. Both forms map to a URN we can hit the
+/// embed endpoint with.
+fn extract_urn(url: &str) -> Option<String> {
+    if let Some(idx) = url.find("urn:li:") {
+        let tail = &url[idx..];
+        let end = tail.find(['/', '?', '#']).unwrap_or(tail.len());
+        let urn = &tail[..end];
+        // Validate shape: urn:li:{type}:{digits}
+        let mut parts = urn.split(':');
+        if parts.next() == Some("urn")
+            && parts.next() == Some("li")
+            && parts.next().is_some()
+            && parts
+                .next()
+                .filter(|p| p.chars().all(|c| c.is_ascii_digit()))
+                .is_some()
+        {
+            return Some(urn.to_string());
+        }
+    }
+
+    // /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second-
+    // to-last segment after the last `-`.
+    if url.contains("/posts/") {
+        static RE: OnceLock<Regex> = OnceLock::new();
+        let re =
+            RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap());
+        if let Some(c) = re.captures(url)
+            && let Some(id) = c.get(1)
+        {
+            return Some(format!("urn:li:activity:{}", id.as_str()));
+        }
+    }
+    None
+}
+
+// ---------------------------------------------------------------------------
+// HTML scraping
+// ---------------------------------------------------------------------------
+
+/// Pull `og:foo` → value pairs out of `<meta property="og:..." content="...">`.
+/// Returns lowercased keys with leading `og:` stripped.
+fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    let mut out = std::collections::HashMap::new();
+    for c in re.captures_iter(html) {
+        let k = c
+            .get(1)
+            .map(|m| m.as_str().to_lowercase())
+            .unwrap_or_default();
+        let v = c
+            .get(2)
+            .map(|m| html_decode(m.as_str()))
+            .unwrap_or_default();
+        out.entry(k).or_insert(v);
+    }
+    out
+}
+
+/// Extract the post body text from the embed page. LinkedIn renders it
+/// inside `<p class="attributed-text-segment-list__content ...">{text}</p>`
+/// where the inner content can include nested `<a>` tags for links.
+fn parse_post_body(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(
+            r#"(?s)<p[^>]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)</p>"#,
+        )
+        .unwrap()
+    });
+    let inner = re.captures(html).and_then(|c| c.get(1))?.as_str();
+    Some(strip_tags(inner).trim().to_string())
+}
+
+/// Author name lives in the `<title>` like:
+///   "55 founding members are in… | Orc Dev"
+/// The chunk after the final `|` is the author display name. Falls back
+/// to the og:title minus the post body if there's no title.
+fn parse_author(html: &str) -> Option<String> {
+    static RE_TITLE: OnceLock<Regex> = OnceLock::new();
+    let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)</title>").unwrap());
+    let title = re.captures(html).and_then(|c| c.get(1))?.as_str();
+    title
+        .rsplit_once('|')
+        .map(|(_, name)| html_decode(name.trim()))
+}
+
+/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.)
+/// stuff into OG content attributes.
+fn html_decode(s: &str) -> String {
+    s.replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+        .replace("&quot;", "\"")
+        .replace("&#39;", "'")
+        .replace("&#064;", "@")
+        .replace("&#x2022;", "•")
+        .replace("&hellip;", "…")
+}
+
+/// Crude HTML tag stripper for the post body. Preserves text inside
+/// nested anchors so URLs don't disappear, and collapses runs of
+/// whitespace introduced by line wrapping.
+fn strip_tags(html: &str) -> String {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
+    let no_tags = re.replace_all(html, "").to_string();
+    html_decode(&no_tags)
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_li_post_urls() {
+        assert!(matches(
+            "https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"
+        ));
+        assert!(matches(
+            "https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288"
+        ));
+        assert!(matches(
+            "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c"
+        ));
+        assert!(!matches("https://www.linkedin.com/in/foo"));
+        assert!(!matches("https://www.linkedin.com/"));
+        assert!(!matches("https://example.com/feed/update/urn:li:share:1"));
+    }
+
+    #[test]
+    fn extract_urn_from_share_url() {
+        assert_eq!(
+            extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"),
+            Some("urn:li:share:7452618582213144577".into())
+        );
+    }
+
+    #[test]
+    fn extract_urn_from_pretty_post_url() {
+        assert_eq!(
+            extract_urn(
+                "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/"
+            ),
+            Some("urn:li:activity:7452618583290892288".into())
+        );
+    }
+
+    #[test]
+    fn parse_og_tags_basic() {
+        let html = r#"<meta property="og:image" content="https://x.com/a.png">
+<meta property="og:url" content="https://example.com/x">"#;
+        let og = parse_og_tags(html);
+        assert_eq!(
+            og.get("image").map(String::as_str),
+            Some("https://x.com/a.png")
+        );
+        assert_eq!(
+            og.get("url").map(String::as_str),
+            Some("https://example.com/x")
+        );
+    }
+
+    #[test]
+    fn parse_post_body_strips_anchor_tags() {
+        let html = r#"<p class="attributed-text-segment-list__content text-color-text" dir="ltr">Hello <a href="x">link</a> world</p>"#;
+        assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world"));
+    }
+
+    #[test]
+    fn html_decode_handles_common_entities() {
+        assert_eq!(html_decode("AT&amp;T &#064;jane"), "AT&T @jane");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@ -0,0 +1,502 @@
+//! Vertical extractors: site-specific parsers that return typed JSON
+//! instead of generic markdown.
+//!
+//! Each extractor handles a single site or platform and exposes:
+//! - `matches(url)` to claim ownership of a URL pattern
+//! - `extract(client, url)` to fetch + parse into a typed JSON `Value`
+//! - `INFO` static for the catalog (`/v1/extractors`)
+//!
+//! The dispatch in this module is a simple `match`-style chain rather than
+//! a trait registry. With ~30 extractors that's still fast and avoids the
+//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit.
+//!
+//! Extractors prefer official JSON APIs over HTML scraping where one
+//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
+//! one). HTML extraction is the fallback for sites that don't.
+
+pub mod amazon_product;
+pub mod arxiv;
+pub mod crates_io;
+pub mod dev_to;
+pub mod docker_hub;
+pub mod ebay_listing;
+pub mod ecommerce_product;
+pub mod etsy_listing;
+pub mod github_issue;
+pub mod github_pr;
+pub mod github_release;
+pub mod github_repo;
+pub mod hackernews;
+pub mod huggingface_dataset;
+pub mod huggingface_model;
+pub mod instagram_post;
+pub mod instagram_profile;
+pub mod linkedin_post;
+pub mod npm;
+pub mod pypi;
+pub mod reddit;
+pub mod shopify_collection;
+pub mod shopify_product;
+pub mod stackoverflow;
+pub mod substack_post;
+pub mod trustpilot_reviews;
+pub mod woocommerce_product;
+pub mod youtube_video;
+
+use serde::Serialize;
+use serde_json::Value;
+
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+/// Public catalog entry for `/v1/extractors`. Stable shape — clients
+/// rely on `name` to pick the right `/v1/scrape/{name}` route.
+#[derive(Debug, Clone, Serialize)]
+pub struct ExtractorInfo {
+    /// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...).
+    pub name: &'static str,
+    /// Human-friendly display name.
+    pub label: &'static str,
+    /// One-line description of what the extractor returns.
+    pub description: &'static str,
+    /// Glob-ish URL pattern(s) the extractor claims. For documentation;
+    /// the actual matching is done by the extractor's `matches` fn.
+    pub url_patterns: &'static [&'static str],
+}
+
+/// Full catalog. Order is stable; new entries append.
+pub fn list() -> Vec<ExtractorInfo> {
+    vec![
+        reddit::INFO,
+        hackernews::INFO,
+        github_repo::INFO,
+        github_pr::INFO,
+        github_issue::INFO,
+        github_release::INFO,
+        pypi::INFO,
+        npm::INFO,
+        crates_io::INFO,
+        huggingface_model::INFO,
+        huggingface_dataset::INFO,
+        arxiv::INFO,
+        docker_hub::INFO,
+        dev_to::INFO,
+        stackoverflow::INFO,
+        substack_post::INFO,
+        youtube_video::INFO,
+        linkedin_post::INFO,
+        instagram_post::INFO,
+        instagram_profile::INFO,
+        shopify_product::INFO,
+        shopify_collection::INFO,
+        ecommerce_product::INFO,
+        woocommerce_product::INFO,
+        amazon_product::INFO,
+        ebay_listing::INFO,
+        etsy_listing::INFO,
+        trustpilot_reviews::INFO,
+    ]
+}
+
+/// Auto-detect mode: try every extractor's `matches`, return the first
+/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
+/// pick a vertical explicitly.
+pub async fn dispatch_by_url(
+    client: &dyn Fetcher,
+    url: &str,
+) -> Option<Result<(&'static str, Value), FetchError>> {
+    if reddit::matches(url) {
+        return Some(
+            reddit::extract(client, url)
+                .await
+                .map(|v| (reddit::INFO.name, v)),
+        );
+    }
+    if hackernews::matches(url) {
+        return Some(
+            hackernews::extract(client, url)
+                .await
+                .map(|v| (hackernews::INFO.name, v)),
+        );
+    }
+    if github_repo::matches(url) {
+        return Some(
+            github_repo::extract(client, url)
+                .await
+                .map(|v| (github_repo::INFO.name, v)),
+        );
+    }
+    if pypi::matches(url) {
+        return Some(
+            pypi::extract(client, url)
+                .await
+                .map(|v| (pypi::INFO.name, v)),
+        );
+    }
+    if npm::matches(url) {
+        return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
+    }
+    if github_pr::matches(url) {
+        return Some(
+            github_pr::extract(client, url)
+                .await
+                .map(|v| (github_pr::INFO.name, v)),
+        );
+    }
+    if github_issue::matches(url) {
+        return Some(
+            github_issue::extract(client, url)
+                .await
+                .map(|v| (github_issue::INFO.name, v)),
+        );
+    }
+    if github_release::matches(url) {
+        return Some(
+            github_release::extract(client, url)
+                .await
+                .map(|v| (github_release::INFO.name, v)),
+        );
+    }
+    if crates_io::matches(url) {
+        return Some(
+            crates_io::extract(client, url)
+                .await
+                .map(|v| (crates_io::INFO.name, v)),
+        );
+    }
+    if huggingface_model::matches(url) {
+        return Some(
+            huggingface_model::extract(client, url)
+                .await
+                .map(|v| (huggingface_model::INFO.name, v)),
+        );
+    }
+    if huggingface_dataset::matches(url) {
+        return Some(
+            huggingface_dataset::extract(client, url)
+                .await
+                .map(|v| (huggingface_dataset::INFO.name, v)),
+        );
+    }
+    if arxiv::matches(url) {
+        return Some(
+            arxiv::extract(client, url)
+                .await
+                .map(|v| (arxiv::INFO.name, v)),
+        );
+    }
+    if docker_hub::matches(url) {
+        return Some(
+            docker_hub::extract(client, url)
+                .await
+                .map(|v| (docker_hub::INFO.name, v)),
+        );
+    }
+    if dev_to::matches(url) {
+        return Some(
+            dev_to::extract(client, url)
+                .await
+                .map(|v| (dev_to::INFO.name, v)),
+        );
+    }
+    if stackoverflow::matches(url) {
+        return Some(
+            stackoverflow::extract(client, url)
+                .await
+                .map(|v| (stackoverflow::INFO.name, v)),
+        );
+    }
+    if linkedin_post::matches(url) {
+        return Some(
+            linkedin_post::extract(client, url)
+                .await
+                .map(|v| (linkedin_post::INFO.name, v)),
+        );
+    }
+    if instagram_post::matches(url) {
+        return Some(
+            instagram_post::extract(client, url)
+                .await
+                .map(|v| (instagram_post::INFO.name, v)),
+        );
+    }
+    if instagram_profile::matches(url) {
+        return Some(
+            instagram_profile::extract(client, url)
+                .await
+                .map(|v| (instagram_profile::INFO.name, v)),
+        );
+    }
+    // Antibot-gated verticals with unique hosts: safe to auto-dispatch
+    // because the matcher can't confuse the URL for anything else. The
+    // extractor's smart_fetch_html path handles the blocked-without-
+    // API-key case with a clear actionable error.
+    if amazon_product::matches(url) {
+        return Some(
+            amazon_product::extract(client, url)
+                .await
+                .map(|v| (amazon_product::INFO.name, v)),
+        );
+    }
+    if ebay_listing::matches(url) {
+        return Some(
+            ebay_listing::extract(client, url)
+                .await
+                .map(|v| (ebay_listing::INFO.name, v)),
+        );
+    }
+    if etsy_listing::matches(url) {
+        return Some(
+            etsy_listing::extract(client, url)
+                .await
+                .map(|v| (etsy_listing::INFO.name, v)),
+        );
+    }
+    if trustpilot_reviews::matches(url) {
+        return Some(
+            trustpilot_reviews::extract(client, url)
+                .await
+                .map(|v| (trustpilot_reviews::INFO.name, v)),
+        );
+    }
+    if youtube_video::matches(url) {
+        return Some(
+            youtube_video::extract(client, url)
+                .await
+                .map(|v| (youtube_video::INFO.name, v)),
+        );
+    }
+    // NOTE: shopify_product, shopify_collection, ecommerce_product,
+    // woocommerce_product, and substack_post are intentionally NOT
+    // in auto-dispatch. Their `matches()` functions are permissive
+    // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
+    // claiming those generically would steal URLs from the default
+    // `/v1/scrape` markdown flow. Callers opt in via
+    // `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
+    None
+}
+
+/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`).
+/// We still validate that the URL plausibly belongs to that vertical so
+/// users get a clear "wrong route" error instead of a confusing parse
+/// failure deep in the extractor.
+pub async fn dispatch_by_name(
+    client: &dyn Fetcher,
+    name: &str,
+    url: &str,
+) -> Result<Value, ExtractorDispatchError> {
+    match name {
+        n if n == reddit::INFO.name => {
+            run_or_mismatch(reddit::matches(url), n, url, || {
+                reddit::extract(client, url)
+            })
+            .await
+        }
+        n if n == hackernews::INFO.name => {
+            run_or_mismatch(hackernews::matches(url), n, url, || {
+                hackernews::extract(client, url)
+            })
+            .await
+        }
+        n if n == github_repo::INFO.name => {
+            run_or_mismatch(github_repo::matches(url), n, url, || {
+                github_repo::extract(client, url)
+            })
+            .await
+        }
+        n if n == pypi::INFO.name => {
+            run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await
+        }
+        n if n == npm::INFO.name => {
+            run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
+        }
+        n if n == github_pr::INFO.name => {
+            run_or_mismatch(github_pr::matches(url), n, url, || {
+                github_pr::extract(client, url)
+            })
+            .await
+        }
+        n if n == github_issue::INFO.name => {
+            run_or_mismatch(github_issue::matches(url), n, url, || {
+                github_issue::extract(client, url)
+            })
+            .await
+        }
+        n if n == github_release::INFO.name => {
+            run_or_mismatch(github_release::matches(url), n, url, || {
+                github_release::extract(client, url)
+            })
+            .await
+        }
+        n if n == crates_io::INFO.name => {
+            run_or_mismatch(crates_io::matches(url), n, url, || {
+                crates_io::extract(client, url)
+            })
+            .await
+        }
+        n if n == huggingface_model::INFO.name => {
+            run_or_mismatch(huggingface_model::matches(url), n, url, || {
+                huggingface_model::extract(client, url)
+            })
+            .await
+        }
+        n if n == huggingface_dataset::INFO.name => {
+            run_or_mismatch(huggingface_dataset::matches(url), n, url, || {
+                huggingface_dataset::extract(client, url)
+            })
+            .await
+        }
+        n if n == arxiv::INFO.name => {
+            run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await
+        }
+        n if n == docker_hub::INFO.name => {
+            run_or_mismatch(docker_hub::matches(url), n, url, || {
+                docker_hub::extract(client, url)
+            })
+            .await
+        }
+        n if n == dev_to::INFO.name => {
+            run_or_mismatch(dev_to::matches(url), n, url, || {
+                dev_to::extract(client, url)
+            })
+            .await
+        }
+        n if n == stackoverflow::INFO.name => {
+            run_or_mismatch(stackoverflow::matches(url), n, url, || {
+                stackoverflow::extract(client, url)
+            })
+            .await
+        }
+        n if n == linkedin_post::INFO.name => {
+            run_or_mismatch(linkedin_post::matches(url), n, url, || {
+                linkedin_post::extract(client, url)
+            })
+            .await
+        }
+        n if n == instagram_post::INFO.name => {
+            run_or_mismatch(instagram_post::matches(url), n, url, || {
+                instagram_post::extract(client, url)
+            })
+            .await
+        }
+        n if n == instagram_profile::INFO.name => {
+            run_or_mismatch(instagram_profile::matches(url), n, url, || {
+                instagram_profile::extract(client, url)
+            })
+            .await
+        }
+        n if n == shopify_product::INFO.name => {
+            run_or_mismatch(shopify_product::matches(url), n, url, || {
+                shopify_product::extract(client, url)
+            })
+            .await
+        }
+        n if n == ecommerce_product::INFO.name => {
+            run_or_mismatch(ecommerce_product::matches(url), n, url, || {
+                ecommerce_product::extract(client, url)
+            })
+            .await
+        }
+        n if n == amazon_product::INFO.name => {
+            run_or_mismatch(amazon_product::matches(url), n, url, || {
+                amazon_product::extract(client, url)
+            })
+            .await
+        }
+        n if n == ebay_listing::INFO.name => {
+            run_or_mismatch(ebay_listing::matches(url), n, url, || {
+                ebay_listing::extract(client, url)
+            })
+            .await
+        }
+        n if n == etsy_listing::INFO.name => {
+            run_or_mismatch(etsy_listing::matches(url), n, url, || {
+                etsy_listing::extract(client, url)
+            })
+            .await
+        }
+        n if n == trustpilot_reviews::INFO.name => {
+            run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
+                trustpilot_reviews::extract(client, url)
+            })
+            .await
+        }
+        n if n == youtube_video::INFO.name => {
+            run_or_mismatch(youtube_video::matches(url), n, url, || {
+                youtube_video::extract(client, url)
+            })
+            .await
+        }
+        n if n == substack_post::INFO.name => {
+            run_or_mismatch(substack_post::matches(url), n, url, || {
+                substack_post::extract(client, url)
+            })
+            .await
+        }
+        n if n == shopify_collection::INFO.name => {
+            run_or_mismatch(shopify_collection::matches(url), n, url, || {
+                shopify_collection::extract(client, url)
+            })
+            .await
+        }
+        n if n == woocommerce_product::INFO.name => {
+            run_or_mismatch(woocommerce_product::matches(url), n, url, || {
+                woocommerce_product::extract(client, url)
+            })
+            .await
+        }
+        _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
+    }
+}
+
+/// Errors that the dispatcher itself raises (vs. errors from inside an
+/// extractor, which come back wrapped in `Fetch`).
+#[derive(Debug, thiserror::Error)]
+pub enum ExtractorDispatchError {
+    #[error("unknown vertical: '{0}'")]
+    UnknownVertical(String),
+
+    #[error("URL '{url}' does not match the '{vertical}' extractor")]
+    UrlMismatch { vertical: String, url: String },
+
+    #[error(transparent)]
+    Fetch(#[from] FetchError),
+}
+
+/// Helper: when the caller explicitly picked a vertical but their URL
+/// doesn't match it, return `UrlMismatch` instead of running the
+/// extractor (which would just fail with a less-clear error).
+async fn run_or_mismatch<F, Fut>(
+    matches: bool,
+    vertical: &str,
+    url: &str,
+    f: F,
+) -> Result<Value, ExtractorDispatchError>
+where
+    F: FnOnce() -> Fut,
+    Fut: std::future::Future<Output = Result<Value, FetchError>>,
+{
+    if !matches {
+        return Err(ExtractorDispatchError::UrlMismatch {
+            vertical: vertical.to_string(),
+            url: url.to_string(),
+        });
+    }
+    f().await.map_err(ExtractorDispatchError::Fetch)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn list_is_non_empty_and_unique() {
+        let entries = list();
+        assert!(!entries.is_empty());
+        let mut names: Vec<_> = entries.iter().map(|e| e.name).collect();
+        names.sort();
+        let before = names.len();
+        names.dedup();
+        assert_eq!(before, names.len(), "extractor names must be unique");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/npm.rs
+++ b/crates/webclaw-fetch/src/extractors/npm.rs
@ -0,0 +1,235 @@
+//! npm package structured extractor.
+//!
+//! Uses two npm-run APIs:
+//!   - `registry.npmjs.org/{name}` for full package metadata
+//!   - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal
+//!
+//! The registry API returns the *full* document including every version
+//! ever published, which can be tens of MB for popular packages
+//! (`@types/node` etc). We strip down to the latest version's manifest
+//! and a count of releases — full history would explode the response.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "npm",
+    label: "npm package",
+    description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.",
+    url_patterns: &["https://www.npmjs.com/package/{name}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "www.npmjs.com" && host != "npmjs.com" {
+        return false;
+    }
+    url.contains("/package/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let name = parse_name(url)
+        .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
+
+    let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name));
+    let resp = client.fetch(&registry_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "npm: package '{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "npm registry returned status {}",
+            resp.status
+        )));
+    }
+
+    let pkg: PackageDoc = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?;
+
+    // Resolve "latest" to a concrete version.
+    let latest_version = pkg
+        .dist_tags
+        .as_ref()
+        .and_then(|t| t.get("latest"))
+        .cloned()
+        .or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned()));
+
+    let latest_manifest = latest_version
+        .as_deref()
+        .and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v)));
+
+    let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0);
+    let latest_release_date = latest_version
+        .as_deref()
+        .and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned()));
+
+    // Best-effort weekly downloads. If the api.npmjs.org call fails we
+    // surface `null` rather than failing the whole extractor — npm
+    // sometimes 503s the downloads endpoint while the registry is up.
+    let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok();
+
+    Ok(json!({
+        "url":                 url,
+        "name":                pkg.name.clone().unwrap_or(name.clone()),
+        "description":         pkg.description,
+        "latest_version":      latest_version,
+        "license":             latest_manifest.and_then(|m| m.license.clone()),
+        "homepage":            pkg.homepage,
+        "repository":          pkg.repository.as_ref().and_then(|r| r.url.clone()),
+        "dependencies":        latest_manifest.and_then(|m| m.dependencies.clone()),
+        "dev_dependencies":    latest_manifest.and_then(|m| m.dev_dependencies.clone()),
+        "peer_dependencies":   latest_manifest.and_then(|m| m.peer_dependencies.clone()),
+        "keywords":            pkg.keywords,
+        "maintainers":         pkg.maintainers,
+        "deprecated":          latest_manifest.and_then(|m| m.deprecated.clone()),
+        "release_count":       release_count,
+        "latest_release_date": latest_release_date,
+        "weekly_downloads":    weekly_downloads,
+    }))
+}
+
+async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
+    let url = format!(
+        "https://api.npmjs.org/downloads/point/last-week/{}",
+        urlencode_segment(name)
+    );
+    let resp = client.fetch(&url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "npm downloads api status {}",
+            resp.status
+        )));
+    }
+    let dl: Downloads = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?;
+    Ok(dl.downloads)
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Extract the package name from an npmjs.com URL. Handles scoped packages
+/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`).
+fn parse_name(url: &str) -> Option<String> {
+    let after = url.split("/package/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let first = segs.next()?;
+    if first.starts_with('@') {
+        let second = segs.next()?;
+        Some(format!("{first}/{second}"))
+    } else {
+        Some(first.to_string())
+    }
+}
+
+/// `@scope/name` must encode the `/` for the registry path. Plain names
+/// pass through untouched.
+fn urlencode_segment(name: &str) -> String {
+    name.replace('/', "%2F")
+}
+
+// ---------------------------------------------------------------------------
+// Registry types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct PackageDoc {
+    name: Option<String>,
+    description: Option<String>,
+    homepage: Option<serde_json::Value>, // sometimes string, sometimes object
+    repository: Option<Repository>,
+    keywords: Option<Vec<String>>,
+    maintainers: Option<Vec<Maintainer>>,
+    #[serde(rename = "dist-tags")]
+    dist_tags: Option<std::collections::BTreeMap<String, String>>,
+    versions: Option<std::collections::BTreeMap<String, VersionManifest>>,
+    time: Option<std::collections::BTreeMap<String, String>>,
+}
+
+#[derive(Deserialize, Default, Clone)]
+struct VersionManifest {
+    license: Option<serde_json::Value>, // string or object
+    dependencies: Option<std::collections::BTreeMap<String, String>>,
+    #[serde(rename = "devDependencies")]
+    dev_dependencies: Option<std::collections::BTreeMap<String, String>>,
+    #[serde(rename = "peerDependencies")]
+    peer_dependencies: Option<std::collections::BTreeMap<String, String>>,
+    // `deprecated` is sometimes a bool and sometimes a string in the
+    // registry. serde_json::Value covers both without failing the parse.
+    deprecated: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct Repository {
+    url: Option<String>,
+}
+
+#[derive(Deserialize, Clone)]
+struct Maintainer {
+    name: Option<String>,
+    email: Option<String>,
+}
+
+impl serde::Serialize for Maintainer {
+    fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
+        use serde::ser::SerializeMap;
+        let mut m = s.serialize_map(Some(2))?;
+        m.serialize_entry("name", &self.name)?;
+        m.serialize_entry("email", &self.email)?;
+        m.end()
+    }
+}
+
+#[derive(Deserialize)]
+struct Downloads {
+    downloads: i64,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_npm_package_urls() {
+        assert!(matches("https://www.npmjs.com/package/react"));
+        assert!(matches("https://www.npmjs.com/package/@types/node"));
+        assert!(matches("https://npmjs.com/package/lodash"));
+        assert!(!matches("https://www.npmjs.com/"));
+        assert!(!matches("https://example.com/package/foo"));
+    }
+
+    #[test]
+    fn parse_name_handles_scoped_and_unscoped() {
+        assert_eq!(
+            parse_name("https://www.npmjs.com/package/react"),
+            Some("react".into())
+        );
+        assert_eq!(
+            parse_name("https://www.npmjs.com/package/@types/node"),
+            Some("@types/node".into())
+        );
+        assert_eq!(
+            parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"),
+            Some("lodash".into())
+        );
+    }
+
+    #[test]
+    fn urlencode_only_touches_scope_separator() {
+        assert_eq!(urlencode_segment("react"), "react");
+        assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/pypi.rs
+++ b/crates/webclaw-fetch/src/extractors/pypi.rs
@ -0,0 +1,184 @@
+//! PyPI package structured extractor.
+//!
+//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and
+//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both
+//! return the full release info plus history. No auth, no rate limits
+//! that we hit at normal usage.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "pypi",
+    label: "PyPI package",
+    description: "Returns package metadata: latest version, dependencies, license, release history.",
+    url_patterns: &[
+        "https://pypi.org/project/{name}/",
+        "https://pypi.org/project/{name}/{version}/",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "pypi.org" && host != "www.pypi.org" {
+        return false;
+    }
+    url.contains("/project/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (name, version) = parse_project(url).ok_or_else(|| {
+        FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
+    })?;
+
+    let api_url = match &version {
+        Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"),
+        None => format!("https://pypi.org/pypi/{name}/json"),
+    };
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "pypi: package '{name}' not found"
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "pypi api returned status {}",
+            resp.status
+        )));
+    }
+
+    let pkg: PypiResponse = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?;
+
+    let info = pkg.info;
+    let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0);
+
+    // Latest release date = max upload time across files in the latest version.
+    let latest_release_date = pkg
+        .releases
+        .as_ref()
+        .and_then(|map| info.version.as_deref().and_then(|v| map.get(v)))
+        .and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max());
+
+    // Drop the long description from the JSON shape — it's frequently a 50KB
+    // README and bloats responses. Callers who need it can hit /v1/scrape.
+    Ok(json!({
+        "url":                 url,
+        "name":                info.name,
+        "version":             info.version,
+        "summary":             info.summary,
+        "homepage":            info.home_page,
+        "license":             info.license,
+        "license_classifier":  pick_license_classifier(&info.classifiers),
+        "author":              info.author,
+        "author_email":        info.author_email,
+        "maintainer":          info.maintainer,
+        "requires_python":     info.requires_python,
+        "requires_dist":       info.requires_dist,
+        "keywords":            info.keywords,
+        "classifiers":         info.classifiers,
+        "yanked":              info.yanked,
+        "yanked_reason":       info.yanked_reason,
+        "project_urls":        info.project_urls,
+        "release_count":       release_count,
+        "latest_release_date": latest_release_date,
+    }))
+}
+
+/// PyPI puts the SPDX-ish license under classifiers like
+/// `License :: OSI Approved :: Apache Software License`. Surface the most
+/// specific one when the `license` field itself is empty/junk.
+fn pick_license_classifier(classifiers: &Option<Vec<String>>) -> Option<String> {
+    classifiers
+        .as_ref()?
+        .iter()
+        .filter(|c| c.starts_with("License ::"))
+        .max_by_key(|c| c.len())
+        .cloned()
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_project(url: &str) -> Option<(String, Option<String>)> {
+    let after = url.split("/project/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let mut segs = stripped.split('/').filter(|s| !s.is_empty());
+    let name = segs.next()?.to_string();
+    let version = segs.next().map(|v| v.to_string());
+    Some((name, version))
+}
+
+// ---------------------------------------------------------------------------
+// PyPI API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct PypiResponse {
+    info: Info,
+    releases: Option<std::collections::BTreeMap<String, Vec<File>>>,
+}
+
+#[derive(Deserialize)]
+struct Info {
+    name: Option<String>,
+    version: Option<String>,
+    summary: Option<String>,
+    home_page: Option<String>,
+    license: Option<String>,
+    author: Option<String>,
+    author_email: Option<String>,
+    maintainer: Option<String>,
+    requires_python: Option<String>,
+    requires_dist: Option<Vec<String>>,
+    keywords: Option<String>,
+    classifiers: Option<Vec<String>>,
+    yanked: Option<bool>,
+    yanked_reason: Option<String>,
+    project_urls: Option<std::collections::BTreeMap<String, String>>,
+}
+
+#[derive(Deserialize)]
+struct File {
+    upload_time: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_project_urls() {
+        assert!(matches("https://pypi.org/project/requests/"));
+        assert!(matches("https://pypi.org/project/numpy/1.26.0/"));
+        assert!(!matches("https://pypi.org/"));
+        assert!(!matches("https://example.com/project/foo"));
+    }
+
+    #[test]
+    fn parse_project_pulls_name_and_version() {
+        assert_eq!(
+            parse_project("https://pypi.org/project/requests/"),
+            Some(("requests".into(), None))
+        );
+        assert_eq!(
+            parse_project("https://pypi.org/project/numpy/1.26.0/"),
+            Some(("numpy".into(), Some("1.26.0".into())))
+        );
+        assert_eq!(
+            parse_project("https://pypi.org/project/scikit-learn/?foo=bar"),
+            Some(("scikit-learn".into(), None))
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/reddit.rs
+++ b/crates/webclaw-fetch/src/extractors/reddit.rs
@ -0,0 +1,234 @@
+//! Reddit structured extractor — returns the full post + comment tree
+//! as typed JSON via Reddit's `.json` API.
+//!
+//! The same trick the markdown extractor in `crate::reddit` uses:
+//! appending `.json` to any post URL returns the data the new SPA
+//! frontend would load client-side. Zero antibot, zero JS rendering.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "reddit",
+    label: "Reddit thread",
+    description: "Returns post + nested comment tree with scores, authors, and timestamps.",
+    url_patterns: &[
+        "https://www.reddit.com/r/*/comments/*",
+        "https://reddit.com/r/*/comments/*",
+        "https://old.reddit.com/r/*/comments/*",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    let is_reddit_host = matches!(
+        host,
+        "reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
+    );
+    is_reddit_host && url.contains("/comments/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let json_url = build_json_url(url);
+    let resp = client.fetch(&json_url).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "reddit api returned status {}",
+            resp.status
+        )));
+    }
+
+    let listings: Vec<Listing> = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?;
+
+    if listings.is_empty() {
+        return Err(FetchError::BodyDecode("reddit response empty".into()));
+    }
+
+    // First listing = the post (single t3 child).
+    let post = listings
+        .first()
+        .and_then(|l| l.data.children.first())
+        .filter(|t| t.kind == "t3")
+        .map(|t| post_json(&t.data))
+        .unwrap_or(Value::Null);
+
+    // Second listing = the comment tree.
+    let comments: Vec<Value> = listings
+        .get(1)
+        .map(|l| l.data.children.iter().filter_map(comment_json).collect())
+        .unwrap_or_default();
+
+    Ok(json!({
+        "url": url,
+        "post": post,
+        "comments": comments,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// JSON shapers
+// ---------------------------------------------------------------------------
+
+fn post_json(d: &ThingData) -> Value {
+    json!({
+        "id":               d.id,
+        "title":            d.title,
+        "author":           d.author,
+        "subreddit":        d.subreddit_name_prefixed,
+        "permalink":        d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
+        "url":              d.url_overridden_by_dest,
+        "is_self":          d.is_self,
+        "selftext":         d.selftext,
+        "score":            d.score,
+        "upvote_ratio":     d.upvote_ratio,
+        "num_comments":     d.num_comments,
+        "created_utc":      d.created_utc,
+        "link_flair_text":  d.link_flair_text,
+        "over_18":          d.over_18,
+        "spoiler":          d.spoiler,
+        "stickied":         d.stickied,
+        "locked":           d.locked,
+    })
+}
+
+/// Render a single comment + its reply tree. Returns `None` for non-t1
+/// kinds (the trailing `more` placeholder Reddit injects at depth limits).
+fn comment_json(thing: &Thing) -> Option<Value> {
+    if thing.kind != "t1" {
+        return None;
+    }
+    let d = &thing.data;
+    let replies: Vec<Value> = match &d.replies {
+        Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(),
+        _ => Vec::new(),
+    };
+    Some(json!({
+        "id":             d.id,
+        "author":         d.author,
+        "body":           d.body,
+        "score":          d.score,
+        "created_utc":    d.created_utc,
+        "is_submitter":   d.is_submitter,
+        "stickied":       d.stickied,
+        "depth":          d.depth,
+        "permalink":      d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
+        "replies":        replies,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com`
+/// or `old.reddit.com` as the caller gave us). Routing through
+/// `old.reddit.com` unconditionally looks appealing but that host has
+/// stricter UA-based blocking than `www.reddit.com`, while the main
+/// host accepts our Chrome-fingerprinted client fine.
+fn build_json_url(url: &str) -> String {
+    let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
+    format!("{clean}.json?raw_json=1")
+}
+
+// ---------------------------------------------------------------------------
+// Reddit JSON types — only fields we render. Everything else is dropped.
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Listing {
+    data: ListingData,
+}
+
+#[derive(Deserialize)]
+struct ListingData {
+    children: Vec<Thing>,
+}
+
+#[derive(Deserialize)]
+struct Thing {
+    kind: String,
+    data: ThingData,
+}
+
+#[derive(Deserialize, Default)]
+struct ThingData {
+    // post (t3)
+    id: Option<String>,
+    title: Option<String>,
+    selftext: Option<String>,
+    subreddit_name_prefixed: Option<String>,
+    url_overridden_by_dest: Option<String>,
+    is_self: Option<bool>,
+    upvote_ratio: Option<f64>,
+    num_comments: Option<i64>,
+    over_18: Option<bool>,
+    spoiler: Option<bool>,
+    stickied: Option<bool>,
+    locked: Option<bool>,
+    link_flair_text: Option<String>,
+
+    // comment (t1)
+    author: Option<String>,
+    body: Option<String>,
+    score: Option<i64>,
+    created_utc: Option<f64>,
+    is_submitter: Option<bool>,
+    depth: Option<i64>,
+    permalink: Option<String>,
+
+    // recursive
+    replies: Option<Replies>,
+}
+
+#[derive(Deserialize)]
+#[serde(untagged)]
+enum Replies {
+    Listing(Listing),
+    #[allow(dead_code)]
+    Empty(String),
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_reddit_post_urls() {
+        assert!(matches(
+            "https://www.reddit.com/r/rust/comments/abc123/some_title/"
+        ));
+        assert!(matches(
+            "https://reddit.com/r/rust/comments/abc123/some_title"
+        ));
+        assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/"));
+    }
+
+    #[test]
+    fn rejects_non_post_reddit_urls() {
+        assert!(!matches("https://www.reddit.com/r/rust"));
+        assert!(!matches("https://www.reddit.com/user/foo"));
+        assert!(!matches("https://example.com/r/rust/comments/x"));
+    }
+
+    #[test]
+    fn json_url_appends_suffix_and_drops_query() {
+        assert_eq!(
+            build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"),
+            "https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1"
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/shopify_collection.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_collection.rs
@ -0,0 +1,242 @@
+//! Shopify collection structured extractor.
+//!
+//! Every Shopify store exposes `/collections/{handle}.json` and
+//! `/collections/{handle}/products.json` on the public surface. This
+//! extractor hits `.json` (collection metadata) and falls through to
+//! `/products.json` for the first page of products. Same caveat as
+//! `shopify_product`: stores with Cloudflare in front of the shop
+//! will 403 the public path.
+//!
+//! Explicit-call only (like `shopify_product`). `/collections/{slug}`
+//! is a URL shape used by non-Shopify stores too, so auto-dispatch
+//! would claim too many URLs.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "shopify_collection",
+    label: "Shopify collection",
+    description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.",
+    url_patterns: &[
+        "https://{shop}/collections/{handle}",
+        "https://{shop}.myshopify.com/collections/{handle}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
+        return false;
+    }
+    url.contains("/collections/") && !url.ends_with("/collections/")
+}
+
+const NON_SHOPIFY_HOSTS: &[&str] = &[
+    "amazon.com",
+    "amazon.co.uk",
+    "amazon.de",
+    "ebay.com",
+    "etsy.com",
+    "walmart.com",
+    "target.com",
+    "aliexpress.com",
+    "huggingface.co", // has /collections/ for models
+    "github.com",
+];
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let (coll_meta_url, coll_products_url) = build_json_urls(url);
+
+    // Step 1: collection metadata. Shopify returns 200 on missing
+    // collections sometimes; check "collection" key below.
+    let meta_resp = client.fetch(&coll_meta_url).await?;
+    if meta_resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "shopify_collection: '{url}' not found"
+        )));
+    }
+    if meta_resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store."
+        )));
+    }
+    if meta_resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "shopify returned status {} for {coll_meta_url}",
+            meta_resp.status
+        )));
+    }
+
+    let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| {
+        FetchError::BodyDecode(format!(
+            "shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})"
+        ))
+    })?;
+
+    // Step 2: first page of products for this collection.
+    let products = match client.fetch(&coll_products_url).await {
+        Ok(r) if r.status == 200 => serde_json::from_str::<ProductsWrapper>(&r.html)
+            .ok()
+            .map(|pw| pw.products)
+            .unwrap_or_default(),
+        _ => Vec::new(),
+    };
+
+    let product_summaries: Vec<Value> = products
+        .iter()
+        .map(|p| {
+            let first_variant = p.variants.first();
+            json!({
+                "id":              p.id,
+                "handle":          p.handle,
+                "title":           p.title,
+                "vendor":          p.vendor,
+                "product_type":    p.product_type,
+                "price":           first_variant.and_then(|v| v.price.clone()),
+                "compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()),
+                "available":       p.variants.iter().any(|v| v.available.unwrap_or(false)),
+                "variant_count":   p.variants.len(),
+                "image":           p.images.first().and_then(|i| i.src.clone()),
+                "created_at":      p.created_at,
+                "updated_at":      p.updated_at,
+            })
+        })
+        .collect();
+
+    let c = meta.collection;
+    Ok(json!({
+        "url":               url,
+        "meta_json_url":     coll_meta_url,
+        "products_json_url": coll_products_url,
+        "collection_id":     c.id,
+        "handle":            c.handle,
+        "title":             c.title,
+        "description_html":  c.body_html,
+        "published_at":      c.published_at,
+        "updated_at":        c.updated_at,
+        "sort_order":        c.sort_order,
+        "products_in_page":  product_summaries.len(),
+        "products":          product_summaries,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Build `(collection.json, collection/products.json)` from a user URL.
+fn build_json_urls(url: &str) -> (String, String) {
+    let (path_part, _query_part) = match url.split_once('?') {
+        Some((a, b)) => (a, Some(b)),
+        None => (url, None),
+    };
+    let clean = path_part.trim_end_matches('/').trim_end_matches(".json");
+    (
+        format!("{clean}.json"),
+        format!("{clean}/products.json?limit=50"),
+    )
+}
+
+// ---------------------------------------------------------------------------
+// Shopify collection + product JSON shapes (subsets)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct MetaWrapper {
+    collection: Collection,
+}
+
+#[derive(Deserialize)]
+struct Collection {
+    id: Option<i64>,
+    handle: Option<String>,
+    title: Option<String>,
+    body_html: Option<String>,
+    published_at: Option<String>,
+    updated_at: Option<String>,
+    sort_order: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct ProductsWrapper {
+    #[serde(default)]
+    products: Vec<ProductSummary>,
+}
+
+#[derive(Deserialize)]
+struct ProductSummary {
+    id: Option<i64>,
+    handle: Option<String>,
+    title: Option<String>,
+    vendor: Option<String>,
+    product_type: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    #[serde(default)]
+    variants: Vec<VariantSummary>,
+    #[serde(default)]
+    images: Vec<ImageSummary>,
+}
+
+#[derive(Deserialize)]
+struct VariantSummary {
+    price: Option<String>,
+    compare_at_price: Option<String>,
+    available: Option<bool>,
+}
+
+#[derive(Deserialize)]
+struct ImageSummary {
+    src: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_shopify_collection_urls() {
+        assert!(matches("https://www.allbirds.com/collections/mens"));
+        assert!(matches(
+            "https://shop.example.com/collections/new-arrivals?page=2"
+        ));
+    }
+
+    #[test]
+    fn rejects_non_shopify() {
+        assert!(!matches("https://github.com/collections/foo"));
+        assert!(!matches("https://huggingface.co/collections/foo"));
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("https://example.com/collections/"));
+    }
+
+    #[test]
+    fn build_json_urls_derives_both_paths() {
+        let (meta, products) = build_json_urls("https://shop.example.com/collections/mens");
+        assert_eq!(meta, "https://shop.example.com/collections/mens.json");
+        assert_eq!(
+            products,
+            "https://shop.example.com/collections/mens/products.json?limit=50"
+        );
+    }
+
+    #[test]
+    fn build_json_urls_handles_trailing_slash() {
+        let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/");
+        assert_eq!(meta, "https://shop.example.com/collections/mens.json");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/shopify_product.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs
@ -0,0 +1,318 @@
+//! Shopify product structured extractor.
+//!
+//! Every Shopify store exposes a public JSON endpoint for each product
+//! by appending `.json` to the product URL:
+//!
+//!   https://shop.example.com/products/cool-tshirt
+//!   → https://shop.example.com/products/cool-tshirt.json
+//!
+//! There are ~4 million Shopify stores. The `.json` endpoint is
+//! undocumented but has been stable for 10+ years. When a store puts
+//! Cloudflare / antibot in front of the shop, this path can 403 just
+//! like any other — for those cases the caller should fall back to
+//! `ecommerce_product` (JSON-LD) or the cloud tier.
+//!
+//! This extractor is **explicit-call only** — it is NOT auto-dispatched
+//! from `/v1/scrape` because we cannot tell ahead of time whether an
+//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
+//! `/v1/scrape/shopify_product` when they know.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "shopify_product",
+    label: "Shopify product",
+    description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
+    url_patterns: &[
+        "https://{shop}/products/{handle}",
+        "https://{shop}.myshopify.com/products/{handle}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    // Any URL whose path contains /products/{something}. We do not
+    // filter by host — Shopify powers custom-domain stores. The
+    // extractor's /.json fallback is what confirms Shopify; `matches`
+    // just says "this is a plausible shape." Still reject obviously
+    // non-Shopify known hosts to save a failed request.
+    let host = host_of(url);
+    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
+        return false;
+    }
+    url.contains("/products/") && !url.ends_with("/products/")
+}
+
+/// Hosts we know are not Shopify — reject so we don't burn a request.
+const NON_SHOPIFY_HOSTS: &[&str] = &[
+    "amazon.com",
+    "amazon.co.uk",
+    "amazon.de",
+    "amazon.fr",
+    "amazon.it",
+    "ebay.com",
+    "etsy.com",
+    "walmart.com",
+    "target.com",
+    "aliexpress.com",
+    "bestbuy.com",
+    "wayfair.com",
+    "homedepot.com",
+    "github.com", // /products is a marketing page
+];
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let json_url = build_json_url(url);
+    let resp = client.fetch(&json_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "shopify_product: '{url}' not found (got 404 from {json_url})"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "shopify returned status {} for {json_url}",
+            resp.status
+        )));
+    }
+
+    let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
+        FetchError::BodyDecode(format!(
+            "shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
+        ))
+    })?;
+    let p = body.product;
+
+    let variants: Vec<Value> = p
+        .variants
+        .iter()
+        .map(|v| {
+            json!({
+                "id":                  v.id,
+                "title":               v.title,
+                "sku":                 v.sku,
+                "barcode":             v.barcode,
+                "price":               v.price,
+                "compare_at_price":    v.compare_at_price,
+                "available":           v.available,
+                "inventory_quantity":  v.inventory_quantity,
+                "position":            v.position,
+                "weight":              v.weight,
+                "weight_unit":         v.weight_unit,
+                "requires_shipping":   v.requires_shipping,
+                "taxable":             v.taxable,
+                "option1":             v.option1,
+                "option2":             v.option2,
+                "option3":             v.option3,
+            })
+        })
+        .collect();
+
+    let images: Vec<Value> = p
+        .images
+        .iter()
+        .map(|i| {
+            json!({
+                "src":      i.src,
+                "width":    i.width,
+                "height":   i.height,
+                "position": i.position,
+                "alt":      i.alt,
+            })
+        })
+        .collect();
+
+    let options: Vec<Value> = p
+        .options
+        .iter()
+        .map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
+        .collect();
+
+    // Price range + availability summary across variants (the shape
+    // agents typically want without walking the variants array).
+    let prices: Vec<f64> = p
+        .variants
+        .iter()
+        .filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
+        .collect();
+    let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
+
+    Ok(json!({
+        "url":             url,
+        "json_url":        json_url,
+        "product_id":      p.id,
+        "handle":          p.handle,
+        "title":           p.title,
+        "vendor":          p.vendor,
+        "product_type":    p.product_type,
+        "tags":            p.tags,
+        "description_html":p.body_html,
+        "published_at":    p.published_at,
+        "created_at":      p.created_at,
+        "updated_at":      p.updated_at,
+        "variant_count":   variants.len(),
+        "image_count":     images.len(),
+        "any_available":   any_available,
+        "price_min":       prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
+        "price_max":       prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
+        "variants":        variants,
+        "images":          images,
+        "options":         options,
+    }))
+}
+
+/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
+/// trailing slashes, and query strings.
+fn build_json_url(url: &str) -> String {
+    let (path_part, query_part) = match url.split_once('?') {
+        Some((a, b)) => (a, Some(b)),
+        None => (url, None),
+    };
+    let clean = path_part.trim_end_matches('/');
+    let with_json = if clean.ends_with(".json") {
+        clean.to_string()
+    } else {
+        format!("{clean}.json")
+    };
+    match query_part {
+        Some(q) => format!("{with_json}?{q}"),
+        None => with_json,
+    }
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+// ---------------------------------------------------------------------------
+// Shopify product JSON shape (a subset of the full response)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Wrapper {
+    product: Product,
+}
+
+#[derive(Deserialize)]
+struct Product {
+    id: Option<i64>,
+    title: Option<String>,
+    handle: Option<String>,
+    vendor: Option<String>,
+    product_type: Option<String>,
+    body_html: Option<String>,
+    published_at: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    #[serde(default)]
+    tags: serde_json::Value, // array OR comma-joined string depending on store
+    #[serde(default)]
+    variants: Vec<Variant>,
+    #[serde(default)]
+    images: Vec<Image>,
+    #[serde(default)]
+    options: Vec<Option_>,
+}
+
+#[derive(Deserialize)]
+struct Variant {
+    id: Option<i64>,
+    title: Option<String>,
+    sku: Option<String>,
+    barcode: Option<String>,
+    price: Option<String>,
+    compare_at_price: Option<String>,
+    available: Option<bool>,
+    inventory_quantity: Option<i64>,
+    position: Option<i64>,
+    weight: Option<f64>,
+    weight_unit: Option<String>,
+    requires_shipping: Option<bool>,
+    taxable: Option<bool>,
+    option1: Option<String>,
+    option2: Option<String>,
+    option3: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Image {
+    src: Option<String>,
+    width: Option<i64>,
+    height: Option<i64>,
+    position: Option<i64>,
+    alt: Option<String>,
+}
+
+#[derive(Deserialize)]
+#[serde(rename_all = "lowercase")]
+struct Option_ {
+    name: Option<String>,
+    position: Option<i64>,
+    #[serde(default)]
+    values: Vec<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_plausible_shopify_urls() {
+        assert!(matches(
+            "https://www.allbirds.com/products/mens-tree-runners"
+        ));
+        assert!(matches(
+            "https://shop.example.com/products/cool-tshirt?variant=123"
+        ));
+        assert!(matches("https://somestore.myshopify.com/products/thing-1"));
+    }
+
+    #[test]
+    fn rejects_known_non_shopify() {
+        assert!(!matches("https://www.amazon.com/dp/B0C123"));
+        assert!(!matches("https://www.etsy.com/listing/12345/foo"));
+        assert!(!matches("https://www.amazon.co.uk/products/thing"));
+        assert!(!matches("https://github.com/products"));
+    }
+
+    #[test]
+    fn rejects_non_product_urls() {
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("https://example.com/products/"));
+        assert!(!matches("https://example.com/collections/all"));
+    }
+
+    #[test]
+    fn build_json_url_handles_slash_and_query() {
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo"),
+            "https://shop.example.com/products/foo.json"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo/"),
+            "https://shop.example.com/products/foo.json"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo?variant=123"),
+            "https://shop.example.com/products/foo.json?variant=123"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo.json"),
+            "https://shop.example.com/products/foo.json"
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/stackoverflow.rs
+++ b/crates/webclaw-fetch/src/extractors/stackoverflow.rs
@ -0,0 +1,216 @@
+//! Stack Overflow Q&A structured extractor.
+//!
+//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}`
+//! with `site=stackoverflow`. Two calls: one for the question, one for
+//! its answers. Both come pre-filtered to include the rendered HTML body
+//! so we don't re-parse the question page itself.
+//!
+//! Anonymous access caps at 300 requests per IP per day. Production
+//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't
+//! require it to work out of the box.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "stackoverflow",
+    label: "Stack Overflow Q&A",
+    description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.",
+    url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host != "stackoverflow.com" && host != "www.stackoverflow.com" {
+        return false;
+    }
+    parse_question_id(url).is_some()
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let id = parse_question_id(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "stackoverflow: cannot parse question id from '{url}'"
+        ))
+    })?;
+
+    // Filter `withbody` includes the rendered HTML body for both questions
+    // and answers. Stack Exchange's filter system is documented at
+    // api.stackexchange.com/docs/filters.
+    let q_url = format!(
+        "https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody"
+    );
+    let q_resp = client.fetch(&q_url).await?;
+    if q_resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "stackexchange api returned status {}",
+            q_resp.status
+        )));
+    }
+    let q_body: QResponse = serde_json::from_str(&q_resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?;
+    let q = q_body
+        .items
+        .first()
+        .ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?;
+
+    let a_url = format!(
+        "https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes"
+    );
+    let a_resp = client.fetch(&a_url).await?;
+    let answers = if a_resp.status == 200 {
+        let a_body: AResponse = serde_json::from_str(&a_resp.html)
+            .map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?;
+        a_body
+            .items
+            .iter()
+            .map(|a| {
+                json!({
+                    "answer_id":     a.answer_id,
+                    "is_accepted":   a.is_accepted,
+                    "score":         a.score,
+                    "body":          a.body,
+                    "creation_date": a.creation_date,
+                    "last_edit_date":a.last_edit_date,
+                    "author":        a.owner.as_ref().and_then(|o| o.display_name.clone()),
+                    "author_rep":    a.owner.as_ref().and_then(|o| o.reputation),
+                })
+            })
+            .collect::<Vec<_>>()
+    } else {
+        Vec::new()
+    };
+
+    let accepted = answers
+        .iter()
+        .find(|a| {
+            a.get("is_accepted")
+                .and_then(|v| v.as_bool())
+                .unwrap_or(false)
+        })
+        .cloned();
+
+    Ok(json!({
+        "url":            url,
+        "question_id":    q.question_id,
+        "title":          q.title,
+        "body":           q.body,
+        "tags":           q.tags,
+        "score":          q.score,
+        "view_count":     q.view_count,
+        "answer_count":   q.answer_count,
+        "is_answered":    q.is_answered,
+        "accepted_answer_id": q.accepted_answer_id,
+        "creation_date":  q.creation_date,
+        "last_activity_date": q.last_activity_date,
+        "author":         q.owner.as_ref().and_then(|o| o.display_name.clone()),
+        "author_rep":     q.owner.as_ref().and_then(|o| o.reputation),
+        "link":           q.link,
+        "accepted_answer": accepted,
+        "top_answers":    answers,
+    }))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Parse question id from a URL of the form `/questions/{id}/{slug}`.
+fn parse_question_id(url: &str) -> Option<u64> {
+    let after = url.split("/questions/").nth(1)?;
+    let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
+    let first = stripped.split('/').next()?;
+    first.parse::<u64>().ok()
+}
+
+// ---------------------------------------------------------------------------
+// Stack Exchange API types
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct QResponse {
+    #[serde(default)]
+    items: Vec<Question>,
+}
+
+#[derive(Deserialize)]
+struct Question {
+    question_id: Option<u64>,
+    title: Option<String>,
+    body: Option<String>,
+    #[serde(default)]
+    tags: Vec<String>,
+    score: Option<i64>,
+    view_count: Option<i64>,
+    answer_count: Option<i64>,
+    is_answered: Option<bool>,
+    accepted_answer_id: Option<u64>,
+    creation_date: Option<i64>,
+    last_activity_date: Option<i64>,
+    owner: Option<Owner>,
+    link: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct AResponse {
+    #[serde(default)]
+    items: Vec<Answer>,
+}
+
+#[derive(Deserialize)]
+struct Answer {
+    answer_id: Option<u64>,
+    is_accepted: Option<bool>,
+    score: Option<i64>,
+    body: Option<String>,
+    creation_date: Option<i64>,
+    last_edit_date: Option<i64>,
+    owner: Option<Owner>,
+}
+
+#[derive(Deserialize)]
+struct Owner {
+    display_name: Option<String>,
+    reputation: Option<i64>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_question_urls() {
+        assert!(matches(
+            "https://stackoverflow.com/questions/12345/some-slug"
+        ));
+        assert!(matches(
+            "https://stackoverflow.com/questions/12345/some-slug?answertab=votes"
+        ));
+        assert!(!matches("https://stackoverflow.com/"));
+        assert!(!matches("https://stackoverflow.com/questions"));
+        assert!(!matches("https://stackoverflow.com/users/100"));
+        assert!(!matches("https://example.com/questions/12345/x"));
+    }
+
+    #[test]
+    fn parse_question_id_handles_slug_and_query() {
+        assert_eq!(
+            parse_question_id("https://stackoverflow.com/questions/12345/some-slug"),
+            Some(12345)
+        );
+        assert_eq!(
+            parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"),
+            Some(12345)
+        );
+        assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None);
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/substack_post.rs
+++ b/crates/webclaw-fetch/src/extractors/substack_post.rs
@ -0,0 +1,565 @@
+//! Substack post extractor.
+//!
+//! Every Substack publication exposes `/api/v1/posts/{slug}` that
+//! returns the full post as JSON: body HTML, cover image, author,
+//! publication info, reactions, paywall state. No auth on public
+//! posts.
+//!
+//! Works on both `*.substack.com` subdomains and custom domains
+//! (e.g. `simonwillison.net` uses Substack too). Detection is
+//! "URL has `/p/{slug}`" because that's the canonical Substack post
+//! path. Explicit-call only because the `/p/{slug}` URL shape is
+//! used by non-Substack sites too.
+//!
+//! ## Fallback
+//!
+//! The API endpoint is rate-limited aggressively on popular publications
+//! and occasionally returns 403 on custom domains with Cloudflare in
+//! front. When that happens we escalate to an HTML fetch (via
+//! `smart_fetch_html`, so antibot-protected custom domains still work)
+//! and extract OG tags + Article JSON-LD for a degraded-but-useful
+//! payload. The response shape stays stable across both paths; a
+//! `data_source` field tells the caller which branch ran.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "substack_post",
+    label: "Substack post",
+    description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.",
+    url_patterns: &[
+        "https://{pub}.substack.com/p/{slug}",
+        "https://{custom-domain}/p/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    if !(url.starts_with("http://") || url.starts_with("https://")) {
+        return false;
+    }
+    url.contains("/p/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let slug = parse_slug(url).ok_or_else(|| {
+        FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
+    })?;
+    let host = host_of(url);
+    if host.is_empty() {
+        return Err(FetchError::Build(format!(
+            "substack_post: empty host in '{url}'"
+        )));
+    }
+    let scheme = if url.starts_with("http://") {
+        "http"
+    } else {
+        "https"
+    };
+    let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
+
+    // 1. Try the public API. 200 = full payload; 404 = real miss; any
+    //    other status hands off to the HTML fallback so a transient rate
+    //    limit or a hardened custom domain doesn't fail the whole call.
+    let resp = client.fetch(&api_url).await?;
+    match resp.status {
+        200 => match serde_json::from_str::<Post>(&resp.html) {
+            Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)),
+            Err(e) => {
+                // API returned 200 but the body isn't the Post shape we
+                // expect. Could be a custom-domain site that exposes
+                // something else at /api/v1/posts/. Fall back to HTML
+                // rather than hard-failing.
+                html_fallback(
+                    client,
+                    url,
+                    &api_url,
+                    &slug,
+                    Some(format!(
+                        "api returned 200 but body was not Substack JSON ({e})"
+                    )),
+                )
+                .await
+            }
+        },
+        404 => Err(FetchError::Build(format!(
+            "substack_post: '{slug}' not found on {host} (got 404). \
+             If the publication isn't actually on Substack, use /v1/scrape instead."
+        ))),
+        _ => {
+            // Rate limit, 403, 5xx, whatever: try HTML.
+            let reason = format!("api returned status {} for {api_url}", resp.status);
+            html_fallback(client, url, &api_url, &slug, Some(reason)).await
+        }
+    }
+}
+
+// ---------------------------------------------------------------------------
+// API-path payload builder
+// ---------------------------------------------------------------------------
+
+fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
+    json!({
+        "url":                  url,
+        "api_url":              api_url,
+        "data_source":          "api",
+        "id":                   p.id,
+        "type":                 p.r#type,
+        "slug":                 p.slug.or_else(|| Some(slug.to_string())),
+        "title":                p.title,
+        "subtitle":             p.subtitle,
+        "description":          p.description,
+        "canonical_url":        p.canonical_url,
+        "post_date":            p.post_date,
+        "updated_at":           p.updated_at,
+        "audience":             p.audience,
+        "has_paywall":          matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")),
+        "is_free_preview":      p.is_free_preview,
+        "cover_image":          p.cover_image,
+        "word_count":           p.wordcount,
+        "reactions":            p.reactions,
+        "comment_count":        p.comment_count,
+        "body_html":            p.body_html,
+        "body_text":            p.truncated_body_text.or(p.body_text),
+        "publication": json!({
+            "id":           p.publication.as_ref().and_then(|pub_| pub_.id),
+            "name":         p.publication.as_ref().and_then(|pub_| pub_.name.clone()),
+            "subdomain":    p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()),
+            "custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()),
+        }),
+        "authors": p.published_bylines.iter().map(|a| json!({
+            "id":     a.id,
+            "name":   a.name,
+            "handle": a.handle,
+            "photo":  a.photo_url,
+        })).collect::<Vec<_>>(),
+    })
+}
+
+// ---------------------------------------------------------------------------
+// HTML fallback: OG + Article JSON-LD
+// ---------------------------------------------------------------------------
+
+async fn html_fallback(
+    client: &dyn Fetcher,
+    url: &str,
+    api_url: &str,
+    slug: &str,
+    fallback_reason: Option<String>,
+) -> Result<Value, FetchError> {
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse_html(&fetched.html, url, api_url, slug);
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "fetch_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+        if let Some(reason) = fallback_reason {
+            obj.insert("fallback_reason".into(), json!(reason));
+        }
+    }
+    Ok(data)
+}
+
+/// Pure HTML parser. Pulls title, subtitle, description, cover image,
+/// publish date, and authors from OG tags and Article JSON-LD. Kept
+/// public so tests can exercise it with fixtures.
+pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value {
+    let article = find_article_jsonld(html);
+
+    let title = article
+        .as_ref()
+        .and_then(|v| get_text(v, "headline"))
+        .or_else(|| og(html, "title"));
+    let description = article
+        .as_ref()
+        .and_then(|v| get_text(v, "description"))
+        .or_else(|| og(html, "description"));
+    let cover_image = article
+        .as_ref()
+        .and_then(get_first_image)
+        .or_else(|| og(html, "image"));
+    let post_date = article
+        .as_ref()
+        .and_then(|v| get_text(v, "datePublished"))
+        .or_else(|| meta_property(html, "article:published_time"));
+    let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified"));
+    let publication_name = og(html, "site_name");
+    let authors = article.as_ref().map(extract_authors).unwrap_or_default();
+
+    json!({
+        "url":                  url,
+        "api_url":              api_url,
+        "data_source":          "html_fallback",
+        "slug":                 slug,
+        "title":                title,
+        "subtitle":             None::<String>,
+        "description":          description,
+        "canonical_url":        canonical_url(html).or_else(|| Some(url.to_string())),
+        "post_date":            post_date,
+        "updated_at":           updated_at,
+        "cover_image":          cover_image,
+        "body_html":            None::<String>,
+        "body_text":            None::<String>,
+        "word_count":           None::<i64>,
+        "comment_count":        None::<i64>,
+        "reactions":            Value::Null,
+        "has_paywall":          None::<bool>,
+        "is_free_preview":      None::<bool>,
+        "publication": json!({
+            "name": publication_name,
+        }),
+        "authors": authors,
+    })
+}
+
+fn extract_authors(v: &Value) -> Vec<Value> {
+    let Some(a) = v.get("author") else {
+        return Vec::new();
+    };
+    let one = |val: &Value| -> Option<Value> {
+        match val {
+            Value::String(s) => Some(json!({"name": s})),
+            Value::Object(_) => {
+                let name = val.get("name").and_then(|n| n.as_str())?;
+                let handle = val
+                    .get("url")
+                    .and_then(|u| u.as_str())
+                    .and_then(handle_from_author_url);
+                Some(json!({
+                    "name":   name,
+                    "handle": handle,
+                }))
+            }
+            _ => None,
+        }
+    };
+    match a {
+        Value::Array(arr) => arr.iter().filter_map(one).collect(),
+        _ => one(a).into_iter().collect(),
+    }
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+fn parse_slug(url: &str) -> Option<String> {
+    let after = url.split("/p/").nth(1)?;
+    let stripped = after
+        .split(['?', '#'])
+        .next()?
+        .trim_end_matches('/')
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if stripped.is_empty() {
+        None
+    } else {
+        Some(stripped.to_string())
+    }
+}
+
+/// Extract the Substack handle from an author URL like
+/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`.
+///
+/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack
+/// author page) so we don't synthesise a fake handle.
+fn handle_from_author_url(u: &str) -> Option<String> {
+    let after = u.rsplit_once('@').map(|(_, tail)| tail)?;
+    let clean = after.split(['/', '?', '#']).next()?;
+    if clean.is_empty() {
+        None
+    } else {
+        Some(clean.to_string())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// HTML tag helpers
+// ---------------------------------------------------------------------------
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+/// Pull `<meta property="article:published_time" content="...">` and
+/// similar structured meta tags.
+fn meta_property(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn canonical_url(html: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE
+        .get_or_init(|| Regex::new(r#"(?i)<link[^>]+rel="canonical"[^>]+href="([^"]+)""#).unwrap());
+    re.captures(html)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers (Article / NewsArticle)
+// ---------------------------------------------------------------------------
+
+fn find_article_jsonld(html: &str) -> Option<Value> {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    for b in blocks {
+        if let Some(found) = find_article_in(&b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_article_in(v: &Value) -> Option<Value> {
+    if is_article_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_article_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_article_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_article_type(v: &Value) -> bool {
+    let Some(t) = v.get("@type") else {
+        return false;
+    };
+    let is_art = |s: &str| {
+        matches!(
+            s,
+            "Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting"
+        )
+    };
+    match t {
+        Value::String(s) => is_art(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_first_image(v: &Value) -> Option<String> {
+    match v.get("image")? {
+        Value::String(s) => Some(s.clone()),
+        Value::Array(arr) => arr.iter().find_map(|x| match x {
+            Value::String(s) => Some(s.clone()),
+            Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
+            _ => None,
+        }),
+        Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
+        _ => None,
+    }
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// Substack API types (subset)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Post {
+    id: Option<i64>,
+    r#type: Option<String>,
+    slug: Option<String>,
+    title: Option<String>,
+    subtitle: Option<String>,
+    description: Option<String>,
+    canonical_url: Option<String>,
+    post_date: Option<String>,
+    updated_at: Option<String>,
+    audience: Option<String>,
+    is_free_preview: Option<bool>,
+    cover_image: Option<String>,
+    wordcount: Option<i64>,
+    reactions: Option<serde_json::Value>,
+    comment_count: Option<i64>,
+    body_html: Option<String>,
+    body_text: Option<String>,
+    truncated_body_text: Option<String>,
+    publication: Option<Publication>,
+    #[serde(default, rename = "publishedBylines")]
+    published_bylines: Vec<Byline>,
+}
+
+#[derive(Deserialize)]
+struct Publication {
+    id: Option<i64>,
+    name: Option<String>,
+    subdomain: Option<String>,
+    custom_domain: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Byline {
+    id: Option<i64>,
+    name: Option<String>,
+    handle: Option<String>,
+    photo_url: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_post_urls() {
+        assert!(matches(
+            "https://stratechery.substack.com/p/the-tech-letter"
+        ));
+        assert!(matches("https://simonwillison.net/p/2024-08-01-something"));
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("ftp://example.com/p/foo"));
+    }
+
+    #[test]
+    fn parse_slug_strips_query_and_trailing_slash() {
+        assert_eq!(
+            parse_slug("https://example.substack.com/p/my-post"),
+            Some("my-post".into())
+        );
+        assert_eq!(
+            parse_slug("https://example.substack.com/p/my-post/"),
+            Some("my-post".into())
+        );
+        assert_eq!(
+            parse_slug("https://example.substack.com/p/my-post?ref=123"),
+            Some("my-post".into())
+        );
+    }
+
+    #[test]
+    fn parse_html_extracts_from_og_tags() {
+        let html = r##"
+<html><head>
+<meta property="og:title" content="My Great Post">
+<meta property="og:description" content="A short summary.">
+<meta property="og:image" content="https://cdn.substack.com/cover.jpg">
+<meta property="og:site_name" content="My Publication">
+<meta property="article:published_time" content="2025-09-01T10:00:00Z">
+<link rel="canonical" href="https://mypub.substack.com/p/my-post">
+</head></html>"##;
+        let v = parse_html(
+            html,
+            "https://mypub.substack.com/p/my-post",
+            "https://mypub.substack.com/api/v1/posts/my-post",
+            "my-post",
+        );
+        assert_eq!(v["data_source"], "html_fallback");
+        assert_eq!(v["title"], "My Great Post");
+        assert_eq!(v["description"], "A short summary.");
+        assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg");
+        assert_eq!(v["post_date"], "2025-09-01T10:00:00Z");
+        assert_eq!(v["publication"]["name"], "My Publication");
+        assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post");
+    }
+
+    #[test]
+    fn parse_html_prefers_jsonld_when_present() {
+        let html = r##"
+<html><head>
+<meta property="og:title" content="OG Title">
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"NewsArticle",
+ "headline":"JSON-LD Title",
+ "description":"JSON-LD desc.",
+ "image":"https://cdn.substack.com/hero.jpg",
+ "datePublished":"2025-10-12T08:30:00Z",
+ "dateModified":"2025-10-12T09:00:00Z",
+ "author":[{"@type":"Person","name":"Alice Author","url":"https://substack.com/@alice"}]}
+</script>
+</head></html>"##;
+        let v = parse_html(
+            html,
+            "https://example.com/p/a",
+            "https://example.com/api/v1/posts/a",
+            "a",
+        );
+        assert_eq!(v["title"], "JSON-LD Title");
+        assert_eq!(v["description"], "JSON-LD desc.");
+        assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg");
+        assert_eq!(v["post_date"], "2025-10-12T08:30:00Z");
+        assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z");
+        assert_eq!(v["authors"][0]["name"], "Alice Author");
+        assert_eq!(v["authors"][0]["handle"], "alice");
+    }
+
+    #[test]
+    fn handle_from_author_url_pulls_handle() {
+        assert_eq!(
+            handle_from_author_url("https://substack.com/@alice"),
+            Some("alice".into())
+        );
+        assert_eq!(
+            handle_from_author_url("https://mypub.substack.com/@bob/"),
+            Some("bob".into())
+        );
+        assert_eq!(
+            handle_from_author_url("https://not-substack.com/author/carol"),
+            None
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@ -0,0 +1,572 @@
+//! Trustpilot company reviews extractor.
+//!
+//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
+//! "Verifying your connection" interstitial, so this extractor always
+//! routes through [`cloud::smart_fetch_html`]. Without
+//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
+//! "set API key" error; with one it escalates to api.webclaw.io.
+//!
+//! ## 2025 JSON-LD schema
+//!
+//! Trustpilot replaced the old single-Organization + aggregateRating
+//! shape with three separate JSON-LD blocks:
+//!
+//! 1. `Organization` block for Trustpilot the platform itself
+//!    (company info, addresses, social profiles). Not the business
+//!    being reviewed. We detect and skip this.
+//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
+//!    per-star-bucket counts for the target business plus a Total
+//!    column. The Dataset's `name` is the business display name.
+//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
+//!    summary of reviews plus the individual review objects
+//!    (consumer, dates, rating, title, text, language, likes).
+//!
+//! Plus `metadata.title` from the page head parses as
+//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
+//! `metadata.description` carries `"{N} customers have already said"`.
+//! We use both as extra signal when the Dataset block is absent.
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::cloud::{self, CloudError};
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "trustpilot_reviews",
+    label: "Trustpilot reviews",
+    description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
+    url_patterns: &["https://www.trustpilot.com/review/{domain}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
+        return false;
+    }
+    url.contains("/review/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+        .await
+        .map_err(cloud_to_fetch_err)?;
+
+    let mut data = parse(&fetched.html, url)?;
+    if let Some(obj) = data.as_object_mut() {
+        obj.insert(
+            "data_source".into(),
+            match fetched.source {
+                cloud::FetchSource::Local => json!("local"),
+                cloud::FetchSource::Cloud => json!("cloud"),
+            },
+        );
+    }
+    Ok(data)
+}
+
+/// Pure parser. Kept public so the cloud pipeline can reuse it on its
+/// own fetched HTML without going through the async extract path.
+pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
+    let domain = parse_review_domain(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
+        ))
+    })?;
+
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+
+    // The business Dataset block has `about.@id` pointing to the target
+    // domain's Organization (e.g. `.../Organization/anthropic.com`).
+    let dataset = find_business_dataset(&blocks, &domain);
+
+    // The aiSummary block: not typed (no `@type`), detect by key.
+    let ai_block = find_ai_summary_block(&blocks);
+
+    // Business name: Dataset > metadata.title regex > URL domain.
+    let business_name = dataset
+        .as_ref()
+        .and_then(|d| get_string(d, "name"))
+        .or_else(|| parse_name_from_og_title(html))
+        .or_else(|| Some(domain.clone()));
+
+    // Rating distribution from the csvw:Table columns. Each column has
+    // csvw:name like "1 star" / "Total" and a single cell with the
+    // integer count.
+    let distribution = dataset.as_ref().and_then(parse_star_distribution);
+    let (rating_from_dist, total_from_dist) = distribution
+        .as_ref()
+        .map(compute_rating_stats)
+        .unwrap_or((None, None));
+
+    // Page-title / page-description fallbacks. OG title format:
+    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
+    let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
+    let total_from_desc = parse_review_count_from_og_description(html);
+
+    // Recent reviews carried by the aiSummary block.
+    let recent_reviews: Vec<Value> = ai_block
+        .as_ref()
+        .and_then(|a| a.get("aiSummaryReviews"))
+        .and_then(|arr| arr.as_array())
+        .map(|arr| arr.iter().map(extract_review).collect())
+        .unwrap_or_default();
+
+    let ai_summary = ai_block
+        .as_ref()
+        .and_then(|a| a.get("aiSummary"))
+        .and_then(|s| s.get("summary"))
+        .and_then(|t| t.as_str())
+        .map(String::from);
+
+    Ok(json!({
+        "url":               url,
+        "domain":            domain,
+        "business_name":     business_name,
+        "rating_label":      rating_label,
+        "average_rating":    rating_from_dist.or(rating_from_og),
+        "review_count":      total_from_dist.or(total_from_desc),
+        "rating_distribution": distribution,
+        "ai_summary":        ai_summary,
+        "recent_reviews":    recent_reviews,
+        "review_count_listed": recent_reviews.len(),
+    }))
+}
+
+fn cloud_to_fetch_err(e: CloudError) -> FetchError {
+    FetchError::Build(e.to_string())
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Pull the target domain from `trustpilot.com/review/{domain}`.
+fn parse_review_domain(url: &str) -> Option<String> {
+    let after = url.split("/review/").nth(1)?;
+    let stripped = after
+        .split(['?', '#'])
+        .next()?
+        .trim_end_matches('/')
+        .split('/')
+        .next()
+        .unwrap_or("");
+    if stripped.is_empty() {
+        None
+    } else {
+        Some(stripped.to_string())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD block walkers
+// ---------------------------------------------------------------------------
+
+/// Find the Dataset block whose `about.@id` references the target
+/// domain's Organization. Falls through to any Dataset if the @id
+/// check doesn't match (Trustpilot occasionally varies the URL).
+fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
+    let mut fallback_any_dataset: Option<Value> = None;
+    for block in blocks {
+        for node in walk_graph(block) {
+            if !is_dataset(&node) {
+                continue;
+            }
+            if dataset_about_matches_domain(&node, domain) {
+                return Some(node);
+            }
+            if fallback_any_dataset.is_none() {
+                fallback_any_dataset = Some(node);
+            }
+        }
+    }
+    fallback_any_dataset
+}
+
+fn is_dataset(v: &Value) -> bool {
+    v.get("@type")
+        .and_then(|t| t.as_str())
+        .is_some_and(|s| s == "Dataset")
+}
+
+fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
+    let about_id = v
+        .get("about")
+        .and_then(|a| a.get("@id"))
+        .and_then(|id| id.as_str());
+    let Some(id) = about_id else {
+        return false;
+    };
+    id.contains(&format!("/Organization/{domain}"))
+}
+
+/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
+/// presence of the `aiSummary` key.
+fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
+    for block in blocks {
+        for node in walk_graph(block) {
+            if node.get("aiSummary").is_some() {
+                return Some(node);
+            }
+        }
+    }
+    None
+}
+
+/// Flatten each block (and its `@graph`) into a list of nodes we can
+/// iterate over. Handles both `@graph: [ ... ]` (array) and
+/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
+fn walk_graph(block: &Value) -> Vec<Value> {
+    let mut out = vec![block.clone()];
+    if let Some(graph) = block.get("@graph") {
+        match graph {
+            Value::Array(arr) => out.extend(arr.iter().cloned()),
+            Value::Object(_) => out.push(graph.clone()),
+            _ => {}
+        }
+    }
+    out
+}
+
+// ---------------------------------------------------------------------------
+// Rating distribution (csvw:Table)
+// ---------------------------------------------------------------------------
+
+/// Parse the per-star distribution from the Dataset block. Returns
+/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
+fn parse_star_distribution(dataset: &Value) -> Option<Value> {
+    let columns = dataset
+        .get("mainEntity")?
+        .get("csvw:tableSchema")?
+        .get("csvw:columns")?
+        .as_array()?;
+    let mut out = serde_json::Map::new();
+    for col in columns {
+        let name = col.get("csvw:name").and_then(|n| n.as_str())?;
+        let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
+        let count = cell
+            .get("csvw:value")
+            .and_then(|v| v.as_str())
+            .and_then(|s| s.parse::<i64>().ok());
+        let percent = cell
+            .get("csvw:notes")
+            .and_then(|n| n.as_array())
+            .and_then(|arr| arr.first())
+            .and_then(|s| s.as_str())
+            .map(String::from);
+        let key = normalise_star_key(name);
+        out.insert(
+            key,
+            json!({
+                "count":   count,
+                "percent": percent,
+            }),
+        );
+    }
+    if out.is_empty() {
+        None
+    } else {
+        Some(Value::Object(out))
+    }
+}
+
+/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
+/// the raw "1 star" key which fights YAML/JS property access.
+fn normalise_star_key(name: &str) -> String {
+    let trimmed = name.trim().to_lowercase();
+    match trimmed.as_str() {
+        "1 star" => "one_star".into(),
+        "2 stars" => "two_stars".into(),
+        "3 stars" => "three_stars".into(),
+        "4 stars" => "four_stars".into(),
+        "5 stars" => "five_stars".into(),
+        "total" => "total".into(),
+        other => other.replace(' ', "_"),
+    }
+}
+
+/// Compute average rating (weighted by bucket) and total count from the
+/// parsed distribution. Returns `(average, total)`.
+fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
+    let Some(obj) = distribution.as_object() else {
+        return (None, None);
+    };
+    let get_count = |key: &str| -> i64 {
+        obj.get(key)
+            .and_then(|v| v.get("count"))
+            .and_then(|v| v.as_i64())
+            .unwrap_or(0)
+    };
+    let one = get_count("one_star");
+    let two = get_count("two_stars");
+    let three = get_count("three_stars");
+    let four = get_count("four_stars");
+    let five = get_count("five_stars");
+    let total_bucket = one + two + three + four + five;
+    let total = obj
+        .get("total")
+        .and_then(|v| v.get("count"))
+        .and_then(|v| v.as_i64())
+        .unwrap_or(total_bucket);
+    if total == 0 {
+        return (None, Some(0));
+    }
+    let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
+    let avg = weighted as f64 / total_bucket.max(1) as f64;
+    // One decimal place, matching how Trustpilot displays the score.
+    (Some(format!("{avg:.1}")), Some(total))
+}
+
+// ---------------------------------------------------------------------------
+// OG / meta-tag fallbacks
+// ---------------------------------------------------------------------------
+
+/// Regex out the business name from the standard Trustpilot OG title
+/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
+fn parse_name_from_og_title(html: &str) -> Option<String> {
+    let title = og(html, "title")?;
+    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
+    re.captures(&title)
+        .and_then(|c| c.get(1))
+        .map(|m| m.as_str().to_string())
+}
+
+/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
+/// from the OG title.
+fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
+    let Some(title) = og(html, "title") else {
+        return (None, None);
+    };
+    static RE: OnceLock<Regex> = OnceLock::new();
+    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
+    });
+    let Some(caps) = re.captures(&title) else {
+        return (None, None);
+    };
+    (
+        caps.get(1).map(|m| m.as_str().trim().to_string()),
+        caps.get(2).map(|m| m.as_str().to_string()),
+    )
+}
+
+/// Parse "hear what 226 customers have already said" from the OG
+/// description tag.
+fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
+    let desc = og(html, "description")?;
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
+    re.captures(&desc)?
+        .get(1)?
+        .as_str()
+        .replace(',', "")
+        .parse::<i64>()
+        .ok()
+}
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            let raw = c.get(2).map(|m| m.as_str())?;
+            return Some(html_unescape(raw));
+        }
+    }
+    None
+}
+
+/// Minimal HTML entity unescaping for the three entities the
+/// synthesize_html escaper might produce. Keeps us off a heavier dep.
+fn html_unescape(s: &str) -> String {
+    s.replace("&quot;", "\"")
+        .replace("&amp;", "&")
+        .replace("&lt;", "<")
+        .replace("&gt;", ">")
+}
+
+fn get_string(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| x.as_str().map(String::from))
+}
+
+// ---------------------------------------------------------------------------
+// Review extraction
+// ---------------------------------------------------------------------------
+
+fn extract_review(r: &Value) -> Value {
+    json!({
+        "id":          r.get("id").and_then(|v| v.as_str()),
+        "rating":      r.get("rating").and_then(|v| v.as_i64()),
+        "title":       r.get("title").and_then(|v| v.as_str()),
+        "text":        r.get("text").and_then(|v| v.as_str()),
+        "language":    r.get("language").and_then(|v| v.as_str()),
+        "source":      r.get("source").and_then(|v| v.as_str()),
+        "likes":       r.get("likes").and_then(|v| v.as_i64()),
+        "author":      r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
+        "author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
+        "author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
+        "verified":    r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
+        "date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
+        "date_published":   r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
+    })
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_trustpilot_review_urls() {
+        assert!(matches("https://www.trustpilot.com/review/stripe.com"));
+        assert!(matches("https://trustpilot.com/review/example.com"));
+        assert!(!matches("https://www.trustpilot.com/"));
+        assert!(!matches("https://example.com/review/foo"));
+    }
+
+    #[test]
+    fn parse_review_domain_handles_query_and_slash() {
+        assert_eq!(
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
+            Some("anthropic.com".into())
+        );
+        assert_eq!(
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
+            Some("anthropic.com".into())
+        );
+        assert_eq!(
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
+            Some("anthropic.com".into())
+        );
+    }
+
+    #[test]
+    fn normalise_star_key_covers_all_buckets() {
+        assert_eq!(normalise_star_key("1 star"), "one_star");
+        assert_eq!(normalise_star_key("2 stars"), "two_stars");
+        assert_eq!(normalise_star_key("5 stars"), "five_stars");
+        assert_eq!(normalise_star_key("Total"), "total");
+    }
+
+    #[test]
+    fn compute_rating_stats_weighted_average() {
+        // 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
+        let dist = json!({
+            "one_star":   { "count": 100, "percent": "50%" },
+            "two_stars":  { "count": 0,   "percent": "0%" },
+            "three_stars":{ "count": 0,   "percent": "0%" },
+            "four_stars": { "count": 0,   "percent": "0%" },
+            "five_stars": { "count": 100, "percent": "50%" },
+            "total":      { "count": 200, "percent": "100%" },
+        });
+        let (avg, total) = compute_rating_stats(&dist);
+        assert_eq!(avg.as_deref(), Some("3.0"));
+        assert_eq!(total, Some(200));
+    }
+
+    #[test]
+    fn parse_og_title_extracts_name_and_rating() {
+        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">"#;
+        assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
+        let (label, rating) = parse_rating_from_og_title(html);
+        assert_eq!(label.as_deref(), Some("Bad"));
+        assert_eq!(rating.as_deref(), Some("1.5"));
+    }
+
+    #[test]
+    fn parse_review_count_from_og_description_picks_number() {
+        let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
+        assert_eq!(parse_review_count_from_og_description(html), Some(226));
+    }
+
+    #[test]
+    fn parse_full_fixture_assembles_all_fields() {
+        let html = r##"<html><head>
+<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
+<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
+<script type="application/ld+json">
+{"@context":"https://schema.org","@graph":[
+  {"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
+]}
+</script>
+<script type="application/ld+json">
+{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
+ "@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
+ "@type":"Dataset",
+ "about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
+ "name":"Anthropic",
+ "mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
+   {"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
+   {"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
+   {"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
+   {"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
+   {"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
+   {"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
+ ]}}}}
+</script>
+<script type="application/ld+json">
+{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
+ "aiSummaryReviews":[
+  {"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
+   "source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
+   "dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
+</script>
+</head></html>"##;
+        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
+        assert_eq!(v["domain"], "anthropic.com");
+        assert_eq!(v["business_name"], "Anthropic");
+        assert_eq!(v["rating_label"], "Bad");
+        assert_eq!(v["review_count"], 226);
+        assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
+        assert_eq!(v["rating_distribution"]["total"]["count"], 226);
+        assert_eq!(v["ai_summary"], "Mixed reviews.");
+        assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
+        assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
+        assert_eq!(v["recent_reviews"][0]["rating"], 1);
+        assert_eq!(v["recent_reviews"][0]["title"], "Bad");
+    }
+
+    #[test]
+    fn parse_falls_back_to_og_when_no_jsonld() {
+        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
+<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
+        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
+        assert_eq!(v["domain"], "anthropic.com");
+        assert_eq!(v["business_name"], "Anthropic");
+        assert_eq!(v["average_rating"], "1.5");
+        assert_eq!(v["review_count"], 226);
+        assert_eq!(v["rating_label"], "Bad");
+    }
+
+    #[test]
+    fn parse_returns_ok_with_url_domain_when_nothing_else() {
+        let v = parse(
+            "<html><head></head></html>",
+            "https://www.trustpilot.com/review/example.com",
+        )
+        .unwrap();
+        assert_eq!(v["domain"], "example.com");
+        assert_eq!(v["business_name"], "example.com");
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs
@ -0,0 +1,237 @@
+//! WooCommerce product structured extractor.
+//!
+//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`.
+//! About 30-50% of WooCommerce stores expose this endpoint publicly
+//! (it's on by default, but common security plugins disable it).
+//! When it's off, the server returns 404 at /wp-json. We surface a
+//! clean error and point callers at `/v1/scrape/ecommerce_product`
+//! which works on any store with Schema.org JSON-LD.
+//!
+//! Explicit-call only. `/product/{slug}` is the default permalink for
+//! WooCommerce but custom stores use every variation imaginable, so
+//! auto-dispatch is unreliable.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "woocommerce_product",
+    label: "WooCommerce product",
+    description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).",
+    url_patterns: &[
+        "https://{shop}/product/{slug}",
+        "https://{shop}/shop/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if host.is_empty() {
+        return false;
+    }
+    // Permissive: WooCommerce stores use custom domains + custom
+    // permalinks. The extractor's API probe is what confirms it's
+    // really WooCommerce.
+    url.contains("/product/")
+        || url.contains("/shop/")
+        || url.contains("/producto/") // common es locale
+        || url.contains("/produit/") // common fr locale
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let slug = parse_slug(url).ok_or_else(|| {
+        FetchError::Build(format!(
+            "woocommerce_product: cannot parse slug from '{url}'"
+        ))
+    })?;
+    let host = host_of(url);
+    if host.is_empty() {
+        return Err(FetchError::Build(format!(
+            "woocommerce_product: empty host in '{url}'"
+        )));
+    }
+    let scheme = if url.starts_with("http://") {
+        "http"
+    } else {
+        "https"
+    };
+    let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1");
+    let resp = client.fetch(&api_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "woocommerce_product: {host} does not expose /wp-json/wc/store (404). \
+             Use /v1/scrape/ecommerce_product for JSON-LD fallback."
+        )));
+    }
+    if resp.status == 401 || resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \
+             Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.",
+            resp.status
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "woocommerce api returned status {} for {api_url}",
+            resp.status
+        )));
+    }
+
+    let products: Vec<Product> = serde_json::from_str(&resp.html)
+        .map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?;
+    let p = products.into_iter().next().ok_or_else(|| {
+        FetchError::Build(format!(
+            "woocommerce_product: no product found for slug '{slug}' on {host}"
+        ))
+    })?;
+
+    let images: Vec<Value> = p
+        .images
+        .iter()
+        .map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt}))
+        .collect();
+    let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0);
+
+    Ok(json!({
+        "url":             url,
+        "api_url":         api_url,
+        "product_id":      p.id,
+        "name":            p.name,
+        "slug":            p.slug,
+        "sku":             p.sku,
+        "permalink":       p.permalink,
+        "on_sale":         p.on_sale,
+        "in_stock":        p.is_in_stock,
+        "is_purchasable":  p.is_purchasable,
+        "price":           p.prices.as_ref().and_then(|pr| pr.price.clone()),
+        "regular_price":   p.prices.as_ref().and_then(|pr| pr.regular_price.clone()),
+        "sale_price":      p.prices.as_ref().and_then(|pr| pr.sale_price.clone()),
+        "currency":        p.prices.as_ref().and_then(|pr| pr.currency_code.clone()),
+        "currency_minor":  p.prices.as_ref().and_then(|pr| pr.currency_minor_unit),
+        "price_range":     p.prices.as_ref().and_then(|pr| pr.price_range.clone()),
+        "average_rating":  p.average_rating,
+        "review_count":    p.review_count,
+        "description":     p.description,
+        "short_description": p.short_description,
+        "categories":      p.categories.iter().filter_map(|c| c.name.clone()).collect::<Vec<_>>(),
+        "tags":            p.tags.iter().filter_map(|t| t.name.clone()).collect::<Vec<_>>(),
+        "variation_count": variations_count,
+        "image_count":     images.len(),
+        "images":          images,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+/// Extract the product slug from common WooCommerce permalinks.
+fn parse_slug(url: &str) -> Option<String> {
+    for needle in ["/product/", "/shop/", "/producto/", "/produit/"] {
+        if let Some(after) = url.split(needle).nth(1) {
+            let stripped = after
+                .split(['?', '#'])
+                .next()?
+                .trim_end_matches('/')
+                .split('/')
+                .next()
+                .unwrap_or("");
+            if !stripped.is_empty() {
+                return Some(stripped.to_string());
+            }
+        }
+    }
+    None
+}
+
+// ---------------------------------------------------------------------------
+// Store API types (subset of the full response)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Product {
+    id: Option<i64>,
+    name: Option<String>,
+    slug: Option<String>,
+    sku: Option<String>,
+    permalink: Option<String>,
+    description: Option<String>,
+    short_description: Option<String>,
+    on_sale: Option<bool>,
+    is_in_stock: Option<bool>,
+    is_purchasable: Option<bool>,
+    average_rating: Option<serde_json::Value>, // string or number
+    review_count: Option<i64>,
+    prices: Option<Prices>,
+    #[serde(default)]
+    categories: Vec<Term>,
+    #[serde(default)]
+    tags: Vec<Term>,
+    #[serde(default)]
+    images: Vec<Img>,
+    variations: Option<Vec<serde_json::Value>>,
+}
+
+#[derive(Deserialize)]
+struct Prices {
+    price: Option<String>,
+    regular_price: Option<String>,
+    sale_price: Option<String>,
+    currency_code: Option<String>,
+    currency_minor_unit: Option<i64>,
+    price_range: Option<serde_json::Value>,
+}
+
+#[derive(Deserialize)]
+struct Term {
+    name: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Img {
+    src: Option<String>,
+    thumbnail: Option<String>,
+    alt: Option<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_common_permalinks() {
+        assert!(matches("https://shop.example.com/product/cool-widget"));
+        assert!(matches("https://shop.example.com/shop/cool-widget"));
+        assert!(matches("https://tienda.example.com/producto/cosa"));
+        assert!(matches("https://boutique.example.com/produit/chose"));
+    }
+
+    #[test]
+    fn parse_slug_handles_locale_and_suffix() {
+        assert_eq!(
+            parse_slug("https://shop.example.com/product/cool-widget"),
+            Some("cool-widget".into())
+        );
+        assert_eq!(
+            parse_slug("https://shop.example.com/product/cool-widget/?attr=red"),
+            Some("cool-widget".into())
+        );
+        assert_eq!(
+            parse_slug("https://tienda.example.com/producto/cosa/"),
+            Some("cosa".into())
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/youtube_video.rs
+++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs
@ -0,0 +1,378 @@
+//! YouTube video structured extractor.
+//!
+//! YouTube embeds the full player configuration in a
+//! `ytInitialPlayerResponse` JavaScript assignment at the top of
+//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the
+//! core crate's already-proven regex + parse to surface typed JSON
+//! from it: video id, title, author + channel id, view count,
+//! duration, upload date, keywords, thumbnails, caption-track URLs.
+//!
+//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
+//! shape is stable.
+//!
+//! ## Fallback
+//!
+//! `ytInitialPlayerResponse` is missing on EU-consent interstitials,
+//! some live-stream pre-show pages, and age-gated videos. In those
+//! cases we drop down to OG tags for `title`, `description`,
+//! `thumbnail`, and `channel`, and return a `data_source:
+//! "og_fallback"` payload so the caller can tell they got a degraded
+//! shape (no view count, duration, captions).
+
+use std::sync::OnceLock;
+
+use regex::Regex;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::error::FetchError;
+use crate::fetcher::Fetcher;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "youtube_video",
+    label: "YouTube video",
+    description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.",
+    url_patterns: &[
+        "https://www.youtube.com/watch?v={id}",
+        "https://youtu.be/{id}",
+        "https://www.youtube.com/shorts/{id}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    webclaw_core::youtube::is_youtube_url(url)
+        || url.contains("youtube.com/shorts/")
+        || url.contains("youtube-nocookie.com/embed/")
+}
+
+pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
+    let video_id = parse_video_id(url).ok_or_else(|| {
+        FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
+    })?;
+
+    // Always fetch the canonical /watch URL. /shorts/ and youtu.be
+    // sometimes serve a thinner page without the player blob.
+    let canonical = format!("https://www.youtube.com/watch?v={video_id}");
+    let resp = client.fetch(&canonical).await?;
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "youtube returned status {} for {canonical}",
+            resp.status
+        )));
+    }
+
+    if let Some(player) = extract_player_response(&resp.html) {
+        return Ok(build_player_payload(
+            &player, &resp.html, url, &canonical, &video_id,
+        ));
+    }
+
+    // No player blob. Fall back to OG tags so the call still returns
+    // something useful for consent / age-gate pages.
+    Ok(build_og_fallback(&resp.html, url, &canonical, &video_id))
+}
+
+// ---------------------------------------------------------------------------
+// Player-blob path (rich payload)
+// ---------------------------------------------------------------------------
+
+fn build_player_payload(
+    player: &Value,
+    html: &str,
+    url: &str,
+    canonical: &str,
+    video_id: &str,
+) -> Value {
+    let video_details = player.get("videoDetails");
+    let microformat = player
+        .get("microformat")
+        .and_then(|m| m.get("playerMicroformatRenderer"));
+
+    let thumbnails: Vec<Value> = video_details
+        .and_then(|vd| vd.get("thumbnail"))
+        .and_then(|t| t.get("thumbnails"))
+        .and_then(|t| t.as_array())
+        .cloned()
+        .unwrap_or_default();
+
+    let keywords: Vec<Value> = video_details
+        .and_then(|vd| vd.get("keywords"))
+        .and_then(|k| k.as_array())
+        .cloned()
+        .unwrap_or_default();
+
+    let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html);
+    let captions: Vec<Value> = caption_tracks
+        .iter()
+        .map(|c| {
+            json!({
+                "url":  c.url,
+                "lang": c.lang,
+                "name": c.name,
+            })
+        })
+        .collect();
+
+    json!({
+        "url":          url,
+        "canonical_url":canonical,
+        "data_source":  "player_response",
+        "video_id":     video_id,
+        "title":        get_str(video_details, "title"),
+        "description":  get_str(video_details, "shortDescription"),
+        "author":       get_str(video_details, "author"),
+        "channel_id":   get_str(video_details, "channelId"),
+        "channel_url":  get_str(microformat, "ownerProfileUrl"),
+        "view_count":   get_int(video_details, "viewCount"),
+        "length_seconds": get_int(video_details, "lengthSeconds"),
+        "is_live":      video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()),
+        "is_private":   video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()),
+        "is_unlisted":  microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()),
+        "allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()),
+        "category":     get_str(microformat, "category"),
+        "upload_date":  get_str(microformat, "uploadDate"),
+        "publish_date": get_str(microformat, "publishDate"),
+        "keywords":     keywords,
+        "thumbnails":   thumbnails,
+        "caption_tracks": captions,
+    })
+}
+
+// ---------------------------------------------------------------------------
+// OG fallback path (degraded payload)
+// ---------------------------------------------------------------------------
+
+fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value {
+    let title = og(html, "title");
+    let description = og(html, "description");
+    let thumbnail = og(html, "image");
+    // YouTube sets `<meta name="channel_name" ...>` on some pages but
+    // OG-only pages reliably carry `og:video:tag` and the channel in
+    // `<link itemprop="name">`. We keep this lean: just what's stable.
+    let channel = meta_name(html, "author");
+
+    json!({
+        "url":          url,
+        "canonical_url":canonical,
+        "data_source":  "og_fallback",
+        "video_id":     video_id,
+        "title":        title,
+        "description":  description,
+        "author":       channel,
+        // OG path: these are null so the caller doesn't have to guess.
+        "channel_id":   None::<String>,
+        "channel_url":  None::<String>,
+        "view_count":   None::<i64>,
+        "length_seconds": None::<i64>,
+        "is_live":      None::<bool>,
+        "is_private":   None::<bool>,
+        "is_unlisted":  None::<bool>,
+        "allow_ratings":None::<bool>,
+        "category":     None::<String>,
+        "upload_date":  None::<String>,
+        "publish_date": None::<String>,
+        "keywords":     Vec::<Value>::new(),
+        "thumbnails":   thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(),
+        "caption_tracks": Vec::<Value>::new(),
+    })
+}
+
+// ---------------------------------------------------------------------------
+// URL helpers
+// ---------------------------------------------------------------------------
+
+fn parse_video_id(url: &str) -> Option<String> {
+    // youtu.be/{id}
+    if let Some(after) = url.split("youtu.be/").nth(1) {
+        let id = after
+            .split(['?', '#', '/'])
+            .next()
+            .unwrap_or("")
+            .trim_end_matches('/');
+        if !id.is_empty() {
+            return Some(id.to_string());
+        }
+    }
+    // youtube.com/shorts/{id}
+    if let Some(after) = url.split("youtube.com/shorts/").nth(1) {
+        let id = after
+            .split(['?', '#', '/'])
+            .next()
+            .unwrap_or("")
+            .trim_end_matches('/');
+        if !id.is_empty() {
+            return Some(id.to_string());
+        }
+    }
+    // youtube-nocookie.com/embed/{id}
+    if let Some(after) = url.split("/embed/").nth(1) {
+        let id = after
+            .split(['?', '#', '/'])
+            .next()
+            .unwrap_or("")
+            .trim_end_matches('/');
+        if !id.is_empty() {
+            return Some(id.to_string());
+        }
+    }
+    // youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id})
+    if let Some(q) = url.split_once('?').map(|(_, q)| q)
+        && let Some(id) = q
+            .split('&')
+            .find_map(|p| p.strip_prefix("v=").map(|v| v.to_string()))
+    {
+        let id = id.split(['#', '/']).next().unwrap_or(&id).to_string();
+        if !id.is_empty() {
+            return Some(id);
+        }
+    }
+    None
+}
+
+// ---------------------------------------------------------------------------
+// Player-response parsing
+// ---------------------------------------------------------------------------
+
+fn extract_player_response(html: &str) -> Option<Value> {
+    // Same regex as webclaw_core::youtube. Duplicated here because
+    // core's regex is module-private. Kept in lockstep; changes are
+    // rare and we cover with tests in both places.
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE
+        .get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
+    let json_str = re.captures(html)?.get(1)?.as_str();
+    serde_json::from_str(json_str).ok()
+}
+
+// ---------------------------------------------------------------------------
+// Meta-tag helpers (for OG fallback)
+// ---------------------------------------------------------------------------
+
+fn og(html: &str, prop: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == prop) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn meta_name(html: &str, name: &str) -> Option<String> {
+    static RE: OnceLock<Regex> = OnceLock::new();
+    let re = RE.get_or_init(|| {
+        Regex::new(r#"(?i)<meta[^>]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
+    });
+    for c in re.captures_iter(html) {
+        if c.get(1).is_some_and(|m| m.as_str() == name) {
+            return c.get(2).map(|m| m.as_str().to_string());
+        }
+    }
+    None
+}
+
+fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
+    v.and_then(|x| x.get(key))
+        .and_then(|x| x.as_str().map(String::from))
+}
+
+fn get_int(v: Option<&Value>, key: &str) -> Option<i64> {
+    v.and_then(|x| x.get(key)).and_then(|x| {
+        x.as_i64()
+            .or_else(|| x.as_str().and_then(|s| s.parse::<i64>().ok()))
+    })
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_watch_urls() {
+        assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ"));
+        assert!(matches("https://youtu.be/dQw4w9WgXcQ"));
+        assert!(matches("https://www.youtube.com/shorts/abc123"));
+        assert!(matches(
+            "https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ"
+        ));
+    }
+
+    #[test]
+    fn rejects_non_video_urls() {
+        assert!(!matches("https://www.youtube.com/"));
+        assert!(!matches("https://www.youtube.com/channel/abc"));
+        assert!(!matches("https://example.com/watch?v=abc"));
+    }
+
+    #[test]
+    fn parse_video_id_from_each_shape() {
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://youtu.be/dQw4w9WgXcQ"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"),
+            Some("dQw4w9WgXcQ".into())
+        );
+        assert_eq!(
+            parse_video_id("https://www.youtube.com/shorts/abc123"),
+            Some("abc123".into())
+        );
+    }
+
+    #[test]
+    fn extract_player_response_happy_path() {
+        let html = r#"
+<html><body>
+<script>
+var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","author":"A","viewCount":"100","lengthSeconds":"60","shortDescription":"d"}};
+</script>
+</body></html>
+"#;
+        let v = extract_player_response(html).unwrap();
+        let vd = v.get("videoDetails").unwrap();
+        assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
+    }
+
+    #[test]
+    fn og_fallback_extracts_basics_from_meta_tags() {
+        let html = r##"
+<html><head>
+<meta property="og:title" content="Example Video Title">
+<meta property="og:description" content="A cool video description.">
+<meta property="og:image" content="https://i.ytimg.com/vi/abc/maxresdefault.jpg">
+<meta name="author" content="Example Channel">
+</head></html>"##;
+        let v = build_og_fallback(
+            html,
+            "https://www.youtube.com/watch?v=abc",
+            "https://www.youtube.com/watch?v=abc",
+            "abc",
+        );
+        assert_eq!(v["data_source"], "og_fallback");
+        assert_eq!(v["title"], "Example Video Title");
+        assert_eq!(v["description"], "A cool video description.");
+        assert_eq!(v["author"], "Example Channel");
+        assert_eq!(
+            v["thumbnails"][0]["url"],
+            "https://i.ytimg.com/vi/abc/maxresdefault.jpg"
+        );
+        assert!(v["view_count"].is_null());
+        assert!(v["caption_tracks"].as_array().unwrap().is_empty());
+    }
+}
--- a/crates/webclaw-fetch/src/fetcher.rs
+++ b/crates/webclaw-fetch/src/fetcher.rs
@ -0,0 +1,118 @@
+//! Pluggable fetcher abstraction for vertical extractors.
+//!
+//! Extractors call the network through this trait instead of hard-
+//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
+//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
+//! server, which must not use in-process TLS fingerprinting, provides
+//! its own implementation that routes through the Go tls-sidecar.
+//!
+//! Both paths expose the same [`FetchResult`] shape and the same
+//! optional cloud-escalation client, so extractor logic stays
+//! identical across environments.
+//!
+//! ## Choosing an implementation
+//!
+//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
+//!   with [`FetchClient::with_cloud`] to attach cloud fallback, pass
+//!   it to extractors as `&client`.
+//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
+//!   (in `server/src/engine/`) that delegates to `engine::tls_client`
+//!   and wraps it in `Arc<dyn Fetcher>` for handler injection.
+//!
+//! ## Why a trait and not a free function
+//!
+//! Extractors need state beyond a single fetch: the cloud client for
+//! antibot escalation, and in the future per-user proxy pools, tenant
+//! headers, circuit breakers. A trait keeps that state encapsulated
+//! behind the fetch interface instead of threading it through every
+//! extractor signature.
+
+use async_trait::async_trait;
+
+use crate::client::FetchResult;
+use crate::cloud::CloudClient;
+use crate::error::FetchError;
+
+/// HTTP fetch surface used by vertical extractors.
+///
+/// Implementations must be `Send + Sync` because extractor dispatchers
+/// run them inside tokio tasks, potentially across many requests.
+#[async_trait]
+pub trait Fetcher: Send + Sync {
+    /// Fetch a URL and return the raw response body + metadata. The
+    /// body is in `FetchResult::html` regardless of the actual content
+    /// type — JSON API endpoints put JSON there, HTML pages put HTML.
+    /// Extractors branch on response status and body shape.
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
+
+    /// Fetch with additional request headers. Needed for endpoints
+    /// that authenticate via a specific header (Instagram's
+    /// `x-ig-app-id`, for example). Default implementation routes to
+    /// [`Self::fetch`] so implementers without header support stay
+    /// functional, though the `Option<String>` field they'd set won't
+    /// be populated on the request.
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        _headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        self.fetch(url).await
+    }
+
+    /// Optional cloud-escalation client for antibot bypass. Returning
+    /// `Some` tells extractors they can call into the hosted API when
+    /// local fetch hits a challenge page. Returning `None` makes
+    /// cloud-gated extractors emit [`CloudError::NotConfigured`] with
+    /// an actionable signup link.
+    ///
+    /// The default implementation returns `None` because not every
+    /// deployment wants cloud fallback (self-hosts that don't have a
+    /// webclaw.io subscription, for instance).
+    ///
+    /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
+    fn cloud(&self) -> Option<&CloudClient> {
+        None
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
+// ---------------------------------------------------------------------------
+
+#[async_trait]
+impl<T: Fetcher + ?Sized> Fetcher for &T {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        (**self).fetch(url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        (**self).fetch_with_headers(url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&CloudClient> {
+        (**self).cloud()
+    }
+}
+
+#[async_trait]
+impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
+    async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
+        (**self).fetch(url).await
+    }
+
+    async fn fetch_with_headers(
+        &self,
+        url: &str,
+        headers: &[(&str, &str)],
+    ) -> Result<FetchResult, FetchError> {
+        (**self).fetch_with_headers(url, headers).await
+    }
+
+    fn cloud(&self) -> Option<&CloudClient> {
+        (**self).cloud()
+    }
+}
--- a/crates/webclaw-fetch/src/lib.rs
+++ b/crates/webclaw-fetch/src/lib.rs
@ -3,10 +3,14 @@
 //! Automatically detects PDF responses and delegates to webclaw-pdf.
 pub mod browser;
 pub mod client;
+pub mod cloud;
 pub mod crawler;
 pub mod document;
 pub mod error;
+pub mod extractors;
+pub mod fetcher;
 pub mod linkedin;
+pub mod locale;
 pub mod proxy;
 pub mod reddit;
 pub mod sitemap;
@ -16,7 +20,9 @@ pub use browser::BrowserProfile;
 pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
 pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
 pub use error::FetchError;
+pub use fetcher::Fetcher;
 pub use http::HeaderMap;
+pub use locale::{accept_language_for_tld, accept_language_for_url};
 pub use proxy::{parse_proxy_file, parse_proxy_line};
 pub use sitemap::SitemapEntry;
 pub use webclaw_pdf::PdfMode;
--- a/crates/webclaw-fetch/src/locale.rs
+++ b/crates/webclaw-fetch/src/locale.rs
@ -0,0 +1,77 @@
+//! Derive an `Accept-Language` header from a URL.
+//!
+//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
+//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
+//! target country + a browser UA but the wrong `Accept-Language` is a bot
+//! signal. Matching the site's expected locale gets us through.
+//!
+//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
+
+/// Best-effort `Accept-Language` header value for the given URL's TLD.
+/// Returns `None` if the URL cannot be parsed.
+pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
+    let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
+    let tld = host.rsplit('.').next()?;
+    Some(accept_language_for_tld(tld))
+}
+
+/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
+/// Unknown TLDs fall back to US English.
+pub fn accept_language_for_tld(tld: &str) -> &'static str {
+    match tld {
+        "it" => "it-IT,it;q=0.9",
+        "fr" => "fr-FR,fr;q=0.9",
+        "de" | "at" => "de-DE,de;q=0.9",
+        "es" => "es-ES,es;q=0.9",
+        "pt" => "pt-PT,pt;q=0.9",
+        "nl" => "nl-NL,nl;q=0.9",
+        "pl" => "pl-PL,pl;q=0.9",
+        "se" => "sv-SE,sv;q=0.9",
+        "no" => "nb-NO,nb;q=0.9",
+        "dk" => "da-DK,da;q=0.9",
+        "fi" => "fi-FI,fi;q=0.9",
+        "cz" => "cs-CZ,cs;q=0.9",
+        "ro" => "ro-RO,ro;q=0.9",
+        "gr" => "el-GR,el;q=0.9",
+        "tr" => "tr-TR,tr;q=0.9",
+        "ru" => "ru-RU,ru;q=0.9",
+        "jp" => "ja-JP,ja;q=0.9",
+        "kr" => "ko-KR,ko;q=0.9",
+        "cn" => "zh-CN,zh;q=0.9",
+        "tw" | "hk" => "zh-TW,zh;q=0.9",
+        "br" => "pt-BR,pt;q=0.9",
+        "mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
+        "uk" | "ie" => "en-GB,en;q=0.9",
+        _ => "en-US,en;q=0.9",
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn tld_dispatch() {
+        assert_eq!(
+            accept_language_for_url("https://www.immobiliare.it/annunci/1"),
+            Some("it-IT,it;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://www.leboncoin.fr/"),
+            Some("fr-FR,fr;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://www.amazon.co.uk/"),
+            Some("en-GB,en;q=0.9")
+        );
+        assert_eq!(
+            accept_language_for_url("https://example.com/"),
+            Some("en-US,en;q=0.9")
+        );
+    }
+
+    #[test]
+    fn bad_url_returns_none() {
+        assert_eq!(accept_language_for_url("not-a-url"), None);
+    }
+}
--- a/crates/webclaw-fetch/src/tls.rs
+++ b/crates/webclaw-fetch/src/tls.rs
@ -7,10 +7,15 @@

 use std::time::Duration;

+use std::borrow::Cow;
+
 use wreq::http2::{
    Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
 };
-use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
+use wreq::tls::{
+    AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
+    TlsVersion,
+};
 use wreq::{Client, Emulation};

 use crate::browser::BrowserVariant;
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
 /// Safari curves.
 const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";

+/// Safari iOS 26 TLS extension order, matching bogdanfinn's
+/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
+/// inserts them itself. Diverges from wreq-util's default SafariIos26
+/// extension order, which DataDome's immobiliare.it ruleset flags.
+fn safari_ios_extensions() -> Vec<ExtensionType> {
+    vec![
+        ExtensionType::CERTIFICATE_TIMESTAMP,
+        ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
+        ExtensionType::SERVER_NAME,
+        ExtensionType::CERT_COMPRESSION,
+        ExtensionType::KEY_SHARE,
+        ExtensionType::SUPPORTED_VERSIONS,
+        ExtensionType::PSK_KEY_EXCHANGE_MODES,
+        ExtensionType::SUPPORTED_GROUPS,
+        ExtensionType::RENEGOTIATE,
+        ExtensionType::SIGNATURE_ALGORITHMS,
+        ExtensionType::STATUS_REQUEST,
+        ExtensionType::EC_POINT_FORMATS,
+        ExtensionType::EXTENDED_MASTER_SECRET,
+    ]
+}
+
+/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
+/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
+/// per handshake, but indeed.com's WAF allowlists this specific wire order
+/// and rejects permuted ones. GREASE slots are inserted by wreq.
+///
+/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
+fn chrome_extensions() -> Vec<ExtensionType> {
+    vec![
+        ExtensionType::CERTIFICATE_TIMESTAMP,                  // 18
+        ExtensionType::STATUS_REQUEST,                         // 5
+        ExtensionType::SESSION_TICKET,                         // 35
+        ExtensionType::KEY_SHARE,                              // 51
+        ExtensionType::SUPPORTED_GROUPS,                       // 10
+        ExtensionType::PSK_KEY_EXCHANGE_MODES,                 // 45
+        ExtensionType::EC_POINT_FORMATS,                       // 11
+        ExtensionType::CERT_COMPRESSION,                       // 27
+        ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
+        ExtensionType::SUPPORTED_VERSIONS,       // 43
+        ExtensionType::SIGNATURE_ALGORITHMS,     // 13
+        ExtensionType::SERVER_NAME,              // 0
+        ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
+        ExtensionType::ENCRYPTED_CLIENT_HELLO,   // 65037
+        ExtensionType::RENEGOTIATE,              // 65281
+        ExtensionType::EXTENDED_MASTER_SECRET,   // 23
+    ]
+}
+
 // --- Chrome HTTP headers in correct wire order ---

 const CHROME_HEADERS: &[(&str, &str)] = &[
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
    ("sec-fetch-dest", "document"),
 ];

+/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
+/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
+/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
+/// include zstd (Safari can't decode it). Verified against bogdanfinn on
+/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
+/// expects for a real iPhone.
+const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
+    (
+        "accept",
+        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
+    ),
+    ("accept-language", "en-US,en;q=0.9"),
+    ("accept-encoding", "gzip, deflate, br"),
+    (
+        "user-agent",
+        "Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
+    ),
+    ("upgrade-insecure-requests", "1"),
+];
+
 const EDGE_HEADERS: &[(&str, &str)] = &[
    (
        "sec-ch-ua",
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
 ];

 fn chrome_tls() -> TlsOptions {
+    // permute_extensions is off so the explicit extension_permutation sticks.
+    // Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
+    // fixed order, so matching that gets us through.
    TlsOptions::builder()
        .cipher_list(CHROME_CIPHERS)
        .sigalgs_list(CHROME_SIGALGS)
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
        .min_tls_version(TlsVersion::TLS_1_2)
        .max_tls_version(TlsVersion::TLS_1_3)
        .grease_enabled(true)
-        .permute_extensions(true)
+        .permute_extensions(false)
+        .extension_permutation(chrome_extensions())
        .enable_ech_grease(true)
        .pre_shared_key(true)
        .enable_ocsp_stapling(true)
        .enable_signed_cert_timestamps(true)
-        .alps_protocols([AlpsProtocol::HTTP2])
+        .alpn_protocols([
+            AlpnProtocol::HTTP3,
+            AlpnProtocol::HTTP2,
+            AlpnProtocol::HTTP1,
+        ])
+        .alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
        .alps_use_new_codepoint(true)
        .aes_hw_override(true)
        .certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
        .build()
 }

+/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
+/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
+/// because the wire-level defaults from wreq-util are already correct for ciphers,
+/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
+/// DataDome compatibility are overridden here:
+///
+///  1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
+///     ends up `8d909525bd5bbb79f133d11cc05159fe`).
+///  2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
+///     wreq-util omits this frame; real Safari and bogdanfinn include it.
+///     This flip is the thing DataDome actually reads — the akamai_fingerprint
+///     hash changes from `c52879e43202aeb92740be6e8c86ea96` to
+///     `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
+///  3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
+///     `priority: u=0, i`, zstd), replace with the real iOS 26 set.
+///  4. `accept-language` preserved from config.extra_headers for locale.
+fn safari_ios_emulation() -> wreq::Emulation {
+    use wreq::EmulationFactory;
+    let mut em = wreq_util::Emulation::SafariIos26.emulation();
+
+    if let Some(tls) = em.tls_options_mut().as_mut() {
+        tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
+    }
+
+    // Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
+    // and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
+    // to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
+    if let Some(h2) = em.http2_options_mut().as_mut() {
+        h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
+    }
+
+    let hm = em.headers_mut();
+    hm.clear();
+    for (k, v) in SAFARI_IOS_HEADERS {
+        if let (Ok(n), Ok(val)) = (
+            http::header::HeaderName::from_bytes(k.as_bytes()),
+            http::header::HeaderValue::from_str(v),
+        ) {
+            hm.append(n, val);
+        }
+    }
+
+    em
+}
+
 fn chrome_h2() -> Http2Options {
+    // SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
+    // ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
+    // MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
+    // and indeed.com's WAF reads this as a bot signal when present. Priority
+    // weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
    Http2Options::builder()
        .initial_window_size(6_291_456)
        .initial_connection_window_size(15_728_640)
        .max_header_list_size(262_144)
        .header_table_size(65_536)
-        .max_concurrent_streams(1000u32)
        .enable_push(false)
        .settings_order(
            SettingsOrder::builder()
                .extend([
                    SettingId::HeaderTableSize,
                    SettingId::EnablePush,
-                    SettingId::MaxConcurrentStreams,
                    SettingId::InitialWindowSize,
-                    SettingId::MaxFrameSize,
                    SettingId::MaxHeaderListSize,
-                    SettingId::EnableConnectProtocol,
-                    SettingId::NoRfc7540Priorities,
                ])
                .build(),
        )
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
                ])
                .build(),
        )
-        .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
+        .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
        .build()
 }

@ -328,32 +456,38 @@ pub fn build_client(
    extra_headers: &std::collections::HashMap<String, String>,
    proxy: Option<&str>,
 ) -> Result<Client, FetchError> {
-    let (tls, h2, headers) = match variant {
-        BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
-        BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
-        BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
-        BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
-        BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
+    // SafariIos26 builds its Emulation on top of wreq-util's base instead
+    // of from scratch. See `safari_ios_emulation` for why.
+    let mut emulation = match variant {
+        BrowserVariant::SafariIos26 => safari_ios_emulation(),
+        other => {
+            let (tls, h2, headers) = match other {
+                BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
+                BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
+                BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
+                BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
+                BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
+                BrowserVariant::SafariIos26 => unreachable!("handled above"),
+            };
+            Emulation::builder()
+                .tls_options(tls)
+                .http2_options(h2)
+                .headers(build_headers(headers))
+                .build()
+        }
    };

-    let mut header_map = build_headers(headers);
-
-    // Append extra headers after profile defaults
+    // Append extra headers after profile defaults.
+    let hm = emulation.headers_mut();
    for (k, v) in extra_headers {
        if let (Ok(n), Ok(val)) = (
            http::header::HeaderName::from_bytes(k.as_bytes()),
            http::header::HeaderValue::from_str(v),
        ) {
-            header_map.insert(n, val);
+            hm.insert(n, val);
        }
    }

-    let emulation = Emulation::builder()
-        .tls_options(tls)
-        .http2_options(h2)
-        .headers(header_map)
-        .build();
-
    let mut builder = Client::builder()
        .emulation(emulation)
        .redirect(wreq::redirect::Policy::limited(10))
--- a/crates/webclaw-mcp/Cargo.toml
+++ b/crates/webclaw-mcp/Cargo.toml
@ -22,6 +22,5 @@ serde_json = { workspace = true }
 tokio = { workspace = true }
 tracing = { workspace = true }
 tracing-subscriber = { workspace = true }
-reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
 url = "2"
 dirs = "6.0.0"
--- a/crates/webclaw-mcp/src/cloud.rs
+++ b/crates/webclaw-mcp/src/cloud.rs
@ -1,302 +0,0 @@
-/// Cloud API fallback for protected sites.
-///
-/// When local fetch returns a challenge page, this module retries
-/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
-use std::time::Duration;
-
-use serde_json::{Value, json};
-use tracing::info;
-
-const API_BASE: &str = "https://api.webclaw.io/v1";
-
-/// Lightweight client for the webclaw cloud API.
-pub struct CloudClient {
-    api_key: String,
-    http: reqwest::Client,
-}
-
-impl CloudClient {
-    /// Create a new cloud client from WEBCLAW_API_KEY env var.
-    /// Returns None if the key is not set.
-    pub fn from_env() -> Option<Self> {
-        let key = std::env::var("WEBCLAW_API_KEY").ok()?;
-        if key.is_empty() {
-            return None;
-        }
-        let http = reqwest::Client::builder()
-            .timeout(Duration::from_secs(60))
-            .build()
-            .unwrap_or_default();
-        Some(Self { api_key: key, http })
-    }
-
-    /// Scrape a URL via the cloud API. Returns the response JSON.
-    pub async fn scrape(
-        &self,
-        url: &str,
-        formats: &[&str],
-        include_selectors: &[String],
-        exclude_selectors: &[String],
-        only_main_content: bool,
-    ) -> Result<Value, String> {
-        let mut body = json!({
-            "url": url,
-            "formats": formats,
-        });
-
-        if only_main_content {
-            body["only_main_content"] = json!(true);
-        }
-        if !include_selectors.is_empty() {
-            body["include_selectors"] = json!(include_selectors);
-        }
-        if !exclude_selectors.is_empty() {
-            body["exclude_selectors"] = json!(exclude_selectors);
-        }
-
-        self.post("scrape", body).await
-    }
-
-    /// Generic POST to the cloud API.
-    pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
-        let resp = self
-            .http
-            .post(format!("{API_BASE}/{endpoint}"))
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .json(&body)
-            .send()
-            .await
-            .map_err(|e| format!("Cloud API request failed: {e}"))?;
-
-        let status = resp.status();
-        if !status.is_success() {
-            let text = resp.text().await.unwrap_or_default();
-            let truncated = truncate_error(&text);
-            return Err(format!("Cloud API error {status}: {truncated}"));
-        }
-
-        resp.json::<Value>()
-            .await
-            .map_err(|e| format!("Cloud API response parse failed: {e}"))
-    }
-
-    /// Generic GET from the cloud API.
-    pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
-        let resp = self
-            .http
-            .get(format!("{API_BASE}/{endpoint}"))
-            .header("Authorization", format!("Bearer {}", self.api_key))
-            .send()
-            .await
-            .map_err(|e| format!("Cloud API request failed: {e}"))?;
-
-        let status = resp.status();
-        if !status.is_success() {
-            let text = resp.text().await.unwrap_or_default();
-            let truncated = truncate_error(&text);
-            return Err(format!("Cloud API error {status}: {truncated}"));
-        }
-
-        resp.json::<Value>()
-            .await
-            .map_err(|e| format!("Cloud API response parse failed: {e}"))
-    }
-}
-
-/// Truncate error body to avoid flooding logs with huge HTML responses.
-fn truncate_error(text: &str) -> &str {
-    const MAX_LEN: usize = 500;
-    match text.char_indices().nth(MAX_LEN) {
-        Some((byte_pos, _)) => &text[..byte_pos],
-        None => text,
-    }
-}
-
-/// Check if fetched HTML looks like a bot protection challenge page.
-/// Detects common bot protection challenge pages.
-pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool {
-    let html_lower = html.to_lowercase();
-
-    // Cloudflare challenge page
-    if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
-        return true;
-    }
-
-    // Cloudflare "checking your browser" spinner
-    if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
-        && html_lower.contains("cf-spinner")
-    {
-        return true;
-    }
-
-    // Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
-    if (html_lower.contains("cf-turnstile")
-        || html_lower.contains("challenges.cloudflare.com/turnstile"))
-        && html.len() < 100_000
-    {
-        return true;
-    }
-
-    // DataDome
-    if html_lower.contains("geo.captcha-delivery.com")
-        || html_lower.contains("captcha-delivery.com/captcha")
-    {
-        return true;
-    }
-
-    // AWS WAF
-    if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
-        return true;
-    }
-
-    // hCaptcha blocking page
-    if html_lower.contains("hcaptcha.com")
-        && html_lower.contains("h-captcha")
-        && html.len() < 50_000
-    {
-        return true;
-    }
-
-    // Cloudflare via headers + challenge body
-    let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
-    if has_cf_headers
-        && (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
-    {
-        return true;
-    }
-
-    false
-}
-
-/// Check if a page likely needs JS rendering (SPA with almost no text content).
-pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
-    let has_scripts = html.contains("<script");
-
-    // Tier 1: almost no extractable text from a large page
-    if word_count < 50 && html.len() > 5_000 && has_scripts {
-        return true;
-    }
-
-    // Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
-    if word_count < 800 && html.len() > 50_000 && has_scripts {
-        let html_lower = html.to_lowercase();
-        let has_spa_marker = html_lower.contains("react-app")
-            || html_lower.contains("id=\"__next\"")
-            || html_lower.contains("id=\"root\"")
-            || html_lower.contains("id=\"app\"")
-            || html_lower.contains("__next_data__")
-            || html_lower.contains("nuxt")
-            || html_lower.contains("ng-app");
-
-        if has_spa_marker {
-            return true;
-        }
-    }
-
-    false
-}
-
-/// Result of a smart fetch: either local extraction or cloud API response.
-pub enum SmartFetchResult {
-    /// Successfully extracted locally.
-    Local(Box<webclaw_core::ExtractionResult>),
-    /// Fell back to cloud API. Contains the API response JSON.
-    Cloud(Value),
-}
-
-/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
-///
-/// Returns the extraction result (local) or the cloud API response JSON.
-/// If no API key is configured and local fetch is blocked, returns an error
-/// with a helpful message.
-pub async fn smart_fetch(
-    client: &webclaw_fetch::FetchClient,
-    cloud: Option<&CloudClient>,
-    url: &str,
-    include_selectors: &[String],
-    exclude_selectors: &[String],
-    only_main_content: bool,
-    formats: &[&str],
-) -> Result<SmartFetchResult, String> {
-    // Step 1: Try local fetch (with timeout to avoid hanging on slow servers)
-    let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
-        .await
-        .map_err(|_| format!("Fetch timed out after 30s for {url}"))?
-        .map_err(|e| format!("Fetch failed: {e}"))?;
-
-    // Step 2: Check for bot protection
-    if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
-        info!(url, "bot protection detected, falling back to cloud API");
-        return cloud_fallback(
-            cloud,
-            url,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-            formats,
-        )
-        .await;
-    }
-
-    // Step 3: Extract locally
-    let options = webclaw_core::ExtractionOptions {
-        include_selectors: include_selectors.to_vec(),
-        exclude_selectors: exclude_selectors.to_vec(),
-        only_main_content,
-        include_raw_html: false,
-    };
-
-    let extraction =
-        webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
-            .map_err(|e| format!("Extraction failed: {e}"))?;
-
-    // Step 4: Check for JS-rendered pages (low content from large HTML)
-    if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
-        info!(
-            url,
-            word_count = extraction.metadata.word_count,
-            html_len = fetch_result.html.len(),
-            "JS-rendered page detected, falling back to cloud API"
-        );
-        return cloud_fallback(
-            cloud,
-            url,
-            include_selectors,
-            exclude_selectors,
-            only_main_content,
-            formats,
-        )
-        .await;
-    }
-
-    Ok(SmartFetchResult::Local(Box::new(extraction)))
-}
-
-async fn cloud_fallback(
-    cloud: Option<&CloudClient>,
-    url: &str,
-    include_selectors: &[String],
-    exclude_selectors: &[String],
-    only_main_content: bool,
-    formats: &[&str],
-) -> Result<SmartFetchResult, String> {
-    match cloud {
-        Some(c) => {
-            let resp = c
-                .scrape(
-                    url,
-                    formats,
-                    include_selectors,
-                    exclude_selectors,
-                    only_main_content,
-                )
-                .await?;
-            info!(url, "cloud API fallback successful");
-            Ok(SmartFetchResult::Cloud(resp))
-        }
-        None => Err(format!(
-            "Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
-             Get a key at https://webclaw.io"
-        )),
-    }
-}
--- a/crates/webclaw-mcp/src/main.rs
+++ b/crates/webclaw-mcp/src/main.rs
@ -1,7 +1,6 @@
 /// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
 /// Exposes web extraction tools over stdio transport for AI agents
 /// like Claude Desktop, Claude Code, and other MCP clients.
-mod cloud;
 mod server;
 mod tools;

--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@ -15,10 +15,17 @@ use serde_json::json;
 use tracing::{error, info, warn};
 use url::Url;

-use crate::cloud::{self, CloudClient, SmartFetchResult};
+use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult};
+
 use crate::tools::*;

 pub struct WebclawMcp {
+    /// Holds the registered MCP tools. `rmcp >= 1.3` reads this through a
+    /// derived trait impl (not by name), so rustc's dead-code lint can't
+    /// see the usage and fires a spurious `field tool_router is never
+    /// read` on `cargo install`. The field is essential — dropping it
+    /// would unregister every tool. See issue #30.
+    #[allow(dead_code)]
    tool_router: ToolRouter<Self>,
    fetch_client: Arc<webclaw_fetch::FetchClient>,
    /// Lazily-initialized Firefox client, reused across all tool calls that
@ -711,6 +718,55 @@ impl WebclawMcp {
            Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
        }
    }
+
+    /// List every vertical extractor the server knows about. Returns a
+    /// JSON array of `{name, label, description, url_patterns}` entries.
+    /// Call this to discover what verticals are available before using
+    /// `vertical_scrape`.
+    #[tool]
+    async fn list_extractors(
+        &self,
+        Parameters(_params): Parameters<ListExtractorsParams>,
+    ) -> Result<String, String> {
+        let catalog = webclaw_fetch::extractors::list();
+        serde_json::to_string_pretty(&catalog)
+            .map_err(|e| format!("failed to serialise extractor catalog: {e}"))
+    }
+
+    /// Run a vertical extractor by name and return typed JSON specific
+    /// to the target site (title, price, rating, author, etc.), not
+    /// generic markdown. Use `list_extractors` to discover available
+    /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
+    /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
+    ///
+    /// Antibot-gated verticals (amazon_product, ebay_listing,
+    /// etsy_listing, trustpilot_reviews) will automatically escalate to
+    /// the webclaw cloud API when local fetch hits bot protection,
+    /// provided `WEBCLAW_API_KEY` is set.
+    #[tool]
+    async fn vertical_scrape(
+        &self,
+        Parameters(params): Parameters<VerticalParams>,
+    ) -> Result<String, String> {
+        validate_url(&params.url)?;
+        // Use the cached Firefox client, not the default Chrome one.
+        // Reddit's `.json` endpoint rejects the wreq-Chrome TLS
+        // fingerprint with a 403 even from residential IPs (they
+        // ship a fingerprint blocklist that includes common
+        // browser-emulation libraries). The wreq-Firefox fingerprint
+        // still passes, and Firefox is equally fine for every other
+        // vertical in the catalog, so it's a strictly-safer default
+        // for `vertical_scrape` than the generic `scrape` tool's
+        // Chrome default. Matches the CLI `webclaw vertical`
+        // subcommand which already uses Firefox.
+        let client = self.firefox_or_build()?;
+        let data =
+            webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
+                .await
+                .map_err(|e| e.to_string())?;
+        serde_json::to_string_pretty(&data)
+            .map_err(|e| format!("failed to serialise extractor output: {e}"))
+    }
 }

 #[tool_handler]
@ -720,7 +776,8 @@ impl ServerHandler for WebclawMcp {
            .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
            .with_instructions(String::from(
                "Webclaw MCP server -- web content extraction for AI agents. \
-                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
+                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
+                 list_extractors, vertical_scrape.",
            ))
    }
 }
--- a/crates/webclaw-mcp/src/tools.rs
+++ b/crates/webclaw-mcp/src/tools.rs
@ -103,3 +103,20 @@ pub struct SearchParams {
    /// Number of results to return (default: 10)
    pub num_results: Option<u32>,
 }
+
+/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
+#[derive(Debug, Deserialize, JsonSchema)]
+pub struct VerticalParams {
+    /// Name of the vertical extractor. Call `list_extractors` to see all
+    /// available names. Examples: "reddit", "github_repo", "pypi",
+    /// "trustpilot_reviews", "youtube_video", "shopify_product".
+    pub name: String,
+    /// URL to extract. Must match the URL patterns the extractor claims;
+    /// otherwise the tool returns a clear "URL mismatch" error.
+    pub url: String,
+}
+
+/// `list_extractors` takes no arguments but we still need an empty struct
+/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
+#[derive(Debug, Deserialize, JsonSchema)]
+pub struct ListExtractorsParams {}
--- a/crates/webclaw-server/Cargo.toml
+++ b/crates/webclaw-server/Cargo.toml
@ -0,0 +1,29 @@
+[package]
+name = "webclaw-server"
+version.workspace = true
+edition.workspace = true
+license.workspace = true
+repository.workspace = true
+description = "Minimal REST API server for self-hosting webclaw extraction. Wraps the OSS extraction crates with HTTP endpoints. NOT the production hosted API at api.webclaw.io — this is a stateless, single-binary reference server for local + self-hosted deployments."
+
+[[bin]]
+name = "webclaw-server"
+path = "src/main.rs"
+
+[dependencies]
+webclaw-core   = { workspace = true }
+webclaw-fetch  = { workspace = true }
+webclaw-llm    = { workspace = true }
+webclaw-pdf    = { workspace = true }
+
+axum           = { version = "0.8", features = ["macros"] }
+tokio          = { workspace = true }
+tower-http     = { version = "0.6", features = ["trace", "cors"] }
+clap           = { workspace = true, features = ["derive", "env"] }
+serde          = { workspace = true }
+serde_json     = { workspace = true }
+tracing        = { workspace = true }
+tracing-subscriber = { workspace = true, features = ["env-filter"] }
+anyhow         = "1"
+thiserror      = { workspace = true }
+subtle         = "2.6"
--- a/crates/webclaw-server/src/auth.rs
+++ b/crates/webclaw-server/src/auth.rs
@ -0,0 +1,48 @@
+//! Optional bearer-token middleware.
+//!
+//! When the server is started without `--api-key`, every request is allowed
+//! through (server runs in "open" mode — appropriate for `localhost`-only
+//! deployments). When a key is configured, every `/v1/*` request must
+//! present `Authorization: Bearer <key>` and the comparison is constant-
+//! time to avoid timing-leaking the key.
+
+use axum::{
+    extract::{Request, State},
+    http::StatusCode,
+    middleware::Next,
+    response::Response,
+};
+use subtle::ConstantTimeEq;
+
+use crate::state::AppState;
+
+/// Axum middleware. Mount with `axum::middleware::from_fn_with_state`.
+pub async fn require_bearer(
+    State(state): State<AppState>,
+    request: Request,
+    next: Next,
+) -> Result<Response, StatusCode> {
+    let Some(expected) = state.api_key() else {
+        // Open mode — no key configured. Allow everything.
+        return Ok(next.run(request).await);
+    };
+
+    let Some(header) = request
+        .headers()
+        .get("authorization")
+        .and_then(|v| v.to_str().ok())
+    else {
+        return Err(StatusCode::UNAUTHORIZED);
+    };
+
+    let presented = header
+        .strip_prefix("Bearer ")
+        .or_else(|| header.strip_prefix("bearer "))
+        .ok_or(StatusCode::UNAUTHORIZED)?;
+
+    if presented.as_bytes().ct_eq(expected.as_bytes()).into() {
+        Ok(next.run(request).await)
+    } else {
+        Err(StatusCode::UNAUTHORIZED)
+    }
+}
--- a/crates/webclaw-server/src/error.rs
+++ b/crates/webclaw-server/src/error.rs
@ -0,0 +1,87 @@
+//! API error type. Maps internal errors to HTTP status codes + JSON.
+
+use axum::{
+    Json,
+    http::StatusCode,
+    response::{IntoResponse, Response},
+};
+use serde_json::json;
+use thiserror::Error;
+
+/// Public-facing API error. Always serializes as `{ "error": "..." }`.
+/// Keep messages user-actionable; internal details belong in tracing logs.
+///
+/// `Unauthorized` / `NotFound` / `Internal` are kept on the enum as
+/// stable variants for handlers that don't exist yet (planned: per-key
+/// rate-limit responses, dynamic route 404s). Marking them dead-code-OK
+/// is preferable to inventing them later in three places.
+#[allow(dead_code)]
+#[derive(Debug, Error)]
+pub enum ApiError {
+    #[error("{0}")]
+    BadRequest(String),
+
+    #[error("unauthorized")]
+    Unauthorized,
+
+    #[error("not found")]
+    NotFound,
+
+    #[error("upstream fetch failed: {0}")]
+    Fetch(String),
+
+    #[error("extraction failed: {0}")]
+    Extract(String),
+
+    #[error("LLM provider error: {0}")]
+    Llm(String),
+
+    #[error("internal: {0}")]
+    Internal(String),
+}
+
+impl ApiError {
+    pub fn bad_request(msg: impl Into<String>) -> Self {
+        Self::BadRequest(msg.into())
+    }
+    #[allow(dead_code)]
+    pub fn internal(msg: impl Into<String>) -> Self {
+        Self::Internal(msg.into())
+    }
+
+    fn status(&self) -> StatusCode {
+        match self {
+            Self::BadRequest(_) => StatusCode::BAD_REQUEST,
+            Self::Unauthorized => StatusCode::UNAUTHORIZED,
+            Self::NotFound => StatusCode::NOT_FOUND,
+            Self::Fetch(_) => StatusCode::BAD_GATEWAY,
+            Self::Extract(_) | Self::Llm(_) => StatusCode::UNPROCESSABLE_ENTITY,
+            Self::Internal(_) => StatusCode::INTERNAL_SERVER_ERROR,
+        }
+    }
+}
+
+impl IntoResponse for ApiError {
+    fn into_response(self) -> Response {
+        let body = Json(json!({ "error": self.to_string() }));
+        (self.status(), body).into_response()
+    }
+}
+
+impl From<webclaw_fetch::FetchError> for ApiError {
+    fn from(e: webclaw_fetch::FetchError) -> Self {
+        Self::Fetch(e.to_string())
+    }
+}
+
+impl From<webclaw_core::ExtractError> for ApiError {
+    fn from(e: webclaw_core::ExtractError) -> Self {
+        Self::Extract(e.to_string())
+    }
+}
+
+impl From<webclaw_llm::LlmError> for ApiError {
+    fn from(e: webclaw_llm::LlmError) -> Self {
+        Self::Llm(e.to_string())
+    }
+}
--- a/crates/webclaw-server/src/main.rs
+++ b/crates/webclaw-server/src/main.rs
@ -0,0 +1,123 @@
+//! webclaw-server — minimal REST API for self-hosting webclaw extraction.
+//!
+//! This is the OSS reference server. It is intentionally small:
+//! single binary, stateless, no database, no job queue. It wraps the
+//! same extraction crates the CLI and MCP server use, exposed over
+//! HTTP with JSON shapes that mirror the hosted API at
+//! api.webclaw.io where the underlying capability exists in OSS.
+//!
+//! Hosted-only features (anti-bot bypass, JS rendering, async crawl
+//! jobs, multi-tenant auth, billing) are *not* implemented here and
+//! never will be — they're closed-source. See the docs for the full
+//! "what self-hosting gives you vs. what the cloud gives you" matrix.
+
+mod auth;
+mod error;
+mod routes;
+mod state;
+
+use std::net::{IpAddr, SocketAddr};
+use std::time::Duration;
+
+use axum::{
+    Router,
+    middleware::from_fn_with_state,
+    routing::{get, post},
+};
+use clap::Parser;
+use tower_http::cors::{Any, CorsLayer};
+use tower_http::trace::TraceLayer;
+use tracing::info;
+use tracing_subscriber::{EnvFilter, fmt};
+
+use crate::state::AppState;
+
+#[derive(Parser, Debug)]
+#[command(
+    name = "webclaw-server",
+    version,
+    about = "Minimal self-hosted REST API for webclaw extraction.",
+    long_about = "Stateless single-binary REST API. Wraps the OSS extraction \
+                  crates over HTTP. For the full hosted platform (anti-bot, \
+                  JS render, async jobs, multi-tenant), use api.webclaw.io."
+)]
+struct Args {
+    /// Port to listen on. Env: WEBCLAW_PORT.
+    #[arg(short, long, env = "WEBCLAW_PORT", default_value_t = 3000)]
+    port: u16,
+
+    /// Host to bind to. Env: WEBCLAW_HOST.
+    /// Default `127.0.0.1` keeps the server local-only; set to
+    /// `0.0.0.0` to expose on all interfaces (only do this with
+    /// `--api-key` set or behind a reverse proxy that adds auth).
+    #[arg(long, env = "WEBCLAW_HOST", default_value = "127.0.0.1")]
+    host: IpAddr,
+
+    /// Optional bearer token. Env: WEBCLAW_API_KEY. When set, every
+    /// `/v1/*` request must present `Authorization: Bearer <key>`.
+    /// When unset, the server runs in open mode (no auth) — only
+    /// safe on a local-bound interface or behind another auth layer.
+    #[arg(long, env = "WEBCLAW_API_KEY")]
+    api_key: Option<String>,
+
+    /// Tracing filter. Env: RUST_LOG.
+    #[arg(long, env = "RUST_LOG", default_value = "info,webclaw_server=info")]
+    log: String,
+}
+
+#[tokio::main]
+async fn main() -> anyhow::Result<()> {
+    let args = Args::parse();
+
+    fmt()
+        .with_env_filter(EnvFilter::try_new(&args.log).unwrap_or_else(|_| EnvFilter::new("info")))
+        .with_target(false)
+        .compact()
+        .init();
+
+    let state = AppState::new(args.api_key.clone())?;
+
+    let v1 = Router::new()
+        .route("/scrape", post(routes::scrape::scrape))
+        .route(
+            "/scrape/{vertical}",
+            post(routes::structured::scrape_vertical),
+        )
+        .route("/crawl", post(routes::crawl::crawl))
+        .route("/map", post(routes::map::map))
+        .route("/batch", post(routes::batch::batch))
+        .route("/extract", post(routes::extract::extract))
+        .route("/extractors", get(routes::structured::list_extractors))
+        .route("/summarize", post(routes::summarize::summarize_route))
+        .route("/diff", post(routes::diff::diff_route))
+        .route("/brand", post(routes::brand::brand))
+        .layer(from_fn_with_state(state.clone(), auth::require_bearer));
+
+    let app = Router::new()
+        .route("/health", get(routes::health::health))
+        .nest("/v1", v1)
+        .layer(
+            // Permissive CORS — same posture as a self-hosted dev tool.
+            // Tighten in front with a reverse proxy if you expose this
+            // publicly.
+            CorsLayer::new()
+                .allow_origin(Any)
+                .allow_methods(Any)
+                .allow_headers(Any)
+                .max_age(Duration::from_secs(3600)),
+        )
+        .layer(TraceLayer::new_for_http())
+        .with_state(state);
+
+    let addr = SocketAddr::from((args.host, args.port));
+    let listener = tokio::net::TcpListener::bind(addr).await?;
+    let auth_status = if args.api_key.is_some() {
+        "bearer auth required"
+    } else {
+        "open mode (no auth)"
+    };
+    info!(%addr, mode = auth_status, "webclaw-server listening");
+
+    axum::serve(listener, app).await?;
+    Ok(())
+}
--- a/crates/webclaw-server/src/routes/batch.rs
+++ b/crates/webclaw-server/src/routes/batch.rs
@ -0,0 +1,85 @@
+//! POST /v1/batch — fetch + extract many URLs in parallel.
+//!
+//! `concurrency` is hard-capped at 20 to avoid hammering targets and
+//! to bound memory growth for naive callers. For larger batches use
+//! the hosted API.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::ExtractionOptions;
+
+use crate::{error::ApiError, state::AppState};
+
+const HARD_MAX_URLS: usize = 100;
+const HARD_MAX_CONCURRENCY: usize = 20;
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct BatchRequest {
+    pub urls: Vec<String>,
+    pub concurrency: Option<usize>,
+    pub include_selectors: Vec<String>,
+    pub exclude_selectors: Vec<String>,
+    pub only_main_content: bool,
+}
+
+pub async fn batch(
+    State(state): State<AppState>,
+    Json(req): Json<BatchRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.urls.is_empty() {
+        return Err(ApiError::bad_request("`urls` is required"));
+    }
+    if req.urls.len() > HARD_MAX_URLS {
+        return Err(ApiError::bad_request(format!(
+            "too many urls: {} (max {HARD_MAX_URLS})",
+            req.urls.len()
+        )));
+    }
+
+    let concurrency = req.concurrency.unwrap_or(5).clamp(1, HARD_MAX_CONCURRENCY);
+
+    let options = ExtractionOptions {
+        include_selectors: req.include_selectors,
+        exclude_selectors: req.exclude_selectors,
+        only_main_content: req.only_main_content,
+        include_raw_html: false,
+    };
+
+    let url_refs: Vec<&str> = req.urls.iter().map(|s| s.as_str()).collect();
+    let results = state
+        .fetch()
+        .fetch_and_extract_batch_with_options(&url_refs, concurrency, &options)
+        .await;
+
+    let mut ok = 0usize;
+    let mut errors = 0usize;
+    let mut out: Vec<Value> = Vec::with_capacity(results.len());
+    for r in results {
+        match r.result {
+            Ok(extraction) => {
+                ok += 1;
+                out.push(json!({
+                    "url": r.url,
+                    "metadata": extraction.metadata,
+                    "markdown": extraction.content.markdown,
+                }));
+            }
+            Err(e) => {
+                errors += 1;
+                out.push(json!({
+                    "url": r.url,
+                    "error": e.to_string(),
+                }));
+            }
+        }
+    }
+
+    Ok(Json(json!({
+        "total": out.len(),
+        "completed": ok,
+        "errors": errors,
+        "results": out,
+    })))
+}
--- a/crates/webclaw-server/src/routes/brand.rs
+++ b/crates/webclaw-server/src/routes/brand.rs
@ -0,0 +1,32 @@
+//! POST /v1/brand — extract brand identity (colors, fonts, logo) from a page.
+//!
+//! Pure DOM/CSS analysis — no LLM, no network beyond the page fetch itself.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::brand::extract_brand;
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct BrandRequest {
+    pub url: String,
+}
+
+pub async fn brand(
+    State(state): State<AppState>,
+    Json(req): Json<BrandRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let fetched = state.fetch().fetch(&req.url).await?;
+    let brand = extract_brand(&fetched.html, Some(&fetched.url));
+
+    Ok(Json(json!({
+        "url": req.url,
+        "brand": brand,
+    })))
+}
--- a/crates/webclaw-server/src/routes/crawl.rs
+++ b/crates/webclaw-server/src/routes/crawl.rs
@ -0,0 +1,85 @@
+//! POST /v1/crawl — synchronous BFS crawl.
+//!
+//! NOTE: this server is stateless — there is no job queue. Crawls run
+//! inline and return when complete. `max_pages` is hard-capped at 500
+//! to avoid OOM on naive callers. For large crawls + async jobs, use
+//! the hosted API at api.webclaw.io.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use std::time::Duration;
+use webclaw_fetch::{CrawlConfig, Crawler, FetchConfig};
+
+use crate::{error::ApiError, state::AppState};
+
+const HARD_MAX_PAGES: usize = 500;
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct CrawlRequest {
+    pub url: String,
+    pub max_depth: Option<usize>,
+    pub max_pages: Option<usize>,
+    pub use_sitemap: bool,
+    pub concurrency: Option<usize>,
+    pub allow_subdomains: bool,
+    pub allow_external_links: bool,
+    pub include_patterns: Vec<String>,
+    pub exclude_patterns: Vec<String>,
+}
+
+pub async fn crawl(
+    State(_state): State<AppState>,
+    Json(req): Json<CrawlRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let max_pages = req.max_pages.unwrap_or(50).min(HARD_MAX_PAGES);
+    let max_depth = req.max_depth.unwrap_or(3);
+    let concurrency = req.concurrency.unwrap_or(5).min(20);
+
+    let config = CrawlConfig {
+        fetch: FetchConfig::default(),
+        max_depth,
+        max_pages,
+        concurrency,
+        delay: Duration::from_millis(200),
+        path_prefix: None,
+        use_sitemap: req.use_sitemap,
+        include_patterns: req.include_patterns,
+        exclude_patterns: req.exclude_patterns,
+        allow_subdomains: req.allow_subdomains,
+        allow_external_links: req.allow_external_links,
+        progress_tx: None,
+        cancel_flag: None,
+    };
+
+    let crawler = Crawler::new(&req.url, config).map_err(ApiError::from)?;
+    let result = crawler.crawl(&req.url, None).await;
+
+    let pages: Vec<Value> = result
+        .pages
+        .iter()
+        .map(|p| {
+            json!({
+                "url": p.url,
+                "depth": p.depth,
+                "metadata": p.extraction.as_ref().map(|e| &e.metadata),
+                "markdown": p.extraction.as_ref().map(|e| e.content.markdown.as_str()).unwrap_or(""),
+                "error": p.error,
+            })
+        })
+        .collect();
+
+    Ok(Json(json!({
+        "url": req.url,
+        "status": "completed",
+        "total": result.total,
+        "completed": result.ok,
+        "errors": result.errors,
+        "elapsed_secs": result.elapsed_secs,
+        "pages": pages,
+    })))
+}
--- a/crates/webclaw-server/src/routes/diff.rs
+++ b/crates/webclaw-server/src/routes/diff.rs
@ -0,0 +1,92 @@
+//! POST /v1/diff — compare current page content against a prior snapshot.
+//!
+//! Caller passes either a full prior `ExtractionResult` or the minimal
+//! `{ markdown, metadata }` shape used by the hosted API. We re-fetch
+//! the URL, extract, and run `webclaw_core::diff::diff` over the pair.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::{Content, ExtractionResult, Metadata, diff::diff};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct DiffRequest {
+    pub url: String,
+    pub previous: PreviousSnapshot,
+}
+
+/// Either a full prior extraction, or the minimal `{ markdown, metadata }`
+/// shape returned by /v1/scrape. Untagged so callers can send whichever
+/// they have on hand.
+#[derive(Debug, Deserialize)]
+#[serde(untagged)]
+pub enum PreviousSnapshot {
+    Full(ExtractionResult),
+    Minimal {
+        #[serde(default)]
+        markdown: String,
+        #[serde(default)]
+        metadata: Option<Metadata>,
+    },
+}
+
+impl PreviousSnapshot {
+    fn into_extraction(self) -> ExtractionResult {
+        match self {
+            Self::Full(r) => r,
+            Self::Minimal { markdown, metadata } => ExtractionResult {
+                metadata: metadata.unwrap_or_else(empty_metadata),
+                content: Content {
+                    markdown,
+                    plain_text: String::new(),
+                    links: Vec::new(),
+                    images: Vec::new(),
+                    code_blocks: Vec::new(),
+                    raw_html: None,
+                },
+                domain_data: None,
+                structured_data: Vec::new(),
+            },
+        }
+    }
+}
+
+fn empty_metadata() -> Metadata {
+    Metadata {
+        title: None,
+        description: None,
+        author: None,
+        published_date: None,
+        language: None,
+        url: None,
+        site_name: None,
+        image: None,
+        favicon: None,
+        word_count: 0,
+    }
+}
+
+pub async fn diff_route(
+    State(state): State<AppState>,
+    Json(req): Json<DiffRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let current = state.fetch().fetch_and_extract(&req.url).await?;
+    let previous = req.previous.into_extraction();
+    let result = diff(&previous, &current);
+
+    Ok(Json(json!({
+        "url": req.url,
+        "status": result.status,
+        "diff": result.text_diff,
+        "metadata_changes": result.metadata_changes,
+        "links_added": result.links_added,
+        "links_removed": result.links_removed,
+        "word_count_delta": result.word_count_delta,
+    })))
+}
--- a/crates/webclaw-server/src/routes/extract.rs
+++ b/crates/webclaw-server/src/routes/extract.rs
@ -0,0 +1,81 @@
+//! POST /v1/extract — LLM-powered structured extraction.
+//!
+//! Two modes:
+//! * `schema` — JSON Schema describing what to extract.
+//! * `prompt` — natural-language instructions.
+//!
+//! At least one must be provided. The provider chain is built per
+//! request from env (Ollama -> OpenAI -> Anthropic). Self-hosters
+//! get the same fallback behaviour as the CLI.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_llm::{ProviderChain, extract::extract_json, extract::extract_with_prompt};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct ExtractRequest {
+    pub url: String,
+    pub schema: Option<Value>,
+    pub prompt: Option<String>,
+    /// Optional override of the provider model name (e.g. `gpt-4o-mini`).
+    pub model: Option<String>,
+}
+
+pub async fn extract(
+    State(state): State<AppState>,
+    Json(req): Json<ExtractRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let has_schema = req.schema.is_some();
+    let has_prompt = req
+        .prompt
+        .as_deref()
+        .map(|p| !p.trim().is_empty())
+        .unwrap_or(false);
+    if !has_schema && !has_prompt {
+        return Err(ApiError::bad_request(
+            "either `schema` or `prompt` is required",
+        ));
+    }
+
+    // Fetch + extract first so we feed the LLM clean markdown instead of
+    // raw HTML. Cheaper tokens, better signal.
+    let extraction = state.fetch().fetch_and_extract(&req.url).await?;
+    let content = if extraction.content.markdown.trim().is_empty() {
+        extraction.content.plain_text.clone()
+    } else {
+        extraction.content.markdown.clone()
+    };
+    if content.trim().is_empty() {
+        return Err(ApiError::Extract(
+            "no extractable content on page".to_string(),
+        ));
+    }
+
+    let chain = ProviderChain::default().await;
+    if chain.is_empty() {
+        return Err(ApiError::Llm(
+            "no LLM providers configured (set OLLAMA_HOST, OPENAI_API_KEY, or ANTHROPIC_API_KEY)"
+                .to_string(),
+        ));
+    }
+
+    let model = req.model.as_deref();
+    let data = if let Some(schema) = req.schema.as_ref() {
+        extract_json(&content, schema, &chain, model).await?
+    } else {
+        let prompt = req.prompt.as_deref().unwrap_or_default();
+        extract_with_prompt(&content, prompt, &chain, model).await?
+    };
+
+    Ok(Json(json!({
+        "url": req.url,
+        "data": data,
+    })))
+}
--- a/crates/webclaw-server/src/routes/health.rs
+++ b/crates/webclaw-server/src/routes/health.rs
@ -0,0 +1,10 @@
+use axum::Json;
+use serde_json::{Value, json};
+
+pub async fn health() -> Json<Value> {
+    Json(json!({
+        "status": "ok",
+        "version": env!("CARGO_PKG_VERSION"),
+        "service": "webclaw-server",
+    }))
+}
--- a/crates/webclaw-server/src/routes/map.rs
+++ b/crates/webclaw-server/src/routes/map.rs
@ -0,0 +1,49 @@
+//! POST /v1/map — discover URLs from a site's sitemaps.
+//!
+//! Walks robots.txt + common sitemap paths, recursively resolves
+//! `<sitemapindex>` files, and returns the deduplicated list of URLs.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_fetch::sitemap;
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct MapRequest {
+    pub url: String,
+    /// When true, return the full SitemapEntry objects (with lastmod,
+    /// priority, changefreq). Defaults to false → bare URL strings,
+    /// matching the hosted-API shape.
+    #[serde(default)]
+    pub include_metadata: bool,
+}
+
+pub async fn map(
+    State(state): State<AppState>,
+    Json(req): Json<MapRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let entries = sitemap::discover(state.fetch(), &req.url).await?;
+
+    let body = if req.include_metadata {
+        json!({
+            "url": req.url,
+            "count": entries.len(),
+            "urls": entries,
+        })
+    } else {
+        let urls: Vec<&str> = entries.iter().map(|e| e.url.as_str()).collect();
+        json!({
+            "url": req.url,
+            "count": urls.len(),
+            "urls": urls,
+        })
+    };
+
+    Ok(Json(body))
+}
--- a/crates/webclaw-server/src/routes/mod.rs
+++ b/crates/webclaw-server/src/routes/mod.rs
@ -0,0 +1,19 @@
+//! HTTP route handlers.
+//!
+//! The OSS server exposes a deliberately small surface that mirrors the
+//! hosted-API JSON shapes where the underlying capability exists in the
+//! OSS crates. Endpoints that depend on private infrastructure
+//! (anti-bot bypass with stealth Chrome, JS rendering at scale,
+//! per-user auth, billing, async job queues, agent loops) are
+//! intentionally not implemented here. Use api.webclaw.io for those.
+
+pub mod batch;
+pub mod brand;
+pub mod crawl;
+pub mod diff;
+pub mod extract;
+pub mod health;
+pub mod map;
+pub mod scrape;
+pub mod structured;
+pub mod summarize;
--- a/crates/webclaw-server/src/routes/scrape.rs
+++ b/crates/webclaw-server/src/routes/scrape.rs
@ -0,0 +1,108 @@
+//! POST /v1/scrape — fetch a URL, run extraction, return the requested
+//! formats. JSON shape mirrors the hosted-API response where possible so
+//! migrating from self-hosted → cloud is a config change, not a code one.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_core::{ExtractionOptions, llm::to_llm_text};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct ScrapeRequest {
+    pub url: String,
+    /// Output formats. Allowed: "markdown", "text", "llm", "json", "html".
+    /// Defaults to ["markdown"]. Accepts a single string ("format")
+    /// or an array ("formats") for hosted-API compatibility.
+    #[serde(alias = "format")]
+    pub formats: ScrapeFormats,
+    pub include_selectors: Vec<String>,
+    pub exclude_selectors: Vec<String>,
+    pub only_main_content: bool,
+}
+
+#[derive(Debug, Deserialize)]
+#[serde(untagged)]
+pub enum ScrapeFormats {
+    One(String),
+    Many(Vec<String>),
+}
+
+impl Default for ScrapeFormats {
+    fn default() -> Self {
+        Self::Many(vec!["markdown".into()])
+    }
+}
+
+impl ScrapeFormats {
+    fn as_vec(&self) -> Vec<String> {
+        match self {
+            Self::One(s) => vec![s.clone()],
+            Self::Many(v) => v.clone(),
+        }
+    }
+}
+
+pub async fn scrape(
+    State(state): State<AppState>,
+    Json(req): Json<ScrapeRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let formats = req.formats.as_vec();
+
+    let options = ExtractionOptions {
+        include_selectors: req.include_selectors,
+        exclude_selectors: req.exclude_selectors,
+        only_main_content: req.only_main_content,
+        include_raw_html: formats.iter().any(|f| f == "html"),
+    };
+
+    let extraction = state
+        .fetch()
+        .fetch_and_extract_with_options(&req.url, &options)
+        .await?;
+
+    let mut body = json!({
+        "url": extraction.metadata.url.clone().unwrap_or_else(|| req.url.clone()),
+        "metadata": extraction.metadata,
+    });
+    let obj = body.as_object_mut().expect("json::object");
+
+    for f in &formats {
+        match f.as_str() {
+            "markdown" => {
+                obj.insert("markdown".into(), json!(extraction.content.markdown));
+            }
+            "text" => {
+                obj.insert("text".into(), json!(extraction.content.plain_text));
+            }
+            "llm" => {
+                let llm = to_llm_text(&extraction, extraction.metadata.url.as_deref());
+                obj.insert("llm".into(), json!(llm));
+            }
+            "html" => {
+                if let Some(raw) = &extraction.content.raw_html {
+                    obj.insert("html".into(), json!(raw));
+                }
+            }
+            "json" => {
+                obj.insert("json".into(), json!(extraction));
+            }
+            other => {
+                return Err(ApiError::bad_request(format!(
+                    "unknown format: '{other}' (allowed: markdown, text, llm, html, json)"
+                )));
+            }
+        }
+    }
+
+    if !extraction.structured_data.is_empty() {
+        obj.insert("structured_data".into(), json!(extraction.structured_data));
+    }
+
+    Ok(Json(body))
+}
--- a/crates/webclaw-server/src/routes/structured.rs
+++ b/crates/webclaw-server/src/routes/structured.rs
@ -0,0 +1,55 @@
+//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`.
+//!
+//! Vertical extractors return typed JSON instead of generic markdown.
+//! See `webclaw_fetch::extractors` for the catalog and per-site logic.
+
+use axum::{
+    Json,
+    extract::{Path, State},
+};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_fetch::extractors::{self, ExtractorDispatchError};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize)]
+pub struct ScrapeRequest {
+    pub url: String,
+}
+
+/// Map dispatcher errors to ApiError so users get clean HTTP statuses
+/// instead of opaque 500s.
+impl From<ExtractorDispatchError> for ApiError {
+    fn from(e: ExtractorDispatchError) -> Self {
+        match e {
+            ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound,
+            ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()),
+            ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()),
+        }
+    }
+}
+
+/// `GET /v1/extractors` — catalog of all available verticals.
+pub async fn list_extractors() -> Json<Value> {
+    Json(json!({
+        "extractors": extractors::list(),
+    }))
+}
+
+/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit.
+pub async fn scrape_vertical(
+    State(state): State<AppState>,
+    Path(vertical): Path<String>,
+    Json(req): Json<ScrapeRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+    let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?;
+    Ok(Json(json!({
+        "vertical": vertical,
+        "url": req.url,
+        "data": data,
+    })))
+}
--- a/crates/webclaw-server/src/routes/summarize.rs
+++ b/crates/webclaw-server/src/routes/summarize.rs
@ -0,0 +1,52 @@
+//! POST /v1/summarize — LLM-powered page summary.
+
+use axum::{Json, extract::State};
+use serde::Deserialize;
+use serde_json::{Value, json};
+use webclaw_llm::{ProviderChain, summarize::summarize};
+
+use crate::{error::ApiError, state::AppState};
+
+#[derive(Debug, Deserialize, Default)]
+#[serde(default)]
+pub struct SummarizeRequest {
+    pub url: String,
+    pub max_sentences: Option<usize>,
+    pub model: Option<String>,
+}
+
+pub async fn summarize_route(
+    State(state): State<AppState>,
+    Json(req): Json<SummarizeRequest>,
+) -> Result<Json<Value>, ApiError> {
+    if req.url.trim().is_empty() {
+        return Err(ApiError::bad_request("`url` is required"));
+    }
+
+    let extraction = state.fetch().fetch_and_extract(&req.url).await?;
+    let content = if extraction.content.markdown.trim().is_empty() {
+        extraction.content.plain_text.clone()
+    } else {
+        extraction.content.markdown.clone()
+    };
+    if content.trim().is_empty() {
+        return Err(ApiError::Extract(
+            "no extractable content on page".to_string(),
+        ));
+    }
+
+    let chain = ProviderChain::default().await;
+    if chain.is_empty() {
+        return Err(ApiError::Llm(
+            "no LLM providers configured (set OLLAMA_HOST, OPENAI_API_KEY, or ANTHROPIC_API_KEY)"
+                .to_string(),
+        ));
+    }
+
+    let summary = summarize(&content, req.max_sentences, &chain, req.model.as_deref()).await?;
+
+    Ok(Json(json!({
+        "url": req.url,
+        "summary": summary,
+    })))
+}
--- a/crates/webclaw-server/src/state.rs
+++ b/crates/webclaw-server/src/state.rs
@ -0,0 +1,107 @@
+//! Shared application state. Cheap to clone via Arc; held by the axum
+//! Router for the life of the process.
+//!
+//! Two unrelated keys get carried here:
+//!
+//! 1. [`AppState::api_key`] — the **bearer token clients must present**
+//!    to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
+//!    Unset = open mode.
+//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
+//!    **outbound** credential for api.webclaw.io, used by extractors
+//!    that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
+//!    Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
+//!    error with a signup link.
+//!
+//! Different variables on purpose: conflating the two means operators
+//! who want their server behind an auth token can't also enable cloud
+//! fallback, and vice versa.
+
+use std::sync::Arc;
+use tracing::info;
+use webclaw_fetch::cloud::CloudClient;
+use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
+
+/// Single-process state shared across all request handlers.
+#[derive(Clone)]
+pub struct AppState {
+    inner: Arc<Inner>,
+}
+
+struct Inner {
+    /// Wrapped in `Arc` because `fetch_and_extract_batch_with_options`
+    /// (used by the /v1/batch handler) takes `self: &Arc<Self>` so it
+    /// can clone the client into spawned tasks. The single-call handlers
+    /// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
+    /// them nothing.
+    pub fetch: Arc<FetchClient>,
+    /// Inbound bearer-auth token for this server's own `/v1/*` surface.
+    pub api_key: Option<String>,
+}
+
+impl AppState {
+    /// Build the application state. The fetch client is constructed once
+    /// and shared across requests so connection pools + browser profile
+    /// state don't churn per request.
+    ///
+    /// `inbound_api_key` is the bearer token clients must present;
+    /// cloud-fallback credentials come from the env (checked here).
+    pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
+        let config = FetchConfig {
+            browser: BrowserProfile::Firefox,
+            ..FetchConfig::default()
+        };
+        let mut fetch = FetchClient::new(config)
+            .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
+
+        // Cloud fallback: only activates when the operator has provided
+        // an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
+        // (preferred, disambiguates from the inbound-auth key) and
+        // WEBCLAW_API_KEY as a fallback when there's no inbound key
+        // configured (backwards compat with MCP / CLI conventions).
+        if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
+            info!(
+                base = cloud.base_url(),
+                "cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
+            );
+            fetch = fetch.with_cloud(cloud);
+        }
+
+        Ok(Self {
+            inner: Arc::new(Inner {
+                fetch: Arc::new(fetch),
+                api_key: inbound_api_key,
+            }),
+        })
+    }
+
+    pub fn fetch(&self) -> &Arc<FetchClient> {
+        &self.inner.fetch
+    }
+
+    pub fn api_key(&self) -> Option<&str> {
+        self.inner.api_key.as_deref()
+    }
+}
+
+/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
+/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
+/// configured (i.e. open mode — the same env var can't mean two
+/// things to one process).
+fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
+    let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
+    if let Some(k) = cloud_key.as_deref()
+        && !k.trim().is_empty()
+    {
+        return Some(CloudClient::with_key(k));
+    }
+    // Reuse WEBCLAW_API_KEY only when not also acting as our own
+    // inbound-auth token — otherwise we'd be telling the operator
+    // they can't have both.
+    if inbound_api_key.is_none()
+        && let Ok(k) = std::env::var("WEBCLAW_API_KEY")
+        && !k.trim().is_empty()
+    {
+        return Some(CloudClient::with_key(k));
+    }
+    None
+}
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@ -0,0 +1,33 @@
+#!/bin/sh
+# webclaw docker entrypoint.
+#
+# Behaves like the real binary when the first arg looks like a webclaw arg
+# (URL or flag), so `docker run ghcr.io/0xmassi/webclaw https://example.com`
+# still works. But gets out of the way when the first arg looks like a
+# different command (e.g. `./setup.sh`, `bash`, `sh -c ...`), so this image
+# can be used as a FROM base in downstream Dockerfiles with a custom CMD.
+#
+# Test matrix:
+#   docker run IMAGE https://example.com          → webclaw https://example.com
+#   docker run IMAGE --help                       → webclaw --help
+#   docker run IMAGE --file page.html             → webclaw --file page.html
+#   docker run IMAGE --stdin < page.html          → webclaw --stdin
+#   docker run IMAGE bash                         → bash
+#   docker run IMAGE ./setup.sh                   → ./setup.sh
+#   docker run IMAGE                              → webclaw --help (default CMD)
+#
+# Root cause fixed: v0.3.13 switched CMD→ENTRYPOINT to make the first use
+# case work, which trapped the last four. This shim restores all of them.
+
+set -e
+
+# If the first arg starts with `-`, `http://`, or `https://`, treat the
+# whole arg list as webclaw flags/URL.
+if [ "$#" -gt 0 ] && {
+    [ "${1#-}" != "$1" ] || \
+    [ "${1#http://}" != "$1" ] || \
+    [ "${1#https://}" != "$1" ]; }; then
+    set -- webclaw "$@"
+fi
+
+exec "$@"