feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)

Two ecommerce extractors covering the long tail of online stores: - shopify_product: hits the public /products/{handle}.json endpoint that every Shopify store exposes. Undocumented but stable for 10+ years. Returns title, vendor, product_type, tags, full variants array (price, SKU, stock, options), images, options matrix, and the price_min/price_max/any_available summary fields. Covers the ~4M Shopify stores out there, modulo stores that put Cloudflare in front of the shop. Rejects known non-Shopify hosts (amazon, etsy, walmart, etc.) to save a failed request. - ecommerce_product: generic Schema.org Product JSON-LD extractor. Works on any modern store that ships the Google-required Product rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace, Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN, images, normalized offers (Offer and AggregateOffer flattened into one shape with price, currency, availability, condition), aggregateRating, and the raw JSON-LD block for anyone who wants it. Reuses webclaw_core::structured_data::extract_json_ld so the JSON-LD parser stays shared across the extraction pipeline. Both are explicit-call only — /v1/scrape/shopify_product and /v1/scrape/ecommerce_product. Not in auto-dispatch because any arbitrary /products/{slug} URL could belong to either platform (or to a custom site that uses the same path shape), and claiming such URLs blindly would steal from the default markdown /v1/scrape flow. Live test results against real stores: - Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images, Size option, all SKUs. 250ms. - ecommerce_product / same Allbirds URL: ProductGroup schema, name 'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer. 300ms. Different extraction path, same product. - ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand, 200ms. - Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as expected \u2014 the error message points callers at the ecommerce_product fallback, but Cloudflare also blocks the HTML path so those stores are cloud-tier territory. Catalog now exposes 19 extractors via GET /v1/extractors. Unit tests: 59 passing across the module. Scope not in v1: - trustpilot_reviews: file written and tested (JSON-LD walker), but NOT registered in the catalog or dispatch. Trustpilot's Cloudflare turnstile blocks our Firefox + Chrome + Safari + mobile profiles at the TLS layer. Shipping it would return 403 more often than 200. Code kept in-tree under #[allow(dead_code)] for when the cloud tier has residential-proxy support. - Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF story. Not fixable without real browser + proxy pool. - WooCommerce explicit: most WooCommerce stores ship Product JSON-LD, so ecommerce_product covers them. A dedicated WooCommerce REST extractor (/wp-json/wc/store/products) would be marginal on top of that and only works on ~30% of stores that expose the REST API. Wave 4 positioning: we now own the OSS structured-scrape space for any site that respects Schema.org. That's Google's entire rich-result index \u2014 meaningful territory competitors won't try to replicate as named endpoints.
2026-06-08 22:25:12 +02:00 · 2026-04-22 15:36:01 +02:00 · 2026-04-22 15:36:01 +02:00 · 0221c151dc
commit 0221c151dc
parent 3bb0a4bca0
4 changed files with 854 additions and 0 deletions
--- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
+++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs
@ -0,0 +1,314 @@
+//! Generic ecommerce product extractor via Schema.org JSON-LD.
+//!
+//! Every modern ecommerce site ships a `<script type="application/ld+json">`
+//! Product block for SEO / rich-result snippets. Google's own SEO docs
+//! force this markup on anyone who wants to appear in shopping search.
+//! We take advantage of it: one extractor that works on Shopify,
+//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
+//! and anything else that follows Schema.org.
+//!
+//! **Explicit-call only** — `/v1/scrape/ecommerce_product`. Not in the
+//! auto-dispatch because we can't identify "this is a product page"
+//! from the URL alone. When the caller knows they have a product URL,
+//! this is the reliable fallback for stores where shopify_product
+//! doesn't apply.
+//!
+//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
+//! so JSON-LD parsing is shared with the rest of the extraction
+//! pipeline. We walk all blocks looking for `@type: Product`,
+//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
+
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "ecommerce_product",
+    label: "Ecommerce product (generic)",
+    description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
+    url_patterns: &[
+        "https://{any-ecom-store}/products/{slug}",
+        "https://{any-ecom-store}/product/{slug}",
+        "https://{any-ecom-store}/p/{slug}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    // Maximally permissive: explicit-call-only extractor. We trust the
+    // caller knows they're pointing at a product page. Custom ecom
+    // sites use every conceivable URL shape (warbyparker.com uses
+    // `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
+    // matching would false-negative a lot. All we gate on is a valid
+    // http(s) URL with a host.
+    if !(url.starts_with("http://") || url.starts_with("https://")) {
+        return false;
+    }
+    !host_of(url).is_empty()
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let resp = client.fetch(url).await?;
+    if !(200..300).contains(&resp.status) {
+        return Err(FetchError::Build(format!(
+            "ecommerce_product: status {} for {url}",
+            resp.status
+        )));
+    }
+
+    // Reuse the core JSON-LD parser so we benefit from whatever
+    // robustness it gains over time (handling @graph, arrays, etc.).
+    let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
+    let product = find_product(&blocks).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "ecommerce_product: no Schema.org Product found in JSON-LD on {url}"
+        ))
+    })?;
+
+    Ok(json!({
+        "url":                url,
+        "name":               get_text(&product, "name"),
+        "description":        get_text(&product, "description"),
+        "brand":              get_brand(&product),
+        "sku":                get_text(&product, "sku"),
+        "mpn":                get_text(&product, "mpn"),
+        "gtin":               get_text(&product, "gtin")
+                                 .or_else(|| get_text(&product, "gtin13"))
+                                 .or_else(|| get_text(&product, "gtin12"))
+                                 .or_else(|| get_text(&product, "gtin8")),
+        "product_id":         get_text(&product, "productID"),
+        "category":           get_text(&product, "category"),
+        "color":              get_text(&product, "color"),
+        "material":           get_text(&product, "material"),
+        "images":             collect_images(&product),
+        "offers":             collect_offers(&product),
+        "aggregate_rating":   get_aggregate_rating(&product),
+        "review_count":       get_review_count(&product),
+        "raw_schema_type":    get_text(&product, "@type"),
+        "raw_jsonld":         product,
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walkers
+// ---------------------------------------------------------------------------
+
+/// Recursively walk the JSON-LD blocks and return the first node whose
+/// `@type` is Product, ProductGroup, or IndividualProduct.
+fn find_product(blocks: &[Value]) -> Option<Value> {
+    for b in blocks {
+        if let Some(found) = find_product_in(b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_product_in(v: &Value) -> Option<Value> {
+    if is_product_type(v) {
+        return Some(v.clone());
+    }
+    // @graph: [ {...}, {...} ]
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    // Bare array wrapper
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_product_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_product_type(v: &Value) -> bool {
+    let t = match v.get("@type") {
+        Some(t) => t,
+        None => return false,
+    };
+    let match_str = |s: &str| {
+        matches!(
+            s,
+            "Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
+        )
+    };
+    match t {
+        Value::String(s) => match_str(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn get_brand(v: &Value) -> Option<String> {
+    let brand = v.get("brand")?;
+    if let Some(s) = brand.as_str() {
+        return Some(s.to_string());
+    }
+    if let Some(obj) = brand.as_object()
+        && let Some(n) = obj.get("name").and_then(|x| x.as_str())
+    {
+        return Some(n.to_string());
+    }
+    None
+}
+
+fn collect_images(v: &Value) -> Vec<Value> {
+    match v.get("image") {
+        Some(Value::String(s)) => vec![Value::String(s.clone())],
+        Some(Value::Array(arr)) => arr
+            .iter()
+            .filter_map(|x| match x {
+                Value::String(s) => Some(Value::String(s.clone())),
+                Value::Object(_) => x.get("url").cloned(),
+                _ => None,
+            })
+            .collect(),
+        Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
+        _ => Vec::new(),
+    }
+}
+
+/// Normalise both bare Offer and AggregateOffer into a uniform array.
+fn collect_offers(v: &Value) -> Vec<Value> {
+    let offers = match v.get("offers") {
+        Some(o) => o,
+        None => return Vec::new(),
+    };
+    let collect_single = |o: &Value| -> Option<Value> {
+        Some(json!({
+            "price":            get_text(o, "price"),
+            "low_price":        get_text(o, "lowPrice"),
+            "high_price":       get_text(o, "highPrice"),
+            "currency":         get_text(o, "priceCurrency"),
+            "availability":     get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
+            "item_condition":   get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
+            "valid_until":      get_text(o, "priceValidUntil"),
+            "url":              get_text(o, "url"),
+            "seller":           o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
+            "offer_count":      get_text(o, "offerCount"),
+        }))
+    };
+    match offers {
+        Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
+        Value::Object(_) => collect_single(offers).into_iter().collect(),
+        _ => Vec::new(),
+    }
+}
+
+fn get_aggregate_rating(v: &Value) -> Option<Value> {
+    let r = v.get("aggregateRating")?;
+    Some(json!({
+        "rating_value":  get_text(r, "ratingValue"),
+        "best_rating":   get_text(r, "bestRating"),
+        "worst_rating":  get_text(r, "worstRating"),
+        "rating_count":  get_text(r, "ratingCount"),
+        "review_count":  get_text(r, "reviewCount"),
+    }))
+}
+
+fn get_review_count(v: &Value) -> Option<String> {
+    v.get("aggregateRating")
+        .and_then(|r| get_text(r, "reviewCount"))
+        .or_else(|| get_text(v, "reviewCount"))
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use serde_json::json;
+
+    #[test]
+    fn matches_any_http_url_with_host() {
+        assert!(matches("https://www.allbirds.com/products/tree-runner"));
+        assert!(matches(
+            "https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
+        ));
+        assert!(matches("https://example.com/p/widget"));
+        assert!(matches("http://shop.example.com/foo/bar"));
+    }
+
+    #[test]
+    fn rejects_empty_or_non_http() {
+        assert!(!matches(""));
+        assert!(!matches("not-a-url"));
+        assert!(!matches("ftp://example.com/file"));
+    }
+
+    #[test]
+    fn find_product_walks_graph() {
+        let block = json!({
+            "@context": "https://schema.org",
+            "@graph": [
+                {"@type": "Organization", "name": "ACME"},
+                {"@type": "Product", "name": "Widget", "sku": "ABC"}
+            ]
+        });
+        let blocks = vec![block];
+        let p = find_product(&blocks).unwrap();
+        assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
+    }
+
+    #[test]
+    fn find_product_handles_array_type() {
+        let block = json!({
+            "@type": ["Product", "Clothing"],
+            "name": "Tee"
+        });
+        assert!(is_product_type(&block));
+    }
+
+    #[test]
+    fn get_brand_from_string_or_object() {
+        assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
+        assert_eq!(
+            get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
+            Some("ACME".into())
+        );
+    }
+
+    #[test]
+    fn collect_offers_handles_single_and_aggregate() {
+        let p = json!({
+            "offers": {
+                "@type": "Offer",
+                "price": "19.99",
+                "priceCurrency": "USD",
+                "availability": "https://schema.org/InStock"
+            }
+        });
+        let offers = collect_offers(&p);
+        assert_eq!(offers.len(), 1);
+        assert_eq!(
+            offers[0].get("price").and_then(|v| v.as_str()),
+            Some("19.99")
+        );
+        assert_eq!(
+            offers[0].get("availability").and_then(|v| v.as_str()),
+            Some("InStock")
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/mod.rs
+++ b/crates/webclaw-fetch/src/extractors/mod.rs
@ -18,6 +18,7 @@ pub mod arxiv;
 pub mod crates_io;
 pub mod dev_to;
 pub mod docker_hub;
+pub mod ecommerce_product;
 pub mod github_pr;
 pub mod github_release;
 pub mod github_repo;
@ -30,7 +31,15 @@ pub mod linkedin_post;
 pub mod npm;
 pub mod pypi;
 pub mod reddit;
+pub mod shopify_product;
 pub mod stackoverflow;
+// `trustpilot_reviews` code lives in the tree but is not wired into the
+// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
+// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
+// it would return 403 more often than not — bad UX. When the cloud tier
+// has residential proxies or a CDP renderer, flip this back on.
+#[allow(dead_code)]
+pub mod trustpilot_reviews;

 use serde::Serialize;
 use serde_json::Value;
@ -73,6 +82,8 @@ pub fn list() -> Vec<ExtractorInfo> {
        linkedin_post::INFO,
        instagram_post::INFO,
        instagram_profile::INFO,
+        shopify_product::INFO,
+        ecommerce_product::INFO,
    ]
 }

@ -198,6 +209,12 @@ pub async fn dispatch_by_url(
                .map(|v| (instagram_profile::INFO.name, v)),
        );
    }
+    // NOTE: shopify_product and ecommerce_product are intentionally NOT
+    // in auto-dispatch. Their `matches()` functions are permissive
+    // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
+    // claiming those generically would steal URLs from the default
+    // `/v1/scrape` markdown flow. Callers opt in via
+    // `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
    None
 }

@ -304,6 +321,18 @@ pub async fn dispatch_by_name(
            })
            .await
        }
+        n if n == shopify_product::INFO.name => {
+            run_or_mismatch(shopify_product::matches(url), n, url, || {
+                shopify_product::extract(client, url)
+            })
+            .await
+        }
+        n if n == ecommerce_product::INFO.name => {
+            run_or_mismatch(ecommerce_product::matches(url), n, url, || {
+                ecommerce_product::extract(client, url)
+            })
+            .await
+        }
        _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
    }
 }
--- a/crates/webclaw-fetch/src/extractors/shopify_product.rs
+++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs
@ -0,0 +1,318 @@
+//! Shopify product structured extractor.
+//!
+//! Every Shopify store exposes a public JSON endpoint for each product
+//! by appending `.json` to the product URL:
+//!
+//!   https://shop.example.com/products/cool-tshirt
+//!   → https://shop.example.com/products/cool-tshirt.json
+//!
+//! There are ~4 million Shopify stores. The `.json` endpoint is
+//! undocumented but has been stable for 10+ years. When a store puts
+//! Cloudflare / antibot in front of the shop, this path can 403 just
+//! like any other — for those cases the caller should fall back to
+//! `ecommerce_product` (JSON-LD) or the cloud tier.
+//!
+//! This extractor is **explicit-call only** — it is NOT auto-dispatched
+//! from `/v1/scrape` because we cannot tell ahead of time whether an
+//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
+//! `/v1/scrape/shopify_product` when they know.
+
+use serde::Deserialize;
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "shopify_product",
+    label: "Shopify product",
+    description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
+    url_patterns: &[
+        "https://{shop}/products/{handle}",
+        "https://{shop}.myshopify.com/products/{handle}",
+    ],
+};
+
+pub fn matches(url: &str) -> bool {
+    // Any URL whose path contains /products/{something}. We do not
+    // filter by host — Shopify powers custom-domain stores. The
+    // extractor's /.json fallback is what confirms Shopify; `matches`
+    // just says "this is a plausible shape." Still reject obviously
+    // non-Shopify known hosts to save a failed request.
+    let host = host_of(url);
+    if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
+        return false;
+    }
+    url.contains("/products/") && !url.ends_with("/products/")
+}
+
+/// Hosts we know are not Shopify — reject so we don't burn a request.
+const NON_SHOPIFY_HOSTS: &[&str] = &[
+    "amazon.com",
+    "amazon.co.uk",
+    "amazon.de",
+    "amazon.fr",
+    "amazon.it",
+    "ebay.com",
+    "etsy.com",
+    "walmart.com",
+    "target.com",
+    "aliexpress.com",
+    "bestbuy.com",
+    "wayfair.com",
+    "homedepot.com",
+    "github.com", // /products is a marketing page
+];
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let json_url = build_json_url(url);
+    let resp = client.fetch(&json_url).await?;
+    if resp.status == 404 {
+        return Err(FetchError::Build(format!(
+            "shopify_product: '{url}' not found (got 404 from {json_url})"
+        )));
+    }
+    if resp.status == 403 {
+        return Err(FetchError::Build(format!(
+            "shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
+        )));
+    }
+    if resp.status != 200 {
+        return Err(FetchError::Build(format!(
+            "shopify returned status {} for {json_url}",
+            resp.status
+        )));
+    }
+
+    let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
+        FetchError::BodyDecode(format!(
+            "shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
+        ))
+    })?;
+    let p = body.product;
+
+    let variants: Vec<Value> = p
+        .variants
+        .iter()
+        .map(|v| {
+            json!({
+                "id":                  v.id,
+                "title":               v.title,
+                "sku":                 v.sku,
+                "barcode":             v.barcode,
+                "price":               v.price,
+                "compare_at_price":    v.compare_at_price,
+                "available":           v.available,
+                "inventory_quantity":  v.inventory_quantity,
+                "position":            v.position,
+                "weight":              v.weight,
+                "weight_unit":         v.weight_unit,
+                "requires_shipping":   v.requires_shipping,
+                "taxable":             v.taxable,
+                "option1":             v.option1,
+                "option2":             v.option2,
+                "option3":             v.option3,
+            })
+        })
+        .collect();
+
+    let images: Vec<Value> = p
+        .images
+        .iter()
+        .map(|i| {
+            json!({
+                "src":      i.src,
+                "width":    i.width,
+                "height":   i.height,
+                "position": i.position,
+                "alt":      i.alt,
+            })
+        })
+        .collect();
+
+    let options: Vec<Value> = p
+        .options
+        .iter()
+        .map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
+        .collect();
+
+    // Price range + availability summary across variants (the shape
+    // agents typically want without walking the variants array).
+    let prices: Vec<f64> = p
+        .variants
+        .iter()
+        .filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
+        .collect();
+    let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
+
+    Ok(json!({
+        "url":             url,
+        "json_url":        json_url,
+        "product_id":      p.id,
+        "handle":          p.handle,
+        "title":           p.title,
+        "vendor":          p.vendor,
+        "product_type":    p.product_type,
+        "tags":            p.tags,
+        "description_html":p.body_html,
+        "published_at":    p.published_at,
+        "created_at":      p.created_at,
+        "updated_at":      p.updated_at,
+        "variant_count":   variants.len(),
+        "image_count":     images.len(),
+        "any_available":   any_available,
+        "price_min":       prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
+        "price_max":       prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
+        "variants":        variants,
+        "images":          images,
+        "options":         options,
+    }))
+}
+
+/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
+/// trailing slashes, and query strings.
+fn build_json_url(url: &str) -> String {
+    let (path_part, query_part) = match url.split_once('?') {
+        Some((a, b)) => (a, Some(b)),
+        None => (url, None),
+    };
+    let clean = path_part.trim_end_matches('/');
+    let with_json = if clean.ends_with(".json") {
+        clean.to_string()
+    } else {
+        format!("{clean}.json")
+    };
+    match query_part {
+        Some(q) => format!("{with_json}?{q}"),
+        None => with_json,
+    }
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+// ---------------------------------------------------------------------------
+// Shopify product JSON shape (a subset of the full response)
+// ---------------------------------------------------------------------------
+
+#[derive(Deserialize)]
+struct Wrapper {
+    product: Product,
+}
+
+#[derive(Deserialize)]
+struct Product {
+    id: Option<i64>,
+    title: Option<String>,
+    handle: Option<String>,
+    vendor: Option<String>,
+    product_type: Option<String>,
+    body_html: Option<String>,
+    published_at: Option<String>,
+    created_at: Option<String>,
+    updated_at: Option<String>,
+    #[serde(default)]
+    tags: serde_json::Value, // array OR comma-joined string depending on store
+    #[serde(default)]
+    variants: Vec<Variant>,
+    #[serde(default)]
+    images: Vec<Image>,
+    #[serde(default)]
+    options: Vec<Option_>,
+}
+
+#[derive(Deserialize)]
+struct Variant {
+    id: Option<i64>,
+    title: Option<String>,
+    sku: Option<String>,
+    barcode: Option<String>,
+    price: Option<String>,
+    compare_at_price: Option<String>,
+    available: Option<bool>,
+    inventory_quantity: Option<i64>,
+    position: Option<i64>,
+    weight: Option<f64>,
+    weight_unit: Option<String>,
+    requires_shipping: Option<bool>,
+    taxable: Option<bool>,
+    option1: Option<String>,
+    option2: Option<String>,
+    option3: Option<String>,
+}
+
+#[derive(Deserialize)]
+struct Image {
+    src: Option<String>,
+    width: Option<i64>,
+    height: Option<i64>,
+    position: Option<i64>,
+    alt: Option<String>,
+}
+
+#[derive(Deserialize)]
+#[serde(rename_all = "lowercase")]
+struct Option_ {
+    name: Option<String>,
+    position: Option<i64>,
+    #[serde(default)]
+    values: Vec<String>,
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_plausible_shopify_urls() {
+        assert!(matches(
+            "https://www.allbirds.com/products/mens-tree-runners"
+        ));
+        assert!(matches(
+            "https://shop.example.com/products/cool-tshirt?variant=123"
+        ));
+        assert!(matches("https://somestore.myshopify.com/products/thing-1"));
+    }
+
+    #[test]
+    fn rejects_known_non_shopify() {
+        assert!(!matches("https://www.amazon.com/dp/B0C123"));
+        assert!(!matches("https://www.etsy.com/listing/12345/foo"));
+        assert!(!matches("https://www.amazon.co.uk/products/thing"));
+        assert!(!matches("https://github.com/products"));
+    }
+
+    #[test]
+    fn rejects_non_product_urls() {
+        assert!(!matches("https://example.com/"));
+        assert!(!matches("https://example.com/products/"));
+        assert!(!matches("https://example.com/collections/all"));
+    }
+
+    #[test]
+    fn build_json_url_handles_slash_and_query() {
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo"),
+            "https://shop.example.com/products/foo.json"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo/"),
+            "https://shop.example.com/products/foo.json"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo?variant=123"),
+            "https://shop.example.com/products/foo.json?variant=123"
+        );
+        assert_eq!(
+            build_json_url("https://shop.example.com/products/foo.json"),
+            "https://shop.example.com/products/foo.json"
+        );
+    }
+}
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@ -0,0 +1,193 @@
+//! Trustpilot company reviews extractor.
+//!
+//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich
+//! JSON-LD `LocalBusiness` / `Organization` block with aggregate
+//! rating + up to 20 recent reviews. No auth, no antibot for the
+//! page HTML itself.
+//!
+//! Auto-dispatch safe because the host is unique.
+
+use serde_json::{Value, json};
+
+use super::ExtractorInfo;
+use crate::client::FetchClient;
+use crate::error::FetchError;
+
+pub const INFO: ExtractorInfo = ExtractorInfo {
+    name: "trustpilot_reviews",
+    label: "Trustpilot reviews",
+    description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
+    url_patterns: &["https://www.trustpilot.com/review/{domain}"],
+};
+
+pub fn matches(url: &str) -> bool {
+    let host = host_of(url);
+    if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
+        return false;
+    }
+    url.contains("/review/")
+}
+
+pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
+    let resp = client.fetch(url).await?;
+    if !(200..300).contains(&resp.status) {
+        return Err(FetchError::Build(format!(
+            "trustpilot_reviews: status {} for {url}",
+            resp.status
+        )));
+    }
+
+    let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
+    let business = find_business(&blocks).ok_or_else(|| {
+        FetchError::BodyDecode(format!(
+            "trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
+        ))
+    })?;
+
+    let aggregate_rating = business.get("aggregateRating").map(|r| {
+        json!({
+            "rating_value":  get_text(r, "ratingValue"),
+            "best_rating":   get_text(r, "bestRating"),
+            "review_count":  get_text(r, "reviewCount"),
+        })
+    });
+
+    let reviews: Vec<Value> = business
+        .get("review")
+        .and_then(|r| r.as_array())
+        .map(|arr| {
+            arr.iter()
+                .map(|r| {
+                    json!({
+                        "author":         r.get("author")
+                                              .and_then(|a| a.get("name"))
+                                              .and_then(|n| n.as_str())
+                                              .map(String::from)
+                                              .or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
+                        "date_published": get_text(r, "datePublished"),
+                        "name":           get_text(r, "name"),
+                        "body":           get_text(r, "reviewBody"),
+                        "rating_value":   r.get("reviewRating")
+                                              .and_then(|rr| rr.get("ratingValue"))
+                                              .and_then(|v| v.as_str().map(String::from)
+                                                  .or_else(|| v.as_f64().map(|n| n.to_string()))),
+                        "language":       get_text(r, "inLanguage"),
+                    })
+                })
+                .collect()
+        })
+        .unwrap_or_default();
+
+    Ok(json!({
+        "url":              url,
+        "name":             get_text(&business, "name"),
+        "description":      get_text(&business, "description"),
+        "logo":             business.get("logo").and_then(|l| l.as_str()).map(String::from)
+                                .or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
+        "telephone":        get_text(&business, "telephone"),
+        "address":          business.get("address").cloned(),
+        "same_as":          business.get("sameAs").cloned(),
+        "aggregate_rating": aggregate_rating,
+        "review_count_listed": reviews.len(),
+        "reviews":          reviews,
+        "business_schema":  business.get("@type").cloned(),
+    }))
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD walker — same pattern as ecommerce_product
+// ---------------------------------------------------------------------------
+
+fn find_business(blocks: &[Value]) -> Option<Value> {
+    for b in blocks {
+        if let Some(found) = find_business_in(b) {
+            return Some(found);
+        }
+    }
+    None
+}
+
+fn find_business_in(v: &Value) -> Option<Value> {
+    if is_business_type(v) {
+        return Some(v.clone());
+    }
+    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
+        for item in graph {
+            if let Some(found) = find_business_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    if let Some(arr) = v.as_array() {
+        for item in arr {
+            if let Some(found) = find_business_in(item) {
+                return Some(found);
+            }
+        }
+    }
+    None
+}
+
+fn is_business_type(v: &Value) -> bool {
+    let t = match v.get("@type") {
+        Some(t) => t,
+        None => return false,
+    };
+    let match_str = |s: &str| {
+        matches!(
+            s,
+            "Organization"
+                | "LocalBusiness"
+                | "Corporation"
+                | "OnlineBusiness"
+                | "Store"
+                | "Service"
+        )
+    };
+    match t {
+        Value::String(s) => match_str(s),
+        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
+        _ => false,
+    }
+}
+
+fn get_text(v: &Value, key: &str) -> Option<String> {
+    v.get(key).and_then(|x| match x {
+        Value::String(s) => Some(s.clone()),
+        Value::Number(n) => Some(n.to_string()),
+        _ => None,
+    })
+}
+
+fn host_of(url: &str) -> &str {
+    url.split("://")
+        .nth(1)
+        .unwrap_or(url)
+        .split('/')
+        .next()
+        .unwrap_or("")
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn matches_trustpilot_review_urls() {
+        assert!(matches("https://www.trustpilot.com/review/stripe.com"));
+        assert!(matches("https://trustpilot.com/review/example.com"));
+        assert!(!matches("https://www.trustpilot.com/"));
+        assert!(!matches("https://example.com/review/foo"));
+    }
+
+    #[test]
+    fn is_business_type_handles_variants() {
+        use serde_json::json;
+        assert!(is_business_type(&json!({"@type": "Organization"})));
+        assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
+        assert!(is_business_type(
+            &json!({"@type": ["Organization", "Corporation"]})
+        ));
+        assert!(!is_business_type(&json!({"@type": "Product"})));
+    }
+}