feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)

Two ecommerce extractors covering the long tail of online stores:

- shopify_product: hits the public /products/{handle}.json endpoint
  that every Shopify store exposes. Undocumented but stable for 10+
  years. Returns title, vendor, product_type, tags, full variants
  array (price, SKU, stock, options), images, options matrix, and
  the price_min/price_max/any_available summary fields. Covers the
  ~4M Shopify stores out there, modulo stores that put Cloudflare
  in front of the shop. Rejects known non-Shopify hosts (amazon,
  etsy, walmart, etc.) to save a failed request.

- ecommerce_product: generic Schema.org Product JSON-LD extractor.
  Works on any modern store that ships the Google-required Product
  rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
  Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
  images, normalized offers (Offer and AggregateOffer flattened into
  one shape with price, currency, availability, condition),
  aggregateRating, and the raw JSON-LD block for anyone who wants it.
  Reuses webclaw_core::structured_data::extract_json_ld so the
  JSON-LD parser stays shared across the extraction pipeline.

Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.

Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
  Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
  'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
  300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
  200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
  expected \u2014 the error message points callers at the ecommerce_product
  fallback, but Cloudflare also blocks the HTML path so those stores
  are cloud-tier territory.

Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.

Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
  NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
  turnstile blocks our Firefox + Chrome + Safari + mobile profiles
  at the TLS layer. Shipping it would return 403 more often than 200.
  Code kept in-tree under #[allow(dead_code)] for when the cloud
  tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
  story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
  so ecommerce_product covers them. A dedicated WooCommerce REST
  extractor (/wp-json/wc/store/products) would be marginal on top of
  that and only works on ~30% of stores that expose the REST API.

Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
This commit is contained in:
Valerio 2026-04-22 15:36:01 +02:00
parent 3bb0a4bca0
commit 0221c151dc
4 changed files with 854 additions and 0 deletions

View file

@ -0,0 +1,314 @@
//! Generic ecommerce product extractor via Schema.org JSON-LD.
//!
//! Every modern ecommerce site ships a `<script type="application/ld+json">`
//! Product block for SEO / rich-result snippets. Google's own SEO docs
//! force this markup on anyone who wants to appear in shopping search.
//! We take advantage of it: one extractor that works on Shopify,
//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
//! and anything else that follows Schema.org.
//!
//! **Explicit-call only** — `/v1/scrape/ecommerce_product`. Not in the
//! auto-dispatch because we can't identify "this is a product page"
//! from the URL alone. When the caller knows they have a product URL,
//! this is the reliable fallback for stores where shopify_product
//! doesn't apply.
//!
//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
//! so JSON-LD parsing is shared with the rest of the extraction
//! pipeline. We walk all blocks looking for `@type: Product`,
//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "ecommerce_product",
label: "Ecommerce product (generic)",
description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
url_patterns: &[
"https://{any-ecom-store}/products/{slug}",
"https://{any-ecom-store}/product/{slug}",
"https://{any-ecom-store}/p/{slug}",
],
};
pub fn matches(url: &str) -> bool {
// Maximally permissive: explicit-call-only extractor. We trust the
// caller knows they're pointing at a product page. Custom ecom
// sites use every conceivable URL shape (warbyparker.com uses
// `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
// matching would false-negative a lot. All we gate on is a valid
// http(s) URL with a host.
if !(url.starts_with("http://") || url.starts_with("https://")) {
return false;
}
!host_of(url).is_empty()
}
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
let resp = client.fetch(url).await?;
if !(200..300).contains(&resp.status) {
return Err(FetchError::Build(format!(
"ecommerce_product: status {} for {url}",
resp.status
)));
}
// Reuse the core JSON-LD parser so we benefit from whatever
// robustness it gains over time (handling @graph, arrays, etc.).
let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
let product = find_product(&blocks).ok_or_else(|| {
FetchError::BodyDecode(format!(
"ecommerce_product: no Schema.org Product found in JSON-LD on {url}"
))
})?;
Ok(json!({
"url": url,
"name": get_text(&product, "name"),
"description": get_text(&product, "description"),
"brand": get_brand(&product),
"sku": get_text(&product, "sku"),
"mpn": get_text(&product, "mpn"),
"gtin": get_text(&product, "gtin")
.or_else(|| get_text(&product, "gtin13"))
.or_else(|| get_text(&product, "gtin12"))
.or_else(|| get_text(&product, "gtin8")),
"product_id": get_text(&product, "productID"),
"category": get_text(&product, "category"),
"color": get_text(&product, "color"),
"material": get_text(&product, "material"),
"images": collect_images(&product),
"offers": collect_offers(&product),
"aggregate_rating": get_aggregate_rating(&product),
"review_count": get_review_count(&product),
"raw_schema_type": get_text(&product, "@type"),
"raw_jsonld": product,
}))
}
// ---------------------------------------------------------------------------
// JSON-LD walkers
// ---------------------------------------------------------------------------
/// Recursively walk the JSON-LD blocks and return the first node whose
/// `@type` is Product, ProductGroup, or IndividualProduct.
fn find_product(blocks: &[Value]) -> Option<Value> {
for b in blocks {
if let Some(found) = find_product_in(b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
// @graph: [ {...}, {...} ]
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
// Bare array wrapper
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let t = match v.get("@type") {
Some(t) => t,
None => return false,
};
let match_str = |s: &str| {
matches!(
s,
"Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
)
};
match t {
Value::String(s) => match_str(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
if let Some(obj) = brand.as_object()
&& let Some(n) = obj.get("name").and_then(|x| x.as_str())
{
return Some(n.to_string());
}
None
}
fn collect_images(v: &Value) -> Vec<Value> {
match v.get("image") {
Some(Value::String(s)) => vec![Value::String(s.clone())],
Some(Value::Array(arr)) => arr
.iter()
.filter_map(|x| match x {
Value::String(s) => Some(Value::String(s.clone())),
Value::Object(_) => x.get("url").cloned(),
_ => None,
})
.collect(),
Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
_ => Vec::new(),
}
}
/// Normalise both bare Offer and AggregateOffer into a uniform array.
fn collect_offers(v: &Value) -> Vec<Value> {
let offers = match v.get("offers") {
Some(o) => o,
None => return Vec::new(),
};
let collect_single = |o: &Value| -> Option<Value> {
Some(json!({
"price": get_text(o, "price"),
"low_price": get_text(o, "lowPrice"),
"high_price": get_text(o, "highPrice"),
"currency": get_text(o, "priceCurrency"),
"availability": get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
"item_condition": get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
"valid_until": get_text(o, "priceValidUntil"),
"url": get_text(o, "url"),
"seller": o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
"offer_count": get_text(o, "offerCount"),
}))
};
match offers {
Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
Value::Object(_) => collect_single(offers).into_iter().collect(),
_ => Vec::new(),
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"best_rating": get_text(r, "bestRating"),
"worst_rating": get_text(r, "worstRating"),
"rating_count": get_text(r, "ratingCount"),
"review_count": get_text(r, "reviewCount"),
}))
}
fn get_review_count(v: &Value) -> Option<String> {
v.get("aggregateRating")
.and_then(|r| get_text(r, "reviewCount"))
.or_else(|| get_text(v, "reviewCount"))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
#[test]
fn matches_any_http_url_with_host() {
assert!(matches("https://www.allbirds.com/products/tree-runner"));
assert!(matches(
"https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
));
assert!(matches("https://example.com/p/widget"));
assert!(matches("http://shop.example.com/foo/bar"));
}
#[test]
fn rejects_empty_or_non_http() {
assert!(!matches(""));
assert!(!matches("not-a-url"));
assert!(!matches("ftp://example.com/file"));
}
#[test]
fn find_product_walks_graph() {
let block = json!({
"@context": "https://schema.org",
"@graph": [
{"@type": "Organization", "name": "ACME"},
{"@type": "Product", "name": "Widget", "sku": "ABC"}
]
});
let blocks = vec![block];
let p = find_product(&blocks).unwrap();
assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
}
#[test]
fn find_product_handles_array_type() {
let block = json!({
"@type": ["Product", "Clothing"],
"name": "Tee"
});
assert!(is_product_type(&block));
}
#[test]
fn get_brand_from_string_or_object() {
assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
assert_eq!(
get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
Some("ACME".into())
);
}
#[test]
fn collect_offers_handles_single_and_aggregate() {
let p = json!({
"offers": {
"@type": "Offer",
"price": "19.99",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
}
});
let offers = collect_offers(&p);
assert_eq!(offers.len(), 1);
assert_eq!(
offers[0].get("price").and_then(|v| v.as_str()),
Some("19.99")
);
assert_eq!(
offers[0].get("availability").and_then(|v| v.as_str()),
Some("InStock")
);
}
}

View file

@ -18,6 +18,7 @@ pub mod arxiv;
pub mod crates_io;
pub mod dev_to;
pub mod docker_hub;
pub mod ecommerce_product;
pub mod github_pr;
pub mod github_release;
pub mod github_repo;
@ -30,7 +31,15 @@ pub mod linkedin_post;
pub mod npm;
pub mod pypi;
pub mod reddit;
pub mod shopify_product;
pub mod stackoverflow;
// `trustpilot_reviews` code lives in the tree but is not wired into the
// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
// it would return 403 more often than not — bad UX. When the cloud tier
// has residential proxies or a CDP renderer, flip this back on.
#[allow(dead_code)]
pub mod trustpilot_reviews;
use serde::Serialize;
use serde_json::Value;
@ -73,6 +82,8 @@ pub fn list() -> Vec<ExtractorInfo> {
linkedin_post::INFO,
instagram_post::INFO,
instagram_profile::INFO,
shopify_product::INFO,
ecommerce_product::INFO,
]
}
@ -198,6 +209,12 @@ pub async fn dispatch_by_url(
.map(|v| (instagram_profile::INFO.name, v)),
);
}
// NOTE: shopify_product and ecommerce_product are intentionally NOT
// in auto-dispatch. Their `matches()` functions are permissive
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and
// claiming those generically would steal URLs from the default
// `/v1/scrape` markdown flow. Callers opt in via
// `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
None
}
@ -304,6 +321,18 @@ pub async fn dispatch_by_name(
})
.await
}
n if n == shopify_product::INFO.name => {
run_or_mismatch(shopify_product::matches(url), n, url, || {
shopify_product::extract(client, url)
})
.await
}
n if n == ecommerce_product::INFO.name => {
run_or_mismatch(ecommerce_product::matches(url), n, url, || {
ecommerce_product::extract(client, url)
})
.await
}
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
}
}

View file

@ -0,0 +1,318 @@
//! Shopify product structured extractor.
//!
//! Every Shopify store exposes a public JSON endpoint for each product
//! by appending `.json` to the product URL:
//!
//! https://shop.example.com/products/cool-tshirt
//! → https://shop.example.com/products/cool-tshirt.json
//!
//! There are ~4 million Shopify stores. The `.json` endpoint is
//! undocumented but has been stable for 10+ years. When a store puts
//! Cloudflare / antibot in front of the shop, this path can 403 just
//! like any other — for those cases the caller should fall back to
//! `ecommerce_product` (JSON-LD) or the cloud tier.
//!
//! This extractor is **explicit-call only** — it is NOT auto-dispatched
//! from `/v1/scrape` because we cannot tell ahead of time whether an
//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
//! `/v1/scrape/shopify_product` when they know.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "shopify_product",
label: "Shopify product",
description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
url_patterns: &[
"https://{shop}/products/{handle}",
"https://{shop}.myshopify.com/products/{handle}",
],
};
pub fn matches(url: &str) -> bool {
// Any URL whose path contains /products/{something}. We do not
// filter by host — Shopify powers custom-domain stores. The
// extractor's /.json fallback is what confirms Shopify; `matches`
// just says "this is a plausible shape." Still reject obviously
// non-Shopify known hosts to save a failed request.
let host = host_of(url);
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
return false;
}
url.contains("/products/") && !url.ends_with("/products/")
}
/// Hosts we know are not Shopify — reject so we don't burn a request.
const NON_SHOPIFY_HOSTS: &[&str] = &[
"amazon.com",
"amazon.co.uk",
"amazon.de",
"amazon.fr",
"amazon.it",
"ebay.com",
"etsy.com",
"walmart.com",
"target.com",
"aliexpress.com",
"bestbuy.com",
"wayfair.com",
"homedepot.com",
"github.com", // /products is a marketing page
];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
let json_url = build_json_url(url);
let resp = client.fetch(&json_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"shopify_product: '{url}' not found (got 404 from {json_url})"
)));
}
if resp.status == 403 {
return Err(FetchError::Build(format!(
"shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"shopify returned status {} for {json_url}",
resp.status
)));
}
let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
FetchError::BodyDecode(format!(
"shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
))
})?;
let p = body.product;
let variants: Vec<Value> = p
.variants
.iter()
.map(|v| {
json!({
"id": v.id,
"title": v.title,
"sku": v.sku,
"barcode": v.barcode,
"price": v.price,
"compare_at_price": v.compare_at_price,
"available": v.available,
"inventory_quantity": v.inventory_quantity,
"position": v.position,
"weight": v.weight,
"weight_unit": v.weight_unit,
"requires_shipping": v.requires_shipping,
"taxable": v.taxable,
"option1": v.option1,
"option2": v.option2,
"option3": v.option3,
})
})
.collect();
let images: Vec<Value> = p
.images
.iter()
.map(|i| {
json!({
"src": i.src,
"width": i.width,
"height": i.height,
"position": i.position,
"alt": i.alt,
})
})
.collect();
let options: Vec<Value> = p
.options
.iter()
.map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
.collect();
// Price range + availability summary across variants (the shape
// agents typically want without walking the variants array).
let prices: Vec<f64> = p
.variants
.iter()
.filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
.collect();
let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
Ok(json!({
"url": url,
"json_url": json_url,
"product_id": p.id,
"handle": p.handle,
"title": p.title,
"vendor": p.vendor,
"product_type": p.product_type,
"tags": p.tags,
"description_html":p.body_html,
"published_at": p.published_at,
"created_at": p.created_at,
"updated_at": p.updated_at,
"variant_count": variants.len(),
"image_count": images.len(),
"any_available": any_available,
"price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
"price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
"variants": variants,
"images": images,
"options": options,
}))
}
/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
/// trailing slashes, and query strings.
fn build_json_url(url: &str) -> String {
let (path_part, query_part) = match url.split_once('?') {
Some((a, b)) => (a, Some(b)),
None => (url, None),
};
let clean = path_part.trim_end_matches('/');
let with_json = if clean.ends_with(".json") {
clean.to_string()
} else {
format!("{clean}.json")
};
match query_part {
Some(q) => format!("{with_json}?{q}"),
None => with_json,
}
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
// ---------------------------------------------------------------------------
// Shopify product JSON shape (a subset of the full response)
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Wrapper {
product: Product,
}
#[derive(Deserialize)]
struct Product {
id: Option<i64>,
title: Option<String>,
handle: Option<String>,
vendor: Option<String>,
product_type: Option<String>,
body_html: Option<String>,
published_at: Option<String>,
created_at: Option<String>,
updated_at: Option<String>,
#[serde(default)]
tags: serde_json::Value, // array OR comma-joined string depending on store
#[serde(default)]
variants: Vec<Variant>,
#[serde(default)]
images: Vec<Image>,
#[serde(default)]
options: Vec<Option_>,
}
#[derive(Deserialize)]
struct Variant {
id: Option<i64>,
title: Option<String>,
sku: Option<String>,
barcode: Option<String>,
price: Option<String>,
compare_at_price: Option<String>,
available: Option<bool>,
inventory_quantity: Option<i64>,
position: Option<i64>,
weight: Option<f64>,
weight_unit: Option<String>,
requires_shipping: Option<bool>,
taxable: Option<bool>,
option1: Option<String>,
option2: Option<String>,
option3: Option<String>,
}
#[derive(Deserialize)]
struct Image {
src: Option<String>,
width: Option<i64>,
height: Option<i64>,
position: Option<i64>,
alt: Option<String>,
}
#[derive(Deserialize)]
#[serde(rename_all = "lowercase")]
struct Option_ {
name: Option<String>,
position: Option<i64>,
#[serde(default)]
values: Vec<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_plausible_shopify_urls() {
assert!(matches(
"https://www.allbirds.com/products/mens-tree-runners"
));
assert!(matches(
"https://shop.example.com/products/cool-tshirt?variant=123"
));
assert!(matches("https://somestore.myshopify.com/products/thing-1"));
}
#[test]
fn rejects_known_non_shopify() {
assert!(!matches("https://www.amazon.com/dp/B0C123"));
assert!(!matches("https://www.etsy.com/listing/12345/foo"));
assert!(!matches("https://www.amazon.co.uk/products/thing"));
assert!(!matches("https://github.com/products"));
}
#[test]
fn rejects_non_product_urls() {
assert!(!matches("https://example.com/"));
assert!(!matches("https://example.com/products/"));
assert!(!matches("https://example.com/collections/all"));
}
#[test]
fn build_json_url_handles_slash_and_query() {
assert_eq!(
build_json_url("https://shop.example.com/products/foo"),
"https://shop.example.com/products/foo.json"
);
assert_eq!(
build_json_url("https://shop.example.com/products/foo/"),
"https://shop.example.com/products/foo.json"
);
assert_eq!(
build_json_url("https://shop.example.com/products/foo?variant=123"),
"https://shop.example.com/products/foo.json?variant=123"
);
assert_eq!(
build_json_url("https://shop.example.com/products/foo.json"),
"https://shop.example.com/products/foo.json"
);
}
}

View file

@ -0,0 +1,193 @@
//! Trustpilot company reviews extractor.
//!
//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich
//! JSON-LD `LocalBusiness` / `Organization` block with aggregate
//! rating + up to 20 recent reviews. No auth, no antibot for the
//! page HTML itself.
//!
//! Auto-dispatch safe because the host is unique.
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "trustpilot_reviews",
label: "Trustpilot reviews",
description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
return false;
}
url.contains("/review/")
}
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
let resp = client.fetch(url).await?;
if !(200..300).contains(&resp.status) {
return Err(FetchError::Build(format!(
"trustpilot_reviews: status {} for {url}",
resp.status
)));
}
let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
let business = find_business(&blocks).ok_or_else(|| {
FetchError::BodyDecode(format!(
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
))
})?;
let aggregate_rating = business.get("aggregateRating").map(|r| {
json!({
"rating_value": get_text(r, "ratingValue"),
"best_rating": get_text(r, "bestRating"),
"review_count": get_text(r, "reviewCount"),
})
});
let reviews: Vec<Value> = business
.get("review")
.and_then(|r| r.as_array())
.map(|arr| {
arr.iter()
.map(|r| {
json!({
"author": r.get("author")
.and_then(|a| a.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
.or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
"date_published": get_text(r, "datePublished"),
"name": get_text(r, "name"),
"body": get_text(r, "reviewBody"),
"rating_value": r.get("reviewRating")
.and_then(|rr| rr.get("ratingValue"))
.and_then(|v| v.as_str().map(String::from)
.or_else(|| v.as_f64().map(|n| n.to_string()))),
"language": get_text(r, "inLanguage"),
})
})
.collect()
})
.unwrap_or_default();
Ok(json!({
"url": url,
"name": get_text(&business, "name"),
"description": get_text(&business, "description"),
"logo": business.get("logo").and_then(|l| l.as_str()).map(String::from)
.or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
"telephone": get_text(&business, "telephone"),
"address": business.get("address").cloned(),
"same_as": business.get("sameAs").cloned(),
"aggregate_rating": aggregate_rating,
"review_count_listed": reviews.len(),
"reviews": reviews,
"business_schema": business.get("@type").cloned(),
}))
}
// ---------------------------------------------------------------------------
// JSON-LD walker — same pattern as ecommerce_product
// ---------------------------------------------------------------------------
fn find_business(blocks: &[Value]) -> Option<Value> {
for b in blocks {
if let Some(found) = find_business_in(b) {
return Some(found);
}
}
None
}
fn find_business_in(v: &Value) -> Option<Value> {
if is_business_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_business_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_business_in(item) {
return Some(found);
}
}
}
None
}
fn is_business_type(v: &Value) -> bool {
let t = match v.get("@type") {
Some(t) => t,
None => return false,
};
let match_str = |s: &str| {
matches!(
s,
"Organization"
| "LocalBusiness"
| "Corporation"
| "OnlineBusiness"
| "Store"
| "Service"
)
};
match t {
Value::String(s) => match_str(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_trustpilot_review_urls() {
assert!(matches("https://www.trustpilot.com/review/stripe.com"));
assert!(matches("https://trustpilot.com/review/example.com"));
assert!(!matches("https://www.trustpilot.com/"));
assert!(!matches("https://example.com/review/foo"));
}
#[test]
fn is_business_type_handles_variants() {
use serde_json::json;
assert!(is_business_type(&json!({"@type": "Organization"})));
assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
assert!(is_business_type(
&json!({"@type": ["Organization", "Corporation"]})
));
assert!(!is_business_type(&json!({"@type": "Product"})));
}
}