mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)
Two ecommerce extractors covering the long tail of online stores:
- shopify_product: hits the public /products/{handle}.json endpoint
that every Shopify store exposes. Undocumented but stable for 10+
years. Returns title, vendor, product_type, tags, full variants
array (price, SKU, stock, options), images, options matrix, and
the price_min/price_max/any_available summary fields. Covers the
~4M Shopify stores out there, modulo stores that put Cloudflare
in front of the shop. Rejects known non-Shopify hosts (amazon,
etsy, walmart, etc.) to save a failed request.
- ecommerce_product: generic Schema.org Product JSON-LD extractor.
Works on any modern store that ships the Google-required Product
rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
images, normalized offers (Offer and AggregateOffer flattened into
one shape with price, currency, availability, condition),
aggregateRating, and the raw JSON-LD block for anyone who wants it.
Reuses webclaw_core::structured_data::extract_json_ld so the
JSON-LD parser stays shared across the extraction pipeline.
Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.
Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
expected \u2014 the error message points callers at the ecommerce_product
fallback, but Cloudflare also blocks the HTML path so those stores
are cloud-tier territory.
Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.
Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
turnstile blocks our Firefox + Chrome + Safari + mobile profiles
at the TLS layer. Shipping it would return 403 more often than 200.
Code kept in-tree under #[allow(dead_code)] for when the cloud
tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
so ecommerce_product covers them. A dedicated WooCommerce REST
extractor (/wp-json/wc/store/products) would be marginal on top of
that and only works on ~30% of stores that expose the REST API.
Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
This commit is contained in:
parent
3bb0a4bca0
commit
0221c151dc
4 changed files with 854 additions and 0 deletions
314
crates/webclaw-fetch/src/extractors/ecommerce_product.rs
Normal file
314
crates/webclaw-fetch/src/extractors/ecommerce_product.rs
Normal file
|
|
@ -0,0 +1,314 @@
|
|||
//! Generic ecommerce product extractor via Schema.org JSON-LD.
|
||||
//!
|
||||
//! Every modern ecommerce site ships a `<script type="application/ld+json">`
|
||||
//! Product block for SEO / rich-result snippets. Google's own SEO docs
|
||||
//! force this markup on anyone who wants to appear in shopping search.
|
||||
//! We take advantage of it: one extractor that works on Shopify,
|
||||
//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
|
||||
//! and anything else that follows Schema.org.
|
||||
//!
|
||||
//! **Explicit-call only** — `/v1/scrape/ecommerce_product`. Not in the
|
||||
//! auto-dispatch because we can't identify "this is a product page"
|
||||
//! from the URL alone. When the caller knows they have a product URL,
|
||||
//! this is the reliable fallback for stores where shopify_product
|
||||
//! doesn't apply.
|
||||
//!
|
||||
//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
|
||||
//! so JSON-LD parsing is shared with the rest of the extraction
|
||||
//! pipeline. We walk all blocks looking for `@type: Product`,
|
||||
//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
|
||||
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ecommerce_product",
|
||||
label: "Ecommerce product (generic)",
|
||||
description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
|
||||
url_patterns: &[
|
||||
"https://{any-ecom-store}/products/{slug}",
|
||||
"https://{any-ecom-store}/product/{slug}",
|
||||
"https://{any-ecom-store}/p/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
// Maximally permissive: explicit-call-only extractor. We trust the
|
||||
// caller knows they're pointing at a product page. Custom ecom
|
||||
// sites use every conceivable URL shape (warbyparker.com uses
|
||||
// `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
|
||||
// matching would false-negative a lot. All we gate on is a valid
|
||||
// http(s) URL with a host.
|
||||
if !(url.starts_with("http://") || url.starts_with("https://")) {
|
||||
return false;
|
||||
}
|
||||
!host_of(url).is_empty()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
let resp = client.fetch(url).await?;
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return Err(FetchError::Build(format!(
|
||||
"ecommerce_product: status {} for {url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
// Reuse the core JSON-LD parser so we benefit from whatever
|
||||
// robustness it gains over time (handling @graph, arrays, etc.).
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
|
||||
let product = find_product(&blocks).ok_or_else(|| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"ecommerce_product: no Schema.org Product found in JSON-LD on {url}"
|
||||
))
|
||||
})?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": get_text(&product, "name"),
|
||||
"description": get_text(&product, "description"),
|
||||
"brand": get_brand(&product),
|
||||
"sku": get_text(&product, "sku"),
|
||||
"mpn": get_text(&product, "mpn"),
|
||||
"gtin": get_text(&product, "gtin")
|
||||
.or_else(|| get_text(&product, "gtin13"))
|
||||
.or_else(|| get_text(&product, "gtin12"))
|
||||
.or_else(|| get_text(&product, "gtin8")),
|
||||
"product_id": get_text(&product, "productID"),
|
||||
"category": get_text(&product, "category"),
|
||||
"color": get_text(&product, "color"),
|
||||
"material": get_text(&product, "material"),
|
||||
"images": collect_images(&product),
|
||||
"offers": collect_offers(&product),
|
||||
"aggregate_rating": get_aggregate_rating(&product),
|
||||
"review_count": get_review_count(&product),
|
||||
"raw_schema_type": get_text(&product, "@type"),
|
||||
"raw_jsonld": product,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Recursively walk the JSON-LD blocks and return the first node whose
|
||||
/// `@type` is Product, ProductGroup, or IndividualProduct.
|
||||
fn find_product(blocks: &[Value]) -> Option<Value> {
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
// @graph: [ {...}, {...} ]
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
// Bare array wrapper
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let t = match v.get("@type") {
|
||||
Some(t) => t,
|
||||
None => return false,
|
||||
};
|
||||
let match_str = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => match_str(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
if let Some(obj) = brand.as_object()
|
||||
&& let Some(n) = obj.get("name").and_then(|x| x.as_str())
|
||||
{
|
||||
return Some(n.to_string());
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn collect_images(v: &Value) -> Vec<Value> {
|
||||
match v.get("image") {
|
||||
Some(Value::String(s)) => vec![Value::String(s.clone())],
|
||||
Some(Value::Array(arr)) => arr
|
||||
.iter()
|
||||
.filter_map(|x| match x {
|
||||
Value::String(s) => Some(Value::String(s.clone())),
|
||||
Value::Object(_) => x.get("url").cloned(),
|
||||
_ => None,
|
||||
})
|
||||
.collect(),
|
||||
Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
|
||||
_ => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Normalise both bare Offer and AggregateOffer into a uniform array.
|
||||
fn collect_offers(v: &Value) -> Vec<Value> {
|
||||
let offers = match v.get("offers") {
|
||||
Some(o) => o,
|
||||
None => return Vec::new(),
|
||||
};
|
||||
let collect_single = |o: &Value| -> Option<Value> {
|
||||
Some(json!({
|
||||
"price": get_text(o, "price"),
|
||||
"low_price": get_text(o, "lowPrice"),
|
||||
"high_price": get_text(o, "highPrice"),
|
||||
"currency": get_text(o, "priceCurrency"),
|
||||
"availability": get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
|
||||
"item_condition": get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
|
||||
"valid_until": get_text(o, "priceValidUntil"),
|
||||
"url": get_text(o, "url"),
|
||||
"seller": o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
|
||||
"offer_count": get_text(o, "offerCount"),
|
||||
}))
|
||||
};
|
||||
match offers {
|
||||
Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
|
||||
Value::Object(_) => collect_single(offers).into_iter().collect(),
|
||||
_ => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
"worst_rating": get_text(r, "worstRating"),
|
||||
"rating_count": get_text(r, "ratingCount"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn get_review_count(v: &Value) -> Option<String> {
|
||||
v.get("aggregateRating")
|
||||
.and_then(|r| get_text(r, "reviewCount"))
|
||||
.or_else(|| get_text(v, "reviewCount"))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use serde_json::json;
|
||||
|
||||
#[test]
|
||||
fn matches_any_http_url_with_host() {
|
||||
assert!(matches("https://www.allbirds.com/products/tree-runner"));
|
||||
assert!(matches(
|
||||
"https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
|
||||
));
|
||||
assert!(matches("https://example.com/p/widget"));
|
||||
assert!(matches("http://shop.example.com/foo/bar"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_empty_or_non_http() {
|
||||
assert!(!matches(""));
|
||||
assert!(!matches("not-a-url"));
|
||||
assert!(!matches("ftp://example.com/file"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn find_product_walks_graph() {
|
||||
let block = json!({
|
||||
"@context": "https://schema.org",
|
||||
"@graph": [
|
||||
{"@type": "Organization", "name": "ACME"},
|
||||
{"@type": "Product", "name": "Widget", "sku": "ABC"}
|
||||
]
|
||||
});
|
||||
let blocks = vec![block];
|
||||
let p = find_product(&blocks).unwrap();
|
||||
assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn find_product_handles_array_type() {
|
||||
let block = json!({
|
||||
"@type": ["Product", "Clothing"],
|
||||
"name": "Tee"
|
||||
});
|
||||
assert!(is_product_type(&block));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn get_brand_from_string_or_object() {
|
||||
assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
|
||||
assert_eq!(
|
||||
get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
|
||||
Some("ACME".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collect_offers_handles_single_and_aggregate() {
|
||||
let p = json!({
|
||||
"offers": {
|
||||
"@type": "Offer",
|
||||
"price": "19.99",
|
||||
"priceCurrency": "USD",
|
||||
"availability": "https://schema.org/InStock"
|
||||
}
|
||||
});
|
||||
let offers = collect_offers(&p);
|
||||
assert_eq!(offers.len(), 1);
|
||||
assert_eq!(
|
||||
offers[0].get("price").and_then(|v| v.as_str()),
|
||||
Some("19.99")
|
||||
);
|
||||
assert_eq!(
|
||||
offers[0].get("availability").and_then(|v| v.as_str()),
|
||||
Some("InStock")
|
||||
);
|
||||
}
|
||||
}
|
||||
|
|
@ -18,6 +18,7 @@ pub mod arxiv;
|
|||
pub mod crates_io;
|
||||
pub mod dev_to;
|
||||
pub mod docker_hub;
|
||||
pub mod ecommerce_product;
|
||||
pub mod github_pr;
|
||||
pub mod github_release;
|
||||
pub mod github_repo;
|
||||
|
|
@ -30,7 +31,15 @@ pub mod linkedin_post;
|
|||
pub mod npm;
|
||||
pub mod pypi;
|
||||
pub mod reddit;
|
||||
pub mod shopify_product;
|
||||
pub mod stackoverflow;
|
||||
// `trustpilot_reviews` code lives in the tree but is not wired into the
|
||||
// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
|
||||
// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
|
||||
// it would return 403 more often than not — bad UX. When the cloud tier
|
||||
// has residential proxies or a CDP renderer, flip this back on.
|
||||
#[allow(dead_code)]
|
||||
pub mod trustpilot_reviews;
|
||||
|
||||
use serde::Serialize;
|
||||
use serde_json::Value;
|
||||
|
|
@ -73,6 +82,8 @@ pub fn list() -> Vec<ExtractorInfo> {
|
|||
linkedin_post::INFO,
|
||||
instagram_post::INFO,
|
||||
instagram_profile::INFO,
|
||||
shopify_product::INFO,
|
||||
ecommerce_product::INFO,
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -198,6 +209,12 @@ pub async fn dispatch_by_url(
|
|||
.map(|v| (instagram_profile::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// NOTE: shopify_product and ecommerce_product are intentionally NOT
|
||||
// in auto-dispatch. Their `matches()` functions are permissive
|
||||
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and
|
||||
// claiming those generically would steal URLs from the default
|
||||
// `/v1/scrape` markdown flow. Callers opt in via
|
||||
// `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
|
||||
None
|
||||
}
|
||||
|
||||
|
|
@ -304,6 +321,18 @@ pub async fn dispatch_by_name(
|
|||
})
|
||||
.await
|
||||
}
|
||||
n if n == shopify_product::INFO.name => {
|
||||
run_or_mismatch(shopify_product::matches(url), n, url, || {
|
||||
shopify_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == ecommerce_product::INFO.name => {
|
||||
run_or_mismatch(ecommerce_product::matches(url), n, url, || {
|
||||
ecommerce_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
|
||||
}
|
||||
}
|
||||
|
|
|
|||
318
crates/webclaw-fetch/src/extractors/shopify_product.rs
Normal file
318
crates/webclaw-fetch/src/extractors/shopify_product.rs
Normal file
|
|
@ -0,0 +1,318 @@
|
|||
//! Shopify product structured extractor.
|
||||
//!
|
||||
//! Every Shopify store exposes a public JSON endpoint for each product
|
||||
//! by appending `.json` to the product URL:
|
||||
//!
|
||||
//! https://shop.example.com/products/cool-tshirt
|
||||
//! → https://shop.example.com/products/cool-tshirt.json
|
||||
//!
|
||||
//! There are ~4 million Shopify stores. The `.json` endpoint is
|
||||
//! undocumented but has been stable for 10+ years. When a store puts
|
||||
//! Cloudflare / antibot in front of the shop, this path can 403 just
|
||||
//! like any other — for those cases the caller should fall back to
|
||||
//! `ecommerce_product` (JSON-LD) or the cloud tier.
|
||||
//!
|
||||
//! This extractor is **explicit-call only** — it is NOT auto-dispatched
|
||||
//! from `/v1/scrape` because we cannot tell ahead of time whether an
|
||||
//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
|
||||
//! `/v1/scrape/shopify_product` when they know.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_product",
|
||||
label: "Shopify product",
|
||||
description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
|
||||
url_patterns: &[
|
||||
"https://{shop}/products/{handle}",
|
||||
"https://{shop}.myshopify.com/products/{handle}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
// Any URL whose path contains /products/{something}. We do not
|
||||
// filter by host — Shopify powers custom-domain stores. The
|
||||
// extractor's /.json fallback is what confirms Shopify; `matches`
|
||||
// just says "this is a plausible shape." Still reject obviously
|
||||
// non-Shopify known hosts to save a failed request.
|
||||
let host = host_of(url);
|
||||
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/products/") && !url.ends_with("/products/")
|
||||
}
|
||||
|
||||
/// Hosts we know are not Shopify — reject so we don't burn a request.
|
||||
const NON_SHOPIFY_HOSTS: &[&str] = &[
|
||||
"amazon.com",
|
||||
"amazon.co.uk",
|
||||
"amazon.de",
|
||||
"amazon.fr",
|
||||
"amazon.it",
|
||||
"ebay.com",
|
||||
"etsy.com",
|
||||
"walmart.com",
|
||||
"target.com",
|
||||
"aliexpress.com",
|
||||
"bestbuy.com",
|
||||
"wayfair.com",
|
||||
"homedepot.com",
|
||||
"github.com", // /products is a marketing page
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_product: '{url}' not found (got 404 from {json_url})"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify returned status {} for {json_url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
|
||||
))
|
||||
})?;
|
||||
let p = body.product;
|
||||
|
||||
let variants: Vec<Value> = p
|
||||
.variants
|
||||
.iter()
|
||||
.map(|v| {
|
||||
json!({
|
||||
"id": v.id,
|
||||
"title": v.title,
|
||||
"sku": v.sku,
|
||||
"barcode": v.barcode,
|
||||
"price": v.price,
|
||||
"compare_at_price": v.compare_at_price,
|
||||
"available": v.available,
|
||||
"inventory_quantity": v.inventory_quantity,
|
||||
"position": v.position,
|
||||
"weight": v.weight,
|
||||
"weight_unit": v.weight_unit,
|
||||
"requires_shipping": v.requires_shipping,
|
||||
"taxable": v.taxable,
|
||||
"option1": v.option1,
|
||||
"option2": v.option2,
|
||||
"option3": v.option3,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let images: Vec<Value> = p
|
||||
.images
|
||||
.iter()
|
||||
.map(|i| {
|
||||
json!({
|
||||
"src": i.src,
|
||||
"width": i.width,
|
||||
"height": i.height,
|
||||
"position": i.position,
|
||||
"alt": i.alt,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let options: Vec<Value> = p
|
||||
.options
|
||||
.iter()
|
||||
.map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
|
||||
.collect();
|
||||
|
||||
// Price range + availability summary across variants (the shape
|
||||
// agents typically want without walking the variants array).
|
||||
let prices: Vec<f64> = p
|
||||
.variants
|
||||
.iter()
|
||||
.filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
|
||||
.collect();
|
||||
let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"json_url": json_url,
|
||||
"product_id": p.id,
|
||||
"handle": p.handle,
|
||||
"title": p.title,
|
||||
"vendor": p.vendor,
|
||||
"product_type": p.product_type,
|
||||
"tags": p.tags,
|
||||
"description_html":p.body_html,
|
||||
"published_at": p.published_at,
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
"variant_count": variants.len(),
|
||||
"image_count": images.len(),
|
||||
"any_available": any_available,
|
||||
"price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
|
||||
"price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
|
||||
"variants": variants,
|
||||
"images": images,
|
||||
"options": options,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
|
||||
/// trailing slashes, and query strings.
|
||||
fn build_json_url(url: &str) -> String {
|
||||
let (path_part, query_part) = match url.split_once('?') {
|
||||
Some((a, b)) => (a, Some(b)),
|
||||
None => (url, None),
|
||||
};
|
||||
let clean = path_part.trim_end_matches('/');
|
||||
let with_json = if clean.ends_with(".json") {
|
||||
clean.to_string()
|
||||
} else {
|
||||
format!("{clean}.json")
|
||||
};
|
||||
match query_part {
|
||||
Some(q) => format!("{with_json}?{q}"),
|
||||
None => with_json,
|
||||
}
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Shopify product JSON shape (a subset of the full response)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Wrapper {
|
||||
product: Product,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Product {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
handle: Option<String>,
|
||||
vendor: Option<String>,
|
||||
product_type: Option<String>,
|
||||
body_html: Option<String>,
|
||||
published_at: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: serde_json::Value, // array OR comma-joined string depending on store
|
||||
#[serde(default)]
|
||||
variants: Vec<Variant>,
|
||||
#[serde(default)]
|
||||
images: Vec<Image>,
|
||||
#[serde(default)]
|
||||
options: Vec<Option_>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Variant {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
sku: Option<String>,
|
||||
barcode: Option<String>,
|
||||
price: Option<String>,
|
||||
compare_at_price: Option<String>,
|
||||
available: Option<bool>,
|
||||
inventory_quantity: Option<i64>,
|
||||
position: Option<i64>,
|
||||
weight: Option<f64>,
|
||||
weight_unit: Option<String>,
|
||||
requires_shipping: Option<bool>,
|
||||
taxable: Option<bool>,
|
||||
option1: Option<String>,
|
||||
option2: Option<String>,
|
||||
option3: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Image {
|
||||
src: Option<String>,
|
||||
width: Option<i64>,
|
||||
height: Option<i64>,
|
||||
position: Option<i64>,
|
||||
alt: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
#[serde(rename_all = "lowercase")]
|
||||
struct Option_ {
|
||||
name: Option<String>,
|
||||
position: Option<i64>,
|
||||
#[serde(default)]
|
||||
values: Vec<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_plausible_shopify_urls() {
|
||||
assert!(matches(
|
||||
"https://www.allbirds.com/products/mens-tree-runners"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://shop.example.com/products/cool-tshirt?variant=123"
|
||||
));
|
||||
assert!(matches("https://somestore.myshopify.com/products/thing-1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_known_non_shopify() {
|
||||
assert!(!matches("https://www.amazon.com/dp/B0C123"));
|
||||
assert!(!matches("https://www.etsy.com/listing/12345/foo"));
|
||||
assert!(!matches("https://www.amazon.co.uk/products/thing"));
|
||||
assert!(!matches("https://github.com/products"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_product_urls() {
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("https://example.com/products/"));
|
||||
assert!(!matches("https://example.com/collections/all"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_url_handles_slash_and_query() {
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo/"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo?variant=123"),
|
||||
"https://shop.example.com/products/foo.json?variant=123"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo.json"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
}
|
||||
}
|
||||
193
crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
Normal file
193
crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
//! Trustpilot company reviews extractor.
|
||||
//!
|
||||
//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich
|
||||
//! JSON-LD `LocalBusiness` / `Organization` block with aggregate
|
||||
//! rating + up to 20 recent reviews. No auth, no antibot for the
|
||||
//! page HTML itself.
|
||||
//!
|
||||
//! Auto-dispatch safe because the host is unique.
|
||||
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "trustpilot_reviews",
|
||||
label: "Trustpilot reviews",
|
||||
description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
|
||||
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
|
||||
return false;
|
||||
}
|
||||
url.contains("/review/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
let resp = client.fetch(url).await?;
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return Err(FetchError::Build(format!(
|
||||
"trustpilot_reviews: status {} for {url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
|
||||
let business = find_business(&blocks).ok_or_else(|| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
|
||||
))
|
||||
})?;
|
||||
|
||||
let aggregate_rating = business.get("aggregateRating").map(|r| {
|
||||
json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
})
|
||||
});
|
||||
|
||||
let reviews: Vec<Value> = business
|
||||
.get("review")
|
||||
.and_then(|r| r.as_array())
|
||||
.map(|arr| {
|
||||
arr.iter()
|
||||
.map(|r| {
|
||||
json!({
|
||||
"author": r.get("author")
|
||||
.and_then(|a| a.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
.or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
|
||||
"date_published": get_text(r, "datePublished"),
|
||||
"name": get_text(r, "name"),
|
||||
"body": get_text(r, "reviewBody"),
|
||||
"rating_value": r.get("reviewRating")
|
||||
.and_then(|rr| rr.get("ratingValue"))
|
||||
.and_then(|v| v.as_str().map(String::from)
|
||||
.or_else(|| v.as_f64().map(|n| n.to_string()))),
|
||||
"language": get_text(r, "inLanguage"),
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
})
|
||||
.unwrap_or_default();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": get_text(&business, "name"),
|
||||
"description": get_text(&business, "description"),
|
||||
"logo": business.get("logo").and_then(|l| l.as_str()).map(String::from)
|
||||
.or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
|
||||
"telephone": get_text(&business, "telephone"),
|
||||
"address": business.get("address").cloned(),
|
||||
"same_as": business.get("sameAs").cloned(),
|
||||
"aggregate_rating": aggregate_rating,
|
||||
"review_count_listed": reviews.len(),
|
||||
"reviews": reviews,
|
||||
"business_schema": business.get("@type").cloned(),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walker — same pattern as ecommerce_product
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_business(blocks: &[Value]) -> Option<Value> {
|
||||
for b in blocks {
|
||||
if let Some(found) = find_business_in(b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_business_in(v: &Value) -> Option<Value> {
|
||||
if is_business_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_business_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_business_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_business_type(v: &Value) -> bool {
|
||||
let t = match v.get("@type") {
|
||||
Some(t) => t,
|
||||
None => return false,
|
||||
};
|
||||
let match_str = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Organization"
|
||||
| "LocalBusiness"
|
||||
| "Corporation"
|
||||
| "OnlineBusiness"
|
||||
| "Store"
|
||||
| "Service"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => match_str(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_trustpilot_review_urls() {
|
||||
assert!(matches("https://www.trustpilot.com/review/stripe.com"));
|
||||
assert!(matches("https://trustpilot.com/review/example.com"));
|
||||
assert!(!matches("https://www.trustpilot.com/"));
|
||||
assert!(!matches("https://example.com/review/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_business_type_handles_variants() {
|
||||
use serde_json::json;
|
||||
assert!(is_business_type(&json!({"@type": "Organization"})));
|
||||
assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
|
||||
assert!(is_business_type(
|
||||
&json!({"@type": ["Organization", "Corporation"]})
|
||||
));
|
||||
assert!(!is_business_type(&json!({"@type": "Product"})));
|
||||
}
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue