fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)

Addresses the four follow-ups surfaced by the cloud-key smoke test.

trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
  separate JSON-LD blocks: a site-level Organization (Trustpilot
  itself), a Dataset with a csvw:Table mainEntity carrying the
  per-star distribution for the target business, and an aiSummary +
  aiSummaryReviews block with the AI-generated summary and recent
  review objects.
- Parser now: skips the site-level Org, walks @graph as either array
  or single object, picks the Dataset whose about.@id references the
  target domain, parses each csvw:column for rating buckets, computes
  weighted-average rating + total from the distribution, extracts the
  aiSummary text, and turns aiSummaryReviews into a clean reviews
  array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
  average_rating when the Dataset block is absent. OG-description
  regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
  average_rating, review_count, rating_distribution (per-star count
  and percent), ai_summary, recent_reviews, review_count_listed,
  data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
  226 reviews with full distribution + AI summary + 2 recent reviews.

amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
  pages. When local fetch returns HTML without Product JSON-LD and
  a cloud client is configured, force-escalate to the cloud path
  which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
  cloud's synthesize_html output (OG tags only, no #productTitle DOM
  ID) still yields useful data. Real Amazon pages still prefer the
  DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
  MacBook Pro title + description + asin.

etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
  blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
  "This item is unavailable - Etsy", plus the OG description
  "Sorry, the page you were looking for was not found." is_generic_*
  helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
  `/listing/123456789/personalized-stainless-steel-tumbler` becomes
  `Personalized Stainless Steel Tumbler` so callers always get a
  meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
  listings that don't ship offers[].seller.name. Shop now falls
  through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
  (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
  225 reviews, InStock).

cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
  not return raw HTML by design and that [`synthesize_html`]
  reassembles the parsed response (metadata + structured_data +
  markdown) back into minimal HTML so existing local parsers run
  unchanged across both paths. Also notes the DOM-regex limitation
  for extractors that need live-page-specific DOM IDs.

Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
This commit is contained in:
Valerio 2026-04-22 17:49:50 +02:00
parent e10066f527
commit b2e7dbf365
4 changed files with 825 additions and 172 deletions

View file

@ -24,6 +24,37 @@
//! parser on it. Returns the typed [`CloudError`] so extractors can
//! emit precise "upgrade your plan" / "invalid key" messages.
//!
//! ## Cloud response shape and [`synthesize_html`]
//!
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
//! `html` field even when `formats=["html"]` is requested. By design
//! the cloud API returns a parsed bundle:
//!
//! ```text
//! {
//! "url": "https://...",
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
//! "markdown": "# Page Title\n\n...", // cleaned markdown
//! "antibot": { engine, path, user_agent }, // bypass telemetry
//! "cache": { status, age_seconds }
//! }
//! ```
//!
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
//! minimal synthetic HTML document so the existing local extractor
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
//! cloud output. Each `structured_data` entry becomes a
//! `<script type="application/ld+json">` tag; each `metadata` field
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
//! exactly what they'd see on a real live page.
//!
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
//! won't hit on the synthesised HTML — those IDs only exist on live
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
//! fallbacks for that reason.
//!
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
//! signup when a site is blocked; nothing fails silently. Cloud users
//! get the escalation for free.

View file

@ -1,16 +1,25 @@
//! Amazon product detail page extractor.
//!
//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
//! a "Sorry, we need to verify you're human" interstitial to any
//! client without a warm Amazon session + residential IP. Detection
//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
//! Amazon heuristic, so this extractor always hits the cloud fallback
//! path in practice.
//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
//! inconsistently protected. Sometimes our local TLS fingerprint gets
//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
//! sometimes we land on a real page that for whatever reason ships
//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
//! extractor has a two-stage fallback:
//!
//! Parsing logic works on the final HTML, local or cloud-sourced. We
//! read the product details primarily from JSON-LD `Product` blocks
//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
//! specific DOM IDs picked up with cheap regex.
//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
//! we have everything (title, brand, price, availability, rating).
//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
//! a cloud client is configured, force-escalate to api.webclaw.io.
//! Cloud's render + antibot pipeline reliably surfaces the
//! structured data. Without a cloud client we return whatever we
//! got from local (usually just title via `#productTitle` or OG
//! meta tags).
//!
//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
//! `#landingImage`) second, OG `<meta>` tags third. The OG path
//! matters because the cloud's synthesized HTML ships metadata as
//! OG tags but lacks Amazon's DOM IDs.
//!
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
//! path. ASINs are a stable Amazon identifier so we extract that as
@ -54,10 +63,36 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
let asin = parse_asin(url)
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
// Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
// pages (they A/B-test it). When local fetch succeeded but has no
// Product JSON-LD, force-escalate to the cloud which runs the
// render pipeline and reliably surfaces structured data. No-op
// when cloud isn't configured — we return whatever local gave us.
if fetched.source == cloud::FetchSource::Local
&& find_product_jsonld(&fetched.html).is_none()
&& let Some(c) = client.cloud()
{
match c.fetch_html(url).await {
Ok(cloud_html) => {
fetched = cloud::FetchedHtml {
html: cloud_html,
final_url: url.to_string(),
source: cloud::FetchSource::Cloud,
};
}
Err(e) => {
tracing::debug!(
error = %e,
"amazon_product: cloud escalation failed, keeping local"
);
}
}
}
let mut data = parse(&fetched.html, url, &asin);
if let Some(obj) = data.as_object_mut() {
obj.insert(
@ -77,16 +112,23 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
/// without carrying webclaw_fetch types.
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
let jsonld = find_product_jsonld(html);
// Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
// (only present on real static HTML) > cloud-synthesized og:title.
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| dom_title(html));
.or_else(|| dom_title(html))
.or_else(|| og(html, "title"));
let image = jsonld
.as_ref()
.and_then(get_first_image)
.or_else(|| dom_image(html));
.or_else(|| dom_image(html))
.or_else(|| og(html, "image"));
let brand = jsonld.as_ref().and_then(get_brand);
let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
let description = jsonld
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description"));
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
let offer = jsonld.as_ref().and_then(first_offer);
@ -267,6 +309,31 @@ fn dom_image(html: &str) -> Option<String> {
.map(|m| m.as_str().to_string())
}
/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
/// line of defence for `title`, `image`, `description`.
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| html_unescape(m.as_str()));
}
}
None
}
/// Undo the synthesize_html attribute escaping for the few entities it
/// emits. Keeps us off a heavier HTML-entity dep.
fn html_unescape(s: &str) -> String {
s.replace("&quot;", "\"")
.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
@ -358,4 +425,28 @@ mod tests {
"https://m.media-amazon.com/images/I/fallback.jpg"
);
}
#[test]
fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
// Shape we see from the cloud synthesize_html path: OG tags
// only, no JSON-LD, no Amazon DOM IDs.
let html = r##"<html><head>
<meta property="og:title" content="Cloud-sourced MacBook Pro">
<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
<meta property="og:description" content="Via api.webclaw.io">
</head></html>"##;
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
assert_eq!(v["description"], "Via api.webclaw.io");
}
#[test]
fn og_unescape_handles_quot_entity() {
let html = r#"<meta property="og:title" content="Apple &quot;M2 Pro&quot; Laptop">"#;
assert_eq!(
og(html, "title").as_deref(),
Some(r#"Apple "M2 Pro" Laptop"#)
);
}
}

View file

@ -10,6 +10,15 @@
//! but some listings return a CF interstitial. We route through
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
//! same as `ebay_listing`.
//!
//! ## URL slug as last-resort title
//!
//! Even with cloud antibot bypass, Etsy frequently serves a generic
//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
//! empty markdown). In that case we humanise the slug from the URL
//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
//! "Personalized Stainless Steel Tumbler") so callers always get a
//! meaningful title. Degrades gracefully when the URL has no slug.
use std::sync::OnceLock;
@ -63,15 +72,17 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
let jsonld = find_product_jsonld(html);
let slug_title = humanise_slug(parse_slug(url).as_deref());
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| og(html, "title"));
.or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
.or(slug_title);
let description = jsonld
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description"));
.or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
let image = jsonld
.as_ref()
.and_then(get_first_image)
@ -98,13 +109,18 @@ pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
.and_then(|v| get_text(v, "itemCondition"))
.map(strip_schema_prefix);
// Shop name lives under offers[0].seller.name on Etsy.
let shop = offer.as_ref().and_then(|o| {
o.get("seller")
.and_then(|s| s.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
});
// Shop name: offers[0].seller.name on newer listings, top-level
// `brand` on older listings (Etsy changed the schema around 2022).
// Fall back through both so either shape resolves.
let shop = offer
.as_ref()
.and_then(|o| {
o.get("seller")
.and_then(|s| s.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
})
.or_else(|| brand.clone());
let shop_url = shop_url_from_html(html);
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
@ -158,6 +174,87 @@ fn parse_listing_id(url: &str) -> Option<String> {
.map(|m| m.as_str().to_string())
}
/// Extract the URL slug after the listing id, e.g.
/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
/// is the bare `/listing/{id}` shape.
fn parse_slug(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
/// Turn a URL slug into a human-ish title:
/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
/// Steel Tumbler`. Word-cap each dash-separated token; preserves
/// underscores as spaces too. Returns `None` on empty input.
fn humanise_slug(slug: Option<&str>) -> Option<String> {
let raw = slug?.trim();
if raw.is_empty() {
return None;
}
let words: Vec<String> = raw
.split(['-', '_'])
.filter(|w| !w.is_empty())
.map(capitalise_word)
.collect();
if words.is_empty() {
None
} else {
Some(words.join(" "))
}
}
fn capitalise_word(w: &str) -> String {
let mut chars = w.chars();
match chars.next() {
Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
None => String::new(),
}
}
/// True when the OG title is Etsy's fallback-page title rather than a
/// listing-specific title. Expired / region-blocked / antibot-filtered
/// pages return Etsy's sitewide tagline:
/// `"Etsy - Your place to buy and sell all things handmade..."`, or
/// simply `"etsy.com"`. A real listing title always starts with the
/// item name, never with "Etsy - " or the domain.
fn is_generic_title(t: &str) -> bool {
let normalised = t.trim().to_lowercase();
if matches!(
normalised.as_str(),
"etsy.com" | "etsy" | "www.etsy.com" | ""
) {
return true;
}
// Etsy's sitewide marketing tagline, served on 404 / blocked pages.
if normalised.starts_with("etsy - ")
|| normalised.starts_with("etsy.com - ")
|| normalised.starts_with("etsy uk - ")
{
return true;
}
// Etsy's "item unavailable" placeholder, served on delisted
// products. Keep the slug fallback so callers still see what the
// URL was about.
normalised.starts_with("this item is unavailable")
|| normalised.starts_with("sorry, this item is")
|| normalised == "item not available - etsy"
}
/// True when the OG description is an Etsy error-page placeholder or
/// sitewide marketing blurb rather than a real listing description.
fn is_generic_description(d: &str) -> bool {
let normalised = d.trim().to_lowercase();
if normalised.is_empty() {
return true;
}
normalised.starts_with("sorry, the page you were looking for")
|| normalised.starts_with("page not found")
|| normalised.starts_with("find the perfect handmade gift")
}
// ---------------------------------------------------------------------------
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
// extractors can diverge without cross-impact)
@ -388,4 +485,88 @@ mod tests {
// No price fields when we only have OG.
assert!(v["price"].is_null());
}
#[test]
fn parse_slug_from_url() {
assert_eq!(
parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
Some("vintage-typewriter".into())
);
assert_eq!(
parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
Some("slug".into())
);
assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
assert_eq!(
parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
Some("slug".into())
);
}
#[test]
fn humanise_slug_capitalises_each_word() {
assert_eq!(
humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
Some("Personalized Stainless Steel Tumbler")
);
assert_eq!(
humanise_slug(Some("hand_crafted_mug")).as_deref(),
Some("Hand Crafted Mug")
);
assert_eq!(humanise_slug(Some("")), None);
assert_eq!(humanise_slug(None), None);
}
#[test]
fn is_generic_title_catches_common_shapes() {
assert!(is_generic_title("etsy.com"));
assert!(is_generic_title("Etsy"));
assert!(is_generic_title(" etsy.com "));
assert!(is_generic_title(
"Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
));
assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
assert!(!is_generic_title("Vintage Typewriter"));
assert!(!is_generic_title("Handmade Etsy-style Mug"));
}
#[test]
fn is_generic_description_catches_404_shapes() {
assert!(is_generic_description(""));
assert!(is_generic_description(
"Sorry, the page you were looking for was not found."
));
assert!(is_generic_description("Page not found"));
assert!(!is_generic_description(
"Hand-thrown ceramic mug, dishwasher safe."
));
}
#[test]
fn parse_uses_slug_when_og_is_generic() {
// Cloud-blocked Etsy listing: og:title is a site-wide generic
// placeholder, no JSON-LD, no description. Slug should win.
let html = r#"<html><head>
<meta property="og:title" content="etsy.com">
</head></html>"#;
let v = parse(
html,
"https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
"1079113183",
);
assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
}
#[test]
fn parse_prefers_real_og_over_slug() {
let html = r#"<html><head>
<meta property="og:title" content="Real Listing Title">
</head></html>"#;
let v = parse(
html,
"https://www.etsy.com/listing/1079113183/the-url-slug",
"1079113183",
);
assert_eq!(v["title"], "Real Listing Title");
}
}

View file

@ -1,13 +1,34 @@
//! Trustpilot company reviews extractor.
//!
//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
//! `Organization` / `LocalBusiness` block with aggregate rating + up
//! to 20 recent reviews. The page HTML itself is usually behind AWS
//! WAF's "Verifying Connection" interstitial — so this extractor
//! always uses [`cloud::smart_fetch_html`] and only returns data when
//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
//! OSS users without a key get a clear error pointing at signup.
//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
//! "Verifying your connection" interstitial, so this extractor always
//! routes through [`cloud::smart_fetch_html`]. Without
//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
//! "set API key" error; with one it escalates to api.webclaw.io.
//!
//! ## 2025 JSON-LD schema
//!
//! Trustpilot replaced the old single-Organization + aggregateRating
//! shape with three separate JSON-LD blocks:
//!
//! 1. `Organization` block for Trustpilot the platform itself
//! (company info, addresses, social profiles). Not the business
//! being reviewed. We detect and skip this.
//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
//! per-star-bucket counts for the target business plus a Total
//! column. The Dataset's `name` is the business display name.
//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
//! summary of reviews plus the individual review objects
//! (consumer, dates, rating, title, text, language, likes).
//!
//! Plus `metadata.title` from the page head parses as
//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
//! `metadata.description` carries `"{N} customers have already said"`.
//! We use both as extra signal when the Dataset block is absent.
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
@ -18,7 +39,7 @@ use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "trustpilot_reviews",
label: "Trustpilot reviews",
description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
};
@ -31,75 +52,88 @@ pub fn matches(url: &str) -> bool {
}
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
// Trustpilot is always behind AWS WAF, so we go through smart_fetch
// which tries local first (which will hit the challenge interstitial),
// detects it, and escalates to cloud /v1/scrape for the real HTML.
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let html = parse(&fetched.html, url)?;
Ok(html_with_source(html, fetched.source))
let mut data = parse(&fetched.html, url)?;
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
/// Run the pure parser on already-fetched HTML. Split out so the cloud
/// pipeline can call it directly after its own antibot-aware fetch
/// without going through [`extract`].
/// Pure parser. Kept public so the cloud pipeline can reuse it on its
/// own fetched HTML without going through the async extract path.
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
let business = find_business(&blocks).ok_or_else(|| {
FetchError::BodyDecode(format!(
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
let domain = parse_review_domain(url).ok_or_else(|| {
FetchError::Build(format!(
"trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
))
})?;
let aggregate_rating = business.get("aggregateRating").map(|r| {
json!({
"rating_value": get_text(r, "ratingValue"),
"best_rating": get_text(r, "bestRating"),
"review_count": get_text(r, "reviewCount"),
})
});
let blocks = webclaw_core::structured_data::extract_json_ld(html);
let reviews: Vec<Value> = business
.get("review")
.and_then(|r| r.as_array())
.map(|arr| {
arr.iter()
.map(|r| {
json!({
"author": r.get("author")
.and_then(|a| a.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
.or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
"date_published": get_text(r, "datePublished"),
"name": get_text(r, "name"),
"body": get_text(r, "reviewBody"),
"rating_value": r.get("reviewRating")
.and_then(|rr| rr.get("ratingValue"))
.and_then(|v| v.as_str().map(String::from)
.or_else(|| v.as_f64().map(|n| n.to_string()))),
"language": get_text(r, "inLanguage"),
})
})
.collect()
})
// The business Dataset block has `about.@id` pointing to the target
// domain's Organization (e.g. `.../Organization/anthropic.com`).
let dataset = find_business_dataset(&blocks, &domain);
// The aiSummary block: not typed (no `@type`), detect by key.
let ai_block = find_ai_summary_block(&blocks);
// Business name: Dataset > metadata.title regex > URL domain.
let business_name = dataset
.as_ref()
.and_then(|d| get_string(d, "name"))
.or_else(|| parse_name_from_og_title(html))
.or_else(|| Some(domain.clone()));
// Rating distribution from the csvw:Table columns. Each column has
// csvw:name like "1 star" / "Total" and a single cell with the
// integer count.
let distribution = dataset.as_ref().and_then(parse_star_distribution);
let (rating_from_dist, total_from_dist) = distribution
.as_ref()
.map(compute_rating_stats)
.unwrap_or((None, None));
// Page-title / page-description fallbacks. OG title format:
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
let total_from_desc = parse_review_count_from_og_description(html);
// Recent reviews carried by the aiSummary block.
let recent_reviews: Vec<Value> = ai_block
.as_ref()
.and_then(|a| a.get("aiSummaryReviews"))
.and_then(|arr| arr.as_array())
.map(|arr| arr.iter().map(extract_review).collect())
.unwrap_or_default();
let ai_summary = ai_block
.as_ref()
.and_then(|a| a.get("aiSummary"))
.and_then(|s| s.get("summary"))
.and_then(|t| t.as_str())
.map(String::from);
Ok(json!({
"url": url,
"name": get_text(&business, "name"),
"description": get_text(&business, "description"),
"logo": business.get("logo").and_then(|l| l.as_str()).map(String::from)
.or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
"telephone": get_text(&business, "telephone"),
"address": business.get("address").cloned(),
"same_as": business.get("sameAs").cloned(),
"aggregate_rating": aggregate_rating,
"review_count_listed": reviews.len(),
"reviews": reviews,
"business_schema": business.get("@type").cloned(),
"url": url,
"domain": domain,
"business_name": business_name,
"rating_label": rating_label,
"average_rating": rating_from_dist.or(rating_from_og),
"review_count": total_from_dist.or(total_from_desc),
"rating_distribution": distribution,
"ai_summary": ai_summary,
"recent_reviews": recent_reviews,
"review_count_listed": recent_reviews.len(),
}))
}
@ -107,87 +141,10 @@ fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
/// Stamp `data_source` onto the parser output so callers can tell at a
/// glance whether this row came from local or cloud. Useful for UX and
/// for pricing-aware pipelines.
fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
if let Some(obj) = v.as_object_mut() {
obj.insert(
"data_source".into(),
match source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
v
}
// ---------------------------------------------------------------------------
// JSON-LD walker — same pattern as ecommerce_product
// URL helpers
// ---------------------------------------------------------------------------
fn find_business(blocks: &[Value]) -> Option<Value> {
for b in blocks {
if let Some(found) = find_business_in(b) {
return Some(found);
}
}
None
}
fn find_business_in(v: &Value) -> Option<Value> {
if is_business_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_business_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_business_in(item) {
return Some(found);
}
}
}
None
}
fn is_business_type(v: &Value) -> bool {
let t = match v.get("@type") {
Some(t) => t,
None => return false,
};
let match_str = |s: &str| {
matches!(
s,
"Organization"
| "LocalBusiness"
| "Corporation"
| "OnlineBusiness"
| "Store"
| "Service"
)
};
match t {
Value::String(s) => match_str(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
@ -197,6 +154,285 @@ fn host_of(url: &str) -> &str {
.unwrap_or("")
}
/// Pull the target domain from `trustpilot.com/review/{domain}`.
fn parse_review_domain(url: &str) -> Option<String> {
let after = url.split("/review/").nth(1)?;
let stripped = after
.split(['?', '#'])
.next()?
.trim_end_matches('/')
.split('/')
.next()
.unwrap_or("");
if stripped.is_empty() {
None
} else {
Some(stripped.to_string())
}
}
// ---------------------------------------------------------------------------
// JSON-LD block walkers
// ---------------------------------------------------------------------------
/// Find the Dataset block whose `about.@id` references the target
/// domain's Organization. Falls through to any Dataset if the @id
/// check doesn't match (Trustpilot occasionally varies the URL).
fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
let mut fallback_any_dataset: Option<Value> = None;
for block in blocks {
for node in walk_graph(block) {
if !is_dataset(&node) {
continue;
}
if dataset_about_matches_domain(&node, domain) {
return Some(node);
}
if fallback_any_dataset.is_none() {
fallback_any_dataset = Some(node);
}
}
}
fallback_any_dataset
}
fn is_dataset(v: &Value) -> bool {
v.get("@type")
.and_then(|t| t.as_str())
.is_some_and(|s| s == "Dataset")
}
fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
let about_id = v
.get("about")
.and_then(|a| a.get("@id"))
.and_then(|id| id.as_str());
let Some(id) = about_id else {
return false;
};
id.contains(&format!("/Organization/{domain}"))
}
/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
/// presence of the `aiSummary` key.
fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
for block in blocks {
for node in walk_graph(block) {
if node.get("aiSummary").is_some() {
return Some(node);
}
}
}
None
}
/// Flatten each block (and its `@graph`) into a list of nodes we can
/// iterate over. Handles both `@graph: [ ... ]` (array) and
/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
fn walk_graph(block: &Value) -> Vec<Value> {
let mut out = vec![block.clone()];
if let Some(graph) = block.get("@graph") {
match graph {
Value::Array(arr) => out.extend(arr.iter().cloned()),
Value::Object(_) => out.push(graph.clone()),
_ => {}
}
}
out
}
// ---------------------------------------------------------------------------
// Rating distribution (csvw:Table)
// ---------------------------------------------------------------------------
/// Parse the per-star distribution from the Dataset block. Returns
/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
fn parse_star_distribution(dataset: &Value) -> Option<Value> {
let columns = dataset
.get("mainEntity")?
.get("csvw:tableSchema")?
.get("csvw:columns")?
.as_array()?;
let mut out = serde_json::Map::new();
for col in columns {
let name = col.get("csvw:name").and_then(|n| n.as_str())?;
let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
let count = cell
.get("csvw:value")
.and_then(|v| v.as_str())
.and_then(|s| s.parse::<i64>().ok());
let percent = cell
.get("csvw:notes")
.and_then(|n| n.as_array())
.and_then(|arr| arr.first())
.and_then(|s| s.as_str())
.map(String::from);
let key = normalise_star_key(name);
out.insert(
key,
json!({
"count": count,
"percent": percent,
}),
);
}
if out.is_empty() {
None
} else {
Some(Value::Object(out))
}
}
/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
/// the raw "1 star" key which fights YAML/JS property access.
fn normalise_star_key(name: &str) -> String {
let trimmed = name.trim().to_lowercase();
match trimmed.as_str() {
"1 star" => "one_star".into(),
"2 stars" => "two_stars".into(),
"3 stars" => "three_stars".into(),
"4 stars" => "four_stars".into(),
"5 stars" => "five_stars".into(),
"total" => "total".into(),
other => other.replace(' ', "_"),
}
}
/// Compute average rating (weighted by bucket) and total count from the
/// parsed distribution. Returns `(average, total)`.
fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
let Some(obj) = distribution.as_object() else {
return (None, None);
};
let get_count = |key: &str| -> i64 {
obj.get(key)
.and_then(|v| v.get("count"))
.and_then(|v| v.as_i64())
.unwrap_or(0)
};
let one = get_count("one_star");
let two = get_count("two_stars");
let three = get_count("three_stars");
let four = get_count("four_stars");
let five = get_count("five_stars");
let total_bucket = one + two + three + four + five;
let total = obj
.get("total")
.and_then(|v| v.get("count"))
.and_then(|v| v.as_i64())
.unwrap_or(total_bucket);
if total == 0 {
return (None, Some(0));
}
let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
let avg = weighted as f64 / total_bucket.max(1) as f64;
// One decimal place, matching how Trustpilot displays the score.
(Some(format!("{avg:.1}")), Some(total))
}
// ---------------------------------------------------------------------------
// OG / meta-tag fallbacks
// ---------------------------------------------------------------------------
/// Regex out the business name from the standard Trustpilot OG title
/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
fn parse_name_from_og_title(html: &str) -> Option<String> {
let title = og(html, "title")?;
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
re.captures(&title)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
/// from the OG title.
fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
let Some(title) = og(html, "title") else {
return (None, None);
};
static RE: OnceLock<Regex> = OnceLock::new();
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
let re = RE.get_or_init(|| {
Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
});
let Some(caps) = re.captures(&title) else {
return (None, None);
};
(
caps.get(1).map(|m| m.as_str().trim().to_string()),
caps.get(2).map(|m| m.as_str().to_string()),
)
}
/// Parse "hear what 226 customers have already said" from the OG
/// description tag.
fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
let desc = og(html, "description")?;
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
re.captures(&desc)?
.get(1)?
.as_str()
.replace(',', "")
.parse::<i64>()
.ok()
}
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
let raw = c.get(2).map(|m| m.as_str())?;
return Some(html_unescape(raw));
}
}
None
}
/// Minimal HTML entity unescaping for the three entities the
/// synthesize_html escaper might produce. Keeps us off a heavier dep.
fn html_unescape(s: &str) -> String {
s.replace("&quot;", "\"")
.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
}
fn get_string(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| x.as_str().map(String::from))
}
// ---------------------------------------------------------------------------
// Review extraction
// ---------------------------------------------------------------------------
fn extract_review(r: &Value) -> Value {
json!({
"id": r.get("id").and_then(|v| v.as_str()),
"rating": r.get("rating").and_then(|v| v.as_i64()),
"title": r.get("title").and_then(|v| v.as_str()),
"text": r.get("text").and_then(|v| v.as_str()),
"language": r.get("language").and_then(|v| v.as_str()),
"source": r.get("source").and_then(|v| v.as_str()),
"likes": r.get("likes").and_then(|v| v.as_i64()),
"author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
"author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
"author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
"verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
"date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
"date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
})
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
@ -210,13 +446,127 @@ mod tests {
}
#[test]
fn is_business_type_handles_variants() {
use serde_json::json;
assert!(is_business_type(&json!({"@type": "Organization"})));
assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
assert!(is_business_type(
&json!({"@type": ["Organization", "Corporation"]})
));
assert!(!is_business_type(&json!({"@type": "Product"})));
fn parse_review_domain_handles_query_and_slash() {
assert_eq!(
parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
Some("anthropic.com".into())
);
assert_eq!(
parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
Some("anthropic.com".into())
);
assert_eq!(
parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
Some("anthropic.com".into())
);
}
#[test]
fn normalise_star_key_covers_all_buckets() {
assert_eq!(normalise_star_key("1 star"), "one_star");
assert_eq!(normalise_star_key("2 stars"), "two_stars");
assert_eq!(normalise_star_key("5 stars"), "five_stars");
assert_eq!(normalise_star_key("Total"), "total");
}
#[test]
fn compute_rating_stats_weighted_average() {
// 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
let dist = json!({
"one_star": { "count": 100, "percent": "50%" },
"two_stars": { "count": 0, "percent": "0%" },
"three_stars":{ "count": 0, "percent": "0%" },
"four_stars": { "count": 0, "percent": "0%" },
"five_stars": { "count": 100, "percent": "50%" },
"total": { "count": 200, "percent": "100%" },
});
let (avg, total) = compute_rating_stats(&dist);
assert_eq!(avg.as_deref(), Some("3.0"));
assert_eq!(total, Some(200));
}
#[test]
fn parse_og_title_extracts_name_and_rating() {
let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">"#;
assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
let (label, rating) = parse_rating_from_og_title(html);
assert_eq!(label.as_deref(), Some("Bad"));
assert_eq!(rating.as_deref(), Some("1.5"));
}
#[test]
fn parse_review_count_from_og_description_picks_number() {
let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
assert_eq!(parse_review_count_from_og_description(html), Some(226));
}
#[test]
fn parse_full_fixture_assembles_all_fields() {
let html = r##"<html><head>
<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
<script type="application/ld+json">
{"@context":"https://schema.org","@graph":[
{"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
]}
</script>
<script type="application/ld+json">
{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
"@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
"@type":"Dataset",
"about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
"name":"Anthropic",
"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
{"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
{"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
{"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
{"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
{"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
{"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
]}}}}
</script>
<script type="application/ld+json">
{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
"aiSummaryReviews":[
{"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
"source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
"dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
</script>
</head></html>"##;
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
assert_eq!(v["domain"], "anthropic.com");
assert_eq!(v["business_name"], "Anthropic");
assert_eq!(v["rating_label"], "Bad");
assert_eq!(v["review_count"], 226);
assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
assert_eq!(v["rating_distribution"]["total"]["count"], 226);
assert_eq!(v["ai_summary"], "Mixed reviews.");
assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
assert_eq!(v["recent_reviews"][0]["rating"], 1);
assert_eq!(v["recent_reviews"][0]["title"], "Bad");
}
#[test]
fn parse_falls_back_to_og_when_no_jsonld() {
let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
assert_eq!(v["domain"], "anthropic.com");
assert_eq!(v["business_name"], "Anthropic");
assert_eq!(v["average_rating"], "1.5");
assert_eq!(v["review_count"], 226);
assert_eq!(v["rating_label"], "Bad");
}
#[test]
fn parse_returns_ok_with_url_domain_when_nothing_else() {
let v = parse(
"<html><head></head></html>",
"https://www.trustpilot.com/review/example.com",
)
.unwrap();
assert_eq!(v["domain"], "example.com");
assert_eq!(v["business_name"], "example.com");
}
}