mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
This commit is contained in:
parent
e10066f527
commit
b2e7dbf365
4 changed files with 825 additions and 172 deletions
|
|
@ -24,6 +24,37 @@
|
||||||
//! parser on it. Returns the typed [`CloudError`] so extractors can
|
//! parser on it. Returns the typed [`CloudError`] so extractors can
|
||||||
//! emit precise "upgrade your plan" / "invalid key" messages.
|
//! emit precise "upgrade your plan" / "invalid key" messages.
|
||||||
//!
|
//!
|
||||||
|
//! ## Cloud response shape and [`synthesize_html`]
|
||||||
|
//!
|
||||||
|
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
|
||||||
|
//! `html` field even when `formats=["html"]` is requested. By design
|
||||||
|
//! the cloud API returns a parsed bundle:
|
||||||
|
//!
|
||||||
|
//! ```text
|
||||||
|
//! {
|
||||||
|
//! "url": "https://...",
|
||||||
|
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
|
||||||
|
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
|
||||||
|
//! "markdown": "# Page Title\n\n...", // cleaned markdown
|
||||||
|
//! "antibot": { engine, path, user_agent }, // bypass telemetry
|
||||||
|
//! "cache": { status, age_seconds }
|
||||||
|
//! }
|
||||||
|
//! ```
|
||||||
|
//!
|
||||||
|
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
|
||||||
|
//! minimal synthetic HTML document so the existing local extractor
|
||||||
|
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
|
||||||
|
//! cloud output. Each `structured_data` entry becomes a
|
||||||
|
//! `<script type="application/ld+json">` tag; each `metadata` field
|
||||||
|
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
|
||||||
|
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
|
||||||
|
//! exactly what they'd see on a real live page.
|
||||||
|
//!
|
||||||
|
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
|
||||||
|
//! won't hit on the synthesised HTML — those IDs only exist on live
|
||||||
|
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
|
||||||
|
//! fallbacks for that reason.
|
||||||
|
//!
|
||||||
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
|
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
|
||||||
//! signup when a site is blocked; nothing fails silently. Cloud users
|
//! signup when a site is blocked; nothing fails silently. Cloud users
|
||||||
//! get the escalation for free.
|
//! get the escalation for free.
|
||||||
|
|
|
||||||
|
|
@ -1,16 +1,25 @@
|
||||||
//! Amazon product detail page extractor.
|
//! Amazon product detail page extractor.
|
||||||
//!
|
//!
|
||||||
//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
|
//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
|
||||||
//! a "Sorry, we need to verify you're human" interstitial to any
|
//! inconsistently protected. Sometimes our local TLS fingerprint gets
|
||||||
//! client without a warm Amazon session + residential IP. Detection
|
//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
|
||||||
//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
|
//! sometimes we land on a real page that for whatever reason ships
|
||||||
//! Amazon heuristic, so this extractor always hits the cloud fallback
|
//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
|
||||||
//! path in practice.
|
//! extractor has a two-stage fallback:
|
||||||
//!
|
//!
|
||||||
//! Parsing logic works on the final HTML, local or cloud-sourced. We
|
//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
|
||||||
//! read the product details primarily from JSON-LD `Product` blocks
|
//! we have everything (title, brand, price, availability, rating).
|
||||||
//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
|
//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
|
||||||
//! specific DOM IDs picked up with cheap regex.
|
//! a cloud client is configured, force-escalate to api.webclaw.io.
|
||||||
|
//! Cloud's render + antibot pipeline reliably surfaces the
|
||||||
|
//! structured data. Without a cloud client we return whatever we
|
||||||
|
//! got from local (usually just title via `#productTitle` or OG
|
||||||
|
//! meta tags).
|
||||||
|
//!
|
||||||
|
//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
|
||||||
|
//! `#landingImage`) second, OG `<meta>` tags third. The OG path
|
||||||
|
//! matters because the cloud's synthesized HTML ships metadata as
|
||||||
|
//! OG tags but lacks Amazon's DOM IDs.
|
||||||
//!
|
//!
|
||||||
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
|
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
|
||||||
//! path. ASINs are a stable Amazon identifier so we extract that as
|
//! path. ASINs are a stable Amazon identifier so we extract that as
|
||||||
|
|
@ -54,10 +63,36 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
||||||
let asin = parse_asin(url)
|
let asin = parse_asin(url)
|
||||||
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
||||||
|
|
||||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||||
.await
|
.await
|
||||||
.map_err(cloud_to_fetch_err)?;
|
.map_err(cloud_to_fetch_err)?;
|
||||||
|
|
||||||
|
// Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
|
||||||
|
// pages (they A/B-test it). When local fetch succeeded but has no
|
||||||
|
// Product JSON-LD, force-escalate to the cloud which runs the
|
||||||
|
// render pipeline and reliably surfaces structured data. No-op
|
||||||
|
// when cloud isn't configured — we return whatever local gave us.
|
||||||
|
if fetched.source == cloud::FetchSource::Local
|
||||||
|
&& find_product_jsonld(&fetched.html).is_none()
|
||||||
|
&& let Some(c) = client.cloud()
|
||||||
|
{
|
||||||
|
match c.fetch_html(url).await {
|
||||||
|
Ok(cloud_html) => {
|
||||||
|
fetched = cloud::FetchedHtml {
|
||||||
|
html: cloud_html,
|
||||||
|
final_url: url.to_string(),
|
||||||
|
source: cloud::FetchSource::Cloud,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
Err(e) => {
|
||||||
|
tracing::debug!(
|
||||||
|
error = %e,
|
||||||
|
"amazon_product: cloud escalation failed, keeping local"
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
let mut data = parse(&fetched.html, url, &asin);
|
let mut data = parse(&fetched.html, url, &asin);
|
||||||
if let Some(obj) = data.as_object_mut() {
|
if let Some(obj) = data.as_object_mut() {
|
||||||
obj.insert(
|
obj.insert(
|
||||||
|
|
@ -77,16 +112,23 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
||||||
/// without carrying webclaw_fetch types.
|
/// without carrying webclaw_fetch types.
|
||||||
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
|
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
|
||||||
let jsonld = find_product_jsonld(html);
|
let jsonld = find_product_jsonld(html);
|
||||||
|
// Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
|
||||||
|
// (only present on real static HTML) > cloud-synthesized og:title.
|
||||||
let title = jsonld
|
let title = jsonld
|
||||||
.as_ref()
|
.as_ref()
|
||||||
.and_then(|v| get_text(v, "name"))
|
.and_then(|v| get_text(v, "name"))
|
||||||
.or_else(|| dom_title(html));
|
.or_else(|| dom_title(html))
|
||||||
|
.or_else(|| og(html, "title"));
|
||||||
let image = jsonld
|
let image = jsonld
|
||||||
.as_ref()
|
.as_ref()
|
||||||
.and_then(get_first_image)
|
.and_then(get_first_image)
|
||||||
.or_else(|| dom_image(html));
|
.or_else(|| dom_image(html))
|
||||||
|
.or_else(|| og(html, "image"));
|
||||||
let brand = jsonld.as_ref().and_then(get_brand);
|
let brand = jsonld.as_ref().and_then(get_brand);
|
||||||
let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
|
let description = jsonld
|
||||||
|
.as_ref()
|
||||||
|
.and_then(|v| get_text(v, "description"))
|
||||||
|
.or_else(|| og(html, "description"));
|
||||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||||
let offer = jsonld.as_ref().and_then(first_offer);
|
let offer = jsonld.as_ref().and_then(first_offer);
|
||||||
|
|
||||||
|
|
@ -267,6 +309,31 @@ fn dom_image(html: &str) -> Option<String> {
|
||||||
.map(|m| m.as_str().to_string())
|
.map(|m| m.as_str().to_string())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
|
||||||
|
/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
|
||||||
|
/// line of defence for `title`, `image`, `description`.
|
||||||
|
fn og(html: &str, prop: &str) -> Option<String> {
|
||||||
|
static RE: OnceLock<Regex> = OnceLock::new();
|
||||||
|
let re = RE.get_or_init(|| {
|
||||||
|
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||||
|
});
|
||||||
|
for c in re.captures_iter(html) {
|
||||||
|
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||||
|
return c.get(2).map(|m| html_unescape(m.as_str()));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Undo the synthesize_html attribute escaping for the few entities it
|
||||||
|
/// emits. Keeps us off a heavier HTML-entity dep.
|
||||||
|
fn html_unescape(s: &str) -> String {
|
||||||
|
s.replace(""", "\"")
|
||||||
|
.replace("&", "&")
|
||||||
|
.replace("<", "<")
|
||||||
|
.replace(">", ">")
|
||||||
|
}
|
||||||
|
|
||||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||||
FetchError::Build(e.to_string())
|
FetchError::Build(e.to_string())
|
||||||
}
|
}
|
||||||
|
|
@ -358,4 +425,28 @@ mod tests {
|
||||||
"https://m.media-amazon.com/images/I/fallback.jpg"
|
"https://m.media-amazon.com/images/I/fallback.jpg"
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
|
||||||
|
// Shape we see from the cloud synthesize_html path: OG tags
|
||||||
|
// only, no JSON-LD, no Amazon DOM IDs.
|
||||||
|
let html = r##"<html><head>
|
||||||
|
<meta property="og:title" content="Cloud-sourced MacBook Pro">
|
||||||
|
<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
|
||||||
|
<meta property="og:description" content="Via api.webclaw.io">
|
||||||
|
</head></html>"##;
|
||||||
|
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||||
|
assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
|
||||||
|
assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
|
||||||
|
assert_eq!(v["description"], "Via api.webclaw.io");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn og_unescape_handles_quot_entity() {
|
||||||
|
let html = r#"<meta property="og:title" content="Apple "M2 Pro" Laptop">"#;
|
||||||
|
assert_eq!(
|
||||||
|
og(html, "title").as_deref(),
|
||||||
|
Some(r#"Apple "M2 Pro" Laptop"#)
|
||||||
|
);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -10,6 +10,15 @@
|
||||||
//! but some listings return a CF interstitial. We route through
|
//! but some listings return a CF interstitial. We route through
|
||||||
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
|
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
|
||||||
//! same as `ebay_listing`.
|
//! same as `ebay_listing`.
|
||||||
|
//!
|
||||||
|
//! ## URL slug as last-resort title
|
||||||
|
//!
|
||||||
|
//! Even with cloud antibot bypass, Etsy frequently serves a generic
|
||||||
|
//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
|
||||||
|
//! empty markdown). In that case we humanise the slug from the URL
|
||||||
|
//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
|
||||||
|
//! "Personalized Stainless Steel Tumbler") so callers always get a
|
||||||
|
//! meaningful title. Degrades gracefully when the URL has no slug.
|
||||||
|
|
||||||
use std::sync::OnceLock;
|
use std::sync::OnceLock;
|
||||||
|
|
||||||
|
|
@ -63,15 +72,17 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
||||||
|
|
||||||
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
||||||
let jsonld = find_product_jsonld(html);
|
let jsonld = find_product_jsonld(html);
|
||||||
|
let slug_title = humanise_slug(parse_slug(url).as_deref());
|
||||||
|
|
||||||
let title = jsonld
|
let title = jsonld
|
||||||
.as_ref()
|
.as_ref()
|
||||||
.and_then(|v| get_text(v, "name"))
|
.and_then(|v| get_text(v, "name"))
|
||||||
.or_else(|| og(html, "title"));
|
.or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
|
||||||
|
.or(slug_title);
|
||||||
let description = jsonld
|
let description = jsonld
|
||||||
.as_ref()
|
.as_ref()
|
||||||
.and_then(|v| get_text(v, "description"))
|
.and_then(|v| get_text(v, "description"))
|
||||||
.or_else(|| og(html, "description"));
|
.or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
|
||||||
let image = jsonld
|
let image = jsonld
|
||||||
.as_ref()
|
.as_ref()
|
||||||
.and_then(get_first_image)
|
.and_then(get_first_image)
|
||||||
|
|
@ -98,13 +109,18 @@ pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
||||||
.and_then(|v| get_text(v, "itemCondition"))
|
.and_then(|v| get_text(v, "itemCondition"))
|
||||||
.map(strip_schema_prefix);
|
.map(strip_schema_prefix);
|
||||||
|
|
||||||
// Shop name lives under offers[0].seller.name on Etsy.
|
// Shop name: offers[0].seller.name on newer listings, top-level
|
||||||
let shop = offer.as_ref().and_then(|o| {
|
// `brand` on older listings (Etsy changed the schema around 2022).
|
||||||
o.get("seller")
|
// Fall back through both so either shape resolves.
|
||||||
.and_then(|s| s.get("name"))
|
let shop = offer
|
||||||
.and_then(|n| n.as_str())
|
.as_ref()
|
||||||
.map(String::from)
|
.and_then(|o| {
|
||||||
});
|
o.get("seller")
|
||||||
|
.and_then(|s| s.get("name"))
|
||||||
|
.and_then(|n| n.as_str())
|
||||||
|
.map(String::from)
|
||||||
|
})
|
||||||
|
.or_else(|| brand.clone());
|
||||||
let shop_url = shop_url_from_html(html);
|
let shop_url = shop_url_from_html(html);
|
||||||
|
|
||||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||||
|
|
@ -158,6 +174,87 @@ fn parse_listing_id(url: &str) -> Option<String> {
|
||||||
.map(|m| m.as_str().to_string())
|
.map(|m| m.as_str().to_string())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Extract the URL slug after the listing id, e.g.
|
||||||
|
/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
|
||||||
|
/// is the bare `/listing/{id}` shape.
|
||||||
|
fn parse_slug(url: &str) -> Option<String> {
|
||||||
|
static RE: OnceLock<Regex> = OnceLock::new();
|
||||||
|
let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
|
||||||
|
re.captures(url)
|
||||||
|
.and_then(|c| c.get(1))
|
||||||
|
.map(|m| m.as_str().to_string())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Turn a URL slug into a human-ish title:
|
||||||
|
/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
|
||||||
|
/// Steel Tumbler`. Word-cap each dash-separated token; preserves
|
||||||
|
/// underscores as spaces too. Returns `None` on empty input.
|
||||||
|
fn humanise_slug(slug: Option<&str>) -> Option<String> {
|
||||||
|
let raw = slug?.trim();
|
||||||
|
if raw.is_empty() {
|
||||||
|
return None;
|
||||||
|
}
|
||||||
|
let words: Vec<String> = raw
|
||||||
|
.split(['-', '_'])
|
||||||
|
.filter(|w| !w.is_empty())
|
||||||
|
.map(capitalise_word)
|
||||||
|
.collect();
|
||||||
|
if words.is_empty() {
|
||||||
|
None
|
||||||
|
} else {
|
||||||
|
Some(words.join(" "))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn capitalise_word(w: &str) -> String {
|
||||||
|
let mut chars = w.chars();
|
||||||
|
match chars.next() {
|
||||||
|
Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
|
||||||
|
None => String::new(),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// True when the OG title is Etsy's fallback-page title rather than a
|
||||||
|
/// listing-specific title. Expired / region-blocked / antibot-filtered
|
||||||
|
/// pages return Etsy's sitewide tagline:
|
||||||
|
/// `"Etsy - Your place to buy and sell all things handmade..."`, or
|
||||||
|
/// simply `"etsy.com"`. A real listing title always starts with the
|
||||||
|
/// item name, never with "Etsy - " or the domain.
|
||||||
|
fn is_generic_title(t: &str) -> bool {
|
||||||
|
let normalised = t.trim().to_lowercase();
|
||||||
|
if matches!(
|
||||||
|
normalised.as_str(),
|
||||||
|
"etsy.com" | "etsy" | "www.etsy.com" | ""
|
||||||
|
) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
// Etsy's sitewide marketing tagline, served on 404 / blocked pages.
|
||||||
|
if normalised.starts_with("etsy - ")
|
||||||
|
|| normalised.starts_with("etsy.com - ")
|
||||||
|
|| normalised.starts_with("etsy uk - ")
|
||||||
|
{
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
// Etsy's "item unavailable" placeholder, served on delisted
|
||||||
|
// products. Keep the slug fallback so callers still see what the
|
||||||
|
// URL was about.
|
||||||
|
normalised.starts_with("this item is unavailable")
|
||||||
|
|| normalised.starts_with("sorry, this item is")
|
||||||
|
|| normalised == "item not available - etsy"
|
||||||
|
}
|
||||||
|
|
||||||
|
/// True when the OG description is an Etsy error-page placeholder or
|
||||||
|
/// sitewide marketing blurb rather than a real listing description.
|
||||||
|
fn is_generic_description(d: &str) -> bool {
|
||||||
|
let normalised = d.trim().to_lowercase();
|
||||||
|
if normalised.is_empty() {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
normalised.starts_with("sorry, the page you were looking for")
|
||||||
|
|| normalised.starts_with("page not found")
|
||||||
|
|| normalised.starts_with("find the perfect handmade gift")
|
||||||
|
}
|
||||||
|
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
|
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
|
||||||
// extractors can diverge without cross-impact)
|
// extractors can diverge without cross-impact)
|
||||||
|
|
@ -388,4 +485,88 @@ mod tests {
|
||||||
// No price fields when we only have OG.
|
// No price fields when we only have OG.
|
||||||
assert!(v["price"].is_null());
|
assert!(v["price"].is_null());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_slug_from_url() {
|
||||||
|
assert_eq!(
|
||||||
|
parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
|
||||||
|
Some("vintage-typewriter".into())
|
||||||
|
);
|
||||||
|
assert_eq!(
|
||||||
|
parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
|
||||||
|
Some("slug".into())
|
||||||
|
);
|
||||||
|
assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
|
||||||
|
assert_eq!(
|
||||||
|
parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
|
||||||
|
Some("slug".into())
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn humanise_slug_capitalises_each_word() {
|
||||||
|
assert_eq!(
|
||||||
|
humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
|
||||||
|
Some("Personalized Stainless Steel Tumbler")
|
||||||
|
);
|
||||||
|
assert_eq!(
|
||||||
|
humanise_slug(Some("hand_crafted_mug")).as_deref(),
|
||||||
|
Some("Hand Crafted Mug")
|
||||||
|
);
|
||||||
|
assert_eq!(humanise_slug(Some("")), None);
|
||||||
|
assert_eq!(humanise_slug(None), None);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn is_generic_title_catches_common_shapes() {
|
||||||
|
assert!(is_generic_title("etsy.com"));
|
||||||
|
assert!(is_generic_title("Etsy"));
|
||||||
|
assert!(is_generic_title(" etsy.com "));
|
||||||
|
assert!(is_generic_title(
|
||||||
|
"Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
|
||||||
|
));
|
||||||
|
assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
|
||||||
|
assert!(!is_generic_title("Vintage Typewriter"));
|
||||||
|
assert!(!is_generic_title("Handmade Etsy-style Mug"));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn is_generic_description_catches_404_shapes() {
|
||||||
|
assert!(is_generic_description(""));
|
||||||
|
assert!(is_generic_description(
|
||||||
|
"Sorry, the page you were looking for was not found."
|
||||||
|
));
|
||||||
|
assert!(is_generic_description("Page not found"));
|
||||||
|
assert!(!is_generic_description(
|
||||||
|
"Hand-thrown ceramic mug, dishwasher safe."
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_uses_slug_when_og_is_generic() {
|
||||||
|
// Cloud-blocked Etsy listing: og:title is a site-wide generic
|
||||||
|
// placeholder, no JSON-LD, no description. Slug should win.
|
||||||
|
let html = r#"<html><head>
|
||||||
|
<meta property="og:title" content="etsy.com">
|
||||||
|
</head></html>"#;
|
||||||
|
let v = parse(
|
||||||
|
html,
|
||||||
|
"https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
|
||||||
|
"1079113183",
|
||||||
|
);
|
||||||
|
assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_prefers_real_og_over_slug() {
|
||||||
|
let html = r#"<html><head>
|
||||||
|
<meta property="og:title" content="Real Listing Title">
|
||||||
|
</head></html>"#;
|
||||||
|
let v = parse(
|
||||||
|
html,
|
||||||
|
"https://www.etsy.com/listing/1079113183/the-url-slug",
|
||||||
|
"1079113183",
|
||||||
|
);
|
||||||
|
assert_eq!(v["title"], "Real Listing Title");
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,13 +1,34 @@
|
||||||
//! Trustpilot company reviews extractor.
|
//! Trustpilot company reviews extractor.
|
||||||
//!
|
//!
|
||||||
//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
|
//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
|
||||||
//! `Organization` / `LocalBusiness` block with aggregate rating + up
|
//! "Verifying your connection" interstitial, so this extractor always
|
||||||
//! to 20 recent reviews. The page HTML itself is usually behind AWS
|
//! routes through [`cloud::smart_fetch_html`]. Without
|
||||||
//! WAF's "Verifying Connection" interstitial — so this extractor
|
//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
|
||||||
//! always uses [`cloud::smart_fetch_html`] and only returns data when
|
//! "set API key" error; with one it escalates to api.webclaw.io.
|
||||||
//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
|
//!
|
||||||
//! OSS users without a key get a clear error pointing at signup.
|
//! ## 2025 JSON-LD schema
|
||||||
|
//!
|
||||||
|
//! Trustpilot replaced the old single-Organization + aggregateRating
|
||||||
|
//! shape with three separate JSON-LD blocks:
|
||||||
|
//!
|
||||||
|
//! 1. `Organization` block for Trustpilot the platform itself
|
||||||
|
//! (company info, addresses, social profiles). Not the business
|
||||||
|
//! being reviewed. We detect and skip this.
|
||||||
|
//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
|
||||||
|
//! per-star-bucket counts for the target business plus a Total
|
||||||
|
//! column. The Dataset's `name` is the business display name.
|
||||||
|
//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
|
||||||
|
//! summary of reviews plus the individual review objects
|
||||||
|
//! (consumer, dates, rating, title, text, language, likes).
|
||||||
|
//!
|
||||||
|
//! Plus `metadata.title` from the page head parses as
|
||||||
|
//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
|
||||||
|
//! `metadata.description` carries `"{N} customers have already said"`.
|
||||||
|
//! We use both as extra signal when the Dataset block is absent.
|
||||||
|
|
||||||
|
use std::sync::OnceLock;
|
||||||
|
|
||||||
|
use regex::Regex;
|
||||||
use serde_json::{Value, json};
|
use serde_json::{Value, json};
|
||||||
|
|
||||||
use super::ExtractorInfo;
|
use super::ExtractorInfo;
|
||||||
|
|
@ -18,7 +39,7 @@ use crate::error::FetchError;
|
||||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||||
name: "trustpilot_reviews",
|
name: "trustpilot_reviews",
|
||||||
label: "Trustpilot reviews",
|
label: "Trustpilot reviews",
|
||||||
description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
|
description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
|
||||||
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
|
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
@ -31,75 +52,88 @@ pub fn matches(url: &str) -> bool {
|
||||||
}
|
}
|
||||||
|
|
||||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||||
// Trustpilot is always behind AWS WAF, so we go through smart_fetch
|
|
||||||
// which tries local first (which will hit the challenge interstitial),
|
|
||||||
// detects it, and escalates to cloud /v1/scrape for the real HTML.
|
|
||||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||||
.await
|
.await
|
||||||
.map_err(cloud_to_fetch_err)?;
|
.map_err(cloud_to_fetch_err)?;
|
||||||
|
|
||||||
let html = parse(&fetched.html, url)?;
|
let mut data = parse(&fetched.html, url)?;
|
||||||
Ok(html_with_source(html, fetched.source))
|
if let Some(obj) = data.as_object_mut() {
|
||||||
|
obj.insert(
|
||||||
|
"data_source".into(),
|
||||||
|
match fetched.source {
|
||||||
|
cloud::FetchSource::Local => json!("local"),
|
||||||
|
cloud::FetchSource::Cloud => json!("cloud"),
|
||||||
|
},
|
||||||
|
);
|
||||||
|
}
|
||||||
|
Ok(data)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Run the pure parser on already-fetched HTML. Split out so the cloud
|
/// Pure parser. Kept public so the cloud pipeline can reuse it on its
|
||||||
/// pipeline can call it directly after its own antibot-aware fetch
|
/// own fetched HTML without going through the async extract path.
|
||||||
/// without going through [`extract`].
|
|
||||||
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
|
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
|
||||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
let domain = parse_review_domain(url).ok_or_else(|| {
|
||||||
let business = find_business(&blocks).ok_or_else(|| {
|
FetchError::Build(format!(
|
||||||
FetchError::BodyDecode(format!(
|
"trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
|
||||||
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
|
|
||||||
))
|
))
|
||||||
})?;
|
})?;
|
||||||
|
|
||||||
let aggregate_rating = business.get("aggregateRating").map(|r| {
|
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||||
json!({
|
|
||||||
"rating_value": get_text(r, "ratingValue"),
|
|
||||||
"best_rating": get_text(r, "bestRating"),
|
|
||||||
"review_count": get_text(r, "reviewCount"),
|
|
||||||
})
|
|
||||||
});
|
|
||||||
|
|
||||||
let reviews: Vec<Value> = business
|
// The business Dataset block has `about.@id` pointing to the target
|
||||||
.get("review")
|
// domain's Organization (e.g. `.../Organization/anthropic.com`).
|
||||||
.and_then(|r| r.as_array())
|
let dataset = find_business_dataset(&blocks, &domain);
|
||||||
.map(|arr| {
|
|
||||||
arr.iter()
|
// The aiSummary block: not typed (no `@type`), detect by key.
|
||||||
.map(|r| {
|
let ai_block = find_ai_summary_block(&blocks);
|
||||||
json!({
|
|
||||||
"author": r.get("author")
|
// Business name: Dataset > metadata.title regex > URL domain.
|
||||||
.and_then(|a| a.get("name"))
|
let business_name = dataset
|
||||||
.and_then(|n| n.as_str())
|
.as_ref()
|
||||||
.map(String::from)
|
.and_then(|d| get_string(d, "name"))
|
||||||
.or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
|
.or_else(|| parse_name_from_og_title(html))
|
||||||
"date_published": get_text(r, "datePublished"),
|
.or_else(|| Some(domain.clone()));
|
||||||
"name": get_text(r, "name"),
|
|
||||||
"body": get_text(r, "reviewBody"),
|
// Rating distribution from the csvw:Table columns. Each column has
|
||||||
"rating_value": r.get("reviewRating")
|
// csvw:name like "1 star" / "Total" and a single cell with the
|
||||||
.and_then(|rr| rr.get("ratingValue"))
|
// integer count.
|
||||||
.and_then(|v| v.as_str().map(String::from)
|
let distribution = dataset.as_ref().and_then(parse_star_distribution);
|
||||||
.or_else(|| v.as_f64().map(|n| n.to_string()))),
|
let (rating_from_dist, total_from_dist) = distribution
|
||||||
"language": get_text(r, "inLanguage"),
|
.as_ref()
|
||||||
})
|
.map(compute_rating_stats)
|
||||||
})
|
.unwrap_or((None, None));
|
||||||
.collect()
|
|
||||||
})
|
// Page-title / page-description fallbacks. OG title format:
|
||||||
|
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||||
|
let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
|
||||||
|
let total_from_desc = parse_review_count_from_og_description(html);
|
||||||
|
|
||||||
|
// Recent reviews carried by the aiSummary block.
|
||||||
|
let recent_reviews: Vec<Value> = ai_block
|
||||||
|
.as_ref()
|
||||||
|
.and_then(|a| a.get("aiSummaryReviews"))
|
||||||
|
.and_then(|arr| arr.as_array())
|
||||||
|
.map(|arr| arr.iter().map(extract_review).collect())
|
||||||
.unwrap_or_default();
|
.unwrap_or_default();
|
||||||
|
|
||||||
|
let ai_summary = ai_block
|
||||||
|
.as_ref()
|
||||||
|
.and_then(|a| a.get("aiSummary"))
|
||||||
|
.and_then(|s| s.get("summary"))
|
||||||
|
.and_then(|t| t.as_str())
|
||||||
|
.map(String::from);
|
||||||
|
|
||||||
Ok(json!({
|
Ok(json!({
|
||||||
"url": url,
|
"url": url,
|
||||||
"name": get_text(&business, "name"),
|
"domain": domain,
|
||||||
"description": get_text(&business, "description"),
|
"business_name": business_name,
|
||||||
"logo": business.get("logo").and_then(|l| l.as_str()).map(String::from)
|
"rating_label": rating_label,
|
||||||
.or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
|
"average_rating": rating_from_dist.or(rating_from_og),
|
||||||
"telephone": get_text(&business, "telephone"),
|
"review_count": total_from_dist.or(total_from_desc),
|
||||||
"address": business.get("address").cloned(),
|
"rating_distribution": distribution,
|
||||||
"same_as": business.get("sameAs").cloned(),
|
"ai_summary": ai_summary,
|
||||||
"aggregate_rating": aggregate_rating,
|
"recent_reviews": recent_reviews,
|
||||||
"review_count_listed": reviews.len(),
|
"review_count_listed": recent_reviews.len(),
|
||||||
"reviews": reviews,
|
|
||||||
"business_schema": business.get("@type").cloned(),
|
|
||||||
}))
|
}))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -107,87 +141,10 @@ fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||||
FetchError::Build(e.to_string())
|
FetchError::Build(e.to_string())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Stamp `data_source` onto the parser output so callers can tell at a
|
|
||||||
/// glance whether this row came from local or cloud. Useful for UX and
|
|
||||||
/// for pricing-aware pipelines.
|
|
||||||
fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
|
|
||||||
if let Some(obj) = v.as_object_mut() {
|
|
||||||
obj.insert(
|
|
||||||
"data_source".into(),
|
|
||||||
match source {
|
|
||||||
cloud::FetchSource::Local => json!("local"),
|
|
||||||
cloud::FetchSource::Cloud => json!("cloud"),
|
|
||||||
},
|
|
||||||
);
|
|
||||||
}
|
|
||||||
v
|
|
||||||
}
|
|
||||||
|
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
// JSON-LD walker — same pattern as ecommerce_product
|
// URL helpers
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
fn find_business(blocks: &[Value]) -> Option<Value> {
|
|
||||||
for b in blocks {
|
|
||||||
if let Some(found) = find_business_in(b) {
|
|
||||||
return Some(found);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
None
|
|
||||||
}
|
|
||||||
|
|
||||||
fn find_business_in(v: &Value) -> Option<Value> {
|
|
||||||
if is_business_type(v) {
|
|
||||||
return Some(v.clone());
|
|
||||||
}
|
|
||||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
|
||||||
for item in graph {
|
|
||||||
if let Some(found) = find_business_in(item) {
|
|
||||||
return Some(found);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if let Some(arr) = v.as_array() {
|
|
||||||
for item in arr {
|
|
||||||
if let Some(found) = find_business_in(item) {
|
|
||||||
return Some(found);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
None
|
|
||||||
}
|
|
||||||
|
|
||||||
fn is_business_type(v: &Value) -> bool {
|
|
||||||
let t = match v.get("@type") {
|
|
||||||
Some(t) => t,
|
|
||||||
None => return false,
|
|
||||||
};
|
|
||||||
let match_str = |s: &str| {
|
|
||||||
matches!(
|
|
||||||
s,
|
|
||||||
"Organization"
|
|
||||||
| "LocalBusiness"
|
|
||||||
| "Corporation"
|
|
||||||
| "OnlineBusiness"
|
|
||||||
| "Store"
|
|
||||||
| "Service"
|
|
||||||
)
|
|
||||||
};
|
|
||||||
match t {
|
|
||||||
Value::String(s) => match_str(s),
|
|
||||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
|
|
||||||
_ => false,
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
|
||||||
v.get(key).and_then(|x| match x {
|
|
||||||
Value::String(s) => Some(s.clone()),
|
|
||||||
Value::Number(n) => Some(n.to_string()),
|
|
||||||
_ => None,
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
fn host_of(url: &str) -> &str {
|
fn host_of(url: &str) -> &str {
|
||||||
url.split("://")
|
url.split("://")
|
||||||
.nth(1)
|
.nth(1)
|
||||||
|
|
@ -197,6 +154,285 @@ fn host_of(url: &str) -> &str {
|
||||||
.unwrap_or("")
|
.unwrap_or("")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Pull the target domain from `trustpilot.com/review/{domain}`.
|
||||||
|
fn parse_review_domain(url: &str) -> Option<String> {
|
||||||
|
let after = url.split("/review/").nth(1)?;
|
||||||
|
let stripped = after
|
||||||
|
.split(['?', '#'])
|
||||||
|
.next()?
|
||||||
|
.trim_end_matches('/')
|
||||||
|
.split('/')
|
||||||
|
.next()
|
||||||
|
.unwrap_or("");
|
||||||
|
if stripped.is_empty() {
|
||||||
|
None
|
||||||
|
} else {
|
||||||
|
Some(stripped.to_string())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// JSON-LD block walkers
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
/// Find the Dataset block whose `about.@id` references the target
|
||||||
|
/// domain's Organization. Falls through to any Dataset if the @id
|
||||||
|
/// check doesn't match (Trustpilot occasionally varies the URL).
|
||||||
|
fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
|
||||||
|
let mut fallback_any_dataset: Option<Value> = None;
|
||||||
|
for block in blocks {
|
||||||
|
for node in walk_graph(block) {
|
||||||
|
if !is_dataset(&node) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if dataset_about_matches_domain(&node, domain) {
|
||||||
|
return Some(node);
|
||||||
|
}
|
||||||
|
if fallback_any_dataset.is_none() {
|
||||||
|
fallback_any_dataset = Some(node);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
fallback_any_dataset
|
||||||
|
}
|
||||||
|
|
||||||
|
fn is_dataset(v: &Value) -> bool {
|
||||||
|
v.get("@type")
|
||||||
|
.and_then(|t| t.as_str())
|
||||||
|
.is_some_and(|s| s == "Dataset")
|
||||||
|
}
|
||||||
|
|
||||||
|
fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
|
||||||
|
let about_id = v
|
||||||
|
.get("about")
|
||||||
|
.and_then(|a| a.get("@id"))
|
||||||
|
.and_then(|id| id.as_str());
|
||||||
|
let Some(id) = about_id else {
|
||||||
|
return false;
|
||||||
|
};
|
||||||
|
id.contains(&format!("/Organization/{domain}"))
|
||||||
|
}
|
||||||
|
|
||||||
|
/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
|
||||||
|
/// presence of the `aiSummary` key.
|
||||||
|
fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
|
||||||
|
for block in blocks {
|
||||||
|
for node in walk_graph(block) {
|
||||||
|
if node.get("aiSummary").is_some() {
|
||||||
|
return Some(node);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Flatten each block (and its `@graph`) into a list of nodes we can
|
||||||
|
/// iterate over. Handles both `@graph: [ ... ]` (array) and
|
||||||
|
/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
|
||||||
|
fn walk_graph(block: &Value) -> Vec<Value> {
|
||||||
|
let mut out = vec![block.clone()];
|
||||||
|
if let Some(graph) = block.get("@graph") {
|
||||||
|
match graph {
|
||||||
|
Value::Array(arr) => out.extend(arr.iter().cloned()),
|
||||||
|
Value::Object(_) => out.push(graph.clone()),
|
||||||
|
_ => {}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
out
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// Rating distribution (csvw:Table)
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
/// Parse the per-star distribution from the Dataset block. Returns
|
||||||
|
/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
|
||||||
|
fn parse_star_distribution(dataset: &Value) -> Option<Value> {
|
||||||
|
let columns = dataset
|
||||||
|
.get("mainEntity")?
|
||||||
|
.get("csvw:tableSchema")?
|
||||||
|
.get("csvw:columns")?
|
||||||
|
.as_array()?;
|
||||||
|
let mut out = serde_json::Map::new();
|
||||||
|
for col in columns {
|
||||||
|
let name = col.get("csvw:name").and_then(|n| n.as_str())?;
|
||||||
|
let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
|
||||||
|
let count = cell
|
||||||
|
.get("csvw:value")
|
||||||
|
.and_then(|v| v.as_str())
|
||||||
|
.and_then(|s| s.parse::<i64>().ok());
|
||||||
|
let percent = cell
|
||||||
|
.get("csvw:notes")
|
||||||
|
.and_then(|n| n.as_array())
|
||||||
|
.and_then(|arr| arr.first())
|
||||||
|
.and_then(|s| s.as_str())
|
||||||
|
.map(String::from);
|
||||||
|
let key = normalise_star_key(name);
|
||||||
|
out.insert(
|
||||||
|
key,
|
||||||
|
json!({
|
||||||
|
"count": count,
|
||||||
|
"percent": percent,
|
||||||
|
}),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
if out.is_empty() {
|
||||||
|
None
|
||||||
|
} else {
|
||||||
|
Some(Value::Object(out))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
|
||||||
|
/// the raw "1 star" key which fights YAML/JS property access.
|
||||||
|
fn normalise_star_key(name: &str) -> String {
|
||||||
|
let trimmed = name.trim().to_lowercase();
|
||||||
|
match trimmed.as_str() {
|
||||||
|
"1 star" => "one_star".into(),
|
||||||
|
"2 stars" => "two_stars".into(),
|
||||||
|
"3 stars" => "three_stars".into(),
|
||||||
|
"4 stars" => "four_stars".into(),
|
||||||
|
"5 stars" => "five_stars".into(),
|
||||||
|
"total" => "total".into(),
|
||||||
|
other => other.replace(' ', "_"),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Compute average rating (weighted by bucket) and total count from the
|
||||||
|
/// parsed distribution. Returns `(average, total)`.
|
||||||
|
fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
|
||||||
|
let Some(obj) = distribution.as_object() else {
|
||||||
|
return (None, None);
|
||||||
|
};
|
||||||
|
let get_count = |key: &str| -> i64 {
|
||||||
|
obj.get(key)
|
||||||
|
.and_then(|v| v.get("count"))
|
||||||
|
.and_then(|v| v.as_i64())
|
||||||
|
.unwrap_or(0)
|
||||||
|
};
|
||||||
|
let one = get_count("one_star");
|
||||||
|
let two = get_count("two_stars");
|
||||||
|
let three = get_count("three_stars");
|
||||||
|
let four = get_count("four_stars");
|
||||||
|
let five = get_count("five_stars");
|
||||||
|
let total_bucket = one + two + three + four + five;
|
||||||
|
let total = obj
|
||||||
|
.get("total")
|
||||||
|
.and_then(|v| v.get("count"))
|
||||||
|
.and_then(|v| v.as_i64())
|
||||||
|
.unwrap_or(total_bucket);
|
||||||
|
if total == 0 {
|
||||||
|
return (None, Some(0));
|
||||||
|
}
|
||||||
|
let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
|
||||||
|
let avg = weighted as f64 / total_bucket.max(1) as f64;
|
||||||
|
// One decimal place, matching how Trustpilot displays the score.
|
||||||
|
(Some(format!("{avg:.1}")), Some(total))
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// OG / meta-tag fallbacks
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
/// Regex out the business name from the standard Trustpilot OG title
|
||||||
|
/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
|
||||||
|
fn parse_name_from_og_title(html: &str) -> Option<String> {
|
||||||
|
let title = og(html, "title")?;
|
||||||
|
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||||
|
static RE: OnceLock<Regex> = OnceLock::new();
|
||||||
|
let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
|
||||||
|
re.captures(&title)
|
||||||
|
.and_then(|c| c.get(1))
|
||||||
|
.map(|m| m.as_str().to_string())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
|
||||||
|
/// from the OG title.
|
||||||
|
fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
|
||||||
|
let Some(title) = og(html, "title") else {
|
||||||
|
return (None, None);
|
||||||
|
};
|
||||||
|
static RE: OnceLock<Regex> = OnceLock::new();
|
||||||
|
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||||
|
let re = RE.get_or_init(|| {
|
||||||
|
Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
|
||||||
|
});
|
||||||
|
let Some(caps) = re.captures(&title) else {
|
||||||
|
return (None, None);
|
||||||
|
};
|
||||||
|
(
|
||||||
|
caps.get(1).map(|m| m.as_str().trim().to_string()),
|
||||||
|
caps.get(2).map(|m| m.as_str().to_string()),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Parse "hear what 226 customers have already said" from the OG
|
||||||
|
/// description tag.
|
||||||
|
fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
|
||||||
|
let desc = og(html, "description")?;
|
||||||
|
static RE: OnceLock<Regex> = OnceLock::new();
|
||||||
|
let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
|
||||||
|
re.captures(&desc)?
|
||||||
|
.get(1)?
|
||||||
|
.as_str()
|
||||||
|
.replace(',', "")
|
||||||
|
.parse::<i64>()
|
||||||
|
.ok()
|
||||||
|
}
|
||||||
|
|
||||||
|
fn og(html: &str, prop: &str) -> Option<String> {
|
||||||
|
static RE: OnceLock<Regex> = OnceLock::new();
|
||||||
|
let re = RE.get_or_init(|| {
|
||||||
|
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||||
|
});
|
||||||
|
for c in re.captures_iter(html) {
|
||||||
|
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||||
|
let raw = c.get(2).map(|m| m.as_str())?;
|
||||||
|
return Some(html_unescape(raw));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Minimal HTML entity unescaping for the three entities the
|
||||||
|
/// synthesize_html escaper might produce. Keeps us off a heavier dep.
|
||||||
|
fn html_unescape(s: &str) -> String {
|
||||||
|
s.replace(""", "\"")
|
||||||
|
.replace("&", "&")
|
||||||
|
.replace("<", "<")
|
||||||
|
.replace(">", ">")
|
||||||
|
}
|
||||||
|
|
||||||
|
fn get_string(v: &Value, key: &str) -> Option<String> {
|
||||||
|
v.get(key).and_then(|x| x.as_str().map(String::from))
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// Review extraction
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
fn extract_review(r: &Value) -> Value {
|
||||||
|
json!({
|
||||||
|
"id": r.get("id").and_then(|v| v.as_str()),
|
||||||
|
"rating": r.get("rating").and_then(|v| v.as_i64()),
|
||||||
|
"title": r.get("title").and_then(|v| v.as_str()),
|
||||||
|
"text": r.get("text").and_then(|v| v.as_str()),
|
||||||
|
"language": r.get("language").and_then(|v| v.as_str()),
|
||||||
|
"source": r.get("source").and_then(|v| v.as_str()),
|
||||||
|
"likes": r.get("likes").and_then(|v| v.as_i64()),
|
||||||
|
"author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
|
||||||
|
"author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
|
||||||
|
"author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
|
||||||
|
"verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
|
||||||
|
"date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
|
||||||
|
"date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// Tests
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
#[cfg(test)]
|
#[cfg(test)]
|
||||||
mod tests {
|
mod tests {
|
||||||
use super::*;
|
use super::*;
|
||||||
|
|
@ -210,13 +446,127 @@ mod tests {
|
||||||
}
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
fn is_business_type_handles_variants() {
|
fn parse_review_domain_handles_query_and_slash() {
|
||||||
use serde_json::json;
|
assert_eq!(
|
||||||
assert!(is_business_type(&json!({"@type": "Organization"})));
|
parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
|
||||||
assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
|
Some("anthropic.com".into())
|
||||||
assert!(is_business_type(
|
);
|
||||||
&json!({"@type": ["Organization", "Corporation"]})
|
assert_eq!(
|
||||||
));
|
parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
|
||||||
assert!(!is_business_type(&json!({"@type": "Product"})));
|
Some("anthropic.com".into())
|
||||||
|
);
|
||||||
|
assert_eq!(
|
||||||
|
parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
|
||||||
|
Some("anthropic.com".into())
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn normalise_star_key_covers_all_buckets() {
|
||||||
|
assert_eq!(normalise_star_key("1 star"), "one_star");
|
||||||
|
assert_eq!(normalise_star_key("2 stars"), "two_stars");
|
||||||
|
assert_eq!(normalise_star_key("5 stars"), "five_stars");
|
||||||
|
assert_eq!(normalise_star_key("Total"), "total");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn compute_rating_stats_weighted_average() {
|
||||||
|
// 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
|
||||||
|
let dist = json!({
|
||||||
|
"one_star": { "count": 100, "percent": "50%" },
|
||||||
|
"two_stars": { "count": 0, "percent": "0%" },
|
||||||
|
"three_stars":{ "count": 0, "percent": "0%" },
|
||||||
|
"four_stars": { "count": 0, "percent": "0%" },
|
||||||
|
"five_stars": { "count": 100, "percent": "50%" },
|
||||||
|
"total": { "count": 200, "percent": "100%" },
|
||||||
|
});
|
||||||
|
let (avg, total) = compute_rating_stats(&dist);
|
||||||
|
assert_eq!(avg.as_deref(), Some("3.0"));
|
||||||
|
assert_eq!(total, Some(200));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_og_title_extracts_name_and_rating() {
|
||||||
|
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">"#;
|
||||||
|
assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
|
||||||
|
let (label, rating) = parse_rating_from_og_title(html);
|
||||||
|
assert_eq!(label.as_deref(), Some("Bad"));
|
||||||
|
assert_eq!(rating.as_deref(), Some("1.5"));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_review_count_from_og_description_picks_number() {
|
||||||
|
let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||||
|
assert_eq!(parse_review_count_from_og_description(html), Some(226));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_full_fixture_assembles_all_fields() {
|
||||||
|
let html = r##"<html><head>
|
||||||
|
<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||||
|
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
|
||||||
|
<script type="application/ld+json">
|
||||||
|
{"@context":"https://schema.org","@graph":[
|
||||||
|
{"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
|
||||||
|
]}
|
||||||
|
</script>
|
||||||
|
<script type="application/ld+json">
|
||||||
|
{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
|
||||||
|
"@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
|
||||||
|
"@type":"Dataset",
|
||||||
|
"about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
|
||||||
|
"name":"Anthropic",
|
||||||
|
"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
|
||||||
|
{"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
|
||||||
|
{"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
|
||||||
|
{"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
|
||||||
|
{"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
|
||||||
|
{"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
|
||||||
|
{"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
|
||||||
|
]}}}}
|
||||||
|
</script>
|
||||||
|
<script type="application/ld+json">
|
||||||
|
{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
|
||||||
|
"aiSummaryReviews":[
|
||||||
|
{"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
|
||||||
|
"source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
|
||||||
|
"dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
|
||||||
|
</script>
|
||||||
|
</head></html>"##;
|
||||||
|
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||||
|
assert_eq!(v["domain"], "anthropic.com");
|
||||||
|
assert_eq!(v["business_name"], "Anthropic");
|
||||||
|
assert_eq!(v["rating_label"], "Bad");
|
||||||
|
assert_eq!(v["review_count"], 226);
|
||||||
|
assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
|
||||||
|
assert_eq!(v["rating_distribution"]["total"]["count"], 226);
|
||||||
|
assert_eq!(v["ai_summary"], "Mixed reviews.");
|
||||||
|
assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
|
||||||
|
assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
|
||||||
|
assert_eq!(v["recent_reviews"][0]["rating"], 1);
|
||||||
|
assert_eq!(v["recent_reviews"][0]["title"], "Bad");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_falls_back_to_og_when_no_jsonld() {
|
||||||
|
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||||
|
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||||
|
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||||
|
assert_eq!(v["domain"], "anthropic.com");
|
||||||
|
assert_eq!(v["business_name"], "Anthropic");
|
||||||
|
assert_eq!(v["average_rating"], "1.5");
|
||||||
|
assert_eq!(v["review_count"], 226);
|
||||||
|
assert_eq!(v["rating_label"], "Bad");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn parse_returns_ok_with_url_domain_when_nothing_else() {
|
||||||
|
let v = parse(
|
||||||
|
"<html><head></head></html>",
|
||||||
|
"https://www.trustpilot.com/review/example.com",
|
||||||
|
)
|
||||||
|
.unwrap();
|
||||||
|
assert_eq!(v["domain"], "example.com");
|
||||||
|
assert_eq!(v["business_name"], "example.com");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue