mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
This commit is contained in:
parent
e10066f527
commit
b2e7dbf365
4 changed files with 825 additions and 172 deletions
|
|
@ -24,6 +24,37 @@
|
|||
//! parser on it. Returns the typed [`CloudError`] so extractors can
|
||||
//! emit precise "upgrade your plan" / "invalid key" messages.
|
||||
//!
|
||||
//! ## Cloud response shape and [`synthesize_html`]
|
||||
//!
|
||||
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
|
||||
//! `html` field even when `formats=["html"]` is requested. By design
|
||||
//! the cloud API returns a parsed bundle:
|
||||
//!
|
||||
//! ```text
|
||||
//! {
|
||||
//! "url": "https://...",
|
||||
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
|
||||
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
|
||||
//! "markdown": "# Page Title\n\n...", // cleaned markdown
|
||||
//! "antibot": { engine, path, user_agent }, // bypass telemetry
|
||||
//! "cache": { status, age_seconds }
|
||||
//! }
|
||||
//! ```
|
||||
//!
|
||||
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
|
||||
//! minimal synthetic HTML document so the existing local extractor
|
||||
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
|
||||
//! cloud output. Each `structured_data` entry becomes a
|
||||
//! `<script type="application/ld+json">` tag; each `metadata` field
|
||||
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
|
||||
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
|
||||
//! exactly what they'd see on a real live page.
|
||||
//!
|
||||
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
|
||||
//! won't hit on the synthesised HTML — those IDs only exist on live
|
||||
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
|
||||
//! fallbacks for that reason.
|
||||
//!
|
||||
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
|
||||
//! signup when a site is blocked; nothing fails silently. Cloud users
|
||||
//! get the escalation for free.
|
||||
|
|
|
|||
|
|
@ -1,16 +1,25 @@
|
|||
//! Amazon product detail page extractor.
|
||||
//!
|
||||
//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
|
||||
//! a "Sorry, we need to verify you're human" interstitial to any
|
||||
//! client without a warm Amazon session + residential IP. Detection
|
||||
//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
|
||||
//! Amazon heuristic, so this extractor always hits the cloud fallback
|
||||
//! path in practice.
|
||||
//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
|
||||
//! inconsistently protected. Sometimes our local TLS fingerprint gets
|
||||
//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
|
||||
//! sometimes we land on a real page that for whatever reason ships
|
||||
//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
|
||||
//! extractor has a two-stage fallback:
|
||||
//!
|
||||
//! Parsing logic works on the final HTML, local or cloud-sourced. We
|
||||
//! read the product details primarily from JSON-LD `Product` blocks
|
||||
//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
|
||||
//! specific DOM IDs picked up with cheap regex.
|
||||
//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
|
||||
//! we have everything (title, brand, price, availability, rating).
|
||||
//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
|
||||
//! a cloud client is configured, force-escalate to api.webclaw.io.
|
||||
//! Cloud's render + antibot pipeline reliably surfaces the
|
||||
//! structured data. Without a cloud client we return whatever we
|
||||
//! got from local (usually just title via `#productTitle` or OG
|
||||
//! meta tags).
|
||||
//!
|
||||
//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
|
||||
//! `#landingImage`) second, OG `<meta>` tags third. The OG path
|
||||
//! matters because the cloud's synthesized HTML ships metadata as
|
||||
//! OG tags but lacks Amazon's DOM IDs.
|
||||
//!
|
||||
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
|
||||
//! path. ASINs are a stable Amazon identifier so we extract that as
|
||||
|
|
@ -54,10 +63,36 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
|||
let asin = parse_asin(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
// Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
|
||||
// pages (they A/B-test it). When local fetch succeeded but has no
|
||||
// Product JSON-LD, force-escalate to the cloud which runs the
|
||||
// render pipeline and reliably surfaces structured data. No-op
|
||||
// when cloud isn't configured — we return whatever local gave us.
|
||||
if fetched.source == cloud::FetchSource::Local
|
||||
&& find_product_jsonld(&fetched.html).is_none()
|
||||
&& let Some(c) = client.cloud()
|
||||
{
|
||||
match c.fetch_html(url).await {
|
||||
Ok(cloud_html) => {
|
||||
fetched = cloud::FetchedHtml {
|
||||
html: cloud_html,
|
||||
final_url: url.to_string(),
|
||||
source: cloud::FetchSource::Cloud,
|
||||
};
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
error = %e,
|
||||
"amazon_product: cloud escalation failed, keeping local"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let mut data = parse(&fetched.html, url, &asin);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
|
|
@ -77,16 +112,23 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
|||
/// without carrying webclaw_fetch types.
|
||||
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
// Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
|
||||
// (only present on real static HTML) > cloud-synthesized og:title.
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| dom_title(html));
|
||||
.or_else(|| dom_title(html))
|
||||
.or_else(|| og(html, "title"));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| dom_image(html));
|
||||
.or_else(|| dom_image(html))
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
|
|
@ -267,6 +309,31 @@ fn dom_image(html: &str) -> Option<String> {
|
|||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
|
||||
/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
|
||||
/// line of defence for `title`, `image`, `description`.
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| html_unescape(m.as_str()));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Undo the synthesize_html attribute escaping for the few entities it
|
||||
/// emits. Keeps us off a heavier HTML-entity dep.
|
||||
fn html_unescape(s: &str) -> String {
|
||||
s.replace(""", "\"")
|
||||
.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
|
@ -358,4 +425,28 @@ mod tests {
|
|||
"https://m.media-amazon.com/images/I/fallback.jpg"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
|
||||
// Shape we see from the cloud synthesize_html path: OG tags
|
||||
// only, no JSON-LD, no Amazon DOM IDs.
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:title" content="Cloud-sourced MacBook Pro">
|
||||
<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
|
||||
<meta property="og:description" content="Via api.webclaw.io">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
|
||||
assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
|
||||
assert_eq!(v["description"], "Via api.webclaw.io");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_unescape_handles_quot_entity() {
|
||||
let html = r#"<meta property="og:title" content="Apple "M2 Pro" Laptop">"#;
|
||||
assert_eq!(
|
||||
og(html, "title").as_deref(),
|
||||
Some(r#"Apple "M2 Pro" Laptop"#)
|
||||
);
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -10,6 +10,15 @@
|
|||
//! but some listings return a CF interstitial. We route through
|
||||
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
|
||||
//! same as `ebay_listing`.
|
||||
//!
|
||||
//! ## URL slug as last-resort title
|
||||
//!
|
||||
//! Even with cloud antibot bypass, Etsy frequently serves a generic
|
||||
//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
|
||||
//! empty markdown). In that case we humanise the slug from the URL
|
||||
//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
|
||||
//! "Personalized Stainless Steel Tumbler") so callers always get a
|
||||
//! meaningful title. Degrades gracefully when the URL has no slug.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
|
|
@ -63,15 +72,17 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
|||
|
||||
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let slug_title = humanise_slug(parse_slug(url).as_deref());
|
||||
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| og(html, "title"));
|
||||
.or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
|
||||
.or(slug_title);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
.or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
|
|
@ -98,13 +109,18 @@ pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
|||
.and_then(|v| get_text(v, "itemCondition"))
|
||||
.map(strip_schema_prefix);
|
||||
|
||||
// Shop name lives under offers[0].seller.name on Etsy.
|
||||
let shop = offer.as_ref().and_then(|o| {
|
||||
// Shop name: offers[0].seller.name on newer listings, top-level
|
||||
// `brand` on older listings (Etsy changed the schema around 2022).
|
||||
// Fall back through both so either shape resolves.
|
||||
let shop = offer
|
||||
.as_ref()
|
||||
.and_then(|o| {
|
||||
o.get("seller")
|
||||
.and_then(|s| s.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
});
|
||||
})
|
||||
.or_else(|| brand.clone());
|
||||
let shop_url = shop_url_from_html(html);
|
||||
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
|
|
@ -158,6 +174,87 @@ fn parse_listing_id(url: &str) -> Option<String> {
|
|||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Extract the URL slug after the listing id, e.g.
|
||||
/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
|
||||
/// is the bare `/listing/{id}` shape.
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Turn a URL slug into a human-ish title:
|
||||
/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
|
||||
/// Steel Tumbler`. Word-cap each dash-separated token; preserves
|
||||
/// underscores as spaces too. Returns `None` on empty input.
|
||||
fn humanise_slug(slug: Option<&str>) -> Option<String> {
|
||||
let raw = slug?.trim();
|
||||
if raw.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let words: Vec<String> = raw
|
||||
.split(['-', '_'])
|
||||
.filter(|w| !w.is_empty())
|
||||
.map(capitalise_word)
|
||||
.collect();
|
||||
if words.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(words.join(" "))
|
||||
}
|
||||
}
|
||||
|
||||
fn capitalise_word(w: &str) -> String {
|
||||
let mut chars = w.chars();
|
||||
match chars.next() {
|
||||
Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
|
||||
None => String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// True when the OG title is Etsy's fallback-page title rather than a
|
||||
/// listing-specific title. Expired / region-blocked / antibot-filtered
|
||||
/// pages return Etsy's sitewide tagline:
|
||||
/// `"Etsy - Your place to buy and sell all things handmade..."`, or
|
||||
/// simply `"etsy.com"`. A real listing title always starts with the
|
||||
/// item name, never with "Etsy - " or the domain.
|
||||
fn is_generic_title(t: &str) -> bool {
|
||||
let normalised = t.trim().to_lowercase();
|
||||
if matches!(
|
||||
normalised.as_str(),
|
||||
"etsy.com" | "etsy" | "www.etsy.com" | ""
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
// Etsy's sitewide marketing tagline, served on 404 / blocked pages.
|
||||
if normalised.starts_with("etsy - ")
|
||||
|| normalised.starts_with("etsy.com - ")
|
||||
|| normalised.starts_with("etsy uk - ")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
// Etsy's "item unavailable" placeholder, served on delisted
|
||||
// products. Keep the slug fallback so callers still see what the
|
||||
// URL was about.
|
||||
normalised.starts_with("this item is unavailable")
|
||||
|| normalised.starts_with("sorry, this item is")
|
||||
|| normalised == "item not available - etsy"
|
||||
}
|
||||
|
||||
/// True when the OG description is an Etsy error-page placeholder or
|
||||
/// sitewide marketing blurb rather than a real listing description.
|
||||
fn is_generic_description(d: &str) -> bool {
|
||||
let normalised = d.trim().to_lowercase();
|
||||
if normalised.is_empty() {
|
||||
return true;
|
||||
}
|
||||
normalised.starts_with("sorry, the page you were looking for")
|
||||
|| normalised.starts_with("page not found")
|
||||
|| normalised.starts_with("find the perfect handmade gift")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
|
||||
// extractors can diverge without cross-impact)
|
||||
|
|
@ -388,4 +485,88 @@ mod tests {
|
|||
// No price fields when we only have OG.
|
||||
assert!(v["price"].is_null());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_from_url() {
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
|
||||
Some("vintage-typewriter".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
|
||||
Some("slug".into())
|
||||
);
|
||||
assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
|
||||
Some("slug".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn humanise_slug_capitalises_each_word() {
|
||||
assert_eq!(
|
||||
humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
|
||||
Some("Personalized Stainless Steel Tumbler")
|
||||
);
|
||||
assert_eq!(
|
||||
humanise_slug(Some("hand_crafted_mug")).as_deref(),
|
||||
Some("Hand Crafted Mug")
|
||||
);
|
||||
assert_eq!(humanise_slug(Some("")), None);
|
||||
assert_eq!(humanise_slug(None), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generic_title_catches_common_shapes() {
|
||||
assert!(is_generic_title("etsy.com"));
|
||||
assert!(is_generic_title("Etsy"));
|
||||
assert!(is_generic_title(" etsy.com "));
|
||||
assert!(is_generic_title(
|
||||
"Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
|
||||
));
|
||||
assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
|
||||
assert!(!is_generic_title("Vintage Typewriter"));
|
||||
assert!(!is_generic_title("Handmade Etsy-style Mug"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generic_description_catches_404_shapes() {
|
||||
assert!(is_generic_description(""));
|
||||
assert!(is_generic_description(
|
||||
"Sorry, the page you were looking for was not found."
|
||||
));
|
||||
assert!(is_generic_description("Page not found"));
|
||||
assert!(!is_generic_description(
|
||||
"Hand-thrown ceramic mug, dishwasher safe."
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_uses_slug_when_og_is_generic() {
|
||||
// Cloud-blocked Etsy listing: og:title is a site-wide generic
|
||||
// placeholder, no JSON-LD, no description. Slug should win.
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="etsy.com">
|
||||
</head></html>"#;
|
||||
let v = parse(
|
||||
html,
|
||||
"https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
|
||||
"1079113183",
|
||||
);
|
||||
assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_prefers_real_og_over_slug() {
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="Real Listing Title">
|
||||
</head></html>"#;
|
||||
let v = parse(
|
||||
html,
|
||||
"https://www.etsy.com/listing/1079113183/the-url-slug",
|
||||
"1079113183",
|
||||
);
|
||||
assert_eq!(v["title"], "Real Listing Title");
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,13 +1,34 @@
|
|||
//! Trustpilot company reviews extractor.
|
||||
//!
|
||||
//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
|
||||
//! `Organization` / `LocalBusiness` block with aggregate rating + up
|
||||
//! to 20 recent reviews. The page HTML itself is usually behind AWS
|
||||
//! WAF's "Verifying Connection" interstitial — so this extractor
|
||||
//! always uses [`cloud::smart_fetch_html`] and only returns data when
|
||||
//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
|
||||
//! OSS users without a key get a clear error pointing at signup.
|
||||
//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
|
||||
//! "Verifying your connection" interstitial, so this extractor always
|
||||
//! routes through [`cloud::smart_fetch_html`]. Without
|
||||
//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
|
||||
//! "set API key" error; with one it escalates to api.webclaw.io.
|
||||
//!
|
||||
//! ## 2025 JSON-LD schema
|
||||
//!
|
||||
//! Trustpilot replaced the old single-Organization + aggregateRating
|
||||
//! shape with three separate JSON-LD blocks:
|
||||
//!
|
||||
//! 1. `Organization` block for Trustpilot the platform itself
|
||||
//! (company info, addresses, social profiles). Not the business
|
||||
//! being reviewed. We detect and skip this.
|
||||
//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
|
||||
//! per-star-bucket counts for the target business plus a Total
|
||||
//! column. The Dataset's `name` is the business display name.
|
||||
//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
|
||||
//! summary of reviews plus the individual review objects
|
||||
//! (consumer, dates, rating, title, text, language, likes).
|
||||
//!
|
||||
//! Plus `metadata.title` from the page head parses as
|
||||
//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
|
||||
//! `metadata.description` carries `"{N} customers have already said"`.
|
||||
//! We use both as extra signal when the Dataset block is absent.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
|
|
@ -18,7 +39,7 @@ use crate::error::FetchError;
|
|||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "trustpilot_reviews",
|
||||
label: "Trustpilot reviews",
|
||||
description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
|
||||
description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
|
||||
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
|
||||
};
|
||||
|
||||
|
|
@ -31,75 +52,88 @@ pub fn matches(url: &str) -> bool {
|
|||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
// Trustpilot is always behind AWS WAF, so we go through smart_fetch
|
||||
// which tries local first (which will hit the challenge interstitial),
|
||||
// detects it, and escalates to cloud /v1/scrape for the real HTML.
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let html = parse(&fetched.html, url)?;
|
||||
Ok(html_with_source(html, fetched.source))
|
||||
let mut data = parse(&fetched.html, url)?;
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Run the pure parser on already-fetched HTML. Split out so the cloud
|
||||
/// pipeline can call it directly after its own antibot-aware fetch
|
||||
/// without going through [`extract`].
|
||||
/// Pure parser. Kept public so the cloud pipeline can reuse it on its
|
||||
/// own fetched HTML without going through the async extract path.
|
||||
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
let business = find_business(&blocks).ok_or_else(|| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
|
||||
let domain = parse_review_domain(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let aggregate_rating = business.get("aggregateRating").map(|r| {
|
||||
json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
})
|
||||
});
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
|
||||
let reviews: Vec<Value> = business
|
||||
.get("review")
|
||||
.and_then(|r| r.as_array())
|
||||
.map(|arr| {
|
||||
arr.iter()
|
||||
.map(|r| {
|
||||
json!({
|
||||
"author": r.get("author")
|
||||
.and_then(|a| a.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
.or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
|
||||
"date_published": get_text(r, "datePublished"),
|
||||
"name": get_text(r, "name"),
|
||||
"body": get_text(r, "reviewBody"),
|
||||
"rating_value": r.get("reviewRating")
|
||||
.and_then(|rr| rr.get("ratingValue"))
|
||||
.and_then(|v| v.as_str().map(String::from)
|
||||
.or_else(|| v.as_f64().map(|n| n.to_string()))),
|
||||
"language": get_text(r, "inLanguage"),
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
})
|
||||
// The business Dataset block has `about.@id` pointing to the target
|
||||
// domain's Organization (e.g. `.../Organization/anthropic.com`).
|
||||
let dataset = find_business_dataset(&blocks, &domain);
|
||||
|
||||
// The aiSummary block: not typed (no `@type`), detect by key.
|
||||
let ai_block = find_ai_summary_block(&blocks);
|
||||
|
||||
// Business name: Dataset > metadata.title regex > URL domain.
|
||||
let business_name = dataset
|
||||
.as_ref()
|
||||
.and_then(|d| get_string(d, "name"))
|
||||
.or_else(|| parse_name_from_og_title(html))
|
||||
.or_else(|| Some(domain.clone()));
|
||||
|
||||
// Rating distribution from the csvw:Table columns. Each column has
|
||||
// csvw:name like "1 star" / "Total" and a single cell with the
|
||||
// integer count.
|
||||
let distribution = dataset.as_ref().and_then(parse_star_distribution);
|
||||
let (rating_from_dist, total_from_dist) = distribution
|
||||
.as_ref()
|
||||
.map(compute_rating_stats)
|
||||
.unwrap_or((None, None));
|
||||
|
||||
// Page-title / page-description fallbacks. OG title format:
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
|
||||
let total_from_desc = parse_review_count_from_og_description(html);
|
||||
|
||||
// Recent reviews carried by the aiSummary block.
|
||||
let recent_reviews: Vec<Value> = ai_block
|
||||
.as_ref()
|
||||
.and_then(|a| a.get("aiSummaryReviews"))
|
||||
.and_then(|arr| arr.as_array())
|
||||
.map(|arr| arr.iter().map(extract_review).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
let ai_summary = ai_block
|
||||
.as_ref()
|
||||
.and_then(|a| a.get("aiSummary"))
|
||||
.and_then(|s| s.get("summary"))
|
||||
.and_then(|t| t.as_str())
|
||||
.map(String::from);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": get_text(&business, "name"),
|
||||
"description": get_text(&business, "description"),
|
||||
"logo": business.get("logo").and_then(|l| l.as_str()).map(String::from)
|
||||
.or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
|
||||
"telephone": get_text(&business, "telephone"),
|
||||
"address": business.get("address").cloned(),
|
||||
"same_as": business.get("sameAs").cloned(),
|
||||
"aggregate_rating": aggregate_rating,
|
||||
"review_count_listed": reviews.len(),
|
||||
"reviews": reviews,
|
||||
"business_schema": business.get("@type").cloned(),
|
||||
"domain": domain,
|
||||
"business_name": business_name,
|
||||
"rating_label": rating_label,
|
||||
"average_rating": rating_from_dist.or(rating_from_og),
|
||||
"review_count": total_from_dist.or(total_from_desc),
|
||||
"rating_distribution": distribution,
|
||||
"ai_summary": ai_summary,
|
||||
"recent_reviews": recent_reviews,
|
||||
"review_count_listed": recent_reviews.len(),
|
||||
}))
|
||||
}
|
||||
|
||||
|
|
@ -107,87 +141,10 @@ fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
|||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
/// Stamp `data_source` onto the parser output so callers can tell at a
|
||||
/// glance whether this row came from local or cloud. Useful for UX and
|
||||
/// for pricing-aware pipelines.
|
||||
fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
|
||||
if let Some(obj) = v.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
v
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walker — same pattern as ecommerce_product
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_business(blocks: &[Value]) -> Option<Value> {
|
||||
for b in blocks {
|
||||
if let Some(found) = find_business_in(b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_business_in(v: &Value) -> Option<Value> {
|
||||
if is_business_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_business_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_business_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_business_type(v: &Value) -> bool {
|
||||
let t = match v.get("@type") {
|
||||
Some(t) => t,
|
||||
None => return false,
|
||||
};
|
||||
let match_str = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Organization"
|
||||
| "LocalBusiness"
|
||||
| "Corporation"
|
||||
| "OnlineBusiness"
|
||||
| "Store"
|
||||
| "Service"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => match_str(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
|
|
@ -197,6 +154,285 @@ fn host_of(url: &str) -> &str {
|
|||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Pull the target domain from `trustpilot.com/review/{domain}`.
|
||||
fn parse_review_domain(url: &str) -> Option<String> {
|
||||
let after = url.split("/review/").nth(1)?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if stripped.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(stripped.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD block walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Find the Dataset block whose `about.@id` references the target
|
||||
/// domain's Organization. Falls through to any Dataset if the @id
|
||||
/// check doesn't match (Trustpilot occasionally varies the URL).
|
||||
fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
|
||||
let mut fallback_any_dataset: Option<Value> = None;
|
||||
for block in blocks {
|
||||
for node in walk_graph(block) {
|
||||
if !is_dataset(&node) {
|
||||
continue;
|
||||
}
|
||||
if dataset_about_matches_domain(&node, domain) {
|
||||
return Some(node);
|
||||
}
|
||||
if fallback_any_dataset.is_none() {
|
||||
fallback_any_dataset = Some(node);
|
||||
}
|
||||
}
|
||||
}
|
||||
fallback_any_dataset
|
||||
}
|
||||
|
||||
fn is_dataset(v: &Value) -> bool {
|
||||
v.get("@type")
|
||||
.and_then(|t| t.as_str())
|
||||
.is_some_and(|s| s == "Dataset")
|
||||
}
|
||||
|
||||
fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
|
||||
let about_id = v
|
||||
.get("about")
|
||||
.and_then(|a| a.get("@id"))
|
||||
.and_then(|id| id.as_str());
|
||||
let Some(id) = about_id else {
|
||||
return false;
|
||||
};
|
||||
id.contains(&format!("/Organization/{domain}"))
|
||||
}
|
||||
|
||||
/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
|
||||
/// presence of the `aiSummary` key.
|
||||
fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
|
||||
for block in blocks {
|
||||
for node in walk_graph(block) {
|
||||
if node.get("aiSummary").is_some() {
|
||||
return Some(node);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Flatten each block (and its `@graph`) into a list of nodes we can
|
||||
/// iterate over. Handles both `@graph: [ ... ]` (array) and
|
||||
/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
|
||||
fn walk_graph(block: &Value) -> Vec<Value> {
|
||||
let mut out = vec![block.clone()];
|
||||
if let Some(graph) = block.get("@graph") {
|
||||
match graph {
|
||||
Value::Array(arr) => out.extend(arr.iter().cloned()),
|
||||
Value::Object(_) => out.push(graph.clone()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Rating distribution (csvw:Table)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Parse the per-star distribution from the Dataset block. Returns
|
||||
/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
|
||||
fn parse_star_distribution(dataset: &Value) -> Option<Value> {
|
||||
let columns = dataset
|
||||
.get("mainEntity")?
|
||||
.get("csvw:tableSchema")?
|
||||
.get("csvw:columns")?
|
||||
.as_array()?;
|
||||
let mut out = serde_json::Map::new();
|
||||
for col in columns {
|
||||
let name = col.get("csvw:name").and_then(|n| n.as_str())?;
|
||||
let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
|
||||
let count = cell
|
||||
.get("csvw:value")
|
||||
.and_then(|v| v.as_str())
|
||||
.and_then(|s| s.parse::<i64>().ok());
|
||||
let percent = cell
|
||||
.get("csvw:notes")
|
||||
.and_then(|n| n.as_array())
|
||||
.and_then(|arr| arr.first())
|
||||
.and_then(|s| s.as_str())
|
||||
.map(String::from);
|
||||
let key = normalise_star_key(name);
|
||||
out.insert(
|
||||
key,
|
||||
json!({
|
||||
"count": count,
|
||||
"percent": percent,
|
||||
}),
|
||||
);
|
||||
}
|
||||
if out.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(Value::Object(out))
|
||||
}
|
||||
}
|
||||
|
||||
/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
|
||||
/// the raw "1 star" key which fights YAML/JS property access.
|
||||
fn normalise_star_key(name: &str) -> String {
|
||||
let trimmed = name.trim().to_lowercase();
|
||||
match trimmed.as_str() {
|
||||
"1 star" => "one_star".into(),
|
||||
"2 stars" => "two_stars".into(),
|
||||
"3 stars" => "three_stars".into(),
|
||||
"4 stars" => "four_stars".into(),
|
||||
"5 stars" => "five_stars".into(),
|
||||
"total" => "total".into(),
|
||||
other => other.replace(' ', "_"),
|
||||
}
|
||||
}
|
||||
|
||||
/// Compute average rating (weighted by bucket) and total count from the
|
||||
/// parsed distribution. Returns `(average, total)`.
|
||||
fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
|
||||
let Some(obj) = distribution.as_object() else {
|
||||
return (None, None);
|
||||
};
|
||||
let get_count = |key: &str| -> i64 {
|
||||
obj.get(key)
|
||||
.and_then(|v| v.get("count"))
|
||||
.and_then(|v| v.as_i64())
|
||||
.unwrap_or(0)
|
||||
};
|
||||
let one = get_count("one_star");
|
||||
let two = get_count("two_stars");
|
||||
let three = get_count("three_stars");
|
||||
let four = get_count("four_stars");
|
||||
let five = get_count("five_stars");
|
||||
let total_bucket = one + two + three + four + five;
|
||||
let total = obj
|
||||
.get("total")
|
||||
.and_then(|v| v.get("count"))
|
||||
.and_then(|v| v.as_i64())
|
||||
.unwrap_or(total_bucket);
|
||||
if total == 0 {
|
||||
return (None, Some(0));
|
||||
}
|
||||
let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
|
||||
let avg = weighted as f64 / total_bucket.max(1) as f64;
|
||||
// One decimal place, matching how Trustpilot displays the score.
|
||||
(Some(format!("{avg:.1}")), Some(total))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG / meta-tag fallbacks
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Regex out the business name from the standard Trustpilot OG title
|
||||
/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
|
||||
fn parse_name_from_og_title(html: &str) -> Option<String> {
|
||||
let title = og(html, "title")?;
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
|
||||
re.captures(&title)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
|
||||
/// from the OG title.
|
||||
fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
|
||||
let Some(title) = og(html, "title") else {
|
||||
return (None, None);
|
||||
};
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
|
||||
});
|
||||
let Some(caps) = re.captures(&title) else {
|
||||
return (None, None);
|
||||
};
|
||||
(
|
||||
caps.get(1).map(|m| m.as_str().trim().to_string()),
|
||||
caps.get(2).map(|m| m.as_str().to_string()),
|
||||
)
|
||||
}
|
||||
|
||||
/// Parse "hear what 226 customers have already said" from the OG
|
||||
/// description tag.
|
||||
fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
|
||||
let desc = og(html, "description")?;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
|
||||
re.captures(&desc)?
|
||||
.get(1)?
|
||||
.as_str()
|
||||
.replace(',', "")
|
||||
.parse::<i64>()
|
||||
.ok()
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
let raw = c.get(2).map(|m| m.as_str())?;
|
||||
return Some(html_unescape(raw));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Minimal HTML entity unescaping for the three entities the
|
||||
/// synthesize_html escaper might produce. Keeps us off a heavier dep.
|
||||
fn html_unescape(s: &str) -> String {
|
||||
s.replace(""", "\"")
|
||||
.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
}
|
||||
|
||||
fn get_string(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| x.as_str().map(String::from))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Review extraction
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn extract_review(r: &Value) -> Value {
|
||||
json!({
|
||||
"id": r.get("id").and_then(|v| v.as_str()),
|
||||
"rating": r.get("rating").and_then(|v| v.as_i64()),
|
||||
"title": r.get("title").and_then(|v| v.as_str()),
|
||||
"text": r.get("text").and_then(|v| v.as_str()),
|
||||
"language": r.get("language").and_then(|v| v.as_str()),
|
||||
"source": r.get("source").and_then(|v| v.as_str()),
|
||||
"likes": r.get("likes").and_then(|v| v.as_i64()),
|
||||
"author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
|
||||
"author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
|
||||
"author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
|
||||
"verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
|
||||
"date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
|
||||
"date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
|
@ -210,13 +446,127 @@ mod tests {
|
|||
}
|
||||
|
||||
#[test]
|
||||
fn is_business_type_handles_variants() {
|
||||
use serde_json::json;
|
||||
assert!(is_business_type(&json!({"@type": "Organization"})));
|
||||
assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
|
||||
assert!(is_business_type(
|
||||
&json!({"@type": ["Organization", "Corporation"]})
|
||||
));
|
||||
assert!(!is_business_type(&json!({"@type": "Product"})));
|
||||
fn parse_review_domain_handles_query_and_slash() {
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalise_star_key_covers_all_buckets() {
|
||||
assert_eq!(normalise_star_key("1 star"), "one_star");
|
||||
assert_eq!(normalise_star_key("2 stars"), "two_stars");
|
||||
assert_eq!(normalise_star_key("5 stars"), "five_stars");
|
||||
assert_eq!(normalise_star_key("Total"), "total");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn compute_rating_stats_weighted_average() {
|
||||
// 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
|
||||
let dist = json!({
|
||||
"one_star": { "count": 100, "percent": "50%" },
|
||||
"two_stars": { "count": 0, "percent": "0%" },
|
||||
"three_stars":{ "count": 0, "percent": "0%" },
|
||||
"four_stars": { "count": 0, "percent": "0%" },
|
||||
"five_stars": { "count": 100, "percent": "50%" },
|
||||
"total": { "count": 200, "percent": "100%" },
|
||||
});
|
||||
let (avg, total) = compute_rating_stats(&dist);
|
||||
assert_eq!(avg.as_deref(), Some("3.0"));
|
||||
assert_eq!(total, Some(200));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_og_title_extracts_name_and_rating() {
|
||||
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">"#;
|
||||
assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
|
||||
let (label, rating) = parse_rating_from_og_title(html);
|
||||
assert_eq!(label.as_deref(), Some("Bad"));
|
||||
assert_eq!(rating.as_deref(), Some("1.5"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_review_count_from_og_description_picks_number() {
|
||||
let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||
assert_eq!(parse_review_count_from_og_description(html), Some(226));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_full_fixture_assembles_all_fields() {
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@graph":[
|
||||
{"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
|
||||
]}
|
||||
</script>
|
||||
<script type="application/ld+json">
|
||||
{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
|
||||
"@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
|
||||
"@type":"Dataset",
|
||||
"about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
|
||||
"name":"Anthropic",
|
||||
"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
|
||||
{"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
|
||||
{"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
|
||||
{"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
|
||||
{"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
|
||||
{"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
|
||||
{"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
|
||||
]}}}}
|
||||
</script>
|
||||
<script type="application/ld+json">
|
||||
{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
|
||||
"aiSummaryReviews":[
|
||||
{"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
|
||||
"source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
|
||||
"dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||
assert_eq!(v["domain"], "anthropic.com");
|
||||
assert_eq!(v["business_name"], "Anthropic");
|
||||
assert_eq!(v["rating_label"], "Bad");
|
||||
assert_eq!(v["review_count"], 226);
|
||||
assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
|
||||
assert_eq!(v["rating_distribution"]["total"]["count"], 226);
|
||||
assert_eq!(v["ai_summary"], "Mixed reviews.");
|
||||
assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
|
||||
assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
|
||||
assert_eq!(v["recent_reviews"][0]["rating"], 1);
|
||||
assert_eq!(v["recent_reviews"][0]["title"], "Bad");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_when_no_jsonld() {
|
||||
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||
assert_eq!(v["domain"], "anthropic.com");
|
||||
assert_eq!(v["business_name"], "Anthropic");
|
||||
assert_eq!(v["average_rating"], "1.5");
|
||||
assert_eq!(v["review_count"], 226);
|
||||
assert_eq!(v["rating_label"], "Bad");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_returns_ok_with_url_domain_when_nothing_else() {
|
||||
let v = parse(
|
||||
"<html><head></head></html>",
|
||||
"https://www.trustpilot.com/review/example.com",
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(v["domain"], "example.com");
|
||||
assert_eq!(v["business_name"], "example.com");
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue