fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)

Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
2026-06-30 03:49:37 +02:00 · 2026-04-22 17:49:50 +02:00 · 2026-04-22 17:49:50 +02:00 · b2e7dbf365
commit b2e7dbf365
parent e10066f527
4 changed files with 825 additions and 172 deletions
--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@ -24,6 +24,37 @@
 //!   parser on it. Returns the typed [`CloudError`] so extractors can
 //!   emit precise "upgrade your plan" / "invalid key" messages.
 //!
 //! ## Cloud response shape and [`synthesize_html`]
 //!
 //! `api.webclaw.io/v1/scrape` deliberately does **not** return a
 //! `html` field even when `formats=["html"]` is requested. By design
 //! the cloud API returns a parsed bundle:
 //!
 //! ```text
 //! {
 //!   "url":             "https://...",
 //!   "metadata":        { title, description, image, site_name, ... },  // OG / meta tags
 //!   "structured_data": [ { "@type": "...", ... }, ... ],               // JSON-LD blocks
 //!   "markdown":        "# Page Title\n\n...",                          // cleaned markdown
 //!   "antibot":         { engine, path, user_agent },                   // bypass telemetry
 //!   "cache":           { status, age_seconds }
 //! }
 //! ```
 //!
 //! [`CloudClient::fetch_html`] reassembles that bundle back into a
 //! minimal synthetic HTML document so the existing local extractor
 //! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
 //! cloud output. Each `structured_data` entry becomes a
 //! `<script type="application/ld+json">` tag; each `metadata` field
 //! becomes a `<meta property="og:...">` tag; `markdown` lands in a
 //! `<pre>` inside the body. Callers that walk Schema.org blocks see
 //! exactly what they'd see on a real live page.
 //!
 //! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
 //! won't hit on the synthesised HTML — those IDs only exist on live
 //! Amazon pages. Extractors that need DOM regex keep OG meta tag
 //! fallbacks for that reason.
 //!
 //! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
 //! signup when a site is blocked; nothing fails silently. Cloud users
 //! get the escalation for free.
--- a/crates/webclaw-fetch/src/extractors/amazon_product.rs
+++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs
@ -1,16 +1,25 @@
 //! Amazon product detail page extractor.
 //!
-//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
+//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
-//! a "Sorry, we need to verify you're human" interstitial to any
+//! inconsistently protected. Sometimes our local TLS fingerprint gets
-//! client without a warm Amazon session + residential IP. Detection
+//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
-//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
+//! sometimes we land on a real page that for whatever reason ships
-//! Amazon heuristic, so this extractor always hits the cloud fallback
+//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
-//! path in practice.
+//! extractor has a two-stage fallback:
 //!
-//! Parsing logic works on the final HTML, local or cloud-sourced. We
+//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
-//! read the product details primarily from JSON-LD `Product` blocks
+//!    we have everything (title, brand, price, availability, rating).
-//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
+//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
-//! specific DOM IDs picked up with cheap regex.
+//!    a cloud client is configured, force-escalate to api.webclaw.io.
 //!    Cloud's render + antibot pipeline reliably surfaces the
 //!    structured data. Without a cloud client we return whatever we
 //!    got from local (usually just title via `#productTitle` or OG
 //!    meta tags).
 //!
 //! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
 //! `#landingImage`) second, OG `<meta>` tags third. The OG path
 //! matters because the cloud's synthesized HTML ships metadata as
 //! OG tags but lacks Amazon's DOM IDs.
 //!
 //! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
 //! path. ASINs are a stable Amazon identifier so we extract that as
@ -54,10 +63,36 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
    let asin = parse_asin(url)
        .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
-    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
+    let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
        .await
        .map_err(cloud_to_fetch_err)?;
    // Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
    // pages (they A/B-test it). When local fetch succeeded but has no
    // Product JSON-LD, force-escalate to the cloud which runs the
    // render pipeline and reliably surfaces structured data. No-op
    // when cloud isn't configured — we return whatever local gave us.
    if fetched.source == cloud::FetchSource::Local
        && find_product_jsonld(&fetched.html).is_none()
        && let Some(c) = client.cloud()
    {
        match c.fetch_html(url).await {
            Ok(cloud_html) => {
                fetched = cloud::FetchedHtml {
                    html: cloud_html,
                    final_url: url.to_string(),
                    source: cloud::FetchSource::Cloud,
                };
            }
            Err(e) => {
                tracing::debug!(
                    error = %e,
                    "amazon_product: cloud escalation failed, keeping local"
                );
            }
        }
    }
    let mut data = parse(&fetched.html, url, &asin);
    if let Some(obj) = data.as_object_mut() {
        obj.insert(
@ -77,16 +112,23 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
 /// without carrying webclaw_fetch types.
 pub fn parse(html: &str, url: &str, asin: &str) -> Value {
    let jsonld = find_product_jsonld(html);
    // Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
    // (only present on real static HTML) > cloud-synthesized og:title.
    let title = jsonld
        .as_ref()
        .and_then(|v| get_text(v, "name"))
-        .or_else(|| dom_title(html));
+        .or_else(|| dom_title(html))
        .or_else(|| og(html, "title"));
    let image = jsonld
        .as_ref()
        .and_then(get_first_image)
-        .or_else(|| dom_image(html));
+        .or_else(|| dom_image(html))
        .or_else(|| og(html, "image"));
    let brand = jsonld.as_ref().and_then(get_brand);
-    let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
+    let description = jsonld
        .as_ref()
        .and_then(|v| get_text(v, "description"))
        .or_else(|| og(html, "description"));
    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
    let offer = jsonld.as_ref().and_then(first_offer);
@ -267,6 +309,31 @@ fn dom_image(html: &str) -> Option<String> {
        .map(|m| m.as_str().to_string())
 }
 /// OG meta tag lookup. Cloud-synthesized HTML ships these even when
 /// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
 /// line of defence for `title`, `image`, `description`.
 fn og(html: &str, prop: &str) -> Option<String> {
    static RE: OnceLock<Regex> = OnceLock::new();
    let re = RE.get_or_init(|| {
        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
    });
    for c in re.captures_iter(html) {
        if c.get(1).is_some_and(|m| m.as_str() == prop) {
            return c.get(2).map(|m| html_unescape(m.as_str()));
        }
    }
    None
 }
 /// Undo the synthesize_html attribute escaping for the few entities it
 /// emits. Keeps us off a heavier HTML-entity dep.
 fn html_unescape(s: &str) -> String {
    s.replace("&quot;", "\"")
        .replace("&amp;", "&")
        .replace("&lt;", "<")
        .replace("&gt;", ">")
 }
 fn cloud_to_fetch_err(e: CloudError) -> FetchError {
    FetchError::Build(e.to_string())
 }
@ -358,4 +425,28 @@ mod tests {
            "https://m.media-amazon.com/images/I/fallback.jpg"
        );
    }
    #[test]
    fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
        // Shape we see from the cloud synthesize_html path: OG tags
        // only, no JSON-LD, no Amazon DOM IDs.
        let html = r##"<html><head>
 <meta property="og:title" content="Cloud-sourced MacBook Pro">
 <meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
 <meta property="og:description" content="Via api.webclaw.io">
 </head></html>"##;
        let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
        assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
        assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
        assert_eq!(v["description"], "Via api.webclaw.io");
    }
    #[test]
    fn og_unescape_handles_quot_entity() {
        let html = r#"<meta property="og:title" content="Apple &quot;M2 Pro&quot; Laptop">"#;
        assert_eq!(
            og(html, "title").as_deref(),
            Some(r#"Apple "M2 Pro" Laptop"#)
        );
    }
 }
--- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs
+++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs
@ -10,6 +10,15 @@
 //! but some listings return a CF interstitial. We route through
 //! `cloud::smart_fetch_html` so both paths resolve to the same parser,
 //! same as `ebay_listing`.
 //!
 //! ## URL slug as last-resort title
 //!
 //! Even with cloud antibot bypass, Etsy frequently serves a generic
 //! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
 //! empty markdown). In that case we humanise the slug from the URL
 //! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
 //! "Personalized Stainless Steel Tumbler") so callers always get a
 //! meaningful title. Degrades gracefully when the URL has no slug.
 use std::sync::OnceLock;
@ -63,15 +72,17 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
 pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
    let jsonld = find_product_jsonld(html);
    let slug_title = humanise_slug(parse_slug(url).as_deref());
    let title = jsonld
        .as_ref()
        .and_then(|v| get_text(v, "name"))
-        .or_else(|| og(html, "title"));
+        .or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
        .or(slug_title);
    let description = jsonld
        .as_ref()
        .and_then(|v| get_text(v, "description"))
-        .or_else(|| og(html, "description"));
+        .or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
    let image = jsonld
        .as_ref()
        .and_then(get_first_image)
@ -98,13 +109,18 @@ pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
        .and_then(|v| get_text(v, "itemCondition"))
        .map(strip_schema_prefix);
-    // Shop name lives under offers[0].seller.name on Etsy.
+    // Shop name: offers[0].seller.name on newer listings, top-level
-    let shop = offer.as_ref().and_then(|o| {
+    // `brand` on older listings (Etsy changed the schema around 2022).
-        o.get("seller")
+    // Fall back through both so either shape resolves.
-            .and_then(|s| s.get("name"))
+    let shop = offer
-            .and_then(|n| n.as_str())
+        .as_ref()
-            .map(String::from)
+        .and_then(|o| {
-    });
+            o.get("seller")
                .and_then(|s| s.get("name"))
                .and_then(|n| n.as_str())
                .map(String::from)
        })
        .or_else(|| brand.clone());
    let shop_url = shop_url_from_html(html);
    let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
@ -158,6 +174,87 @@ fn parse_listing_id(url: &str) -> Option<String> {
        .map(|m| m.as_str().to_string())
 }
 /// Extract the URL slug after the listing id, e.g.
 /// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
 /// is the bare `/listing/{id}` shape.
 fn parse_slug(url: &str) -> Option<String> {
    static RE: OnceLock<Regex> = OnceLock::new();
    let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
    re.captures(url)
        .and_then(|c| c.get(1))
        .map(|m| m.as_str().to_string())
 }
 /// Turn a URL slug into a human-ish title:
 /// `personalized-stainless-steel-tumbler` → `Personalized Stainless
 /// Steel Tumbler`. Word-cap each dash-separated token; preserves
 /// underscores as spaces too. Returns `None` on empty input.
 fn humanise_slug(slug: Option<&str>) -> Option<String> {
    let raw = slug?.trim();
    if raw.is_empty() {
        return None;
    }
    let words: Vec<String> = raw
        .split(['-', '_'])
        .filter(|w| !w.is_empty())
        .map(capitalise_word)
        .collect();
    if words.is_empty() {
        None
    } else {
        Some(words.join(" "))
    }
 }
 fn capitalise_word(w: &str) -> String {
    let mut chars = w.chars();
    match chars.next() {
        Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
        None => String::new(),
    }
 }
 /// True when the OG title is Etsy's fallback-page title rather than a
 /// listing-specific title. Expired / region-blocked / antibot-filtered
 /// pages return Etsy's sitewide tagline:
 /// `"Etsy - Your place to buy and sell all things handmade..."`, or
 /// simply `"etsy.com"`. A real listing title always starts with the
 /// item name, never with "Etsy - " or the domain.
 fn is_generic_title(t: &str) -> bool {
    let normalised = t.trim().to_lowercase();
    if matches!(
        normalised.as_str(),
        "etsy.com" | "etsy" | "www.etsy.com" | ""
    ) {
        return true;
    }
    // Etsy's sitewide marketing tagline, served on 404 / blocked pages.
    if normalised.starts_with("etsy - ")
        || normalised.starts_with("etsy.com - ")
        || normalised.starts_with("etsy uk - ")
    {
        return true;
    }
    // Etsy's "item unavailable" placeholder, served on delisted
    // products. Keep the slug fallback so callers still see what the
    // URL was about.
    normalised.starts_with("this item is unavailable")
        || normalised.starts_with("sorry, this item is")
        || normalised == "item not available - etsy"
 }
 /// True when the OG description is an Etsy error-page placeholder or
 /// sitewide marketing blurb rather than a real listing description.
 fn is_generic_description(d: &str) -> bool {
    let normalised = d.trim().to_lowercase();
    if normalised.is_empty() {
        return true;
    }
    normalised.starts_with("sorry, the page you were looking for")
        || normalised.starts_with("page not found")
        || normalised.starts_with("find the perfect handmade gift")
 }
 // ---------------------------------------------------------------------------
 // JSON-LD walkers (same shape as ebay_listing; kept separate so the two
 // extractors can diverge without cross-impact)
@ -388,4 +485,88 @@ mod tests {
        // No price fields when we only have OG.
        assert!(v["price"].is_null());
    }
    #[test]
    fn parse_slug_from_url() {
        assert_eq!(
            parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
            Some("vintage-typewriter".into())
        );
        assert_eq!(
            parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
            Some("slug".into())
        );
        assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
        assert_eq!(
            parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
            Some("slug".into())
        );
    }
    #[test]
    fn humanise_slug_capitalises_each_word() {
        assert_eq!(
            humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
            Some("Personalized Stainless Steel Tumbler")
        );
        assert_eq!(
            humanise_slug(Some("hand_crafted_mug")).as_deref(),
            Some("Hand Crafted Mug")
        );
        assert_eq!(humanise_slug(Some("")), None);
        assert_eq!(humanise_slug(None), None);
    }
    #[test]
    fn is_generic_title_catches_common_shapes() {
        assert!(is_generic_title("etsy.com"));
        assert!(is_generic_title("Etsy"));
        assert!(is_generic_title("  etsy.com  "));
        assert!(is_generic_title(
            "Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
        ));
        assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
        assert!(!is_generic_title("Vintage Typewriter"));
        assert!(!is_generic_title("Handmade Etsy-style Mug"));
    }
    #[test]
    fn is_generic_description_catches_404_shapes() {
        assert!(is_generic_description(""));
        assert!(is_generic_description(
            "Sorry, the page you were looking for was not found."
        ));
        assert!(is_generic_description("Page not found"));
        assert!(!is_generic_description(
            "Hand-thrown ceramic mug, dishwasher safe."
        ));
    }
    #[test]
    fn parse_uses_slug_when_og_is_generic() {
        // Cloud-blocked Etsy listing: og:title is a site-wide generic
        // placeholder, no JSON-LD, no description. Slug should win.
        let html = r#"<html><head>
 <meta property="og:title" content="etsy.com">
 </head></html>"#;
        let v = parse(
            html,
            "https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
            "1079113183",
        );
        assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
    }
    #[test]
    fn parse_prefers_real_og_over_slug() {
        let html = r#"<html><head>
 <meta property="og:title" content="Real Listing Title">
 </head></html>"#;
        let v = parse(
            html,
            "https://www.etsy.com/listing/1079113183/the-url-slug",
            "1079113183",
        );
        assert_eq!(v["title"], "Real Listing Title");
    }
 }
--- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
+++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
@ -1,13 +1,34 @@
 //! Trustpilot company reviews extractor.
 //!
-//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
+//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
-//! `Organization` / `LocalBusiness` block with aggregate rating + up
+//! "Verifying your connection" interstitial, so this extractor always
-//! to 20 recent reviews. The page HTML itself is usually behind AWS
+//! routes through [`cloud::smart_fetch_html`]. Without
-//! WAF's "Verifying Connection" interstitial — so this extractor
+//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
-//! always uses [`cloud::smart_fetch_html`] and only returns data when
+//! "set API key" error; with one it escalates to api.webclaw.io.
-//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
+//!
-//! OSS users without a key get a clear error pointing at signup.
+//! ## 2025 JSON-LD schema
 //!
 //! Trustpilot replaced the old single-Organization + aggregateRating
 //! shape with three separate JSON-LD blocks:
 //!
 //! 1. `Organization` block for Trustpilot the platform itself
 //!    (company info, addresses, social profiles). Not the business
 //!    being reviewed. We detect and skip this.
 //! 2. `Dataset` block with a csvw:Table mainEntity that contains the
 //!    per-star-bucket counts for the target business plus a Total
 //!    column. The Dataset's `name` is the business display name.
 //! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
 //!    summary of reviews plus the individual review objects
 //!    (consumer, dates, rating, title, text, language, likes).
 //!
 //! Plus `metadata.title` from the page head parses as
 //! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
 //! `metadata.description` carries `"{N} customers have already said"`.
 //! We use both as extra signal when the Dataset block is absent.
 use std::sync::OnceLock;
 use regex::Regex;
 use serde_json::{Value, json};
 use super::ExtractorInfo;
@ -18,7 +39,7 @@ use crate::error::FetchError;
 pub const INFO: ExtractorInfo = ExtractorInfo {
    name: "trustpilot_reviews",
    label: "Trustpilot reviews",
-    description: "Returns company aggregate rating + recent reviews for a business on Trustpilot.",
+    description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
    url_patterns: &["https://www.trustpilot.com/review/{domain}"],
 };
@ -31,75 +52,88 @@ pub fn matches(url: &str) -> bool {
 }
 pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
    // Trustpilot is always behind AWS WAF, so we go through smart_fetch
    // which tries local first (which will hit the challenge interstitial),
    // detects it, and escalates to cloud /v1/scrape for the real HTML.
    let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
        .await
        .map_err(cloud_to_fetch_err)?;
-    let html = parse(&fetched.html, url)?;
+    let mut data = parse(&fetched.html, url)?;
-    Ok(html_with_source(html, fetched.source))
+    if let Some(obj) = data.as_object_mut() {
        obj.insert(
            "data_source".into(),
            match fetched.source {
                cloud::FetchSource::Local => json!("local"),
                cloud::FetchSource::Cloud => json!("cloud"),
            },
        );
    }
    Ok(data)
 }
-/// Run the pure parser on already-fetched HTML. Split out so the cloud
+/// Pure parser. Kept public so the cloud pipeline can reuse it on its
-/// pipeline can call it directly after its own antibot-aware fetch
+/// own fetched HTML without going through the async extract path.
 /// without going through [`extract`].
 pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
-    let blocks = webclaw_core::structured_data::extract_json_ld(html);
+    let domain = parse_review_domain(url).ok_or_else(|| {
-    let business = find_business(&blocks).ok_or_else(|| {
+        FetchError::Build(format!(
-        FetchError::BodyDecode(format!(
+            "trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
            "trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
        ))
    })?;
-    let aggregate_rating = business.get("aggregateRating").map(|r| {
+    let blocks = webclaw_core::structured_data::extract_json_ld(html);
        json!({
            "rating_value":  get_text(r, "ratingValue"),
            "best_rating":   get_text(r, "bestRating"),
            "review_count":  get_text(r, "reviewCount"),
        })
    });
-    let reviews: Vec<Value> = business
+    // The business Dataset block has `about.@id` pointing to the target
-        .get("review")
+    // domain's Organization (e.g. `.../Organization/anthropic.com`).
-        .and_then(|r| r.as_array())
+    let dataset = find_business_dataset(&blocks, &domain);
-        .map(|arr| {
+
-            arr.iter()
+    // The aiSummary block: not typed (no `@type`), detect by key.
-                .map(|r| {
+    let ai_block = find_ai_summary_block(&blocks);
-                    json!({
+
-                        "author":         r.get("author")
+    // Business name: Dataset > metadata.title regex > URL domain.
-                                              .and_then(|a| a.get("name"))
+    let business_name = dataset
-                                              .and_then(|n| n.as_str())
+        .as_ref()
-                                              .map(String::from)
+        .and_then(|d| get_string(d, "name"))
-                                              .or_else(|| r.get("author").and_then(|a| a.as_str()).map(String::from)),
+        .or_else(|| parse_name_from_og_title(html))
-                        "date_published": get_text(r, "datePublished"),
+        .or_else(|| Some(domain.clone()));
-                        "name":           get_text(r, "name"),
+
-                        "body":           get_text(r, "reviewBody"),
+    // Rating distribution from the csvw:Table columns. Each column has
-                        "rating_value":   r.get("reviewRating")
+    // csvw:name like "1 star" / "Total" and a single cell with the
-                                              .and_then(|rr| rr.get("ratingValue"))
+    // integer count.
-                                              .and_then(|v| v.as_str().map(String::from)
+    let distribution = dataset.as_ref().and_then(parse_star_distribution);
-                                                  .or_else(|| v.as_f64().map(|n| n.to_string()))),
+    let (rating_from_dist, total_from_dist) = distribution
-                        "language":       get_text(r, "inLanguage"),
+        .as_ref()
-                    })
+        .map(compute_rating_stats)
-                })
+        .unwrap_or((None, None));
-                .collect()
+
-        })
+    // Page-title / page-description fallbacks. OG title format:
    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
    let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
    let total_from_desc = parse_review_count_from_og_description(html);
    // Recent reviews carried by the aiSummary block.
    let recent_reviews: Vec<Value> = ai_block
        .as_ref()
        .and_then(|a| a.get("aiSummaryReviews"))
        .and_then(|arr| arr.as_array())
        .map(|arr| arr.iter().map(extract_review).collect())
        .unwrap_or_default();
    let ai_summary = ai_block
        .as_ref()
        .and_then(|a| a.get("aiSummary"))
        .and_then(|s| s.get("summary"))
        .and_then(|t| t.as_str())
        .map(String::from);
    Ok(json!({
-        "url":              url,
+        "url":               url,
-        "name":             get_text(&business, "name"),
+        "domain":            domain,
-        "description":      get_text(&business, "description"),
+        "business_name":     business_name,
-        "logo":             business.get("logo").and_then(|l| l.as_str()).map(String::from)
+        "rating_label":      rating_label,
-                                .or_else(|| business.get("logo").and_then(|l| l.get("url")).and_then(|v| v.as_str()).map(String::from)),
+        "average_rating":    rating_from_dist.or(rating_from_og),
-        "telephone":        get_text(&business, "telephone"),
+        "review_count":      total_from_dist.or(total_from_desc),
-        "address":          business.get("address").cloned(),
+        "rating_distribution": distribution,
-        "same_as":          business.get("sameAs").cloned(),
+        "ai_summary":        ai_summary,
-        "aggregate_rating": aggregate_rating,
+        "recent_reviews":    recent_reviews,
-        "review_count_listed": reviews.len(),
+        "review_count_listed": recent_reviews.len(),
        "reviews":          reviews,
        "business_schema":  business.get("@type").cloned(),
    }))
 }
@ -107,87 +141,10 @@ fn cloud_to_fetch_err(e: CloudError) -> FetchError {
    FetchError::Build(e.to_string())
 }
 /// Stamp `data_source` onto the parser output so callers can tell at a
 /// glance whether this row came from local or cloud. Useful for UX and
 /// for pricing-aware pipelines.
 fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
    if let Some(obj) = v.as_object_mut() {
        obj.insert(
            "data_source".into(),
            match source {
                cloud::FetchSource::Local => json!("local"),
                cloud::FetchSource::Cloud => json!("cloud"),
            },
        );
    }
    v
 }
 // ---------------------------------------------------------------------------
-// JSON-LD walker — same pattern as ecommerce_product
+// URL helpers
 // ---------------------------------------------------------------------------
 fn find_business(blocks: &[Value]) -> Option<Value> {
    for b in blocks {
        if let Some(found) = find_business_in(b) {
            return Some(found);
        }
    }
    None
 }
 fn find_business_in(v: &Value) -> Option<Value> {
    if is_business_type(v) {
        return Some(v.clone());
    }
    if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
        for item in graph {
            if let Some(found) = find_business_in(item) {
                return Some(found);
            }
        }
    }
    if let Some(arr) = v.as_array() {
        for item in arr {
            if let Some(found) = find_business_in(item) {
                return Some(found);
            }
        }
    }
    None
 }
 fn is_business_type(v: &Value) -> bool {
    let t = match v.get("@type") {
        Some(t) => t,
        None => return false,
    };
    let match_str = |s: &str| {
        matches!(
            s,
            "Organization"
                | "LocalBusiness"
                | "Corporation"
                | "OnlineBusiness"
                | "Store"
                | "Service"
        )
    };
    match t {
        Value::String(s) => match_str(s),
        Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
        _ => false,
    }
 }
 fn get_text(v: &Value, key: &str) -> Option<String> {
    v.get(key).and_then(|x| match x {
        Value::String(s) => Some(s.clone()),
        Value::Number(n) => Some(n.to_string()),
        _ => None,
    })
 }
 fn host_of(url: &str) -> &str {
    url.split("://")
        .nth(1)
@ -197,6 +154,285 @@ fn host_of(url: &str) -> &str {
        .unwrap_or("")
 }
 /// Pull the target domain from `trustpilot.com/review/{domain}`.
 fn parse_review_domain(url: &str) -> Option<String> {
    let after = url.split("/review/").nth(1)?;
    let stripped = after
        .split(['?', '#'])
        .next()?
        .trim_end_matches('/')
        .split('/')
        .next()
        .unwrap_or("");
    if stripped.is_empty() {
        None
    } else {
        Some(stripped.to_string())
    }
 }
 // ---------------------------------------------------------------------------
 // JSON-LD block walkers
 // ---------------------------------------------------------------------------
 /// Find the Dataset block whose `about.@id` references the target
 /// domain's Organization. Falls through to any Dataset if the @id
 /// check doesn't match (Trustpilot occasionally varies the URL).
 fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
    let mut fallback_any_dataset: Option<Value> = None;
    for block in blocks {
        for node in walk_graph(block) {
            if !is_dataset(&node) {
                continue;
            }
            if dataset_about_matches_domain(&node, domain) {
                return Some(node);
            }
            if fallback_any_dataset.is_none() {
                fallback_any_dataset = Some(node);
            }
        }
    }
    fallback_any_dataset
 }
 fn is_dataset(v: &Value) -> bool {
    v.get("@type")
        .and_then(|t| t.as_str())
        .is_some_and(|s| s == "Dataset")
 }
 fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
    let about_id = v
        .get("about")
        .and_then(|a| a.get("@id"))
        .and_then(|id| id.as_str());
    let Some(id) = about_id else {
        return false;
    };
    id.contains(&format!("/Organization/{domain}"))
 }
 /// The aiSummary / aiSummaryReviews block has no `@type`, so match by
 /// presence of the `aiSummary` key.
 fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
    for block in blocks {
        for node in walk_graph(block) {
            if node.get("aiSummary").is_some() {
                return Some(node);
            }
        }
    }
    None
 }
 /// Flatten each block (and its `@graph`) into a list of nodes we can
 /// iterate over. Handles both `@graph: [ ... ]` (array) and
 /// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
 fn walk_graph(block: &Value) -> Vec<Value> {
    let mut out = vec![block.clone()];
    if let Some(graph) = block.get("@graph") {
        match graph {
            Value::Array(arr) => out.extend(arr.iter().cloned()),
            Value::Object(_) => out.push(graph.clone()),
            _ => {}
        }
    }
    out
 }
 // ---------------------------------------------------------------------------
 // Rating distribution (csvw:Table)
 // ---------------------------------------------------------------------------
 /// Parse the per-star distribution from the Dataset block. Returns
 /// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
 fn parse_star_distribution(dataset: &Value) -> Option<Value> {
    let columns = dataset
        .get("mainEntity")?
        .get("csvw:tableSchema")?
        .get("csvw:columns")?
        .as_array()?;
    let mut out = serde_json::Map::new();
    for col in columns {
        let name = col.get("csvw:name").and_then(|n| n.as_str())?;
        let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
        let count = cell
            .get("csvw:value")
            .and_then(|v| v.as_str())
            .and_then(|s| s.parse::<i64>().ok());
        let percent = cell
            .get("csvw:notes")
            .and_then(|n| n.as_array())
            .and_then(|arr| arr.first())
            .and_then(|s| s.as_str())
            .map(String::from);
        let key = normalise_star_key(name);
        out.insert(
            key,
            json!({
                "count":   count,
                "percent": percent,
            }),
        );
    }
    if out.is_empty() {
        None
    } else {
        Some(Value::Object(out))
    }
 }
 /// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
 /// the raw "1 star" key which fights YAML/JS property access.
 fn normalise_star_key(name: &str) -> String {
    let trimmed = name.trim().to_lowercase();
    match trimmed.as_str() {
        "1 star" => "one_star".into(),
        "2 stars" => "two_stars".into(),
        "3 stars" => "three_stars".into(),
        "4 stars" => "four_stars".into(),
        "5 stars" => "five_stars".into(),
        "total" => "total".into(),
        other => other.replace(' ', "_"),
    }
 }
 /// Compute average rating (weighted by bucket) and total count from the
 /// parsed distribution. Returns `(average, total)`.
 fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
    let Some(obj) = distribution.as_object() else {
        return (None, None);
    };
    let get_count = |key: &str| -> i64 {
        obj.get(key)
            .and_then(|v| v.get("count"))
            .and_then(|v| v.as_i64())
            .unwrap_or(0)
    };
    let one = get_count("one_star");
    let two = get_count("two_stars");
    let three = get_count("three_stars");
    let four = get_count("four_stars");
    let five = get_count("five_stars");
    let total_bucket = one + two + three + four + five;
    let total = obj
        .get("total")
        .and_then(|v| v.get("count"))
        .and_then(|v| v.as_i64())
        .unwrap_or(total_bucket);
    if total == 0 {
        return (None, Some(0));
    }
    let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
    let avg = weighted as f64 / total_bucket.max(1) as f64;
    // One decimal place, matching how Trustpilot displays the score.
    (Some(format!("{avg:.1}")), Some(total))
 }
 // ---------------------------------------------------------------------------
 // OG / meta-tag fallbacks
 // ---------------------------------------------------------------------------
 /// Regex out the business name from the standard Trustpilot OG title
 /// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
 fn parse_name_from_og_title(html: &str) -> Option<String> {
    let title = og(html, "title")?;
    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
    static RE: OnceLock<Regex> = OnceLock::new();
    let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
    re.captures(&title)
        .and_then(|c| c.get(1))
        .map(|m| m.as_str().to_string())
 }
 /// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
 /// from the OG title.
 fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
    let Some(title) = og(html, "title") else {
        return (None, None);
    };
    static RE: OnceLock<Regex> = OnceLock::new();
    // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
    let re = RE.get_or_init(|| {
        Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
    });
    let Some(caps) = re.captures(&title) else {
        return (None, None);
    };
    (
        caps.get(1).map(|m| m.as_str().trim().to_string()),
        caps.get(2).map(|m| m.as_str().to_string()),
    )
 }
 /// Parse "hear what 226 customers have already said" from the OG
 /// description tag.
 fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
    let desc = og(html, "description")?;
    static RE: OnceLock<Regex> = OnceLock::new();
    let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
    re.captures(&desc)?
        .get(1)?
        .as_str()
        .replace(',', "")
        .parse::<i64>()
        .ok()
 }
 fn og(html: &str, prop: &str) -> Option<String> {
    static RE: OnceLock<Regex> = OnceLock::new();
    let re = RE.get_or_init(|| {
        Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
    });
    for c in re.captures_iter(html) {
        if c.get(1).is_some_and(|m| m.as_str() == prop) {
            let raw = c.get(2).map(|m| m.as_str())?;
            return Some(html_unescape(raw));
        }
    }
    None
 }
 /// Minimal HTML entity unescaping for the three entities the
 /// synthesize_html escaper might produce. Keeps us off a heavier dep.
 fn html_unescape(s: &str) -> String {
    s.replace("&quot;", "\"")
        .replace("&amp;", "&")
        .replace("&lt;", "<")
        .replace("&gt;", ">")
 }
 fn get_string(v: &Value, key: &str) -> Option<String> {
    v.get(key).and_then(|x| x.as_str().map(String::from))
 }
 // ---------------------------------------------------------------------------
 // Review extraction
 // ---------------------------------------------------------------------------
 fn extract_review(r: &Value) -> Value {
    json!({
        "id":          r.get("id").and_then(|v| v.as_str()),
        "rating":      r.get("rating").and_then(|v| v.as_i64()),
        "title":       r.get("title").and_then(|v| v.as_str()),
        "text":        r.get("text").and_then(|v| v.as_str()),
        "language":    r.get("language").and_then(|v| v.as_str()),
        "source":      r.get("source").and_then(|v| v.as_str()),
        "likes":       r.get("likes").and_then(|v| v.as_i64()),
        "author":      r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
        "author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
        "author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
        "verified":    r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
        "date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
        "date_published":   r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
    })
 }
 // ---------------------------------------------------------------------------
 // Tests
 // ---------------------------------------------------------------------------
 #[cfg(test)]
 mod tests {
    use super::*;
@ -210,13 +446,127 @@ mod tests {
    }
    #[test]
-    fn is_business_type_handles_variants() {
+    fn parse_review_domain_handles_query_and_slash() {
-        use serde_json::json;
+        assert_eq!(
-        assert!(is_business_type(&json!({"@type": "Organization"})));
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
-        assert!(is_business_type(&json!({"@type": "LocalBusiness"})));
+            Some("anthropic.com".into())
-        assert!(is_business_type(
+        );
-            &json!({"@type": ["Organization", "Corporation"]})
+        assert_eq!(
-        ));
+            parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
-        assert!(!is_business_type(&json!({"@type": "Product"})));
+            Some("anthropic.com".into())
        );
        assert_eq!(
            parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
            Some("anthropic.com".into())
        );
    }
    #[test]
    fn normalise_star_key_covers_all_buckets() {
        assert_eq!(normalise_star_key("1 star"), "one_star");
        assert_eq!(normalise_star_key("2 stars"), "two_stars");
        assert_eq!(normalise_star_key("5 stars"), "five_stars");
        assert_eq!(normalise_star_key("Total"), "total");
    }
    #[test]
    fn compute_rating_stats_weighted_average() {
        // 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
        let dist = json!({
            "one_star":   { "count": 100, "percent": "50%" },
            "two_stars":  { "count": 0,   "percent": "0%" },
            "three_stars":{ "count": 0,   "percent": "0%" },
            "four_stars": { "count": 0,   "percent": "0%" },
            "five_stars": { "count": 100, "percent": "50%" },
            "total":      { "count": 200, "percent": "100%" },
        });
        let (avg, total) = compute_rating_stats(&dist);
        assert_eq!(avg.as_deref(), Some("3.0"));
        assert_eq!(total, Some(200));
    }
    #[test]
    fn parse_og_title_extracts_name_and_rating() {
        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">"#;
        assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
        let (label, rating) = parse_rating_from_og_title(html);
        assert_eq!(label.as_deref(), Some("Bad"));
        assert_eq!(rating.as_deref(), Some("1.5"));
    }
    #[test]
    fn parse_review_count_from_og_description_picks_number() {
        let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
        assert_eq!(parse_review_count_from_og_description(html), Some(226));
    }
    #[test]
    fn parse_full_fixture_assembles_all_fields() {
        let html = r##"<html><head>
 <meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
 <meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
 <script type="application/ld+json">
 {"@context":"https://schema.org","@graph":[
  {"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
 ]}
 </script>
 <script type="application/ld+json">
 {"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
 "@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
 "@type":"Dataset",
 "about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
 "name":"Anthropic",
 "mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
   {"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
   {"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
   {"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
   {"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
   {"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
   {"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
 ]}}}}
 </script>
 <script type="application/ld+json">
 {"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
 "aiSummaryReviews":[
  {"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
   "source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
   "dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
 </script>
 </head></html>"##;
        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
        assert_eq!(v["domain"], "anthropic.com");
        assert_eq!(v["business_name"], "Anthropic");
        assert_eq!(v["rating_label"], "Bad");
        assert_eq!(v["review_count"], 226);
        assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
        assert_eq!(v["rating_distribution"]["total"]["count"], 226);
        assert_eq!(v["ai_summary"], "Mixed reviews.");
        assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
        assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
        assert_eq!(v["recent_reviews"][0]["rating"], 1);
        assert_eq!(v["recent_reviews"][0]["title"], "Bad");
    }
    #[test]
    fn parse_falls_back_to_og_when_no_jsonld() {
        let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
 <meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
        let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
        assert_eq!(v["domain"], "anthropic.com");
        assert_eq!(v["business_name"], "Anthropic");
        assert_eq!(v["average_rating"], "1.5");
        assert_eq!(v["review_count"], 226);
        assert_eq!(v["rating_label"], "Bad");
    }
    #[test]
    fn parse_returns_ok_with_url_domain_when_nothing_else() {
        let v = parse(
            "<html><head></head></html>",
            "https://www.trustpilot.com/review/example.com",
        )
        .unwrap();
        assert_eq!(v["domain"], "example.com");
        assert_eq!(v["business_name"], "example.com");
    }
 }