fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)

Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
2026-06-06 22:05:13 +02:00 · 2026-04-22 17:49:50 +02:00 · 2026-04-22 17:49:50 +02:00 · b2e7dbf365
commit b2e7dbf365
parent e10066f527
4 changed files with 825 additions and 172 deletions
--- a/crates/webclaw-fetch/src/cloud.rs
+++ b/crates/webclaw-fetch/src/cloud.rs
@ -24,6 +24,37 @@
 //!   parser on it. Returns the typed [`CloudError`] so extractors can
 //!   emit precise "upgrade your plan" / "invalid key" messages.
 //!
+//! ## Cloud response shape and [`synthesize_html`]
+//!
+//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
+//! `html` field even when `formats=["html"]` is requested. By design
+//! the cloud API returns a parsed bundle:
+//!
+//! ```text
+//! {
+//!   "url":             "https://...",
+//!   "metadata":        { title, description, image, site_name, ... },  // OG / meta tags
+//!   "structured_data": [ { "@type": "...", ... }, ... ],               // JSON-LD blocks
+//!   "markdown":        "# Page Title\n\n...",                          // cleaned markdown
+//!   "antibot":         { engine, path, user_agent },                   // bypass telemetry
+//!   "cache":           { status, age_seconds }
+//! }
+//! ```
+//!
+//! [`CloudClient::fetch_html`] reassembles that bundle back into a
+//! minimal synthetic HTML document so the existing local extractor
+//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
+//! cloud output. Each `structured_data` entry becomes a
+//! `<script type="application/ld+json">` tag; each `metadata` field
+//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
+//! `<pre>` inside the body. Callers that walk Schema.org blocks see
+//! exactly what they'd see on a real live page.
+//!
+//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
+//! won't hit on the synthesised HTML — those IDs only exist on live
+//! Amazon pages. Extractors that need DOM regex keep OG meta tag
+//! fallbacks for that reason.
+//!
 //! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
 //! signup when a site is blocked; nothing fails silently. Cloud users
 //! get the escalation for free.