webclaw/crates/webclaw-fetch
Valerio b2e7dbf365 fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test.

trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
  separate JSON-LD blocks: a site-level Organization (Trustpilot
  itself), a Dataset with a csvw:Table mainEntity carrying the
  per-star distribution for the target business, and an aiSummary +
  aiSummaryReviews block with the AI-generated summary and recent
  review objects.
- Parser now: skips the site-level Org, walks @graph as either array
  or single object, picks the Dataset whose about.@id references the
  target domain, parses each csvw:column for rating buckets, computes
  weighted-average rating + total from the distribution, extracts the
  aiSummary text, and turns aiSummaryReviews into a clean reviews
  array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
  average_rating when the Dataset block is absent. OG-description
  regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
  average_rating, review_count, rating_distribution (per-star count
  and percent), ai_summary, recent_reviews, review_count_listed,
  data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
  226 reviews with full distribution + AI summary + 2 recent reviews.

amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
  pages. When local fetch returns HTML without Product JSON-LD and
  a cloud client is configured, force-escalate to the cloud path
  which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
  cloud's synthesize_html output (OG tags only, no #productTitle DOM
  ID) still yields useful data. Real Amazon pages still prefer the
  DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
  MacBook Pro title + description + asin.

etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
  blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
  "This item is unavailable - Etsy", plus the OG description
  "Sorry, the page you were looking for was not found." is_generic_*
  helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
  `/listing/123456789/personalized-stainless-steel-tumbler` becomes
  `Personalized Stainless Steel Tumbler` so callers always get a
  meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
  listings that don't ship offers[].seller.name. Shop now falls
  through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
  (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
  225 reviews, InStock).

cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
  not return raw HTML by design and that [`synthesize_html`]
  reassembles the parsed response (metadata + structured_data +
  markdown) back into minimal HTML so existing local parsers run
  unchanged across both paths. Also notes the DOM-regex limitation
  for extractors that need live-page-specific DOM IDs.

Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
2026-04-22 17:49:50 +02:00
..
src fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs) 2026-04-22 17:49:50 +02:00
tests fix: handle raw newlines in JSON-LD strings 2026-04-16 11:40:25 +02:00
Cargo.toml refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch 2026-04-22 16:05:44 +02:00