mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
This commit is contained in:
parent
e10066f527
commit
b2e7dbf365
4 changed files with 825 additions and 172 deletions
|
|
@ -24,6 +24,37 @@
|
|||
//! parser on it. Returns the typed [`CloudError`] so extractors can
|
||||
//! emit precise "upgrade your plan" / "invalid key" messages.
|
||||
//!
|
||||
//! ## Cloud response shape and [`synthesize_html`]
|
||||
//!
|
||||
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
|
||||
//! `html` field even when `formats=["html"]` is requested. By design
|
||||
//! the cloud API returns a parsed bundle:
|
||||
//!
|
||||
//! ```text
|
||||
//! {
|
||||
//! "url": "https://...",
|
||||
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
|
||||
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
|
||||
//! "markdown": "# Page Title\n\n...", // cleaned markdown
|
||||
//! "antibot": { engine, path, user_agent }, // bypass telemetry
|
||||
//! "cache": { status, age_seconds }
|
||||
//! }
|
||||
//! ```
|
||||
//!
|
||||
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
|
||||
//! minimal synthetic HTML document so the existing local extractor
|
||||
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
|
||||
//! cloud output. Each `structured_data` entry becomes a
|
||||
//! `<script type="application/ld+json">` tag; each `metadata` field
|
||||
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
|
||||
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
|
||||
//! exactly what they'd see on a real live page.
|
||||
//!
|
||||
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
|
||||
//! won't hit on the synthesised HTML — those IDs only exist on live
|
||||
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
|
||||
//! fallbacks for that reason.
|
||||
//!
|
||||
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
|
||||
//! signup when a site is blocked; nothing fails silently. Cloud users
|
||||
//! get the escalation for free.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue