mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-07 22:15:12 +02:00
4 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b2e7dbf365 |
fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed. |
||
|
|
e10066f527 |
fix(cloud): synthesize HTML from cloud response instead of requesting raw html
api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean. |
||
|
|
a53578e45c |
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product
Two targeted fixes surfaced by the manual extractor smoke test.
cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
"Verifying your connection..." and an `interstitial-spinner` div.
That pattern was not in our detector, so local fetch returned the
challenge page, JSON-LD parsing found nothing, and the extractor
emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
happen to mention the phrase aren't misclassified. Two new tests
cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
actionable error without a key, or cloud-bypassed HTML with one.
ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
produced an empty `offers` list when JSON-LD was present but its
`offers` node was. Many sites (Patagonia-style catalog pages,
smaller Squarespace stores) ship one or the other of OG / JSON-LD
but not both with price data.
- Added OG meta-tag fallback that handles:
* no JSON-LD at all -> build minimal payload from og:title,
og:image, og:description, product:price:amount,
product:price:currency, product:availability, product:brand
* JSON-LD present but offers empty -> augment with an OG-derived
offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
so blog posts don't get mis-classified as products.
Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
|
||
|
|
0ab891bd6b |
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch
The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
pulling rmcp as a dep)
Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.
## New module: webclaw-fetch/src/cloud.rs
Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:
- `CloudError` enum with dedicated variants for the four HTTP
outcomes callers act on differently — 401 (key rejected),
402 (insufficient plan), 429 (rate limited), plus ServerError /
Network / ParseFailed. Each variant's Display message ends with
an actionable URL (signup / pricing / dashboard) so API consumers
can surface it verbatim.
- `From<CloudError> for String` bridge so the dozen existing
`.await?` call sites in MCP / CLI that expected `Result<_, String>`
keep compiling. We can migrate them to the typed error per-site
later without a churn commit.
- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
flag pattern (explicit key wins, env fallback, None when empty).
`::from_env()` kept for MCP-style call sites.
- `with_key_and_base` for staging / integration tests.
- `scrape / post / get / fetch_html` — `fetch_html` is new, a
convenience that calls /v1/scrape with formats=["html"] and
returns the raw HTML string so vertical extractors can plug
antibot-bypassed HTML straight into their parsers.
- `is_bot_protected` + `needs_js_rendering` detectors moved
over verbatim. Detection patterns are public (CF / DataDome /
AWS WAF challenge-page signatures) — no moat leak.
- `smart_fetch` kept on the original `Result<_, String>`
signature so MCP's six call sites compile unchanged.
- `smart_fetch_html` is new: the local-first-then-cloud flow
for the vertical-extractor pattern, returning the typed
`CloudError` so extractors can emit precise upgrade-path
messages.
## Cleanup
- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
`webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of
webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
transitively pulled in by webclaw-llm; this just makes the
dependency relationship explicit at the call site.
## Tests
16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)
Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
|