webclaw/crates
Valerio e10066f527 fix(cloud): synthesize HTML from cloud response instead of requesting raw html
api.webclaw.io/v1/scrape does not return a `html` field even when
`formats=["html"]` is requested, by design: the cloud API returns
pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags,
title, description, image, site_name), and `markdown`.

Our CloudClient::fetch_html helper was premised on the API returning
raw HTML. Without a key set, the error message was hidden behind
CloudError::NotConfigured so the bug never surfaced. With a key set,
every extractor that escalated to cloud (trustpilot_reviews,
etsy_listing, amazon_product, ebay_listing, substack_post HTML
fallback) got back "cloud /v1/scrape returned no html field".

Fix: reassemble a minimal synthetic HTML document from the cloud's
parsed output. Each JSON-LD block goes back into a
`<script type="application/ld+json">` tag, metadata fields become OG
`<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing
local extractor parsers (find_product_jsonld, find_business,
og() regex) see the same shapes they'd see from a real page, so no
per-extractor changes needed.

Verified end-to-end with WEBCLAW_CLOUD_API_KEY set:
- trustpilot_reviews: escalates, returns Organization JSON-LD data
  (parser picks Trustpilot site-level Org not the reviewed business;
  tracked as a follow-up to update Trustpilot schema handling)
- etsy_listing: escalates via antibot render path; listing-specific
  data depends on target listing having JSON-LD (many Etsy listings
  don't)
- amazon_product, ebay_listing: stay local because their pages ship
  enough content not to trigger bot-detection escalation
- The other 24 extractors unchanged (local path, zero cloud credits)

Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
..
webclaw-cli refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch 2026-04-22 16:05:44 +02:00
webclaw-core style: cargo fmt 2026-04-17 12:03:22 +02:00
webclaw-fetch fix(cloud): synthesize HTML from cloud response instead of requesting raw html 2026-04-22 17:24:50 +02:00
webclaw-llm feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22) 2026-04-16 19:44:08 +02:00
webclaw-mcp refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch 2026-04-22 16:05:44 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback 2026-04-22 16:16:11 +02:00