webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-07-24 07:31:01 +02:00

Valerio e10066f527 fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.		2026-04-22 17:24:50 +02:00
..
webclaw-cli	refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch	2026-04-22 16:05:44 +02:00
webclaw-core	style: cargo fmt	2026-04-17 12:03:22 +02:00
webclaw-fetch	fix(cloud): synthesize HTML from cloud response instead of requesting raw html	2026-04-22 17:24:50 +02:00
webclaw-llm	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00
webclaw-mcp	refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch	2026-04-22 16:05:44 +02:00
webclaw-pdf	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
webclaw-server	feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback	2026-04-22 16:16:11 +02:00