Commit graph

3 commits

Author SHA1 Message Date
Valerio
e10066f527 fix(cloud): synthesize HTML from cloud response instead of requesting raw html
api.webclaw.io/v1/scrape does not return a `html` field even when
`formats=["html"]` is requested, by design: the cloud API returns
pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags,
title, description, image, site_name), and `markdown`.

Our CloudClient::fetch_html helper was premised on the API returning
raw HTML. Without a key set, the error message was hidden behind
CloudError::NotConfigured so the bug never surfaced. With a key set,
every extractor that escalated to cloud (trustpilot_reviews,
etsy_listing, amazon_product, ebay_listing, substack_post HTML
fallback) got back "cloud /v1/scrape returned no html field".

Fix: reassemble a minimal synthetic HTML document from the cloud's
parsed output. Each JSON-LD block goes back into a
`<script type="application/ld+json">` tag, metadata fields become OG
`<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing
local extractor parsers (find_product_jsonld, find_business,
og() regex) see the same shapes they'd see from a real page, so no
per-extractor changes needed.

Verified end-to-end with WEBCLAW_CLOUD_API_KEY set:
- trustpilot_reviews: escalates, returns Organization JSON-LD data
  (parser picks Trustpilot site-level Org not the reviewed business;
  tracked as a follow-up to update Trustpilot schema handling)
- etsy_listing: escalates via antibot render path; listing-specific
  data depends on target listing having JSON-LD (many Etsy listings
  don't)
- amazon_product, ebay_listing: stay local because their pages ship
  enough content not to trigger bot-detection escalation
- The other 24 extractors unchanged (local path, zero cloud credits)

Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
Valerio
a53578e45c fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product
Two targeted fixes surfaced by the manual extractor smoke test.

cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
  "Verifying your connection..." and an `interstitial-spinner` div.
  That pattern was not in our detector, so local fetch returned the
  challenge page, JSON-LD parsing found nothing, and the extractor
  emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
  happen to mention the phrase aren't misclassified. Two new tests
  cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
  smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
  actionable error without a key, or cloud-bypassed HTML with one.

ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
  produced an empty `offers` list when JSON-LD was present but its
  `offers` node was. Many sites (Patagonia-style catalog pages,
  smaller Squarespace stores) ship one or the other of OG / JSON-LD
  but not both with price data.
- Added OG meta-tag fallback that handles:
  * no JSON-LD at all -> build minimal payload from og:title,
    og:image, og:description, product:price:amount,
    product:price:currency, product:availability, product:brand
  * JSON-LD present but offers empty -> augment with an OG-derived
    offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
  so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
  so blog posts don't get mis-classified as products.

Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
Valerio
0ab891bd6b refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch
The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
  pulling rmcp as a dep)

Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.

## New module: webclaw-fetch/src/cloud.rs

Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:

- `CloudError` enum with dedicated variants for the four HTTP
  outcomes callers act on differently — 401 (key rejected),
  402 (insufficient plan), 429 (rate limited), plus ServerError /
  Network / ParseFailed. Each variant's Display message ends with
  an actionable URL (signup / pricing / dashboard) so API consumers
  can surface it verbatim.

- `From<CloudError> for String` bridge so the dozen existing
  `.await?` call sites in MCP / CLI that expected `Result<_, String>`
  keep compiling. We can migrate them to the typed error per-site
  later without a churn commit.

- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
  flag pattern (explicit key wins, env fallback, None when empty).
  `::from_env()` kept for MCP-style call sites.

- `with_key_and_base` for staging / integration tests.

- `scrape / post / get / fetch_html` — `fetch_html` is new, a
  convenience that calls /v1/scrape with formats=["html"] and
  returns the raw HTML string so vertical extractors can plug
  antibot-bypassed HTML straight into their parsers.

- `is_bot_protected` + `needs_js_rendering` detectors moved
  over verbatim. Detection patterns are public (CF / DataDome /
  AWS WAF challenge-page signatures) — no moat leak.

- `smart_fetch` kept on the original `Result<_, String>`
  signature so MCP's six call sites compile unchanged.

- `smart_fetch_html` is new: the local-first-then-cloud flow
  for the vertical-extractor pattern, returning the typed
  `CloudError` so extractors can emit precise upgrade-path
  messages.

## Cleanup

- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
  `webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of
  webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
  webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
  transitively pulled in by webclaw-llm; this just makes the
  dependency relationship explicit at the call site.

## Tests

16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
  embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)

Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00