webclaw/crates/webclaw-fetch/src/cloud.rs

852 lines
30 KiB
Rust
Raw Normal View History

refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
//! Cloud API fallback client for api.webclaw.io.
//!
//! When local fetch hits bot protection or a JS-only SPA, callers can
//! fall back to the hosted API which runs the full antibot / CDP
//! pipeline. This module is the shared home for that flow: previously
//! duplicated between `webclaw-mcp/src/cloud.rs` and
//! `webclaw-cli/src/cloud.rs`.
//!
//! ## Architecture
//!
//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
//! REST surface. Typed errors for the four HTTP failures callers act
//! on differently (401 / 402 / 429 / other) plus network + parse.
//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
//! response bodies. The detection patterns are public (CF / DataDome
//! challenge-page signatures) so these live in OSS without leaking
//! any moat.
//! - [`smart_fetch`] — try-local-then-escalate flow returning an
//! [`ExtractionResult`] or raw cloud JSON. Kept on the original
//! `Result<_, String>` signature so the existing MCP / CLI call
//! sites work unchanged.
//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
//! pattern: just give me antibot-bypassed HTML so I can run my own
//! parser on it. Returns the typed [`CloudError`] so extractors can
//! emit precise "upgrade your plan" / "invalid key" messages.
//!
fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs) Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
2026-04-22 17:49:50 +02:00
//! ## Cloud response shape and [`synthesize_html`]
//!
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
//! `html` field even when `formats=["html"]` is requested. By design
//! the cloud API returns a parsed bundle:
//!
//! ```text
//! {
//! "url": "https://...",
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
//! "markdown": "# Page Title\n\n...", // cleaned markdown
//! "antibot": { engine, path, user_agent }, // bypass telemetry
//! "cache": { status, age_seconds }
//! }
//! ```
//!
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
//! minimal synthetic HTML document so the existing local extractor
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
//! cloud output. Each `structured_data` entry becomes a
//! `<script type="application/ld+json">` tag; each `metadata` field
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
//! exactly what they'd see on a real live page.
//!
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
//! won't hit on the synthesised HTML — those IDs only exist on live
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
//! fallbacks for that reason.
//!
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
//! signup when a site is blocked; nothing fails silently. Cloud users
//! get the escalation for free.
use std::time::Duration;
use http::HeaderMap;
use serde_json::{Value, json};
use thiserror::Error;
use tracing::{debug, info, warn};
use crate::client::FetchClient;
// ---------------------------------------------------------------------------
// URLs + defaults — keep in one place so "change the signup link" is a
// single-commit edit.
// ---------------------------------------------------------------------------
const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
const DEFAULT_TIMEOUT_SECS: u64 = 120;
const SIGNUP_URL: &str = "https://webclaw.io/signup";
const PRICING_URL: &str = "https://webclaw.io/pricing";
const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
// ---------------------------------------------------------------------------
// Errors
// ---------------------------------------------------------------------------
/// Structured cloud-fallback error. Variants correspond to the HTTP
/// outcomes callers act on differently — a 401 needs a different UX
/// than a 402 which needs a different UX than a network blip.
///
/// Display messages end with an actionable URL so API consumers can
/// surface them to users verbatim.
#[derive(Debug, Error)]
pub enum CloudError {
/// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
/// and friends when they hit bot protection but have no client to
/// escalate to.
#[error(
"this site is behind antibot protection. \
Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
Free tier: {SIGNUP_URL}"
)]
NotConfigured,
/// HTTP 401 — the key is present but rejected.
#[error(
"WEBCLAW_API_KEY rejected (HTTP 401). \
Check or regenerate your key at {KEYS_URL}"
)]
Unauthorized,
/// HTTP 402 — the key is valid but the plan doesn't cover the call.
#[error(
"your plan doesn't include this endpoint / site (HTTP 402). \
Upgrade at {PRICING_URL}"
)]
InsufficientPlan,
/// HTTP 429 — rate limit.
#[error(
"cloud API rate limit reached (HTTP 429). \
Wait a moment or upgrade at {PRICING_URL}"
)]
RateLimited,
/// HTTP 4xx / 5xx the caller probably can't do anything specific
/// about. Body is truncated to a sensible length for logs.
#[error("cloud API returned HTTP {status}: {body}")]
ServerError { status: u16, body: String },
#[error("cloud request failed: {0}")]
Network(String),
#[error("cloud response parse failed: {0}")]
ParseFailed(String),
}
impl CloudError {
/// Build from a non-success HTTP response, routing well-known
/// statuses to dedicated variants.
fn from_status_and_body(status: u16, body: String) -> Self {
match status {
401 => Self::Unauthorized,
402 => Self::InsufficientPlan,
429 => Self::RateLimited,
_ => Self::ServerError {
status,
body: truncate(&body, 500).to_string(),
},
}
}
}
impl From<reqwest::Error> for CloudError {
fn from(e: reqwest::Error) -> Self {
Self::Network(e.to_string())
}
}
/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
/// sites `use .await?` into functions returning `Result<_, String>`.
/// Having this `From` impl means those sites keep compiling while we
/// migrate them to the typed error over time.
impl From<CloudError> for String {
fn from(e: CloudError) -> Self {
e.to_string()
}
}
fn truncate(text: &str, max: usize) -> &str {
match text.char_indices().nth(max) {
Some((byte_pos, _)) => &text[..byte_pos],
None => text,
}
}
// ---------------------------------------------------------------------------
// CloudClient
// ---------------------------------------------------------------------------
/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
/// inner `reqwest::Client` already refcounts its connection pool.
#[derive(Clone)]
pub struct CloudClient {
api_key: String,
base_url: String,
http: reqwest::Client,
}
impl CloudClient {
/// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
/// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
/// neither is set / both are empty.
///
/// This is the function call sites should use by default — it's
/// what both the CLI and MCP want.
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
explicit_key
.map(String::from)
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
.filter(|k| !k.trim().is_empty())
.map(Self::with_key)
}
/// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
/// readability at call sites that never accept a flag.
pub fn from_env() -> Option<Self> {
Self::new(None)
}
/// Build with an explicit key. Useful when the caller already has
/// a key from somewhere other than env or a flag (e.g. loaded from
/// config).
pub fn with_key(api_key: impl Into<String>) -> Self {
Self::with_key_and_base(api_key, API_BASE_DEFAULT)
}
/// Build with an explicit key and base URL. Used by integration
/// tests and staging deployments.
pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
let http = reqwest::Client::builder()
.timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
.build()
.expect("reqwest client builder failed with default settings");
Self {
api_key: api_key.into(),
base_url: base_url.into().trim_end_matches('/').to_string(),
http,
}
}
pub fn base_url(&self) -> &str {
&self.base_url
}
/// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
/// normalise the slash.
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
let resp = self
.http
.post(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.json(&body)
.send()
.await?;
parse_cloud_response(resp).await
}
/// Generic GET.
pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
let resp = self
.http
.get(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.send()
.await?;
parse_cloud_response(resp).await
}
/// `POST /v1/scrape` with the caller's extraction options. This is
/// the public "do everything" surface: the cloud side handles
/// fetch + antibot + JS render + extraction + formatting.
pub async fn scrape(
&self,
url: &str,
formats: &[&str],
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
) -> Result<Value, CloudError> {
let mut body = json!({ "url": url, "formats": formats });
if only_main_content {
body["only_main_content"] = json!(true);
}
if !include_selectors.is_empty() {
body["include_selectors"] = json!(include_selectors);
}
if !exclude_selectors.is_empty() {
body["exclude_selectors"] = json!(exclude_selectors);
}
self.post("scrape", body).await
}
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
/// Get antibot-bypassed page data back as a synthetic HTML string.
///
/// `api.webclaw.io/v1/scrape` intentionally does not return raw
/// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
/// plus `metadata` (title, description, OG tags, image) plus a
/// `markdown` body. We reassemble those into a minimal HTML doc
/// that looks enough like the real page for our local extractor
/// parsers to run unchanged: each JSON-LD block gets emitted as a
/// `<script type="application/ld+json">` tag, metadata gets
/// emitted as OG `<meta>` tags, and the markdown lands in the
/// body. Extractors that walk JSON-LD (ecommerce_product,
/// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
/// see exactly the same shapes they'd see from a live HTML fetch.
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
Ok(synthesize_html(&resp))
}
}
/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
/// response so existing HTML-based extractor parsers can run against
/// cloud output without a separate code path.
fn synthesize_html(resp: &Value) -> String {
let mut out = String::with_capacity(8_192);
out.push_str("<html><head>\n");
// Metadata → OG meta tags. Keep keys stable with what local
// extractors read: og:title, og:description, og:image, og:site_name.
if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
for (src_key, og_key) in [
("title", "title"),
("description", "description"),
("image", "image"),
("site_name", "site_name"),
] {
if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
&& !val.is_empty()
{
out.push_str(&format!(
"<meta property=\"og:{og_key}\" content=\"{}\">\n",
html_escape_attr(val)
));
}
}
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
}
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
// Structured data blocks → <script type="application/ld+json">.
// Serialise losslessly so extract_json_ld's parser gets the same
// shape it would get from a real page.
if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
for block in blocks {
if let Ok(s) = serde_json::to_string(block) {
out.push_str("<script type=\"application/ld+json\">");
out.push_str(&s);
out.push_str("</script>\n");
}
}
}
out.push_str("</head><body>\n");
// Markdown body → plaintext in <body>. Extractors that regex over
// <div> IDs won't hit here, but they won't hit on local cloud
// bypass either. OK to keep minimal.
if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
out.push_str("<pre>");
out.push_str(&html_escape_text(md));
out.push_str("</pre>\n");
}
out.push_str("</body></html>");
out
}
fn html_escape_attr(s: &str) -> String {
s.replace('&', "&amp;")
.replace('"', "&quot;")
.replace('<', "&lt;")
.replace('>', "&gt;")
}
fn html_escape_text(s: &str) -> String {
s.replace('&', "&amp;")
.replace('<', "&lt;")
.replace('>', "&gt;")
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
}
async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
let status = resp.status();
if status.is_success() {
return resp
.json()
.await
.map_err(|e| CloudError::ParseFailed(e.to_string()));
}
let body = resp.text().await.unwrap_or_default();
Err(CloudError::from_status_and_body(status.as_u16(), body))
}
// ---------------------------------------------------------------------------
// Detection
// ---------------------------------------------------------------------------
/// True when a fetched response body is actually a bot-protection
/// challenge page rather than the content the caller asked for.
///
/// Conservative — only fires on patterns that indicate the *entire*
/// page is a challenge, not embedded CAPTCHAs on a real content page.
pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
let html_lower = html.to_lowercase();
// Cloudflare challenge page.
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
return true;
}
// Cloudflare "Just a moment" / "Checking your browser" interstitial.
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
&& html_lower.contains("cf-spinner")
{
return true;
}
// Cloudflare Turnstile. Only counts when the page is small —
// legitimate pages embed Turnstile for signup forms etc.
if (html_lower.contains("cf-turnstile")
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
&& html.len() < 100_000
{
return true;
}
// DataDome.
if html_lower.contains("geo.captcha-delivery.com")
|| html_lower.contains("captcha-delivery.com/captcha")
{
return true;
}
// AWS WAF.
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
return true;
}
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
// Distinct from the captcha-branded path above: the challenge page is
// a tiny HTML shell with an `interstitial-spinner` div and no content.
// Gating on html.len() keeps false-positives off long pages that
// happen to mention the phrase in an unrelated context.
if html_lower.contains("interstitial-spinner")
&& html_lower.contains("verifying your connection")
&& html.len() < 10_000
{
return true;
}
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
// hCaptcha *blocking* page (not just an embedded widget).
if html_lower.contains("hcaptcha.com")
&& html_lower.contains("h-captcha")
&& html.len() < 50_000
{
return true;
}
// Cloudflare via response headers + challenge body.
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
if has_cf_headers
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
{
return true;
}
false
}
/// True when a page likely needs JS rendering — a large HTML document
/// with almost no extractable text + an SPA framework signature.
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
let has_scripts = html.contains("<script");
// Tier 1: almost no extractable text from a large-ish page.
if word_count < 50 && html.len() > 5_000 && has_scripts {
return true;
}
// Tier 2: SPA framework markers + low content-to-HTML ratio.
if word_count < 800 && html.len() > 50_000 && has_scripts {
let html_lower = html.to_lowercase();
let has_spa_marker = html_lower.contains("react-app")
|| html_lower.contains("id=\"__next\"")
|| html_lower.contains("id=\"root\"")
|| html_lower.contains("id=\"app\"")
|| html_lower.contains("__next_data__")
|| html_lower.contains("nuxt")
|| html_lower.contains("ng-app");
if has_spa_marker {
return true;
}
}
false
}
// ---------------------------------------------------------------------------
// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
// or raw cloud JSON)
// ---------------------------------------------------------------------------
/// Result of [`smart_fetch`]: either a local extraction or the raw
/// cloud API response when we escalated.
pub enum SmartFetchResult {
Local(Box<webclaw_core::ExtractionResult>),
Cloud(Value),
}
/// Try local fetch + extract first. On bot protection or detected
/// JS-render, fall back to `cloud.scrape(...)` with the caller's
/// formats. Returns `Err(String)` so existing call sites that expect
/// stringified errors keep compiling.
///
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
/// [`CloudError`] so you can render precise UX.
pub async fn smart_fetch(
client: &FetchClient,
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
.await
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
.map_err(|e| format!("Fetch failed: {e}"))?;
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
info!(url, "bot protection detected, falling back to cloud API");
return cloud_scrape_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
let options = webclaw_core::ExtractionOptions {
include_selectors: include_selectors.to_vec(),
exclude_selectors: exclude_selectors.to_vec(),
only_main_content,
include_raw_html: false,
};
let extraction =
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
.map_err(|e| format!("Extraction failed: {e}"))?;
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
info!(
url,
word_count = extraction.metadata.word_count,
html_len = fetch_result.html.len(),
"JS-rendered page detected, falling back to cloud API"
);
return cloud_scrape_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
Ok(SmartFetchResult::Local(Box::new(extraction)))
}
async fn cloud_scrape_fallback(
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
let Some(c) = cloud else {
return Err(CloudError::NotConfigured.to_string());
};
let resp = c
.scrape(
url,
formats,
include_selectors,
exclude_selectors,
only_main_content,
)
.await
.map_err(|e| e.to_string())?;
info!(url, "cloud API fallback successful");
Ok(SmartFetchResult::Cloud(resp))
}
// ---------------------------------------------------------------------------
// Smart-fetch-HTML: for vertical extractors
// ---------------------------------------------------------------------------
/// Where the HTML ultimately came from — useful for callers that want
/// to track "did we fall back?" for logging or pricing.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FetchSource {
Local,
Cloud,
}
/// Antibot-aware HTML fetch result. The `html` field is always populated.
pub struct FetchedHtml {
pub html: String,
pub final_url: String,
pub source: FetchSource,
}
/// Try local fetch; on bot protection, escalate to the cloud's
/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
///
/// Designed for the vertical-extractor pattern where the caller has
/// its own parser and just needs bytes.
pub async fn smart_fetch_html(
client: &FetchClient,
cloud: Option<&CloudClient>,
url: &str,
) -> Result<FetchedHtml, CloudError> {
let resp = client
.fetch(url)
.await
.map_err(|e| CloudError::Network(e.to_string()))?;
if !is_bot_protected(&resp.html, &resp.headers) {
return Ok(FetchedHtml {
html: resp.html,
final_url: resp.url,
source: FetchSource::Local,
});
}
let Some(c) = cloud else {
warn!(url, "bot protection detected + no cloud client configured");
return Err(CloudError::NotConfigured);
};
debug!(url, "bot protection detected, escalating to cloud");
let html = c.fetch_html(url).await?;
Ok(FetchedHtml {
html,
final_url: url.to_string(),
source: FetchSource::Cloud,
})
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
fn empty_headers() -> HeaderMap {
HeaderMap::new()
}
// --- detectors ----------------------------------------------------------
#[test]
fn is_bot_protected_detects_cloudflare_challenge() {
let html = "<html><body>_cf_chl_opt loaded</body></html>";
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_detects_turnstile_on_short_page() {
let html = "<div class=\"cf-turnstile\"></div>";
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_ignores_turnstile_on_real_content() {
let html = format!(
"<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
"lots of real content ".repeat(8_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
#[test]
fn is_bot_protected_detects_aws_waf_verifying_connection() {
// The exact shape Trustpilot serves under AWS WAF.
let html = r#"<div class="container"><div id="loading-state">
<div class="interstitial-spinner" id="spinner"></div>
<h1>Verifying your connection...</h1></div></div>"#;
assert!(is_bot_protected(html, &empty_headers()));
}
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
#[test]
fn synthesize_html_embeds_jsonld_and_og_tags() {
let resp = json!({
"url": "https://example.com/p/1",
"metadata": {
"title": "My Product",
"description": "A nice thing.",
"image": "https://cdn.example.com/1.jpg",
"site_name": "Example Shop"
},
"structured_data": [
{"@context":"https://schema.org","@type":"Product",
"name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
],
"markdown": "# Widget\n\nA nice widget."
});
let html = synthesize_html(&resp);
// OG tags from metadata.
assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
assert!(
html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
);
// JSON-LD block preserved losslessly.
assert!(html.contains(r#"<script type="application/ld+json">"#));
assert!(html.contains(r#""@type":"Product""#));
assert!(html.contains(r#""price":"9.99""#));
// Body carries markdown.
assert!(html.contains("A nice widget."));
}
#[test]
fn synthesize_html_handles_missing_fields_gracefully() {
let resp = json!({"url": "https://example.com", "metadata": {}});
let html = synthesize_html(&resp);
// No panic, no stray unclosed tags.
assert!(html.starts_with("<html><head>"));
assert!(html.ends_with("</body></html>"));
}
#[test]
fn synthesize_html_escapes_attribute_quotes() {
let resp = json!({
"metadata": {"title": r#"She said "hi""#}
});
let html = synthesize_html(&resp);
assert!(html.contains(r#"og:title" content="She said &quot;hi&quot;""#));
}
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
#[test]
fn is_bot_protected_ignores_phrase_on_real_content() {
// A real article that happens to mention the phrase in prose
// should not trigger the short-page detector.
let html = format!(
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
"article text ".repeat(2_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
#[test]
fn needs_js_rendering_flags_spa_skeleton() {
let html = format!(
"<html><body><div id=\"__next\"></div>{}</body></html>",
"<script>x</script>".repeat(500)
);
assert!(needs_js_rendering(10, &html));
}
#[test]
fn needs_js_rendering_passes_real_article() {
let html = format!(
"<html><body>{}<script>x</script></body></html>",
"Real article text ".repeat(5_000)
);
assert!(!needs_js_rendering(5_000, &html));
}
// --- CloudError mapping -------------------------------------------------
#[test]
fn cloud_error_maps_401() {
let e = CloudError::from_status_and_body(401, "invalid key".into());
assert!(matches!(e, CloudError::Unauthorized));
assert!(e.to_string().contains(KEYS_URL));
}
#[test]
fn cloud_error_maps_402() {
let e = CloudError::from_status_and_body(402, "{}".into());
assert!(matches!(e, CloudError::InsufficientPlan));
assert!(e.to_string().contains(PRICING_URL));
}
#[test]
fn cloud_error_maps_429() {
let e = CloudError::from_status_and_body(429, "slow down".into());
assert!(matches!(e, CloudError::RateLimited));
assert!(e.to_string().contains(PRICING_URL));
}
#[test]
fn cloud_error_maps_generic_5xx() {
let e = CloudError::from_status_and_body(503, "x".repeat(2000));
match e {
CloudError::ServerError { status, body } => {
assert_eq!(status, 503);
assert!(body.len() <= 500);
}
_ => panic!("expected ServerError"),
}
}
#[test]
fn not_configured_error_points_at_signup() {
let msg = CloudError::NotConfigured.to_string();
assert!(msg.contains(SIGNUP_URL));
assert!(msg.contains("WEBCLAW_API_KEY"));
}
// --- CloudClient construction ------------------------------------------
#[test]
fn cloud_client_explicit_key_wins_over_env() {
// SAFETY: this test mutates process env. Serial tests only.
// Set env to something, pass an explicit key, explicit should win.
// (We don't actually *call* the API, just check the struct stored
// the right key.)
// rustc std::env::set_var is unsafe in newer toolchains.
unsafe {
std::env::set_var("WEBCLAW_API_KEY", "from-env");
}
let client = CloudClient::new(Some("from-flag")).expect("client built");
assert_eq!(client.api_key, "from-flag");
unsafe {
std::env::remove_var("WEBCLAW_API_KEY");
}
}
#[test]
fn cloud_client_none_when_empty() {
unsafe {
std::env::remove_var("WEBCLAW_API_KEY");
}
assert!(CloudClient::new(None).is_none());
assert!(CloudClient::new(Some("")).is_none());
assert!(CloudClient::new(Some(" ")).is_none());
}
#[test]
fn cloud_client_base_url_strips_trailing_slash() {
let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
assert_eq!(c.base_url(), "https://api.example.com/v1");
}
#[test]
fn truncate_respects_char_boundaries() {
// Ensure we don't slice inside a multi-byte char.
let s = "a".repeat(10) + "é"; // é is 2 bytes
let out = truncate(&s, 11);
assert_eq!(out.chars().count(), 11);
}
}