webclaw/crates/webclaw-fetch/src/cloud.rs

854 lines
30 KiB
Rust
Raw Normal View History

refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
//! Cloud API fallback client for api.webclaw.io.
//!
//! When local fetch hits bot protection or a JS-only SPA, callers can
//! fall back to the hosted API which runs the full antibot / CDP
//! pipeline. This module is the shared home for that flow: previously
//! duplicated between `webclaw-mcp/src/cloud.rs` and
//! `webclaw-cli/src/cloud.rs`.
//!
//! ## Architecture
//!
//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
//! REST surface. Typed errors for the four HTTP failures callers act
//! on differently (401 / 402 / 429 / other) plus network + parse.
//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
//! response bodies. The detection patterns are public (CF / DataDome
//! challenge-page signatures) so these live in OSS without leaking
//! any moat.
//! - [`smart_fetch`] — try-local-then-escalate flow returning an
//! [`ExtractionResult`] or raw cloud JSON. Kept on the original
//! `Result<_, String>` signature so the existing MCP / CLI call
//! sites work unchanged.
//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
//! pattern: just give me antibot-bypassed HTML so I can run my own
//! parser on it. Returns the typed [`CloudError`] so extractors can
//! emit precise "upgrade your plan" / "invalid key" messages.
//!
fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs) Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.
2026-04-22 17:49:50 +02:00
//! ## Cloud response shape and [`synthesize_html`]
//!
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
//! `html` field even when `formats=["html"]` is requested. By design
//! the cloud API returns a parsed bundle:
//!
//! ```text
//! {
//! "url": "https://...",
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
//! "markdown": "# Page Title\n\n...", // cleaned markdown
//! "antibot": { engine, path, user_agent }, // bypass telemetry
//! "cache": { status, age_seconds }
//! }
//! ```
//!
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
//! minimal synthetic HTML document so the existing local extractor
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
//! cloud output. Each `structured_data` entry becomes a
//! `<script type="application/ld+json">` tag; each `metadata` field
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
//! exactly what they'd see on a real live page.
//!
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
//! won't hit on the synthesised HTML — those IDs only exist on live
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
//! fallbacks for that reason.
//!
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
//! signup when a site is blocked; nothing fails silently. Cloud users
//! get the escalation for free.
use std::time::Duration;
use http::HeaderMap;
use serde_json::{Value, json};
use thiserror::Error;
use tracing::{debug, info, warn};
feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.
2026-04-22 21:17:50 +02:00
// Client type isn't needed here anymore now that smart_fetch* takes
// `&dyn Fetcher`. Kept as a comment for historical context: this
// module used to import FetchClient directly before v0.5.1.
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
// ---------------------------------------------------------------------------
// URLs + defaults — keep in one place so "change the signup link" is a
// single-commit edit.
// ---------------------------------------------------------------------------
const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
const DEFAULT_TIMEOUT_SECS: u64 = 120;
const SIGNUP_URL: &str = "https://webclaw.io/signup";
const PRICING_URL: &str = "https://webclaw.io/pricing";
const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
// ---------------------------------------------------------------------------
// Errors
// ---------------------------------------------------------------------------
/// Structured cloud-fallback error. Variants correspond to the HTTP
/// outcomes callers act on differently — a 401 needs a different UX
/// than a 402 which needs a different UX than a network blip.
///
/// Display messages end with an actionable URL so API consumers can
/// surface them to users verbatim.
#[derive(Debug, Error)]
pub enum CloudError {
/// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
/// and friends when they hit bot protection but have no client to
/// escalate to.
#[error(
"this site is behind antibot protection. \
Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
Free tier: {SIGNUP_URL}"
)]
NotConfigured,
/// HTTP 401 — the key is present but rejected.
#[error(
"WEBCLAW_API_KEY rejected (HTTP 401). \
Check or regenerate your key at {KEYS_URL}"
)]
Unauthorized,
/// HTTP 402 — the key is valid but the plan doesn't cover the call.
#[error(
"your plan doesn't include this endpoint / site (HTTP 402). \
Upgrade at {PRICING_URL}"
)]
InsufficientPlan,
/// HTTP 429 — rate limit.
#[error(
"cloud API rate limit reached (HTTP 429). \
Wait a moment or upgrade at {PRICING_URL}"
)]
RateLimited,
/// HTTP 4xx / 5xx the caller probably can't do anything specific
/// about. Body is truncated to a sensible length for logs.
#[error("cloud API returned HTTP {status}: {body}")]
ServerError { status: u16, body: String },
#[error("cloud request failed: {0}")]
Network(String),
#[error("cloud response parse failed: {0}")]
ParseFailed(String),
}
impl CloudError {
/// Build from a non-success HTTP response, routing well-known
/// statuses to dedicated variants.
fn from_status_and_body(status: u16, body: String) -> Self {
match status {
401 => Self::Unauthorized,
402 => Self::InsufficientPlan,
429 => Self::RateLimited,
_ => Self::ServerError {
status,
body: truncate(&body, 500).to_string(),
},
}
}
}
impl From<reqwest::Error> for CloudError {
fn from(e: reqwest::Error) -> Self {
Self::Network(e.to_string())
}
}
/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
/// sites `use .await?` into functions returning `Result<_, String>`.
/// Having this `From` impl means those sites keep compiling while we
/// migrate them to the typed error over time.
impl From<CloudError> for String {
fn from(e: CloudError) -> Self {
e.to_string()
}
}
fn truncate(text: &str, max: usize) -> &str {
match text.char_indices().nth(max) {
Some((byte_pos, _)) => &text[..byte_pos],
None => text,
}
}
// ---------------------------------------------------------------------------
// CloudClient
// ---------------------------------------------------------------------------
/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
/// inner `reqwest::Client` already refcounts its connection pool.
#[derive(Clone)]
pub struct CloudClient {
api_key: String,
base_url: String,
http: reqwest::Client,
}
impl CloudClient {
/// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
/// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
/// neither is set / both are empty.
///
/// This is the function call sites should use by default — it's
/// what both the CLI and MCP want.
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
explicit_key
.map(String::from)
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
.filter(|k| !k.trim().is_empty())
.map(Self::with_key)
}
/// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
/// readability at call sites that never accept a flag.
pub fn from_env() -> Option<Self> {
Self::new(None)
}
/// Build with an explicit key. Useful when the caller already has
/// a key from somewhere other than env or a flag (e.g. loaded from
/// config).
pub fn with_key(api_key: impl Into<String>) -> Self {
Self::with_key_and_base(api_key, API_BASE_DEFAULT)
}
/// Build with an explicit key and base URL. Used by integration
/// tests and staging deployments.
pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
let http = reqwest::Client::builder()
.timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
.build()
.expect("reqwest client builder failed with default settings");
Self {
api_key: api_key.into(),
base_url: base_url.into().trim_end_matches('/').to_string(),
http,
}
}
pub fn base_url(&self) -> &str {
&self.base_url
}
/// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
/// normalise the slash.
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
let resp = self
.http
.post(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.json(&body)
.send()
.await?;
parse_cloud_response(resp).await
}
/// Generic GET.
pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
let resp = self
.http
.get(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.send()
.await?;
parse_cloud_response(resp).await
}
/// `POST /v1/scrape` with the caller's extraction options. This is
/// the public "do everything" surface: the cloud side handles
/// fetch + antibot + JS render + extraction + formatting.
pub async fn scrape(
&self,
url: &str,
formats: &[&str],
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
) -> Result<Value, CloudError> {
let mut body = json!({ "url": url, "formats": formats });
if only_main_content {
body["only_main_content"] = json!(true);
}
if !include_selectors.is_empty() {
body["include_selectors"] = json!(include_selectors);
}
if !exclude_selectors.is_empty() {
body["exclude_selectors"] = json!(exclude_selectors);
}
self.post("scrape", body).await
}
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
/// Get antibot-bypassed page data back as a synthetic HTML string.
///
/// `api.webclaw.io/v1/scrape` intentionally does not return raw
/// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
/// plus `metadata` (title, description, OG tags, image) plus a
/// `markdown` body. We reassemble those into a minimal HTML doc
/// that looks enough like the real page for our local extractor
/// parsers to run unchanged: each JSON-LD block gets emitted as a
/// `<script type="application/ld+json">` tag, metadata gets
/// emitted as OG `<meta>` tags, and the markdown lands in the
/// body. Extractors that walk JSON-LD (ecommerce_product,
/// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
/// see exactly the same shapes they'd see from a live HTML fetch.
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
Ok(synthesize_html(&resp))
}
}
/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
/// response so existing HTML-based extractor parsers can run against
/// cloud output without a separate code path.
fn synthesize_html(resp: &Value) -> String {
let mut out = String::with_capacity(8_192);
out.push_str("<html><head>\n");
// Metadata → OG meta tags. Keep keys stable with what local
// extractors read: og:title, og:description, og:image, og:site_name.
if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
for (src_key, og_key) in [
("title", "title"),
("description", "description"),
("image", "image"),
("site_name", "site_name"),
] {
if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
&& !val.is_empty()
{
out.push_str(&format!(
"<meta property=\"og:{og_key}\" content=\"{}\">\n",
html_escape_attr(val)
));
}
}
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
}
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
// Structured data blocks → <script type="application/ld+json">.
// Serialise losslessly so extract_json_ld's parser gets the same
// shape it would get from a real page.
if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
for block in blocks {
if let Ok(s) = serde_json::to_string(block) {
out.push_str("<script type=\"application/ld+json\">");
out.push_str(&s);
out.push_str("</script>\n");
}
}
}
out.push_str("</head><body>\n");
// Markdown body → plaintext in <body>. Extractors that regex over
// <div> IDs won't hit here, but they won't hit on local cloud
// bypass either. OK to keep minimal.
if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
out.push_str("<pre>");
out.push_str(&html_escape_text(md));
out.push_str("</pre>\n");
}
out.push_str("</body></html>");
out
}
fn html_escape_attr(s: &str) -> String {
s.replace('&', "&amp;")
.replace('"', "&quot;")
.replace('<', "&lt;")
.replace('>', "&gt;")
}
fn html_escape_text(s: &str) -> String {
s.replace('&', "&amp;")
.replace('<', "&lt;")
.replace('>', "&gt;")
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
}
async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
let status = resp.status();
if status.is_success() {
return resp
.json()
.await
.map_err(|e| CloudError::ParseFailed(e.to_string()));
}
let body = resp.text().await.unwrap_or_default();
Err(CloudError::from_status_and_body(status.as_u16(), body))
}
// ---------------------------------------------------------------------------
// Detection
// ---------------------------------------------------------------------------
/// True when a fetched response body is actually a bot-protection
/// challenge page rather than the content the caller asked for.
///
/// Conservative — only fires on patterns that indicate the *entire*
/// page is a challenge, not embedded CAPTCHAs on a real content page.
pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
let html_lower = html.to_lowercase();
// Cloudflare challenge page.
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
return true;
}
// Cloudflare "Just a moment" / "Checking your browser" interstitial.
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
&& html_lower.contains("cf-spinner")
{
return true;
}
// Cloudflare Turnstile. Only counts when the page is small —
// legitimate pages embed Turnstile for signup forms etc.
if (html_lower.contains("cf-turnstile")
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
&& html.len() < 100_000
{
return true;
}
// DataDome.
if html_lower.contains("geo.captcha-delivery.com")
|| html_lower.contains("captcha-delivery.com/captcha")
{
return true;
}
// AWS WAF.
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
return true;
}
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
// Distinct from the captcha-branded path above: the challenge page is
// a tiny HTML shell with an `interstitial-spinner` div and no content.
// Gating on html.len() keeps false-positives off long pages that
// happen to mention the phrase in an unrelated context.
if html_lower.contains("interstitial-spinner")
&& html_lower.contains("verifying your connection")
&& html.len() < 10_000
{
return true;
}
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
// hCaptcha *blocking* page (not just an embedded widget).
if html_lower.contains("hcaptcha.com")
&& html_lower.contains("h-captcha")
&& html.len() < 50_000
{
return true;
}
// Cloudflare via response headers + challenge body.
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
if has_cf_headers
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
{
return true;
}
false
}
/// True when a page likely needs JS rendering — a large HTML document
/// with almost no extractable text + an SPA framework signature.
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
let has_scripts = html.contains("<script");
// Tier 1: almost no extractable text from a large-ish page.
if word_count < 50 && html.len() > 5_000 && has_scripts {
return true;
}
// Tier 2: SPA framework markers + low content-to-HTML ratio.
if word_count < 800 && html.len() > 50_000 && has_scripts {
let html_lower = html.to_lowercase();
let has_spa_marker = html_lower.contains("react-app")
|| html_lower.contains("id=\"__next\"")
|| html_lower.contains("id=\"root\"")
|| html_lower.contains("id=\"app\"")
|| html_lower.contains("__next_data__")
|| html_lower.contains("nuxt")
|| html_lower.contains("ng-app");
if has_spa_marker {
return true;
}
}
false
}
// ---------------------------------------------------------------------------
// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
// or raw cloud JSON)
// ---------------------------------------------------------------------------
/// Result of [`smart_fetch`]: either a local extraction or the raw
/// cloud API response when we escalated.
pub enum SmartFetchResult {
Local(Box<webclaw_core::ExtractionResult>),
Cloud(Value),
}
/// Try local fetch + extract first. On bot protection or detected
/// JS-render, fall back to `cloud.scrape(...)` with the caller's
/// formats. Returns `Err(String)` so existing call sites that expect
/// stringified errors keep compiling.
///
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
/// [`CloudError`] so you can render precise UX.
pub async fn smart_fetch(
feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.
2026-04-22 21:17:50 +02:00
client: &dyn crate::fetcher::Fetcher,
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
.await
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
.map_err(|e| format!("Fetch failed: {e}"))?;
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
info!(url, "bot protection detected, falling back to cloud API");
return cloud_scrape_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
let options = webclaw_core::ExtractionOptions {
include_selectors: include_selectors.to_vec(),
exclude_selectors: exclude_selectors.to_vec(),
only_main_content,
include_raw_html: false,
};
let extraction =
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
.map_err(|e| format!("Extraction failed: {e}"))?;
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
info!(
url,
word_count = extraction.metadata.word_count,
html_len = fetch_result.html.len(),
"JS-rendered page detected, falling back to cloud API"
);
return cloud_scrape_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
Ok(SmartFetchResult::Local(Box::new(extraction)))
}
async fn cloud_scrape_fallback(
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
let Some(c) = cloud else {
return Err(CloudError::NotConfigured.to_string());
};
let resp = c
.scrape(
url,
formats,
include_selectors,
exclude_selectors,
only_main_content,
)
.await
.map_err(|e| e.to_string())?;
info!(url, "cloud API fallback successful");
Ok(SmartFetchResult::Cloud(resp))
}
// ---------------------------------------------------------------------------
// Smart-fetch-HTML: for vertical extractors
// ---------------------------------------------------------------------------
/// Where the HTML ultimately came from — useful for callers that want
/// to track "did we fall back?" for logging or pricing.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FetchSource {
Local,
Cloud,
}
/// Antibot-aware HTML fetch result. The `html` field is always populated.
pub struct FetchedHtml {
pub html: String,
pub final_url: String,
pub source: FetchSource,
}
/// Try local fetch; on bot protection, escalate to the cloud's
/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
///
/// Designed for the vertical-extractor pattern where the caller has
/// its own parser and just needs bytes.
pub async fn smart_fetch_html(
feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.
2026-04-22 21:17:50 +02:00
client: &dyn crate::fetcher::Fetcher,
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
cloud: Option<&CloudClient>,
url: &str,
) -> Result<FetchedHtml, CloudError> {
let resp = client
.fetch(url)
.await
.map_err(|e| CloudError::Network(e.to_string()))?;
if !is_bot_protected(&resp.html, &resp.headers) {
return Ok(FetchedHtml {
html: resp.html,
final_url: resp.url,
source: FetchSource::Local,
});
}
let Some(c) = cloud else {
warn!(url, "bot protection detected + no cloud client configured");
return Err(CloudError::NotConfigured);
};
debug!(url, "bot protection detected, escalating to cloud");
let html = c.fetch_html(url).await?;
Ok(FetchedHtml {
html,
final_url: url.to_string(),
source: FetchSource::Cloud,
})
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
fn empty_headers() -> HeaderMap {
HeaderMap::new()
}
// --- detectors ----------------------------------------------------------
#[test]
fn is_bot_protected_detects_cloudflare_challenge() {
let html = "<html><body>_cf_chl_opt loaded</body></html>";
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_detects_turnstile_on_short_page() {
let html = "<div class=\"cf-turnstile\"></div>";
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_ignores_turnstile_on_real_content() {
let html = format!(
"<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
"lots of real content ".repeat(8_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
#[test]
fn is_bot_protected_detects_aws_waf_verifying_connection() {
// The exact shape Trustpilot serves under AWS WAF.
let html = r#"<div class="container"><div id="loading-state">
<div class="interstitial-spinner" id="spinner"></div>
<h1>Verifying your connection...</h1></div></div>"#;
assert!(is_bot_protected(html, &empty_headers()));
}
fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
#[test]
fn synthesize_html_embeds_jsonld_and_og_tags() {
let resp = json!({
"url": "https://example.com/p/1",
"metadata": {
"title": "My Product",
"description": "A nice thing.",
"image": "https://cdn.example.com/1.jpg",
"site_name": "Example Shop"
},
"structured_data": [
{"@context":"https://schema.org","@type":"Product",
"name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
],
"markdown": "# Widget\n\nA nice widget."
});
let html = synthesize_html(&resp);
// OG tags from metadata.
assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
assert!(
html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
);
// JSON-LD block preserved losslessly.
assert!(html.contains(r#"<script type="application/ld+json">"#));
assert!(html.contains(r#""@type":"Product""#));
assert!(html.contains(r#""price":"9.99""#));
// Body carries markdown.
assert!(html.contains("A nice widget."));
}
#[test]
fn synthesize_html_handles_missing_fields_gracefully() {
let resp = json!({"url": "https://example.com", "metadata": {}});
let html = synthesize_html(&resp);
// No panic, no stray unclosed tags.
assert!(html.starts_with("<html><head>"));
assert!(html.ends_with("</body></html>"));
}
#[test]
fn synthesize_html_escapes_attribute_quotes() {
let resp = json!({
"metadata": {"title": r#"She said "hi""#}
});
let html = synthesize_html(&resp);
assert!(html.contains(r#"og:title" content="She said &quot;hi&quot;""#));
}
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
#[test]
fn is_bot_protected_ignores_phrase_on_real_content() {
// A real article that happens to mention the phrase in prose
// should not trigger the short-page detector.
let html = format!(
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
"article text ".repeat(2_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
#[test]
fn needs_js_rendering_flags_spa_skeleton() {
let html = format!(
"<html><body><div id=\"__next\"></div>{}</body></html>",
"<script>x</script>".repeat(500)
);
assert!(needs_js_rendering(10, &html));
}
#[test]
fn needs_js_rendering_passes_real_article() {
let html = format!(
"<html><body>{}<script>x</script></body></html>",
"Real article text ".repeat(5_000)
);
assert!(!needs_js_rendering(5_000, &html));
}
// --- CloudError mapping -------------------------------------------------
#[test]
fn cloud_error_maps_401() {
let e = CloudError::from_status_and_body(401, "invalid key".into());
assert!(matches!(e, CloudError::Unauthorized));
assert!(e.to_string().contains(KEYS_URL));
}
#[test]
fn cloud_error_maps_402() {
let e = CloudError::from_status_and_body(402, "{}".into());
assert!(matches!(e, CloudError::InsufficientPlan));
assert!(e.to_string().contains(PRICING_URL));
}
#[test]
fn cloud_error_maps_429() {
let e = CloudError::from_status_and_body(429, "slow down".into());
assert!(matches!(e, CloudError::RateLimited));
assert!(e.to_string().contains(PRICING_URL));
}
#[test]
fn cloud_error_maps_generic_5xx() {
let e = CloudError::from_status_and_body(503, "x".repeat(2000));
match e {
CloudError::ServerError { status, body } => {
assert_eq!(status, 503);
assert!(body.len() <= 500);
}
_ => panic!("expected ServerError"),
}
}
#[test]
fn not_configured_error_points_at_signup() {
let msg = CloudError::NotConfigured.to_string();
assert!(msg.contains(SIGNUP_URL));
assert!(msg.contains("WEBCLAW_API_KEY"));
}
// --- CloudClient construction ------------------------------------------
#[test]
fn cloud_client_explicit_key_wins_over_env() {
// SAFETY: this test mutates process env. Serial tests only.
// Set env to something, pass an explicit key, explicit should win.
// (We don't actually *call* the API, just check the struct stored
// the right key.)
// rustc std::env::set_var is unsafe in newer toolchains.
unsafe {
std::env::set_var("WEBCLAW_API_KEY", "from-env");
}
let client = CloudClient::new(Some("from-flag")).expect("client built");
assert_eq!(client.api_key, "from-flag");
unsafe {
std::env::remove_var("WEBCLAW_API_KEY");
}
}
#[test]
fn cloud_client_none_when_empty() {
unsafe {
std::env::remove_var("WEBCLAW_API_KEY");
}
assert!(CloudClient::new(None).is_none());
assert!(CloudClient::new(Some("")).is_none());
assert!(CloudClient::new(Some(" ")).is_none());
}
#[test]
fn cloud_client_base_url_strips_trailing_slash() {
let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
assert_eq!(c.base_url(), "https://api.example.com/v1");
}
#[test]
fn truncate_respects_char_boundaries() {
// Ensure we don't slice inside a multi-byte char.
let s = "a".repeat(10) + "é"; // é is 2 bytes
let out = truncate(&s, 11);
assert_eq!(out.chars().count(), 11);
}
}