feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback

Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:

- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
  WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
  challenge-page detection we escalate to api.webclaw.io/v1/scrape
  with formats=['html'] and parse the antibot-bypassed HTML locally.

- Without a cloud key, callers get a typed CloudError::NotConfigured
  whose Display message points at https://webclaw.io/signup.
  Self-hosters without a webclaw.io account know exactly what to do.

## New extractors (all auto-dispatched \u2014 unique hosts)

- amazon_product: ASIN extraction from /dp/, /gp/product/,
  /product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
  locale. Parses the Product JSON-LD Amazon ships for SEO; falls
  back to #productTitle and #landingImage DOM selectors when
  JSON-LD is absent. Returns price, currency, availability,
  condition, brand, image, aggregate rating, SKU / MPN.

- ebay_listing: item-id extraction from /itm/{id} and
  /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
  .it. Parses both bare Offer (Buy It Now) and AggregateOffer
  (used-copies / auctions) from the Product JSON-LD. Returns
  price or low/high-price range, currency, condition, seller,
  offer_count, aggregate rating.

- trustpilot_reviews: reactivated from the `trustpilot_reviews`
  file that was previously dead-code'd. Parser already worked; it
  just needed the smart_fetch_html path to get past AWS WAF's
  'Verifying Connection' interstitial. Organisation / LocalBusiness
  JSON-LD block gives aggregate rating + up to 20 recent reviews.

## FetchClient change

- Added optional `cloud: Option<Arc<CloudClient>>` field with
  `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
  accessor. Extractors call client.cloud() to decide whether they
  can escalate. Cheap clones (Arc-wrapped).

## webclaw-server wiring

AppState::new() now reads the cloud credential from env:

1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
   server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
   mode (no inbound-auth key set), matching the MCP / CLI
   convention of that env var.

When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.

## Catalog + dispatch

All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.

## Tests

- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
  fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
  URL-matching + slugged-URL parsing + both Offer and AggregateOffer
  fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
  means having WEBCLAW_CLOUD_API_KEY set against a real cloud
  backend. Integration-test harness is a separate follow-up.

Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
This commit is contained in:
Valerio 2026-04-22 16:16:11 +02:00
parent 0ab891bd6b
commit d8c9274a9c
6 changed files with 884 additions and 24 deletions

View file

@ -1,7 +1,24 @@
//! Shared application state. Cheap to clone via Arc; held by the axum
//! Router for the life of the process.
//!
//! Two unrelated keys get carried here:
//!
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
//! Unset = open mode.
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
//! **outbound** credential for api.webclaw.io, used by extractors
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
//! error with a signup link.
//!
//! Different variables on purpose: conflating the two means operators
//! who want their server behind an auth token can't also enable cloud
//! fallback, and vice versa.
use std::sync::Arc;
use tracing::info;
use webclaw_fetch::cloud::CloudClient;
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
/// Single-process state shared across all request handlers.
@ -17,6 +34,7 @@ struct Inner {
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
/// them nothing.
pub fetch: Arc<FetchClient>,
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
pub api_key: Option<String>,
}
@ -24,17 +42,34 @@ impl AppState {
/// Build the application state. The fetch client is constructed once
/// and shared across requests so connection pools + browser profile
/// state don't churn per request.
pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
///
/// `inbound_api_key` is the bearer token clients must present;
/// cloud-fallback credentials come from the env (checked here).
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
let config = FetchConfig {
browser: BrowserProfile::Firefox,
..FetchConfig::default()
};
let fetch = FetchClient::new(config)
let mut fetch = FetchClient::new(config)
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
// Cloud fallback: only activates when the operator has provided
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
// (preferred, disambiguates from the inbound-auth key) and
// WEBCLAW_API_KEY as a fallback when there's no inbound key
// configured (backwards compat with MCP / CLI conventions).
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
info!(
base = cloud.base_url(),
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
);
fetch = fetch.with_cloud(cloud);
}
Ok(Self {
inner: Arc::new(Inner {
fetch: Arc::new(fetch),
api_key,
api_key: inbound_api_key,
}),
})
}
@ -47,3 +82,26 @@ impl AppState {
self.inner.api_key.as_deref()
}
}
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
/// configured (i.e. open mode — the same env var can't mean two
/// things to one process).
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
if let Some(k) = cloud_key.as_deref()
&& !k.trim().is_empty()
{
return Some(CloudClient::with_key(k));
}
// Reuse WEBCLAW_API_KEY only when not also acting as our own
// inbound-auth token — otherwise we'd be telling the operator
// they can't have both.
if inbound_api_key.is_none()
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
&& !k.trim().is_empty()
{
return Some(CloudClient::with_key(k));
}
None
}