feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback

Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:

- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
  WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
  challenge-page detection we escalate to api.webclaw.io/v1/scrape
  with formats=['html'] and parse the antibot-bypassed HTML locally.

- Without a cloud key, callers get a typed CloudError::NotConfigured
  whose Display message points at https://webclaw.io/signup.
  Self-hosters without a webclaw.io account know exactly what to do.

## New extractors (all auto-dispatched \u2014 unique hosts)

- amazon_product: ASIN extraction from /dp/, /gp/product/,
  /product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
  locale. Parses the Product JSON-LD Amazon ships for SEO; falls
  back to #productTitle and #landingImage DOM selectors when
  JSON-LD is absent. Returns price, currency, availability,
  condition, brand, image, aggregate rating, SKU / MPN.

- ebay_listing: item-id extraction from /itm/{id} and
  /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
  .it. Parses both bare Offer (Buy It Now) and AggregateOffer
  (used-copies / auctions) from the Product JSON-LD. Returns
  price or low/high-price range, currency, condition, seller,
  offer_count, aggregate rating.

- trustpilot_reviews: reactivated from the `trustpilot_reviews`
  file that was previously dead-code'd. Parser already worked; it
  just needed the smart_fetch_html path to get past AWS WAF's
  'Verifying Connection' interstitial. Organisation / LocalBusiness
  JSON-LD block gives aggregate rating + up to 20 recent reviews.

## FetchClient change

- Added optional `cloud: Option<Arc<CloudClient>>` field with
  `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
  accessor. Extractors call client.cloud() to decide whether they
  can escalate. Cheap clones (Arc-wrapped).

## webclaw-server wiring

AppState::new() now reads the cloud credential from env:

1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
   server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
   mode (no inbound-auth key set), matching the MCP / CLI
   convention of that env var.

When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.

## Catalog + dispatch

All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.

## Tests

- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
  fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
  URL-matching + slugged-URL parsing + both Offer and AggregateOffer
  fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
  means having WEBCLAW_CLOUD_API_KEY set against a real cloud
  backend. Integration-test harness is a separate follow-up.

Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
This commit is contained in:
Valerio 2026-04-22 16:16:11 +02:00
parent 0ab891bd6b
commit d8c9274a9c
6 changed files with 884 additions and 24 deletions

View file

@ -177,6 +177,11 @@ enum ClientPool {
pub struct FetchClient { pub struct FetchClient {
pool: ClientPool, pool: ClientPool,
pdf_mode: PdfMode, pdf_mode: PdfMode,
/// Optional cloud-fallback client. Extractors that need to
/// escalate past bot protection call `client.cloud()` to get this
/// out. Stored as `Arc` so cloning a `FetchClient` (common in
/// axum state) doesn't clone the underlying reqwest pool.
cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
} }
impl FetchClient { impl FetchClient {
@ -225,7 +230,35 @@ impl FetchClient {
ClientPool::Rotating { clients } ClientPool::Rotating { clients }
}; };
Ok(Self { pool, pdf_mode }) Ok(Self {
pool,
pdf_mode,
cloud: None,
})
}
/// Attach a cloud-fallback client. Returns `self` so it composes in
/// a builder-ish way:
///
/// ```ignore
/// let client = FetchClient::new(config)?
/// .with_cloud(CloudClient::from_env()?);
/// ```
///
/// Extractors that can escalate past bot protection will call
/// `client.cloud()` internally. Sets the field regardless of
/// whether `cloud` is configured to bypass anything specific —
/// attachment is cheap (just wraps in `Arc`).
pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
self.cloud = Some(std::sync::Arc::new(cloud));
self
}
/// Optional cloud-fallback client, if one was attached via
/// [`Self::with_cloud`]. Extractors that handle antibot sites
/// pass this into `cloud::smart_fetch_html`.
pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
self.cloud.as_deref()
} }
/// Fetch a URL and return the raw HTML + response metadata. /// Fetch a URL and return the raw HTML + response metadata.

View file

@ -0,0 +1,361 @@
//! Amazon product detail page extractor.
//!
//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
//! a "Sorry, we need to verify you're human" interstitial to any
//! client without a warm Amazon session + residential IP. Detection
//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
//! Amazon heuristic, so this extractor always hits the cloud fallback
//! path in practice.
//!
//! Parsing logic works on the final HTML, local or cloud-sourced. We
//! read the product details primarily from JSON-LD `Product` blocks
//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
//! specific DOM IDs picked up with cheap regex.
//!
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
//! path. ASINs are a stable Amazon identifier so we extract that as
//! part of the response even when everything else is empty (tells
//! callers the URL was at least recognised).
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "amazon_product",
label: "Amazon product",
description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
url_patterns: &[
"https://www.amazon.com/dp/{ASIN}",
"https://www.amazon.co.uk/dp/{ASIN}",
"https://www.amazon.de/dp/{ASIN}",
"https://www.amazon.fr/dp/{ASIN}",
"https://www.amazon.it/dp/{ASIN}",
"https://www.amazon.es/dp/{ASIN}",
"https://www.amazon.co.jp/dp/{ASIN}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !is_amazon_host(host) {
return false;
}
parse_asin(url).is_some()
}
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
let asin = parse_asin(url)
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let mut data = parse(&fetched.html, url, &asin);
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
/// file) and the source URL, extract Amazon product detail. Returns a
/// `Value` rather than a typed struct so callers can pass it through
/// without carrying webclaw_fetch types.
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
let jsonld = find_product_jsonld(html);
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| dom_title(html));
let image = jsonld
.as_ref()
.and_then(get_first_image)
.or_else(|| dom_image(html));
let brand = jsonld.as_ref().and_then(get_brand);
let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
let offer = jsonld.as_ref().and_then(first_offer);
let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
json!({
"url": url,
"asin": asin,
"title": title,
"brand": brand,
"description": description,
"image": image,
"price": offer.as_ref().and_then(|o| get_text(o, "price")),
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
"availability": offer.as_ref().and_then(|o| {
get_text(o, "availability").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"condition": offer.as_ref().and_then(|o| {
get_text(o, "itemCondition").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"sku": sku,
"mpn": mpn,
"aggregate_rating": aggregate_rating,
})
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn is_amazon_host(host: &str) -> bool {
host.starts_with("www.amazon.") || host.starts_with("amazon.")
}
/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
/// - /dp/{ASIN}
/// - /gp/product/{ASIN}
/// - /product/{ASIN}
/// - /exec/obidos/ASIN/{ASIN}
fn parse_asin(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
});
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
// ---------------------------------------------------------------------------
// JSON-LD walkers — light reuse of ecommerce_product's style
// ---------------------------------------------------------------------------
fn find_product_jsonld(html: &str) -> Option<Value> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
for b in blocks {
if let Some(found) = find_product_in(&b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let Some(t) = v.get("@type") else {
return false;
};
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
match t {
Value::String(s) => is_prod(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
brand
.as_object()
.and_then(|o| o.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
}
fn get_first_image(v: &Value) -> Option<String> {
match v.get("image")? {
Value::String(s) => Some(s.clone()),
Value::Array(arr) => arr.iter().find_map(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}),
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}
}
fn first_offer(v: &Value) -> Option<Value> {
let offers = v.get("offers")?;
match offers {
Value::Array(arr) => arr.first().cloned(),
Value::Object(_) => Some(offers.clone()),
_ => None,
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"review_count": get_text(r, "reviewCount"),
"best_rating": get_text(r, "bestRating"),
}))
}
// ---------------------------------------------------------------------------
// DOM fallbacks — cheap regex for the two fields most likely to be
// missing from JSON-LD on Amazon.
// ---------------------------------------------------------------------------
fn dom_title(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| m.as_str().trim().to_string())
}
fn dom_image(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_multi_locale() {
assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
assert!(matches(
"https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
));
}
#[test]
fn rejects_non_product_urls() {
assert!(!matches("https://www.amazon.com/"));
assert!(!matches("https://www.amazon.com/gp/cart"));
assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
}
#[test]
fn parse_asin_extracts_from_multiple_shapes() {
assert_eq!(
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
Some("B0CHX1W1XY".into())
);
assert_eq!(parse_asin("https://www.amazon.com/"), None);
}
#[test]
fn parse_extracts_from_fixture_jsonld() {
// Minimal Amazon-style fixture with a Product JSON-LD block.
let html = r##"
<html><head>
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"Product",
"name":"ACME Widget","sku":"B0CHX1W1XY",
"brand":{"@type":"Brand","name":"ACME"},
"image":"https://m.media-amazon.com/images/I/abc.jpg",
"offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
"availability":"https://schema.org/InStock"},
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
</script>
</head><body></body></html>"##;
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
assert_eq!(v["asin"], "B0CHX1W1XY");
assert_eq!(v["title"], "ACME Widget");
assert_eq!(v["brand"], "ACME");
assert_eq!(v["price"], "19.99");
assert_eq!(v["currency"], "USD");
assert_eq!(v["availability"], "InStock");
assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
assert_eq!(v["aggregate_rating"]["review_count"], "1234");
}
#[test]
fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
let html = r#"
<html><body>
<span id="productTitle">Fallback Title</span>
<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
</body></html>
"#;
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
assert_eq!(v["title"], "Fallback Title");
assert_eq!(
v["image"],
"https://m.media-amazon.com/images/I/fallback.jpg"
);
}
}

View file

@ -0,0 +1,337 @@
//! eBay listing extractor.
//!
//! eBay item pages at `ebay.com/itm/{id}` and international variants
//! usually ship a `Product` JSON-LD block with title, price, currency,
//! condition, and an `AggregateOffer` when bidding. eBay applies
//! Cloudflare + custom WAF selectively — some item IDs return normal
//! HTML to the Firefox profile, others 403 / get the "Pardon our
//! interruption" page. We route through `cloud::smart_fetch_html` so
//! both paths resolve to the same parser.
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "ebay_listing",
label: "eBay listing",
description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
url_patterns: &[
"https://www.ebay.com/itm/{id}",
"https://www.ebay.co.uk/itm/{id}",
"https://www.ebay.de/itm/{id}",
"https://www.ebay.fr/itm/{id}",
"https://www.ebay.it/itm/{id}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !is_ebay_host(host) {
return false;
}
parse_item_id(url).is_some()
}
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
let item_id = parse_item_id(url)
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let mut data = parse(&fetched.html, url, &item_id);
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
let jsonld = find_product_jsonld(html);
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| og(html, "title"));
let image = jsonld
.as_ref()
.and_then(get_first_image)
.or_else(|| og(html, "image"));
let brand = jsonld.as_ref().and_then(get_brand);
let description = jsonld
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description"));
let offer = jsonld.as_ref().and_then(first_offer);
// eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
let (low_price, high_price, single_price) = match offer.as_ref() {
Some(o) => (
get_text(o, "lowPrice"),
get_text(o, "highPrice"),
get_text(o, "price"),
),
None => (None, None, None),
};
let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
json!({
"url": url,
"item_id": item_id,
"title": title,
"brand": brand,
"description": description,
"image": image,
"price": single_price,
"low_price": low_price,
"high_price": high_price,
"offer_count": offer_count,
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
"availability": offer.as_ref().and_then(|o| {
get_text(o, "availability").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"condition": offer.as_ref().and_then(|o| {
get_text(o, "itemCondition").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"seller": offer.as_ref().and_then(|o|
o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
"aggregate_rating": aggregate_rating,
})
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn is_ebay_host(host: &str) -> bool {
host.starts_with("www.ebay.") || host.starts_with("ebay.")
}
/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
/// URLs. IDs are 10-15 digits today, but we accept any all-digit
/// trailing segment so the extractor stays forward-compatible.
fn parse_item_id(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
// /itm/(optional-slug/)?(digits)([/?#]|end)
Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
});
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
// ---------------------------------------------------------------------------
// JSON-LD walkers
// ---------------------------------------------------------------------------
fn find_product_jsonld(html: &str) -> Option<Value> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
for b in blocks {
if let Some(found) = find_product_in(&b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let Some(t) = v.get("@type") else {
return false;
};
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
match t {
Value::String(s) => is_prod(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
brand
.as_object()
.and_then(|o| o.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
}
fn get_first_image(v: &Value) -> Option<String> {
match v.get("image")? {
Value::String(s) => Some(s.clone()),
Value::Array(arr) => arr.iter().find_map(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}),
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}
}
fn first_offer(v: &Value) -> Option<Value> {
let offers = v.get("offers")?;
match offers {
Value::Array(arr) => arr.first().cloned(),
Value::Object(_) => Some(offers.clone()),
_ => None,
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"review_count": get_text(r, "reviewCount"),
"best_rating": get_text(r, "bestRating"),
}))
}
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_ebay_item_urls() {
assert!(matches("https://www.ebay.com/itm/325478156234"));
assert!(matches(
"https://www.ebay.com/itm/vintage-typewriter/325478156234"
));
assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
assert!(!matches("https://www.ebay.com/"));
assert!(!matches("https://www.ebay.com/sch/foo"));
assert!(!matches("https://example.com/itm/325478156234"));
}
#[test]
fn parse_item_id_handles_slugged_urls() {
assert_eq!(
parse_item_id("https://www.ebay.com/itm/325478156234"),
Some("325478156234".into())
);
assert_eq!(
parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
Some("325478156234".into())
);
assert_eq!(
parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
Some("325478156234".into())
);
}
#[test]
fn parse_extracts_from_fixture_jsonld() {
let html = r##"
<html><head>
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"Product",
"name":"Vintage Typewriter","sku":"TW-001",
"brand":{"@type":"Brand","name":"Olivetti"},
"image":"https://i.ebayimg.com/images/abc.jpg",
"offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
"availability":"https://schema.org/InStock",
"itemCondition":"https://schema.org/UsedCondition",
"seller":{"@type":"Person","name":"vintage_seller_99"}}}
</script>
</head></html>"##;
let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
assert_eq!(v["title"], "Vintage Typewriter");
assert_eq!(v["price"], "79.99");
assert_eq!(v["currency"], "GBP");
assert_eq!(v["availability"], "InStock");
assert_eq!(v["condition"], "UsedCondition");
assert_eq!(v["seller"], "vintage_seller_99");
assert_eq!(v["brand"], "Olivetti");
}
#[test]
fn parse_handles_aggregate_offer_price_range() {
let html = r##"
<script type="application/ld+json">
{"@type":"Product","name":"Used Copies",
"offers":{"@type":"AggregateOffer","offerCount":"5",
"lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
</script>
"##;
let v = parse(html, "https://www.ebay.com/itm/1", "1");
assert_eq!(v["low_price"], "10.00");
assert_eq!(v["high_price"], "50.00");
assert_eq!(v["offer_count"], "5");
assert_eq!(v["currency"], "USD");
}
}

View file

@ -14,10 +14,12 @@
//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have //! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
//! one). HTML extraction is the fallback for sites that don't. //! one). HTML extraction is the fallback for sites that don't.
pub mod amazon_product;
pub mod arxiv; pub mod arxiv;
pub mod crates_io; pub mod crates_io;
pub mod dev_to; pub mod dev_to;
pub mod docker_hub; pub mod docker_hub;
pub mod ebay_listing;
pub mod ecommerce_product; pub mod ecommerce_product;
pub mod github_pr; pub mod github_pr;
pub mod github_release; pub mod github_release;
@ -33,12 +35,6 @@ pub mod pypi;
pub mod reddit; pub mod reddit;
pub mod shopify_product; pub mod shopify_product;
pub mod stackoverflow; pub mod stackoverflow;
// `trustpilot_reviews` code lives in the tree but is not wired into the
// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
// it would return 403 more often than not — bad UX. When the cloud tier
// has residential proxies or a CDP renderer, flip this back on.
#[allow(dead_code)]
pub mod trustpilot_reviews; pub mod trustpilot_reviews;
use serde::Serialize; use serde::Serialize;
@ -84,6 +80,9 @@ pub fn list() -> Vec<ExtractorInfo> {
instagram_profile::INFO, instagram_profile::INFO,
shopify_product::INFO, shopify_product::INFO,
ecommerce_product::INFO, ecommerce_product::INFO,
amazon_product::INFO,
ebay_listing::INFO,
trustpilot_reviews::INFO,
] ]
} }
@ -209,6 +208,31 @@ pub async fn dispatch_by_url(
.map(|v| (instagram_profile::INFO.name, v)), .map(|v| (instagram_profile::INFO.name, v)),
); );
} }
// Antibot-gated verticals with unique hosts: safe to auto-dispatch
// because the matcher can't confuse the URL for anything else. The
// extractor's smart_fetch_html path handles the blocked-without-
// API-key case with a clear actionable error.
if amazon_product::matches(url) {
return Some(
amazon_product::extract(client, url)
.await
.map(|v| (amazon_product::INFO.name, v)),
);
}
if ebay_listing::matches(url) {
return Some(
ebay_listing::extract(client, url)
.await
.map(|v| (ebay_listing::INFO.name, v)),
);
}
if trustpilot_reviews::matches(url) {
return Some(
trustpilot_reviews::extract(client, url)
.await
.map(|v| (trustpilot_reviews::INFO.name, v)),
);
}
// NOTE: shopify_product and ecommerce_product are intentionally NOT // NOTE: shopify_product and ecommerce_product are intentionally NOT
// in auto-dispatch. Their `matches()` functions are permissive // in auto-dispatch. Their `matches()` functions are permissive
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and // (any URL with `/products/`, `/product/`, `/p/`, etc.) and
@ -333,6 +357,24 @@ pub async fn dispatch_by_name(
}) })
.await .await
} }
n if n == amazon_product::INFO.name => {
run_or_mismatch(amazon_product::matches(url), n, url, || {
amazon_product::extract(client, url)
})
.await
}
n if n == ebay_listing::INFO.name => {
run_or_mismatch(ebay_listing::matches(url), n, url, || {
ebay_listing::extract(client, url)
})
.await
}
n if n == trustpilot_reviews::INFO.name => {
run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
trustpilot_reviews::extract(client, url)
})
.await
}
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())), _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
} }
} }

View file

@ -1,16 +1,18 @@
//! Trustpilot company reviews extractor. //! Trustpilot company reviews extractor.
//! //!
//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich //! `trustpilot.com/review/{domain}` pages embed a JSON-LD
//! JSON-LD `LocalBusiness` / `Organization` block with aggregate //! `Organization` / `LocalBusiness` block with aggregate rating + up
//! rating + up to 20 recent reviews. No auth, no antibot for the //! to 20 recent reviews. The page HTML itself is usually behind AWS
//! page HTML itself. //! WAF's "Verifying Connection" interstitial — so this extractor
//! //! always uses [`cloud::smart_fetch_html`] and only returns data when
//! Auto-dispatch safe because the host is unique. //! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
//! OSS users without a key get a clear error pointing at signup.
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient; use crate::client::FetchClient;
use crate::cloud::{self, CloudError};
use crate::error::FetchError; use crate::error::FetchError;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
@ -29,15 +31,22 @@ pub fn matches(url: &str) -> bool {
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
let resp = client.fetch(url).await?; // Trustpilot is always behind AWS WAF, so we go through smart_fetch
if !(200..300).contains(&resp.status) { // which tries local first (which will hit the challenge interstitial),
return Err(FetchError::Build(format!( // detects it, and escalates to cloud /v1/scrape for the real HTML.
"trustpilot_reviews: status {} for {url}", let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
resp.status .await
))); .map_err(cloud_to_fetch_err)?;
}
let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html); let html = parse(&fetched.html, url)?;
Ok(html_with_source(html, fetched.source))
}
/// Run the pure parser on already-fetched HTML. Split out so the cloud
/// pipeline can call it directly after its own antibot-aware fetch
/// without going through [`extract`].
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
let business = find_business(&blocks).ok_or_else(|| { let business = find_business(&blocks).ok_or_else(|| {
FetchError::BodyDecode(format!( FetchError::BodyDecode(format!(
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}" "trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
@ -94,6 +103,26 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
})) }))
} }
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
/// Stamp `data_source` onto the parser output so callers can tell at a
/// glance whether this row came from local or cloud. Useful for UX and
/// for pricing-aware pipelines.
fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
if let Some(obj) = v.as_object_mut() {
obj.insert(
"data_source".into(),
match source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
v
}
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
// JSON-LD walker — same pattern as ecommerce_product // JSON-LD walker — same pattern as ecommerce_product
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------

View file

@ -1,7 +1,24 @@
//! Shared application state. Cheap to clone via Arc; held by the axum //! Shared application state. Cheap to clone via Arc; held by the axum
//! Router for the life of the process. //! Router for the life of the process.
//!
//! Two unrelated keys get carried here:
//!
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
//! Unset = open mode.
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
//! **outbound** credential for api.webclaw.io, used by extractors
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
//! error with a signup link.
//!
//! Different variables on purpose: conflating the two means operators
//! who want their server behind an auth token can't also enable cloud
//! fallback, and vice versa.
use std::sync::Arc; use std::sync::Arc;
use tracing::info;
use webclaw_fetch::cloud::CloudClient;
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig}; use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
/// Single-process state shared across all request handlers. /// Single-process state shared across all request handlers.
@ -17,6 +34,7 @@ struct Inner {
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs /// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
/// them nothing. /// them nothing.
pub fetch: Arc<FetchClient>, pub fetch: Arc<FetchClient>,
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
pub api_key: Option<String>, pub api_key: Option<String>,
} }
@ -24,17 +42,34 @@ impl AppState {
/// Build the application state. The fetch client is constructed once /// Build the application state. The fetch client is constructed once
/// and shared across requests so connection pools + browser profile /// and shared across requests so connection pools + browser profile
/// state don't churn per request. /// state don't churn per request.
pub fn new(api_key: Option<String>) -> anyhow::Result<Self> { ///
/// `inbound_api_key` is the bearer token clients must present;
/// cloud-fallback credentials come from the env (checked here).
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
let config = FetchConfig { let config = FetchConfig {
browser: BrowserProfile::Firefox, browser: BrowserProfile::Firefox,
..FetchConfig::default() ..FetchConfig::default()
}; };
let fetch = FetchClient::new(config) let mut fetch = FetchClient::new(config)
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?; .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
// Cloud fallback: only activates when the operator has provided
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
// (preferred, disambiguates from the inbound-auth key) and
// WEBCLAW_API_KEY as a fallback when there's no inbound key
// configured (backwards compat with MCP / CLI conventions).
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
info!(
base = cloud.base_url(),
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
);
fetch = fetch.with_cloud(cloud);
}
Ok(Self { Ok(Self {
inner: Arc::new(Inner { inner: Arc::new(Inner {
fetch: Arc::new(fetch), fetch: Arc::new(fetch),
api_key, api_key: inbound_api_key,
}), }),
}) })
} }
@ -47,3 +82,26 @@ impl AppState {
self.inner.api_key.as_deref() self.inner.api_key.as_deref()
} }
} }
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
/// configured (i.e. open mode — the same env var can't mean two
/// things to one process).
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
if let Some(k) = cloud_key.as_deref()
&& !k.trim().is_empty()
{
return Some(CloudClient::with_key(k));
}
// Reuse WEBCLAW_API_KEY only when not also acting as our own
// inbound-auth token — otherwise we'd be telling the operator
// they can't have both.
if inbound_api_key.is_none()
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
&& !k.trim().is_empty()
{
return Some(CloudClient::with_key(k));
}
None
}