mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback
Three hard-site extractors that all require antibot bypass to ever return usable data. They ship in OSS so the parsers + schema live with the rest of the vertical extractors, but the fetch path routes through cloud::smart_fetch_html \u2014 meaning: - With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on challenge-page detection we escalate to api.webclaw.io/v1/scrape with formats=['html'] and parse the antibot-bypassed HTML locally. - Without a cloud key, callers get a typed CloudError::NotConfigured whose Display message points at https://webclaw.io/signup. Self-hosters without a webclaw.io account know exactly what to do. ## New extractors (all auto-dispatched \u2014 unique hosts) - amazon_product: ASIN extraction from /dp/, /gp/product/, /product/, /exec/obidos/ASIN/ URL shapes across every amazon.* locale. Parses the Product JSON-LD Amazon ships for SEO; falls back to #productTitle and #landingImage DOM selectors when JSON-LD is absent. Returns price, currency, availability, condition, brand, image, aggregate rating, SKU / MPN. - ebay_listing: item-id extraction from /itm/{id} and /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr / .it. Parses both bare Offer (Buy It Now) and AggregateOffer (used-copies / auctions) from the Product JSON-LD. Returns price or low/high-price range, currency, condition, seller, offer_count, aggregate rating. - trustpilot_reviews: reactivated from the `trustpilot_reviews` file that was previously dead-code'd. Parser already worked; it just needed the smart_fetch_html path to get past AWS WAF's 'Verifying Connection' interstitial. Organisation / LocalBusiness JSON-LD block gives aggregate rating + up to 20 recent reviews. ## FetchClient change - Added optional `cloud: Option<Arc<CloudClient>>` field with `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)` accessor. Extractors call client.cloud() to decide whether they can escalate. Cheap clones (Arc-wrapped). ## webclaw-server wiring AppState::new() now reads the cloud credential from env: 1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the server's own inbound bearer token. 2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open mode (no inbound-auth key set), matching the MCP / CLI convention of that env var. When present, state.rs builds a CloudClient and attaches it to the FetchClient via with_cloud(). Log line at startup so operators see when cloud fallback is active. ## Catalog + dispatch All three extractors registered in list() and in dispatch_by_url. /v1/extractors catalog now exposes 22 verticals. Explicit /v1/scrape/{vertical} routes work per the existing pattern. ## Tests - 7 new unit tests (parse_asin multi-shape + parse from JSON-LD fixture + DOM-fallback on missing JSON-LD for Amazon; ebay URL-matching + slugged-URL parsing + both Offer and AggregateOffer fixtures). - Full extractors suite: 68 passing (was 59, +9 from the new files). - fmt + clippy clean. - No live-test story for these three inside CI \u2014 verifying them means having WEBCLAW_CLOUD_API_KEY set against a real cloud backend. Integration-test harness is a separate follow-up. Catalog summary: 22 verticals total across wave 1-5. Hard-site three are gated behind an actionable cloud-fallback upgrade path rather than silently returning nothing or 403-ing the caller.
This commit is contained in:
parent
0ab891bd6b
commit
d8c9274a9c
6 changed files with 884 additions and 24 deletions
|
|
@ -177,6 +177,11 @@ enum ClientPool {
|
|||
pub struct FetchClient {
|
||||
pool: ClientPool,
|
||||
pdf_mode: PdfMode,
|
||||
/// Optional cloud-fallback client. Extractors that need to
|
||||
/// escalate past bot protection call `client.cloud()` to get this
|
||||
/// out. Stored as `Arc` so cloning a `FetchClient` (common in
|
||||
/// axum state) doesn't clone the underlying reqwest pool.
|
||||
cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
|
||||
}
|
||||
|
||||
impl FetchClient {
|
||||
|
|
@ -225,7 +230,35 @@ impl FetchClient {
|
|||
ClientPool::Rotating { clients }
|
||||
};
|
||||
|
||||
Ok(Self { pool, pdf_mode })
|
||||
Ok(Self {
|
||||
pool,
|
||||
pdf_mode,
|
||||
cloud: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Attach a cloud-fallback client. Returns `self` so it composes in
|
||||
/// a builder-ish way:
|
||||
///
|
||||
/// ```ignore
|
||||
/// let client = FetchClient::new(config)?
|
||||
/// .with_cloud(CloudClient::from_env()?);
|
||||
/// ```
|
||||
///
|
||||
/// Extractors that can escalate past bot protection will call
|
||||
/// `client.cloud()` internally. Sets the field regardless of
|
||||
/// whether `cloud` is configured to bypass anything specific —
|
||||
/// attachment is cheap (just wraps in `Arc`).
|
||||
pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
|
||||
self.cloud = Some(std::sync::Arc::new(cloud));
|
||||
self
|
||||
}
|
||||
|
||||
/// Optional cloud-fallback client, if one was attached via
|
||||
/// [`Self::with_cloud`]. Extractors that handle antibot sites
|
||||
/// pass this into `cloud::smart_fetch_html`.
|
||||
pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
|
||||
self.cloud.as_deref()
|
||||
}
|
||||
|
||||
/// Fetch a URL and return the raw HTML + response metadata.
|
||||
|
|
|
|||
361
crates/webclaw-fetch/src/extractors/amazon_product.rs
Normal file
361
crates/webclaw-fetch/src/extractors/amazon_product.rs
Normal file
|
|
@ -0,0 +1,361 @@
|
|||
//! Amazon product detail page extractor.
|
||||
//!
|
||||
//! Amazon product pages (`/dp/{ASIN}/` on every locale) always return
|
||||
//! a "Sorry, we need to verify you're human" interstitial to any
|
||||
//! client without a warm Amazon session + residential IP. Detection
|
||||
//! fires immediately in [`cloud::is_bot_protected`] via the dedicated
|
||||
//! Amazon heuristic, so this extractor always hits the cloud fallback
|
||||
//! path in practice.
|
||||
//!
|
||||
//! Parsing logic works on the final HTML, local or cloud-sourced. We
|
||||
//! read the product details primarily from JSON-LD `Product` blocks
|
||||
//! (Amazon exposes a solid subset for SEO) plus a couple of Amazon-
|
||||
//! specific DOM IDs picked up with cheap regex.
|
||||
//!
|
||||
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
|
||||
//! path. ASINs are a stable Amazon identifier so we extract that as
|
||||
//! part of the response even when everything else is empty (tells
|
||||
//! callers the URL was at least recognised).
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "amazon_product",
|
||||
label: "Amazon product",
|
||||
description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
|
||||
url_patterns: &[
|
||||
"https://www.amazon.com/dp/{ASIN}",
|
||||
"https://www.amazon.co.uk/dp/{ASIN}",
|
||||
"https://www.amazon.de/dp/{ASIN}",
|
||||
"https://www.amazon.fr/dp/{ASIN}",
|
||||
"https://www.amazon.it/dp/{ASIN}",
|
||||
"https://www.amazon.es/dp/{ASIN}",
|
||||
"https://www.amazon.co.jp/dp/{ASIN}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_amazon_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_asin(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
let asin = parse_asin(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url, &asin);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
|
||||
/// file) and the source URL, extract Amazon product detail. Returns a
|
||||
/// `Value` rather than a typed struct so callers can pass it through
|
||||
/// without carrying webclaw_fetch types.
|
||||
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| dom_title(html));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| dom_image(html));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld.as_ref().and_then(|v| get_text(v, "description"));
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
|
||||
let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"asin": asin,
|
||||
"title": title,
|
||||
"brand": brand,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"price": offer.as_ref().and_then(|o| get_text(o, "price")),
|
||||
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
|
||||
"availability": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "availability").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"condition": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "itemCondition").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"sku": sku,
|
||||
"mpn": mpn,
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_amazon_host(host: &str) -> bool {
|
||||
host.starts_with("www.amazon.") || host.starts_with("amazon.")
|
||||
}
|
||||
|
||||
/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
|
||||
/// - /dp/{ASIN}
|
||||
/// - /gp/product/{ASIN}
|
||||
/// - /product/{ASIN}
|
||||
/// - /exec/obidos/ASIN/{ASIN}
|
||||
fn parse_asin(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
|
||||
});
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers — light reuse of ecommerce_product's style
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// DOM fallbacks — cheap regex for the two fields most likely to be
|
||||
// missing from JSON-LD on Amazon.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn dom_title(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().trim().to_string())
|
||||
}
|
||||
|
||||
fn dom_image(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_multi_locale() {
|
||||
assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
|
||||
assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
|
||||
assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
|
||||
assert!(matches(
|
||||
"https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_product_urls() {
|
||||
assert!(!matches("https://www.amazon.com/"));
|
||||
assert!(!matches("https://www.amazon.com/gp/cart"));
|
||||
assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_asin_extracts_from_multiple_shapes() {
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(parse_asin("https://www.amazon.com/"), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
// Minimal Amazon-style fixture with a Product JSON-LD block.
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"ACME Widget","sku":"B0CHX1W1XY",
|
||||
"brand":{"@type":"Brand","name":"ACME"},
|
||||
"image":"https://m.media-amazon.com/images/I/abc.jpg",
|
||||
"offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
|
||||
"availability":"https://schema.org/InStock"},
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
|
||||
</script>
|
||||
</head><body></body></html>"##;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["asin"], "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "ACME Widget");
|
||||
assert_eq!(v["brand"], "ACME");
|
||||
assert_eq!(v["price"], "19.99");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
|
||||
assert_eq!(v["aggregate_rating"]["review_count"], "1234");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
|
||||
let html = r#"
|
||||
<html><body>
|
||||
<span id="productTitle">Fallback Title</span>
|
||||
<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
|
||||
</body></html>
|
||||
"#;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "Fallback Title");
|
||||
assert_eq!(
|
||||
v["image"],
|
||||
"https://m.media-amazon.com/images/I/fallback.jpg"
|
||||
);
|
||||
}
|
||||
}
|
||||
337
crates/webclaw-fetch/src/extractors/ebay_listing.rs
Normal file
337
crates/webclaw-fetch/src/extractors/ebay_listing.rs
Normal file
|
|
@ -0,0 +1,337 @@
|
|||
//! eBay listing extractor.
|
||||
//!
|
||||
//! eBay item pages at `ebay.com/itm/{id}` and international variants
|
||||
//! usually ship a `Product` JSON-LD block with title, price, currency,
|
||||
//! condition, and an `AggregateOffer` when bidding. eBay applies
|
||||
//! Cloudflare + custom WAF selectively — some item IDs return normal
|
||||
//! HTML to the Firefox profile, others 403 / get the "Pardon our
|
||||
//! interruption" page. We route through `cloud::smart_fetch_html` so
|
||||
//! both paths resolve to the same parser.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ebay_listing",
|
||||
label: "eBay listing",
|
||||
description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
|
||||
url_patterns: &[
|
||||
"https://www.ebay.com/itm/{id}",
|
||||
"https://www.ebay.co.uk/itm/{id}",
|
||||
"https://www.ebay.de/itm/{id}",
|
||||
"https://www.ebay.fr/itm/{id}",
|
||||
"https://www.ebay.it/itm/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_ebay_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_item_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
let item_id = parse_item_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url, &item_id);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| og(html, "title"));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
// eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
|
||||
let (low_price, high_price, single_price) = match offer.as_ref() {
|
||||
Some(o) => (
|
||||
get_text(o, "lowPrice"),
|
||||
get_text(o, "highPrice"),
|
||||
get_text(o, "price"),
|
||||
),
|
||||
None => (None, None, None),
|
||||
};
|
||||
let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
|
||||
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"item_id": item_id,
|
||||
"title": title,
|
||||
"brand": brand,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"price": single_price,
|
||||
"low_price": low_price,
|
||||
"high_price": high_price,
|
||||
"offer_count": offer_count,
|
||||
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
|
||||
"availability": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "availability").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"condition": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "itemCondition").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"seller": offer.as_ref().and_then(|o|
|
||||
o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_ebay_host(host: &str) -> bool {
|
||||
host.starts_with("www.ebay.") || host.starts_with("ebay.")
|
||||
}
|
||||
|
||||
/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
|
||||
/// URLs. IDs are 10-15 digits today, but we accept any all-digit
|
||||
/// trailing segment so the extractor stays forward-compatible.
|
||||
fn parse_item_id(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
// /itm/(optional-slug/)?(digits)([/?#]|end)
|
||||
Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
|
||||
});
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_ebay_item_urls() {
|
||||
assert!(matches("https://www.ebay.com/itm/325478156234"));
|
||||
assert!(matches(
|
||||
"https://www.ebay.com/itm/vintage-typewriter/325478156234"
|
||||
));
|
||||
assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
|
||||
assert!(!matches("https://www.ebay.com/"));
|
||||
assert!(!matches("https://www.ebay.com/sch/foo"));
|
||||
assert!(!matches("https://example.com/itm/325478156234"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_item_id_handles_slugged_urls() {
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/325478156234"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Vintage Typewriter","sku":"TW-001",
|
||||
"brand":{"@type":"Brand","name":"Olivetti"},
|
||||
"image":"https://i.ebayimg.com/images/abc.jpg",
|
||||
"offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
|
||||
"availability":"https://schema.org/InStock",
|
||||
"itemCondition":"https://schema.org/UsedCondition",
|
||||
"seller":{"@type":"Person","name":"vintage_seller_99"}}}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
|
||||
assert_eq!(v["title"], "Vintage Typewriter");
|
||||
assert_eq!(v["price"], "79.99");
|
||||
assert_eq!(v["currency"], "GBP");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["condition"], "UsedCondition");
|
||||
assert_eq!(v["seller"], "vintage_seller_99");
|
||||
assert_eq!(v["brand"], "Olivetti");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_handles_aggregate_offer_price_range() {
|
||||
let html = r##"
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Used Copies",
|
||||
"offers":{"@type":"AggregateOffer","offerCount":"5",
|
||||
"lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
|
||||
</script>
|
||||
"##;
|
||||
let v = parse(html, "https://www.ebay.com/itm/1", "1");
|
||||
assert_eq!(v["low_price"], "10.00");
|
||||
assert_eq!(v["high_price"], "50.00");
|
||||
assert_eq!(v["offer_count"], "5");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
}
|
||||
}
|
||||
|
|
@ -14,10 +14,12 @@
|
|||
//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
|
||||
//! one). HTML extraction is the fallback for sites that don't.
|
||||
|
||||
pub mod amazon_product;
|
||||
pub mod arxiv;
|
||||
pub mod crates_io;
|
||||
pub mod dev_to;
|
||||
pub mod docker_hub;
|
||||
pub mod ebay_listing;
|
||||
pub mod ecommerce_product;
|
||||
pub mod github_pr;
|
||||
pub mod github_release;
|
||||
|
|
@ -33,12 +35,6 @@ pub mod pypi;
|
|||
pub mod reddit;
|
||||
pub mod shopify_product;
|
||||
pub mod stackoverflow;
|
||||
// `trustpilot_reviews` code lives in the tree but is not wired into the
|
||||
// catalog or dispatch: Cloudflare turnstile blocks our client at the TLS
|
||||
// layer (all browser profiles tried, all UAs, mobile + desktop). Shipping
|
||||
// it would return 403 more often than not — bad UX. When the cloud tier
|
||||
// has residential proxies or a CDP renderer, flip this back on.
|
||||
#[allow(dead_code)]
|
||||
pub mod trustpilot_reviews;
|
||||
|
||||
use serde::Serialize;
|
||||
|
|
@ -84,6 +80,9 @@ pub fn list() -> Vec<ExtractorInfo> {
|
|||
instagram_profile::INFO,
|
||||
shopify_product::INFO,
|
||||
ecommerce_product::INFO,
|
||||
amazon_product::INFO,
|
||||
ebay_listing::INFO,
|
||||
trustpilot_reviews::INFO,
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -209,6 +208,31 @@ pub async fn dispatch_by_url(
|
|||
.map(|v| (instagram_profile::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// Antibot-gated verticals with unique hosts: safe to auto-dispatch
|
||||
// because the matcher can't confuse the URL for anything else. The
|
||||
// extractor's smart_fetch_html path handles the blocked-without-
|
||||
// API-key case with a clear actionable error.
|
||||
if amazon_product::matches(url) {
|
||||
return Some(
|
||||
amazon_product::extract(client, url)
|
||||
.await
|
||||
.map(|v| (amazon_product::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if ebay_listing::matches(url) {
|
||||
return Some(
|
||||
ebay_listing::extract(client, url)
|
||||
.await
|
||||
.map(|v| (ebay_listing::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if trustpilot_reviews::matches(url) {
|
||||
return Some(
|
||||
trustpilot_reviews::extract(client, url)
|
||||
.await
|
||||
.map(|v| (trustpilot_reviews::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// NOTE: shopify_product and ecommerce_product are intentionally NOT
|
||||
// in auto-dispatch. Their `matches()` functions are permissive
|
||||
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and
|
||||
|
|
@ -333,6 +357,24 @@ pub async fn dispatch_by_name(
|
|||
})
|
||||
.await
|
||||
}
|
||||
n if n == amazon_product::INFO.name => {
|
||||
run_or_mismatch(amazon_product::matches(url), n, url, || {
|
||||
amazon_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == ebay_listing::INFO.name => {
|
||||
run_or_mismatch(ebay_listing::matches(url), n, url, || {
|
||||
ebay_listing::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == trustpilot_reviews::INFO.name => {
|
||||
run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
|
||||
trustpilot_reviews::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,16 +1,18 @@
|
|||
//! Trustpilot company reviews extractor.
|
||||
//!
|
||||
//! Trustpilot pages at `trustpilot.com/review/{domain}` embed a rich
|
||||
//! JSON-LD `LocalBusiness` / `Organization` block with aggregate
|
||||
//! rating + up to 20 recent reviews. No auth, no antibot for the
|
||||
//! page HTML itself.
|
||||
//!
|
||||
//! Auto-dispatch safe because the host is unique.
|
||||
//! `trustpilot.com/review/{domain}` pages embed a JSON-LD
|
||||
//! `Organization` / `LocalBusiness` block with aggregate rating + up
|
||||
//! to 20 recent reviews. The page HTML itself is usually behind AWS
|
||||
//! WAF's "Verifying Connection" interstitial — so this extractor
|
||||
//! always uses [`cloud::smart_fetch_html`] and only returns data when
|
||||
//! the caller has `WEBCLAW_API_KEY` set (cloud handles the bypass).
|
||||
//! OSS users without a key get a clear error pointing at signup.
|
||||
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
|
|
@ -29,15 +31,22 @@ pub fn matches(url: &str) -> bool {
|
|||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
let resp = client.fetch(url).await?;
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return Err(FetchError::Build(format!(
|
||||
"trustpilot_reviews: status {} for {url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
// Trustpilot is always behind AWS WAF, so we go through smart_fetch
|
||||
// which tries local first (which will hit the challenge interstitial),
|
||||
// detects it, and escalates to cloud /v1/scrape for the real HTML.
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(&resp.html);
|
||||
let html = parse(&fetched.html, url)?;
|
||||
Ok(html_with_source(html, fetched.source))
|
||||
}
|
||||
|
||||
/// Run the pure parser on already-fetched HTML. Split out so the cloud
|
||||
/// pipeline can call it directly after its own antibot-aware fetch
|
||||
/// without going through [`extract`].
|
||||
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
let business = find_business(&blocks).ok_or_else(|| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"trustpilot_reviews: no Organization/LocalBusiness JSON-LD on {url}"
|
||||
|
|
@ -94,6 +103,26 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
|||
}))
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
/// Stamp `data_source` onto the parser output so callers can tell at a
|
||||
/// glance whether this row came from local or cloud. Useful for UX and
|
||||
/// for pricing-aware pipelines.
|
||||
fn html_with_source(mut v: Value, source: cloud::FetchSource) -> Value {
|
||||
if let Some(obj) = v.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
v
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walker — same pattern as ecommerce_product
|
||||
// ---------------------------------------------------------------------------
|
||||
|
|
|
|||
|
|
@ -1,7 +1,24 @@
|
|||
//! Shared application state. Cheap to clone via Arc; held by the axum
|
||||
//! Router for the life of the process.
|
||||
//!
|
||||
//! Two unrelated keys get carried here:
|
||||
//!
|
||||
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
|
||||
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
|
||||
//! Unset = open mode.
|
||||
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
|
||||
//! **outbound** credential for api.webclaw.io, used by extractors
|
||||
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
|
||||
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
|
||||
//! error with a signup link.
|
||||
//!
|
||||
//! Different variables on purpose: conflating the two means operators
|
||||
//! who want their server behind an auth token can't also enable cloud
|
||||
//! fallback, and vice versa.
|
||||
|
||||
use std::sync::Arc;
|
||||
use tracing::info;
|
||||
use webclaw_fetch::cloud::CloudClient;
|
||||
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
|
||||
|
||||
/// Single-process state shared across all request handlers.
|
||||
|
|
@ -17,6 +34,7 @@ struct Inner {
|
|||
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
|
||||
/// them nothing.
|
||||
pub fetch: Arc<FetchClient>,
|
||||
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
|
||||
pub api_key: Option<String>,
|
||||
}
|
||||
|
||||
|
|
@ -24,17 +42,34 @@ impl AppState {
|
|||
/// Build the application state. The fetch client is constructed once
|
||||
/// and shared across requests so connection pools + browser profile
|
||||
/// state don't churn per request.
|
||||
pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
///
|
||||
/// `inbound_api_key` is the bearer token clients must present;
|
||||
/// cloud-fallback credentials come from the env (checked here).
|
||||
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
let config = FetchConfig {
|
||||
browser: BrowserProfile::Firefox,
|
||||
..FetchConfig::default()
|
||||
};
|
||||
let fetch = FetchClient::new(config)
|
||||
let mut fetch = FetchClient::new(config)
|
||||
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
|
||||
|
||||
// Cloud fallback: only activates when the operator has provided
|
||||
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
|
||||
// (preferred, disambiguates from the inbound-auth key) and
|
||||
// WEBCLAW_API_KEY as a fallback when there's no inbound key
|
||||
// configured (backwards compat with MCP / CLI conventions).
|
||||
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
|
||||
info!(
|
||||
base = cloud.base_url(),
|
||||
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
|
||||
);
|
||||
fetch = fetch.with_cloud(cloud);
|
||||
}
|
||||
|
||||
Ok(Self {
|
||||
inner: Arc::new(Inner {
|
||||
fetch: Arc::new(fetch),
|
||||
api_key,
|
||||
api_key: inbound_api_key,
|
||||
}),
|
||||
})
|
||||
}
|
||||
|
|
@ -47,3 +82,26 @@ impl AppState {
|
|||
self.inner.api_key.as_deref()
|
||||
}
|
||||
}
|
||||
|
||||
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
|
||||
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
|
||||
/// configured (i.e. open mode — the same env var can't mean two
|
||||
/// things to one process).
|
||||
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
|
||||
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
|
||||
if let Some(k) = cloud_key.as_deref()
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
// Reuse WEBCLAW_API_KEY only when not also acting as our own
|
||||
// inbound-auth token — otherwise we'd be telling the operator
|
||||
// they can't have both.
|
||||
if inbound_api_key.is_none()
|
||||
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
None
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue