mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-05-29 20:45:12 +02:00
feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback
Three hard-site extractors that all require antibot bypass to ever return usable data. They ship in OSS so the parsers + schema live with the rest of the vertical extractors, but the fetch path routes through cloud::smart_fetch_html \u2014 meaning: - With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on challenge-page detection we escalate to api.webclaw.io/v1/scrape with formats=['html'] and parse the antibot-bypassed HTML locally. - Without a cloud key, callers get a typed CloudError::NotConfigured whose Display message points at https://webclaw.io/signup. Self-hosters without a webclaw.io account know exactly what to do. ## New extractors (all auto-dispatched \u2014 unique hosts) - amazon_product: ASIN extraction from /dp/, /gp/product/, /product/, /exec/obidos/ASIN/ URL shapes across every amazon.* locale. Parses the Product JSON-LD Amazon ships for SEO; falls back to #productTitle and #landingImage DOM selectors when JSON-LD is absent. Returns price, currency, availability, condition, brand, image, aggregate rating, SKU / MPN. - ebay_listing: item-id extraction from /itm/{id} and /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr / .it. Parses both bare Offer (Buy It Now) and AggregateOffer (used-copies / auctions) from the Product JSON-LD. Returns price or low/high-price range, currency, condition, seller, offer_count, aggregate rating. - trustpilot_reviews: reactivated from the `trustpilot_reviews` file that was previously dead-code'd. Parser already worked; it just needed the smart_fetch_html path to get past AWS WAF's 'Verifying Connection' interstitial. Organisation / LocalBusiness JSON-LD block gives aggregate rating + up to 20 recent reviews. ## FetchClient change - Added optional `cloud: Option<Arc<CloudClient>>` field with `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)` accessor. Extractors call client.cloud() to decide whether they can escalate. Cheap clones (Arc-wrapped). ## webclaw-server wiring AppState::new() now reads the cloud credential from env: 1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the server's own inbound bearer token. 2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open mode (no inbound-auth key set), matching the MCP / CLI convention of that env var. When present, state.rs builds a CloudClient and attaches it to the FetchClient via with_cloud(). Log line at startup so operators see when cloud fallback is active. ## Catalog + dispatch All three extractors registered in list() and in dispatch_by_url. /v1/extractors catalog now exposes 22 verticals. Explicit /v1/scrape/{vertical} routes work per the existing pattern. ## Tests - 7 new unit tests (parse_asin multi-shape + parse from JSON-LD fixture + DOM-fallback on missing JSON-LD for Amazon; ebay URL-matching + slugged-URL parsing + both Offer and AggregateOffer fixtures). - Full extractors suite: 68 passing (was 59, +9 from the new files). - fmt + clippy clean. - No live-test story for these three inside CI \u2014 verifying them means having WEBCLAW_CLOUD_API_KEY set against a real cloud backend. Integration-test harness is a separate follow-up. Catalog summary: 22 verticals total across wave 1-5. Hard-site three are gated behind an actionable cloud-fallback upgrade path rather than silently returning nothing or 403-ing the caller.
This commit is contained in:
parent
0ab891bd6b
commit
d8c9274a9c
6 changed files with 884 additions and 24 deletions
|
|
@ -1,7 +1,24 @@
|
|||
//! Shared application state. Cheap to clone via Arc; held by the axum
|
||||
//! Router for the life of the process.
|
||||
//!
|
||||
//! Two unrelated keys get carried here:
|
||||
//!
|
||||
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
|
||||
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
|
||||
//! Unset = open mode.
|
||||
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
|
||||
//! **outbound** credential for api.webclaw.io, used by extractors
|
||||
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
|
||||
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
|
||||
//! error with a signup link.
|
||||
//!
|
||||
//! Different variables on purpose: conflating the two means operators
|
||||
//! who want their server behind an auth token can't also enable cloud
|
||||
//! fallback, and vice versa.
|
||||
|
||||
use std::sync::Arc;
|
||||
use tracing::info;
|
||||
use webclaw_fetch::cloud::CloudClient;
|
||||
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
|
||||
|
||||
/// Single-process state shared across all request handlers.
|
||||
|
|
@ -17,6 +34,7 @@ struct Inner {
|
|||
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
|
||||
/// them nothing.
|
||||
pub fetch: Arc<FetchClient>,
|
||||
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
|
||||
pub api_key: Option<String>,
|
||||
}
|
||||
|
||||
|
|
@ -24,17 +42,34 @@ impl AppState {
|
|||
/// Build the application state. The fetch client is constructed once
|
||||
/// and shared across requests so connection pools + browser profile
|
||||
/// state don't churn per request.
|
||||
pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
///
|
||||
/// `inbound_api_key` is the bearer token clients must present;
|
||||
/// cloud-fallback credentials come from the env (checked here).
|
||||
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
let config = FetchConfig {
|
||||
browser: BrowserProfile::Firefox,
|
||||
..FetchConfig::default()
|
||||
};
|
||||
let fetch = FetchClient::new(config)
|
||||
let mut fetch = FetchClient::new(config)
|
||||
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
|
||||
|
||||
// Cloud fallback: only activates when the operator has provided
|
||||
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
|
||||
// (preferred, disambiguates from the inbound-auth key) and
|
||||
// WEBCLAW_API_KEY as a fallback when there's no inbound key
|
||||
// configured (backwards compat with MCP / CLI conventions).
|
||||
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
|
||||
info!(
|
||||
base = cloud.base_url(),
|
||||
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
|
||||
);
|
||||
fetch = fetch.with_cloud(cloud);
|
||||
}
|
||||
|
||||
Ok(Self {
|
||||
inner: Arc::new(Inner {
|
||||
fetch: Arc::new(fetch),
|
||||
api_key,
|
||||
api_key: inbound_api_key,
|
||||
}),
|
||||
})
|
||||
}
|
||||
|
|
@ -47,3 +82,26 @@ impl AppState {
|
|||
self.inner.api_key.as_deref()
|
||||
}
|
||||
}
|
||||
|
||||
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
|
||||
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
|
||||
/// configured (i.e. open mode — the same env var can't mean two
|
||||
/// things to one process).
|
||||
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
|
||||
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
|
||||
if let Some(k) = cloud_key.as_deref()
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
// Reuse WEBCLAW_API_KEY only when not also acting as our own
|
||||
// inbound-auth token — otherwise we'd be telling the operator
|
||||
// they can't have both.
|
||||
if inbound_api_key.is_none()
|
||||
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
None
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue