diff --git a/CHANGELOG.md b/CHANGELOG.md index 4069d54..ef2d2f2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,57 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.5.2] — 2026-04-22 + +### Added +- **`webclaw vertical ` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1. + +- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick. + +- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set. + +### Changed +- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`. + +--- + +## [0.5.1] — 2026-04-22 + +### Added +- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc`, so `&client` coerces to `&dyn Fetcher` automatically. + + The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses. + + Backwards compatible. No behavior change for CLI, MCP, or OSS self-host. + +### Changed +- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces. + +--- + +## [0.5.0] — 2026-04-22 + +### Added +- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob. +- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route. +- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source. +- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran. +- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `\n"); + } + } + } + + out.push_str("\n"); + + // Markdown body → plaintext in . Extractors that regex over + //
IDs won't hit here, but they won't hit on local cloud + // bypass either. OK to keep minimal. + if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) { + out.push_str("
");
+        out.push_str(&html_escape_text(md));
+        out.push_str("
\n"); + } + + out.push_str(""); + out +} + +fn html_escape_attr(s: &str) -> String { + s.replace('&', "&") + .replace('"', """) + .replace('<', "<") + .replace('>', ">") +} + +fn html_escape_text(s: &str) -> String { + s.replace('&', "&") + .replace('<', "<") + .replace('>', ">") +} + +async fn parse_cloud_response(resp: reqwest::Response) -> Result { + let status = resp.status(); + if status.is_success() { + return resp + .json() + .await + .map_err(|e| CloudError::ParseFailed(e.to_string())); + } + let body = resp.text().await.unwrap_or_default(); + Err(CloudError::from_status_and_body(status.as_u16(), body)) +} + +// --------------------------------------------------------------------------- +// Detection +// --------------------------------------------------------------------------- + +/// True when a fetched response body is actually a bot-protection +/// challenge page rather than the content the caller asked for. +/// +/// Conservative — only fires on patterns that indicate the *entire* +/// page is a challenge, not embedded CAPTCHAs on a real content page. +pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool { + let html_lower = html.to_lowercase(); + + // Cloudflare challenge page. + if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") { + return true; + } + + // Cloudflare "Just a moment" / "Checking your browser" interstitial. + if (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) + && html_lower.contains("cf-spinner") + { + return true; + } + + // Cloudflare Turnstile. Only counts when the page is small — + // legitimate pages embed Turnstile for signup forms etc. + if (html_lower.contains("cf-turnstile") + || html_lower.contains("challenges.cloudflare.com/turnstile")) + && html.len() < 100_000 + { + return true; + } + + // DataDome. + if html_lower.contains("geo.captcha-delivery.com") + || html_lower.contains("captcha-delivery.com/captcha") + { + return true; + } + + // AWS WAF. + if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") { + return true; + } + + // AWS WAF "Verifying your connection" interstitial (used by Trustpilot). + // Distinct from the captcha-branded path above: the challenge page is + // a tiny HTML shell with an `interstitial-spinner` div and no content. + // Gating on html.len() keeps false-positives off long pages that + // happen to mention the phrase in an unrelated context. + if html_lower.contains("interstitial-spinner") + && html_lower.contains("verifying your connection") + && html.len() < 10_000 + { + return true; + } + + // hCaptcha *blocking* page (not just an embedded widget). + if html_lower.contains("hcaptcha.com") + && html_lower.contains("h-captcha") + && html.len() < 50_000 + { + return true; + } + + // Cloudflare via response headers + challenge body. + let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some(); + if has_cf_headers + && (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) + { + return true; + } + + false +} + +/// True when a page likely needs JS rendering — a large HTML document +/// with almost no extractable text + an SPA framework signature. +pub fn needs_js_rendering(word_count: usize, html: &str) -> bool { + let has_scripts = html.contains(" 5_000 && has_scripts { + return true; + } + + // Tier 2: SPA framework markers + low content-to-HTML ratio. + if word_count < 800 && html.len() > 50_000 && has_scripts { + let html_lower = html.to_lowercase(); + let has_spa_marker = html_lower.contains("react-app") + || html_lower.contains("id=\"__next\"") + || html_lower.contains("id=\"root\"") + || html_lower.contains("id=\"app\"") + || html_lower.contains("__next_data__") + || html_lower.contains("nuxt") + || html_lower.contains("ng-app"); + if has_spa_marker { + return true; + } + } + + false +} + +// --------------------------------------------------------------------------- +// Smart-fetch: classic flow for MCP / CLI (returns either an extraction +// or raw cloud JSON) +// --------------------------------------------------------------------------- + +/// Result of [`smart_fetch`]: either a local extraction or the raw +/// cloud API response when we escalated. +pub enum SmartFetchResult { + Local(Box), + Cloud(Value), +} + +/// Try local fetch + extract first. On bot protection or detected +/// JS-render, fall back to `cloud.scrape(...)` with the caller's +/// formats. Returns `Err(String)` so existing call sites that expect +/// stringified errors keep compiling. +/// +/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed +/// [`CloudError`] so you can render precise UX. +pub async fn smart_fetch( + client: &dyn crate::fetcher::Fetcher, + cloud: Option<&CloudClient>, + url: &str, + include_selectors: &[String], + exclude_selectors: &[String], + only_main_content: bool, + formats: &[&str], +) -> Result { + let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url)) + .await + .map_err(|_| format!("Fetch timed out after 30s for {url}"))? + .map_err(|e| format!("Fetch failed: {e}"))?; + + if is_bot_protected(&fetch_result.html, &fetch_result.headers) { + info!(url, "bot protection detected, falling back to cloud API"); + return cloud_scrape_fallback( + cloud, + url, + include_selectors, + exclude_selectors, + only_main_content, + formats, + ) + .await; + } + + let options = webclaw_core::ExtractionOptions { + include_selectors: include_selectors.to_vec(), + exclude_selectors: exclude_selectors.to_vec(), + only_main_content, + include_raw_html: false, + }; + let extraction = + webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options) + .map_err(|e| format!("Extraction failed: {e}"))?; + + if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) { + info!( + url, + word_count = extraction.metadata.word_count, + html_len = fetch_result.html.len(), + "JS-rendered page detected, falling back to cloud API" + ); + return cloud_scrape_fallback( + cloud, + url, + include_selectors, + exclude_selectors, + only_main_content, + formats, + ) + .await; + } + + Ok(SmartFetchResult::Local(Box::new(extraction))) +} + +async fn cloud_scrape_fallback( + cloud: Option<&CloudClient>, + url: &str, + include_selectors: &[String], + exclude_selectors: &[String], + only_main_content: bool, + formats: &[&str], +) -> Result { + let Some(c) = cloud else { + return Err(CloudError::NotConfigured.to_string()); + }; + let resp = c + .scrape( + url, + formats, + include_selectors, + exclude_selectors, + only_main_content, + ) + .await + .map_err(|e| e.to_string())?; + info!(url, "cloud API fallback successful"); + Ok(SmartFetchResult::Cloud(resp)) +} + +// --------------------------------------------------------------------------- +// Smart-fetch-HTML: for vertical extractors +// --------------------------------------------------------------------------- + +/// Where the HTML ultimately came from — useful for callers that want +/// to track "did we fall back?" for logging or pricing. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum FetchSource { + Local, + Cloud, +} + +/// Antibot-aware HTML fetch result. The `html` field is always populated. +pub struct FetchedHtml { + pub html: String, + pub final_url: String, + pub source: FetchSource, +} + +/// Try local fetch; on bot protection, escalate to the cloud's +/// `/v1/scrape` with `formats=["html"]` and return the raw HTML. +/// +/// Designed for the vertical-extractor pattern where the caller has +/// its own parser and just needs bytes. +pub async fn smart_fetch_html( + client: &dyn crate::fetcher::Fetcher, + cloud: Option<&CloudClient>, + url: &str, +) -> Result { + let resp = client + .fetch(url) + .await + .map_err(|e| CloudError::Network(e.to_string()))?; + + if !is_bot_protected(&resp.html, &resp.headers) { + return Ok(FetchedHtml { + html: resp.html, + final_url: resp.url, + source: FetchSource::Local, + }); + } + + let Some(c) = cloud else { + warn!(url, "bot protection detected + no cloud client configured"); + return Err(CloudError::NotConfigured); + }; + debug!(url, "bot protection detected, escalating to cloud"); + let html = c.fetch_html(url).await?; + Ok(FetchedHtml { + html, + final_url: url.to_string(), + source: FetchSource::Cloud, + }) +} + +// --------------------------------------------------------------------------- +// Tests +// --------------------------------------------------------------------------- + +#[cfg(test)] +mod tests { + use super::*; + + fn empty_headers() -> HeaderMap { + HeaderMap::new() + } + + // --- detectors ---------------------------------------------------------- + + #[test] + fn is_bot_protected_detects_cloudflare_challenge() { + let html = "_cf_chl_opt loaded"; + assert!(is_bot_protected(html, &empty_headers())); + } + + #[test] + fn is_bot_protected_detects_turnstile_on_short_page() { + let html = "
"; + assert!(is_bot_protected(html, &empty_headers())); + } + + #[test] + fn is_bot_protected_ignores_turnstile_on_real_content() { + let html = format!( + "{}
", + "lots of real content ".repeat(8_000) + ); + assert!(!is_bot_protected(&html, &empty_headers())); + } + + #[test] + fn is_bot_protected_detects_aws_waf_verifying_connection() { + // The exact shape Trustpilot serves under AWS WAF. + let html = r#"
+
+

Verifying your connection...

"#; + assert!(is_bot_protected(html, &empty_headers())); + } + + #[test] + fn synthesize_html_embeds_jsonld_and_og_tags() { + let resp = json!({ + "url": "https://example.com/p/1", + "metadata": { + "title": "My Product", + "description": "A nice thing.", + "image": "https://cdn.example.com/1.jpg", + "site_name": "Example Shop" + }, + "structured_data": [ + {"@context":"https://schema.org","@type":"Product", + "name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}} + ], + "markdown": "# Widget\n\nA nice widget." + }); + let html = synthesize_html(&resp); + // OG tags from metadata. + assert!(html.contains(r#""#)); + assert!( + html.contains(r#""#) + ); + // JSON-LD block preserved losslessly. + assert!(html.contains(r#"".repeat(500) + ); + assert!(needs_js_rendering(10, &html)); + } + + #[test] + fn needs_js_rendering_passes_real_article() { + let html = format!( + "{}", + "Real article text ".repeat(5_000) + ); + assert!(!needs_js_rendering(5_000, &html)); + } + + // --- CloudError mapping ------------------------------------------------- + + #[test] + fn cloud_error_maps_401() { + let e = CloudError::from_status_and_body(401, "invalid key".into()); + assert!(matches!(e, CloudError::Unauthorized)); + assert!(e.to_string().contains(KEYS_URL)); + } + + #[test] + fn cloud_error_maps_402() { + let e = CloudError::from_status_and_body(402, "{}".into()); + assert!(matches!(e, CloudError::InsufficientPlan)); + assert!(e.to_string().contains(PRICING_URL)); + } + + #[test] + fn cloud_error_maps_429() { + let e = CloudError::from_status_and_body(429, "slow down".into()); + assert!(matches!(e, CloudError::RateLimited)); + assert!(e.to_string().contains(PRICING_URL)); + } + + #[test] + fn cloud_error_maps_generic_5xx() { + let e = CloudError::from_status_and_body(503, "x".repeat(2000)); + match e { + CloudError::ServerError { status, body } => { + assert_eq!(status, 503); + assert!(body.len() <= 500); + } + _ => panic!("expected ServerError"), + } + } + + #[test] + fn not_configured_error_points_at_signup() { + let msg = CloudError::NotConfigured.to_string(); + assert!(msg.contains(SIGNUP_URL)); + assert!(msg.contains("WEBCLAW_API_KEY")); + } + + // --- CloudClient construction ------------------------------------------ + + #[test] + fn cloud_client_explicit_key_wins_over_env() { + // SAFETY: this test mutates process env. Serial tests only. + // Set env to something, pass an explicit key, explicit should win. + // (We don't actually *call* the API, just check the struct stored + // the right key.) + // rustc std::env::set_var is unsafe in newer toolchains. + unsafe { + std::env::set_var("WEBCLAW_API_KEY", "from-env"); + } + let client = CloudClient::new(Some("from-flag")).expect("client built"); + assert_eq!(client.api_key, "from-flag"); + unsafe { + std::env::remove_var("WEBCLAW_API_KEY"); + } + } + + #[test] + fn cloud_client_none_when_empty() { + unsafe { + std::env::remove_var("WEBCLAW_API_KEY"); + } + assert!(CloudClient::new(None).is_none()); + assert!(CloudClient::new(Some("")).is_none()); + assert!(CloudClient::new(Some(" ")).is_none()); + } + + #[test] + fn cloud_client_base_url_strips_trailing_slash() { + let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/"); + assert_eq!(c.base_url(), "https://api.example.com/v1"); + } + + #[test] + fn truncate_respects_char_boundaries() { + // Ensure we don't slice inside a multi-byte char. + let s = "a".repeat(10) + "é"; // é is 2 bytes + let out = truncate(&s, 11); + assert_eq!(out.chars().count(), 11); + } +} diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs new file mode 100644 index 0000000..fed6b9f --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs @@ -0,0 +1,452 @@ +//! Amazon product detail page extractor. +//! +//! Amazon product pages (`/dp/{ASIN}/` on every locale) are +//! inconsistently protected. Sometimes our local TLS fingerprint gets +//! a real HTML page; sometimes we land on a CAPTCHA interstitial; +//! sometimes we land on a real page that for whatever reason ships +//! no Product JSON-LD (Amazon A/B-tests this regularly). So the +//! extractor has a two-stage fallback: +//! +//! 1. Try local fetch + parse. If we got Product JSON-LD back, great: +//! we have everything (title, brand, price, availability, rating). +//! 2. If local fetch worked *but the page has no Product JSON-LD* AND +//! a cloud client is configured, force-escalate to api.webclaw.io. +//! Cloud's render + antibot pipeline reliably surfaces the +//! structured data. Without a cloud client we return whatever we +//! got from local (usually just title via `#productTitle` or OG +//! meta tags). +//! +//! Parsing tries JSON-LD first, DOM regex (`#productTitle`, +//! `#landingImage`) second, OG `` tags third. The OG path +//! matters because the cloud's synthesized HTML ships metadata as +//! OG tags but lacks Amazon's DOM IDs. +//! +//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/` +//! path. ASINs are a stable Amazon identifier so we extract that as +//! part of the response even when everything else is empty (tells +//! callers the URL was at least recognised). + +use std::sync::OnceLock; + +use regex::Regex; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::cloud::{self, CloudError}; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "amazon_product", + label: "Amazon product", + description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.", + url_patterns: &[ + "https://www.amazon.com/dp/{ASIN}", + "https://www.amazon.co.uk/dp/{ASIN}", + "https://www.amazon.de/dp/{ASIN}", + "https://www.amazon.fr/dp/{ASIN}", + "https://www.amazon.it/dp/{ASIN}", + "https://www.amazon.es/dp/{ASIN}", + "https://www.amazon.co.jp/dp/{ASIN}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !is_amazon_host(host) { + return false; + } + parse_asin(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let asin = parse_asin(url) + .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?; + + let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url) + .await + .map_err(cloud_to_fetch_err)?; + + // Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA + // pages (they A/B-test it). When local fetch succeeded but has no + // Product JSON-LD, force-escalate to the cloud which runs the + // render pipeline and reliably surfaces structured data. No-op + // when cloud isn't configured — we return whatever local gave us. + if fetched.source == cloud::FetchSource::Local + && find_product_jsonld(&fetched.html).is_none() + && let Some(c) = client.cloud() + { + match c.fetch_html(url).await { + Ok(cloud_html) => { + fetched = cloud::FetchedHtml { + html: cloud_html, + final_url: url.to_string(), + source: cloud::FetchSource::Cloud, + }; + } + Err(e) => { + tracing::debug!( + error = %e, + "amazon_product: cloud escalation failed, keeping local" + ); + } + } + } + + let mut data = parse(&fetched.html, url, &asin); + if let Some(obj) = data.as_object_mut() { + obj.insert( + "data_source".into(), + match fetched.source { + cloud::FetchSource::Local => json!("local"), + cloud::FetchSource::Cloud => json!("cloud"), + }, + ); + } + Ok(data) +} + +/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture +/// file) and the source URL, extract Amazon product detail. Returns a +/// `Value` rather than a typed struct so callers can pass it through +/// without carrying webclaw_fetch types. +pub fn parse(html: &str, url: &str, asin: &str) -> Value { + let jsonld = find_product_jsonld(html); + // Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span + // (only present on real static HTML) > cloud-synthesized og:title. + let title = jsonld + .as_ref() + .and_then(|v| get_text(v, "name")) + .or_else(|| dom_title(html)) + .or_else(|| og(html, "title")); + let image = jsonld + .as_ref() + .and_then(get_first_image) + .or_else(|| dom_image(html)) + .or_else(|| og(html, "image")); + let brand = jsonld.as_ref().and_then(get_brand); + let description = jsonld + .as_ref() + .and_then(|v| get_text(v, "description")) + .or_else(|| og(html, "description")); + let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating); + let offer = jsonld.as_ref().and_then(first_offer); + + let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku")); + let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn")); + + json!({ + "url": url, + "asin": asin, + "title": title, + "brand": brand, + "description": description, + "image": image, + "price": offer.as_ref().and_then(|o| get_text(o, "price")), + "currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")), + "availability": offer.as_ref().and_then(|o| { + get_text(o, "availability").map(|s| + s.replace("http://schema.org/", "").replace("https://schema.org/", "")) + }), + "condition": offer.as_ref().and_then(|o| { + get_text(o, "itemCondition").map(|s| + s.replace("http://schema.org/", "").replace("https://schema.org/", "")) + }), + "sku": sku, + "mpn": mpn, + "aggregate_rating": aggregate_rating, + }) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn is_amazon_host(host: &str) -> bool { + host.starts_with("www.amazon.") || host.starts_with("amazon.") +} + +/// Pull a 10-char ASIN out of any recognised Amazon URL shape: +/// - /dp/{ASIN} +/// - /gp/product/{ASIN} +/// - /product/{ASIN} +/// - /exec/obidos/ASIN/{ASIN} +fn parse_asin(url: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap() + }); + re.captures(url) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +// --------------------------------------------------------------------------- +// JSON-LD walkers — light reuse of ecommerce_product's style +// --------------------------------------------------------------------------- + +fn find_product_jsonld(html: &str) -> Option { + let blocks = webclaw_core::structured_data::extract_json_ld(html); + for b in blocks { + if let Some(found) = find_product_in(&b) { + return Some(found); + } + } + None +} + +fn find_product_in(v: &Value) -> Option { + if is_product_type(v) { + return Some(v.clone()); + } + if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { + for item in graph { + if let Some(found) = find_product_in(item) { + return Some(found); + } + } + } + if let Some(arr) = v.as_array() { + for item in arr { + if let Some(found) = find_product_in(item) { + return Some(found); + } + } + } + None +} + +fn is_product_type(v: &Value) -> bool { + let Some(t) = v.get("@type") else { + return false; + }; + let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct"); + match t { + Value::String(s) => is_prod(s), + Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)), + _ => false, + } +} + +fn get_text(v: &Value, key: &str) -> Option { + v.get(key).and_then(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Number(n) => Some(n.to_string()), + _ => None, + }) +} + +fn get_brand(v: &Value) -> Option { + let brand = v.get("brand")?; + if let Some(s) = brand.as_str() { + return Some(s.to_string()); + } + brand + .as_object() + .and_then(|o| o.get("name")) + .and_then(|n| n.as_str()) + .map(String::from) +} + +fn get_first_image(v: &Value) -> Option { + match v.get("image")? { + Value::String(s) => Some(s.clone()), + Value::Array(arr) => arr.iter().find_map(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + }), + Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + } +} + +fn first_offer(v: &Value) -> Option { + let offers = v.get("offers")?; + match offers { + Value::Array(arr) => arr.first().cloned(), + Value::Object(_) => Some(offers.clone()), + _ => None, + } +} + +fn get_aggregate_rating(v: &Value) -> Option { + let r = v.get("aggregateRating")?; + Some(json!({ + "rating_value": get_text(r, "ratingValue"), + "review_count": get_text(r, "reviewCount"), + "best_rating": get_text(r, "bestRating"), + })) +} + +// --------------------------------------------------------------------------- +// DOM fallbacks — cheap regex for the two fields most likely to be +// missing from JSON-LD on Amazon. +// --------------------------------------------------------------------------- + +fn dom_title(html: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap()); + re.captures(html) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().trim().to_string()) +} + +fn dom_image(html: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap()); + re.captures(html) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +/// OG meta tag lookup. Cloud-synthesized HTML ships these even when +/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last +/// line of defence for `title`, `image`, `description`. +fn og(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + return c.get(2).map(|m| html_unescape(m.as_str())); + } + } + None +} + +/// Undo the synthesize_html attribute escaping for the few entities it +/// emits. Keeps us off a heavier HTML-entity dep. +fn html_unescape(s: &str) -> String { + s.replace(""", "\"") + .replace("&", "&") + .replace("<", "<") + .replace(">", ">") +} + +fn cloud_to_fetch_err(e: CloudError) -> FetchError { + FetchError::Build(e.to_string()) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_multi_locale() { + assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY")); + assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/")); + assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1")); + assert!(matches( + "https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo" + )); + } + + #[test] + fn rejects_non_product_urls() { + assert!(!matches("https://www.amazon.com/")); + assert!(!matches("https://www.amazon.com/gp/cart")); + assert!(!matches("https://example.com/dp/B0CHX1W1XY")); + } + + #[test] + fn parse_asin_extracts_from_multiple_shapes() { + assert_eq!( + parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"), + Some("B0CHX1W1XY".into()) + ); + assert_eq!( + parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"), + Some("B0CHX1W1XY".into()) + ); + assert_eq!( + parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"), + Some("B0CHX1W1XY".into()) + ); + assert_eq!( + parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"), + Some("B0CHX1W1XY".into()) + ); + assert_eq!( + parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"), + Some("B0CHX1W1XY".into()) + ); + assert_eq!(parse_asin("https://www.amazon.com/"), None); + } + + #[test] + fn parse_extracts_from_fixture_jsonld() { + // Minimal Amazon-style fixture with a Product JSON-LD block. + let html = r##" + + +"##; + let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY"); + assert_eq!(v["asin"], "B0CHX1W1XY"); + assert_eq!(v["title"], "ACME Widget"); + assert_eq!(v["brand"], "ACME"); + assert_eq!(v["price"], "19.99"); + assert_eq!(v["currency"], "USD"); + assert_eq!(v["availability"], "InStock"); + assert_eq!(v["aggregate_rating"]["rating_value"], "4.6"); + assert_eq!(v["aggregate_rating"]["review_count"], "1234"); + } + + #[test] + fn parse_falls_back_to_dom_when_jsonld_missing_fields() { + let html = r#" + +Fallback Title + + +"#; + let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY"); + assert_eq!(v["title"], "Fallback Title"); + assert_eq!( + v["image"], + "https://m.media-amazon.com/images/I/fallback.jpg" + ); + } + + #[test] + fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() { + // Shape we see from the cloud synthesize_html path: OG tags + // only, no JSON-LD, no Amazon DOM IDs. + let html = r##" + + + +"##; + let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY"); + assert_eq!(v["title"], "Cloud-sourced MacBook Pro"); + assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg"); + assert_eq!(v["description"], "Via api.webclaw.io"); + } + + #[test] + fn og_unescape_handles_quot_entity() { + let html = r#""#; + assert_eq!( + og(html, "title").as_deref(), + Some(r#"Apple "M2 Pro" Laptop"#) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/arxiv.rs b/crates/webclaw-fetch/src/extractors/arxiv.rs new file mode 100644 index 0000000..c2b85c0 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/arxiv.rs @@ -0,0 +1,314 @@ +//! ArXiv paper structured extractor. +//! +//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}` +//! which returns Atom XML. We parse just enough to surface title, authors, +//! abstract, categories, and the canonical PDF link. No HTML scraping +//! required and no auth. + +use quick_xml::Reader; +use quick_xml::events::Event; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "arxiv", + label: "ArXiv paper", + description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.", + url_patterns: &[ + "https://arxiv.org/abs/{id}", + "https://arxiv.org/abs/{id}v{n}", + "https://arxiv.org/pdf/{id}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "arxiv.org" && host != "www.arxiv.org" { + return false; + } + url.contains("/abs/") || url.contains("/pdf/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let id = parse_id(url) + .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?; + + let api_url = format!("https://export.arxiv.org/api/query?id_list={id}"); + let resp = client.fetch(&api_url).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "arxiv api returned status {}", + resp.status + ))); + } + + let entry = parse_atom_entry(&resp.html) + .ok_or_else(|| FetchError::BodyDecode("arxiv: no in response".into()))?; + if entry.title.is_none() && entry.summary.is_none() { + return Err(FetchError::BodyDecode(format!( + "arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)" + ))); + } + + Ok(json!({ + "url": url, + "id": id, + "arxiv_id": entry.id, + "title": entry.title, + "authors": entry.authors, + "abstract": entry.summary.map(|s| collapse_whitespace(&s)), + "published": entry.published, + "updated": entry.updated, + "primary_category": entry.primary_category, + "categories": entry.categories, + "doi": entry.doi, + "comment": entry.comment, + "pdf_url": entry.pdf_url, + "abs_url": entry.abs_url, + })) +} + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`) +/// and the `.pdf` extension when present. +fn parse_id(url: &str) -> Option { + let after = url + .split("/abs/") + .nth(1) + .or_else(|| url.split("/pdf/").nth(1))?; + let stripped = after + .split(['?', '#']) + .next()? + .trim_end_matches('/') + .trim_end_matches(".pdf"); + // Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345" + let no_version = match stripped.rfind('v') { + Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i], + _ => stripped, + }; + if no_version.is_empty() { + None + } else { + Some(no_version.to_string()) + } +} + +fn collapse_whitespace(s: &str) -> String { + s.split_whitespace().collect::>().join(" ") +} + +#[derive(Default)] +struct AtomEntry { + id: Option, + title: Option, + summary: Option, + published: Option, + updated: Option, + primary_category: Option, + categories: Vec, + authors: Vec, + doi: Option, + comment: Option, + pdf_url: Option, + abs_url: Option, +} + +/// Parse the first `` block of an ArXiv Atom feed. +fn parse_atom_entry(xml: &str) -> Option { + let mut reader = Reader::from_str(xml); + let mut buf = Vec::new(); + + // States + let mut in_entry = false; + let mut current: Option<&'static str> = None; + let mut in_author = false; + let mut in_author_name = false; + let mut entry = AtomEntry::default(); + + loop { + match reader.read_event_into(&mut buf) { + Ok(Event::Start(ref e)) => { + let local = e.local_name(); + match local.as_ref() { + b"entry" => in_entry = true, + b"id" if in_entry && !in_author => current = Some("id"), + b"title" if in_entry => current = Some("title"), + b"summary" if in_entry => current = Some("summary"), + b"published" if in_entry => current = Some("published"), + b"updated" if in_entry => current = Some("updated"), + b"author" if in_entry => in_author = true, + b"name" if in_author => { + in_author_name = true; + current = Some("author_name"); + } + b"category" if in_entry => { + // primary_category is namespaced (arxiv:primary_category) + // category is plain. quick-xml gives us local-name only, + // so we treat both as categories and take the first as + // primary. + for attr in e.attributes().flatten() { + if attr.key.as_ref() == b"term" + && let Ok(v) = attr.unescape_value() + { + let term = v.to_string(); + if entry.primary_category.is_none() { + entry.primary_category = Some(term.clone()); + } + entry.categories.push(term); + } + } + } + b"link" if in_entry => { + let mut href = None; + let mut rel = None; + let mut typ = None; + for attr in e.attributes().flatten() { + match attr.key.as_ref() { + b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()), + b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()), + b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()), + _ => {} + } + } + if let Some(h) = href { + if typ.as_deref() == Some("application/pdf") { + entry.pdf_url = Some(h.clone()); + } + if rel.as_deref() == Some("alternate") { + entry.abs_url = Some(h); + } + } + } + _ => current = None, + } + } + Ok(Event::Empty(ref e)) => { + // Self-closing tags (). Same handling as Start. + let local = e.local_name(); + if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry { + let mut href = None; + let mut rel = None; + let mut typ = None; + let mut term = None; + for attr in e.attributes().flatten() { + match attr.key.as_ref() { + b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()), + b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()), + b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()), + b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()), + _ => {} + } + } + if let Some(t) = term { + if entry.primary_category.is_none() { + entry.primary_category = Some(t.clone()); + } + entry.categories.push(t); + } + if let Some(h) = href { + if typ.as_deref() == Some("application/pdf") { + entry.pdf_url = Some(h.clone()); + } + if rel.as_deref() == Some("alternate") { + entry.abs_url = Some(h); + } + } + } + } + Ok(Event::Text(ref e)) => { + if let (Some(field), Ok(text)) = (current, e.unescape()) { + let text = text.to_string(); + match field { + "id" => entry.id = Some(text.trim().to_string()), + "title" => entry.title = append_text(entry.title.take(), &text), + "summary" => entry.summary = append_text(entry.summary.take(), &text), + "published" => entry.published = Some(text.trim().to_string()), + "updated" => entry.updated = Some(text.trim().to_string()), + "author_name" => entry.authors.push(text.trim().to_string()), + _ => {} + } + } + } + Ok(Event::End(ref e)) => { + let local = e.local_name(); + match local.as_ref() { + b"entry" => break, + b"author" => in_author = false, + b"name" => in_author_name = false, + _ => {} + } + if !in_author_name { + current = None; + } + } + Ok(Event::Eof) => break, + Err(_) => return None, + _ => {} + } + buf.clear(); + } + + if in_entry { Some(entry) } else { None } +} + +/// Concatenate text fragments (long fields can be split across multiple +/// text events if they contain entities or CDATA). +fn append_text(prev: Option, next: &str) -> Option { + match prev { + Some(mut s) => { + s.push_str(next); + Some(s) + } + None => Some(next.to_string()), + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_arxiv_urls() { + assert!(matches("https://arxiv.org/abs/2401.12345")); + assert!(matches("https://arxiv.org/abs/2401.12345v2")); + assert!(matches("https://arxiv.org/pdf/2401.12345.pdf")); + assert!(!matches("https://arxiv.org/")); + assert!(!matches("https://example.com/abs/foo")); + } + + #[test] + fn parse_id_strips_version_and_extension() { + assert_eq!( + parse_id("https://arxiv.org/abs/2401.12345"), + Some("2401.12345".into()) + ); + assert_eq!( + parse_id("https://arxiv.org/abs/2401.12345v3"), + Some("2401.12345".into()) + ); + assert_eq!( + parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"), + Some("2401.12345".into()) + ); + } + + #[test] + fn collapse_whitespace_handles_newlines_and_tabs() { + assert_eq!(collapse_whitespace("a b\n\tc "), "a b c"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/crates_io.rs b/crates/webclaw-fetch/src/extractors/crates_io.rs new file mode 100644 index 0000000..719579f --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/crates_io.rs @@ -0,0 +1,168 @@ +//! crates.io structured extractor. +//! +//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No +//! auth, no rate limit at normal usage. The response includes both +//! the crate metadata and the full version list, which we summarize +//! down to a count + latest release info to keep the payload small. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "crates_io", + label: "crates.io package", + description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.", + url_patterns: &[ + "https://crates.io/crates/{name}", + "https://crates.io/crates/{name}/{version}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "crates.io" && host != "www.crates.io" { + return false; + } + url.contains("/crates/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let name = parse_name(url) + .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?; + + let api_url = format!("https://crates.io/api/v1/crates/{name}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "crates.io: crate '{name}' not found" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "crates.io api returned status {}", + resp.status + ))); + } + + let body: CratesResponse = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?; + + let c = body.crate_; + let latest_version = body + .versions + .iter() + .find(|v| !v.yanked.unwrap_or(false)) + .or_else(|| body.versions.first()); + + Ok(json!({ + "url": url, + "name": c.id, + "description": c.description, + "homepage": c.homepage, + "documentation": c.documentation, + "repository": c.repository, + "max_stable_version": c.max_stable_version, + "max_version": c.max_version, + "newest_version": c.newest_version, + "downloads": c.downloads, + "recent_downloads": c.recent_downloads, + "categories": c.categories, + "keywords": c.keywords, + "release_count": body.versions.len(), + "latest_release_date": latest_version.and_then(|v| v.created_at.clone()), + "latest_license": latest_version.and_then(|v| v.license.clone()), + "latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()), + "latest_yanked": latest_version.and_then(|v| v.yanked), + "created_at": c.created_at, + "updated_at": c.updated_at, + })) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn parse_name(url: &str) -> Option { + let after = url.split("/crates/").nth(1)?; + let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); + let first = stripped.split('/').find(|s| !s.is_empty())?; + Some(first.to_string()) +} + +// --------------------------------------------------------------------------- +// crates.io API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct CratesResponse { + #[serde(rename = "crate")] + crate_: CrateInfo, + #[serde(default)] + versions: Vec, +} + +#[derive(Deserialize)] +struct CrateInfo { + id: Option, + description: Option, + homepage: Option, + documentation: Option, + repository: Option, + max_stable_version: Option, + max_version: Option, + newest_version: Option, + downloads: Option, + recent_downloads: Option, + #[serde(default)] + categories: Vec, + #[serde(default)] + keywords: Vec, + created_at: Option, + updated_at: Option, +} + +#[derive(Deserialize)] +struct VersionInfo { + license: Option, + rust_version: Option, + yanked: Option, + created_at: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_crate_pages() { + assert!(matches("https://crates.io/crates/serde")); + assert!(matches("https://crates.io/crates/tokio/1.45.0")); + assert!(!matches("https://crates.io/")); + assert!(!matches("https://example.com/crates/foo")); + } + + #[test] + fn parse_name_handles_versioned_urls() { + assert_eq!( + parse_name("https://crates.io/crates/serde"), + Some("serde".into()) + ); + assert_eq!( + parse_name("https://crates.io/crates/tokio/1.45.0"), + Some("tokio".into()) + ); + assert_eq!( + parse_name("https://crates.io/crates/scraper/?foo=bar"), + Some("scraper".into()) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/dev_to.rs b/crates/webclaw-fetch/src/extractors/dev_to.rs new file mode 100644 index 0000000..86199d8 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/dev_to.rs @@ -0,0 +1,188 @@ +//! dev.to article structured extractor. +//! +//! `dev.to/api/articles/{username}/{slug}` returns the full article body, +//! tags, reaction count, comment count, and reading time. Anonymous +//! access works fine for published posts. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "dev_to", + label: "dev.to article", + description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.", + url_patterns: &["https://dev.to/{username}/{slug}"], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "dev.to" && host != "www.dev.to" { + return false; + } + let path = url + .split("://") + .nth(1) + .and_then(|s| s.split_once('/')) + .map(|(_, p)| p) + .unwrap_or(""); + let stripped = path + .split(['?', '#']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + // Need exactly /{username}/{slug}, with username starting with non-reserved. + segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0]) +} + +const RESERVED_FIRST_SEGS: &[&str] = &[ + "api", + "tags", + "search", + "settings", + "enter", + "signup", + "about", + "code-of-conduct", + "privacy", + "terms", + "contact", + "sponsorships", + "sponsors", + "shop", + "videos", + "listings", + "podcasts", + "p", + "t", +]; + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (username, slug) = parse_username_slug(url).ok_or_else(|| { + FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'")) + })?; + + let api_url = format!("https://dev.to/api/articles/{username}/{slug}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "dev_to: article '{username}/{slug}' not found" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "dev.to api returned status {}", + resp.status + ))); + } + + let a: Article = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?; + + Ok(json!({ + "url": url, + "id": a.id, + "title": a.title, + "description": a.description, + "body_markdown": a.body_markdown, + "url_canonical": a.canonical_url, + "published_at": a.published_at, + "edited_at": a.edited_at, + "reading_time_min": a.reading_time_minutes, + "tags": a.tag_list, + "positive_reactions": a.positive_reactions_count, + "public_reactions": a.public_reactions_count, + "comments_count": a.comments_count, + "page_views_count": a.page_views_count, + "cover_image": a.cover_image, + "author": json!({ + "username": a.user.as_ref().and_then(|u| u.username.clone()), + "name": a.user.as_ref().and_then(|u| u.name.clone()), + "twitter": a.user.as_ref().and_then(|u| u.twitter_username.clone()), + "github": a.user.as_ref().and_then(|u| u.github_username.clone()), + "website": a.user.as_ref().and_then(|u| u.website_url.clone()), + }), + })) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn parse_username_slug(url: &str) -> Option<(String, String)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let username = segs.next()?; + let slug = segs.next()?; + Some((username.to_string(), slug.to_string())) +} + +// --------------------------------------------------------------------------- +// dev.to API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Article { + id: Option, + title: Option, + description: Option, + body_markdown: Option, + canonical_url: Option, + published_at: Option, + edited_at: Option, + reading_time_minutes: Option, + tag_list: Option, // string OR array depending on endpoint + positive_reactions_count: Option, + public_reactions_count: Option, + comments_count: Option, + page_views_count: Option, + cover_image: Option, + user: Option, +} + +#[derive(Deserialize)] +struct UserRef { + username: Option, + name: Option, + twitter_username: Option, + github_username: Option, + website_url: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_article_urls() { + assert!(matches("https://dev.to/ben/welcome-thread")); + assert!(matches("https://dev.to/0xmassi/some-post-1abc")); + assert!(!matches("https://dev.to/")); + assert!(!matches("https://dev.to/api/articles/foo/bar")); + assert!(!matches("https://dev.to/tags/rust")); + assert!(!matches("https://dev.to/ben")); // user profile, not article + assert!(!matches("https://example.com/ben/post")); + } + + #[test] + fn parse_pulls_username_and_slug() { + assert_eq!( + parse_username_slug("https://dev.to/ben/welcome-thread"), + Some(("ben".into(), "welcome-thread".into())) + ); + assert_eq!( + parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"), + Some(("0xmassi".into(), "some-post-1abc".into())) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/docker_hub.rs b/crates/webclaw-fetch/src/extractors/docker_hub.rs new file mode 100644 index 0000000..bce9315 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/docker_hub.rs @@ -0,0 +1,150 @@ +//! Docker Hub repository structured extractor. +//! +//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`. +//! Anonymous access is allowed for public images. The official-image +//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "docker_hub", + label: "Docker Hub repository", + description: "Returns image metadata: pull count, star count, last_updated, official flag, description.", + url_patterns: &[ + "https://hub.docker.com/_/{name}", + "https://hub.docker.com/r/{namespace}/{name}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "hub.docker.com" { + return false; + } + url.contains("/_/") || url.contains("/r/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (namespace, name) = parse_repo(url) + .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?; + + let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "docker_hub: repo '{namespace}/{name}' not found" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "docker_hub api returned status {}", + resp.status + ))); + } + + let r: RepoResponse = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?; + + Ok(json!({ + "url": url, + "namespace": r.namespace, + "name": r.name, + "full_name": format!("{namespace}/{name}"), + "pull_count": r.pull_count, + "star_count": r.star_count, + "description": r.description, + "full_description": r.full_description, + "last_updated": r.last_updated, + "date_registered": r.date_registered, + "is_official": namespace == "library", + "is_private": r.is_private, + "status_description":r.status_description, + "categories": r.categories, + })) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Parse `(namespace, name)` from a Docker Hub URL. The official-image +/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos +/// `/r/foo/bar` map to `(foo, bar)`. +fn parse_repo(url: &str) -> Option<(String, String)> { + if let Some(after) = url.split("/_/").nth(1) { + let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); + let name = stripped.split('/').next().filter(|s| !s.is_empty())?; + return Some(("library".into(), name.to_string())); + } + let after = url.split("/r/").nth(1)?; + let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let ns = segs.next()?; + let nm = segs.next()?; + Some((ns.to_string(), nm.to_string())) +} + +#[derive(Deserialize)] +struct RepoResponse { + namespace: Option, + name: Option, + pull_count: Option, + star_count: Option, + description: Option, + full_description: Option, + last_updated: Option, + date_registered: Option, + is_private: Option, + status_description: Option, + #[serde(default)] + categories: Vec, +} + +#[derive(Deserialize, serde::Serialize)] +struct DockerCategory { + name: Option, + slug: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_docker_urls() { + assert!(matches("https://hub.docker.com/_/nginx")); + assert!(matches("https://hub.docker.com/r/grafana/grafana")); + assert!(!matches("https://hub.docker.com/")); + assert!(!matches("https://example.com/_/nginx")); + } + + #[test] + fn parse_repo_handles_official_and_personal() { + assert_eq!( + parse_repo("https://hub.docker.com/_/nginx"), + Some(("library".into(), "nginx".into())) + ); + assert_eq!( + parse_repo("https://hub.docker.com/_/nginx/tags"), + Some(("library".into(), "nginx".into())) + ); + assert_eq!( + parse_repo("https://hub.docker.com/r/grafana/grafana"), + Some(("grafana".into(), "grafana".into())) + ); + assert_eq!( + parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"), + Some(("grafana".into(), "grafana".into())) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/ebay_listing.rs b/crates/webclaw-fetch/src/extractors/ebay_listing.rs new file mode 100644 index 0000000..dbc85ab --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/ebay_listing.rs @@ -0,0 +1,337 @@ +//! eBay listing extractor. +//! +//! eBay item pages at `ebay.com/itm/{id}` and international variants +//! usually ship a `Product` JSON-LD block with title, price, currency, +//! condition, and an `AggregateOffer` when bidding. eBay applies +//! Cloudflare + custom WAF selectively — some item IDs return normal +//! HTML to the Firefox profile, others 403 / get the "Pardon our +//! interruption" page. We route through `cloud::smart_fetch_html` so +//! both paths resolve to the same parser. + +use std::sync::OnceLock; + +use regex::Regex; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::cloud::{self, CloudError}; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "ebay_listing", + label: "eBay listing", + description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.", + url_patterns: &[ + "https://www.ebay.com/itm/{id}", + "https://www.ebay.co.uk/itm/{id}", + "https://www.ebay.de/itm/{id}", + "https://www.ebay.fr/itm/{id}", + "https://www.ebay.it/itm/{id}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !is_ebay_host(host) { + return false; + } + parse_item_id(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let item_id = parse_item_id(url) + .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?; + + let fetched = cloud::smart_fetch_html(client, client.cloud(), url) + .await + .map_err(cloud_to_fetch_err)?; + + let mut data = parse(&fetched.html, url, &item_id); + if let Some(obj) = data.as_object_mut() { + obj.insert( + "data_source".into(), + match fetched.source { + cloud::FetchSource::Local => json!("local"), + cloud::FetchSource::Cloud => json!("cloud"), + }, + ); + } + Ok(data) +} + +pub fn parse(html: &str, url: &str, item_id: &str) -> Value { + let jsonld = find_product_jsonld(html); + let title = jsonld + .as_ref() + .and_then(|v| get_text(v, "name")) + .or_else(|| og(html, "title")); + let image = jsonld + .as_ref() + .and_then(get_first_image) + .or_else(|| og(html, "image")); + let brand = jsonld.as_ref().and_then(get_brand); + let description = jsonld + .as_ref() + .and_then(|v| get_text(v, "description")) + .or_else(|| og(html, "description")); + let offer = jsonld.as_ref().and_then(first_offer); + + // eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price. + let (low_price, high_price, single_price) = match offer.as_ref() { + Some(o) => ( + get_text(o, "lowPrice"), + get_text(o, "highPrice"), + get_text(o, "price"), + ), + None => (None, None, None), + }; + let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount")); + + let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating); + + json!({ + "url": url, + "item_id": item_id, + "title": title, + "brand": brand, + "description": description, + "image": image, + "price": single_price, + "low_price": low_price, + "high_price": high_price, + "offer_count": offer_count, + "currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")), + "availability": offer.as_ref().and_then(|o| { + get_text(o, "availability").map(|s| + s.replace("http://schema.org/", "").replace("https://schema.org/", "")) + }), + "condition": offer.as_ref().and_then(|o| { + get_text(o, "itemCondition").map(|s| + s.replace("http://schema.org/", "").replace("https://schema.org/", "")) + }), + "seller": offer.as_ref().and_then(|o| + o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)), + "aggregate_rating": aggregate_rating, + }) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn is_ebay_host(host: &str) -> bool { + host.starts_with("www.ebay.") || host.starts_with("ebay.") +} + +/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}` +/// URLs. IDs are 10-15 digits today, but we accept any all-digit +/// trailing segment so the extractor stays forward-compatible. +fn parse_item_id(url: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + // /itm/(optional-slug/)?(digits)([/?#]|end) + Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap() + }); + re.captures(url) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +// --------------------------------------------------------------------------- +// JSON-LD walkers +// --------------------------------------------------------------------------- + +fn find_product_jsonld(html: &str) -> Option { + let blocks = webclaw_core::structured_data::extract_json_ld(html); + for b in blocks { + if let Some(found) = find_product_in(&b) { + return Some(found); + } + } + None +} + +fn find_product_in(v: &Value) -> Option { + if is_product_type(v) { + return Some(v.clone()); + } + if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { + for item in graph { + if let Some(found) = find_product_in(item) { + return Some(found); + } + } + } + if let Some(arr) = v.as_array() { + for item in arr { + if let Some(found) = find_product_in(item) { + return Some(found); + } + } + } + None +} + +fn is_product_type(v: &Value) -> bool { + let Some(t) = v.get("@type") else { + return false; + }; + let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct"); + match t { + Value::String(s) => is_prod(s), + Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)), + _ => false, + } +} + +fn get_text(v: &Value, key: &str) -> Option { + v.get(key).and_then(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Number(n) => Some(n.to_string()), + _ => None, + }) +} + +fn get_brand(v: &Value) -> Option { + let brand = v.get("brand")?; + if let Some(s) = brand.as_str() { + return Some(s.to_string()); + } + brand + .as_object() + .and_then(|o| o.get("name")) + .and_then(|n| n.as_str()) + .map(String::from) +} + +fn get_first_image(v: &Value) -> Option { + match v.get("image")? { + Value::String(s) => Some(s.clone()), + Value::Array(arr) => arr.iter().find_map(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + }), + Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + } +} + +fn first_offer(v: &Value) -> Option { + let offers = v.get("offers")?; + match offers { + Value::Array(arr) => arr.first().cloned(), + Value::Object(_) => Some(offers.clone()), + _ => None, + } +} + +fn get_aggregate_rating(v: &Value) -> Option { + let r = v.get("aggregateRating")?; + Some(json!({ + "rating_value": get_text(r, "ratingValue"), + "review_count": get_text(r, "reviewCount"), + "best_rating": get_text(r, "bestRating"), + })) +} + +fn og(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + return c.get(2).map(|m| m.as_str().to_string()); + } + } + None +} + +fn cloud_to_fetch_err(e: CloudError) -> FetchError { + FetchError::Build(e.to_string()) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_ebay_item_urls() { + assert!(matches("https://www.ebay.com/itm/325478156234")); + assert!(matches( + "https://www.ebay.com/itm/vintage-typewriter/325478156234" + )); + assert!(matches("https://www.ebay.co.uk/itm/325478156234")); + assert!(!matches("https://www.ebay.com/")); + assert!(!matches("https://www.ebay.com/sch/foo")); + assert!(!matches("https://example.com/itm/325478156234")); + } + + #[test] + fn parse_item_id_handles_slugged_urls() { + assert_eq!( + parse_item_id("https://www.ebay.com/itm/325478156234"), + Some("325478156234".into()) + ); + assert_eq!( + parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"), + Some("325478156234".into()) + ); + assert_eq!( + parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"), + Some("325478156234".into()) + ); + } + + #[test] + fn parse_extracts_from_fixture_jsonld() { + let html = r##" + + +"##; + let v = parse(html, "https://www.ebay.co.uk/itm/325", "325"); + assert_eq!(v["title"], "Vintage Typewriter"); + assert_eq!(v["price"], "79.99"); + assert_eq!(v["currency"], "GBP"); + assert_eq!(v["availability"], "InStock"); + assert_eq!(v["condition"], "UsedCondition"); + assert_eq!(v["seller"], "vintage_seller_99"); + assert_eq!(v["brand"], "Olivetti"); + } + + #[test] + fn parse_handles_aggregate_offer_price_range() { + let html = r##" + +"##; + let v = parse(html, "https://www.ebay.com/itm/1", "1"); + assert_eq!(v["low_price"], "10.00"); + assert_eq!(v["high_price"], "50.00"); + assert_eq!(v["offer_count"], "5"); + assert_eq!(v["currency"], "USD"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs new file mode 100644 index 0000000..019fb68 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs @@ -0,0 +1,553 @@ +//! Generic ecommerce product extractor via Schema.org JSON-LD. +//! +//! Every modern ecommerce site ships a ` + + + "##; + let v = parse(html, "https://patagonia.com/p/x").unwrap(); + assert_eq!(v["data_source"], "jsonld+og"); + assert_eq!(v["name"], "Better Sweater"); + assert_eq!(v["offers"].as_array().unwrap().len(), 1); + assert_eq!(v["offers"][0]["price"], "139.00"); + } + + #[test] + fn jsonld_only_stays_pure_jsonld() { + let html = r##" + + "##; + let v = parse(html, "https://example.com/p/w").unwrap(); + assert_eq!(v["data_source"], "jsonld"); + assert_eq!(v["offers"][0]["price"], "9.99"); + } + + #[test] + fn parse_returns_none_on_no_product_signals() { + let html = r#" + + + "#; + assert!(parse(html, "https://blog.example.com/post").is_none()); + } +} diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs new file mode 100644 index 0000000..ea9ed0b --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs @@ -0,0 +1,572 @@ +//! Etsy listing extractor. +//! +//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant +//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD +//! block with title, price, currency, availability, shop seller, and +//! an `AggregateRating` for the listing. +//! +//! Etsy puts Cloudflare + custom WAF in front of product pages with a +//! high variance: the Firefox profile gets clean HTML most of the time +//! but some listings return a CF interstitial. We route through +//! `cloud::smart_fetch_html` so both paths resolve to the same parser, +//! same as `ebay_listing`. +//! +//! ## URL slug as last-resort title +//! +//! Even with cloud antibot bypass, Etsy frequently serves a generic +//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD, +//! empty markdown). In that case we humanise the slug from the URL +//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes +//! "Personalized Stainless Steel Tumbler") so callers always get a +//! meaningful title. Degrades gracefully when the URL has no slug. + +use std::sync::OnceLock; + +use regex::Regex; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::cloud::{self, CloudError}; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "etsy_listing", + label: "Etsy listing", + description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.", + url_patterns: &[ + "https://www.etsy.com/listing/{id}", + "https://www.etsy.com/listing/{id}/{slug}", + "https://www.etsy.com/{locale}/listing/{id}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !is_etsy_host(host) { + return false; + } + parse_listing_id(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let listing_id = parse_listing_id(url) + .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?; + + let fetched = cloud::smart_fetch_html(client, client.cloud(), url) + .await + .map_err(cloud_to_fetch_err)?; + + let mut data = parse(&fetched.html, url, &listing_id); + if let Some(obj) = data.as_object_mut() { + obj.insert( + "data_source".into(), + match fetched.source { + cloud::FetchSource::Local => json!("local"), + cloud::FetchSource::Cloud => json!("cloud"), + }, + ); + } + Ok(data) +} + +pub fn parse(html: &str, url: &str, listing_id: &str) -> Value { + let jsonld = find_product_jsonld(html); + let slug_title = humanise_slug(parse_slug(url).as_deref()); + + let title = jsonld + .as_ref() + .and_then(|v| get_text(v, "name")) + .or_else(|| og(html, "title").filter(|t| !is_generic_title(t))) + .or(slug_title); + let description = jsonld + .as_ref() + .and_then(|v| get_text(v, "description")) + .or_else(|| og(html, "description").filter(|d| !is_generic_description(d))); + let image = jsonld + .as_ref() + .and_then(get_first_image) + .or_else(|| og(html, "image")); + let brand = jsonld.as_ref().and_then(get_brand); + + // Etsy listings often ship either a single Offer or an + // AggregateOffer when the listing has variants with different prices. + let offer = jsonld.as_ref().and_then(first_offer); + let (low_price, high_price, single_price) = match offer.as_ref() { + Some(o) => ( + get_text(o, "lowPrice"), + get_text(o, "highPrice"), + get_text(o, "price"), + ), + None => (None, None, None), + }; + let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency")); + let availability = offer + .as_ref() + .and_then(|o| get_text(o, "availability").map(strip_schema_prefix)); + let item_condition = jsonld + .as_ref() + .and_then(|v| get_text(v, "itemCondition")) + .map(strip_schema_prefix); + + // Shop name: offers[0].seller.name on newer listings, top-level + // `brand` on older listings (Etsy changed the schema around 2022). + // Fall back through both so either shape resolves. + let shop = offer + .as_ref() + .and_then(|o| { + o.get("seller") + .and_then(|s| s.get("name")) + .and_then(|n| n.as_str()) + .map(String::from) + }) + .or_else(|| brand.clone()); + let shop_url = shop_url_from_html(html); + + let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating); + + json!({ + "url": url, + "listing_id": listing_id, + "title": title, + "description": description, + "image": image, + "brand": brand, + "price": single_price, + "low_price": low_price, + "high_price": high_price, + "currency": currency, + "availability": availability, + "item_condition": item_condition, + "shop": shop, + "shop_url": shop_url, + "aggregate_rating": aggregate_rating, + }) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn is_etsy_host(host: &str) -> bool { + host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com") +} + +/// Extract the numeric listing id. Etsy ids are 9-11 digits today but +/// we accept any all-digit segment right after `/listing/`. +/// +/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised +/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`). +fn parse_listing_id(url: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap()); + re.captures(url) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +/// Extract the URL slug after the listing id, e.g. +/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL +/// is the bare `/listing/{id}` shape. +fn parse_slug(url: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap()); + re.captures(url) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +/// Turn a URL slug into a human-ish title: +/// `personalized-stainless-steel-tumbler` → `Personalized Stainless +/// Steel Tumbler`. Word-cap each dash-separated token; preserves +/// underscores as spaces too. Returns `None` on empty input. +fn humanise_slug(slug: Option<&str>) -> Option { + let raw = slug?.trim(); + if raw.is_empty() { + return None; + } + let words: Vec = raw + .split(['-', '_']) + .filter(|w| !w.is_empty()) + .map(capitalise_word) + .collect(); + if words.is_empty() { + None + } else { + Some(words.join(" ")) + } +} + +fn capitalise_word(w: &str) -> String { + let mut chars = w.chars(); + match chars.next() { + Some(first) => first.to_uppercase().collect::() + chars.as_str(), + None => String::new(), + } +} + +/// True when the OG title is Etsy's fallback-page title rather than a +/// listing-specific title. Expired / region-blocked / antibot-filtered +/// pages return Etsy's sitewide tagline: +/// `"Etsy - Your place to buy and sell all things handmade..."`, or +/// simply `"etsy.com"`. A real listing title always starts with the +/// item name, never with "Etsy - " or the domain. +fn is_generic_title(t: &str) -> bool { + let normalised = t.trim().to_lowercase(); + if matches!( + normalised.as_str(), + "etsy.com" | "etsy" | "www.etsy.com" | "" + ) { + return true; + } + // Etsy's sitewide marketing tagline, served on 404 / blocked pages. + if normalised.starts_with("etsy - ") + || normalised.starts_with("etsy.com - ") + || normalised.starts_with("etsy uk - ") + { + return true; + } + // Etsy's "item unavailable" placeholder, served on delisted + // products. Keep the slug fallback so callers still see what the + // URL was about. + normalised.starts_with("this item is unavailable") + || normalised.starts_with("sorry, this item is") + || normalised == "item not available - etsy" +} + +/// True when the OG description is an Etsy error-page placeholder or +/// sitewide marketing blurb rather than a real listing description. +fn is_generic_description(d: &str) -> bool { + let normalised = d.trim().to_lowercase(); + if normalised.is_empty() { + return true; + } + normalised.starts_with("sorry, the page you were looking for") + || normalised.starts_with("page not found") + || normalised.starts_with("find the perfect handmade gift") +} + +// --------------------------------------------------------------------------- +// JSON-LD walkers (same shape as ebay_listing; kept separate so the two +// extractors can diverge without cross-impact) +// --------------------------------------------------------------------------- + +fn find_product_jsonld(html: &str) -> Option { + let blocks = webclaw_core::structured_data::extract_json_ld(html); + for b in blocks { + if let Some(found) = find_product_in(&b) { + return Some(found); + } + } + None +} + +fn find_product_in(v: &Value) -> Option { + if is_product_type(v) { + return Some(v.clone()); + } + if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { + for item in graph { + if let Some(found) = find_product_in(item) { + return Some(found); + } + } + } + if let Some(arr) = v.as_array() { + for item in arr { + if let Some(found) = find_product_in(item) { + return Some(found); + } + } + } + None +} + +fn is_product_type(v: &Value) -> bool { + let Some(t) = v.get("@type") else { + return false; + }; + let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct"); + match t { + Value::String(s) => is_prod(s), + Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)), + _ => false, + } +} + +fn get_text(v: &Value, key: &str) -> Option { + v.get(key).and_then(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Number(n) => Some(n.to_string()), + _ => None, + }) +} + +fn get_brand(v: &Value) -> Option { + let brand = v.get("brand")?; + if let Some(s) = brand.as_str() { + return Some(s.to_string()); + } + brand + .as_object() + .and_then(|o| o.get("name")) + .and_then(|n| n.as_str()) + .map(String::from) +} + +fn get_first_image(v: &Value) -> Option { + match v.get("image")? { + Value::String(s) => Some(s.clone()), + Value::Array(arr) => arr.iter().find_map(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + }), + Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + } +} + +fn first_offer(v: &Value) -> Option { + let offers = v.get("offers")?; + match offers { + Value::Array(arr) => arr.first().cloned(), + Value::Object(_) => Some(offers.clone()), + _ => None, + } +} + +fn get_aggregate_rating(v: &Value) -> Option { + let r = v.get("aggregateRating")?; + Some(json!({ + "rating_value": get_text(r, "ratingValue"), + "review_count": get_text(r, "reviewCount"), + "best_rating": get_text(r, "bestRating"), + })) +} + +fn strip_schema_prefix(s: String) -> String { + s.replace("http://schema.org/", "") + .replace("https://schema.org/", "") +} + +fn og(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + return c.get(2).map(|m| m.as_str().to_string()); + } + } + None +} + +/// Etsy links the owning shop with a canonical anchor like +/// ``. Grab the first one after the +/// breadcrumb boundary. +fn shop_url_from_html(html: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap()); + re.captures(html) + .and_then(|c| c.get(1)) + .map(|m| format!("https://www.etsy.com{}", m.as_str())) +} + +fn cloud_to_fetch_err(e: CloudError) -> FetchError { + FetchError::Build(e.to_string()) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_etsy_listing_urls() { + assert!(matches("https://www.etsy.com/listing/123456789")); + assert!(matches( + "https://www.etsy.com/listing/123456789/vintage-typewriter" + )); + assert!(matches( + "https://www.etsy.com/fr/listing/123456789/vintage-typewriter" + )); + assert!(!matches("https://www.etsy.com/")); + assert!(!matches("https://www.etsy.com/shop/SomeShop")); + assert!(!matches("https://example.com/listing/123456789")); + } + + #[test] + fn parse_listing_id_handles_slug_and_locale() { + assert_eq!( + parse_listing_id("https://www.etsy.com/listing/123456789"), + Some("123456789".into()) + ); + assert_eq!( + parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"), + Some("123456789".into()) + ); + assert_eq!( + parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"), + Some("123456789".into()) + ); + assert_eq!( + parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"), + Some("123456789".into()) + ); + } + + #[test] + fn parse_extracts_from_fixture_jsonld() { + let html = r##" + + +StudioClay +"##; + let v = parse(html, "https://www.etsy.com/listing/1", "1"); + assert_eq!(v["title"], "Handmade Ceramic Mug"); + assert_eq!(v["price"], "24.00"); + assert_eq!(v["currency"], "USD"); + assert_eq!(v["availability"], "InStock"); + assert_eq!(v["item_condition"], "NewCondition"); + assert_eq!(v["shop"], "StudioClay"); + assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay"); + assert_eq!(v["brand"], "Studio Clay"); + assert_eq!(v["aggregate_rating"]["rating_value"], "4.9"); + assert_eq!(v["aggregate_rating"]["review_count"], "127"); + } + + #[test] + fn parse_handles_aggregate_offer_price_range() { + let html = r##" + +"##; + let v = parse(html, "https://www.etsy.com/listing/2", "2"); + assert_eq!(v["low_price"], "18.00"); + assert_eq!(v["high_price"], "36.00"); + assert_eq!(v["currency"], "USD"); + } + + #[test] + fn parse_falls_back_to_og_when_no_jsonld() { + let html = r#" + + + + +"#; + let v = parse(html, "https://www.etsy.com/listing/3", "3"); + assert_eq!(v["title"], "Minimal Fallback Item"); + assert_eq!(v["description"], "OG-only extraction test."); + assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg"); + // No price fields when we only have OG. + assert!(v["price"].is_null()); + } + + #[test] + fn parse_slug_from_url() { + assert_eq!( + parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"), + Some("vintage-typewriter".into()) + ); + assert_eq!( + parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"), + Some("slug".into()) + ); + assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None); + assert_eq!( + parse_slug("https://www.etsy.com/fr/listing/123456789/slug"), + Some("slug".into()) + ); + } + + #[test] + fn humanise_slug_capitalises_each_word() { + assert_eq!( + humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(), + Some("Personalized Stainless Steel Tumbler") + ); + assert_eq!( + humanise_slug(Some("hand_crafted_mug")).as_deref(), + Some("Hand Crafted Mug") + ); + assert_eq!(humanise_slug(Some("")), None); + assert_eq!(humanise_slug(None), None); + } + + #[test] + fn is_generic_title_catches_common_shapes() { + assert!(is_generic_title("etsy.com")); + assert!(is_generic_title("Etsy")); + assert!(is_generic_title(" etsy.com ")); + assert!(is_generic_title( + "Etsy - Your place to buy and sell all things handmade, vintage, and supplies" + )); + assert!(is_generic_title("Etsy UK - Vintage & Handmade")); + assert!(!is_generic_title("Vintage Typewriter")); + assert!(!is_generic_title("Handmade Etsy-style Mug")); + } + + #[test] + fn is_generic_description_catches_404_shapes() { + assert!(is_generic_description("")); + assert!(is_generic_description( + "Sorry, the page you were looking for was not found." + )); + assert!(is_generic_description("Page not found")); + assert!(!is_generic_description( + "Hand-thrown ceramic mug, dishwasher safe." + )); + } + + #[test] + fn parse_uses_slug_when_og_is_generic() { + // Cloud-blocked Etsy listing: og:title is a site-wide generic + // placeholder, no JSON-LD, no description. Slug should win. + let html = r#" + +"#; + let v = parse( + html, + "https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler", + "1079113183", + ); + assert_eq!(v["title"], "Personalized Stainless Steel Tumbler"); + } + + #[test] + fn parse_prefers_real_og_over_slug() { + let html = r#" + +"#; + let v = parse( + html, + "https://www.etsy.com/listing/1079113183/the-url-slug", + "1079113183", + ); + assert_eq!(v["title"], "Real Listing Title"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/github_issue.rs b/crates/webclaw-fetch/src/extractors/github_issue.rs new file mode 100644 index 0000000..9a64f21 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/github_issue.rs @@ -0,0 +1,172 @@ +//! GitHub issue structured extractor. +//! +//! Mirror of `github_pr` but on `/issues/{number}`. Uses +//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the +//! issue body + comment count + labels + milestone + author / +//! assignees. Full per-comment bodies would be another call; kept for +//! a follow-up. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "github_issue", + label: "GitHub issue", + description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.", + url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"], +}; + +pub fn matches(url: &str) -> bool { + let host = url + .split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or(""); + if host != "github.com" && host != "www.github.com" { + return false; + } + parse_issue(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (owner, repo, number) = parse_issue(url).ok_or_else(|| { + FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'")) + })?; + + let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "github_issue: issue '{owner}/{repo}#{number}' not found" + ))); + } + if resp.status == 403 { + return Err(FetchError::Build( + "github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(), + )); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "github api returned status {}", + resp.status + ))); + } + + let issue: Issue = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?; + + // The same endpoint returns PRs too; reject if we got one so the caller + // uses /v1/scrape/github_pr instead of getting a half-shaped payload. + if issue.pull_request.is_some() { + return Err(FetchError::Build(format!( + "github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr" + ))); + } + + Ok(json!({ + "url": url, + "owner": owner, + "repo": repo, + "number": issue.number, + "title": issue.title, + "body": issue.body, + "state": issue.state, + "state_reason":issue.state_reason, + "author": issue.user.as_ref().and_then(|u| u.login.clone()), + "labels": issue.labels.iter().filter_map(|l| l.name.clone()).collect::>(), + "assignees": issue.assignees.iter().filter_map(|u| u.login.clone()).collect::>(), + "milestone": issue.milestone.as_ref().and_then(|m| m.title.clone()), + "comments": issue.comments, + "locked": issue.locked, + "created_at": issue.created_at, + "updated_at": issue.updated_at, + "closed_at": issue.closed_at, + "html_url": issue.html_url, + })) +} + +fn parse_issue(url: &str) -> Option<(String, String, u64)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + if segs.len() < 4 || segs[2] != "issues" { + return None; + } + let number: u64 = segs[3].parse().ok()?; + Some((segs[0].to_string(), segs[1].to_string(), number)) +} + +// --------------------------------------------------------------------------- +// GitHub issue API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Issue { + number: Option, + title: Option, + body: Option, + state: Option, + state_reason: Option, + locked: Option, + comments: Option, + created_at: Option, + updated_at: Option, + closed_at: Option, + html_url: Option, + user: Option, + #[serde(default)] + labels: Vec, + #[serde(default)] + assignees: Vec, + milestone: Option, + /// Present when this "issue" is actually a pull request. The REST + /// API overloads the issues endpoint for PRs. + pull_request: Option, +} + +#[derive(Deserialize)] +struct UserRef { + login: Option, +} + +#[derive(Deserialize)] +struct LabelRef { + name: Option, +} + +#[derive(Deserialize)] +struct Milestone { + title: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_issue_urls() { + assert!(matches("https://github.com/rust-lang/rust/issues/100")); + assert!(matches("https://github.com/rust-lang/rust/issues/100/")); + assert!(!matches("https://github.com/rust-lang/rust")); + assert!(!matches("https://github.com/rust-lang/rust/pull/100")); + assert!(!matches("https://github.com/rust-lang/rust/issues")); + } + + #[test] + fn parse_issue_extracts_owner_repo_number() { + assert_eq!( + parse_issue("https://github.com/rust-lang/rust/issues/100"), + Some(("rust-lang".into(), "rust".into(), 100)) + ); + assert_eq!( + parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"), + Some(("rust-lang".into(), "rust".into(), 100)) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/github_pr.rs b/crates/webclaw-fetch/src/extractors/github_pr.rs new file mode 100644 index 0000000..266d3cd --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/github_pr.rs @@ -0,0 +1,189 @@ +//! GitHub pull request structured extractor. +//! +//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns +//! the PR metadata + a counted summary of comments and review activity. +//! Full diff and per-comment bodies require additional calls — left for +//! a follow-up enhancement so the v1 stays one network round-trip. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "github_pr", + label: "GitHub pull request", + description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.", + url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"], +}; + +pub fn matches(url: &str) -> bool { + let host = url + .split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or(""); + if host != "github.com" && host != "www.github.com" { + return false; + } + parse_pr(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (owner, repo, number) = parse_pr(url).ok_or_else(|| { + FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'")) + })?; + + let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "github_pr: pull request '{owner}/{repo}#{number}' not found" + ))); + } + if resp.status == 403 { + return Err(FetchError::Build( + "github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(), + )); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "github api returned status {}", + resp.status + ))); + } + + let p: PullRequest = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?; + + Ok(json!({ + "url": url, + "owner": owner, + "repo": repo, + "number": p.number, + "title": p.title, + "body": p.body, + "state": p.state, + "draft": p.draft, + "merged": p.merged, + "merged_at": p.merged_at, + "merge_commit_sha": p.merge_commit_sha, + "author": p.user.as_ref().and_then(|u| u.login.clone()), + "labels": p.labels.iter().filter_map(|l| l.name.clone()).collect::>(), + "milestone": p.milestone.as_ref().and_then(|m| m.title.clone()), + "head_ref": p.head.as_ref().and_then(|r| r.ref_name.clone()), + "base_ref": p.base.as_ref().and_then(|r| r.ref_name.clone()), + "head_sha": p.head.as_ref().and_then(|r| r.sha.clone()), + "additions": p.additions, + "deletions": p.deletions, + "changed_files": p.changed_files, + "commits": p.commits, + "comments": p.comments, + "review_comments":p.review_comments, + "created_at": p.created_at, + "updated_at": p.updated_at, + "closed_at": p.closed_at, + "html_url": p.html_url, + })) +} + +fn parse_pr(url: &str) -> Option<(String, String, u64)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + // /{owner}/{repo}/pull/{number} (or /pulls/{number} variant) + if segs.len() < 4 { + return None; + } + if segs[2] != "pull" && segs[2] != "pulls" { + return None; + } + let number: u64 = segs[3].parse().ok()?; + Some((segs[0].to_string(), segs[1].to_string(), number)) +} + +// --------------------------------------------------------------------------- +// GitHub PR API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct PullRequest { + number: Option, + title: Option, + body: Option, + state: Option, + draft: Option, + merged: Option, + merged_at: Option, + merge_commit_sha: Option, + user: Option, + #[serde(default)] + labels: Vec, + milestone: Option, + head: Option, + base: Option, + additions: Option, + deletions: Option, + changed_files: Option, + commits: Option, + comments: Option, + review_comments: Option, + created_at: Option, + updated_at: Option, + closed_at: Option, + html_url: Option, +} + +#[derive(Deserialize)] +struct UserRef { + login: Option, +} + +#[derive(Deserialize)] +struct LabelRef { + name: Option, +} + +#[derive(Deserialize)] +struct Milestone { + title: Option, +} + +#[derive(Deserialize)] +struct GitRef { + #[serde(rename = "ref")] + ref_name: Option, + sha: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_pr_urls() { + assert!(matches("https://github.com/rust-lang/rust/pull/12345")); + assert!(matches( + "https://github.com/rust-lang/rust/pull/12345/files" + )); + assert!(!matches("https://github.com/rust-lang/rust")); + assert!(!matches("https://github.com/rust-lang/rust/issues/100")); + assert!(!matches("https://github.com/rust-lang")); + } + + #[test] + fn parse_pr_extracts_owner_repo_number() { + assert_eq!( + parse_pr("https://github.com/rust-lang/rust/pull/12345"), + Some(("rust-lang".into(), "rust".into(), 12345)) + ); + assert_eq!( + parse_pr("https://github.com/rust-lang/rust/pull/12345/files"), + Some(("rust-lang".into(), "rust".into(), 12345)) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/github_release.rs b/crates/webclaw-fetch/src/extractors/github_release.rs new file mode 100644 index 0000000..7699d09 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/github_release.rs @@ -0,0 +1,179 @@ +//! GitHub release structured extractor. +//! +//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns +//! the release notes body, asset list with download counts, and +//! prerelease flag. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "github_release", + label: "GitHub release", + description: "Returns release metadata: tag, name, body (release notes), assets with download counts.", + url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"], +}; + +pub fn matches(url: &str) -> bool { + let host = url + .split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or(""); + if host != "github.com" && host != "www.github.com" { + return false; + } + parse_release(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (owner, repo, tag) = parse_release(url).ok_or_else(|| { + FetchError::Build(format!("github_release: cannot parse release URL '{url}'")) + })?; + + let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "github_release: release '{owner}/{repo}@{tag}' not found" + ))); + } + if resp.status == 403 { + return Err(FetchError::Build( + "github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour." + .into(), + )); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "github api returned status {}", + resp.status + ))); + } + + let r: Release = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?; + + let assets: Vec = r + .assets + .iter() + .map(|a| { + json!({ + "name": a.name, + "size": a.size, + "download_count": a.download_count, + "browser_download_url": a.browser_download_url, + "content_type": a.content_type, + "created_at": a.created_at, + "updated_at": a.updated_at, + }) + }) + .collect(); + + Ok(json!({ + "url": url, + "owner": owner, + "repo": repo, + "tag_name": r.tag_name, + "name": r.name, + "body": r.body, + "draft": r.draft, + "prerelease": r.prerelease, + "author": r.author.as_ref().and_then(|u| u.login.clone()), + "created_at": r.created_at, + "published_at": r.published_at, + "asset_count": assets.len(), + "total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::(), + "assets": assets, + "html_url": r.html_url, + })) +} + +fn parse_release(url: &str) -> Option<(String, String, String)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + // /{owner}/{repo}/releases/tag/{tag} + if segs.len() < 5 { + return None; + } + if segs[2] != "releases" || segs[3] != "tag" { + return None; + } + Some(( + segs[0].to_string(), + segs[1].to_string(), + segs[4].to_string(), + )) +} + +// --------------------------------------------------------------------------- +// GitHub Release API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Release { + tag_name: Option, + name: Option, + body: Option, + draft: Option, + prerelease: Option, + author: Option, + created_at: Option, + published_at: Option, + html_url: Option, + #[serde(default)] + assets: Vec, +} + +#[derive(Deserialize)] +struct UserRef { + login: Option, +} + +#[derive(Deserialize)] +struct Asset { + name: Option, + size: Option, + download_count: Option, + browser_download_url: Option, + content_type: Option, + created_at: Option, + updated_at: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_release_urls() { + assert!(matches( + "https://github.com/rust-lang/rust/releases/tag/1.85.0" + )); + assert!(matches( + "https://github.com/0xMassi/webclaw/releases/tag/v0.4.0" + )); + assert!(!matches("https://github.com/rust-lang/rust")); + assert!(!matches("https://github.com/rust-lang/rust/releases")); + assert!(!matches("https://github.com/rust-lang/rust/pull/100")); + } + + #[test] + fn parse_release_extracts_owner_repo_tag() { + assert_eq!( + parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"), + Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into())) + ); + assert_eq!( + parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"), + Some(("rust-lang".into(), "rust".into(), "1.85.0".into())) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/github_repo.rs b/crates/webclaw-fetch/src/extractors/github_repo.rs new file mode 100644 index 0000000..2a62aa3 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/github_repo.rs @@ -0,0 +1,212 @@ +//! GitHub repository structured extractor. +//! +//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`. +//! Unauthenticated requests get 60/hour per IP, which is fine for users +//! self-hosting and for low-volume cloud usage. Production cloud should +//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't +//! depend on it being set — it works open out of the box. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "github_repo", + label: "GitHub repository", + description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.", + url_patterns: &["https://github.com/{owner}/{repo}"], +}; + +pub fn matches(url: &str) -> bool { + let host = url + .split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or(""); + if host != "github.com" && host != "www.github.com" { + return false; + } + // Path must be exactly /{owner}/{repo} (or with trailing slash). Reject + // sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the + // future github_issue / github_pr extractors will handle. + let path = url + .split("://") + .nth(1) + .and_then(|s| s.split_once('/')) + .map(|(_, p)| p) + .unwrap_or(""); + let stripped = path + .split(['?', '#']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0]) +} + +/// GitHub uses some top-level paths for non-repo pages. +const RESERVED_OWNERS: &[&str] = &[ + "settings", + "marketplace", + "explore", + "topics", + "trending", + "collections", + "events", + "sponsors", + "issues", + "pulls", + "notifications", + "new", + "organizations", + "login", + "join", + "search", + "about", +]; + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (owner, repo) = parse_owner_repo(url).ok_or_else(|| { + FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'")) + })?; + + let api_url = format!("https://api.github.com/repos/{owner}/{repo}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "github_repo: repo '{owner}/{repo}' not found" + ))); + } + if resp.status == 403 { + return Err(FetchError::Build( + "github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(), + )); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "github api returned status {}", + resp.status + ))); + } + + let r: Repo = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?; + + Ok(json!({ + "url": url, + "owner": r.owner.as_ref().map(|o| &o.login), + "name": r.name, + "full_name": r.full_name, + "description": r.description, + "homepage": r.homepage, + "language": r.language, + "topics": r.topics, + "license": r.license.as_ref().and_then(|l| l.spdx_id.clone()), + "license_name": r.license.as_ref().map(|l| l.name.clone()), + "default_branch": r.default_branch, + "stars": r.stargazers_count, + "forks": r.forks_count, + "watchers": r.subscribers_count, + "open_issues": r.open_issues_count, + "size_kb": r.size, + "archived": r.archived, + "fork": r.fork, + "is_template": r.is_template, + "has_issues": r.has_issues, + "has_wiki": r.has_wiki, + "has_pages": r.has_pages, + "has_discussions": r.has_discussions, + "created_at": r.created_at, + "updated_at": r.updated_at, + "pushed_at": r.pushed_at, + "html_url": r.html_url, + })) +} + +fn parse_owner_repo(url: &str) -> Option<(String, String)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let owner = segs.next()?.to_string(); + let repo = segs.next()?.to_string(); + Some((owner, repo)) +} + +// --------------------------------------------------------------------------- +// GitHub API types — only the fields we surface +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Repo { + name: Option, + full_name: Option, + description: Option, + homepage: Option, + language: Option, + #[serde(default)] + topics: Vec, + license: Option, + default_branch: Option, + stargazers_count: Option, + forks_count: Option, + subscribers_count: Option, + open_issues_count: Option, + size: Option, + archived: Option, + fork: Option, + is_template: Option, + has_issues: Option, + has_wiki: Option, + has_pages: Option, + has_discussions: Option, + created_at: Option, + updated_at: Option, + pushed_at: Option, + html_url: Option, + owner: Option, +} + +#[derive(Deserialize)] +struct Owner { + login: String, +} + +#[derive(Deserialize)] +struct License { + name: String, + spdx_id: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_repo_root_only() { + assert!(matches("https://github.com/rust-lang/rust")); + assert!(matches("https://github.com/rust-lang/rust/")); + assert!(!matches("https://github.com/rust-lang/rust/issues")); + assert!(!matches("https://github.com/rust-lang/rust/pulls/123")); + assert!(!matches("https://github.com/rust-lang")); + assert!(!matches("https://github.com/marketplace")); + assert!(!matches("https://github.com/topics/rust")); + assert!(!matches("https://example.com/foo/bar")); + } + + #[test] + fn parse_owner_repo_handles_trailing_slash_and_query() { + assert_eq!( + parse_owner_repo("https://github.com/rust-lang/rust"), + Some(("rust-lang".into(), "rust".into())) + ); + assert_eq!( + parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"), + Some(("rust-lang".into(), "rust".into())) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/hackernews.rs b/crates/webclaw-fetch/src/extractors/hackernews.rs new file mode 100644 index 0000000..91d4520 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/hackernews.rs @@ -0,0 +1,186 @@ +//! Hacker News structured extractor. +//! +//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which +//! returns the full post + recursive comment tree in a single request. +//! The official Firebase API at `hacker-news.firebaseio.com` requires +//! N+1 fetches per comment, so we'd hit either timeout or rate-limit +//! on any non-trivial thread. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "hackernews", + label: "Hacker News story", + description: "Returns post + nested comment tree for a Hacker News item.", + url_patterns: &[ + "https://news.ycombinator.com/item?id=N", + "https://hn.algolia.com/items/N", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = url + .split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or(""); + if host == "news.ycombinator.com" { + return url.contains("item?id=") || url.contains("item%3Fid="); + } + if host == "hn.algolia.com" { + return url.contains("/items/"); + } + false +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let id = parse_item_id(url).ok_or_else(|| { + FetchError::Build(format!("hackernews: cannot parse item id from '{url}'")) + })?; + + let api_url = format!("https://hn.algolia.com/api/v1/items/{id}"); + let resp = client.fetch(&api_url).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "hn algolia returned status {}", + resp.status + ))); + } + + let item: AlgoliaItem = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?; + + let post = post_json(&item); + let comments: Vec = item.children.iter().filter_map(comment_json).collect(); + + Ok(json!({ + "url": url, + "post": post, + "comments": comments, + })) +} + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the +/// Algolia mirror's `/items/N` form. +fn parse_item_id(url: &str) -> Option { + if let Some(after) = url.split("id=").nth(1) { + let n = after.split('&').next().unwrap_or(after); + if let Ok(id) = n.parse::() { + return Some(id); + } + } + if let Some(after) = url.split("/items/").nth(1) { + let n = after.split(['/', '?', '#']).next().unwrap_or(after); + if let Ok(id) = n.parse::() { + return Some(id); + } + } + None +} + +fn post_json(item: &AlgoliaItem) -> Value { + json!({ + "id": item.id, + "type": item.r#type, + "title": item.title, + "url": item.url, + "author": item.author, + "points": item.points, + "text": item.text, // populated for ask/show/tell + "created_at": item.created_at, + "created_at_unix": item.created_at_i, + "comment_count": count_descendants(item), + "permalink": item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")), + }) +} + +fn comment_json(item: &AlgoliaItem) -> Option { + if !matches!(item.r#type.as_deref(), Some("comment")) { + return None; + } + // Dead/deleted comments still appear in the tree; surface them honestly. + let replies: Vec = item.children.iter().filter_map(comment_json).collect(); + Some(json!({ + "id": item.id, + "author": item.author, + "text": item.text, + "created_at": item.created_at, + "created_at_unix": item.created_at_i, + "parent_id": item.parent_id, + "story_id": item.story_id, + "replies": replies, + })) +} + +fn count_descendants(item: &AlgoliaItem) -> usize { + item.children + .iter() + .filter(|c| matches!(c.r#type.as_deref(), Some("comment"))) + .map(|c| 1 + count_descendants(c)) + .sum() +} + +// --------------------------------------------------------------------------- +// Algolia API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct AlgoliaItem { + id: Option, + r#type: Option, + title: Option, + url: Option, + author: Option, + points: Option, + text: Option, + created_at: Option, + created_at_i: Option, + parent_id: Option, + story_id: Option, + #[serde(default)] + children: Vec, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_hn_item_urls() { + assert!(matches("https://news.ycombinator.com/item?id=1")); + assert!(matches("https://news.ycombinator.com/item?id=12345")); + assert!(matches("https://hn.algolia.com/items/1")); + } + + #[test] + fn rejects_non_item_urls() { + assert!(!matches("https://news.ycombinator.com/")); + assert!(!matches("https://news.ycombinator.com/news")); + assert!(!matches("https://example.com/item?id=1")); + } + + #[test] + fn parse_item_id_handles_both_forms() { + assert_eq!( + parse_item_id("https://news.ycombinator.com/item?id=1"), + Some(1) + ); + assert_eq!( + parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"), + Some(12345) + ); + assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999)); + assert_eq!(parse_item_id("https://example.com/foo"), None); + } +} diff --git a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs new file mode 100644 index 0000000..e1f84f7 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs @@ -0,0 +1,189 @@ +//! HuggingFace dataset structured extractor. +//! +//! Same shape as the model extractor but hits the dataset endpoint. +//! `huggingface.co/api/datasets/{owner}/{name}`. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "huggingface_dataset", + label: "HuggingFace dataset", + description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.", + url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "huggingface.co" && host != "www.huggingface.co" { + return false; + } + let path = url + .split("://") + .nth(1) + .and_then(|s| s.split_once('/')) + .map(|(_, p)| p) + .unwrap_or(""); + let stripped = path + .split(['?', '#']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + // /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical). + segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3) +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let dataset_path = parse_dataset_path(url).ok_or_else(|| { + FetchError::Build(format!( + "hf_dataset: cannot parse dataset path from '{url}'" + )) + })?; + + let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "hf_dataset: '{dataset_path}' not found" + ))); + } + if resp.status == 401 { + return Err(FetchError::Build(format!( + "hf_dataset: '{dataset_path}' requires authentication (gated)" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "hf_dataset api returned status {}", + resp.status + ))); + } + + let d: DatasetInfo = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?; + + let files: Vec = d + .siblings + .iter() + .map(|s| json!({"rfilename": s.rfilename, "size": s.size})) + .collect(); + + Ok(json!({ + "url": url, + "id": d.id, + "private": d.private, + "gated": d.gated, + "downloads": d.downloads, + "downloads_30d": d.downloads_all_time, + "likes": d.likes, + "tags": d.tags, + "license": d.card_data.as_ref().and_then(|c| c.license.clone()), + "language": d.card_data.as_ref().and_then(|c| c.language.clone()), + "task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()), + "size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()), + "annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()), + "configs": d.card_data.as_ref().and_then(|c| c.configs.clone()), + "created_at": d.created_at, + "last_modified": d.last_modified, + "sha": d.sha, + "file_count": d.siblings.len(), + "files": files, + })) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Returns the part to append to the API URL — either `name` (legacy +/// top-level dataset like `squad`) or `owner/name` (canonical form). +fn parse_dataset_path(url: &str) -> Option { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + if segs.next() != Some("datasets") { + return None; + } + let first = segs.next()?.to_string(); + match segs.next() { + Some(second) => Some(format!("{first}/{second}")), + None => Some(first), + } +} + +#[derive(Deserialize)] +struct DatasetInfo { + id: Option, + private: Option, + gated: Option, + downloads: Option, + #[serde(rename = "downloadsAllTime")] + downloads_all_time: Option, + likes: Option, + #[serde(default)] + tags: Vec, + #[serde(rename = "createdAt")] + created_at: Option, + #[serde(rename = "lastModified")] + last_modified: Option, + sha: Option, + #[serde(rename = "cardData")] + card_data: Option, + #[serde(default)] + siblings: Vec, +} + +#[derive(Deserialize)] +struct DatasetCard { + license: Option, + language: Option, + task_categories: Option, + size_categories: Option, + annotations_creators: Option, + configs: Option, +} + +#[derive(Deserialize)] +struct Sibling { + rfilename: String, + size: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_dataset_pages() { + assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level + assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name + assert!(!matches("https://huggingface.co/openai/whisper-large-v3")); + assert!(!matches("https://huggingface.co/datasets/")); + } + + #[test] + fn parse_dataset_path_works() { + assert_eq!( + parse_dataset_path("https://huggingface.co/datasets/squad"), + Some("squad".into()) + ); + assert_eq!( + parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"), + Some("openai/gsm8k".into()) + ); + assert_eq!( + parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"), + Some("openai/gsm8k".into()) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/huggingface_model.rs b/crates/webclaw-fetch/src/extractors/huggingface_model.rs new file mode 100644 index 0000000..4c549e0 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/huggingface_model.rs @@ -0,0 +1,223 @@ +//! HuggingFace model card structured extractor. +//! +//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`. +//! Returns metadata + the parsed model card front matter, but does not +//! pull the full README body — those are sometimes 100KB+ and the user +//! can hit /v1/scrape if they want it as markdown. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "huggingface_model", + label: "HuggingFace model", + description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.", + url_patterns: &["https://huggingface.co/{owner}/{name}"], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "huggingface.co" && host != "www.huggingface.co" { + return false; + } + let path = url + .split("://") + .nth(1) + .and_then(|s| s.split_once('/')) + .map(|(_, p)| p) + .unwrap_or(""); + let stripped = path + .split(['?', '#']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + // /{owner}/{name} but reject HF-internal sections + sub-pages. + if segs.len() != 2 { + return false; + } + !RESERVED_NAMESPACES.contains(&segs[0]) +} + +const RESERVED_NAMESPACES: &[&str] = &[ + "datasets", + "spaces", + "blog", + "docs", + "api", + "models", + "papers", + "pricing", + "tasks", + "join", + "login", + "settings", + "organizations", + "new", + "search", +]; + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (owner, name) = parse_owner_name(url).ok_or_else(|| { + FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'")) + })?; + + let api_url = format!("https://huggingface.co/api/models/{owner}/{name}"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "hf model: '{owner}/{name}' not found" + ))); + } + if resp.status == 401 { + return Err(FetchError::Build(format!( + "hf model: '{owner}/{name}' requires authentication (gated repo)" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "hf api returned status {}", + resp.status + ))); + } + + let m: ModelInfo = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?; + + // Surface a flat file list — full siblings can be hundreds of entries + // for big repos. We keep it as-is because callers want to know about + // every shard; if it bloats responses too much we'll add pagination. + let files: Vec = m + .siblings + .iter() + .map(|s| json!({"rfilename": s.rfilename, "size": s.size})) + .collect(); + + Ok(json!({ + "url": url, + "id": m.id, + "model_id": m.model_id, + "private": m.private, + "gated": m.gated, + "downloads": m.downloads, + "downloads_30d": m.downloads_all_time, + "likes": m.likes, + "library_name": m.library_name, + "pipeline_tag": m.pipeline_tag, + "tags": m.tags, + "license": m.card_data.as_ref().and_then(|c| c.license.clone()), + "language": m.card_data.as_ref().and_then(|c| c.language.clone()), + "datasets": m.card_data.as_ref().and_then(|c| c.datasets.clone()), + "base_model": m.card_data.as_ref().and_then(|c| c.base_model.clone()), + "model_type": m.card_data.as_ref().and_then(|c| c.model_type.clone()), + "created_at": m.created_at, + "last_modified": m.last_modified, + "sha": m.sha, + "file_count": m.siblings.len(), + "files": files, + })) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn parse_owner_name(url: &str) -> Option<(String, String)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let owner = segs.next()?.to_string(); + let name = segs.next()?.to_string(); + Some((owner, name)) +} + +// --------------------------------------------------------------------------- +// HF API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct ModelInfo { + id: Option, + #[serde(rename = "modelId")] + model_id: Option, + private: Option, + gated: Option, // bool or string ("auto" / "manual" / false) + downloads: Option, + #[serde(rename = "downloadsAllTime")] + downloads_all_time: Option, + likes: Option, + #[serde(rename = "library_name")] + library_name: Option, + #[serde(rename = "pipeline_tag")] + pipeline_tag: Option, + #[serde(default)] + tags: Vec, + #[serde(rename = "createdAt")] + created_at: Option, + #[serde(rename = "lastModified")] + last_modified: Option, + sha: Option, + #[serde(rename = "cardData")] + card_data: Option, + #[serde(default)] + siblings: Vec, +} + +#[derive(Deserialize)] +struct CardData { + license: Option, // string or array + language: Option, + datasets: Option, + #[serde(rename = "base_model")] + base_model: Option, + #[serde(rename = "model_type")] + model_type: Option, +} + +#[derive(Deserialize)] +struct Sibling { + rfilename: String, + size: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_model_pages() { + assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B")); + assert!(matches("https://huggingface.co/openai/whisper-large-v3")); + assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1 + } + + #[test] + fn rejects_hf_section_pages() { + assert!(!matches("https://huggingface.co/datasets/squad")); + assert!(!matches("https://huggingface.co/spaces/foo/bar")); + assert!(!matches("https://huggingface.co/blog/intro")); + assert!(!matches("https://huggingface.co/")); + assert!(!matches("https://huggingface.co/meta-llama")); + } + + #[test] + fn parse_owner_name_pulls_both() { + assert_eq!( + parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"), + Some(("meta-llama".into(), "Meta-Llama-3-8B".into())) + ); + assert_eq!( + parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"), + Some(("openai".into(), "whisper-large-v3".into())) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/instagram_post.rs b/crates/webclaw-fetch/src/extractors/instagram_post.rs new file mode 100644 index 0000000..8847e36 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/instagram_post.rs @@ -0,0 +1,235 @@ +//! Instagram post structured extractor. +//! +//! Uses Instagram's public embed endpoint +//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the +//! full caption, author username, and thumbnail. No auth required. +//! The same endpoint serves reels and IGTV under `/reel/{code}` and +//! `/tv/{code}` URLs (we accept all three). + +use regex::Regex; +use serde_json::{Value, json}; +use std::sync::OnceLock; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "instagram_post", + label: "Instagram post", + description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.", + url_patterns: &[ + "https://www.instagram.com/p/{shortcode}/", + "https://www.instagram.com/reel/{shortcode}/", + "https://www.instagram.com/tv/{shortcode}/", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !matches!(host, "www.instagram.com" | "instagram.com") { + return false; + } + parse_shortcode(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| { + FetchError::Build(format!( + "instagram_post: cannot parse shortcode from '{url}'" + )) + })?; + + // Instagram serves the same embed HTML for posts/reels/tv under /p/. + let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/"); + let resp = client.fetch(&embed_url).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "instagram embed returned status {} for {shortcode}", + resp.status + ))); + } + + let html = &resp.html; + let username = parse_username(html); + let caption = parse_caption(html); + let thumbnail = parse_thumbnail(html); + + Ok(json!({ + "url": url, + "embed_url": embed_url, + "shortcode": shortcode, + "kind": kind, + "data_completeness": "embed", + "author_username": username, + "caption": caption, + "thumbnail_url": thumbnail, + "canonical_url": format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)), + })) +} + +// --------------------------------------------------------------------------- +// URL parsing +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}. +fn parse_shortcode(url: &str) -> Option<(&'static str, String)> { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let first = segs.next()?; + let kind = match first { + "p" => "post", + "reel" | "reels" => "reel", + "tv" => "tv", + _ => return None, + }; + let shortcode = segs.next()?; + if shortcode.is_empty() { + return None; + } + Some((kind, shortcode.to_string())) +} + +fn path_segment_for(kind: &str) -> &'static str { + match kind { + "reel" => "reel", + "tv" => "tv", + _ => "p", + } +} + +// --------------------------------------------------------------------------- +// HTML scraping +// --------------------------------------------------------------------------- + +/// Username appears as the anchor text inside ``. +fn parse_username(html: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap()); + re.captures(html) + .and_then(|c| c.get(1)) + .map(|m| html_decode(m.as_str().trim())) +} + +/// Caption sits inside `
` after the username anchor. +/// We grab the whole Caption block and strip out the username link, time +/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate. +fn parse_caption(html: &str) -> Option { + static RE_OUTER: OnceLock = OnceLock::new(); + let outer = RE_OUTER + .get_or_init(|| Regex::new(r#"(?s)]*>(.*?)
"#).unwrap()); + let block = outer.captures(html)?.get(1)?.as_str(); + + // Strip everything wrapped in
.... + static RE_USER: OnceLock = OnceLock::new(); + let user_re = RE_USER + .get_or_init(|| Regex::new(r#"(?s)]*class="CaptionUsername"[^>]*>.*?"#).unwrap()); + let stripped = user_re.replace_all(block, ""); + + // Then strip anything remaining tagged. + static RE_TAGS: OnceLock = OnceLock::new(); + let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap()); + let text = tag_re.replace_all(&stripped, " "); + + let cleaned = collapse_whitespace(&html_decode(text.trim())); + if cleaned.is_empty() { + None + } else { + Some(cleaned) + } +} + +/// Thumbnail is the `` inside the embed +/// (or the og:image as fallback). +fn parse_thumbnail(html: &str) -> Option { + static RE_IMG: OnceLock = OnceLock::new(); + let img_re = RE_IMG.get_or_init(|| { + Regex::new(r#"(?s)]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#) + .unwrap() + }); + if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) { + return Some(html_decode(m.as_str())); + } + static RE_OG: OnceLock = OnceLock::new(); + let og_re = RE_OG.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:image"[^>]+content="([^"]+)""#).unwrap() + }); + og_re + .captures(html) + .and_then(|c| c.get(1)) + .map(|m| html_decode(m.as_str())) +} + +fn html_decode(s: &str) -> String { + s.replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace(""", "\"") + .replace("'", "'") + .replace("@", "@") + .replace("•", "•") + .replace("…", "…") +} + +fn collapse_whitespace(s: &str) -> String { + s.split_whitespace().collect::>().join(" ") +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_post_reel_tv_urls() { + assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/")); + assert!(matches( + "https://www.instagram.com/p/DT-RICMjeK5/?img_index=1" + )); + assert!(matches("https://www.instagram.com/reel/abc123/")); + assert!(matches("https://www.instagram.com/tv/abc123/")); + assert!(!matches("https://www.instagram.com/ticketswave")); + assert!(!matches("https://www.instagram.com/")); + assert!(!matches("https://example.com/p/abc/")); + } + + #[test] + fn parse_shortcode_reads_each_kind() { + assert_eq!( + parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"), + Some(("post", "DT-RICMjeK5".into())) + ); + assert_eq!( + parse_shortcode("https://www.instagram.com/reel/abc123/"), + Some(("reel", "abc123".into())) + ); + assert_eq!( + parse_shortcode("https://www.instagram.com/tv/abc123"), + Some(("tv", "abc123".into())) + ); + } + + #[test] + fn parse_username_pulls_anchor_text() { + let html = r#"ticketswave"#; + assert_eq!(parse_username(html).as_deref(), Some("ticketswave")); + } + + #[test] + fn parse_caption_strips_username_anchor() { + let html = r#"
ticketswave Some caption text here
"#; + assert_eq!( + parse_caption(html).as_deref(), + Some("Some caption text here") + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/instagram_profile.rs b/crates/webclaw-fetch/src/extractors/instagram_profile.rs new file mode 100644 index 0000000..9a92b4c --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/instagram_profile.rs @@ -0,0 +1,465 @@ +//! Instagram profile structured extractor. +//! +//! Hits Instagram's internal `web_profile_info` endpoint at +//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The +//! `x-ig-app-id` header is Instagram's own public web-app id (not a +//! secret) — the same value Instagram's own JavaScript bundle sends. +//! +//! Returns the full profile (bio, exact follower count, verified / +//! business flags, profile picture) plus the **12 most recent posts** +//! with shortcodes, like counts, types, thumbnails, and caption +//! previews. Callers can fan out to `/v1/scrape/instagram_post` per +//! shortcode to get the full caption + media. +//! +//! Pagination beyond 12 requires authenticated cookies + a CSRF token; +//! we accept that as the practical ceiling for the unauth path. The +//! cloud (with stored sessions) can paginate later as a follow-up. +//! +//! Falls back to OG-tag scraping of the public profile page if the API +//! returns 401/403 — Instagram has tightened this endpoint multiple +//! times, so we keep the second path warm. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "instagram_profile", + label: "Instagram profile", + description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).", + url_patterns: &["https://www.instagram.com/{username}/"], +}; + +/// Instagram's own public web-app identifier. Sent by their JS bundle +/// on every API call, accepted by the unauth endpoint, not a secret. +const IG_APP_ID: &str = "936619743392459"; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !matches!(host, "www.instagram.com" | "instagram.com") { + return false; + } + let path = url + .split("://") + .nth(1) + .and_then(|s| s.split_once('/')) + .map(|(_, p)| p) + .unwrap_or(""); + let stripped = path + .split(['?', '#']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); + segs.len() == 1 && !RESERVED.contains(&segs[0]) +} + +const RESERVED: &[&str] = &[ + "p", + "reel", + "reels", + "tv", + "explore", + "stories", + "directory", + "accounts", + "about", + "developer", + "press", + "api", + "ads", + "blog", + "fragments", + "terms", + "privacy", + "session", + "login", + "signup", +]; + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let username = parse_username(url).ok_or_else(|| { + FetchError::Build(format!( + "instagram_profile: cannot parse username from '{url}'" + )) + })?; + + let api_url = + format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"); + let extra_headers: &[(&str, &str)] = &[ + ("x-ig-app-id", IG_APP_ID), + ("accept", "*/*"), + ("sec-fetch-site", "same-origin"), + ("x-requested-with", "XMLHttpRequest"), + ]; + let resp = client.fetch_with_headers(&api_url, extra_headers).await?; + + if resp.status == 404 { + return Err(FetchError::Build(format!( + "instagram_profile: '{username}' not found" + ))); + } + // Auth wall fallback: Instagram occasionally tightens this endpoint + // and starts returning 401/403/302 to a login page. When that + // happens we still want to give the caller something useful — the + // OG tags from the public HTML page (no posts list, but bio etc). + if !(200..300).contains(&resp.status) { + return og_fallback(client, &username, url, resp.status).await; + } + + let body: ApiResponse = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?; + let user = body.data.user; + + let recent_posts: Vec = user + .edge_owner_to_timeline_media + .as_ref() + .map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect()) + .unwrap_or_default(); + + Ok(json!({ + "url": url, + "canonical_url": format!("https://www.instagram.com/{username}/"), + "username": user.username.unwrap_or(username), + "data_completeness": "api", + "user_id": user.id, + "full_name": user.full_name, + "biography": user.biography, + "biography_links": user.bio_links, + "external_url": user.external_url, + "category": user.category_name, + "follower_count": user.edge_followed_by.map(|c| c.count), + "following_count": user.edge_follow.map(|c| c.count), + "post_count": user.edge_owner_to_timeline_media.as_ref().map(|m| m.count), + "is_verified": user.is_verified, + "is_private": user.is_private, + "is_business": user.is_business_account, + "is_professional": user.is_professional_account, + "profile_pic_url": user.profile_pic_url_hd.or(user.profile_pic_url), + "recent_posts": recent_posts, + })) +} + +/// Build the per-post summary the caller fans out from. Includes a +/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`. +fn post_summary(n: &MediaNode) -> Value { + let kind = classify(n); + let url = match kind { + "reel" => format!( + "https://www.instagram.com/reel/{}/", + n.shortcode.as_deref().unwrap_or("") + ), + _ => format!( + "https://www.instagram.com/p/{}/", + n.shortcode.as_deref().unwrap_or("") + ), + }; + let caption = n + .edge_media_to_caption + .as_ref() + .and_then(|c| c.edges.first()) + .and_then(|e| e.node.text.clone()); + json!({ + "shortcode": n.shortcode, + "url": url, + "kind": kind, + "is_video": n.is_video.unwrap_or(false), + "video_views": n.video_view_count, + "thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()), + "display_url": n.display_url, + "like_count": n.edge_media_preview_like.as_ref().map(|c| c.count), + "comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count), + "taken_at": n.taken_at_timestamp, + "caption": caption, + "alt_text": n.accessibility_caption, + "dimensions": n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})), + "product_type": n.product_type, + }) +} + +/// Best-effort post-type classification. `clips` is reels; `feed` is +/// the regular grid. Sidecar = multi-photo carousel. +fn classify(n: &MediaNode) -> &'static str { + if n.product_type.as_deref() == Some("clips") { + return "reel"; + } + match n.typename.as_deref() { + Some("GraphSidecar") => "carousel", + Some("GraphVideo") => "video", + Some("GraphImage") => "photo", + _ => "post", + } +} + +/// Fallback when the API path is blocked: hit the public profile HTML, +/// pull whatever OG tags we can. Returns less data and explicitly +/// flags `data_completeness: "og_only"` so callers know. +async fn og_fallback( + client: &dyn Fetcher, + username: &str, + original_url: &str, + api_status: u16, +) -> Result { + let canonical = format!("https://www.instagram.com/{username}/"); + let resp = client.fetch(&canonical).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "instagram_profile: api status {api_status}, html status {} for {username}", + resp.status + ))); + } + let og = parse_og_tags(&resp.html); + let (followers, following, posts) = + parse_counts_from_og_description(og.get("description").map(String::as_str)); + + Ok(json!({ + "url": original_url, + "canonical_url": canonical, + "username": username, + "data_completeness": "og_only", + "fallback_reason": format!("api returned {api_status}"), + "full_name": parse_full_name(&og.get("title").cloned().unwrap_or_default()), + "follower_count": followers, + "following_count": following, + "post_count": posts, + "profile_pic_url": og.get("image").cloned(), + "biography": null_value(), + "is_verified": null_value(), + "is_business": null_value(), + "recent_posts": Vec::::new(), + })) +} + +fn null_value() -> Value { + Value::Null +} + +// --------------------------------------------------------------------------- +// URL parsing +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn parse_username(url: &str) -> Option { + let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; + let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); + stripped + .split('/') + .find(|s| !s.is_empty()) + .map(|s| s.to_string()) +} + +// --------------------------------------------------------------------------- +// OG-fallback helpers (kept self-contained — same shape as the previous +// version we shipped, retained as the safety net) +// --------------------------------------------------------------------------- + +fn parse_og_tags(html: &str) -> std::collections::HashMap { + use regex::Regex; + use std::sync::OnceLock; + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + let mut out = std::collections::HashMap::new(); + for c in re.captures_iter(html) { + let k = c + .get(1) + .map(|m| m.as_str().to_lowercase()) + .unwrap_or_default(); + let v = c + .get(2) + .map(|m| html_decode(m.as_str())) + .unwrap_or_default(); + out.entry(k).or_insert(v); + } + out +} + +fn parse_full_name(og_title: &str) -> Option { + if og_title.is_empty() { + return None; + } + let decoded = html_decode(og_title); + let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim(); + if trimmed.is_empty() { + None + } else { + Some(trimmed.to_string()) + } +} + +fn parse_counts_from_og_description(desc: Option<&str>) -> (Option, Option, Option) { + let Some(text) = desc else { + return (None, None, None); + }; + let decoded = html_decode(text); + use regex::Regex; + use std::sync::OnceLock; + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap() + }); + if let Some(c) = re.captures(&decoded) { + return ( + c.get(1).and_then(|m| parse_compact_number(m.as_str())), + c.get(2).and_then(|m| parse_compact_number(m.as_str())), + c.get(3).and_then(|m| parse_compact_number(m.as_str())), + ); + } + (None, None, None) +} + +fn parse_compact_number(s: &str) -> Option { + let s = s.trim(); + let (num_str, mul) = match s.chars().last() { + Some('K') => (&s[..s.len() - 1], 1_000i64), + Some('M') => (&s[..s.len() - 1], 1_000_000i64), + Some('B') => (&s[..s.len() - 1], 1_000_000_000i64), + _ => (s, 1i64), + }; + let cleaned: String = num_str.chars().filter(|c| *c != ',').collect(); + cleaned.parse::().ok().map(|f| (f * mul as f64) as i64) +} + +fn html_decode(s: &str) -> String { + s.replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace(""", "\"") + .replace("'", "'") + .replace("@", "@") + .replace("•", "•") + .replace("…", "…") +} + +// --------------------------------------------------------------------------- +// Instagram web_profile_info API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct ApiResponse { + data: ApiData, +} + +#[derive(Deserialize)] +struct ApiData { + user: User, +} + +#[derive(Deserialize)] +struct User { + id: Option, + username: Option, + full_name: Option, + biography: Option, + bio_links: Option>, + external_url: Option, + category_name: Option, + profile_pic_url: Option, + profile_pic_url_hd: Option, + is_verified: Option, + is_private: Option, + is_business_account: Option, + is_professional_account: Option, + edge_followed_by: Option, + edge_follow: Option, + edge_owner_to_timeline_media: Option, +} + +#[derive(Deserialize)] +struct EdgeCount { + count: i64, +} + +#[derive(Deserialize)] +struct MediaEdges { + count: i64, + edges: Vec, +} + +#[derive(Deserialize)] +struct MediaEdge { + node: MediaNode, +} + +#[derive(Deserialize)] +struct MediaNode { + #[serde(rename = "__typename")] + typename: Option, + shortcode: Option, + is_video: Option, + video_view_count: Option, + display_url: Option, + thumbnail_src: Option, + accessibility_caption: Option, + taken_at_timestamp: Option, + product_type: Option, + dimensions: Option, + edge_media_preview_like: Option, + edge_media_to_comment: Option, + edge_media_to_caption: Option, +} + +#[derive(Deserialize)] +struct Dimensions { + width: i64, + height: i64, +} + +#[derive(Deserialize)] +struct CaptionEdges { + edges: Vec, +} + +#[derive(Deserialize)] +struct CaptionEdge { + node: CaptionNode, +} + +#[derive(Deserialize)] +struct CaptionNode { + text: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_profile_urls() { + assert!(matches("https://www.instagram.com/ticketswave")); + assert!(matches("https://www.instagram.com/ticketswave/")); + assert!(matches("https://instagram.com/0xmassi/?hl=en")); + assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/")); + assert!(!matches("https://www.instagram.com/explore")); + assert!(!matches("https://www.instagram.com/")); + assert!(!matches("https://example.com/foo")); + } + + #[test] + fn parse_full_name_strips_handle() { + assert_eq!( + parse_full_name("Ticket Wave (@ticketswave) • Instagram photos and videos"), + Some("Ticket Wave".into()) + ); + } + + #[test] + fn compact_number_handles_kmb() { + assert_eq!(parse_compact_number("18K"), Some(18_000)); + assert_eq!(parse_compact_number("1.5M"), Some(1_500_000)); + assert_eq!(parse_compact_number("1,234"), Some(1_234)); + assert_eq!(parse_compact_number("641"), Some(641)); + } +} diff --git a/crates/webclaw-fetch/src/extractors/linkedin_post.rs b/crates/webclaw-fetch/src/extractors/linkedin_post.rs new file mode 100644 index 0000000..ed7e07b --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/linkedin_post.rs @@ -0,0 +1,266 @@ +//! LinkedIn post structured extractor. +//! +//! Uses the public embed endpoint `/embed/feed/update/{urn}` which +//! LinkedIn provides for sites that want to render a post inline. No +//! auth required, returns SSR HTML with the full post body, OG tags, +//! image, and a link back to the original post. +//! +//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`) +//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by +//! pulling the trailing numeric id and converting to an activity URN. + +use regex::Regex; +use serde_json::{Value, json}; +use std::sync::OnceLock; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "linkedin_post", + label: "LinkedIn post", + description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.", + url_patterns: &[ + "https://www.linkedin.com/feed/update/urn:li:share:{id}", + "https://www.linkedin.com/feed/update/urn:li:activity:{id}", + "https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !matches!(host, "www.linkedin.com" | "linkedin.com") { + return false; + } + url.contains("/feed/update/urn:li:") || url.contains("/posts/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let urn = extract_urn(url).ok_or_else(|| { + FetchError::Build(format!( + "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})" + )) + })?; + + let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}"); + let resp = client.fetch(&embed_url).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "linkedin embed returned status {} for {urn}", + resp.status + ))); + } + + let html = &resp.html; + let og = parse_og_tags(html); + let body = parse_post_body(html); + let author = parse_author(html); + let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone()); + + Ok(json!({ + "url": url, + "embed_url": embed_url, + "urn": urn, + "canonical_url": canonical_url, + "data_completeness": "embed", + "title": og.get("title").cloned(), + "body": body, + "author_name": author, + "image_url": og.get("image").cloned(), + "site_name": og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()), + })) +} + +// --------------------------------------------------------------------------- +// URN extraction +// --------------------------------------------------------------------------- + +/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL. +/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second- +/// to-last `-` separated chunk. Both forms map to a URN we can hit the +/// embed endpoint with. +fn extract_urn(url: &str) -> Option { + if let Some(idx) = url.find("urn:li:") { + let tail = &url[idx..]; + let end = tail.find(['/', '?', '#']).unwrap_or(tail.len()); + let urn = &tail[..end]; + // Validate shape: urn:li:{type}:{digits} + let mut parts = urn.split(':'); + if parts.next() == Some("urn") + && parts.next() == Some("li") + && parts.next().is_some() + && parts + .next() + .filter(|p| p.chars().all(|c| c.is_ascii_digit())) + .is_some() + { + return Some(urn.to_string()); + } + } + + // /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second- + // to-last segment after the last `-`. + if url.contains("/posts/") { + static RE: OnceLock = OnceLock::new(); + let re = + RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap()); + if let Some(c) = re.captures(url) + && let Some(id) = c.get(1) + { + return Some(format!("urn:li:activity:{}", id.as_str())); + } + } + None +} + +// --------------------------------------------------------------------------- +// HTML scraping +// --------------------------------------------------------------------------- + +/// Pull `og:foo` → value pairs out of ``. +/// Returns lowercased keys with leading `og:` stripped. +fn parse_og_tags(html: &str) -> std::collections::HashMap { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + let mut out = std::collections::HashMap::new(); + for c in re.captures_iter(html) { + let k = c + .get(1) + .map(|m| m.as_str().to_lowercase()) + .unwrap_or_default(); + let v = c + .get(2) + .map(|m| html_decode(m.as_str())) + .unwrap_or_default(); + out.entry(k).or_insert(v); + } + out +} + +/// Extract the post body text from the embed page. LinkedIn renders it +/// inside `

{text}

` +/// where the inner content can include nested `` tags for links. +fn parse_post_body(html: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new( + r#"(?s)]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)

"#, + ) + .unwrap() + }); + let inner = re.captures(html).and_then(|c| c.get(1))?.as_str(); + Some(strip_tags(inner).trim().to_string()) +} + +/// Author name lives in the `` like: +/// "55 founding members are in… | Orc Dev" +/// The chunk after the final `|` is the author display name. Falls back +/// to the og:title minus the post body if there's no title. +fn parse_author(html: &str) -> Option<String> { + static RE_TITLE: OnceLock<Regex> = OnceLock::new(); + let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)").unwrap()); + let title = re.captures(html).and_then(|c| c.get(1))?.as_str(); + title + .rsplit_once('|') + .map(|(_, name)| html_decode(name.trim())) +} + +/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.) +/// stuff into OG content attributes. +fn html_decode(s: &str) -> String { + s.replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace(""", "\"") + .replace("'", "'") + .replace("@", "@") + .replace("•", "•") + .replace("…", "…") +} + +/// Crude HTML tag stripper for the post body. Preserves text inside +/// nested anchors so URLs don't disappear, and collapses runs of +/// whitespace introduced by line wrapping. +fn strip_tags(html: &str) -> String { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap()); + let no_tags = re.replace_all(html, "").to_string(); + html_decode(&no_tags) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_li_post_urls() { + assert!(matches( + "https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/" + )); + assert!(matches( + "https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288" + )); + assert!(matches( + "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c" + )); + assert!(!matches("https://www.linkedin.com/in/foo")); + assert!(!matches("https://www.linkedin.com/")); + assert!(!matches("https://example.com/feed/update/urn:li:share:1")); + } + + #[test] + fn extract_urn_from_share_url() { + assert_eq!( + extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"), + Some("urn:li:share:7452618582213144577".into()) + ); + } + + #[test] + fn extract_urn_from_pretty_post_url() { + assert_eq!( + extract_urn( + "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/" + ), + Some("urn:li:activity:7452618583290892288".into()) + ); + } + + #[test] + fn parse_og_tags_basic() { + let html = r#" +"#; + let og = parse_og_tags(html); + assert_eq!( + og.get("image").map(String::as_str), + Some("https://x.com/a.png") + ); + assert_eq!( + og.get("url").map(String::as_str), + Some("https://example.com/x") + ); + } + + #[test] + fn parse_post_body_strips_anchor_tags() { + let html = r#"

Hello link world

"#; + assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world")); + } + + #[test] + fn html_decode_handles_common_entities() { + assert_eq!(html_decode("AT&T @jane"), "AT&T @jane"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs new file mode 100644 index 0000000..91ef8d0 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/mod.rs @@ -0,0 +1,502 @@ +//! Vertical extractors: site-specific parsers that return typed JSON +//! instead of generic markdown. +//! +//! Each extractor handles a single site or platform and exposes: +//! - `matches(url)` to claim ownership of a URL pattern +//! - `extract(client, url)` to fetch + parse into a typed JSON `Value` +//! - `INFO` static for the catalog (`/v1/extractors`) +//! +//! The dispatch in this module is a simple `match`-style chain rather than +//! a trait registry. With ~30 extractors that's still fast and avoids the +//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit. +//! +//! Extractors prefer official JSON APIs over HTML scraping where one +//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have +//! one). HTML extraction is the fallback for sites that don't. + +pub mod amazon_product; +pub mod arxiv; +pub mod crates_io; +pub mod dev_to; +pub mod docker_hub; +pub mod ebay_listing; +pub mod ecommerce_product; +pub mod etsy_listing; +pub mod github_issue; +pub mod github_pr; +pub mod github_release; +pub mod github_repo; +pub mod hackernews; +pub mod huggingface_dataset; +pub mod huggingface_model; +pub mod instagram_post; +pub mod instagram_profile; +pub mod linkedin_post; +pub mod npm; +pub mod pypi; +pub mod reddit; +pub mod shopify_collection; +pub mod shopify_product; +pub mod stackoverflow; +pub mod substack_post; +pub mod trustpilot_reviews; +pub mod woocommerce_product; +pub mod youtube_video; + +use serde::Serialize; +use serde_json::Value; + +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +/// Public catalog entry for `/v1/extractors`. Stable shape — clients +/// rely on `name` to pick the right `/v1/scrape/{name}` route. +#[derive(Debug, Clone, Serialize)] +pub struct ExtractorInfo { + /// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...). + pub name: &'static str, + /// Human-friendly display name. + pub label: &'static str, + /// One-line description of what the extractor returns. + pub description: &'static str, + /// Glob-ish URL pattern(s) the extractor claims. For documentation; + /// the actual matching is done by the extractor's `matches` fn. + pub url_patterns: &'static [&'static str], +} + +/// Full catalog. Order is stable; new entries append. +pub fn list() -> Vec { + vec![ + reddit::INFO, + hackernews::INFO, + github_repo::INFO, + github_pr::INFO, + github_issue::INFO, + github_release::INFO, + pypi::INFO, + npm::INFO, + crates_io::INFO, + huggingface_model::INFO, + huggingface_dataset::INFO, + arxiv::INFO, + docker_hub::INFO, + dev_to::INFO, + stackoverflow::INFO, + substack_post::INFO, + youtube_video::INFO, + linkedin_post::INFO, + instagram_post::INFO, + instagram_profile::INFO, + shopify_product::INFO, + shopify_collection::INFO, + ecommerce_product::INFO, + woocommerce_product::INFO, + amazon_product::INFO, + ebay_listing::INFO, + etsy_listing::INFO, + trustpilot_reviews::INFO, + ] +} + +/// Auto-detect mode: try every extractor's `matches`, return the first +/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't +/// pick a vertical explicitly. +pub async fn dispatch_by_url( + client: &dyn Fetcher, + url: &str, +) -> Option> { + if reddit::matches(url) { + return Some( + reddit::extract(client, url) + .await + .map(|v| (reddit::INFO.name, v)), + ); + } + if hackernews::matches(url) { + return Some( + hackernews::extract(client, url) + .await + .map(|v| (hackernews::INFO.name, v)), + ); + } + if github_repo::matches(url) { + return Some( + github_repo::extract(client, url) + .await + .map(|v| (github_repo::INFO.name, v)), + ); + } + if pypi::matches(url) { + return Some( + pypi::extract(client, url) + .await + .map(|v| (pypi::INFO.name, v)), + ); + } + if npm::matches(url) { + return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v))); + } + if github_pr::matches(url) { + return Some( + github_pr::extract(client, url) + .await + .map(|v| (github_pr::INFO.name, v)), + ); + } + if github_issue::matches(url) { + return Some( + github_issue::extract(client, url) + .await + .map(|v| (github_issue::INFO.name, v)), + ); + } + if github_release::matches(url) { + return Some( + github_release::extract(client, url) + .await + .map(|v| (github_release::INFO.name, v)), + ); + } + if crates_io::matches(url) { + return Some( + crates_io::extract(client, url) + .await + .map(|v| (crates_io::INFO.name, v)), + ); + } + if huggingface_model::matches(url) { + return Some( + huggingface_model::extract(client, url) + .await + .map(|v| (huggingface_model::INFO.name, v)), + ); + } + if huggingface_dataset::matches(url) { + return Some( + huggingface_dataset::extract(client, url) + .await + .map(|v| (huggingface_dataset::INFO.name, v)), + ); + } + if arxiv::matches(url) { + return Some( + arxiv::extract(client, url) + .await + .map(|v| (arxiv::INFO.name, v)), + ); + } + if docker_hub::matches(url) { + return Some( + docker_hub::extract(client, url) + .await + .map(|v| (docker_hub::INFO.name, v)), + ); + } + if dev_to::matches(url) { + return Some( + dev_to::extract(client, url) + .await + .map(|v| (dev_to::INFO.name, v)), + ); + } + if stackoverflow::matches(url) { + return Some( + stackoverflow::extract(client, url) + .await + .map(|v| (stackoverflow::INFO.name, v)), + ); + } + if linkedin_post::matches(url) { + return Some( + linkedin_post::extract(client, url) + .await + .map(|v| (linkedin_post::INFO.name, v)), + ); + } + if instagram_post::matches(url) { + return Some( + instagram_post::extract(client, url) + .await + .map(|v| (instagram_post::INFO.name, v)), + ); + } + if instagram_profile::matches(url) { + return Some( + instagram_profile::extract(client, url) + .await + .map(|v| (instagram_profile::INFO.name, v)), + ); + } + // Antibot-gated verticals with unique hosts: safe to auto-dispatch + // because the matcher can't confuse the URL for anything else. The + // extractor's smart_fetch_html path handles the blocked-without- + // API-key case with a clear actionable error. + if amazon_product::matches(url) { + return Some( + amazon_product::extract(client, url) + .await + .map(|v| (amazon_product::INFO.name, v)), + ); + } + if ebay_listing::matches(url) { + return Some( + ebay_listing::extract(client, url) + .await + .map(|v| (ebay_listing::INFO.name, v)), + ); + } + if etsy_listing::matches(url) { + return Some( + etsy_listing::extract(client, url) + .await + .map(|v| (etsy_listing::INFO.name, v)), + ); + } + if trustpilot_reviews::matches(url) { + return Some( + trustpilot_reviews::extract(client, url) + .await + .map(|v| (trustpilot_reviews::INFO.name, v)), + ); + } + if youtube_video::matches(url) { + return Some( + youtube_video::extract(client, url) + .await + .map(|v| (youtube_video::INFO.name, v)), + ); + } + // NOTE: shopify_product, shopify_collection, ecommerce_product, + // woocommerce_product, and substack_post are intentionally NOT + // in auto-dispatch. Their `matches()` functions are permissive + // (any URL with `/products/`, `/product/`, `/p/`, etc.) and + // claiming those generically would steal URLs from the default + // `/v1/scrape` markdown flow. Callers opt in via + // `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`. + None +} + +/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`). +/// We still validate that the URL plausibly belongs to that vertical so +/// users get a clear "wrong route" error instead of a confusing parse +/// failure deep in the extractor. +pub async fn dispatch_by_name( + client: &dyn Fetcher, + name: &str, + url: &str, +) -> Result { + match name { + n if n == reddit::INFO.name => { + run_or_mismatch(reddit::matches(url), n, url, || { + reddit::extract(client, url) + }) + .await + } + n if n == hackernews::INFO.name => { + run_or_mismatch(hackernews::matches(url), n, url, || { + hackernews::extract(client, url) + }) + .await + } + n if n == github_repo::INFO.name => { + run_or_mismatch(github_repo::matches(url), n, url, || { + github_repo::extract(client, url) + }) + .await + } + n if n == pypi::INFO.name => { + run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await + } + n if n == npm::INFO.name => { + run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await + } + n if n == github_pr::INFO.name => { + run_or_mismatch(github_pr::matches(url), n, url, || { + github_pr::extract(client, url) + }) + .await + } + n if n == github_issue::INFO.name => { + run_or_mismatch(github_issue::matches(url), n, url, || { + github_issue::extract(client, url) + }) + .await + } + n if n == github_release::INFO.name => { + run_or_mismatch(github_release::matches(url), n, url, || { + github_release::extract(client, url) + }) + .await + } + n if n == crates_io::INFO.name => { + run_or_mismatch(crates_io::matches(url), n, url, || { + crates_io::extract(client, url) + }) + .await + } + n if n == huggingface_model::INFO.name => { + run_or_mismatch(huggingface_model::matches(url), n, url, || { + huggingface_model::extract(client, url) + }) + .await + } + n if n == huggingface_dataset::INFO.name => { + run_or_mismatch(huggingface_dataset::matches(url), n, url, || { + huggingface_dataset::extract(client, url) + }) + .await + } + n if n == arxiv::INFO.name => { + run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await + } + n if n == docker_hub::INFO.name => { + run_or_mismatch(docker_hub::matches(url), n, url, || { + docker_hub::extract(client, url) + }) + .await + } + n if n == dev_to::INFO.name => { + run_or_mismatch(dev_to::matches(url), n, url, || { + dev_to::extract(client, url) + }) + .await + } + n if n == stackoverflow::INFO.name => { + run_or_mismatch(stackoverflow::matches(url), n, url, || { + stackoverflow::extract(client, url) + }) + .await + } + n if n == linkedin_post::INFO.name => { + run_or_mismatch(linkedin_post::matches(url), n, url, || { + linkedin_post::extract(client, url) + }) + .await + } + n if n == instagram_post::INFO.name => { + run_or_mismatch(instagram_post::matches(url), n, url, || { + instagram_post::extract(client, url) + }) + .await + } + n if n == instagram_profile::INFO.name => { + run_or_mismatch(instagram_profile::matches(url), n, url, || { + instagram_profile::extract(client, url) + }) + .await + } + n if n == shopify_product::INFO.name => { + run_or_mismatch(shopify_product::matches(url), n, url, || { + shopify_product::extract(client, url) + }) + .await + } + n if n == ecommerce_product::INFO.name => { + run_or_mismatch(ecommerce_product::matches(url), n, url, || { + ecommerce_product::extract(client, url) + }) + .await + } + n if n == amazon_product::INFO.name => { + run_or_mismatch(amazon_product::matches(url), n, url, || { + amazon_product::extract(client, url) + }) + .await + } + n if n == ebay_listing::INFO.name => { + run_or_mismatch(ebay_listing::matches(url), n, url, || { + ebay_listing::extract(client, url) + }) + .await + } + n if n == etsy_listing::INFO.name => { + run_or_mismatch(etsy_listing::matches(url), n, url, || { + etsy_listing::extract(client, url) + }) + .await + } + n if n == trustpilot_reviews::INFO.name => { + run_or_mismatch(trustpilot_reviews::matches(url), n, url, || { + trustpilot_reviews::extract(client, url) + }) + .await + } + n if n == youtube_video::INFO.name => { + run_or_mismatch(youtube_video::matches(url), n, url, || { + youtube_video::extract(client, url) + }) + .await + } + n if n == substack_post::INFO.name => { + run_or_mismatch(substack_post::matches(url), n, url, || { + substack_post::extract(client, url) + }) + .await + } + n if n == shopify_collection::INFO.name => { + run_or_mismatch(shopify_collection::matches(url), n, url, || { + shopify_collection::extract(client, url) + }) + .await + } + n if n == woocommerce_product::INFO.name => { + run_or_mismatch(woocommerce_product::matches(url), n, url, || { + woocommerce_product::extract(client, url) + }) + .await + } + _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())), + } +} + +/// Errors that the dispatcher itself raises (vs. errors from inside an +/// extractor, which come back wrapped in `Fetch`). +#[derive(Debug, thiserror::Error)] +pub enum ExtractorDispatchError { + #[error("unknown vertical: '{0}'")] + UnknownVertical(String), + + #[error("URL '{url}' does not match the '{vertical}' extractor")] + UrlMismatch { vertical: String, url: String }, + + #[error(transparent)] + Fetch(#[from] FetchError), +} + +/// Helper: when the caller explicitly picked a vertical but their URL +/// doesn't match it, return `UrlMismatch` instead of running the +/// extractor (which would just fail with a less-clear error). +async fn run_or_mismatch( + matches: bool, + vertical: &str, + url: &str, + f: F, +) -> Result +where + F: FnOnce() -> Fut, + Fut: std::future::Future>, +{ + if !matches { + return Err(ExtractorDispatchError::UrlMismatch { + vertical: vertical.to_string(), + url: url.to_string(), + }); + } + f().await.map_err(ExtractorDispatchError::Fetch) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn list_is_non_empty_and_unique() { + let entries = list(); + assert!(!entries.is_empty()); + let mut names: Vec<_> = entries.iter().map(|e| e.name).collect(); + names.sort(); + let before = names.len(); + names.dedup(); + assert_eq!(before, names.len(), "extractor names must be unique"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/npm.rs b/crates/webclaw-fetch/src/extractors/npm.rs new file mode 100644 index 0000000..f84da0e --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/npm.rs @@ -0,0 +1,235 @@ +//! npm package structured extractor. +//! +//! Uses two npm-run APIs: +//! - `registry.npmjs.org/{name}` for full package metadata +//! - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal +//! +//! The registry API returns the *full* document including every version +//! ever published, which can be tens of MB for popular packages +//! (`@types/node` etc). We strip down to the latest version's manifest +//! and a count of releases — full history would explode the response. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "npm", + label: "npm package", + description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.", + url_patterns: &["https://www.npmjs.com/package/{name}"], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "www.npmjs.com" && host != "npmjs.com" { + return false; + } + url.contains("/package/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let name = parse_name(url) + .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?; + + let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name)); + let resp = client.fetch(®istry_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "npm: package '{name}' not found" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "npm registry returned status {}", + resp.status + ))); + } + + let pkg: PackageDoc = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?; + + // Resolve "latest" to a concrete version. + let latest_version = pkg + .dist_tags + .as_ref() + .and_then(|t| t.get("latest")) + .cloned() + .or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned())); + + let latest_manifest = latest_version + .as_deref() + .and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v))); + + let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0); + let latest_release_date = latest_version + .as_deref() + .and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned())); + + // Best-effort weekly downloads. If the api.npmjs.org call fails we + // surface `null` rather than failing the whole extractor — npm + // sometimes 503s the downloads endpoint while the registry is up. + let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok(); + + Ok(json!({ + "url": url, + "name": pkg.name.clone().unwrap_or(name.clone()), + "description": pkg.description, + "latest_version": latest_version, + "license": latest_manifest.and_then(|m| m.license.clone()), + "homepage": pkg.homepage, + "repository": pkg.repository.as_ref().and_then(|r| r.url.clone()), + "dependencies": latest_manifest.and_then(|m| m.dependencies.clone()), + "dev_dependencies": latest_manifest.and_then(|m| m.dev_dependencies.clone()), + "peer_dependencies": latest_manifest.and_then(|m| m.peer_dependencies.clone()), + "keywords": pkg.keywords, + "maintainers": pkg.maintainers, + "deprecated": latest_manifest.and_then(|m| m.deprecated.clone()), + "release_count": release_count, + "latest_release_date": latest_release_date, + "weekly_downloads": weekly_downloads, + })) +} + +async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result { + let url = format!( + "https://api.npmjs.org/downloads/point/last-week/{}", + urlencode_segment(name) + ); + let resp = client.fetch(&url).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "npm downloads api status {}", + resp.status + ))); + } + let dl: Downloads = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?; + Ok(dl.downloads) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Extract the package name from an npmjs.com URL. Handles scoped packages +/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`). +fn parse_name(url: &str) -> Option { + let after = url.split("/package/").nth(1)?; + let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let first = segs.next()?; + if first.starts_with('@') { + let second = segs.next()?; + Some(format!("{first}/{second}")) + } else { + Some(first.to_string()) + } +} + +/// `@scope/name` must encode the `/` for the registry path. Plain names +/// pass through untouched. +fn urlencode_segment(name: &str) -> String { + name.replace('/', "%2F") +} + +// --------------------------------------------------------------------------- +// Registry types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct PackageDoc { + name: Option, + description: Option, + homepage: Option, // sometimes string, sometimes object + repository: Option, + keywords: Option>, + maintainers: Option>, + #[serde(rename = "dist-tags")] + dist_tags: Option>, + versions: Option>, + time: Option>, +} + +#[derive(Deserialize, Default, Clone)] +struct VersionManifest { + license: Option, // string or object + dependencies: Option>, + #[serde(rename = "devDependencies")] + dev_dependencies: Option>, + #[serde(rename = "peerDependencies")] + peer_dependencies: Option>, + // `deprecated` is sometimes a bool and sometimes a string in the + // registry. serde_json::Value covers both without failing the parse. + deprecated: Option, +} + +#[derive(Deserialize)] +struct Repository { + url: Option, +} + +#[derive(Deserialize, Clone)] +struct Maintainer { + name: Option, + email: Option, +} + +impl serde::Serialize for Maintainer { + fn serialize(&self, s: S) -> Result { + use serde::ser::SerializeMap; + let mut m = s.serialize_map(Some(2))?; + m.serialize_entry("name", &self.name)?; + m.serialize_entry("email", &self.email)?; + m.end() + } +} + +#[derive(Deserialize)] +struct Downloads { + downloads: i64, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_npm_package_urls() { + assert!(matches("https://www.npmjs.com/package/react")); + assert!(matches("https://www.npmjs.com/package/@types/node")); + assert!(matches("https://npmjs.com/package/lodash")); + assert!(!matches("https://www.npmjs.com/")); + assert!(!matches("https://example.com/package/foo")); + } + + #[test] + fn parse_name_handles_scoped_and_unscoped() { + assert_eq!( + parse_name("https://www.npmjs.com/package/react"), + Some("react".into()) + ); + assert_eq!( + parse_name("https://www.npmjs.com/package/@types/node"), + Some("@types/node".into()) + ); + assert_eq!( + parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"), + Some("lodash".into()) + ); + } + + #[test] + fn urlencode_only_touches_scope_separator() { + assert_eq!(urlencode_segment("react"), "react"); + assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/pypi.rs b/crates/webclaw-fetch/src/extractors/pypi.rs new file mode 100644 index 0000000..33a4d1c --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/pypi.rs @@ -0,0 +1,184 @@ +//! PyPI package structured extractor. +//! +//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and +//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both +//! return the full release info plus history. No auth, no rate limits +//! that we hit at normal usage. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "pypi", + label: "PyPI package", + description: "Returns package metadata: latest version, dependencies, license, release history.", + url_patterns: &[ + "https://pypi.org/project/{name}/", + "https://pypi.org/project/{name}/{version}/", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "pypi.org" && host != "www.pypi.org" { + return false; + } + url.contains("/project/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (name, version) = parse_project(url).ok_or_else(|| { + FetchError::Build(format!("pypi: cannot parse package name from '{url}'")) + })?; + + let api_url = match &version { + Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"), + None => format!("https://pypi.org/pypi/{name}/json"), + }; + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "pypi: package '{name}' not found" + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "pypi api returned status {}", + resp.status + ))); + } + + let pkg: PypiResponse = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?; + + let info = pkg.info; + let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0); + + // Latest release date = max upload time across files in the latest version. + let latest_release_date = pkg + .releases + .as_ref() + .and_then(|map| info.version.as_deref().and_then(|v| map.get(v))) + .and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max()); + + // Drop the long description from the JSON shape — it's frequently a 50KB + // README and bloats responses. Callers who need it can hit /v1/scrape. + Ok(json!({ + "url": url, + "name": info.name, + "version": info.version, + "summary": info.summary, + "homepage": info.home_page, + "license": info.license, + "license_classifier": pick_license_classifier(&info.classifiers), + "author": info.author, + "author_email": info.author_email, + "maintainer": info.maintainer, + "requires_python": info.requires_python, + "requires_dist": info.requires_dist, + "keywords": info.keywords, + "classifiers": info.classifiers, + "yanked": info.yanked, + "yanked_reason": info.yanked_reason, + "project_urls": info.project_urls, + "release_count": release_count, + "latest_release_date": latest_release_date, + })) +} + +/// PyPI puts the SPDX-ish license under classifiers like +/// `License :: OSI Approved :: Apache Software License`. Surface the most +/// specific one when the `license` field itself is empty/junk. +fn pick_license_classifier(classifiers: &Option>) -> Option { + classifiers + .as_ref()? + .iter() + .filter(|c| c.starts_with("License ::")) + .max_by_key(|c| c.len()) + .cloned() +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn parse_project(url: &str) -> Option<(String, Option)> { + let after = url.split("/project/").nth(1)?; + let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); + let mut segs = stripped.split('/').filter(|s| !s.is_empty()); + let name = segs.next()?.to_string(); + let version = segs.next().map(|v| v.to_string()); + Some((name, version)) +} + +// --------------------------------------------------------------------------- +// PyPI API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct PypiResponse { + info: Info, + releases: Option>>, +} + +#[derive(Deserialize)] +struct Info { + name: Option, + version: Option, + summary: Option, + home_page: Option, + license: Option, + author: Option, + author_email: Option, + maintainer: Option, + requires_python: Option, + requires_dist: Option>, + keywords: Option, + classifiers: Option>, + yanked: Option, + yanked_reason: Option, + project_urls: Option>, +} + +#[derive(Deserialize)] +struct File { + upload_time: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_project_urls() { + assert!(matches("https://pypi.org/project/requests/")); + assert!(matches("https://pypi.org/project/numpy/1.26.0/")); + assert!(!matches("https://pypi.org/")); + assert!(!matches("https://example.com/project/foo")); + } + + #[test] + fn parse_project_pulls_name_and_version() { + assert_eq!( + parse_project("https://pypi.org/project/requests/"), + Some(("requests".into(), None)) + ); + assert_eq!( + parse_project("https://pypi.org/project/numpy/1.26.0/"), + Some(("numpy".into(), Some("1.26.0".into()))) + ); + assert_eq!( + parse_project("https://pypi.org/project/scikit-learn/?foo=bar"), + Some(("scikit-learn".into(), None)) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/reddit.rs b/crates/webclaw-fetch/src/extractors/reddit.rs new file mode 100644 index 0000000..13cdc16 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/reddit.rs @@ -0,0 +1,234 @@ +//! Reddit structured extractor — returns the full post + comment tree +//! as typed JSON via Reddit's `.json` API. +//! +//! The same trick the markdown extractor in `crate::reddit` uses: +//! appending `.json` to any post URL returns the data the new SPA +//! frontend would load client-side. Zero antibot, zero JS rendering. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "reddit", + label: "Reddit thread", + description: "Returns post + nested comment tree with scores, authors, and timestamps.", + url_patterns: &[ + "https://www.reddit.com/r/*/comments/*", + "https://reddit.com/r/*/comments/*", + "https://old.reddit.com/r/*/comments/*", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + let is_reddit_host = matches!( + host, + "reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com" + ); + is_reddit_host && url.contains("/comments/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let json_url = build_json_url(url); + let resp = client.fetch(&json_url).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "reddit api returned status {}", + resp.status + ))); + } + + let listings: Vec = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?; + + if listings.is_empty() { + return Err(FetchError::BodyDecode("reddit response empty".into())); + } + + // First listing = the post (single t3 child). + let post = listings + .first() + .and_then(|l| l.data.children.first()) + .filter(|t| t.kind == "t3") + .map(|t| post_json(&t.data)) + .unwrap_or(Value::Null); + + // Second listing = the comment tree. + let comments: Vec = listings + .get(1) + .map(|l| l.data.children.iter().filter_map(comment_json).collect()) + .unwrap_or_default(); + + Ok(json!({ + "url": url, + "post": post, + "comments": comments, + })) +} + +// --------------------------------------------------------------------------- +// JSON shapers +// --------------------------------------------------------------------------- + +fn post_json(d: &ThingData) -> Value { + json!({ + "id": d.id, + "title": d.title, + "author": d.author, + "subreddit": d.subreddit_name_prefixed, + "permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")), + "url": d.url_overridden_by_dest, + "is_self": d.is_self, + "selftext": d.selftext, + "score": d.score, + "upvote_ratio": d.upvote_ratio, + "num_comments": d.num_comments, + "created_utc": d.created_utc, + "link_flair_text": d.link_flair_text, + "over_18": d.over_18, + "spoiler": d.spoiler, + "stickied": d.stickied, + "locked": d.locked, + }) +} + +/// Render a single comment + its reply tree. Returns `None` for non-t1 +/// kinds (the trailing `more` placeholder Reddit injects at depth limits). +fn comment_json(thing: &Thing) -> Option { + if thing.kind != "t1" { + return None; + } + let d = &thing.data; + let replies: Vec = match &d.replies { + Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(), + _ => Vec::new(), + }; + Some(json!({ + "id": d.id, + "author": d.author, + "body": d.body, + "score": d.score, + "created_utc": d.created_utc, + "is_submitter": d.is_submitter, + "stickied": d.stickied, + "depth": d.depth, + "permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")), + "replies": replies, + })) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com` +/// or `old.reddit.com` as the caller gave us). Routing through +/// `old.reddit.com` unconditionally looks appealing but that host has +/// stricter UA-based blocking than `www.reddit.com`, while the main +/// host accepts our Chrome-fingerprinted client fine. +fn build_json_url(url: &str) -> String { + let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/'); + format!("{clean}.json?raw_json=1") +} + +// --------------------------------------------------------------------------- +// Reddit JSON types — only fields we render. Everything else is dropped. +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Listing { + data: ListingData, +} + +#[derive(Deserialize)] +struct ListingData { + children: Vec, +} + +#[derive(Deserialize)] +struct Thing { + kind: String, + data: ThingData, +} + +#[derive(Deserialize, Default)] +struct ThingData { + // post (t3) + id: Option, + title: Option, + selftext: Option, + subreddit_name_prefixed: Option, + url_overridden_by_dest: Option, + is_self: Option, + upvote_ratio: Option, + num_comments: Option, + over_18: Option, + spoiler: Option, + stickied: Option, + locked: Option, + link_flair_text: Option, + + // comment (t1) + author: Option, + body: Option, + score: Option, + created_utc: Option, + is_submitter: Option, + depth: Option, + permalink: Option, + + // recursive + replies: Option, +} + +#[derive(Deserialize)] +#[serde(untagged)] +enum Replies { + Listing(Listing), + #[allow(dead_code)] + Empty(String), +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_reddit_post_urls() { + assert!(matches( + "https://www.reddit.com/r/rust/comments/abc123/some_title/" + )); + assert!(matches( + "https://reddit.com/r/rust/comments/abc123/some_title" + )); + assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/")); + } + + #[test] + fn rejects_non_post_reddit_urls() { + assert!(!matches("https://www.reddit.com/r/rust")); + assert!(!matches("https://www.reddit.com/user/foo")); + assert!(!matches("https://example.com/r/rust/comments/x")); + } + + #[test] + fn json_url_appends_suffix_and_drops_query() { + assert_eq!( + build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"), + "https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1" + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/shopify_collection.rs b/crates/webclaw-fetch/src/extractors/shopify_collection.rs new file mode 100644 index 0000000..23d57c6 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/shopify_collection.rs @@ -0,0 +1,242 @@ +//! Shopify collection structured extractor. +//! +//! Every Shopify store exposes `/collections/{handle}.json` and +//! `/collections/{handle}/products.json` on the public surface. This +//! extractor hits `.json` (collection metadata) and falls through to +//! `/products.json` for the first page of products. Same caveat as +//! `shopify_product`: stores with Cloudflare in front of the shop +//! will 403 the public path. +//! +//! Explicit-call only (like `shopify_product`). `/collections/{slug}` +//! is a URL shape used by non-Shopify stores too, so auto-dispatch +//! would claim too many URLs. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "shopify_collection", + label: "Shopify collection", + description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.", + url_patterns: &[ + "https://{shop}/collections/{handle}", + "https://{shop}.myshopify.com/collections/{handle}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) { + return false; + } + url.contains("/collections/") && !url.ends_with("/collections/") +} + +const NON_SHOPIFY_HOSTS: &[&str] = &[ + "amazon.com", + "amazon.co.uk", + "amazon.de", + "ebay.com", + "etsy.com", + "walmart.com", + "target.com", + "aliexpress.com", + "huggingface.co", // has /collections/ for models + "github.com", +]; + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let (coll_meta_url, coll_products_url) = build_json_urls(url); + + // Step 1: collection metadata. Shopify returns 200 on missing + // collections sometimes; check "collection" key below. + let meta_resp = client.fetch(&coll_meta_url).await?; + if meta_resp.status == 404 { + return Err(FetchError::Build(format!( + "shopify_collection: '{url}' not found" + ))); + } + if meta_resp.status == 403 { + return Err(FetchError::Build(format!( + "shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store." + ))); + } + if meta_resp.status != 200 { + return Err(FetchError::Build(format!( + "shopify returned status {} for {coll_meta_url}", + meta_resp.status + ))); + } + + let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| { + FetchError::BodyDecode(format!( + "shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})" + )) + })?; + + // Step 2: first page of products for this collection. + let products = match client.fetch(&coll_products_url).await { + Ok(r) if r.status == 200 => serde_json::from_str::(&r.html) + .ok() + .map(|pw| pw.products) + .unwrap_or_default(), + _ => Vec::new(), + }; + + let product_summaries: Vec = products + .iter() + .map(|p| { + let first_variant = p.variants.first(); + json!({ + "id": p.id, + "handle": p.handle, + "title": p.title, + "vendor": p.vendor, + "product_type": p.product_type, + "price": first_variant.and_then(|v| v.price.clone()), + "compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()), + "available": p.variants.iter().any(|v| v.available.unwrap_or(false)), + "variant_count": p.variants.len(), + "image": p.images.first().and_then(|i| i.src.clone()), + "created_at": p.created_at, + "updated_at": p.updated_at, + }) + }) + .collect(); + + let c = meta.collection; + Ok(json!({ + "url": url, + "meta_json_url": coll_meta_url, + "products_json_url": coll_products_url, + "collection_id": c.id, + "handle": c.handle, + "title": c.title, + "description_html": c.body_html, + "published_at": c.published_at, + "updated_at": c.updated_at, + "sort_order": c.sort_order, + "products_in_page": product_summaries.len(), + "products": product_summaries, + })) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Build `(collection.json, collection/products.json)` from a user URL. +fn build_json_urls(url: &str) -> (String, String) { + let (path_part, _query_part) = match url.split_once('?') { + Some((a, b)) => (a, Some(b)), + None => (url, None), + }; + let clean = path_part.trim_end_matches('/').trim_end_matches(".json"); + ( + format!("{clean}.json"), + format!("{clean}/products.json?limit=50"), + ) +} + +// --------------------------------------------------------------------------- +// Shopify collection + product JSON shapes (subsets) +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct MetaWrapper { + collection: Collection, +} + +#[derive(Deserialize)] +struct Collection { + id: Option, + handle: Option, + title: Option, + body_html: Option, + published_at: Option, + updated_at: Option, + sort_order: Option, +} + +#[derive(Deserialize)] +struct ProductsWrapper { + #[serde(default)] + products: Vec, +} + +#[derive(Deserialize)] +struct ProductSummary { + id: Option, + handle: Option, + title: Option, + vendor: Option, + product_type: Option, + created_at: Option, + updated_at: Option, + #[serde(default)] + variants: Vec, + #[serde(default)] + images: Vec, +} + +#[derive(Deserialize)] +struct VariantSummary { + price: Option, + compare_at_price: Option, + available: Option, +} + +#[derive(Deserialize)] +struct ImageSummary { + src: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_shopify_collection_urls() { + assert!(matches("https://www.allbirds.com/collections/mens")); + assert!(matches( + "https://shop.example.com/collections/new-arrivals?page=2" + )); + } + + #[test] + fn rejects_non_shopify() { + assert!(!matches("https://github.com/collections/foo")); + assert!(!matches("https://huggingface.co/collections/foo")); + assert!(!matches("https://example.com/")); + assert!(!matches("https://example.com/collections/")); + } + + #[test] + fn build_json_urls_derives_both_paths() { + let (meta, products) = build_json_urls("https://shop.example.com/collections/mens"); + assert_eq!(meta, "https://shop.example.com/collections/mens.json"); + assert_eq!( + products, + "https://shop.example.com/collections/mens/products.json?limit=50" + ); + } + + #[test] + fn build_json_urls_handles_trailing_slash() { + let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/"); + assert_eq!(meta, "https://shop.example.com/collections/mens.json"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/shopify_product.rs b/crates/webclaw-fetch/src/extractors/shopify_product.rs new file mode 100644 index 0000000..b52ef36 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs @@ -0,0 +1,318 @@ +//! Shopify product structured extractor. +//! +//! Every Shopify store exposes a public JSON endpoint for each product +//! by appending `.json` to the product URL: +//! +//! https://shop.example.com/products/cool-tshirt +//! → https://shop.example.com/products/cool-tshirt.json +//! +//! There are ~4 million Shopify stores. The `.json` endpoint is +//! undocumented but has been stable for 10+ years. When a store puts +//! Cloudflare / antibot in front of the shop, this path can 403 just +//! like any other — for those cases the caller should fall back to +//! `ecommerce_product` (JSON-LD) or the cloud tier. +//! +//! This extractor is **explicit-call only** — it is NOT auto-dispatched +//! from `/v1/scrape` because we cannot tell ahead of time whether an +//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit +//! `/v1/scrape/shopify_product` when they know. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "shopify_product", + label: "Shopify product", + description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.", + url_patterns: &[ + "https://{shop}/products/{handle}", + "https://{shop}.myshopify.com/products/{handle}", + ], +}; + +pub fn matches(url: &str) -> bool { + // Any URL whose path contains /products/{something}. We do not + // filter by host — Shopify powers custom-domain stores. The + // extractor's /.json fallback is what confirms Shopify; `matches` + // just says "this is a plausible shape." Still reject obviously + // non-Shopify known hosts to save a failed request. + let host = host_of(url); + if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) { + return false; + } + url.contains("/products/") && !url.ends_with("/products/") +} + +/// Hosts we know are not Shopify — reject so we don't burn a request. +const NON_SHOPIFY_HOSTS: &[&str] = &[ + "amazon.com", + "amazon.co.uk", + "amazon.de", + "amazon.fr", + "amazon.it", + "ebay.com", + "etsy.com", + "walmart.com", + "target.com", + "aliexpress.com", + "bestbuy.com", + "wayfair.com", + "homedepot.com", + "github.com", // /products is a marketing page +]; + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let json_url = build_json_url(url); + let resp = client.fetch(&json_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "shopify_product: '{url}' not found (got 404 from {json_url})" + ))); + } + if resp.status == 403 { + return Err(FetchError::Build(format!( + "shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback." + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "shopify returned status {} for {json_url}", + resp.status + ))); + } + + let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| { + FetchError::BodyDecode(format!( + "shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})" + )) + })?; + let p = body.product; + + let variants: Vec = p + .variants + .iter() + .map(|v| { + json!({ + "id": v.id, + "title": v.title, + "sku": v.sku, + "barcode": v.barcode, + "price": v.price, + "compare_at_price": v.compare_at_price, + "available": v.available, + "inventory_quantity": v.inventory_quantity, + "position": v.position, + "weight": v.weight, + "weight_unit": v.weight_unit, + "requires_shipping": v.requires_shipping, + "taxable": v.taxable, + "option1": v.option1, + "option2": v.option2, + "option3": v.option3, + }) + }) + .collect(); + + let images: Vec = p + .images + .iter() + .map(|i| { + json!({ + "src": i.src, + "width": i.width, + "height": i.height, + "position": i.position, + "alt": i.alt, + }) + }) + .collect(); + + let options: Vec = p + .options + .iter() + .map(|o| json!({"name": o.name, "values": o.values, "position": o.position})) + .collect(); + + // Price range + availability summary across variants (the shape + // agents typically want without walking the variants array). + let prices: Vec = p + .variants + .iter() + .filter_map(|v| v.price.as_deref().and_then(|s| s.parse::().ok())) + .collect(); + let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false)); + + Ok(json!({ + "url": url, + "json_url": json_url, + "product_id": p.id, + "handle": p.handle, + "title": p.title, + "vendor": p.vendor, + "product_type": p.product_type, + "tags": p.tags, + "description_html":p.body_html, + "published_at": p.published_at, + "created_at": p.created_at, + "updated_at": p.updated_at, + "variant_count": variants.len(), + "image_count": images.len(), + "any_available": any_available, + "price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)), + "price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)), + "variants": variants, + "images": images, + "options": options, + })) +} + +/// Build the .json path from a product URL. Handles pre-.jsoned URLs, +/// trailing slashes, and query strings. +fn build_json_url(url: &str) -> String { + let (path_part, query_part) = match url.split_once('?') { + Some((a, b)) => (a, Some(b)), + None => (url, None), + }; + let clean = path_part.trim_end_matches('/'); + let with_json = if clean.ends_with(".json") { + clean.to_string() + } else { + format!("{clean}.json") + }; + match query_part { + Some(q) => format!("{with_json}?{q}"), + None => with_json, + } +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +// --------------------------------------------------------------------------- +// Shopify product JSON shape (a subset of the full response) +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Wrapper { + product: Product, +} + +#[derive(Deserialize)] +struct Product { + id: Option, + title: Option, + handle: Option, + vendor: Option, + product_type: Option, + body_html: Option, + published_at: Option, + created_at: Option, + updated_at: Option, + #[serde(default)] + tags: serde_json::Value, // array OR comma-joined string depending on store + #[serde(default)] + variants: Vec, + #[serde(default)] + images: Vec, + #[serde(default)] + options: Vec, +} + +#[derive(Deserialize)] +struct Variant { + id: Option, + title: Option, + sku: Option, + barcode: Option, + price: Option, + compare_at_price: Option, + available: Option, + inventory_quantity: Option, + position: Option, + weight: Option, + weight_unit: Option, + requires_shipping: Option, + taxable: Option, + option1: Option, + option2: Option, + option3: Option, +} + +#[derive(Deserialize)] +struct Image { + src: Option, + width: Option, + height: Option, + position: Option, + alt: Option, +} + +#[derive(Deserialize)] +#[serde(rename_all = "lowercase")] +struct Option_ { + name: Option, + position: Option, + #[serde(default)] + values: Vec, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_plausible_shopify_urls() { + assert!(matches( + "https://www.allbirds.com/products/mens-tree-runners" + )); + assert!(matches( + "https://shop.example.com/products/cool-tshirt?variant=123" + )); + assert!(matches("https://somestore.myshopify.com/products/thing-1")); + } + + #[test] + fn rejects_known_non_shopify() { + assert!(!matches("https://www.amazon.com/dp/B0C123")); + assert!(!matches("https://www.etsy.com/listing/12345/foo")); + assert!(!matches("https://www.amazon.co.uk/products/thing")); + assert!(!matches("https://github.com/products")); + } + + #[test] + fn rejects_non_product_urls() { + assert!(!matches("https://example.com/")); + assert!(!matches("https://example.com/products/")); + assert!(!matches("https://example.com/collections/all")); + } + + #[test] + fn build_json_url_handles_slash_and_query() { + assert_eq!( + build_json_url("https://shop.example.com/products/foo"), + "https://shop.example.com/products/foo.json" + ); + assert_eq!( + build_json_url("https://shop.example.com/products/foo/"), + "https://shop.example.com/products/foo.json" + ); + assert_eq!( + build_json_url("https://shop.example.com/products/foo?variant=123"), + "https://shop.example.com/products/foo.json?variant=123" + ); + assert_eq!( + build_json_url("https://shop.example.com/products/foo.json"), + "https://shop.example.com/products/foo.json" + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/stackoverflow.rs b/crates/webclaw-fetch/src/extractors/stackoverflow.rs new file mode 100644 index 0000000..03597a3 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/stackoverflow.rs @@ -0,0 +1,216 @@ +//! Stack Overflow Q&A structured extractor. +//! +//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}` +//! with `site=stackoverflow`. Two calls: one for the question, one for +//! its answers. Both come pre-filtered to include the rendered HTML body +//! so we don't re-parse the question page itself. +//! +//! Anonymous access caps at 300 requests per IP per day. Production +//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't +//! require it to work out of the box. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "stackoverflow", + label: "Stack Overflow Q&A", + description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.", + url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host != "stackoverflow.com" && host != "www.stackoverflow.com" { + return false; + } + parse_question_id(url).is_some() +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let id = parse_question_id(url).ok_or_else(|| { + FetchError::Build(format!( + "stackoverflow: cannot parse question id from '{url}'" + )) + })?; + + // Filter `withbody` includes the rendered HTML body for both questions + // and answers. Stack Exchange's filter system is documented at + // api.stackexchange.com/docs/filters. + let q_url = format!( + "https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody" + ); + let q_resp = client.fetch(&q_url).await?; + if q_resp.status != 200 { + return Err(FetchError::Build(format!( + "stackexchange api returned status {}", + q_resp.status + ))); + } + let q_body: QResponse = serde_json::from_str(&q_resp.html) + .map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?; + let q = q_body + .items + .first() + .ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?; + + let a_url = format!( + "https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes" + ); + let a_resp = client.fetch(&a_url).await?; + let answers = if a_resp.status == 200 { + let a_body: AResponse = serde_json::from_str(&a_resp.html) + .map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?; + a_body + .items + .iter() + .map(|a| { + json!({ + "answer_id": a.answer_id, + "is_accepted": a.is_accepted, + "score": a.score, + "body": a.body, + "creation_date": a.creation_date, + "last_edit_date":a.last_edit_date, + "author": a.owner.as_ref().and_then(|o| o.display_name.clone()), + "author_rep": a.owner.as_ref().and_then(|o| o.reputation), + }) + }) + .collect::>() + } else { + Vec::new() + }; + + let accepted = answers + .iter() + .find(|a| { + a.get("is_accepted") + .and_then(|v| v.as_bool()) + .unwrap_or(false) + }) + .cloned(); + + Ok(json!({ + "url": url, + "question_id": q.question_id, + "title": q.title, + "body": q.body, + "tags": q.tags, + "score": q.score, + "view_count": q.view_count, + "answer_count": q.answer_count, + "is_answered": q.is_answered, + "accepted_answer_id": q.accepted_answer_id, + "creation_date": q.creation_date, + "last_activity_date": q.last_activity_date, + "author": q.owner.as_ref().and_then(|o| o.display_name.clone()), + "author_rep": q.owner.as_ref().and_then(|o| o.reputation), + "link": q.link, + "accepted_answer": accepted, + "top_answers": answers, + })) +} + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Parse question id from a URL of the form `/questions/{id}/{slug}`. +fn parse_question_id(url: &str) -> Option { + let after = url.split("/questions/").nth(1)?; + let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); + let first = stripped.split('/').next()?; + first.parse::().ok() +} + +// --------------------------------------------------------------------------- +// Stack Exchange API types +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct QResponse { + #[serde(default)] + items: Vec, +} + +#[derive(Deserialize)] +struct Question { + question_id: Option, + title: Option, + body: Option, + #[serde(default)] + tags: Vec, + score: Option, + view_count: Option, + answer_count: Option, + is_answered: Option, + accepted_answer_id: Option, + creation_date: Option, + last_activity_date: Option, + owner: Option, + link: Option, +} + +#[derive(Deserialize)] +struct AResponse { + #[serde(default)] + items: Vec, +} + +#[derive(Deserialize)] +struct Answer { + answer_id: Option, + is_accepted: Option, + score: Option, + body: Option, + creation_date: Option, + last_edit_date: Option, + owner: Option, +} + +#[derive(Deserialize)] +struct Owner { + display_name: Option, + reputation: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_question_urls() { + assert!(matches( + "https://stackoverflow.com/questions/12345/some-slug" + )); + assert!(matches( + "https://stackoverflow.com/questions/12345/some-slug?answertab=votes" + )); + assert!(!matches("https://stackoverflow.com/")); + assert!(!matches("https://stackoverflow.com/questions")); + assert!(!matches("https://stackoverflow.com/users/100")); + assert!(!matches("https://example.com/questions/12345/x")); + } + + #[test] + fn parse_question_id_handles_slug_and_query() { + assert_eq!( + parse_question_id("https://stackoverflow.com/questions/12345/some-slug"), + Some(12345) + ); + assert_eq!( + parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"), + Some(12345) + ); + assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None); + } +} diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs new file mode 100644 index 0000000..c5b5019 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/substack_post.rs @@ -0,0 +1,565 @@ +//! Substack post extractor. +//! +//! Every Substack publication exposes `/api/v1/posts/{slug}` that +//! returns the full post as JSON: body HTML, cover image, author, +//! publication info, reactions, paywall state. No auth on public +//! posts. +//! +//! Works on both `*.substack.com` subdomains and custom domains +//! (e.g. `simonwillison.net` uses Substack too). Detection is +//! "URL has `/p/{slug}`" because that's the canonical Substack post +//! path. Explicit-call only because the `/p/{slug}` URL shape is +//! used by non-Substack sites too. +//! +//! ## Fallback +//! +//! The API endpoint is rate-limited aggressively on popular publications +//! and occasionally returns 403 on custom domains with Cloudflare in +//! front. When that happens we escalate to an HTML fetch (via +//! `smart_fetch_html`, so antibot-protected custom domains still work) +//! and extract OG tags + Article JSON-LD for a degraded-but-useful +//! payload. The response shape stays stable across both paths; a +//! `data_source` field tells the caller which branch ran. + +use std::sync::OnceLock; + +use regex::Regex; +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::cloud::{self, CloudError}; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "substack_post", + label: "Substack post", + description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.", + url_patterns: &[ + "https://{pub}.substack.com/p/{slug}", + "https://{custom-domain}/p/{slug}", + ], +}; + +pub fn matches(url: &str) -> bool { + if !(url.starts_with("http://") || url.starts_with("https://")) { + return false; + } + url.contains("/p/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let slug = parse_slug(url).ok_or_else(|| { + FetchError::Build(format!("substack_post: cannot parse slug from '{url}'")) + })?; + let host = host_of(url); + if host.is_empty() { + return Err(FetchError::Build(format!( + "substack_post: empty host in '{url}'" + ))); + } + let scheme = if url.starts_with("http://") { + "http" + } else { + "https" + }; + let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}"); + + // 1. Try the public API. 200 = full payload; 404 = real miss; any + // other status hands off to the HTML fallback so a transient rate + // limit or a hardened custom domain doesn't fail the whole call. + let resp = client.fetch(&api_url).await?; + match resp.status { + 200 => match serde_json::from_str::(&resp.html) { + Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)), + Err(e) => { + // API returned 200 but the body isn't the Post shape we + // expect. Could be a custom-domain site that exposes + // something else at /api/v1/posts/. Fall back to HTML + // rather than hard-failing. + html_fallback( + client, + url, + &api_url, + &slug, + Some(format!( + "api returned 200 but body was not Substack JSON ({e})" + )), + ) + .await + } + }, + 404 => Err(FetchError::Build(format!( + "substack_post: '{slug}' not found on {host} (got 404). \ + If the publication isn't actually on Substack, use /v1/scrape instead." + ))), + _ => { + // Rate limit, 403, 5xx, whatever: try HTML. + let reason = format!("api returned status {} for {api_url}", resp.status); + html_fallback(client, url, &api_url, &slug, Some(reason)).await + } + } +} + +// --------------------------------------------------------------------------- +// API-path payload builder +// --------------------------------------------------------------------------- + +fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value { + json!({ + "url": url, + "api_url": api_url, + "data_source": "api", + "id": p.id, + "type": p.r#type, + "slug": p.slug.or_else(|| Some(slug.to_string())), + "title": p.title, + "subtitle": p.subtitle, + "description": p.description, + "canonical_url": p.canonical_url, + "post_date": p.post_date, + "updated_at": p.updated_at, + "audience": p.audience, + "has_paywall": matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")), + "is_free_preview": p.is_free_preview, + "cover_image": p.cover_image, + "word_count": p.wordcount, + "reactions": p.reactions, + "comment_count": p.comment_count, + "body_html": p.body_html, + "body_text": p.truncated_body_text.or(p.body_text), + "publication": json!({ + "id": p.publication.as_ref().and_then(|pub_| pub_.id), + "name": p.publication.as_ref().and_then(|pub_| pub_.name.clone()), + "subdomain": p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()), + "custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()), + }), + "authors": p.published_bylines.iter().map(|a| json!({ + "id": a.id, + "name": a.name, + "handle": a.handle, + "photo": a.photo_url, + })).collect::>(), + }) +} + +// --------------------------------------------------------------------------- +// HTML fallback: OG + Article JSON-LD +// --------------------------------------------------------------------------- + +async fn html_fallback( + client: &dyn Fetcher, + url: &str, + api_url: &str, + slug: &str, + fallback_reason: Option, +) -> Result { + let fetched = cloud::smart_fetch_html(client, client.cloud(), url) + .await + .map_err(cloud_to_fetch_err)?; + + let mut data = parse_html(&fetched.html, url, api_url, slug); + if let Some(obj) = data.as_object_mut() { + obj.insert( + "fetch_source".into(), + match fetched.source { + cloud::FetchSource::Local => json!("local"), + cloud::FetchSource::Cloud => json!("cloud"), + }, + ); + if let Some(reason) = fallback_reason { + obj.insert("fallback_reason".into(), json!(reason)); + } + } + Ok(data) +} + +/// Pure HTML parser. Pulls title, subtitle, description, cover image, +/// publish date, and authors from OG tags and Article JSON-LD. Kept +/// public so tests can exercise it with fixtures. +pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value { + let article = find_article_jsonld(html); + + let title = article + .as_ref() + .and_then(|v| get_text(v, "headline")) + .or_else(|| og(html, "title")); + let description = article + .as_ref() + .and_then(|v| get_text(v, "description")) + .or_else(|| og(html, "description")); + let cover_image = article + .as_ref() + .and_then(get_first_image) + .or_else(|| og(html, "image")); + let post_date = article + .as_ref() + .and_then(|v| get_text(v, "datePublished")) + .or_else(|| meta_property(html, "article:published_time")); + let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified")); + let publication_name = og(html, "site_name"); + let authors = article.as_ref().map(extract_authors).unwrap_or_default(); + + json!({ + "url": url, + "api_url": api_url, + "data_source": "html_fallback", + "slug": slug, + "title": title, + "subtitle": None::, + "description": description, + "canonical_url": canonical_url(html).or_else(|| Some(url.to_string())), + "post_date": post_date, + "updated_at": updated_at, + "cover_image": cover_image, + "body_html": None::, + "body_text": None::, + "word_count": None::, + "comment_count": None::, + "reactions": Value::Null, + "has_paywall": None::, + "is_free_preview": None::, + "publication": json!({ + "name": publication_name, + }), + "authors": authors, + }) +} + +fn extract_authors(v: &Value) -> Vec { + let Some(a) = v.get("author") else { + return Vec::new(); + }; + let one = |val: &Value| -> Option { + match val { + Value::String(s) => Some(json!({"name": s})), + Value::Object(_) => { + let name = val.get("name").and_then(|n| n.as_str())?; + let handle = val + .get("url") + .and_then(|u| u.as_str()) + .and_then(handle_from_author_url); + Some(json!({ + "name": name, + "handle": handle, + })) + } + _ => None, + } + }; + match a { + Value::Array(arr) => arr.iter().filter_map(one).collect(), + _ => one(a).into_iter().collect(), + } +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +fn parse_slug(url: &str) -> Option { + let after = url.split("/p/").nth(1)?; + let stripped = after + .split(['?', '#']) + .next()? + .trim_end_matches('/') + .split('/') + .next() + .unwrap_or(""); + if stripped.is_empty() { + None + } else { + Some(stripped.to_string()) + } +} + +/// Extract the Substack handle from an author URL like +/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`. +/// +/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack +/// author page) so we don't synthesise a fake handle. +fn handle_from_author_url(u: &str) -> Option { + let after = u.rsplit_once('@').map(|(_, tail)| tail)?; + let clean = after.split(['/', '?', '#']).next()?; + if clean.is_empty() { + None + } else { + Some(clean.to_string()) + } +} + +// --------------------------------------------------------------------------- +// HTML tag helpers +// --------------------------------------------------------------------------- + +fn og(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + return c.get(2).map(|m| m.as_str().to_string()); + } + } + None +} + +/// Pull `` and +/// similar structured meta tags. +fn meta_property(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + return c.get(2).map(|m| m.as_str().to_string()); + } + } + None +} + +fn canonical_url(html: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE + .get_or_init(|| Regex::new(r#"(?i)]+rel="canonical"[^>]+href="([^"]+)""#).unwrap()); + re.captures(html) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +// --------------------------------------------------------------------------- +// JSON-LD walkers (Article / NewsArticle) +// --------------------------------------------------------------------------- + +fn find_article_jsonld(html: &str) -> Option { + let blocks = webclaw_core::structured_data::extract_json_ld(html); + for b in blocks { + if let Some(found) = find_article_in(&b) { + return Some(found); + } + } + None +} + +fn find_article_in(v: &Value) -> Option { + if is_article_type(v) { + return Some(v.clone()); + } + if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { + for item in graph { + if let Some(found) = find_article_in(item) { + return Some(found); + } + } + } + if let Some(arr) = v.as_array() { + for item in arr { + if let Some(found) = find_article_in(item) { + return Some(found); + } + } + } + None +} + +fn is_article_type(v: &Value) -> bool { + let Some(t) = v.get("@type") else { + return false; + }; + let is_art = |s: &str| { + matches!( + s, + "Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting" + ) + }; + match t { + Value::String(s) => is_art(s), + Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)), + _ => false, + } +} + +fn get_text(v: &Value, key: &str) -> Option { + v.get(key).and_then(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Number(n) => Some(n.to_string()), + _ => None, + }) +} + +fn get_first_image(v: &Value) -> Option { + match v.get("image")? { + Value::String(s) => Some(s.clone()), + Value::Array(arr) => arr.iter().find_map(|x| match x { + Value::String(s) => Some(s.clone()), + Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + }), + Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), + _ => None, + } +} + +fn cloud_to_fetch_err(e: CloudError) -> FetchError { + FetchError::Build(e.to_string()) +} + +// --------------------------------------------------------------------------- +// Substack API types (subset) +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Post { + id: Option, + r#type: Option, + slug: Option, + title: Option, + subtitle: Option, + description: Option, + canonical_url: Option, + post_date: Option, + updated_at: Option, + audience: Option, + is_free_preview: Option, + cover_image: Option, + wordcount: Option, + reactions: Option, + comment_count: Option, + body_html: Option, + body_text: Option, + truncated_body_text: Option, + publication: Option, + #[serde(default, rename = "publishedBylines")] + published_bylines: Vec, +} + +#[derive(Deserialize)] +struct Publication { + id: Option, + name: Option, + subdomain: Option, + custom_domain: Option, +} + +#[derive(Deserialize)] +struct Byline { + id: Option, + name: Option, + handle: Option, + photo_url: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_post_urls() { + assert!(matches( + "https://stratechery.substack.com/p/the-tech-letter" + )); + assert!(matches("https://simonwillison.net/p/2024-08-01-something")); + assert!(!matches("https://example.com/")); + assert!(!matches("ftp://example.com/p/foo")); + } + + #[test] + fn parse_slug_strips_query_and_trailing_slash() { + assert_eq!( + parse_slug("https://example.substack.com/p/my-post"), + Some("my-post".into()) + ); + assert_eq!( + parse_slug("https://example.substack.com/p/my-post/"), + Some("my-post".into()) + ); + assert_eq!( + parse_slug("https://example.substack.com/p/my-post?ref=123"), + Some("my-post".into()) + ); + } + + #[test] + fn parse_html_extracts_from_og_tags() { + let html = r##" + + + + + + + +"##; + let v = parse_html( + html, + "https://mypub.substack.com/p/my-post", + "https://mypub.substack.com/api/v1/posts/my-post", + "my-post", + ); + assert_eq!(v["data_source"], "html_fallback"); + assert_eq!(v["title"], "My Great Post"); + assert_eq!(v["description"], "A short summary."); + assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg"); + assert_eq!(v["post_date"], "2025-09-01T10:00:00Z"); + assert_eq!(v["publication"]["name"], "My Publication"); + assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post"); + } + + #[test] + fn parse_html_prefers_jsonld_when_present() { + let html = r##" + + + +"##; + let v = parse_html( + html, + "https://example.com/p/a", + "https://example.com/api/v1/posts/a", + "a", + ); + assert_eq!(v["title"], "JSON-LD Title"); + assert_eq!(v["description"], "JSON-LD desc."); + assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg"); + assert_eq!(v["post_date"], "2025-10-12T08:30:00Z"); + assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z"); + assert_eq!(v["authors"][0]["name"], "Alice Author"); + assert_eq!(v["authors"][0]["handle"], "alice"); + } + + #[test] + fn handle_from_author_url_pulls_handle() { + assert_eq!( + handle_from_author_url("https://substack.com/@alice"), + Some("alice".into()) + ); + assert_eq!( + handle_from_author_url("https://mypub.substack.com/@bob/"), + Some("bob".into()) + ); + assert_eq!( + handle_from_author_url("https://not-substack.com/author/carol"), + None + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs new file mode 100644 index 0000000..8b77a29 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs @@ -0,0 +1,572 @@ +//! Trustpilot company reviews extractor. +//! +//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's +//! "Verifying your connection" interstitial, so this extractor always +//! routes through [`cloud::smart_fetch_html`]. Without +//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean +//! "set API key" error; with one it escalates to api.webclaw.io. +//! +//! ## 2025 JSON-LD schema +//! +//! Trustpilot replaced the old single-Organization + aggregateRating +//! shape with three separate JSON-LD blocks: +//! +//! 1. `Organization` block for Trustpilot the platform itself +//! (company info, addresses, social profiles). Not the business +//! being reviewed. We detect and skip this. +//! 2. `Dataset` block with a csvw:Table mainEntity that contains the +//! per-star-bucket counts for the target business plus a Total +//! column. The Dataset's `name` is the business display name. +//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated +//! summary of reviews plus the individual review objects +//! (consumer, dates, rating, title, text, language, likes). +//! +//! Plus `metadata.title` from the page head parses as +//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and +//! `metadata.description` carries `"{N} customers have already said"`. +//! We use both as extra signal when the Dataset block is absent. + +use std::sync::OnceLock; + +use regex::Regex; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::cloud::{self, CloudError}; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "trustpilot_reviews", + label: "Trustpilot reviews", + description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.", + url_patterns: &["https://www.trustpilot.com/review/{domain}"], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if !matches!(host, "www.trustpilot.com" | "trustpilot.com") { + return false; + } + url.contains("/review/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let fetched = cloud::smart_fetch_html(client, client.cloud(), url) + .await + .map_err(cloud_to_fetch_err)?; + + let mut data = parse(&fetched.html, url)?; + if let Some(obj) = data.as_object_mut() { + obj.insert( + "data_source".into(), + match fetched.source { + cloud::FetchSource::Local => json!("local"), + cloud::FetchSource::Cloud => json!("cloud"), + }, + ); + } + Ok(data) +} + +/// Pure parser. Kept public so the cloud pipeline can reuse it on its +/// own fetched HTML without going through the async extract path. +pub fn parse(html: &str, url: &str) -> Result { + let domain = parse_review_domain(url).ok_or_else(|| { + FetchError::Build(format!( + "trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'" + )) + })?; + + let blocks = webclaw_core::structured_data::extract_json_ld(html); + + // The business Dataset block has `about.@id` pointing to the target + // domain's Organization (e.g. `.../Organization/anthropic.com`). + let dataset = find_business_dataset(&blocks, &domain); + + // The aiSummary block: not typed (no `@type`), detect by key. + let ai_block = find_ai_summary_block(&blocks); + + // Business name: Dataset > metadata.title regex > URL domain. + let business_name = dataset + .as_ref() + .and_then(|d| get_string(d, "name")) + .or_else(|| parse_name_from_og_title(html)) + .or_else(|| Some(domain.clone())); + + // Rating distribution from the csvw:Table columns. Each column has + // csvw:name like "1 star" / "Total" and a single cell with the + // integer count. + let distribution = dataset.as_ref().and_then(parse_star_distribution); + let (rating_from_dist, total_from_dist) = distribution + .as_ref() + .map(compute_rating_stats) + .unwrap_or((None, None)); + + // Page-title / page-description fallbacks. OG title format: + // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot" + let (rating_label, rating_from_og) = parse_rating_from_og_title(html); + let total_from_desc = parse_review_count_from_og_description(html); + + // Recent reviews carried by the aiSummary block. + let recent_reviews: Vec = ai_block + .as_ref() + .and_then(|a| a.get("aiSummaryReviews")) + .and_then(|arr| arr.as_array()) + .map(|arr| arr.iter().map(extract_review).collect()) + .unwrap_or_default(); + + let ai_summary = ai_block + .as_ref() + .and_then(|a| a.get("aiSummary")) + .and_then(|s| s.get("summary")) + .and_then(|t| t.as_str()) + .map(String::from); + + Ok(json!({ + "url": url, + "domain": domain, + "business_name": business_name, + "rating_label": rating_label, + "average_rating": rating_from_dist.or(rating_from_og), + "review_count": total_from_dist.or(total_from_desc), + "rating_distribution": distribution, + "ai_summary": ai_summary, + "recent_reviews": recent_reviews, + "review_count_listed": recent_reviews.len(), + })) +} + +fn cloud_to_fetch_err(e: CloudError) -> FetchError { + FetchError::Build(e.to_string()) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Pull the target domain from `trustpilot.com/review/{domain}`. +fn parse_review_domain(url: &str) -> Option { + let after = url.split("/review/").nth(1)?; + let stripped = after + .split(['?', '#']) + .next()? + .trim_end_matches('/') + .split('/') + .next() + .unwrap_or(""); + if stripped.is_empty() { + None + } else { + Some(stripped.to_string()) + } +} + +// --------------------------------------------------------------------------- +// JSON-LD block walkers +// --------------------------------------------------------------------------- + +/// Find the Dataset block whose `about.@id` references the target +/// domain's Organization. Falls through to any Dataset if the @id +/// check doesn't match (Trustpilot occasionally varies the URL). +fn find_business_dataset(blocks: &[Value], domain: &str) -> Option { + let mut fallback_any_dataset: Option = None; + for block in blocks { + for node in walk_graph(block) { + if !is_dataset(&node) { + continue; + } + if dataset_about_matches_domain(&node, domain) { + return Some(node); + } + if fallback_any_dataset.is_none() { + fallback_any_dataset = Some(node); + } + } + } + fallback_any_dataset +} + +fn is_dataset(v: &Value) -> bool { + v.get("@type") + .and_then(|t| t.as_str()) + .is_some_and(|s| s == "Dataset") +} + +fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool { + let about_id = v + .get("about") + .and_then(|a| a.get("@id")) + .and_then(|id| id.as_str()); + let Some(id) = about_id else { + return false; + }; + id.contains(&format!("/Organization/{domain}")) +} + +/// The aiSummary / aiSummaryReviews block has no `@type`, so match by +/// presence of the `aiSummary` key. +fn find_ai_summary_block(blocks: &[Value]) -> Option { + for block in blocks { + for node in walk_graph(block) { + if node.get("aiSummary").is_some() { + return Some(node); + } + } + } + None +} + +/// Flatten each block (and its `@graph`) into a list of nodes we can +/// iterate over. Handles both `@graph: [ ... ]` (array) and +/// `@graph: { ... }` (single object) shapes — Trustpilot uses both. +fn walk_graph(block: &Value) -> Vec { + let mut out = vec![block.clone()]; + if let Some(graph) = block.get("@graph") { + match graph { + Value::Array(arr) => out.extend(arr.iter().cloned()), + Value::Object(_) => out.push(graph.clone()), + _ => {} + } + } + out +} + +// --------------------------------------------------------------------------- +// Rating distribution (csvw:Table) +// --------------------------------------------------------------------------- + +/// Parse the per-star distribution from the Dataset block. Returns +/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`. +fn parse_star_distribution(dataset: &Value) -> Option { + let columns = dataset + .get("mainEntity")? + .get("csvw:tableSchema")? + .get("csvw:columns")? + .as_array()?; + let mut out = serde_json::Map::new(); + for col in columns { + let name = col.get("csvw:name").and_then(|n| n.as_str())?; + let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?; + let count = cell + .get("csvw:value") + .and_then(|v| v.as_str()) + .and_then(|s| s.parse::().ok()); + let percent = cell + .get("csvw:notes") + .and_then(|n| n.as_array()) + .and_then(|arr| arr.first()) + .and_then(|s| s.as_str()) + .map(String::from); + let key = normalise_star_key(name); + out.insert( + key, + json!({ + "count": count, + "percent": percent, + }), + ); + } + if out.is_empty() { + None + } else { + Some(Value::Object(out)) + } +} + +/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than +/// the raw "1 star" key which fights YAML/JS property access. +fn normalise_star_key(name: &str) -> String { + let trimmed = name.trim().to_lowercase(); + match trimmed.as_str() { + "1 star" => "one_star".into(), + "2 stars" => "two_stars".into(), + "3 stars" => "three_stars".into(), + "4 stars" => "four_stars".into(), + "5 stars" => "five_stars".into(), + "total" => "total".into(), + other => other.replace(' ', "_"), + } +} + +/// Compute average rating (weighted by bucket) and total count from the +/// parsed distribution. Returns `(average, total)`. +fn compute_rating_stats(distribution: &Value) -> (Option, Option) { + let Some(obj) = distribution.as_object() else { + return (None, None); + }; + let get_count = |key: &str| -> i64 { + obj.get(key) + .and_then(|v| v.get("count")) + .and_then(|v| v.as_i64()) + .unwrap_or(0) + }; + let one = get_count("one_star"); + let two = get_count("two_stars"); + let three = get_count("three_stars"); + let four = get_count("four_stars"); + let five = get_count("five_stars"); + let total_bucket = one + two + three + four + five; + let total = obj + .get("total") + .and_then(|v| v.get("count")) + .and_then(|v| v.as_i64()) + .unwrap_or(total_bucket); + if total == 0 { + return (None, Some(0)); + } + let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5); + let avg = weighted as f64 / total_bucket.max(1) as f64; + // One decimal place, matching how Trustpilot displays the score. + (Some(format!("{avg:.1}")), Some(total)) +} + +// --------------------------------------------------------------------------- +// OG / meta-tag fallbacks +// --------------------------------------------------------------------------- + +/// Regex out the business name from the standard Trustpilot OG title +/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`. +fn parse_name_from_og_title(html: &str) -> Option { + let title = og(html, "title")?; + // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot" + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap()); + re.captures(&title) + .and_then(|c| c.get(1)) + .map(|m| m.as_str().to_string()) +} + +/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value +/// from the OG title. +fn parse_rating_from_og_title(html: &str) -> (Option, Option) { + let Some(title) = og(html, "title") else { + return (None, None); + }; + static RE: OnceLock = OnceLock::new(); + // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot" + let re = RE.get_or_init(|| { + Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap() + }); + let Some(caps) = re.captures(&title) else { + return (None, None); + }; + ( + caps.get(1).map(|m| m.as_str().trim().to_string()), + caps.get(2).map(|m| m.as_str().to_string()), + ) +} + +/// Parse "hear what 226 customers have already said" from the OG +/// description tag. +fn parse_review_count_from_og_description(html: &str) -> Option { + let desc = og(html, "description")?; + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap()); + re.captures(&desc)? + .get(1)? + .as_str() + .replace(',', "") + .parse::() + .ok() +} + +fn og(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + let raw = c.get(2).map(|m| m.as_str())?; + return Some(html_unescape(raw)); + } + } + None +} + +/// Minimal HTML entity unescaping for the three entities the +/// synthesize_html escaper might produce. Keeps us off a heavier dep. +fn html_unescape(s: &str) -> String { + s.replace(""", "\"") + .replace("&", "&") + .replace("<", "<") + .replace(">", ">") +} + +fn get_string(v: &Value, key: &str) -> Option { + v.get(key).and_then(|x| x.as_str().map(String::from)) +} + +// --------------------------------------------------------------------------- +// Review extraction +// --------------------------------------------------------------------------- + +fn extract_review(r: &Value) -> Value { + json!({ + "id": r.get("id").and_then(|v| v.as_str()), + "rating": r.get("rating").and_then(|v| v.as_i64()), + "title": r.get("title").and_then(|v| v.as_str()), + "text": r.get("text").and_then(|v| v.as_str()), + "language": r.get("language").and_then(|v| v.as_str()), + "source": r.get("source").and_then(|v| v.as_str()), + "likes": r.get("likes").and_then(|v| v.as_i64()), + "author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()), + "author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()), + "author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()), + "verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()), + "date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()), + "date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()), + }) +} + +// --------------------------------------------------------------------------- +// Tests +// --------------------------------------------------------------------------- + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_trustpilot_review_urls() { + assert!(matches("https://www.trustpilot.com/review/stripe.com")); + assert!(matches("https://trustpilot.com/review/example.com")); + assert!(!matches("https://www.trustpilot.com/")); + assert!(!matches("https://example.com/review/foo")); + } + + #[test] + fn parse_review_domain_handles_query_and_slash() { + assert_eq!( + parse_review_domain("https://www.trustpilot.com/review/anthropic.com"), + Some("anthropic.com".into()) + ); + assert_eq!( + parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"), + Some("anthropic.com".into()) + ); + assert_eq!( + parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"), + Some("anthropic.com".into()) + ); + } + + #[test] + fn normalise_star_key_covers_all_buckets() { + assert_eq!(normalise_star_key("1 star"), "one_star"); + assert_eq!(normalise_star_key("2 stars"), "two_stars"); + assert_eq!(normalise_star_key("5 stars"), "five_stars"); + assert_eq!(normalise_star_key("Total"), "total"); + } + + #[test] + fn compute_rating_stats_weighted_average() { + // 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews. + let dist = json!({ + "one_star": { "count": 100, "percent": "50%" }, + "two_stars": { "count": 0, "percent": "0%" }, + "three_stars":{ "count": 0, "percent": "0%" }, + "four_stars": { "count": 0, "percent": "0%" }, + "five_stars": { "count": 100, "percent": "50%" }, + "total": { "count": 200, "percent": "100%" }, + }); + let (avg, total) = compute_rating_stats(&dist); + assert_eq!(avg.as_deref(), Some("3.0")); + assert_eq!(total, Some(200)); + } + + #[test] + fn parse_og_title_extracts_name_and_rating() { + let html = r#""#; + assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into())); + let (label, rating) = parse_rating_from_og_title(html); + assert_eq!(label.as_deref(), Some("Bad")); + assert_eq!(rating.as_deref(), Some("1.5")); + } + + #[test] + fn parse_review_count_from_og_description_picks_number() { + let html = r#""#; + assert_eq!(parse_review_count_from_og_description(html), Some(226)); + } + + #[test] + fn parse_full_fixture_assembles_all_fields() { + let html = r##" + + + + + +"##; + let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap(); + assert_eq!(v["domain"], "anthropic.com"); + assert_eq!(v["business_name"], "Anthropic"); + assert_eq!(v["rating_label"], "Bad"); + assert_eq!(v["review_count"], 226); + assert_eq!(v["rating_distribution"]["one_star"]["count"], 196); + assert_eq!(v["rating_distribution"]["total"]["count"], 226); + assert_eq!(v["ai_summary"], "Mixed reviews."); + assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1); + assert_eq!(v["recent_reviews"][0]["author"], "W.FRH"); + assert_eq!(v["recent_reviews"][0]["rating"], 1); + assert_eq!(v["recent_reviews"][0]["title"], "Bad"); + } + + #[test] + fn parse_falls_back_to_og_when_no_jsonld() { + let html = r#" +"#; + let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap(); + assert_eq!(v["domain"], "anthropic.com"); + assert_eq!(v["business_name"], "Anthropic"); + assert_eq!(v["average_rating"], "1.5"); + assert_eq!(v["review_count"], 226); + assert_eq!(v["rating_label"], "Bad"); + } + + #[test] + fn parse_returns_ok_with_url_domain_when_nothing_else() { + let v = parse( + "", + "https://www.trustpilot.com/review/example.com", + ) + .unwrap(); + assert_eq!(v["domain"], "example.com"); + assert_eq!(v["business_name"], "example.com"); + } +} diff --git a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs new file mode 100644 index 0000000..db6dd78 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs @@ -0,0 +1,237 @@ +//! WooCommerce product structured extractor. +//! +//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`. +//! About 30-50% of WooCommerce stores expose this endpoint publicly +//! (it's on by default, but common security plugins disable it). +//! When it's off, the server returns 404 at /wp-json. We surface a +//! clean error and point callers at `/v1/scrape/ecommerce_product` +//! which works on any store with Schema.org JSON-LD. +//! +//! Explicit-call only. `/product/{slug}` is the default permalink for +//! WooCommerce but custom stores use every variation imaginable, so +//! auto-dispatch is unreliable. + +use serde::Deserialize; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "woocommerce_product", + label: "WooCommerce product", + description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).", + url_patterns: &[ + "https://{shop}/product/{slug}", + "https://{shop}/shop/{slug}", + ], +}; + +pub fn matches(url: &str) -> bool { + let host = host_of(url); + if host.is_empty() { + return false; + } + // Permissive: WooCommerce stores use custom domains + custom + // permalinks. The extractor's API probe is what confirms it's + // really WooCommerce. + url.contains("/product/") + || url.contains("/shop/") + || url.contains("/producto/") // common es locale + || url.contains("/produit/") // common fr locale +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let slug = parse_slug(url).ok_or_else(|| { + FetchError::Build(format!( + "woocommerce_product: cannot parse slug from '{url}'" + )) + })?; + let host = host_of(url); + if host.is_empty() { + return Err(FetchError::Build(format!( + "woocommerce_product: empty host in '{url}'" + ))); + } + let scheme = if url.starts_with("http://") { + "http" + } else { + "https" + }; + let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1"); + let resp = client.fetch(&api_url).await?; + if resp.status == 404 { + return Err(FetchError::Build(format!( + "woocommerce_product: {host} does not expose /wp-json/wc/store (404). \ + Use /v1/scrape/ecommerce_product for JSON-LD fallback." + ))); + } + if resp.status == 401 || resp.status == 403 { + return Err(FetchError::Build(format!( + "woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \ + Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.", + resp.status + ))); + } + if resp.status != 200 { + return Err(FetchError::Build(format!( + "woocommerce api returned status {} for {api_url}", + resp.status + ))); + } + + let products: Vec = serde_json::from_str(&resp.html) + .map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?; + let p = products.into_iter().next().ok_or_else(|| { + FetchError::Build(format!( + "woocommerce_product: no product found for slug '{slug}' on {host}" + )) + })?; + + let images: Vec = p + .images + .iter() + .map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt})) + .collect(); + let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0); + + Ok(json!({ + "url": url, + "api_url": api_url, + "product_id": p.id, + "name": p.name, + "slug": p.slug, + "sku": p.sku, + "permalink": p.permalink, + "on_sale": p.on_sale, + "in_stock": p.is_in_stock, + "is_purchasable": p.is_purchasable, + "price": p.prices.as_ref().and_then(|pr| pr.price.clone()), + "regular_price": p.prices.as_ref().and_then(|pr| pr.regular_price.clone()), + "sale_price": p.prices.as_ref().and_then(|pr| pr.sale_price.clone()), + "currency": p.prices.as_ref().and_then(|pr| pr.currency_code.clone()), + "currency_minor": p.prices.as_ref().and_then(|pr| pr.currency_minor_unit), + "price_range": p.prices.as_ref().and_then(|pr| pr.price_range.clone()), + "average_rating": p.average_rating, + "review_count": p.review_count, + "description": p.description, + "short_description": p.short_description, + "categories": p.categories.iter().filter_map(|c| c.name.clone()).collect::>(), + "tags": p.tags.iter().filter_map(|t| t.name.clone()).collect::>(), + "variation_count": variations_count, + "image_count": images.len(), + "images": images, + })) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn host_of(url: &str) -> &str { + url.split("://") + .nth(1) + .unwrap_or(url) + .split('/') + .next() + .unwrap_or("") +} + +/// Extract the product slug from common WooCommerce permalinks. +fn parse_slug(url: &str) -> Option { + for needle in ["/product/", "/shop/", "/producto/", "/produit/"] { + if let Some(after) = url.split(needle).nth(1) { + let stripped = after + .split(['?', '#']) + .next()? + .trim_end_matches('/') + .split('/') + .next() + .unwrap_or(""); + if !stripped.is_empty() { + return Some(stripped.to_string()); + } + } + } + None +} + +// --------------------------------------------------------------------------- +// Store API types (subset of the full response) +// --------------------------------------------------------------------------- + +#[derive(Deserialize)] +struct Product { + id: Option, + name: Option, + slug: Option, + sku: Option, + permalink: Option, + description: Option, + short_description: Option, + on_sale: Option, + is_in_stock: Option, + is_purchasable: Option, + average_rating: Option, // string or number + review_count: Option, + prices: Option, + #[serde(default)] + categories: Vec, + #[serde(default)] + tags: Vec, + #[serde(default)] + images: Vec, + variations: Option>, +} + +#[derive(Deserialize)] +struct Prices { + price: Option, + regular_price: Option, + sale_price: Option, + currency_code: Option, + currency_minor_unit: Option, + price_range: Option, +} + +#[derive(Deserialize)] +struct Term { + name: Option, +} + +#[derive(Deserialize)] +struct Img { + src: Option, + thumbnail: Option, + alt: Option, +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_common_permalinks() { + assert!(matches("https://shop.example.com/product/cool-widget")); + assert!(matches("https://shop.example.com/shop/cool-widget")); + assert!(matches("https://tienda.example.com/producto/cosa")); + assert!(matches("https://boutique.example.com/produit/chose")); + } + + #[test] + fn parse_slug_handles_locale_and_suffix() { + assert_eq!( + parse_slug("https://shop.example.com/product/cool-widget"), + Some("cool-widget".into()) + ); + assert_eq!( + parse_slug("https://shop.example.com/product/cool-widget/?attr=red"), + Some("cool-widget".into()) + ); + assert_eq!( + parse_slug("https://tienda.example.com/producto/cosa/"), + Some("cosa".into()) + ); + } +} diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs new file mode 100644 index 0000000..2551ff8 --- /dev/null +++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs @@ -0,0 +1,378 @@ +//! YouTube video structured extractor. +//! +//! YouTube embeds the full player configuration in a +//! `ytInitialPlayerResponse` JavaScript assignment at the top of +//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the +//! core crate's already-proven regex + parse to surface typed JSON +//! from it: video id, title, author + channel id, view count, +//! duration, upload date, keywords, thumbnails, caption-track URLs. +//! +//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/` +//! shape is stable. +//! +//! ## Fallback +//! +//! `ytInitialPlayerResponse` is missing on EU-consent interstitials, +//! some live-stream pre-show pages, and age-gated videos. In those +//! cases we drop down to OG tags for `title`, `description`, +//! `thumbnail`, and `channel`, and return a `data_source: +//! "og_fallback"` payload so the caller can tell they got a degraded +//! shape (no view count, duration, captions). + +use std::sync::OnceLock; + +use regex::Regex; +use serde_json::{Value, json}; + +use super::ExtractorInfo; +use crate::error::FetchError; +use crate::fetcher::Fetcher; + +pub const INFO: ExtractorInfo = ExtractorInfo { + name: "youtube_video", + label: "YouTube video", + description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.", + url_patterns: &[ + "https://www.youtube.com/watch?v={id}", + "https://youtu.be/{id}", + "https://www.youtube.com/shorts/{id}", + ], +}; + +pub fn matches(url: &str) -> bool { + webclaw_core::youtube::is_youtube_url(url) + || url.contains("youtube.com/shorts/") + || url.contains("youtube-nocookie.com/embed/") +} + +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { + let video_id = parse_video_id(url).ok_or_else(|| { + FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'")) + })?; + + // Always fetch the canonical /watch URL. /shorts/ and youtu.be + // sometimes serve a thinner page without the player blob. + let canonical = format!("https://www.youtube.com/watch?v={video_id}"); + let resp = client.fetch(&canonical).await?; + if resp.status != 200 { + return Err(FetchError::Build(format!( + "youtube returned status {} for {canonical}", + resp.status + ))); + } + + if let Some(player) = extract_player_response(&resp.html) { + return Ok(build_player_payload( + &player, &resp.html, url, &canonical, &video_id, + )); + } + + // No player blob. Fall back to OG tags so the call still returns + // something useful for consent / age-gate pages. + Ok(build_og_fallback(&resp.html, url, &canonical, &video_id)) +} + +// --------------------------------------------------------------------------- +// Player-blob path (rich payload) +// --------------------------------------------------------------------------- + +fn build_player_payload( + player: &Value, + html: &str, + url: &str, + canonical: &str, + video_id: &str, +) -> Value { + let video_details = player.get("videoDetails"); + let microformat = player + .get("microformat") + .and_then(|m| m.get("playerMicroformatRenderer")); + + let thumbnails: Vec = video_details + .and_then(|vd| vd.get("thumbnail")) + .and_then(|t| t.get("thumbnails")) + .and_then(|t| t.as_array()) + .cloned() + .unwrap_or_default(); + + let keywords: Vec = video_details + .and_then(|vd| vd.get("keywords")) + .and_then(|k| k.as_array()) + .cloned() + .unwrap_or_default(); + + let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html); + let captions: Vec = caption_tracks + .iter() + .map(|c| { + json!({ + "url": c.url, + "lang": c.lang, + "name": c.name, + }) + }) + .collect(); + + json!({ + "url": url, + "canonical_url":canonical, + "data_source": "player_response", + "video_id": video_id, + "title": get_str(video_details, "title"), + "description": get_str(video_details, "shortDescription"), + "author": get_str(video_details, "author"), + "channel_id": get_str(video_details, "channelId"), + "channel_url": get_str(microformat, "ownerProfileUrl"), + "view_count": get_int(video_details, "viewCount"), + "length_seconds": get_int(video_details, "lengthSeconds"), + "is_live": video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()), + "is_private": video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()), + "is_unlisted": microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()), + "allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()), + "category": get_str(microformat, "category"), + "upload_date": get_str(microformat, "uploadDate"), + "publish_date": get_str(microformat, "publishDate"), + "keywords": keywords, + "thumbnails": thumbnails, + "caption_tracks": captions, + }) +} + +// --------------------------------------------------------------------------- +// OG fallback path (degraded payload) +// --------------------------------------------------------------------------- + +fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value { + let title = og(html, "title"); + let description = og(html, "description"); + let thumbnail = og(html, "image"); + // YouTube sets `` on some pages but + // OG-only pages reliably carry `og:video:tag` and the channel in + // ``. We keep this lean: just what's stable. + let channel = meta_name(html, "author"); + + json!({ + "url": url, + "canonical_url":canonical, + "data_source": "og_fallback", + "video_id": video_id, + "title": title, + "description": description, + "author": channel, + // OG path: these are null so the caller doesn't have to guess. + "channel_id": None::, + "channel_url": None::, + "view_count": None::, + "length_seconds": None::, + "is_live": None::, + "is_private": None::, + "is_unlisted": None::, + "allow_ratings":None::, + "category": None::, + "upload_date": None::, + "publish_date": None::, + "keywords": Vec::::new(), + "thumbnails": thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(), + "caption_tracks": Vec::::new(), + }) +} + +// --------------------------------------------------------------------------- +// URL helpers +// --------------------------------------------------------------------------- + +fn parse_video_id(url: &str) -> Option { + // youtu.be/{id} + if let Some(after) = url.split("youtu.be/").nth(1) { + let id = after + .split(['?', '#', '/']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + if !id.is_empty() { + return Some(id.to_string()); + } + } + // youtube.com/shorts/{id} + if let Some(after) = url.split("youtube.com/shorts/").nth(1) { + let id = after + .split(['?', '#', '/']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + if !id.is_empty() { + return Some(id.to_string()); + } + } + // youtube-nocookie.com/embed/{id} + if let Some(after) = url.split("/embed/").nth(1) { + let id = after + .split(['?', '#', '/']) + .next() + .unwrap_or("") + .trim_end_matches('/'); + if !id.is_empty() { + return Some(id.to_string()); + } + } + // youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id}) + if let Some(q) = url.split_once('?').map(|(_, q)| q) + && let Some(id) = q + .split('&') + .find_map(|p| p.strip_prefix("v=").map(|v| v.to_string())) + { + let id = id.split(['#', '/']).next().unwrap_or(&id).to_string(); + if !id.is_empty() { + return Some(id); + } + } + None +} + +// --------------------------------------------------------------------------- +// Player-response parsing +// --------------------------------------------------------------------------- + +fn extract_player_response(html: &str) -> Option { + // Same regex as webclaw_core::youtube. Duplicated here because + // core's regex is module-private. Kept in lockstep; changes are + // rare and we cover with tests in both places. + static RE: OnceLock = OnceLock::new(); + let re = RE + .get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap()); + let json_str = re.captures(html)?.get(1)?.as_str(); + serde_json::from_str(json_str).ok() +} + +// --------------------------------------------------------------------------- +// Meta-tag helpers (for OG fallback) +// --------------------------------------------------------------------------- + +fn og(html: &str, prop: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == prop) { + return c.get(2).map(|m| m.as_str().to_string()); + } + } + None +} + +fn meta_name(html: &str, name: &str) -> Option { + static RE: OnceLock = OnceLock::new(); + let re = RE.get_or_init(|| { + Regex::new(r#"(?i)]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap() + }); + for c in re.captures_iter(html) { + if c.get(1).is_some_and(|m| m.as_str() == name) { + return c.get(2).map(|m| m.as_str().to_string()); + } + } + None +} + +fn get_str(v: Option<&Value>, key: &str) -> Option { + v.and_then(|x| x.get(key)) + .and_then(|x| x.as_str().map(String::from)) +} + +fn get_int(v: Option<&Value>, key: &str) -> Option { + v.and_then(|x| x.get(key)).and_then(|x| { + x.as_i64() + .or_else(|| x.as_str().and_then(|s| s.parse::().ok())) + }) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn matches_watch_urls() { + assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ")); + assert!(matches("https://youtu.be/dQw4w9WgXcQ")); + assert!(matches("https://www.youtube.com/shorts/abc123")); + assert!(matches( + "https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ" + )); + } + + #[test] + fn rejects_non_video_urls() { + assert!(!matches("https://www.youtube.com/")); + assert!(!matches("https://www.youtube.com/channel/abc")); + assert!(!matches("https://example.com/watch?v=abc")); + } + + #[test] + fn parse_video_id_from_each_shape() { + assert_eq!( + parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"), + Some("dQw4w9WgXcQ".into()) + ); + assert_eq!( + parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"), + Some("dQw4w9WgXcQ".into()) + ); + assert_eq!( + parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"), + Some("dQw4w9WgXcQ".into()) + ); + assert_eq!( + parse_video_id("https://youtu.be/dQw4w9WgXcQ"), + Some("dQw4w9WgXcQ".into()) + ); + assert_eq!( + parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"), + Some("dQw4w9WgXcQ".into()) + ); + assert_eq!( + parse_video_id("https://www.youtube.com/shorts/abc123"), + Some("abc123".into()) + ); + } + + #[test] + fn extract_player_response_happy_path() { + let html = r#" + + + +"#; + let v = extract_player_response(html).unwrap(); + let vd = v.get("videoDetails").unwrap(); + assert_eq!(vd.get("title").unwrap().as_str(), Some("T")); + } + + #[test] + fn og_fallback_extracts_basics_from_meta_tags() { + let html = r##" + + + + + +"##; + let v = build_og_fallback( + html, + "https://www.youtube.com/watch?v=abc", + "https://www.youtube.com/watch?v=abc", + "abc", + ); + assert_eq!(v["data_source"], "og_fallback"); + assert_eq!(v["title"], "Example Video Title"); + assert_eq!(v["description"], "A cool video description."); + assert_eq!(v["author"], "Example Channel"); + assert_eq!( + v["thumbnails"][0]["url"], + "https://i.ytimg.com/vi/abc/maxresdefault.jpg" + ); + assert!(v["view_count"].is_null()); + assert!(v["caption_tracks"].as_array().unwrap().is_empty()); + } +} diff --git a/crates/webclaw-fetch/src/fetcher.rs b/crates/webclaw-fetch/src/fetcher.rs new file mode 100644 index 0000000..fabcf44 --- /dev/null +++ b/crates/webclaw-fetch/src/fetcher.rs @@ -0,0 +1,118 @@ +//! Pluggable fetcher abstraction for vertical extractors. +//! +//! Extractors call the network through this trait instead of hard- +//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all +//! pass `&FetchClient` (wreq-backed BoringSSL). The production API +//! server, which must not use in-process TLS fingerprinting, provides +//! its own implementation that routes through the Go tls-sidecar. +//! +//! Both paths expose the same [`FetchResult`] shape and the same +//! optional cloud-escalation client, so extractor logic stays +//! identical across environments. +//! +//! ## Choosing an implementation +//! +//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`] +//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass +//! it to extractors as `&client`. +//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher` +//! (in `server/src/engine/`) that delegates to `engine::tls_client` +//! and wraps it in `Arc` for handler injection. +//! +//! ## Why a trait and not a free function +//! +//! Extractors need state beyond a single fetch: the cloud client for +//! antibot escalation, and in the future per-user proxy pools, tenant +//! headers, circuit breakers. A trait keeps that state encapsulated +//! behind the fetch interface instead of threading it through every +//! extractor signature. + +use async_trait::async_trait; + +use crate::client::FetchResult; +use crate::cloud::CloudClient; +use crate::error::FetchError; + +/// HTTP fetch surface used by vertical extractors. +/// +/// Implementations must be `Send + Sync` because extractor dispatchers +/// run them inside tokio tasks, potentially across many requests. +#[async_trait] +pub trait Fetcher: Send + Sync { + /// Fetch a URL and return the raw response body + metadata. The + /// body is in `FetchResult::html` regardless of the actual content + /// type — JSON API endpoints put JSON there, HTML pages put HTML. + /// Extractors branch on response status and body shape. + async fn fetch(&self, url: &str) -> Result; + + /// Fetch with additional request headers. Needed for endpoints + /// that authenticate via a specific header (Instagram's + /// `x-ig-app-id`, for example). Default implementation routes to + /// [`Self::fetch`] so implementers without header support stay + /// functional, though the `Option` field they'd set won't + /// be populated on the request. + async fn fetch_with_headers( + &self, + url: &str, + _headers: &[(&str, &str)], + ) -> Result { + self.fetch(url).await + } + + /// Optional cloud-escalation client for antibot bypass. Returning + /// `Some` tells extractors they can call into the hosted API when + /// local fetch hits a challenge page. Returning `None` makes + /// cloud-gated extractors emit [`CloudError::NotConfigured`] with + /// an actionable signup link. + /// + /// The default implementation returns `None` because not every + /// deployment wants cloud fallback (self-hosts that don't have a + /// webclaw.io subscription, for instance). + /// + /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured + fn cloud(&self) -> Option<&CloudClient> { + None + } +} + +// --------------------------------------------------------------------------- +// Blanket impls: make `&T` and `Arc` behave like the wrapped `T`. +// --------------------------------------------------------------------------- + +#[async_trait] +impl Fetcher for &T { + async fn fetch(&self, url: &str) -> Result { + (**self).fetch(url).await + } + + async fn fetch_with_headers( + &self, + url: &str, + headers: &[(&str, &str)], + ) -> Result { + (**self).fetch_with_headers(url, headers).await + } + + fn cloud(&self) -> Option<&CloudClient> { + (**self).cloud() + } +} + +#[async_trait] +impl Fetcher for std::sync::Arc { + async fn fetch(&self, url: &str) -> Result { + (**self).fetch(url).await + } + + async fn fetch_with_headers( + &self, + url: &str, + headers: &[(&str, &str)], + ) -> Result { + (**self).fetch_with_headers(url, headers).await + } + + fn cloud(&self) -> Option<&CloudClient> { + (**self).cloud() + } +} diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs index 517cb6e..83664a1 100644 --- a/crates/webclaw-fetch/src/lib.rs +++ b/crates/webclaw-fetch/src/lib.rs @@ -3,9 +3,12 @@ //! Automatically detects PDF responses and delegates to webclaw-pdf. pub mod browser; pub mod client; +pub mod cloud; pub mod crawler; pub mod document; pub mod error; +pub mod extractors; +pub mod fetcher; pub mod linkedin; pub mod proxy; pub mod reddit; @@ -16,6 +19,7 @@ pub use browser::BrowserProfile; pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult}; pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult}; pub use error::FetchError; +pub use fetcher::Fetcher; pub use http::HeaderMap; pub use proxy::{parse_proxy_file, parse_proxy_line}; pub use sitemap::SitemapEntry; diff --git a/crates/webclaw-mcp/Cargo.toml b/crates/webclaw-mcp/Cargo.toml index df9dd97..ec3b2b4 100644 --- a/crates/webclaw-mcp/Cargo.toml +++ b/crates/webclaw-mcp/Cargo.toml @@ -22,6 +22,5 @@ serde_json = { workspace = true } tokio = { workspace = true } tracing = { workspace = true } tracing-subscriber = { workspace = true } -reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] } url = "2" dirs = "6.0.0" diff --git a/crates/webclaw-mcp/src/cloud.rs b/crates/webclaw-mcp/src/cloud.rs deleted file mode 100644 index ac602e4..0000000 --- a/crates/webclaw-mcp/src/cloud.rs +++ /dev/null @@ -1,302 +0,0 @@ -/// Cloud API fallback for protected sites. -/// -/// When local fetch returns a challenge page, this module retries -/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set. -use std::time::Duration; - -use serde_json::{Value, json}; -use tracing::info; - -const API_BASE: &str = "https://api.webclaw.io/v1"; - -/// Lightweight client for the webclaw cloud API. -pub struct CloudClient { - api_key: String, - http: reqwest::Client, -} - -impl CloudClient { - /// Create a new cloud client from WEBCLAW_API_KEY env var. - /// Returns None if the key is not set. - pub fn from_env() -> Option { - let key = std::env::var("WEBCLAW_API_KEY").ok()?; - if key.is_empty() { - return None; - } - let http = reqwest::Client::builder() - .timeout(Duration::from_secs(60)) - .build() - .unwrap_or_default(); - Some(Self { api_key: key, http }) - } - - /// Scrape a URL via the cloud API. Returns the response JSON. - pub async fn scrape( - &self, - url: &str, - formats: &[&str], - include_selectors: &[String], - exclude_selectors: &[String], - only_main_content: bool, - ) -> Result { - let mut body = json!({ - "url": url, - "formats": formats, - }); - - if only_main_content { - body["only_main_content"] = json!(true); - } - if !include_selectors.is_empty() { - body["include_selectors"] = json!(include_selectors); - } - if !exclude_selectors.is_empty() { - body["exclude_selectors"] = json!(exclude_selectors); - } - - self.post("scrape", body).await - } - - /// Generic POST to the cloud API. - pub async fn post(&self, endpoint: &str, body: Value) -> Result { - let resp = self - .http - .post(format!("{API_BASE}/{endpoint}")) - .header("Authorization", format!("Bearer {}", self.api_key)) - .json(&body) - .send() - .await - .map_err(|e| format!("Cloud API request failed: {e}"))?; - - let status = resp.status(); - if !status.is_success() { - let text = resp.text().await.unwrap_or_default(); - let truncated = truncate_error(&text); - return Err(format!("Cloud API error {status}: {truncated}")); - } - - resp.json::() - .await - .map_err(|e| format!("Cloud API response parse failed: {e}")) - } - - /// Generic GET from the cloud API. - pub async fn get(&self, endpoint: &str) -> Result { - let resp = self - .http - .get(format!("{API_BASE}/{endpoint}")) - .header("Authorization", format!("Bearer {}", self.api_key)) - .send() - .await - .map_err(|e| format!("Cloud API request failed: {e}"))?; - - let status = resp.status(); - if !status.is_success() { - let text = resp.text().await.unwrap_or_default(); - let truncated = truncate_error(&text); - return Err(format!("Cloud API error {status}: {truncated}")); - } - - resp.json::() - .await - .map_err(|e| format!("Cloud API response parse failed: {e}")) - } -} - -/// Truncate error body to avoid flooding logs with huge HTML responses. -fn truncate_error(text: &str) -> &str { - const MAX_LEN: usize = 500; - match text.char_indices().nth(MAX_LEN) { - Some((byte_pos, _)) => &text[..byte_pos], - None => text, - } -} - -/// Check if fetched HTML looks like a bot protection challenge page. -/// Detects common bot protection challenge pages. -pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool { - let html_lower = html.to_lowercase(); - - // Cloudflare challenge page - if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") { - return true; - } - - // Cloudflare "checking your browser" spinner - if (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) - && html_lower.contains("cf-spinner") - { - return true; - } - - // Cloudflare Turnstile (only on short pages = challenge, not embedded on real content) - if (html_lower.contains("cf-turnstile") - || html_lower.contains("challenges.cloudflare.com/turnstile")) - && html.len() < 100_000 - { - return true; - } - - // DataDome - if html_lower.contains("geo.captcha-delivery.com") - || html_lower.contains("captcha-delivery.com/captcha") - { - return true; - } - - // AWS WAF - if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") { - return true; - } - - // hCaptcha blocking page - if html_lower.contains("hcaptcha.com") - && html_lower.contains("h-captcha") - && html.len() < 50_000 - { - return true; - } - - // Cloudflare via headers + challenge body - let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some(); - if has_cf_headers - && (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) - { - return true; - } - - false -} - -/// Check if a page likely needs JS rendering (SPA with almost no text content). -pub fn needs_js_rendering(word_count: usize, html: &str) -> bool { - let has_scripts = html.contains(" 5_000 && has_scripts { - return true; - } - - // Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio - if word_count < 800 && html.len() > 50_000 && has_scripts { - let html_lower = html.to_lowercase(); - let has_spa_marker = html_lower.contains("react-app") - || html_lower.contains("id=\"__next\"") - || html_lower.contains("id=\"root\"") - || html_lower.contains("id=\"app\"") - || html_lower.contains("__next_data__") - || html_lower.contains("nuxt") - || html_lower.contains("ng-app"); - - if has_spa_marker { - return true; - } - } - - false -} - -/// Result of a smart fetch: either local extraction or cloud API response. -pub enum SmartFetchResult { - /// Successfully extracted locally. - Local(Box), - /// Fell back to cloud API. Contains the API response JSON. - Cloud(Value), -} - -/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered. -/// -/// Returns the extraction result (local) or the cloud API response JSON. -/// If no API key is configured and local fetch is blocked, returns an error -/// with a helpful message. -pub async fn smart_fetch( - client: &webclaw_fetch::FetchClient, - cloud: Option<&CloudClient>, - url: &str, - include_selectors: &[String], - exclude_selectors: &[String], - only_main_content: bool, - formats: &[&str], -) -> Result { - // Step 1: Try local fetch (with timeout to avoid hanging on slow servers) - let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url)) - .await - .map_err(|_| format!("Fetch timed out after 30s for {url}"))? - .map_err(|e| format!("Fetch failed: {e}"))?; - - // Step 2: Check for bot protection - if is_bot_protected(&fetch_result.html, &fetch_result.headers) { - info!(url, "bot protection detected, falling back to cloud API"); - return cloud_fallback( - cloud, - url, - include_selectors, - exclude_selectors, - only_main_content, - formats, - ) - .await; - } - - // Step 3: Extract locally - let options = webclaw_core::ExtractionOptions { - include_selectors: include_selectors.to_vec(), - exclude_selectors: exclude_selectors.to_vec(), - only_main_content, - include_raw_html: false, - }; - - let extraction = - webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options) - .map_err(|e| format!("Extraction failed: {e}"))?; - - // Step 4: Check for JS-rendered pages (low content from large HTML) - if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) { - info!( - url, - word_count = extraction.metadata.word_count, - html_len = fetch_result.html.len(), - "JS-rendered page detected, falling back to cloud API" - ); - return cloud_fallback( - cloud, - url, - include_selectors, - exclude_selectors, - only_main_content, - formats, - ) - .await; - } - - Ok(SmartFetchResult::Local(Box::new(extraction))) -} - -async fn cloud_fallback( - cloud: Option<&CloudClient>, - url: &str, - include_selectors: &[String], - exclude_selectors: &[String], - only_main_content: bool, - formats: &[&str], -) -> Result { - match cloud { - Some(c) => { - let resp = c - .scrape( - url, - formats, - include_selectors, - exclude_selectors, - only_main_content, - ) - .await?; - info!(url, "cloud API fallback successful"); - Ok(SmartFetchResult::Cloud(resp)) - } - None => Err(format!( - "Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \ - Get a key at https://webclaw.io" - )), - } -} diff --git a/crates/webclaw-mcp/src/main.rs b/crates/webclaw-mcp/src/main.rs index 8576562..89a4755 100644 --- a/crates/webclaw-mcp/src/main.rs +++ b/crates/webclaw-mcp/src/main.rs @@ -1,7 +1,6 @@ /// webclaw-mcp: MCP (Model Context Protocol) server for webclaw. /// Exposes web extraction tools over stdio transport for AI agents /// like Claude Desktop, Claude Code, and other MCP clients. -mod cloud; mod server; mod tools; diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs index f00eae7..a4af79d 100644 --- a/crates/webclaw-mcp/src/server.rs +++ b/crates/webclaw-mcp/src/server.rs @@ -15,7 +15,8 @@ use serde_json::json; use tracing::{error, info, warn}; use url::Url; -use crate::cloud::{self, CloudClient, SmartFetchResult}; +use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult}; + use crate::tools::*; pub struct WebclawMcp { @@ -717,6 +718,50 @@ impl WebclawMcp { Ok(serde_json::to_string_pretty(&resp).unwrap_or_default()) } } + + /// List every vertical extractor the server knows about. Returns a + /// JSON array of `{name, label, description, url_patterns}` entries. + /// Call this to discover what verticals are available before using + /// `vertical_scrape`. + #[tool] + async fn list_extractors( + &self, + Parameters(_params): Parameters, + ) -> Result { + let catalog = webclaw_fetch::extractors::list(); + serde_json::to_string_pretty(&catalog) + .map_err(|e| format!("failed to serialise extractor catalog: {e}")) + } + + /// Run a vertical extractor by name and return typed JSON specific + /// to the target site (title, price, rating, author, etc.), not + /// generic markdown. Use `list_extractors` to discover available + /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`, + /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`. + /// + /// Antibot-gated verticals (amazon_product, ebay_listing, + /// etsy_listing, trustpilot_reviews) will automatically escalate to + /// the webclaw cloud API when local fetch hits bot protection, + /// provided `WEBCLAW_API_KEY` is set. + #[tool] + async fn vertical_scrape( + &self, + Parameters(params): Parameters, + ) -> Result { + validate_url(¶ms.url)?; + // Reuse the long-lived default FetchClient. Extractors accept + // `&dyn Fetcher`; FetchClient implements the trait so this just + // works (see webclaw_fetch::Fetcher and client::FetchClient). + let data = webclaw_fetch::extractors::dispatch_by_name( + self.fetch_client.as_ref(), + ¶ms.name, + ¶ms.url, + ) + .await + .map_err(|e| e.to_string())?; + serde_json::to_string_pretty(&data) + .map_err(|e| format!("failed to serialise extractor output: {e}")) + } } #[tool_handler] @@ -726,7 +771,8 @@ impl ServerHandler for WebclawMcp { .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION"))) .with_instructions(String::from( "Webclaw MCP server -- web content extraction for AI agents. \ - Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.", + Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \ + list_extractors, vertical_scrape.", )) } } diff --git a/crates/webclaw-mcp/src/tools.rs b/crates/webclaw-mcp/src/tools.rs index e0195f1..02bf534 100644 --- a/crates/webclaw-mcp/src/tools.rs +++ b/crates/webclaw-mcp/src/tools.rs @@ -103,3 +103,20 @@ pub struct SearchParams { /// Number of results to return (default: 10) pub num_results: Option, } + +/// Parameters for `vertical_scrape`: run a site-specific extractor by name. +#[derive(Debug, Deserialize, JsonSchema)] +pub struct VerticalParams { + /// Name of the vertical extractor. Call `list_extractors` to see all + /// available names. Examples: "reddit", "github_repo", "pypi", + /// "trustpilot_reviews", "youtube_video", "shopify_product". + pub name: String, + /// URL to extract. Must match the URL patterns the extractor claims; + /// otherwise the tool returns a clear "URL mismatch" error. + pub url: String, +} + +/// `list_extractors` takes no arguments but we still need an empty struct +/// so rmcp can generate a schema and parse the (empty) JSON-RPC params. +#[derive(Debug, Deserialize, JsonSchema)] +pub struct ListExtractorsParams {} diff --git a/crates/webclaw-server/src/main.rs b/crates/webclaw-server/src/main.rs index c57fed8..f4cfdcb 100644 --- a/crates/webclaw-server/src/main.rs +++ b/crates/webclaw-server/src/main.rs @@ -79,10 +79,15 @@ async fn main() -> anyhow::Result<()> { let v1 = Router::new() .route("/scrape", post(routes::scrape::scrape)) + .route( + "/scrape/{vertical}", + post(routes::structured::scrape_vertical), + ) .route("/crawl", post(routes::crawl::crawl)) .route("/map", post(routes::map::map)) .route("/batch", post(routes::batch::batch)) .route("/extract", post(routes::extract::extract)) + .route("/extractors", get(routes::structured::list_extractors)) .route("/summarize", post(routes::summarize::summarize_route)) .route("/diff", post(routes::diff::diff_route)) .route("/brand", post(routes::brand::brand)) diff --git a/crates/webclaw-server/src/routes/mod.rs b/crates/webclaw-server/src/routes/mod.rs index 7c3d68e..01f1052 100644 --- a/crates/webclaw-server/src/routes/mod.rs +++ b/crates/webclaw-server/src/routes/mod.rs @@ -15,4 +15,5 @@ pub mod extract; pub mod health; pub mod map; pub mod scrape; +pub mod structured; pub mod summarize; diff --git a/crates/webclaw-server/src/routes/structured.rs b/crates/webclaw-server/src/routes/structured.rs new file mode 100644 index 0000000..c9cdc1a --- /dev/null +++ b/crates/webclaw-server/src/routes/structured.rs @@ -0,0 +1,55 @@ +//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`. +//! +//! Vertical extractors return typed JSON instead of generic markdown. +//! See `webclaw_fetch::extractors` for the catalog and per-site logic. + +use axum::{ + Json, + extract::{Path, State}, +}; +use serde::Deserialize; +use serde_json::{Value, json}; +use webclaw_fetch::extractors::{self, ExtractorDispatchError}; + +use crate::{error::ApiError, state::AppState}; + +#[derive(Debug, Deserialize)] +pub struct ScrapeRequest { + pub url: String, +} + +/// Map dispatcher errors to ApiError so users get clean HTTP statuses +/// instead of opaque 500s. +impl From for ApiError { + fn from(e: ExtractorDispatchError) -> Self { + match e { + ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound, + ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()), + ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()), + } + } +} + +/// `GET /v1/extractors` — catalog of all available verticals. +pub async fn list_extractors() -> Json { + Json(json!({ + "extractors": extractors::list(), + })) +} + +/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit. +pub async fn scrape_vertical( + State(state): State, + Path(vertical): Path, + Json(req): Json, +) -> Result, ApiError> { + if req.url.trim().is_empty() { + return Err(ApiError::bad_request("`url` is required")); + } + let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?; + Ok(Json(json!({ + "vertical": vertical, + "url": req.url, + "data": data, + }))) +} diff --git a/crates/webclaw-server/src/state.rs b/crates/webclaw-server/src/state.rs index b3f9b6b..6c2e8f7 100644 --- a/crates/webclaw-server/src/state.rs +++ b/crates/webclaw-server/src/state.rs @@ -1,7 +1,24 @@ //! Shared application state. Cheap to clone via Arc; held by the axum //! Router for the life of the process. +//! +//! Two unrelated keys get carried here: +//! +//! 1. [`AppState::api_key`] — the **bearer token clients must present** +//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`. +//! Unset = open mode. +//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our +//! **outbound** credential for api.webclaw.io, used by extractors +//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`. +//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY" +//! error with a signup link. +//! +//! Different variables on purpose: conflating the two means operators +//! who want their server behind an auth token can't also enable cloud +//! fallback, and vice versa. use std::sync::Arc; +use tracing::info; +use webclaw_fetch::cloud::CloudClient; use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig}; /// Single-process state shared across all request handlers. @@ -17,6 +34,7 @@ struct Inner { /// auto-deref `&Arc` -> `&FetchClient`, so this costs /// them nothing. pub fetch: Arc, + /// Inbound bearer-auth token for this server's own `/v1/*` surface. pub api_key: Option, } @@ -24,17 +42,34 @@ impl AppState { /// Build the application state. The fetch client is constructed once /// and shared across requests so connection pools + browser profile /// state don't churn per request. - pub fn new(api_key: Option) -> anyhow::Result { + /// + /// `inbound_api_key` is the bearer token clients must present; + /// cloud-fallback credentials come from the env (checked here). + pub fn new(inbound_api_key: Option) -> anyhow::Result { let config = FetchConfig { - browser: BrowserProfile::Chrome, + browser: BrowserProfile::Firefox, ..FetchConfig::default() }; - let fetch = FetchClient::new(config) + let mut fetch = FetchClient::new(config) .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?; + + // Cloud fallback: only activates when the operator has provided + // an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY + // (preferred, disambiguates from the inbound-auth key) and + // WEBCLAW_API_KEY as a fallback when there's no inbound key + // configured (backwards compat with MCP / CLI conventions). + if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) { + info!( + base = cloud.base_url(), + "cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io" + ); + fetch = fetch.with_cloud(cloud); + } + Ok(Self { inner: Arc::new(Inner { fetch: Arc::new(fetch), - api_key, + api_key: inbound_api_key, }), }) } @@ -47,3 +82,26 @@ impl AppState { self.inner.api_key.as_deref() } } + +/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`; +/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is +/// configured (i.e. open mode — the same env var can't mean two +/// things to one process). +fn build_cloud_client(inbound_api_key: Option<&str>) -> Option { + let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok(); + if let Some(k) = cloud_key.as_deref() + && !k.trim().is_empty() + { + return Some(CloudClient::with_key(k)); + } + // Reuse WEBCLAW_API_KEY only when not also acting as our own + // inbound-auth token — otherwise we'd be telling the operator + // they can't have both. + if inbound_api_key.is_none() + && let Ok(k) = std::env::var("WEBCLAW_API_KEY") + && !k.trim().is_empty() + { + return Some(CloudClient::with_key(k)); + } + None +}