diff --git a/CHANGELOG.md b/CHANGELOG.md index ef2d2f2..4069d54 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,57 +3,6 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). -## [0.5.2] — 2026-04-22 - -### Added -- **`webclaw vertical ` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1. - -- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick. - -- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set. - -### Changed -- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`. - ---- - -## [0.5.1] — 2026-04-22 - -### Added -- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc`, so `&client` coerces to `&dyn Fetcher` automatically. - - The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses. - - Backwards compatible. No behavior change for CLI, MCP, or OSS self-host. - -### Changed -- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces. - ---- - -## [0.5.0] — 2026-04-22 - -### Added -- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob. -- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route. -- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source. -- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran. -- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `\n"); - } - } - } - - out.push_str("\n"); - - // Markdown body → plaintext in . Extractors that regex over - //
IDs won't hit here, but they won't hit on local cloud - // bypass either. OK to keep minimal. - if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) { - out.push_str("
");
-        out.push_str(&html_escape_text(md));
-        out.push_str("
\n"); - } - - out.push_str(""); - out -} - -fn html_escape_attr(s: &str) -> String { - s.replace('&', "&") - .replace('"', """) - .replace('<', "<") - .replace('>', ">") -} - -fn html_escape_text(s: &str) -> String { - s.replace('&', "&") - .replace('<', "<") - .replace('>', ">") -} - -async fn parse_cloud_response(resp: reqwest::Response) -> Result { - let status = resp.status(); - if status.is_success() { - return resp - .json() - .await - .map_err(|e| CloudError::ParseFailed(e.to_string())); - } - let body = resp.text().await.unwrap_or_default(); - Err(CloudError::from_status_and_body(status.as_u16(), body)) -} - -// --------------------------------------------------------------------------- -// Detection -// --------------------------------------------------------------------------- - -/// True when a fetched response body is actually a bot-protection -/// challenge page rather than the content the caller asked for. -/// -/// Conservative — only fires on patterns that indicate the *entire* -/// page is a challenge, not embedded CAPTCHAs on a real content page. -pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool { - let html_lower = html.to_lowercase(); - - // Cloudflare challenge page. - if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") { - return true; - } - - // Cloudflare "Just a moment" / "Checking your browser" interstitial. - if (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) - && html_lower.contains("cf-spinner") - { - return true; - } - - // Cloudflare Turnstile. Only counts when the page is small — - // legitimate pages embed Turnstile for signup forms etc. - if (html_lower.contains("cf-turnstile") - || html_lower.contains("challenges.cloudflare.com/turnstile")) - && html.len() < 100_000 - { - return true; - } - - // DataDome. - if html_lower.contains("geo.captcha-delivery.com") - || html_lower.contains("captcha-delivery.com/captcha") - { - return true; - } - - // AWS WAF. - if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") { - return true; - } - - // AWS WAF "Verifying your connection" interstitial (used by Trustpilot). - // Distinct from the captcha-branded path above: the challenge page is - // a tiny HTML shell with an `interstitial-spinner` div and no content. - // Gating on html.len() keeps false-positives off long pages that - // happen to mention the phrase in an unrelated context. - if html_lower.contains("interstitial-spinner") - && html_lower.contains("verifying your connection") - && html.len() < 10_000 - { - return true; - } - - // hCaptcha *blocking* page (not just an embedded widget). - if html_lower.contains("hcaptcha.com") - && html_lower.contains("h-captcha") - && html.len() < 50_000 - { - return true; - } - - // Cloudflare via response headers + challenge body. - let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some(); - if has_cf_headers - && (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) - { - return true; - } - - false -} - -/// True when a page likely needs JS rendering — a large HTML document -/// with almost no extractable text + an SPA framework signature. -pub fn needs_js_rendering(word_count: usize, html: &str) -> bool { - let has_scripts = html.contains(" 5_000 && has_scripts { - return true; - } - - // Tier 2: SPA framework markers + low content-to-HTML ratio. - if word_count < 800 && html.len() > 50_000 && has_scripts { - let html_lower = html.to_lowercase(); - let has_spa_marker = html_lower.contains("react-app") - || html_lower.contains("id=\"__next\"") - || html_lower.contains("id=\"root\"") - || html_lower.contains("id=\"app\"") - || html_lower.contains("__next_data__") - || html_lower.contains("nuxt") - || html_lower.contains("ng-app"); - if has_spa_marker { - return true; - } - } - - false -} - -// --------------------------------------------------------------------------- -// Smart-fetch: classic flow for MCP / CLI (returns either an extraction -// or raw cloud JSON) -// --------------------------------------------------------------------------- - -/// Result of [`smart_fetch`]: either a local extraction or the raw -/// cloud API response when we escalated. -pub enum SmartFetchResult { - Local(Box), - Cloud(Value), -} - -/// Try local fetch + extract first. On bot protection or detected -/// JS-render, fall back to `cloud.scrape(...)` with the caller's -/// formats. Returns `Err(String)` so existing call sites that expect -/// stringified errors keep compiling. -/// -/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed -/// [`CloudError`] so you can render precise UX. -pub async fn smart_fetch( - client: &dyn crate::fetcher::Fetcher, - cloud: Option<&CloudClient>, - url: &str, - include_selectors: &[String], - exclude_selectors: &[String], - only_main_content: bool, - formats: &[&str], -) -> Result { - let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url)) - .await - .map_err(|_| format!("Fetch timed out after 30s for {url}"))? - .map_err(|e| format!("Fetch failed: {e}"))?; - - if is_bot_protected(&fetch_result.html, &fetch_result.headers) { - info!(url, "bot protection detected, falling back to cloud API"); - return cloud_scrape_fallback( - cloud, - url, - include_selectors, - exclude_selectors, - only_main_content, - formats, - ) - .await; - } - - let options = webclaw_core::ExtractionOptions { - include_selectors: include_selectors.to_vec(), - exclude_selectors: exclude_selectors.to_vec(), - only_main_content, - include_raw_html: false, - }; - let extraction = - webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options) - .map_err(|e| format!("Extraction failed: {e}"))?; - - if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) { - info!( - url, - word_count = extraction.metadata.word_count, - html_len = fetch_result.html.len(), - "JS-rendered page detected, falling back to cloud API" - ); - return cloud_scrape_fallback( - cloud, - url, - include_selectors, - exclude_selectors, - only_main_content, - formats, - ) - .await; - } - - Ok(SmartFetchResult::Local(Box::new(extraction))) -} - -async fn cloud_scrape_fallback( - cloud: Option<&CloudClient>, - url: &str, - include_selectors: &[String], - exclude_selectors: &[String], - only_main_content: bool, - formats: &[&str], -) -> Result { - let Some(c) = cloud else { - return Err(CloudError::NotConfigured.to_string()); - }; - let resp = c - .scrape( - url, - formats, - include_selectors, - exclude_selectors, - only_main_content, - ) - .await - .map_err(|e| e.to_string())?; - info!(url, "cloud API fallback successful"); - Ok(SmartFetchResult::Cloud(resp)) -} - -// --------------------------------------------------------------------------- -// Smart-fetch-HTML: for vertical extractors -// --------------------------------------------------------------------------- - -/// Where the HTML ultimately came from — useful for callers that want -/// to track "did we fall back?" for logging or pricing. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub enum FetchSource { - Local, - Cloud, -} - -/// Antibot-aware HTML fetch result. The `html` field is always populated. -pub struct FetchedHtml { - pub html: String, - pub final_url: String, - pub source: FetchSource, -} - -/// Try local fetch; on bot protection, escalate to the cloud's -/// `/v1/scrape` with `formats=["html"]` and return the raw HTML. -/// -/// Designed for the vertical-extractor pattern where the caller has -/// its own parser and just needs bytes. -pub async fn smart_fetch_html( - client: &dyn crate::fetcher::Fetcher, - cloud: Option<&CloudClient>, - url: &str, -) -> Result { - let resp = client - .fetch(url) - .await - .map_err(|e| CloudError::Network(e.to_string()))?; - - if !is_bot_protected(&resp.html, &resp.headers) { - return Ok(FetchedHtml { - html: resp.html, - final_url: resp.url, - source: FetchSource::Local, - }); - } - - let Some(c) = cloud else { - warn!(url, "bot protection detected + no cloud client configured"); - return Err(CloudError::NotConfigured); - }; - debug!(url, "bot protection detected, escalating to cloud"); - let html = c.fetch_html(url).await?; - Ok(FetchedHtml { - html, - final_url: url.to_string(), - source: FetchSource::Cloud, - }) -} - -// --------------------------------------------------------------------------- -// Tests -// --------------------------------------------------------------------------- - -#[cfg(test)] -mod tests { - use super::*; - - fn empty_headers() -> HeaderMap { - HeaderMap::new() - } - - // --- detectors ---------------------------------------------------------- - - #[test] - fn is_bot_protected_detects_cloudflare_challenge() { - let html = "_cf_chl_opt loaded"; - assert!(is_bot_protected(html, &empty_headers())); - } - - #[test] - fn is_bot_protected_detects_turnstile_on_short_page() { - let html = "
"; - assert!(is_bot_protected(html, &empty_headers())); - } - - #[test] - fn is_bot_protected_ignores_turnstile_on_real_content() { - let html = format!( - "{}
", - "lots of real content ".repeat(8_000) - ); - assert!(!is_bot_protected(&html, &empty_headers())); - } - - #[test] - fn is_bot_protected_detects_aws_waf_verifying_connection() { - // The exact shape Trustpilot serves under AWS WAF. - let html = r#"
-
-

Verifying your connection...

"#; - assert!(is_bot_protected(html, &empty_headers())); - } - - #[test] - fn synthesize_html_embeds_jsonld_and_og_tags() { - let resp = json!({ - "url": "https://example.com/p/1", - "metadata": { - "title": "My Product", - "description": "A nice thing.", - "image": "https://cdn.example.com/1.jpg", - "site_name": "Example Shop" - }, - "structured_data": [ - {"@context":"https://schema.org","@type":"Product", - "name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}} - ], - "markdown": "# Widget\n\nA nice widget." - }); - let html = synthesize_html(&resp); - // OG tags from metadata. - assert!(html.contains(r#""#)); - assert!( - html.contains(r#""#) - ); - // JSON-LD block preserved losslessly. - assert!(html.contains(r#"".repeat(500) - ); - assert!(needs_js_rendering(10, &html)); - } - - #[test] - fn needs_js_rendering_passes_real_article() { - let html = format!( - "{}", - "Real article text ".repeat(5_000) - ); - assert!(!needs_js_rendering(5_000, &html)); - } - - // --- CloudError mapping ------------------------------------------------- - - #[test] - fn cloud_error_maps_401() { - let e = CloudError::from_status_and_body(401, "invalid key".into()); - assert!(matches!(e, CloudError::Unauthorized)); - assert!(e.to_string().contains(KEYS_URL)); - } - - #[test] - fn cloud_error_maps_402() { - let e = CloudError::from_status_and_body(402, "{}".into()); - assert!(matches!(e, CloudError::InsufficientPlan)); - assert!(e.to_string().contains(PRICING_URL)); - } - - #[test] - fn cloud_error_maps_429() { - let e = CloudError::from_status_and_body(429, "slow down".into()); - assert!(matches!(e, CloudError::RateLimited)); - assert!(e.to_string().contains(PRICING_URL)); - } - - #[test] - fn cloud_error_maps_generic_5xx() { - let e = CloudError::from_status_and_body(503, "x".repeat(2000)); - match e { - CloudError::ServerError { status, body } => { - assert_eq!(status, 503); - assert!(body.len() <= 500); - } - _ => panic!("expected ServerError"), - } - } - - #[test] - fn not_configured_error_points_at_signup() { - let msg = CloudError::NotConfigured.to_string(); - assert!(msg.contains(SIGNUP_URL)); - assert!(msg.contains("WEBCLAW_API_KEY")); - } - - // --- CloudClient construction ------------------------------------------ - - #[test] - fn cloud_client_explicit_key_wins_over_env() { - // SAFETY: this test mutates process env. Serial tests only. - // Set env to something, pass an explicit key, explicit should win. - // (We don't actually *call* the API, just check the struct stored - // the right key.) - // rustc std::env::set_var is unsafe in newer toolchains. - unsafe { - std::env::set_var("WEBCLAW_API_KEY", "from-env"); - } - let client = CloudClient::new(Some("from-flag")).expect("client built"); - assert_eq!(client.api_key, "from-flag"); - unsafe { - std::env::remove_var("WEBCLAW_API_KEY"); - } - } - - #[test] - fn cloud_client_none_when_empty() { - unsafe { - std::env::remove_var("WEBCLAW_API_KEY"); - } - assert!(CloudClient::new(None).is_none()); - assert!(CloudClient::new(Some("")).is_none()); - assert!(CloudClient::new(Some(" ")).is_none()); - } - - #[test] - fn cloud_client_base_url_strips_trailing_slash() { - let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/"); - assert_eq!(c.base_url(), "https://api.example.com/v1"); - } - - #[test] - fn truncate_respects_char_boundaries() { - // Ensure we don't slice inside a multi-byte char. - let s = "a".repeat(10) + "é"; // é is 2 bytes - let out = truncate(&s, 11); - assert_eq!(out.chars().count(), 11); - } -} diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs deleted file mode 100644 index fed6b9f..0000000 --- a/crates/webclaw-fetch/src/extractors/amazon_product.rs +++ /dev/null @@ -1,452 +0,0 @@ -//! Amazon product detail page extractor. -//! -//! Amazon product pages (`/dp/{ASIN}/` on every locale) are -//! inconsistently protected. Sometimes our local TLS fingerprint gets -//! a real HTML page; sometimes we land on a CAPTCHA interstitial; -//! sometimes we land on a real page that for whatever reason ships -//! no Product JSON-LD (Amazon A/B-tests this regularly). So the -//! extractor has a two-stage fallback: -//! -//! 1. Try local fetch + parse. If we got Product JSON-LD back, great: -//! we have everything (title, brand, price, availability, rating). -//! 2. If local fetch worked *but the page has no Product JSON-LD* AND -//! a cloud client is configured, force-escalate to api.webclaw.io. -//! Cloud's render + antibot pipeline reliably surfaces the -//! structured data. Without a cloud client we return whatever we -//! got from local (usually just title via `#productTitle` or OG -//! meta tags). -//! -//! Parsing tries JSON-LD first, DOM regex (`#productTitle`, -//! `#landingImage`) second, OG `` tags third. The OG path -//! matters because the cloud's synthesized HTML ships metadata as -//! OG tags but lacks Amazon's DOM IDs. -//! -//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/` -//! path. ASINs are a stable Amazon identifier so we extract that as -//! part of the response even when everything else is empty (tells -//! callers the URL was at least recognised). - -use std::sync::OnceLock; - -use regex::Regex; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::cloud::{self, CloudError}; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "amazon_product", - label: "Amazon product", - description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.", - url_patterns: &[ - "https://www.amazon.com/dp/{ASIN}", - "https://www.amazon.co.uk/dp/{ASIN}", - "https://www.amazon.de/dp/{ASIN}", - "https://www.amazon.fr/dp/{ASIN}", - "https://www.amazon.it/dp/{ASIN}", - "https://www.amazon.es/dp/{ASIN}", - "https://www.amazon.co.jp/dp/{ASIN}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !is_amazon_host(host) { - return false; - } - parse_asin(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let asin = parse_asin(url) - .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?; - - let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url) - .await - .map_err(cloud_to_fetch_err)?; - - // Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA - // pages (they A/B-test it). When local fetch succeeded but has no - // Product JSON-LD, force-escalate to the cloud which runs the - // render pipeline and reliably surfaces structured data. No-op - // when cloud isn't configured — we return whatever local gave us. - if fetched.source == cloud::FetchSource::Local - && find_product_jsonld(&fetched.html).is_none() - && let Some(c) = client.cloud() - { - match c.fetch_html(url).await { - Ok(cloud_html) => { - fetched = cloud::FetchedHtml { - html: cloud_html, - final_url: url.to_string(), - source: cloud::FetchSource::Cloud, - }; - } - Err(e) => { - tracing::debug!( - error = %e, - "amazon_product: cloud escalation failed, keeping local" - ); - } - } - } - - let mut data = parse(&fetched.html, url, &asin); - if let Some(obj) = data.as_object_mut() { - obj.insert( - "data_source".into(), - match fetched.source { - cloud::FetchSource::Local => json!("local"), - cloud::FetchSource::Cloud => json!("cloud"), - }, - ); - } - Ok(data) -} - -/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture -/// file) and the source URL, extract Amazon product detail. Returns a -/// `Value` rather than a typed struct so callers can pass it through -/// without carrying webclaw_fetch types. -pub fn parse(html: &str, url: &str, asin: &str) -> Value { - let jsonld = find_product_jsonld(html); - // Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span - // (only present on real static HTML) > cloud-synthesized og:title. - let title = jsonld - .as_ref() - .and_then(|v| get_text(v, "name")) - .or_else(|| dom_title(html)) - .or_else(|| og(html, "title")); - let image = jsonld - .as_ref() - .and_then(get_first_image) - .or_else(|| dom_image(html)) - .or_else(|| og(html, "image")); - let brand = jsonld.as_ref().and_then(get_brand); - let description = jsonld - .as_ref() - .and_then(|v| get_text(v, "description")) - .or_else(|| og(html, "description")); - let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating); - let offer = jsonld.as_ref().and_then(first_offer); - - let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku")); - let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn")); - - json!({ - "url": url, - "asin": asin, - "title": title, - "brand": brand, - "description": description, - "image": image, - "price": offer.as_ref().and_then(|o| get_text(o, "price")), - "currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")), - "availability": offer.as_ref().and_then(|o| { - get_text(o, "availability").map(|s| - s.replace("http://schema.org/", "").replace("https://schema.org/", "")) - }), - "condition": offer.as_ref().and_then(|o| { - get_text(o, "itemCondition").map(|s| - s.replace("http://schema.org/", "").replace("https://schema.org/", "")) - }), - "sku": sku, - "mpn": mpn, - "aggregate_rating": aggregate_rating, - }) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn is_amazon_host(host: &str) -> bool { - host.starts_with("www.amazon.") || host.starts_with("amazon.") -} - -/// Pull a 10-char ASIN out of any recognised Amazon URL shape: -/// - /dp/{ASIN} -/// - /gp/product/{ASIN} -/// - /product/{ASIN} -/// - /exec/obidos/ASIN/{ASIN} -fn parse_asin(url: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap() - }); - re.captures(url) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -// --------------------------------------------------------------------------- -// JSON-LD walkers — light reuse of ecommerce_product's style -// --------------------------------------------------------------------------- - -fn find_product_jsonld(html: &str) -> Option { - let blocks = webclaw_core::structured_data::extract_json_ld(html); - for b in blocks { - if let Some(found) = find_product_in(&b) { - return Some(found); - } - } - None -} - -fn find_product_in(v: &Value) -> Option { - if is_product_type(v) { - return Some(v.clone()); - } - if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { - for item in graph { - if let Some(found) = find_product_in(item) { - return Some(found); - } - } - } - if let Some(arr) = v.as_array() { - for item in arr { - if let Some(found) = find_product_in(item) { - return Some(found); - } - } - } - None -} - -fn is_product_type(v: &Value) -> bool { - let Some(t) = v.get("@type") else { - return false; - }; - let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct"); - match t { - Value::String(s) => is_prod(s), - Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)), - _ => false, - } -} - -fn get_text(v: &Value, key: &str) -> Option { - v.get(key).and_then(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Number(n) => Some(n.to_string()), - _ => None, - }) -} - -fn get_brand(v: &Value) -> Option { - let brand = v.get("brand")?; - if let Some(s) = brand.as_str() { - return Some(s.to_string()); - } - brand - .as_object() - .and_then(|o| o.get("name")) - .and_then(|n| n.as_str()) - .map(String::from) -} - -fn get_first_image(v: &Value) -> Option { - match v.get("image")? { - Value::String(s) => Some(s.clone()), - Value::Array(arr) => arr.iter().find_map(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - }), - Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - } -} - -fn first_offer(v: &Value) -> Option { - let offers = v.get("offers")?; - match offers { - Value::Array(arr) => arr.first().cloned(), - Value::Object(_) => Some(offers.clone()), - _ => None, - } -} - -fn get_aggregate_rating(v: &Value) -> Option { - let r = v.get("aggregateRating")?; - Some(json!({ - "rating_value": get_text(r, "ratingValue"), - "review_count": get_text(r, "reviewCount"), - "best_rating": get_text(r, "bestRating"), - })) -} - -// --------------------------------------------------------------------------- -// DOM fallbacks — cheap regex for the two fields most likely to be -// missing from JSON-LD on Amazon. -// --------------------------------------------------------------------------- - -fn dom_title(html: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap()); - re.captures(html) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().trim().to_string()) -} - -fn dom_image(html: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap()); - re.captures(html) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -/// OG meta tag lookup. Cloud-synthesized HTML ships these even when -/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last -/// line of defence for `title`, `image`, `description`. -fn og(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - return c.get(2).map(|m| html_unescape(m.as_str())); - } - } - None -} - -/// Undo the synthesize_html attribute escaping for the few entities it -/// emits. Keeps us off a heavier HTML-entity dep. -fn html_unescape(s: &str) -> String { - s.replace(""", "\"") - .replace("&", "&") - .replace("<", "<") - .replace(">", ">") -} - -fn cloud_to_fetch_err(e: CloudError) -> FetchError { - FetchError::Build(e.to_string()) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_multi_locale() { - assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY")); - assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/")); - assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1")); - assert!(matches( - "https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo" - )); - } - - #[test] - fn rejects_non_product_urls() { - assert!(!matches("https://www.amazon.com/")); - assert!(!matches("https://www.amazon.com/gp/cart")); - assert!(!matches("https://example.com/dp/B0CHX1W1XY")); - } - - #[test] - fn parse_asin_extracts_from_multiple_shapes() { - assert_eq!( - parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"), - Some("B0CHX1W1XY".into()) - ); - assert_eq!( - parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"), - Some("B0CHX1W1XY".into()) - ); - assert_eq!( - parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"), - Some("B0CHX1W1XY".into()) - ); - assert_eq!( - parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"), - Some("B0CHX1W1XY".into()) - ); - assert_eq!( - parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"), - Some("B0CHX1W1XY".into()) - ); - assert_eq!(parse_asin("https://www.amazon.com/"), None); - } - - #[test] - fn parse_extracts_from_fixture_jsonld() { - // Minimal Amazon-style fixture with a Product JSON-LD block. - let html = r##" - - -"##; - let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY"); - assert_eq!(v["asin"], "B0CHX1W1XY"); - assert_eq!(v["title"], "ACME Widget"); - assert_eq!(v["brand"], "ACME"); - assert_eq!(v["price"], "19.99"); - assert_eq!(v["currency"], "USD"); - assert_eq!(v["availability"], "InStock"); - assert_eq!(v["aggregate_rating"]["rating_value"], "4.6"); - assert_eq!(v["aggregate_rating"]["review_count"], "1234"); - } - - #[test] - fn parse_falls_back_to_dom_when_jsonld_missing_fields() { - let html = r#" - -Fallback Title - - -"#; - let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY"); - assert_eq!(v["title"], "Fallback Title"); - assert_eq!( - v["image"], - "https://m.media-amazon.com/images/I/fallback.jpg" - ); - } - - #[test] - fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() { - // Shape we see from the cloud synthesize_html path: OG tags - // only, no JSON-LD, no Amazon DOM IDs. - let html = r##" - - - -"##; - let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY"); - assert_eq!(v["title"], "Cloud-sourced MacBook Pro"); - assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg"); - assert_eq!(v["description"], "Via api.webclaw.io"); - } - - #[test] - fn og_unescape_handles_quot_entity() { - let html = r#""#; - assert_eq!( - og(html, "title").as_deref(), - Some(r#"Apple "M2 Pro" Laptop"#) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/arxiv.rs b/crates/webclaw-fetch/src/extractors/arxiv.rs deleted file mode 100644 index c2b85c0..0000000 --- a/crates/webclaw-fetch/src/extractors/arxiv.rs +++ /dev/null @@ -1,314 +0,0 @@ -//! ArXiv paper structured extractor. -//! -//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}` -//! which returns Atom XML. We parse just enough to surface title, authors, -//! abstract, categories, and the canonical PDF link. No HTML scraping -//! required and no auth. - -use quick_xml::Reader; -use quick_xml::events::Event; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "arxiv", - label: "ArXiv paper", - description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.", - url_patterns: &[ - "https://arxiv.org/abs/{id}", - "https://arxiv.org/abs/{id}v{n}", - "https://arxiv.org/pdf/{id}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "arxiv.org" && host != "www.arxiv.org" { - return false; - } - url.contains("/abs/") || url.contains("/pdf/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let id = parse_id(url) - .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?; - - let api_url = format!("https://export.arxiv.org/api/query?id_list={id}"); - let resp = client.fetch(&api_url).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "arxiv api returned status {}", - resp.status - ))); - } - - let entry = parse_atom_entry(&resp.html) - .ok_or_else(|| FetchError::BodyDecode("arxiv: no in response".into()))?; - if entry.title.is_none() && entry.summary.is_none() { - return Err(FetchError::BodyDecode(format!( - "arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)" - ))); - } - - Ok(json!({ - "url": url, - "id": id, - "arxiv_id": entry.id, - "title": entry.title, - "authors": entry.authors, - "abstract": entry.summary.map(|s| collapse_whitespace(&s)), - "published": entry.published, - "updated": entry.updated, - "primary_category": entry.primary_category, - "categories": entry.categories, - "doi": entry.doi, - "comment": entry.comment, - "pdf_url": entry.pdf_url, - "abs_url": entry.abs_url, - })) -} - -// --------------------------------------------------------------------------- -// Helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`) -/// and the `.pdf` extension when present. -fn parse_id(url: &str) -> Option { - let after = url - .split("/abs/") - .nth(1) - .or_else(|| url.split("/pdf/").nth(1))?; - let stripped = after - .split(['?', '#']) - .next()? - .trim_end_matches('/') - .trim_end_matches(".pdf"); - // Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345" - let no_version = match stripped.rfind('v') { - Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i], - _ => stripped, - }; - if no_version.is_empty() { - None - } else { - Some(no_version.to_string()) - } -} - -fn collapse_whitespace(s: &str) -> String { - s.split_whitespace().collect::>().join(" ") -} - -#[derive(Default)] -struct AtomEntry { - id: Option, - title: Option, - summary: Option, - published: Option, - updated: Option, - primary_category: Option, - categories: Vec, - authors: Vec, - doi: Option, - comment: Option, - pdf_url: Option, - abs_url: Option, -} - -/// Parse the first `` block of an ArXiv Atom feed. -fn parse_atom_entry(xml: &str) -> Option { - let mut reader = Reader::from_str(xml); - let mut buf = Vec::new(); - - // States - let mut in_entry = false; - let mut current: Option<&'static str> = None; - let mut in_author = false; - let mut in_author_name = false; - let mut entry = AtomEntry::default(); - - loop { - match reader.read_event_into(&mut buf) { - Ok(Event::Start(ref e)) => { - let local = e.local_name(); - match local.as_ref() { - b"entry" => in_entry = true, - b"id" if in_entry && !in_author => current = Some("id"), - b"title" if in_entry => current = Some("title"), - b"summary" if in_entry => current = Some("summary"), - b"published" if in_entry => current = Some("published"), - b"updated" if in_entry => current = Some("updated"), - b"author" if in_entry => in_author = true, - b"name" if in_author => { - in_author_name = true; - current = Some("author_name"); - } - b"category" if in_entry => { - // primary_category is namespaced (arxiv:primary_category) - // category is plain. quick-xml gives us local-name only, - // so we treat both as categories and take the first as - // primary. - for attr in e.attributes().flatten() { - if attr.key.as_ref() == b"term" - && let Ok(v) = attr.unescape_value() - { - let term = v.to_string(); - if entry.primary_category.is_none() { - entry.primary_category = Some(term.clone()); - } - entry.categories.push(term); - } - } - } - b"link" if in_entry => { - let mut href = None; - let mut rel = None; - let mut typ = None; - for attr in e.attributes().flatten() { - match attr.key.as_ref() { - b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()), - b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()), - b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()), - _ => {} - } - } - if let Some(h) = href { - if typ.as_deref() == Some("application/pdf") { - entry.pdf_url = Some(h.clone()); - } - if rel.as_deref() == Some("alternate") { - entry.abs_url = Some(h); - } - } - } - _ => current = None, - } - } - Ok(Event::Empty(ref e)) => { - // Self-closing tags (). Same handling as Start. - let local = e.local_name(); - if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry { - let mut href = None; - let mut rel = None; - let mut typ = None; - let mut term = None; - for attr in e.attributes().flatten() { - match attr.key.as_ref() { - b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()), - b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()), - b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()), - b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()), - _ => {} - } - } - if let Some(t) = term { - if entry.primary_category.is_none() { - entry.primary_category = Some(t.clone()); - } - entry.categories.push(t); - } - if let Some(h) = href { - if typ.as_deref() == Some("application/pdf") { - entry.pdf_url = Some(h.clone()); - } - if rel.as_deref() == Some("alternate") { - entry.abs_url = Some(h); - } - } - } - } - Ok(Event::Text(ref e)) => { - if let (Some(field), Ok(text)) = (current, e.unescape()) { - let text = text.to_string(); - match field { - "id" => entry.id = Some(text.trim().to_string()), - "title" => entry.title = append_text(entry.title.take(), &text), - "summary" => entry.summary = append_text(entry.summary.take(), &text), - "published" => entry.published = Some(text.trim().to_string()), - "updated" => entry.updated = Some(text.trim().to_string()), - "author_name" => entry.authors.push(text.trim().to_string()), - _ => {} - } - } - } - Ok(Event::End(ref e)) => { - let local = e.local_name(); - match local.as_ref() { - b"entry" => break, - b"author" => in_author = false, - b"name" => in_author_name = false, - _ => {} - } - if !in_author_name { - current = None; - } - } - Ok(Event::Eof) => break, - Err(_) => return None, - _ => {} - } - buf.clear(); - } - - if in_entry { Some(entry) } else { None } -} - -/// Concatenate text fragments (long fields can be split across multiple -/// text events if they contain entities or CDATA). -fn append_text(prev: Option, next: &str) -> Option { - match prev { - Some(mut s) => { - s.push_str(next); - Some(s) - } - None => Some(next.to_string()), - } -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_arxiv_urls() { - assert!(matches("https://arxiv.org/abs/2401.12345")); - assert!(matches("https://arxiv.org/abs/2401.12345v2")); - assert!(matches("https://arxiv.org/pdf/2401.12345.pdf")); - assert!(!matches("https://arxiv.org/")); - assert!(!matches("https://example.com/abs/foo")); - } - - #[test] - fn parse_id_strips_version_and_extension() { - assert_eq!( - parse_id("https://arxiv.org/abs/2401.12345"), - Some("2401.12345".into()) - ); - assert_eq!( - parse_id("https://arxiv.org/abs/2401.12345v3"), - Some("2401.12345".into()) - ); - assert_eq!( - parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"), - Some("2401.12345".into()) - ); - } - - #[test] - fn collapse_whitespace_handles_newlines_and_tabs() { - assert_eq!(collapse_whitespace("a b\n\tc "), "a b c"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/crates_io.rs b/crates/webclaw-fetch/src/extractors/crates_io.rs deleted file mode 100644 index 719579f..0000000 --- a/crates/webclaw-fetch/src/extractors/crates_io.rs +++ /dev/null @@ -1,168 +0,0 @@ -//! crates.io structured extractor. -//! -//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No -//! auth, no rate limit at normal usage. The response includes both -//! the crate metadata and the full version list, which we summarize -//! down to a count + latest release info to keep the payload small. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "crates_io", - label: "crates.io package", - description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.", - url_patterns: &[ - "https://crates.io/crates/{name}", - "https://crates.io/crates/{name}/{version}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "crates.io" && host != "www.crates.io" { - return false; - } - url.contains("/crates/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let name = parse_name(url) - .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?; - - let api_url = format!("https://crates.io/api/v1/crates/{name}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "crates.io: crate '{name}' not found" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "crates.io api returned status {}", - resp.status - ))); - } - - let body: CratesResponse = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?; - - let c = body.crate_; - let latest_version = body - .versions - .iter() - .find(|v| !v.yanked.unwrap_or(false)) - .or_else(|| body.versions.first()); - - Ok(json!({ - "url": url, - "name": c.id, - "description": c.description, - "homepage": c.homepage, - "documentation": c.documentation, - "repository": c.repository, - "max_stable_version": c.max_stable_version, - "max_version": c.max_version, - "newest_version": c.newest_version, - "downloads": c.downloads, - "recent_downloads": c.recent_downloads, - "categories": c.categories, - "keywords": c.keywords, - "release_count": body.versions.len(), - "latest_release_date": latest_version.and_then(|v| v.created_at.clone()), - "latest_license": latest_version.and_then(|v| v.license.clone()), - "latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()), - "latest_yanked": latest_version.and_then(|v| v.yanked), - "created_at": c.created_at, - "updated_at": c.updated_at, - })) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn parse_name(url: &str) -> Option { - let after = url.split("/crates/").nth(1)?; - let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); - let first = stripped.split('/').find(|s| !s.is_empty())?; - Some(first.to_string()) -} - -// --------------------------------------------------------------------------- -// crates.io API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct CratesResponse { - #[serde(rename = "crate")] - crate_: CrateInfo, - #[serde(default)] - versions: Vec, -} - -#[derive(Deserialize)] -struct CrateInfo { - id: Option, - description: Option, - homepage: Option, - documentation: Option, - repository: Option, - max_stable_version: Option, - max_version: Option, - newest_version: Option, - downloads: Option, - recent_downloads: Option, - #[serde(default)] - categories: Vec, - #[serde(default)] - keywords: Vec, - created_at: Option, - updated_at: Option, -} - -#[derive(Deserialize)] -struct VersionInfo { - license: Option, - rust_version: Option, - yanked: Option, - created_at: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_crate_pages() { - assert!(matches("https://crates.io/crates/serde")); - assert!(matches("https://crates.io/crates/tokio/1.45.0")); - assert!(!matches("https://crates.io/")); - assert!(!matches("https://example.com/crates/foo")); - } - - #[test] - fn parse_name_handles_versioned_urls() { - assert_eq!( - parse_name("https://crates.io/crates/serde"), - Some("serde".into()) - ); - assert_eq!( - parse_name("https://crates.io/crates/tokio/1.45.0"), - Some("tokio".into()) - ); - assert_eq!( - parse_name("https://crates.io/crates/scraper/?foo=bar"), - Some("scraper".into()) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/dev_to.rs b/crates/webclaw-fetch/src/extractors/dev_to.rs deleted file mode 100644 index 86199d8..0000000 --- a/crates/webclaw-fetch/src/extractors/dev_to.rs +++ /dev/null @@ -1,188 +0,0 @@ -//! dev.to article structured extractor. -//! -//! `dev.to/api/articles/{username}/{slug}` returns the full article body, -//! tags, reaction count, comment count, and reading time. Anonymous -//! access works fine for published posts. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "dev_to", - label: "dev.to article", - description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.", - url_patterns: &["https://dev.to/{username}/{slug}"], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "dev.to" && host != "www.dev.to" { - return false; - } - let path = url - .split("://") - .nth(1) - .and_then(|s| s.split_once('/')) - .map(|(_, p)| p) - .unwrap_or(""); - let stripped = path - .split(['?', '#']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - // Need exactly /{username}/{slug}, with username starting with non-reserved. - segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0]) -} - -const RESERVED_FIRST_SEGS: &[&str] = &[ - "api", - "tags", - "search", - "settings", - "enter", - "signup", - "about", - "code-of-conduct", - "privacy", - "terms", - "contact", - "sponsorships", - "sponsors", - "shop", - "videos", - "listings", - "podcasts", - "p", - "t", -]; - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (username, slug) = parse_username_slug(url).ok_or_else(|| { - FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'")) - })?; - - let api_url = format!("https://dev.to/api/articles/{username}/{slug}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "dev_to: article '{username}/{slug}' not found" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "dev.to api returned status {}", - resp.status - ))); - } - - let a: Article = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?; - - Ok(json!({ - "url": url, - "id": a.id, - "title": a.title, - "description": a.description, - "body_markdown": a.body_markdown, - "url_canonical": a.canonical_url, - "published_at": a.published_at, - "edited_at": a.edited_at, - "reading_time_min": a.reading_time_minutes, - "tags": a.tag_list, - "positive_reactions": a.positive_reactions_count, - "public_reactions": a.public_reactions_count, - "comments_count": a.comments_count, - "page_views_count": a.page_views_count, - "cover_image": a.cover_image, - "author": json!({ - "username": a.user.as_ref().and_then(|u| u.username.clone()), - "name": a.user.as_ref().and_then(|u| u.name.clone()), - "twitter": a.user.as_ref().and_then(|u| u.twitter_username.clone()), - "github": a.user.as_ref().and_then(|u| u.github_username.clone()), - "website": a.user.as_ref().and_then(|u| u.website_url.clone()), - }), - })) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn parse_username_slug(url: &str) -> Option<(String, String)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let username = segs.next()?; - let slug = segs.next()?; - Some((username.to_string(), slug.to_string())) -} - -// --------------------------------------------------------------------------- -// dev.to API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Article { - id: Option, - title: Option, - description: Option, - body_markdown: Option, - canonical_url: Option, - published_at: Option, - edited_at: Option, - reading_time_minutes: Option, - tag_list: Option, // string OR array depending on endpoint - positive_reactions_count: Option, - public_reactions_count: Option, - comments_count: Option, - page_views_count: Option, - cover_image: Option, - user: Option, -} - -#[derive(Deserialize)] -struct UserRef { - username: Option, - name: Option, - twitter_username: Option, - github_username: Option, - website_url: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_article_urls() { - assert!(matches("https://dev.to/ben/welcome-thread")); - assert!(matches("https://dev.to/0xmassi/some-post-1abc")); - assert!(!matches("https://dev.to/")); - assert!(!matches("https://dev.to/api/articles/foo/bar")); - assert!(!matches("https://dev.to/tags/rust")); - assert!(!matches("https://dev.to/ben")); // user profile, not article - assert!(!matches("https://example.com/ben/post")); - } - - #[test] - fn parse_pulls_username_and_slug() { - assert_eq!( - parse_username_slug("https://dev.to/ben/welcome-thread"), - Some(("ben".into(), "welcome-thread".into())) - ); - assert_eq!( - parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"), - Some(("0xmassi".into(), "some-post-1abc".into())) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/docker_hub.rs b/crates/webclaw-fetch/src/extractors/docker_hub.rs deleted file mode 100644 index bce9315..0000000 --- a/crates/webclaw-fetch/src/extractors/docker_hub.rs +++ /dev/null @@ -1,150 +0,0 @@ -//! Docker Hub repository structured extractor. -//! -//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`. -//! Anonymous access is allowed for public images. The official-image -//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "docker_hub", - label: "Docker Hub repository", - description: "Returns image metadata: pull count, star count, last_updated, official flag, description.", - url_patterns: &[ - "https://hub.docker.com/_/{name}", - "https://hub.docker.com/r/{namespace}/{name}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "hub.docker.com" { - return false; - } - url.contains("/_/") || url.contains("/r/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (namespace, name) = parse_repo(url) - .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?; - - let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "docker_hub: repo '{namespace}/{name}' not found" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "docker_hub api returned status {}", - resp.status - ))); - } - - let r: RepoResponse = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?; - - Ok(json!({ - "url": url, - "namespace": r.namespace, - "name": r.name, - "full_name": format!("{namespace}/{name}"), - "pull_count": r.pull_count, - "star_count": r.star_count, - "description": r.description, - "full_description": r.full_description, - "last_updated": r.last_updated, - "date_registered": r.date_registered, - "is_official": namespace == "library", - "is_private": r.is_private, - "status_description":r.status_description, - "categories": r.categories, - })) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Parse `(namespace, name)` from a Docker Hub URL. The official-image -/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos -/// `/r/foo/bar` map to `(foo, bar)`. -fn parse_repo(url: &str) -> Option<(String, String)> { - if let Some(after) = url.split("/_/").nth(1) { - let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); - let name = stripped.split('/').next().filter(|s| !s.is_empty())?; - return Some(("library".into(), name.to_string())); - } - let after = url.split("/r/").nth(1)?; - let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let ns = segs.next()?; - let nm = segs.next()?; - Some((ns.to_string(), nm.to_string())) -} - -#[derive(Deserialize)] -struct RepoResponse { - namespace: Option, - name: Option, - pull_count: Option, - star_count: Option, - description: Option, - full_description: Option, - last_updated: Option, - date_registered: Option, - is_private: Option, - status_description: Option, - #[serde(default)] - categories: Vec, -} - -#[derive(Deserialize, serde::Serialize)] -struct DockerCategory { - name: Option, - slug: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_docker_urls() { - assert!(matches("https://hub.docker.com/_/nginx")); - assert!(matches("https://hub.docker.com/r/grafana/grafana")); - assert!(!matches("https://hub.docker.com/")); - assert!(!matches("https://example.com/_/nginx")); - } - - #[test] - fn parse_repo_handles_official_and_personal() { - assert_eq!( - parse_repo("https://hub.docker.com/_/nginx"), - Some(("library".into(), "nginx".into())) - ); - assert_eq!( - parse_repo("https://hub.docker.com/_/nginx/tags"), - Some(("library".into(), "nginx".into())) - ); - assert_eq!( - parse_repo("https://hub.docker.com/r/grafana/grafana"), - Some(("grafana".into(), "grafana".into())) - ); - assert_eq!( - parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"), - Some(("grafana".into(), "grafana".into())) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/ebay_listing.rs b/crates/webclaw-fetch/src/extractors/ebay_listing.rs deleted file mode 100644 index dbc85ab..0000000 --- a/crates/webclaw-fetch/src/extractors/ebay_listing.rs +++ /dev/null @@ -1,337 +0,0 @@ -//! eBay listing extractor. -//! -//! eBay item pages at `ebay.com/itm/{id}` and international variants -//! usually ship a `Product` JSON-LD block with title, price, currency, -//! condition, and an `AggregateOffer` when bidding. eBay applies -//! Cloudflare + custom WAF selectively — some item IDs return normal -//! HTML to the Firefox profile, others 403 / get the "Pardon our -//! interruption" page. We route through `cloud::smart_fetch_html` so -//! both paths resolve to the same parser. - -use std::sync::OnceLock; - -use regex::Regex; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::cloud::{self, CloudError}; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "ebay_listing", - label: "eBay listing", - description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.", - url_patterns: &[ - "https://www.ebay.com/itm/{id}", - "https://www.ebay.co.uk/itm/{id}", - "https://www.ebay.de/itm/{id}", - "https://www.ebay.fr/itm/{id}", - "https://www.ebay.it/itm/{id}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !is_ebay_host(host) { - return false; - } - parse_item_id(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let item_id = parse_item_id(url) - .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?; - - let fetched = cloud::smart_fetch_html(client, client.cloud(), url) - .await - .map_err(cloud_to_fetch_err)?; - - let mut data = parse(&fetched.html, url, &item_id); - if let Some(obj) = data.as_object_mut() { - obj.insert( - "data_source".into(), - match fetched.source { - cloud::FetchSource::Local => json!("local"), - cloud::FetchSource::Cloud => json!("cloud"), - }, - ); - } - Ok(data) -} - -pub fn parse(html: &str, url: &str, item_id: &str) -> Value { - let jsonld = find_product_jsonld(html); - let title = jsonld - .as_ref() - .and_then(|v| get_text(v, "name")) - .or_else(|| og(html, "title")); - let image = jsonld - .as_ref() - .and_then(get_first_image) - .or_else(|| og(html, "image")); - let brand = jsonld.as_ref().and_then(get_brand); - let description = jsonld - .as_ref() - .and_then(|v| get_text(v, "description")) - .or_else(|| og(html, "description")); - let offer = jsonld.as_ref().and_then(first_offer); - - // eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price. - let (low_price, high_price, single_price) = match offer.as_ref() { - Some(o) => ( - get_text(o, "lowPrice"), - get_text(o, "highPrice"), - get_text(o, "price"), - ), - None => (None, None, None), - }; - let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount")); - - let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating); - - json!({ - "url": url, - "item_id": item_id, - "title": title, - "brand": brand, - "description": description, - "image": image, - "price": single_price, - "low_price": low_price, - "high_price": high_price, - "offer_count": offer_count, - "currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")), - "availability": offer.as_ref().and_then(|o| { - get_text(o, "availability").map(|s| - s.replace("http://schema.org/", "").replace("https://schema.org/", "")) - }), - "condition": offer.as_ref().and_then(|o| { - get_text(o, "itemCondition").map(|s| - s.replace("http://schema.org/", "").replace("https://schema.org/", "")) - }), - "seller": offer.as_ref().and_then(|o| - o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)), - "aggregate_rating": aggregate_rating, - }) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn is_ebay_host(host: &str) -> bool { - host.starts_with("www.ebay.") || host.starts_with("ebay.") -} - -/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}` -/// URLs. IDs are 10-15 digits today, but we accept any all-digit -/// trailing segment so the extractor stays forward-compatible. -fn parse_item_id(url: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - // /itm/(optional-slug/)?(digits)([/?#]|end) - Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap() - }); - re.captures(url) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -// --------------------------------------------------------------------------- -// JSON-LD walkers -// --------------------------------------------------------------------------- - -fn find_product_jsonld(html: &str) -> Option { - let blocks = webclaw_core::structured_data::extract_json_ld(html); - for b in blocks { - if let Some(found) = find_product_in(&b) { - return Some(found); - } - } - None -} - -fn find_product_in(v: &Value) -> Option { - if is_product_type(v) { - return Some(v.clone()); - } - if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { - for item in graph { - if let Some(found) = find_product_in(item) { - return Some(found); - } - } - } - if let Some(arr) = v.as_array() { - for item in arr { - if let Some(found) = find_product_in(item) { - return Some(found); - } - } - } - None -} - -fn is_product_type(v: &Value) -> bool { - let Some(t) = v.get("@type") else { - return false; - }; - let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct"); - match t { - Value::String(s) => is_prod(s), - Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)), - _ => false, - } -} - -fn get_text(v: &Value, key: &str) -> Option { - v.get(key).and_then(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Number(n) => Some(n.to_string()), - _ => None, - }) -} - -fn get_brand(v: &Value) -> Option { - let brand = v.get("brand")?; - if let Some(s) = brand.as_str() { - return Some(s.to_string()); - } - brand - .as_object() - .and_then(|o| o.get("name")) - .and_then(|n| n.as_str()) - .map(String::from) -} - -fn get_first_image(v: &Value) -> Option { - match v.get("image")? { - Value::String(s) => Some(s.clone()), - Value::Array(arr) => arr.iter().find_map(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - }), - Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - } -} - -fn first_offer(v: &Value) -> Option { - let offers = v.get("offers")?; - match offers { - Value::Array(arr) => arr.first().cloned(), - Value::Object(_) => Some(offers.clone()), - _ => None, - } -} - -fn get_aggregate_rating(v: &Value) -> Option { - let r = v.get("aggregateRating")?; - Some(json!({ - "rating_value": get_text(r, "ratingValue"), - "review_count": get_text(r, "reviewCount"), - "best_rating": get_text(r, "bestRating"), - })) -} - -fn og(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - return c.get(2).map(|m| m.as_str().to_string()); - } - } - None -} - -fn cloud_to_fetch_err(e: CloudError) -> FetchError { - FetchError::Build(e.to_string()) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_ebay_item_urls() { - assert!(matches("https://www.ebay.com/itm/325478156234")); - assert!(matches( - "https://www.ebay.com/itm/vintage-typewriter/325478156234" - )); - assert!(matches("https://www.ebay.co.uk/itm/325478156234")); - assert!(!matches("https://www.ebay.com/")); - assert!(!matches("https://www.ebay.com/sch/foo")); - assert!(!matches("https://example.com/itm/325478156234")); - } - - #[test] - fn parse_item_id_handles_slugged_urls() { - assert_eq!( - parse_item_id("https://www.ebay.com/itm/325478156234"), - Some("325478156234".into()) - ); - assert_eq!( - parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"), - Some("325478156234".into()) - ); - assert_eq!( - parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"), - Some("325478156234".into()) - ); - } - - #[test] - fn parse_extracts_from_fixture_jsonld() { - let html = r##" - - -"##; - let v = parse(html, "https://www.ebay.co.uk/itm/325", "325"); - assert_eq!(v["title"], "Vintage Typewriter"); - assert_eq!(v["price"], "79.99"); - assert_eq!(v["currency"], "GBP"); - assert_eq!(v["availability"], "InStock"); - assert_eq!(v["condition"], "UsedCondition"); - assert_eq!(v["seller"], "vintage_seller_99"); - assert_eq!(v["brand"], "Olivetti"); - } - - #[test] - fn parse_handles_aggregate_offer_price_range() { - let html = r##" - -"##; - let v = parse(html, "https://www.ebay.com/itm/1", "1"); - assert_eq!(v["low_price"], "10.00"); - assert_eq!(v["high_price"], "50.00"); - assert_eq!(v["offer_count"], "5"); - assert_eq!(v["currency"], "USD"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs deleted file mode 100644 index 019fb68..0000000 --- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs +++ /dev/null @@ -1,553 +0,0 @@ -//! Generic ecommerce product extractor via Schema.org JSON-LD. -//! -//! Every modern ecommerce site ships a ` - - - "##; - let v = parse(html, "https://patagonia.com/p/x").unwrap(); - assert_eq!(v["data_source"], "jsonld+og"); - assert_eq!(v["name"], "Better Sweater"); - assert_eq!(v["offers"].as_array().unwrap().len(), 1); - assert_eq!(v["offers"][0]["price"], "139.00"); - } - - #[test] - fn jsonld_only_stays_pure_jsonld() { - let html = r##" - - "##; - let v = parse(html, "https://example.com/p/w").unwrap(); - assert_eq!(v["data_source"], "jsonld"); - assert_eq!(v["offers"][0]["price"], "9.99"); - } - - #[test] - fn parse_returns_none_on_no_product_signals() { - let html = r#" - - - "#; - assert!(parse(html, "https://blog.example.com/post").is_none()); - } -} diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs deleted file mode 100644 index ea9ed0b..0000000 --- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs +++ /dev/null @@ -1,572 +0,0 @@ -//! Etsy listing extractor. -//! -//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant -//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD -//! block with title, price, currency, availability, shop seller, and -//! an `AggregateRating` for the listing. -//! -//! Etsy puts Cloudflare + custom WAF in front of product pages with a -//! high variance: the Firefox profile gets clean HTML most of the time -//! but some listings return a CF interstitial. We route through -//! `cloud::smart_fetch_html` so both paths resolve to the same parser, -//! same as `ebay_listing`. -//! -//! ## URL slug as last-resort title -//! -//! Even with cloud antibot bypass, Etsy frequently serves a generic -//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD, -//! empty markdown). In that case we humanise the slug from the URL -//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes -//! "Personalized Stainless Steel Tumbler") so callers always get a -//! meaningful title. Degrades gracefully when the URL has no slug. - -use std::sync::OnceLock; - -use regex::Regex; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::cloud::{self, CloudError}; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "etsy_listing", - label: "Etsy listing", - description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.", - url_patterns: &[ - "https://www.etsy.com/listing/{id}", - "https://www.etsy.com/listing/{id}/{slug}", - "https://www.etsy.com/{locale}/listing/{id}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !is_etsy_host(host) { - return false; - } - parse_listing_id(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let listing_id = parse_listing_id(url) - .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?; - - let fetched = cloud::smart_fetch_html(client, client.cloud(), url) - .await - .map_err(cloud_to_fetch_err)?; - - let mut data = parse(&fetched.html, url, &listing_id); - if let Some(obj) = data.as_object_mut() { - obj.insert( - "data_source".into(), - match fetched.source { - cloud::FetchSource::Local => json!("local"), - cloud::FetchSource::Cloud => json!("cloud"), - }, - ); - } - Ok(data) -} - -pub fn parse(html: &str, url: &str, listing_id: &str) -> Value { - let jsonld = find_product_jsonld(html); - let slug_title = humanise_slug(parse_slug(url).as_deref()); - - let title = jsonld - .as_ref() - .and_then(|v| get_text(v, "name")) - .or_else(|| og(html, "title").filter(|t| !is_generic_title(t))) - .or(slug_title); - let description = jsonld - .as_ref() - .and_then(|v| get_text(v, "description")) - .or_else(|| og(html, "description").filter(|d| !is_generic_description(d))); - let image = jsonld - .as_ref() - .and_then(get_first_image) - .or_else(|| og(html, "image")); - let brand = jsonld.as_ref().and_then(get_brand); - - // Etsy listings often ship either a single Offer or an - // AggregateOffer when the listing has variants with different prices. - let offer = jsonld.as_ref().and_then(first_offer); - let (low_price, high_price, single_price) = match offer.as_ref() { - Some(o) => ( - get_text(o, "lowPrice"), - get_text(o, "highPrice"), - get_text(o, "price"), - ), - None => (None, None, None), - }; - let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency")); - let availability = offer - .as_ref() - .and_then(|o| get_text(o, "availability").map(strip_schema_prefix)); - let item_condition = jsonld - .as_ref() - .and_then(|v| get_text(v, "itemCondition")) - .map(strip_schema_prefix); - - // Shop name: offers[0].seller.name on newer listings, top-level - // `brand` on older listings (Etsy changed the schema around 2022). - // Fall back through both so either shape resolves. - let shop = offer - .as_ref() - .and_then(|o| { - o.get("seller") - .and_then(|s| s.get("name")) - .and_then(|n| n.as_str()) - .map(String::from) - }) - .or_else(|| brand.clone()); - let shop_url = shop_url_from_html(html); - - let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating); - - json!({ - "url": url, - "listing_id": listing_id, - "title": title, - "description": description, - "image": image, - "brand": brand, - "price": single_price, - "low_price": low_price, - "high_price": high_price, - "currency": currency, - "availability": availability, - "item_condition": item_condition, - "shop": shop, - "shop_url": shop_url, - "aggregate_rating": aggregate_rating, - }) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn is_etsy_host(host: &str) -> bool { - host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com") -} - -/// Extract the numeric listing id. Etsy ids are 9-11 digits today but -/// we accept any all-digit segment right after `/listing/`. -/// -/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised -/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`). -fn parse_listing_id(url: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap()); - re.captures(url) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -/// Extract the URL slug after the listing id, e.g. -/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL -/// is the bare `/listing/{id}` shape. -fn parse_slug(url: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap()); - re.captures(url) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -/// Turn a URL slug into a human-ish title: -/// `personalized-stainless-steel-tumbler` → `Personalized Stainless -/// Steel Tumbler`. Word-cap each dash-separated token; preserves -/// underscores as spaces too. Returns `None` on empty input. -fn humanise_slug(slug: Option<&str>) -> Option { - let raw = slug?.trim(); - if raw.is_empty() { - return None; - } - let words: Vec = raw - .split(['-', '_']) - .filter(|w| !w.is_empty()) - .map(capitalise_word) - .collect(); - if words.is_empty() { - None - } else { - Some(words.join(" ")) - } -} - -fn capitalise_word(w: &str) -> String { - let mut chars = w.chars(); - match chars.next() { - Some(first) => first.to_uppercase().collect::() + chars.as_str(), - None => String::new(), - } -} - -/// True when the OG title is Etsy's fallback-page title rather than a -/// listing-specific title. Expired / region-blocked / antibot-filtered -/// pages return Etsy's sitewide tagline: -/// `"Etsy - Your place to buy and sell all things handmade..."`, or -/// simply `"etsy.com"`. A real listing title always starts with the -/// item name, never with "Etsy - " or the domain. -fn is_generic_title(t: &str) -> bool { - let normalised = t.trim().to_lowercase(); - if matches!( - normalised.as_str(), - "etsy.com" | "etsy" | "www.etsy.com" | "" - ) { - return true; - } - // Etsy's sitewide marketing tagline, served on 404 / blocked pages. - if normalised.starts_with("etsy - ") - || normalised.starts_with("etsy.com - ") - || normalised.starts_with("etsy uk - ") - { - return true; - } - // Etsy's "item unavailable" placeholder, served on delisted - // products. Keep the slug fallback so callers still see what the - // URL was about. - normalised.starts_with("this item is unavailable") - || normalised.starts_with("sorry, this item is") - || normalised == "item not available - etsy" -} - -/// True when the OG description is an Etsy error-page placeholder or -/// sitewide marketing blurb rather than a real listing description. -fn is_generic_description(d: &str) -> bool { - let normalised = d.trim().to_lowercase(); - if normalised.is_empty() { - return true; - } - normalised.starts_with("sorry, the page you were looking for") - || normalised.starts_with("page not found") - || normalised.starts_with("find the perfect handmade gift") -} - -// --------------------------------------------------------------------------- -// JSON-LD walkers (same shape as ebay_listing; kept separate so the two -// extractors can diverge without cross-impact) -// --------------------------------------------------------------------------- - -fn find_product_jsonld(html: &str) -> Option { - let blocks = webclaw_core::structured_data::extract_json_ld(html); - for b in blocks { - if let Some(found) = find_product_in(&b) { - return Some(found); - } - } - None -} - -fn find_product_in(v: &Value) -> Option { - if is_product_type(v) { - return Some(v.clone()); - } - if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { - for item in graph { - if let Some(found) = find_product_in(item) { - return Some(found); - } - } - } - if let Some(arr) = v.as_array() { - for item in arr { - if let Some(found) = find_product_in(item) { - return Some(found); - } - } - } - None -} - -fn is_product_type(v: &Value) -> bool { - let Some(t) = v.get("@type") else { - return false; - }; - let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct"); - match t { - Value::String(s) => is_prod(s), - Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)), - _ => false, - } -} - -fn get_text(v: &Value, key: &str) -> Option { - v.get(key).and_then(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Number(n) => Some(n.to_string()), - _ => None, - }) -} - -fn get_brand(v: &Value) -> Option { - let brand = v.get("brand")?; - if let Some(s) = brand.as_str() { - return Some(s.to_string()); - } - brand - .as_object() - .and_then(|o| o.get("name")) - .and_then(|n| n.as_str()) - .map(String::from) -} - -fn get_first_image(v: &Value) -> Option { - match v.get("image")? { - Value::String(s) => Some(s.clone()), - Value::Array(arr) => arr.iter().find_map(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - }), - Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - } -} - -fn first_offer(v: &Value) -> Option { - let offers = v.get("offers")?; - match offers { - Value::Array(arr) => arr.first().cloned(), - Value::Object(_) => Some(offers.clone()), - _ => None, - } -} - -fn get_aggregate_rating(v: &Value) -> Option { - let r = v.get("aggregateRating")?; - Some(json!({ - "rating_value": get_text(r, "ratingValue"), - "review_count": get_text(r, "reviewCount"), - "best_rating": get_text(r, "bestRating"), - })) -} - -fn strip_schema_prefix(s: String) -> String { - s.replace("http://schema.org/", "") - .replace("https://schema.org/", "") -} - -fn og(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - return c.get(2).map(|m| m.as_str().to_string()); - } - } - None -} - -/// Etsy links the owning shop with a canonical anchor like -/// ``. Grab the first one after the -/// breadcrumb boundary. -fn shop_url_from_html(html: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap()); - re.captures(html) - .and_then(|c| c.get(1)) - .map(|m| format!("https://www.etsy.com{}", m.as_str())) -} - -fn cloud_to_fetch_err(e: CloudError) -> FetchError { - FetchError::Build(e.to_string()) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_etsy_listing_urls() { - assert!(matches("https://www.etsy.com/listing/123456789")); - assert!(matches( - "https://www.etsy.com/listing/123456789/vintage-typewriter" - )); - assert!(matches( - "https://www.etsy.com/fr/listing/123456789/vintage-typewriter" - )); - assert!(!matches("https://www.etsy.com/")); - assert!(!matches("https://www.etsy.com/shop/SomeShop")); - assert!(!matches("https://example.com/listing/123456789")); - } - - #[test] - fn parse_listing_id_handles_slug_and_locale() { - assert_eq!( - parse_listing_id("https://www.etsy.com/listing/123456789"), - Some("123456789".into()) - ); - assert_eq!( - parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"), - Some("123456789".into()) - ); - assert_eq!( - parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"), - Some("123456789".into()) - ); - assert_eq!( - parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"), - Some("123456789".into()) - ); - } - - #[test] - fn parse_extracts_from_fixture_jsonld() { - let html = r##" - - -StudioClay -"##; - let v = parse(html, "https://www.etsy.com/listing/1", "1"); - assert_eq!(v["title"], "Handmade Ceramic Mug"); - assert_eq!(v["price"], "24.00"); - assert_eq!(v["currency"], "USD"); - assert_eq!(v["availability"], "InStock"); - assert_eq!(v["item_condition"], "NewCondition"); - assert_eq!(v["shop"], "StudioClay"); - assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay"); - assert_eq!(v["brand"], "Studio Clay"); - assert_eq!(v["aggregate_rating"]["rating_value"], "4.9"); - assert_eq!(v["aggregate_rating"]["review_count"], "127"); - } - - #[test] - fn parse_handles_aggregate_offer_price_range() { - let html = r##" - -"##; - let v = parse(html, "https://www.etsy.com/listing/2", "2"); - assert_eq!(v["low_price"], "18.00"); - assert_eq!(v["high_price"], "36.00"); - assert_eq!(v["currency"], "USD"); - } - - #[test] - fn parse_falls_back_to_og_when_no_jsonld() { - let html = r#" - - - - -"#; - let v = parse(html, "https://www.etsy.com/listing/3", "3"); - assert_eq!(v["title"], "Minimal Fallback Item"); - assert_eq!(v["description"], "OG-only extraction test."); - assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg"); - // No price fields when we only have OG. - assert!(v["price"].is_null()); - } - - #[test] - fn parse_slug_from_url() { - assert_eq!( - parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"), - Some("vintage-typewriter".into()) - ); - assert_eq!( - parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"), - Some("slug".into()) - ); - assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None); - assert_eq!( - parse_slug("https://www.etsy.com/fr/listing/123456789/slug"), - Some("slug".into()) - ); - } - - #[test] - fn humanise_slug_capitalises_each_word() { - assert_eq!( - humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(), - Some("Personalized Stainless Steel Tumbler") - ); - assert_eq!( - humanise_slug(Some("hand_crafted_mug")).as_deref(), - Some("Hand Crafted Mug") - ); - assert_eq!(humanise_slug(Some("")), None); - assert_eq!(humanise_slug(None), None); - } - - #[test] - fn is_generic_title_catches_common_shapes() { - assert!(is_generic_title("etsy.com")); - assert!(is_generic_title("Etsy")); - assert!(is_generic_title(" etsy.com ")); - assert!(is_generic_title( - "Etsy - Your place to buy and sell all things handmade, vintage, and supplies" - )); - assert!(is_generic_title("Etsy UK - Vintage & Handmade")); - assert!(!is_generic_title("Vintage Typewriter")); - assert!(!is_generic_title("Handmade Etsy-style Mug")); - } - - #[test] - fn is_generic_description_catches_404_shapes() { - assert!(is_generic_description("")); - assert!(is_generic_description( - "Sorry, the page you were looking for was not found." - )); - assert!(is_generic_description("Page not found")); - assert!(!is_generic_description( - "Hand-thrown ceramic mug, dishwasher safe." - )); - } - - #[test] - fn parse_uses_slug_when_og_is_generic() { - // Cloud-blocked Etsy listing: og:title is a site-wide generic - // placeholder, no JSON-LD, no description. Slug should win. - let html = r#" - -"#; - let v = parse( - html, - "https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler", - "1079113183", - ); - assert_eq!(v["title"], "Personalized Stainless Steel Tumbler"); - } - - #[test] - fn parse_prefers_real_og_over_slug() { - let html = r#" - -"#; - let v = parse( - html, - "https://www.etsy.com/listing/1079113183/the-url-slug", - "1079113183", - ); - assert_eq!(v["title"], "Real Listing Title"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/github_issue.rs b/crates/webclaw-fetch/src/extractors/github_issue.rs deleted file mode 100644 index 9a64f21..0000000 --- a/crates/webclaw-fetch/src/extractors/github_issue.rs +++ /dev/null @@ -1,172 +0,0 @@ -//! GitHub issue structured extractor. -//! -//! Mirror of `github_pr` but on `/issues/{number}`. Uses -//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the -//! issue body + comment count + labels + milestone + author / -//! assignees. Full per-comment bodies would be another call; kept for -//! a follow-up. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "github_issue", - label: "GitHub issue", - description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.", - url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"], -}; - -pub fn matches(url: &str) -> bool { - let host = url - .split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or(""); - if host != "github.com" && host != "www.github.com" { - return false; - } - parse_issue(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (owner, repo, number) = parse_issue(url).ok_or_else(|| { - FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'")) - })?; - - let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "github_issue: issue '{owner}/{repo}#{number}' not found" - ))); - } - if resp.status == 403 { - return Err(FetchError::Build( - "github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(), - )); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "github api returned status {}", - resp.status - ))); - } - - let issue: Issue = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?; - - // The same endpoint returns PRs too; reject if we got one so the caller - // uses /v1/scrape/github_pr instead of getting a half-shaped payload. - if issue.pull_request.is_some() { - return Err(FetchError::Build(format!( - "github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr" - ))); - } - - Ok(json!({ - "url": url, - "owner": owner, - "repo": repo, - "number": issue.number, - "title": issue.title, - "body": issue.body, - "state": issue.state, - "state_reason":issue.state_reason, - "author": issue.user.as_ref().and_then(|u| u.login.clone()), - "labels": issue.labels.iter().filter_map(|l| l.name.clone()).collect::>(), - "assignees": issue.assignees.iter().filter_map(|u| u.login.clone()).collect::>(), - "milestone": issue.milestone.as_ref().and_then(|m| m.title.clone()), - "comments": issue.comments, - "locked": issue.locked, - "created_at": issue.created_at, - "updated_at": issue.updated_at, - "closed_at": issue.closed_at, - "html_url": issue.html_url, - })) -} - -fn parse_issue(url: &str) -> Option<(String, String, u64)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - if segs.len() < 4 || segs[2] != "issues" { - return None; - } - let number: u64 = segs[3].parse().ok()?; - Some((segs[0].to_string(), segs[1].to_string(), number)) -} - -// --------------------------------------------------------------------------- -// GitHub issue API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Issue { - number: Option, - title: Option, - body: Option, - state: Option, - state_reason: Option, - locked: Option, - comments: Option, - created_at: Option, - updated_at: Option, - closed_at: Option, - html_url: Option, - user: Option, - #[serde(default)] - labels: Vec, - #[serde(default)] - assignees: Vec, - milestone: Option, - /// Present when this "issue" is actually a pull request. The REST - /// API overloads the issues endpoint for PRs. - pull_request: Option, -} - -#[derive(Deserialize)] -struct UserRef { - login: Option, -} - -#[derive(Deserialize)] -struct LabelRef { - name: Option, -} - -#[derive(Deserialize)] -struct Milestone { - title: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_issue_urls() { - assert!(matches("https://github.com/rust-lang/rust/issues/100")); - assert!(matches("https://github.com/rust-lang/rust/issues/100/")); - assert!(!matches("https://github.com/rust-lang/rust")); - assert!(!matches("https://github.com/rust-lang/rust/pull/100")); - assert!(!matches("https://github.com/rust-lang/rust/issues")); - } - - #[test] - fn parse_issue_extracts_owner_repo_number() { - assert_eq!( - parse_issue("https://github.com/rust-lang/rust/issues/100"), - Some(("rust-lang".into(), "rust".into(), 100)) - ); - assert_eq!( - parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"), - Some(("rust-lang".into(), "rust".into(), 100)) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/github_pr.rs b/crates/webclaw-fetch/src/extractors/github_pr.rs deleted file mode 100644 index 266d3cd..0000000 --- a/crates/webclaw-fetch/src/extractors/github_pr.rs +++ /dev/null @@ -1,189 +0,0 @@ -//! GitHub pull request structured extractor. -//! -//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns -//! the PR metadata + a counted summary of comments and review activity. -//! Full diff and per-comment bodies require additional calls — left for -//! a follow-up enhancement so the v1 stays one network round-trip. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "github_pr", - label: "GitHub pull request", - description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.", - url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"], -}; - -pub fn matches(url: &str) -> bool { - let host = url - .split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or(""); - if host != "github.com" && host != "www.github.com" { - return false; - } - parse_pr(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (owner, repo, number) = parse_pr(url).ok_or_else(|| { - FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'")) - })?; - - let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "github_pr: pull request '{owner}/{repo}#{number}' not found" - ))); - } - if resp.status == 403 { - return Err(FetchError::Build( - "github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(), - )); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "github api returned status {}", - resp.status - ))); - } - - let p: PullRequest = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?; - - Ok(json!({ - "url": url, - "owner": owner, - "repo": repo, - "number": p.number, - "title": p.title, - "body": p.body, - "state": p.state, - "draft": p.draft, - "merged": p.merged, - "merged_at": p.merged_at, - "merge_commit_sha": p.merge_commit_sha, - "author": p.user.as_ref().and_then(|u| u.login.clone()), - "labels": p.labels.iter().filter_map(|l| l.name.clone()).collect::>(), - "milestone": p.milestone.as_ref().and_then(|m| m.title.clone()), - "head_ref": p.head.as_ref().and_then(|r| r.ref_name.clone()), - "base_ref": p.base.as_ref().and_then(|r| r.ref_name.clone()), - "head_sha": p.head.as_ref().and_then(|r| r.sha.clone()), - "additions": p.additions, - "deletions": p.deletions, - "changed_files": p.changed_files, - "commits": p.commits, - "comments": p.comments, - "review_comments":p.review_comments, - "created_at": p.created_at, - "updated_at": p.updated_at, - "closed_at": p.closed_at, - "html_url": p.html_url, - })) -} - -fn parse_pr(url: &str) -> Option<(String, String, u64)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - // /{owner}/{repo}/pull/{number} (or /pulls/{number} variant) - if segs.len() < 4 { - return None; - } - if segs[2] != "pull" && segs[2] != "pulls" { - return None; - } - let number: u64 = segs[3].parse().ok()?; - Some((segs[0].to_string(), segs[1].to_string(), number)) -} - -// --------------------------------------------------------------------------- -// GitHub PR API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct PullRequest { - number: Option, - title: Option, - body: Option, - state: Option, - draft: Option, - merged: Option, - merged_at: Option, - merge_commit_sha: Option, - user: Option, - #[serde(default)] - labels: Vec, - milestone: Option, - head: Option, - base: Option, - additions: Option, - deletions: Option, - changed_files: Option, - commits: Option, - comments: Option, - review_comments: Option, - created_at: Option, - updated_at: Option, - closed_at: Option, - html_url: Option, -} - -#[derive(Deserialize)] -struct UserRef { - login: Option, -} - -#[derive(Deserialize)] -struct LabelRef { - name: Option, -} - -#[derive(Deserialize)] -struct Milestone { - title: Option, -} - -#[derive(Deserialize)] -struct GitRef { - #[serde(rename = "ref")] - ref_name: Option, - sha: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_pr_urls() { - assert!(matches("https://github.com/rust-lang/rust/pull/12345")); - assert!(matches( - "https://github.com/rust-lang/rust/pull/12345/files" - )); - assert!(!matches("https://github.com/rust-lang/rust")); - assert!(!matches("https://github.com/rust-lang/rust/issues/100")); - assert!(!matches("https://github.com/rust-lang")); - } - - #[test] - fn parse_pr_extracts_owner_repo_number() { - assert_eq!( - parse_pr("https://github.com/rust-lang/rust/pull/12345"), - Some(("rust-lang".into(), "rust".into(), 12345)) - ); - assert_eq!( - parse_pr("https://github.com/rust-lang/rust/pull/12345/files"), - Some(("rust-lang".into(), "rust".into(), 12345)) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/github_release.rs b/crates/webclaw-fetch/src/extractors/github_release.rs deleted file mode 100644 index 7699d09..0000000 --- a/crates/webclaw-fetch/src/extractors/github_release.rs +++ /dev/null @@ -1,179 +0,0 @@ -//! GitHub release structured extractor. -//! -//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns -//! the release notes body, asset list with download counts, and -//! prerelease flag. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "github_release", - label: "GitHub release", - description: "Returns release metadata: tag, name, body (release notes), assets with download counts.", - url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"], -}; - -pub fn matches(url: &str) -> bool { - let host = url - .split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or(""); - if host != "github.com" && host != "www.github.com" { - return false; - } - parse_release(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (owner, repo, tag) = parse_release(url).ok_or_else(|| { - FetchError::Build(format!("github_release: cannot parse release URL '{url}'")) - })?; - - let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "github_release: release '{owner}/{repo}@{tag}' not found" - ))); - } - if resp.status == 403 { - return Err(FetchError::Build( - "github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour." - .into(), - )); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "github api returned status {}", - resp.status - ))); - } - - let r: Release = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?; - - let assets: Vec = r - .assets - .iter() - .map(|a| { - json!({ - "name": a.name, - "size": a.size, - "download_count": a.download_count, - "browser_download_url": a.browser_download_url, - "content_type": a.content_type, - "created_at": a.created_at, - "updated_at": a.updated_at, - }) - }) - .collect(); - - Ok(json!({ - "url": url, - "owner": owner, - "repo": repo, - "tag_name": r.tag_name, - "name": r.name, - "body": r.body, - "draft": r.draft, - "prerelease": r.prerelease, - "author": r.author.as_ref().and_then(|u| u.login.clone()), - "created_at": r.created_at, - "published_at": r.published_at, - "asset_count": assets.len(), - "total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::(), - "assets": assets, - "html_url": r.html_url, - })) -} - -fn parse_release(url: &str) -> Option<(String, String, String)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - // /{owner}/{repo}/releases/tag/{tag} - if segs.len() < 5 { - return None; - } - if segs[2] != "releases" || segs[3] != "tag" { - return None; - } - Some(( - segs[0].to_string(), - segs[1].to_string(), - segs[4].to_string(), - )) -} - -// --------------------------------------------------------------------------- -// GitHub Release API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Release { - tag_name: Option, - name: Option, - body: Option, - draft: Option, - prerelease: Option, - author: Option, - created_at: Option, - published_at: Option, - html_url: Option, - #[serde(default)] - assets: Vec, -} - -#[derive(Deserialize)] -struct UserRef { - login: Option, -} - -#[derive(Deserialize)] -struct Asset { - name: Option, - size: Option, - download_count: Option, - browser_download_url: Option, - content_type: Option, - created_at: Option, - updated_at: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_release_urls() { - assert!(matches( - "https://github.com/rust-lang/rust/releases/tag/1.85.0" - )); - assert!(matches( - "https://github.com/0xMassi/webclaw/releases/tag/v0.4.0" - )); - assert!(!matches("https://github.com/rust-lang/rust")); - assert!(!matches("https://github.com/rust-lang/rust/releases")); - assert!(!matches("https://github.com/rust-lang/rust/pull/100")); - } - - #[test] - fn parse_release_extracts_owner_repo_tag() { - assert_eq!( - parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"), - Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into())) - ); - assert_eq!( - parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"), - Some(("rust-lang".into(), "rust".into(), "1.85.0".into())) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/github_repo.rs b/crates/webclaw-fetch/src/extractors/github_repo.rs deleted file mode 100644 index 2a62aa3..0000000 --- a/crates/webclaw-fetch/src/extractors/github_repo.rs +++ /dev/null @@ -1,212 +0,0 @@ -//! GitHub repository structured extractor. -//! -//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`. -//! Unauthenticated requests get 60/hour per IP, which is fine for users -//! self-hosting and for low-volume cloud usage. Production cloud should -//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't -//! depend on it being set — it works open out of the box. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "github_repo", - label: "GitHub repository", - description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.", - url_patterns: &["https://github.com/{owner}/{repo}"], -}; - -pub fn matches(url: &str) -> bool { - let host = url - .split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or(""); - if host != "github.com" && host != "www.github.com" { - return false; - } - // Path must be exactly /{owner}/{repo} (or with trailing slash). Reject - // sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the - // future github_issue / github_pr extractors will handle. - let path = url - .split("://") - .nth(1) - .and_then(|s| s.split_once('/')) - .map(|(_, p)| p) - .unwrap_or(""); - let stripped = path - .split(['?', '#']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0]) -} - -/// GitHub uses some top-level paths for non-repo pages. -const RESERVED_OWNERS: &[&str] = &[ - "settings", - "marketplace", - "explore", - "topics", - "trending", - "collections", - "events", - "sponsors", - "issues", - "pulls", - "notifications", - "new", - "organizations", - "login", - "join", - "search", - "about", -]; - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (owner, repo) = parse_owner_repo(url).ok_or_else(|| { - FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'")) - })?; - - let api_url = format!("https://api.github.com/repos/{owner}/{repo}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "github_repo: repo '{owner}/{repo}' not found" - ))); - } - if resp.status == 403 { - return Err(FetchError::Build( - "github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(), - )); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "github api returned status {}", - resp.status - ))); - } - - let r: Repo = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?; - - Ok(json!({ - "url": url, - "owner": r.owner.as_ref().map(|o| &o.login), - "name": r.name, - "full_name": r.full_name, - "description": r.description, - "homepage": r.homepage, - "language": r.language, - "topics": r.topics, - "license": r.license.as_ref().and_then(|l| l.spdx_id.clone()), - "license_name": r.license.as_ref().map(|l| l.name.clone()), - "default_branch": r.default_branch, - "stars": r.stargazers_count, - "forks": r.forks_count, - "watchers": r.subscribers_count, - "open_issues": r.open_issues_count, - "size_kb": r.size, - "archived": r.archived, - "fork": r.fork, - "is_template": r.is_template, - "has_issues": r.has_issues, - "has_wiki": r.has_wiki, - "has_pages": r.has_pages, - "has_discussions": r.has_discussions, - "created_at": r.created_at, - "updated_at": r.updated_at, - "pushed_at": r.pushed_at, - "html_url": r.html_url, - })) -} - -fn parse_owner_repo(url: &str) -> Option<(String, String)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let owner = segs.next()?.to_string(); - let repo = segs.next()?.to_string(); - Some((owner, repo)) -} - -// --------------------------------------------------------------------------- -// GitHub API types — only the fields we surface -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Repo { - name: Option, - full_name: Option, - description: Option, - homepage: Option, - language: Option, - #[serde(default)] - topics: Vec, - license: Option, - default_branch: Option, - stargazers_count: Option, - forks_count: Option, - subscribers_count: Option, - open_issues_count: Option, - size: Option, - archived: Option, - fork: Option, - is_template: Option, - has_issues: Option, - has_wiki: Option, - has_pages: Option, - has_discussions: Option, - created_at: Option, - updated_at: Option, - pushed_at: Option, - html_url: Option, - owner: Option, -} - -#[derive(Deserialize)] -struct Owner { - login: String, -} - -#[derive(Deserialize)] -struct License { - name: String, - spdx_id: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_repo_root_only() { - assert!(matches("https://github.com/rust-lang/rust")); - assert!(matches("https://github.com/rust-lang/rust/")); - assert!(!matches("https://github.com/rust-lang/rust/issues")); - assert!(!matches("https://github.com/rust-lang/rust/pulls/123")); - assert!(!matches("https://github.com/rust-lang")); - assert!(!matches("https://github.com/marketplace")); - assert!(!matches("https://github.com/topics/rust")); - assert!(!matches("https://example.com/foo/bar")); - } - - #[test] - fn parse_owner_repo_handles_trailing_slash_and_query() { - assert_eq!( - parse_owner_repo("https://github.com/rust-lang/rust"), - Some(("rust-lang".into(), "rust".into())) - ); - assert_eq!( - parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"), - Some(("rust-lang".into(), "rust".into())) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/hackernews.rs b/crates/webclaw-fetch/src/extractors/hackernews.rs deleted file mode 100644 index 91d4520..0000000 --- a/crates/webclaw-fetch/src/extractors/hackernews.rs +++ /dev/null @@ -1,186 +0,0 @@ -//! Hacker News structured extractor. -//! -//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which -//! returns the full post + recursive comment tree in a single request. -//! The official Firebase API at `hacker-news.firebaseio.com` requires -//! N+1 fetches per comment, so we'd hit either timeout or rate-limit -//! on any non-trivial thread. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "hackernews", - label: "Hacker News story", - description: "Returns post + nested comment tree for a Hacker News item.", - url_patterns: &[ - "https://news.ycombinator.com/item?id=N", - "https://hn.algolia.com/items/N", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = url - .split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or(""); - if host == "news.ycombinator.com" { - return url.contains("item?id=") || url.contains("item%3Fid="); - } - if host == "hn.algolia.com" { - return url.contains("/items/"); - } - false -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let id = parse_item_id(url).ok_or_else(|| { - FetchError::Build(format!("hackernews: cannot parse item id from '{url}'")) - })?; - - let api_url = format!("https://hn.algolia.com/api/v1/items/{id}"); - let resp = client.fetch(&api_url).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "hn algolia returned status {}", - resp.status - ))); - } - - let item: AlgoliaItem = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?; - - let post = post_json(&item); - let comments: Vec = item.children.iter().filter_map(comment_json).collect(); - - Ok(json!({ - "url": url, - "post": post, - "comments": comments, - })) -} - -// --------------------------------------------------------------------------- -// Helpers -// --------------------------------------------------------------------------- - -/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the -/// Algolia mirror's `/items/N` form. -fn parse_item_id(url: &str) -> Option { - if let Some(after) = url.split("id=").nth(1) { - let n = after.split('&').next().unwrap_or(after); - if let Ok(id) = n.parse::() { - return Some(id); - } - } - if let Some(after) = url.split("/items/").nth(1) { - let n = after.split(['/', '?', '#']).next().unwrap_or(after); - if let Ok(id) = n.parse::() { - return Some(id); - } - } - None -} - -fn post_json(item: &AlgoliaItem) -> Value { - json!({ - "id": item.id, - "type": item.r#type, - "title": item.title, - "url": item.url, - "author": item.author, - "points": item.points, - "text": item.text, // populated for ask/show/tell - "created_at": item.created_at, - "created_at_unix": item.created_at_i, - "comment_count": count_descendants(item), - "permalink": item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")), - }) -} - -fn comment_json(item: &AlgoliaItem) -> Option { - if !matches!(item.r#type.as_deref(), Some("comment")) { - return None; - } - // Dead/deleted comments still appear in the tree; surface them honestly. - let replies: Vec = item.children.iter().filter_map(comment_json).collect(); - Some(json!({ - "id": item.id, - "author": item.author, - "text": item.text, - "created_at": item.created_at, - "created_at_unix": item.created_at_i, - "parent_id": item.parent_id, - "story_id": item.story_id, - "replies": replies, - })) -} - -fn count_descendants(item: &AlgoliaItem) -> usize { - item.children - .iter() - .filter(|c| matches!(c.r#type.as_deref(), Some("comment"))) - .map(|c| 1 + count_descendants(c)) - .sum() -} - -// --------------------------------------------------------------------------- -// Algolia API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct AlgoliaItem { - id: Option, - r#type: Option, - title: Option, - url: Option, - author: Option, - points: Option, - text: Option, - created_at: Option, - created_at_i: Option, - parent_id: Option, - story_id: Option, - #[serde(default)] - children: Vec, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_hn_item_urls() { - assert!(matches("https://news.ycombinator.com/item?id=1")); - assert!(matches("https://news.ycombinator.com/item?id=12345")); - assert!(matches("https://hn.algolia.com/items/1")); - } - - #[test] - fn rejects_non_item_urls() { - assert!(!matches("https://news.ycombinator.com/")); - assert!(!matches("https://news.ycombinator.com/news")); - assert!(!matches("https://example.com/item?id=1")); - } - - #[test] - fn parse_item_id_handles_both_forms() { - assert_eq!( - parse_item_id("https://news.ycombinator.com/item?id=1"), - Some(1) - ); - assert_eq!( - parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"), - Some(12345) - ); - assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999)); - assert_eq!(parse_item_id("https://example.com/foo"), None); - } -} diff --git a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs deleted file mode 100644 index e1f84f7..0000000 --- a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs +++ /dev/null @@ -1,189 +0,0 @@ -//! HuggingFace dataset structured extractor. -//! -//! Same shape as the model extractor but hits the dataset endpoint. -//! `huggingface.co/api/datasets/{owner}/{name}`. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "huggingface_dataset", - label: "HuggingFace dataset", - description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.", - url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "huggingface.co" && host != "www.huggingface.co" { - return false; - } - let path = url - .split("://") - .nth(1) - .and_then(|s| s.split_once('/')) - .map(|(_, p)| p) - .unwrap_or(""); - let stripped = path - .split(['?', '#']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - // /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical). - segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3) -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let dataset_path = parse_dataset_path(url).ok_or_else(|| { - FetchError::Build(format!( - "hf_dataset: cannot parse dataset path from '{url}'" - )) - })?; - - let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "hf_dataset: '{dataset_path}' not found" - ))); - } - if resp.status == 401 { - return Err(FetchError::Build(format!( - "hf_dataset: '{dataset_path}' requires authentication (gated)" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "hf_dataset api returned status {}", - resp.status - ))); - } - - let d: DatasetInfo = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?; - - let files: Vec = d - .siblings - .iter() - .map(|s| json!({"rfilename": s.rfilename, "size": s.size})) - .collect(); - - Ok(json!({ - "url": url, - "id": d.id, - "private": d.private, - "gated": d.gated, - "downloads": d.downloads, - "downloads_30d": d.downloads_all_time, - "likes": d.likes, - "tags": d.tags, - "license": d.card_data.as_ref().and_then(|c| c.license.clone()), - "language": d.card_data.as_ref().and_then(|c| c.language.clone()), - "task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()), - "size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()), - "annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()), - "configs": d.card_data.as_ref().and_then(|c| c.configs.clone()), - "created_at": d.created_at, - "last_modified": d.last_modified, - "sha": d.sha, - "file_count": d.siblings.len(), - "files": files, - })) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Returns the part to append to the API URL — either `name` (legacy -/// top-level dataset like `squad`) or `owner/name` (canonical form). -fn parse_dataset_path(url: &str) -> Option { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - if segs.next() != Some("datasets") { - return None; - } - let first = segs.next()?.to_string(); - match segs.next() { - Some(second) => Some(format!("{first}/{second}")), - None => Some(first), - } -} - -#[derive(Deserialize)] -struct DatasetInfo { - id: Option, - private: Option, - gated: Option, - downloads: Option, - #[serde(rename = "downloadsAllTime")] - downloads_all_time: Option, - likes: Option, - #[serde(default)] - tags: Vec, - #[serde(rename = "createdAt")] - created_at: Option, - #[serde(rename = "lastModified")] - last_modified: Option, - sha: Option, - #[serde(rename = "cardData")] - card_data: Option, - #[serde(default)] - siblings: Vec, -} - -#[derive(Deserialize)] -struct DatasetCard { - license: Option, - language: Option, - task_categories: Option, - size_categories: Option, - annotations_creators: Option, - configs: Option, -} - -#[derive(Deserialize)] -struct Sibling { - rfilename: String, - size: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_dataset_pages() { - assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level - assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name - assert!(!matches("https://huggingface.co/openai/whisper-large-v3")); - assert!(!matches("https://huggingface.co/datasets/")); - } - - #[test] - fn parse_dataset_path_works() { - assert_eq!( - parse_dataset_path("https://huggingface.co/datasets/squad"), - Some("squad".into()) - ); - assert_eq!( - parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"), - Some("openai/gsm8k".into()) - ); - assert_eq!( - parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"), - Some("openai/gsm8k".into()) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/huggingface_model.rs b/crates/webclaw-fetch/src/extractors/huggingface_model.rs deleted file mode 100644 index 4c549e0..0000000 --- a/crates/webclaw-fetch/src/extractors/huggingface_model.rs +++ /dev/null @@ -1,223 +0,0 @@ -//! HuggingFace model card structured extractor. -//! -//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`. -//! Returns metadata + the parsed model card front matter, but does not -//! pull the full README body — those are sometimes 100KB+ and the user -//! can hit /v1/scrape if they want it as markdown. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "huggingface_model", - label: "HuggingFace model", - description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.", - url_patterns: &["https://huggingface.co/{owner}/{name}"], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "huggingface.co" && host != "www.huggingface.co" { - return false; - } - let path = url - .split("://") - .nth(1) - .and_then(|s| s.split_once('/')) - .map(|(_, p)| p) - .unwrap_or(""); - let stripped = path - .split(['?', '#']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - // /{owner}/{name} but reject HF-internal sections + sub-pages. - if segs.len() != 2 { - return false; - } - !RESERVED_NAMESPACES.contains(&segs[0]) -} - -const RESERVED_NAMESPACES: &[&str] = &[ - "datasets", - "spaces", - "blog", - "docs", - "api", - "models", - "papers", - "pricing", - "tasks", - "join", - "login", - "settings", - "organizations", - "new", - "search", -]; - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (owner, name) = parse_owner_name(url).ok_or_else(|| { - FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'")) - })?; - - let api_url = format!("https://huggingface.co/api/models/{owner}/{name}"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "hf model: '{owner}/{name}' not found" - ))); - } - if resp.status == 401 { - return Err(FetchError::Build(format!( - "hf model: '{owner}/{name}' requires authentication (gated repo)" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "hf api returned status {}", - resp.status - ))); - } - - let m: ModelInfo = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?; - - // Surface a flat file list — full siblings can be hundreds of entries - // for big repos. We keep it as-is because callers want to know about - // every shard; if it bloats responses too much we'll add pagination. - let files: Vec = m - .siblings - .iter() - .map(|s| json!({"rfilename": s.rfilename, "size": s.size})) - .collect(); - - Ok(json!({ - "url": url, - "id": m.id, - "model_id": m.model_id, - "private": m.private, - "gated": m.gated, - "downloads": m.downloads, - "downloads_30d": m.downloads_all_time, - "likes": m.likes, - "library_name": m.library_name, - "pipeline_tag": m.pipeline_tag, - "tags": m.tags, - "license": m.card_data.as_ref().and_then(|c| c.license.clone()), - "language": m.card_data.as_ref().and_then(|c| c.language.clone()), - "datasets": m.card_data.as_ref().and_then(|c| c.datasets.clone()), - "base_model": m.card_data.as_ref().and_then(|c| c.base_model.clone()), - "model_type": m.card_data.as_ref().and_then(|c| c.model_type.clone()), - "created_at": m.created_at, - "last_modified": m.last_modified, - "sha": m.sha, - "file_count": m.siblings.len(), - "files": files, - })) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn parse_owner_name(url: &str) -> Option<(String, String)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let owner = segs.next()?.to_string(); - let name = segs.next()?.to_string(); - Some((owner, name)) -} - -// --------------------------------------------------------------------------- -// HF API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct ModelInfo { - id: Option, - #[serde(rename = "modelId")] - model_id: Option, - private: Option, - gated: Option, // bool or string ("auto" / "manual" / false) - downloads: Option, - #[serde(rename = "downloadsAllTime")] - downloads_all_time: Option, - likes: Option, - #[serde(rename = "library_name")] - library_name: Option, - #[serde(rename = "pipeline_tag")] - pipeline_tag: Option, - #[serde(default)] - tags: Vec, - #[serde(rename = "createdAt")] - created_at: Option, - #[serde(rename = "lastModified")] - last_modified: Option, - sha: Option, - #[serde(rename = "cardData")] - card_data: Option, - #[serde(default)] - siblings: Vec, -} - -#[derive(Deserialize)] -struct CardData { - license: Option, // string or array - language: Option, - datasets: Option, - #[serde(rename = "base_model")] - base_model: Option, - #[serde(rename = "model_type")] - model_type: Option, -} - -#[derive(Deserialize)] -struct Sibling { - rfilename: String, - size: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_model_pages() { - assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B")); - assert!(matches("https://huggingface.co/openai/whisper-large-v3")); - assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1 - } - - #[test] - fn rejects_hf_section_pages() { - assert!(!matches("https://huggingface.co/datasets/squad")); - assert!(!matches("https://huggingface.co/spaces/foo/bar")); - assert!(!matches("https://huggingface.co/blog/intro")); - assert!(!matches("https://huggingface.co/")); - assert!(!matches("https://huggingface.co/meta-llama")); - } - - #[test] - fn parse_owner_name_pulls_both() { - assert_eq!( - parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"), - Some(("meta-llama".into(), "Meta-Llama-3-8B".into())) - ); - assert_eq!( - parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"), - Some(("openai".into(), "whisper-large-v3".into())) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/instagram_post.rs b/crates/webclaw-fetch/src/extractors/instagram_post.rs deleted file mode 100644 index 8847e36..0000000 --- a/crates/webclaw-fetch/src/extractors/instagram_post.rs +++ /dev/null @@ -1,235 +0,0 @@ -//! Instagram post structured extractor. -//! -//! Uses Instagram's public embed endpoint -//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the -//! full caption, author username, and thumbnail. No auth required. -//! The same endpoint serves reels and IGTV under `/reel/{code}` and -//! `/tv/{code}` URLs (we accept all three). - -use regex::Regex; -use serde_json::{Value, json}; -use std::sync::OnceLock; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "instagram_post", - label: "Instagram post", - description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.", - url_patterns: &[ - "https://www.instagram.com/p/{shortcode}/", - "https://www.instagram.com/reel/{shortcode}/", - "https://www.instagram.com/tv/{shortcode}/", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !matches!(host, "www.instagram.com" | "instagram.com") { - return false; - } - parse_shortcode(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| { - FetchError::Build(format!( - "instagram_post: cannot parse shortcode from '{url}'" - )) - })?; - - // Instagram serves the same embed HTML for posts/reels/tv under /p/. - let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/"); - let resp = client.fetch(&embed_url).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "instagram embed returned status {} for {shortcode}", - resp.status - ))); - } - - let html = &resp.html; - let username = parse_username(html); - let caption = parse_caption(html); - let thumbnail = parse_thumbnail(html); - - Ok(json!({ - "url": url, - "embed_url": embed_url, - "shortcode": shortcode, - "kind": kind, - "data_completeness": "embed", - "author_username": username, - "caption": caption, - "thumbnail_url": thumbnail, - "canonical_url": format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)), - })) -} - -// --------------------------------------------------------------------------- -// URL parsing -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}. -fn parse_shortcode(url: &str) -> Option<(&'static str, String)> { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let first = segs.next()?; - let kind = match first { - "p" => "post", - "reel" | "reels" => "reel", - "tv" => "tv", - _ => return None, - }; - let shortcode = segs.next()?; - if shortcode.is_empty() { - return None; - } - Some((kind, shortcode.to_string())) -} - -fn path_segment_for(kind: &str) -> &'static str { - match kind { - "reel" => "reel", - "tv" => "tv", - _ => "p", - } -} - -// --------------------------------------------------------------------------- -// HTML scraping -// --------------------------------------------------------------------------- - -/// Username appears as the anchor text inside ``. -fn parse_username(html: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap()); - re.captures(html) - .and_then(|c| c.get(1)) - .map(|m| html_decode(m.as_str().trim())) -} - -/// Caption sits inside `
` after the username anchor. -/// We grab the whole Caption block and strip out the username link, time -/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate. -fn parse_caption(html: &str) -> Option { - static RE_OUTER: OnceLock = OnceLock::new(); - let outer = RE_OUTER - .get_or_init(|| Regex::new(r#"(?s)]*>(.*?)
"#).unwrap()); - let block = outer.captures(html)?.get(1)?.as_str(); - - // Strip everything wrapped in
.... - static RE_USER: OnceLock = OnceLock::new(); - let user_re = RE_USER - .get_or_init(|| Regex::new(r#"(?s)]*class="CaptionUsername"[^>]*>.*?"#).unwrap()); - let stripped = user_re.replace_all(block, ""); - - // Then strip anything remaining tagged. - static RE_TAGS: OnceLock = OnceLock::new(); - let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap()); - let text = tag_re.replace_all(&stripped, " "); - - let cleaned = collapse_whitespace(&html_decode(text.trim())); - if cleaned.is_empty() { - None - } else { - Some(cleaned) - } -} - -/// Thumbnail is the `` inside the embed -/// (or the og:image as fallback). -fn parse_thumbnail(html: &str) -> Option { - static RE_IMG: OnceLock = OnceLock::new(); - let img_re = RE_IMG.get_or_init(|| { - Regex::new(r#"(?s)]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#) - .unwrap() - }); - if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) { - return Some(html_decode(m.as_str())); - } - static RE_OG: OnceLock = OnceLock::new(); - let og_re = RE_OG.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:image"[^>]+content="([^"]+)""#).unwrap() - }); - og_re - .captures(html) - .and_then(|c| c.get(1)) - .map(|m| html_decode(m.as_str())) -} - -fn html_decode(s: &str) -> String { - s.replace("&", "&") - .replace("<", "<") - .replace(">", ">") - .replace(""", "\"") - .replace("'", "'") - .replace("@", "@") - .replace("•", "•") - .replace("…", "…") -} - -fn collapse_whitespace(s: &str) -> String { - s.split_whitespace().collect::>().join(" ") -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_post_reel_tv_urls() { - assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/")); - assert!(matches( - "https://www.instagram.com/p/DT-RICMjeK5/?img_index=1" - )); - assert!(matches("https://www.instagram.com/reel/abc123/")); - assert!(matches("https://www.instagram.com/tv/abc123/")); - assert!(!matches("https://www.instagram.com/ticketswave")); - assert!(!matches("https://www.instagram.com/")); - assert!(!matches("https://example.com/p/abc/")); - } - - #[test] - fn parse_shortcode_reads_each_kind() { - assert_eq!( - parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"), - Some(("post", "DT-RICMjeK5".into())) - ); - assert_eq!( - parse_shortcode("https://www.instagram.com/reel/abc123/"), - Some(("reel", "abc123".into())) - ); - assert_eq!( - parse_shortcode("https://www.instagram.com/tv/abc123"), - Some(("tv", "abc123".into())) - ); - } - - #[test] - fn parse_username_pulls_anchor_text() { - let html = r#"ticketswave"#; - assert_eq!(parse_username(html).as_deref(), Some("ticketswave")); - } - - #[test] - fn parse_caption_strips_username_anchor() { - let html = r#"
ticketswave Some caption text here
"#; - assert_eq!( - parse_caption(html).as_deref(), - Some("Some caption text here") - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/instagram_profile.rs b/crates/webclaw-fetch/src/extractors/instagram_profile.rs deleted file mode 100644 index 9a92b4c..0000000 --- a/crates/webclaw-fetch/src/extractors/instagram_profile.rs +++ /dev/null @@ -1,465 +0,0 @@ -//! Instagram profile structured extractor. -//! -//! Hits Instagram's internal `web_profile_info` endpoint at -//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The -//! `x-ig-app-id` header is Instagram's own public web-app id (not a -//! secret) — the same value Instagram's own JavaScript bundle sends. -//! -//! Returns the full profile (bio, exact follower count, verified / -//! business flags, profile picture) plus the **12 most recent posts** -//! with shortcodes, like counts, types, thumbnails, and caption -//! previews. Callers can fan out to `/v1/scrape/instagram_post` per -//! shortcode to get the full caption + media. -//! -//! Pagination beyond 12 requires authenticated cookies + a CSRF token; -//! we accept that as the practical ceiling for the unauth path. The -//! cloud (with stored sessions) can paginate later as a follow-up. -//! -//! Falls back to OG-tag scraping of the public profile page if the API -//! returns 401/403 — Instagram has tightened this endpoint multiple -//! times, so we keep the second path warm. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "instagram_profile", - label: "Instagram profile", - description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).", - url_patterns: &["https://www.instagram.com/{username}/"], -}; - -/// Instagram's own public web-app identifier. Sent by their JS bundle -/// on every API call, accepted by the unauth endpoint, not a secret. -const IG_APP_ID: &str = "936619743392459"; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !matches!(host, "www.instagram.com" | "instagram.com") { - return false; - } - let path = url - .split("://") - .nth(1) - .and_then(|s| s.split_once('/')) - .map(|(_, p)| p) - .unwrap_or(""); - let stripped = path - .split(['?', '#']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect(); - segs.len() == 1 && !RESERVED.contains(&segs[0]) -} - -const RESERVED: &[&str] = &[ - "p", - "reel", - "reels", - "tv", - "explore", - "stories", - "directory", - "accounts", - "about", - "developer", - "press", - "api", - "ads", - "blog", - "fragments", - "terms", - "privacy", - "session", - "login", - "signup", -]; - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let username = parse_username(url).ok_or_else(|| { - FetchError::Build(format!( - "instagram_profile: cannot parse username from '{url}'" - )) - })?; - - let api_url = - format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"); - let extra_headers: &[(&str, &str)] = &[ - ("x-ig-app-id", IG_APP_ID), - ("accept", "*/*"), - ("sec-fetch-site", "same-origin"), - ("x-requested-with", "XMLHttpRequest"), - ]; - let resp = client.fetch_with_headers(&api_url, extra_headers).await?; - - if resp.status == 404 { - return Err(FetchError::Build(format!( - "instagram_profile: '{username}' not found" - ))); - } - // Auth wall fallback: Instagram occasionally tightens this endpoint - // and starts returning 401/403/302 to a login page. When that - // happens we still want to give the caller something useful — the - // OG tags from the public HTML page (no posts list, but bio etc). - if !(200..300).contains(&resp.status) { - return og_fallback(client, &username, url, resp.status).await; - } - - let body: ApiResponse = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?; - let user = body.data.user; - - let recent_posts: Vec = user - .edge_owner_to_timeline_media - .as_ref() - .map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect()) - .unwrap_or_default(); - - Ok(json!({ - "url": url, - "canonical_url": format!("https://www.instagram.com/{username}/"), - "username": user.username.unwrap_or(username), - "data_completeness": "api", - "user_id": user.id, - "full_name": user.full_name, - "biography": user.biography, - "biography_links": user.bio_links, - "external_url": user.external_url, - "category": user.category_name, - "follower_count": user.edge_followed_by.map(|c| c.count), - "following_count": user.edge_follow.map(|c| c.count), - "post_count": user.edge_owner_to_timeline_media.as_ref().map(|m| m.count), - "is_verified": user.is_verified, - "is_private": user.is_private, - "is_business": user.is_business_account, - "is_professional": user.is_professional_account, - "profile_pic_url": user.profile_pic_url_hd.or(user.profile_pic_url), - "recent_posts": recent_posts, - })) -} - -/// Build the per-post summary the caller fans out from. Includes a -/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`. -fn post_summary(n: &MediaNode) -> Value { - let kind = classify(n); - let url = match kind { - "reel" => format!( - "https://www.instagram.com/reel/{}/", - n.shortcode.as_deref().unwrap_or("") - ), - _ => format!( - "https://www.instagram.com/p/{}/", - n.shortcode.as_deref().unwrap_or("") - ), - }; - let caption = n - .edge_media_to_caption - .as_ref() - .and_then(|c| c.edges.first()) - .and_then(|e| e.node.text.clone()); - json!({ - "shortcode": n.shortcode, - "url": url, - "kind": kind, - "is_video": n.is_video.unwrap_or(false), - "video_views": n.video_view_count, - "thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()), - "display_url": n.display_url, - "like_count": n.edge_media_preview_like.as_ref().map(|c| c.count), - "comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count), - "taken_at": n.taken_at_timestamp, - "caption": caption, - "alt_text": n.accessibility_caption, - "dimensions": n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})), - "product_type": n.product_type, - }) -} - -/// Best-effort post-type classification. `clips` is reels; `feed` is -/// the regular grid. Sidecar = multi-photo carousel. -fn classify(n: &MediaNode) -> &'static str { - if n.product_type.as_deref() == Some("clips") { - return "reel"; - } - match n.typename.as_deref() { - Some("GraphSidecar") => "carousel", - Some("GraphVideo") => "video", - Some("GraphImage") => "photo", - _ => "post", - } -} - -/// Fallback when the API path is blocked: hit the public profile HTML, -/// pull whatever OG tags we can. Returns less data and explicitly -/// flags `data_completeness: "og_only"` so callers know. -async fn og_fallback( - client: &dyn Fetcher, - username: &str, - original_url: &str, - api_status: u16, -) -> Result { - let canonical = format!("https://www.instagram.com/{username}/"); - let resp = client.fetch(&canonical).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "instagram_profile: api status {api_status}, html status {} for {username}", - resp.status - ))); - } - let og = parse_og_tags(&resp.html); - let (followers, following, posts) = - parse_counts_from_og_description(og.get("description").map(String::as_str)); - - Ok(json!({ - "url": original_url, - "canonical_url": canonical, - "username": username, - "data_completeness": "og_only", - "fallback_reason": format!("api returned {api_status}"), - "full_name": parse_full_name(&og.get("title").cloned().unwrap_or_default()), - "follower_count": followers, - "following_count": following, - "post_count": posts, - "profile_pic_url": og.get("image").cloned(), - "biography": null_value(), - "is_verified": null_value(), - "is_business": null_value(), - "recent_posts": Vec::::new(), - })) -} - -fn null_value() -> Value { - Value::Null -} - -// --------------------------------------------------------------------------- -// URL parsing -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn parse_username(url: &str) -> Option { - let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?; - let stripped = path.split(['?', '#']).next()?.trim_end_matches('/'); - stripped - .split('/') - .find(|s| !s.is_empty()) - .map(|s| s.to_string()) -} - -// --------------------------------------------------------------------------- -// OG-fallback helpers (kept self-contained — same shape as the previous -// version we shipped, retained as the safety net) -// --------------------------------------------------------------------------- - -fn parse_og_tags(html: &str) -> std::collections::HashMap { - use regex::Regex; - use std::sync::OnceLock; - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - let mut out = std::collections::HashMap::new(); - for c in re.captures_iter(html) { - let k = c - .get(1) - .map(|m| m.as_str().to_lowercase()) - .unwrap_or_default(); - let v = c - .get(2) - .map(|m| html_decode(m.as_str())) - .unwrap_or_default(); - out.entry(k).or_insert(v); - } - out -} - -fn parse_full_name(og_title: &str) -> Option { - if og_title.is_empty() { - return None; - } - let decoded = html_decode(og_title); - let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim(); - if trimmed.is_empty() { - None - } else { - Some(trimmed.to_string()) - } -} - -fn parse_counts_from_og_description(desc: Option<&str>) -> (Option, Option, Option) { - let Some(text) = desc else { - return (None, None, None); - }; - let decoded = html_decode(text); - use regex::Regex; - use std::sync::OnceLock; - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap() - }); - if let Some(c) = re.captures(&decoded) { - return ( - c.get(1).and_then(|m| parse_compact_number(m.as_str())), - c.get(2).and_then(|m| parse_compact_number(m.as_str())), - c.get(3).and_then(|m| parse_compact_number(m.as_str())), - ); - } - (None, None, None) -} - -fn parse_compact_number(s: &str) -> Option { - let s = s.trim(); - let (num_str, mul) = match s.chars().last() { - Some('K') => (&s[..s.len() - 1], 1_000i64), - Some('M') => (&s[..s.len() - 1], 1_000_000i64), - Some('B') => (&s[..s.len() - 1], 1_000_000_000i64), - _ => (s, 1i64), - }; - let cleaned: String = num_str.chars().filter(|c| *c != ',').collect(); - cleaned.parse::().ok().map(|f| (f * mul as f64) as i64) -} - -fn html_decode(s: &str) -> String { - s.replace("&", "&") - .replace("<", "<") - .replace(">", ">") - .replace(""", "\"") - .replace("'", "'") - .replace("@", "@") - .replace("•", "•") - .replace("…", "…") -} - -// --------------------------------------------------------------------------- -// Instagram web_profile_info API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct ApiResponse { - data: ApiData, -} - -#[derive(Deserialize)] -struct ApiData { - user: User, -} - -#[derive(Deserialize)] -struct User { - id: Option, - username: Option, - full_name: Option, - biography: Option, - bio_links: Option>, - external_url: Option, - category_name: Option, - profile_pic_url: Option, - profile_pic_url_hd: Option, - is_verified: Option, - is_private: Option, - is_business_account: Option, - is_professional_account: Option, - edge_followed_by: Option, - edge_follow: Option, - edge_owner_to_timeline_media: Option, -} - -#[derive(Deserialize)] -struct EdgeCount { - count: i64, -} - -#[derive(Deserialize)] -struct MediaEdges { - count: i64, - edges: Vec, -} - -#[derive(Deserialize)] -struct MediaEdge { - node: MediaNode, -} - -#[derive(Deserialize)] -struct MediaNode { - #[serde(rename = "__typename")] - typename: Option, - shortcode: Option, - is_video: Option, - video_view_count: Option, - display_url: Option, - thumbnail_src: Option, - accessibility_caption: Option, - taken_at_timestamp: Option, - product_type: Option, - dimensions: Option, - edge_media_preview_like: Option, - edge_media_to_comment: Option, - edge_media_to_caption: Option, -} - -#[derive(Deserialize)] -struct Dimensions { - width: i64, - height: i64, -} - -#[derive(Deserialize)] -struct CaptionEdges { - edges: Vec, -} - -#[derive(Deserialize)] -struct CaptionEdge { - node: CaptionNode, -} - -#[derive(Deserialize)] -struct CaptionNode { - text: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_profile_urls() { - assert!(matches("https://www.instagram.com/ticketswave")); - assert!(matches("https://www.instagram.com/ticketswave/")); - assert!(matches("https://instagram.com/0xmassi/?hl=en")); - assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/")); - assert!(!matches("https://www.instagram.com/explore")); - assert!(!matches("https://www.instagram.com/")); - assert!(!matches("https://example.com/foo")); - } - - #[test] - fn parse_full_name_strips_handle() { - assert_eq!( - parse_full_name("Ticket Wave (@ticketswave) • Instagram photos and videos"), - Some("Ticket Wave".into()) - ); - } - - #[test] - fn compact_number_handles_kmb() { - assert_eq!(parse_compact_number("18K"), Some(18_000)); - assert_eq!(parse_compact_number("1.5M"), Some(1_500_000)); - assert_eq!(parse_compact_number("1,234"), Some(1_234)); - assert_eq!(parse_compact_number("641"), Some(641)); - } -} diff --git a/crates/webclaw-fetch/src/extractors/linkedin_post.rs b/crates/webclaw-fetch/src/extractors/linkedin_post.rs deleted file mode 100644 index ed7e07b..0000000 --- a/crates/webclaw-fetch/src/extractors/linkedin_post.rs +++ /dev/null @@ -1,266 +0,0 @@ -//! LinkedIn post structured extractor. -//! -//! Uses the public embed endpoint `/embed/feed/update/{urn}` which -//! LinkedIn provides for sites that want to render a post inline. No -//! auth required, returns SSR HTML with the full post body, OG tags, -//! image, and a link back to the original post. -//! -//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`) -//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by -//! pulling the trailing numeric id and converting to an activity URN. - -use regex::Regex; -use serde_json::{Value, json}; -use std::sync::OnceLock; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "linkedin_post", - label: "LinkedIn post", - description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.", - url_patterns: &[ - "https://www.linkedin.com/feed/update/urn:li:share:{id}", - "https://www.linkedin.com/feed/update/urn:li:activity:{id}", - "https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !matches!(host, "www.linkedin.com" | "linkedin.com") { - return false; - } - url.contains("/feed/update/urn:li:") || url.contains("/posts/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let urn = extract_urn(url).ok_or_else(|| { - FetchError::Build(format!( - "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})" - )) - })?; - - let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}"); - let resp = client.fetch(&embed_url).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "linkedin embed returned status {} for {urn}", - resp.status - ))); - } - - let html = &resp.html; - let og = parse_og_tags(html); - let body = parse_post_body(html); - let author = parse_author(html); - let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone()); - - Ok(json!({ - "url": url, - "embed_url": embed_url, - "urn": urn, - "canonical_url": canonical_url, - "data_completeness": "embed", - "title": og.get("title").cloned(), - "body": body, - "author_name": author, - "image_url": og.get("image").cloned(), - "site_name": og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()), - })) -} - -// --------------------------------------------------------------------------- -// URN extraction -// --------------------------------------------------------------------------- - -/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL. -/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second- -/// to-last `-` separated chunk. Both forms map to a URN we can hit the -/// embed endpoint with. -fn extract_urn(url: &str) -> Option { - if let Some(idx) = url.find("urn:li:") { - let tail = &url[idx..]; - let end = tail.find(['/', '?', '#']).unwrap_or(tail.len()); - let urn = &tail[..end]; - // Validate shape: urn:li:{type}:{digits} - let mut parts = urn.split(':'); - if parts.next() == Some("urn") - && parts.next() == Some("li") - && parts.next().is_some() - && parts - .next() - .filter(|p| p.chars().all(|c| c.is_ascii_digit())) - .is_some() - { - return Some(urn.to_string()); - } - } - - // /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second- - // to-last segment after the last `-`. - if url.contains("/posts/") { - static RE: OnceLock = OnceLock::new(); - let re = - RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap()); - if let Some(c) = re.captures(url) - && let Some(id) = c.get(1) - { - return Some(format!("urn:li:activity:{}", id.as_str())); - } - } - None -} - -// --------------------------------------------------------------------------- -// HTML scraping -// --------------------------------------------------------------------------- - -/// Pull `og:foo` → value pairs out of ``. -/// Returns lowercased keys with leading `og:` stripped. -fn parse_og_tags(html: &str) -> std::collections::HashMap { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - let mut out = std::collections::HashMap::new(); - for c in re.captures_iter(html) { - let k = c - .get(1) - .map(|m| m.as_str().to_lowercase()) - .unwrap_or_default(); - let v = c - .get(2) - .map(|m| html_decode(m.as_str())) - .unwrap_or_default(); - out.entry(k).or_insert(v); - } - out -} - -/// Extract the post body text from the embed page. LinkedIn renders it -/// inside `

{text}

` -/// where the inner content can include nested `` tags for links. -fn parse_post_body(html: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new( - r#"(?s)]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)

"#, - ) - .unwrap() - }); - let inner = re.captures(html).and_then(|c| c.get(1))?.as_str(); - Some(strip_tags(inner).trim().to_string()) -} - -/// Author name lives in the `` like: -/// "55 founding members are in… | Orc Dev" -/// The chunk after the final `|` is the author display name. Falls back -/// to the og:title minus the post body if there's no title. -fn parse_author(html: &str) -> Option<String> { - static RE_TITLE: OnceLock<Regex> = OnceLock::new(); - let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)").unwrap()); - let title = re.captures(html).and_then(|c| c.get(1))?.as_str(); - title - .rsplit_once('|') - .map(|(_, name)| html_decode(name.trim())) -} - -/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.) -/// stuff into OG content attributes. -fn html_decode(s: &str) -> String { - s.replace("&", "&") - .replace("<", "<") - .replace(">", ">") - .replace(""", "\"") - .replace("'", "'") - .replace("@", "@") - .replace("•", "•") - .replace("…", "…") -} - -/// Crude HTML tag stripper for the post body. Preserves text inside -/// nested anchors so URLs don't disappear, and collapses runs of -/// whitespace introduced by line wrapping. -fn strip_tags(html: &str) -> String { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap()); - let no_tags = re.replace_all(html, "").to_string(); - html_decode(&no_tags) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_li_post_urls() { - assert!(matches( - "https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/" - )); - assert!(matches( - "https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288" - )); - assert!(matches( - "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c" - )); - assert!(!matches("https://www.linkedin.com/in/foo")); - assert!(!matches("https://www.linkedin.com/")); - assert!(!matches("https://example.com/feed/update/urn:li:share:1")); - } - - #[test] - fn extract_urn_from_share_url() { - assert_eq!( - extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"), - Some("urn:li:share:7452618582213144577".into()) - ); - } - - #[test] - fn extract_urn_from_pretty_post_url() { - assert_eq!( - extract_urn( - "https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/" - ), - Some("urn:li:activity:7452618583290892288".into()) - ); - } - - #[test] - fn parse_og_tags_basic() { - let html = r#" -"#; - let og = parse_og_tags(html); - assert_eq!( - og.get("image").map(String::as_str), - Some("https://x.com/a.png") - ); - assert_eq!( - og.get("url").map(String::as_str), - Some("https://example.com/x") - ); - } - - #[test] - fn parse_post_body_strips_anchor_tags() { - let html = r#"

Hello link world

"#; - assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world")); - } - - #[test] - fn html_decode_handles_common_entities() { - assert_eq!(html_decode("AT&T @jane"), "AT&T @jane"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs deleted file mode 100644 index 91ef8d0..0000000 --- a/crates/webclaw-fetch/src/extractors/mod.rs +++ /dev/null @@ -1,502 +0,0 @@ -//! Vertical extractors: site-specific parsers that return typed JSON -//! instead of generic markdown. -//! -//! Each extractor handles a single site or platform and exposes: -//! - `matches(url)` to claim ownership of a URL pattern -//! - `extract(client, url)` to fetch + parse into a typed JSON `Value` -//! - `INFO` static for the catalog (`/v1/extractors`) -//! -//! The dispatch in this module is a simple `match`-style chain rather than -//! a trait registry. With ~30 extractors that's still fast and avoids the -//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit. -//! -//! Extractors prefer official JSON APIs over HTML scraping where one -//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have -//! one). HTML extraction is the fallback for sites that don't. - -pub mod amazon_product; -pub mod arxiv; -pub mod crates_io; -pub mod dev_to; -pub mod docker_hub; -pub mod ebay_listing; -pub mod ecommerce_product; -pub mod etsy_listing; -pub mod github_issue; -pub mod github_pr; -pub mod github_release; -pub mod github_repo; -pub mod hackernews; -pub mod huggingface_dataset; -pub mod huggingface_model; -pub mod instagram_post; -pub mod instagram_profile; -pub mod linkedin_post; -pub mod npm; -pub mod pypi; -pub mod reddit; -pub mod shopify_collection; -pub mod shopify_product; -pub mod stackoverflow; -pub mod substack_post; -pub mod trustpilot_reviews; -pub mod woocommerce_product; -pub mod youtube_video; - -use serde::Serialize; -use serde_json::Value; - -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -/// Public catalog entry for `/v1/extractors`. Stable shape — clients -/// rely on `name` to pick the right `/v1/scrape/{name}` route. -#[derive(Debug, Clone, Serialize)] -pub struct ExtractorInfo { - /// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...). - pub name: &'static str, - /// Human-friendly display name. - pub label: &'static str, - /// One-line description of what the extractor returns. - pub description: &'static str, - /// Glob-ish URL pattern(s) the extractor claims. For documentation; - /// the actual matching is done by the extractor's `matches` fn. - pub url_patterns: &'static [&'static str], -} - -/// Full catalog. Order is stable; new entries append. -pub fn list() -> Vec { - vec![ - reddit::INFO, - hackernews::INFO, - github_repo::INFO, - github_pr::INFO, - github_issue::INFO, - github_release::INFO, - pypi::INFO, - npm::INFO, - crates_io::INFO, - huggingface_model::INFO, - huggingface_dataset::INFO, - arxiv::INFO, - docker_hub::INFO, - dev_to::INFO, - stackoverflow::INFO, - substack_post::INFO, - youtube_video::INFO, - linkedin_post::INFO, - instagram_post::INFO, - instagram_profile::INFO, - shopify_product::INFO, - shopify_collection::INFO, - ecommerce_product::INFO, - woocommerce_product::INFO, - amazon_product::INFO, - ebay_listing::INFO, - etsy_listing::INFO, - trustpilot_reviews::INFO, - ] -} - -/// Auto-detect mode: try every extractor's `matches`, return the first -/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't -/// pick a vertical explicitly. -pub async fn dispatch_by_url( - client: &dyn Fetcher, - url: &str, -) -> Option> { - if reddit::matches(url) { - return Some( - reddit::extract(client, url) - .await - .map(|v| (reddit::INFO.name, v)), - ); - } - if hackernews::matches(url) { - return Some( - hackernews::extract(client, url) - .await - .map(|v| (hackernews::INFO.name, v)), - ); - } - if github_repo::matches(url) { - return Some( - github_repo::extract(client, url) - .await - .map(|v| (github_repo::INFO.name, v)), - ); - } - if pypi::matches(url) { - return Some( - pypi::extract(client, url) - .await - .map(|v| (pypi::INFO.name, v)), - ); - } - if npm::matches(url) { - return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v))); - } - if github_pr::matches(url) { - return Some( - github_pr::extract(client, url) - .await - .map(|v| (github_pr::INFO.name, v)), - ); - } - if github_issue::matches(url) { - return Some( - github_issue::extract(client, url) - .await - .map(|v| (github_issue::INFO.name, v)), - ); - } - if github_release::matches(url) { - return Some( - github_release::extract(client, url) - .await - .map(|v| (github_release::INFO.name, v)), - ); - } - if crates_io::matches(url) { - return Some( - crates_io::extract(client, url) - .await - .map(|v| (crates_io::INFO.name, v)), - ); - } - if huggingface_model::matches(url) { - return Some( - huggingface_model::extract(client, url) - .await - .map(|v| (huggingface_model::INFO.name, v)), - ); - } - if huggingface_dataset::matches(url) { - return Some( - huggingface_dataset::extract(client, url) - .await - .map(|v| (huggingface_dataset::INFO.name, v)), - ); - } - if arxiv::matches(url) { - return Some( - arxiv::extract(client, url) - .await - .map(|v| (arxiv::INFO.name, v)), - ); - } - if docker_hub::matches(url) { - return Some( - docker_hub::extract(client, url) - .await - .map(|v| (docker_hub::INFO.name, v)), - ); - } - if dev_to::matches(url) { - return Some( - dev_to::extract(client, url) - .await - .map(|v| (dev_to::INFO.name, v)), - ); - } - if stackoverflow::matches(url) { - return Some( - stackoverflow::extract(client, url) - .await - .map(|v| (stackoverflow::INFO.name, v)), - ); - } - if linkedin_post::matches(url) { - return Some( - linkedin_post::extract(client, url) - .await - .map(|v| (linkedin_post::INFO.name, v)), - ); - } - if instagram_post::matches(url) { - return Some( - instagram_post::extract(client, url) - .await - .map(|v| (instagram_post::INFO.name, v)), - ); - } - if instagram_profile::matches(url) { - return Some( - instagram_profile::extract(client, url) - .await - .map(|v| (instagram_profile::INFO.name, v)), - ); - } - // Antibot-gated verticals with unique hosts: safe to auto-dispatch - // because the matcher can't confuse the URL for anything else. The - // extractor's smart_fetch_html path handles the blocked-without- - // API-key case with a clear actionable error. - if amazon_product::matches(url) { - return Some( - amazon_product::extract(client, url) - .await - .map(|v| (amazon_product::INFO.name, v)), - ); - } - if ebay_listing::matches(url) { - return Some( - ebay_listing::extract(client, url) - .await - .map(|v| (ebay_listing::INFO.name, v)), - ); - } - if etsy_listing::matches(url) { - return Some( - etsy_listing::extract(client, url) - .await - .map(|v| (etsy_listing::INFO.name, v)), - ); - } - if trustpilot_reviews::matches(url) { - return Some( - trustpilot_reviews::extract(client, url) - .await - .map(|v| (trustpilot_reviews::INFO.name, v)), - ); - } - if youtube_video::matches(url) { - return Some( - youtube_video::extract(client, url) - .await - .map(|v| (youtube_video::INFO.name, v)), - ); - } - // NOTE: shopify_product, shopify_collection, ecommerce_product, - // woocommerce_product, and substack_post are intentionally NOT - // in auto-dispatch. Their `matches()` functions are permissive - // (any URL with `/products/`, `/product/`, `/p/`, etc.) and - // claiming those generically would steal URLs from the default - // `/v1/scrape` markdown flow. Callers opt in via - // `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`. - None -} - -/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`). -/// We still validate that the URL plausibly belongs to that vertical so -/// users get a clear "wrong route" error instead of a confusing parse -/// failure deep in the extractor. -pub async fn dispatch_by_name( - client: &dyn Fetcher, - name: &str, - url: &str, -) -> Result { - match name { - n if n == reddit::INFO.name => { - run_or_mismatch(reddit::matches(url), n, url, || { - reddit::extract(client, url) - }) - .await - } - n if n == hackernews::INFO.name => { - run_or_mismatch(hackernews::matches(url), n, url, || { - hackernews::extract(client, url) - }) - .await - } - n if n == github_repo::INFO.name => { - run_or_mismatch(github_repo::matches(url), n, url, || { - github_repo::extract(client, url) - }) - .await - } - n if n == pypi::INFO.name => { - run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await - } - n if n == npm::INFO.name => { - run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await - } - n if n == github_pr::INFO.name => { - run_or_mismatch(github_pr::matches(url), n, url, || { - github_pr::extract(client, url) - }) - .await - } - n if n == github_issue::INFO.name => { - run_or_mismatch(github_issue::matches(url), n, url, || { - github_issue::extract(client, url) - }) - .await - } - n if n == github_release::INFO.name => { - run_or_mismatch(github_release::matches(url), n, url, || { - github_release::extract(client, url) - }) - .await - } - n if n == crates_io::INFO.name => { - run_or_mismatch(crates_io::matches(url), n, url, || { - crates_io::extract(client, url) - }) - .await - } - n if n == huggingface_model::INFO.name => { - run_or_mismatch(huggingface_model::matches(url), n, url, || { - huggingface_model::extract(client, url) - }) - .await - } - n if n == huggingface_dataset::INFO.name => { - run_or_mismatch(huggingface_dataset::matches(url), n, url, || { - huggingface_dataset::extract(client, url) - }) - .await - } - n if n == arxiv::INFO.name => { - run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await - } - n if n == docker_hub::INFO.name => { - run_or_mismatch(docker_hub::matches(url), n, url, || { - docker_hub::extract(client, url) - }) - .await - } - n if n == dev_to::INFO.name => { - run_or_mismatch(dev_to::matches(url), n, url, || { - dev_to::extract(client, url) - }) - .await - } - n if n == stackoverflow::INFO.name => { - run_or_mismatch(stackoverflow::matches(url), n, url, || { - stackoverflow::extract(client, url) - }) - .await - } - n if n == linkedin_post::INFO.name => { - run_or_mismatch(linkedin_post::matches(url), n, url, || { - linkedin_post::extract(client, url) - }) - .await - } - n if n == instagram_post::INFO.name => { - run_or_mismatch(instagram_post::matches(url), n, url, || { - instagram_post::extract(client, url) - }) - .await - } - n if n == instagram_profile::INFO.name => { - run_or_mismatch(instagram_profile::matches(url), n, url, || { - instagram_profile::extract(client, url) - }) - .await - } - n if n == shopify_product::INFO.name => { - run_or_mismatch(shopify_product::matches(url), n, url, || { - shopify_product::extract(client, url) - }) - .await - } - n if n == ecommerce_product::INFO.name => { - run_or_mismatch(ecommerce_product::matches(url), n, url, || { - ecommerce_product::extract(client, url) - }) - .await - } - n if n == amazon_product::INFO.name => { - run_or_mismatch(amazon_product::matches(url), n, url, || { - amazon_product::extract(client, url) - }) - .await - } - n if n == ebay_listing::INFO.name => { - run_or_mismatch(ebay_listing::matches(url), n, url, || { - ebay_listing::extract(client, url) - }) - .await - } - n if n == etsy_listing::INFO.name => { - run_or_mismatch(etsy_listing::matches(url), n, url, || { - etsy_listing::extract(client, url) - }) - .await - } - n if n == trustpilot_reviews::INFO.name => { - run_or_mismatch(trustpilot_reviews::matches(url), n, url, || { - trustpilot_reviews::extract(client, url) - }) - .await - } - n if n == youtube_video::INFO.name => { - run_or_mismatch(youtube_video::matches(url), n, url, || { - youtube_video::extract(client, url) - }) - .await - } - n if n == substack_post::INFO.name => { - run_or_mismatch(substack_post::matches(url), n, url, || { - substack_post::extract(client, url) - }) - .await - } - n if n == shopify_collection::INFO.name => { - run_or_mismatch(shopify_collection::matches(url), n, url, || { - shopify_collection::extract(client, url) - }) - .await - } - n if n == woocommerce_product::INFO.name => { - run_or_mismatch(woocommerce_product::matches(url), n, url, || { - woocommerce_product::extract(client, url) - }) - .await - } - _ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())), - } -} - -/// Errors that the dispatcher itself raises (vs. errors from inside an -/// extractor, which come back wrapped in `Fetch`). -#[derive(Debug, thiserror::Error)] -pub enum ExtractorDispatchError { - #[error("unknown vertical: '{0}'")] - UnknownVertical(String), - - #[error("URL '{url}' does not match the '{vertical}' extractor")] - UrlMismatch { vertical: String, url: String }, - - #[error(transparent)] - Fetch(#[from] FetchError), -} - -/// Helper: when the caller explicitly picked a vertical but their URL -/// doesn't match it, return `UrlMismatch` instead of running the -/// extractor (which would just fail with a less-clear error). -async fn run_or_mismatch( - matches: bool, - vertical: &str, - url: &str, - f: F, -) -> Result -where - F: FnOnce() -> Fut, - Fut: std::future::Future>, -{ - if !matches { - return Err(ExtractorDispatchError::UrlMismatch { - vertical: vertical.to_string(), - url: url.to_string(), - }); - } - f().await.map_err(ExtractorDispatchError::Fetch) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn list_is_non_empty_and_unique() { - let entries = list(); - assert!(!entries.is_empty()); - let mut names: Vec<_> = entries.iter().map(|e| e.name).collect(); - names.sort(); - let before = names.len(); - names.dedup(); - assert_eq!(before, names.len(), "extractor names must be unique"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/npm.rs b/crates/webclaw-fetch/src/extractors/npm.rs deleted file mode 100644 index f84da0e..0000000 --- a/crates/webclaw-fetch/src/extractors/npm.rs +++ /dev/null @@ -1,235 +0,0 @@ -//! npm package structured extractor. -//! -//! Uses two npm-run APIs: -//! - `registry.npmjs.org/{name}` for full package metadata -//! - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal -//! -//! The registry API returns the *full* document including every version -//! ever published, which can be tens of MB for popular packages -//! (`@types/node` etc). We strip down to the latest version's manifest -//! and a count of releases — full history would explode the response. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "npm", - label: "npm package", - description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.", - url_patterns: &["https://www.npmjs.com/package/{name}"], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "www.npmjs.com" && host != "npmjs.com" { - return false; - } - url.contains("/package/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let name = parse_name(url) - .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?; - - let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name)); - let resp = client.fetch(®istry_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "npm: package '{name}' not found" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "npm registry returned status {}", - resp.status - ))); - } - - let pkg: PackageDoc = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?; - - // Resolve "latest" to a concrete version. - let latest_version = pkg - .dist_tags - .as_ref() - .and_then(|t| t.get("latest")) - .cloned() - .or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned())); - - let latest_manifest = latest_version - .as_deref() - .and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v))); - - let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0); - let latest_release_date = latest_version - .as_deref() - .and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned())); - - // Best-effort weekly downloads. If the api.npmjs.org call fails we - // surface `null` rather than failing the whole extractor — npm - // sometimes 503s the downloads endpoint while the registry is up. - let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok(); - - Ok(json!({ - "url": url, - "name": pkg.name.clone().unwrap_or(name.clone()), - "description": pkg.description, - "latest_version": latest_version, - "license": latest_manifest.and_then(|m| m.license.clone()), - "homepage": pkg.homepage, - "repository": pkg.repository.as_ref().and_then(|r| r.url.clone()), - "dependencies": latest_manifest.and_then(|m| m.dependencies.clone()), - "dev_dependencies": latest_manifest.and_then(|m| m.dev_dependencies.clone()), - "peer_dependencies": latest_manifest.and_then(|m| m.peer_dependencies.clone()), - "keywords": pkg.keywords, - "maintainers": pkg.maintainers, - "deprecated": latest_manifest.and_then(|m| m.deprecated.clone()), - "release_count": release_count, - "latest_release_date": latest_release_date, - "weekly_downloads": weekly_downloads, - })) -} - -async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result { - let url = format!( - "https://api.npmjs.org/downloads/point/last-week/{}", - urlencode_segment(name) - ); - let resp = client.fetch(&url).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "npm downloads api status {}", - resp.status - ))); - } - let dl: Downloads = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?; - Ok(dl.downloads) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Extract the package name from an npmjs.com URL. Handles scoped packages -/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`). -fn parse_name(url: &str) -> Option { - let after = url.split("/package/").nth(1)?; - let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let first = segs.next()?; - if first.starts_with('@') { - let second = segs.next()?; - Some(format!("{first}/{second}")) - } else { - Some(first.to_string()) - } -} - -/// `@scope/name` must encode the `/` for the registry path. Plain names -/// pass through untouched. -fn urlencode_segment(name: &str) -> String { - name.replace('/', "%2F") -} - -// --------------------------------------------------------------------------- -// Registry types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct PackageDoc { - name: Option, - description: Option, - homepage: Option, // sometimes string, sometimes object - repository: Option, - keywords: Option>, - maintainers: Option>, - #[serde(rename = "dist-tags")] - dist_tags: Option>, - versions: Option>, - time: Option>, -} - -#[derive(Deserialize, Default, Clone)] -struct VersionManifest { - license: Option, // string or object - dependencies: Option>, - #[serde(rename = "devDependencies")] - dev_dependencies: Option>, - #[serde(rename = "peerDependencies")] - peer_dependencies: Option>, - // `deprecated` is sometimes a bool and sometimes a string in the - // registry. serde_json::Value covers both without failing the parse. - deprecated: Option, -} - -#[derive(Deserialize)] -struct Repository { - url: Option, -} - -#[derive(Deserialize, Clone)] -struct Maintainer { - name: Option, - email: Option, -} - -impl serde::Serialize for Maintainer { - fn serialize(&self, s: S) -> Result { - use serde::ser::SerializeMap; - let mut m = s.serialize_map(Some(2))?; - m.serialize_entry("name", &self.name)?; - m.serialize_entry("email", &self.email)?; - m.end() - } -} - -#[derive(Deserialize)] -struct Downloads { - downloads: i64, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_npm_package_urls() { - assert!(matches("https://www.npmjs.com/package/react")); - assert!(matches("https://www.npmjs.com/package/@types/node")); - assert!(matches("https://npmjs.com/package/lodash")); - assert!(!matches("https://www.npmjs.com/")); - assert!(!matches("https://example.com/package/foo")); - } - - #[test] - fn parse_name_handles_scoped_and_unscoped() { - assert_eq!( - parse_name("https://www.npmjs.com/package/react"), - Some("react".into()) - ); - assert_eq!( - parse_name("https://www.npmjs.com/package/@types/node"), - Some("@types/node".into()) - ); - assert_eq!( - parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"), - Some("lodash".into()) - ); - } - - #[test] - fn urlencode_only_touches_scope_separator() { - assert_eq!(urlencode_segment("react"), "react"); - assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/pypi.rs b/crates/webclaw-fetch/src/extractors/pypi.rs deleted file mode 100644 index 33a4d1c..0000000 --- a/crates/webclaw-fetch/src/extractors/pypi.rs +++ /dev/null @@ -1,184 +0,0 @@ -//! PyPI package structured extractor. -//! -//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and -//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both -//! return the full release info plus history. No auth, no rate limits -//! that we hit at normal usage. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "pypi", - label: "PyPI package", - description: "Returns package metadata: latest version, dependencies, license, release history.", - url_patterns: &[ - "https://pypi.org/project/{name}/", - "https://pypi.org/project/{name}/{version}/", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "pypi.org" && host != "www.pypi.org" { - return false; - } - url.contains("/project/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (name, version) = parse_project(url).ok_or_else(|| { - FetchError::Build(format!("pypi: cannot parse package name from '{url}'")) - })?; - - let api_url = match &version { - Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"), - None => format!("https://pypi.org/pypi/{name}/json"), - }; - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "pypi: package '{name}' not found" - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "pypi api returned status {}", - resp.status - ))); - } - - let pkg: PypiResponse = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?; - - let info = pkg.info; - let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0); - - // Latest release date = max upload time across files in the latest version. - let latest_release_date = pkg - .releases - .as_ref() - .and_then(|map| info.version.as_deref().and_then(|v| map.get(v))) - .and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max()); - - // Drop the long description from the JSON shape — it's frequently a 50KB - // README and bloats responses. Callers who need it can hit /v1/scrape. - Ok(json!({ - "url": url, - "name": info.name, - "version": info.version, - "summary": info.summary, - "homepage": info.home_page, - "license": info.license, - "license_classifier": pick_license_classifier(&info.classifiers), - "author": info.author, - "author_email": info.author_email, - "maintainer": info.maintainer, - "requires_python": info.requires_python, - "requires_dist": info.requires_dist, - "keywords": info.keywords, - "classifiers": info.classifiers, - "yanked": info.yanked, - "yanked_reason": info.yanked_reason, - "project_urls": info.project_urls, - "release_count": release_count, - "latest_release_date": latest_release_date, - })) -} - -/// PyPI puts the SPDX-ish license under classifiers like -/// `License :: OSI Approved :: Apache Software License`. Surface the most -/// specific one when the `license` field itself is empty/junk. -fn pick_license_classifier(classifiers: &Option>) -> Option { - classifiers - .as_ref()? - .iter() - .filter(|c| c.starts_with("License ::")) - .max_by_key(|c| c.len()) - .cloned() -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn parse_project(url: &str) -> Option<(String, Option)> { - let after = url.split("/project/").nth(1)?; - let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); - let mut segs = stripped.split('/').filter(|s| !s.is_empty()); - let name = segs.next()?.to_string(); - let version = segs.next().map(|v| v.to_string()); - Some((name, version)) -} - -// --------------------------------------------------------------------------- -// PyPI API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct PypiResponse { - info: Info, - releases: Option>>, -} - -#[derive(Deserialize)] -struct Info { - name: Option, - version: Option, - summary: Option, - home_page: Option, - license: Option, - author: Option, - author_email: Option, - maintainer: Option, - requires_python: Option, - requires_dist: Option>, - keywords: Option, - classifiers: Option>, - yanked: Option, - yanked_reason: Option, - project_urls: Option>, -} - -#[derive(Deserialize)] -struct File { - upload_time: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_project_urls() { - assert!(matches("https://pypi.org/project/requests/")); - assert!(matches("https://pypi.org/project/numpy/1.26.0/")); - assert!(!matches("https://pypi.org/")); - assert!(!matches("https://example.com/project/foo")); - } - - #[test] - fn parse_project_pulls_name_and_version() { - assert_eq!( - parse_project("https://pypi.org/project/requests/"), - Some(("requests".into(), None)) - ); - assert_eq!( - parse_project("https://pypi.org/project/numpy/1.26.0/"), - Some(("numpy".into(), Some("1.26.0".into()))) - ); - assert_eq!( - parse_project("https://pypi.org/project/scikit-learn/?foo=bar"), - Some(("scikit-learn".into(), None)) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/reddit.rs b/crates/webclaw-fetch/src/extractors/reddit.rs deleted file mode 100644 index 13cdc16..0000000 --- a/crates/webclaw-fetch/src/extractors/reddit.rs +++ /dev/null @@ -1,234 +0,0 @@ -//! Reddit structured extractor — returns the full post + comment tree -//! as typed JSON via Reddit's `.json` API. -//! -//! The same trick the markdown extractor in `crate::reddit` uses: -//! appending `.json` to any post URL returns the data the new SPA -//! frontend would load client-side. Zero antibot, zero JS rendering. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "reddit", - label: "Reddit thread", - description: "Returns post + nested comment tree with scores, authors, and timestamps.", - url_patterns: &[ - "https://www.reddit.com/r/*/comments/*", - "https://reddit.com/r/*/comments/*", - "https://old.reddit.com/r/*/comments/*", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - let is_reddit_host = matches!( - host, - "reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com" - ); - is_reddit_host && url.contains("/comments/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let json_url = build_json_url(url); - let resp = client.fetch(&json_url).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "reddit api returned status {}", - resp.status - ))); - } - - let listings: Vec = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?; - - if listings.is_empty() { - return Err(FetchError::BodyDecode("reddit response empty".into())); - } - - // First listing = the post (single t3 child). - let post = listings - .first() - .and_then(|l| l.data.children.first()) - .filter(|t| t.kind == "t3") - .map(|t| post_json(&t.data)) - .unwrap_or(Value::Null); - - // Second listing = the comment tree. - let comments: Vec = listings - .get(1) - .map(|l| l.data.children.iter().filter_map(comment_json).collect()) - .unwrap_or_default(); - - Ok(json!({ - "url": url, - "post": post, - "comments": comments, - })) -} - -// --------------------------------------------------------------------------- -// JSON shapers -// --------------------------------------------------------------------------- - -fn post_json(d: &ThingData) -> Value { - json!({ - "id": d.id, - "title": d.title, - "author": d.author, - "subreddit": d.subreddit_name_prefixed, - "permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")), - "url": d.url_overridden_by_dest, - "is_self": d.is_self, - "selftext": d.selftext, - "score": d.score, - "upvote_ratio": d.upvote_ratio, - "num_comments": d.num_comments, - "created_utc": d.created_utc, - "link_flair_text": d.link_flair_text, - "over_18": d.over_18, - "spoiler": d.spoiler, - "stickied": d.stickied, - "locked": d.locked, - }) -} - -/// Render a single comment + its reply tree. Returns `None` for non-t1 -/// kinds (the trailing `more` placeholder Reddit injects at depth limits). -fn comment_json(thing: &Thing) -> Option { - if thing.kind != "t1" { - return None; - } - let d = &thing.data; - let replies: Vec = match &d.replies { - Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(), - _ => Vec::new(), - }; - Some(json!({ - "id": d.id, - "author": d.author, - "body": d.body, - "score": d.score, - "created_utc": d.created_utc, - "is_submitter": d.is_submitter, - "stickied": d.stickied, - "depth": d.depth, - "permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")), - "replies": replies, - })) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com` -/// or `old.reddit.com` as the caller gave us). Routing through -/// `old.reddit.com` unconditionally looks appealing but that host has -/// stricter UA-based blocking than `www.reddit.com`, while the main -/// host accepts our Chrome-fingerprinted client fine. -fn build_json_url(url: &str) -> String { - let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/'); - format!("{clean}.json?raw_json=1") -} - -// --------------------------------------------------------------------------- -// Reddit JSON types — only fields we render. Everything else is dropped. -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Listing { - data: ListingData, -} - -#[derive(Deserialize)] -struct ListingData { - children: Vec, -} - -#[derive(Deserialize)] -struct Thing { - kind: String, - data: ThingData, -} - -#[derive(Deserialize, Default)] -struct ThingData { - // post (t3) - id: Option, - title: Option, - selftext: Option, - subreddit_name_prefixed: Option, - url_overridden_by_dest: Option, - is_self: Option, - upvote_ratio: Option, - num_comments: Option, - over_18: Option, - spoiler: Option, - stickied: Option, - locked: Option, - link_flair_text: Option, - - // comment (t1) - author: Option, - body: Option, - score: Option, - created_utc: Option, - is_submitter: Option, - depth: Option, - permalink: Option, - - // recursive - replies: Option, -} - -#[derive(Deserialize)] -#[serde(untagged)] -enum Replies { - Listing(Listing), - #[allow(dead_code)] - Empty(String), -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_reddit_post_urls() { - assert!(matches( - "https://www.reddit.com/r/rust/comments/abc123/some_title/" - )); - assert!(matches( - "https://reddit.com/r/rust/comments/abc123/some_title" - )); - assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/")); - } - - #[test] - fn rejects_non_post_reddit_urls() { - assert!(!matches("https://www.reddit.com/r/rust")); - assert!(!matches("https://www.reddit.com/user/foo")); - assert!(!matches("https://example.com/r/rust/comments/x")); - } - - #[test] - fn json_url_appends_suffix_and_drops_query() { - assert_eq!( - build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"), - "https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1" - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/shopify_collection.rs b/crates/webclaw-fetch/src/extractors/shopify_collection.rs deleted file mode 100644 index 23d57c6..0000000 --- a/crates/webclaw-fetch/src/extractors/shopify_collection.rs +++ /dev/null @@ -1,242 +0,0 @@ -//! Shopify collection structured extractor. -//! -//! Every Shopify store exposes `/collections/{handle}.json` and -//! `/collections/{handle}/products.json` on the public surface. This -//! extractor hits `.json` (collection metadata) and falls through to -//! `/products.json` for the first page of products. Same caveat as -//! `shopify_product`: stores with Cloudflare in front of the shop -//! will 403 the public path. -//! -//! Explicit-call only (like `shopify_product`). `/collections/{slug}` -//! is a URL shape used by non-Shopify stores too, so auto-dispatch -//! would claim too many URLs. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "shopify_collection", - label: "Shopify collection", - description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.", - url_patterns: &[ - "https://{shop}/collections/{handle}", - "https://{shop}.myshopify.com/collections/{handle}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) { - return false; - } - url.contains("/collections/") && !url.ends_with("/collections/") -} - -const NON_SHOPIFY_HOSTS: &[&str] = &[ - "amazon.com", - "amazon.co.uk", - "amazon.de", - "ebay.com", - "etsy.com", - "walmart.com", - "target.com", - "aliexpress.com", - "huggingface.co", // has /collections/ for models - "github.com", -]; - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let (coll_meta_url, coll_products_url) = build_json_urls(url); - - // Step 1: collection metadata. Shopify returns 200 on missing - // collections sometimes; check "collection" key below. - let meta_resp = client.fetch(&coll_meta_url).await?; - if meta_resp.status == 404 { - return Err(FetchError::Build(format!( - "shopify_collection: '{url}' not found" - ))); - } - if meta_resp.status == 403 { - return Err(FetchError::Build(format!( - "shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store." - ))); - } - if meta_resp.status != 200 { - return Err(FetchError::Build(format!( - "shopify returned status {} for {coll_meta_url}", - meta_resp.status - ))); - } - - let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| { - FetchError::BodyDecode(format!( - "shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})" - )) - })?; - - // Step 2: first page of products for this collection. - let products = match client.fetch(&coll_products_url).await { - Ok(r) if r.status == 200 => serde_json::from_str::(&r.html) - .ok() - .map(|pw| pw.products) - .unwrap_or_default(), - _ => Vec::new(), - }; - - let product_summaries: Vec = products - .iter() - .map(|p| { - let first_variant = p.variants.first(); - json!({ - "id": p.id, - "handle": p.handle, - "title": p.title, - "vendor": p.vendor, - "product_type": p.product_type, - "price": first_variant.and_then(|v| v.price.clone()), - "compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()), - "available": p.variants.iter().any(|v| v.available.unwrap_or(false)), - "variant_count": p.variants.len(), - "image": p.images.first().and_then(|i| i.src.clone()), - "created_at": p.created_at, - "updated_at": p.updated_at, - }) - }) - .collect(); - - let c = meta.collection; - Ok(json!({ - "url": url, - "meta_json_url": coll_meta_url, - "products_json_url": coll_products_url, - "collection_id": c.id, - "handle": c.handle, - "title": c.title, - "description_html": c.body_html, - "published_at": c.published_at, - "updated_at": c.updated_at, - "sort_order": c.sort_order, - "products_in_page": product_summaries.len(), - "products": product_summaries, - })) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Build `(collection.json, collection/products.json)` from a user URL. -fn build_json_urls(url: &str) -> (String, String) { - let (path_part, _query_part) = match url.split_once('?') { - Some((a, b)) => (a, Some(b)), - None => (url, None), - }; - let clean = path_part.trim_end_matches('/').trim_end_matches(".json"); - ( - format!("{clean}.json"), - format!("{clean}/products.json?limit=50"), - ) -} - -// --------------------------------------------------------------------------- -// Shopify collection + product JSON shapes (subsets) -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct MetaWrapper { - collection: Collection, -} - -#[derive(Deserialize)] -struct Collection { - id: Option, - handle: Option, - title: Option, - body_html: Option, - published_at: Option, - updated_at: Option, - sort_order: Option, -} - -#[derive(Deserialize)] -struct ProductsWrapper { - #[serde(default)] - products: Vec, -} - -#[derive(Deserialize)] -struct ProductSummary { - id: Option, - handle: Option, - title: Option, - vendor: Option, - product_type: Option, - created_at: Option, - updated_at: Option, - #[serde(default)] - variants: Vec, - #[serde(default)] - images: Vec, -} - -#[derive(Deserialize)] -struct VariantSummary { - price: Option, - compare_at_price: Option, - available: Option, -} - -#[derive(Deserialize)] -struct ImageSummary { - src: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_shopify_collection_urls() { - assert!(matches("https://www.allbirds.com/collections/mens")); - assert!(matches( - "https://shop.example.com/collections/new-arrivals?page=2" - )); - } - - #[test] - fn rejects_non_shopify() { - assert!(!matches("https://github.com/collections/foo")); - assert!(!matches("https://huggingface.co/collections/foo")); - assert!(!matches("https://example.com/")); - assert!(!matches("https://example.com/collections/")); - } - - #[test] - fn build_json_urls_derives_both_paths() { - let (meta, products) = build_json_urls("https://shop.example.com/collections/mens"); - assert_eq!(meta, "https://shop.example.com/collections/mens.json"); - assert_eq!( - products, - "https://shop.example.com/collections/mens/products.json?limit=50" - ); - } - - #[test] - fn build_json_urls_handles_trailing_slash() { - let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/"); - assert_eq!(meta, "https://shop.example.com/collections/mens.json"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/shopify_product.rs b/crates/webclaw-fetch/src/extractors/shopify_product.rs deleted file mode 100644 index b52ef36..0000000 --- a/crates/webclaw-fetch/src/extractors/shopify_product.rs +++ /dev/null @@ -1,318 +0,0 @@ -//! Shopify product structured extractor. -//! -//! Every Shopify store exposes a public JSON endpoint for each product -//! by appending `.json` to the product URL: -//! -//! https://shop.example.com/products/cool-tshirt -//! → https://shop.example.com/products/cool-tshirt.json -//! -//! There are ~4 million Shopify stores. The `.json` endpoint is -//! undocumented but has been stable for 10+ years. When a store puts -//! Cloudflare / antibot in front of the shop, this path can 403 just -//! like any other — for those cases the caller should fall back to -//! `ecommerce_product` (JSON-LD) or the cloud tier. -//! -//! This extractor is **explicit-call only** — it is NOT auto-dispatched -//! from `/v1/scrape` because we cannot tell ahead of time whether an -//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit -//! `/v1/scrape/shopify_product` when they know. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "shopify_product", - label: "Shopify product", - description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.", - url_patterns: &[ - "https://{shop}/products/{handle}", - "https://{shop}.myshopify.com/products/{handle}", - ], -}; - -pub fn matches(url: &str) -> bool { - // Any URL whose path contains /products/{something}. We do not - // filter by host — Shopify powers custom-domain stores. The - // extractor's /.json fallback is what confirms Shopify; `matches` - // just says "this is a plausible shape." Still reject obviously - // non-Shopify known hosts to save a failed request. - let host = host_of(url); - if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) { - return false; - } - url.contains("/products/") && !url.ends_with("/products/") -} - -/// Hosts we know are not Shopify — reject so we don't burn a request. -const NON_SHOPIFY_HOSTS: &[&str] = &[ - "amazon.com", - "amazon.co.uk", - "amazon.de", - "amazon.fr", - "amazon.it", - "ebay.com", - "etsy.com", - "walmart.com", - "target.com", - "aliexpress.com", - "bestbuy.com", - "wayfair.com", - "homedepot.com", - "github.com", // /products is a marketing page -]; - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let json_url = build_json_url(url); - let resp = client.fetch(&json_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "shopify_product: '{url}' not found (got 404 from {json_url})" - ))); - } - if resp.status == 403 { - return Err(FetchError::Build(format!( - "shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback." - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "shopify returned status {} for {json_url}", - resp.status - ))); - } - - let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| { - FetchError::BodyDecode(format!( - "shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})" - )) - })?; - let p = body.product; - - let variants: Vec = p - .variants - .iter() - .map(|v| { - json!({ - "id": v.id, - "title": v.title, - "sku": v.sku, - "barcode": v.barcode, - "price": v.price, - "compare_at_price": v.compare_at_price, - "available": v.available, - "inventory_quantity": v.inventory_quantity, - "position": v.position, - "weight": v.weight, - "weight_unit": v.weight_unit, - "requires_shipping": v.requires_shipping, - "taxable": v.taxable, - "option1": v.option1, - "option2": v.option2, - "option3": v.option3, - }) - }) - .collect(); - - let images: Vec = p - .images - .iter() - .map(|i| { - json!({ - "src": i.src, - "width": i.width, - "height": i.height, - "position": i.position, - "alt": i.alt, - }) - }) - .collect(); - - let options: Vec = p - .options - .iter() - .map(|o| json!({"name": o.name, "values": o.values, "position": o.position})) - .collect(); - - // Price range + availability summary across variants (the shape - // agents typically want without walking the variants array). - let prices: Vec = p - .variants - .iter() - .filter_map(|v| v.price.as_deref().and_then(|s| s.parse::().ok())) - .collect(); - let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false)); - - Ok(json!({ - "url": url, - "json_url": json_url, - "product_id": p.id, - "handle": p.handle, - "title": p.title, - "vendor": p.vendor, - "product_type": p.product_type, - "tags": p.tags, - "description_html":p.body_html, - "published_at": p.published_at, - "created_at": p.created_at, - "updated_at": p.updated_at, - "variant_count": variants.len(), - "image_count": images.len(), - "any_available": any_available, - "price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)), - "price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)), - "variants": variants, - "images": images, - "options": options, - })) -} - -/// Build the .json path from a product URL. Handles pre-.jsoned URLs, -/// trailing slashes, and query strings. -fn build_json_url(url: &str) -> String { - let (path_part, query_part) = match url.split_once('?') { - Some((a, b)) => (a, Some(b)), - None => (url, None), - }; - let clean = path_part.trim_end_matches('/'); - let with_json = if clean.ends_with(".json") { - clean.to_string() - } else { - format!("{clean}.json") - }; - match query_part { - Some(q) => format!("{with_json}?{q}"), - None => with_json, - } -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -// --------------------------------------------------------------------------- -// Shopify product JSON shape (a subset of the full response) -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Wrapper { - product: Product, -} - -#[derive(Deserialize)] -struct Product { - id: Option, - title: Option, - handle: Option, - vendor: Option, - product_type: Option, - body_html: Option, - published_at: Option, - created_at: Option, - updated_at: Option, - #[serde(default)] - tags: serde_json::Value, // array OR comma-joined string depending on store - #[serde(default)] - variants: Vec, - #[serde(default)] - images: Vec, - #[serde(default)] - options: Vec, -} - -#[derive(Deserialize)] -struct Variant { - id: Option, - title: Option, - sku: Option, - barcode: Option, - price: Option, - compare_at_price: Option, - available: Option, - inventory_quantity: Option, - position: Option, - weight: Option, - weight_unit: Option, - requires_shipping: Option, - taxable: Option, - option1: Option, - option2: Option, - option3: Option, -} - -#[derive(Deserialize)] -struct Image { - src: Option, - width: Option, - height: Option, - position: Option, - alt: Option, -} - -#[derive(Deserialize)] -#[serde(rename_all = "lowercase")] -struct Option_ { - name: Option, - position: Option, - #[serde(default)] - values: Vec, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_plausible_shopify_urls() { - assert!(matches( - "https://www.allbirds.com/products/mens-tree-runners" - )); - assert!(matches( - "https://shop.example.com/products/cool-tshirt?variant=123" - )); - assert!(matches("https://somestore.myshopify.com/products/thing-1")); - } - - #[test] - fn rejects_known_non_shopify() { - assert!(!matches("https://www.amazon.com/dp/B0C123")); - assert!(!matches("https://www.etsy.com/listing/12345/foo")); - assert!(!matches("https://www.amazon.co.uk/products/thing")); - assert!(!matches("https://github.com/products")); - } - - #[test] - fn rejects_non_product_urls() { - assert!(!matches("https://example.com/")); - assert!(!matches("https://example.com/products/")); - assert!(!matches("https://example.com/collections/all")); - } - - #[test] - fn build_json_url_handles_slash_and_query() { - assert_eq!( - build_json_url("https://shop.example.com/products/foo"), - "https://shop.example.com/products/foo.json" - ); - assert_eq!( - build_json_url("https://shop.example.com/products/foo/"), - "https://shop.example.com/products/foo.json" - ); - assert_eq!( - build_json_url("https://shop.example.com/products/foo?variant=123"), - "https://shop.example.com/products/foo.json?variant=123" - ); - assert_eq!( - build_json_url("https://shop.example.com/products/foo.json"), - "https://shop.example.com/products/foo.json" - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/stackoverflow.rs b/crates/webclaw-fetch/src/extractors/stackoverflow.rs deleted file mode 100644 index 03597a3..0000000 --- a/crates/webclaw-fetch/src/extractors/stackoverflow.rs +++ /dev/null @@ -1,216 +0,0 @@ -//! Stack Overflow Q&A structured extractor. -//! -//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}` -//! with `site=stackoverflow`. Two calls: one for the question, one for -//! its answers. Both come pre-filtered to include the rendered HTML body -//! so we don't re-parse the question page itself. -//! -//! Anonymous access caps at 300 requests per IP per day. Production -//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't -//! require it to work out of the box. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "stackoverflow", - label: "Stack Overflow Q&A", - description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.", - url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host != "stackoverflow.com" && host != "www.stackoverflow.com" { - return false; - } - parse_question_id(url).is_some() -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let id = parse_question_id(url).ok_or_else(|| { - FetchError::Build(format!( - "stackoverflow: cannot parse question id from '{url}'" - )) - })?; - - // Filter `withbody` includes the rendered HTML body for both questions - // and answers. Stack Exchange's filter system is documented at - // api.stackexchange.com/docs/filters. - let q_url = format!( - "https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody" - ); - let q_resp = client.fetch(&q_url).await?; - if q_resp.status != 200 { - return Err(FetchError::Build(format!( - "stackexchange api returned status {}", - q_resp.status - ))); - } - let q_body: QResponse = serde_json::from_str(&q_resp.html) - .map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?; - let q = q_body - .items - .first() - .ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?; - - let a_url = format!( - "https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes" - ); - let a_resp = client.fetch(&a_url).await?; - let answers = if a_resp.status == 200 { - let a_body: AResponse = serde_json::from_str(&a_resp.html) - .map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?; - a_body - .items - .iter() - .map(|a| { - json!({ - "answer_id": a.answer_id, - "is_accepted": a.is_accepted, - "score": a.score, - "body": a.body, - "creation_date": a.creation_date, - "last_edit_date":a.last_edit_date, - "author": a.owner.as_ref().and_then(|o| o.display_name.clone()), - "author_rep": a.owner.as_ref().and_then(|o| o.reputation), - }) - }) - .collect::>() - } else { - Vec::new() - }; - - let accepted = answers - .iter() - .find(|a| { - a.get("is_accepted") - .and_then(|v| v.as_bool()) - .unwrap_or(false) - }) - .cloned(); - - Ok(json!({ - "url": url, - "question_id": q.question_id, - "title": q.title, - "body": q.body, - "tags": q.tags, - "score": q.score, - "view_count": q.view_count, - "answer_count": q.answer_count, - "is_answered": q.is_answered, - "accepted_answer_id": q.accepted_answer_id, - "creation_date": q.creation_date, - "last_activity_date": q.last_activity_date, - "author": q.owner.as_ref().and_then(|o| o.display_name.clone()), - "author_rep": q.owner.as_ref().and_then(|o| o.reputation), - "link": q.link, - "accepted_answer": accepted, - "top_answers": answers, - })) -} - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Parse question id from a URL of the form `/questions/{id}/{slug}`. -fn parse_question_id(url: &str) -> Option { - let after = url.split("/questions/").nth(1)?; - let stripped = after.split(['?', '#']).next()?.trim_end_matches('/'); - let first = stripped.split('/').next()?; - first.parse::().ok() -} - -// --------------------------------------------------------------------------- -// Stack Exchange API types -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct QResponse { - #[serde(default)] - items: Vec, -} - -#[derive(Deserialize)] -struct Question { - question_id: Option, - title: Option, - body: Option, - #[serde(default)] - tags: Vec, - score: Option, - view_count: Option, - answer_count: Option, - is_answered: Option, - accepted_answer_id: Option, - creation_date: Option, - last_activity_date: Option, - owner: Option, - link: Option, -} - -#[derive(Deserialize)] -struct AResponse { - #[serde(default)] - items: Vec, -} - -#[derive(Deserialize)] -struct Answer { - answer_id: Option, - is_accepted: Option, - score: Option, - body: Option, - creation_date: Option, - last_edit_date: Option, - owner: Option, -} - -#[derive(Deserialize)] -struct Owner { - display_name: Option, - reputation: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_question_urls() { - assert!(matches( - "https://stackoverflow.com/questions/12345/some-slug" - )); - assert!(matches( - "https://stackoverflow.com/questions/12345/some-slug?answertab=votes" - )); - assert!(!matches("https://stackoverflow.com/")); - assert!(!matches("https://stackoverflow.com/questions")); - assert!(!matches("https://stackoverflow.com/users/100")); - assert!(!matches("https://example.com/questions/12345/x")); - } - - #[test] - fn parse_question_id_handles_slug_and_query() { - assert_eq!( - parse_question_id("https://stackoverflow.com/questions/12345/some-slug"), - Some(12345) - ); - assert_eq!( - parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"), - Some(12345) - ); - assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None); - } -} diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs deleted file mode 100644 index c5b5019..0000000 --- a/crates/webclaw-fetch/src/extractors/substack_post.rs +++ /dev/null @@ -1,565 +0,0 @@ -//! Substack post extractor. -//! -//! Every Substack publication exposes `/api/v1/posts/{slug}` that -//! returns the full post as JSON: body HTML, cover image, author, -//! publication info, reactions, paywall state. No auth on public -//! posts. -//! -//! Works on both `*.substack.com` subdomains and custom domains -//! (e.g. `simonwillison.net` uses Substack too). Detection is -//! "URL has `/p/{slug}`" because that's the canonical Substack post -//! path. Explicit-call only because the `/p/{slug}` URL shape is -//! used by non-Substack sites too. -//! -//! ## Fallback -//! -//! The API endpoint is rate-limited aggressively on popular publications -//! and occasionally returns 403 on custom domains with Cloudflare in -//! front. When that happens we escalate to an HTML fetch (via -//! `smart_fetch_html`, so antibot-protected custom domains still work) -//! and extract OG tags + Article JSON-LD for a degraded-but-useful -//! payload. The response shape stays stable across both paths; a -//! `data_source` field tells the caller which branch ran. - -use std::sync::OnceLock; - -use regex::Regex; -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::cloud::{self, CloudError}; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "substack_post", - label: "Substack post", - description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.", - url_patterns: &[ - "https://{pub}.substack.com/p/{slug}", - "https://{custom-domain}/p/{slug}", - ], -}; - -pub fn matches(url: &str) -> bool { - if !(url.starts_with("http://") || url.starts_with("https://")) { - return false; - } - url.contains("/p/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let slug = parse_slug(url).ok_or_else(|| { - FetchError::Build(format!("substack_post: cannot parse slug from '{url}'")) - })?; - let host = host_of(url); - if host.is_empty() { - return Err(FetchError::Build(format!( - "substack_post: empty host in '{url}'" - ))); - } - let scheme = if url.starts_with("http://") { - "http" - } else { - "https" - }; - let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}"); - - // 1. Try the public API. 200 = full payload; 404 = real miss; any - // other status hands off to the HTML fallback so a transient rate - // limit or a hardened custom domain doesn't fail the whole call. - let resp = client.fetch(&api_url).await?; - match resp.status { - 200 => match serde_json::from_str::(&resp.html) { - Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)), - Err(e) => { - // API returned 200 but the body isn't the Post shape we - // expect. Could be a custom-domain site that exposes - // something else at /api/v1/posts/. Fall back to HTML - // rather than hard-failing. - html_fallback( - client, - url, - &api_url, - &slug, - Some(format!( - "api returned 200 but body was not Substack JSON ({e})" - )), - ) - .await - } - }, - 404 => Err(FetchError::Build(format!( - "substack_post: '{slug}' not found on {host} (got 404). \ - If the publication isn't actually on Substack, use /v1/scrape instead." - ))), - _ => { - // Rate limit, 403, 5xx, whatever: try HTML. - let reason = format!("api returned status {} for {api_url}", resp.status); - html_fallback(client, url, &api_url, &slug, Some(reason)).await - } - } -} - -// --------------------------------------------------------------------------- -// API-path payload builder -// --------------------------------------------------------------------------- - -fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value { - json!({ - "url": url, - "api_url": api_url, - "data_source": "api", - "id": p.id, - "type": p.r#type, - "slug": p.slug.or_else(|| Some(slug.to_string())), - "title": p.title, - "subtitle": p.subtitle, - "description": p.description, - "canonical_url": p.canonical_url, - "post_date": p.post_date, - "updated_at": p.updated_at, - "audience": p.audience, - "has_paywall": matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")), - "is_free_preview": p.is_free_preview, - "cover_image": p.cover_image, - "word_count": p.wordcount, - "reactions": p.reactions, - "comment_count": p.comment_count, - "body_html": p.body_html, - "body_text": p.truncated_body_text.or(p.body_text), - "publication": json!({ - "id": p.publication.as_ref().and_then(|pub_| pub_.id), - "name": p.publication.as_ref().and_then(|pub_| pub_.name.clone()), - "subdomain": p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()), - "custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()), - }), - "authors": p.published_bylines.iter().map(|a| json!({ - "id": a.id, - "name": a.name, - "handle": a.handle, - "photo": a.photo_url, - })).collect::>(), - }) -} - -// --------------------------------------------------------------------------- -// HTML fallback: OG + Article JSON-LD -// --------------------------------------------------------------------------- - -async fn html_fallback( - client: &dyn Fetcher, - url: &str, - api_url: &str, - slug: &str, - fallback_reason: Option, -) -> Result { - let fetched = cloud::smart_fetch_html(client, client.cloud(), url) - .await - .map_err(cloud_to_fetch_err)?; - - let mut data = parse_html(&fetched.html, url, api_url, slug); - if let Some(obj) = data.as_object_mut() { - obj.insert( - "fetch_source".into(), - match fetched.source { - cloud::FetchSource::Local => json!("local"), - cloud::FetchSource::Cloud => json!("cloud"), - }, - ); - if let Some(reason) = fallback_reason { - obj.insert("fallback_reason".into(), json!(reason)); - } - } - Ok(data) -} - -/// Pure HTML parser. Pulls title, subtitle, description, cover image, -/// publish date, and authors from OG tags and Article JSON-LD. Kept -/// public so tests can exercise it with fixtures. -pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value { - let article = find_article_jsonld(html); - - let title = article - .as_ref() - .and_then(|v| get_text(v, "headline")) - .or_else(|| og(html, "title")); - let description = article - .as_ref() - .and_then(|v| get_text(v, "description")) - .or_else(|| og(html, "description")); - let cover_image = article - .as_ref() - .and_then(get_first_image) - .or_else(|| og(html, "image")); - let post_date = article - .as_ref() - .and_then(|v| get_text(v, "datePublished")) - .or_else(|| meta_property(html, "article:published_time")); - let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified")); - let publication_name = og(html, "site_name"); - let authors = article.as_ref().map(extract_authors).unwrap_or_default(); - - json!({ - "url": url, - "api_url": api_url, - "data_source": "html_fallback", - "slug": slug, - "title": title, - "subtitle": None::, - "description": description, - "canonical_url": canonical_url(html).or_else(|| Some(url.to_string())), - "post_date": post_date, - "updated_at": updated_at, - "cover_image": cover_image, - "body_html": None::, - "body_text": None::, - "word_count": None::, - "comment_count": None::, - "reactions": Value::Null, - "has_paywall": None::, - "is_free_preview": None::, - "publication": json!({ - "name": publication_name, - }), - "authors": authors, - }) -} - -fn extract_authors(v: &Value) -> Vec { - let Some(a) = v.get("author") else { - return Vec::new(); - }; - let one = |val: &Value| -> Option { - match val { - Value::String(s) => Some(json!({"name": s})), - Value::Object(_) => { - let name = val.get("name").and_then(|n| n.as_str())?; - let handle = val - .get("url") - .and_then(|u| u.as_str()) - .and_then(handle_from_author_url); - Some(json!({ - "name": name, - "handle": handle, - })) - } - _ => None, - } - }; - match a { - Value::Array(arr) => arr.iter().filter_map(one).collect(), - _ => one(a).into_iter().collect(), - } -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -fn parse_slug(url: &str) -> Option { - let after = url.split("/p/").nth(1)?; - let stripped = after - .split(['?', '#']) - .next()? - .trim_end_matches('/') - .split('/') - .next() - .unwrap_or(""); - if stripped.is_empty() { - None - } else { - Some(stripped.to_string()) - } -} - -/// Extract the Substack handle from an author URL like -/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`. -/// -/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack -/// author page) so we don't synthesise a fake handle. -fn handle_from_author_url(u: &str) -> Option { - let after = u.rsplit_once('@').map(|(_, tail)| tail)?; - let clean = after.split(['/', '?', '#']).next()?; - if clean.is_empty() { - None - } else { - Some(clean.to_string()) - } -} - -// --------------------------------------------------------------------------- -// HTML tag helpers -// --------------------------------------------------------------------------- - -fn og(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - return c.get(2).map(|m| m.as_str().to_string()); - } - } - None -} - -/// Pull `` and -/// similar structured meta tags. -fn meta_property(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - return c.get(2).map(|m| m.as_str().to_string()); - } - } - None -} - -fn canonical_url(html: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE - .get_or_init(|| Regex::new(r#"(?i)]+rel="canonical"[^>]+href="([^"]+)""#).unwrap()); - re.captures(html) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -// --------------------------------------------------------------------------- -// JSON-LD walkers (Article / NewsArticle) -// --------------------------------------------------------------------------- - -fn find_article_jsonld(html: &str) -> Option { - let blocks = webclaw_core::structured_data::extract_json_ld(html); - for b in blocks { - if let Some(found) = find_article_in(&b) { - return Some(found); - } - } - None -} - -fn find_article_in(v: &Value) -> Option { - if is_article_type(v) { - return Some(v.clone()); - } - if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) { - for item in graph { - if let Some(found) = find_article_in(item) { - return Some(found); - } - } - } - if let Some(arr) = v.as_array() { - for item in arr { - if let Some(found) = find_article_in(item) { - return Some(found); - } - } - } - None -} - -fn is_article_type(v: &Value) -> bool { - let Some(t) = v.get("@type") else { - return false; - }; - let is_art = |s: &str| { - matches!( - s, - "Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting" - ) - }; - match t { - Value::String(s) => is_art(s), - Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)), - _ => false, - } -} - -fn get_text(v: &Value, key: &str) -> Option { - v.get(key).and_then(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Number(n) => Some(n.to_string()), - _ => None, - }) -} - -fn get_first_image(v: &Value) -> Option { - match v.get("image")? { - Value::String(s) => Some(s.clone()), - Value::Array(arr) => arr.iter().find_map(|x| match x { - Value::String(s) => Some(s.clone()), - Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - }), - Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from), - _ => None, - } -} - -fn cloud_to_fetch_err(e: CloudError) -> FetchError { - FetchError::Build(e.to_string()) -} - -// --------------------------------------------------------------------------- -// Substack API types (subset) -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Post { - id: Option, - r#type: Option, - slug: Option, - title: Option, - subtitle: Option, - description: Option, - canonical_url: Option, - post_date: Option, - updated_at: Option, - audience: Option, - is_free_preview: Option, - cover_image: Option, - wordcount: Option, - reactions: Option, - comment_count: Option, - body_html: Option, - body_text: Option, - truncated_body_text: Option, - publication: Option, - #[serde(default, rename = "publishedBylines")] - published_bylines: Vec, -} - -#[derive(Deserialize)] -struct Publication { - id: Option, - name: Option, - subdomain: Option, - custom_domain: Option, -} - -#[derive(Deserialize)] -struct Byline { - id: Option, - name: Option, - handle: Option, - photo_url: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_post_urls() { - assert!(matches( - "https://stratechery.substack.com/p/the-tech-letter" - )); - assert!(matches("https://simonwillison.net/p/2024-08-01-something")); - assert!(!matches("https://example.com/")); - assert!(!matches("ftp://example.com/p/foo")); - } - - #[test] - fn parse_slug_strips_query_and_trailing_slash() { - assert_eq!( - parse_slug("https://example.substack.com/p/my-post"), - Some("my-post".into()) - ); - assert_eq!( - parse_slug("https://example.substack.com/p/my-post/"), - Some("my-post".into()) - ); - assert_eq!( - parse_slug("https://example.substack.com/p/my-post?ref=123"), - Some("my-post".into()) - ); - } - - #[test] - fn parse_html_extracts_from_og_tags() { - let html = r##" - - - - - - - -"##; - let v = parse_html( - html, - "https://mypub.substack.com/p/my-post", - "https://mypub.substack.com/api/v1/posts/my-post", - "my-post", - ); - assert_eq!(v["data_source"], "html_fallback"); - assert_eq!(v["title"], "My Great Post"); - assert_eq!(v["description"], "A short summary."); - assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg"); - assert_eq!(v["post_date"], "2025-09-01T10:00:00Z"); - assert_eq!(v["publication"]["name"], "My Publication"); - assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post"); - } - - #[test] - fn parse_html_prefers_jsonld_when_present() { - let html = r##" - - - -"##; - let v = parse_html( - html, - "https://example.com/p/a", - "https://example.com/api/v1/posts/a", - "a", - ); - assert_eq!(v["title"], "JSON-LD Title"); - assert_eq!(v["description"], "JSON-LD desc."); - assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg"); - assert_eq!(v["post_date"], "2025-10-12T08:30:00Z"); - assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z"); - assert_eq!(v["authors"][0]["name"], "Alice Author"); - assert_eq!(v["authors"][0]["handle"], "alice"); - } - - #[test] - fn handle_from_author_url_pulls_handle() { - assert_eq!( - handle_from_author_url("https://substack.com/@alice"), - Some("alice".into()) - ); - assert_eq!( - handle_from_author_url("https://mypub.substack.com/@bob/"), - Some("bob".into()) - ); - assert_eq!( - handle_from_author_url("https://not-substack.com/author/carol"), - None - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs deleted file mode 100644 index 8b77a29..0000000 --- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs +++ /dev/null @@ -1,572 +0,0 @@ -//! Trustpilot company reviews extractor. -//! -//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's -//! "Verifying your connection" interstitial, so this extractor always -//! routes through [`cloud::smart_fetch_html`]. Without -//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean -//! "set API key" error; with one it escalates to api.webclaw.io. -//! -//! ## 2025 JSON-LD schema -//! -//! Trustpilot replaced the old single-Organization + aggregateRating -//! shape with three separate JSON-LD blocks: -//! -//! 1. `Organization` block for Trustpilot the platform itself -//! (company info, addresses, social profiles). Not the business -//! being reviewed. We detect and skip this. -//! 2. `Dataset` block with a csvw:Table mainEntity that contains the -//! per-star-bucket counts for the target business plus a Total -//! column. The Dataset's `name` is the business display name. -//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated -//! summary of reviews plus the individual review objects -//! (consumer, dates, rating, title, text, language, likes). -//! -//! Plus `metadata.title` from the page head parses as -//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and -//! `metadata.description` carries `"{N} customers have already said"`. -//! We use both as extra signal when the Dataset block is absent. - -use std::sync::OnceLock; - -use regex::Regex; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::cloud::{self, CloudError}; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "trustpilot_reviews", - label: "Trustpilot reviews", - description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.", - url_patterns: &["https://www.trustpilot.com/review/{domain}"], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if !matches!(host, "www.trustpilot.com" | "trustpilot.com") { - return false; - } - url.contains("/review/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let fetched = cloud::smart_fetch_html(client, client.cloud(), url) - .await - .map_err(cloud_to_fetch_err)?; - - let mut data = parse(&fetched.html, url)?; - if let Some(obj) = data.as_object_mut() { - obj.insert( - "data_source".into(), - match fetched.source { - cloud::FetchSource::Local => json!("local"), - cloud::FetchSource::Cloud => json!("cloud"), - }, - ); - } - Ok(data) -} - -/// Pure parser. Kept public so the cloud pipeline can reuse it on its -/// own fetched HTML without going through the async extract path. -pub fn parse(html: &str, url: &str) -> Result { - let domain = parse_review_domain(url).ok_or_else(|| { - FetchError::Build(format!( - "trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'" - )) - })?; - - let blocks = webclaw_core::structured_data::extract_json_ld(html); - - // The business Dataset block has `about.@id` pointing to the target - // domain's Organization (e.g. `.../Organization/anthropic.com`). - let dataset = find_business_dataset(&blocks, &domain); - - // The aiSummary block: not typed (no `@type`), detect by key. - let ai_block = find_ai_summary_block(&blocks); - - // Business name: Dataset > metadata.title regex > URL domain. - let business_name = dataset - .as_ref() - .and_then(|d| get_string(d, "name")) - .or_else(|| parse_name_from_og_title(html)) - .or_else(|| Some(domain.clone())); - - // Rating distribution from the csvw:Table columns. Each column has - // csvw:name like "1 star" / "Total" and a single cell with the - // integer count. - let distribution = dataset.as_ref().and_then(parse_star_distribution); - let (rating_from_dist, total_from_dist) = distribution - .as_ref() - .map(compute_rating_stats) - .unwrap_or((None, None)); - - // Page-title / page-description fallbacks. OG title format: - // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot" - let (rating_label, rating_from_og) = parse_rating_from_og_title(html); - let total_from_desc = parse_review_count_from_og_description(html); - - // Recent reviews carried by the aiSummary block. - let recent_reviews: Vec = ai_block - .as_ref() - .and_then(|a| a.get("aiSummaryReviews")) - .and_then(|arr| arr.as_array()) - .map(|arr| arr.iter().map(extract_review).collect()) - .unwrap_or_default(); - - let ai_summary = ai_block - .as_ref() - .and_then(|a| a.get("aiSummary")) - .and_then(|s| s.get("summary")) - .and_then(|t| t.as_str()) - .map(String::from); - - Ok(json!({ - "url": url, - "domain": domain, - "business_name": business_name, - "rating_label": rating_label, - "average_rating": rating_from_dist.or(rating_from_og), - "review_count": total_from_dist.or(total_from_desc), - "rating_distribution": distribution, - "ai_summary": ai_summary, - "recent_reviews": recent_reviews, - "review_count_listed": recent_reviews.len(), - })) -} - -fn cloud_to_fetch_err(e: CloudError) -> FetchError { - FetchError::Build(e.to_string()) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Pull the target domain from `trustpilot.com/review/{domain}`. -fn parse_review_domain(url: &str) -> Option { - let after = url.split("/review/").nth(1)?; - let stripped = after - .split(['?', '#']) - .next()? - .trim_end_matches('/') - .split('/') - .next() - .unwrap_or(""); - if stripped.is_empty() { - None - } else { - Some(stripped.to_string()) - } -} - -// --------------------------------------------------------------------------- -// JSON-LD block walkers -// --------------------------------------------------------------------------- - -/// Find the Dataset block whose `about.@id` references the target -/// domain's Organization. Falls through to any Dataset if the @id -/// check doesn't match (Trustpilot occasionally varies the URL). -fn find_business_dataset(blocks: &[Value], domain: &str) -> Option { - let mut fallback_any_dataset: Option = None; - for block in blocks { - for node in walk_graph(block) { - if !is_dataset(&node) { - continue; - } - if dataset_about_matches_domain(&node, domain) { - return Some(node); - } - if fallback_any_dataset.is_none() { - fallback_any_dataset = Some(node); - } - } - } - fallback_any_dataset -} - -fn is_dataset(v: &Value) -> bool { - v.get("@type") - .and_then(|t| t.as_str()) - .is_some_and(|s| s == "Dataset") -} - -fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool { - let about_id = v - .get("about") - .and_then(|a| a.get("@id")) - .and_then(|id| id.as_str()); - let Some(id) = about_id else { - return false; - }; - id.contains(&format!("/Organization/{domain}")) -} - -/// The aiSummary / aiSummaryReviews block has no `@type`, so match by -/// presence of the `aiSummary` key. -fn find_ai_summary_block(blocks: &[Value]) -> Option { - for block in blocks { - for node in walk_graph(block) { - if node.get("aiSummary").is_some() { - return Some(node); - } - } - } - None -} - -/// Flatten each block (and its `@graph`) into a list of nodes we can -/// iterate over. Handles both `@graph: [ ... ]` (array) and -/// `@graph: { ... }` (single object) shapes — Trustpilot uses both. -fn walk_graph(block: &Value) -> Vec { - let mut out = vec![block.clone()]; - if let Some(graph) = block.get("@graph") { - match graph { - Value::Array(arr) => out.extend(arr.iter().cloned()), - Value::Object(_) => out.push(graph.clone()), - _ => {} - } - } - out -} - -// --------------------------------------------------------------------------- -// Rating distribution (csvw:Table) -// --------------------------------------------------------------------------- - -/// Parse the per-star distribution from the Dataset block. Returns -/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`. -fn parse_star_distribution(dataset: &Value) -> Option { - let columns = dataset - .get("mainEntity")? - .get("csvw:tableSchema")? - .get("csvw:columns")? - .as_array()?; - let mut out = serde_json::Map::new(); - for col in columns { - let name = col.get("csvw:name").and_then(|n| n.as_str())?; - let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?; - let count = cell - .get("csvw:value") - .and_then(|v| v.as_str()) - .and_then(|s| s.parse::().ok()); - let percent = cell - .get("csvw:notes") - .and_then(|n| n.as_array()) - .and_then(|arr| arr.first()) - .and_then(|s| s.as_str()) - .map(String::from); - let key = normalise_star_key(name); - out.insert( - key, - json!({ - "count": count, - "percent": percent, - }), - ); - } - if out.is_empty() { - None - } else { - Some(Value::Object(out)) - } -} - -/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than -/// the raw "1 star" key which fights YAML/JS property access. -fn normalise_star_key(name: &str) -> String { - let trimmed = name.trim().to_lowercase(); - match trimmed.as_str() { - "1 star" => "one_star".into(), - "2 stars" => "two_stars".into(), - "3 stars" => "three_stars".into(), - "4 stars" => "four_stars".into(), - "5 stars" => "five_stars".into(), - "total" => "total".into(), - other => other.replace(' ', "_"), - } -} - -/// Compute average rating (weighted by bucket) and total count from the -/// parsed distribution. Returns `(average, total)`. -fn compute_rating_stats(distribution: &Value) -> (Option, Option) { - let Some(obj) = distribution.as_object() else { - return (None, None); - }; - let get_count = |key: &str| -> i64 { - obj.get(key) - .and_then(|v| v.get("count")) - .and_then(|v| v.as_i64()) - .unwrap_or(0) - }; - let one = get_count("one_star"); - let two = get_count("two_stars"); - let three = get_count("three_stars"); - let four = get_count("four_stars"); - let five = get_count("five_stars"); - let total_bucket = one + two + three + four + five; - let total = obj - .get("total") - .and_then(|v| v.get("count")) - .and_then(|v| v.as_i64()) - .unwrap_or(total_bucket); - if total == 0 { - return (None, Some(0)); - } - let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5); - let avg = weighted as f64 / total_bucket.max(1) as f64; - // One decimal place, matching how Trustpilot displays the score. - (Some(format!("{avg:.1}")), Some(total)) -} - -// --------------------------------------------------------------------------- -// OG / meta-tag fallbacks -// --------------------------------------------------------------------------- - -/// Regex out the business name from the standard Trustpilot OG title -/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`. -fn parse_name_from_og_title(html: &str) -> Option { - let title = og(html, "title")?; - // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot" - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap()); - re.captures(&title) - .and_then(|c| c.get(1)) - .map(|m| m.as_str().to_string()) -} - -/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value -/// from the OG title. -fn parse_rating_from_og_title(html: &str) -> (Option, Option) { - let Some(title) = og(html, "title") else { - return (None, None); - }; - static RE: OnceLock = OnceLock::new(); - // "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot" - let re = RE.get_or_init(|| { - Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap() - }); - let Some(caps) = re.captures(&title) else { - return (None, None); - }; - ( - caps.get(1).map(|m| m.as_str().trim().to_string()), - caps.get(2).map(|m| m.as_str().to_string()), - ) -} - -/// Parse "hear what 226 customers have already said" from the OG -/// description tag. -fn parse_review_count_from_og_description(html: &str) -> Option { - let desc = og(html, "description")?; - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap()); - re.captures(&desc)? - .get(1)? - .as_str() - .replace(',', "") - .parse::() - .ok() -} - -fn og(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - let raw = c.get(2).map(|m| m.as_str())?; - return Some(html_unescape(raw)); - } - } - None -} - -/// Minimal HTML entity unescaping for the three entities the -/// synthesize_html escaper might produce. Keeps us off a heavier dep. -fn html_unescape(s: &str) -> String { - s.replace(""", "\"") - .replace("&", "&") - .replace("<", "<") - .replace(">", ">") -} - -fn get_string(v: &Value, key: &str) -> Option { - v.get(key).and_then(|x| x.as_str().map(String::from)) -} - -// --------------------------------------------------------------------------- -// Review extraction -// --------------------------------------------------------------------------- - -fn extract_review(r: &Value) -> Value { - json!({ - "id": r.get("id").and_then(|v| v.as_str()), - "rating": r.get("rating").and_then(|v| v.as_i64()), - "title": r.get("title").and_then(|v| v.as_str()), - "text": r.get("text").and_then(|v| v.as_str()), - "language": r.get("language").and_then(|v| v.as_str()), - "source": r.get("source").and_then(|v| v.as_str()), - "likes": r.get("likes").and_then(|v| v.as_i64()), - "author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()), - "author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()), - "author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()), - "verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()), - "date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()), - "date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()), - }) -} - -// --------------------------------------------------------------------------- -// Tests -// --------------------------------------------------------------------------- - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_trustpilot_review_urls() { - assert!(matches("https://www.trustpilot.com/review/stripe.com")); - assert!(matches("https://trustpilot.com/review/example.com")); - assert!(!matches("https://www.trustpilot.com/")); - assert!(!matches("https://example.com/review/foo")); - } - - #[test] - fn parse_review_domain_handles_query_and_slash() { - assert_eq!( - parse_review_domain("https://www.trustpilot.com/review/anthropic.com"), - Some("anthropic.com".into()) - ); - assert_eq!( - parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"), - Some("anthropic.com".into()) - ); - assert_eq!( - parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"), - Some("anthropic.com".into()) - ); - } - - #[test] - fn normalise_star_key_covers_all_buckets() { - assert_eq!(normalise_star_key("1 star"), "one_star"); - assert_eq!(normalise_star_key("2 stars"), "two_stars"); - assert_eq!(normalise_star_key("5 stars"), "five_stars"); - assert_eq!(normalise_star_key("Total"), "total"); - } - - #[test] - fn compute_rating_stats_weighted_average() { - // 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews. - let dist = json!({ - "one_star": { "count": 100, "percent": "50%" }, - "two_stars": { "count": 0, "percent": "0%" }, - "three_stars":{ "count": 0, "percent": "0%" }, - "four_stars": { "count": 0, "percent": "0%" }, - "five_stars": { "count": 100, "percent": "50%" }, - "total": { "count": 200, "percent": "100%" }, - }); - let (avg, total) = compute_rating_stats(&dist); - assert_eq!(avg.as_deref(), Some("3.0")); - assert_eq!(total, Some(200)); - } - - #[test] - fn parse_og_title_extracts_name_and_rating() { - let html = r#""#; - assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into())); - let (label, rating) = parse_rating_from_og_title(html); - assert_eq!(label.as_deref(), Some("Bad")); - assert_eq!(rating.as_deref(), Some("1.5")); - } - - #[test] - fn parse_review_count_from_og_description_picks_number() { - let html = r#""#; - assert_eq!(parse_review_count_from_og_description(html), Some(226)); - } - - #[test] - fn parse_full_fixture_assembles_all_fields() { - let html = r##" - - - - - -"##; - let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap(); - assert_eq!(v["domain"], "anthropic.com"); - assert_eq!(v["business_name"], "Anthropic"); - assert_eq!(v["rating_label"], "Bad"); - assert_eq!(v["review_count"], 226); - assert_eq!(v["rating_distribution"]["one_star"]["count"], 196); - assert_eq!(v["rating_distribution"]["total"]["count"], 226); - assert_eq!(v["ai_summary"], "Mixed reviews."); - assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1); - assert_eq!(v["recent_reviews"][0]["author"], "W.FRH"); - assert_eq!(v["recent_reviews"][0]["rating"], 1); - assert_eq!(v["recent_reviews"][0]["title"], "Bad"); - } - - #[test] - fn parse_falls_back_to_og_when_no_jsonld() { - let html = r#" -"#; - let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap(); - assert_eq!(v["domain"], "anthropic.com"); - assert_eq!(v["business_name"], "Anthropic"); - assert_eq!(v["average_rating"], "1.5"); - assert_eq!(v["review_count"], 226); - assert_eq!(v["rating_label"], "Bad"); - } - - #[test] - fn parse_returns_ok_with_url_domain_when_nothing_else() { - let v = parse( - "", - "https://www.trustpilot.com/review/example.com", - ) - .unwrap(); - assert_eq!(v["domain"], "example.com"); - assert_eq!(v["business_name"], "example.com"); - } -} diff --git a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs deleted file mode 100644 index db6dd78..0000000 --- a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs +++ /dev/null @@ -1,237 +0,0 @@ -//! WooCommerce product structured extractor. -//! -//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`. -//! About 30-50% of WooCommerce stores expose this endpoint publicly -//! (it's on by default, but common security plugins disable it). -//! When it's off, the server returns 404 at /wp-json. We surface a -//! clean error and point callers at `/v1/scrape/ecommerce_product` -//! which works on any store with Schema.org JSON-LD. -//! -//! Explicit-call only. `/product/{slug}` is the default permalink for -//! WooCommerce but custom stores use every variation imaginable, so -//! auto-dispatch is unreliable. - -use serde::Deserialize; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "woocommerce_product", - label: "WooCommerce product", - description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).", - url_patterns: &[ - "https://{shop}/product/{slug}", - "https://{shop}/shop/{slug}", - ], -}; - -pub fn matches(url: &str) -> bool { - let host = host_of(url); - if host.is_empty() { - return false; - } - // Permissive: WooCommerce stores use custom domains + custom - // permalinks. The extractor's API probe is what confirms it's - // really WooCommerce. - url.contains("/product/") - || url.contains("/shop/") - || url.contains("/producto/") // common es locale - || url.contains("/produit/") // common fr locale -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let slug = parse_slug(url).ok_or_else(|| { - FetchError::Build(format!( - "woocommerce_product: cannot parse slug from '{url}'" - )) - })?; - let host = host_of(url); - if host.is_empty() { - return Err(FetchError::Build(format!( - "woocommerce_product: empty host in '{url}'" - ))); - } - let scheme = if url.starts_with("http://") { - "http" - } else { - "https" - }; - let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1"); - let resp = client.fetch(&api_url).await?; - if resp.status == 404 { - return Err(FetchError::Build(format!( - "woocommerce_product: {host} does not expose /wp-json/wc/store (404). \ - Use /v1/scrape/ecommerce_product for JSON-LD fallback." - ))); - } - if resp.status == 401 || resp.status == 403 { - return Err(FetchError::Build(format!( - "woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \ - Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.", - resp.status - ))); - } - if resp.status != 200 { - return Err(FetchError::Build(format!( - "woocommerce api returned status {} for {api_url}", - resp.status - ))); - } - - let products: Vec = serde_json::from_str(&resp.html) - .map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?; - let p = products.into_iter().next().ok_or_else(|| { - FetchError::Build(format!( - "woocommerce_product: no product found for slug '{slug}' on {host}" - )) - })?; - - let images: Vec = p - .images - .iter() - .map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt})) - .collect(); - let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0); - - Ok(json!({ - "url": url, - "api_url": api_url, - "product_id": p.id, - "name": p.name, - "slug": p.slug, - "sku": p.sku, - "permalink": p.permalink, - "on_sale": p.on_sale, - "in_stock": p.is_in_stock, - "is_purchasable": p.is_purchasable, - "price": p.prices.as_ref().and_then(|pr| pr.price.clone()), - "regular_price": p.prices.as_ref().and_then(|pr| pr.regular_price.clone()), - "sale_price": p.prices.as_ref().and_then(|pr| pr.sale_price.clone()), - "currency": p.prices.as_ref().and_then(|pr| pr.currency_code.clone()), - "currency_minor": p.prices.as_ref().and_then(|pr| pr.currency_minor_unit), - "price_range": p.prices.as_ref().and_then(|pr| pr.price_range.clone()), - "average_rating": p.average_rating, - "review_count": p.review_count, - "description": p.description, - "short_description": p.short_description, - "categories": p.categories.iter().filter_map(|c| c.name.clone()).collect::>(), - "tags": p.tags.iter().filter_map(|t| t.name.clone()).collect::>(), - "variation_count": variations_count, - "image_count": images.len(), - "images": images, - })) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn host_of(url: &str) -> &str { - url.split("://") - .nth(1) - .unwrap_or(url) - .split('/') - .next() - .unwrap_or("") -} - -/// Extract the product slug from common WooCommerce permalinks. -fn parse_slug(url: &str) -> Option { - for needle in ["/product/", "/shop/", "/producto/", "/produit/"] { - if let Some(after) = url.split(needle).nth(1) { - let stripped = after - .split(['?', '#']) - .next()? - .trim_end_matches('/') - .split('/') - .next() - .unwrap_or(""); - if !stripped.is_empty() { - return Some(stripped.to_string()); - } - } - } - None -} - -// --------------------------------------------------------------------------- -// Store API types (subset of the full response) -// --------------------------------------------------------------------------- - -#[derive(Deserialize)] -struct Product { - id: Option, - name: Option, - slug: Option, - sku: Option, - permalink: Option, - description: Option, - short_description: Option, - on_sale: Option, - is_in_stock: Option, - is_purchasable: Option, - average_rating: Option, // string or number - review_count: Option, - prices: Option, - #[serde(default)] - categories: Vec, - #[serde(default)] - tags: Vec, - #[serde(default)] - images: Vec, - variations: Option>, -} - -#[derive(Deserialize)] -struct Prices { - price: Option, - regular_price: Option, - sale_price: Option, - currency_code: Option, - currency_minor_unit: Option, - price_range: Option, -} - -#[derive(Deserialize)] -struct Term { - name: Option, -} - -#[derive(Deserialize)] -struct Img { - src: Option, - thumbnail: Option, - alt: Option, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_common_permalinks() { - assert!(matches("https://shop.example.com/product/cool-widget")); - assert!(matches("https://shop.example.com/shop/cool-widget")); - assert!(matches("https://tienda.example.com/producto/cosa")); - assert!(matches("https://boutique.example.com/produit/chose")); - } - - #[test] - fn parse_slug_handles_locale_and_suffix() { - assert_eq!( - parse_slug("https://shop.example.com/product/cool-widget"), - Some("cool-widget".into()) - ); - assert_eq!( - parse_slug("https://shop.example.com/product/cool-widget/?attr=red"), - Some("cool-widget".into()) - ); - assert_eq!( - parse_slug("https://tienda.example.com/producto/cosa/"), - Some("cosa".into()) - ); - } -} diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs deleted file mode 100644 index 2551ff8..0000000 --- a/crates/webclaw-fetch/src/extractors/youtube_video.rs +++ /dev/null @@ -1,378 +0,0 @@ -//! YouTube video structured extractor. -//! -//! YouTube embeds the full player configuration in a -//! `ytInitialPlayerResponse` JavaScript assignment at the top of -//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the -//! core crate's already-proven regex + parse to surface typed JSON -//! from it: video id, title, author + channel id, view count, -//! duration, upload date, keywords, thumbnails, caption-track URLs. -//! -//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/` -//! shape is stable. -//! -//! ## Fallback -//! -//! `ytInitialPlayerResponse` is missing on EU-consent interstitials, -//! some live-stream pre-show pages, and age-gated videos. In those -//! cases we drop down to OG tags for `title`, `description`, -//! `thumbnail`, and `channel`, and return a `data_source: -//! "og_fallback"` payload so the caller can tell they got a degraded -//! shape (no view count, duration, captions). - -use std::sync::OnceLock; - -use regex::Regex; -use serde_json::{Value, json}; - -use super::ExtractorInfo; -use crate::error::FetchError; -use crate::fetcher::Fetcher; - -pub const INFO: ExtractorInfo = ExtractorInfo { - name: "youtube_video", - label: "YouTube video", - description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.", - url_patterns: &[ - "https://www.youtube.com/watch?v={id}", - "https://youtu.be/{id}", - "https://www.youtube.com/shorts/{id}", - ], -}; - -pub fn matches(url: &str) -> bool { - webclaw_core::youtube::is_youtube_url(url) - || url.contains("youtube.com/shorts/") - || url.contains("youtube-nocookie.com/embed/") -} - -pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { - let video_id = parse_video_id(url).ok_or_else(|| { - FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'")) - })?; - - // Always fetch the canonical /watch URL. /shorts/ and youtu.be - // sometimes serve a thinner page without the player blob. - let canonical = format!("https://www.youtube.com/watch?v={video_id}"); - let resp = client.fetch(&canonical).await?; - if resp.status != 200 { - return Err(FetchError::Build(format!( - "youtube returned status {} for {canonical}", - resp.status - ))); - } - - if let Some(player) = extract_player_response(&resp.html) { - return Ok(build_player_payload( - &player, &resp.html, url, &canonical, &video_id, - )); - } - - // No player blob. Fall back to OG tags so the call still returns - // something useful for consent / age-gate pages. - Ok(build_og_fallback(&resp.html, url, &canonical, &video_id)) -} - -// --------------------------------------------------------------------------- -// Player-blob path (rich payload) -// --------------------------------------------------------------------------- - -fn build_player_payload( - player: &Value, - html: &str, - url: &str, - canonical: &str, - video_id: &str, -) -> Value { - let video_details = player.get("videoDetails"); - let microformat = player - .get("microformat") - .and_then(|m| m.get("playerMicroformatRenderer")); - - let thumbnails: Vec = video_details - .and_then(|vd| vd.get("thumbnail")) - .and_then(|t| t.get("thumbnails")) - .and_then(|t| t.as_array()) - .cloned() - .unwrap_or_default(); - - let keywords: Vec = video_details - .and_then(|vd| vd.get("keywords")) - .and_then(|k| k.as_array()) - .cloned() - .unwrap_or_default(); - - let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html); - let captions: Vec = caption_tracks - .iter() - .map(|c| { - json!({ - "url": c.url, - "lang": c.lang, - "name": c.name, - }) - }) - .collect(); - - json!({ - "url": url, - "canonical_url":canonical, - "data_source": "player_response", - "video_id": video_id, - "title": get_str(video_details, "title"), - "description": get_str(video_details, "shortDescription"), - "author": get_str(video_details, "author"), - "channel_id": get_str(video_details, "channelId"), - "channel_url": get_str(microformat, "ownerProfileUrl"), - "view_count": get_int(video_details, "viewCount"), - "length_seconds": get_int(video_details, "lengthSeconds"), - "is_live": video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()), - "is_private": video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()), - "is_unlisted": microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()), - "allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()), - "category": get_str(microformat, "category"), - "upload_date": get_str(microformat, "uploadDate"), - "publish_date": get_str(microformat, "publishDate"), - "keywords": keywords, - "thumbnails": thumbnails, - "caption_tracks": captions, - }) -} - -// --------------------------------------------------------------------------- -// OG fallback path (degraded payload) -// --------------------------------------------------------------------------- - -fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value { - let title = og(html, "title"); - let description = og(html, "description"); - let thumbnail = og(html, "image"); - // YouTube sets `` on some pages but - // OG-only pages reliably carry `og:video:tag` and the channel in - // ``. We keep this lean: just what's stable. - let channel = meta_name(html, "author"); - - json!({ - "url": url, - "canonical_url":canonical, - "data_source": "og_fallback", - "video_id": video_id, - "title": title, - "description": description, - "author": channel, - // OG path: these are null so the caller doesn't have to guess. - "channel_id": None::, - "channel_url": None::, - "view_count": None::, - "length_seconds": None::, - "is_live": None::, - "is_private": None::, - "is_unlisted": None::, - "allow_ratings":None::, - "category": None::, - "upload_date": None::, - "publish_date": None::, - "keywords": Vec::::new(), - "thumbnails": thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(), - "caption_tracks": Vec::::new(), - }) -} - -// --------------------------------------------------------------------------- -// URL helpers -// --------------------------------------------------------------------------- - -fn parse_video_id(url: &str) -> Option { - // youtu.be/{id} - if let Some(after) = url.split("youtu.be/").nth(1) { - let id = after - .split(['?', '#', '/']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - if !id.is_empty() { - return Some(id.to_string()); - } - } - // youtube.com/shorts/{id} - if let Some(after) = url.split("youtube.com/shorts/").nth(1) { - let id = after - .split(['?', '#', '/']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - if !id.is_empty() { - return Some(id.to_string()); - } - } - // youtube-nocookie.com/embed/{id} - if let Some(after) = url.split("/embed/").nth(1) { - let id = after - .split(['?', '#', '/']) - .next() - .unwrap_or("") - .trim_end_matches('/'); - if !id.is_empty() { - return Some(id.to_string()); - } - } - // youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id}) - if let Some(q) = url.split_once('?').map(|(_, q)| q) - && let Some(id) = q - .split('&') - .find_map(|p| p.strip_prefix("v=").map(|v| v.to_string())) - { - let id = id.split(['#', '/']).next().unwrap_or(&id).to_string(); - if !id.is_empty() { - return Some(id); - } - } - None -} - -// --------------------------------------------------------------------------- -// Player-response parsing -// --------------------------------------------------------------------------- - -fn extract_player_response(html: &str) -> Option { - // Same regex as webclaw_core::youtube. Duplicated here because - // core's regex is module-private. Kept in lockstep; changes are - // rare and we cover with tests in both places. - static RE: OnceLock = OnceLock::new(); - let re = RE - .get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap()); - let json_str = re.captures(html)?.get(1)?.as_str(); - serde_json::from_str(json_str).ok() -} - -// --------------------------------------------------------------------------- -// Meta-tag helpers (for OG fallback) -// --------------------------------------------------------------------------- - -fn og(html: &str, prop: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == prop) { - return c.get(2).map(|m| m.as_str().to_string()); - } - } - None -} - -fn meta_name(html: &str, name: &str) -> Option { - static RE: OnceLock = OnceLock::new(); - let re = RE.get_or_init(|| { - Regex::new(r#"(?i)]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap() - }); - for c in re.captures_iter(html) { - if c.get(1).is_some_and(|m| m.as_str() == name) { - return c.get(2).map(|m| m.as_str().to_string()); - } - } - None -} - -fn get_str(v: Option<&Value>, key: &str) -> Option { - v.and_then(|x| x.get(key)) - .and_then(|x| x.as_str().map(String::from)) -} - -fn get_int(v: Option<&Value>, key: &str) -> Option { - v.and_then(|x| x.get(key)).and_then(|x| { - x.as_i64() - .or_else(|| x.as_str().and_then(|s| s.parse::().ok())) - }) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn matches_watch_urls() { - assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ")); - assert!(matches("https://youtu.be/dQw4w9WgXcQ")); - assert!(matches("https://www.youtube.com/shorts/abc123")); - assert!(matches( - "https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ" - )); - } - - #[test] - fn rejects_non_video_urls() { - assert!(!matches("https://www.youtube.com/")); - assert!(!matches("https://www.youtube.com/channel/abc")); - assert!(!matches("https://example.com/watch?v=abc")); - } - - #[test] - fn parse_video_id_from_each_shape() { - assert_eq!( - parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"), - Some("dQw4w9WgXcQ".into()) - ); - assert_eq!( - parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"), - Some("dQw4w9WgXcQ".into()) - ); - assert_eq!( - parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"), - Some("dQw4w9WgXcQ".into()) - ); - assert_eq!( - parse_video_id("https://youtu.be/dQw4w9WgXcQ"), - Some("dQw4w9WgXcQ".into()) - ); - assert_eq!( - parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"), - Some("dQw4w9WgXcQ".into()) - ); - assert_eq!( - parse_video_id("https://www.youtube.com/shorts/abc123"), - Some("abc123".into()) - ); - } - - #[test] - fn extract_player_response_happy_path() { - let html = r#" - - - -"#; - let v = extract_player_response(html).unwrap(); - let vd = v.get("videoDetails").unwrap(); - assert_eq!(vd.get("title").unwrap().as_str(), Some("T")); - } - - #[test] - fn og_fallback_extracts_basics_from_meta_tags() { - let html = r##" - - - - - -"##; - let v = build_og_fallback( - html, - "https://www.youtube.com/watch?v=abc", - "https://www.youtube.com/watch?v=abc", - "abc", - ); - assert_eq!(v["data_source"], "og_fallback"); - assert_eq!(v["title"], "Example Video Title"); - assert_eq!(v["description"], "A cool video description."); - assert_eq!(v["author"], "Example Channel"); - assert_eq!( - v["thumbnails"][0]["url"], - "https://i.ytimg.com/vi/abc/maxresdefault.jpg" - ); - assert!(v["view_count"].is_null()); - assert!(v["caption_tracks"].as_array().unwrap().is_empty()); - } -} diff --git a/crates/webclaw-fetch/src/fetcher.rs b/crates/webclaw-fetch/src/fetcher.rs deleted file mode 100644 index fabcf44..0000000 --- a/crates/webclaw-fetch/src/fetcher.rs +++ /dev/null @@ -1,118 +0,0 @@ -//! Pluggable fetcher abstraction for vertical extractors. -//! -//! Extractors call the network through this trait instead of hard- -//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all -//! pass `&FetchClient` (wreq-backed BoringSSL). The production API -//! server, which must not use in-process TLS fingerprinting, provides -//! its own implementation that routes through the Go tls-sidecar. -//! -//! Both paths expose the same [`FetchResult`] shape and the same -//! optional cloud-escalation client, so extractor logic stays -//! identical across environments. -//! -//! ## Choosing an implementation -//! -//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`] -//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass -//! it to extractors as `&client`. -//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher` -//! (in `server/src/engine/`) that delegates to `engine::tls_client` -//! and wraps it in `Arc` for handler injection. -//! -//! ## Why a trait and not a free function -//! -//! Extractors need state beyond a single fetch: the cloud client for -//! antibot escalation, and in the future per-user proxy pools, tenant -//! headers, circuit breakers. A trait keeps that state encapsulated -//! behind the fetch interface instead of threading it through every -//! extractor signature. - -use async_trait::async_trait; - -use crate::client::FetchResult; -use crate::cloud::CloudClient; -use crate::error::FetchError; - -/// HTTP fetch surface used by vertical extractors. -/// -/// Implementations must be `Send + Sync` because extractor dispatchers -/// run them inside tokio tasks, potentially across many requests. -#[async_trait] -pub trait Fetcher: Send + Sync { - /// Fetch a URL and return the raw response body + metadata. The - /// body is in `FetchResult::html` regardless of the actual content - /// type — JSON API endpoints put JSON there, HTML pages put HTML. - /// Extractors branch on response status and body shape. - async fn fetch(&self, url: &str) -> Result; - - /// Fetch with additional request headers. Needed for endpoints - /// that authenticate via a specific header (Instagram's - /// `x-ig-app-id`, for example). Default implementation routes to - /// [`Self::fetch`] so implementers without header support stay - /// functional, though the `Option` field they'd set won't - /// be populated on the request. - async fn fetch_with_headers( - &self, - url: &str, - _headers: &[(&str, &str)], - ) -> Result { - self.fetch(url).await - } - - /// Optional cloud-escalation client for antibot bypass. Returning - /// `Some` tells extractors they can call into the hosted API when - /// local fetch hits a challenge page. Returning `None` makes - /// cloud-gated extractors emit [`CloudError::NotConfigured`] with - /// an actionable signup link. - /// - /// The default implementation returns `None` because not every - /// deployment wants cloud fallback (self-hosts that don't have a - /// webclaw.io subscription, for instance). - /// - /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured - fn cloud(&self) -> Option<&CloudClient> { - None - } -} - -// --------------------------------------------------------------------------- -// Blanket impls: make `&T` and `Arc` behave like the wrapped `T`. -// --------------------------------------------------------------------------- - -#[async_trait] -impl Fetcher for &T { - async fn fetch(&self, url: &str) -> Result { - (**self).fetch(url).await - } - - async fn fetch_with_headers( - &self, - url: &str, - headers: &[(&str, &str)], - ) -> Result { - (**self).fetch_with_headers(url, headers).await - } - - fn cloud(&self) -> Option<&CloudClient> { - (**self).cloud() - } -} - -#[async_trait] -impl Fetcher for std::sync::Arc { - async fn fetch(&self, url: &str) -> Result { - (**self).fetch(url).await - } - - async fn fetch_with_headers( - &self, - url: &str, - headers: &[(&str, &str)], - ) -> Result { - (**self).fetch_with_headers(url, headers).await - } - - fn cloud(&self) -> Option<&CloudClient> { - (**self).cloud() - } -} diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs index 83664a1..517cb6e 100644 --- a/crates/webclaw-fetch/src/lib.rs +++ b/crates/webclaw-fetch/src/lib.rs @@ -3,12 +3,9 @@ //! Automatically detects PDF responses and delegates to webclaw-pdf. pub mod browser; pub mod client; -pub mod cloud; pub mod crawler; pub mod document; pub mod error; -pub mod extractors; -pub mod fetcher; pub mod linkedin; pub mod proxy; pub mod reddit; @@ -19,7 +16,6 @@ pub use browser::BrowserProfile; pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult}; pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult}; pub use error::FetchError; -pub use fetcher::Fetcher; pub use http::HeaderMap; pub use proxy::{parse_proxy_file, parse_proxy_line}; pub use sitemap::SitemapEntry; diff --git a/crates/webclaw-mcp/Cargo.toml b/crates/webclaw-mcp/Cargo.toml index ec3b2b4..df9dd97 100644 --- a/crates/webclaw-mcp/Cargo.toml +++ b/crates/webclaw-mcp/Cargo.toml @@ -22,5 +22,6 @@ serde_json = { workspace = true } tokio = { workspace = true } tracing = { workspace = true } tracing-subscriber = { workspace = true } +reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] } url = "2" dirs = "6.0.0" diff --git a/crates/webclaw-mcp/src/cloud.rs b/crates/webclaw-mcp/src/cloud.rs new file mode 100644 index 0000000..ac602e4 --- /dev/null +++ b/crates/webclaw-mcp/src/cloud.rs @@ -0,0 +1,302 @@ +/// Cloud API fallback for protected sites. +/// +/// When local fetch returns a challenge page, this module retries +/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set. +use std::time::Duration; + +use serde_json::{Value, json}; +use tracing::info; + +const API_BASE: &str = "https://api.webclaw.io/v1"; + +/// Lightweight client for the webclaw cloud API. +pub struct CloudClient { + api_key: String, + http: reqwest::Client, +} + +impl CloudClient { + /// Create a new cloud client from WEBCLAW_API_KEY env var. + /// Returns None if the key is not set. + pub fn from_env() -> Option { + let key = std::env::var("WEBCLAW_API_KEY").ok()?; + if key.is_empty() { + return None; + } + let http = reqwest::Client::builder() + .timeout(Duration::from_secs(60)) + .build() + .unwrap_or_default(); + Some(Self { api_key: key, http }) + } + + /// Scrape a URL via the cloud API. Returns the response JSON. + pub async fn scrape( + &self, + url: &str, + formats: &[&str], + include_selectors: &[String], + exclude_selectors: &[String], + only_main_content: bool, + ) -> Result { + let mut body = json!({ + "url": url, + "formats": formats, + }); + + if only_main_content { + body["only_main_content"] = json!(true); + } + if !include_selectors.is_empty() { + body["include_selectors"] = json!(include_selectors); + } + if !exclude_selectors.is_empty() { + body["exclude_selectors"] = json!(exclude_selectors); + } + + self.post("scrape", body).await + } + + /// Generic POST to the cloud API. + pub async fn post(&self, endpoint: &str, body: Value) -> Result { + let resp = self + .http + .post(format!("{API_BASE}/{endpoint}")) + .header("Authorization", format!("Bearer {}", self.api_key)) + .json(&body) + .send() + .await + .map_err(|e| format!("Cloud API request failed: {e}"))?; + + let status = resp.status(); + if !status.is_success() { + let text = resp.text().await.unwrap_or_default(); + let truncated = truncate_error(&text); + return Err(format!("Cloud API error {status}: {truncated}")); + } + + resp.json::() + .await + .map_err(|e| format!("Cloud API response parse failed: {e}")) + } + + /// Generic GET from the cloud API. + pub async fn get(&self, endpoint: &str) -> Result { + let resp = self + .http + .get(format!("{API_BASE}/{endpoint}")) + .header("Authorization", format!("Bearer {}", self.api_key)) + .send() + .await + .map_err(|e| format!("Cloud API request failed: {e}"))?; + + let status = resp.status(); + if !status.is_success() { + let text = resp.text().await.unwrap_or_default(); + let truncated = truncate_error(&text); + return Err(format!("Cloud API error {status}: {truncated}")); + } + + resp.json::() + .await + .map_err(|e| format!("Cloud API response parse failed: {e}")) + } +} + +/// Truncate error body to avoid flooding logs with huge HTML responses. +fn truncate_error(text: &str) -> &str { + const MAX_LEN: usize = 500; + match text.char_indices().nth(MAX_LEN) { + Some((byte_pos, _)) => &text[..byte_pos], + None => text, + } +} + +/// Check if fetched HTML looks like a bot protection challenge page. +/// Detects common bot protection challenge pages. +pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool { + let html_lower = html.to_lowercase(); + + // Cloudflare challenge page + if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") { + return true; + } + + // Cloudflare "checking your browser" spinner + if (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) + && html_lower.contains("cf-spinner") + { + return true; + } + + // Cloudflare Turnstile (only on short pages = challenge, not embedded on real content) + if (html_lower.contains("cf-turnstile") + || html_lower.contains("challenges.cloudflare.com/turnstile")) + && html.len() < 100_000 + { + return true; + } + + // DataDome + if html_lower.contains("geo.captcha-delivery.com") + || html_lower.contains("captcha-delivery.com/captcha") + { + return true; + } + + // AWS WAF + if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") { + return true; + } + + // hCaptcha blocking page + if html_lower.contains("hcaptcha.com") + && html_lower.contains("h-captcha") + && html.len() < 50_000 + { + return true; + } + + // Cloudflare via headers + challenge body + let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some(); + if has_cf_headers + && (html_lower.contains("just a moment") || html_lower.contains("checking your browser")) + { + return true; + } + + false +} + +/// Check if a page likely needs JS rendering (SPA with almost no text content). +pub fn needs_js_rendering(word_count: usize, html: &str) -> bool { + let has_scripts = html.contains(" 5_000 && has_scripts { + return true; + } + + // Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio + if word_count < 800 && html.len() > 50_000 && has_scripts { + let html_lower = html.to_lowercase(); + let has_spa_marker = html_lower.contains("react-app") + || html_lower.contains("id=\"__next\"") + || html_lower.contains("id=\"root\"") + || html_lower.contains("id=\"app\"") + || html_lower.contains("__next_data__") + || html_lower.contains("nuxt") + || html_lower.contains("ng-app"); + + if has_spa_marker { + return true; + } + } + + false +} + +/// Result of a smart fetch: either local extraction or cloud API response. +pub enum SmartFetchResult { + /// Successfully extracted locally. + Local(Box), + /// Fell back to cloud API. Contains the API response JSON. + Cloud(Value), +} + +/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered. +/// +/// Returns the extraction result (local) or the cloud API response JSON. +/// If no API key is configured and local fetch is blocked, returns an error +/// with a helpful message. +pub async fn smart_fetch( + client: &webclaw_fetch::FetchClient, + cloud: Option<&CloudClient>, + url: &str, + include_selectors: &[String], + exclude_selectors: &[String], + only_main_content: bool, + formats: &[&str], +) -> Result { + // Step 1: Try local fetch (with timeout to avoid hanging on slow servers) + let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url)) + .await + .map_err(|_| format!("Fetch timed out after 30s for {url}"))? + .map_err(|e| format!("Fetch failed: {e}"))?; + + // Step 2: Check for bot protection + if is_bot_protected(&fetch_result.html, &fetch_result.headers) { + info!(url, "bot protection detected, falling back to cloud API"); + return cloud_fallback( + cloud, + url, + include_selectors, + exclude_selectors, + only_main_content, + formats, + ) + .await; + } + + // Step 3: Extract locally + let options = webclaw_core::ExtractionOptions { + include_selectors: include_selectors.to_vec(), + exclude_selectors: exclude_selectors.to_vec(), + only_main_content, + include_raw_html: false, + }; + + let extraction = + webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options) + .map_err(|e| format!("Extraction failed: {e}"))?; + + // Step 4: Check for JS-rendered pages (low content from large HTML) + if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) { + info!( + url, + word_count = extraction.metadata.word_count, + html_len = fetch_result.html.len(), + "JS-rendered page detected, falling back to cloud API" + ); + return cloud_fallback( + cloud, + url, + include_selectors, + exclude_selectors, + only_main_content, + formats, + ) + .await; + } + + Ok(SmartFetchResult::Local(Box::new(extraction))) +} + +async fn cloud_fallback( + cloud: Option<&CloudClient>, + url: &str, + include_selectors: &[String], + exclude_selectors: &[String], + only_main_content: bool, + formats: &[&str], +) -> Result { + match cloud { + Some(c) => { + let resp = c + .scrape( + url, + formats, + include_selectors, + exclude_selectors, + only_main_content, + ) + .await?; + info!(url, "cloud API fallback successful"); + Ok(SmartFetchResult::Cloud(resp)) + } + None => Err(format!( + "Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \ + Get a key at https://webclaw.io" + )), + } +} diff --git a/crates/webclaw-mcp/src/main.rs b/crates/webclaw-mcp/src/main.rs index 89a4755..8576562 100644 --- a/crates/webclaw-mcp/src/main.rs +++ b/crates/webclaw-mcp/src/main.rs @@ -1,6 +1,7 @@ /// webclaw-mcp: MCP (Model Context Protocol) server for webclaw. /// Exposes web extraction tools over stdio transport for AI agents /// like Claude Desktop, Claude Code, and other MCP clients. +mod cloud; mod server; mod tools; diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs index a4af79d..f00eae7 100644 --- a/crates/webclaw-mcp/src/server.rs +++ b/crates/webclaw-mcp/src/server.rs @@ -15,8 +15,7 @@ use serde_json::json; use tracing::{error, info, warn}; use url::Url; -use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult}; - +use crate::cloud::{self, CloudClient, SmartFetchResult}; use crate::tools::*; pub struct WebclawMcp { @@ -718,50 +717,6 @@ impl WebclawMcp { Ok(serde_json::to_string_pretty(&resp).unwrap_or_default()) } } - - /// List every vertical extractor the server knows about. Returns a - /// JSON array of `{name, label, description, url_patterns}` entries. - /// Call this to discover what verticals are available before using - /// `vertical_scrape`. - #[tool] - async fn list_extractors( - &self, - Parameters(_params): Parameters, - ) -> Result { - let catalog = webclaw_fetch::extractors::list(); - serde_json::to_string_pretty(&catalog) - .map_err(|e| format!("failed to serialise extractor catalog: {e}")) - } - - /// Run a vertical extractor by name and return typed JSON specific - /// to the target site (title, price, rating, author, etc.), not - /// generic markdown. Use `list_extractors` to discover available - /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`, - /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`. - /// - /// Antibot-gated verticals (amazon_product, ebay_listing, - /// etsy_listing, trustpilot_reviews) will automatically escalate to - /// the webclaw cloud API when local fetch hits bot protection, - /// provided `WEBCLAW_API_KEY` is set. - #[tool] - async fn vertical_scrape( - &self, - Parameters(params): Parameters, - ) -> Result { - validate_url(¶ms.url)?; - // Reuse the long-lived default FetchClient. Extractors accept - // `&dyn Fetcher`; FetchClient implements the trait so this just - // works (see webclaw_fetch::Fetcher and client::FetchClient). - let data = webclaw_fetch::extractors::dispatch_by_name( - self.fetch_client.as_ref(), - ¶ms.name, - ¶ms.url, - ) - .await - .map_err(|e| e.to_string())?; - serde_json::to_string_pretty(&data) - .map_err(|e| format!("failed to serialise extractor output: {e}")) - } } #[tool_handler] @@ -771,8 +726,7 @@ impl ServerHandler for WebclawMcp { .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION"))) .with_instructions(String::from( "Webclaw MCP server -- web content extraction for AI agents. \ - Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \ - list_extractors, vertical_scrape.", + Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.", )) } } diff --git a/crates/webclaw-mcp/src/tools.rs b/crates/webclaw-mcp/src/tools.rs index 02bf534..e0195f1 100644 --- a/crates/webclaw-mcp/src/tools.rs +++ b/crates/webclaw-mcp/src/tools.rs @@ -103,20 +103,3 @@ pub struct SearchParams { /// Number of results to return (default: 10) pub num_results: Option, } - -/// Parameters for `vertical_scrape`: run a site-specific extractor by name. -#[derive(Debug, Deserialize, JsonSchema)] -pub struct VerticalParams { - /// Name of the vertical extractor. Call `list_extractors` to see all - /// available names. Examples: "reddit", "github_repo", "pypi", - /// "trustpilot_reviews", "youtube_video", "shopify_product". - pub name: String, - /// URL to extract. Must match the URL patterns the extractor claims; - /// otherwise the tool returns a clear "URL mismatch" error. - pub url: String, -} - -/// `list_extractors` takes no arguments but we still need an empty struct -/// so rmcp can generate a schema and parse the (empty) JSON-RPC params. -#[derive(Debug, Deserialize, JsonSchema)] -pub struct ListExtractorsParams {} diff --git a/crates/webclaw-server/src/main.rs b/crates/webclaw-server/src/main.rs index f4cfdcb..c57fed8 100644 --- a/crates/webclaw-server/src/main.rs +++ b/crates/webclaw-server/src/main.rs @@ -79,15 +79,10 @@ async fn main() -> anyhow::Result<()> { let v1 = Router::new() .route("/scrape", post(routes::scrape::scrape)) - .route( - "/scrape/{vertical}", - post(routes::structured::scrape_vertical), - ) .route("/crawl", post(routes::crawl::crawl)) .route("/map", post(routes::map::map)) .route("/batch", post(routes::batch::batch)) .route("/extract", post(routes::extract::extract)) - .route("/extractors", get(routes::structured::list_extractors)) .route("/summarize", post(routes::summarize::summarize_route)) .route("/diff", post(routes::diff::diff_route)) .route("/brand", post(routes::brand::brand)) diff --git a/crates/webclaw-server/src/routes/mod.rs b/crates/webclaw-server/src/routes/mod.rs index 01f1052..7c3d68e 100644 --- a/crates/webclaw-server/src/routes/mod.rs +++ b/crates/webclaw-server/src/routes/mod.rs @@ -15,5 +15,4 @@ pub mod extract; pub mod health; pub mod map; pub mod scrape; -pub mod structured; pub mod summarize; diff --git a/crates/webclaw-server/src/routes/structured.rs b/crates/webclaw-server/src/routes/structured.rs deleted file mode 100644 index c9cdc1a..0000000 --- a/crates/webclaw-server/src/routes/structured.rs +++ /dev/null @@ -1,55 +0,0 @@ -//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`. -//! -//! Vertical extractors return typed JSON instead of generic markdown. -//! See `webclaw_fetch::extractors` for the catalog and per-site logic. - -use axum::{ - Json, - extract::{Path, State}, -}; -use serde::Deserialize; -use serde_json::{Value, json}; -use webclaw_fetch::extractors::{self, ExtractorDispatchError}; - -use crate::{error::ApiError, state::AppState}; - -#[derive(Debug, Deserialize)] -pub struct ScrapeRequest { - pub url: String, -} - -/// Map dispatcher errors to ApiError so users get clean HTTP statuses -/// instead of opaque 500s. -impl From for ApiError { - fn from(e: ExtractorDispatchError) -> Self { - match e { - ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound, - ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()), - ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()), - } - } -} - -/// `GET /v1/extractors` — catalog of all available verticals. -pub async fn list_extractors() -> Json { - Json(json!({ - "extractors": extractors::list(), - })) -} - -/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit. -pub async fn scrape_vertical( - State(state): State, - Path(vertical): Path, - Json(req): Json, -) -> Result, ApiError> { - if req.url.trim().is_empty() { - return Err(ApiError::bad_request("`url` is required")); - } - let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?; - Ok(Json(json!({ - "vertical": vertical, - "url": req.url, - "data": data, - }))) -} diff --git a/crates/webclaw-server/src/state.rs b/crates/webclaw-server/src/state.rs index 6c2e8f7..b3f9b6b 100644 --- a/crates/webclaw-server/src/state.rs +++ b/crates/webclaw-server/src/state.rs @@ -1,24 +1,7 @@ //! Shared application state. Cheap to clone via Arc; held by the axum //! Router for the life of the process. -//! -//! Two unrelated keys get carried here: -//! -//! 1. [`AppState::api_key`] — the **bearer token clients must present** -//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`. -//! Unset = open mode. -//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our -//! **outbound** credential for api.webclaw.io, used by extractors -//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`. -//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY" -//! error with a signup link. -//! -//! Different variables on purpose: conflating the two means operators -//! who want their server behind an auth token can't also enable cloud -//! fallback, and vice versa. use std::sync::Arc; -use tracing::info; -use webclaw_fetch::cloud::CloudClient; use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig}; /// Single-process state shared across all request handlers. @@ -34,7 +17,6 @@ struct Inner { /// auto-deref `&Arc` -> `&FetchClient`, so this costs /// them nothing. pub fetch: Arc, - /// Inbound bearer-auth token for this server's own `/v1/*` surface. pub api_key: Option, } @@ -42,34 +24,17 @@ impl AppState { /// Build the application state. The fetch client is constructed once /// and shared across requests so connection pools + browser profile /// state don't churn per request. - /// - /// `inbound_api_key` is the bearer token clients must present; - /// cloud-fallback credentials come from the env (checked here). - pub fn new(inbound_api_key: Option) -> anyhow::Result { + pub fn new(api_key: Option) -> anyhow::Result { let config = FetchConfig { - browser: BrowserProfile::Firefox, + browser: BrowserProfile::Chrome, ..FetchConfig::default() }; - let mut fetch = FetchClient::new(config) + let fetch = FetchClient::new(config) .map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?; - - // Cloud fallback: only activates when the operator has provided - // an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY - // (preferred, disambiguates from the inbound-auth key) and - // WEBCLAW_API_KEY as a fallback when there's no inbound key - // configured (backwards compat with MCP / CLI conventions). - if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) { - info!( - base = cloud.base_url(), - "cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io" - ); - fetch = fetch.with_cloud(cloud); - } - Ok(Self { inner: Arc::new(Inner { fetch: Arc::new(fetch), - api_key: inbound_api_key, + api_key, }), }) } @@ -82,26 +47,3 @@ impl AppState { self.inner.api_key.as_deref() } } - -/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`; -/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is -/// configured (i.e. open mode — the same env var can't mean two -/// things to one process). -fn build_cloud_client(inbound_api_key: Option<&str>) -> Option { - let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok(); - if let Some(k) = cloud_key.as_deref() - && !k.trim().is_empty() - { - return Some(CloudClient::with_key(k)); - } - // Reuse WEBCLAW_API_KEY only when not also acting as our own - // inbound-auth token — otherwise we'd be telling the operator - // they can't have both. - if inbound_api_key.is_none() - && let Ok(k) = std::env::var("WEBCLAW_API_KEY") - && !k.trim().is_empty() - { - return Some(CloudClient::with_key(k)); - } - None -}