mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Compare commits
25 commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a5c3433372 | ||
|
|
966981bc42 | ||
|
|
866fa88aa0 | ||
|
|
b413d702b2 | ||
|
|
98a177dec4 | ||
|
|
e1af2da509 | ||
|
|
2285c585b1 | ||
|
|
b77767814a | ||
|
|
4bf11d902f | ||
|
|
0daa2fec1a | ||
|
|
058493bc8f | ||
|
|
aaa5103504 | ||
|
|
2373162c81 | ||
|
|
b2e7dbf365 | ||
|
|
e10066f527 | ||
|
|
a53578e45c | ||
|
|
7f5eb93b65 | ||
|
|
8cc727c2f2 | ||
|
|
d8c9274a9c | ||
|
|
0ab891bd6b | ||
|
|
0221c151dc | ||
|
|
3bb0a4bca0 | ||
|
|
b041f3cddd | ||
|
|
86182ef28a | ||
|
|
8ba7538c37 |
53 changed files with 10592 additions and 443 deletions
81
CHANGELOG.md
81
CHANGELOG.md
|
|
@ -3,6 +3,87 @@
|
|||
All notable changes to webclaw are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
||||
|
||||
## [0.5.6] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
|
||||
|
||||
### Fixed
|
||||
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
|
||||
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.5] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.4] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
|
||||
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
|
||||
|
||||
### Changed
|
||||
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
|
||||
- Bumped `wreq-util` to `3.0.0-rc.10`.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.2] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
|
||||
|
||||
- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
|
||||
|
||||
- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
|
||||
|
||||
### Changed
|
||||
- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.1] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
|
||||
|
||||
The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
|
||||
|
||||
Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
|
||||
|
||||
### Changed
|
||||
- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.0] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob.
|
||||
- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route.
|
||||
- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source.
|
||||
- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran.
|
||||
- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `<script>` tags, metadata as OG `<meta>` tags, markdown in a `<pre>`) so existing local parsers run unchanged across both paths. No per-extractor code path branches are needed for "came from cloud" vs "came from local".
|
||||
- **Trustpilot 2025 schema parser.** Trustpilot replaced their single-Organization + aggregateRating shape with three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table `mainEntity` carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent reviews. The parser walks all three, skips the site-level Org, picks the Dataset by `about.@id` matching the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and returns recent reviews with author / country / date / rating / title / text / likes.
|
||||
- **OG-tag fallback in `ecommerce_product` for sites with no JSON-LD and sites with JSON-LD but empty offers.** Three paths now: `jsonld` (Schema.org Product with offers), `jsonld+og` (Product JSON-LD plus OG product tags filling in missing price), and `og_fallback` (no JSON-LD at all, build minimal payload from `og:title`, `og:image`, `og:description`, `product:price:amount`, `product:price:currency`, `product:availability`, `product:brand`). `has_og_product_signal()` gates the fallback on `og:type=product` or a price tag so blog posts don't get mis-classified as products.
|
||||
- **URL-slug title fallback in `etsy_listing` for delisted / blocked pages.** When Etsy serves a placeholder page (`"etsy.com"`, `"Etsy - Your place to buy..."`, `"This item is unavailable"`), humanise the URL slug (`/listing/123/personalized-stainless-steel-tumbler` becomes `"Personalized Stainless Steel Tumbler"`) so callers always get a meaningful title. Plus shop falls through `offers[].seller.name` then top-level `brand` because Etsy uses both schemas depending on listing age.
|
||||
- **Force-cloud-escalation in `amazon_product` when local HTML lacks Product JSON-LD.** Amazon A/B-tests JSON-LD presence. When local fetch succeeds but has no `Product` block and a cloud client is configured, the extractor force-escalates to the cloud which reliably surfaces title + description via its render engine. Added OG meta-tag fallback so the cloud's synthesized HTML output (OG tags only, no Amazon DOM IDs) still yields title / image / description.
|
||||
- **AWS WAF "Verifying your connection" detector in `cloud::is_bot_protected`.** Trustpilot serves a `~565` byte interstitial with an `interstitial-spinner` CSS class. The detector now fires on that pattern with a `< 10_000` byte size gate to avoid false positives on real articles that happen to mention the phrase.
|
||||
|
||||
### Changed
|
||||
- **`webclaw-fetch::FetchClient` gained an optional `cloud` field** via `with_cloud(CloudClient)`. Extractors reach it through `client.cloud()` to decide whether to escalate. `webclaw-server::AppState` reads `WEBCLAW_CLOUD_API_KEY` (preferred) or falls back to `WEBCLAW_API_KEY` only when inbound auth is not configured (open mode).
|
||||
- **Consolidated `CloudClient` into `webclaw-fetch`.** Previously duplicated between `webclaw-mcp/src/cloud.rs` (302 LOC) and `webclaw-cli/src/cloud.rs` (80 LOC). Single canonical home with typed `CloudError` (`NotConfigured`, `Unauthorized`, `InsufficientPlan`, `RateLimited`, `ServerError`, `Network`, `ParseFailed`) that Display with actionable URLs; `From<CloudError> for String` bridge keeps pre-existing CLI / MCP call sites compiling unchanged during migration.
|
||||
|
||||
### Tests
|
||||
- 215 unit tests passing in `webclaw-fetch` (100+ new, covering every extractor's matcher, URL parser, JSON-LD / OG fallback paths, and the cloud synthesis helper). `cargo clippy --workspace --release --no-deps` clean.
|
||||
|
||||
---
|
||||
|
||||
## [0.4.0] — 2026-04-22
|
||||
|
||||
### Added
|
||||
|
|
|
|||
11
CLAUDE.md
11
CLAUDE.md
|
|
@ -11,7 +11,7 @@ webclaw/
|
|||
# + ExtractionOptions (include/exclude CSS selectors)
|
||||
# + diff engine (change tracking)
|
||||
# + brand extraction (DOM/CSS analysis)
|
||||
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
|
||||
webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
|
||||
# + proxy pool rotation (per-request)
|
||||
# + PDF content-type detection
|
||||
# + document parsing (DOCX, XLSX, CSV)
|
||||
|
|
@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
|
|||
- `brand.rs` — Brand identity extraction from DOM structure and CSS
|
||||
|
||||
### Fetch Modules (`webclaw-fetch`)
|
||||
- `client.rs` — FetchClient with primp TLS impersonation
|
||||
- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
|
||||
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
|
||||
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
|
||||
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
|
||||
|
|
@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
|
|||
## Hard Rules
|
||||
|
||||
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
|
||||
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
|
||||
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
|
||||
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
|
||||
- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
|
||||
- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
|
||||
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
|
||||
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
|
||||
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
|
||||
|
||||
## Build & Test
|
||||
|
|
|
|||
49
Cargo.lock
generated
49
Cargo.lock
generated
|
|
@ -2967,6 +2967,26 @@ dependencies = [
|
|||
"pom",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-builder"
|
||||
version = "0.23.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
|
||||
dependencies = [
|
||||
"typed-builder-macro",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-builder-macro"
|
||||
version = "0.23.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-path"
|
||||
version = "0.12.3"
|
||||
|
|
@ -3199,7 +3219,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-cli"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"clap",
|
||||
"dotenvy",
|
||||
|
|
@ -3220,7 +3240,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-core"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"ego-tree",
|
||||
"once_cell",
|
||||
|
|
@ -3238,13 +3258,16 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-fetch"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"bytes",
|
||||
"calamine",
|
||||
"http",
|
||||
"quick-xml 0.37.5",
|
||||
"rand 0.8.5",
|
||||
"regex",
|
||||
"reqwest",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"tempfile",
|
||||
|
|
@ -3255,12 +3278,13 @@ dependencies = [
|
|||
"webclaw-core",
|
||||
"webclaw-pdf",
|
||||
"wreq",
|
||||
"wreq-util",
|
||||
"zip 2.4.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "webclaw-llm"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"reqwest",
|
||||
|
|
@ -3273,11 +3297,10 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-mcp"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"dirs",
|
||||
"dotenvy",
|
||||
"reqwest",
|
||||
"rmcp",
|
||||
"schemars",
|
||||
"serde",
|
||||
|
|
@ -3294,7 +3317,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-pdf"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"pdf-extract",
|
||||
"thiserror",
|
||||
|
|
@ -3303,7 +3326,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-server"
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"axum",
|
||||
|
|
@ -3707,6 +3730,16 @@ dependencies = [
|
|||
"zstd",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "wreq-util"
|
||||
version = "3.0.0-rc.10"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
|
||||
dependencies = [
|
||||
"typed-builder",
|
||||
"wreq",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "writeable"
|
||||
version = "0.6.2"
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ resolver = "2"
|
|||
members = ["crates/*"]
|
||||
|
||||
[workspace.package]
|
||||
version = "0.4.0"
|
||||
version = "0.5.6"
|
||||
edition = "2024"
|
||||
license = "AGPL-3.0"
|
||||
repository = "https://github.com/0xMassi/webclaw"
|
||||
|
|
|
|||
|
|
@ -1,80 +0,0 @@
|
|||
/// Cloud API client for automatic fallback when local extraction fails.
|
||||
///
|
||||
/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
|
||||
/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
|
||||
/// all requests go through the cloud API directly.
|
||||
///
|
||||
/// NOTE: The canonical, full-featured cloud module lives in webclaw-mcp/src/cloud.rs
|
||||
/// (smart_fetch, bot detection, JS rendering checks). This is the minimal subset
|
||||
/// needed by the CLI. Kept separate to avoid pulling in rmcp via webclaw-mcp.
|
||||
/// and adding webclaw-mcp as a dependency would pull in rmcp.
|
||||
use serde_json::{Value, json};
|
||||
|
||||
const API_BASE: &str = "https://api.webclaw.io/v1";
|
||||
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Create from explicit key or WEBCLAW_API_KEY env var.
|
||||
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
|
||||
let key = explicit_key
|
||||
.map(String::from)
|
||||
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
|
||||
.filter(|k| !k.is_empty())?;
|
||||
|
||||
Some(Self {
|
||||
api_key: key,
|
||||
http: reqwest::Client::new(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Scrape via the cloud API.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({
|
||||
"url": url,
|
||||
"formats": formats,
|
||||
});
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.post(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.timeout(std::time::Duration::from_secs(120))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
return Err(format!("cloud API error {status}: {text}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("cloud API response parse failed: {e}"))
|
||||
}
|
||||
}
|
||||
|
|
@ -1,7 +1,6 @@
|
|||
/// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
|
||||
/// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
|
||||
mod bench;
|
||||
mod cloud;
|
||||
|
||||
use std::io::{self, Read as _};
|
||||
use std::path::{Path, PathBuf};
|
||||
|
|
@ -309,6 +308,34 @@ enum Commands {
|
|||
#[arg(long)]
|
||||
facts: Option<PathBuf>,
|
||||
},
|
||||
|
||||
/// List all vertical extractors in the catalog.
|
||||
///
|
||||
/// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
|
||||
/// a human-friendly label, a one-line description, and the URL
|
||||
/// patterns it claims. The same data is served by `/v1/extractors`
|
||||
/// when running the REST API.
|
||||
Extractors {
|
||||
/// Emit JSON instead of a human-friendly table.
|
||||
#[arg(long)]
|
||||
json: bool,
|
||||
},
|
||||
|
||||
/// Run a vertical extractor by name. Returns typed JSON with fields
|
||||
/// specific to the target site (title, price, author, rating, etc.)
|
||||
/// rather than generic markdown.
|
||||
///
|
||||
/// Use `webclaw extractors` to see the full list. Example:
|
||||
/// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
|
||||
Vertical {
|
||||
/// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
|
||||
name: String,
|
||||
/// URL to extract.
|
||||
url: String,
|
||||
/// Emit compact JSON (single line). Default is pretty-printed.
|
||||
#[arg(long)]
|
||||
raw: bool,
|
||||
},
|
||||
}
|
||||
|
||||
#[derive(Clone, ValueEnum)]
|
||||
|
|
@ -324,6 +351,9 @@ enum OutputFormat {
|
|||
enum Browser {
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
|
||||
/// that reject non-mobile profiles.
|
||||
SafariIos,
|
||||
Random,
|
||||
}
|
||||
|
||||
|
|
@ -350,6 +380,7 @@ impl From<Browser> for BrowserProfile {
|
|||
match b {
|
||||
Browser::Chrome => BrowserProfile::Chrome,
|
||||
Browser::Firefox => BrowserProfile::Firefox,
|
||||
Browser::SafariIos => BrowserProfile::SafariIos,
|
||||
Browser::Random => BrowserProfile::Random,
|
||||
}
|
||||
}
|
||||
|
|
@ -674,7 +705,7 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
|
|||
let url = normalize_url(raw_url);
|
||||
let url = url.as_str();
|
||||
|
||||
let cloud_client = cloud::CloudClient::new(cli.api_key.as_deref());
|
||||
let cloud_client = webclaw_fetch::cloud::CloudClient::new(cli.api_key.as_deref());
|
||||
|
||||
// --cloud: skip local, go straight to cloud API
|
||||
if cli.cloud {
|
||||
|
|
@ -2289,6 +2320,83 @@ async fn main() {
|
|||
}
|
||||
return;
|
||||
}
|
||||
Commands::Extractors { json } => {
|
||||
let entries = webclaw_fetch::extractors::list();
|
||||
if *json {
|
||||
// Serialize with serde_json. ExtractorInfo derives
|
||||
// Serialize so this is a one-liner.
|
||||
match serde_json::to_string_pretty(&entries) {
|
||||
Ok(s) => println!("{s}"),
|
||||
Err(e) => {
|
||||
eprintln!("error: failed to serialise catalog: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Human-friendly table: NAME + LABEL + one URL
|
||||
// pattern sample. Keeps the output scannable on a
|
||||
// narrow terminal.
|
||||
println!("{} vertical extractors available:\n", entries.len());
|
||||
let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
|
||||
let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
|
||||
for e in &entries {
|
||||
let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
|
||||
println!(
|
||||
" {:<nw$} {:<lw$} {}",
|
||||
e.name,
|
||||
e.label,
|
||||
pattern_sample,
|
||||
nw = name_w,
|
||||
lw = label_w,
|
||||
);
|
||||
}
|
||||
println!("\nRun one: webclaw vertical <name> <url>");
|
||||
}
|
||||
return;
|
||||
}
|
||||
Commands::Vertical { name, url, raw } => {
|
||||
// Build a FetchClient with cloud fallback attached when
|
||||
// WEBCLAW_API_KEY is set. Antibot-gated verticals
|
||||
// (amazon, ebay, etsy, trustpilot) need this to escalate
|
||||
// on bot protection.
|
||||
let fetch_cfg = webclaw_fetch::FetchConfig {
|
||||
browser: webclaw_fetch::BrowserProfile::Firefox,
|
||||
..webclaw_fetch::FetchConfig::default()
|
||||
};
|
||||
let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
|
||||
Ok(c) => c,
|
||||
Err(e) => {
|
||||
eprintln!("error: failed to build fetch client: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
};
|
||||
if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
|
||||
client = client.with_cloud(cloud);
|
||||
}
|
||||
match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
|
||||
Ok(data) => {
|
||||
let rendered = if *raw {
|
||||
serde_json::to_string(&data)
|
||||
} else {
|
||||
serde_json::to_string_pretty(&data)
|
||||
};
|
||||
match rendered {
|
||||
Ok(s) => println!("{s}"),
|
||||
Err(e) => {
|
||||
eprintln!("error: JSON encode failed: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
// UrlMismatch / UnknownVertical / Fetch all get
|
||||
// Display impls with actionable messages.
|
||||
eprintln!("error: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
|
|||
continue;
|
||||
}
|
||||
|
||||
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
|
||||
if trimmed.starts_with('|') && trimmed.ends_with('|') {
|
||||
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
|
||||
// Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
|
||||
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
|
||||
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
|
||||
let inner = &trimmed[1..trimmed.len() - 1];
|
||||
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
|
||||
lines.push(cells.join("\t"));
|
||||
|
|
|
|||
|
|
@ -12,12 +12,16 @@ serde = { workspace = true }
|
|||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
async-trait = "0.1"
|
||||
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
|
||||
wreq-util = "3.0.0-rc.10"
|
||||
http = "1"
|
||||
bytes = "1"
|
||||
url = "2"
|
||||
rand = "0.8"
|
||||
quick-xml = { version = "0.37", features = ["serde"] }
|
||||
regex = "1"
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
serde_json.workspace = true
|
||||
calamine = "0.34"
|
||||
zip = "2"
|
||||
|
|
|
|||
|
|
@ -7,6 +7,10 @@ pub enum BrowserProfile {
|
|||
#[default]
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Safari iOS 26 (iPhone). The one profile proven to defeat
|
||||
/// DataDome's immobiliare.it / idealista.it / target.com-class
|
||||
/// rules when paired with a country-scoped residential proxy.
|
||||
SafariIos,
|
||||
/// Randomly pick from all available profiles on each request.
|
||||
Random,
|
||||
}
|
||||
|
|
@ -18,6 +22,7 @@ pub enum BrowserVariant {
|
|||
ChromeMacos,
|
||||
Firefox,
|
||||
Safari,
|
||||
SafariIos26,
|
||||
Edge,
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -177,6 +177,11 @@ enum ClientPool {
|
|||
pub struct FetchClient {
|
||||
pool: ClientPool,
|
||||
pdf_mode: PdfMode,
|
||||
/// Optional cloud-fallback client. Extractors that need to
|
||||
/// escalate past bot protection call `client.cloud()` to get this
|
||||
/// out. Stored as `Arc` so cloning a `FetchClient` (common in
|
||||
/// axum state) doesn't clone the underlying reqwest pool.
|
||||
cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
|
||||
}
|
||||
|
||||
impl FetchClient {
|
||||
|
|
@ -225,13 +230,96 @@ impl FetchClient {
|
|||
ClientPool::Rotating { clients }
|
||||
};
|
||||
|
||||
Ok(Self { pool, pdf_mode })
|
||||
Ok(Self {
|
||||
pool,
|
||||
pdf_mode,
|
||||
cloud: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Attach a cloud-fallback client. Returns `self` so it composes in
|
||||
/// a builder-ish way:
|
||||
///
|
||||
/// ```ignore
|
||||
/// let client = FetchClient::new(config)?
|
||||
/// .with_cloud(CloudClient::from_env()?);
|
||||
/// ```
|
||||
///
|
||||
/// Extractors that can escalate past bot protection will call
|
||||
/// `client.cloud()` internally. Sets the field regardless of
|
||||
/// whether `cloud` is configured to bypass anything specific —
|
||||
/// attachment is cheap (just wraps in `Arc`).
|
||||
pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
|
||||
self.cloud = Some(std::sync::Arc::new(cloud));
|
||||
self
|
||||
}
|
||||
|
||||
/// Optional cloud-fallback client, if one was attached via
|
||||
/// [`Self::with_cloud`]. Extractors that handle antibot sites
|
||||
/// pass this into `cloud::smart_fetch_html`.
|
||||
pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
|
||||
self.cloud.as_deref()
|
||||
}
|
||||
|
||||
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
|
||||
/// `.json` API, and Akamai-style challenge responses trigger a homepage
|
||||
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
|
||||
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
|
||||
/// server) benefits without shape churn.
|
||||
///
|
||||
/// This is the method most callers want. Use plain [`Self::fetch`] only
|
||||
/// when you need literal no-rescue behavior (e.g. inside the rescue
|
||||
/// logic itself to avoid recursion).
|
||||
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
// Reddit: the HTML page shows a verification interstitial for most
|
||||
// client IPs, but appending `.json` returns the post + comment tree
|
||||
// publicly. `parse_reddit_json` in downstream code knows how to read
|
||||
// the result; here we just do the URL swap at the fetch layer.
|
||||
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
|
||||
let json_url = crate::reddit::json_url(url);
|
||||
// Reddit's public .json API serves JSON to identifiable bot
|
||||
// User-Agents and blocks browser UAs with a verification wall.
|
||||
// Override our Chrome-profile UA for this specific call.
|
||||
let ua = concat!(
|
||||
"Webclaw/",
|
||||
env!("CARGO_PKG_VERSION"),
|
||||
" (+https://webclaw.io)"
|
||||
);
|
||||
if let Ok(resp) = self
|
||||
.fetch_with_headers(&json_url, &[("user-agent", ua)])
|
||||
.await
|
||||
&& resp.status == 200
|
||||
{
|
||||
let first = resp.html.trim_start().as_bytes().first().copied();
|
||||
if matches!(first, Some(b'{') | Some(b'[')) {
|
||||
return Ok(resp);
|
||||
}
|
||||
}
|
||||
// If the .json fetch failed or returned HTML, fall through.
|
||||
}
|
||||
|
||||
let resp = self.fetch(url).await?;
|
||||
|
||||
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
|
||||
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
|
||||
if is_challenge_html(&resp.html)
|
||||
&& let Some(homepage) = extract_homepage(url)
|
||||
{
|
||||
debug!("challenge detected, warming cookies via {homepage}");
|
||||
let _ = self.fetch(&homepage).await;
|
||||
if let Ok(retry) = self.fetch(url).await {
|
||||
return Ok(retry);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(resp)
|
||||
}
|
||||
|
||||
/// Fetch a URL and return the raw HTML + response metadata.
|
||||
///
|
||||
/// Automatically retries on transient failures (network errors, 5xx, 429)
|
||||
/// with exponential backoff: 0s, 1s (2 attempts total).
|
||||
/// with exponential backoff: 0s, 1s (2 attempts total). No per-site
|
||||
/// rescue logic; use [`Self::fetch_smart`] for that.
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
let delays = [Duration::ZERO, Duration::from_secs(1)];
|
||||
|
|
@ -279,14 +367,85 @@ impl FetchClient {
|
|||
|
||||
/// Single fetch attempt.
|
||||
async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
self.fetch_once_with_headers(url, &[]).await
|
||||
}
|
||||
|
||||
/// Single fetch attempt with optional per-request headers appended
|
||||
/// after the profile defaults. Used by extractors that need to
|
||||
/// satisfy site-specific headers (e.g. `x-ig-app-id` for Instagram's
|
||||
/// internal API).
|
||||
async fn fetch_once_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
extra: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
let start = Instant::now();
|
||||
let client = self.pick_client(url);
|
||||
|
||||
let resp = client.get(url).send().await?;
|
||||
let mut req = client.get(url);
|
||||
for (k, v) in extra {
|
||||
req = req.header(*k, *v);
|
||||
}
|
||||
let resp = req.send().await?;
|
||||
let response = Response::from_wreq(resp).await?;
|
||||
response_to_result(response, start)
|
||||
}
|
||||
|
||||
/// Fetch a URL with extra per-request headers appended after the
|
||||
/// browser-profile defaults. Same retry semantics as `fetch`.
|
||||
///
|
||||
/// Use this when an upstream API requires a header the global
|
||||
/// `FetchConfig.headers` shouldn't carry to other hosts (Instagram's
|
||||
/// `x-ig-app-id`, GitHub's `Authorization` once we wire `GITHUB_TOKEN`,
|
||||
/// Reddit's compliant UA when we add OAuth, etc.).
|
||||
#[instrument(skip(self, extra), fields(url = %url, extra_count = extra.len()))]
|
||||
pub async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
extra: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
let delays = [Duration::ZERO, Duration::from_secs(1)];
|
||||
let mut last_err = None;
|
||||
|
||||
for (attempt, delay) in delays.iter().enumerate() {
|
||||
if attempt > 0 {
|
||||
tokio::time::sleep(*delay).await;
|
||||
}
|
||||
match self.fetch_once_with_headers(url, extra).await {
|
||||
Ok(result) => {
|
||||
if is_retryable_status(result.status) && attempt < delays.len() - 1 {
|
||||
warn!(
|
||||
url,
|
||||
status = result.status,
|
||||
attempt = attempt + 1,
|
||||
"retryable status, will retry"
|
||||
);
|
||||
last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
|
||||
continue;
|
||||
}
|
||||
if attempt > 0 {
|
||||
debug!(url, attempt = attempt + 1, "retry succeeded");
|
||||
}
|
||||
return Ok(result);
|
||||
}
|
||||
Err(e) => {
|
||||
if !is_retryable_error(&e) || attempt == delays.len() - 1 {
|
||||
return Err(e);
|
||||
}
|
||||
warn!(
|
||||
url,
|
||||
error = %e,
|
||||
attempt = attempt + 1,
|
||||
"transient error, will retry"
|
||||
);
|
||||
last_err = Some(e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
|
||||
}
|
||||
|
||||
/// Fetch a URL then extract structured content.
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch_and_extract(
|
||||
|
|
@ -495,12 +654,43 @@ impl FetchClient {
|
|||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Fetcher trait implementation
|
||||
//
|
||||
// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
|
||||
// rather than `FetchClient` directly, which is what lets the production
|
||||
// API server swap in a tls-sidecar-backed implementation without
|
||||
// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
|
||||
// self-hosted OSS server) this impl means "pass the FetchClient you
|
||||
// already have; nothing changes".
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl crate::fetcher::Fetcher for FetchClient {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
FetchClient::fetch(self, url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
FetchClient::fetch_with_headers(self, url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
|
||||
FetchClient::cloud(self)
|
||||
}
|
||||
}
|
||||
|
||||
/// Collect the browser variants to use based on the browser profile.
|
||||
fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
|
||||
match profile {
|
||||
BrowserProfile::Random => browser::all_variants(),
|
||||
BrowserProfile::Chrome => vec![browser::latest_chrome()],
|
||||
BrowserProfile::Firefox => vec![browser::latest_firefox()],
|
||||
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -578,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
|
|||
|
||||
/// Detect if a response looks like a bot protection challenge page.
|
||||
fn is_challenge_response(response: &Response) -> bool {
|
||||
let len = response.body().len();
|
||||
is_challenge_html(response.text().as_ref())
|
||||
}
|
||||
|
||||
/// Same as `is_challenge_response`, operating on a body string directly
|
||||
/// so callers holding a `FetchResult` can reuse the heuristic.
|
||||
fn is_challenge_html(html: &str) -> bool {
|
||||
let len = html.len();
|
||||
if len > 15_000 || len == 0 {
|
||||
return false;
|
||||
}
|
||||
|
||||
let text = response.text();
|
||||
let lower = text.to_lowercase();
|
||||
|
||||
let lower = html.to_lowercase();
|
||||
if lower.contains("<title>challenge page</title>") {
|
||||
return true;
|
||||
}
|
||||
|
||||
if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
|
|
|
|||
853
crates/webclaw-fetch/src/cloud.rs
Normal file
853
crates/webclaw-fetch/src/cloud.rs
Normal file
|
|
@ -0,0 +1,853 @@
|
|||
//! Cloud API fallback client for api.webclaw.io.
|
||||
//!
|
||||
//! When local fetch hits bot protection or a JS-only SPA, callers can
|
||||
//! fall back to the hosted API which runs the full antibot / CDP
|
||||
//! pipeline. This module is the shared home for that flow: previously
|
||||
//! duplicated between `webclaw-mcp/src/cloud.rs` and
|
||||
//! `webclaw-cli/src/cloud.rs`.
|
||||
//!
|
||||
//! ## Architecture
|
||||
//!
|
||||
//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
|
||||
//! REST surface. Typed errors for the four HTTP failures callers act
|
||||
//! on differently (401 / 402 / 429 / other) plus network + parse.
|
||||
//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
|
||||
//! response bodies. The detection patterns are public (CF / DataDome
|
||||
//! challenge-page signatures) so these live in OSS without leaking
|
||||
//! any moat.
|
||||
//! - [`smart_fetch`] — try-local-then-escalate flow returning an
|
||||
//! [`ExtractionResult`] or raw cloud JSON. Kept on the original
|
||||
//! `Result<_, String>` signature so the existing MCP / CLI call
|
||||
//! sites work unchanged.
|
||||
//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
|
||||
//! pattern: just give me antibot-bypassed HTML so I can run my own
|
||||
//! parser on it. Returns the typed [`CloudError`] so extractors can
|
||||
//! emit precise "upgrade your plan" / "invalid key" messages.
|
||||
//!
|
||||
//! ## Cloud response shape and [`synthesize_html`]
|
||||
//!
|
||||
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
|
||||
//! `html` field even when `formats=["html"]` is requested. By design
|
||||
//! the cloud API returns a parsed bundle:
|
||||
//!
|
||||
//! ```text
|
||||
//! {
|
||||
//! "url": "https://...",
|
||||
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
|
||||
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
|
||||
//! "markdown": "# Page Title\n\n...", // cleaned markdown
|
||||
//! "antibot": { engine, path, user_agent }, // bypass telemetry
|
||||
//! "cache": { status, age_seconds }
|
||||
//! }
|
||||
//! ```
|
||||
//!
|
||||
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
|
||||
//! minimal synthetic HTML document so the existing local extractor
|
||||
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
|
||||
//! cloud output. Each `structured_data` entry becomes a
|
||||
//! `<script type="application/ld+json">` tag; each `metadata` field
|
||||
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
|
||||
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
|
||||
//! exactly what they'd see on a real live page.
|
||||
//!
|
||||
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
|
||||
//! won't hit on the synthesised HTML — those IDs only exist on live
|
||||
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
|
||||
//! fallbacks for that reason.
|
||||
//!
|
||||
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
|
||||
//! signup when a site is blocked; nothing fails silently. Cloud users
|
||||
//! get the escalation for free.
|
||||
|
||||
use std::time::Duration;
|
||||
|
||||
use http::HeaderMap;
|
||||
use serde_json::{Value, json};
|
||||
use thiserror::Error;
|
||||
use tracing::{debug, info, warn};
|
||||
|
||||
// Client type isn't needed here anymore now that smart_fetch* takes
|
||||
// `&dyn Fetcher`. Kept as a comment for historical context: this
|
||||
// module used to import FetchClient directly before v0.5.1.
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URLs + defaults — keep in one place so "change the signup link" is a
|
||||
// single-commit edit.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
|
||||
const DEFAULT_TIMEOUT_SECS: u64 = 120;
|
||||
|
||||
const SIGNUP_URL: &str = "https://webclaw.io/signup";
|
||||
const PRICING_URL: &str = "https://webclaw.io/pricing";
|
||||
const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Errors
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Structured cloud-fallback error. Variants correspond to the HTTP
|
||||
/// outcomes callers act on differently — a 401 needs a different UX
|
||||
/// than a 402 which needs a different UX than a network blip.
|
||||
///
|
||||
/// Display messages end with an actionable URL so API consumers can
|
||||
/// surface them to users verbatim.
|
||||
#[derive(Debug, Error)]
|
||||
pub enum CloudError {
|
||||
/// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
|
||||
/// and friends when they hit bot protection but have no client to
|
||||
/// escalate to.
|
||||
#[error(
|
||||
"this site is behind antibot protection. \
|
||||
Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
|
||||
Free tier: {SIGNUP_URL}"
|
||||
)]
|
||||
NotConfigured,
|
||||
|
||||
/// HTTP 401 — the key is present but rejected.
|
||||
#[error(
|
||||
"WEBCLAW_API_KEY rejected (HTTP 401). \
|
||||
Check or regenerate your key at {KEYS_URL}"
|
||||
)]
|
||||
Unauthorized,
|
||||
|
||||
/// HTTP 402 — the key is valid but the plan doesn't cover the call.
|
||||
#[error(
|
||||
"your plan doesn't include this endpoint / site (HTTP 402). \
|
||||
Upgrade at {PRICING_URL}"
|
||||
)]
|
||||
InsufficientPlan,
|
||||
|
||||
/// HTTP 429 — rate limit.
|
||||
#[error(
|
||||
"cloud API rate limit reached (HTTP 429). \
|
||||
Wait a moment or upgrade at {PRICING_URL}"
|
||||
)]
|
||||
RateLimited,
|
||||
|
||||
/// HTTP 4xx / 5xx the caller probably can't do anything specific
|
||||
/// about. Body is truncated to a sensible length for logs.
|
||||
#[error("cloud API returned HTTP {status}: {body}")]
|
||||
ServerError { status: u16, body: String },
|
||||
|
||||
#[error("cloud request failed: {0}")]
|
||||
Network(String),
|
||||
|
||||
#[error("cloud response parse failed: {0}")]
|
||||
ParseFailed(String),
|
||||
}
|
||||
|
||||
impl CloudError {
|
||||
/// Build from a non-success HTTP response, routing well-known
|
||||
/// statuses to dedicated variants.
|
||||
fn from_status_and_body(status: u16, body: String) -> Self {
|
||||
match status {
|
||||
401 => Self::Unauthorized,
|
||||
402 => Self::InsufficientPlan,
|
||||
429 => Self::RateLimited,
|
||||
_ => Self::ServerError {
|
||||
status,
|
||||
body: truncate(&body, 500).to_string(),
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<reqwest::Error> for CloudError {
|
||||
fn from(e: reqwest::Error) -> Self {
|
||||
Self::Network(e.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
|
||||
/// sites `use .await?` into functions returning `Result<_, String>`.
|
||||
/// Having this `From` impl means those sites keep compiling while we
|
||||
/// migrate them to the typed error over time.
|
||||
impl From<CloudError> for String {
|
||||
fn from(e: CloudError) -> Self {
|
||||
e.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
fn truncate(text: &str, max: usize) -> &str {
|
||||
match text.char_indices().nth(max) {
|
||||
Some((byte_pos, _)) => &text[..byte_pos],
|
||||
None => text,
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// CloudClient
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
|
||||
/// inner `reqwest::Client` already refcounts its connection pool.
|
||||
#[derive(Clone)]
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
base_url: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
|
||||
/// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
|
||||
/// neither is set / both are empty.
|
||||
///
|
||||
/// This is the function call sites should use by default — it's
|
||||
/// what both the CLI and MCP want.
|
||||
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
|
||||
explicit_key
|
||||
.map(String::from)
|
||||
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
|
||||
.filter(|k| !k.trim().is_empty())
|
||||
.map(Self::with_key)
|
||||
}
|
||||
|
||||
/// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
|
||||
/// readability at call sites that never accept a flag.
|
||||
pub fn from_env() -> Option<Self> {
|
||||
Self::new(None)
|
||||
}
|
||||
|
||||
/// Build with an explicit key. Useful when the caller already has
|
||||
/// a key from somewhere other than env or a flag (e.g. loaded from
|
||||
/// config).
|
||||
pub fn with_key(api_key: impl Into<String>) -> Self {
|
||||
Self::with_key_and_base(api_key, API_BASE_DEFAULT)
|
||||
}
|
||||
|
||||
/// Build with an explicit key and base URL. Used by integration
|
||||
/// tests and staging deployments.
|
||||
pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
|
||||
let http = reqwest::Client::builder()
|
||||
.timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
|
||||
.build()
|
||||
.expect("reqwest client builder failed with default settings");
|
||||
Self {
|
||||
api_key: api_key.into(),
|
||||
base_url: base_url.into().trim_end_matches('/').to_string(),
|
||||
http,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn base_url(&self) -> &str {
|
||||
&self.base_url
|
||||
}
|
||||
|
||||
/// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
|
||||
/// normalise the slash.
|
||||
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
|
||||
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
|
||||
let resp = self
|
||||
.http
|
||||
.post(&url)
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.send()
|
||||
.await?;
|
||||
parse_cloud_response(resp).await
|
||||
}
|
||||
|
||||
/// Generic GET.
|
||||
pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
|
||||
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
|
||||
let resp = self
|
||||
.http
|
||||
.get(&url)
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.send()
|
||||
.await?;
|
||||
parse_cloud_response(resp).await
|
||||
}
|
||||
|
||||
/// `POST /v1/scrape` with the caller's extraction options. This is
|
||||
/// the public "do everything" surface: the cloud side handles
|
||||
/// fetch + antibot + JS render + extraction + formatting.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, CloudError> {
|
||||
let mut body = json!({ "url": url, "formats": formats });
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
/// Get antibot-bypassed page data back as a synthetic HTML string.
|
||||
///
|
||||
/// `api.webclaw.io/v1/scrape` intentionally does not return raw
|
||||
/// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
|
||||
/// plus `metadata` (title, description, OG tags, image) plus a
|
||||
/// `markdown` body. We reassemble those into a minimal HTML doc
|
||||
/// that looks enough like the real page for our local extractor
|
||||
/// parsers to run unchanged: each JSON-LD block gets emitted as a
|
||||
/// `<script type="application/ld+json">` tag, metadata gets
|
||||
/// emitted as OG `<meta>` tags, and the markdown lands in the
|
||||
/// body. Extractors that walk JSON-LD (ecommerce_product,
|
||||
/// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
|
||||
/// see exactly the same shapes they'd see from a live HTML fetch.
|
||||
pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
|
||||
let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
|
||||
Ok(synthesize_html(&resp))
|
||||
}
|
||||
}
|
||||
|
||||
/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
|
||||
/// response so existing HTML-based extractor parsers can run against
|
||||
/// cloud output without a separate code path.
|
||||
fn synthesize_html(resp: &Value) -> String {
|
||||
let mut out = String::with_capacity(8_192);
|
||||
out.push_str("<html><head>\n");
|
||||
|
||||
// Metadata → OG meta tags. Keep keys stable with what local
|
||||
// extractors read: og:title, og:description, og:image, og:site_name.
|
||||
if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
|
||||
for (src_key, og_key) in [
|
||||
("title", "title"),
|
||||
("description", "description"),
|
||||
("image", "image"),
|
||||
("site_name", "site_name"),
|
||||
] {
|
||||
if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
|
||||
&& !val.is_empty()
|
||||
{
|
||||
out.push_str(&format!(
|
||||
"<meta property=\"og:{og_key}\" content=\"{}\">\n",
|
||||
html_escape_attr(val)
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Structured data blocks → <script type="application/ld+json">.
|
||||
// Serialise losslessly so extract_json_ld's parser gets the same
|
||||
// shape it would get from a real page.
|
||||
if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
|
||||
for block in blocks {
|
||||
if let Ok(s) = serde_json::to_string(block) {
|
||||
out.push_str("<script type=\"application/ld+json\">");
|
||||
out.push_str(&s);
|
||||
out.push_str("</script>\n");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
out.push_str("</head><body>\n");
|
||||
|
||||
// Markdown body → plaintext in <body>. Extractors that regex over
|
||||
// <div> IDs won't hit here, but they won't hit on local cloud
|
||||
// bypass either. OK to keep minimal.
|
||||
if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
|
||||
out.push_str("<pre>");
|
||||
out.push_str(&html_escape_text(md));
|
||||
out.push_str("</pre>\n");
|
||||
}
|
||||
|
||||
out.push_str("</body></html>");
|
||||
out
|
||||
}
|
||||
|
||||
fn html_escape_attr(s: &str) -> String {
|
||||
s.replace('&', "&")
|
||||
.replace('"', """)
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
}
|
||||
|
||||
fn html_escape_text(s: &str) -> String {
|
||||
s.replace('&', "&")
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
}
|
||||
|
||||
async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
|
||||
let status = resp.status();
|
||||
if status.is_success() {
|
||||
return resp
|
||||
.json()
|
||||
.await
|
||||
.map_err(|e| CloudError::ParseFailed(e.to_string()));
|
||||
}
|
||||
let body = resp.text().await.unwrap_or_default();
|
||||
Err(CloudError::from_status_and_body(status.as_u16(), body))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Detection
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// True when a fetched response body is actually a bot-protection
|
||||
/// challenge page rather than the content the caller asked for.
|
||||
///
|
||||
/// Conservative — only fires on patterns that indicate the *entire*
|
||||
/// page is a challenge, not embedded CAPTCHAs on a real content page.
|
||||
pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
|
||||
let html_lower = html.to_lowercase();
|
||||
|
||||
// Cloudflare challenge page.
|
||||
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare "Just a moment" / "Checking your browser" interstitial.
|
||||
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
&& html_lower.contains("cf-spinner")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare Turnstile. Only counts when the page is small —
|
||||
// legitimate pages embed Turnstile for signup forms etc.
|
||||
if (html_lower.contains("cf-turnstile")
|
||||
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
|
||||
&& html.len() < 100_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// DataDome.
|
||||
if html_lower.contains("geo.captcha-delivery.com")
|
||||
|| html_lower.contains("captcha-delivery.com/captcha")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF.
|
||||
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
|
||||
// Distinct from the captcha-branded path above: the challenge page is
|
||||
// a tiny HTML shell with an `interstitial-spinner` div and no content.
|
||||
// Gating on html.len() keeps false-positives off long pages that
|
||||
// happen to mention the phrase in an unrelated context.
|
||||
if html_lower.contains("interstitial-spinner")
|
||||
&& html_lower.contains("verifying your connection")
|
||||
&& html.len() < 10_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// hCaptcha *blocking* page (not just an embedded widget).
|
||||
if html_lower.contains("hcaptcha.com")
|
||||
&& html_lower.contains("h-captcha")
|
||||
&& html.len() < 50_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare via response headers + challenge body.
|
||||
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
|
||||
if has_cf_headers
|
||||
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// True when a page likely needs JS rendering — a large HTML document
|
||||
/// with almost no extractable text + an SPA framework signature.
|
||||
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
|
||||
let has_scripts = html.contains("<script");
|
||||
|
||||
// Tier 1: almost no extractable text from a large-ish page.
|
||||
if word_count < 50 && html.len() > 5_000 && has_scripts {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Tier 2: SPA framework markers + low content-to-HTML ratio.
|
||||
if word_count < 800 && html.len() > 50_000 && has_scripts {
|
||||
let html_lower = html.to_lowercase();
|
||||
let has_spa_marker = html_lower.contains("react-app")
|
||||
|| html_lower.contains("id=\"__next\"")
|
||||
|| html_lower.contains("id=\"root\"")
|
||||
|| html_lower.contains("id=\"app\"")
|
||||
|| html_lower.contains("__next_data__")
|
||||
|| html_lower.contains("nuxt")
|
||||
|| html_lower.contains("ng-app");
|
||||
if has_spa_marker {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
|
||||
// or raw cloud JSON)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Result of [`smart_fetch`]: either a local extraction or the raw
|
||||
/// cloud API response when we escalated.
|
||||
pub enum SmartFetchResult {
|
||||
Local(Box<webclaw_core::ExtractionResult>),
|
||||
Cloud(Value),
|
||||
}
|
||||
|
||||
/// Try local fetch + extract first. On bot protection or detected
|
||||
/// JS-render, fall back to `cloud.scrape(...)` with the caller's
|
||||
/// formats. Returns `Err(String)` so existing call sites that expect
|
||||
/// stringified errors keep compiling.
|
||||
///
|
||||
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
|
||||
/// [`CloudError`] so you can render precise UX.
|
||||
pub async fn smart_fetch(
|
||||
client: &dyn crate::fetcher::Fetcher,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
|
||||
.await
|
||||
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
|
||||
.map_err(|e| format!("Fetch failed: {e}"))?;
|
||||
|
||||
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
|
||||
info!(url, "bot protection detected, falling back to cloud API");
|
||||
return cloud_scrape_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
let options = webclaw_core::ExtractionOptions {
|
||||
include_selectors: include_selectors.to_vec(),
|
||||
exclude_selectors: exclude_selectors.to_vec(),
|
||||
only_main_content,
|
||||
include_raw_html: false,
|
||||
};
|
||||
let extraction =
|
||||
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
|
||||
.map_err(|e| format!("Extraction failed: {e}"))?;
|
||||
|
||||
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
|
||||
info!(
|
||||
url,
|
||||
word_count = extraction.metadata.word_count,
|
||||
html_len = fetch_result.html.len(),
|
||||
"JS-rendered page detected, falling back to cloud API"
|
||||
);
|
||||
return cloud_scrape_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
Ok(SmartFetchResult::Local(Box::new(extraction)))
|
||||
}
|
||||
|
||||
async fn cloud_scrape_fallback(
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
let Some(c) = cloud else {
|
||||
return Err(CloudError::NotConfigured.to_string());
|
||||
};
|
||||
let resp = c
|
||||
.scrape(
|
||||
url,
|
||||
formats,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
)
|
||||
.await
|
||||
.map_err(|e| e.to_string())?;
|
||||
info!(url, "cloud API fallback successful");
|
||||
Ok(SmartFetchResult::Cloud(resp))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Smart-fetch-HTML: for vertical extractors
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Where the HTML ultimately came from — useful for callers that want
|
||||
/// to track "did we fall back?" for logging or pricing.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum FetchSource {
|
||||
Local,
|
||||
Cloud,
|
||||
}
|
||||
|
||||
/// Antibot-aware HTML fetch result. The `html` field is always populated.
|
||||
pub struct FetchedHtml {
|
||||
pub html: String,
|
||||
pub final_url: String,
|
||||
pub source: FetchSource,
|
||||
}
|
||||
|
||||
/// Try local fetch; on bot protection, escalate to the cloud's
|
||||
/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
|
||||
///
|
||||
/// Designed for the vertical-extractor pattern where the caller has
|
||||
/// its own parser and just needs bytes.
|
||||
pub async fn smart_fetch_html(
|
||||
client: &dyn crate::fetcher::Fetcher,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
) -> Result<FetchedHtml, CloudError> {
|
||||
let resp = client
|
||||
.fetch(url)
|
||||
.await
|
||||
.map_err(|e| CloudError::Network(e.to_string()))?;
|
||||
|
||||
if !is_bot_protected(&resp.html, &resp.headers) {
|
||||
return Ok(FetchedHtml {
|
||||
html: resp.html,
|
||||
final_url: resp.url,
|
||||
source: FetchSource::Local,
|
||||
});
|
||||
}
|
||||
|
||||
let Some(c) = cloud else {
|
||||
warn!(url, "bot protection detected + no cloud client configured");
|
||||
return Err(CloudError::NotConfigured);
|
||||
};
|
||||
debug!(url, "bot protection detected, escalating to cloud");
|
||||
let html = c.fetch_html(url).await?;
|
||||
Ok(FetchedHtml {
|
||||
html,
|
||||
final_url: url.to_string(),
|
||||
source: FetchSource::Cloud,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn empty_headers() -> HeaderMap {
|
||||
HeaderMap::new()
|
||||
}
|
||||
|
||||
// --- detectors ----------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_cloudflare_challenge() {
|
||||
let html = "<html><body>_cf_chl_opt loaded</body></html>";
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_turnstile_on_short_page() {
|
||||
let html = "<div class=\"cf-turnstile\"></div>";
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_ignores_turnstile_on_real_content() {
|
||||
let html = format!(
|
||||
"<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
|
||||
"lots of real content ".repeat(8_000)
|
||||
);
|
||||
assert!(!is_bot_protected(&html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_aws_waf_verifying_connection() {
|
||||
// The exact shape Trustpilot serves under AWS WAF.
|
||||
let html = r#"<div class="container"><div id="loading-state">
|
||||
<div class="interstitial-spinner" id="spinner"></div>
|
||||
<h1>Verifying your connection...</h1></div></div>"#;
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synthesize_html_embeds_jsonld_and_og_tags() {
|
||||
let resp = json!({
|
||||
"url": "https://example.com/p/1",
|
||||
"metadata": {
|
||||
"title": "My Product",
|
||||
"description": "A nice thing.",
|
||||
"image": "https://cdn.example.com/1.jpg",
|
||||
"site_name": "Example Shop"
|
||||
},
|
||||
"structured_data": [
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
|
||||
],
|
||||
"markdown": "# Widget\n\nA nice widget."
|
||||
});
|
||||
let html = synthesize_html(&resp);
|
||||
// OG tags from metadata.
|
||||
assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
|
||||
assert!(
|
||||
html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
|
||||
);
|
||||
// JSON-LD block preserved losslessly.
|
||||
assert!(html.contains(r#"<script type="application/ld+json">"#));
|
||||
assert!(html.contains(r#""@type":"Product""#));
|
||||
assert!(html.contains(r#""price":"9.99""#));
|
||||
// Body carries markdown.
|
||||
assert!(html.contains("A nice widget."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synthesize_html_handles_missing_fields_gracefully() {
|
||||
let resp = json!({"url": "https://example.com", "metadata": {}});
|
||||
let html = synthesize_html(&resp);
|
||||
// No panic, no stray unclosed tags.
|
||||
assert!(html.starts_with("<html><head>"));
|
||||
assert!(html.ends_with("</body></html>"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synthesize_html_escapes_attribute_quotes() {
|
||||
let resp = json!({
|
||||
"metadata": {"title": r#"She said "hi""#}
|
||||
});
|
||||
let html = synthesize_html(&resp);
|
||||
assert!(html.contains(r#"og:title" content="She said "hi"""#));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_ignores_phrase_on_real_content() {
|
||||
// A real article that happens to mention the phrase in prose
|
||||
// should not trigger the short-page detector.
|
||||
let html = format!(
|
||||
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
|
||||
"article text ".repeat(2_000)
|
||||
);
|
||||
assert!(!is_bot_protected(&html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn needs_js_rendering_flags_spa_skeleton() {
|
||||
let html = format!(
|
||||
"<html><body><div id=\"__next\"></div>{}</body></html>",
|
||||
"<script>x</script>".repeat(500)
|
||||
);
|
||||
assert!(needs_js_rendering(10, &html));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn needs_js_rendering_passes_real_article() {
|
||||
let html = format!(
|
||||
"<html><body>{}<script>x</script></body></html>",
|
||||
"Real article text ".repeat(5_000)
|
||||
);
|
||||
assert!(!needs_js_rendering(5_000, &html));
|
||||
}
|
||||
|
||||
// --- CloudError mapping -------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_401() {
|
||||
let e = CloudError::from_status_and_body(401, "invalid key".into());
|
||||
assert!(matches!(e, CloudError::Unauthorized));
|
||||
assert!(e.to_string().contains(KEYS_URL));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_402() {
|
||||
let e = CloudError::from_status_and_body(402, "{}".into());
|
||||
assert!(matches!(e, CloudError::InsufficientPlan));
|
||||
assert!(e.to_string().contains(PRICING_URL));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_429() {
|
||||
let e = CloudError::from_status_and_body(429, "slow down".into());
|
||||
assert!(matches!(e, CloudError::RateLimited));
|
||||
assert!(e.to_string().contains(PRICING_URL));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_generic_5xx() {
|
||||
let e = CloudError::from_status_and_body(503, "x".repeat(2000));
|
||||
match e {
|
||||
CloudError::ServerError { status, body } => {
|
||||
assert_eq!(status, 503);
|
||||
assert!(body.len() <= 500);
|
||||
}
|
||||
_ => panic!("expected ServerError"),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn not_configured_error_points_at_signup() {
|
||||
let msg = CloudError::NotConfigured.to_string();
|
||||
assert!(msg.contains(SIGNUP_URL));
|
||||
assert!(msg.contains("WEBCLAW_API_KEY"));
|
||||
}
|
||||
|
||||
// --- CloudClient construction ------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn cloud_client_explicit_key_wins_over_env() {
|
||||
// SAFETY: this test mutates process env. Serial tests only.
|
||||
// Set env to something, pass an explicit key, explicit should win.
|
||||
// (We don't actually *call* the API, just check the struct stored
|
||||
// the right key.)
|
||||
// rustc std::env::set_var is unsafe in newer toolchains.
|
||||
unsafe {
|
||||
std::env::set_var("WEBCLAW_API_KEY", "from-env");
|
||||
}
|
||||
let client = CloudClient::new(Some("from-flag")).expect("client built");
|
||||
assert_eq!(client.api_key, "from-flag");
|
||||
unsafe {
|
||||
std::env::remove_var("WEBCLAW_API_KEY");
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_client_none_when_empty() {
|
||||
unsafe {
|
||||
std::env::remove_var("WEBCLAW_API_KEY");
|
||||
}
|
||||
assert!(CloudClient::new(None).is_none());
|
||||
assert!(CloudClient::new(Some("")).is_none());
|
||||
assert!(CloudClient::new(Some(" ")).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_client_base_url_strips_trailing_slash() {
|
||||
let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
|
||||
assert_eq!(c.base_url(), "https://api.example.com/v1");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn truncate_respects_char_boundaries() {
|
||||
// Ensure we don't slice inside a multi-byte char.
|
||||
let s = "a".repeat(10) + "é"; // é is 2 bytes
|
||||
let out = truncate(&s, 11);
|
||||
assert_eq!(out.chars().count(), 11);
|
||||
}
|
||||
}
|
||||
452
crates/webclaw-fetch/src/extractors/amazon_product.rs
Normal file
452
crates/webclaw-fetch/src/extractors/amazon_product.rs
Normal file
|
|
@ -0,0 +1,452 @@
|
|||
//! Amazon product detail page extractor.
|
||||
//!
|
||||
//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
|
||||
//! inconsistently protected. Sometimes our local TLS fingerprint gets
|
||||
//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
|
||||
//! sometimes we land on a real page that for whatever reason ships
|
||||
//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
|
||||
//! extractor has a two-stage fallback:
|
||||
//!
|
||||
//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
|
||||
//! we have everything (title, brand, price, availability, rating).
|
||||
//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
|
||||
//! a cloud client is configured, force-escalate to api.webclaw.io.
|
||||
//! Cloud's render + antibot pipeline reliably surfaces the
|
||||
//! structured data. Without a cloud client we return whatever we
|
||||
//! got from local (usually just title via `#productTitle` or OG
|
||||
//! meta tags).
|
||||
//!
|
||||
//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
|
||||
//! `#landingImage`) second, OG `<meta>` tags third. The OG path
|
||||
//! matters because the cloud's synthesized HTML ships metadata as
|
||||
//! OG tags but lacks Amazon's DOM IDs.
|
||||
//!
|
||||
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
|
||||
//! path. ASINs are a stable Amazon identifier so we extract that as
|
||||
//! part of the response even when everything else is empty (tells
|
||||
//! callers the URL was at least recognised).
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "amazon_product",
|
||||
label: "Amazon product",
|
||||
description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
|
||||
url_patterns: &[
|
||||
"https://www.amazon.com/dp/{ASIN}",
|
||||
"https://www.amazon.co.uk/dp/{ASIN}",
|
||||
"https://www.amazon.de/dp/{ASIN}",
|
||||
"https://www.amazon.fr/dp/{ASIN}",
|
||||
"https://www.amazon.it/dp/{ASIN}",
|
||||
"https://www.amazon.es/dp/{ASIN}",
|
||||
"https://www.amazon.co.jp/dp/{ASIN}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_amazon_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_asin(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let asin = parse_asin(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
||||
|
||||
let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
// Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
|
||||
// pages (they A/B-test it). When local fetch succeeded but has no
|
||||
// Product JSON-LD, force-escalate to the cloud which runs the
|
||||
// render pipeline and reliably surfaces structured data. No-op
|
||||
// when cloud isn't configured — we return whatever local gave us.
|
||||
if fetched.source == cloud::FetchSource::Local
|
||||
&& find_product_jsonld(&fetched.html).is_none()
|
||||
&& let Some(c) = client.cloud()
|
||||
{
|
||||
match c.fetch_html(url).await {
|
||||
Ok(cloud_html) => {
|
||||
fetched = cloud::FetchedHtml {
|
||||
html: cloud_html,
|
||||
final_url: url.to_string(),
|
||||
source: cloud::FetchSource::Cloud,
|
||||
};
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
error = %e,
|
||||
"amazon_product: cloud escalation failed, keeping local"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let mut data = parse(&fetched.html, url, &asin);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
|
||||
/// file) and the source URL, extract Amazon product detail. Returns a
|
||||
/// `Value` rather than a typed struct so callers can pass it through
|
||||
/// without carrying webclaw_fetch types.
|
||||
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
// Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
|
||||
// (only present on real static HTML) > cloud-synthesized og:title.
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| dom_title(html))
|
||||
.or_else(|| og(html, "title"));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| dom_image(html))
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
|
||||
let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"asin": asin,
|
||||
"title": title,
|
||||
"brand": brand,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"price": offer.as_ref().and_then(|o| get_text(o, "price")),
|
||||
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
|
||||
"availability": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "availability").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"condition": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "itemCondition").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"sku": sku,
|
||||
"mpn": mpn,
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_amazon_host(host: &str) -> bool {
|
||||
host.starts_with("www.amazon.") || host.starts_with("amazon.")
|
||||
}
|
||||
|
||||
/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
|
||||
/// - /dp/{ASIN}
|
||||
/// - /gp/product/{ASIN}
|
||||
/// - /product/{ASIN}
|
||||
/// - /exec/obidos/ASIN/{ASIN}
|
||||
fn parse_asin(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
|
||||
});
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers — light reuse of ecommerce_product's style
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// DOM fallbacks — cheap regex for the two fields most likely to be
|
||||
// missing from JSON-LD on Amazon.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn dom_title(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().trim().to_string())
|
||||
}
|
||||
|
||||
fn dom_image(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
|
||||
/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
|
||||
/// line of defence for `title`, `image`, `description`.
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| html_unescape(m.as_str()));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Undo the synthesize_html attribute escaping for the few entities it
|
||||
/// emits. Keeps us off a heavier HTML-entity dep.
|
||||
fn html_unescape(s: &str) -> String {
|
||||
s.replace(""", "\"")
|
||||
.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_multi_locale() {
|
||||
assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
|
||||
assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
|
||||
assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
|
||||
assert!(matches(
|
||||
"https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_product_urls() {
|
||||
assert!(!matches("https://www.amazon.com/"));
|
||||
assert!(!matches("https://www.amazon.com/gp/cart"));
|
||||
assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_asin_extracts_from_multiple_shapes() {
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(parse_asin("https://www.amazon.com/"), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
// Minimal Amazon-style fixture with a Product JSON-LD block.
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"ACME Widget","sku":"B0CHX1W1XY",
|
||||
"brand":{"@type":"Brand","name":"ACME"},
|
||||
"image":"https://m.media-amazon.com/images/I/abc.jpg",
|
||||
"offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
|
||||
"availability":"https://schema.org/InStock"},
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
|
||||
</script>
|
||||
</head><body></body></html>"##;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["asin"], "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "ACME Widget");
|
||||
assert_eq!(v["brand"], "ACME");
|
||||
assert_eq!(v["price"], "19.99");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
|
||||
assert_eq!(v["aggregate_rating"]["review_count"], "1234");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
|
||||
let html = r#"
|
||||
<html><body>
|
||||
<span id="productTitle">Fallback Title</span>
|
||||
<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
|
||||
</body></html>
|
||||
"#;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "Fallback Title");
|
||||
assert_eq!(
|
||||
v["image"],
|
||||
"https://m.media-amazon.com/images/I/fallback.jpg"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
|
||||
// Shape we see from the cloud synthesize_html path: OG tags
|
||||
// only, no JSON-LD, no Amazon DOM IDs.
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:title" content="Cloud-sourced MacBook Pro">
|
||||
<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
|
||||
<meta property="og:description" content="Via api.webclaw.io">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
|
||||
assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
|
||||
assert_eq!(v["description"], "Via api.webclaw.io");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_unescape_handles_quot_entity() {
|
||||
let html = r#"<meta property="og:title" content="Apple "M2 Pro" Laptop">"#;
|
||||
assert_eq!(
|
||||
og(html, "title").as_deref(),
|
||||
Some(r#"Apple "M2 Pro" Laptop"#)
|
||||
);
|
||||
}
|
||||
}
|
||||
314
crates/webclaw-fetch/src/extractors/arxiv.rs
Normal file
314
crates/webclaw-fetch/src/extractors/arxiv.rs
Normal file
|
|
@ -0,0 +1,314 @@
|
|||
//! ArXiv paper structured extractor.
|
||||
//!
|
||||
//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}`
|
||||
//! which returns Atom XML. We parse just enough to surface title, authors,
|
||||
//! abstract, categories, and the canonical PDF link. No HTML scraping
|
||||
//! required and no auth.
|
||||
|
||||
use quick_xml::Reader;
|
||||
use quick_xml::events::Event;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "arxiv",
|
||||
label: "ArXiv paper",
|
||||
description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.",
|
||||
url_patterns: &[
|
||||
"https://arxiv.org/abs/{id}",
|
||||
"https://arxiv.org/abs/{id}v{n}",
|
||||
"https://arxiv.org/pdf/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "arxiv.org" && host != "www.arxiv.org" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/abs/") || url.contains("/pdf/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
|
||||
|
||||
let api_url = format!("https://export.arxiv.org/api/query?id_list={id}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"arxiv api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let entry = parse_atom_entry(&resp.html)
|
||||
.ok_or_else(|| FetchError::BodyDecode("arxiv: no <entry> in response".into()))?;
|
||||
if entry.title.is_none() && entry.summary.is_none() {
|
||||
return Err(FetchError::BodyDecode(format!(
|
||||
"arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)"
|
||||
)));
|
||||
}
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": id,
|
||||
"arxiv_id": entry.id,
|
||||
"title": entry.title,
|
||||
"authors": entry.authors,
|
||||
"abstract": entry.summary.map(|s| collapse_whitespace(&s)),
|
||||
"published": entry.published,
|
||||
"updated": entry.updated,
|
||||
"primary_category": entry.primary_category,
|
||||
"categories": entry.categories,
|
||||
"doi": entry.doi,
|
||||
"comment": entry.comment,
|
||||
"pdf_url": entry.pdf_url,
|
||||
"abs_url": entry.abs_url,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`)
|
||||
/// and the `.pdf` extension when present.
|
||||
fn parse_id(url: &str) -> Option<String> {
|
||||
let after = url
|
||||
.split("/abs/")
|
||||
.nth(1)
|
||||
.or_else(|| url.split("/pdf/").nth(1))?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.trim_end_matches(".pdf");
|
||||
// Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345"
|
||||
let no_version = match stripped.rfind('v') {
|
||||
Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i],
|
||||
_ => stripped,
|
||||
};
|
||||
if no_version.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(no_version.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
fn collapse_whitespace(s: &str) -> String {
|
||||
s.split_whitespace().collect::<Vec<_>>().join(" ")
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
struct AtomEntry {
|
||||
id: Option<String>,
|
||||
title: Option<String>,
|
||||
summary: Option<String>,
|
||||
published: Option<String>,
|
||||
updated: Option<String>,
|
||||
primary_category: Option<String>,
|
||||
categories: Vec<String>,
|
||||
authors: Vec<String>,
|
||||
doi: Option<String>,
|
||||
comment: Option<String>,
|
||||
pdf_url: Option<String>,
|
||||
abs_url: Option<String>,
|
||||
}
|
||||
|
||||
/// Parse the first `<entry>` block of an ArXiv Atom feed.
|
||||
fn parse_atom_entry(xml: &str) -> Option<AtomEntry> {
|
||||
let mut reader = Reader::from_str(xml);
|
||||
let mut buf = Vec::new();
|
||||
|
||||
// States
|
||||
let mut in_entry = false;
|
||||
let mut current: Option<&'static str> = None;
|
||||
let mut in_author = false;
|
||||
let mut in_author_name = false;
|
||||
let mut entry = AtomEntry::default();
|
||||
|
||||
loop {
|
||||
match reader.read_event_into(&mut buf) {
|
||||
Ok(Event::Start(ref e)) => {
|
||||
let local = e.local_name();
|
||||
match local.as_ref() {
|
||||
b"entry" => in_entry = true,
|
||||
b"id" if in_entry && !in_author => current = Some("id"),
|
||||
b"title" if in_entry => current = Some("title"),
|
||||
b"summary" if in_entry => current = Some("summary"),
|
||||
b"published" if in_entry => current = Some("published"),
|
||||
b"updated" if in_entry => current = Some("updated"),
|
||||
b"author" if in_entry => in_author = true,
|
||||
b"name" if in_author => {
|
||||
in_author_name = true;
|
||||
current = Some("author_name");
|
||||
}
|
||||
b"category" if in_entry => {
|
||||
// primary_category is namespaced (arxiv:primary_category)
|
||||
// category is plain. quick-xml gives us local-name only,
|
||||
// so we treat both as categories and take the first as
|
||||
// primary.
|
||||
for attr in e.attributes().flatten() {
|
||||
if attr.key.as_ref() == b"term"
|
||||
&& let Ok(v) = attr.unescape_value()
|
||||
{
|
||||
let term = v.to_string();
|
||||
if entry.primary_category.is_none() {
|
||||
entry.primary_category = Some(term.clone());
|
||||
}
|
||||
entry.categories.push(term);
|
||||
}
|
||||
}
|
||||
}
|
||||
b"link" if in_entry => {
|
||||
let mut href = None;
|
||||
let mut rel = None;
|
||||
let mut typ = None;
|
||||
for attr in e.attributes().flatten() {
|
||||
match attr.key.as_ref() {
|
||||
b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
if let Some(h) = href {
|
||||
if typ.as_deref() == Some("application/pdf") {
|
||||
entry.pdf_url = Some(h.clone());
|
||||
}
|
||||
if rel.as_deref() == Some("alternate") {
|
||||
entry.abs_url = Some(h);
|
||||
}
|
||||
}
|
||||
}
|
||||
_ => current = None,
|
||||
}
|
||||
}
|
||||
Ok(Event::Empty(ref e)) => {
|
||||
// Self-closing tags (<link href="..." />). Same handling as Start.
|
||||
let local = e.local_name();
|
||||
if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry {
|
||||
let mut href = None;
|
||||
let mut rel = None;
|
||||
let mut typ = None;
|
||||
let mut term = None;
|
||||
for attr in e.attributes().flatten() {
|
||||
match attr.key.as_ref() {
|
||||
b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
if let Some(t) = term {
|
||||
if entry.primary_category.is_none() {
|
||||
entry.primary_category = Some(t.clone());
|
||||
}
|
||||
entry.categories.push(t);
|
||||
}
|
||||
if let Some(h) = href {
|
||||
if typ.as_deref() == Some("application/pdf") {
|
||||
entry.pdf_url = Some(h.clone());
|
||||
}
|
||||
if rel.as_deref() == Some("alternate") {
|
||||
entry.abs_url = Some(h);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(Event::Text(ref e)) => {
|
||||
if let (Some(field), Ok(text)) = (current, e.unescape()) {
|
||||
let text = text.to_string();
|
||||
match field {
|
||||
"id" => entry.id = Some(text.trim().to_string()),
|
||||
"title" => entry.title = append_text(entry.title.take(), &text),
|
||||
"summary" => entry.summary = append_text(entry.summary.take(), &text),
|
||||
"published" => entry.published = Some(text.trim().to_string()),
|
||||
"updated" => entry.updated = Some(text.trim().to_string()),
|
||||
"author_name" => entry.authors.push(text.trim().to_string()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(Event::End(ref e)) => {
|
||||
let local = e.local_name();
|
||||
match local.as_ref() {
|
||||
b"entry" => break,
|
||||
b"author" => in_author = false,
|
||||
b"name" => in_author_name = false,
|
||||
_ => {}
|
||||
}
|
||||
if !in_author_name {
|
||||
current = None;
|
||||
}
|
||||
}
|
||||
Ok(Event::Eof) => break,
|
||||
Err(_) => return None,
|
||||
_ => {}
|
||||
}
|
||||
buf.clear();
|
||||
}
|
||||
|
||||
if in_entry { Some(entry) } else { None }
|
||||
}
|
||||
|
||||
/// Concatenate text fragments (long fields can be split across multiple
|
||||
/// text events if they contain entities or CDATA).
|
||||
fn append_text(prev: Option<String>, next: &str) -> Option<String> {
|
||||
match prev {
|
||||
Some(mut s) => {
|
||||
s.push_str(next);
|
||||
Some(s)
|
||||
}
|
||||
None => Some(next.to_string()),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_arxiv_urls() {
|
||||
assert!(matches("https://arxiv.org/abs/2401.12345"));
|
||||
assert!(matches("https://arxiv.org/abs/2401.12345v2"));
|
||||
assert!(matches("https://arxiv.org/pdf/2401.12345.pdf"));
|
||||
assert!(!matches("https://arxiv.org/"));
|
||||
assert!(!matches("https://example.com/abs/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_id_strips_version_and_extension() {
|
||||
assert_eq!(
|
||||
parse_id("https://arxiv.org/abs/2401.12345"),
|
||||
Some("2401.12345".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_id("https://arxiv.org/abs/2401.12345v3"),
|
||||
Some("2401.12345".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"),
|
||||
Some("2401.12345".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapse_whitespace_handles_newlines_and_tabs() {
|
||||
assert_eq!(collapse_whitespace("a b\n\tc "), "a b c");
|
||||
}
|
||||
}
|
||||
168
crates/webclaw-fetch/src/extractors/crates_io.rs
Normal file
168
crates/webclaw-fetch/src/extractors/crates_io.rs
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
//! crates.io structured extractor.
|
||||
//!
|
||||
//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No
|
||||
//! auth, no rate limit at normal usage. The response includes both
|
||||
//! the crate metadata and the full version list, which we summarize
|
||||
//! down to a count + latest release info to keep the payload small.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "crates_io",
|
||||
label: "crates.io package",
|
||||
description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.",
|
||||
url_patterns: &[
|
||||
"https://crates.io/crates/{name}",
|
||||
"https://crates.io/crates/{name}/{version}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "crates.io" && host != "www.crates.io" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/crates/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let name = parse_name(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
|
||||
|
||||
let api_url = format!("https://crates.io/api/v1/crates/{name}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"crates.io: crate '{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"crates.io api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let body: CratesResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?;
|
||||
|
||||
let c = body.crate_;
|
||||
let latest_version = body
|
||||
.versions
|
||||
.iter()
|
||||
.find(|v| !v.yanked.unwrap_or(false))
|
||||
.or_else(|| body.versions.first());
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": c.id,
|
||||
"description": c.description,
|
||||
"homepage": c.homepage,
|
||||
"documentation": c.documentation,
|
||||
"repository": c.repository,
|
||||
"max_stable_version": c.max_stable_version,
|
||||
"max_version": c.max_version,
|
||||
"newest_version": c.newest_version,
|
||||
"downloads": c.downloads,
|
||||
"recent_downloads": c.recent_downloads,
|
||||
"categories": c.categories,
|
||||
"keywords": c.keywords,
|
||||
"release_count": body.versions.len(),
|
||||
"latest_release_date": latest_version.and_then(|v| v.created_at.clone()),
|
||||
"latest_license": latest_version.and_then(|v| v.license.clone()),
|
||||
"latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()),
|
||||
"latest_yanked": latest_version.and_then(|v| v.yanked),
|
||||
"created_at": c.created_at,
|
||||
"updated_at": c.updated_at,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_name(url: &str) -> Option<String> {
|
||||
let after = url.split("/crates/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let first = stripped.split('/').find(|s| !s.is_empty())?;
|
||||
Some(first.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// crates.io API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CratesResponse {
|
||||
#[serde(rename = "crate")]
|
||||
crate_: CrateInfo,
|
||||
#[serde(default)]
|
||||
versions: Vec<VersionInfo>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CrateInfo {
|
||||
id: Option<String>,
|
||||
description: Option<String>,
|
||||
homepage: Option<String>,
|
||||
documentation: Option<String>,
|
||||
repository: Option<String>,
|
||||
max_stable_version: Option<String>,
|
||||
max_version: Option<String>,
|
||||
newest_version: Option<String>,
|
||||
downloads: Option<i64>,
|
||||
recent_downloads: Option<i64>,
|
||||
#[serde(default)]
|
||||
categories: Vec<String>,
|
||||
#[serde(default)]
|
||||
keywords: Vec<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct VersionInfo {
|
||||
license: Option<String>,
|
||||
rust_version: Option<String>,
|
||||
yanked: Option<bool>,
|
||||
created_at: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_crate_pages() {
|
||||
assert!(matches("https://crates.io/crates/serde"));
|
||||
assert!(matches("https://crates.io/crates/tokio/1.45.0"));
|
||||
assert!(!matches("https://crates.io/"));
|
||||
assert!(!matches("https://example.com/crates/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_name_handles_versioned_urls() {
|
||||
assert_eq!(
|
||||
parse_name("https://crates.io/crates/serde"),
|
||||
Some("serde".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://crates.io/crates/tokio/1.45.0"),
|
||||
Some("tokio".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://crates.io/crates/scraper/?foo=bar"),
|
||||
Some("scraper".into())
|
||||
);
|
||||
}
|
||||
}
|
||||
188
crates/webclaw-fetch/src/extractors/dev_to.rs
Normal file
188
crates/webclaw-fetch/src/extractors/dev_to.rs
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
//! dev.to article structured extractor.
|
||||
//!
|
||||
//! `dev.to/api/articles/{username}/{slug}` returns the full article body,
|
||||
//! tags, reaction count, comment count, and reading time. Anonymous
|
||||
//! access works fine for published posts.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "dev_to",
|
||||
label: "dev.to article",
|
||||
description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.",
|
||||
url_patterns: &["https://dev.to/{username}/{slug}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "dev.to" && host != "www.dev.to" {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// Need exactly /{username}/{slug}, with username starting with non-reserved.
|
||||
segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0])
|
||||
}
|
||||
|
||||
const RESERVED_FIRST_SEGS: &[&str] = &[
|
||||
"api",
|
||||
"tags",
|
||||
"search",
|
||||
"settings",
|
||||
"enter",
|
||||
"signup",
|
||||
"about",
|
||||
"code-of-conduct",
|
||||
"privacy",
|
||||
"terms",
|
||||
"contact",
|
||||
"sponsorships",
|
||||
"sponsors",
|
||||
"shop",
|
||||
"videos",
|
||||
"listings",
|
||||
"podcasts",
|
||||
"p",
|
||||
"t",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (username, slug) = parse_username_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://dev.to/api/articles/{username}/{slug}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"dev_to: article '{username}/{slug}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"dev.to api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let a: Article = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": a.id,
|
||||
"title": a.title,
|
||||
"description": a.description,
|
||||
"body_markdown": a.body_markdown,
|
||||
"url_canonical": a.canonical_url,
|
||||
"published_at": a.published_at,
|
||||
"edited_at": a.edited_at,
|
||||
"reading_time_min": a.reading_time_minutes,
|
||||
"tags": a.tag_list,
|
||||
"positive_reactions": a.positive_reactions_count,
|
||||
"public_reactions": a.public_reactions_count,
|
||||
"comments_count": a.comments_count,
|
||||
"page_views_count": a.page_views_count,
|
||||
"cover_image": a.cover_image,
|
||||
"author": json!({
|
||||
"username": a.user.as_ref().and_then(|u| u.username.clone()),
|
||||
"name": a.user.as_ref().and_then(|u| u.name.clone()),
|
||||
"twitter": a.user.as_ref().and_then(|u| u.twitter_username.clone()),
|
||||
"github": a.user.as_ref().and_then(|u| u.github_username.clone()),
|
||||
"website": a.user.as_ref().and_then(|u| u.website_url.clone()),
|
||||
}),
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_username_slug(url: &str) -> Option<(String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let username = segs.next()?;
|
||||
let slug = segs.next()?;
|
||||
Some((username.to_string(), slug.to_string()))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// dev.to API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Article {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
description: Option<String>,
|
||||
body_markdown: Option<String>,
|
||||
canonical_url: Option<String>,
|
||||
published_at: Option<String>,
|
||||
edited_at: Option<String>,
|
||||
reading_time_minutes: Option<i64>,
|
||||
tag_list: Option<serde_json::Value>, // string OR array depending on endpoint
|
||||
positive_reactions_count: Option<i64>,
|
||||
public_reactions_count: Option<i64>,
|
||||
comments_count: Option<i64>,
|
||||
page_views_count: Option<i64>,
|
||||
cover_image: Option<String>,
|
||||
user: Option<UserRef>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
username: Option<String>,
|
||||
name: Option<String>,
|
||||
twitter_username: Option<String>,
|
||||
github_username: Option<String>,
|
||||
website_url: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_article_urls() {
|
||||
assert!(matches("https://dev.to/ben/welcome-thread"));
|
||||
assert!(matches("https://dev.to/0xmassi/some-post-1abc"));
|
||||
assert!(!matches("https://dev.to/"));
|
||||
assert!(!matches("https://dev.to/api/articles/foo/bar"));
|
||||
assert!(!matches("https://dev.to/tags/rust"));
|
||||
assert!(!matches("https://dev.to/ben")); // user profile, not article
|
||||
assert!(!matches("https://example.com/ben/post"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_pulls_username_and_slug() {
|
||||
assert_eq!(
|
||||
parse_username_slug("https://dev.to/ben/welcome-thread"),
|
||||
Some(("ben".into(), "welcome-thread".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"),
|
||||
Some(("0xmassi".into(), "some-post-1abc".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
150
crates/webclaw-fetch/src/extractors/docker_hub.rs
Normal file
150
crates/webclaw-fetch/src/extractors/docker_hub.rs
Normal file
|
|
@ -0,0 +1,150 @@
|
|||
//! Docker Hub repository structured extractor.
|
||||
//!
|
||||
//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`.
|
||||
//! Anonymous access is allowed for public images. The official-image
|
||||
//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "docker_hub",
|
||||
label: "Docker Hub repository",
|
||||
description: "Returns image metadata: pull count, star count, last_updated, official flag, description.",
|
||||
url_patterns: &[
|
||||
"https://hub.docker.com/_/{name}",
|
||||
"https://hub.docker.com/r/{namespace}/{name}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "hub.docker.com" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/_/") || url.contains("/r/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (namespace, name) = parse_repo(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
|
||||
|
||||
let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"docker_hub: repo '{namespace}/{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"docker_hub api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let r: RepoResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"namespace": r.namespace,
|
||||
"name": r.name,
|
||||
"full_name": format!("{namespace}/{name}"),
|
||||
"pull_count": r.pull_count,
|
||||
"star_count": r.star_count,
|
||||
"description": r.description,
|
||||
"full_description": r.full_description,
|
||||
"last_updated": r.last_updated,
|
||||
"date_registered": r.date_registered,
|
||||
"is_official": namespace == "library",
|
||||
"is_private": r.is_private,
|
||||
"status_description":r.status_description,
|
||||
"categories": r.categories,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Parse `(namespace, name)` from a Docker Hub URL. The official-image
|
||||
/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos
|
||||
/// `/r/foo/bar` map to `(foo, bar)`.
|
||||
fn parse_repo(url: &str) -> Option<(String, String)> {
|
||||
if let Some(after) = url.split("/_/").nth(1) {
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let name = stripped.split('/').next().filter(|s| !s.is_empty())?;
|
||||
return Some(("library".into(), name.to_string()));
|
||||
}
|
||||
let after = url.split("/r/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let ns = segs.next()?;
|
||||
let nm = segs.next()?;
|
||||
Some((ns.to_string(), nm.to_string()))
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct RepoResponse {
|
||||
namespace: Option<String>,
|
||||
name: Option<String>,
|
||||
pull_count: Option<i64>,
|
||||
star_count: Option<i64>,
|
||||
description: Option<String>,
|
||||
full_description: Option<String>,
|
||||
last_updated: Option<String>,
|
||||
date_registered: Option<String>,
|
||||
is_private: Option<bool>,
|
||||
status_description: Option<String>,
|
||||
#[serde(default)]
|
||||
categories: Vec<DockerCategory>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, serde::Serialize)]
|
||||
struct DockerCategory {
|
||||
name: Option<String>,
|
||||
slug: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_docker_urls() {
|
||||
assert!(matches("https://hub.docker.com/_/nginx"));
|
||||
assert!(matches("https://hub.docker.com/r/grafana/grafana"));
|
||||
assert!(!matches("https://hub.docker.com/"));
|
||||
assert!(!matches("https://example.com/_/nginx"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_repo_handles_official_and_personal() {
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/_/nginx"),
|
||||
Some(("library".into(), "nginx".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/_/nginx/tags"),
|
||||
Some(("library".into(), "nginx".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/r/grafana/grafana"),
|
||||
Some(("grafana".into(), "grafana".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"),
|
||||
Some(("grafana".into(), "grafana".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
337
crates/webclaw-fetch/src/extractors/ebay_listing.rs
Normal file
337
crates/webclaw-fetch/src/extractors/ebay_listing.rs
Normal file
|
|
@ -0,0 +1,337 @@
|
|||
//! eBay listing extractor.
|
||||
//!
|
||||
//! eBay item pages at `ebay.com/itm/{id}` and international variants
|
||||
//! usually ship a `Product` JSON-LD block with title, price, currency,
|
||||
//! condition, and an `AggregateOffer` when bidding. eBay applies
|
||||
//! Cloudflare + custom WAF selectively — some item IDs return normal
|
||||
//! HTML to the Firefox profile, others 403 / get the "Pardon our
|
||||
//! interruption" page. We route through `cloud::smart_fetch_html` so
|
||||
//! both paths resolve to the same parser.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ebay_listing",
|
||||
label: "eBay listing",
|
||||
description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
|
||||
url_patterns: &[
|
||||
"https://www.ebay.com/itm/{id}",
|
||||
"https://www.ebay.co.uk/itm/{id}",
|
||||
"https://www.ebay.de/itm/{id}",
|
||||
"https://www.ebay.fr/itm/{id}",
|
||||
"https://www.ebay.it/itm/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_ebay_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_item_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let item_id = parse_item_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url, &item_id);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| og(html, "title"));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
// eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
|
||||
let (low_price, high_price, single_price) = match offer.as_ref() {
|
||||
Some(o) => (
|
||||
get_text(o, "lowPrice"),
|
||||
get_text(o, "highPrice"),
|
||||
get_text(o, "price"),
|
||||
),
|
||||
None => (None, None, None),
|
||||
};
|
||||
let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
|
||||
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"item_id": item_id,
|
||||
"title": title,
|
||||
"brand": brand,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"price": single_price,
|
||||
"low_price": low_price,
|
||||
"high_price": high_price,
|
||||
"offer_count": offer_count,
|
||||
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
|
||||
"availability": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "availability").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"condition": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "itemCondition").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"seller": offer.as_ref().and_then(|o|
|
||||
o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_ebay_host(host: &str) -> bool {
|
||||
host.starts_with("www.ebay.") || host.starts_with("ebay.")
|
||||
}
|
||||
|
||||
/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
|
||||
/// URLs. IDs are 10-15 digits today, but we accept any all-digit
|
||||
/// trailing segment so the extractor stays forward-compatible.
|
||||
fn parse_item_id(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
// /itm/(optional-slug/)?(digits)([/?#]|end)
|
||||
Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
|
||||
});
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_ebay_item_urls() {
|
||||
assert!(matches("https://www.ebay.com/itm/325478156234"));
|
||||
assert!(matches(
|
||||
"https://www.ebay.com/itm/vintage-typewriter/325478156234"
|
||||
));
|
||||
assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
|
||||
assert!(!matches("https://www.ebay.com/"));
|
||||
assert!(!matches("https://www.ebay.com/sch/foo"));
|
||||
assert!(!matches("https://example.com/itm/325478156234"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_item_id_handles_slugged_urls() {
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/325478156234"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Vintage Typewriter","sku":"TW-001",
|
||||
"brand":{"@type":"Brand","name":"Olivetti"},
|
||||
"image":"https://i.ebayimg.com/images/abc.jpg",
|
||||
"offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
|
||||
"availability":"https://schema.org/InStock",
|
||||
"itemCondition":"https://schema.org/UsedCondition",
|
||||
"seller":{"@type":"Person","name":"vintage_seller_99"}}}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
|
||||
assert_eq!(v["title"], "Vintage Typewriter");
|
||||
assert_eq!(v["price"], "79.99");
|
||||
assert_eq!(v["currency"], "GBP");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["condition"], "UsedCondition");
|
||||
assert_eq!(v["seller"], "vintage_seller_99");
|
||||
assert_eq!(v["brand"], "Olivetti");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_handles_aggregate_offer_price_range() {
|
||||
let html = r##"
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Used Copies",
|
||||
"offers":{"@type":"AggregateOffer","offerCount":"5",
|
||||
"lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
|
||||
</script>
|
||||
"##;
|
||||
let v = parse(html, "https://www.ebay.com/itm/1", "1");
|
||||
assert_eq!(v["low_price"], "10.00");
|
||||
assert_eq!(v["high_price"], "50.00");
|
||||
assert_eq!(v["offer_count"], "5");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
}
|
||||
}
|
||||
553
crates/webclaw-fetch/src/extractors/ecommerce_product.rs
Normal file
553
crates/webclaw-fetch/src/extractors/ecommerce_product.rs
Normal file
|
|
@ -0,0 +1,553 @@
|
|||
//! Generic ecommerce product extractor via Schema.org JSON-LD.
|
||||
//!
|
||||
//! Every modern ecommerce site ships a `<script type="application/ld+json">`
|
||||
//! Product block for SEO / rich-result snippets. Google's own SEO docs
|
||||
//! force this markup on anyone who wants to appear in shopping search.
|
||||
//! We take advantage of it: one extractor that works on Shopify,
|
||||
//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
|
||||
//! and anything else that follows Schema.org.
|
||||
//!
|
||||
//! **Explicit-call only** (`/v1/scrape/ecommerce_product`). Not in the
|
||||
//! auto-dispatch because we can't identify "this is a product page"
|
||||
//! from the URL alone. When the caller knows they have a product URL,
|
||||
//! this is the reliable fallback for stores where shopify_product
|
||||
//! doesn't apply.
|
||||
//!
|
||||
//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
|
||||
//! so JSON-LD parsing is shared with the rest of the extraction
|
||||
//! pipeline. We walk all blocks looking for `@type: Product`,
|
||||
//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
|
||||
//!
|
||||
//! ## OG fallback
|
||||
//!
|
||||
//! Two real-world cases JSON-LD alone can't cover:
|
||||
//!
|
||||
//! 1. Site has no Product JSON-LD at all (smaller Squarespace / custom
|
||||
//! storefronts, many European shops).
|
||||
//! 2. Site has Product JSON-LD but the `offers` block is empty (seen on
|
||||
//! Patagonia and other catalog-style sites that split price onto a
|
||||
//! separate widget).
|
||||
//!
|
||||
//! For case 1 we build a minimal payload from OG / product meta tags
|
||||
//! (`og:title`, `og:image`, `og:description`, `product:price:amount`,
|
||||
//! `product:price:currency`, `product:availability`, `product:brand`).
|
||||
//! For case 2 we augment the JSON-LD offers list with an OG-derived
|
||||
//! offer so callers get a price either way. A `data_source` field
|
||||
//! (`"jsonld"` / `"jsonld+og"` / `"og_fallback"`) tells the caller
|
||||
//! which branch produced the data.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ecommerce_product",
|
||||
label: "Ecommerce product (generic)",
|
||||
description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
|
||||
url_patterns: &[
|
||||
"https://{any-ecom-store}/products/{slug}",
|
||||
"https://{any-ecom-store}/product/{slug}",
|
||||
"https://{any-ecom-store}/p/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
// Maximally permissive: explicit-call-only extractor. We trust the
|
||||
// caller knows they're pointing at a product page. Custom ecom
|
||||
// sites use every conceivable URL shape (warbyparker.com uses
|
||||
// `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
|
||||
// matching would false-negative a lot. All we gate on is a valid
|
||||
// http(s) URL with a host.
|
||||
if !(url.starts_with("http://") || url.starts_with("https://")) {
|
||||
return false;
|
||||
}
|
||||
!host_of(url).is_empty()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let resp = client.fetch(url).await?;
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return Err(FetchError::Build(format!(
|
||||
"ecommerce_product: status {} for {url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
parse(&resp.html, url).ok_or_else(|| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"ecommerce_product: no Schema.org Product JSON-LD and no OG product tags on {url}"
|
||||
))
|
||||
})
|
||||
}
|
||||
|
||||
/// Pure parser: try JSON-LD first, fall back to OG meta tags. Returns
|
||||
/// `None` when neither path has enough to say "this is a product page".
|
||||
pub fn parse(html: &str, url: &str) -> Option<Value> {
|
||||
// Reuse the core JSON-LD parser so we benefit from whatever
|
||||
// robustness it gains over time (handling @graph, arrays, etc.).
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
let product = find_product(&blocks);
|
||||
|
||||
if let Some(p) = product {
|
||||
Some(build_jsonld_payload(&p, html, url))
|
||||
} else if has_og_product_signal(html) {
|
||||
Some(build_og_payload(html, url))
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Build the rich payload from a Product JSON-LD node. Augments the
|
||||
/// `offers` array with an OG-derived offer when JSON-LD offers is empty
|
||||
/// so callers get a price on sites like Patagonia.
|
||||
fn build_jsonld_payload(product: &Value, html: &str, url: &str) -> Value {
|
||||
let mut offers = collect_offers(product);
|
||||
let mut data_source = "jsonld";
|
||||
if offers.is_empty()
|
||||
&& let Some(og_offer) = build_og_offer(html)
|
||||
{
|
||||
offers.push(og_offer);
|
||||
data_source = "jsonld+og";
|
||||
}
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"data_source": data_source,
|
||||
"name": get_text(product, "name").or_else(|| og(html, "title")),
|
||||
"description": get_text(product, "description").or_else(|| og(html, "description")),
|
||||
"brand": get_brand(product).or_else(|| meta_property(html, "product:brand")),
|
||||
"sku": get_text(product, "sku"),
|
||||
"mpn": get_text(product, "mpn"),
|
||||
"gtin": get_text(product, "gtin")
|
||||
.or_else(|| get_text(product, "gtin13"))
|
||||
.or_else(|| get_text(product, "gtin12"))
|
||||
.or_else(|| get_text(product, "gtin8")),
|
||||
"product_id": get_text(product, "productID"),
|
||||
"category": get_text(product, "category"),
|
||||
"color": get_text(product, "color"),
|
||||
"material": get_text(product, "material"),
|
||||
"images": nonempty_or_og(collect_images(product), html),
|
||||
"offers": offers,
|
||||
"aggregate_rating": get_aggregate_rating(product),
|
||||
"review_count": get_review_count(product),
|
||||
"raw_schema_type": get_text(product, "@type"),
|
||||
"raw_jsonld": product.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Build a minimal payload from OG / product meta tags. Used when a
|
||||
/// page has no Product JSON-LD at all.
|
||||
fn build_og_payload(html: &str, url: &str) -> Value {
|
||||
let offers = build_og_offer(html).map(|o| vec![o]).unwrap_or_default();
|
||||
let image = og(html, "image");
|
||||
let images: Vec<Value> = image.map(|i| vec![Value::String(i)]).unwrap_or_default();
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"data_source": "og_fallback",
|
||||
"name": og(html, "title"),
|
||||
"description": og(html, "description"),
|
||||
"brand": meta_property(html, "product:brand"),
|
||||
"sku": None::<String>,
|
||||
"mpn": None::<String>,
|
||||
"gtin": None::<String>,
|
||||
"product_id": None::<String>,
|
||||
"category": None::<String>,
|
||||
"color": None::<String>,
|
||||
"material": None::<String>,
|
||||
"images": images,
|
||||
"offers": offers,
|
||||
"aggregate_rating": Value::Null,
|
||||
"review_count": None::<String>,
|
||||
"raw_schema_type": None::<String>,
|
||||
"raw_jsonld": Value::Null,
|
||||
})
|
||||
}
|
||||
|
||||
fn nonempty_or_og(imgs: Vec<Value>, html: &str) -> Vec<Value> {
|
||||
if !imgs.is_empty() {
|
||||
return imgs;
|
||||
}
|
||||
og(html, "image")
|
||||
.map(|s| vec![Value::String(s)])
|
||||
.unwrap_or_default()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Recursively walk the JSON-LD blocks and return the first node whose
|
||||
/// `@type` is Product, ProductGroup, or IndividualProduct.
|
||||
fn find_product(blocks: &[Value]) -> Option<Value> {
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
// @graph: [ {...}, {...} ]
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
// Bare array wrapper
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let t = match v.get("@type") {
|
||||
Some(t) => t,
|
||||
None => return false,
|
||||
};
|
||||
let match_str = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => match_str(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
if let Some(obj) = brand.as_object()
|
||||
&& let Some(n) = obj.get("name").and_then(|x| x.as_str())
|
||||
{
|
||||
return Some(n.to_string());
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn collect_images(v: &Value) -> Vec<Value> {
|
||||
match v.get("image") {
|
||||
Some(Value::String(s)) => vec![Value::String(s.clone())],
|
||||
Some(Value::Array(arr)) => arr
|
||||
.iter()
|
||||
.filter_map(|x| match x {
|
||||
Value::String(s) => Some(Value::String(s.clone())),
|
||||
Value::Object(_) => x.get("url").cloned(),
|
||||
_ => None,
|
||||
})
|
||||
.collect(),
|
||||
Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
|
||||
_ => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Normalise both bare Offer and AggregateOffer into a uniform array.
|
||||
fn collect_offers(v: &Value) -> Vec<Value> {
|
||||
let offers = match v.get("offers") {
|
||||
Some(o) => o,
|
||||
None => return Vec::new(),
|
||||
};
|
||||
let collect_single = |o: &Value| -> Option<Value> {
|
||||
Some(json!({
|
||||
"price": get_text(o, "price"),
|
||||
"low_price": get_text(o, "lowPrice"),
|
||||
"high_price": get_text(o, "highPrice"),
|
||||
"currency": get_text(o, "priceCurrency"),
|
||||
"availability": get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
|
||||
"item_condition": get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
|
||||
"valid_until": get_text(o, "priceValidUntil"),
|
||||
"url": get_text(o, "url"),
|
||||
"seller": o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
|
||||
"offer_count": get_text(o, "offerCount"),
|
||||
}))
|
||||
};
|
||||
match offers {
|
||||
Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
|
||||
Value::Object(_) => collect_single(offers).into_iter().collect(),
|
||||
_ => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
"worst_rating": get_text(r, "worstRating"),
|
||||
"rating_count": get_text(r, "ratingCount"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn get_review_count(v: &Value) -> Option<String> {
|
||||
v.get("aggregateRating")
|
||||
.and_then(|r| get_text(r, "reviewCount"))
|
||||
.or_else(|| get_text(v, "reviewCount"))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG / product meta-tag helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// True when the HTML has enough OG / product meta tags to justify
|
||||
/// building a fallback payload. A single `og:title` isn't enough on its
|
||||
/// own — every blog post has that. We require either a product price
|
||||
/// tag or at least an `og:type` of `product`/`og:product` to avoid
|
||||
/// mis-classifying articles as products.
|
||||
fn has_og_product_signal(html: &str) -> bool {
|
||||
let has_price = meta_property(html, "product:price:amount").is_some()
|
||||
|| meta_property(html, "og:price:amount").is_some();
|
||||
if has_price {
|
||||
return true;
|
||||
}
|
||||
// `<meta property="og:type" content="product">` is the Schema.org OG
|
||||
// marker for product pages.
|
||||
let og_type = og(html, "type").unwrap_or_default().to_lowercase();
|
||||
matches!(og_type.as_str(), "product" | "og:product" | "product.item")
|
||||
}
|
||||
|
||||
/// Build a single Offer-shaped Value from OG / product meta tags, or
|
||||
/// `None` if there's no price info at all.
|
||||
fn build_og_offer(html: &str) -> Option<Value> {
|
||||
let price = meta_property(html, "product:price:amount")
|
||||
.or_else(|| meta_property(html, "og:price:amount"));
|
||||
let currency = meta_property(html, "product:price:currency")
|
||||
.or_else(|| meta_property(html, "og:price:currency"));
|
||||
let availability = meta_property(html, "product:availability")
|
||||
.or_else(|| meta_property(html, "og:availability"));
|
||||
price.as_ref()?;
|
||||
Some(json!({
|
||||
"price": price,
|
||||
"low_price": None::<String>,
|
||||
"high_price": None::<String>,
|
||||
"currency": currency,
|
||||
"availability": availability,
|
||||
"item_condition": None::<String>,
|
||||
"valid_until": None::<String>,
|
||||
"url": None::<String>,
|
||||
"seller": None::<String>,
|
||||
"offer_count": None::<String>,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Pull the value of `<meta property="og:{prop}" content="...">`.
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Pull the value of any `<meta property="..." content="...">` tag.
|
||||
/// Needed for namespaced OG variants like `product:price:amount` that
|
||||
/// the simple `og:*` matcher above doesn't cover.
|
||||
fn meta_property(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use serde_json::json;
|
||||
|
||||
#[test]
|
||||
fn matches_any_http_url_with_host() {
|
||||
assert!(matches("https://www.allbirds.com/products/tree-runner"));
|
||||
assert!(matches(
|
||||
"https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
|
||||
));
|
||||
assert!(matches("https://example.com/p/widget"));
|
||||
assert!(matches("http://shop.example.com/foo/bar"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_empty_or_non_http() {
|
||||
assert!(!matches(""));
|
||||
assert!(!matches("not-a-url"));
|
||||
assert!(!matches("ftp://example.com/file"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn find_product_walks_graph() {
|
||||
let block = json!({
|
||||
"@context": "https://schema.org",
|
||||
"@graph": [
|
||||
{"@type": "Organization", "name": "ACME"},
|
||||
{"@type": "Product", "name": "Widget", "sku": "ABC"}
|
||||
]
|
||||
});
|
||||
let blocks = vec![block];
|
||||
let p = find_product(&blocks).unwrap();
|
||||
assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn find_product_handles_array_type() {
|
||||
let block = json!({
|
||||
"@type": ["Product", "Clothing"],
|
||||
"name": "Tee"
|
||||
});
|
||||
assert!(is_product_type(&block));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn get_brand_from_string_or_object() {
|
||||
assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
|
||||
assert_eq!(
|
||||
get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
|
||||
Some("ACME".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collect_offers_handles_single_and_aggregate() {
|
||||
let p = json!({
|
||||
"offers": {
|
||||
"@type": "Offer",
|
||||
"price": "19.99",
|
||||
"priceCurrency": "USD",
|
||||
"availability": "https://schema.org/InStock"
|
||||
}
|
||||
});
|
||||
let offers = collect_offers(&p);
|
||||
assert_eq!(offers.len(), 1);
|
||||
assert_eq!(
|
||||
offers[0].get("price").and_then(|v| v.as_str()),
|
||||
Some("19.99")
|
||||
);
|
||||
assert_eq!(
|
||||
offers[0].get("availability").and_then(|v| v.as_str()),
|
||||
Some("InStock")
|
||||
);
|
||||
}
|
||||
|
||||
// --- OG fallback --------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn has_og_product_signal_accepts_product_type_or_price() {
|
||||
let type_only = r#"<meta property="og:type" content="product">"#;
|
||||
let price_only = r#"<meta property="product:price:amount" content="49.00">"#;
|
||||
let neither = r#"<meta property="og:title" content="My Article"><meta property="og:type" content="article">"#;
|
||||
assert!(has_og_product_signal(type_only));
|
||||
assert!(has_og_product_signal(price_only));
|
||||
assert!(!has_og_product_signal(neither));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_fallback_builds_payload_without_jsonld() {
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:type" content="product">
|
||||
<meta property="og:title" content="Handmade Candle">
|
||||
<meta property="og:image" content="https://cdn.example.com/candle.jpg">
|
||||
<meta property="og:description" content="Small-batch soy candle.">
|
||||
<meta property="product:price:amount" content="18.00">
|
||||
<meta property="product:price:currency" content="USD">
|
||||
<meta property="product:availability" content="in stock">
|
||||
<meta property="product:brand" content="Little Studio">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://example.com/p/candle").unwrap();
|
||||
assert_eq!(v["data_source"], "og_fallback");
|
||||
assert_eq!(v["name"], "Handmade Candle");
|
||||
assert_eq!(v["description"], "Small-batch soy candle.");
|
||||
assert_eq!(v["brand"], "Little Studio");
|
||||
assert_eq!(v["offers"][0]["price"], "18.00");
|
||||
assert_eq!(v["offers"][0]["currency"], "USD");
|
||||
assert_eq!(v["offers"][0]["availability"], "in stock");
|
||||
assert_eq!(v["images"][0], "https://cdn.example.com/candle.jpg");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn jsonld_augments_empty_offers_with_og_price() {
|
||||
// Patagonia-shaped page: Product JSON-LD without an Offer, plus
|
||||
// product:price:* OG tags. We should merge.
|
||||
let html = r##"<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Better Sweater","brand":"Patagonia",
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.4","reviewCount":"1142"}}
|
||||
</script>
|
||||
<meta property="product:price:amount" content="139.00">
|
||||
<meta property="product:price:currency" content="USD">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://patagonia.com/p/x").unwrap();
|
||||
assert_eq!(v["data_source"], "jsonld+og");
|
||||
assert_eq!(v["name"], "Better Sweater");
|
||||
assert_eq!(v["offers"].as_array().unwrap().len(), 1);
|
||||
assert_eq!(v["offers"][0]["price"], "139.00");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn jsonld_only_stays_pure_jsonld() {
|
||||
let html = r##"<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Widget",
|
||||
"offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://example.com/p/w").unwrap();
|
||||
assert_eq!(v["data_source"], "jsonld");
|
||||
assert_eq!(v["offers"][0]["price"], "9.99");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_returns_none_on_no_product_signals() {
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="My Blog Post">
|
||||
<meta property="og:type" content="article">
|
||||
</head></html>"#;
|
||||
assert!(parse(html, "https://blog.example.com/post").is_none());
|
||||
}
|
||||
}
|
||||
572
crates/webclaw-fetch/src/extractors/etsy_listing.rs
Normal file
572
crates/webclaw-fetch/src/extractors/etsy_listing.rs
Normal file
|
|
@ -0,0 +1,572 @@
|
|||
//! Etsy listing extractor.
|
||||
//!
|
||||
//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant
|
||||
//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD
|
||||
//! block with title, price, currency, availability, shop seller, and
|
||||
//! an `AggregateRating` for the listing.
|
||||
//!
|
||||
//! Etsy puts Cloudflare + custom WAF in front of product pages with a
|
||||
//! high variance: the Firefox profile gets clean HTML most of the time
|
||||
//! but some listings return a CF interstitial. We route through
|
||||
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
|
||||
//! same as `ebay_listing`.
|
||||
//!
|
||||
//! ## URL slug as last-resort title
|
||||
//!
|
||||
//! Even with cloud antibot bypass, Etsy frequently serves a generic
|
||||
//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
|
||||
//! empty markdown). In that case we humanise the slug from the URL
|
||||
//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
|
||||
//! "Personalized Stainless Steel Tumbler") so callers always get a
|
||||
//! meaningful title. Degrades gracefully when the URL has no slug.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "etsy_listing",
|
||||
label: "Etsy listing",
|
||||
description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.",
|
||||
url_patterns: &[
|
||||
"https://www.etsy.com/listing/{id}",
|
||||
"https://www.etsy.com/listing/{id}/{slug}",
|
||||
"https://www.etsy.com/{locale}/listing/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_etsy_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_listing_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let listing_id = parse_listing_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url, &listing_id);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let slug_title = humanise_slug(parse_slug(url).as_deref());
|
||||
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
|
||||
.or(slug_title);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
|
||||
// Etsy listings often ship either a single Offer or an
|
||||
// AggregateOffer when the listing has variants with different prices.
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
let (low_price, high_price, single_price) = match offer.as_ref() {
|
||||
Some(o) => (
|
||||
get_text(o, "lowPrice"),
|
||||
get_text(o, "highPrice"),
|
||||
get_text(o, "price"),
|
||||
),
|
||||
None => (None, None, None),
|
||||
};
|
||||
let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency"));
|
||||
let availability = offer
|
||||
.as_ref()
|
||||
.and_then(|o| get_text(o, "availability").map(strip_schema_prefix));
|
||||
let item_condition = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "itemCondition"))
|
||||
.map(strip_schema_prefix);
|
||||
|
||||
// Shop name: offers[0].seller.name on newer listings, top-level
|
||||
// `brand` on older listings (Etsy changed the schema around 2022).
|
||||
// Fall back through both so either shape resolves.
|
||||
let shop = offer
|
||||
.as_ref()
|
||||
.and_then(|o| {
|
||||
o.get("seller")
|
||||
.and_then(|s| s.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
})
|
||||
.or_else(|| brand.clone());
|
||||
let shop_url = shop_url_from_html(html);
|
||||
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"listing_id": listing_id,
|
||||
"title": title,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"brand": brand,
|
||||
"price": single_price,
|
||||
"low_price": low_price,
|
||||
"high_price": high_price,
|
||||
"currency": currency,
|
||||
"availability": availability,
|
||||
"item_condition": item_condition,
|
||||
"shop": shop,
|
||||
"shop_url": shop_url,
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_etsy_host(host: &str) -> bool {
|
||||
host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com")
|
||||
}
|
||||
|
||||
/// Extract the numeric listing id. Etsy ids are 9-11 digits today but
|
||||
/// we accept any all-digit segment right after `/listing/`.
|
||||
///
|
||||
/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised
|
||||
/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`).
|
||||
fn parse_listing_id(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap());
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Extract the URL slug after the listing id, e.g.
|
||||
/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
|
||||
/// is the bare `/listing/{id}` shape.
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Turn a URL slug into a human-ish title:
|
||||
/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
|
||||
/// Steel Tumbler`. Word-cap each dash-separated token; preserves
|
||||
/// underscores as spaces too. Returns `None` on empty input.
|
||||
fn humanise_slug(slug: Option<&str>) -> Option<String> {
|
||||
let raw = slug?.trim();
|
||||
if raw.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let words: Vec<String> = raw
|
||||
.split(['-', '_'])
|
||||
.filter(|w| !w.is_empty())
|
||||
.map(capitalise_word)
|
||||
.collect();
|
||||
if words.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(words.join(" "))
|
||||
}
|
||||
}
|
||||
|
||||
fn capitalise_word(w: &str) -> String {
|
||||
let mut chars = w.chars();
|
||||
match chars.next() {
|
||||
Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
|
||||
None => String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// True when the OG title is Etsy's fallback-page title rather than a
|
||||
/// listing-specific title. Expired / region-blocked / antibot-filtered
|
||||
/// pages return Etsy's sitewide tagline:
|
||||
/// `"Etsy - Your place to buy and sell all things handmade..."`, or
|
||||
/// simply `"etsy.com"`. A real listing title always starts with the
|
||||
/// item name, never with "Etsy - " or the domain.
|
||||
fn is_generic_title(t: &str) -> bool {
|
||||
let normalised = t.trim().to_lowercase();
|
||||
if matches!(
|
||||
normalised.as_str(),
|
||||
"etsy.com" | "etsy" | "www.etsy.com" | ""
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
// Etsy's sitewide marketing tagline, served on 404 / blocked pages.
|
||||
if normalised.starts_with("etsy - ")
|
||||
|| normalised.starts_with("etsy.com - ")
|
||||
|| normalised.starts_with("etsy uk - ")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
// Etsy's "item unavailable" placeholder, served on delisted
|
||||
// products. Keep the slug fallback so callers still see what the
|
||||
// URL was about.
|
||||
normalised.starts_with("this item is unavailable")
|
||||
|| normalised.starts_with("sorry, this item is")
|
||||
|| normalised == "item not available - etsy"
|
||||
}
|
||||
|
||||
/// True when the OG description is an Etsy error-page placeholder or
|
||||
/// sitewide marketing blurb rather than a real listing description.
|
||||
fn is_generic_description(d: &str) -> bool {
|
||||
let normalised = d.trim().to_lowercase();
|
||||
if normalised.is_empty() {
|
||||
return true;
|
||||
}
|
||||
normalised.starts_with("sorry, the page you were looking for")
|
||||
|| normalised.starts_with("page not found")
|
||||
|| normalised.starts_with("find the perfect handmade gift")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
|
||||
// extractors can diverge without cross-impact)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn strip_schema_prefix(s: String) -> String {
|
||||
s.replace("http://schema.org/", "")
|
||||
.replace("https://schema.org/", "")
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Etsy links the owning shop with a canonical anchor like
|
||||
/// `<a href="/shop/ShopName" ...>`. Grab the first one after the
|
||||
/// breadcrumb boundary.
|
||||
fn shop_url_from_html(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| format!("https://www.etsy.com{}", m.as_str()))
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_etsy_listing_urls() {
|
||||
assert!(matches("https://www.etsy.com/listing/123456789"));
|
||||
assert!(matches(
|
||||
"https://www.etsy.com/listing/123456789/vintage-typewriter"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://www.etsy.com/fr/listing/123456789/vintage-typewriter"
|
||||
));
|
||||
assert!(!matches("https://www.etsy.com/"));
|
||||
assert!(!matches("https://www.etsy.com/shop/SomeShop"));
|
||||
assert!(!matches("https://example.com/listing/123456789"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_listing_id_handles_slug_and_locale() {
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/listing/123456789"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Handmade Ceramic Mug","sku":"MUG-001",
|
||||
"brand":{"@type":"Brand","name":"Studio Clay"},
|
||||
"image":["https://i.etsystatic.com/abc.jpg","https://i.etsystatic.com/xyz.jpg"],
|
||||
"itemCondition":"https://schema.org/NewCondition",
|
||||
"offers":{"@type":"Offer","price":"24.00","priceCurrency":"USD",
|
||||
"availability":"https://schema.org/InStock",
|
||||
"seller":{"@type":"Organization","name":"StudioClay"}},
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.9","reviewCount":"127","bestRating":"5"}}
|
||||
</script>
|
||||
<a href="/shop/StudioClay" class="wt-text-link">StudioClay</a>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.etsy.com/listing/1", "1");
|
||||
assert_eq!(v["title"], "Handmade Ceramic Mug");
|
||||
assert_eq!(v["price"], "24.00");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["item_condition"], "NewCondition");
|
||||
assert_eq!(v["shop"], "StudioClay");
|
||||
assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay");
|
||||
assert_eq!(v["brand"], "Studio Clay");
|
||||
assert_eq!(v["aggregate_rating"]["rating_value"], "4.9");
|
||||
assert_eq!(v["aggregate_rating"]["review_count"], "127");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_handles_aggregate_offer_price_range() {
|
||||
let html = r##"
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Mug Set",
|
||||
"offers":{"@type":"AggregateOffer",
|
||||
"lowPrice":"18.00","highPrice":"36.00","priceCurrency":"USD"}}
|
||||
</script>
|
||||
"##;
|
||||
let v = parse(html, "https://www.etsy.com/listing/2", "2");
|
||||
assert_eq!(v["low_price"], "18.00");
|
||||
assert_eq!(v["high_price"], "36.00");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_when_no_jsonld() {
|
||||
let html = r#"
|
||||
<html><head>
|
||||
<meta property="og:title" content="Minimal Fallback Item">
|
||||
<meta property="og:description" content="OG-only extraction test.">
|
||||
<meta property="og:image" content="https://i.etsystatic.com/fallback.jpg">
|
||||
</head></html>"#;
|
||||
let v = parse(html, "https://www.etsy.com/listing/3", "3");
|
||||
assert_eq!(v["title"], "Minimal Fallback Item");
|
||||
assert_eq!(v["description"], "OG-only extraction test.");
|
||||
assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg");
|
||||
// No price fields when we only have OG.
|
||||
assert!(v["price"].is_null());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_from_url() {
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
|
||||
Some("vintage-typewriter".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
|
||||
Some("slug".into())
|
||||
);
|
||||
assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
|
||||
Some("slug".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn humanise_slug_capitalises_each_word() {
|
||||
assert_eq!(
|
||||
humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
|
||||
Some("Personalized Stainless Steel Tumbler")
|
||||
);
|
||||
assert_eq!(
|
||||
humanise_slug(Some("hand_crafted_mug")).as_deref(),
|
||||
Some("Hand Crafted Mug")
|
||||
);
|
||||
assert_eq!(humanise_slug(Some("")), None);
|
||||
assert_eq!(humanise_slug(None), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generic_title_catches_common_shapes() {
|
||||
assert!(is_generic_title("etsy.com"));
|
||||
assert!(is_generic_title("Etsy"));
|
||||
assert!(is_generic_title(" etsy.com "));
|
||||
assert!(is_generic_title(
|
||||
"Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
|
||||
));
|
||||
assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
|
||||
assert!(!is_generic_title("Vintage Typewriter"));
|
||||
assert!(!is_generic_title("Handmade Etsy-style Mug"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generic_description_catches_404_shapes() {
|
||||
assert!(is_generic_description(""));
|
||||
assert!(is_generic_description(
|
||||
"Sorry, the page you were looking for was not found."
|
||||
));
|
||||
assert!(is_generic_description("Page not found"));
|
||||
assert!(!is_generic_description(
|
||||
"Hand-thrown ceramic mug, dishwasher safe."
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_uses_slug_when_og_is_generic() {
|
||||
// Cloud-blocked Etsy listing: og:title is a site-wide generic
|
||||
// placeholder, no JSON-LD, no description. Slug should win.
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="etsy.com">
|
||||
</head></html>"#;
|
||||
let v = parse(
|
||||
html,
|
||||
"https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
|
||||
"1079113183",
|
||||
);
|
||||
assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_prefers_real_og_over_slug() {
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="Real Listing Title">
|
||||
</head></html>"#;
|
||||
let v = parse(
|
||||
html,
|
||||
"https://www.etsy.com/listing/1079113183/the-url-slug",
|
||||
"1079113183",
|
||||
);
|
||||
assert_eq!(v["title"], "Real Listing Title");
|
||||
}
|
||||
}
|
||||
172
crates/webclaw-fetch/src/extractors/github_issue.rs
Normal file
172
crates/webclaw-fetch/src/extractors/github_issue.rs
Normal file
|
|
@ -0,0 +1,172 @@
|
|||
//! GitHub issue structured extractor.
|
||||
//!
|
||||
//! Mirror of `github_pr` but on `/issues/{number}`. Uses
|
||||
//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the
|
||||
//! issue body + comment count + labels + milestone + author /
|
||||
//! assignees. Full per-comment bodies would be another call; kept for
|
||||
//! a follow-up.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_issue",
|
||||
label: "GitHub issue",
|
||||
description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
parse_issue(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_issue: issue '{owner}/{repo}#{number}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let issue: Issue = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?;
|
||||
|
||||
// The same endpoint returns PRs too; reject if we got one so the caller
|
||||
// uses /v1/scrape/github_pr instead of getting a half-shaped payload.
|
||||
if issue.pull_request.is_some() {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr"
|
||||
)));
|
||||
}
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": owner,
|
||||
"repo": repo,
|
||||
"number": issue.number,
|
||||
"title": issue.title,
|
||||
"body": issue.body,
|
||||
"state": issue.state,
|
||||
"state_reason":issue.state_reason,
|
||||
"author": issue.user.as_ref().and_then(|u| u.login.clone()),
|
||||
"labels": issue.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
|
||||
"assignees": issue.assignees.iter().filter_map(|u| u.login.clone()).collect::<Vec<_>>(),
|
||||
"milestone": issue.milestone.as_ref().and_then(|m| m.title.clone()),
|
||||
"comments": issue.comments,
|
||||
"locked": issue.locked,
|
||||
"created_at": issue.created_at,
|
||||
"updated_at": issue.updated_at,
|
||||
"closed_at": issue.closed_at,
|
||||
"html_url": issue.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_issue(url: &str) -> Option<(String, String, u64)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
if segs.len() < 4 || segs[2] != "issues" {
|
||||
return None;
|
||||
}
|
||||
let number: u64 = segs[3].parse().ok()?;
|
||||
Some((segs[0].to_string(), segs[1].to_string(), number))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub issue API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Issue {
|
||||
number: Option<i64>,
|
||||
title: Option<String>,
|
||||
body: Option<String>,
|
||||
state: Option<String>,
|
||||
state_reason: Option<String>,
|
||||
locked: Option<bool>,
|
||||
comments: Option<i64>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
closed_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
user: Option<UserRef>,
|
||||
#[serde(default)]
|
||||
labels: Vec<LabelRef>,
|
||||
#[serde(default)]
|
||||
assignees: Vec<UserRef>,
|
||||
milestone: Option<Milestone>,
|
||||
/// Present when this "issue" is actually a pull request. The REST
|
||||
/// API overloads the issues endpoint for PRs.
|
||||
pull_request: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
login: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct LabelRef {
|
||||
name: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Milestone {
|
||||
title: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_issue_urls() {
|
||||
assert!(matches("https://github.com/rust-lang/rust/issues/100"));
|
||||
assert!(matches("https://github.com/rust-lang/rust/issues/100/"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/issues"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_issue_extracts_owner_repo_number() {
|
||||
assert_eq!(
|
||||
parse_issue("https://github.com/rust-lang/rust/issues/100"),
|
||||
Some(("rust-lang".into(), "rust".into(), 100))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"),
|
||||
Some(("rust-lang".into(), "rust".into(), 100))
|
||||
);
|
||||
}
|
||||
}
|
||||
189
crates/webclaw-fetch/src/extractors/github_pr.rs
Normal file
189
crates/webclaw-fetch/src/extractors/github_pr.rs
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
//! GitHub pull request structured extractor.
|
||||
//!
|
||||
//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns
|
||||
//! the PR metadata + a counted summary of comments and review activity.
|
||||
//! Full diff and per-comment bodies require additional calls — left for
|
||||
//! a follow-up enhancement so the v1 stays one network round-trip.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_pr",
|
||||
label: "GitHub pull request",
|
||||
description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
parse_pr(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_pr: pull request '{owner}/{repo}#{number}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let p: PullRequest = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": owner,
|
||||
"repo": repo,
|
||||
"number": p.number,
|
||||
"title": p.title,
|
||||
"body": p.body,
|
||||
"state": p.state,
|
||||
"draft": p.draft,
|
||||
"merged": p.merged,
|
||||
"merged_at": p.merged_at,
|
||||
"merge_commit_sha": p.merge_commit_sha,
|
||||
"author": p.user.as_ref().and_then(|u| u.login.clone()),
|
||||
"labels": p.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
|
||||
"milestone": p.milestone.as_ref().and_then(|m| m.title.clone()),
|
||||
"head_ref": p.head.as_ref().and_then(|r| r.ref_name.clone()),
|
||||
"base_ref": p.base.as_ref().and_then(|r| r.ref_name.clone()),
|
||||
"head_sha": p.head.as_ref().and_then(|r| r.sha.clone()),
|
||||
"additions": p.additions,
|
||||
"deletions": p.deletions,
|
||||
"changed_files": p.changed_files,
|
||||
"commits": p.commits,
|
||||
"comments": p.comments,
|
||||
"review_comments":p.review_comments,
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
"closed_at": p.closed_at,
|
||||
"html_url": p.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_pr(url: &str) -> Option<(String, String, u64)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /{owner}/{repo}/pull/{number} (or /pulls/{number} variant)
|
||||
if segs.len() < 4 {
|
||||
return None;
|
||||
}
|
||||
if segs[2] != "pull" && segs[2] != "pulls" {
|
||||
return None;
|
||||
}
|
||||
let number: u64 = segs[3].parse().ok()?;
|
||||
Some((segs[0].to_string(), segs[1].to_string(), number))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub PR API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct PullRequest {
|
||||
number: Option<i64>,
|
||||
title: Option<String>,
|
||||
body: Option<String>,
|
||||
state: Option<String>,
|
||||
draft: Option<bool>,
|
||||
merged: Option<bool>,
|
||||
merged_at: Option<String>,
|
||||
merge_commit_sha: Option<String>,
|
||||
user: Option<UserRef>,
|
||||
#[serde(default)]
|
||||
labels: Vec<LabelRef>,
|
||||
milestone: Option<Milestone>,
|
||||
head: Option<GitRef>,
|
||||
base: Option<GitRef>,
|
||||
additions: Option<i64>,
|
||||
deletions: Option<i64>,
|
||||
changed_files: Option<i64>,
|
||||
commits: Option<i64>,
|
||||
comments: Option<i64>,
|
||||
review_comments: Option<i64>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
closed_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
login: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct LabelRef {
|
||||
name: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Milestone {
|
||||
title: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct GitRef {
|
||||
#[serde(rename = "ref")]
|
||||
ref_name: Option<String>,
|
||||
sha: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_pr_urls() {
|
||||
assert!(matches("https://github.com/rust-lang/rust/pull/12345"));
|
||||
assert!(matches(
|
||||
"https://github.com/rust-lang/rust/pull/12345/files"
|
||||
));
|
||||
assert!(!matches("https://github.com/rust-lang/rust"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/issues/100"));
|
||||
assert!(!matches("https://github.com/rust-lang"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_pr_extracts_owner_repo_number() {
|
||||
assert_eq!(
|
||||
parse_pr("https://github.com/rust-lang/rust/pull/12345"),
|
||||
Some(("rust-lang".into(), "rust".into(), 12345))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_pr("https://github.com/rust-lang/rust/pull/12345/files"),
|
||||
Some(("rust-lang".into(), "rust".into(), 12345))
|
||||
);
|
||||
}
|
||||
}
|
||||
179
crates/webclaw-fetch/src/extractors/github_release.rs
Normal file
179
crates/webclaw-fetch/src/extractors/github_release.rs
Normal file
|
|
@ -0,0 +1,179 @@
|
|||
//! GitHub release structured extractor.
|
||||
//!
|
||||
//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns
|
||||
//! the release notes body, asset list with download counts, and
|
||||
//! prerelease flag.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_release",
|
||||
label: "GitHub release",
|
||||
description: "Returns release metadata: tag, name, body (release notes), assets with download counts.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
parse_release(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_release: release '{owner}/{repo}@{tag}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour."
|
||||
.into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let r: Release = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?;
|
||||
|
||||
let assets: Vec<Value> = r
|
||||
.assets
|
||||
.iter()
|
||||
.map(|a| {
|
||||
json!({
|
||||
"name": a.name,
|
||||
"size": a.size,
|
||||
"download_count": a.download_count,
|
||||
"browser_download_url": a.browser_download_url,
|
||||
"content_type": a.content_type,
|
||||
"created_at": a.created_at,
|
||||
"updated_at": a.updated_at,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": owner,
|
||||
"repo": repo,
|
||||
"tag_name": r.tag_name,
|
||||
"name": r.name,
|
||||
"body": r.body,
|
||||
"draft": r.draft,
|
||||
"prerelease": r.prerelease,
|
||||
"author": r.author.as_ref().and_then(|u| u.login.clone()),
|
||||
"created_at": r.created_at,
|
||||
"published_at": r.published_at,
|
||||
"asset_count": assets.len(),
|
||||
"total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::<i64>(),
|
||||
"assets": assets,
|
||||
"html_url": r.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_release(url: &str) -> Option<(String, String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /{owner}/{repo}/releases/tag/{tag}
|
||||
if segs.len() < 5 {
|
||||
return None;
|
||||
}
|
||||
if segs[2] != "releases" || segs[3] != "tag" {
|
||||
return None;
|
||||
}
|
||||
Some((
|
||||
segs[0].to_string(),
|
||||
segs[1].to_string(),
|
||||
segs[4].to_string(),
|
||||
))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub Release API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Release {
|
||||
tag_name: Option<String>,
|
||||
name: Option<String>,
|
||||
body: Option<String>,
|
||||
draft: Option<bool>,
|
||||
prerelease: Option<bool>,
|
||||
author: Option<UserRef>,
|
||||
created_at: Option<String>,
|
||||
published_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
#[serde(default)]
|
||||
assets: Vec<Asset>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
login: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Asset {
|
||||
name: Option<String>,
|
||||
size: Option<i64>,
|
||||
download_count: Option<i64>,
|
||||
browser_download_url: Option<String>,
|
||||
content_type: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_release_urls() {
|
||||
assert!(matches(
|
||||
"https://github.com/rust-lang/rust/releases/tag/1.85.0"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"
|
||||
));
|
||||
assert!(!matches("https://github.com/rust-lang/rust"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/releases"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_release_extracts_owner_repo_tag() {
|
||||
assert_eq!(
|
||||
parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"),
|
||||
Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"),
|
||||
Some(("rust-lang".into(), "rust".into(), "1.85.0".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
212
crates/webclaw-fetch/src/extractors/github_repo.rs
Normal file
212
crates/webclaw-fetch/src/extractors/github_repo.rs
Normal file
|
|
@ -0,0 +1,212 @@
|
|||
//! GitHub repository structured extractor.
|
||||
//!
|
||||
//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`.
|
||||
//! Unauthenticated requests get 60/hour per IP, which is fine for users
|
||||
//! self-hosting and for low-volume cloud usage. Production cloud should
|
||||
//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't
|
||||
//! depend on it being set — it works open out of the box.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_repo",
|
||||
label: "GitHub repository",
|
||||
description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
// Path must be exactly /{owner}/{repo} (or with trailing slash). Reject
|
||||
// sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the
|
||||
// future github_issue / github_pr extractors will handle.
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0])
|
||||
}
|
||||
|
||||
/// GitHub uses some top-level paths for non-repo pages.
|
||||
const RESERVED_OWNERS: &[&str] = &[
|
||||
"settings",
|
||||
"marketplace",
|
||||
"explore",
|
||||
"topics",
|
||||
"trending",
|
||||
"collections",
|
||||
"events",
|
||||
"sponsors",
|
||||
"issues",
|
||||
"pulls",
|
||||
"notifications",
|
||||
"new",
|
||||
"organizations",
|
||||
"login",
|
||||
"join",
|
||||
"search",
|
||||
"about",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_repo: repo '{owner}/{repo}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let r: Repo = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": r.owner.as_ref().map(|o| &o.login),
|
||||
"name": r.name,
|
||||
"full_name": r.full_name,
|
||||
"description": r.description,
|
||||
"homepage": r.homepage,
|
||||
"language": r.language,
|
||||
"topics": r.topics,
|
||||
"license": r.license.as_ref().and_then(|l| l.spdx_id.clone()),
|
||||
"license_name": r.license.as_ref().map(|l| l.name.clone()),
|
||||
"default_branch": r.default_branch,
|
||||
"stars": r.stargazers_count,
|
||||
"forks": r.forks_count,
|
||||
"watchers": r.subscribers_count,
|
||||
"open_issues": r.open_issues_count,
|
||||
"size_kb": r.size,
|
||||
"archived": r.archived,
|
||||
"fork": r.fork,
|
||||
"is_template": r.is_template,
|
||||
"has_issues": r.has_issues,
|
||||
"has_wiki": r.has_wiki,
|
||||
"has_pages": r.has_pages,
|
||||
"has_discussions": r.has_discussions,
|
||||
"created_at": r.created_at,
|
||||
"updated_at": r.updated_at,
|
||||
"pushed_at": r.pushed_at,
|
||||
"html_url": r.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_owner_repo(url: &str) -> Option<(String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let owner = segs.next()?.to_string();
|
||||
let repo = segs.next()?.to_string();
|
||||
Some((owner, repo))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub API types — only the fields we surface
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Repo {
|
||||
name: Option<String>,
|
||||
full_name: Option<String>,
|
||||
description: Option<String>,
|
||||
homepage: Option<String>,
|
||||
language: Option<String>,
|
||||
#[serde(default)]
|
||||
topics: Vec<String>,
|
||||
license: Option<License>,
|
||||
default_branch: Option<String>,
|
||||
stargazers_count: Option<i64>,
|
||||
forks_count: Option<i64>,
|
||||
subscribers_count: Option<i64>,
|
||||
open_issues_count: Option<i64>,
|
||||
size: Option<i64>,
|
||||
archived: Option<bool>,
|
||||
fork: Option<bool>,
|
||||
is_template: Option<bool>,
|
||||
has_issues: Option<bool>,
|
||||
has_wiki: Option<bool>,
|
||||
has_pages: Option<bool>,
|
||||
has_discussions: Option<bool>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
pushed_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
owner: Option<Owner>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Owner {
|
||||
login: String,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct License {
|
||||
name: String,
|
||||
spdx_id: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_repo_root_only() {
|
||||
assert!(matches("https://github.com/rust-lang/rust"));
|
||||
assert!(matches("https://github.com/rust-lang/rust/"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/issues"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/pulls/123"));
|
||||
assert!(!matches("https://github.com/rust-lang"));
|
||||
assert!(!matches("https://github.com/marketplace"));
|
||||
assert!(!matches("https://github.com/topics/rust"));
|
||||
assert!(!matches("https://example.com/foo/bar"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_owner_repo_handles_trailing_slash_and_query() {
|
||||
assert_eq!(
|
||||
parse_owner_repo("https://github.com/rust-lang/rust"),
|
||||
Some(("rust-lang".into(), "rust".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"),
|
||||
Some(("rust-lang".into(), "rust".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
186
crates/webclaw-fetch/src/extractors/hackernews.rs
Normal file
186
crates/webclaw-fetch/src/extractors/hackernews.rs
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
//! Hacker News structured extractor.
|
||||
//!
|
||||
//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which
|
||||
//! returns the full post + recursive comment tree in a single request.
|
||||
//! The official Firebase API at `hacker-news.firebaseio.com` requires
|
||||
//! N+1 fetches per comment, so we'd hit either timeout or rate-limit
|
||||
//! on any non-trivial thread.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "hackernews",
|
||||
label: "Hacker News story",
|
||||
description: "Returns post + nested comment tree for a Hacker News item.",
|
||||
url_patterns: &[
|
||||
"https://news.ycombinator.com/item?id=N",
|
||||
"https://hn.algolia.com/items/N",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host == "news.ycombinator.com" {
|
||||
return url.contains("item?id=") || url.contains("item%3Fid=");
|
||||
}
|
||||
if host == "hn.algolia.com" {
|
||||
return url.contains("/items/");
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_item_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://hn.algolia.com/api/v1/items/{id}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hn algolia returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let item: AlgoliaItem = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?;
|
||||
|
||||
let post = post_json(&item);
|
||||
let comments: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"post": post,
|
||||
"comments": comments,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the
|
||||
/// Algolia mirror's `/items/N` form.
|
||||
fn parse_item_id(url: &str) -> Option<u64> {
|
||||
if let Some(after) = url.split("id=").nth(1) {
|
||||
let n = after.split('&').next().unwrap_or(after);
|
||||
if let Ok(id) = n.parse::<u64>() {
|
||||
return Some(id);
|
||||
}
|
||||
}
|
||||
if let Some(after) = url.split("/items/").nth(1) {
|
||||
let n = after.split(['/', '?', '#']).next().unwrap_or(after);
|
||||
if let Ok(id) = n.parse::<u64>() {
|
||||
return Some(id);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn post_json(item: &AlgoliaItem) -> Value {
|
||||
json!({
|
||||
"id": item.id,
|
||||
"type": item.r#type,
|
||||
"title": item.title,
|
||||
"url": item.url,
|
||||
"author": item.author,
|
||||
"points": item.points,
|
||||
"text": item.text, // populated for ask/show/tell
|
||||
"created_at": item.created_at,
|
||||
"created_at_unix": item.created_at_i,
|
||||
"comment_count": count_descendants(item),
|
||||
"permalink": item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")),
|
||||
})
|
||||
}
|
||||
|
||||
fn comment_json(item: &AlgoliaItem) -> Option<Value> {
|
||||
if !matches!(item.r#type.as_deref(), Some("comment")) {
|
||||
return None;
|
||||
}
|
||||
// Dead/deleted comments still appear in the tree; surface them honestly.
|
||||
let replies: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
|
||||
Some(json!({
|
||||
"id": item.id,
|
||||
"author": item.author,
|
||||
"text": item.text,
|
||||
"created_at": item.created_at,
|
||||
"created_at_unix": item.created_at_i,
|
||||
"parent_id": item.parent_id,
|
||||
"story_id": item.story_id,
|
||||
"replies": replies,
|
||||
}))
|
||||
}
|
||||
|
||||
fn count_descendants(item: &AlgoliaItem) -> usize {
|
||||
item.children
|
||||
.iter()
|
||||
.filter(|c| matches!(c.r#type.as_deref(), Some("comment")))
|
||||
.map(|c| 1 + count_descendants(c))
|
||||
.sum()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Algolia API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct AlgoliaItem {
|
||||
id: Option<u64>,
|
||||
r#type: Option<String>,
|
||||
title: Option<String>,
|
||||
url: Option<String>,
|
||||
author: Option<String>,
|
||||
points: Option<i64>,
|
||||
text: Option<String>,
|
||||
created_at: Option<String>,
|
||||
created_at_i: Option<i64>,
|
||||
parent_id: Option<u64>,
|
||||
story_id: Option<u64>,
|
||||
#[serde(default)]
|
||||
children: Vec<AlgoliaItem>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_hn_item_urls() {
|
||||
assert!(matches("https://news.ycombinator.com/item?id=1"));
|
||||
assert!(matches("https://news.ycombinator.com/item?id=12345"));
|
||||
assert!(matches("https://hn.algolia.com/items/1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_item_urls() {
|
||||
assert!(!matches("https://news.ycombinator.com/"));
|
||||
assert!(!matches("https://news.ycombinator.com/news"));
|
||||
assert!(!matches("https://example.com/item?id=1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_item_id_handles_both_forms() {
|
||||
assert_eq!(
|
||||
parse_item_id("https://news.ycombinator.com/item?id=1"),
|
||||
Some(1)
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"),
|
||||
Some(12345)
|
||||
);
|
||||
assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999));
|
||||
assert_eq!(parse_item_id("https://example.com/foo"), None);
|
||||
}
|
||||
}
|
||||
189
crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
Normal file
189
crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
//! HuggingFace dataset structured extractor.
|
||||
//!
|
||||
//! Same shape as the model extractor but hits the dataset endpoint.
|
||||
//! `huggingface.co/api/datasets/{owner}/{name}`.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "huggingface_dataset",
|
||||
label: "HuggingFace dataset",
|
||||
description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.",
|
||||
url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "huggingface.co" && host != "www.huggingface.co" {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical).
|
||||
segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let dataset_path = parse_dataset_path(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"hf_dataset: cannot parse dataset path from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf_dataset: '{dataset_path}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 401 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf_dataset: '{dataset_path}' requires authentication (gated)"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf_dataset api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let d: DatasetInfo = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?;
|
||||
|
||||
let files: Vec<Value> = d
|
||||
.siblings
|
||||
.iter()
|
||||
.map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
|
||||
.collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": d.id,
|
||||
"private": d.private,
|
||||
"gated": d.gated,
|
||||
"downloads": d.downloads,
|
||||
"downloads_30d": d.downloads_all_time,
|
||||
"likes": d.likes,
|
||||
"tags": d.tags,
|
||||
"license": d.card_data.as_ref().and_then(|c| c.license.clone()),
|
||||
"language": d.card_data.as_ref().and_then(|c| c.language.clone()),
|
||||
"task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()),
|
||||
"size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()),
|
||||
"annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()),
|
||||
"configs": d.card_data.as_ref().and_then(|c| c.configs.clone()),
|
||||
"created_at": d.created_at,
|
||||
"last_modified": d.last_modified,
|
||||
"sha": d.sha,
|
||||
"file_count": d.siblings.len(),
|
||||
"files": files,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Returns the part to append to the API URL — either `name` (legacy
|
||||
/// top-level dataset like `squad`) or `owner/name` (canonical form).
|
||||
fn parse_dataset_path(url: &str) -> Option<String> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
if segs.next() != Some("datasets") {
|
||||
return None;
|
||||
}
|
||||
let first = segs.next()?.to_string();
|
||||
match segs.next() {
|
||||
Some(second) => Some(format!("{first}/{second}")),
|
||||
None => Some(first),
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct DatasetInfo {
|
||||
id: Option<String>,
|
||||
private: Option<bool>,
|
||||
gated: Option<serde_json::Value>,
|
||||
downloads: Option<i64>,
|
||||
#[serde(rename = "downloadsAllTime")]
|
||||
downloads_all_time: Option<i64>,
|
||||
likes: Option<i64>,
|
||||
#[serde(default)]
|
||||
tags: Vec<String>,
|
||||
#[serde(rename = "createdAt")]
|
||||
created_at: Option<String>,
|
||||
#[serde(rename = "lastModified")]
|
||||
last_modified: Option<String>,
|
||||
sha: Option<String>,
|
||||
#[serde(rename = "cardData")]
|
||||
card_data: Option<DatasetCard>,
|
||||
#[serde(default)]
|
||||
siblings: Vec<Sibling>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct DatasetCard {
|
||||
license: Option<serde_json::Value>,
|
||||
language: Option<serde_json::Value>,
|
||||
task_categories: Option<serde_json::Value>,
|
||||
size_categories: Option<serde_json::Value>,
|
||||
annotations_creators: Option<serde_json::Value>,
|
||||
configs: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Sibling {
|
||||
rfilename: String,
|
||||
size: Option<i64>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_dataset_pages() {
|
||||
assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level
|
||||
assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name
|
||||
assert!(!matches("https://huggingface.co/openai/whisper-large-v3"));
|
||||
assert!(!matches("https://huggingface.co/datasets/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_dataset_path_works() {
|
||||
assert_eq!(
|
||||
parse_dataset_path("https://huggingface.co/datasets/squad"),
|
||||
Some("squad".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"),
|
||||
Some("openai/gsm8k".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"),
|
||||
Some("openai/gsm8k".into())
|
||||
);
|
||||
}
|
||||
}
|
||||
223
crates/webclaw-fetch/src/extractors/huggingface_model.rs
Normal file
223
crates/webclaw-fetch/src/extractors/huggingface_model.rs
Normal file
|
|
@ -0,0 +1,223 @@
|
|||
//! HuggingFace model card structured extractor.
|
||||
//!
|
||||
//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`.
|
||||
//! Returns metadata + the parsed model card front matter, but does not
|
||||
//! pull the full README body — those are sometimes 100KB+ and the user
|
||||
//! can hit /v1/scrape if they want it as markdown.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "huggingface_model",
|
||||
label: "HuggingFace model",
|
||||
description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.",
|
||||
url_patterns: &["https://huggingface.co/{owner}/{name}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "huggingface.co" && host != "www.huggingface.co" {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /{owner}/{name} but reject HF-internal sections + sub-pages.
|
||||
if segs.len() != 2 {
|
||||
return false;
|
||||
}
|
||||
!RESERVED_NAMESPACES.contains(&segs[0])
|
||||
}
|
||||
|
||||
const RESERVED_NAMESPACES: &[&str] = &[
|
||||
"datasets",
|
||||
"spaces",
|
||||
"blog",
|
||||
"docs",
|
||||
"api",
|
||||
"models",
|
||||
"papers",
|
||||
"pricing",
|
||||
"tasks",
|
||||
"join",
|
||||
"login",
|
||||
"settings",
|
||||
"organizations",
|
||||
"new",
|
||||
"search",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, name) = parse_owner_name(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://huggingface.co/api/models/{owner}/{name}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf model: '{owner}/{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 401 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf model: '{owner}/{name}' requires authentication (gated repo)"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let m: ModelInfo = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?;
|
||||
|
||||
// Surface a flat file list — full siblings can be hundreds of entries
|
||||
// for big repos. We keep it as-is because callers want to know about
|
||||
// every shard; if it bloats responses too much we'll add pagination.
|
||||
let files: Vec<Value> = m
|
||||
.siblings
|
||||
.iter()
|
||||
.map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
|
||||
.collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": m.id,
|
||||
"model_id": m.model_id,
|
||||
"private": m.private,
|
||||
"gated": m.gated,
|
||||
"downloads": m.downloads,
|
||||
"downloads_30d": m.downloads_all_time,
|
||||
"likes": m.likes,
|
||||
"library_name": m.library_name,
|
||||
"pipeline_tag": m.pipeline_tag,
|
||||
"tags": m.tags,
|
||||
"license": m.card_data.as_ref().and_then(|c| c.license.clone()),
|
||||
"language": m.card_data.as_ref().and_then(|c| c.language.clone()),
|
||||
"datasets": m.card_data.as_ref().and_then(|c| c.datasets.clone()),
|
||||
"base_model": m.card_data.as_ref().and_then(|c| c.base_model.clone()),
|
||||
"model_type": m.card_data.as_ref().and_then(|c| c.model_type.clone()),
|
||||
"created_at": m.created_at,
|
||||
"last_modified": m.last_modified,
|
||||
"sha": m.sha,
|
||||
"file_count": m.siblings.len(),
|
||||
"files": files,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_owner_name(url: &str) -> Option<(String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let owner = segs.next()?.to_string();
|
||||
let name = segs.next()?.to_string();
|
||||
Some((owner, name))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HF API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ModelInfo {
|
||||
id: Option<String>,
|
||||
#[serde(rename = "modelId")]
|
||||
model_id: Option<String>,
|
||||
private: Option<bool>,
|
||||
gated: Option<serde_json::Value>, // bool or string ("auto" / "manual" / false)
|
||||
downloads: Option<i64>,
|
||||
#[serde(rename = "downloadsAllTime")]
|
||||
downloads_all_time: Option<i64>,
|
||||
likes: Option<i64>,
|
||||
#[serde(rename = "library_name")]
|
||||
library_name: Option<String>,
|
||||
#[serde(rename = "pipeline_tag")]
|
||||
pipeline_tag: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: Vec<String>,
|
||||
#[serde(rename = "createdAt")]
|
||||
created_at: Option<String>,
|
||||
#[serde(rename = "lastModified")]
|
||||
last_modified: Option<String>,
|
||||
sha: Option<String>,
|
||||
#[serde(rename = "cardData")]
|
||||
card_data: Option<CardData>,
|
||||
#[serde(default)]
|
||||
siblings: Vec<Sibling>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CardData {
|
||||
license: Option<serde_json::Value>, // string or array
|
||||
language: Option<serde_json::Value>,
|
||||
datasets: Option<serde_json::Value>,
|
||||
#[serde(rename = "base_model")]
|
||||
base_model: Option<serde_json::Value>,
|
||||
#[serde(rename = "model_type")]
|
||||
model_type: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Sibling {
|
||||
rfilename: String,
|
||||
size: Option<i64>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_model_pages() {
|
||||
assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B"));
|
||||
assert!(matches("https://huggingface.co/openai/whisper-large-v3"));
|
||||
assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_hf_section_pages() {
|
||||
assert!(!matches("https://huggingface.co/datasets/squad"));
|
||||
assert!(!matches("https://huggingface.co/spaces/foo/bar"));
|
||||
assert!(!matches("https://huggingface.co/blog/intro"));
|
||||
assert!(!matches("https://huggingface.co/"));
|
||||
assert!(!matches("https://huggingface.co/meta-llama"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_owner_name_pulls_both() {
|
||||
assert_eq!(
|
||||
parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"),
|
||||
Some(("meta-llama".into(), "Meta-Llama-3-8B".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"),
|
||||
Some(("openai".into(), "whisper-large-v3".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
235
crates/webclaw-fetch/src/extractors/instagram_post.rs
Normal file
235
crates/webclaw-fetch/src/extractors/instagram_post.rs
Normal file
|
|
@ -0,0 +1,235 @@
|
|||
//! Instagram post structured extractor.
|
||||
//!
|
||||
//! Uses Instagram's public embed endpoint
|
||||
//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the
|
||||
//! full caption, author username, and thumbnail. No auth required.
|
||||
//! The same endpoint serves reels and IGTV under `/reel/{code}` and
|
||||
//! `/tv/{code}` URLs (we accept all three).
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "instagram_post",
|
||||
label: "Instagram post",
|
||||
description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.",
|
||||
url_patterns: &[
|
||||
"https://www.instagram.com/p/{shortcode}/",
|
||||
"https://www.instagram.com/reel/{shortcode}/",
|
||||
"https://www.instagram.com/tv/{shortcode}/",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.instagram.com" | "instagram.com") {
|
||||
return false;
|
||||
}
|
||||
parse_shortcode(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"instagram_post: cannot parse shortcode from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
// Instagram serves the same embed HTML for posts/reels/tv under /p/.
|
||||
let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/");
|
||||
let resp = client.fetch(&embed_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"instagram embed returned status {} for {shortcode}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let html = &resp.html;
|
||||
let username = parse_username(html);
|
||||
let caption = parse_caption(html);
|
||||
let thumbnail = parse_thumbnail(html);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"embed_url": embed_url,
|
||||
"shortcode": shortcode,
|
||||
"kind": kind,
|
||||
"data_completeness": "embed",
|
||||
"author_username": username,
|
||||
"caption": caption,
|
||||
"thumbnail_url": thumbnail,
|
||||
"canonical_url": format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL parsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}.
|
||||
fn parse_shortcode(url: &str) -> Option<(&'static str, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let first = segs.next()?;
|
||||
let kind = match first {
|
||||
"p" => "post",
|
||||
"reel" | "reels" => "reel",
|
||||
"tv" => "tv",
|
||||
_ => return None,
|
||||
};
|
||||
let shortcode = segs.next()?;
|
||||
if shortcode.is_empty() {
|
||||
return None;
|
||||
}
|
||||
Some((kind, shortcode.to_string()))
|
||||
}
|
||||
|
||||
fn path_segment_for(kind: &str) -> &'static str {
|
||||
match kind {
|
||||
"reel" => "reel",
|
||||
"tv" => "tv",
|
||||
_ => "p",
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML scraping
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Username appears as the anchor text inside `<a class="CaptionUsername">`.
|
||||
fn parse_username(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| html_decode(m.as_str().trim()))
|
||||
}
|
||||
|
||||
/// Caption sits inside `<div class="Caption">` after the username anchor.
|
||||
/// We grab the whole Caption block and strip out the username link, time
|
||||
/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate.
|
||||
fn parse_caption(html: &str) -> Option<String> {
|
||||
static RE_OUTER: OnceLock<Regex> = OnceLock::new();
|
||||
let outer = RE_OUTER
|
||||
.get_or_init(|| Regex::new(r#"(?s)<div\s+class="Caption"[^>]*>(.*?)</div>"#).unwrap());
|
||||
let block = outer.captures(html)?.get(1)?.as_str();
|
||||
|
||||
// Strip everything wrapped in <a class="CaptionUsername">...</a>.
|
||||
static RE_USER: OnceLock<Regex> = OnceLock::new();
|
||||
let user_re = RE_USER
|
||||
.get_or_init(|| Regex::new(r#"(?s)<a[^>]*class="CaptionUsername"[^>]*>.*?</a>"#).unwrap());
|
||||
let stripped = user_re.replace_all(block, "");
|
||||
|
||||
// Then strip anything remaining tagged.
|
||||
static RE_TAGS: OnceLock<Regex> = OnceLock::new();
|
||||
let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
|
||||
let text = tag_re.replace_all(&stripped, " ");
|
||||
|
||||
let cleaned = collapse_whitespace(&html_decode(text.trim()));
|
||||
if cleaned.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(cleaned)
|
||||
}
|
||||
}
|
||||
|
||||
/// Thumbnail is the `<img class="EmbeddedMediaImage">` inside the embed
|
||||
/// (or the og:image as fallback).
|
||||
fn parse_thumbnail(html: &str) -> Option<String> {
|
||||
static RE_IMG: OnceLock<Regex> = OnceLock::new();
|
||||
let img_re = RE_IMG.get_or_init(|| {
|
||||
Regex::new(r#"(?s)<img[^>]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#)
|
||||
.unwrap()
|
||||
});
|
||||
if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) {
|
||||
return Some(html_decode(m.as_str()));
|
||||
}
|
||||
static RE_OG: OnceLock<Regex> = OnceLock::new();
|
||||
let og_re = RE_OG.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:image"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
og_re
|
||||
.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| html_decode(m.as_str()))
|
||||
}
|
||||
|
||||
fn html_decode(s: &str) -> String {
|
||||
s.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
.replace(""", "\"")
|
||||
.replace("'", "'")
|
||||
.replace("@", "@")
|
||||
.replace("•", "•")
|
||||
.replace("…", "…")
|
||||
}
|
||||
|
||||
fn collapse_whitespace(s: &str) -> String {
|
||||
s.split_whitespace().collect::<Vec<_>>().join(" ")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_post_reel_tv_urls() {
|
||||
assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/"));
|
||||
assert!(matches(
|
||||
"https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"
|
||||
));
|
||||
assert!(matches("https://www.instagram.com/reel/abc123/"));
|
||||
assert!(matches("https://www.instagram.com/tv/abc123/"));
|
||||
assert!(!matches("https://www.instagram.com/ticketswave"));
|
||||
assert!(!matches("https://www.instagram.com/"));
|
||||
assert!(!matches("https://example.com/p/abc/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_shortcode_reads_each_kind() {
|
||||
assert_eq!(
|
||||
parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"),
|
||||
Some(("post", "DT-RICMjeK5".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_shortcode("https://www.instagram.com/reel/abc123/"),
|
||||
Some(("reel", "abc123".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_shortcode("https://www.instagram.com/tv/abc123"),
|
||||
Some(("tv", "abc123".into()))
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_username_pulls_anchor_text() {
|
||||
let html = r#"<a class="CaptionUsername" href="...">ticketswave</a>"#;
|
||||
assert_eq!(parse_username(html).as_deref(), Some("ticketswave"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_caption_strips_username_anchor() {
|
||||
let html = r#"<div class="Caption"><a class="CaptionUsername" href="...">ticketswave</a> Some caption text here</div>"#;
|
||||
assert_eq!(
|
||||
parse_caption(html).as_deref(),
|
||||
Some("Some caption text here")
|
||||
);
|
||||
}
|
||||
}
|
||||
465
crates/webclaw-fetch/src/extractors/instagram_profile.rs
Normal file
465
crates/webclaw-fetch/src/extractors/instagram_profile.rs
Normal file
|
|
@ -0,0 +1,465 @@
|
|||
//! Instagram profile structured extractor.
|
||||
//!
|
||||
//! Hits Instagram's internal `web_profile_info` endpoint at
|
||||
//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The
|
||||
//! `x-ig-app-id` header is Instagram's own public web-app id (not a
|
||||
//! secret) — the same value Instagram's own JavaScript bundle sends.
|
||||
//!
|
||||
//! Returns the full profile (bio, exact follower count, verified /
|
||||
//! business flags, profile picture) plus the **12 most recent posts**
|
||||
//! with shortcodes, like counts, types, thumbnails, and caption
|
||||
//! previews. Callers can fan out to `/v1/scrape/instagram_post` per
|
||||
//! shortcode to get the full caption + media.
|
||||
//!
|
||||
//! Pagination beyond 12 requires authenticated cookies + a CSRF token;
|
||||
//! we accept that as the practical ceiling for the unauth path. The
|
||||
//! cloud (with stored sessions) can paginate later as a follow-up.
|
||||
//!
|
||||
//! Falls back to OG-tag scraping of the public profile page if the API
|
||||
//! returns 401/403 — Instagram has tightened this endpoint multiple
|
||||
//! times, so we keep the second path warm.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "instagram_profile",
|
||||
label: "Instagram profile",
|
||||
description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).",
|
||||
url_patterns: &["https://www.instagram.com/{username}/"],
|
||||
};
|
||||
|
||||
/// Instagram's own public web-app identifier. Sent by their JS bundle
|
||||
/// on every API call, accepted by the unauth endpoint, not a secret.
|
||||
const IG_APP_ID: &str = "936619743392459";
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.instagram.com" | "instagram.com") {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
segs.len() == 1 && !RESERVED.contains(&segs[0])
|
||||
}
|
||||
|
||||
const RESERVED: &[&str] = &[
|
||||
"p",
|
||||
"reel",
|
||||
"reels",
|
||||
"tv",
|
||||
"explore",
|
||||
"stories",
|
||||
"directory",
|
||||
"accounts",
|
||||
"about",
|
||||
"developer",
|
||||
"press",
|
||||
"api",
|
||||
"ads",
|
||||
"blog",
|
||||
"fragments",
|
||||
"terms",
|
||||
"privacy",
|
||||
"session",
|
||||
"login",
|
||||
"signup",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let username = parse_username(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"instagram_profile: cannot parse username from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let api_url =
|
||||
format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}");
|
||||
let extra_headers: &[(&str, &str)] = &[
|
||||
("x-ig-app-id", IG_APP_ID),
|
||||
("accept", "*/*"),
|
||||
("sec-fetch-site", "same-origin"),
|
||||
("x-requested-with", "XMLHttpRequest"),
|
||||
];
|
||||
let resp = client.fetch_with_headers(&api_url, extra_headers).await?;
|
||||
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"instagram_profile: '{username}' not found"
|
||||
)));
|
||||
}
|
||||
// Auth wall fallback: Instagram occasionally tightens this endpoint
|
||||
// and starts returning 401/403/302 to a login page. When that
|
||||
// happens we still want to give the caller something useful — the
|
||||
// OG tags from the public HTML page (no posts list, but bio etc).
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return og_fallback(client, &username, url, resp.status).await;
|
||||
}
|
||||
|
||||
let body: ApiResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?;
|
||||
let user = body.data.user;
|
||||
|
||||
let recent_posts: Vec<Value> = user
|
||||
.edge_owner_to_timeline_media
|
||||
.as_ref()
|
||||
.map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"canonical_url": format!("https://www.instagram.com/{username}/"),
|
||||
"username": user.username.unwrap_or(username),
|
||||
"data_completeness": "api",
|
||||
"user_id": user.id,
|
||||
"full_name": user.full_name,
|
||||
"biography": user.biography,
|
||||
"biography_links": user.bio_links,
|
||||
"external_url": user.external_url,
|
||||
"category": user.category_name,
|
||||
"follower_count": user.edge_followed_by.map(|c| c.count),
|
||||
"following_count": user.edge_follow.map(|c| c.count),
|
||||
"post_count": user.edge_owner_to_timeline_media.as_ref().map(|m| m.count),
|
||||
"is_verified": user.is_verified,
|
||||
"is_private": user.is_private,
|
||||
"is_business": user.is_business_account,
|
||||
"is_professional": user.is_professional_account,
|
||||
"profile_pic_url": user.profile_pic_url_hd.or(user.profile_pic_url),
|
||||
"recent_posts": recent_posts,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Build the per-post summary the caller fans out from. Includes a
|
||||
/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`.
|
||||
fn post_summary(n: &MediaNode) -> Value {
|
||||
let kind = classify(n);
|
||||
let url = match kind {
|
||||
"reel" => format!(
|
||||
"https://www.instagram.com/reel/{}/",
|
||||
n.shortcode.as_deref().unwrap_or("")
|
||||
),
|
||||
_ => format!(
|
||||
"https://www.instagram.com/p/{}/",
|
||||
n.shortcode.as_deref().unwrap_or("")
|
||||
),
|
||||
};
|
||||
let caption = n
|
||||
.edge_media_to_caption
|
||||
.as_ref()
|
||||
.and_then(|c| c.edges.first())
|
||||
.and_then(|e| e.node.text.clone());
|
||||
json!({
|
||||
"shortcode": n.shortcode,
|
||||
"url": url,
|
||||
"kind": kind,
|
||||
"is_video": n.is_video.unwrap_or(false),
|
||||
"video_views": n.video_view_count,
|
||||
"thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()),
|
||||
"display_url": n.display_url,
|
||||
"like_count": n.edge_media_preview_like.as_ref().map(|c| c.count),
|
||||
"comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count),
|
||||
"taken_at": n.taken_at_timestamp,
|
||||
"caption": caption,
|
||||
"alt_text": n.accessibility_caption,
|
||||
"dimensions": n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})),
|
||||
"product_type": n.product_type,
|
||||
})
|
||||
}
|
||||
|
||||
/// Best-effort post-type classification. `clips` is reels; `feed` is
|
||||
/// the regular grid. Sidecar = multi-photo carousel.
|
||||
fn classify(n: &MediaNode) -> &'static str {
|
||||
if n.product_type.as_deref() == Some("clips") {
|
||||
return "reel";
|
||||
}
|
||||
match n.typename.as_deref() {
|
||||
Some("GraphSidecar") => "carousel",
|
||||
Some("GraphVideo") => "video",
|
||||
Some("GraphImage") => "photo",
|
||||
_ => "post",
|
||||
}
|
||||
}
|
||||
|
||||
/// Fallback when the API path is blocked: hit the public profile HTML,
|
||||
/// pull whatever OG tags we can. Returns less data and explicitly
|
||||
/// flags `data_completeness: "og_only"` so callers know.
|
||||
async fn og_fallback(
|
||||
client: &dyn Fetcher,
|
||||
username: &str,
|
||||
original_url: &str,
|
||||
api_status: u16,
|
||||
) -> Result<Value, FetchError> {
|
||||
let canonical = format!("https://www.instagram.com/{username}/");
|
||||
let resp = client.fetch(&canonical).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"instagram_profile: api status {api_status}, html status {} for {username}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
let og = parse_og_tags(&resp.html);
|
||||
let (followers, following, posts) =
|
||||
parse_counts_from_og_description(og.get("description").map(String::as_str));
|
||||
|
||||
Ok(json!({
|
||||
"url": original_url,
|
||||
"canonical_url": canonical,
|
||||
"username": username,
|
||||
"data_completeness": "og_only",
|
||||
"fallback_reason": format!("api returned {api_status}"),
|
||||
"full_name": parse_full_name(&og.get("title").cloned().unwrap_or_default()),
|
||||
"follower_count": followers,
|
||||
"following_count": following,
|
||||
"post_count": posts,
|
||||
"profile_pic_url": og.get("image").cloned(),
|
||||
"biography": null_value(),
|
||||
"is_verified": null_value(),
|
||||
"is_business": null_value(),
|
||||
"recent_posts": Vec::<Value>::new(),
|
||||
}))
|
||||
}
|
||||
|
||||
fn null_value() -> Value {
|
||||
Value::Null
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL parsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_username(url: &str) -> Option<String> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
stripped
|
||||
.split('/')
|
||||
.find(|s| !s.is_empty())
|
||||
.map(|s| s.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG-fallback helpers (kept self-contained — same shape as the previous
|
||||
// version we shipped, retained as the safety net)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
|
||||
use regex::Regex;
|
||||
use std::sync::OnceLock;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
let mut out = std::collections::HashMap::new();
|
||||
for c in re.captures_iter(html) {
|
||||
let k = c
|
||||
.get(1)
|
||||
.map(|m| m.as_str().to_lowercase())
|
||||
.unwrap_or_default();
|
||||
let v = c
|
||||
.get(2)
|
||||
.map(|m| html_decode(m.as_str()))
|
||||
.unwrap_or_default();
|
||||
out.entry(k).or_insert(v);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
fn parse_full_name(og_title: &str) -> Option<String> {
|
||||
if og_title.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let decoded = html_decode(og_title);
|
||||
let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim();
|
||||
if trimmed.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(trimmed.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_counts_from_og_description(desc: Option<&str>) -> (Option<i64>, Option<i64>, Option<i64>) {
|
||||
let Some(text) = desc else {
|
||||
return (None, None, None);
|
||||
};
|
||||
let decoded = html_decode(text);
|
||||
use regex::Regex;
|
||||
use std::sync::OnceLock;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap()
|
||||
});
|
||||
if let Some(c) = re.captures(&decoded) {
|
||||
return (
|
||||
c.get(1).and_then(|m| parse_compact_number(m.as_str())),
|
||||
c.get(2).and_then(|m| parse_compact_number(m.as_str())),
|
||||
c.get(3).and_then(|m| parse_compact_number(m.as_str())),
|
||||
);
|
||||
}
|
||||
(None, None, None)
|
||||
}
|
||||
|
||||
fn parse_compact_number(s: &str) -> Option<i64> {
|
||||
let s = s.trim();
|
||||
let (num_str, mul) = match s.chars().last() {
|
||||
Some('K') => (&s[..s.len() - 1], 1_000i64),
|
||||
Some('M') => (&s[..s.len() - 1], 1_000_000i64),
|
||||
Some('B') => (&s[..s.len() - 1], 1_000_000_000i64),
|
||||
_ => (s, 1i64),
|
||||
};
|
||||
let cleaned: String = num_str.chars().filter(|c| *c != ',').collect();
|
||||
cleaned.parse::<f64>().ok().map(|f| (f * mul as f64) as i64)
|
||||
}
|
||||
|
||||
fn html_decode(s: &str) -> String {
|
||||
s.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
.replace(""", "\"")
|
||||
.replace("'", "'")
|
||||
.replace("@", "@")
|
||||
.replace("•", "•")
|
||||
.replace("…", "…")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Instagram web_profile_info API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ApiResponse {
|
||||
data: ApiData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ApiData {
|
||||
user: User,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct User {
|
||||
id: Option<String>,
|
||||
username: Option<String>,
|
||||
full_name: Option<String>,
|
||||
biography: Option<String>,
|
||||
bio_links: Option<Vec<serde_json::Value>>,
|
||||
external_url: Option<String>,
|
||||
category_name: Option<String>,
|
||||
profile_pic_url: Option<String>,
|
||||
profile_pic_url_hd: Option<String>,
|
||||
is_verified: Option<bool>,
|
||||
is_private: Option<bool>,
|
||||
is_business_account: Option<bool>,
|
||||
is_professional_account: Option<bool>,
|
||||
edge_followed_by: Option<EdgeCount>,
|
||||
edge_follow: Option<EdgeCount>,
|
||||
edge_owner_to_timeline_media: Option<MediaEdges>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct EdgeCount {
|
||||
count: i64,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MediaEdges {
|
||||
count: i64,
|
||||
edges: Vec<MediaEdge>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MediaEdge {
|
||||
node: MediaNode,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MediaNode {
|
||||
#[serde(rename = "__typename")]
|
||||
typename: Option<String>,
|
||||
shortcode: Option<String>,
|
||||
is_video: Option<bool>,
|
||||
video_view_count: Option<i64>,
|
||||
display_url: Option<String>,
|
||||
thumbnail_src: Option<String>,
|
||||
accessibility_caption: Option<String>,
|
||||
taken_at_timestamp: Option<i64>,
|
||||
product_type: Option<String>,
|
||||
dimensions: Option<Dimensions>,
|
||||
edge_media_preview_like: Option<EdgeCount>,
|
||||
edge_media_to_comment: Option<EdgeCount>,
|
||||
edge_media_to_caption: Option<CaptionEdges>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Dimensions {
|
||||
width: i64,
|
||||
height: i64,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CaptionEdges {
|
||||
edges: Vec<CaptionEdge>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CaptionEdge {
|
||||
node: CaptionNode,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CaptionNode {
|
||||
text: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_profile_urls() {
|
||||
assert!(matches("https://www.instagram.com/ticketswave"));
|
||||
assert!(matches("https://www.instagram.com/ticketswave/"));
|
||||
assert!(matches("https://instagram.com/0xmassi/?hl=en"));
|
||||
assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/"));
|
||||
assert!(!matches("https://www.instagram.com/explore"));
|
||||
assert!(!matches("https://www.instagram.com/"));
|
||||
assert!(!matches("https://example.com/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_full_name_strips_handle() {
|
||||
assert_eq!(
|
||||
parse_full_name("Ticket Wave (@ticketswave) • Instagram photos and videos"),
|
||||
Some("Ticket Wave".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn compact_number_handles_kmb() {
|
||||
assert_eq!(parse_compact_number("18K"), Some(18_000));
|
||||
assert_eq!(parse_compact_number("1.5M"), Some(1_500_000));
|
||||
assert_eq!(parse_compact_number("1,234"), Some(1_234));
|
||||
assert_eq!(parse_compact_number("641"), Some(641));
|
||||
}
|
||||
}
|
||||
266
crates/webclaw-fetch/src/extractors/linkedin_post.rs
Normal file
266
crates/webclaw-fetch/src/extractors/linkedin_post.rs
Normal file
|
|
@ -0,0 +1,266 @@
|
|||
//! LinkedIn post structured extractor.
|
||||
//!
|
||||
//! Uses the public embed endpoint `/embed/feed/update/{urn}` which
|
||||
//! LinkedIn provides for sites that want to render a post inline. No
|
||||
//! auth required, returns SSR HTML with the full post body, OG tags,
|
||||
//! image, and a link back to the original post.
|
||||
//!
|
||||
//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`)
|
||||
//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by
|
||||
//! pulling the trailing numeric id and converting to an activity URN.
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "linkedin_post",
|
||||
label: "LinkedIn post",
|
||||
description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.",
|
||||
url_patterns: &[
|
||||
"https://www.linkedin.com/feed/update/urn:li:share:{id}",
|
||||
"https://www.linkedin.com/feed/update/urn:li:activity:{id}",
|
||||
"https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.linkedin.com" | "linkedin.com") {
|
||||
return false;
|
||||
}
|
||||
url.contains("/feed/update/urn:li:") || url.contains("/posts/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let urn = extract_urn(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
|
||||
))
|
||||
})?;
|
||||
|
||||
let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}");
|
||||
let resp = client.fetch(&embed_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"linkedin embed returned status {} for {urn}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let html = &resp.html;
|
||||
let og = parse_og_tags(html);
|
||||
let body = parse_post_body(html);
|
||||
let author = parse_author(html);
|
||||
let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone());
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"embed_url": embed_url,
|
||||
"urn": urn,
|
||||
"canonical_url": canonical_url,
|
||||
"data_completeness": "embed",
|
||||
"title": og.get("title").cloned(),
|
||||
"body": body,
|
||||
"author_name": author,
|
||||
"image_url": og.get("image").cloned(),
|
||||
"site_name": og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URN extraction
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL.
|
||||
/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second-
|
||||
/// to-last `-` separated chunk. Both forms map to a URN we can hit the
|
||||
/// embed endpoint with.
|
||||
fn extract_urn(url: &str) -> Option<String> {
|
||||
if let Some(idx) = url.find("urn:li:") {
|
||||
let tail = &url[idx..];
|
||||
let end = tail.find(['/', '?', '#']).unwrap_or(tail.len());
|
||||
let urn = &tail[..end];
|
||||
// Validate shape: urn:li:{type}:{digits}
|
||||
let mut parts = urn.split(':');
|
||||
if parts.next() == Some("urn")
|
||||
&& parts.next() == Some("li")
|
||||
&& parts.next().is_some()
|
||||
&& parts
|
||||
.next()
|
||||
.filter(|p| p.chars().all(|c| c.is_ascii_digit()))
|
||||
.is_some()
|
||||
{
|
||||
return Some(urn.to_string());
|
||||
}
|
||||
}
|
||||
|
||||
// /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second-
|
||||
// to-last segment after the last `-`.
|
||||
if url.contains("/posts/") {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re =
|
||||
RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap());
|
||||
if let Some(c) = re.captures(url)
|
||||
&& let Some(id) = c.get(1)
|
||||
{
|
||||
return Some(format!("urn:li:activity:{}", id.as_str()));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML scraping
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Pull `og:foo` → value pairs out of `<meta property="og:..." content="...">`.
|
||||
/// Returns lowercased keys with leading `og:` stripped.
|
||||
fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
let mut out = std::collections::HashMap::new();
|
||||
for c in re.captures_iter(html) {
|
||||
let k = c
|
||||
.get(1)
|
||||
.map(|m| m.as_str().to_lowercase())
|
||||
.unwrap_or_default();
|
||||
let v = c
|
||||
.get(2)
|
||||
.map(|m| html_decode(m.as_str()))
|
||||
.unwrap_or_default();
|
||||
out.entry(k).or_insert(v);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
/// Extract the post body text from the embed page. LinkedIn renders it
|
||||
/// inside `<p class="attributed-text-segment-list__content ...">{text}</p>`
|
||||
/// where the inner content can include nested `<a>` tags for links.
|
||||
fn parse_post_body(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(
|
||||
r#"(?s)<p[^>]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)</p>"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let inner = re.captures(html).and_then(|c| c.get(1))?.as_str();
|
||||
Some(strip_tags(inner).trim().to_string())
|
||||
}
|
||||
|
||||
/// Author name lives in the `<title>` like:
|
||||
/// "55 founding members are in… | Orc Dev"
|
||||
/// The chunk after the final `|` is the author display name. Falls back
|
||||
/// to the og:title minus the post body if there's no title.
|
||||
fn parse_author(html: &str) -> Option<String> {
|
||||
static RE_TITLE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)</title>").unwrap());
|
||||
let title = re.captures(html).and_then(|c| c.get(1))?.as_str();
|
||||
title
|
||||
.rsplit_once('|')
|
||||
.map(|(_, name)| html_decode(name.trim()))
|
||||
}
|
||||
|
||||
/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.)
|
||||
/// stuff into OG content attributes.
|
||||
fn html_decode(s: &str) -> String {
|
||||
s.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
.replace(""", "\"")
|
||||
.replace("'", "'")
|
||||
.replace("@", "@")
|
||||
.replace("•", "•")
|
||||
.replace("…", "…")
|
||||
}
|
||||
|
||||
/// Crude HTML tag stripper for the post body. Preserves text inside
|
||||
/// nested anchors so URLs don't disappear, and collapses runs of
|
||||
/// whitespace introduced by line wrapping.
|
||||
fn strip_tags(html: &str) -> String {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
|
||||
let no_tags = re.replace_all(html, "").to_string();
|
||||
html_decode(&no_tags)
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_li_post_urls() {
|
||||
assert!(matches(
|
||||
"https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c"
|
||||
));
|
||||
assert!(!matches("https://www.linkedin.com/in/foo"));
|
||||
assert!(!matches("https://www.linkedin.com/"));
|
||||
assert!(!matches("https://example.com/feed/update/urn:li:share:1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_urn_from_share_url() {
|
||||
assert_eq!(
|
||||
extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"),
|
||||
Some("urn:li:share:7452618582213144577".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_urn_from_pretty_post_url() {
|
||||
assert_eq!(
|
||||
extract_urn(
|
||||
"https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/"
|
||||
),
|
||||
Some("urn:li:activity:7452618583290892288".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_og_tags_basic() {
|
||||
let html = r#"<meta property="og:image" content="https://x.com/a.png">
|
||||
<meta property="og:url" content="https://example.com/x">"#;
|
||||
let og = parse_og_tags(html);
|
||||
assert_eq!(
|
||||
og.get("image").map(String::as_str),
|
||||
Some("https://x.com/a.png")
|
||||
);
|
||||
assert_eq!(
|
||||
og.get("url").map(String::as_str),
|
||||
Some("https://example.com/x")
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_post_body_strips_anchor_tags() {
|
||||
let html = r#"<p class="attributed-text-segment-list__content text-color-text" dir="ltr">Hello <a href="x">link</a> world</p>"#;
|
||||
assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn html_decode_handles_common_entities() {
|
||||
assert_eq!(html_decode("AT&T @jane"), "AT&T @jane");
|
||||
}
|
||||
}
|
||||
502
crates/webclaw-fetch/src/extractors/mod.rs
Normal file
502
crates/webclaw-fetch/src/extractors/mod.rs
Normal file
|
|
@ -0,0 +1,502 @@
|
|||
//! Vertical extractors: site-specific parsers that return typed JSON
|
||||
//! instead of generic markdown.
|
||||
//!
|
||||
//! Each extractor handles a single site or platform and exposes:
|
||||
//! - `matches(url)` to claim ownership of a URL pattern
|
||||
//! - `extract(client, url)` to fetch + parse into a typed JSON `Value`
|
||||
//! - `INFO` static for the catalog (`/v1/extractors`)
|
||||
//!
|
||||
//! The dispatch in this module is a simple `match`-style chain rather than
|
||||
//! a trait registry. With ~30 extractors that's still fast and avoids the
|
||||
//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit.
|
||||
//!
|
||||
//! Extractors prefer official JSON APIs over HTML scraping where one
|
||||
//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
|
||||
//! one). HTML extraction is the fallback for sites that don't.
|
||||
|
||||
pub mod amazon_product;
|
||||
pub mod arxiv;
|
||||
pub mod crates_io;
|
||||
pub mod dev_to;
|
||||
pub mod docker_hub;
|
||||
pub mod ebay_listing;
|
||||
pub mod ecommerce_product;
|
||||
pub mod etsy_listing;
|
||||
pub mod github_issue;
|
||||
pub mod github_pr;
|
||||
pub mod github_release;
|
||||
pub mod github_repo;
|
||||
pub mod hackernews;
|
||||
pub mod huggingface_dataset;
|
||||
pub mod huggingface_model;
|
||||
pub mod instagram_post;
|
||||
pub mod instagram_profile;
|
||||
pub mod linkedin_post;
|
||||
pub mod npm;
|
||||
pub mod pypi;
|
||||
pub mod reddit;
|
||||
pub mod shopify_collection;
|
||||
pub mod shopify_product;
|
||||
pub mod stackoverflow;
|
||||
pub mod substack_post;
|
||||
pub mod trustpilot_reviews;
|
||||
pub mod woocommerce_product;
|
||||
pub mod youtube_video;
|
||||
|
||||
use serde::Serialize;
|
||||
use serde_json::Value;
|
||||
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
/// Public catalog entry for `/v1/extractors`. Stable shape — clients
|
||||
/// rely on `name` to pick the right `/v1/scrape/{name}` route.
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct ExtractorInfo {
|
||||
/// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...).
|
||||
pub name: &'static str,
|
||||
/// Human-friendly display name.
|
||||
pub label: &'static str,
|
||||
/// One-line description of what the extractor returns.
|
||||
pub description: &'static str,
|
||||
/// Glob-ish URL pattern(s) the extractor claims. For documentation;
|
||||
/// the actual matching is done by the extractor's `matches` fn.
|
||||
pub url_patterns: &'static [&'static str],
|
||||
}
|
||||
|
||||
/// Full catalog. Order is stable; new entries append.
|
||||
pub fn list() -> Vec<ExtractorInfo> {
|
||||
vec![
|
||||
reddit::INFO,
|
||||
hackernews::INFO,
|
||||
github_repo::INFO,
|
||||
github_pr::INFO,
|
||||
github_issue::INFO,
|
||||
github_release::INFO,
|
||||
pypi::INFO,
|
||||
npm::INFO,
|
||||
crates_io::INFO,
|
||||
huggingface_model::INFO,
|
||||
huggingface_dataset::INFO,
|
||||
arxiv::INFO,
|
||||
docker_hub::INFO,
|
||||
dev_to::INFO,
|
||||
stackoverflow::INFO,
|
||||
substack_post::INFO,
|
||||
youtube_video::INFO,
|
||||
linkedin_post::INFO,
|
||||
instagram_post::INFO,
|
||||
instagram_profile::INFO,
|
||||
shopify_product::INFO,
|
||||
shopify_collection::INFO,
|
||||
ecommerce_product::INFO,
|
||||
woocommerce_product::INFO,
|
||||
amazon_product::INFO,
|
||||
ebay_listing::INFO,
|
||||
etsy_listing::INFO,
|
||||
trustpilot_reviews::INFO,
|
||||
]
|
||||
}
|
||||
|
||||
/// Auto-detect mode: try every extractor's `matches`, return the first
|
||||
/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
|
||||
/// pick a vertical explicitly.
|
||||
pub async fn dispatch_by_url(
|
||||
client: &dyn Fetcher,
|
||||
url: &str,
|
||||
) -> Option<Result<(&'static str, Value), FetchError>> {
|
||||
if reddit::matches(url) {
|
||||
return Some(
|
||||
reddit::extract(client, url)
|
||||
.await
|
||||
.map(|v| (reddit::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if hackernews::matches(url) {
|
||||
return Some(
|
||||
hackernews::extract(client, url)
|
||||
.await
|
||||
.map(|v| (hackernews::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if github_repo::matches(url) {
|
||||
return Some(
|
||||
github_repo::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_repo::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if pypi::matches(url) {
|
||||
return Some(
|
||||
pypi::extract(client, url)
|
||||
.await
|
||||
.map(|v| (pypi::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if npm::matches(url) {
|
||||
return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
|
||||
}
|
||||
if github_pr::matches(url) {
|
||||
return Some(
|
||||
github_pr::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_pr::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if github_issue::matches(url) {
|
||||
return Some(
|
||||
github_issue::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_issue::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if github_release::matches(url) {
|
||||
return Some(
|
||||
github_release::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_release::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if crates_io::matches(url) {
|
||||
return Some(
|
||||
crates_io::extract(client, url)
|
||||
.await
|
||||
.map(|v| (crates_io::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if huggingface_model::matches(url) {
|
||||
return Some(
|
||||
huggingface_model::extract(client, url)
|
||||
.await
|
||||
.map(|v| (huggingface_model::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if huggingface_dataset::matches(url) {
|
||||
return Some(
|
||||
huggingface_dataset::extract(client, url)
|
||||
.await
|
||||
.map(|v| (huggingface_dataset::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if arxiv::matches(url) {
|
||||
return Some(
|
||||
arxiv::extract(client, url)
|
||||
.await
|
||||
.map(|v| (arxiv::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if docker_hub::matches(url) {
|
||||
return Some(
|
||||
docker_hub::extract(client, url)
|
||||
.await
|
||||
.map(|v| (docker_hub::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if dev_to::matches(url) {
|
||||
return Some(
|
||||
dev_to::extract(client, url)
|
||||
.await
|
||||
.map(|v| (dev_to::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if stackoverflow::matches(url) {
|
||||
return Some(
|
||||
stackoverflow::extract(client, url)
|
||||
.await
|
||||
.map(|v| (stackoverflow::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if linkedin_post::matches(url) {
|
||||
return Some(
|
||||
linkedin_post::extract(client, url)
|
||||
.await
|
||||
.map(|v| (linkedin_post::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if instagram_post::matches(url) {
|
||||
return Some(
|
||||
instagram_post::extract(client, url)
|
||||
.await
|
||||
.map(|v| (instagram_post::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if instagram_profile::matches(url) {
|
||||
return Some(
|
||||
instagram_profile::extract(client, url)
|
||||
.await
|
||||
.map(|v| (instagram_profile::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// Antibot-gated verticals with unique hosts: safe to auto-dispatch
|
||||
// because the matcher can't confuse the URL for anything else. The
|
||||
// extractor's smart_fetch_html path handles the blocked-without-
|
||||
// API-key case with a clear actionable error.
|
||||
if amazon_product::matches(url) {
|
||||
return Some(
|
||||
amazon_product::extract(client, url)
|
||||
.await
|
||||
.map(|v| (amazon_product::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if ebay_listing::matches(url) {
|
||||
return Some(
|
||||
ebay_listing::extract(client, url)
|
||||
.await
|
||||
.map(|v| (ebay_listing::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if etsy_listing::matches(url) {
|
||||
return Some(
|
||||
etsy_listing::extract(client, url)
|
||||
.await
|
||||
.map(|v| (etsy_listing::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if trustpilot_reviews::matches(url) {
|
||||
return Some(
|
||||
trustpilot_reviews::extract(client, url)
|
||||
.await
|
||||
.map(|v| (trustpilot_reviews::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if youtube_video::matches(url) {
|
||||
return Some(
|
||||
youtube_video::extract(client, url)
|
||||
.await
|
||||
.map(|v| (youtube_video::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// NOTE: shopify_product, shopify_collection, ecommerce_product,
|
||||
// woocommerce_product, and substack_post are intentionally NOT
|
||||
// in auto-dispatch. Their `matches()` functions are permissive
|
||||
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and
|
||||
// claiming those generically would steal URLs from the default
|
||||
// `/v1/scrape` markdown flow. Callers opt in via
|
||||
// `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
|
||||
None
|
||||
}
|
||||
|
||||
/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`).
|
||||
/// We still validate that the URL plausibly belongs to that vertical so
|
||||
/// users get a clear "wrong route" error instead of a confusing parse
|
||||
/// failure deep in the extractor.
|
||||
pub async fn dispatch_by_name(
|
||||
client: &dyn Fetcher,
|
||||
name: &str,
|
||||
url: &str,
|
||||
) -> Result<Value, ExtractorDispatchError> {
|
||||
match name {
|
||||
n if n == reddit::INFO.name => {
|
||||
run_or_mismatch(reddit::matches(url), n, url, || {
|
||||
reddit::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == hackernews::INFO.name => {
|
||||
run_or_mismatch(hackernews::matches(url), n, url, || {
|
||||
hackernews::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == github_repo::INFO.name => {
|
||||
run_or_mismatch(github_repo::matches(url), n, url, || {
|
||||
github_repo::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == pypi::INFO.name => {
|
||||
run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await
|
||||
}
|
||||
n if n == npm::INFO.name => {
|
||||
run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
|
||||
}
|
||||
n if n == github_pr::INFO.name => {
|
||||
run_or_mismatch(github_pr::matches(url), n, url, || {
|
||||
github_pr::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == github_issue::INFO.name => {
|
||||
run_or_mismatch(github_issue::matches(url), n, url, || {
|
||||
github_issue::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == github_release::INFO.name => {
|
||||
run_or_mismatch(github_release::matches(url), n, url, || {
|
||||
github_release::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == crates_io::INFO.name => {
|
||||
run_or_mismatch(crates_io::matches(url), n, url, || {
|
||||
crates_io::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == huggingface_model::INFO.name => {
|
||||
run_or_mismatch(huggingface_model::matches(url), n, url, || {
|
||||
huggingface_model::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == huggingface_dataset::INFO.name => {
|
||||
run_or_mismatch(huggingface_dataset::matches(url), n, url, || {
|
||||
huggingface_dataset::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == arxiv::INFO.name => {
|
||||
run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await
|
||||
}
|
||||
n if n == docker_hub::INFO.name => {
|
||||
run_or_mismatch(docker_hub::matches(url), n, url, || {
|
||||
docker_hub::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == dev_to::INFO.name => {
|
||||
run_or_mismatch(dev_to::matches(url), n, url, || {
|
||||
dev_to::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == stackoverflow::INFO.name => {
|
||||
run_or_mismatch(stackoverflow::matches(url), n, url, || {
|
||||
stackoverflow::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == linkedin_post::INFO.name => {
|
||||
run_or_mismatch(linkedin_post::matches(url), n, url, || {
|
||||
linkedin_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == instagram_post::INFO.name => {
|
||||
run_or_mismatch(instagram_post::matches(url), n, url, || {
|
||||
instagram_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == instagram_profile::INFO.name => {
|
||||
run_or_mismatch(instagram_profile::matches(url), n, url, || {
|
||||
instagram_profile::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == shopify_product::INFO.name => {
|
||||
run_or_mismatch(shopify_product::matches(url), n, url, || {
|
||||
shopify_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == ecommerce_product::INFO.name => {
|
||||
run_or_mismatch(ecommerce_product::matches(url), n, url, || {
|
||||
ecommerce_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == amazon_product::INFO.name => {
|
||||
run_or_mismatch(amazon_product::matches(url), n, url, || {
|
||||
amazon_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == ebay_listing::INFO.name => {
|
||||
run_or_mismatch(ebay_listing::matches(url), n, url, || {
|
||||
ebay_listing::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == etsy_listing::INFO.name => {
|
||||
run_or_mismatch(etsy_listing::matches(url), n, url, || {
|
||||
etsy_listing::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == trustpilot_reviews::INFO.name => {
|
||||
run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
|
||||
trustpilot_reviews::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == youtube_video::INFO.name => {
|
||||
run_or_mismatch(youtube_video::matches(url), n, url, || {
|
||||
youtube_video::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == substack_post::INFO.name => {
|
||||
run_or_mismatch(substack_post::matches(url), n, url, || {
|
||||
substack_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == shopify_collection::INFO.name => {
|
||||
run_or_mismatch(shopify_collection::matches(url), n, url, || {
|
||||
shopify_collection::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == woocommerce_product::INFO.name => {
|
||||
run_or_mismatch(woocommerce_product::matches(url), n, url, || {
|
||||
woocommerce_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
|
||||
}
|
||||
}
|
||||
|
||||
/// Errors that the dispatcher itself raises (vs. errors from inside an
|
||||
/// extractor, which come back wrapped in `Fetch`).
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum ExtractorDispatchError {
|
||||
#[error("unknown vertical: '{0}'")]
|
||||
UnknownVertical(String),
|
||||
|
||||
#[error("URL '{url}' does not match the '{vertical}' extractor")]
|
||||
UrlMismatch { vertical: String, url: String },
|
||||
|
||||
#[error(transparent)]
|
||||
Fetch(#[from] FetchError),
|
||||
}
|
||||
|
||||
/// Helper: when the caller explicitly picked a vertical but their URL
|
||||
/// doesn't match it, return `UrlMismatch` instead of running the
|
||||
/// extractor (which would just fail with a less-clear error).
|
||||
async fn run_or_mismatch<F, Fut>(
|
||||
matches: bool,
|
||||
vertical: &str,
|
||||
url: &str,
|
||||
f: F,
|
||||
) -> Result<Value, ExtractorDispatchError>
|
||||
where
|
||||
F: FnOnce() -> Fut,
|
||||
Fut: std::future::Future<Output = Result<Value, FetchError>>,
|
||||
{
|
||||
if !matches {
|
||||
return Err(ExtractorDispatchError::UrlMismatch {
|
||||
vertical: vertical.to_string(),
|
||||
url: url.to_string(),
|
||||
});
|
||||
}
|
||||
f().await.map_err(ExtractorDispatchError::Fetch)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn list_is_non_empty_and_unique() {
|
||||
let entries = list();
|
||||
assert!(!entries.is_empty());
|
||||
let mut names: Vec<_> = entries.iter().map(|e| e.name).collect();
|
||||
names.sort();
|
||||
let before = names.len();
|
||||
names.dedup();
|
||||
assert_eq!(before, names.len(), "extractor names must be unique");
|
||||
}
|
||||
}
|
||||
235
crates/webclaw-fetch/src/extractors/npm.rs
Normal file
235
crates/webclaw-fetch/src/extractors/npm.rs
Normal file
|
|
@ -0,0 +1,235 @@
|
|||
//! npm package structured extractor.
|
||||
//!
|
||||
//! Uses two npm-run APIs:
|
||||
//! - `registry.npmjs.org/{name}` for full package metadata
|
||||
//! - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal
|
||||
//!
|
||||
//! The registry API returns the *full* document including every version
|
||||
//! ever published, which can be tens of MB for popular packages
|
||||
//! (`@types/node` etc). We strip down to the latest version's manifest
|
||||
//! and a count of releases — full history would explode the response.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "npm",
|
||||
label: "npm package",
|
||||
description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.",
|
||||
url_patterns: &["https://www.npmjs.com/package/{name}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "www.npmjs.com" && host != "npmjs.com" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/package/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let name = parse_name(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
|
||||
|
||||
let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name));
|
||||
let resp = client.fetch(®istry_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"npm: package '{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"npm registry returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let pkg: PackageDoc = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?;
|
||||
|
||||
// Resolve "latest" to a concrete version.
|
||||
let latest_version = pkg
|
||||
.dist_tags
|
||||
.as_ref()
|
||||
.and_then(|t| t.get("latest"))
|
||||
.cloned()
|
||||
.or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned()));
|
||||
|
||||
let latest_manifest = latest_version
|
||||
.as_deref()
|
||||
.and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v)));
|
||||
|
||||
let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0);
|
||||
let latest_release_date = latest_version
|
||||
.as_deref()
|
||||
.and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned()));
|
||||
|
||||
// Best-effort weekly downloads. If the api.npmjs.org call fails we
|
||||
// surface `null` rather than failing the whole extractor — npm
|
||||
// sometimes 503s the downloads endpoint while the registry is up.
|
||||
let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": pkg.name.clone().unwrap_or(name.clone()),
|
||||
"description": pkg.description,
|
||||
"latest_version": latest_version,
|
||||
"license": latest_manifest.and_then(|m| m.license.clone()),
|
||||
"homepage": pkg.homepage,
|
||||
"repository": pkg.repository.as_ref().and_then(|r| r.url.clone()),
|
||||
"dependencies": latest_manifest.and_then(|m| m.dependencies.clone()),
|
||||
"dev_dependencies": latest_manifest.and_then(|m| m.dev_dependencies.clone()),
|
||||
"peer_dependencies": latest_manifest.and_then(|m| m.peer_dependencies.clone()),
|
||||
"keywords": pkg.keywords,
|
||||
"maintainers": pkg.maintainers,
|
||||
"deprecated": latest_manifest.and_then(|m| m.deprecated.clone()),
|
||||
"release_count": release_count,
|
||||
"latest_release_date": latest_release_date,
|
||||
"weekly_downloads": weekly_downloads,
|
||||
}))
|
||||
}
|
||||
|
||||
async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
|
||||
let url = format!(
|
||||
"https://api.npmjs.org/downloads/point/last-week/{}",
|
||||
urlencode_segment(name)
|
||||
);
|
||||
let resp = client.fetch(&url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"npm downloads api status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
let dl: Downloads = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?;
|
||||
Ok(dl.downloads)
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Extract the package name from an npmjs.com URL. Handles scoped packages
|
||||
/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`).
|
||||
fn parse_name(url: &str) -> Option<String> {
|
||||
let after = url.split("/package/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let first = segs.next()?;
|
||||
if first.starts_with('@') {
|
||||
let second = segs.next()?;
|
||||
Some(format!("{first}/{second}"))
|
||||
} else {
|
||||
Some(first.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// `@scope/name` must encode the `/` for the registry path. Plain names
|
||||
/// pass through untouched.
|
||||
fn urlencode_segment(name: &str) -> String {
|
||||
name.replace('/', "%2F")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Registry types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct PackageDoc {
|
||||
name: Option<String>,
|
||||
description: Option<String>,
|
||||
homepage: Option<serde_json::Value>, // sometimes string, sometimes object
|
||||
repository: Option<Repository>,
|
||||
keywords: Option<Vec<String>>,
|
||||
maintainers: Option<Vec<Maintainer>>,
|
||||
#[serde(rename = "dist-tags")]
|
||||
dist_tags: Option<std::collections::BTreeMap<String, String>>,
|
||||
versions: Option<std::collections::BTreeMap<String, VersionManifest>>,
|
||||
time: Option<std::collections::BTreeMap<String, String>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Default, Clone)]
|
||||
struct VersionManifest {
|
||||
license: Option<serde_json::Value>, // string or object
|
||||
dependencies: Option<std::collections::BTreeMap<String, String>>,
|
||||
#[serde(rename = "devDependencies")]
|
||||
dev_dependencies: Option<std::collections::BTreeMap<String, String>>,
|
||||
#[serde(rename = "peerDependencies")]
|
||||
peer_dependencies: Option<std::collections::BTreeMap<String, String>>,
|
||||
// `deprecated` is sometimes a bool and sometimes a string in the
|
||||
// registry. serde_json::Value covers both without failing the parse.
|
||||
deprecated: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Repository {
|
||||
url: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Clone)]
|
||||
struct Maintainer {
|
||||
name: Option<String>,
|
||||
email: Option<String>,
|
||||
}
|
||||
|
||||
impl serde::Serialize for Maintainer {
|
||||
fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
|
||||
use serde::ser::SerializeMap;
|
||||
let mut m = s.serialize_map(Some(2))?;
|
||||
m.serialize_entry("name", &self.name)?;
|
||||
m.serialize_entry("email", &self.email)?;
|
||||
m.end()
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Downloads {
|
||||
downloads: i64,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_npm_package_urls() {
|
||||
assert!(matches("https://www.npmjs.com/package/react"));
|
||||
assert!(matches("https://www.npmjs.com/package/@types/node"));
|
||||
assert!(matches("https://npmjs.com/package/lodash"));
|
||||
assert!(!matches("https://www.npmjs.com/"));
|
||||
assert!(!matches("https://example.com/package/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_name_handles_scoped_and_unscoped() {
|
||||
assert_eq!(
|
||||
parse_name("https://www.npmjs.com/package/react"),
|
||||
Some("react".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://www.npmjs.com/package/@types/node"),
|
||||
Some("@types/node".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"),
|
||||
Some("lodash".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn urlencode_only_touches_scope_separator() {
|
||||
assert_eq!(urlencode_segment("react"), "react");
|
||||
assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode");
|
||||
}
|
||||
}
|
||||
184
crates/webclaw-fetch/src/extractors/pypi.rs
Normal file
184
crates/webclaw-fetch/src/extractors/pypi.rs
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
//! PyPI package structured extractor.
|
||||
//!
|
||||
//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and
|
||||
//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both
|
||||
//! return the full release info plus history. No auth, no rate limits
|
||||
//! that we hit at normal usage.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "pypi",
|
||||
label: "PyPI package",
|
||||
description: "Returns package metadata: latest version, dependencies, license, release history.",
|
||||
url_patterns: &[
|
||||
"https://pypi.org/project/{name}/",
|
||||
"https://pypi.org/project/{name}/{version}/",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "pypi.org" && host != "www.pypi.org" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/project/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (name, version) = parse_project(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = match &version {
|
||||
Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"),
|
||||
None => format!("https://pypi.org/pypi/{name}/json"),
|
||||
};
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"pypi: package '{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"pypi api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let pkg: PypiResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?;
|
||||
|
||||
let info = pkg.info;
|
||||
let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0);
|
||||
|
||||
// Latest release date = max upload time across files in the latest version.
|
||||
let latest_release_date = pkg
|
||||
.releases
|
||||
.as_ref()
|
||||
.and_then(|map| info.version.as_deref().and_then(|v| map.get(v)))
|
||||
.and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max());
|
||||
|
||||
// Drop the long description from the JSON shape — it's frequently a 50KB
|
||||
// README and bloats responses. Callers who need it can hit /v1/scrape.
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": info.name,
|
||||
"version": info.version,
|
||||
"summary": info.summary,
|
||||
"homepage": info.home_page,
|
||||
"license": info.license,
|
||||
"license_classifier": pick_license_classifier(&info.classifiers),
|
||||
"author": info.author,
|
||||
"author_email": info.author_email,
|
||||
"maintainer": info.maintainer,
|
||||
"requires_python": info.requires_python,
|
||||
"requires_dist": info.requires_dist,
|
||||
"keywords": info.keywords,
|
||||
"classifiers": info.classifiers,
|
||||
"yanked": info.yanked,
|
||||
"yanked_reason": info.yanked_reason,
|
||||
"project_urls": info.project_urls,
|
||||
"release_count": release_count,
|
||||
"latest_release_date": latest_release_date,
|
||||
}))
|
||||
}
|
||||
|
||||
/// PyPI puts the SPDX-ish license under classifiers like
|
||||
/// `License :: OSI Approved :: Apache Software License`. Surface the most
|
||||
/// specific one when the `license` field itself is empty/junk.
|
||||
fn pick_license_classifier(classifiers: &Option<Vec<String>>) -> Option<String> {
|
||||
classifiers
|
||||
.as_ref()?
|
||||
.iter()
|
||||
.filter(|c| c.starts_with("License ::"))
|
||||
.max_by_key(|c| c.len())
|
||||
.cloned()
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_project(url: &str) -> Option<(String, Option<String>)> {
|
||||
let after = url.split("/project/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let name = segs.next()?.to_string();
|
||||
let version = segs.next().map(|v| v.to_string());
|
||||
Some((name, version))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// PyPI API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct PypiResponse {
|
||||
info: Info,
|
||||
releases: Option<std::collections::BTreeMap<String, Vec<File>>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Info {
|
||||
name: Option<String>,
|
||||
version: Option<String>,
|
||||
summary: Option<String>,
|
||||
home_page: Option<String>,
|
||||
license: Option<String>,
|
||||
author: Option<String>,
|
||||
author_email: Option<String>,
|
||||
maintainer: Option<String>,
|
||||
requires_python: Option<String>,
|
||||
requires_dist: Option<Vec<String>>,
|
||||
keywords: Option<String>,
|
||||
classifiers: Option<Vec<String>>,
|
||||
yanked: Option<bool>,
|
||||
yanked_reason: Option<String>,
|
||||
project_urls: Option<std::collections::BTreeMap<String, String>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct File {
|
||||
upload_time: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_project_urls() {
|
||||
assert!(matches("https://pypi.org/project/requests/"));
|
||||
assert!(matches("https://pypi.org/project/numpy/1.26.0/"));
|
||||
assert!(!matches("https://pypi.org/"));
|
||||
assert!(!matches("https://example.com/project/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_project_pulls_name_and_version() {
|
||||
assert_eq!(
|
||||
parse_project("https://pypi.org/project/requests/"),
|
||||
Some(("requests".into(), None))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_project("https://pypi.org/project/numpy/1.26.0/"),
|
||||
Some(("numpy".into(), Some("1.26.0".into())))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_project("https://pypi.org/project/scikit-learn/?foo=bar"),
|
||||
Some(("scikit-learn".into(), None))
|
||||
);
|
||||
}
|
||||
}
|
||||
234
crates/webclaw-fetch/src/extractors/reddit.rs
Normal file
234
crates/webclaw-fetch/src/extractors/reddit.rs
Normal file
|
|
@ -0,0 +1,234 @@
|
|||
//! Reddit structured extractor — returns the full post + comment tree
|
||||
//! as typed JSON via Reddit's `.json` API.
|
||||
//!
|
||||
//! The same trick the markdown extractor in `crate::reddit` uses:
|
||||
//! appending `.json` to any post URL returns the data the new SPA
|
||||
//! frontend would load client-side. Zero antibot, zero JS rendering.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "reddit",
|
||||
label: "Reddit thread",
|
||||
description: "Returns post + nested comment tree with scores, authors, and timestamps.",
|
||||
url_patterns: &[
|
||||
"https://www.reddit.com/r/*/comments/*",
|
||||
"https://reddit.com/r/*/comments/*",
|
||||
"https://old.reddit.com/r/*/comments/*",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
let is_reddit_host = matches!(
|
||||
host,
|
||||
"reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
|
||||
);
|
||||
is_reddit_host && url.contains("/comments/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"reddit api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let listings: Vec<Listing> = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?;
|
||||
|
||||
if listings.is_empty() {
|
||||
return Err(FetchError::BodyDecode("reddit response empty".into()));
|
||||
}
|
||||
|
||||
// First listing = the post (single t3 child).
|
||||
let post = listings
|
||||
.first()
|
||||
.and_then(|l| l.data.children.first())
|
||||
.filter(|t| t.kind == "t3")
|
||||
.map(|t| post_json(&t.data))
|
||||
.unwrap_or(Value::Null);
|
||||
|
||||
// Second listing = the comment tree.
|
||||
let comments: Vec<Value> = listings
|
||||
.get(1)
|
||||
.map(|l| l.data.children.iter().filter_map(comment_json).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"post": post,
|
||||
"comments": comments,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON shapers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn post_json(d: &ThingData) -> Value {
|
||||
json!({
|
||||
"id": d.id,
|
||||
"title": d.title,
|
||||
"author": d.author,
|
||||
"subreddit": d.subreddit_name_prefixed,
|
||||
"permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
|
||||
"url": d.url_overridden_by_dest,
|
||||
"is_self": d.is_self,
|
||||
"selftext": d.selftext,
|
||||
"score": d.score,
|
||||
"upvote_ratio": d.upvote_ratio,
|
||||
"num_comments": d.num_comments,
|
||||
"created_utc": d.created_utc,
|
||||
"link_flair_text": d.link_flair_text,
|
||||
"over_18": d.over_18,
|
||||
"spoiler": d.spoiler,
|
||||
"stickied": d.stickied,
|
||||
"locked": d.locked,
|
||||
})
|
||||
}
|
||||
|
||||
/// Render a single comment + its reply tree. Returns `None` for non-t1
|
||||
/// kinds (the trailing `more` placeholder Reddit injects at depth limits).
|
||||
fn comment_json(thing: &Thing) -> Option<Value> {
|
||||
if thing.kind != "t1" {
|
||||
return None;
|
||||
}
|
||||
let d = &thing.data;
|
||||
let replies: Vec<Value> = match &d.replies {
|
||||
Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(),
|
||||
_ => Vec::new(),
|
||||
};
|
||||
Some(json!({
|
||||
"id": d.id,
|
||||
"author": d.author,
|
||||
"body": d.body,
|
||||
"score": d.score,
|
||||
"created_utc": d.created_utc,
|
||||
"is_submitter": d.is_submitter,
|
||||
"stickied": d.stickied,
|
||||
"depth": d.depth,
|
||||
"permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
|
||||
"replies": replies,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com`
|
||||
/// or `old.reddit.com` as the caller gave us). Routing through
|
||||
/// `old.reddit.com` unconditionally looks appealing but that host has
|
||||
/// stricter UA-based blocking than `www.reddit.com`, while the main
|
||||
/// host accepts our Chrome-fingerprinted client fine.
|
||||
fn build_json_url(url: &str) -> String {
|
||||
let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
|
||||
format!("{clean}.json?raw_json=1")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Reddit JSON types — only fields we render. Everything else is dropped.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Listing {
|
||||
data: ListingData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ListingData {
|
||||
children: Vec<Thing>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Thing {
|
||||
kind: String,
|
||||
data: ThingData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Default)]
|
||||
struct ThingData {
|
||||
// post (t3)
|
||||
id: Option<String>,
|
||||
title: Option<String>,
|
||||
selftext: Option<String>,
|
||||
subreddit_name_prefixed: Option<String>,
|
||||
url_overridden_by_dest: Option<String>,
|
||||
is_self: Option<bool>,
|
||||
upvote_ratio: Option<f64>,
|
||||
num_comments: Option<i64>,
|
||||
over_18: Option<bool>,
|
||||
spoiler: Option<bool>,
|
||||
stickied: Option<bool>,
|
||||
locked: Option<bool>,
|
||||
link_flair_text: Option<String>,
|
||||
|
||||
// comment (t1)
|
||||
author: Option<String>,
|
||||
body: Option<String>,
|
||||
score: Option<i64>,
|
||||
created_utc: Option<f64>,
|
||||
is_submitter: Option<bool>,
|
||||
depth: Option<i64>,
|
||||
permalink: Option<String>,
|
||||
|
||||
// recursive
|
||||
replies: Option<Replies>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
#[serde(untagged)]
|
||||
enum Replies {
|
||||
Listing(Listing),
|
||||
#[allow(dead_code)]
|
||||
Empty(String),
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_reddit_post_urls() {
|
||||
assert!(matches(
|
||||
"https://www.reddit.com/r/rust/comments/abc123/some_title/"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://reddit.com/r/rust/comments/abc123/some_title"
|
||||
));
|
||||
assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_post_reddit_urls() {
|
||||
assert!(!matches("https://www.reddit.com/r/rust"));
|
||||
assert!(!matches("https://www.reddit.com/user/foo"));
|
||||
assert!(!matches("https://example.com/r/rust/comments/x"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn json_url_appends_suffix_and_drops_query() {
|
||||
assert_eq!(
|
||||
build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"),
|
||||
"https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1"
|
||||
);
|
||||
}
|
||||
}
|
||||
242
crates/webclaw-fetch/src/extractors/shopify_collection.rs
Normal file
242
crates/webclaw-fetch/src/extractors/shopify_collection.rs
Normal file
|
|
@ -0,0 +1,242 @@
|
|||
//! Shopify collection structured extractor.
|
||||
//!
|
||||
//! Every Shopify store exposes `/collections/{handle}.json` and
|
||||
//! `/collections/{handle}/products.json` on the public surface. This
|
||||
//! extractor hits `.json` (collection metadata) and falls through to
|
||||
//! `/products.json` for the first page of products. Same caveat as
|
||||
//! `shopify_product`: stores with Cloudflare in front of the shop
|
||||
//! will 403 the public path.
|
||||
//!
|
||||
//! Explicit-call only (like `shopify_product`). `/collections/{slug}`
|
||||
//! is a URL shape used by non-Shopify stores too, so auto-dispatch
|
||||
//! would claim too many URLs.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_collection",
|
||||
label: "Shopify collection",
|
||||
description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.",
|
||||
url_patterns: &[
|
||||
"https://{shop}/collections/{handle}",
|
||||
"https://{shop}.myshopify.com/collections/{handle}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/collections/") && !url.ends_with("/collections/")
|
||||
}
|
||||
|
||||
const NON_SHOPIFY_HOSTS: &[&str] = &[
|
||||
"amazon.com",
|
||||
"amazon.co.uk",
|
||||
"amazon.de",
|
||||
"ebay.com",
|
||||
"etsy.com",
|
||||
"walmart.com",
|
||||
"target.com",
|
||||
"aliexpress.com",
|
||||
"huggingface.co", // has /collections/ for models
|
||||
"github.com",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (coll_meta_url, coll_products_url) = build_json_urls(url);
|
||||
|
||||
// Step 1: collection metadata. Shopify returns 200 on missing
|
||||
// collections sometimes; check "collection" key below.
|
||||
let meta_resp = client.fetch(&coll_meta_url).await?;
|
||||
if meta_resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_collection: '{url}' not found"
|
||||
)));
|
||||
}
|
||||
if meta_resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store."
|
||||
)));
|
||||
}
|
||||
if meta_resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify returned status {} for {coll_meta_url}",
|
||||
meta_resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})"
|
||||
))
|
||||
})?;
|
||||
|
||||
// Step 2: first page of products for this collection.
|
||||
let products = match client.fetch(&coll_products_url).await {
|
||||
Ok(r) if r.status == 200 => serde_json::from_str::<ProductsWrapper>(&r.html)
|
||||
.ok()
|
||||
.map(|pw| pw.products)
|
||||
.unwrap_or_default(),
|
||||
_ => Vec::new(),
|
||||
};
|
||||
|
||||
let product_summaries: Vec<Value> = products
|
||||
.iter()
|
||||
.map(|p| {
|
||||
let first_variant = p.variants.first();
|
||||
json!({
|
||||
"id": p.id,
|
||||
"handle": p.handle,
|
||||
"title": p.title,
|
||||
"vendor": p.vendor,
|
||||
"product_type": p.product_type,
|
||||
"price": first_variant.and_then(|v| v.price.clone()),
|
||||
"compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()),
|
||||
"available": p.variants.iter().any(|v| v.available.unwrap_or(false)),
|
||||
"variant_count": p.variants.len(),
|
||||
"image": p.images.first().and_then(|i| i.src.clone()),
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let c = meta.collection;
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"meta_json_url": coll_meta_url,
|
||||
"products_json_url": coll_products_url,
|
||||
"collection_id": c.id,
|
||||
"handle": c.handle,
|
||||
"title": c.title,
|
||||
"description_html": c.body_html,
|
||||
"published_at": c.published_at,
|
||||
"updated_at": c.updated_at,
|
||||
"sort_order": c.sort_order,
|
||||
"products_in_page": product_summaries.len(),
|
||||
"products": product_summaries,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Build `(collection.json, collection/products.json)` from a user URL.
|
||||
fn build_json_urls(url: &str) -> (String, String) {
|
||||
let (path_part, _query_part) = match url.split_once('?') {
|
||||
Some((a, b)) => (a, Some(b)),
|
||||
None => (url, None),
|
||||
};
|
||||
let clean = path_part.trim_end_matches('/').trim_end_matches(".json");
|
||||
(
|
||||
format!("{clean}.json"),
|
||||
format!("{clean}/products.json?limit=50"),
|
||||
)
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Shopify collection + product JSON shapes (subsets)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MetaWrapper {
|
||||
collection: Collection,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Collection {
|
||||
id: Option<i64>,
|
||||
handle: Option<String>,
|
||||
title: Option<String>,
|
||||
body_html: Option<String>,
|
||||
published_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
sort_order: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ProductsWrapper {
|
||||
#[serde(default)]
|
||||
products: Vec<ProductSummary>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ProductSummary {
|
||||
id: Option<i64>,
|
||||
handle: Option<String>,
|
||||
title: Option<String>,
|
||||
vendor: Option<String>,
|
||||
product_type: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
#[serde(default)]
|
||||
variants: Vec<VariantSummary>,
|
||||
#[serde(default)]
|
||||
images: Vec<ImageSummary>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct VariantSummary {
|
||||
price: Option<String>,
|
||||
compare_at_price: Option<String>,
|
||||
available: Option<bool>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ImageSummary {
|
||||
src: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_shopify_collection_urls() {
|
||||
assert!(matches("https://www.allbirds.com/collections/mens"));
|
||||
assert!(matches(
|
||||
"https://shop.example.com/collections/new-arrivals?page=2"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_shopify() {
|
||||
assert!(!matches("https://github.com/collections/foo"));
|
||||
assert!(!matches("https://huggingface.co/collections/foo"));
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("https://example.com/collections/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_urls_derives_both_paths() {
|
||||
let (meta, products) = build_json_urls("https://shop.example.com/collections/mens");
|
||||
assert_eq!(meta, "https://shop.example.com/collections/mens.json");
|
||||
assert_eq!(
|
||||
products,
|
||||
"https://shop.example.com/collections/mens/products.json?limit=50"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_urls_handles_trailing_slash() {
|
||||
let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/");
|
||||
assert_eq!(meta, "https://shop.example.com/collections/mens.json");
|
||||
}
|
||||
}
|
||||
318
crates/webclaw-fetch/src/extractors/shopify_product.rs
Normal file
318
crates/webclaw-fetch/src/extractors/shopify_product.rs
Normal file
|
|
@ -0,0 +1,318 @@
|
|||
//! Shopify product structured extractor.
|
||||
//!
|
||||
//! Every Shopify store exposes a public JSON endpoint for each product
|
||||
//! by appending `.json` to the product URL:
|
||||
//!
|
||||
//! https://shop.example.com/products/cool-tshirt
|
||||
//! → https://shop.example.com/products/cool-tshirt.json
|
||||
//!
|
||||
//! There are ~4 million Shopify stores. The `.json` endpoint is
|
||||
//! undocumented but has been stable for 10+ years. When a store puts
|
||||
//! Cloudflare / antibot in front of the shop, this path can 403 just
|
||||
//! like any other — for those cases the caller should fall back to
|
||||
//! `ecommerce_product` (JSON-LD) or the cloud tier.
|
||||
//!
|
||||
//! This extractor is **explicit-call only** — it is NOT auto-dispatched
|
||||
//! from `/v1/scrape` because we cannot tell ahead of time whether an
|
||||
//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
|
||||
//! `/v1/scrape/shopify_product` when they know.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_product",
|
||||
label: "Shopify product",
|
||||
description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
|
||||
url_patterns: &[
|
||||
"https://{shop}/products/{handle}",
|
||||
"https://{shop}.myshopify.com/products/{handle}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
// Any URL whose path contains /products/{something}. We do not
|
||||
// filter by host — Shopify powers custom-domain stores. The
|
||||
// extractor's /.json fallback is what confirms Shopify; `matches`
|
||||
// just says "this is a plausible shape." Still reject obviously
|
||||
// non-Shopify known hosts to save a failed request.
|
||||
let host = host_of(url);
|
||||
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/products/") && !url.ends_with("/products/")
|
||||
}
|
||||
|
||||
/// Hosts we know are not Shopify — reject so we don't burn a request.
|
||||
const NON_SHOPIFY_HOSTS: &[&str] = &[
|
||||
"amazon.com",
|
||||
"amazon.co.uk",
|
||||
"amazon.de",
|
||||
"amazon.fr",
|
||||
"amazon.it",
|
||||
"ebay.com",
|
||||
"etsy.com",
|
||||
"walmart.com",
|
||||
"target.com",
|
||||
"aliexpress.com",
|
||||
"bestbuy.com",
|
||||
"wayfair.com",
|
||||
"homedepot.com",
|
||||
"github.com", // /products is a marketing page
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_product: '{url}' not found (got 404 from {json_url})"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify returned status {} for {json_url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
|
||||
))
|
||||
})?;
|
||||
let p = body.product;
|
||||
|
||||
let variants: Vec<Value> = p
|
||||
.variants
|
||||
.iter()
|
||||
.map(|v| {
|
||||
json!({
|
||||
"id": v.id,
|
||||
"title": v.title,
|
||||
"sku": v.sku,
|
||||
"barcode": v.barcode,
|
||||
"price": v.price,
|
||||
"compare_at_price": v.compare_at_price,
|
||||
"available": v.available,
|
||||
"inventory_quantity": v.inventory_quantity,
|
||||
"position": v.position,
|
||||
"weight": v.weight,
|
||||
"weight_unit": v.weight_unit,
|
||||
"requires_shipping": v.requires_shipping,
|
||||
"taxable": v.taxable,
|
||||
"option1": v.option1,
|
||||
"option2": v.option2,
|
||||
"option3": v.option3,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let images: Vec<Value> = p
|
||||
.images
|
||||
.iter()
|
||||
.map(|i| {
|
||||
json!({
|
||||
"src": i.src,
|
||||
"width": i.width,
|
||||
"height": i.height,
|
||||
"position": i.position,
|
||||
"alt": i.alt,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let options: Vec<Value> = p
|
||||
.options
|
||||
.iter()
|
||||
.map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
|
||||
.collect();
|
||||
|
||||
// Price range + availability summary across variants (the shape
|
||||
// agents typically want without walking the variants array).
|
||||
let prices: Vec<f64> = p
|
||||
.variants
|
||||
.iter()
|
||||
.filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
|
||||
.collect();
|
||||
let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"json_url": json_url,
|
||||
"product_id": p.id,
|
||||
"handle": p.handle,
|
||||
"title": p.title,
|
||||
"vendor": p.vendor,
|
||||
"product_type": p.product_type,
|
||||
"tags": p.tags,
|
||||
"description_html":p.body_html,
|
||||
"published_at": p.published_at,
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
"variant_count": variants.len(),
|
||||
"image_count": images.len(),
|
||||
"any_available": any_available,
|
||||
"price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
|
||||
"price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
|
||||
"variants": variants,
|
||||
"images": images,
|
||||
"options": options,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
|
||||
/// trailing slashes, and query strings.
|
||||
fn build_json_url(url: &str) -> String {
|
||||
let (path_part, query_part) = match url.split_once('?') {
|
||||
Some((a, b)) => (a, Some(b)),
|
||||
None => (url, None),
|
||||
};
|
||||
let clean = path_part.trim_end_matches('/');
|
||||
let with_json = if clean.ends_with(".json") {
|
||||
clean.to_string()
|
||||
} else {
|
||||
format!("{clean}.json")
|
||||
};
|
||||
match query_part {
|
||||
Some(q) => format!("{with_json}?{q}"),
|
||||
None => with_json,
|
||||
}
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Shopify product JSON shape (a subset of the full response)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Wrapper {
|
||||
product: Product,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Product {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
handle: Option<String>,
|
||||
vendor: Option<String>,
|
||||
product_type: Option<String>,
|
||||
body_html: Option<String>,
|
||||
published_at: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: serde_json::Value, // array OR comma-joined string depending on store
|
||||
#[serde(default)]
|
||||
variants: Vec<Variant>,
|
||||
#[serde(default)]
|
||||
images: Vec<Image>,
|
||||
#[serde(default)]
|
||||
options: Vec<Option_>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Variant {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
sku: Option<String>,
|
||||
barcode: Option<String>,
|
||||
price: Option<String>,
|
||||
compare_at_price: Option<String>,
|
||||
available: Option<bool>,
|
||||
inventory_quantity: Option<i64>,
|
||||
position: Option<i64>,
|
||||
weight: Option<f64>,
|
||||
weight_unit: Option<String>,
|
||||
requires_shipping: Option<bool>,
|
||||
taxable: Option<bool>,
|
||||
option1: Option<String>,
|
||||
option2: Option<String>,
|
||||
option3: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Image {
|
||||
src: Option<String>,
|
||||
width: Option<i64>,
|
||||
height: Option<i64>,
|
||||
position: Option<i64>,
|
||||
alt: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
#[serde(rename_all = "lowercase")]
|
||||
struct Option_ {
|
||||
name: Option<String>,
|
||||
position: Option<i64>,
|
||||
#[serde(default)]
|
||||
values: Vec<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_plausible_shopify_urls() {
|
||||
assert!(matches(
|
||||
"https://www.allbirds.com/products/mens-tree-runners"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://shop.example.com/products/cool-tshirt?variant=123"
|
||||
));
|
||||
assert!(matches("https://somestore.myshopify.com/products/thing-1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_known_non_shopify() {
|
||||
assert!(!matches("https://www.amazon.com/dp/B0C123"));
|
||||
assert!(!matches("https://www.etsy.com/listing/12345/foo"));
|
||||
assert!(!matches("https://www.amazon.co.uk/products/thing"));
|
||||
assert!(!matches("https://github.com/products"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_product_urls() {
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("https://example.com/products/"));
|
||||
assert!(!matches("https://example.com/collections/all"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_url_handles_slash_and_query() {
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo/"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo?variant=123"),
|
||||
"https://shop.example.com/products/foo.json?variant=123"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo.json"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
}
|
||||
}
|
||||
216
crates/webclaw-fetch/src/extractors/stackoverflow.rs
Normal file
216
crates/webclaw-fetch/src/extractors/stackoverflow.rs
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
//! Stack Overflow Q&A structured extractor.
|
||||
//!
|
||||
//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}`
|
||||
//! with `site=stackoverflow`. Two calls: one for the question, one for
|
||||
//! its answers. Both come pre-filtered to include the rendered HTML body
|
||||
//! so we don't re-parse the question page itself.
|
||||
//!
|
||||
//! Anonymous access caps at 300 requests per IP per day. Production
|
||||
//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't
|
||||
//! require it to work out of the box.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "stackoverflow",
|
||||
label: "Stack Overflow Q&A",
|
||||
description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.",
|
||||
url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "stackoverflow.com" && host != "www.stackoverflow.com" {
|
||||
return false;
|
||||
}
|
||||
parse_question_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_question_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"stackoverflow: cannot parse question id from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
// Filter `withbody` includes the rendered HTML body for both questions
|
||||
// and answers. Stack Exchange's filter system is documented at
|
||||
// api.stackexchange.com/docs/filters.
|
||||
let q_url = format!(
|
||||
"https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody"
|
||||
);
|
||||
let q_resp = client.fetch(&q_url).await?;
|
||||
if q_resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"stackexchange api returned status {}",
|
||||
q_resp.status
|
||||
)));
|
||||
}
|
||||
let q_body: QResponse = serde_json::from_str(&q_resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?;
|
||||
let q = q_body
|
||||
.items
|
||||
.first()
|
||||
.ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?;
|
||||
|
||||
let a_url = format!(
|
||||
"https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes"
|
||||
);
|
||||
let a_resp = client.fetch(&a_url).await?;
|
||||
let answers = if a_resp.status == 200 {
|
||||
let a_body: AResponse = serde_json::from_str(&a_resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?;
|
||||
a_body
|
||||
.items
|
||||
.iter()
|
||||
.map(|a| {
|
||||
json!({
|
||||
"answer_id": a.answer_id,
|
||||
"is_accepted": a.is_accepted,
|
||||
"score": a.score,
|
||||
"body": a.body,
|
||||
"creation_date": a.creation_date,
|
||||
"last_edit_date":a.last_edit_date,
|
||||
"author": a.owner.as_ref().and_then(|o| o.display_name.clone()),
|
||||
"author_rep": a.owner.as_ref().and_then(|o| o.reputation),
|
||||
})
|
||||
})
|
||||
.collect::<Vec<_>>()
|
||||
} else {
|
||||
Vec::new()
|
||||
};
|
||||
|
||||
let accepted = answers
|
||||
.iter()
|
||||
.find(|a| {
|
||||
a.get("is_accepted")
|
||||
.and_then(|v| v.as_bool())
|
||||
.unwrap_or(false)
|
||||
})
|
||||
.cloned();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"question_id": q.question_id,
|
||||
"title": q.title,
|
||||
"body": q.body,
|
||||
"tags": q.tags,
|
||||
"score": q.score,
|
||||
"view_count": q.view_count,
|
||||
"answer_count": q.answer_count,
|
||||
"is_answered": q.is_answered,
|
||||
"accepted_answer_id": q.accepted_answer_id,
|
||||
"creation_date": q.creation_date,
|
||||
"last_activity_date": q.last_activity_date,
|
||||
"author": q.owner.as_ref().and_then(|o| o.display_name.clone()),
|
||||
"author_rep": q.owner.as_ref().and_then(|o| o.reputation),
|
||||
"link": q.link,
|
||||
"accepted_answer": accepted,
|
||||
"top_answers": answers,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Parse question id from a URL of the form `/questions/{id}/{slug}`.
|
||||
fn parse_question_id(url: &str) -> Option<u64> {
|
||||
let after = url.split("/questions/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let first = stripped.split('/').next()?;
|
||||
first.parse::<u64>().ok()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Stack Exchange API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct QResponse {
|
||||
#[serde(default)]
|
||||
items: Vec<Question>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Question {
|
||||
question_id: Option<u64>,
|
||||
title: Option<String>,
|
||||
body: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: Vec<String>,
|
||||
score: Option<i64>,
|
||||
view_count: Option<i64>,
|
||||
answer_count: Option<i64>,
|
||||
is_answered: Option<bool>,
|
||||
accepted_answer_id: Option<u64>,
|
||||
creation_date: Option<i64>,
|
||||
last_activity_date: Option<i64>,
|
||||
owner: Option<Owner>,
|
||||
link: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct AResponse {
|
||||
#[serde(default)]
|
||||
items: Vec<Answer>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Answer {
|
||||
answer_id: Option<u64>,
|
||||
is_accepted: Option<bool>,
|
||||
score: Option<i64>,
|
||||
body: Option<String>,
|
||||
creation_date: Option<i64>,
|
||||
last_edit_date: Option<i64>,
|
||||
owner: Option<Owner>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Owner {
|
||||
display_name: Option<String>,
|
||||
reputation: Option<i64>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_question_urls() {
|
||||
assert!(matches(
|
||||
"https://stackoverflow.com/questions/12345/some-slug"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://stackoverflow.com/questions/12345/some-slug?answertab=votes"
|
||||
));
|
||||
assert!(!matches("https://stackoverflow.com/"));
|
||||
assert!(!matches("https://stackoverflow.com/questions"));
|
||||
assert!(!matches("https://stackoverflow.com/users/100"));
|
||||
assert!(!matches("https://example.com/questions/12345/x"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_question_id_handles_slug_and_query() {
|
||||
assert_eq!(
|
||||
parse_question_id("https://stackoverflow.com/questions/12345/some-slug"),
|
||||
Some(12345)
|
||||
);
|
||||
assert_eq!(
|
||||
parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"),
|
||||
Some(12345)
|
||||
);
|
||||
assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None);
|
||||
}
|
||||
}
|
||||
565
crates/webclaw-fetch/src/extractors/substack_post.rs
Normal file
565
crates/webclaw-fetch/src/extractors/substack_post.rs
Normal file
|
|
@ -0,0 +1,565 @@
|
|||
//! Substack post extractor.
|
||||
//!
|
||||
//! Every Substack publication exposes `/api/v1/posts/{slug}` that
|
||||
//! returns the full post as JSON: body HTML, cover image, author,
|
||||
//! publication info, reactions, paywall state. No auth on public
|
||||
//! posts.
|
||||
//!
|
||||
//! Works on both `*.substack.com` subdomains and custom domains
|
||||
//! (e.g. `simonwillison.net` uses Substack too). Detection is
|
||||
//! "URL has `/p/{slug}`" because that's the canonical Substack post
|
||||
//! path. Explicit-call only because the `/p/{slug}` URL shape is
|
||||
//! used by non-Substack sites too.
|
||||
//!
|
||||
//! ## Fallback
|
||||
//!
|
||||
//! The API endpoint is rate-limited aggressively on popular publications
|
||||
//! and occasionally returns 403 on custom domains with Cloudflare in
|
||||
//! front. When that happens we escalate to an HTML fetch (via
|
||||
//! `smart_fetch_html`, so antibot-protected custom domains still work)
|
||||
//! and extract OG tags + Article JSON-LD for a degraded-but-useful
|
||||
//! payload. The response shape stays stable across both paths; a
|
||||
//! `data_source` field tells the caller which branch ran.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "substack_post",
|
||||
label: "Substack post",
|
||||
description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.",
|
||||
url_patterns: &[
|
||||
"https://{pub}.substack.com/p/{slug}",
|
||||
"https://{custom-domain}/p/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
if !(url.starts_with("http://") || url.starts_with("https://")) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/p/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let slug = parse_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
|
||||
})?;
|
||||
let host = host_of(url);
|
||||
if host.is_empty() {
|
||||
return Err(FetchError::Build(format!(
|
||||
"substack_post: empty host in '{url}'"
|
||||
)));
|
||||
}
|
||||
let scheme = if url.starts_with("http://") {
|
||||
"http"
|
||||
} else {
|
||||
"https"
|
||||
};
|
||||
let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
|
||||
|
||||
// 1. Try the public API. 200 = full payload; 404 = real miss; any
|
||||
// other status hands off to the HTML fallback so a transient rate
|
||||
// limit or a hardened custom domain doesn't fail the whole call.
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
match resp.status {
|
||||
200 => match serde_json::from_str::<Post>(&resp.html) {
|
||||
Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)),
|
||||
Err(e) => {
|
||||
// API returned 200 but the body isn't the Post shape we
|
||||
// expect. Could be a custom-domain site that exposes
|
||||
// something else at /api/v1/posts/. Fall back to HTML
|
||||
// rather than hard-failing.
|
||||
html_fallback(
|
||||
client,
|
||||
url,
|
||||
&api_url,
|
||||
&slug,
|
||||
Some(format!(
|
||||
"api returned 200 but body was not Substack JSON ({e})"
|
||||
)),
|
||||
)
|
||||
.await
|
||||
}
|
||||
},
|
||||
404 => Err(FetchError::Build(format!(
|
||||
"substack_post: '{slug}' not found on {host} (got 404). \
|
||||
If the publication isn't actually on Substack, use /v1/scrape instead."
|
||||
))),
|
||||
_ => {
|
||||
// Rate limit, 403, 5xx, whatever: try HTML.
|
||||
let reason = format!("api returned status {} for {api_url}", resp.status);
|
||||
html_fallback(client, url, &api_url, &slug, Some(reason)).await
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// API-path payload builder
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
|
||||
json!({
|
||||
"url": url,
|
||||
"api_url": api_url,
|
||||
"data_source": "api",
|
||||
"id": p.id,
|
||||
"type": p.r#type,
|
||||
"slug": p.slug.or_else(|| Some(slug.to_string())),
|
||||
"title": p.title,
|
||||
"subtitle": p.subtitle,
|
||||
"description": p.description,
|
||||
"canonical_url": p.canonical_url,
|
||||
"post_date": p.post_date,
|
||||
"updated_at": p.updated_at,
|
||||
"audience": p.audience,
|
||||
"has_paywall": matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")),
|
||||
"is_free_preview": p.is_free_preview,
|
||||
"cover_image": p.cover_image,
|
||||
"word_count": p.wordcount,
|
||||
"reactions": p.reactions,
|
||||
"comment_count": p.comment_count,
|
||||
"body_html": p.body_html,
|
||||
"body_text": p.truncated_body_text.or(p.body_text),
|
||||
"publication": json!({
|
||||
"id": p.publication.as_ref().and_then(|pub_| pub_.id),
|
||||
"name": p.publication.as_ref().and_then(|pub_| pub_.name.clone()),
|
||||
"subdomain": p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()),
|
||||
"custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()),
|
||||
}),
|
||||
"authors": p.published_bylines.iter().map(|a| json!({
|
||||
"id": a.id,
|
||||
"name": a.name,
|
||||
"handle": a.handle,
|
||||
"photo": a.photo_url,
|
||||
})).collect::<Vec<_>>(),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML fallback: OG + Article JSON-LD
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
async fn html_fallback(
|
||||
client: &dyn Fetcher,
|
||||
url: &str,
|
||||
api_url: &str,
|
||||
slug: &str,
|
||||
fallback_reason: Option<String>,
|
||||
) -> Result<Value, FetchError> {
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse_html(&fetched.html, url, api_url, slug);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"fetch_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
if let Some(reason) = fallback_reason {
|
||||
obj.insert("fallback_reason".into(), json!(reason));
|
||||
}
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure HTML parser. Pulls title, subtitle, description, cover image,
|
||||
/// publish date, and authors from OG tags and Article JSON-LD. Kept
|
||||
/// public so tests can exercise it with fixtures.
|
||||
pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value {
|
||||
let article = find_article_jsonld(html);
|
||||
|
||||
let title = article
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "headline"))
|
||||
.or_else(|| og(html, "title"));
|
||||
let description = article
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let cover_image = article
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let post_date = article
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "datePublished"))
|
||||
.or_else(|| meta_property(html, "article:published_time"));
|
||||
let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified"));
|
||||
let publication_name = og(html, "site_name");
|
||||
let authors = article.as_ref().map(extract_authors).unwrap_or_default();
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"api_url": api_url,
|
||||
"data_source": "html_fallback",
|
||||
"slug": slug,
|
||||
"title": title,
|
||||
"subtitle": None::<String>,
|
||||
"description": description,
|
||||
"canonical_url": canonical_url(html).or_else(|| Some(url.to_string())),
|
||||
"post_date": post_date,
|
||||
"updated_at": updated_at,
|
||||
"cover_image": cover_image,
|
||||
"body_html": None::<String>,
|
||||
"body_text": None::<String>,
|
||||
"word_count": None::<i64>,
|
||||
"comment_count": None::<i64>,
|
||||
"reactions": Value::Null,
|
||||
"has_paywall": None::<bool>,
|
||||
"is_free_preview": None::<bool>,
|
||||
"publication": json!({
|
||||
"name": publication_name,
|
||||
}),
|
||||
"authors": authors,
|
||||
})
|
||||
}
|
||||
|
||||
fn extract_authors(v: &Value) -> Vec<Value> {
|
||||
let Some(a) = v.get("author") else {
|
||||
return Vec::new();
|
||||
};
|
||||
let one = |val: &Value| -> Option<Value> {
|
||||
match val {
|
||||
Value::String(s) => Some(json!({"name": s})),
|
||||
Value::Object(_) => {
|
||||
let name = val.get("name").and_then(|n| n.as_str())?;
|
||||
let handle = val
|
||||
.get("url")
|
||||
.and_then(|u| u.as_str())
|
||||
.and_then(handle_from_author_url);
|
||||
Some(json!({
|
||||
"name": name,
|
||||
"handle": handle,
|
||||
}))
|
||||
}
|
||||
_ => None,
|
||||
}
|
||||
};
|
||||
match a {
|
||||
Value::Array(arr) => arr.iter().filter_map(one).collect(),
|
||||
_ => one(a).into_iter().collect(),
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
let after = url.split("/p/").nth(1)?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if stripped.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(stripped.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract the Substack handle from an author URL like
|
||||
/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`.
|
||||
///
|
||||
/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack
|
||||
/// author page) so we don't synthesise a fake handle.
|
||||
fn handle_from_author_url(u: &str) -> Option<String> {
|
||||
let after = u.rsplit_once('@').map(|(_, tail)| tail)?;
|
||||
let clean = after.split(['/', '?', '#']).next()?;
|
||||
if clean.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(clean.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML tag helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Pull `<meta property="article:published_time" content="...">` and
|
||||
/// similar structured meta tags.
|
||||
fn meta_property(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn canonical_url(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE
|
||||
.get_or_init(|| Regex::new(r#"(?i)<link[^>]+rel="canonical"[^>]+href="([^"]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers (Article / NewsArticle)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_article_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_article_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_article_in(v: &Value) -> Option<Value> {
|
||||
if is_article_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_article_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_article_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_article_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_art = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => is_art(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Substack API types (subset)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Post {
|
||||
id: Option<i64>,
|
||||
r#type: Option<String>,
|
||||
slug: Option<String>,
|
||||
title: Option<String>,
|
||||
subtitle: Option<String>,
|
||||
description: Option<String>,
|
||||
canonical_url: Option<String>,
|
||||
post_date: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
audience: Option<String>,
|
||||
is_free_preview: Option<bool>,
|
||||
cover_image: Option<String>,
|
||||
wordcount: Option<i64>,
|
||||
reactions: Option<serde_json::Value>,
|
||||
comment_count: Option<i64>,
|
||||
body_html: Option<String>,
|
||||
body_text: Option<String>,
|
||||
truncated_body_text: Option<String>,
|
||||
publication: Option<Publication>,
|
||||
#[serde(default, rename = "publishedBylines")]
|
||||
published_bylines: Vec<Byline>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Publication {
|
||||
id: Option<i64>,
|
||||
name: Option<String>,
|
||||
subdomain: Option<String>,
|
||||
custom_domain: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Byline {
|
||||
id: Option<i64>,
|
||||
name: Option<String>,
|
||||
handle: Option<String>,
|
||||
photo_url: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_post_urls() {
|
||||
assert!(matches(
|
||||
"https://stratechery.substack.com/p/the-tech-letter"
|
||||
));
|
||||
assert!(matches("https://simonwillison.net/p/2024-08-01-something"));
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("ftp://example.com/p/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_strips_query_and_trailing_slash() {
|
||||
assert_eq!(
|
||||
parse_slug("https://example.substack.com/p/my-post"),
|
||||
Some("my-post".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://example.substack.com/p/my-post/"),
|
||||
Some("my-post".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://example.substack.com/p/my-post?ref=123"),
|
||||
Some("my-post".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_html_extracts_from_og_tags() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<meta property="og:title" content="My Great Post">
|
||||
<meta property="og:description" content="A short summary.">
|
||||
<meta property="og:image" content="https://cdn.substack.com/cover.jpg">
|
||||
<meta property="og:site_name" content="My Publication">
|
||||
<meta property="article:published_time" content="2025-09-01T10:00:00Z">
|
||||
<link rel="canonical" href="https://mypub.substack.com/p/my-post">
|
||||
</head></html>"##;
|
||||
let v = parse_html(
|
||||
html,
|
||||
"https://mypub.substack.com/p/my-post",
|
||||
"https://mypub.substack.com/api/v1/posts/my-post",
|
||||
"my-post",
|
||||
);
|
||||
assert_eq!(v["data_source"], "html_fallback");
|
||||
assert_eq!(v["title"], "My Great Post");
|
||||
assert_eq!(v["description"], "A short summary.");
|
||||
assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg");
|
||||
assert_eq!(v["post_date"], "2025-09-01T10:00:00Z");
|
||||
assert_eq!(v["publication"]["name"], "My Publication");
|
||||
assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_html_prefers_jsonld_when_present() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<meta property="og:title" content="OG Title">
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"NewsArticle",
|
||||
"headline":"JSON-LD Title",
|
||||
"description":"JSON-LD desc.",
|
||||
"image":"https://cdn.substack.com/hero.jpg",
|
||||
"datePublished":"2025-10-12T08:30:00Z",
|
||||
"dateModified":"2025-10-12T09:00:00Z",
|
||||
"author":[{"@type":"Person","name":"Alice Author","url":"https://substack.com/@alice"}]}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse_html(
|
||||
html,
|
||||
"https://example.com/p/a",
|
||||
"https://example.com/api/v1/posts/a",
|
||||
"a",
|
||||
);
|
||||
assert_eq!(v["title"], "JSON-LD Title");
|
||||
assert_eq!(v["description"], "JSON-LD desc.");
|
||||
assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg");
|
||||
assert_eq!(v["post_date"], "2025-10-12T08:30:00Z");
|
||||
assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z");
|
||||
assert_eq!(v["authors"][0]["name"], "Alice Author");
|
||||
assert_eq!(v["authors"][0]["handle"], "alice");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handle_from_author_url_pulls_handle() {
|
||||
assert_eq!(
|
||||
handle_from_author_url("https://substack.com/@alice"),
|
||||
Some("alice".into())
|
||||
);
|
||||
assert_eq!(
|
||||
handle_from_author_url("https://mypub.substack.com/@bob/"),
|
||||
Some("bob".into())
|
||||
);
|
||||
assert_eq!(
|
||||
handle_from_author_url("https://not-substack.com/author/carol"),
|
||||
None
|
||||
);
|
||||
}
|
||||
}
|
||||
572
crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
Normal file
572
crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
Normal file
|
|
@ -0,0 +1,572 @@
|
|||
//! Trustpilot company reviews extractor.
|
||||
//!
|
||||
//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
|
||||
//! "Verifying your connection" interstitial, so this extractor always
|
||||
//! routes through [`cloud::smart_fetch_html`]. Without
|
||||
//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
|
||||
//! "set API key" error; with one it escalates to api.webclaw.io.
|
||||
//!
|
||||
//! ## 2025 JSON-LD schema
|
||||
//!
|
||||
//! Trustpilot replaced the old single-Organization + aggregateRating
|
||||
//! shape with three separate JSON-LD blocks:
|
||||
//!
|
||||
//! 1. `Organization` block for Trustpilot the platform itself
|
||||
//! (company info, addresses, social profiles). Not the business
|
||||
//! being reviewed. We detect and skip this.
|
||||
//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
|
||||
//! per-star-bucket counts for the target business plus a Total
|
||||
//! column. The Dataset's `name` is the business display name.
|
||||
//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
|
||||
//! summary of reviews plus the individual review objects
|
||||
//! (consumer, dates, rating, title, text, language, likes).
|
||||
//!
|
||||
//! Plus `metadata.title` from the page head parses as
|
||||
//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
|
||||
//! `metadata.description` carries `"{N} customers have already said"`.
|
||||
//! We use both as extra signal when the Dataset block is absent.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "trustpilot_reviews",
|
||||
label: "Trustpilot reviews",
|
||||
description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
|
||||
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
|
||||
return false;
|
||||
}
|
||||
url.contains("/review/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url)?;
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure parser. Kept public so the cloud pipeline can reuse it on its
|
||||
/// own fetched HTML without going through the async extract path.
|
||||
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
|
||||
let domain = parse_review_domain(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
|
||||
// The business Dataset block has `about.@id` pointing to the target
|
||||
// domain's Organization (e.g. `.../Organization/anthropic.com`).
|
||||
let dataset = find_business_dataset(&blocks, &domain);
|
||||
|
||||
// The aiSummary block: not typed (no `@type`), detect by key.
|
||||
let ai_block = find_ai_summary_block(&blocks);
|
||||
|
||||
// Business name: Dataset > metadata.title regex > URL domain.
|
||||
let business_name = dataset
|
||||
.as_ref()
|
||||
.and_then(|d| get_string(d, "name"))
|
||||
.or_else(|| parse_name_from_og_title(html))
|
||||
.or_else(|| Some(domain.clone()));
|
||||
|
||||
// Rating distribution from the csvw:Table columns. Each column has
|
||||
// csvw:name like "1 star" / "Total" and a single cell with the
|
||||
// integer count.
|
||||
let distribution = dataset.as_ref().and_then(parse_star_distribution);
|
||||
let (rating_from_dist, total_from_dist) = distribution
|
||||
.as_ref()
|
||||
.map(compute_rating_stats)
|
||||
.unwrap_or((None, None));
|
||||
|
||||
// Page-title / page-description fallbacks. OG title format:
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
|
||||
let total_from_desc = parse_review_count_from_og_description(html);
|
||||
|
||||
// Recent reviews carried by the aiSummary block.
|
||||
let recent_reviews: Vec<Value> = ai_block
|
||||
.as_ref()
|
||||
.and_then(|a| a.get("aiSummaryReviews"))
|
||||
.and_then(|arr| arr.as_array())
|
||||
.map(|arr| arr.iter().map(extract_review).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
let ai_summary = ai_block
|
||||
.as_ref()
|
||||
.and_then(|a| a.get("aiSummary"))
|
||||
.and_then(|s| s.get("summary"))
|
||||
.and_then(|t| t.as_str())
|
||||
.map(String::from);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"domain": domain,
|
||||
"business_name": business_name,
|
||||
"rating_label": rating_label,
|
||||
"average_rating": rating_from_dist.or(rating_from_og),
|
||||
"review_count": total_from_dist.or(total_from_desc),
|
||||
"rating_distribution": distribution,
|
||||
"ai_summary": ai_summary,
|
||||
"recent_reviews": recent_reviews,
|
||||
"review_count_listed": recent_reviews.len(),
|
||||
}))
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Pull the target domain from `trustpilot.com/review/{domain}`.
|
||||
fn parse_review_domain(url: &str) -> Option<String> {
|
||||
let after = url.split("/review/").nth(1)?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if stripped.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(stripped.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD block walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Find the Dataset block whose `about.@id` references the target
|
||||
/// domain's Organization. Falls through to any Dataset if the @id
|
||||
/// check doesn't match (Trustpilot occasionally varies the URL).
|
||||
fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
|
||||
let mut fallback_any_dataset: Option<Value> = None;
|
||||
for block in blocks {
|
||||
for node in walk_graph(block) {
|
||||
if !is_dataset(&node) {
|
||||
continue;
|
||||
}
|
||||
if dataset_about_matches_domain(&node, domain) {
|
||||
return Some(node);
|
||||
}
|
||||
if fallback_any_dataset.is_none() {
|
||||
fallback_any_dataset = Some(node);
|
||||
}
|
||||
}
|
||||
}
|
||||
fallback_any_dataset
|
||||
}
|
||||
|
||||
fn is_dataset(v: &Value) -> bool {
|
||||
v.get("@type")
|
||||
.and_then(|t| t.as_str())
|
||||
.is_some_and(|s| s == "Dataset")
|
||||
}
|
||||
|
||||
fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
|
||||
let about_id = v
|
||||
.get("about")
|
||||
.and_then(|a| a.get("@id"))
|
||||
.and_then(|id| id.as_str());
|
||||
let Some(id) = about_id else {
|
||||
return false;
|
||||
};
|
||||
id.contains(&format!("/Organization/{domain}"))
|
||||
}
|
||||
|
||||
/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
|
||||
/// presence of the `aiSummary` key.
|
||||
fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
|
||||
for block in blocks {
|
||||
for node in walk_graph(block) {
|
||||
if node.get("aiSummary").is_some() {
|
||||
return Some(node);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Flatten each block (and its `@graph`) into a list of nodes we can
|
||||
/// iterate over. Handles both `@graph: [ ... ]` (array) and
|
||||
/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
|
||||
fn walk_graph(block: &Value) -> Vec<Value> {
|
||||
let mut out = vec![block.clone()];
|
||||
if let Some(graph) = block.get("@graph") {
|
||||
match graph {
|
||||
Value::Array(arr) => out.extend(arr.iter().cloned()),
|
||||
Value::Object(_) => out.push(graph.clone()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Rating distribution (csvw:Table)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Parse the per-star distribution from the Dataset block. Returns
|
||||
/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
|
||||
fn parse_star_distribution(dataset: &Value) -> Option<Value> {
|
||||
let columns = dataset
|
||||
.get("mainEntity")?
|
||||
.get("csvw:tableSchema")?
|
||||
.get("csvw:columns")?
|
||||
.as_array()?;
|
||||
let mut out = serde_json::Map::new();
|
||||
for col in columns {
|
||||
let name = col.get("csvw:name").and_then(|n| n.as_str())?;
|
||||
let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
|
||||
let count = cell
|
||||
.get("csvw:value")
|
||||
.and_then(|v| v.as_str())
|
||||
.and_then(|s| s.parse::<i64>().ok());
|
||||
let percent = cell
|
||||
.get("csvw:notes")
|
||||
.and_then(|n| n.as_array())
|
||||
.and_then(|arr| arr.first())
|
||||
.and_then(|s| s.as_str())
|
||||
.map(String::from);
|
||||
let key = normalise_star_key(name);
|
||||
out.insert(
|
||||
key,
|
||||
json!({
|
||||
"count": count,
|
||||
"percent": percent,
|
||||
}),
|
||||
);
|
||||
}
|
||||
if out.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(Value::Object(out))
|
||||
}
|
||||
}
|
||||
|
||||
/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
|
||||
/// the raw "1 star" key which fights YAML/JS property access.
|
||||
fn normalise_star_key(name: &str) -> String {
|
||||
let trimmed = name.trim().to_lowercase();
|
||||
match trimmed.as_str() {
|
||||
"1 star" => "one_star".into(),
|
||||
"2 stars" => "two_stars".into(),
|
||||
"3 stars" => "three_stars".into(),
|
||||
"4 stars" => "four_stars".into(),
|
||||
"5 stars" => "five_stars".into(),
|
||||
"total" => "total".into(),
|
||||
other => other.replace(' ', "_"),
|
||||
}
|
||||
}
|
||||
|
||||
/// Compute average rating (weighted by bucket) and total count from the
|
||||
/// parsed distribution. Returns `(average, total)`.
|
||||
fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
|
||||
let Some(obj) = distribution.as_object() else {
|
||||
return (None, None);
|
||||
};
|
||||
let get_count = |key: &str| -> i64 {
|
||||
obj.get(key)
|
||||
.and_then(|v| v.get("count"))
|
||||
.and_then(|v| v.as_i64())
|
||||
.unwrap_or(0)
|
||||
};
|
||||
let one = get_count("one_star");
|
||||
let two = get_count("two_stars");
|
||||
let three = get_count("three_stars");
|
||||
let four = get_count("four_stars");
|
||||
let five = get_count("five_stars");
|
||||
let total_bucket = one + two + three + four + five;
|
||||
let total = obj
|
||||
.get("total")
|
||||
.and_then(|v| v.get("count"))
|
||||
.and_then(|v| v.as_i64())
|
||||
.unwrap_or(total_bucket);
|
||||
if total == 0 {
|
||||
return (None, Some(0));
|
||||
}
|
||||
let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
|
||||
let avg = weighted as f64 / total_bucket.max(1) as f64;
|
||||
// One decimal place, matching how Trustpilot displays the score.
|
||||
(Some(format!("{avg:.1}")), Some(total))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG / meta-tag fallbacks
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Regex out the business name from the standard Trustpilot OG title
|
||||
/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
|
||||
fn parse_name_from_og_title(html: &str) -> Option<String> {
|
||||
let title = og(html, "title")?;
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
|
||||
re.captures(&title)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
|
||||
/// from the OG title.
|
||||
fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
|
||||
let Some(title) = og(html, "title") else {
|
||||
return (None, None);
|
||||
};
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
|
||||
});
|
||||
let Some(caps) = re.captures(&title) else {
|
||||
return (None, None);
|
||||
};
|
||||
(
|
||||
caps.get(1).map(|m| m.as_str().trim().to_string()),
|
||||
caps.get(2).map(|m| m.as_str().to_string()),
|
||||
)
|
||||
}
|
||||
|
||||
/// Parse "hear what 226 customers have already said" from the OG
|
||||
/// description tag.
|
||||
fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
|
||||
let desc = og(html, "description")?;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
|
||||
re.captures(&desc)?
|
||||
.get(1)?
|
||||
.as_str()
|
||||
.replace(',', "")
|
||||
.parse::<i64>()
|
||||
.ok()
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
let raw = c.get(2).map(|m| m.as_str())?;
|
||||
return Some(html_unescape(raw));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Minimal HTML entity unescaping for the three entities the
|
||||
/// synthesize_html escaper might produce. Keeps us off a heavier dep.
|
||||
fn html_unescape(s: &str) -> String {
|
||||
s.replace(""", "\"")
|
||||
.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
}
|
||||
|
||||
fn get_string(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| x.as_str().map(String::from))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Review extraction
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn extract_review(r: &Value) -> Value {
|
||||
json!({
|
||||
"id": r.get("id").and_then(|v| v.as_str()),
|
||||
"rating": r.get("rating").and_then(|v| v.as_i64()),
|
||||
"title": r.get("title").and_then(|v| v.as_str()),
|
||||
"text": r.get("text").and_then(|v| v.as_str()),
|
||||
"language": r.get("language").and_then(|v| v.as_str()),
|
||||
"source": r.get("source").and_then(|v| v.as_str()),
|
||||
"likes": r.get("likes").and_then(|v| v.as_i64()),
|
||||
"author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
|
||||
"author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
|
||||
"author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
|
||||
"verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
|
||||
"date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
|
||||
"date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_trustpilot_review_urls() {
|
||||
assert!(matches("https://www.trustpilot.com/review/stripe.com"));
|
||||
assert!(matches("https://trustpilot.com/review/example.com"));
|
||||
assert!(!matches("https://www.trustpilot.com/"));
|
||||
assert!(!matches("https://example.com/review/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_review_domain_handles_query_and_slash() {
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalise_star_key_covers_all_buckets() {
|
||||
assert_eq!(normalise_star_key("1 star"), "one_star");
|
||||
assert_eq!(normalise_star_key("2 stars"), "two_stars");
|
||||
assert_eq!(normalise_star_key("5 stars"), "five_stars");
|
||||
assert_eq!(normalise_star_key("Total"), "total");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn compute_rating_stats_weighted_average() {
|
||||
// 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
|
||||
let dist = json!({
|
||||
"one_star": { "count": 100, "percent": "50%" },
|
||||
"two_stars": { "count": 0, "percent": "0%" },
|
||||
"three_stars":{ "count": 0, "percent": "0%" },
|
||||
"four_stars": { "count": 0, "percent": "0%" },
|
||||
"five_stars": { "count": 100, "percent": "50%" },
|
||||
"total": { "count": 200, "percent": "100%" },
|
||||
});
|
||||
let (avg, total) = compute_rating_stats(&dist);
|
||||
assert_eq!(avg.as_deref(), Some("3.0"));
|
||||
assert_eq!(total, Some(200));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_og_title_extracts_name_and_rating() {
|
||||
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">"#;
|
||||
assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
|
||||
let (label, rating) = parse_rating_from_og_title(html);
|
||||
assert_eq!(label.as_deref(), Some("Bad"));
|
||||
assert_eq!(rating.as_deref(), Some("1.5"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_review_count_from_og_description_picks_number() {
|
||||
let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||
assert_eq!(parse_review_count_from_og_description(html), Some(226));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_full_fixture_assembles_all_fields() {
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@graph":[
|
||||
{"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
|
||||
]}
|
||||
</script>
|
||||
<script type="application/ld+json">
|
||||
{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
|
||||
"@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
|
||||
"@type":"Dataset",
|
||||
"about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
|
||||
"name":"Anthropic",
|
||||
"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
|
||||
{"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
|
||||
{"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
|
||||
{"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
|
||||
{"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
|
||||
{"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
|
||||
{"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
|
||||
]}}}}
|
||||
</script>
|
||||
<script type="application/ld+json">
|
||||
{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
|
||||
"aiSummaryReviews":[
|
||||
{"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
|
||||
"source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
|
||||
"dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||
assert_eq!(v["domain"], "anthropic.com");
|
||||
assert_eq!(v["business_name"], "Anthropic");
|
||||
assert_eq!(v["rating_label"], "Bad");
|
||||
assert_eq!(v["review_count"], 226);
|
||||
assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
|
||||
assert_eq!(v["rating_distribution"]["total"]["count"], 226);
|
||||
assert_eq!(v["ai_summary"], "Mixed reviews.");
|
||||
assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
|
||||
assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
|
||||
assert_eq!(v["recent_reviews"][0]["rating"], 1);
|
||||
assert_eq!(v["recent_reviews"][0]["title"], "Bad");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_when_no_jsonld() {
|
||||
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||
assert_eq!(v["domain"], "anthropic.com");
|
||||
assert_eq!(v["business_name"], "Anthropic");
|
||||
assert_eq!(v["average_rating"], "1.5");
|
||||
assert_eq!(v["review_count"], 226);
|
||||
assert_eq!(v["rating_label"], "Bad");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_returns_ok_with_url_domain_when_nothing_else() {
|
||||
let v = parse(
|
||||
"<html><head></head></html>",
|
||||
"https://www.trustpilot.com/review/example.com",
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(v["domain"], "example.com");
|
||||
assert_eq!(v["business_name"], "example.com");
|
||||
}
|
||||
}
|
||||
237
crates/webclaw-fetch/src/extractors/woocommerce_product.rs
Normal file
237
crates/webclaw-fetch/src/extractors/woocommerce_product.rs
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
//! WooCommerce product structured extractor.
|
||||
//!
|
||||
//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`.
|
||||
//! About 30-50% of WooCommerce stores expose this endpoint publicly
|
||||
//! (it's on by default, but common security plugins disable it).
|
||||
//! When it's off, the server returns 404 at /wp-json. We surface a
|
||||
//! clean error and point callers at `/v1/scrape/ecommerce_product`
|
||||
//! which works on any store with Schema.org JSON-LD.
|
||||
//!
|
||||
//! Explicit-call only. `/product/{slug}` is the default permalink for
|
||||
//! WooCommerce but custom stores use every variation imaginable, so
|
||||
//! auto-dispatch is unreliable.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "woocommerce_product",
|
||||
label: "WooCommerce product",
|
||||
description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).",
|
||||
url_patterns: &[
|
||||
"https://{shop}/product/{slug}",
|
||||
"https://{shop}/shop/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host.is_empty() {
|
||||
return false;
|
||||
}
|
||||
// Permissive: WooCommerce stores use custom domains + custom
|
||||
// permalinks. The extractor's API probe is what confirms it's
|
||||
// really WooCommerce.
|
||||
url.contains("/product/")
|
||||
|| url.contains("/shop/")
|
||||
|| url.contains("/producto/") // common es locale
|
||||
|| url.contains("/produit/") // common fr locale
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let slug = parse_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"woocommerce_product: cannot parse slug from '{url}'"
|
||||
))
|
||||
})?;
|
||||
let host = host_of(url);
|
||||
if host.is_empty() {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce_product: empty host in '{url}'"
|
||||
)));
|
||||
}
|
||||
let scheme = if url.starts_with("http://") {
|
||||
"http"
|
||||
} else {
|
||||
"https"
|
||||
};
|
||||
let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce_product: {host} does not expose /wp-json/wc/store (404). \
|
||||
Use /v1/scrape/ecommerce_product for JSON-LD fallback."
|
||||
)));
|
||||
}
|
||||
if resp.status == 401 || resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \
|
||||
Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce api returned status {} for {api_url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let products: Vec<Product> = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?;
|
||||
let p = products.into_iter().next().ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"woocommerce_product: no product found for slug '{slug}' on {host}"
|
||||
))
|
||||
})?;
|
||||
|
||||
let images: Vec<Value> = p
|
||||
.images
|
||||
.iter()
|
||||
.map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt}))
|
||||
.collect();
|
||||
let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"api_url": api_url,
|
||||
"product_id": p.id,
|
||||
"name": p.name,
|
||||
"slug": p.slug,
|
||||
"sku": p.sku,
|
||||
"permalink": p.permalink,
|
||||
"on_sale": p.on_sale,
|
||||
"in_stock": p.is_in_stock,
|
||||
"is_purchasable": p.is_purchasable,
|
||||
"price": p.prices.as_ref().and_then(|pr| pr.price.clone()),
|
||||
"regular_price": p.prices.as_ref().and_then(|pr| pr.regular_price.clone()),
|
||||
"sale_price": p.prices.as_ref().and_then(|pr| pr.sale_price.clone()),
|
||||
"currency": p.prices.as_ref().and_then(|pr| pr.currency_code.clone()),
|
||||
"currency_minor": p.prices.as_ref().and_then(|pr| pr.currency_minor_unit),
|
||||
"price_range": p.prices.as_ref().and_then(|pr| pr.price_range.clone()),
|
||||
"average_rating": p.average_rating,
|
||||
"review_count": p.review_count,
|
||||
"description": p.description,
|
||||
"short_description": p.short_description,
|
||||
"categories": p.categories.iter().filter_map(|c| c.name.clone()).collect::<Vec<_>>(),
|
||||
"tags": p.tags.iter().filter_map(|t| t.name.clone()).collect::<Vec<_>>(),
|
||||
"variation_count": variations_count,
|
||||
"image_count": images.len(),
|
||||
"images": images,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Extract the product slug from common WooCommerce permalinks.
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
for needle in ["/product/", "/shop/", "/producto/", "/produit/"] {
|
||||
if let Some(after) = url.split(needle).nth(1) {
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if !stripped.is_empty() {
|
||||
return Some(stripped.to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Store API types (subset of the full response)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Product {
|
||||
id: Option<i64>,
|
||||
name: Option<String>,
|
||||
slug: Option<String>,
|
||||
sku: Option<String>,
|
||||
permalink: Option<String>,
|
||||
description: Option<String>,
|
||||
short_description: Option<String>,
|
||||
on_sale: Option<bool>,
|
||||
is_in_stock: Option<bool>,
|
||||
is_purchasable: Option<bool>,
|
||||
average_rating: Option<serde_json::Value>, // string or number
|
||||
review_count: Option<i64>,
|
||||
prices: Option<Prices>,
|
||||
#[serde(default)]
|
||||
categories: Vec<Term>,
|
||||
#[serde(default)]
|
||||
tags: Vec<Term>,
|
||||
#[serde(default)]
|
||||
images: Vec<Img>,
|
||||
variations: Option<Vec<serde_json::Value>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Prices {
|
||||
price: Option<String>,
|
||||
regular_price: Option<String>,
|
||||
sale_price: Option<String>,
|
||||
currency_code: Option<String>,
|
||||
currency_minor_unit: Option<i64>,
|
||||
price_range: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Term {
|
||||
name: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Img {
|
||||
src: Option<String>,
|
||||
thumbnail: Option<String>,
|
||||
alt: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_common_permalinks() {
|
||||
assert!(matches("https://shop.example.com/product/cool-widget"));
|
||||
assert!(matches("https://shop.example.com/shop/cool-widget"));
|
||||
assert!(matches("https://tienda.example.com/producto/cosa"));
|
||||
assert!(matches("https://boutique.example.com/produit/chose"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_handles_locale_and_suffix() {
|
||||
assert_eq!(
|
||||
parse_slug("https://shop.example.com/product/cool-widget"),
|
||||
Some("cool-widget".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://shop.example.com/product/cool-widget/?attr=red"),
|
||||
Some("cool-widget".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://tienda.example.com/producto/cosa/"),
|
||||
Some("cosa".into())
|
||||
);
|
||||
}
|
||||
}
|
||||
378
crates/webclaw-fetch/src/extractors/youtube_video.rs
Normal file
378
crates/webclaw-fetch/src/extractors/youtube_video.rs
Normal file
|
|
@ -0,0 +1,378 @@
|
|||
//! YouTube video structured extractor.
|
||||
//!
|
||||
//! YouTube embeds the full player configuration in a
|
||||
//! `ytInitialPlayerResponse` JavaScript assignment at the top of
|
||||
//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the
|
||||
//! core crate's already-proven regex + parse to surface typed JSON
|
||||
//! from it: video id, title, author + channel id, view count,
|
||||
//! duration, upload date, keywords, thumbnails, caption-track URLs.
|
||||
//!
|
||||
//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
|
||||
//! shape is stable.
|
||||
//!
|
||||
//! ## Fallback
|
||||
//!
|
||||
//! `ytInitialPlayerResponse` is missing on EU-consent interstitials,
|
||||
//! some live-stream pre-show pages, and age-gated videos. In those
|
||||
//! cases we drop down to OG tags for `title`, `description`,
|
||||
//! `thumbnail`, and `channel`, and return a `data_source:
|
||||
//! "og_fallback"` payload so the caller can tell they got a degraded
|
||||
//! shape (no view count, duration, captions).
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "youtube_video",
|
||||
label: "YouTube video",
|
||||
description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.",
|
||||
url_patterns: &[
|
||||
"https://www.youtube.com/watch?v={id}",
|
||||
"https://youtu.be/{id}",
|
||||
"https://www.youtube.com/shorts/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
webclaw_core::youtube::is_youtube_url(url)
|
||||
|| url.contains("youtube.com/shorts/")
|
||||
|| url.contains("youtube-nocookie.com/embed/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let video_id = parse_video_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
|
||||
})?;
|
||||
|
||||
// Always fetch the canonical /watch URL. /shorts/ and youtu.be
|
||||
// sometimes serve a thinner page without the player blob.
|
||||
let canonical = format!("https://www.youtube.com/watch?v={video_id}");
|
||||
let resp = client.fetch(&canonical).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"youtube returned status {} for {canonical}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
if let Some(player) = extract_player_response(&resp.html) {
|
||||
return Ok(build_player_payload(
|
||||
&player, &resp.html, url, &canonical, &video_id,
|
||||
));
|
||||
}
|
||||
|
||||
// No player blob. Fall back to OG tags so the call still returns
|
||||
// something useful for consent / age-gate pages.
|
||||
Ok(build_og_fallback(&resp.html, url, &canonical, &video_id))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Player-blob path (rich payload)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn build_player_payload(
|
||||
player: &Value,
|
||||
html: &str,
|
||||
url: &str,
|
||||
canonical: &str,
|
||||
video_id: &str,
|
||||
) -> Value {
|
||||
let video_details = player.get("videoDetails");
|
||||
let microformat = player
|
||||
.get("microformat")
|
||||
.and_then(|m| m.get("playerMicroformatRenderer"));
|
||||
|
||||
let thumbnails: Vec<Value> = video_details
|
||||
.and_then(|vd| vd.get("thumbnail"))
|
||||
.and_then(|t| t.get("thumbnails"))
|
||||
.and_then(|t| t.as_array())
|
||||
.cloned()
|
||||
.unwrap_or_default();
|
||||
|
||||
let keywords: Vec<Value> = video_details
|
||||
.and_then(|vd| vd.get("keywords"))
|
||||
.and_then(|k| k.as_array())
|
||||
.cloned()
|
||||
.unwrap_or_default();
|
||||
|
||||
let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html);
|
||||
let captions: Vec<Value> = caption_tracks
|
||||
.iter()
|
||||
.map(|c| {
|
||||
json!({
|
||||
"url": c.url,
|
||||
"lang": c.lang,
|
||||
"name": c.name,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"canonical_url":canonical,
|
||||
"data_source": "player_response",
|
||||
"video_id": video_id,
|
||||
"title": get_str(video_details, "title"),
|
||||
"description": get_str(video_details, "shortDescription"),
|
||||
"author": get_str(video_details, "author"),
|
||||
"channel_id": get_str(video_details, "channelId"),
|
||||
"channel_url": get_str(microformat, "ownerProfileUrl"),
|
||||
"view_count": get_int(video_details, "viewCount"),
|
||||
"length_seconds": get_int(video_details, "lengthSeconds"),
|
||||
"is_live": video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()),
|
||||
"is_private": video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()),
|
||||
"is_unlisted": microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()),
|
||||
"allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()),
|
||||
"category": get_str(microformat, "category"),
|
||||
"upload_date": get_str(microformat, "uploadDate"),
|
||||
"publish_date": get_str(microformat, "publishDate"),
|
||||
"keywords": keywords,
|
||||
"thumbnails": thumbnails,
|
||||
"caption_tracks": captions,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG fallback path (degraded payload)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value {
|
||||
let title = og(html, "title");
|
||||
let description = og(html, "description");
|
||||
let thumbnail = og(html, "image");
|
||||
// YouTube sets `<meta name="channel_name" ...>` on some pages but
|
||||
// OG-only pages reliably carry `og:video:tag` and the channel in
|
||||
// `<link itemprop="name">`. We keep this lean: just what's stable.
|
||||
let channel = meta_name(html, "author");
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"canonical_url":canonical,
|
||||
"data_source": "og_fallback",
|
||||
"video_id": video_id,
|
||||
"title": title,
|
||||
"description": description,
|
||||
"author": channel,
|
||||
// OG path: these are null so the caller doesn't have to guess.
|
||||
"channel_id": None::<String>,
|
||||
"channel_url": None::<String>,
|
||||
"view_count": None::<i64>,
|
||||
"length_seconds": None::<i64>,
|
||||
"is_live": None::<bool>,
|
||||
"is_private": None::<bool>,
|
||||
"is_unlisted": None::<bool>,
|
||||
"allow_ratings":None::<bool>,
|
||||
"category": None::<String>,
|
||||
"upload_date": None::<String>,
|
||||
"publish_date": None::<String>,
|
||||
"keywords": Vec::<Value>::new(),
|
||||
"thumbnails": thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(),
|
||||
"caption_tracks": Vec::<Value>::new(),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn parse_video_id(url: &str) -> Option<String> {
|
||||
// youtu.be/{id}
|
||||
if let Some(after) = url.split("youtu.be/").nth(1) {
|
||||
let id = after
|
||||
.split(['?', '#', '/'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
if !id.is_empty() {
|
||||
return Some(id.to_string());
|
||||
}
|
||||
}
|
||||
// youtube.com/shorts/{id}
|
||||
if let Some(after) = url.split("youtube.com/shorts/").nth(1) {
|
||||
let id = after
|
||||
.split(['?', '#', '/'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
if !id.is_empty() {
|
||||
return Some(id.to_string());
|
||||
}
|
||||
}
|
||||
// youtube-nocookie.com/embed/{id}
|
||||
if let Some(after) = url.split("/embed/").nth(1) {
|
||||
let id = after
|
||||
.split(['?', '#', '/'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
if !id.is_empty() {
|
||||
return Some(id.to_string());
|
||||
}
|
||||
}
|
||||
// youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id})
|
||||
if let Some(q) = url.split_once('?').map(|(_, q)| q)
|
||||
&& let Some(id) = q
|
||||
.split('&')
|
||||
.find_map(|p| p.strip_prefix("v=").map(|v| v.to_string()))
|
||||
{
|
||||
let id = id.split(['#', '/']).next().unwrap_or(&id).to_string();
|
||||
if !id.is_empty() {
|
||||
return Some(id);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Player-response parsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn extract_player_response(html: &str) -> Option<Value> {
|
||||
// Same regex as webclaw_core::youtube. Duplicated here because
|
||||
// core's regex is module-private. Kept in lockstep; changes are
|
||||
// rare and we cover with tests in both places.
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE
|
||||
.get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
|
||||
let json_str = re.captures(html)?.get(1)?.as_str();
|
||||
serde_json::from_str(json_str).ok()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Meta-tag helpers (for OG fallback)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn meta_name(html: &str, name: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == name) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
|
||||
v.and_then(|x| x.get(key))
|
||||
.and_then(|x| x.as_str().map(String::from))
|
||||
}
|
||||
|
||||
fn get_int(v: Option<&Value>, key: &str) -> Option<i64> {
|
||||
v.and_then(|x| x.get(key)).and_then(|x| {
|
||||
x.as_i64()
|
||||
.or_else(|| x.as_str().and_then(|s| s.parse::<i64>().ok()))
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_watch_urls() {
|
||||
assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ"));
|
||||
assert!(matches("https://youtu.be/dQw4w9WgXcQ"));
|
||||
assert!(matches("https://www.youtube.com/shorts/abc123"));
|
||||
assert!(matches(
|
||||
"https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_video_urls() {
|
||||
assert!(!matches("https://www.youtube.com/"));
|
||||
assert!(!matches("https://www.youtube.com/channel/abc"));
|
||||
assert!(!matches("https://example.com/watch?v=abc"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_video_id_from_each_shape() {
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://youtu.be/dQw4w9WgXcQ"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/shorts/abc123"),
|
||||
Some("abc123".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_player_response_happy_path() {
|
||||
let html = r#"
|
||||
<html><body>
|
||||
<script>
|
||||
var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","author":"A","viewCount":"100","lengthSeconds":"60","shortDescription":"d"}};
|
||||
</script>
|
||||
</body></html>
|
||||
"#;
|
||||
let v = extract_player_response(html).unwrap();
|
||||
let vd = v.get("videoDetails").unwrap();
|
||||
assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_fallback_extracts_basics_from_meta_tags() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<meta property="og:title" content="Example Video Title">
|
||||
<meta property="og:description" content="A cool video description.">
|
||||
<meta property="og:image" content="https://i.ytimg.com/vi/abc/maxresdefault.jpg">
|
||||
<meta name="author" content="Example Channel">
|
||||
</head></html>"##;
|
||||
let v = build_og_fallback(
|
||||
html,
|
||||
"https://www.youtube.com/watch?v=abc",
|
||||
"https://www.youtube.com/watch?v=abc",
|
||||
"abc",
|
||||
);
|
||||
assert_eq!(v["data_source"], "og_fallback");
|
||||
assert_eq!(v["title"], "Example Video Title");
|
||||
assert_eq!(v["description"], "A cool video description.");
|
||||
assert_eq!(v["author"], "Example Channel");
|
||||
assert_eq!(
|
||||
v["thumbnails"][0]["url"],
|
||||
"https://i.ytimg.com/vi/abc/maxresdefault.jpg"
|
||||
);
|
||||
assert!(v["view_count"].is_null());
|
||||
assert!(v["caption_tracks"].as_array().unwrap().is_empty());
|
||||
}
|
||||
}
|
||||
118
crates/webclaw-fetch/src/fetcher.rs
Normal file
118
crates/webclaw-fetch/src/fetcher.rs
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
//! Pluggable fetcher abstraction for vertical extractors.
|
||||
//!
|
||||
//! Extractors call the network through this trait instead of hard-
|
||||
//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
|
||||
//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
|
||||
//! server, which must not use in-process TLS fingerprinting, provides
|
||||
//! its own implementation that routes through the Go tls-sidecar.
|
||||
//!
|
||||
//! Both paths expose the same [`FetchResult`] shape and the same
|
||||
//! optional cloud-escalation client, so extractor logic stays
|
||||
//! identical across environments.
|
||||
//!
|
||||
//! ## Choosing an implementation
|
||||
//!
|
||||
//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
|
||||
//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass
|
||||
//! it to extractors as `&client`.
|
||||
//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
|
||||
//! (in `server/src/engine/`) that delegates to `engine::tls_client`
|
||||
//! and wraps it in `Arc<dyn Fetcher>` for handler injection.
|
||||
//!
|
||||
//! ## Why a trait and not a free function
|
||||
//!
|
||||
//! Extractors need state beyond a single fetch: the cloud client for
|
||||
//! antibot escalation, and in the future per-user proxy pools, tenant
|
||||
//! headers, circuit breakers. A trait keeps that state encapsulated
|
||||
//! behind the fetch interface instead of threading it through every
|
||||
//! extractor signature.
|
||||
|
||||
use async_trait::async_trait;
|
||||
|
||||
use crate::client::FetchResult;
|
||||
use crate::cloud::CloudClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
/// HTTP fetch surface used by vertical extractors.
|
||||
///
|
||||
/// Implementations must be `Send + Sync` because extractor dispatchers
|
||||
/// run them inside tokio tasks, potentially across many requests.
|
||||
#[async_trait]
|
||||
pub trait Fetcher: Send + Sync {
|
||||
/// Fetch a URL and return the raw response body + metadata. The
|
||||
/// body is in `FetchResult::html` regardless of the actual content
|
||||
/// type — JSON API endpoints put JSON there, HTML pages put HTML.
|
||||
/// Extractors branch on response status and body shape.
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
|
||||
|
||||
/// Fetch with additional request headers. Needed for endpoints
|
||||
/// that authenticate via a specific header (Instagram's
|
||||
/// `x-ig-app-id`, for example). Default implementation routes to
|
||||
/// [`Self::fetch`] so implementers without header support stay
|
||||
/// functional, though the `Option<String>` field they'd set won't
|
||||
/// be populated on the request.
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
_headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
self.fetch(url).await
|
||||
}
|
||||
|
||||
/// Optional cloud-escalation client for antibot bypass. Returning
|
||||
/// `Some` tells extractors they can call into the hosted API when
|
||||
/// local fetch hits a challenge page. Returning `None` makes
|
||||
/// cloud-gated extractors emit [`CloudError::NotConfigured`] with
|
||||
/// an actionable signup link.
|
||||
///
|
||||
/// The default implementation returns `None` because not every
|
||||
/// deployment wants cloud fallback (self-hosts that don't have a
|
||||
/// webclaw.io subscription, for instance).
|
||||
///
|
||||
/// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Fetcher + ?Sized> Fetcher for &T {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch(url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch_with_headers(url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
(**self).cloud()
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch(url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch_with_headers(url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
(**self).cloud()
|
||||
}
|
||||
}
|
||||
|
|
@ -3,10 +3,14 @@
|
|||
//! Automatically detects PDF responses and delegates to webclaw-pdf.
|
||||
pub mod browser;
|
||||
pub mod client;
|
||||
pub mod cloud;
|
||||
pub mod crawler;
|
||||
pub mod document;
|
||||
pub mod error;
|
||||
pub mod extractors;
|
||||
pub mod fetcher;
|
||||
pub mod linkedin;
|
||||
pub mod locale;
|
||||
pub mod proxy;
|
||||
pub mod reddit;
|
||||
pub mod sitemap;
|
||||
|
|
@ -16,7 +20,9 @@ pub use browser::BrowserProfile;
|
|||
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
|
||||
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
|
||||
pub use error::FetchError;
|
||||
pub use fetcher::Fetcher;
|
||||
pub use http::HeaderMap;
|
||||
pub use locale::{accept_language_for_tld, accept_language_for_url};
|
||||
pub use proxy::{parse_proxy_file, parse_proxy_line};
|
||||
pub use sitemap::SitemapEntry;
|
||||
pub use webclaw_pdf::PdfMode;
|
||||
|
|
|
|||
77
crates/webclaw-fetch/src/locale.rs
Normal file
77
crates/webclaw-fetch/src/locale.rs
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
//! Derive an `Accept-Language` header from a URL.
|
||||
//!
|
||||
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
|
||||
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
|
||||
//! target country + a browser UA but the wrong `Accept-Language` is a bot
|
||||
//! signal. Matching the site's expected locale gets us through.
|
||||
//!
|
||||
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
|
||||
|
||||
/// Best-effort `Accept-Language` header value for the given URL's TLD.
|
||||
/// Returns `None` if the URL cannot be parsed.
|
||||
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
|
||||
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
|
||||
let tld = host.rsplit('.').next()?;
|
||||
Some(accept_language_for_tld(tld))
|
||||
}
|
||||
|
||||
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
|
||||
/// Unknown TLDs fall back to US English.
|
||||
pub fn accept_language_for_tld(tld: &str) -> &'static str {
|
||||
match tld {
|
||||
"it" => "it-IT,it;q=0.9",
|
||||
"fr" => "fr-FR,fr;q=0.9",
|
||||
"de" | "at" => "de-DE,de;q=0.9",
|
||||
"es" => "es-ES,es;q=0.9",
|
||||
"pt" => "pt-PT,pt;q=0.9",
|
||||
"nl" => "nl-NL,nl;q=0.9",
|
||||
"pl" => "pl-PL,pl;q=0.9",
|
||||
"se" => "sv-SE,sv;q=0.9",
|
||||
"no" => "nb-NO,nb;q=0.9",
|
||||
"dk" => "da-DK,da;q=0.9",
|
||||
"fi" => "fi-FI,fi;q=0.9",
|
||||
"cz" => "cs-CZ,cs;q=0.9",
|
||||
"ro" => "ro-RO,ro;q=0.9",
|
||||
"gr" => "el-GR,el;q=0.9",
|
||||
"tr" => "tr-TR,tr;q=0.9",
|
||||
"ru" => "ru-RU,ru;q=0.9",
|
||||
"jp" => "ja-JP,ja;q=0.9",
|
||||
"kr" => "ko-KR,ko;q=0.9",
|
||||
"cn" => "zh-CN,zh;q=0.9",
|
||||
"tw" | "hk" => "zh-TW,zh;q=0.9",
|
||||
"br" => "pt-BR,pt;q=0.9",
|
||||
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
|
||||
"uk" | "ie" => "en-GB,en;q=0.9",
|
||||
_ => "en-US,en;q=0.9",
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn tld_dispatch() {
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
|
||||
Some("it-IT,it;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.leboncoin.fr/"),
|
||||
Some("fr-FR,fr;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.amazon.co.uk/"),
|
||||
Some("en-GB,en;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://example.com/"),
|
||||
Some("en-US,en;q=0.9")
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bad_url_returns_none() {
|
||||
assert_eq!(accept_language_for_url("not-a-url"), None);
|
||||
}
|
||||
}
|
||||
|
|
@ -7,10 +7,15 @@
|
|||
|
||||
use std::time::Duration;
|
||||
|
||||
use std::borrow::Cow;
|
||||
|
||||
use wreq::http2::{
|
||||
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
|
||||
};
|
||||
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
|
||||
use wreq::tls::{
|
||||
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
|
||||
TlsVersion,
|
||||
};
|
||||
use wreq::{Client, Emulation};
|
||||
|
||||
use crate::browser::BrowserVariant;
|
||||
|
|
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
|
|||
/// Safari curves.
|
||||
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
|
||||
|
||||
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
|
||||
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
|
||||
/// inserts them itself. Diverges from wreq-util's default SafariIos26
|
||||
/// extension order, which DataDome's immobiliare.it ruleset flags.
|
||||
fn safari_ios_extensions() -> Vec<ExtensionType> {
|
||||
vec![
|
||||
ExtensionType::CERTIFICATE_TIMESTAMP,
|
||||
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
|
||||
ExtensionType::SERVER_NAME,
|
||||
ExtensionType::CERT_COMPRESSION,
|
||||
ExtensionType::KEY_SHARE,
|
||||
ExtensionType::SUPPORTED_VERSIONS,
|
||||
ExtensionType::PSK_KEY_EXCHANGE_MODES,
|
||||
ExtensionType::SUPPORTED_GROUPS,
|
||||
ExtensionType::RENEGOTIATE,
|
||||
ExtensionType::SIGNATURE_ALGORITHMS,
|
||||
ExtensionType::STATUS_REQUEST,
|
||||
ExtensionType::EC_POINT_FORMATS,
|
||||
ExtensionType::EXTENDED_MASTER_SECRET,
|
||||
]
|
||||
}
|
||||
|
||||
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
|
||||
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
|
||||
/// per handshake, but indeed.com's WAF allowlists this specific wire order
|
||||
/// and rejects permuted ones. GREASE slots are inserted by wreq.
|
||||
///
|
||||
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
|
||||
fn chrome_extensions() -> Vec<ExtensionType> {
|
||||
vec![
|
||||
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
|
||||
ExtensionType::STATUS_REQUEST, // 5
|
||||
ExtensionType::SESSION_TICKET, // 35
|
||||
ExtensionType::KEY_SHARE, // 51
|
||||
ExtensionType::SUPPORTED_GROUPS, // 10
|
||||
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
|
||||
ExtensionType::EC_POINT_FORMATS, // 11
|
||||
ExtensionType::CERT_COMPRESSION, // 27
|
||||
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
|
||||
ExtensionType::SUPPORTED_VERSIONS, // 43
|
||||
ExtensionType::SIGNATURE_ALGORITHMS, // 13
|
||||
ExtensionType::SERVER_NAME, // 0
|
||||
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
|
||||
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
|
||||
ExtensionType::RENEGOTIATE, // 65281
|
||||
ExtensionType::EXTENDED_MASTER_SECRET, // 23
|
||||
]
|
||||
}
|
||||
|
||||
// --- Chrome HTTP headers in correct wire order ---
|
||||
|
||||
const CHROME_HEADERS: &[(&str, &str)] = &[
|
||||
|
|
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
|
|||
("sec-fetch-dest", "document"),
|
||||
];
|
||||
|
||||
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
|
||||
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
|
||||
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
|
||||
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
|
||||
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
|
||||
/// expects for a real iPhone.
|
||||
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
|
||||
(
|
||||
"accept",
|
||||
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||
),
|
||||
("accept-language", "en-US,en;q=0.9"),
|
||||
("accept-encoding", "gzip, deflate, br"),
|
||||
(
|
||||
"user-agent",
|
||||
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
|
||||
),
|
||||
("upgrade-insecure-requests", "1"),
|
||||
];
|
||||
|
||||
const EDGE_HEADERS: &[(&str, &str)] = &[
|
||||
(
|
||||
"sec-ch-ua",
|
||||
|
|
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
|
|||
];
|
||||
|
||||
fn chrome_tls() -> TlsOptions {
|
||||
// permute_extensions is off so the explicit extension_permutation sticks.
|
||||
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
|
||||
// fixed order, so matching that gets us through.
|
||||
TlsOptions::builder()
|
||||
.cipher_list(CHROME_CIPHERS)
|
||||
.sigalgs_list(CHROME_SIGALGS)
|
||||
|
|
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
|
|||
.min_tls_version(TlsVersion::TLS_1_2)
|
||||
.max_tls_version(TlsVersion::TLS_1_3)
|
||||
.grease_enabled(true)
|
||||
.permute_extensions(true)
|
||||
.permute_extensions(false)
|
||||
.extension_permutation(chrome_extensions())
|
||||
.enable_ech_grease(true)
|
||||
.pre_shared_key(true)
|
||||
.enable_ocsp_stapling(true)
|
||||
.enable_signed_cert_timestamps(true)
|
||||
.alps_protocols([AlpsProtocol::HTTP2])
|
||||
.alpn_protocols([
|
||||
AlpnProtocol::HTTP3,
|
||||
AlpnProtocol::HTTP2,
|
||||
AlpnProtocol::HTTP1,
|
||||
])
|
||||
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
|
||||
.alps_use_new_codepoint(true)
|
||||
.aes_hw_override(true)
|
||||
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
|
||||
|
|
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
|
|||
.build()
|
||||
}
|
||||
|
||||
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
|
||||
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
|
||||
/// because the wire-level defaults from wreq-util are already correct for ciphers,
|
||||
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
|
||||
/// DataDome compatibility are overridden here:
|
||||
///
|
||||
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
|
||||
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
|
||||
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
|
||||
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
|
||||
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
|
||||
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
|
||||
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
|
||||
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
|
||||
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
|
||||
/// 4. `accept-language` preserved from config.extra_headers for locale.
|
||||
fn safari_ios_emulation() -> wreq::Emulation {
|
||||
use wreq::EmulationFactory;
|
||||
let mut em = wreq_util::Emulation::SafariIos26.emulation();
|
||||
|
||||
if let Some(tls) = em.tls_options_mut().as_mut() {
|
||||
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
|
||||
}
|
||||
|
||||
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
|
||||
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
|
||||
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
|
||||
if let Some(h2) = em.http2_options_mut().as_mut() {
|
||||
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
|
||||
}
|
||||
|
||||
let hm = em.headers_mut();
|
||||
hm.clear();
|
||||
for (k, v) in SAFARI_IOS_HEADERS {
|
||||
if let (Ok(n), Ok(val)) = (
|
||||
http::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
http::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
hm.append(n, val);
|
||||
}
|
||||
}
|
||||
|
||||
em
|
||||
}
|
||||
|
||||
fn chrome_h2() -> Http2Options {
|
||||
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
|
||||
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
|
||||
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
|
||||
// and indeed.com's WAF reads this as a bot signal when present. Priority
|
||||
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
|
||||
Http2Options::builder()
|
||||
.initial_window_size(6_291_456)
|
||||
.initial_connection_window_size(15_728_640)
|
||||
.max_header_list_size(262_144)
|
||||
.header_table_size(65_536)
|
||||
.max_concurrent_streams(1000u32)
|
||||
.enable_push(false)
|
||||
.settings_order(
|
||||
SettingsOrder::builder()
|
||||
.extend([
|
||||
SettingId::HeaderTableSize,
|
||||
SettingId::EnablePush,
|
||||
SettingId::MaxConcurrentStreams,
|
||||
SettingId::InitialWindowSize,
|
||||
SettingId::MaxFrameSize,
|
||||
SettingId::MaxHeaderListSize,
|
||||
SettingId::EnableConnectProtocol,
|
||||
SettingId::NoRfc7540Priorities,
|
||||
])
|
||||
.build(),
|
||||
)
|
||||
|
|
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
|
|||
])
|
||||
.build(),
|
||||
)
|
||||
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
|
||||
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
|
||||
.build()
|
||||
}
|
||||
|
||||
|
|
@ -328,32 +456,38 @@ pub fn build_client(
|
|||
extra_headers: &std::collections::HashMap<String, String>,
|
||||
proxy: Option<&str>,
|
||||
) -> Result<Client, FetchError> {
|
||||
let (tls, h2, headers) = match variant {
|
||||
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
|
||||
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
|
||||
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
|
||||
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
|
||||
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
|
||||
// SafariIos26 builds its Emulation on top of wreq-util's base instead
|
||||
// of from scratch. See `safari_ios_emulation` for why.
|
||||
let mut emulation = match variant {
|
||||
BrowserVariant::SafariIos26 => safari_ios_emulation(),
|
||||
other => {
|
||||
let (tls, h2, headers) = match other {
|
||||
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
|
||||
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
|
||||
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
|
||||
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
|
||||
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
|
||||
BrowserVariant::SafariIos26 => unreachable!("handled above"),
|
||||
};
|
||||
Emulation::builder()
|
||||
.tls_options(tls)
|
||||
.http2_options(h2)
|
||||
.headers(build_headers(headers))
|
||||
.build()
|
||||
}
|
||||
};
|
||||
|
||||
let mut header_map = build_headers(headers);
|
||||
|
||||
// Append extra headers after profile defaults
|
||||
// Append extra headers after profile defaults.
|
||||
let hm = emulation.headers_mut();
|
||||
for (k, v) in extra_headers {
|
||||
if let (Ok(n), Ok(val)) = (
|
||||
http::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
http::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
header_map.insert(n, val);
|
||||
hm.insert(n, val);
|
||||
}
|
||||
}
|
||||
|
||||
let emulation = Emulation::builder()
|
||||
.tls_options(tls)
|
||||
.http2_options(h2)
|
||||
.headers(header_map)
|
||||
.build();
|
||||
|
||||
let mut builder = Client::builder()
|
||||
.emulation(emulation)
|
||||
.redirect(wreq::redirect::Policy::limited(10))
|
||||
|
|
|
|||
|
|
@ -22,6 +22,5 @@ serde_json = { workspace = true }
|
|||
tokio = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tracing-subscriber = { workspace = true }
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
url = "2"
|
||||
dirs = "6.0.0"
|
||||
|
|
|
|||
|
|
@ -1,302 +0,0 @@
|
|||
/// Cloud API fallback for protected sites.
|
||||
///
|
||||
/// When local fetch returns a challenge page, this module retries
|
||||
/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
|
||||
use std::time::Duration;
|
||||
|
||||
use serde_json::{Value, json};
|
||||
use tracing::info;
|
||||
|
||||
const API_BASE: &str = "https://api.webclaw.io/v1";
|
||||
|
||||
/// Lightweight client for the webclaw cloud API.
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Create a new cloud client from WEBCLAW_API_KEY env var.
|
||||
/// Returns None if the key is not set.
|
||||
pub fn from_env() -> Option<Self> {
|
||||
let key = std::env::var("WEBCLAW_API_KEY").ok()?;
|
||||
if key.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let http = reqwest::Client::builder()
|
||||
.timeout(Duration::from_secs(60))
|
||||
.build()
|
||||
.unwrap_or_default();
|
||||
Some(Self { api_key: key, http })
|
||||
}
|
||||
|
||||
/// Scrape a URL via the cloud API. Returns the response JSON.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({
|
||||
"url": url,
|
||||
"formats": formats,
|
||||
});
|
||||
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
/// Generic POST to the cloud API.
|
||||
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.post(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let truncated = truncate_error(&text);
|
||||
return Err(format!("Cloud API error {status}: {truncated}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API response parse failed: {e}"))
|
||||
}
|
||||
|
||||
/// Generic GET from the cloud API.
|
||||
pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.get(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let truncated = truncate_error(&text);
|
||||
return Err(format!("Cloud API error {status}: {truncated}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API response parse failed: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
/// Truncate error body to avoid flooding logs with huge HTML responses.
|
||||
fn truncate_error(text: &str) -> &str {
|
||||
const MAX_LEN: usize = 500;
|
||||
match text.char_indices().nth(MAX_LEN) {
|
||||
Some((byte_pos, _)) => &text[..byte_pos],
|
||||
None => text,
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if fetched HTML looks like a bot protection challenge page.
|
||||
/// Detects common bot protection challenge pages.
|
||||
pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool {
|
||||
let html_lower = html.to_lowercase();
|
||||
|
||||
// Cloudflare challenge page
|
||||
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare "checking your browser" spinner
|
||||
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
&& html_lower.contains("cf-spinner")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
|
||||
if (html_lower.contains("cf-turnstile")
|
||||
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
|
||||
&& html.len() < 100_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// DataDome
|
||||
if html_lower.contains("geo.captcha-delivery.com")
|
||||
|| html_lower.contains("captcha-delivery.com/captcha")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF
|
||||
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// hCaptcha blocking page
|
||||
if html_lower.contains("hcaptcha.com")
|
||||
&& html_lower.contains("h-captcha")
|
||||
&& html.len() < 50_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare via headers + challenge body
|
||||
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
|
||||
if has_cf_headers
|
||||
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if a page likely needs JS rendering (SPA with almost no text content).
|
||||
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
|
||||
let has_scripts = html.contains("<script");
|
||||
|
||||
// Tier 1: almost no extractable text from a large page
|
||||
if word_count < 50 && html.len() > 5_000 && has_scripts {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
|
||||
if word_count < 800 && html.len() > 50_000 && has_scripts {
|
||||
let html_lower = html.to_lowercase();
|
||||
let has_spa_marker = html_lower.contains("react-app")
|
||||
|| html_lower.contains("id=\"__next\"")
|
||||
|| html_lower.contains("id=\"root\"")
|
||||
|| html_lower.contains("id=\"app\"")
|
||||
|| html_lower.contains("__next_data__")
|
||||
|| html_lower.contains("nuxt")
|
||||
|| html_lower.contains("ng-app");
|
||||
|
||||
if has_spa_marker {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Result of a smart fetch: either local extraction or cloud API response.
|
||||
pub enum SmartFetchResult {
|
||||
/// Successfully extracted locally.
|
||||
Local(Box<webclaw_core::ExtractionResult>),
|
||||
/// Fell back to cloud API. Contains the API response JSON.
|
||||
Cloud(Value),
|
||||
}
|
||||
|
||||
/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
|
||||
///
|
||||
/// Returns the extraction result (local) or the cloud API response JSON.
|
||||
/// If no API key is configured and local fetch is blocked, returns an error
|
||||
/// with a helpful message.
|
||||
pub async fn smart_fetch(
|
||||
client: &webclaw_fetch::FetchClient,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
// Step 1: Try local fetch (with timeout to avoid hanging on slow servers)
|
||||
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
|
||||
.await
|
||||
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
|
||||
.map_err(|e| format!("Fetch failed: {e}"))?;
|
||||
|
||||
// Step 2: Check for bot protection
|
||||
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
|
||||
info!(url, "bot protection detected, falling back to cloud API");
|
||||
return cloud_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
// Step 3: Extract locally
|
||||
let options = webclaw_core::ExtractionOptions {
|
||||
include_selectors: include_selectors.to_vec(),
|
||||
exclude_selectors: exclude_selectors.to_vec(),
|
||||
only_main_content,
|
||||
include_raw_html: false,
|
||||
};
|
||||
|
||||
let extraction =
|
||||
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
|
||||
.map_err(|e| format!("Extraction failed: {e}"))?;
|
||||
|
||||
// Step 4: Check for JS-rendered pages (low content from large HTML)
|
||||
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
|
||||
info!(
|
||||
url,
|
||||
word_count = extraction.metadata.word_count,
|
||||
html_len = fetch_result.html.len(),
|
||||
"JS-rendered page detected, falling back to cloud API"
|
||||
);
|
||||
return cloud_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
Ok(SmartFetchResult::Local(Box::new(extraction)))
|
||||
}
|
||||
|
||||
async fn cloud_fallback(
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
match cloud {
|
||||
Some(c) => {
|
||||
let resp = c
|
||||
.scrape(
|
||||
url,
|
||||
formats,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
)
|
||||
.await?;
|
||||
info!(url, "cloud API fallback successful");
|
||||
Ok(SmartFetchResult::Cloud(resp))
|
||||
}
|
||||
None => Err(format!(
|
||||
"Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
|
||||
Get a key at https://webclaw.io"
|
||||
)),
|
||||
}
|
||||
}
|
||||
|
|
@ -1,7 +1,6 @@
|
|||
/// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
|
||||
/// Exposes web extraction tools over stdio transport for AI agents
|
||||
/// like Claude Desktop, Claude Code, and other MCP clients.
|
||||
mod cloud;
|
||||
mod server;
|
||||
mod tools;
|
||||
|
||||
|
|
|
|||
|
|
@ -15,7 +15,8 @@ use serde_json::json;
|
|||
use tracing::{error, info, warn};
|
||||
use url::Url;
|
||||
|
||||
use crate::cloud::{self, CloudClient, SmartFetchResult};
|
||||
use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult};
|
||||
|
||||
use crate::tools::*;
|
||||
|
||||
pub struct WebclawMcp {
|
||||
|
|
@ -717,6 +718,55 @@ impl WebclawMcp {
|
|||
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
|
||||
}
|
||||
}
|
||||
|
||||
/// List every vertical extractor the server knows about. Returns a
|
||||
/// JSON array of `{name, label, description, url_patterns}` entries.
|
||||
/// Call this to discover what verticals are available before using
|
||||
/// `vertical_scrape`.
|
||||
#[tool]
|
||||
async fn list_extractors(
|
||||
&self,
|
||||
Parameters(_params): Parameters<ListExtractorsParams>,
|
||||
) -> Result<String, String> {
|
||||
let catalog = webclaw_fetch::extractors::list();
|
||||
serde_json::to_string_pretty(&catalog)
|
||||
.map_err(|e| format!("failed to serialise extractor catalog: {e}"))
|
||||
}
|
||||
|
||||
/// Run a vertical extractor by name and return typed JSON specific
|
||||
/// to the target site (title, price, rating, author, etc.), not
|
||||
/// generic markdown. Use `list_extractors` to discover available
|
||||
/// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
|
||||
/// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
|
||||
///
|
||||
/// Antibot-gated verticals (amazon_product, ebay_listing,
|
||||
/// etsy_listing, trustpilot_reviews) will automatically escalate to
|
||||
/// the webclaw cloud API when local fetch hits bot protection,
|
||||
/// provided `WEBCLAW_API_KEY` is set.
|
||||
#[tool]
|
||||
async fn vertical_scrape(
|
||||
&self,
|
||||
Parameters(params): Parameters<VerticalParams>,
|
||||
) -> Result<String, String> {
|
||||
validate_url(¶ms.url)?;
|
||||
// Use the cached Firefox client, not the default Chrome one.
|
||||
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
|
||||
// fingerprint with a 403 even from residential IPs (they
|
||||
// ship a fingerprint blocklist that includes common
|
||||
// browser-emulation libraries). The wreq-Firefox fingerprint
|
||||
// still passes, and Firefox is equally fine for every other
|
||||
// vertical in the catalog, so it's a strictly-safer default
|
||||
// for `vertical_scrape` than the generic `scrape` tool's
|
||||
// Chrome default. Matches the CLI `webclaw vertical`
|
||||
// subcommand which already uses Firefox.
|
||||
let client = self.firefox_or_build()?;
|
||||
let data =
|
||||
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), ¶ms.name, ¶ms.url)
|
||||
.await
|
||||
.map_err(|e| e.to_string())?;
|
||||
serde_json::to_string_pretty(&data)
|
||||
.map_err(|e| format!("failed to serialise extractor output: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
#[tool_handler]
|
||||
|
|
@ -726,7 +776,8 @@ impl ServerHandler for WebclawMcp {
|
|||
.with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
|
||||
.with_instructions(String::from(
|
||||
"Webclaw MCP server -- web content extraction for AI agents. \
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
|
||||
list_extractors, vertical_scrape.",
|
||||
))
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -103,3 +103,20 @@ pub struct SearchParams {
|
|||
/// Number of results to return (default: 10)
|
||||
pub num_results: Option<u32>,
|
||||
}
|
||||
|
||||
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct VerticalParams {
|
||||
/// Name of the vertical extractor. Call `list_extractors` to see all
|
||||
/// available names. Examples: "reddit", "github_repo", "pypi",
|
||||
/// "trustpilot_reviews", "youtube_video", "shopify_product".
|
||||
pub name: String,
|
||||
/// URL to extract. Must match the URL patterns the extractor claims;
|
||||
/// otherwise the tool returns a clear "URL mismatch" error.
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
/// `list_extractors` takes no arguments but we still need an empty struct
|
||||
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct ListExtractorsParams {}
|
||||
|
|
|
|||
|
|
@ -79,10 +79,15 @@ async fn main() -> anyhow::Result<()> {
|
|||
|
||||
let v1 = Router::new()
|
||||
.route("/scrape", post(routes::scrape::scrape))
|
||||
.route(
|
||||
"/scrape/{vertical}",
|
||||
post(routes::structured::scrape_vertical),
|
||||
)
|
||||
.route("/crawl", post(routes::crawl::crawl))
|
||||
.route("/map", post(routes::map::map))
|
||||
.route("/batch", post(routes::batch::batch))
|
||||
.route("/extract", post(routes::extract::extract))
|
||||
.route("/extractors", get(routes::structured::list_extractors))
|
||||
.route("/summarize", post(routes::summarize::summarize_route))
|
||||
.route("/diff", post(routes::diff::diff_route))
|
||||
.route("/brand", post(routes::brand::brand))
|
||||
|
|
|
|||
|
|
@ -15,4 +15,5 @@ pub mod extract;
|
|||
pub mod health;
|
||||
pub mod map;
|
||||
pub mod scrape;
|
||||
pub mod structured;
|
||||
pub mod summarize;
|
||||
|
|
|
|||
55
crates/webclaw-server/src/routes/structured.rs
Normal file
55
crates/webclaw-server/src/routes/structured.rs
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`.
|
||||
//!
|
||||
//! Vertical extractors return typed JSON instead of generic markdown.
|
||||
//! See `webclaw_fetch::extractors` for the catalog and per-site logic.
|
||||
|
||||
use axum::{
|
||||
Json,
|
||||
extract::{Path, State},
|
||||
};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_fetch::extractors::{self, ExtractorDispatchError};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
pub struct ScrapeRequest {
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
/// Map dispatcher errors to ApiError so users get clean HTTP statuses
|
||||
/// instead of opaque 500s.
|
||||
impl From<ExtractorDispatchError> for ApiError {
|
||||
fn from(e: ExtractorDispatchError) -> Self {
|
||||
match e {
|
||||
ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound,
|
||||
ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()),
|
||||
ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// `GET /v1/extractors` — catalog of all available verticals.
|
||||
pub async fn list_extractors() -> Json<Value> {
|
||||
Json(json!({
|
||||
"extractors": extractors::list(),
|
||||
}))
|
||||
}
|
||||
|
||||
/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit.
|
||||
pub async fn scrape_vertical(
|
||||
State(state): State<AppState>,
|
||||
Path(vertical): Path<String>,
|
||||
Json(req): Json<ScrapeRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?;
|
||||
Ok(Json(json!({
|
||||
"vertical": vertical,
|
||||
"url": req.url,
|
||||
"data": data,
|
||||
})))
|
||||
}
|
||||
|
|
@ -1,7 +1,24 @@
|
|||
//! Shared application state. Cheap to clone via Arc; held by the axum
|
||||
//! Router for the life of the process.
|
||||
//!
|
||||
//! Two unrelated keys get carried here:
|
||||
//!
|
||||
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
|
||||
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
|
||||
//! Unset = open mode.
|
||||
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
|
||||
//! **outbound** credential for api.webclaw.io, used by extractors
|
||||
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
|
||||
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
|
||||
//! error with a signup link.
|
||||
//!
|
||||
//! Different variables on purpose: conflating the two means operators
|
||||
//! who want their server behind an auth token can't also enable cloud
|
||||
//! fallback, and vice versa.
|
||||
|
||||
use std::sync::Arc;
|
||||
use tracing::info;
|
||||
use webclaw_fetch::cloud::CloudClient;
|
||||
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
|
||||
|
||||
/// Single-process state shared across all request handlers.
|
||||
|
|
@ -17,6 +34,7 @@ struct Inner {
|
|||
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
|
||||
/// them nothing.
|
||||
pub fetch: Arc<FetchClient>,
|
||||
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
|
||||
pub api_key: Option<String>,
|
||||
}
|
||||
|
||||
|
|
@ -24,17 +42,34 @@ impl AppState {
|
|||
/// Build the application state. The fetch client is constructed once
|
||||
/// and shared across requests so connection pools + browser profile
|
||||
/// state don't churn per request.
|
||||
pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
///
|
||||
/// `inbound_api_key` is the bearer token clients must present;
|
||||
/// cloud-fallback credentials come from the env (checked here).
|
||||
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
let config = FetchConfig {
|
||||
browser: BrowserProfile::Chrome,
|
||||
browser: BrowserProfile::Firefox,
|
||||
..FetchConfig::default()
|
||||
};
|
||||
let fetch = FetchClient::new(config)
|
||||
let mut fetch = FetchClient::new(config)
|
||||
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
|
||||
|
||||
// Cloud fallback: only activates when the operator has provided
|
||||
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
|
||||
// (preferred, disambiguates from the inbound-auth key) and
|
||||
// WEBCLAW_API_KEY as a fallback when there's no inbound key
|
||||
// configured (backwards compat with MCP / CLI conventions).
|
||||
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
|
||||
info!(
|
||||
base = cloud.base_url(),
|
||||
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
|
||||
);
|
||||
fetch = fetch.with_cloud(cloud);
|
||||
}
|
||||
|
||||
Ok(Self {
|
||||
inner: Arc::new(Inner {
|
||||
fetch: Arc::new(fetch),
|
||||
api_key,
|
||||
api_key: inbound_api_key,
|
||||
}),
|
||||
})
|
||||
}
|
||||
|
|
@ -47,3 +82,26 @@ impl AppState {
|
|||
self.inner.api_key.as_deref()
|
||||
}
|
||||
}
|
||||
|
||||
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
|
||||
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
|
||||
/// configured (i.e. open mode — the same env var can't mean two
|
||||
/// things to one process).
|
||||
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
|
||||
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
|
||||
if let Some(k) = cloud_key.as_deref()
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
// Reuse WEBCLAW_API_KEY only when not also acting as our own
|
||||
// inbound-auth token — otherwise we'd be telling the operator
|
||||
// they can't have both.
|
||||
if inbound_api_key.is_none()
|
||||
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
None
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue