mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Compare commits
12 commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a5c3433372 | ||
|
|
966981bc42 | ||
|
|
866fa88aa0 | ||
|
|
b413d702b2 | ||
|
|
98a177dec4 | ||
|
|
e1af2da509 | ||
|
|
2285c585b1 | ||
|
|
b77767814a | ||
|
|
4bf11d902f | ||
|
|
0daa2fec1a | ||
|
|
058493bc8f | ||
|
|
aaa5103504 |
45 changed files with 812 additions and 114 deletions
58
CHANGELOG.md
58
CHANGELOG.md
|
|
@ -3,6 +3,64 @@
|
|||
All notable changes to webclaw are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
||||
|
||||
## [0.5.6] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
|
||||
|
||||
### Fixed
|
||||
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
|
||||
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.5] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.4] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
|
||||
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
|
||||
|
||||
### Changed
|
||||
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
|
||||
- Bumped `wreq-util` to `3.0.0-rc.10`.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.2] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
|
||||
|
||||
- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
|
||||
|
||||
- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
|
||||
|
||||
### Changed
|
||||
- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.1] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
|
||||
|
||||
The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
|
||||
|
||||
Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
|
||||
|
||||
### Changed
|
||||
- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.0] — 2026-04-22
|
||||
|
||||
### Added
|
||||
|
|
|
|||
11
CLAUDE.md
11
CLAUDE.md
|
|
@ -11,7 +11,7 @@ webclaw/
|
|||
# + ExtractionOptions (include/exclude CSS selectors)
|
||||
# + diff engine (change tracking)
|
||||
# + brand extraction (DOM/CSS analysis)
|
||||
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
|
||||
webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
|
||||
# + proxy pool rotation (per-request)
|
||||
# + PDF content-type detection
|
||||
# + document parsing (DOCX, XLSX, CSV)
|
||||
|
|
@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
|
|||
- `brand.rs` — Brand identity extraction from DOM structure and CSS
|
||||
|
||||
### Fetch Modules (`webclaw-fetch`)
|
||||
- `client.rs` — FetchClient with primp TLS impersonation
|
||||
- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
|
||||
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
|
||||
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
|
||||
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
|
||||
|
|
@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
|
|||
## Hard Rules
|
||||
|
||||
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
|
||||
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
|
||||
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
|
||||
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
|
||||
- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
|
||||
- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
|
||||
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
|
||||
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
|
||||
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
|
||||
|
||||
## Build & Test
|
||||
|
|
|
|||
46
Cargo.lock
generated
46
Cargo.lock
generated
|
|
@ -2967,6 +2967,26 @@ dependencies = [
|
|||
"pom",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-builder"
|
||||
version = "0.23.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
|
||||
dependencies = [
|
||||
"typed-builder-macro",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-builder-macro"
|
||||
version = "0.23.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-path"
|
||||
version = "0.12.3"
|
||||
|
|
@ -3199,7 +3219,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-cli"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"clap",
|
||||
"dotenvy",
|
||||
|
|
@ -3220,7 +3240,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-core"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"ego-tree",
|
||||
"once_cell",
|
||||
|
|
@ -3238,8 +3258,9 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-fetch"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"bytes",
|
||||
"calamine",
|
||||
"http",
|
||||
|
|
@ -3257,12 +3278,13 @@ dependencies = [
|
|||
"webclaw-core",
|
||||
"webclaw-pdf",
|
||||
"wreq",
|
||||
"wreq-util",
|
||||
"zip 2.4.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "webclaw-llm"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"reqwest",
|
||||
|
|
@ -3275,7 +3297,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-mcp"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"dirs",
|
||||
"dotenvy",
|
||||
|
|
@ -3295,7 +3317,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-pdf"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"pdf-extract",
|
||||
"thiserror",
|
||||
|
|
@ -3304,7 +3326,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-server"
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"axum",
|
||||
|
|
@ -3708,6 +3730,16 @@ dependencies = [
|
|||
"zstd",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "wreq-util"
|
||||
version = "3.0.0-rc.10"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
|
||||
dependencies = [
|
||||
"typed-builder",
|
||||
"wreq",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "writeable"
|
||||
version = "0.6.2"
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ resolver = "2"
|
|||
members = ["crates/*"]
|
||||
|
||||
[workspace.package]
|
||||
version = "0.5.0"
|
||||
version = "0.5.6"
|
||||
edition = "2024"
|
||||
license = "AGPL-3.0"
|
||||
repository = "https://github.com/0xMassi/webclaw"
|
||||
|
|
|
|||
|
|
@ -308,6 +308,34 @@ enum Commands {
|
|||
#[arg(long)]
|
||||
facts: Option<PathBuf>,
|
||||
},
|
||||
|
||||
/// List all vertical extractors in the catalog.
|
||||
///
|
||||
/// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
|
||||
/// a human-friendly label, a one-line description, and the URL
|
||||
/// patterns it claims. The same data is served by `/v1/extractors`
|
||||
/// when running the REST API.
|
||||
Extractors {
|
||||
/// Emit JSON instead of a human-friendly table.
|
||||
#[arg(long)]
|
||||
json: bool,
|
||||
},
|
||||
|
||||
/// Run a vertical extractor by name. Returns typed JSON with fields
|
||||
/// specific to the target site (title, price, author, rating, etc.)
|
||||
/// rather than generic markdown.
|
||||
///
|
||||
/// Use `webclaw extractors` to see the full list. Example:
|
||||
/// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
|
||||
Vertical {
|
||||
/// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
|
||||
name: String,
|
||||
/// URL to extract.
|
||||
url: String,
|
||||
/// Emit compact JSON (single line). Default is pretty-printed.
|
||||
#[arg(long)]
|
||||
raw: bool,
|
||||
},
|
||||
}
|
||||
|
||||
#[derive(Clone, ValueEnum)]
|
||||
|
|
@ -323,6 +351,9 @@ enum OutputFormat {
|
|||
enum Browser {
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
|
||||
/// that reject non-mobile profiles.
|
||||
SafariIos,
|
||||
Random,
|
||||
}
|
||||
|
||||
|
|
@ -349,6 +380,7 @@ impl From<Browser> for BrowserProfile {
|
|||
match b {
|
||||
Browser::Chrome => BrowserProfile::Chrome,
|
||||
Browser::Firefox => BrowserProfile::Firefox,
|
||||
Browser::SafariIos => BrowserProfile::SafariIos,
|
||||
Browser::Random => BrowserProfile::Random,
|
||||
}
|
||||
}
|
||||
|
|
@ -2288,6 +2320,83 @@ async fn main() {
|
|||
}
|
||||
return;
|
||||
}
|
||||
Commands::Extractors { json } => {
|
||||
let entries = webclaw_fetch::extractors::list();
|
||||
if *json {
|
||||
// Serialize with serde_json. ExtractorInfo derives
|
||||
// Serialize so this is a one-liner.
|
||||
match serde_json::to_string_pretty(&entries) {
|
||||
Ok(s) => println!("{s}"),
|
||||
Err(e) => {
|
||||
eprintln!("error: failed to serialise catalog: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Human-friendly table: NAME + LABEL + one URL
|
||||
// pattern sample. Keeps the output scannable on a
|
||||
// narrow terminal.
|
||||
println!("{} vertical extractors available:\n", entries.len());
|
||||
let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
|
||||
let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
|
||||
for e in &entries {
|
||||
let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
|
||||
println!(
|
||||
" {:<nw$} {:<lw$} {}",
|
||||
e.name,
|
||||
e.label,
|
||||
pattern_sample,
|
||||
nw = name_w,
|
||||
lw = label_w,
|
||||
);
|
||||
}
|
||||
println!("\nRun one: webclaw vertical <name> <url>");
|
||||
}
|
||||
return;
|
||||
}
|
||||
Commands::Vertical { name, url, raw } => {
|
||||
// Build a FetchClient with cloud fallback attached when
|
||||
// WEBCLAW_API_KEY is set. Antibot-gated verticals
|
||||
// (amazon, ebay, etsy, trustpilot) need this to escalate
|
||||
// on bot protection.
|
||||
let fetch_cfg = webclaw_fetch::FetchConfig {
|
||||
browser: webclaw_fetch::BrowserProfile::Firefox,
|
||||
..webclaw_fetch::FetchConfig::default()
|
||||
};
|
||||
let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
|
||||
Ok(c) => c,
|
||||
Err(e) => {
|
||||
eprintln!("error: failed to build fetch client: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
};
|
||||
if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
|
||||
client = client.with_cloud(cloud);
|
||||
}
|
||||
match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
|
||||
Ok(data) => {
|
||||
let rendered = if *raw {
|
||||
serde_json::to_string(&data)
|
||||
} else {
|
||||
serde_json::to_string_pretty(&data)
|
||||
};
|
||||
match rendered {
|
||||
Ok(s) => println!("{s}"),
|
||||
Err(e) => {
|
||||
eprintln!("error: JSON encode failed: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
// UrlMismatch / UnknownVertical / Fetch all get
|
||||
// Display impls with actionable messages.
|
||||
eprintln!("error: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
|
|||
continue;
|
||||
}
|
||||
|
||||
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
|
||||
if trimmed.starts_with('|') && trimmed.ends_with('|') {
|
||||
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
|
||||
// Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
|
||||
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
|
||||
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
|
||||
let inner = &trimmed[1..trimmed.len() - 1];
|
||||
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
|
||||
lines.push(cells.join("\t"));
|
||||
|
|
|
|||
|
|
@ -12,7 +12,9 @@ serde = { workspace = true }
|
|||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
async-trait = "0.1"
|
||||
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
|
||||
wreq-util = "3.0.0-rc.10"
|
||||
http = "1"
|
||||
bytes = "1"
|
||||
url = "2"
|
||||
|
|
|
|||
|
|
@ -7,6 +7,10 @@ pub enum BrowserProfile {
|
|||
#[default]
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Safari iOS 26 (iPhone). The one profile proven to defeat
|
||||
/// DataDome's immobiliare.it / idealista.it / target.com-class
|
||||
/// rules when paired with a country-scoped residential proxy.
|
||||
SafariIos,
|
||||
/// Randomly pick from all available profiles on each request.
|
||||
Random,
|
||||
}
|
||||
|
|
@ -18,6 +22,7 @@ pub enum BrowserVariant {
|
|||
ChromeMacos,
|
||||
Firefox,
|
||||
Safari,
|
||||
SafariIos26,
|
||||
Edge,
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -261,10 +261,65 @@ impl FetchClient {
|
|||
self.cloud.as_deref()
|
||||
}
|
||||
|
||||
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
|
||||
/// `.json` API, and Akamai-style challenge responses trigger a homepage
|
||||
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
|
||||
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
|
||||
/// server) benefits without shape churn.
|
||||
///
|
||||
/// This is the method most callers want. Use plain [`Self::fetch`] only
|
||||
/// when you need literal no-rescue behavior (e.g. inside the rescue
|
||||
/// logic itself to avoid recursion).
|
||||
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
// Reddit: the HTML page shows a verification interstitial for most
|
||||
// client IPs, but appending `.json` returns the post + comment tree
|
||||
// publicly. `parse_reddit_json` in downstream code knows how to read
|
||||
// the result; here we just do the URL swap at the fetch layer.
|
||||
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
|
||||
let json_url = crate::reddit::json_url(url);
|
||||
// Reddit's public .json API serves JSON to identifiable bot
|
||||
// User-Agents and blocks browser UAs with a verification wall.
|
||||
// Override our Chrome-profile UA for this specific call.
|
||||
let ua = concat!(
|
||||
"Webclaw/",
|
||||
env!("CARGO_PKG_VERSION"),
|
||||
" (+https://webclaw.io)"
|
||||
);
|
||||
if let Ok(resp) = self
|
||||
.fetch_with_headers(&json_url, &[("user-agent", ua)])
|
||||
.await
|
||||
&& resp.status == 200
|
||||
{
|
||||
let first = resp.html.trim_start().as_bytes().first().copied();
|
||||
if matches!(first, Some(b'{') | Some(b'[')) {
|
||||
return Ok(resp);
|
||||
}
|
||||
}
|
||||
// If the .json fetch failed or returned HTML, fall through.
|
||||
}
|
||||
|
||||
let resp = self.fetch(url).await?;
|
||||
|
||||
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
|
||||
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
|
||||
if is_challenge_html(&resp.html)
|
||||
&& let Some(homepage) = extract_homepage(url)
|
||||
{
|
||||
debug!("challenge detected, warming cookies via {homepage}");
|
||||
let _ = self.fetch(&homepage).await;
|
||||
if let Ok(retry) = self.fetch(url).await {
|
||||
return Ok(retry);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(resp)
|
||||
}
|
||||
|
||||
/// Fetch a URL and return the raw HTML + response metadata.
|
||||
///
|
||||
/// Automatically retries on transient failures (network errors, 5xx, 429)
|
||||
/// with exponential backoff: 0s, 1s (2 attempts total).
|
||||
/// with exponential backoff: 0s, 1s (2 attempts total). No per-site
|
||||
/// rescue logic; use [`Self::fetch_smart`] for that.
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
let delays = [Duration::ZERO, Duration::from_secs(1)];
|
||||
|
|
@ -599,12 +654,43 @@ impl FetchClient {
|
|||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Fetcher trait implementation
|
||||
//
|
||||
// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
|
||||
// rather than `FetchClient` directly, which is what lets the production
|
||||
// API server swap in a tls-sidecar-backed implementation without
|
||||
// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
|
||||
// self-hosted OSS server) this impl means "pass the FetchClient you
|
||||
// already have; nothing changes".
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl crate::fetcher::Fetcher for FetchClient {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
FetchClient::fetch(self, url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
FetchClient::fetch_with_headers(self, url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
|
||||
FetchClient::cloud(self)
|
||||
}
|
||||
}
|
||||
|
||||
/// Collect the browser variants to use based on the browser profile.
|
||||
fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
|
||||
match profile {
|
||||
BrowserProfile::Random => browser::all_variants(),
|
||||
BrowserProfile::Chrome => vec![browser::latest_chrome()],
|
||||
BrowserProfile::Firefox => vec![browser::latest_firefox()],
|
||||
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -682,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
|
|||
|
||||
/// Detect if a response looks like a bot protection challenge page.
|
||||
fn is_challenge_response(response: &Response) -> bool {
|
||||
let len = response.body().len();
|
||||
is_challenge_html(response.text().as_ref())
|
||||
}
|
||||
|
||||
/// Same as `is_challenge_response`, operating on a body string directly
|
||||
/// so callers holding a `FetchResult` can reuse the heuristic.
|
||||
fn is_challenge_html(html: &str) -> bool {
|
||||
let len = html.len();
|
||||
if len > 15_000 || len == 0 {
|
||||
return false;
|
||||
}
|
||||
|
||||
let text = response.text();
|
||||
let lower = text.to_lowercase();
|
||||
|
||||
let lower = html.to_lowercase();
|
||||
if lower.contains("<title>challenge page</title>") {
|
||||
return true;
|
||||
}
|
||||
|
||||
if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -66,7 +66,9 @@ use serde_json::{Value, json};
|
|||
use thiserror::Error;
|
||||
use tracing::{debug, info, warn};
|
||||
|
||||
use crate::client::FetchClient;
|
||||
// Client type isn't needed here anymore now that smart_fetch* takes
|
||||
// `&dyn Fetcher`. Kept as a comment for historical context: this
|
||||
// module used to import FetchClient directly before v0.5.1.
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URLs + defaults — keep in one place so "change the signup link" is a
|
||||
|
|
@ -506,7 +508,7 @@ pub enum SmartFetchResult {
|
|||
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
|
||||
/// [`CloudError`] so you can render precise UX.
|
||||
pub async fn smart_fetch(
|
||||
client: &FetchClient,
|
||||
client: &dyn crate::fetcher::Fetcher,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
|
|
@ -613,7 +615,7 @@ pub struct FetchedHtml {
|
|||
/// Designed for the vertical-extractor pattern where the caller has
|
||||
/// its own parser and just needs bytes.
|
||||
pub async fn smart_fetch_html(
|
||||
client: &FetchClient,
|
||||
client: &dyn crate::fetcher::Fetcher,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
) -> Result<FetchedHtml, CloudError> {
|
||||
|
|
|
|||
|
|
@ -32,9 +32,9 @@ use regex::Regex;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "amazon_product",
|
||||
|
|
@ -59,7 +59,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_asin(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let asin = parse_asin(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
||||
|
||||
|
|
|
|||
|
|
@ -10,8 +10,8 @@ use quick_xml::events::Event;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "arxiv",
|
||||
|
|
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/abs/") || url.contains("/pdf/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
|
||||
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "crates_io",
|
||||
|
|
@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/crates/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let name = parse_name(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
|
||||
|
||||
|
|
|
|||
|
|
@ -8,8 +8,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "dev_to",
|
||||
|
|
@ -61,7 +61,7 @@ const RESERVED_FIRST_SEGS: &[&str] = &[
|
|||
"t",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (username, slug) = parse_username_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -8,8 +8,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "docker_hub",
|
||||
|
|
@ -29,7 +29,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/_/") || url.contains("/r/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (namespace, name) = parse_repo(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
|
||||
|
||||
|
|
|
|||
|
|
@ -14,9 +14,9 @@ use regex::Regex;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ebay_listing",
|
||||
|
|
@ -39,7 +39,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_item_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let item_id = parse_item_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
|
||||
|
||||
|
|
|
|||
|
|
@ -42,8 +42,8 @@ use regex::Regex;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ecommerce_product",
|
||||
|
|
@ -69,7 +69,7 @@ pub fn matches(url: &str) -> bool {
|
|||
!host_of(url).is_empty()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let resp = client.fetch(url).await?;
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return Err(FetchError::Build(format!(
|
||||
|
|
|
|||
|
|
@ -26,9 +26,9 @@ use regex::Regex;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "etsy_listing",
|
||||
|
|
@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_listing_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let listing_id = parse_listing_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
|
||||
|
||||
|
|
|
|||
|
|
@ -10,8 +10,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_issue",
|
||||
|
|
@ -34,7 +34,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_issue(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_pr",
|
||||
|
|
@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_pr(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -8,8 +8,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_release",
|
||||
|
|
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_release(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -10,8 +10,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_repo",
|
||||
|
|
@ -70,7 +70,7 @@ const RESERVED_OWNERS: &[&str] = &[
|
|||
"about",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -10,8 +10,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "hackernews",
|
||||
|
|
@ -40,7 +40,7 @@ pub fn matches(url: &str) -> bool {
|
|||
false
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_item_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -7,8 +7,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "huggingface_dataset",
|
||||
|
|
@ -38,7 +38,7 @@ pub fn matches(url: &str) -> bool {
|
|||
segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let dataset_path = parse_dataset_path(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"hf_dataset: cannot parse dataset path from '{url}'"
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "huggingface_model",
|
||||
|
|
@ -61,7 +61,7 @@ const RESERVED_NAMESPACES: &[&str] = &[
|
|||
"search",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, name) = parse_owner_name(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -11,8 +11,8 @@ use serde_json::{Value, json};
|
|||
use std::sync::OnceLock;
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "instagram_post",
|
||||
|
|
@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_shortcode(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"instagram_post: cannot parse shortcode from '{url}'"
|
||||
|
|
|
|||
|
|
@ -23,8 +23,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "instagram_profile",
|
||||
|
|
@ -80,7 +80,7 @@ const RESERVED: &[&str] = &[
|
|||
"signup",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let username = parse_username(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"instagram_profile: cannot parse username from '{url}'"
|
||||
|
|
@ -198,7 +198,7 @@ fn classify(n: &MediaNode) -> &'static str {
|
|||
/// pull whatever OG tags we can. Returns less data and explicitly
|
||||
/// flags `data_completeness: "og_only"` so callers know.
|
||||
async fn og_fallback(
|
||||
client: &FetchClient,
|
||||
client: &dyn Fetcher,
|
||||
username: &str,
|
||||
original_url: &str,
|
||||
api_status: u16,
|
||||
|
|
|
|||
|
|
@ -14,8 +14,8 @@ use serde_json::{Value, json};
|
|||
use std::sync::OnceLock;
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "linkedin_post",
|
||||
|
|
@ -36,7 +36,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/feed/update/urn:li:") || url.contains("/posts/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let urn = extract_urn(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
|
||||
|
|
|
|||
|
|
@ -46,8 +46,8 @@ pub mod youtube_video;
|
|||
use serde::Serialize;
|
||||
use serde_json::Value;
|
||||
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
/// Public catalog entry for `/v1/extractors`. Stable shape — clients
|
||||
/// rely on `name` to pick the right `/v1/scrape/{name}` route.
|
||||
|
|
@ -102,7 +102,7 @@ pub fn list() -> Vec<ExtractorInfo> {
|
|||
/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
|
||||
/// pick a vertical explicitly.
|
||||
pub async fn dispatch_by_url(
|
||||
client: &FetchClient,
|
||||
client: &dyn Fetcher,
|
||||
url: &str,
|
||||
) -> Option<Result<(&'static str, Value), FetchError>> {
|
||||
if reddit::matches(url) {
|
||||
|
|
@ -281,7 +281,7 @@ pub async fn dispatch_by_url(
|
|||
/// users get a clear "wrong route" error instead of a confusing parse
|
||||
/// failure deep in the extractor.
|
||||
pub async fn dispatch_by_name(
|
||||
client: &FetchClient,
|
||||
client: &dyn Fetcher,
|
||||
name: &str,
|
||||
url: &str,
|
||||
) -> Result<Value, ExtractorDispatchError> {
|
||||
|
|
|
|||
|
|
@ -13,8 +13,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "npm",
|
||||
|
|
@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/package/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let name = parse_name(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
|
||||
|
||||
|
|
@ -94,7 +94,7 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
|
|||
}))
|
||||
}
|
||||
|
||||
async fn fetch_weekly_downloads(client: &FetchClient, name: &str) -> Result<i64, FetchError> {
|
||||
async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
|
||||
let url = format!(
|
||||
"https://api.npmjs.org/downloads/point/last-week/{}",
|
||||
urlencode_segment(name)
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "pypi",
|
||||
|
|
@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/project/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (name, version) = parse_project(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "reddit",
|
||||
|
|
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
|
|||
is_reddit_host && url.contains("/comments/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status != 200 {
|
||||
|
|
|
|||
|
|
@ -15,8 +15,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_collection",
|
||||
|
|
@ -49,7 +49,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
|
|||
"github.com",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (coll_meta_url, coll_products_url) = build_json_urls(url);
|
||||
|
||||
// Step 1: collection metadata. Shopify returns 200 on missing
|
||||
|
|
|
|||
|
|
@ -21,8 +21,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_product",
|
||||
|
|
@ -65,7 +65,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
|
|||
"github.com", // /products is a marketing page
|
||||
];
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status == 404 {
|
||||
|
|
|
|||
|
|
@ -13,8 +13,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "stackoverflow",
|
||||
|
|
@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
|
|||
parse_question_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_question_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"stackoverflow: cannot parse question id from '{url}'"
|
||||
|
|
|
|||
|
|
@ -28,9 +28,9 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "substack_post",
|
||||
|
|
@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/p/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let slug = parse_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
|
||||
})?;
|
||||
|
|
@ -149,7 +149,7 @@ fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
|
|||
// ---------------------------------------------------------------------------
|
||||
|
||||
async fn html_fallback(
|
||||
client: &FetchClient,
|
||||
client: &dyn Fetcher,
|
||||
url: &str,
|
||||
api_url: &str,
|
||||
slug: &str,
|
||||
|
|
|
|||
|
|
@ -32,9 +32,9 @@ use regex::Regex;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "trustpilot_reviews",
|
||||
|
|
@ -51,7 +51,7 @@ pub fn matches(url: &str) -> bool {
|
|||
url.contains("/review/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
|
|
|||
|
|
@ -15,8 +15,8 @@ use serde::Deserialize;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "woocommerce_product",
|
||||
|
|
@ -42,7 +42,7 @@ pub fn matches(url: &str) -> bool {
|
|||
|| url.contains("/produit/") // common fr locale
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let slug = parse_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"woocommerce_product: cannot parse slug from '{url}'"
|
||||
|
|
|
|||
|
|
@ -25,8 +25,8 @@ use regex::Regex;
|
|||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "youtube_video",
|
||||
|
|
@ -45,7 +45,7 @@ pub fn matches(url: &str) -> bool {
|
|||
|| url.contains("youtube-nocookie.com/embed/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> {
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let video_id = parse_video_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
|
||||
})?;
|
||||
|
|
|
|||
118
crates/webclaw-fetch/src/fetcher.rs
Normal file
118
crates/webclaw-fetch/src/fetcher.rs
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
//! Pluggable fetcher abstraction for vertical extractors.
|
||||
//!
|
||||
//! Extractors call the network through this trait instead of hard-
|
||||
//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
|
||||
//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
|
||||
//! server, which must not use in-process TLS fingerprinting, provides
|
||||
//! its own implementation that routes through the Go tls-sidecar.
|
||||
//!
|
||||
//! Both paths expose the same [`FetchResult`] shape and the same
|
||||
//! optional cloud-escalation client, so extractor logic stays
|
||||
//! identical across environments.
|
||||
//!
|
||||
//! ## Choosing an implementation
|
||||
//!
|
||||
//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
|
||||
//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass
|
||||
//! it to extractors as `&client`.
|
||||
//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
|
||||
//! (in `server/src/engine/`) that delegates to `engine::tls_client`
|
||||
//! and wraps it in `Arc<dyn Fetcher>` for handler injection.
|
||||
//!
|
||||
//! ## Why a trait and not a free function
|
||||
//!
|
||||
//! Extractors need state beyond a single fetch: the cloud client for
|
||||
//! antibot escalation, and in the future per-user proxy pools, tenant
|
||||
//! headers, circuit breakers. A trait keeps that state encapsulated
|
||||
//! behind the fetch interface instead of threading it through every
|
||||
//! extractor signature.
|
||||
|
||||
use async_trait::async_trait;
|
||||
|
||||
use crate::client::FetchResult;
|
||||
use crate::cloud::CloudClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
/// HTTP fetch surface used by vertical extractors.
|
||||
///
|
||||
/// Implementations must be `Send + Sync` because extractor dispatchers
|
||||
/// run them inside tokio tasks, potentially across many requests.
|
||||
#[async_trait]
|
||||
pub trait Fetcher: Send + Sync {
|
||||
/// Fetch a URL and return the raw response body + metadata. The
|
||||
/// body is in `FetchResult::html` regardless of the actual content
|
||||
/// type — JSON API endpoints put JSON there, HTML pages put HTML.
|
||||
/// Extractors branch on response status and body shape.
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
|
||||
|
||||
/// Fetch with additional request headers. Needed for endpoints
|
||||
/// that authenticate via a specific header (Instagram's
|
||||
/// `x-ig-app-id`, for example). Default implementation routes to
|
||||
/// [`Self::fetch`] so implementers without header support stay
|
||||
/// functional, though the `Option<String>` field they'd set won't
|
||||
/// be populated on the request.
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
_headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
self.fetch(url).await
|
||||
}
|
||||
|
||||
/// Optional cloud-escalation client for antibot bypass. Returning
|
||||
/// `Some` tells extractors they can call into the hosted API when
|
||||
/// local fetch hits a challenge page. Returning `None` makes
|
||||
/// cloud-gated extractors emit [`CloudError::NotConfigured`] with
|
||||
/// an actionable signup link.
|
||||
///
|
||||
/// The default implementation returns `None` because not every
|
||||
/// deployment wants cloud fallback (self-hosts that don't have a
|
||||
/// webclaw.io subscription, for instance).
|
||||
///
|
||||
/// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Fetcher + ?Sized> Fetcher for &T {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch(url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch_with_headers(url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
(**self).cloud()
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch(url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch_with_headers(url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
(**self).cloud()
|
||||
}
|
||||
}
|
||||
|
|
@ -8,7 +8,9 @@ pub mod crawler;
|
|||
pub mod document;
|
||||
pub mod error;
|
||||
pub mod extractors;
|
||||
pub mod fetcher;
|
||||
pub mod linkedin;
|
||||
pub mod locale;
|
||||
pub mod proxy;
|
||||
pub mod reddit;
|
||||
pub mod sitemap;
|
||||
|
|
@ -18,7 +20,9 @@ pub use browser::BrowserProfile;
|
|||
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
|
||||
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
|
||||
pub use error::FetchError;
|
||||
pub use fetcher::Fetcher;
|
||||
pub use http::HeaderMap;
|
||||
pub use locale::{accept_language_for_tld, accept_language_for_url};
|
||||
pub use proxy::{parse_proxy_file, parse_proxy_line};
|
||||
pub use sitemap::SitemapEntry;
|
||||
pub use webclaw_pdf::PdfMode;
|
||||
|
|
|
|||
77
crates/webclaw-fetch/src/locale.rs
Normal file
77
crates/webclaw-fetch/src/locale.rs
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
//! Derive an `Accept-Language` header from a URL.
|
||||
//!
|
||||
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
|
||||
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
|
||||
//! target country + a browser UA but the wrong `Accept-Language` is a bot
|
||||
//! signal. Matching the site's expected locale gets us through.
|
||||
//!
|
||||
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
|
||||
|
||||
/// Best-effort `Accept-Language` header value for the given URL's TLD.
|
||||
/// Returns `None` if the URL cannot be parsed.
|
||||
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
|
||||
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
|
||||
let tld = host.rsplit('.').next()?;
|
||||
Some(accept_language_for_tld(tld))
|
||||
}
|
||||
|
||||
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
|
||||
/// Unknown TLDs fall back to US English.
|
||||
pub fn accept_language_for_tld(tld: &str) -> &'static str {
|
||||
match tld {
|
||||
"it" => "it-IT,it;q=0.9",
|
||||
"fr" => "fr-FR,fr;q=0.9",
|
||||
"de" | "at" => "de-DE,de;q=0.9",
|
||||
"es" => "es-ES,es;q=0.9",
|
||||
"pt" => "pt-PT,pt;q=0.9",
|
||||
"nl" => "nl-NL,nl;q=0.9",
|
||||
"pl" => "pl-PL,pl;q=0.9",
|
||||
"se" => "sv-SE,sv;q=0.9",
|
||||
"no" => "nb-NO,nb;q=0.9",
|
||||
"dk" => "da-DK,da;q=0.9",
|
||||
"fi" => "fi-FI,fi;q=0.9",
|
||||
"cz" => "cs-CZ,cs;q=0.9",
|
||||
"ro" => "ro-RO,ro;q=0.9",
|
||||
"gr" => "el-GR,el;q=0.9",
|
||||
"tr" => "tr-TR,tr;q=0.9",
|
||||
"ru" => "ru-RU,ru;q=0.9",
|
||||
"jp" => "ja-JP,ja;q=0.9",
|
||||
"kr" => "ko-KR,ko;q=0.9",
|
||||
"cn" => "zh-CN,zh;q=0.9",
|
||||
"tw" | "hk" => "zh-TW,zh;q=0.9",
|
||||
"br" => "pt-BR,pt;q=0.9",
|
||||
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
|
||||
"uk" | "ie" => "en-GB,en;q=0.9",
|
||||
_ => "en-US,en;q=0.9",
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn tld_dispatch() {
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
|
||||
Some("it-IT,it;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.leboncoin.fr/"),
|
||||
Some("fr-FR,fr;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.amazon.co.uk/"),
|
||||
Some("en-GB,en;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://example.com/"),
|
||||
Some("en-US,en;q=0.9")
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bad_url_returns_none() {
|
||||
assert_eq!(accept_language_for_url("not-a-url"), None);
|
||||
}
|
||||
}
|
||||
|
|
@ -7,10 +7,15 @@
|
|||
|
||||
use std::time::Duration;
|
||||
|
||||
use std::borrow::Cow;
|
||||
|
||||
use wreq::http2::{
|
||||
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
|
||||
};
|
||||
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
|
||||
use wreq::tls::{
|
||||
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
|
||||
TlsVersion,
|
||||
};
|
||||
use wreq::{Client, Emulation};
|
||||
|
||||
use crate::browser::BrowserVariant;
|
||||
|
|
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
|
|||
/// Safari curves.
|
||||
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
|
||||
|
||||
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
|
||||
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
|
||||
/// inserts them itself. Diverges from wreq-util's default SafariIos26
|
||||
/// extension order, which DataDome's immobiliare.it ruleset flags.
|
||||
fn safari_ios_extensions() -> Vec<ExtensionType> {
|
||||
vec![
|
||||
ExtensionType::CERTIFICATE_TIMESTAMP,
|
||||
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
|
||||
ExtensionType::SERVER_NAME,
|
||||
ExtensionType::CERT_COMPRESSION,
|
||||
ExtensionType::KEY_SHARE,
|
||||
ExtensionType::SUPPORTED_VERSIONS,
|
||||
ExtensionType::PSK_KEY_EXCHANGE_MODES,
|
||||
ExtensionType::SUPPORTED_GROUPS,
|
||||
ExtensionType::RENEGOTIATE,
|
||||
ExtensionType::SIGNATURE_ALGORITHMS,
|
||||
ExtensionType::STATUS_REQUEST,
|
||||
ExtensionType::EC_POINT_FORMATS,
|
||||
ExtensionType::EXTENDED_MASTER_SECRET,
|
||||
]
|
||||
}
|
||||
|
||||
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
|
||||
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
|
||||
/// per handshake, but indeed.com's WAF allowlists this specific wire order
|
||||
/// and rejects permuted ones. GREASE slots are inserted by wreq.
|
||||
///
|
||||
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
|
||||
fn chrome_extensions() -> Vec<ExtensionType> {
|
||||
vec![
|
||||
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
|
||||
ExtensionType::STATUS_REQUEST, // 5
|
||||
ExtensionType::SESSION_TICKET, // 35
|
||||
ExtensionType::KEY_SHARE, // 51
|
||||
ExtensionType::SUPPORTED_GROUPS, // 10
|
||||
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
|
||||
ExtensionType::EC_POINT_FORMATS, // 11
|
||||
ExtensionType::CERT_COMPRESSION, // 27
|
||||
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
|
||||
ExtensionType::SUPPORTED_VERSIONS, // 43
|
||||
ExtensionType::SIGNATURE_ALGORITHMS, // 13
|
||||
ExtensionType::SERVER_NAME, // 0
|
||||
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
|
||||
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
|
||||
ExtensionType::RENEGOTIATE, // 65281
|
||||
ExtensionType::EXTENDED_MASTER_SECRET, // 23
|
||||
]
|
||||
}
|
||||
|
||||
// --- Chrome HTTP headers in correct wire order ---
|
||||
|
||||
const CHROME_HEADERS: &[(&str, &str)] = &[
|
||||
|
|
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
|
|||
("sec-fetch-dest", "document"),
|
||||
];
|
||||
|
||||
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
|
||||
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
|
||||
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
|
||||
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
|
||||
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
|
||||
/// expects for a real iPhone.
|
||||
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
|
||||
(
|
||||
"accept",
|
||||
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||
),
|
||||
("accept-language", "en-US,en;q=0.9"),
|
||||
("accept-encoding", "gzip, deflate, br"),
|
||||
(
|
||||
"user-agent",
|
||||
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
|
||||
),
|
||||
("upgrade-insecure-requests", "1"),
|
||||
];
|
||||
|
||||
const EDGE_HEADERS: &[(&str, &str)] = &[
|
||||
(
|
||||
"sec-ch-ua",
|
||||
|
|
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
|
|||
];
|
||||
|
||||
fn chrome_tls() -> TlsOptions {
|
||||
// permute_extensions is off so the explicit extension_permutation sticks.
|
||||
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
|
||||
// fixed order, so matching that gets us through.
|
||||
TlsOptions::builder()
|
||||
.cipher_list(CHROME_CIPHERS)
|
||||
.sigalgs_list(CHROME_SIGALGS)
|
||||
|
|
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
|
|||
.min_tls_version(TlsVersion::TLS_1_2)
|
||||
.max_tls_version(TlsVersion::TLS_1_3)
|
||||
.grease_enabled(true)
|
||||
.permute_extensions(true)
|
||||
.permute_extensions(false)
|
||||
.extension_permutation(chrome_extensions())
|
||||
.enable_ech_grease(true)
|
||||
.pre_shared_key(true)
|
||||
.enable_ocsp_stapling(true)
|
||||
.enable_signed_cert_timestamps(true)
|
||||
.alps_protocols([AlpsProtocol::HTTP2])
|
||||
.alpn_protocols([
|
||||
AlpnProtocol::HTTP3,
|
||||
AlpnProtocol::HTTP2,
|
||||
AlpnProtocol::HTTP1,
|
||||
])
|
||||
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
|
||||
.alps_use_new_codepoint(true)
|
||||
.aes_hw_override(true)
|
||||
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
|
||||
|
|
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
|
|||
.build()
|
||||
}
|
||||
|
||||
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
|
||||
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
|
||||
/// because the wire-level defaults from wreq-util are already correct for ciphers,
|
||||
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
|
||||
/// DataDome compatibility are overridden here:
|
||||
///
|
||||
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
|
||||
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
|
||||
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
|
||||
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
|
||||
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
|
||||
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
|
||||
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
|
||||
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
|
||||
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
|
||||
/// 4. `accept-language` preserved from config.extra_headers for locale.
|
||||
fn safari_ios_emulation() -> wreq::Emulation {
|
||||
use wreq::EmulationFactory;
|
||||
let mut em = wreq_util::Emulation::SafariIos26.emulation();
|
||||
|
||||
if let Some(tls) = em.tls_options_mut().as_mut() {
|
||||
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
|
||||
}
|
||||
|
||||
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
|
||||
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
|
||||
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
|
||||
if let Some(h2) = em.http2_options_mut().as_mut() {
|
||||
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
|
||||
}
|
||||
|
||||
let hm = em.headers_mut();
|
||||
hm.clear();
|
||||
for (k, v) in SAFARI_IOS_HEADERS {
|
||||
if let (Ok(n), Ok(val)) = (
|
||||
http::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
http::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
hm.append(n, val);
|
||||
}
|
||||
}
|
||||
|
||||
em
|
||||
}
|
||||
|
||||
fn chrome_h2() -> Http2Options {
|
||||
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
|
||||
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
|
||||
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
|
||||
// and indeed.com's WAF reads this as a bot signal when present. Priority
|
||||
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
|
||||
Http2Options::builder()
|
||||
.initial_window_size(6_291_456)
|
||||
.initial_connection_window_size(15_728_640)
|
||||
.max_header_list_size(262_144)
|
||||
.header_table_size(65_536)
|
||||
.max_concurrent_streams(1000u32)
|
||||
.enable_push(false)
|
||||
.settings_order(
|
||||
SettingsOrder::builder()
|
||||
.extend([
|
||||
SettingId::HeaderTableSize,
|
||||
SettingId::EnablePush,
|
||||
SettingId::MaxConcurrentStreams,
|
||||
SettingId::InitialWindowSize,
|
||||
SettingId::MaxFrameSize,
|
||||
SettingId::MaxHeaderListSize,
|
||||
SettingId::EnableConnectProtocol,
|
||||
SettingId::NoRfc7540Priorities,
|
||||
])
|
||||
.build(),
|
||||
)
|
||||
|
|
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
|
|||
])
|
||||
.build(),
|
||||
)
|
||||
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
|
||||
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
|
||||
.build()
|
||||
}
|
||||
|
||||
|
|
@ -328,32 +456,38 @@ pub fn build_client(
|
|||
extra_headers: &std::collections::HashMap<String, String>,
|
||||
proxy: Option<&str>,
|
||||
) -> Result<Client, FetchError> {
|
||||
let (tls, h2, headers) = match variant {
|
||||
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
|
||||
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
|
||||
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
|
||||
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
|
||||
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
|
||||
// SafariIos26 builds its Emulation on top of wreq-util's base instead
|
||||
// of from scratch. See `safari_ios_emulation` for why.
|
||||
let mut emulation = match variant {
|
||||
BrowserVariant::SafariIos26 => safari_ios_emulation(),
|
||||
other => {
|
||||
let (tls, h2, headers) = match other {
|
||||
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
|
||||
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
|
||||
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
|
||||
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
|
||||
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
|
||||
BrowserVariant::SafariIos26 => unreachable!("handled above"),
|
||||
};
|
||||
Emulation::builder()
|
||||
.tls_options(tls)
|
||||
.http2_options(h2)
|
||||
.headers(build_headers(headers))
|
||||
.build()
|
||||
}
|
||||
};
|
||||
|
||||
let mut header_map = build_headers(headers);
|
||||
|
||||
// Append extra headers after profile defaults
|
||||
// Append extra headers after profile defaults.
|
||||
let hm = emulation.headers_mut();
|
||||
for (k, v) in extra_headers {
|
||||
if let (Ok(n), Ok(val)) = (
|
||||
http::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
http::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
header_map.insert(n, val);
|
||||
hm.insert(n, val);
|
||||
}
|
||||
}
|
||||
|
||||
let emulation = Emulation::builder()
|
||||
.tls_options(tls)
|
||||
.http2_options(h2)
|
||||
.headers(header_map)
|
||||
.build();
|
||||
|
||||
let mut builder = Client::builder()
|
||||
.emulation(emulation)
|
||||
.redirect(wreq::redirect::Policy::limited(10))
|
||||
|
|
|
|||
|
|
@ -718,6 +718,55 @@ impl WebclawMcp {
|
|||
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
|
||||
}
|
||||
}
|
||||
|
||||
/// List every vertical extractor the server knows about. Returns a
|
||||
/// JSON array of `{name, label, description, url_patterns}` entries.
|
||||
/// Call this to discover what verticals are available before using
|
||||
/// `vertical_scrape`.
|
||||
#[tool]
|
||||
async fn list_extractors(
|
||||
&self,
|
||||
Parameters(_params): Parameters<ListExtractorsParams>,
|
||||
) -> Result<String, String> {
|
||||
let catalog = webclaw_fetch::extractors::list();
|
||||
serde_json::to_string_pretty(&catalog)
|
||||
.map_err(|e| format!("failed to serialise extractor catalog: {e}"))
|
||||
}
|
||||
|
||||
/// Run a vertical extractor by name and return typed JSON specific
|
||||
/// to the target site (title, price, rating, author, etc.), not
|
||||
/// generic markdown. Use `list_extractors` to discover available
|
||||
/// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
|
||||
/// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
|
||||
///
|
||||
/// Antibot-gated verticals (amazon_product, ebay_listing,
|
||||
/// etsy_listing, trustpilot_reviews) will automatically escalate to
|
||||
/// the webclaw cloud API when local fetch hits bot protection,
|
||||
/// provided `WEBCLAW_API_KEY` is set.
|
||||
#[tool]
|
||||
async fn vertical_scrape(
|
||||
&self,
|
||||
Parameters(params): Parameters<VerticalParams>,
|
||||
) -> Result<String, String> {
|
||||
validate_url(¶ms.url)?;
|
||||
// Use the cached Firefox client, not the default Chrome one.
|
||||
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
|
||||
// fingerprint with a 403 even from residential IPs (they
|
||||
// ship a fingerprint blocklist that includes common
|
||||
// browser-emulation libraries). The wreq-Firefox fingerprint
|
||||
// still passes, and Firefox is equally fine for every other
|
||||
// vertical in the catalog, so it's a strictly-safer default
|
||||
// for `vertical_scrape` than the generic `scrape` tool's
|
||||
// Chrome default. Matches the CLI `webclaw vertical`
|
||||
// subcommand which already uses Firefox.
|
||||
let client = self.firefox_or_build()?;
|
||||
let data =
|
||||
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), ¶ms.name, ¶ms.url)
|
||||
.await
|
||||
.map_err(|e| e.to_string())?;
|
||||
serde_json::to_string_pretty(&data)
|
||||
.map_err(|e| format!("failed to serialise extractor output: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
#[tool_handler]
|
||||
|
|
@ -727,7 +776,8 @@ impl ServerHandler for WebclawMcp {
|
|||
.with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
|
||||
.with_instructions(String::from(
|
||||
"Webclaw MCP server -- web content extraction for AI agents. \
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
|
||||
list_extractors, vertical_scrape.",
|
||||
))
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -103,3 +103,20 @@ pub struct SearchParams {
|
|||
/// Number of results to return (default: 10)
|
||||
pub num_results: Option<u32>,
|
||||
}
|
||||
|
||||
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct VerticalParams {
|
||||
/// Name of the vertical extractor. Call `list_extractors` to see all
|
||||
/// available names. Examples: "reddit", "github_repo", "pypi",
|
||||
/// "trustpilot_reviews", "youtube_video", "shopify_product".
|
||||
pub name: String,
|
||||
/// URL to extract. Must match the URL patterns the extractor claims;
|
||||
/// otherwise the tool returns a clear "URL mismatch" error.
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
/// `list_extractors` takes no arguments but we still need an empty struct
|
||||
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct ListExtractorsParams {}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue