Compare commits

...

12 commits
v0.5.0 ... main

Author SHA1 Message Date
Valerio
a5c3433372 fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-04-23 15:26:31 +02:00
Valerio
966981bc42 fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:17:04 +02:00
Valerio
866fa88aa0 fix(fetch): reject HTML verification pages served at .json reddit URL 2026-04-23 15:06:35 +02:00
Valerio
b413d702b2 feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6 2026-04-23 14:59:29 +02:00
Valerio
98a177dec4 feat(cli): expose safari-ios browser profile + bump to 0.5.5 2026-04-23 13:32:55 +02:00
Valerio
e1af2da509 docs(claude): drop sidecar references, mention ProductionFetcher 2026-04-23 13:25:23 +02:00
Valerio
2285c585b1 docs(changelog): simplify 0.5.4 entry 2026-04-23 13:01:02 +02:00
Valerio
b77767814a Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
  Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
  extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
  gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
  8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
  immobiliare.it with country-it residential.

- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
  MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
  explicit extension_permutation, advertise h3 in ALPN and ALPS.
  JA3 43067709b025da334de1279a120f8e14, akamai_fp
  52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
  Cloudflare-fronted sites.

- New locale module: accept_language_for_url / accept_language_for_tld.
  TLD to Accept-Language mapping, unknown TLDs default to en-US.
  DataDome geo-vs-locale cross-checks are now trivially satisfiable.

- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
2026-04-23 12:58:24 +02:00
Valerio
4bf11d902f fix(mcp): vertical_scrape uses Firefox profile, not default Chrome
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.

Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.

Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers

No change to the `scrape` tool -- it keeps the user-selectable
browser param.

Bumps version to 0.5.3.
2026-04-22 23:18:11 +02:00
Valerio
0daa2fec1a feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.

CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
  name, label, and sample URL. `--json` flag emits the full catalog
  as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
  extractor and prints typed JSON. Pretty-printed by default; `--raw`
  for single-line. Exits 1 with a clear "URL does not match" error
  on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
  when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.

MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
  pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
  JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
  updated accordingly.

Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.

Version bumped to 0.5.2 (minor for API additions, backwards compatible).
2026-04-22 21:41:15 +02:00
Valerio
058493bc8f feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.

Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.

Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
  blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
  its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
  `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
  take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
  native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.

Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
2026-04-22 21:17:50 +02:00
Valerio
aaa5103504 docs(claude): fix stale primp references, document wreq + Fetcher trait
webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago
but CLAUDE.md still documented primp, the `[patch.crates-io]`
requirement, and RUSTFLAGS that no longer apply. Refreshed four
sections:

- Crate listing: webclaw-fetch uses wreq, not primp
- client.rs description: wreq BoringSSL, plus a note that FetchClient
  will implement the new Fetcher trait so production can swap in a
  tls-sidecar-backed fetcher without importing wreq
- Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines,
  added the "Vertical extractors take `&dyn Fetcher`" rule that makes
  the architectural separation explicit for the upcoming production
  integration
- Removed language about primp being "patched"; reqwest in webclaw-llm
  is now just "plain reqwest" with no relationship to wreq
2026-04-22 21:11:18 +02:00
45 changed files with 812 additions and 114 deletions

View file

@ -3,6 +3,64 @@
All notable changes to webclaw are documented here. All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/). Format follows [Keep a Changelog](https://keepachangelog.com/).
## [0.5.6] — 2026-04-23
### Added
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
### Fixed
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
---
## [0.5.5] — 2026-04-23
### Added
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
---
## [0.5.4] — 2026-04-23
### Added
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
### Changed
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
- Bumped `wreq-util` to `3.0.0-rc.10`.
---
## [0.5.2] — 2026-04-22
### Added
- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
### Changed
- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
---
## [0.5.1] — 2026-04-22
### Added
- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
### Changed
- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
---
## [0.5.0] — 2026-04-22 ## [0.5.0] — 2026-04-22
### Added ### Added

View file

@ -11,7 +11,7 @@ webclaw/
# + ExtractionOptions (include/exclude CSS selectors) # + ExtractionOptions (include/exclude CSS selectors)
# + diff engine (change tracking) # + diff engine (change tracking)
# + brand extraction (DOM/CSS analysis) # + brand extraction (DOM/CSS analysis)
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops. webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
# + proxy pool rotation (per-request) # + proxy pool rotation (per-request)
# + PDF content-type detection # + PDF content-type detection
# + document parsing (DOCX, XLSX, CSV) # + document parsing (DOCX, XLSX, CSV)
@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
- `brand.rs` — Brand identity extraction from DOM structure and CSS - `brand.rs` — Brand identity extraction from DOM structure and CSS
### Fetch Modules (`webclaw-fetch`) ### Fetch Modules (`webclaw-fetch`)
- `client.rs` — FetchClient with primp TLS impersonation - `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128) - `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay - `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt) - `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
## Hard Rules ## Hard Rules
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible. - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level. - **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually. - **No special RUSTFLAGS**`.cargo/config.toml` is currently empty of build flags. Don't add any.
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting. - **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels. - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
## Build & Test ## Build & Test

46
Cargo.lock generated
View file

@ -2967,6 +2967,26 @@ dependencies = [
"pom", "pom",
] ]
[[package]]
name = "typed-builder"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
dependencies = [
"typed-builder-macro",
]
[[package]]
name = "typed-builder-macro"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]] [[package]]
name = "typed-path" name = "typed-path"
version = "0.12.3" version = "0.12.3"
@ -3199,7 +3219,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-cli" name = "webclaw-cli"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"clap", "clap",
"dotenvy", "dotenvy",
@ -3220,7 +3240,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-core" name = "webclaw-core"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"ego-tree", "ego-tree",
"once_cell", "once_cell",
@ -3238,8 +3258,9 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-fetch" name = "webclaw-fetch"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"async-trait",
"bytes", "bytes",
"calamine", "calamine",
"http", "http",
@ -3257,12 +3278,13 @@ dependencies = [
"webclaw-core", "webclaw-core",
"webclaw-pdf", "webclaw-pdf",
"wreq", "wreq",
"wreq-util",
"zip 2.4.2", "zip 2.4.2",
] ]
[[package]] [[package]]
name = "webclaw-llm" name = "webclaw-llm"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"reqwest", "reqwest",
@ -3275,7 +3297,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-mcp" name = "webclaw-mcp"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"dirs", "dirs",
"dotenvy", "dotenvy",
@ -3295,7 +3317,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-pdf" name = "webclaw-pdf"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"pdf-extract", "pdf-extract",
"thiserror", "thiserror",
@ -3304,7 +3326,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-server" name = "webclaw-server"
version = "0.5.0" version = "0.5.6"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"axum", "axum",
@ -3708,6 +3730,16 @@ dependencies = [
"zstd", "zstd",
] ]
[[package]]
name = "wreq-util"
version = "3.0.0-rc.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
dependencies = [
"typed-builder",
"wreq",
]
[[package]] [[package]]
name = "writeable" name = "writeable"
version = "0.6.2" version = "0.6.2"

View file

@ -3,7 +3,7 @@ resolver = "2"
members = ["crates/*"] members = ["crates/*"]
[workspace.package] [workspace.package]
version = "0.5.0" version = "0.5.6"
edition = "2024" edition = "2024"
license = "AGPL-3.0" license = "AGPL-3.0"
repository = "https://github.com/0xMassi/webclaw" repository = "https://github.com/0xMassi/webclaw"

View file

@ -308,6 +308,34 @@ enum Commands {
#[arg(long)] #[arg(long)]
facts: Option<PathBuf>, facts: Option<PathBuf>,
}, },
/// List all vertical extractors in the catalog.
///
/// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
/// a human-friendly label, a one-line description, and the URL
/// patterns it claims. The same data is served by `/v1/extractors`
/// when running the REST API.
Extractors {
/// Emit JSON instead of a human-friendly table.
#[arg(long)]
json: bool,
},
/// Run a vertical extractor by name. Returns typed JSON with fields
/// specific to the target site (title, price, author, rating, etc.)
/// rather than generic markdown.
///
/// Use `webclaw extractors` to see the full list. Example:
/// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
Vertical {
/// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
name: String,
/// URL to extract.
url: String,
/// Emit compact JSON (single line). Default is pretty-printed.
#[arg(long)]
raw: bool,
},
} }
#[derive(Clone, ValueEnum)] #[derive(Clone, ValueEnum)]
@ -323,6 +351,9 @@ enum OutputFormat {
enum Browser { enum Browser {
Chrome, Chrome,
Firefox, Firefox,
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
/// that reject non-mobile profiles.
SafariIos,
Random, Random,
} }
@ -349,6 +380,7 @@ impl From<Browser> for BrowserProfile {
match b { match b {
Browser::Chrome => BrowserProfile::Chrome, Browser::Chrome => BrowserProfile::Chrome,
Browser::Firefox => BrowserProfile::Firefox, Browser::Firefox => BrowserProfile::Firefox,
Browser::SafariIos => BrowserProfile::SafariIos,
Browser::Random => BrowserProfile::Random, Browser::Random => BrowserProfile::Random,
} }
} }
@ -2288,6 +2320,83 @@ async fn main() {
} }
return; return;
} }
Commands::Extractors { json } => {
let entries = webclaw_fetch::extractors::list();
if *json {
// Serialize with serde_json. ExtractorInfo derives
// Serialize so this is a one-liner.
match serde_json::to_string_pretty(&entries) {
Ok(s) => println!("{s}"),
Err(e) => {
eprintln!("error: failed to serialise catalog: {e}");
process::exit(1);
}
}
} else {
// Human-friendly table: NAME + LABEL + one URL
// pattern sample. Keeps the output scannable on a
// narrow terminal.
println!("{} vertical extractors available:\n", entries.len());
let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
for e in &entries {
let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
println!(
" {:<nw$} {:<lw$} {}",
e.name,
e.label,
pattern_sample,
nw = name_w,
lw = label_w,
);
}
println!("\nRun one: webclaw vertical <name> <url>");
}
return;
}
Commands::Vertical { name, url, raw } => {
// Build a FetchClient with cloud fallback attached when
// WEBCLAW_API_KEY is set. Antibot-gated verticals
// (amazon, ebay, etsy, trustpilot) need this to escalate
// on bot protection.
let fetch_cfg = webclaw_fetch::FetchConfig {
browser: webclaw_fetch::BrowserProfile::Firefox,
..webclaw_fetch::FetchConfig::default()
};
let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
Ok(c) => c,
Err(e) => {
eprintln!("error: failed to build fetch client: {e}");
process::exit(1);
}
};
if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
client = client.with_cloud(cloud);
}
match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
Ok(data) => {
let rendered = if *raw {
serde_json::to_string(&data)
} else {
serde_json::to_string_pretty(&data)
};
match rendered {
Ok(s) => println!("{s}"),
Err(e) => {
eprintln!("error: JSON encode failed: {e}");
process::exit(1);
}
}
}
Err(e) => {
// UrlMismatch / UnknownVertical / Fetch all get
// Display impls with actionable messages.
eprintln!("error: {e}");
process::exit(1);
}
}
return;
}
} }
} }

View file

@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
continue; continue;
} }
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
if trimmed.starts_with('|') && trimmed.ends_with('|') { // Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
let inner = &trimmed[1..trimmed.len() - 1]; let inner = &trimmed[1..trimmed.len() - 1];
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect(); let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
lines.push(cells.join("\t")); lines.push(cells.join("\t"));

View file

@ -12,7 +12,9 @@ serde = { workspace = true }
thiserror = { workspace = true } thiserror = { workspace = true }
tracing = { workspace = true } tracing = { workspace = true }
tokio = { workspace = true } tokio = { workspace = true }
async-trait = "0.1"
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] } wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
wreq-util = "3.0.0-rc.10"
http = "1" http = "1"
bytes = "1" bytes = "1"
url = "2" url = "2"

View file

@ -7,6 +7,10 @@ pub enum BrowserProfile {
#[default] #[default]
Chrome, Chrome,
Firefox, Firefox,
/// Safari iOS 26 (iPhone). The one profile proven to defeat
/// DataDome's immobiliare.it / idealista.it / target.com-class
/// rules when paired with a country-scoped residential proxy.
SafariIos,
/// Randomly pick from all available profiles on each request. /// Randomly pick from all available profiles on each request.
Random, Random,
} }
@ -18,6 +22,7 @@ pub enum BrowserVariant {
ChromeMacos, ChromeMacos,
Firefox, Firefox,
Safari, Safari,
SafariIos26,
Edge, Edge,
} }

View file

@ -261,10 +261,65 @@ impl FetchClient {
self.cloud.as_deref() self.cloud.as_deref()
} }
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
/// `.json` API, and Akamai-style challenge responses trigger a homepage
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
/// server) benefits without shape churn.
///
/// This is the method most callers want. Use plain [`Self::fetch`] only
/// when you need literal no-rescue behavior (e.g. inside the rescue
/// logic itself to avoid recursion).
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
// Reddit: the HTML page shows a verification interstitial for most
// client IPs, but appending `.json` returns the post + comment tree
// publicly. `parse_reddit_json` in downstream code knows how to read
// the result; here we just do the URL swap at the fetch layer.
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
let json_url = crate::reddit::json_url(url);
// Reddit's public .json API serves JSON to identifiable bot
// User-Agents and blocks browser UAs with a verification wall.
// Override our Chrome-profile UA for this specific call.
let ua = concat!(
"Webclaw/",
env!("CARGO_PKG_VERSION"),
" (+https://webclaw.io)"
);
if let Ok(resp) = self
.fetch_with_headers(&json_url, &[("user-agent", ua)])
.await
&& resp.status == 200
{
let first = resp.html.trim_start().as_bytes().first().copied();
if matches!(first, Some(b'{') | Some(b'[')) {
return Ok(resp);
}
}
// If the .json fetch failed or returned HTML, fall through.
}
let resp = self.fetch(url).await?;
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
if is_challenge_html(&resp.html)
&& let Some(homepage) = extract_homepage(url)
{
debug!("challenge detected, warming cookies via {homepage}");
let _ = self.fetch(&homepage).await;
if let Ok(retry) = self.fetch(url).await {
return Ok(retry);
}
}
Ok(resp)
}
/// Fetch a URL and return the raw HTML + response metadata. /// Fetch a URL and return the raw HTML + response metadata.
/// ///
/// Automatically retries on transient failures (network errors, 5xx, 429) /// Automatically retries on transient failures (network errors, 5xx, 429)
/// with exponential backoff: 0s, 1s (2 attempts total). /// with exponential backoff: 0s, 1s (2 attempts total). No per-site
/// rescue logic; use [`Self::fetch_smart`] for that.
#[instrument(skip(self), fields(url = %url))] #[instrument(skip(self), fields(url = %url))]
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> { pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
let delays = [Duration::ZERO, Duration::from_secs(1)]; let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -599,12 +654,43 @@ impl FetchClient {
} }
} }
// ---------------------------------------------------------------------------
// Fetcher trait implementation
//
// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
// rather than `FetchClient` directly, which is what lets the production
// API server swap in a tls-sidecar-backed implementation without
// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
// self-hosted OSS server) this impl means "pass the FetchClient you
// already have; nothing changes".
// ---------------------------------------------------------------------------
#[async_trait::async_trait]
impl crate::fetcher::Fetcher for FetchClient {
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
FetchClient::fetch(self, url).await
}
async fn fetch_with_headers(
&self,
url: &str,
headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
FetchClient::fetch_with_headers(self, url, headers).await
}
fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
FetchClient::cloud(self)
}
}
/// Collect the browser variants to use based on the browser profile. /// Collect the browser variants to use based on the browser profile.
fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> { fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
match profile { match profile {
BrowserProfile::Random => browser::all_variants(), BrowserProfile::Random => browser::all_variants(),
BrowserProfile::Chrome => vec![browser::latest_chrome()], BrowserProfile::Chrome => vec![browser::latest_chrome()],
BrowserProfile::Firefox => vec![browser::latest_firefox()], BrowserProfile::Firefox => vec![browser::latest_firefox()],
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
} }
} }
@ -682,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
/// Detect if a response looks like a bot protection challenge page. /// Detect if a response looks like a bot protection challenge page.
fn is_challenge_response(response: &Response) -> bool { fn is_challenge_response(response: &Response) -> bool {
let len = response.body().len(); is_challenge_html(response.text().as_ref())
}
/// Same as `is_challenge_response`, operating on a body string directly
/// so callers holding a `FetchResult` can reuse the heuristic.
fn is_challenge_html(html: &str) -> bool {
let len = html.len();
if len > 15_000 || len == 0 { if len > 15_000 || len == 0 {
return false; return false;
} }
let lower = html.to_lowercase();
let text = response.text();
let lower = text.to_lowercase();
if lower.contains("<title>challenge page</title>") { if lower.contains("<title>challenge page</title>") {
return true; return true;
} }
if lower.contains("bazadebezolkohpepadr") && len < 5_000 { if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
return true; return true;
} }
false false
} }

View file

@ -66,7 +66,9 @@ use serde_json::{Value, json};
use thiserror::Error; use thiserror::Error;
use tracing::{debug, info, warn}; use tracing::{debug, info, warn};
use crate::client::FetchClient; // Client type isn't needed here anymore now that smart_fetch* takes
// `&dyn Fetcher`. Kept as a comment for historical context: this
// module used to import FetchClient directly before v0.5.1.
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
// URLs + defaults — keep in one place so "change the signup link" is a // URLs + defaults — keep in one place so "change the signup link" is a
@ -506,7 +508,7 @@ pub enum SmartFetchResult {
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed /// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
/// [`CloudError`] so you can render precise UX. /// [`CloudError`] so you can render precise UX.
pub async fn smart_fetch( pub async fn smart_fetch(
client: &FetchClient, client: &dyn crate::fetcher::Fetcher,
cloud: Option<&CloudClient>, cloud: Option<&CloudClient>,
url: &str, url: &str,
include_selectors: &[String], include_selectors: &[String],
@ -613,7 +615,7 @@ pub struct FetchedHtml {
/// Designed for the vertical-extractor pattern where the caller has /// Designed for the vertical-extractor pattern where the caller has
/// its own parser and just needs bytes. /// its own parser and just needs bytes.
pub async fn smart_fetch_html( pub async fn smart_fetch_html(
client: &FetchClient, client: &dyn crate::fetcher::Fetcher,
cloud: Option<&CloudClient>, cloud: Option<&CloudClient>,
url: &str, url: &str,
) -> Result<FetchedHtml, CloudError> { ) -> Result<FetchedHtml, CloudError> {

View file

@ -32,9 +32,9 @@ use regex::Regex;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError}; use crate::cloud::{self, CloudError};
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "amazon_product", name: "amazon_product",
@ -59,7 +59,7 @@ pub fn matches(url: &str) -> bool {
parse_asin(url).is_some() parse_asin(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let asin = parse_asin(url) let asin = parse_asin(url)
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;

View file

@ -10,8 +10,8 @@ use quick_xml::events::Event;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "arxiv", name: "arxiv",
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/abs/") || url.contains("/pdf/") url.contains("/abs/") || url.contains("/pdf/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let id = parse_id(url) let id = parse_id(url)
.ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;

View file

@ -9,8 +9,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "crates_io", name: "crates_io",
@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/crates/") url.contains("/crates/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let name = parse_name(url) let name = parse_name(url)
.ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;

View file

@ -8,8 +8,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "dev_to", name: "dev_to",
@ -61,7 +61,7 @@ const RESERVED_FIRST_SEGS: &[&str] = &[
"t", "t",
]; ];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (username, slug) = parse_username_slug(url).ok_or_else(|| { let (username, slug) = parse_username_slug(url).ok_or_else(|| {
FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'")) FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
})?; })?;

View file

@ -8,8 +8,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "docker_hub", name: "docker_hub",
@ -29,7 +29,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/_/") || url.contains("/r/") url.contains("/_/") || url.contains("/r/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (namespace, name) = parse_repo(url) let (namespace, name) = parse_repo(url)
.ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;

View file

@ -14,9 +14,9 @@ use regex::Regex;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError}; use crate::cloud::{self, CloudError};
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "ebay_listing", name: "ebay_listing",
@ -39,7 +39,7 @@ pub fn matches(url: &str) -> bool {
parse_item_id(url).is_some() parse_item_id(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let item_id = parse_item_id(url) let item_id = parse_item_id(url)
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;

View file

@ -42,8 +42,8 @@ use regex::Regex;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "ecommerce_product", name: "ecommerce_product",
@ -69,7 +69,7 @@ pub fn matches(url: &str) -> bool {
!host_of(url).is_empty() !host_of(url).is_empty()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let resp = client.fetch(url).await?; let resp = client.fetch(url).await?;
if !(200..300).contains(&resp.status) { if !(200..300).contains(&resp.status) {
return Err(FetchError::Build(format!( return Err(FetchError::Build(format!(

View file

@ -26,9 +26,9 @@ use regex::Regex;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError}; use crate::cloud::{self, CloudError};
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "etsy_listing", name: "etsy_listing",
@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
parse_listing_id(url).is_some() parse_listing_id(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let listing_id = parse_listing_id(url) let listing_id = parse_listing_id(url)
.ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;

View file

@ -10,8 +10,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_issue", name: "github_issue",
@ -34,7 +34,7 @@ pub fn matches(url: &str) -> bool {
parse_issue(url).is_some() parse_issue(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo, number) = parse_issue(url).ok_or_else(|| { let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'")) FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
})?; })?;

View file

@ -9,8 +9,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_pr", name: "github_pr",
@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
parse_pr(url).is_some() parse_pr(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo, number) = parse_pr(url).ok_or_else(|| { let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'")) FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
})?; })?;

View file

@ -8,8 +8,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_release", name: "github_release",
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
parse_release(url).is_some() parse_release(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo, tag) = parse_release(url).ok_or_else(|| { let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
FetchError::Build(format!("github_release: cannot parse release URL '{url}'")) FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
})?; })?;

View file

@ -10,8 +10,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_repo", name: "github_repo",
@ -70,7 +70,7 @@ const RESERVED_OWNERS: &[&str] = &[
"about", "about",
]; ];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo) = parse_owner_repo(url).ok_or_else(|| { let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'")) FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
})?; })?;

View file

@ -10,8 +10,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "hackernews", name: "hackernews",
@ -40,7 +40,7 @@ pub fn matches(url: &str) -> bool {
false false
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let id = parse_item_id(url).ok_or_else(|| { let id = parse_item_id(url).ok_or_else(|| {
FetchError::Build(format!("hackernews: cannot parse item id from '{url}'")) FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
})?; })?;

View file

@ -7,8 +7,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "huggingface_dataset", name: "huggingface_dataset",
@ -38,7 +38,7 @@ pub fn matches(url: &str) -> bool {
segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3) segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let dataset_path = parse_dataset_path(url).ok_or_else(|| { let dataset_path = parse_dataset_path(url).ok_or_else(|| {
FetchError::Build(format!( FetchError::Build(format!(
"hf_dataset: cannot parse dataset path from '{url}'" "hf_dataset: cannot parse dataset path from '{url}'"

View file

@ -9,8 +9,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "huggingface_model", name: "huggingface_model",
@ -61,7 +61,7 @@ const RESERVED_NAMESPACES: &[&str] = &[
"search", "search",
]; ];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, name) = parse_owner_name(url).ok_or_else(|| { let (owner, name) = parse_owner_name(url).ok_or_else(|| {
FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'")) FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
})?; })?;

View file

@ -11,8 +11,8 @@ use serde_json::{Value, json};
use std::sync::OnceLock; use std::sync::OnceLock;
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "instagram_post", name: "instagram_post",
@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool {
parse_shortcode(url).is_some() parse_shortcode(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| { let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
FetchError::Build(format!( FetchError::Build(format!(
"instagram_post: cannot parse shortcode from '{url}'" "instagram_post: cannot parse shortcode from '{url}'"

View file

@ -23,8 +23,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "instagram_profile", name: "instagram_profile",
@ -80,7 +80,7 @@ const RESERVED: &[&str] = &[
"signup", "signup",
]; ];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let username = parse_username(url).ok_or_else(|| { let username = parse_username(url).ok_or_else(|| {
FetchError::Build(format!( FetchError::Build(format!(
"instagram_profile: cannot parse username from '{url}'" "instagram_profile: cannot parse username from '{url}'"
@ -198,7 +198,7 @@ fn classify(n: &MediaNode) -> &'static str {
/// pull whatever OG tags we can. Returns less data and explicitly /// pull whatever OG tags we can. Returns less data and explicitly
/// flags `data_completeness: "og_only"` so callers know. /// flags `data_completeness: "og_only"` so callers know.
async fn og_fallback( async fn og_fallback(
client: &FetchClient, client: &dyn Fetcher,
username: &str, username: &str,
original_url: &str, original_url: &str,
api_status: u16, api_status: u16,

View file

@ -14,8 +14,8 @@ use serde_json::{Value, json};
use std::sync::OnceLock; use std::sync::OnceLock;
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "linkedin_post", name: "linkedin_post",
@ -36,7 +36,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/feed/update/urn:li:") || url.contains("/posts/") url.contains("/feed/update/urn:li:") || url.contains("/posts/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let urn = extract_urn(url).ok_or_else(|| { let urn = extract_urn(url).ok_or_else(|| {
FetchError::Build(format!( FetchError::Build(format!(
"linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})" "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"

View file

@ -46,8 +46,8 @@ pub mod youtube_video;
use serde::Serialize; use serde::Serialize;
use serde_json::Value; use serde_json::Value;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
/// Public catalog entry for `/v1/extractors`. Stable shape — clients /// Public catalog entry for `/v1/extractors`. Stable shape — clients
/// rely on `name` to pick the right `/v1/scrape/{name}` route. /// rely on `name` to pick the right `/v1/scrape/{name}` route.
@ -102,7 +102,7 @@ pub fn list() -> Vec<ExtractorInfo> {
/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't /// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
/// pick a vertical explicitly. /// pick a vertical explicitly.
pub async fn dispatch_by_url( pub async fn dispatch_by_url(
client: &FetchClient, client: &dyn Fetcher,
url: &str, url: &str,
) -> Option<Result<(&'static str, Value), FetchError>> { ) -> Option<Result<(&'static str, Value), FetchError>> {
if reddit::matches(url) { if reddit::matches(url) {
@ -281,7 +281,7 @@ pub async fn dispatch_by_url(
/// users get a clear "wrong route" error instead of a confusing parse /// users get a clear "wrong route" error instead of a confusing parse
/// failure deep in the extractor. /// failure deep in the extractor.
pub async fn dispatch_by_name( pub async fn dispatch_by_name(
client: &FetchClient, client: &dyn Fetcher,
name: &str, name: &str,
url: &str, url: &str,
) -> Result<Value, ExtractorDispatchError> { ) -> Result<Value, ExtractorDispatchError> {

View file

@ -13,8 +13,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "npm", name: "npm",
@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/package/") url.contains("/package/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let name = parse_name(url) let name = parse_name(url)
.ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?; .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
@ -94,7 +94,7 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchErro
})) }))
} }
async fn fetch_weekly_downloads(client: &FetchClient, name: &str) -> Result<i64, FetchError> { async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
let url = format!( let url = format!(
"https://api.npmjs.org/downloads/point/last-week/{}", "https://api.npmjs.org/downloads/point/last-week/{}",
urlencode_segment(name) urlencode_segment(name)

View file

@ -9,8 +9,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "pypi", name: "pypi",
@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/project/") url.contains("/project/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (name, version) = parse_project(url).ok_or_else(|| { let (name, version) = parse_project(url).ok_or_else(|| {
FetchError::Build(format!("pypi: cannot parse package name from '{url}'")) FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
})?; })?;

View file

@ -9,8 +9,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "reddit", name: "reddit",
@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool {
is_reddit_host && url.contains("/comments/") is_reddit_host && url.contains("/comments/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let json_url = build_json_url(url); let json_url = build_json_url(url);
let resp = client.fetch(&json_url).await?; let resp = client.fetch(&json_url).await?;
if resp.status != 200 { if resp.status != 200 {

View file

@ -15,8 +15,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "shopify_collection", name: "shopify_collection",
@ -49,7 +49,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
"github.com", "github.com",
]; ];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (coll_meta_url, coll_products_url) = build_json_urls(url); let (coll_meta_url, coll_products_url) = build_json_urls(url);
// Step 1: collection metadata. Shopify returns 200 on missing // Step 1: collection metadata. Shopify returns 200 on missing

View file

@ -21,8 +21,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "shopify_product", name: "shopify_product",
@ -65,7 +65,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[
"github.com", // /products is a marketing page "github.com", // /products is a marketing page
]; ];
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let json_url = build_json_url(url); let json_url = build_json_url(url);
let resp = client.fetch(&json_url).await?; let resp = client.fetch(&json_url).await?;
if resp.status == 404 { if resp.status == 404 {

View file

@ -13,8 +13,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "stackoverflow", name: "stackoverflow",
@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool {
parse_question_id(url).is_some() parse_question_id(url).is_some()
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let id = parse_question_id(url).ok_or_else(|| { let id = parse_question_id(url).ok_or_else(|| {
FetchError::Build(format!( FetchError::Build(format!(
"stackoverflow: cannot parse question id from '{url}'" "stackoverflow: cannot parse question id from '{url}'"

View file

@ -28,9 +28,9 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError}; use crate::cloud::{self, CloudError};
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "substack_post", name: "substack_post",
@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/p/") url.contains("/p/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let slug = parse_slug(url).ok_or_else(|| { let slug = parse_slug(url).ok_or_else(|| {
FetchError::Build(format!("substack_post: cannot parse slug from '{url}'")) FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
})?; })?;
@ -149,7 +149,7 @@ fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
async fn html_fallback( async fn html_fallback(
client: &FetchClient, client: &dyn Fetcher,
url: &str, url: &str,
api_url: &str, api_url: &str,
slug: &str, slug: &str,

View file

@ -32,9 +32,9 @@ use regex::Regex;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::cloud::{self, CloudError}; use crate::cloud::{self, CloudError};
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "trustpilot_reviews", name: "trustpilot_reviews",
@ -51,7 +51,7 @@ pub fn matches(url: &str) -> bool {
url.contains("/review/") url.contains("/review/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let fetched = cloud::smart_fetch_html(client, client.cloud(), url) let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await .await
.map_err(cloud_to_fetch_err)?; .map_err(cloud_to_fetch_err)?;

View file

@ -15,8 +15,8 @@ use serde::Deserialize;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "woocommerce_product", name: "woocommerce_product",
@ -42,7 +42,7 @@ pub fn matches(url: &str) -> bool {
|| url.contains("/produit/") // common fr locale || url.contains("/produit/") // common fr locale
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let slug = parse_slug(url).ok_or_else(|| { let slug = parse_slug(url).ok_or_else(|| {
FetchError::Build(format!( FetchError::Build(format!(
"woocommerce_product: cannot parse slug from '{url}'" "woocommerce_product: cannot parse slug from '{url}'"

View file

@ -25,8 +25,8 @@ use regex::Regex;
use serde_json::{Value, json}; use serde_json::{Value, json};
use super::ExtractorInfo; use super::ExtractorInfo;
use crate::client::FetchClient;
use crate::error::FetchError; use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo { pub const INFO: ExtractorInfo = ExtractorInfo {
name: "youtube_video", name: "youtube_video",
@ -45,7 +45,7 @@ pub fn matches(url: &str) -> bool {
|| url.contains("youtube-nocookie.com/embed/") || url.contains("youtube-nocookie.com/embed/")
} }
pub async fn extract(client: &FetchClient, url: &str) -> Result<Value, FetchError> { pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let video_id = parse_video_id(url).ok_or_else(|| { let video_id = parse_video_id(url).ok_or_else(|| {
FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'")) FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
})?; })?;

View file

@ -0,0 +1,118 @@
//! Pluggable fetcher abstraction for vertical extractors.
//!
//! Extractors call the network through this trait instead of hard-
//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
//! server, which must not use in-process TLS fingerprinting, provides
//! its own implementation that routes through the Go tls-sidecar.
//!
//! Both paths expose the same [`FetchResult`] shape and the same
//! optional cloud-escalation client, so extractor logic stays
//! identical across environments.
//!
//! ## Choosing an implementation
//!
//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass
//! it to extractors as `&client`.
//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
//! (in `server/src/engine/`) that delegates to `engine::tls_client`
//! and wraps it in `Arc<dyn Fetcher>` for handler injection.
//!
//! ## Why a trait and not a free function
//!
//! Extractors need state beyond a single fetch: the cloud client for
//! antibot escalation, and in the future per-user proxy pools, tenant
//! headers, circuit breakers. A trait keeps that state encapsulated
//! behind the fetch interface instead of threading it through every
//! extractor signature.
use async_trait::async_trait;
use crate::client::FetchResult;
use crate::cloud::CloudClient;
use crate::error::FetchError;
/// HTTP fetch surface used by vertical extractors.
///
/// Implementations must be `Send + Sync` because extractor dispatchers
/// run them inside tokio tasks, potentially across many requests.
#[async_trait]
pub trait Fetcher: Send + Sync {
/// Fetch a URL and return the raw response body + metadata. The
/// body is in `FetchResult::html` regardless of the actual content
/// type — JSON API endpoints put JSON there, HTML pages put HTML.
/// Extractors branch on response status and body shape.
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
/// Fetch with additional request headers. Needed for endpoints
/// that authenticate via a specific header (Instagram's
/// `x-ig-app-id`, for example). Default implementation routes to
/// [`Self::fetch`] so implementers without header support stay
/// functional, though the `Option<String>` field they'd set won't
/// be populated on the request.
async fn fetch_with_headers(
&self,
url: &str,
_headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
self.fetch(url).await
}
/// Optional cloud-escalation client for antibot bypass. Returning
/// `Some` tells extractors they can call into the hosted API when
/// local fetch hits a challenge page. Returning `None` makes
/// cloud-gated extractors emit [`CloudError::NotConfigured`] with
/// an actionable signup link.
///
/// The default implementation returns `None` because not every
/// deployment wants cloud fallback (self-hosts that don't have a
/// webclaw.io subscription, for instance).
///
/// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
fn cloud(&self) -> Option<&CloudClient> {
None
}
}
// ---------------------------------------------------------------------------
// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
// ---------------------------------------------------------------------------
#[async_trait]
impl<T: Fetcher + ?Sized> Fetcher for &T {
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
(**self).fetch(url).await
}
async fn fetch_with_headers(
&self,
url: &str,
headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
(**self).fetch_with_headers(url, headers).await
}
fn cloud(&self) -> Option<&CloudClient> {
(**self).cloud()
}
}
#[async_trait]
impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
(**self).fetch(url).await
}
async fn fetch_with_headers(
&self,
url: &str,
headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
(**self).fetch_with_headers(url, headers).await
}
fn cloud(&self) -> Option<&CloudClient> {
(**self).cloud()
}
}

View file

@ -8,7 +8,9 @@ pub mod crawler;
pub mod document; pub mod document;
pub mod error; pub mod error;
pub mod extractors; pub mod extractors;
pub mod fetcher;
pub mod linkedin; pub mod linkedin;
pub mod locale;
pub mod proxy; pub mod proxy;
pub mod reddit; pub mod reddit;
pub mod sitemap; pub mod sitemap;
@ -18,7 +20,9 @@ pub use browser::BrowserProfile;
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult}; pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult}; pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
pub use error::FetchError; pub use error::FetchError;
pub use fetcher::Fetcher;
pub use http::HeaderMap; pub use http::HeaderMap;
pub use locale::{accept_language_for_tld, accept_language_for_url};
pub use proxy::{parse_proxy_file, parse_proxy_line}; pub use proxy::{parse_proxy_file, parse_proxy_line};
pub use sitemap::SitemapEntry; pub use sitemap::SitemapEntry;
pub use webclaw_pdf::PdfMode; pub use webclaw_pdf::PdfMode;

View file

@ -0,0 +1,77 @@
//! Derive an `Accept-Language` header from a URL.
//!
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
//! target country + a browser UA but the wrong `Accept-Language` is a bot
//! signal. Matching the site's expected locale gets us through.
//!
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
/// Best-effort `Accept-Language` header value for the given URL's TLD.
/// Returns `None` if the URL cannot be parsed.
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
let tld = host.rsplit('.').next()?;
Some(accept_language_for_tld(tld))
}
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
/// Unknown TLDs fall back to US English.
pub fn accept_language_for_tld(tld: &str) -> &'static str {
match tld {
"it" => "it-IT,it;q=0.9",
"fr" => "fr-FR,fr;q=0.9",
"de" | "at" => "de-DE,de;q=0.9",
"es" => "es-ES,es;q=0.9",
"pt" => "pt-PT,pt;q=0.9",
"nl" => "nl-NL,nl;q=0.9",
"pl" => "pl-PL,pl;q=0.9",
"se" => "sv-SE,sv;q=0.9",
"no" => "nb-NO,nb;q=0.9",
"dk" => "da-DK,da;q=0.9",
"fi" => "fi-FI,fi;q=0.9",
"cz" => "cs-CZ,cs;q=0.9",
"ro" => "ro-RO,ro;q=0.9",
"gr" => "el-GR,el;q=0.9",
"tr" => "tr-TR,tr;q=0.9",
"ru" => "ru-RU,ru;q=0.9",
"jp" => "ja-JP,ja;q=0.9",
"kr" => "ko-KR,ko;q=0.9",
"cn" => "zh-CN,zh;q=0.9",
"tw" | "hk" => "zh-TW,zh;q=0.9",
"br" => "pt-BR,pt;q=0.9",
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
"uk" | "ie" => "en-GB,en;q=0.9",
_ => "en-US,en;q=0.9",
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tld_dispatch() {
assert_eq!(
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
Some("it-IT,it;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.leboncoin.fr/"),
Some("fr-FR,fr;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.amazon.co.uk/"),
Some("en-GB,en;q=0.9")
);
assert_eq!(
accept_language_for_url("https://example.com/"),
Some("en-US,en;q=0.9")
);
}
#[test]
fn bad_url_returns_none() {
assert_eq!(accept_language_for_url("not-a-url"), None);
}
}

View file

@ -7,10 +7,15 @@
use std::time::Duration; use std::time::Duration;
use std::borrow::Cow;
use wreq::http2::{ use wreq::http2::{
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId, Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
}; };
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion}; use wreq::tls::{
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
TlsVersion,
};
use wreq::{Client, Emulation}; use wreq::{Client, Emulation};
use crate::browser::BrowserVariant; use crate::browser::BrowserVariant;
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
/// Safari curves. /// Safari curves.
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521"; const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
/// inserts them itself. Diverges from wreq-util's default SafariIos26
/// extension order, which DataDome's immobiliare.it ruleset flags.
fn safari_ios_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP,
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
ExtensionType::SERVER_NAME,
ExtensionType::CERT_COMPRESSION,
ExtensionType::KEY_SHARE,
ExtensionType::SUPPORTED_VERSIONS,
ExtensionType::PSK_KEY_EXCHANGE_MODES,
ExtensionType::SUPPORTED_GROUPS,
ExtensionType::RENEGOTIATE,
ExtensionType::SIGNATURE_ALGORITHMS,
ExtensionType::STATUS_REQUEST,
ExtensionType::EC_POINT_FORMATS,
ExtensionType::EXTENDED_MASTER_SECRET,
]
}
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
/// per handshake, but indeed.com's WAF allowlists this specific wire order
/// and rejects permuted ones. GREASE slots are inserted by wreq.
///
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
fn chrome_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
ExtensionType::STATUS_REQUEST, // 5
ExtensionType::SESSION_TICKET, // 35
ExtensionType::KEY_SHARE, // 51
ExtensionType::SUPPORTED_GROUPS, // 10
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
ExtensionType::EC_POINT_FORMATS, // 11
ExtensionType::CERT_COMPRESSION, // 27
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
ExtensionType::SUPPORTED_VERSIONS, // 43
ExtensionType::SIGNATURE_ALGORITHMS, // 13
ExtensionType::SERVER_NAME, // 0
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
ExtensionType::RENEGOTIATE, // 65281
ExtensionType::EXTENDED_MASTER_SECRET, // 23
]
}
// --- Chrome HTTP headers in correct wire order --- // --- Chrome HTTP headers in correct wire order ---
const CHROME_HEADERS: &[(&str, &str)] = &[ const CHROME_HEADERS: &[(&str, &str)] = &[
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
("sec-fetch-dest", "document"), ("sec-fetch-dest", "document"),
]; ];
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
/// expects for a real iPhone.
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
(
"accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
),
("accept-language", "en-US,en;q=0.9"),
("accept-encoding", "gzip, deflate, br"),
(
"user-agent",
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
),
("upgrade-insecure-requests", "1"),
];
const EDGE_HEADERS: &[(&str, &str)] = &[ const EDGE_HEADERS: &[(&str, &str)] = &[
( (
"sec-ch-ua", "sec-ch-ua",
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
]; ];
fn chrome_tls() -> TlsOptions { fn chrome_tls() -> TlsOptions {
// permute_extensions is off so the explicit extension_permutation sticks.
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
// fixed order, so matching that gets us through.
TlsOptions::builder() TlsOptions::builder()
.cipher_list(CHROME_CIPHERS) .cipher_list(CHROME_CIPHERS)
.sigalgs_list(CHROME_SIGALGS) .sigalgs_list(CHROME_SIGALGS)
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
.min_tls_version(TlsVersion::TLS_1_2) .min_tls_version(TlsVersion::TLS_1_2)
.max_tls_version(TlsVersion::TLS_1_3) .max_tls_version(TlsVersion::TLS_1_3)
.grease_enabled(true) .grease_enabled(true)
.permute_extensions(true) .permute_extensions(false)
.extension_permutation(chrome_extensions())
.enable_ech_grease(true) .enable_ech_grease(true)
.pre_shared_key(true) .pre_shared_key(true)
.enable_ocsp_stapling(true) .enable_ocsp_stapling(true)
.enable_signed_cert_timestamps(true) .enable_signed_cert_timestamps(true)
.alps_protocols([AlpsProtocol::HTTP2]) .alpn_protocols([
AlpnProtocol::HTTP3,
AlpnProtocol::HTTP2,
AlpnProtocol::HTTP1,
])
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
.alps_use_new_codepoint(true) .alps_use_new_codepoint(true)
.aes_hw_override(true) .aes_hw_override(true)
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI]) .certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
.build() .build()
} }
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
/// because the wire-level defaults from wreq-util are already correct for ciphers,
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
/// DataDome compatibility are overridden here:
///
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
/// 4. `accept-language` preserved from config.extra_headers for locale.
fn safari_ios_emulation() -> wreq::Emulation {
use wreq::EmulationFactory;
let mut em = wreq_util::Emulation::SafariIos26.emulation();
if let Some(tls) = em.tls_options_mut().as_mut() {
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
}
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
if let Some(h2) = em.http2_options_mut().as_mut() {
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
}
let hm = em.headers_mut();
hm.clear();
for (k, v) in SAFARI_IOS_HEADERS {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
hm.append(n, val);
}
}
em
}
fn chrome_h2() -> Http2Options { fn chrome_h2() -> Http2Options {
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
// and indeed.com's WAF reads this as a bot signal when present. Priority
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
Http2Options::builder() Http2Options::builder()
.initial_window_size(6_291_456) .initial_window_size(6_291_456)
.initial_connection_window_size(15_728_640) .initial_connection_window_size(15_728_640)
.max_header_list_size(262_144) .max_header_list_size(262_144)
.header_table_size(65_536) .header_table_size(65_536)
.max_concurrent_streams(1000u32)
.enable_push(false) .enable_push(false)
.settings_order( .settings_order(
SettingsOrder::builder() SettingsOrder::builder()
.extend([ .extend([
SettingId::HeaderTableSize, SettingId::HeaderTableSize,
SettingId::EnablePush, SettingId::EnablePush,
SettingId::MaxConcurrentStreams,
SettingId::InitialWindowSize, SettingId::InitialWindowSize,
SettingId::MaxFrameSize,
SettingId::MaxHeaderListSize, SettingId::MaxHeaderListSize,
SettingId::EnableConnectProtocol,
SettingId::NoRfc7540Priorities,
]) ])
.build(), .build(),
) )
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
]) ])
.build(), .build(),
) )
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true)) .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
.build() .build()
} }
@ -328,32 +456,38 @@ pub fn build_client(
extra_headers: &std::collections::HashMap<String, String>, extra_headers: &std::collections::HashMap<String, String>,
proxy: Option<&str>, proxy: Option<&str>,
) -> Result<Client, FetchError> { ) -> Result<Client, FetchError> {
let (tls, h2, headers) = match variant { // SafariIos26 builds its Emulation on top of wreq-util's base instead
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS), // of from scratch. See `safari_ios_emulation` for why.
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS), let mut emulation = match variant {
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS), BrowserVariant::SafariIos26 => safari_ios_emulation(),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS), other => {
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS), let (tls, h2, headers) = match other {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
BrowserVariant::SafariIos26 => unreachable!("handled above"),
};
Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(build_headers(headers))
.build()
}
}; };
let mut header_map = build_headers(headers); // Append extra headers after profile defaults.
let hm = emulation.headers_mut();
// Append extra headers after profile defaults
for (k, v) in extra_headers { for (k, v) in extra_headers {
if let (Ok(n), Ok(val)) = ( if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()), http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v), http::header::HeaderValue::from_str(v),
) { ) {
header_map.insert(n, val); hm.insert(n, val);
} }
} }
let emulation = Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(header_map)
.build();
let mut builder = Client::builder() let mut builder = Client::builder()
.emulation(emulation) .emulation(emulation)
.redirect(wreq::redirect::Policy::limited(10)) .redirect(wreq::redirect::Policy::limited(10))

View file

@ -718,6 +718,55 @@ impl WebclawMcp {
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default()) Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
} }
} }
/// List every vertical extractor the server knows about. Returns a
/// JSON array of `{name, label, description, url_patterns}` entries.
/// Call this to discover what verticals are available before using
/// `vertical_scrape`.
#[tool]
async fn list_extractors(
&self,
Parameters(_params): Parameters<ListExtractorsParams>,
) -> Result<String, String> {
let catalog = webclaw_fetch::extractors::list();
serde_json::to_string_pretty(&catalog)
.map_err(|e| format!("failed to serialise extractor catalog: {e}"))
}
/// Run a vertical extractor by name and return typed JSON specific
/// to the target site (title, price, rating, author, etc.), not
/// generic markdown. Use `list_extractors` to discover available
/// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
/// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
///
/// Antibot-gated verticals (amazon_product, ebay_listing,
/// etsy_listing, trustpilot_reviews) will automatically escalate to
/// the webclaw cloud API when local fetch hits bot protection,
/// provided `WEBCLAW_API_KEY` is set.
#[tool]
async fn vertical_scrape(
&self,
Parameters(params): Parameters<VerticalParams>,
) -> Result<String, String> {
validate_url(&params.url)?;
// Use the cached Firefox client, not the default Chrome one.
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
// fingerprint with a 403 even from residential IPs (they
// ship a fingerprint blocklist that includes common
// browser-emulation libraries). The wreq-Firefox fingerprint
// still passes, and Firefox is equally fine for every other
// vertical in the catalog, so it's a strictly-safer default
// for `vertical_scrape` than the generic `scrape` tool's
// Chrome default. Matches the CLI `webclaw vertical`
// subcommand which already uses Firefox.
let client = self.firefox_or_build()?;
let data =
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
.await
.map_err(|e| e.to_string())?;
serde_json::to_string_pretty(&data)
.map_err(|e| format!("failed to serialise extractor output: {e}"))
}
} }
#[tool_handler] #[tool_handler]
@ -727,7 +776,8 @@ impl ServerHandler for WebclawMcp {
.with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION"))) .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
.with_instructions(String::from( .with_instructions(String::from(
"Webclaw MCP server -- web content extraction for AI agents. \ "Webclaw MCP server -- web content extraction for AI agents. \
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.", Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
list_extractors, vertical_scrape.",
)) ))
} }
} }

View file

@ -103,3 +103,20 @@ pub struct SearchParams {
/// Number of results to return (default: 10) /// Number of results to return (default: 10)
pub num_results: Option<u32>, pub num_results: Option<u32>,
} }
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct VerticalParams {
/// Name of the vertical extractor. Call `list_extractors` to see all
/// available names. Examples: "reddit", "github_repo", "pypi",
/// "trustpilot_reviews", "youtube_video", "shopify_product".
pub name: String,
/// URL to extract. Must match the URL patterns the extractor claims;
/// otherwise the tool returns a clear "URL mismatch" error.
pub url: String,
}
/// `list_extractors` takes no arguments but we still need an empty struct
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct ListExtractorsParams {}