Compare commits

...

10 commits
v0.5.1 ... main

Author SHA1 Message Date
Valerio
a5c3433372 fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:26:31 +02:00
Valerio
966981bc42 fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:17:04 +02:00
Valerio
866fa88aa0 fix(fetch): reject HTML verification pages served at .json reddit URL 2026-04-23 15:06:35 +02:00
Valerio
b413d702b2 feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6 2026-04-23 14:59:29 +02:00
Valerio
98a177dec4 feat(cli): expose safari-ios browser profile + bump to 0.5.5 2026-04-23 13:32:55 +02:00
Valerio
e1af2da509 docs(claude): drop sidecar references, mention ProductionFetcher 2026-04-23 13:25:23 +02:00
Valerio
2285c585b1 docs(changelog): simplify 0.5.4 entry 2026-04-23 13:01:02 +02:00
Valerio
b77767814a Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
  Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
  extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
  gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
  8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
  immobiliare.it with country-it residential.

- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
  MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
  explicit extension_permutation, advertise h3 in ALPN and ALPS.
  JA3 43067709b025da334de1279a120f8e14, akamai_fp
  52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
  Cloudflare-fronted sites.

- New locale module: accept_language_for_url / accept_language_for_tld.
  TLD to Accept-Language mapping, unknown TLDs default to en-US.
  DataDome geo-vs-locale cross-checks are now trivially satisfiable.

- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
2026-04-23 12:58:24 +02:00
Valerio
4bf11d902f fix(mcp): vertical_scrape uses Firefox profile, not default Chrome
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.

Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.

Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers

No change to the `scrape` tool -- it keeps the user-selectable
browser param.

Bumps version to 0.5.3.
2026-04-22 23:18:11 +02:00
Valerio
0daa2fec1a feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.

CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
  name, label, and sample URL. `--json` flag emits the full catalog
  as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
  extractor and prints typed JSON. Pretty-printed by default; `--raw`
  for single-line. Exits 1 with a clear "URL does not match" error
  on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
  when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.

MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
  pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
  JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
  updated accordingly.

Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.

Version bumped to 0.5.2 (minor for API additions, backwards compatible).
2026-04-22 21:41:15 +02:00
14 changed files with 574 additions and 45 deletions

View file

@ -3,6 +3,50 @@
All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/).
## [0.5.6] — 2026-04-23
### Added
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
### Fixed
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
---
## [0.5.5] — 2026-04-23
### Added
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
---
## [0.5.4] — 2026-04-23
### Added
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
### Changed
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
- Bumped `wreq-util` to `3.0.0-rc.10`.
---
## [0.5.2] — 2026-04-22
### Added
- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
### Changed
- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
---
## [0.5.1] — 2026-04-22
### Added

View file

@ -79,7 +79,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
- **No special RUSTFLAGS**`.cargo/config.toml` is currently empty of build flags. Don't add any.
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
## Build & Test

45
Cargo.lock generated
View file

@ -2967,6 +2967,26 @@ dependencies = [
"pom",
]
[[package]]
name = "typed-builder"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
dependencies = [
"typed-builder-macro",
]
[[package]]
name = "typed-builder-macro"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "typed-path"
version = "0.12.3"
@ -3199,7 +3219,7 @@ dependencies = [
[[package]]
name = "webclaw-cli"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"clap",
"dotenvy",
@ -3220,7 +3240,7 @@ dependencies = [
[[package]]
name = "webclaw-core"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"ego-tree",
"once_cell",
@ -3238,7 +3258,7 @@ dependencies = [
[[package]]
name = "webclaw-fetch"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"async-trait",
"bytes",
@ -3258,12 +3278,13 @@ dependencies = [
"webclaw-core",
"webclaw-pdf",
"wreq",
"wreq-util",
"zip 2.4.2",
]
[[package]]
name = "webclaw-llm"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"async-trait",
"reqwest",
@ -3276,7 +3297,7 @@ dependencies = [
[[package]]
name = "webclaw-mcp"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"dirs",
"dotenvy",
@ -3296,7 +3317,7 @@ dependencies = [
[[package]]
name = "webclaw-pdf"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"pdf-extract",
"thiserror",
@ -3305,7 +3326,7 @@ dependencies = [
[[package]]
name = "webclaw-server"
version = "0.5.1"
version = "0.5.6"
dependencies = [
"anyhow",
"axum",
@ -3709,6 +3730,16 @@ dependencies = [
"zstd",
]
[[package]]
name = "wreq-util"
version = "3.0.0-rc.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
dependencies = [
"typed-builder",
"wreq",
]
[[package]]
name = "writeable"
version = "0.6.2"

View file

@ -3,7 +3,7 @@ resolver = "2"
members = ["crates/*"]
[workspace.package]
version = "0.5.1"
version = "0.5.6"
edition = "2024"
license = "AGPL-3.0"
repository = "https://github.com/0xMassi/webclaw"

View file

@ -308,6 +308,34 @@ enum Commands {
#[arg(long)]
facts: Option<PathBuf>,
},
/// List all vertical extractors in the catalog.
///
/// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
/// a human-friendly label, a one-line description, and the URL
/// patterns it claims. The same data is served by `/v1/extractors`
/// when running the REST API.
Extractors {
/// Emit JSON instead of a human-friendly table.
#[arg(long)]
json: bool,
},
/// Run a vertical extractor by name. Returns typed JSON with fields
/// specific to the target site (title, price, author, rating, etc.)
/// rather than generic markdown.
///
/// Use `webclaw extractors` to see the full list. Example:
/// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
Vertical {
/// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
name: String,
/// URL to extract.
url: String,
/// Emit compact JSON (single line). Default is pretty-printed.
#[arg(long)]
raw: bool,
},
}
#[derive(Clone, ValueEnum)]
@ -323,6 +351,9 @@ enum OutputFormat {
enum Browser {
Chrome,
Firefox,
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
/// that reject non-mobile profiles.
SafariIos,
Random,
}
@ -349,6 +380,7 @@ impl From<Browser> for BrowserProfile {
match b {
Browser::Chrome => BrowserProfile::Chrome,
Browser::Firefox => BrowserProfile::Firefox,
Browser::SafariIos => BrowserProfile::SafariIos,
Browser::Random => BrowserProfile::Random,
}
}
@ -2288,6 +2320,83 @@ async fn main() {
}
return;
}
Commands::Extractors { json } => {
let entries = webclaw_fetch::extractors::list();
if *json {
// Serialize with serde_json. ExtractorInfo derives
// Serialize so this is a one-liner.
match serde_json::to_string_pretty(&entries) {
Ok(s) => println!("{s}"),
Err(e) => {
eprintln!("error: failed to serialise catalog: {e}");
process::exit(1);
}
}
} else {
// Human-friendly table: NAME + LABEL + one URL
// pattern sample. Keeps the output scannable on a
// narrow terminal.
println!("{} vertical extractors available:\n", entries.len());
let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
for e in &entries {
let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
println!(
" {:<nw$} {:<lw$} {}",
e.name,
e.label,
pattern_sample,
nw = name_w,
lw = label_w,
);
}
println!("\nRun one: webclaw vertical <name> <url>");
}
return;
}
Commands::Vertical { name, url, raw } => {
// Build a FetchClient with cloud fallback attached when
// WEBCLAW_API_KEY is set. Antibot-gated verticals
// (amazon, ebay, etsy, trustpilot) need this to escalate
// on bot protection.
let fetch_cfg = webclaw_fetch::FetchConfig {
browser: webclaw_fetch::BrowserProfile::Firefox,
..webclaw_fetch::FetchConfig::default()
};
let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
Ok(c) => c,
Err(e) => {
eprintln!("error: failed to build fetch client: {e}");
process::exit(1);
}
};
if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
client = client.with_cloud(cloud);
}
match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
Ok(data) => {
let rendered = if *raw {
serde_json::to_string(&data)
} else {
serde_json::to_string_pretty(&data)
};
match rendered {
Ok(s) => println!("{s}"),
Err(e) => {
eprintln!("error: JSON encode failed: {e}");
process::exit(1);
}
}
}
Err(e) => {
// UrlMismatch / UnknownVertical / Fetch all get
// Display impls with actionable messages.
eprintln!("error: {e}");
process::exit(1);
}
}
return;
}
}
}

View file

@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
continue;
}
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
if trimmed.starts_with('|') && trimmed.ends_with('|') {
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
// Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
let inner = &trimmed[1..trimmed.len() - 1];
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
lines.push(cells.join("\t"));

View file

@ -14,6 +14,7 @@ tracing = { workspace = true }
tokio = { workspace = true }
async-trait = "0.1"
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
wreq-util = "3.0.0-rc.10"
http = "1"
bytes = "1"
url = "2"

View file

@ -7,6 +7,10 @@ pub enum BrowserProfile {
#[default]
Chrome,
Firefox,
/// Safari iOS 26 (iPhone). The one profile proven to defeat
/// DataDome's immobiliare.it / idealista.it / target.com-class
/// rules when paired with a country-scoped residential proxy.
SafariIos,
/// Randomly pick from all available profiles on each request.
Random,
}
@ -18,6 +22,7 @@ pub enum BrowserVariant {
ChromeMacos,
Firefox,
Safari,
SafariIos26,
Edge,
}

View file

@ -261,10 +261,65 @@ impl FetchClient {
self.cloud.as_deref()
}
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
/// `.json` API, and Akamai-style challenge responses trigger a homepage
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
/// server) benefits without shape churn.
///
/// This is the method most callers want. Use plain [`Self::fetch`] only
/// when you need literal no-rescue behavior (e.g. inside the rescue
/// logic itself to avoid recursion).
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
// Reddit: the HTML page shows a verification interstitial for most
// client IPs, but appending `.json` returns the post + comment tree
// publicly. `parse_reddit_json` in downstream code knows how to read
// the result; here we just do the URL swap at the fetch layer.
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
let json_url = crate::reddit::json_url(url);
// Reddit's public .json API serves JSON to identifiable bot
// User-Agents and blocks browser UAs with a verification wall.
// Override our Chrome-profile UA for this specific call.
let ua = concat!(
"Webclaw/",
env!("CARGO_PKG_VERSION"),
" (+https://webclaw.io)"
);
if let Ok(resp) = self
.fetch_with_headers(&json_url, &[("user-agent", ua)])
.await
&& resp.status == 200
{
let first = resp.html.trim_start().as_bytes().first().copied();
if matches!(first, Some(b'{') | Some(b'[')) {
return Ok(resp);
}
}
// If the .json fetch failed or returned HTML, fall through.
}
let resp = self.fetch(url).await?;
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
if is_challenge_html(&resp.html)
&& let Some(homepage) = extract_homepage(url)
{
debug!("challenge detected, warming cookies via {homepage}");
let _ = self.fetch(&homepage).await;
if let Ok(retry) = self.fetch(url).await {
return Ok(retry);
}
}
Ok(resp)
}
/// Fetch a URL and return the raw HTML + response metadata.
///
/// Automatically retries on transient failures (network errors, 5xx, 429)
/// with exponential backoff: 0s, 1s (2 attempts total).
/// with exponential backoff: 0s, 1s (2 attempts total). No per-site
/// rescue logic; use [`Self::fetch_smart`] for that.
#[instrument(skip(self), fields(url = %url))]
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -635,6 +690,7 @@ fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
BrowserProfile::Random => browser::all_variants(),
BrowserProfile::Chrome => vec![browser::latest_chrome()],
BrowserProfile::Firefox => vec![browser::latest_firefox()],
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
}
}
@ -712,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
/// Detect if a response looks like a bot protection challenge page.
fn is_challenge_response(response: &Response) -> bool {
let len = response.body().len();
is_challenge_html(response.text().as_ref())
}
/// Same as `is_challenge_response`, operating on a body string directly
/// so callers holding a `FetchResult` can reuse the heuristic.
fn is_challenge_html(html: &str) -> bool {
let len = html.len();
if len > 15_000 || len == 0 {
return false;
}
let text = response.text();
let lower = text.to_lowercase();
let lower = html.to_lowercase();
if lower.contains("<title>challenge page</title>") {
return true;
}
if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
return true;
}
false
}

View file

@ -10,6 +10,7 @@ pub mod error;
pub mod extractors;
pub mod fetcher;
pub mod linkedin;
pub mod locale;
pub mod proxy;
pub mod reddit;
pub mod sitemap;
@ -21,6 +22,7 @@ pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
pub use error::FetchError;
pub use fetcher::Fetcher;
pub use http::HeaderMap;
pub use locale::{accept_language_for_tld, accept_language_for_url};
pub use proxy::{parse_proxy_file, parse_proxy_line};
pub use sitemap::SitemapEntry;
pub use webclaw_pdf::PdfMode;

View file

@ -0,0 +1,77 @@
//! Derive an `Accept-Language` header from a URL.
//!
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
//! target country + a browser UA but the wrong `Accept-Language` is a bot
//! signal. Matching the site's expected locale gets us through.
//!
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
/// Best-effort `Accept-Language` header value for the given URL's TLD.
/// Returns `None` if the URL cannot be parsed.
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
let tld = host.rsplit('.').next()?;
Some(accept_language_for_tld(tld))
}
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
/// Unknown TLDs fall back to US English.
pub fn accept_language_for_tld(tld: &str) -> &'static str {
match tld {
"it" => "it-IT,it;q=0.9",
"fr" => "fr-FR,fr;q=0.9",
"de" | "at" => "de-DE,de;q=0.9",
"es" => "es-ES,es;q=0.9",
"pt" => "pt-PT,pt;q=0.9",
"nl" => "nl-NL,nl;q=0.9",
"pl" => "pl-PL,pl;q=0.9",
"se" => "sv-SE,sv;q=0.9",
"no" => "nb-NO,nb;q=0.9",
"dk" => "da-DK,da;q=0.9",
"fi" => "fi-FI,fi;q=0.9",
"cz" => "cs-CZ,cs;q=0.9",
"ro" => "ro-RO,ro;q=0.9",
"gr" => "el-GR,el;q=0.9",
"tr" => "tr-TR,tr;q=0.9",
"ru" => "ru-RU,ru;q=0.9",
"jp" => "ja-JP,ja;q=0.9",
"kr" => "ko-KR,ko;q=0.9",
"cn" => "zh-CN,zh;q=0.9",
"tw" | "hk" => "zh-TW,zh;q=0.9",
"br" => "pt-BR,pt;q=0.9",
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
"uk" | "ie" => "en-GB,en;q=0.9",
_ => "en-US,en;q=0.9",
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tld_dispatch() {
assert_eq!(
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
Some("it-IT,it;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.leboncoin.fr/"),
Some("fr-FR,fr;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.amazon.co.uk/"),
Some("en-GB,en;q=0.9")
);
assert_eq!(
accept_language_for_url("https://example.com/"),
Some("en-US,en;q=0.9")
);
}
#[test]
fn bad_url_returns_none() {
assert_eq!(accept_language_for_url("not-a-url"), None);
}
}

View file

@ -7,10 +7,15 @@
use std::time::Duration;
use std::borrow::Cow;
use wreq::http2::{
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
};
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
use wreq::tls::{
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
TlsVersion,
};
use wreq::{Client, Emulation};
use crate::browser::BrowserVariant;
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
/// Safari curves.
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
/// inserts them itself. Diverges from wreq-util's default SafariIos26
/// extension order, which DataDome's immobiliare.it ruleset flags.
fn safari_ios_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP,
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
ExtensionType::SERVER_NAME,
ExtensionType::CERT_COMPRESSION,
ExtensionType::KEY_SHARE,
ExtensionType::SUPPORTED_VERSIONS,
ExtensionType::PSK_KEY_EXCHANGE_MODES,
ExtensionType::SUPPORTED_GROUPS,
ExtensionType::RENEGOTIATE,
ExtensionType::SIGNATURE_ALGORITHMS,
ExtensionType::STATUS_REQUEST,
ExtensionType::EC_POINT_FORMATS,
ExtensionType::EXTENDED_MASTER_SECRET,
]
}
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
/// per handshake, but indeed.com's WAF allowlists this specific wire order
/// and rejects permuted ones. GREASE slots are inserted by wreq.
///
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
fn chrome_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
ExtensionType::STATUS_REQUEST, // 5
ExtensionType::SESSION_TICKET, // 35
ExtensionType::KEY_SHARE, // 51
ExtensionType::SUPPORTED_GROUPS, // 10
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
ExtensionType::EC_POINT_FORMATS, // 11
ExtensionType::CERT_COMPRESSION, // 27
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
ExtensionType::SUPPORTED_VERSIONS, // 43
ExtensionType::SIGNATURE_ALGORITHMS, // 13
ExtensionType::SERVER_NAME, // 0
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
ExtensionType::RENEGOTIATE, // 65281
ExtensionType::EXTENDED_MASTER_SECRET, // 23
]
}
// --- Chrome HTTP headers in correct wire order ---
const CHROME_HEADERS: &[(&str, &str)] = &[
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
("sec-fetch-dest", "document"),
];
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
/// expects for a real iPhone.
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
(
"accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
),
("accept-language", "en-US,en;q=0.9"),
("accept-encoding", "gzip, deflate, br"),
(
"user-agent",
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
),
("upgrade-insecure-requests", "1"),
];
const EDGE_HEADERS: &[(&str, &str)] = &[
(
"sec-ch-ua",
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
];
fn chrome_tls() -> TlsOptions {
// permute_extensions is off so the explicit extension_permutation sticks.
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
// fixed order, so matching that gets us through.
TlsOptions::builder()
.cipher_list(CHROME_CIPHERS)
.sigalgs_list(CHROME_SIGALGS)
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
.min_tls_version(TlsVersion::TLS_1_2)
.max_tls_version(TlsVersion::TLS_1_3)
.grease_enabled(true)
.permute_extensions(true)
.permute_extensions(false)
.extension_permutation(chrome_extensions())
.enable_ech_grease(true)
.pre_shared_key(true)
.enable_ocsp_stapling(true)
.enable_signed_cert_timestamps(true)
.alps_protocols([AlpsProtocol::HTTP2])
.alpn_protocols([
AlpnProtocol::HTTP3,
AlpnProtocol::HTTP2,
AlpnProtocol::HTTP1,
])
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
.alps_use_new_codepoint(true)
.aes_hw_override(true)
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
.build()
}
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
/// because the wire-level defaults from wreq-util are already correct for ciphers,
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
/// DataDome compatibility are overridden here:
///
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
/// 4. `accept-language` preserved from config.extra_headers for locale.
fn safari_ios_emulation() -> wreq::Emulation {
use wreq::EmulationFactory;
let mut em = wreq_util::Emulation::SafariIos26.emulation();
if let Some(tls) = em.tls_options_mut().as_mut() {
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
}
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
if let Some(h2) = em.http2_options_mut().as_mut() {
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
}
let hm = em.headers_mut();
hm.clear();
for (k, v) in SAFARI_IOS_HEADERS {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
hm.append(n, val);
}
}
em
}
fn chrome_h2() -> Http2Options {
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
// and indeed.com's WAF reads this as a bot signal when present. Priority
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
Http2Options::builder()
.initial_window_size(6_291_456)
.initial_connection_window_size(15_728_640)
.max_header_list_size(262_144)
.header_table_size(65_536)
.max_concurrent_streams(1000u32)
.enable_push(false)
.settings_order(
SettingsOrder::builder()
.extend([
SettingId::HeaderTableSize,
SettingId::EnablePush,
SettingId::MaxConcurrentStreams,
SettingId::InitialWindowSize,
SettingId::MaxFrameSize,
SettingId::MaxHeaderListSize,
SettingId::EnableConnectProtocol,
SettingId::NoRfc7540Priorities,
])
.build(),
)
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
])
.build(),
)
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
.build()
}
@ -328,32 +456,38 @@ pub fn build_client(
extra_headers: &std::collections::HashMap<String, String>,
proxy: Option<&str>,
) -> Result<Client, FetchError> {
let (tls, h2, headers) = match variant {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
// SafariIos26 builds its Emulation on top of wreq-util's base instead
// of from scratch. See `safari_ios_emulation` for why.
let mut emulation = match variant {
BrowserVariant::SafariIos26 => safari_ios_emulation(),
other => {
let (tls, h2, headers) = match other {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
BrowserVariant::SafariIos26 => unreachable!("handled above"),
};
Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(build_headers(headers))
.build()
}
};
let mut header_map = build_headers(headers);
// Append extra headers after profile defaults
// Append extra headers after profile defaults.
let hm = emulation.headers_mut();
for (k, v) in extra_headers {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
header_map.insert(n, val);
hm.insert(n, val);
}
}
let emulation = Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(header_map)
.build();
let mut builder = Client::builder()
.emulation(emulation)
.redirect(wreq::redirect::Policy::limited(10))

View file

@ -718,6 +718,55 @@ impl WebclawMcp {
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
}
}
/// List every vertical extractor the server knows about. Returns a
/// JSON array of `{name, label, description, url_patterns}` entries.
/// Call this to discover what verticals are available before using
/// `vertical_scrape`.
#[tool]
async fn list_extractors(
&self,
Parameters(_params): Parameters<ListExtractorsParams>,
) -> Result<String, String> {
let catalog = webclaw_fetch::extractors::list();
serde_json::to_string_pretty(&catalog)
.map_err(|e| format!("failed to serialise extractor catalog: {e}"))
}
/// Run a vertical extractor by name and return typed JSON specific
/// to the target site (title, price, rating, author, etc.), not
/// generic markdown. Use `list_extractors` to discover available
/// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
/// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
///
/// Antibot-gated verticals (amazon_product, ebay_listing,
/// etsy_listing, trustpilot_reviews) will automatically escalate to
/// the webclaw cloud API when local fetch hits bot protection,
/// provided `WEBCLAW_API_KEY` is set.
#[tool]
async fn vertical_scrape(
&self,
Parameters(params): Parameters<VerticalParams>,
) -> Result<String, String> {
validate_url(&params.url)?;
// Use the cached Firefox client, not the default Chrome one.
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
// fingerprint with a 403 even from residential IPs (they
// ship a fingerprint blocklist that includes common
// browser-emulation libraries). The wreq-Firefox fingerprint
// still passes, and Firefox is equally fine for every other
// vertical in the catalog, so it's a strictly-safer default
// for `vertical_scrape` than the generic `scrape` tool's
// Chrome default. Matches the CLI `webclaw vertical`
// subcommand which already uses Firefox.
let client = self.firefox_or_build()?;
let data =
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
.await
.map_err(|e| e.to_string())?;
serde_json::to_string_pretty(&data)
.map_err(|e| format!("failed to serialise extractor output: {e}"))
}
}
#[tool_handler]
@ -727,7 +776,8 @@ impl ServerHandler for WebclawMcp {
.with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
.with_instructions(String::from(
"Webclaw MCP server -- web content extraction for AI agents. \
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
list_extractors, vertical_scrape.",
))
}
}

View file

@ -103,3 +103,20 @@ pub struct SearchParams {
/// Number of results to return (default: 10)
pub num_results: Option<u32>,
}
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct VerticalParams {
/// Name of the vertical extractor. Call `list_extractors` to see all
/// available names. Examples: "reddit", "github_repo", "pypi",
/// "trustpilot_reviews", "youtube_video", "shopify_product".
pub name: String,
/// URL to extract. Must match the URL patterns the extractor claims;
/// otherwise the tool returns a clear "URL mismatch" error.
pub url: String,
}
/// `list_extractors` takes no arguments but we still need an empty struct
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct ListExtractorsParams {}