Compare commits

...

9 commits
v0.5.2 ... main

Author SHA1 Message Date
Valerio
a5c3433372 fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-04-23 15:26:31 +02:00
Valerio
966981bc42 fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:17:04 +02:00
Valerio
866fa88aa0 fix(fetch): reject HTML verification pages served at .json reddit URL 2026-04-23 15:06:35 +02:00
Valerio
b413d702b2 feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6 2026-04-23 14:59:29 +02:00
Valerio
98a177dec4 feat(cli): expose safari-ios browser profile + bump to 0.5.5 2026-04-23 13:32:55 +02:00
Valerio
e1af2da509 docs(claude): drop sidecar references, mention ProductionFetcher 2026-04-23 13:25:23 +02:00
Valerio
2285c585b1 docs(changelog): simplify 0.5.4 entry 2026-04-23 13:01:02 +02:00
Valerio
b77767814a Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
  Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
  extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
  gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
  8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
  immobiliare.it with country-it residential.

- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
  MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
  explicit extension_permutation, advertise h3 in ALPN and ALPS.
  JA3 43067709b025da334de1279a120f8e14, akamai_fp
  52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
  Cloudflare-fronted sites.

- New locale module: accept_language_for_url / accept_language_for_tld.
  TLD to Accept-Language mapping, unknown TLDs default to en-US.
  DataDome geo-vs-locale cross-checks are now trivially satisfiable.

- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
2026-04-23 12:58:24 +02:00
Valerio
4bf11d902f fix(mcp): vertical_scrape uses Firefox profile, not default Chrome
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.

Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.

Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers

No change to the `scrape` tool -- it keeps the user-selectable
browser param.

Bumps version to 0.5.3.
2026-04-22 23:18:11 +02:00
13 changed files with 402 additions and 54 deletions

View file

@ -3,6 +3,36 @@
All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/).
## [0.5.6] — 2026-04-23
### Added
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
### Fixed
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
---
## [0.5.5] — 2026-04-23
### Added
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
---
## [0.5.4] — 2026-04-23
### Added
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
### Changed
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
- Bumped `wreq-util` to `3.0.0-rc.10`.
---
## [0.5.2] — 2026-04-22
### Added

View file

@ -79,7 +79,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
- **No special RUSTFLAGS**`.cargo/config.toml` is currently empty of build flags. Don't add any.
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
## Build & Test

45
Cargo.lock generated
View file

@ -2967,6 +2967,26 @@ dependencies = [
"pom",
]
[[package]]
name = "typed-builder"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
dependencies = [
"typed-builder-macro",
]
[[package]]
name = "typed-builder-macro"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "typed-path"
version = "0.12.3"
@ -3199,7 +3219,7 @@ dependencies = [
[[package]]
name = "webclaw-cli"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"clap",
"dotenvy",
@ -3220,7 +3240,7 @@ dependencies = [
[[package]]
name = "webclaw-core"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"ego-tree",
"once_cell",
@ -3238,7 +3258,7 @@ dependencies = [
[[package]]
name = "webclaw-fetch"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"async-trait",
"bytes",
@ -3258,12 +3278,13 @@ dependencies = [
"webclaw-core",
"webclaw-pdf",
"wreq",
"wreq-util",
"zip 2.4.2",
]
[[package]]
name = "webclaw-llm"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"async-trait",
"reqwest",
@ -3276,7 +3297,7 @@ dependencies = [
[[package]]
name = "webclaw-mcp"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"dirs",
"dotenvy",
@ -3296,7 +3317,7 @@ dependencies = [
[[package]]
name = "webclaw-pdf"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"pdf-extract",
"thiserror",
@ -3305,7 +3326,7 @@ dependencies = [
[[package]]
name = "webclaw-server"
version = "0.5.2"
version = "0.5.6"
dependencies = [
"anyhow",
"axum",
@ -3709,6 +3730,16 @@ dependencies = [
"zstd",
]
[[package]]
name = "wreq-util"
version = "3.0.0-rc.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
dependencies = [
"typed-builder",
"wreq",
]
[[package]]
name = "writeable"
version = "0.6.2"

View file

@ -3,7 +3,7 @@ resolver = "2"
members = ["crates/*"]
[workspace.package]
version = "0.5.2"
version = "0.5.6"
edition = "2024"
license = "AGPL-3.0"
repository = "https://github.com/0xMassi/webclaw"

View file

@ -351,6 +351,9 @@ enum OutputFormat {
enum Browser {
Chrome,
Firefox,
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
/// that reject non-mobile profiles.
SafariIos,
Random,
}
@ -377,6 +380,7 @@ impl From<Browser> for BrowserProfile {
match b {
Browser::Chrome => BrowserProfile::Chrome,
Browser::Firefox => BrowserProfile::Firefox,
Browser::SafariIos => BrowserProfile::SafariIos,
Browser::Random => BrowserProfile::Random,
}
}

View file

@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
continue;
}
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
if trimmed.starts_with('|') && trimmed.ends_with('|') {
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
// Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
let inner = &trimmed[1..trimmed.len() - 1];
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
lines.push(cells.join("\t"));

View file

@ -14,6 +14,7 @@ tracing = { workspace = true }
tokio = { workspace = true }
async-trait = "0.1"
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
wreq-util = "3.0.0-rc.10"
http = "1"
bytes = "1"
url = "2"

View file

@ -7,6 +7,10 @@ pub enum BrowserProfile {
#[default]
Chrome,
Firefox,
/// Safari iOS 26 (iPhone). The one profile proven to defeat
/// DataDome's immobiliare.it / idealista.it / target.com-class
/// rules when paired with a country-scoped residential proxy.
SafariIos,
/// Randomly pick from all available profiles on each request.
Random,
}
@ -18,6 +22,7 @@ pub enum BrowserVariant {
ChromeMacos,
Firefox,
Safari,
SafariIos26,
Edge,
}

View file

@ -261,10 +261,65 @@ impl FetchClient {
self.cloud.as_deref()
}
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
/// `.json` API, and Akamai-style challenge responses trigger a homepage
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
/// server) benefits without shape churn.
///
/// This is the method most callers want. Use plain [`Self::fetch`] only
/// when you need literal no-rescue behavior (e.g. inside the rescue
/// logic itself to avoid recursion).
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
// Reddit: the HTML page shows a verification interstitial for most
// client IPs, but appending `.json` returns the post + comment tree
// publicly. `parse_reddit_json` in downstream code knows how to read
// the result; here we just do the URL swap at the fetch layer.
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
let json_url = crate::reddit::json_url(url);
// Reddit's public .json API serves JSON to identifiable bot
// User-Agents and blocks browser UAs with a verification wall.
// Override our Chrome-profile UA for this specific call.
let ua = concat!(
"Webclaw/",
env!("CARGO_PKG_VERSION"),
" (+https://webclaw.io)"
);
if let Ok(resp) = self
.fetch_with_headers(&json_url, &[("user-agent", ua)])
.await
&& resp.status == 200
{
let first = resp.html.trim_start().as_bytes().first().copied();
if matches!(first, Some(b'{') | Some(b'[')) {
return Ok(resp);
}
}
// If the .json fetch failed or returned HTML, fall through.
}
let resp = self.fetch(url).await?;
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
if is_challenge_html(&resp.html)
&& let Some(homepage) = extract_homepage(url)
{
debug!("challenge detected, warming cookies via {homepage}");
let _ = self.fetch(&homepage).await;
if let Ok(retry) = self.fetch(url).await {
return Ok(retry);
}
}
Ok(resp)
}
/// Fetch a URL and return the raw HTML + response metadata.
///
/// Automatically retries on transient failures (network errors, 5xx, 429)
/// with exponential backoff: 0s, 1s (2 attempts total).
/// with exponential backoff: 0s, 1s (2 attempts total). No per-site
/// rescue logic; use [`Self::fetch_smart`] for that.
#[instrument(skip(self), fields(url = %url))]
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -635,6 +690,7 @@ fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
BrowserProfile::Random => browser::all_variants(),
BrowserProfile::Chrome => vec![browser::latest_chrome()],
BrowserProfile::Firefox => vec![browser::latest_firefox()],
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
}
}
@ -712,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
/// Detect if a response looks like a bot protection challenge page.
fn is_challenge_response(response: &Response) -> bool {
let len = response.body().len();
is_challenge_html(response.text().as_ref())
}
/// Same as `is_challenge_response`, operating on a body string directly
/// so callers holding a `FetchResult` can reuse the heuristic.
fn is_challenge_html(html: &str) -> bool {
let len = html.len();
if len > 15_000 || len == 0 {
return false;
}
let text = response.text();
let lower = text.to_lowercase();
let lower = html.to_lowercase();
if lower.contains("<title>challenge page</title>") {
return true;
}
if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
return true;
}
false
}

View file

@ -10,6 +10,7 @@ pub mod error;
pub mod extractors;
pub mod fetcher;
pub mod linkedin;
pub mod locale;
pub mod proxy;
pub mod reddit;
pub mod sitemap;
@ -21,6 +22,7 @@ pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
pub use error::FetchError;
pub use fetcher::Fetcher;
pub use http::HeaderMap;
pub use locale::{accept_language_for_tld, accept_language_for_url};
pub use proxy::{parse_proxy_file, parse_proxy_line};
pub use sitemap::SitemapEntry;
pub use webclaw_pdf::PdfMode;

View file

@ -0,0 +1,77 @@
//! Derive an `Accept-Language` header from a URL.
//!
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
//! target country + a browser UA but the wrong `Accept-Language` is a bot
//! signal. Matching the site's expected locale gets us through.
//!
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
/// Best-effort `Accept-Language` header value for the given URL's TLD.
/// Returns `None` if the URL cannot be parsed.
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
let tld = host.rsplit('.').next()?;
Some(accept_language_for_tld(tld))
}
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
/// Unknown TLDs fall back to US English.
pub fn accept_language_for_tld(tld: &str) -> &'static str {
match tld {
"it" => "it-IT,it;q=0.9",
"fr" => "fr-FR,fr;q=0.9",
"de" | "at" => "de-DE,de;q=0.9",
"es" => "es-ES,es;q=0.9",
"pt" => "pt-PT,pt;q=0.9",
"nl" => "nl-NL,nl;q=0.9",
"pl" => "pl-PL,pl;q=0.9",
"se" => "sv-SE,sv;q=0.9",
"no" => "nb-NO,nb;q=0.9",
"dk" => "da-DK,da;q=0.9",
"fi" => "fi-FI,fi;q=0.9",
"cz" => "cs-CZ,cs;q=0.9",
"ro" => "ro-RO,ro;q=0.9",
"gr" => "el-GR,el;q=0.9",
"tr" => "tr-TR,tr;q=0.9",
"ru" => "ru-RU,ru;q=0.9",
"jp" => "ja-JP,ja;q=0.9",
"kr" => "ko-KR,ko;q=0.9",
"cn" => "zh-CN,zh;q=0.9",
"tw" | "hk" => "zh-TW,zh;q=0.9",
"br" => "pt-BR,pt;q=0.9",
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
"uk" | "ie" => "en-GB,en;q=0.9",
_ => "en-US,en;q=0.9",
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tld_dispatch() {
assert_eq!(
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
Some("it-IT,it;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.leboncoin.fr/"),
Some("fr-FR,fr;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.amazon.co.uk/"),
Some("en-GB,en;q=0.9")
);
assert_eq!(
accept_language_for_url("https://example.com/"),
Some("en-US,en;q=0.9")
);
}
#[test]
fn bad_url_returns_none() {
assert_eq!(accept_language_for_url("not-a-url"), None);
}
}

View file

@ -7,10 +7,15 @@
use std::time::Duration;
use std::borrow::Cow;
use wreq::http2::{
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
};
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
use wreq::tls::{
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
TlsVersion,
};
use wreq::{Client, Emulation};
use crate::browser::BrowserVariant;
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
/// Safari curves.
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
/// inserts them itself. Diverges from wreq-util's default SafariIos26
/// extension order, which DataDome's immobiliare.it ruleset flags.
fn safari_ios_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP,
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
ExtensionType::SERVER_NAME,
ExtensionType::CERT_COMPRESSION,
ExtensionType::KEY_SHARE,
ExtensionType::SUPPORTED_VERSIONS,
ExtensionType::PSK_KEY_EXCHANGE_MODES,
ExtensionType::SUPPORTED_GROUPS,
ExtensionType::RENEGOTIATE,
ExtensionType::SIGNATURE_ALGORITHMS,
ExtensionType::STATUS_REQUEST,
ExtensionType::EC_POINT_FORMATS,
ExtensionType::EXTENDED_MASTER_SECRET,
]
}
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
/// per handshake, but indeed.com's WAF allowlists this specific wire order
/// and rejects permuted ones. GREASE slots are inserted by wreq.
///
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
fn chrome_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
ExtensionType::STATUS_REQUEST, // 5
ExtensionType::SESSION_TICKET, // 35
ExtensionType::KEY_SHARE, // 51
ExtensionType::SUPPORTED_GROUPS, // 10
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
ExtensionType::EC_POINT_FORMATS, // 11
ExtensionType::CERT_COMPRESSION, // 27
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
ExtensionType::SUPPORTED_VERSIONS, // 43
ExtensionType::SIGNATURE_ALGORITHMS, // 13
ExtensionType::SERVER_NAME, // 0
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
ExtensionType::RENEGOTIATE, // 65281
ExtensionType::EXTENDED_MASTER_SECRET, // 23
]
}
// --- Chrome HTTP headers in correct wire order ---
const CHROME_HEADERS: &[(&str, &str)] = &[
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
("sec-fetch-dest", "document"),
];
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
/// expects for a real iPhone.
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
(
"accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
),
("accept-language", "en-US,en;q=0.9"),
("accept-encoding", "gzip, deflate, br"),
(
"user-agent",
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
),
("upgrade-insecure-requests", "1"),
];
const EDGE_HEADERS: &[(&str, &str)] = &[
(
"sec-ch-ua",
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
];
fn chrome_tls() -> TlsOptions {
// permute_extensions is off so the explicit extension_permutation sticks.
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
// fixed order, so matching that gets us through.
TlsOptions::builder()
.cipher_list(CHROME_CIPHERS)
.sigalgs_list(CHROME_SIGALGS)
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
.min_tls_version(TlsVersion::TLS_1_2)
.max_tls_version(TlsVersion::TLS_1_3)
.grease_enabled(true)
.permute_extensions(true)
.permute_extensions(false)
.extension_permutation(chrome_extensions())
.enable_ech_grease(true)
.pre_shared_key(true)
.enable_ocsp_stapling(true)
.enable_signed_cert_timestamps(true)
.alps_protocols([AlpsProtocol::HTTP2])
.alpn_protocols([
AlpnProtocol::HTTP3,
AlpnProtocol::HTTP2,
AlpnProtocol::HTTP1,
])
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
.alps_use_new_codepoint(true)
.aes_hw_override(true)
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
.build()
}
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
/// because the wire-level defaults from wreq-util are already correct for ciphers,
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
/// DataDome compatibility are overridden here:
///
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
/// 4. `accept-language` preserved from config.extra_headers for locale.
fn safari_ios_emulation() -> wreq::Emulation {
use wreq::EmulationFactory;
let mut em = wreq_util::Emulation::SafariIos26.emulation();
if let Some(tls) = em.tls_options_mut().as_mut() {
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
}
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
if let Some(h2) = em.http2_options_mut().as_mut() {
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
}
let hm = em.headers_mut();
hm.clear();
for (k, v) in SAFARI_IOS_HEADERS {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
hm.append(n, val);
}
}
em
}
fn chrome_h2() -> Http2Options {
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
// and indeed.com's WAF reads this as a bot signal when present. Priority
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
Http2Options::builder()
.initial_window_size(6_291_456)
.initial_connection_window_size(15_728_640)
.max_header_list_size(262_144)
.header_table_size(65_536)
.max_concurrent_streams(1000u32)
.enable_push(false)
.settings_order(
SettingsOrder::builder()
.extend([
SettingId::HeaderTableSize,
SettingId::EnablePush,
SettingId::MaxConcurrentStreams,
SettingId::InitialWindowSize,
SettingId::MaxFrameSize,
SettingId::MaxHeaderListSize,
SettingId::EnableConnectProtocol,
SettingId::NoRfc7540Priorities,
])
.build(),
)
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
])
.build(),
)
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
.build()
}
@ -328,32 +456,38 @@ pub fn build_client(
extra_headers: &std::collections::HashMap<String, String>,
proxy: Option<&str>,
) -> Result<Client, FetchError> {
let (tls, h2, headers) = match variant {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
// SafariIos26 builds its Emulation on top of wreq-util's base instead
// of from scratch. See `safari_ios_emulation` for why.
let mut emulation = match variant {
BrowserVariant::SafariIos26 => safari_ios_emulation(),
other => {
let (tls, h2, headers) = match other {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
BrowserVariant::SafariIos26 => unreachable!("handled above"),
};
Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(build_headers(headers))
.build()
}
};
let mut header_map = build_headers(headers);
// Append extra headers after profile defaults
// Append extra headers after profile defaults.
let hm = emulation.headers_mut();
for (k, v) in extra_headers {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
header_map.insert(n, val);
hm.insert(n, val);
}
}
let emulation = Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(header_map)
.build();
let mut builder = Client::builder()
.emulation(emulation)
.redirect(wreq::redirect::Policy::limited(10))

View file

@ -749,16 +749,21 @@ impl WebclawMcp {
Parameters(params): Parameters<VerticalParams>,
) -> Result<String, String> {
validate_url(&params.url)?;
// Reuse the long-lived default FetchClient. Extractors accept
// `&dyn Fetcher`; FetchClient implements the trait so this just
// works (see webclaw_fetch::Fetcher and client::FetchClient).
let data = webclaw_fetch::extractors::dispatch_by_name(
self.fetch_client.as_ref(),
&params.name,
&params.url,
)
.await
.map_err(|e| e.to_string())?;
// Use the cached Firefox client, not the default Chrome one.
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
// fingerprint with a 403 even from residential IPs (they
// ship a fingerprint blocklist that includes common
// browser-emulation libraries). The wreq-Firefox fingerprint
// still passes, and Firefox is equally fine for every other
// vertical in the catalog, so it's a strictly-safer default
// for `vertical_scrape` than the generic `scrape` tool's
// Chrome default. Matches the CLI `webclaw vertical`
// subcommand which already uses Firefox.
let client = self.firefox_or_build()?;
let data =
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
.await
.map_err(|e| e.to_string())?;
serde_json::to_string_pretty(&data)
.map_err(|e| format!("failed to serialise extractor output: {e}"))
}