From aaa510350458da25dab15cc5632b62b7f01ec2f6 Mon Sep 17 00:00:00 2001 From: Valerio Date: Wed, 22 Apr 2026 21:11:18 +0200 Subject: [PATCH 01/12] docs(claude): fix stale primp references, document wreq + Fetcher trait webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago but CLAUDE.md still documented primp, the `[patch.crates-io]` requirement, and RUSTFLAGS that no longer apply. Refreshed four sections: - Crate listing: webclaw-fetch uses wreq, not primp - client.rs description: wreq BoringSSL, plus a note that FetchClient will implement the new Fetcher trait so production can swap in a tls-sidecar-backed fetcher without importing wreq - Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines, added the "Vertical extractors take `&dyn Fetcher`" rule that makes the architectural separation explicit for the upcoming production integration - Removed language about primp being "patched"; reqwest in webclaw-llm is now just "plain reqwest" with no relationship to wreq --- CLAUDE.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index eac2f9f..fcd27da 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -11,7 +11,7 @@ webclaw/ # + ExtractionOptions (include/exclude CSS selectors) # + diff engine (change tracking) # + brand extraction (DOM/CSS analysis) - webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops. + webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops. # + proxy pool rotation (per-request) # + PDF content-type detection # + document parsing (DOCX, XLSX, CSV) @@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R - `brand.rs` — Brand identity extraction from DOM structure and CSS ### Fetch Modules (`webclaw-fetch`) -- `client.rs` — FetchClient with primp TLS impersonation +- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations - `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128) - `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay - `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt) @@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R ## Hard Rules - **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible. -- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level. -- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually. -- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting. +- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally. +- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any. +- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep. +- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq. - **qwen3 thinking tags** (``) are stripped at both provider and consumer levels. ## Build & Test From 058493bc8f64b0cf47e7aaccda15f34a0ede74b9 Mon Sep 17 00:00:00 2001 From: Valerio Date: Wed, 22 Apr 2026 21:17:50 +0200 Subject: [PATCH 02/12] feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide. --- CHANGELOG.md | 14 +++ Cargo.lock | 15 +-- Cargo.toml | 2 +- crates/webclaw-fetch/Cargo.toml | 1 + crates/webclaw-fetch/src/client.rs | 30 +++++ crates/webclaw-fetch/src/cloud.rs | 8 +- .../src/extractors/amazon_product.rs | 4 +- crates/webclaw-fetch/src/extractors/arxiv.rs | 4 +- .../webclaw-fetch/src/extractors/crates_io.rs | 4 +- crates/webclaw-fetch/src/extractors/dev_to.rs | 4 +- .../src/extractors/docker_hub.rs | 4 +- .../src/extractors/ebay_listing.rs | 4 +- .../src/extractors/ecommerce_product.rs | 4 +- .../src/extractors/etsy_listing.rs | 4 +- .../src/extractors/github_issue.rs | 4 +- .../webclaw-fetch/src/extractors/github_pr.rs | 4 +- .../src/extractors/github_release.rs | 4 +- .../src/extractors/github_repo.rs | 4 +- .../src/extractors/hackernews.rs | 4 +- .../src/extractors/huggingface_dataset.rs | 4 +- .../src/extractors/huggingface_model.rs | 4 +- .../src/extractors/instagram_post.rs | 4 +- .../src/extractors/instagram_profile.rs | 6 +- .../src/extractors/linkedin_post.rs | 4 +- crates/webclaw-fetch/src/extractors/mod.rs | 6 +- crates/webclaw-fetch/src/extractors/npm.rs | 6 +- crates/webclaw-fetch/src/extractors/pypi.rs | 4 +- crates/webclaw-fetch/src/extractors/reddit.rs | 4 +- .../src/extractors/shopify_collection.rs | 4 +- .../src/extractors/shopify_product.rs | 4 +- .../src/extractors/stackoverflow.rs | 4 +- .../src/extractors/substack_post.rs | 6 +- .../src/extractors/trustpilot_reviews.rs | 4 +- .../src/extractors/woocommerce_product.rs | 4 +- .../src/extractors/youtube_video.rs | 4 +- crates/webclaw-fetch/src/fetcher.rs | 118 ++++++++++++++++++ crates/webclaw-fetch/src/lib.rs | 2 + 37 files changed, 241 insertions(+), 73 deletions(-) create mode 100644 crates/webclaw-fetch/src/fetcher.rs diff --git a/CHANGELOG.md b/CHANGELOG.md index 938a0b4..7cfd1e5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,20 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.5.1] — 2026-04-22 + +### Added +- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc`, so `&client` coerces to `&dyn Fetcher` automatically. + + The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses. + + Backwards compatible. No behavior change for CLI, MCP, or OSS self-host. + +### Changed +- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces. + +--- + ## [0.5.0] — 2026-04-22 ### Added diff --git a/Cargo.lock b/Cargo.lock index 3603981..bad52e3 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3199,7 +3199,7 @@ dependencies = [ [[package]] name = "webclaw-cli" -version = "0.5.0" +version = "0.5.1" dependencies = [ "clap", "dotenvy", @@ -3220,7 +3220,7 @@ dependencies = [ [[package]] name = "webclaw-core" -version = "0.5.0" +version = "0.5.1" dependencies = [ "ego-tree", "once_cell", @@ -3238,8 +3238,9 @@ dependencies = [ [[package]] name = "webclaw-fetch" -version = "0.5.0" +version = "0.5.1" dependencies = [ + "async-trait", "bytes", "calamine", "http", @@ -3262,7 +3263,7 @@ dependencies = [ [[package]] name = "webclaw-llm" -version = "0.5.0" +version = "0.5.1" dependencies = [ "async-trait", "reqwest", @@ -3275,7 +3276,7 @@ dependencies = [ [[package]] name = "webclaw-mcp" -version = "0.5.0" +version = "0.5.1" dependencies = [ "dirs", "dotenvy", @@ -3295,7 +3296,7 @@ dependencies = [ [[package]] name = "webclaw-pdf" -version = "0.5.0" +version = "0.5.1" dependencies = [ "pdf-extract", "thiserror", @@ -3304,7 +3305,7 @@ dependencies = [ [[package]] name = "webclaw-server" -version = "0.5.0" +version = "0.5.1" dependencies = [ "anyhow", "axum", diff --git a/Cargo.toml b/Cargo.toml index e8b2677..92152f2 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.5.0" +version = "0.5.1" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml index 2ec9b9d..a47ba7e 100644 --- a/crates/webclaw-fetch/Cargo.toml +++ b/crates/webclaw-fetch/Cargo.toml @@ -12,6 +12,7 @@ serde = { workspace = true } thiserror = { workspace = true } tracing = { workspace = true } tokio = { workspace = true } +async-trait = "0.1" wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] } http = "1" bytes = "1" diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs index 7ce16d7..8fd5ff5 100644 --- a/crates/webclaw-fetch/src/client.rs +++ b/crates/webclaw-fetch/src/client.rs @@ -599,6 +599,36 @@ impl FetchClient { } } +// --------------------------------------------------------------------------- +// Fetcher trait implementation +// +// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait +// rather than `FetchClient` directly, which is what lets the production +// API server swap in a tls-sidecar-backed implementation without +// pulling wreq into its dependency graph. For everyone else (CLI, MCP, +// self-hosted OSS server) this impl means "pass the FetchClient you +// already have; nothing changes". +// --------------------------------------------------------------------------- + +#[async_trait::async_trait] +impl crate::fetcher::Fetcher for FetchClient { + async fn fetch(&self, url: &str) -> Result { + FetchClient::fetch(self, url).await + } + + async fn fetch_with_headers( + &self, + url: &str, + headers: &[(&str, &str)], + ) -> Result { + FetchClient::fetch_with_headers(self, url, headers).await + } + + fn cloud(&self) -> Option<&crate::cloud::CloudClient> { + FetchClient::cloud(self) + } +} + /// Collect the browser variants to use based on the browser profile. fn collect_variants(profile: &BrowserProfile) -> Vec { match profile { diff --git a/crates/webclaw-fetch/src/cloud.rs b/crates/webclaw-fetch/src/cloud.rs index c70a75e..3bad383 100644 --- a/crates/webclaw-fetch/src/cloud.rs +++ b/crates/webclaw-fetch/src/cloud.rs @@ -66,7 +66,9 @@ use serde_json::{Value, json}; use thiserror::Error; use tracing::{debug, info, warn}; -use crate::client::FetchClient; +// Client type isn't needed here anymore now that smart_fetch* takes +// `&dyn Fetcher`. Kept as a comment for historical context: this +// module used to import FetchClient directly before v0.5.1. // --------------------------------------------------------------------------- // URLs + defaults — keep in one place so "change the signup link" is a @@ -506,7 +508,7 @@ pub enum SmartFetchResult { /// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed /// [`CloudError`] so you can render precise UX. pub async fn smart_fetch( - client: &FetchClient, + client: &dyn crate::fetcher::Fetcher, cloud: Option<&CloudClient>, url: &str, include_selectors: &[String], @@ -613,7 +615,7 @@ pub struct FetchedHtml { /// Designed for the vertical-extractor pattern where the caller has /// its own parser and just needs bytes. pub async fn smart_fetch_html( - client: &FetchClient, + client: &dyn crate::fetcher::Fetcher, cloud: Option<&CloudClient>, url: &str, ) -> Result { diff --git a/crates/webclaw-fetch/src/extractors/amazon_product.rs b/crates/webclaw-fetch/src/extractors/amazon_product.rs index 7f022fb..fed6b9f 100644 --- a/crates/webclaw-fetch/src/extractors/amazon_product.rs +++ b/crates/webclaw-fetch/src/extractors/amazon_product.rs @@ -32,9 +32,9 @@ use regex::Regex; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::cloud::{self, CloudError}; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "amazon_product", @@ -59,7 +59,7 @@ pub fn matches(url: &str) -> bool { parse_asin(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let asin = parse_asin(url) .ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?; diff --git a/crates/webclaw-fetch/src/extractors/arxiv.rs b/crates/webclaw-fetch/src/extractors/arxiv.rs index cbcb3d1..c2b85c0 100644 --- a/crates/webclaw-fetch/src/extractors/arxiv.rs +++ b/crates/webclaw-fetch/src/extractors/arxiv.rs @@ -10,8 +10,8 @@ use quick_xml::events::Event; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "arxiv", @@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool { url.contains("/abs/") || url.contains("/pdf/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let id = parse_id(url) .ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?; diff --git a/crates/webclaw-fetch/src/extractors/crates_io.rs b/crates/webclaw-fetch/src/extractors/crates_io.rs index 915b1c3..719579f 100644 --- a/crates/webclaw-fetch/src/extractors/crates_io.rs +++ b/crates/webclaw-fetch/src/extractors/crates_io.rs @@ -9,8 +9,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "crates_io", @@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool { url.contains("/crates/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let name = parse_name(url) .ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?; diff --git a/crates/webclaw-fetch/src/extractors/dev_to.rs b/crates/webclaw-fetch/src/extractors/dev_to.rs index 49372ce..86199d8 100644 --- a/crates/webclaw-fetch/src/extractors/dev_to.rs +++ b/crates/webclaw-fetch/src/extractors/dev_to.rs @@ -8,8 +8,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "dev_to", @@ -61,7 +61,7 @@ const RESERVED_FIRST_SEGS: &[&str] = &[ "t", ]; -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (username, slug) = parse_username_slug(url).ok_or_else(|| { FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/docker_hub.rs b/crates/webclaw-fetch/src/extractors/docker_hub.rs index 15c928c..bce9315 100644 --- a/crates/webclaw-fetch/src/extractors/docker_hub.rs +++ b/crates/webclaw-fetch/src/extractors/docker_hub.rs @@ -8,8 +8,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "docker_hub", @@ -29,7 +29,7 @@ pub fn matches(url: &str) -> bool { url.contains("/_/") || url.contains("/r/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (namespace, name) = parse_repo(url) .ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?; diff --git a/crates/webclaw-fetch/src/extractors/ebay_listing.rs b/crates/webclaw-fetch/src/extractors/ebay_listing.rs index 14c36ef..dbc85ab 100644 --- a/crates/webclaw-fetch/src/extractors/ebay_listing.rs +++ b/crates/webclaw-fetch/src/extractors/ebay_listing.rs @@ -14,9 +14,9 @@ use regex::Regex; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::cloud::{self, CloudError}; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "ebay_listing", @@ -39,7 +39,7 @@ pub fn matches(url: &str) -> bool { parse_item_id(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let item_id = parse_item_id(url) .ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?; diff --git a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs index 099a8fb..019fb68 100644 --- a/crates/webclaw-fetch/src/extractors/ecommerce_product.rs +++ b/crates/webclaw-fetch/src/extractors/ecommerce_product.rs @@ -42,8 +42,8 @@ use regex::Regex; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "ecommerce_product", @@ -69,7 +69,7 @@ pub fn matches(url: &str) -> bool { !host_of(url).is_empty() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let resp = client.fetch(url).await?; if !(200..300).contains(&resp.status) { return Err(FetchError::Build(format!( diff --git a/crates/webclaw-fetch/src/extractors/etsy_listing.rs b/crates/webclaw-fetch/src/extractors/etsy_listing.rs index 060c3b6..ea9ed0b 100644 --- a/crates/webclaw-fetch/src/extractors/etsy_listing.rs +++ b/crates/webclaw-fetch/src/extractors/etsy_listing.rs @@ -26,9 +26,9 @@ use regex::Regex; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::cloud::{self, CloudError}; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "etsy_listing", @@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool { parse_listing_id(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let listing_id = parse_listing_id(url) .ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?; diff --git a/crates/webclaw-fetch/src/extractors/github_issue.rs b/crates/webclaw-fetch/src/extractors/github_issue.rs index 436faa9..9a64f21 100644 --- a/crates/webclaw-fetch/src/extractors/github_issue.rs +++ b/crates/webclaw-fetch/src/extractors/github_issue.rs @@ -10,8 +10,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "github_issue", @@ -34,7 +34,7 @@ pub fn matches(url: &str) -> bool { parse_issue(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (owner, repo, number) = parse_issue(url).ok_or_else(|| { FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/github_pr.rs b/crates/webclaw-fetch/src/extractors/github_pr.rs index 9d4b95a..266d3cd 100644 --- a/crates/webclaw-fetch/src/extractors/github_pr.rs +++ b/crates/webclaw-fetch/src/extractors/github_pr.rs @@ -9,8 +9,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "github_pr", @@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool { parse_pr(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (owner, repo, number) = parse_pr(url).ok_or_else(|| { FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/github_release.rs b/crates/webclaw-fetch/src/extractors/github_release.rs index b019550..7699d09 100644 --- a/crates/webclaw-fetch/src/extractors/github_release.rs +++ b/crates/webclaw-fetch/src/extractors/github_release.rs @@ -8,8 +8,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "github_release", @@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool { parse_release(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (owner, repo, tag) = parse_release(url).ok_or_else(|| { FetchError::Build(format!("github_release: cannot parse release URL '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/github_repo.rs b/crates/webclaw-fetch/src/extractors/github_repo.rs index d89d06a..2a62aa3 100644 --- a/crates/webclaw-fetch/src/extractors/github_repo.rs +++ b/crates/webclaw-fetch/src/extractors/github_repo.rs @@ -10,8 +10,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "github_repo", @@ -70,7 +70,7 @@ const RESERVED_OWNERS: &[&str] = &[ "about", ]; -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (owner, repo) = parse_owner_repo(url).ok_or_else(|| { FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/hackernews.rs b/crates/webclaw-fetch/src/extractors/hackernews.rs index 7adaa1c..91d4520 100644 --- a/crates/webclaw-fetch/src/extractors/hackernews.rs +++ b/crates/webclaw-fetch/src/extractors/hackernews.rs @@ -10,8 +10,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "hackernews", @@ -40,7 +40,7 @@ pub fn matches(url: &str) -> bool { false } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let id = parse_item_id(url).ok_or_else(|| { FetchError::Build(format!("hackernews: cannot parse item id from '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs index cb1f524..e1f84f7 100644 --- a/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs +++ b/crates/webclaw-fetch/src/extractors/huggingface_dataset.rs @@ -7,8 +7,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "huggingface_dataset", @@ -38,7 +38,7 @@ pub fn matches(url: &str) -> bool { segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3) } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let dataset_path = parse_dataset_path(url).ok_or_else(|| { FetchError::Build(format!( "hf_dataset: cannot parse dataset path from '{url}'" diff --git a/crates/webclaw-fetch/src/extractors/huggingface_model.rs b/crates/webclaw-fetch/src/extractors/huggingface_model.rs index decc68a..4c549e0 100644 --- a/crates/webclaw-fetch/src/extractors/huggingface_model.rs +++ b/crates/webclaw-fetch/src/extractors/huggingface_model.rs @@ -9,8 +9,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "huggingface_model", @@ -61,7 +61,7 @@ const RESERVED_NAMESPACES: &[&str] = &[ "search", ]; -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (owner, name) = parse_owner_name(url).ok_or_else(|| { FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/instagram_post.rs b/crates/webclaw-fetch/src/extractors/instagram_post.rs index 05c9b8a..8847e36 100644 --- a/crates/webclaw-fetch/src/extractors/instagram_post.rs +++ b/crates/webclaw-fetch/src/extractors/instagram_post.rs @@ -11,8 +11,8 @@ use serde_json::{Value, json}; use std::sync::OnceLock; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "instagram_post", @@ -33,7 +33,7 @@ pub fn matches(url: &str) -> bool { parse_shortcode(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| { FetchError::Build(format!( "instagram_post: cannot parse shortcode from '{url}'" diff --git a/crates/webclaw-fetch/src/extractors/instagram_profile.rs b/crates/webclaw-fetch/src/extractors/instagram_profile.rs index 4524090..9a92b4c 100644 --- a/crates/webclaw-fetch/src/extractors/instagram_profile.rs +++ b/crates/webclaw-fetch/src/extractors/instagram_profile.rs @@ -23,8 +23,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "instagram_profile", @@ -80,7 +80,7 @@ const RESERVED: &[&str] = &[ "signup", ]; -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let username = parse_username(url).ok_or_else(|| { FetchError::Build(format!( "instagram_profile: cannot parse username from '{url}'" @@ -198,7 +198,7 @@ fn classify(n: &MediaNode) -> &'static str { /// pull whatever OG tags we can. Returns less data and explicitly /// flags `data_completeness: "og_only"` so callers know. async fn og_fallback( - client: &FetchClient, + client: &dyn Fetcher, username: &str, original_url: &str, api_status: u16, diff --git a/crates/webclaw-fetch/src/extractors/linkedin_post.rs b/crates/webclaw-fetch/src/extractors/linkedin_post.rs index 2d6a399..ed7e07b 100644 --- a/crates/webclaw-fetch/src/extractors/linkedin_post.rs +++ b/crates/webclaw-fetch/src/extractors/linkedin_post.rs @@ -14,8 +14,8 @@ use serde_json::{Value, json}; use std::sync::OnceLock; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "linkedin_post", @@ -36,7 +36,7 @@ pub fn matches(url: &str) -> bool { url.contains("/feed/update/urn:li:") || url.contains("/posts/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let urn = extract_urn(url).ok_or_else(|| { FetchError::Build(format!( "linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})" diff --git a/crates/webclaw-fetch/src/extractors/mod.rs b/crates/webclaw-fetch/src/extractors/mod.rs index 5d06158..91ef8d0 100644 --- a/crates/webclaw-fetch/src/extractors/mod.rs +++ b/crates/webclaw-fetch/src/extractors/mod.rs @@ -46,8 +46,8 @@ pub mod youtube_video; use serde::Serialize; use serde_json::Value; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; /// Public catalog entry for `/v1/extractors`. Stable shape — clients /// rely on `name` to pick the right `/v1/scrape/{name}` route. @@ -102,7 +102,7 @@ pub fn list() -> Vec { /// one that claims the URL. Used by `/v1/scrape` when the caller doesn't /// pick a vertical explicitly. pub async fn dispatch_by_url( - client: &FetchClient, + client: &dyn Fetcher, url: &str, ) -> Option> { if reddit::matches(url) { @@ -281,7 +281,7 @@ pub async fn dispatch_by_url( /// users get a clear "wrong route" error instead of a confusing parse /// failure deep in the extractor. pub async fn dispatch_by_name( - client: &FetchClient, + client: &dyn Fetcher, name: &str, url: &str, ) -> Result { diff --git a/crates/webclaw-fetch/src/extractors/npm.rs b/crates/webclaw-fetch/src/extractors/npm.rs index 4343890..f84da0e 100644 --- a/crates/webclaw-fetch/src/extractors/npm.rs +++ b/crates/webclaw-fetch/src/extractors/npm.rs @@ -13,8 +13,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "npm", @@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool { url.contains("/package/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let name = parse_name(url) .ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?; @@ -94,7 +94,7 @@ pub async fn extract(client: &FetchClient, url: &str) -> Result Result { +async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result { let url = format!( "https://api.npmjs.org/downloads/point/last-week/{}", urlencode_segment(name) diff --git a/crates/webclaw-fetch/src/extractors/pypi.rs b/crates/webclaw-fetch/src/extractors/pypi.rs index f6b7c64..33a4d1c 100644 --- a/crates/webclaw-fetch/src/extractors/pypi.rs +++ b/crates/webclaw-fetch/src/extractors/pypi.rs @@ -9,8 +9,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "pypi", @@ -30,7 +30,7 @@ pub fn matches(url: &str) -> bool { url.contains("/project/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (name, version) = parse_project(url).ok_or_else(|| { FetchError::Build(format!("pypi: cannot parse package name from '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/extractors/reddit.rs b/crates/webclaw-fetch/src/extractors/reddit.rs index 2d084dc..13cdc16 100644 --- a/crates/webclaw-fetch/src/extractors/reddit.rs +++ b/crates/webclaw-fetch/src/extractors/reddit.rs @@ -9,8 +9,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "reddit", @@ -32,7 +32,7 @@ pub fn matches(url: &str) -> bool { is_reddit_host && url.contains("/comments/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let json_url = build_json_url(url); let resp = client.fetch(&json_url).await?; if resp.status != 200 { diff --git a/crates/webclaw-fetch/src/extractors/shopify_collection.rs b/crates/webclaw-fetch/src/extractors/shopify_collection.rs index 095f7dd..23d57c6 100644 --- a/crates/webclaw-fetch/src/extractors/shopify_collection.rs +++ b/crates/webclaw-fetch/src/extractors/shopify_collection.rs @@ -15,8 +15,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "shopify_collection", @@ -49,7 +49,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[ "github.com", ]; -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let (coll_meta_url, coll_products_url) = build_json_urls(url); // Step 1: collection metadata. Shopify returns 200 on missing diff --git a/crates/webclaw-fetch/src/extractors/shopify_product.rs b/crates/webclaw-fetch/src/extractors/shopify_product.rs index 19f0438..b52ef36 100644 --- a/crates/webclaw-fetch/src/extractors/shopify_product.rs +++ b/crates/webclaw-fetch/src/extractors/shopify_product.rs @@ -21,8 +21,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "shopify_product", @@ -65,7 +65,7 @@ const NON_SHOPIFY_HOSTS: &[&str] = &[ "github.com", // /products is a marketing page ]; -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let json_url = build_json_url(url); let resp = client.fetch(&json_url).await?; if resp.status == 404 { diff --git a/crates/webclaw-fetch/src/extractors/stackoverflow.rs b/crates/webclaw-fetch/src/extractors/stackoverflow.rs index d74b511..03597a3 100644 --- a/crates/webclaw-fetch/src/extractors/stackoverflow.rs +++ b/crates/webclaw-fetch/src/extractors/stackoverflow.rs @@ -13,8 +13,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "stackoverflow", @@ -31,7 +31,7 @@ pub fn matches(url: &str) -> bool { parse_question_id(url).is_some() } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let id = parse_question_id(url).ok_or_else(|| { FetchError::Build(format!( "stackoverflow: cannot parse question id from '{url}'" diff --git a/crates/webclaw-fetch/src/extractors/substack_post.rs b/crates/webclaw-fetch/src/extractors/substack_post.rs index 0571f3d..c5b5019 100644 --- a/crates/webclaw-fetch/src/extractors/substack_post.rs +++ b/crates/webclaw-fetch/src/extractors/substack_post.rs @@ -28,9 +28,9 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::cloud::{self, CloudError}; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "substack_post", @@ -49,7 +49,7 @@ pub fn matches(url: &str) -> bool { url.contains("/p/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let slug = parse_slug(url).ok_or_else(|| { FetchError::Build(format!("substack_post: cannot parse slug from '{url}'")) })?; @@ -149,7 +149,7 @@ fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value { // --------------------------------------------------------------------------- async fn html_fallback( - client: &FetchClient, + client: &dyn Fetcher, url: &str, api_url: &str, slug: &str, diff --git a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs index ae97c67..8b77a29 100644 --- a/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs +++ b/crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs @@ -32,9 +32,9 @@ use regex::Regex; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::cloud::{self, CloudError}; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "trustpilot_reviews", @@ -51,7 +51,7 @@ pub fn matches(url: &str) -> bool { url.contains("/review/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let fetched = cloud::smart_fetch_html(client, client.cloud(), url) .await .map_err(cloud_to_fetch_err)?; diff --git a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs index 73f1109..db6dd78 100644 --- a/crates/webclaw-fetch/src/extractors/woocommerce_product.rs +++ b/crates/webclaw-fetch/src/extractors/woocommerce_product.rs @@ -15,8 +15,8 @@ use serde::Deserialize; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "woocommerce_product", @@ -42,7 +42,7 @@ pub fn matches(url: &str) -> bool { || url.contains("/produit/") // common fr locale } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let slug = parse_slug(url).ok_or_else(|| { FetchError::Build(format!( "woocommerce_product: cannot parse slug from '{url}'" diff --git a/crates/webclaw-fetch/src/extractors/youtube_video.rs b/crates/webclaw-fetch/src/extractors/youtube_video.rs index 81079f4..2551ff8 100644 --- a/crates/webclaw-fetch/src/extractors/youtube_video.rs +++ b/crates/webclaw-fetch/src/extractors/youtube_video.rs @@ -25,8 +25,8 @@ use regex::Regex; use serde_json::{Value, json}; use super::ExtractorInfo; -use crate::client::FetchClient; use crate::error::FetchError; +use crate::fetcher::Fetcher; pub const INFO: ExtractorInfo = ExtractorInfo { name: "youtube_video", @@ -45,7 +45,7 @@ pub fn matches(url: &str) -> bool { || url.contains("youtube-nocookie.com/embed/") } -pub async fn extract(client: &FetchClient, url: &str) -> Result { +pub async fn extract(client: &dyn Fetcher, url: &str) -> Result { let video_id = parse_video_id(url).ok_or_else(|| { FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'")) })?; diff --git a/crates/webclaw-fetch/src/fetcher.rs b/crates/webclaw-fetch/src/fetcher.rs new file mode 100644 index 0000000..fabcf44 --- /dev/null +++ b/crates/webclaw-fetch/src/fetcher.rs @@ -0,0 +1,118 @@ +//! Pluggable fetcher abstraction for vertical extractors. +//! +//! Extractors call the network through this trait instead of hard- +//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all +//! pass `&FetchClient` (wreq-backed BoringSSL). The production API +//! server, which must not use in-process TLS fingerprinting, provides +//! its own implementation that routes through the Go tls-sidecar. +//! +//! Both paths expose the same [`FetchResult`] shape and the same +//! optional cloud-escalation client, so extractor logic stays +//! identical across environments. +//! +//! ## Choosing an implementation +//! +//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`] +//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass +//! it to extractors as `&client`. +//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher` +//! (in `server/src/engine/`) that delegates to `engine::tls_client` +//! and wraps it in `Arc` for handler injection. +//! +//! ## Why a trait and not a free function +//! +//! Extractors need state beyond a single fetch: the cloud client for +//! antibot escalation, and in the future per-user proxy pools, tenant +//! headers, circuit breakers. A trait keeps that state encapsulated +//! behind the fetch interface instead of threading it through every +//! extractor signature. + +use async_trait::async_trait; + +use crate::client::FetchResult; +use crate::cloud::CloudClient; +use crate::error::FetchError; + +/// HTTP fetch surface used by vertical extractors. +/// +/// Implementations must be `Send + Sync` because extractor dispatchers +/// run them inside tokio tasks, potentially across many requests. +#[async_trait] +pub trait Fetcher: Send + Sync { + /// Fetch a URL and return the raw response body + metadata. The + /// body is in `FetchResult::html` regardless of the actual content + /// type — JSON API endpoints put JSON there, HTML pages put HTML. + /// Extractors branch on response status and body shape. + async fn fetch(&self, url: &str) -> Result; + + /// Fetch with additional request headers. Needed for endpoints + /// that authenticate via a specific header (Instagram's + /// `x-ig-app-id`, for example). Default implementation routes to + /// [`Self::fetch`] so implementers without header support stay + /// functional, though the `Option` field they'd set won't + /// be populated on the request. + async fn fetch_with_headers( + &self, + url: &str, + _headers: &[(&str, &str)], + ) -> Result { + self.fetch(url).await + } + + /// Optional cloud-escalation client for antibot bypass. Returning + /// `Some` tells extractors they can call into the hosted API when + /// local fetch hits a challenge page. Returning `None` makes + /// cloud-gated extractors emit [`CloudError::NotConfigured`] with + /// an actionable signup link. + /// + /// The default implementation returns `None` because not every + /// deployment wants cloud fallback (self-hosts that don't have a + /// webclaw.io subscription, for instance). + /// + /// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured + fn cloud(&self) -> Option<&CloudClient> { + None + } +} + +// --------------------------------------------------------------------------- +// Blanket impls: make `&T` and `Arc` behave like the wrapped `T`. +// --------------------------------------------------------------------------- + +#[async_trait] +impl Fetcher for &T { + async fn fetch(&self, url: &str) -> Result { + (**self).fetch(url).await + } + + async fn fetch_with_headers( + &self, + url: &str, + headers: &[(&str, &str)], + ) -> Result { + (**self).fetch_with_headers(url, headers).await + } + + fn cloud(&self) -> Option<&CloudClient> { + (**self).cloud() + } +} + +#[async_trait] +impl Fetcher for std::sync::Arc { + async fn fetch(&self, url: &str) -> Result { + (**self).fetch(url).await + } + + async fn fetch_with_headers( + &self, + url: &str, + headers: &[(&str, &str)], + ) -> Result { + (**self).fetch_with_headers(url, headers).await + } + + fn cloud(&self) -> Option<&CloudClient> { + (**self).cloud() + } +} diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs index 3a4781e..83664a1 100644 --- a/crates/webclaw-fetch/src/lib.rs +++ b/crates/webclaw-fetch/src/lib.rs @@ -8,6 +8,7 @@ pub mod crawler; pub mod document; pub mod error; pub mod extractors; +pub mod fetcher; pub mod linkedin; pub mod proxy; pub mod reddit; @@ -18,6 +19,7 @@ pub use browser::BrowserProfile; pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult}; pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult}; pub use error::FetchError; +pub use fetcher::Fetcher; pub use http::HeaderMap; pub use proxy::{parse_proxy_file, parse_proxy_line}; pub use sitemap::SitemapEntry; From 0daa2fec1a3b32c1b00a58fa4ae498266910abbb Mon Sep 17 00:00:00 2001 From: Valerio Date: Wed, 22 Apr 2026 21:41:15 +0200 Subject: [PATCH 03/12] feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable) Wires the vertical extractor catalog into both the CLI and the MCP server so users don't have to hit the HTTP API to invoke them. Same semantics as `/v1/scrape/{vertical}` + `/v1/extractors`. CLI (webclaw-cli): - New subcommand `webclaw extractors` lists all 28 extractors with name, label, and sample URL. `--json` flag emits the full catalog as machine-readable JSON. - New subcommand `webclaw vertical ` runs a specific extractor and prints typed JSON. Pretty-printed by default; `--raw` for single-line. Exits 1 with a clear "URL does not match" error on mismatch. - FetchClient built with Firefox profile + cloud fallback attached when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate. MCP (webclaw-mcp): - New tool `list_extractors` (no args) returns the catalog as pretty-printed JSON for in-session discovery. - New tool `vertical_scrape` takes `{name, url}` and returns typed JSON. Reuses the long-lived self.fetch_client. - Tool count goes from 10 to 12. Server-info instruction string updated accordingly. Tests: 215 passing, clippy clean. Manual surface-tested end-to-end: CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns 28-entry catalog + typed responses for pypi/requests + rust-lang/rust in 200-400ms. Version bumped to 0.5.2 (minor for API additions, backwards compatible). --- CHANGELOG.md | 14 +++++ Cargo.lock | 14 ++--- Cargo.toml | 2 +- crates/webclaw-cli/src/main.rs | 105 +++++++++++++++++++++++++++++++ crates/webclaw-mcp/src/server.rs | 47 +++++++++++++- crates/webclaw-mcp/src/tools.rs | 17 +++++ 6 files changed, 190 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 7cfd1e5..ef2d2f2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,20 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.5.2] — 2026-04-22 + +### Added +- **`webclaw vertical ` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1. + +- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick. + +- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set. + +### Changed +- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`. + +--- + ## [0.5.1] — 2026-04-22 ### Added diff --git a/Cargo.lock b/Cargo.lock index bad52e3..ed0f4fa 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3199,7 +3199,7 @@ dependencies = [ [[package]] name = "webclaw-cli" -version = "0.5.1" +version = "0.5.2" dependencies = [ "clap", "dotenvy", @@ -3220,7 +3220,7 @@ dependencies = [ [[package]] name = "webclaw-core" -version = "0.5.1" +version = "0.5.2" dependencies = [ "ego-tree", "once_cell", @@ -3238,7 +3238,7 @@ dependencies = [ [[package]] name = "webclaw-fetch" -version = "0.5.1" +version = "0.5.2" dependencies = [ "async-trait", "bytes", @@ -3263,7 +3263,7 @@ dependencies = [ [[package]] name = "webclaw-llm" -version = "0.5.1" +version = "0.5.2" dependencies = [ "async-trait", "reqwest", @@ -3276,7 +3276,7 @@ dependencies = [ [[package]] name = "webclaw-mcp" -version = "0.5.1" +version = "0.5.2" dependencies = [ "dirs", "dotenvy", @@ -3296,7 +3296,7 @@ dependencies = [ [[package]] name = "webclaw-pdf" -version = "0.5.1" +version = "0.5.2" dependencies = [ "pdf-extract", "thiserror", @@ -3305,7 +3305,7 @@ dependencies = [ [[package]] name = "webclaw-server" -version = "0.5.1" +version = "0.5.2" dependencies = [ "anyhow", "axum", diff --git a/Cargo.toml b/Cargo.toml index 92152f2..a286972 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.5.1" +version = "0.5.2" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs index 91af384..a12cae1 100644 --- a/crates/webclaw-cli/src/main.rs +++ b/crates/webclaw-cli/src/main.rs @@ -308,6 +308,34 @@ enum Commands { #[arg(long)] facts: Option, }, + + /// List all vertical extractors in the catalog. + /// + /// Each entry has a stable `name` (usable with `webclaw vertical `), + /// a human-friendly label, a one-line description, and the URL + /// patterns it claims. The same data is served by `/v1/extractors` + /// when running the REST API. + Extractors { + /// Emit JSON instead of a human-friendly table. + #[arg(long)] + json: bool, + }, + + /// Run a vertical extractor by name. Returns typed JSON with fields + /// specific to the target site (title, price, author, rating, etc.) + /// rather than generic markdown. + /// + /// Use `webclaw extractors` to see the full list. Example: + /// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`. + Vertical { + /// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`). + name: String, + /// URL to extract. + url: String, + /// Emit compact JSON (single line). Default is pretty-printed. + #[arg(long)] + raw: bool, + }, } #[derive(Clone, ValueEnum)] @@ -2288,6 +2316,83 @@ async fn main() { } return; } + Commands::Extractors { json } => { + let entries = webclaw_fetch::extractors::list(); + if *json { + // Serialize with serde_json. ExtractorInfo derives + // Serialize so this is a one-liner. + match serde_json::to_string_pretty(&entries) { + Ok(s) => println!("{s}"), + Err(e) => { + eprintln!("error: failed to serialise catalog: {e}"); + process::exit(1); + } + } + } else { + // Human-friendly table: NAME + LABEL + one URL + // pattern sample. Keeps the output scannable on a + // narrow terminal. + println!("{} vertical extractors available:\n", entries.len()); + let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0); + let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0); + for e in &entries { + let pattern_sample = e.url_patterns.first().copied().unwrap_or(""); + println!( + " {: "); + } + return; + } + Commands::Vertical { name, url, raw } => { + // Build a FetchClient with cloud fallback attached when + // WEBCLAW_API_KEY is set. Antibot-gated verticals + // (amazon, ebay, etsy, trustpilot) need this to escalate + // on bot protection. + let fetch_cfg = webclaw_fetch::FetchConfig { + browser: webclaw_fetch::BrowserProfile::Firefox, + ..webclaw_fetch::FetchConfig::default() + }; + let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) { + Ok(c) => c, + Err(e) => { + eprintln!("error: failed to build fetch client: {e}"); + process::exit(1); + } + }; + if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() { + client = client.with_cloud(cloud); + } + match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await { + Ok(data) => { + let rendered = if *raw { + serde_json::to_string(&data) + } else { + serde_json::to_string_pretty(&data) + }; + match rendered { + Ok(s) => println!("{s}"), + Err(e) => { + eprintln!("error: JSON encode failed: {e}"); + process::exit(1); + } + } + } + Err(e) => { + // UrlMismatch / UnknownVertical / Fetch all get + // Display impls with actionable messages. + eprintln!("error: {e}"); + process::exit(1); + } + } + return; + } } } diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs index 87c222e..a4af79d 100644 --- a/crates/webclaw-mcp/src/server.rs +++ b/crates/webclaw-mcp/src/server.rs @@ -718,6 +718,50 @@ impl WebclawMcp { Ok(serde_json::to_string_pretty(&resp).unwrap_or_default()) } } + + /// List every vertical extractor the server knows about. Returns a + /// JSON array of `{name, label, description, url_patterns}` entries. + /// Call this to discover what verticals are available before using + /// `vertical_scrape`. + #[tool] + async fn list_extractors( + &self, + Parameters(_params): Parameters, + ) -> Result { + let catalog = webclaw_fetch::extractors::list(); + serde_json::to_string_pretty(&catalog) + .map_err(|e| format!("failed to serialise extractor catalog: {e}")) + } + + /// Run a vertical extractor by name and return typed JSON specific + /// to the target site (title, price, rating, author, etc.), not + /// generic markdown. Use `list_extractors` to discover available + /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`, + /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`. + /// + /// Antibot-gated verticals (amazon_product, ebay_listing, + /// etsy_listing, trustpilot_reviews) will automatically escalate to + /// the webclaw cloud API when local fetch hits bot protection, + /// provided `WEBCLAW_API_KEY` is set. + #[tool] + async fn vertical_scrape( + &self, + Parameters(params): Parameters, + ) -> Result { + validate_url(¶ms.url)?; + // Reuse the long-lived default FetchClient. Extractors accept + // `&dyn Fetcher`; FetchClient implements the trait so this just + // works (see webclaw_fetch::Fetcher and client::FetchClient). + let data = webclaw_fetch::extractors::dispatch_by_name( + self.fetch_client.as_ref(), + ¶ms.name, + ¶ms.url, + ) + .await + .map_err(|e| e.to_string())?; + serde_json::to_string_pretty(&data) + .map_err(|e| format!("failed to serialise extractor output: {e}")) + } } #[tool_handler] @@ -727,7 +771,8 @@ impl ServerHandler for WebclawMcp { .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION"))) .with_instructions(String::from( "Webclaw MCP server -- web content extraction for AI agents. \ - Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.", + Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \ + list_extractors, vertical_scrape.", )) } } diff --git a/crates/webclaw-mcp/src/tools.rs b/crates/webclaw-mcp/src/tools.rs index e0195f1..02bf534 100644 --- a/crates/webclaw-mcp/src/tools.rs +++ b/crates/webclaw-mcp/src/tools.rs @@ -103,3 +103,20 @@ pub struct SearchParams { /// Number of results to return (default: 10) pub num_results: Option, } + +/// Parameters for `vertical_scrape`: run a site-specific extractor by name. +#[derive(Debug, Deserialize, JsonSchema)] +pub struct VerticalParams { + /// Name of the vertical extractor. Call `list_extractors` to see all + /// available names. Examples: "reddit", "github_repo", "pypi", + /// "trustpilot_reviews", "youtube_video", "shopify_product". + pub name: String, + /// URL to extract. Must match the URL patterns the extractor claims; + /// otherwise the tool returns a clear "URL mismatch" error. + pub url: String, +} + +/// `list_extractors` takes no arguments but we still need an empty struct +/// so rmcp can generate a schema and parse the (empty) JSON-RPC params. +#[derive(Debug, Deserialize, JsonSchema)] +pub struct ListExtractorsParams {} From 4bf11d902f2bf51f327b4f62d820c9b8013cf173 Mon Sep 17 00:00:00 2001 From: Valerio Date: Wed, 22 Apr 2026 23:18:11 +0200 Subject: [PATCH 04/12] fix(mcp): vertical_scrape uses Firefox profile, not default Chrome Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a 403 even from residential IPs. Their block list includes known browser-emulation library fingerprints. wreq-Firefox passes. The CLI `vertical` subcommand already forced Firefox; MCP `vertical_scrape` was still falling back to the long-lived `self.fetch_client` which defaults to Chrome, so reddit failed on MCP and nobody noticed because the earlier test runs all had an API key set that masked the issue. Switched vertical_scrape to reuse `self.firefox_or_build()` which gives us the cached Firefox client (same pattern the scrape tool uses when the caller requests `browser: firefox`). Firefox is strictly-safer-than-Chrome for every vertical in the catalog, so making it the hard default for `vertical_scrape` is the right call. Verified end-to-end from a clean shell with no WEBCLAW_API_KEY: - MCP reddit: 679ms, post/author/6 comments correct - MCP instagram_profile: 1157ms, 18471 followers No change to the `scrape` tool -- it keeps the user-selectable browser param. Bumps version to 0.5.3. --- Cargo.lock | 14 +++++++------- Cargo.toml | 2 +- crates/webclaw-mcp/src/server.rs | 25 +++++++++++++++---------- 3 files changed, 23 insertions(+), 18 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index ed0f4fa..3b4aa51 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3199,7 +3199,7 @@ dependencies = [ [[package]] name = "webclaw-cli" -version = "0.5.2" +version = "0.5.3" dependencies = [ "clap", "dotenvy", @@ -3220,7 +3220,7 @@ dependencies = [ [[package]] name = "webclaw-core" -version = "0.5.2" +version = "0.5.3" dependencies = [ "ego-tree", "once_cell", @@ -3238,7 +3238,7 @@ dependencies = [ [[package]] name = "webclaw-fetch" -version = "0.5.2" +version = "0.5.3" dependencies = [ "async-trait", "bytes", @@ -3263,7 +3263,7 @@ dependencies = [ [[package]] name = "webclaw-llm" -version = "0.5.2" +version = "0.5.3" dependencies = [ "async-trait", "reqwest", @@ -3276,7 +3276,7 @@ dependencies = [ [[package]] name = "webclaw-mcp" -version = "0.5.2" +version = "0.5.3" dependencies = [ "dirs", "dotenvy", @@ -3296,7 +3296,7 @@ dependencies = [ [[package]] name = "webclaw-pdf" -version = "0.5.2" +version = "0.5.3" dependencies = [ "pdf-extract", "thiserror", @@ -3305,7 +3305,7 @@ dependencies = [ [[package]] name = "webclaw-server" -version = "0.5.2" +version = "0.5.3" dependencies = [ "anyhow", "axum", diff --git a/Cargo.toml b/Cargo.toml index a286972..97c3055 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.5.2" +version = "0.5.3" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/crates/webclaw-mcp/src/server.rs b/crates/webclaw-mcp/src/server.rs index a4af79d..45e8647 100644 --- a/crates/webclaw-mcp/src/server.rs +++ b/crates/webclaw-mcp/src/server.rs @@ -749,16 +749,21 @@ impl WebclawMcp { Parameters(params): Parameters, ) -> Result { validate_url(¶ms.url)?; - // Reuse the long-lived default FetchClient. Extractors accept - // `&dyn Fetcher`; FetchClient implements the trait so this just - // works (see webclaw_fetch::Fetcher and client::FetchClient). - let data = webclaw_fetch::extractors::dispatch_by_name( - self.fetch_client.as_ref(), - ¶ms.name, - ¶ms.url, - ) - .await - .map_err(|e| e.to_string())?; + // Use the cached Firefox client, not the default Chrome one. + // Reddit's `.json` endpoint rejects the wreq-Chrome TLS + // fingerprint with a 403 even from residential IPs (they + // ship a fingerprint blocklist that includes common + // browser-emulation libraries). The wreq-Firefox fingerprint + // still passes, and Firefox is equally fine for every other + // vertical in the catalog, so it's a strictly-safer default + // for `vertical_scrape` than the generic `scrape` tool's + // Chrome default. Matches the CLI `webclaw vertical` + // subcommand which already uses Firefox. + let client = self.firefox_or_build()?; + let data = + webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), ¶ms.name, ¶ms.url) + .await + .map_err(|e| e.to_string())?; serde_json::to_string_pretty(&data) .map_err(|e| format!("failed to serialise extractor output: {e}")) } From b77767814a5fc767d0d4ebee78f0737beb7223db Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 12:58:24 +0200 Subject: [PATCH 05/12] Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper - New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26. Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers, gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3 8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on immobiliare.it with country-it residential. - BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256, explicit extension_permutation, advertise h3 in ALPN and ALPS. JA3 43067709b025da334de1279a120f8e14, akamai_fp 52d84b11737d980aef856699f885ca86. Fixes indeed.com and other Cloudflare-fronted sites. - New locale module: accept_language_for_url / accept_language_for_tld. TLD to Accept-Language mapping, unknown TLDs default to en-US. DataDome geo-vs-locale cross-checks are now trivially satisfiable. - wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26. --- CHANGELOG.md | 14 +++ Cargo.lock | 31 +++++ Cargo.toml | 2 +- crates/webclaw-fetch/Cargo.toml | 1 + crates/webclaw-fetch/src/browser.rs | 5 + crates/webclaw-fetch/src/client.rs | 1 + crates/webclaw-fetch/src/lib.rs | 2 + crates/webclaw-fetch/src/locale.rs | 77 ++++++++++++ crates/webclaw-fetch/src/tls.rs | 184 ++++++++++++++++++++++++---- 9 files changed, 291 insertions(+), 26 deletions(-) create mode 100644 crates/webclaw-fetch/src/locale.rs diff --git a/CHANGELOG.md b/CHANGELOG.md index ef2d2f2..de610e5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,20 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.5.4] — 2026-04-23 + +### Added +- **`BrowserProfile::SafariIos`** variant, mapped to a new `BrowserVariant::SafariIos26`. Built on top of `wreq_util::Emulation::SafariIos26` with four targeted overrides that close the gap against DataDome's immobiliare.it / target.com / bestbuy.com / sephora.com rulesets: TLS extension order pinned to bogdanfinn's `safari_ios_26_0` wire format, HTTP/2 HEADERS priority flag set (weight 256, exclusive, depends_on=0) while preserving wreq-util's SETTINGS + WINDOW_UPDATE, Safari iOS 26 header set without Chromium leaks, accept-encoding limited to `gzip, deflate, br` (no zstd). Empirically 9/10 on immobiliare with `country-it` residential, 2/2 on target/bestbuy/sephora with `country-us` residential. Matches bogdanfinn's JA3 `8d909525bd5bbb79f133d11cc05159fe` exactly. + +- **`accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers** in a new `locale` module. TLD to `Accept-Language` mapping (`.it` to `it-IT,it;q=0.9`, `.fr` to `fr-FR,fr;q=0.9`, etc.). Unknown TLDs fall back to `en-US,en;q=0.9`. DataDome rules that cross-check geo vs locale (Italian IP + English `accept-language` = bot) are now trivially satisfiable by callers that plumb the target URL through this helper before building a `FetchConfig`. + +### Changed +- **`BrowserProfile::Chrome` fingerprint aligned to bogdanfinn `chrome_133`.** Three wire-level fixes: removed `MAX_CONCURRENT_STREAMS` from the HTTP/2 SETTINGS frame (real Chrome 133 does not send this setting), priority weight on the HEADERS frame changed from 220 to 256, TLS extension order pinned via `extension_permutation` to match bogdanfinn's stable JA3 `43067709b025da334de1279a120f8e14`. `alpn_protocols` extended to `[HTTP3, HTTP2, HTTP1]` and `alps_protocols` to `[HTTP3, HTTP2]` so Cloudflare's bot management sees the h3 advertisement real Chrome 133+ emits. Fixes indeed.com and other Cloudflare-protected sites that were serving the previous fingerprint a 403 "Security Check" challenge. Full matrix result (12 Chrome rows): 11/12 clean, the one failure is shared with bogdanfinn from the same proxy (IP reputation, not fingerprint). + +- **Bumped `wreq-util` from `2.2.6` to `3.0.0-rc.10`** to pick up `Emulation::SafariIos26`, which didn't ship until rc.10. + +--- + ## [0.5.2] — 2026-04-22 ### Added diff --git a/Cargo.lock b/Cargo.lock index 3b4aa51..7302b9f 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2967,6 +2967,26 @@ dependencies = [ "pom", ] +[[package]] +name = "typed-builder" +version = "0.23.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda" +dependencies = [ + "typed-builder-macro", +] + +[[package]] +name = "typed-builder-macro" +version = "0.23.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + [[package]] name = "typed-path" version = "0.12.3" @@ -3258,6 +3278,7 @@ dependencies = [ "webclaw-core", "webclaw-pdf", "wreq", + "wreq-util", "zip 2.4.2", ] @@ -3709,6 +3730,16 @@ dependencies = [ "zstd", ] +[[package]] +name = "wreq-util" +version = "3.0.0-rc.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04" +dependencies = [ + "typed-builder", + "wreq", +] + [[package]] name = "writeable" version = "0.6.2" diff --git a/Cargo.toml b/Cargo.toml index 97c3055..77a64a0 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.5.3" +version = "0.5.4" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/crates/webclaw-fetch/Cargo.toml b/crates/webclaw-fetch/Cargo.toml index a47ba7e..3bf5401 100644 --- a/crates/webclaw-fetch/Cargo.toml +++ b/crates/webclaw-fetch/Cargo.toml @@ -14,6 +14,7 @@ tracing = { workspace = true } tokio = { workspace = true } async-trait = "0.1" wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] } +wreq-util = "3.0.0-rc.10" http = "1" bytes = "1" url = "2" diff --git a/crates/webclaw-fetch/src/browser.rs b/crates/webclaw-fetch/src/browser.rs index 007ac9f..05f2c54 100644 --- a/crates/webclaw-fetch/src/browser.rs +++ b/crates/webclaw-fetch/src/browser.rs @@ -7,6 +7,10 @@ pub enum BrowserProfile { #[default] Chrome, Firefox, + /// Safari iOS 26 (iPhone). The one profile proven to defeat + /// DataDome's immobiliare.it / idealista.it / target.com-class + /// rules when paired with a country-scoped residential proxy. + SafariIos, /// Randomly pick from all available profiles on each request. Random, } @@ -18,6 +22,7 @@ pub enum BrowserVariant { ChromeMacos, Firefox, Safari, + SafariIos26, Edge, } diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs index 8fd5ff5..e147337 100644 --- a/crates/webclaw-fetch/src/client.rs +++ b/crates/webclaw-fetch/src/client.rs @@ -635,6 +635,7 @@ fn collect_variants(profile: &BrowserProfile) -> Vec { BrowserProfile::Random => browser::all_variants(), BrowserProfile::Chrome => vec![browser::latest_chrome()], BrowserProfile::Firefox => vec![browser::latest_firefox()], + BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26], } } diff --git a/crates/webclaw-fetch/src/lib.rs b/crates/webclaw-fetch/src/lib.rs index 83664a1..ca04bdb 100644 --- a/crates/webclaw-fetch/src/lib.rs +++ b/crates/webclaw-fetch/src/lib.rs @@ -10,6 +10,7 @@ pub mod error; pub mod extractors; pub mod fetcher; pub mod linkedin; +pub mod locale; pub mod proxy; pub mod reddit; pub mod sitemap; @@ -21,6 +22,7 @@ pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult}; pub use error::FetchError; pub use fetcher::Fetcher; pub use http::HeaderMap; +pub use locale::{accept_language_for_tld, accept_language_for_url}; pub use proxy::{parse_proxy_file, parse_proxy_line}; pub use sitemap::SitemapEntry; pub use webclaw_pdf::PdfMode; diff --git a/crates/webclaw-fetch/src/locale.rs b/crates/webclaw-fetch/src/locale.rs new file mode 100644 index 0000000..04079ec --- /dev/null +++ b/crates/webclaw-fetch/src/locale.rs @@ -0,0 +1,77 @@ +//! Derive an `Accept-Language` header from a URL. +//! +//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it, +//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the +//! target country + a browser UA but the wrong `Accept-Language` is a bot +//! signal. Matching the site's expected locale gets us through. +//! +//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback. + +/// Best-effort `Accept-Language` header value for the given URL's TLD. +/// Returns `None` if the URL cannot be parsed. +pub fn accept_language_for_url(url: &str) -> Option<&'static str> { + let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase(); + let tld = host.rsplit('.').next()?; + Some(accept_language_for_tld(tld)) +} + +/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`. +/// Unknown TLDs fall back to US English. +pub fn accept_language_for_tld(tld: &str) -> &'static str { + match tld { + "it" => "it-IT,it;q=0.9", + "fr" => "fr-FR,fr;q=0.9", + "de" | "at" => "de-DE,de;q=0.9", + "es" => "es-ES,es;q=0.9", + "pt" => "pt-PT,pt;q=0.9", + "nl" => "nl-NL,nl;q=0.9", + "pl" => "pl-PL,pl;q=0.9", + "se" => "sv-SE,sv;q=0.9", + "no" => "nb-NO,nb;q=0.9", + "dk" => "da-DK,da;q=0.9", + "fi" => "fi-FI,fi;q=0.9", + "cz" => "cs-CZ,cs;q=0.9", + "ro" => "ro-RO,ro;q=0.9", + "gr" => "el-GR,el;q=0.9", + "tr" => "tr-TR,tr;q=0.9", + "ru" => "ru-RU,ru;q=0.9", + "jp" => "ja-JP,ja;q=0.9", + "kr" => "ko-KR,ko;q=0.9", + "cn" => "zh-CN,zh;q=0.9", + "tw" | "hk" => "zh-TW,zh;q=0.9", + "br" => "pt-BR,pt;q=0.9", + "mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9", + "uk" | "ie" => "en-GB,en;q=0.9", + _ => "en-US,en;q=0.9", + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn tld_dispatch() { + assert_eq!( + accept_language_for_url("https://www.immobiliare.it/annunci/1"), + Some("it-IT,it;q=0.9") + ); + assert_eq!( + accept_language_for_url("https://www.leboncoin.fr/"), + Some("fr-FR,fr;q=0.9") + ); + assert_eq!( + accept_language_for_url("https://www.amazon.co.uk/"), + Some("en-GB,en;q=0.9") + ); + assert_eq!( + accept_language_for_url("https://example.com/"), + Some("en-US,en;q=0.9") + ); + } + + #[test] + fn bad_url_returns_none() { + assert_eq!(accept_language_for_url("not-a-url"), None); + } +} diff --git a/crates/webclaw-fetch/src/tls.rs b/crates/webclaw-fetch/src/tls.rs index 608ae96..308265b 100644 --- a/crates/webclaw-fetch/src/tls.rs +++ b/crates/webclaw-fetch/src/tls.rs @@ -7,10 +7,15 @@ use std::time::Duration; +use std::borrow::Cow; + use wreq::http2::{ Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId, }; -use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion}; +use wreq::tls::{ + AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions, + TlsVersion, +}; use wreq::{Client, Emulation}; use crate::browser::BrowserVariant; @@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc /// Safari curves. const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521"; +/// Safari iOS 26 TLS extension order, matching bogdanfinn's +/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq +/// inserts them itself. Diverges from wreq-util's default SafariIos26 +/// extension order, which DataDome's immobiliare.it ruleset flags. +fn safari_ios_extensions() -> Vec { + vec![ + ExtensionType::CERTIFICATE_TIMESTAMP, + ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, + ExtensionType::SERVER_NAME, + ExtensionType::CERT_COMPRESSION, + ExtensionType::KEY_SHARE, + ExtensionType::SUPPORTED_VERSIONS, + ExtensionType::PSK_KEY_EXCHANGE_MODES, + ExtensionType::SUPPORTED_GROUPS, + ExtensionType::RENEGOTIATE, + ExtensionType::SIGNATURE_ALGORITHMS, + ExtensionType::STATUS_REQUEST, + ExtensionType::EC_POINT_FORMATS, + ExtensionType::EXTENDED_MASTER_SECRET, + ] +} + +/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3 +/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions +/// per handshake, but indeed.com's WAF allowlists this specific wire order +/// and rejects permuted ones. GREASE slots are inserted by wreq. +/// +/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23 +fn chrome_extensions() -> Vec { + vec![ + ExtensionType::CERTIFICATE_TIMESTAMP, // 18 + ExtensionType::STATUS_REQUEST, // 5 + ExtensionType::SESSION_TICKET, // 35 + ExtensionType::KEY_SHARE, // 51 + ExtensionType::SUPPORTED_GROUPS, // 10 + ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45 + ExtensionType::EC_POINT_FORMATS, // 11 + ExtensionType::CERT_COMPRESSION, // 27 + ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint) + ExtensionType::SUPPORTED_VERSIONS, // 43 + ExtensionType::SIGNATURE_ALGORITHMS, // 13 + ExtensionType::SERVER_NAME, // 0 + ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16 + ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037 + ExtensionType::RENEGOTIATE, // 65281 + ExtensionType::EXTENDED_MASTER_SECRET, // 23 + ] +} + // --- Chrome HTTP headers in correct wire order --- const CHROME_HEADERS: &[(&str, &str)] = &[ @@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[ ("sec-fetch-dest", "document"), ]; +/// Safari iOS 26 headers, in the wire order real Safari emits. Critically: +/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but +/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not +/// include zstd (Safari can't decode it). Verified against bogdanfinn on +/// 2026-04-22: this header set is what DataDome's immobiliare ruleset +/// expects for a real iPhone. +const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[ + ( + "accept", + "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", + ), + ("accept-language", "en-US,en;q=0.9"), + ("accept-encoding", "gzip, deflate, br"), + ( + "user-agent", + "Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1", + ), + ("upgrade-insecure-requests", "1"), +]; + const EDGE_HEADERS: &[(&str, &str)] = &[ ( "sec-ch-ua", @@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[ ]; fn chrome_tls() -> TlsOptions { + // permute_extensions is off so the explicit extension_permutation sticks. + // Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's + // fixed order, so matching that gets us through. TlsOptions::builder() .cipher_list(CHROME_CIPHERS) .sigalgs_list(CHROME_SIGALGS) @@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions { .min_tls_version(TlsVersion::TLS_1_2) .max_tls_version(TlsVersion::TLS_1_3) .grease_enabled(true) - .permute_extensions(true) + .permute_extensions(false) + .extension_permutation(chrome_extensions()) .enable_ech_grease(true) .pre_shared_key(true) .enable_ocsp_stapling(true) .enable_signed_cert_timestamps(true) - .alps_protocols([AlpsProtocol::HTTP2]) + .alpn_protocols([ + AlpnProtocol::HTTP3, + AlpnProtocol::HTTP2, + AlpnProtocol::HTTP1, + ]) + .alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2]) .alps_use_new_codepoint(true) .aes_hw_override(true) .certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI]) @@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions { .build() } +/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26` +/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox +/// because the wire-level defaults from wreq-util are already correct for ciphers, +/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for +/// DataDome compatibility are overridden here: +/// +/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3 +/// ends up `8d909525bd5bbb79f133d11cc05159fe`). +/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0. +/// wreq-util omits this frame; real Safari and bogdanfinn include it. +/// This flip is the thing DataDome actually reads — the akamai_fingerprint +/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to +/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature. +/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`, +/// `priority: u=0, i`, zstd), replace with the real iOS 26 set. +/// 4. `accept-language` preserved from config.extra_headers for locale. +fn safari_ios_emulation() -> wreq::Emulation { + use wreq::EmulationFactory; + let mut em = wreq_util::Emulation::SafariIos26.emulation(); + + if let Some(tls) = em.tls_options_mut().as_mut() { + tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions())); + } + + // Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE, + // and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS + // to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome. + if let Some(h2) = em.http2_options_mut().as_mut() { + h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true)); + } + + let hm = em.headers_mut(); + hm.clear(); + for (k, v) in SAFARI_IOS_HEADERS { + if let (Ok(n), Ok(val)) = ( + http::header::HeaderName::from_bytes(k.as_bytes()), + http::header::HeaderValue::from_str(v), + ) { + hm.append(n, val); + } + } + + em +} + fn chrome_h2() -> Http2Options { + // SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE, + // ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No + // MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it, + // and indeed.com's WAF reads this as a bot signal when present. Priority + // weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame. Http2Options::builder() .initial_window_size(6_291_456) .initial_connection_window_size(15_728_640) .max_header_list_size(262_144) .header_table_size(65_536) - .max_concurrent_streams(1000u32) .enable_push(false) .settings_order( SettingsOrder::builder() .extend([ SettingId::HeaderTableSize, SettingId::EnablePush, - SettingId::MaxConcurrentStreams, SettingId::InitialWindowSize, - SettingId::MaxFrameSize, SettingId::MaxHeaderListSize, - SettingId::EnableConnectProtocol, - SettingId::NoRfc7540Priorities, ]) .build(), ) @@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options { ]) .build(), ) - .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true)) + .headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true)) .build() } @@ -328,32 +456,38 @@ pub fn build_client( extra_headers: &std::collections::HashMap, proxy: Option<&str>, ) -> Result { - let (tls, h2, headers) = match variant { - BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS), - BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS), - BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS), - BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS), - BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS), + // SafariIos26 builds its Emulation on top of wreq-util's base instead + // of from scratch. See `safari_ios_emulation` for why. + let mut emulation = match variant { + BrowserVariant::SafariIos26 => safari_ios_emulation(), + other => { + let (tls, h2, headers) = match other { + BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS), + BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS), + BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS), + BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS), + BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS), + BrowserVariant::SafariIos26 => unreachable!("handled above"), + }; + Emulation::builder() + .tls_options(tls) + .http2_options(h2) + .headers(build_headers(headers)) + .build() + } }; - let mut header_map = build_headers(headers); - - // Append extra headers after profile defaults + // Append extra headers after profile defaults. + let hm = emulation.headers_mut(); for (k, v) in extra_headers { if let (Ok(n), Ok(val)) = ( http::header::HeaderName::from_bytes(k.as_bytes()), http::header::HeaderValue::from_str(v), ) { - header_map.insert(n, val); + hm.insert(n, val); } } - let emulation = Emulation::builder() - .tls_options(tls) - .http2_options(h2) - .headers(header_map) - .build(); - let mut builder = Client::builder() .emulation(emulation) .redirect(wreq::redirect::Policy::limited(10)) From 2285c585b18bfe02a16f1bfc9ca118a0523f4f24 Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 13:01:02 +0200 Subject: [PATCH 06/12] docs(changelog): simplify 0.5.4 entry --- CHANGELOG.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index de610e5..c249734 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,14 +6,12 @@ Format follows [Keep a Changelog](https://keepachangelog.com/). ## [0.5.4] — 2026-04-23 ### Added -- **`BrowserProfile::SafariIos`** variant, mapped to a new `BrowserVariant::SafariIos26`. Built on top of `wreq_util::Emulation::SafariIos26` with four targeted overrides that close the gap against DataDome's immobiliare.it / target.com / bestbuy.com / sephora.com rulesets: TLS extension order pinned to bogdanfinn's `safari_ios_26_0` wire format, HTTP/2 HEADERS priority flag set (weight 256, exclusive, depends_on=0) while preserving wreq-util's SETTINGS + WINDOW_UPDATE, Safari iOS 26 header set without Chromium leaks, accept-encoding limited to `gzip, deflate, br` (no zstd). Empirically 9/10 on immobiliare with `country-it` residential, 2/2 on target/bestbuy/sephora with `country-us` residential. Matches bogdanfinn's JA3 `8d909525bd5bbb79f133d11cc05159fe` exactly. - -- **`accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers** in a new `locale` module. TLD to `Accept-Language` mapping (`.it` to `it-IT,it;q=0.9`, `.fr` to `fr-FR,fr;q=0.9`, etc.). Unknown TLDs fall back to `en-US,en;q=0.9`. DataDome rules that cross-check geo vs locale (Italian IP + English `accept-language` = bot) are now trivially satisfiable by callers that plumb the target URL through this helper before building a `FetchConfig`. +- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles. +- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback. ### Changed -- **`BrowserProfile::Chrome` fingerprint aligned to bogdanfinn `chrome_133`.** Three wire-level fixes: removed `MAX_CONCURRENT_STREAMS` from the HTTP/2 SETTINGS frame (real Chrome 133 does not send this setting), priority weight on the HEADERS frame changed from 220 to 256, TLS extension order pinned via `extension_permutation` to match bogdanfinn's stable JA3 `43067709b025da334de1279a120f8e14`. `alpn_protocols` extended to `[HTTP3, HTTP2, HTTP1]` and `alps_protocols` to `[HTTP3, HTTP2]` so Cloudflare's bot management sees the h3 advertisement real Chrome 133+ emits. Fixes indeed.com and other Cloudflare-protected sites that were serving the previous fingerprint a 403 "Security Check" challenge. Full matrix result (12 Chrome rows): 11/12 clean, the one failure is shared with bogdanfinn from the same proxy (IP reputation, not fingerprint). - -- **Bumped `wreq-util` from `2.2.6` to `3.0.0-rc.10`** to pick up `Emulation::SafariIos26`, which didn't ship until rc.10. +- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites. +- Bumped `wreq-util` to `3.0.0-rc.10`. --- From e1af2da5092ab06e2d915209f93d0ccd74c548f8 Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 13:25:23 +0200 Subject: [PATCH 07/12] docs(claude): drop sidecar references, mention ProductionFetcher --- CLAUDE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CLAUDE.md b/CLAUDE.md index fcd27da..c33d61f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -79,7 +79,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R - **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally. - **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any. - **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep. -- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq. +- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client. - **qwen3 thinking tags** (``) are stripped at both provider and consumer levels. ## Build & Test From 98a177dec489c8d29551648854f502dd6900e067 Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 13:32:55 +0200 Subject: [PATCH 08/12] feat(cli): expose safari-ios browser profile + bump to 0.5.5 --- CHANGELOG.md | 7 +++++++ Cargo.lock | 14 +++++++------- Cargo.toml | 2 +- crates/webclaw-cli/src/main.rs | 4 ++++ 4 files changed, 19 insertions(+), 8 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c249734..94b9ddb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,13 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.5.5] — 2026-04-23 + +### Added +- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles. + +--- + ## [0.5.4] — 2026-04-23 ### Added diff --git a/Cargo.lock b/Cargo.lock index 7302b9f..30135cd 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3219,7 +3219,7 @@ dependencies = [ [[package]] name = "webclaw-cli" -version = "0.5.3" +version = "0.5.5" dependencies = [ "clap", "dotenvy", @@ -3240,7 +3240,7 @@ dependencies = [ [[package]] name = "webclaw-core" -version = "0.5.3" +version = "0.5.5" dependencies = [ "ego-tree", "once_cell", @@ -3258,7 +3258,7 @@ dependencies = [ [[package]] name = "webclaw-fetch" -version = "0.5.3" +version = "0.5.5" dependencies = [ "async-trait", "bytes", @@ -3284,7 +3284,7 @@ dependencies = [ [[package]] name = "webclaw-llm" -version = "0.5.3" +version = "0.5.5" dependencies = [ "async-trait", "reqwest", @@ -3297,7 +3297,7 @@ dependencies = [ [[package]] name = "webclaw-mcp" -version = "0.5.3" +version = "0.5.5" dependencies = [ "dirs", "dotenvy", @@ -3317,7 +3317,7 @@ dependencies = [ [[package]] name = "webclaw-pdf" -version = "0.5.3" +version = "0.5.5" dependencies = [ "pdf-extract", "thiserror", @@ -3326,7 +3326,7 @@ dependencies = [ [[package]] name = "webclaw-server" -version = "0.5.3" +version = "0.5.5" dependencies = [ "anyhow", "axum", diff --git a/Cargo.toml b/Cargo.toml index 77a64a0..abd5816 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.5.4" +version = "0.5.5" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/crates/webclaw-cli/src/main.rs b/crates/webclaw-cli/src/main.rs index a12cae1..e97f15d 100644 --- a/crates/webclaw-cli/src/main.rs +++ b/crates/webclaw-cli/src/main.rs @@ -351,6 +351,9 @@ enum OutputFormat { enum Browser { Chrome, Firefox, + /// Safari iOS 26. Pair with a country-matched residential proxy for sites + /// that reject non-mobile profiles. + SafariIos, Random, } @@ -377,6 +380,7 @@ impl From for BrowserProfile { match b { Browser::Chrome => BrowserProfile::Chrome, Browser::Firefox => BrowserProfile::Firefox, + Browser::SafariIos => BrowserProfile::SafariIos, Browser::Random => BrowserProfile::Random, } } From b413d702b272960dcc3970394194f5328c784eeb Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 14:59:29 +0200 Subject: [PATCH 09/12] feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6 --- CHANGELOG.md | 10 +++++ Cargo.lock | 14 +++---- Cargo.toml | 2 +- crates/webclaw-fetch/src/client.rs | 59 ++++++++++++++++++++++++++---- 4 files changed, 69 insertions(+), 16 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 94b9ddb..54cb31f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,16 @@ All notable changes to webclaw are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/). +## [0.5.6] — 2026-04-23 + +### Added +- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again. + +### Fixed +- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up. + +--- + ## [0.5.5] — 2026-04-23 ### Added diff --git a/Cargo.lock b/Cargo.lock index 30135cd..b382000 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3219,7 +3219,7 @@ dependencies = [ [[package]] name = "webclaw-cli" -version = "0.5.5" +version = "0.5.6" dependencies = [ "clap", "dotenvy", @@ -3240,7 +3240,7 @@ dependencies = [ [[package]] name = "webclaw-core" -version = "0.5.5" +version = "0.5.6" dependencies = [ "ego-tree", "once_cell", @@ -3258,7 +3258,7 @@ dependencies = [ [[package]] name = "webclaw-fetch" -version = "0.5.5" +version = "0.5.6" dependencies = [ "async-trait", "bytes", @@ -3284,7 +3284,7 @@ dependencies = [ [[package]] name = "webclaw-llm" -version = "0.5.5" +version = "0.5.6" dependencies = [ "async-trait", "reqwest", @@ -3297,7 +3297,7 @@ dependencies = [ [[package]] name = "webclaw-mcp" -version = "0.5.5" +version = "0.5.6" dependencies = [ "dirs", "dotenvy", @@ -3317,7 +3317,7 @@ dependencies = [ [[package]] name = "webclaw-pdf" -version = "0.5.5" +version = "0.5.6" dependencies = [ "pdf-extract", "thiserror", @@ -3326,7 +3326,7 @@ dependencies = [ [[package]] name = "webclaw-server" -version = "0.5.5" +version = "0.5.6" dependencies = [ "anyhow", "axum", diff --git a/Cargo.toml b/Cargo.toml index abd5816..d9cfd92 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,7 +3,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.5.5" +version = "0.5.6" edition = "2024" license = "AGPL-3.0" repository = "https://github.com/0xMassi/webclaw" diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs index e147337..d61694f 100644 --- a/crates/webclaw-fetch/src/client.rs +++ b/crates/webclaw-fetch/src/client.rs @@ -261,10 +261,52 @@ impl FetchClient { self.cloud.as_deref() } + /// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the + /// `.json` API, and Akamai-style challenge responses trigger a homepage + /// cookie warmup and a retry. Returns the same `FetchResult` shape as + /// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production + /// server) benefits without shape churn. + /// + /// This is the method most callers want. Use plain [`Self::fetch`] only + /// when you need literal no-rescue behavior (e.g. inside the rescue + /// logic itself to avoid recursion). + pub async fn fetch_smart(&self, url: &str) -> Result { + // Reddit: the HTML page shows a verification interstitial for most + // client IPs, but appending `.json` returns the post + comment tree + // publicly. `parse_reddit_json` in downstream code knows how to read + // the result; here we just do the URL swap at the fetch layer. + if crate::reddit::is_reddit_url(url) { + let json_url = crate::reddit::json_url(url); + if let Ok(resp) = self.fetch(&json_url).await { + if resp.status == 200 && !resp.html.is_empty() { + return Ok(resp); + } + } + // If the .json fetch failed, fall through to the HTML path. + } + + let resp = self.fetch(url).await?; + + // Akamai / bazadebezolkohpepadr challenge: visit the homepage to + // collect warmup cookies (_abck, bm_sz, etc.), then retry. + if is_challenge_html(&resp.html) + && let Some(homepage) = extract_homepage(url) + { + debug!("challenge detected, warming cookies via {homepage}"); + let _ = self.fetch(&homepage).await; + if let Ok(retry) = self.fetch(url).await { + return Ok(retry); + } + } + + Ok(resp) + } + /// Fetch a URL and return the raw HTML + response metadata. /// /// Automatically retries on transient failures (network errors, 5xx, 429) - /// with exponential backoff: 0s, 1s (2 attempts total). + /// with exponential backoff: 0s, 1s (2 attempts total). No per-site + /// rescue logic; use [`Self::fetch_smart`] for that. #[instrument(skip(self), fields(url = %url))] pub async fn fetch(&self, url: &str) -> Result { let delays = [Duration::ZERO, Duration::from_secs(1)]; @@ -713,22 +755,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool { /// Detect if a response looks like a bot protection challenge page. fn is_challenge_response(response: &Response) -> bool { - let len = response.body().len(); + is_challenge_html(response.text().as_ref()) +} + +/// Same as `is_challenge_response`, operating on a body string directly +/// so callers holding a `FetchResult` can reuse the heuristic. +fn is_challenge_html(html: &str) -> bool { + let len = html.len(); if len > 15_000 || len == 0 { return false; } - - let text = response.text(); - let lower = text.to_lowercase(); - + let lower = html.to_lowercase(); if lower.contains("challenge page") { return true; } - if lower.contains("bazadebezolkohpepadr") && len < 5_000 { return true; } - false } From 866fa88aa05d208cb5389795cfc655876742cfbc Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 15:06:35 +0200 Subject: [PATCH 10/12] fix(fetch): reject HTML verification pages served at .json reddit URL --- crates/webclaw-fetch/src/client.rs | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs index d61694f..78731e5 100644 --- a/crates/webclaw-fetch/src/client.rs +++ b/crates/webclaw-fetch/src/client.rs @@ -277,12 +277,18 @@ impl FetchClient { // the result; here we just do the URL swap at the fetch layer. if crate::reddit::is_reddit_url(url) { let json_url = crate::reddit::json_url(url); - if let Ok(resp) = self.fetch(&json_url).await { - if resp.status == 200 && !resp.html.is_empty() { + if let Ok(resp) = self.fetch(&json_url).await + && resp.status == 200 + { + // Reddit will serve an HTML verification page at the .json + // URL too when the IP is flagged. Only return if the body + // actually starts with a JSON payload. + let first = resp.html.trim_start().as_bytes().first().copied(); + if matches!(first, Some(b'{') | Some(b'[')) { return Ok(resp); } } - // If the .json fetch failed, fall through to the HTML path. + // If the .json fetch failed or returned HTML, fall through. } let resp = self.fetch(url).await?; From 966981bc4299323721c2d43ff5aa157bf939b82c Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 15:17:04 +0200 Subject: [PATCH 11/12] fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block --- crates/webclaw-fetch/src/client.rs | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/crates/webclaw-fetch/src/client.rs b/crates/webclaw-fetch/src/client.rs index 78731e5..94d698f 100644 --- a/crates/webclaw-fetch/src/client.rs +++ b/crates/webclaw-fetch/src/client.rs @@ -275,14 +275,21 @@ impl FetchClient { // client IPs, but appending `.json` returns the post + comment tree // publicly. `parse_reddit_json` in downstream code knows how to read // the result; here we just do the URL swap at the fetch layer. - if crate::reddit::is_reddit_url(url) { + if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") { let json_url = crate::reddit::json_url(url); - if let Ok(resp) = self.fetch(&json_url).await + // Reddit's public .json API serves JSON to identifiable bot + // User-Agents and blocks browser UAs with a verification wall. + // Override our Chrome-profile UA for this specific call. + let ua = concat!( + "Webclaw/", + env!("CARGO_PKG_VERSION"), + " (+https://webclaw.io)" + ); + if let Ok(resp) = self + .fetch_with_headers(&json_url, &[("user-agent", ua)]) + .await && resp.status == 200 { - // Reddit will serve an HTML verification page at the .json - // URL too when the IP is flagged. Only return if the body - // actually starts with a JSON payload. let first = resp.html.trim_start().as_bytes().first().copied(); if matches!(first, Some(b'{') | Some(b'[')) { return Ok(resp); From a5c3433372f33517f2aa765c2544ab6abdfe1cc7 Mon Sep 17 00:00:00 2001 From: Valerio Date: Thu, 23 Apr 2026 15:26:31 +0200 Subject: [PATCH 12/12] fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls --- CHANGELOG.md | 3 ++- crates/webclaw-core/src/markdown.rs | 6 ++++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 54cb31f..3000593 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,10 +6,11 @@ Format follows [Keep a Changelog](https://keepachangelog.com/). ## [0.5.6] — 2026-04-23 ### Added -- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again. +- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again. ### Fixed - Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up. +- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded. --- diff --git a/crates/webclaw-core/src/markdown.rs b/crates/webclaw-core/src/markdown.rs index 1a61586..d0a2c23 100644 --- a/crates/webclaw-core/src/markdown.rs +++ b/crates/webclaw-core/src/markdown.rs @@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String { continue; } - // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs - if trimmed.starts_with('|') && trimmed.ends_with('|') { + // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs. + // Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows + // (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`. + if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') { let inner = &trimmed[1..trimmed.len() - 1]; let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect(); lines.push(cells.join("\t"));