fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls

fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
fix(fetch): reject HTML verification pages served at .json reddit URL
2026-04-25 00:06:21 +02:00 · 2026-04-23 15:26:31 +02:00 · 2026-04-23 15:17:04 +02:00 · 2026-04-23 15:06:35 +02:00 · 2026-04-23 14:59:29 +02:00 · 2026-04-23 13:32:55 +02:00
7 changed files with 99 additions and 19 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -3,6 +3,24 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 ## [0.5.6] — 2026-04-23
 ### Added
 - `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
 ### Fixed
 - Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
 - Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
 ---
 ## [0.5.5] — 2026-04-23
 ### Added
 - `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
 ---
 ## [0.5.4] — 2026-04-23
 ### Added
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -79,7 +79,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
 - **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
 - **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
 - **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `TlsSidecarFetcher` that routes through the Go tls-sidecar instead of in-process wreq.
+- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
 - **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
 ## Build & Test
--- a/Cargo.lock
+++ b/Cargo.lock
@ -3219,7 +3219,7 @@ dependencies = [
 [[package]]
 name = "webclaw-cli"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "clap",
 "dotenvy",
@ -3240,7 +3240,7 @@ dependencies = [
 [[package]]
 name = "webclaw-core"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "ego-tree",
 "once_cell",
@ -3258,7 +3258,7 @@ dependencies = [
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "async-trait",
 "bytes",
@ -3284,7 +3284,7 @@ dependencies = [
 [[package]]
 name = "webclaw-llm"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "async-trait",
 "reqwest",
@ -3297,7 +3297,7 @@ dependencies = [
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "dirs",
 "dotenvy",
@ -3317,7 +3317,7 @@ dependencies = [
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "pdf-extract",
 "thiserror",
@ -3326,7 +3326,7 @@ dependencies = [
 [[package]]
 name = "webclaw-server"
-version = "0.5.3"
+version = "0.5.6"
 dependencies = [
 "anyhow",
 "axum",
--- a/Cargo.toml
+++ b/Cargo.toml
@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 [workspace.package]
-version = "0.5.4"
+version = "0.5.6"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@ -351,6 +351,9 @@ enum OutputFormat {
 enum Browser {
    Chrome,
    Firefox,
    /// Safari iOS 26. Pair with a country-matched residential proxy for sites
    /// that reject non-mobile profiles.
    SafariIos,
    Random,
 }
@ -377,6 +380,7 @@ impl From<Browser> for BrowserProfile {
        match b {
            Browser::Chrome => BrowserProfile::Chrome,
            Browser::Firefox => BrowserProfile::Firefox,
            Browser::SafariIos => BrowserProfile::SafariIos,
            Browser::Random => BrowserProfile::Random,
        }
    }
--- a/crates/webclaw-core/src/markdown.rs
+++ b/crates/webclaw-core/src/markdown.rs
@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
            continue;
        }
-        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
+        // Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
-        if trimmed.starts_with('|') && trimmed.ends_with('|') {
+        // Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
        // (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
        if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
            let inner = &trimmed[1..trimmed.len() - 1];
            let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
            lines.push(cells.join("\t"));
--- a/crates/webclaw-fetch/src/client.rs
+++ b/crates/webclaw-fetch/src/client.rs
@ -261,10 +261,65 @@ impl FetchClient {
        self.cloud.as_deref()
    }
    /// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
    /// `.json` API, and Akamai-style challenge responses trigger a homepage
    /// cookie warmup and a retry. Returns the same `FetchResult` shape as
    /// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
    /// server) benefits without shape churn.
    ///
    /// This is the method most callers want. Use plain [`Self::fetch`] only
    /// when you need literal no-rescue behavior (e.g. inside the rescue
    /// logic itself to avoid recursion).
    pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
        // Reddit: the HTML page shows a verification interstitial for most
        // client IPs, but appending `.json` returns the post + comment tree
        // publicly. `parse_reddit_json` in downstream code knows how to read
        // the result; here we just do the URL swap at the fetch layer.
        if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
            let json_url = crate::reddit::json_url(url);
            // Reddit's public .json API serves JSON to identifiable bot
            // User-Agents and blocks browser UAs with a verification wall.
            // Override our Chrome-profile UA for this specific call.
            let ua = concat!(
                "Webclaw/",
                env!("CARGO_PKG_VERSION"),
                " (+https://webclaw.io)"
            );
            if let Ok(resp) = self
                .fetch_with_headers(&json_url, &[("user-agent", ua)])
                .await
                && resp.status == 200
            {
                let first = resp.html.trim_start().as_bytes().first().copied();
                if matches!(first, Some(b'{') | Some(b'[')) {
                    return Ok(resp);
                }
            }
            // If the .json fetch failed or returned HTML, fall through.
        }
        let resp = self.fetch(url).await?;
        // Akamai / bazadebezolkohpepadr challenge: visit the homepage to
        // collect warmup cookies (_abck, bm_sz, etc.), then retry.
        if is_challenge_html(&resp.html)
            && let Some(homepage) = extract_homepage(url)
        {
            debug!("challenge detected, warming cookies via {homepage}");
            let _ = self.fetch(&homepage).await;
            if let Ok(retry) = self.fetch(url).await {
                return Ok(retry);
            }
        }
        Ok(resp)
    }
    /// Fetch a URL and return the raw HTML + response metadata.
    ///
    /// Automatically retries on transient failures (network errors, 5xx, 429)
-    /// with exponential backoff: 0s, 1s (2 attempts total).
+    /// with exponential backoff: 0s, 1s (2 attempts total). No per-site
    /// rescue logic; use [`Self::fetch_smart`] for that.
    #[instrument(skip(self), fields(url = %url))]
    pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
        let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -713,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
 /// Detect if a response looks like a bot protection challenge page.
 fn is_challenge_response(response: &Response) -> bool {
-    let len = response.body().len();
+    is_challenge_html(response.text().as_ref())
 }
 /// Same as `is_challenge_response`, operating on a body string directly
 /// so callers holding a `FetchResult` can reuse the heuristic.
 fn is_challenge_html(html: &str) -> bool {
    let len = html.len();
    if len > 15_000 || len == 0 {
        return false;
    }
-
+    let lower = html.to_lowercase();
    let text = response.text();
    let lower = text.to_lowercase();
    if lower.contains("<title>challenge page</title>") {
        return true;
    }
    if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
        return true;
    }
    false
 }
Author	SHA1	Message	Date
Valerio	a5c3433372	fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls Some checks failed CI / Test (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Docs (push) Has been cancelled Details	2026-04-23 15:26:31 +02:00
Valerio	966981bc42	fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details	2026-04-23 15:17:04 +02:00
Valerio	866fa88aa0	fix(fetch): reject HTML verification pages served at .json reddit URL	2026-04-23 15:06:35 +02:00
Valerio	b413d702b2	feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6	2026-04-23 14:59:29 +02:00
Valerio	98a177dec4	feat(cli): expose safari-ios browser profile + bump to 0.5.5	2026-04-23 13:32:55 +02:00
Valerio	e1af2da509	docs(claude): drop sidecar references, mention ProductionFetcher	2026-04-23 13:25:23 +02:00