Compare commits

..

19 commits
v0.6.5 ... main

Author SHA1 Message Date
Valerio
df7336d55b
Merge pull request #56 from 0xMassi/docs/nodemaven-partner
docs: add NodeMaven studio partner to README
2026-06-10 17:46:55 +02:00
Valerio
acd3021f38 docs(readme): add NodeMaven studio partner 2026-06-10 17:46:49 +02:00
Valerio
bcc58dbadd
Merge pull request #55 from 0xMassi/fix/docker-multiarch-single-build
ci(release): single multi-platform Docker build + dispatch re-publish
2026-06-10 15:56:36 +02:00
Valerio
8015de7db5 ci(release): build the Docker image in one multi-platform pass
The per-arch build + 'imagetools create' combine failed at the manifest
step with 'v0.6.9-arm64: not found' — buildx's default provenance/SBOM
attestations turn each per-arch tag into an index, and assembling them
races GHCR's read-after-write. Replace it with a single
'docker buildx build --platform linux/amd64,linux/arm64 --push'
(attestations off) so one manifest list is pushed atomically. Dockerfile.ci
now selects binaries by TARGETARCH. Adds a workflow_dispatch path to
re-publish an existing tag's image without rebuilding binaries or bumping
the version.
2026-06-10 15:54:28 +02:00
Valerio
be64409d62
Merge pull request #54 from 0xMassi/fix/docker-multiarch-release
chore: release v0.6.9 (fix multi-arch Docker publish)
2026-06-10 15:30:46 +02:00
Valerio
2773474984 chore: release v0.6.9
Publish the multi-arch Docker image with Buildx instead of the legacy
docker driver, whose GHCR push intermittently failed with 'unknown
blob'. The manifest list is now assembled registry-side with
`imagetools create`. This also unblocks the Homebrew formula update,
which depends on the Docker job. No library or CLI behavior changes.
2026-06-10 15:30:39 +02:00
Valerio
7dfa180e86 chore: release v0.6.8 2026-06-10 14:42:05 +02:00
Valerio
598f319bf3
Merge pull request #52 from 0xMassi/audit-fixes-2026-06-09
fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability
2026-06-10 14:40:29 +02:00
Valerio
fae2766db1
Merge pull request #53 from 0xMassi/docs-coldproxy
docs: add ColdProxy proxy-backed crawling walkthrough
2026-06-10 14:40:01 +02:00
Valerio
d0909a25e3 docs: add ColdProxy proxy-backed crawling walkthrough 2026-06-10 10:42:47 +02:00
Valerio
499345046c fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability
- webclaw-llm: add explicit request + connect timeouts to the reqwest
  client in every provider (anthropic, openai, ollama) with a shorter
  timeout on the ollama health check, so a stalled provider fails fast.
- webclaw-llm: fix a panic when truncating a provider error body that
  contains multibyte characters near the 500-char cut (char-safe take).
- webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char
  boundary so oversized scripts with non-ASCII content no longer panic.
- webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of
  `byte as char`, preserving multibyte UTF-8 in SvelteKit string values
  rather than producing Latin-1 mojibake.
- webclaw-cli: have fire_webhook return its JoinHandle and await it at
  the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps.
- webclaw-mcp: drop the up-front DNS pre-validation loop in batch that
  aborted the whole request on one bad URL; the fetch layer already
  applies the same SSRF guard per URL and reports per-URL errors.
- webclaw-fetch: include the port in the warmup homepage URL so hosts
  on a non-default port are warmed correctly.

Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.
2026-06-09 21:10:15 +02:00
Valerio
d0d7b835f2 docs(readme): update banner to new webclaw branding 2026-06-09 18:53:14 +02:00
Valerio
6519ac2a8b chore(release): v0.6.7 2026-06-09 12:38:03 +02:00
Valerio
14ded4b99e chore(deps): bump wreq 6.0.0-rc.29, wreq-util 3.0.0-rc.12
Ports the TLS/Response API breaks in the bump:
- certificate_compression_algorithms -> certificate_compressors with
  wreq-util's BrotliCompressor/ZlibCompressor trait objects
- ExtensionType::APPLICATION_SETTINGS_NEW -> APPLICATION_SETTINGS (same
  codepoint 17613)
- wreq_util::Emulation::SafariIos26.emulation() ->
  Profile::SafariIos26.into_emulation(); Emulation fields are now public
  so *_mut() accessors become direct field access; build() takes a Group
- Response::chunk() removed -> bytes_stream() (wreq 'stream' feature) with
  the running body-size ceiling preserved; adds futures-util

Browser fingerprints verified unchanged on tls.peet.ws: Chrome JA3
43067709b025da334de1279a120f8e14, Safari iOS JA3 8d909525bd5bbb79f133d11cc05159fe.
2026-06-09 12:38:03 +02:00
Valerio
72a451cfb6 chore(release): sync Cargo.lock to v0.6.6 2026-06-09 11:26:18 +02:00
Valerio
17fce81a95 chore(release): v0.6.6
Salvaged two CLI ergonomics fixes from #49:
- periodic progress line on slow fetches (stderr)
- --url-encoded flag + URL truncation warning
2026-06-09 11:24:13 +02:00
Valerio
84a0f9774d style: apply rustfmt to salvaged #49 commits 2026-06-09 11:24:13 +02:00
devnen
519dfb7864 feat(cli): URL truncation warning + --url-encoded flag
When bash splits a URL at & or ? (a common foot-gun), webclaw
receives only the truncated prefix and silently fetches the wrong
page. Per issue #6:

1. Heuristic warning: if the URL ends with '&' or contains '?' with
   no '=' after, emit a stderr warning before fetching:
     # webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded.

2. New flag --url-encoded: parallel input that asserts the user has
   handled escaping. Suppresses the truncation warning since intent
   is explicit.

Fetch proceeds in both cases; this is informational only. 4 new
tests in webclaw-cli. Workspace 720 -> 724.

(cherry picked from commit 4ef27fcd33)
2026-06-09 11:24:13 +02:00
devnen
985a90b083 feat(fetch): periodic progress stderr line on slow fetches
Webclaw's default -t timeout is 30s; slow sites previously sat
silently with no feedback. Now during a fetch, every 10s of elapsed
time webclaw writes one line to stderr:

  # webclaw: still fetching <URL> (Ns)

Fetches completing in under 10s emit nothing (the timer never fires).
Stdout output is untouched - pure feedback signal on stderr.

No timeout change. No new flags. Default behavior is augmented at
stderr only.

Implemented via tokio::select! between the fetch future and a
tokio::time::interval. Latency cost: a single tokio task spawn
and a 10s tick - microseconds on the fast path.

10 new tests in webclaw-fetch::progress::tests (none ignored; the
slow-future test uses a 50ms test interval to keep cargo test fast).
Workspace total 710 -> 720.

(cherry picked from commit 06f065cb08)
2026-06-09 11:24:13 +02:00
21 changed files with 853 additions and 225 deletions

BIN
.github/banner.png vendored

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

After

Width:  |  Height:  |  Size: 48 KiB

Before After
Before After

View file

@ -3,6 +3,15 @@ name: Release
on: on:
push: push:
tags: ["v*"] tags: ["v*"]
# Manual re-publish of the Docker image for an existing release, without
# rebuilding binaries or cutting a new version. Runs only the docker (+
# homebrew) jobs against the given tag's already-published release assets.
workflow_dispatch:
inputs:
tag:
description: "Existing release tag to (re)build + push the Docker image for, e.g. v0.6.9"
required: true
type: string
permissions: permissions:
contents: read contents: read
@ -12,6 +21,9 @@ env:
jobs: jobs:
build: build:
# Binaries are only built when a tag is pushed. A manual dispatch reuses
# the existing release's binaries, so it skips this job entirely.
if: github.event_name == 'push'
permissions: permissions:
contents: read contents: read
name: Build ${{ matrix.target }} name: Build ${{ matrix.target }}
@ -105,6 +117,7 @@ jobs:
release: release:
name: Release name: Release
if: github.event_name == 'push'
needs: build needs: build
runs-on: ubuntu-latest runs-on: ubuntu-latest
permissions: permissions:
@ -137,6 +150,10 @@ jobs:
docker: docker:
name: Docker name: Docker
needs: release needs: release
# Runs after a successful release on tag push, or standalone via
# workflow_dispatch to (re)publish an existing tag's image. `always()` lets
# it run even though `release` is skipped on a manual dispatch.
if: ${{ always() && (github.event_name == 'workflow_dispatch' || needs.release.result == 'success') }}
runs-on: ubuntu-latest runs-on: ubuntu-latest
permissions: permissions:
contents: read contents: read
@ -156,49 +173,48 @@ jobs:
username: ${{ github.actor }} username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }} password: ${{ secrets.GITHUB_TOKEN }}
# Download pre-built binaries for both architectures # The pushed tag, or the workflow_dispatch input for a manual re-publish.
- name: Resolve tag
id: tag
run: echo "tag=${{ github.event.inputs.tag || github.ref_name }}" >> "$GITHUB_OUTPUT"
# Download pre-built binaries into TARGETARCH-named dirs (amd64/arm64) so
# a single multi-platform build picks the matching binary per platform.
- name: Download release binaries - name: Download release binaries
run: | run: |
tag="${GITHUB_REF#refs/tags/}" tag="${{ steps.tag.outputs.tag }}"
declare -A arch=( [x86_64-unknown-linux-gnu]=amd64 [aarch64-unknown-linux-gnu]=arm64 )
for target in x86_64-unknown-linux-gnu aarch64-unknown-linux-gnu; do for target in x86_64-unknown-linux-gnu aarch64-unknown-linux-gnu; do
dir="webclaw-${tag}-${target}" dir="webclaw-${tag}-${target}"
curl -sSL "https://github.com/0xMassi/webclaw/releases/download/${tag}/${dir}.tar.gz" -o "${target}.tar.gz" curl -sSL "https://github.com/0xMassi/webclaw/releases/download/${tag}/${dir}.tar.gz" -o "${target}.tar.gz"
tar xzf "${target}.tar.gz" tar xzf "${target}.tar.gz"
mkdir -p "binaries-${target}" a="${arch[$target]}"
cp "${dir}/webclaw" "binaries-${target}/webclaw" mkdir -p "binaries-${a}"
cp "${dir}/webclaw-mcp" "binaries-${target}/webclaw-mcp" cp "${dir}/webclaw" "${dir}/webclaw-mcp" "${dir}/webclaw-server" "binaries-${a}/"
cp "${dir}/webclaw-server" "binaries-${target}/webclaw-server" chmod +x "binaries-${a}"/*
chmod +x "binaries-${target}"/*
done done
ls -laR binaries-*/ ls -laR binaries-*/
# Build per-arch images with plain docker build (no buildx manifest nesting) # One atomic multi-platform build + push. buildx assembles a single
# manifest list and pushes it in one shot, so there is no separate
# `imagetools create` step to race GHCR's read-after-write (that is what
# failed before: "v0.6.9-arm64: not found"). Provenance/SBOM attestations
# are disabled so each platform entry stays a plain image manifest.
- name: Build and push - name: Build and push
run: | run: |
tag="${GITHUB_REF#refs/tags/}" tag="${{ steps.tag.outputs.tag }}"
docker buildx build -f Dockerfile.ci \
# amd64 --platform linux/amd64,linux/arm64 \
docker build -f Dockerfile.ci --build-arg BINARY_DIR=binaries-x86_64-unknown-linux-gnu \ --provenance=false --sbom=false \
--platform linux/amd64 -t ghcr.io/0xmassi/webclaw:${tag}-amd64 --push . -t "ghcr.io/0xmassi/webclaw:${tag}" \
-t ghcr.io/0xmassi/webclaw:latest \
# arm64 --push .
docker build -f Dockerfile.ci --build-arg BINARY_DIR=binaries-aarch64-unknown-linux-gnu \
--platform linux/arm64 -t ghcr.io/0xmassi/webclaw:${tag}-arm64 --push .
# Multi-arch manifest
docker manifest create ghcr.io/0xmassi/webclaw:${tag} \
ghcr.io/0xmassi/webclaw:${tag}-amd64 \
ghcr.io/0xmassi/webclaw:${tag}-arm64
docker manifest push ghcr.io/0xmassi/webclaw:${tag}
docker manifest create ghcr.io/0xmassi/webclaw:latest \
ghcr.io/0xmassi/webclaw:${tag}-amd64 \
ghcr.io/0xmassi/webclaw:${tag}-arm64
docker manifest push ghcr.io/0xmassi/webclaw:latest
homebrew: homebrew:
name: Update Homebrew name: Update Homebrew
needs: [release, docker] needs: [release, docker]
# Runs once Docker succeeds, on both tag push and manual re-publish.
if: ${{ always() && needs.docker.result == 'success' }}
runs-on: ubuntu-latest runs-on: ubuntu-latest
permissions: permissions:
contents: read contents: read
@ -207,7 +223,7 @@ jobs:
env: env:
COMMITTER_TOKEN: ${{ secrets.HOMEBREW_TAP_TOKEN }} COMMITTER_TOKEN: ${{ secrets.HOMEBREW_TAP_TOKEN }}
run: | run: |
tag="${GITHUB_REF#refs/tags/}" tag="${{ github.event.inputs.tag || github.ref_name }}"
base="https://github.com/0xMassi/webclaw/releases/download/${tag}" base="https://github.com/0xMassi/webclaw/releases/download/${tag}"
# Download all tarballs (Linux + macOS) and compute SHAs # Download all tarballs (Linux + macOS) and compute SHAs

View file

@ -3,6 +3,40 @@
All notable changes to webclaw are documented here. All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/). Format follows [Keep a Changelog](https://keepachangelog.com/).
## [Unreleased]
## [0.6.9] - 2026-06-10
### Fixed
- The multi-arch Docker image (linux/amd64 + linux/arm64) now publishes reliably on each release. The build moved to Buildx so registry pushes no longer fail intermittently, and the Homebrew formula update that depends on it is no longer skipped.
## [0.6.8] - 2026-06-10
### Fixed
- Pages with multibyte text (accented or CJK characters) no longer panic or get mangled during extraction. API-endpoint discovery now cuts oversized scripts on a character boundary instead of crashing mid-character, and structured-data parsing preserves non-ASCII string values instead of turning them into mojibake.
- LLM error messages from a provider no longer panic when the error body contains multibyte characters near the truncation point.
- LLM provider requests now have explicit connect and overall timeouts, so a stalled or unreachable provider fails fast instead of hanging.
- Batch extraction in the MCP server no longer aborts the whole batch when a single URL fails to resolve; bad URLs are reported as individual per-URL errors and the rest still run.
- CLI crawl and batch runs now wait for the completion webhook to actually send before exiting, replacing a fixed delay that could cut the request off or waste time.
- Homepage warm-up requests now include the port for hosts on a non-default port, so those sites are warmed correctly.
---
## [0.6.7] — 2026-06-09
### Changed
- Updated the HTTP/TLS engine (wreq 6.0.0-rc.29, wreq-util 3.0.0-rc.12). This pulls in upstream robustness fixes: no more panic on responses with non-UTF8 header values, a fix for short reads when decoding large compressed bodies, and the TCP nodelay setting is restored. Browser TLS fingerprints are unchanged.
---
## [0.6.6] — 2026-06-09
### Added
- Slow fetches now print a progress line to stderr every 10 seconds (`# webclaw: still fetching <url> (Ns)`) so a long request no longer looks like the CLI hung. Fast fetches stay silent and stdout is untouched.
- New `--url-encoded` flag plus a warning when a URL looks like the shell split it on `&` or `?`. The warning suggests quoting the URL; pass `--url-encoded` to silence it when the URL is intentional.
---
## [0.6.5] — 2026-06-04 ## [0.6.5] — 2026-06-04
### Changed ### Changed

221
Cargo.lock generated
View file

@ -28,18 +28,6 @@ dependencies = [
"cpufeatures", "cpufeatures",
] ]
[[package]]
name = "ahash"
version = "0.8.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5a15f179cd60c4584b8a8c596927aadc462e27f2ca70c04e0071964a73ba7a75"
dependencies = [
"cfg-if",
"once_cell",
"version_check",
"zerocopy",
]
[[package]] [[package]]
name = "aho-corasick" name = "aho-corasick"
version = "1.1.4" version = "1.1.4"
@ -64,6 +52,12 @@ dependencies = [
"alloc-no-stdlib", "alloc-no-stdlib",
] ]
[[package]]
name = "allocator-api2"
version = "0.2.21"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "683d7910e743518b0e34f1186f92494becacb047c7b6bf616c96772180fef923"
[[package]] [[package]]
name = "android_system_properties" name = "android_system_properties"
version = "0.1.5" version = "0.1.5"
@ -272,9 +266,9 @@ dependencies = [
[[package]] [[package]]
name = "bitflags" name = "bitflags"
version = "2.11.0" version = "2.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "843867be96c8daad0d758b57df9392b6d8d271134fce549de6ce169ff98a92af" checksum = "b4388bee8683e3d04af747c73422af53102d2bd24d9eadb6cbc100baef4b43f8"
[[package]] [[package]]
name = "block-buffer" name = "block-buffer"
@ -285,31 +279,6 @@ dependencies = [
"generic-array", "generic-array",
] ]
[[package]]
name = "boring-sys2"
version = "5.0.0-alpha.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "455d79965f5155dcc88a7abce112c3590883889131b799beda10bf9a813ed669"
dependencies = [
"bindgen",
"cmake",
"fs_extra",
"fslock",
]
[[package]]
name = "boring2"
version = "5.0.0-alpha.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "183ccc3854411c035410dcdbffafca62084f3a6c33f013c77e83c025d2a08a28"
dependencies = [
"bitflags",
"boring-sys2",
"foreign-types",
"libc",
"openssl-macros",
]
[[package]] [[package]]
name = "brotli" name = "brotli"
version = "8.0.2" version = "8.0.2"
@ -331,6 +300,31 @@ dependencies = [
"alloc-stdlib", "alloc-stdlib",
] ]
[[package]]
name = "btls"
version = "0.5.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2c5e60b8c8d282c86360cab651ded04ab0335a7b5390c8d34145cbeab8cacf5f"
dependencies = [
"bitflags",
"btls-sys",
"foreign-types",
"libc",
"openssl-macros",
]
[[package]]
name = "btls-sys"
version = "0.5.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b1b8638a2e1c38a5ae4efa90ae57e643baec35a30d03fc5b399b893adc4954b"
dependencies = [
"bindgen",
"cmake",
"fs_extra",
"fslock",
]
[[package]] [[package]]
name = "bumpalo" name = "bumpalo"
version = "3.20.2" version = "3.20.2"
@ -865,6 +859,12 @@ version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2" checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2"
[[package]]
name = "foldhash"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "77ce24cb58228fbb8aa041425bb1050850ac19177686ea6e0f41a70416f56fdb"
[[package]] [[package]]
name = "foreign-types" name = "foreign-types"
version = "0.5.0" version = "0.5.0"
@ -1089,19 +1089,13 @@ version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0cc23270f6e1808e30a928bdc84dea0b9b4136a8bc82338574f23baf47bbd280" checksum = "0cc23270f6e1808e30a928bdc84dea0b9b4136a8bc82338574f23baf47bbd280"
[[package]]
name = "hashbrown"
version = "0.13.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "43a3c133739dddd0d2990f9a4bdf8eb4b21ef50e4851ca85ab661199821d510e"
[[package]] [[package]]
name = "hashbrown" name = "hashbrown"
version = "0.15.5" version = "0.15.5"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1" checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1"
dependencies = [ dependencies = [
"foldhash", "foldhash 0.1.5",
] ]
[[package]] [[package]]
@ -1110,6 +1104,17 @@ version = "0.16.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100" checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100"
[[package]]
name = "hashbrown"
version = "0.17.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ed5909b6e89a2db4456e54cd5f673791d7eca6732202bbf2a9cc504fe2f9b84a"
dependencies = [
"allocator-api2",
"equivalent",
"foldhash 0.2.0",
]
[[package]] [[package]]
name = "heck" name = "heck"
version = "0.5.0" version = "0.5.0"
@ -1172,9 +1177,9 @@ dependencies = [
[[package]] [[package]]
name = "http2" name = "http2"
version = "0.5.15" version = "0.5.17"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c45c6490693ee8a8d0d95fdbdf76fead9fb87548f7894137259a7c6d22821948" checksum = "569ef7a780e853c4e1768f58a3c8168193b82cdcbab66638a0b1c6583ec5995e"
dependencies = [ dependencies = [
"atomic-waker", "atomic-waker",
"bytes", "bytes",
@ -1183,7 +1188,6 @@ dependencies = [
"futures-sink", "futures-sink",
"http", "http",
"indexmap", "indexmap",
"parking_lot",
"slab", "slab",
"smallvec", "smallvec",
"tokio", "tokio",
@ -1495,9 +1499,9 @@ checksum = "09edd9e8b54e49e587e4f6295a7d29c3ea94d469cb40ab8ca70b288248a81db2"
[[package]] [[package]]
name = "libc" name = "libc"
version = "0.2.183" version = "0.2.186"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b5b646652bf6661599e1da8901b3b9522896f01e736bad5f723fe7a3a27f899d" checksum = "68ab91017fe16c622486840e4c83c9a37afeff978bd239b5293d61ece587de66"
[[package]] [[package]]
name = "libloading" name = "libloading"
@ -1563,6 +1567,15 @@ dependencies = [
"weezl", "weezl",
] ]
[[package]]
name = "lru"
version = "0.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8a860605968fce16869fd239cf4237a82f3ac470723415db603b0e8b6c8d4fb9"
dependencies = [
"hashbrown 0.17.1",
]
[[package]] [[package]]
name = "lru-slab" name = "lru-slab"
version = "0.1.2" version = "0.1.2"
@ -2375,17 +2388,6 @@ dependencies = [
"syn", "syn",
] ]
[[package]]
name = "schnellru"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "356285bbf17bea63d9e52e96bd18f039672ac92b55b8cb997d6162a2a37d1649"
dependencies = [
"ahash",
"cfg-if",
"hashbrown 0.13.2",
]
[[package]] [[package]]
name = "scopeguard" name = "scopeguard"
version = "1.2.0" version = "1.2.0"
@ -2779,9 +2781,9 @@ checksum = "1f3ccbac311fea05f86f61904b462b55fb3df8837a366dfc601a0161d0532f20"
[[package]] [[package]]
name = "tokio" name = "tokio"
version = "1.50.0" version = "1.52.3"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "27ad5e34374e03cfffefc301becb44e9dc3c17584f414349ebe29ed26661822d" checksum = "8fc7f01b389ac15039e4dc9531aa973a135d7a4135281b12d7c1bc79fd57fffe"
dependencies = [ dependencies = [
"bytes", "bytes",
"libc", "libc",
@ -2795,20 +2797,20 @@ dependencies = [
] ]
[[package]] [[package]]
name = "tokio-boring2" name = "tokio-btls"
version = "5.0.0-alpha.13" version = "0.5.6"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0f81df1210d791f31d72d840de8fbd80b9c3cb324956523048b1413e2bd55756" checksum = "2e1fd638ec35427faf3b8f412e0fdd6fae76591d79dba40f38fa667d22bc44dd"
dependencies = [ dependencies = [
"boring2", "btls",
"tokio", "tokio",
] ]
[[package]] [[package]]
name = "tokio-macros" name = "tokio-macros"
version = "2.6.1" version = "2.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5c55a2eff8b69ce66c84f85e1da1c233edc36ceb85a2058d11b0d6a3c7e7569c" checksum = "385a6cb71ab9ab790c5fe8d67f1645e6c450a7ce006a33de03daa956cf70a496"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
@ -3219,7 +3221,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-cli" name = "webclaw-cli"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"clap", "clap",
"dotenvy", "dotenvy",
@ -3240,7 +3242,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-core" name = "webclaw-core"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"ego-tree", "ego-tree",
"once_cell", "once_cell",
@ -3258,11 +3260,12 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-fetch" name = "webclaw-fetch"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"bytes", "bytes",
"calamine", "calamine",
"futures-util",
"http", "http",
"quick-xml 0.37.5", "quick-xml 0.37.5",
"rand 0.8.5", "rand 0.8.5",
@ -3284,7 +3287,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-llm" name = "webclaw-llm"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"reqwest", "reqwest",
@ -3297,7 +3300,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-mcp" name = "webclaw-mcp"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"dirs", "dirs",
"dotenvy", "dotenvy",
@ -3317,7 +3320,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-pdf" name = "webclaw-pdf"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"pdf-extract", "pdf-extract",
"thiserror", "thiserror",
@ -3326,7 +3329,7 @@ dependencies = [
[[package]] [[package]]
name = "webclaw-server" name = "webclaw-server"
version = "0.6.5" version = "0.6.9"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"axum", "axum",
@ -3347,9 +3350,9 @@ dependencies = [
[[package]] [[package]]
name = "webpki-root-certs" name = "webpki-root-certs"
version = "1.0.6" version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "804f18a4ac2676ffb4e8b5b5fa9ae38af06df08162314f96a68d2a363e21a8ca" checksum = "f31141ce3fc3e300ae89b78c0dd67f9708061d1d2eda54b8209346fd6be9a92c"
dependencies = [ dependencies = [
"rustls-pki-types", "rustls-pki-types",
] ]
@ -3696,17 +3699,14 @@ dependencies = [
[[package]] [[package]]
name = "wreq" name = "wreq"
version = "6.0.0-rc.28" version = "6.0.0-rc.29"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f79937f6c4df65b3f6f78715b9de2977afe9ee3b3436483c7949a24511e25935" checksum = "3f0eba5f5814a94e5f1a99156f187133464e525b66bdbc69a9627d46530af2e1"
dependencies = [ dependencies = [
"ahash", "btls",
"boring2", "btls-sys",
"brotli",
"bytes", "bytes",
"cookie", "cookie",
"flate2",
"futures-channel",
"futures-util", "futures-util",
"http", "http",
"http-body", "http-body",
@ -3715,29 +3715,64 @@ dependencies = [
"httparse", "httparse",
"ipnet", "ipnet",
"libc", "libc",
"lru",
"percent-encoding", "percent-encoding",
"pin-project-lite", "pin-project-lite",
"schnellru",
"smallvec",
"socket2", "socket2",
"sync_wrapper",
"tokio", "tokio",
"tokio-boring2", "tokio-btls",
"tokio-util",
"tower", "tower",
"tower-http", "tower-http",
"url", "url",
"want",
"webpki-root-certs", "webpki-root-certs",
"zstd", "wreq-proto",
"wreq-rt",
]
[[package]]
name = "wreq-proto"
version = "0.2.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a43942f024bb303f1042c9aa3c87fa1d9149f507c65db6e5220a11ccdb207387"
dependencies = [
"bytes",
"futures-channel",
"futures-util",
"http",
"http-body",
"http2",
"httparse",
"pin-project-lite",
"smallvec",
"tokio",
"tokio-util",
"want",
]
[[package]]
name = "wreq-rt"
version = "0.2.2-rc.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "99e9bce67a3fa3dd3f1503f066d86661c9caf399a763d3bd184da7afaf886c8b"
dependencies = [
"pin-project-lite",
"tokio",
"wreq-proto",
] ]
[[package]] [[package]]
name = "wreq-util" name = "wreq-util"
version = "3.0.0-rc.10" version = "3.0.0-rc.12"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04" checksum = "baa5d2ab72139256916ca352a3d05c53d74e1dd360052eb5ba7691033c417c65"
dependencies = [ dependencies = [
"brotli",
"flate2",
"typed-builder", "typed-builder",
"wreq", "wreq",
"zstd",
] ]
[[package]] [[package]]

View file

@ -3,7 +3,7 @@ resolver = "2"
members = ["crates/*"] members = ["crates/*"]
[workspace.package] [workspace.package]
version = "0.6.5" version = "0.6.9"
edition = "2024" edition = "2024"
license = "AGPL-3.0" license = "AGPL-3.0"
repository = "https://github.com/0xMassi/webclaw" repository = "https://github.com/0xMassi/webclaw"

View file

@ -1,7 +1,6 @@
# Slim runtime image — uses pre-built binaries from the release. # Slim runtime image — uses pre-built binaries from the release.
# The full Dockerfile (multi-stage Rust build) is for local development. # The full Dockerfile (multi-stage Rust build) is for local development.
# CI uses this to avoid 60+ min QEMU cross-compilation. # CI uses this to avoid 60+ min QEMU cross-compilation.
ARG BINARY_DIR=binaries
FROM ubuntu:24.04 FROM ubuntu:24.04
@ -10,10 +9,13 @@ FROM ubuntu:24.04
# CI runners and breaks the multi-arch release build. No build-time network. # CI runners and breaks the multi-arch release build. No build-time network.
COPY --from=gcr.io/distroless/static-debian12 /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt COPY --from=gcr.io/distroless/static-debian12 /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
ARG BINARY_DIR # TARGETARCH (amd64 / arm64) is provided automatically by buildx for each
COPY ${BINARY_DIR}/webclaw /usr/local/bin/webclaw # target platform, so one multi-platform build copies the matching binaries.
COPY ${BINARY_DIR}/webclaw-mcp /usr/local/bin/webclaw-mcp # The release workflow stages them in binaries-amd64 / binaries-arm64.
COPY ${BINARY_DIR}/webclaw-server /usr/local/bin/webclaw-server ARG TARGETARCH
COPY binaries-${TARGETARCH}/webclaw /usr/local/bin/webclaw
COPY binaries-${TARGETARCH}/webclaw-mcp /usr/local/bin/webclaw-mcp
COPY binaries-${TARGETARCH}/webclaw-server /usr/local/bin/webclaw-server
# Default REST API port when running `webclaw-server` inside the container. # Default REST API port when running `webclaw-server` inside the container.
EXPOSE 3000 EXPOSE 3000
@ -25,8 +27,9 @@ ENV WEBCLAW_HOST=0.0.0.0
# Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other # Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
# commands directly so this image can be used as a FROM base with custom CMD. # commands directly so this image can be used as a FROM base with custom CMD.
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh # `--chmod` sets the bit at copy time so the build needs no in-container `RUN`
RUN chmod +x /usr/local/bin/docker-entrypoint.sh # (and thus no QEMU emulation for the arm64 platform).
COPY --chmod=755 docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
ENTRYPOINT ["docker-entrypoint.sh"] ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["webclaw", "--help"] CMD ["webclaw", "--help"]

View file

@ -142,7 +142,7 @@ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
- [HTML to Markdown for RAG](examples/html-to-markdown-rag/) - [HTML to Markdown for RAG](examples/html-to-markdown-rag/)
- [Firecrawl-compatible API](examples/firecrawl-compatible-api/) - [Firecrawl-compatible API](examples/firecrawl-compatible-api/)
- [MCP web scraping](examples/mcp-web-scraping/) - [MCP web scraping](examples/mcp-web-scraping/)
- [Proxy-backed crawling](examples/proxy-backed-crawling/) - [Proxy-backed crawling with ColdProxy](examples/proxy-backed-crawling/)
- [Cloudflare diagnostics](examples/cloudflare-diagnostics/) - [Cloudflare diagnostics](examples/cloudflare-diagnostics/)
### Extract brand assets ### Extract brand assets
@ -401,6 +401,8 @@ Please remove secrets, cookies, private tokens, and customer data from logs befo
residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data
collection, regional testing, monitoring, and web scraping workflows. Explore collection, regional testing, monitoring, and web scraping workflows. Explore
<a href="https://coldproxy.com/">ColdProxy</a>'s latest plans and available offers directly on the website. <a href="https://coldproxy.com/">ColdProxy</a>'s latest plans and available offers directly on the website.
See the <a href="examples/proxy-backed-crawling/#using-coldproxy">proxy-backed crawling guide</a>
for a hands-on walkthrough of wiring ColdProxy into webclaw.
</td> </td>
</tr> </tr>
</table> </table>
@ -410,6 +412,21 @@ Please remove secrets, cookies, private tokens, and customer data from logs befo
## Studio Partners ## Studio Partners
<table> <table>
<tr>
<td width="340" align="center">
<a href="https://go.nodemaven.com/webclaw">
<img src="./assets/sponsors/nodemaven-banner.png" alt="NodeMaven" width="300" />
</a>
</td>
<td>
<strong>NodeMaven</strong> is the most reliable proxy provider with the highest-quality IPs on the market.
Best solution for automation, web scraping, SEO research, and social media management: 99.9% uptime,
sticky sessions up to 7 days, IP filtering (all proxies under a 97% fraud score), no KYC, and cashback up
to 10% on traffic. Use <code>WEBCLAW35</code> for 35% off Mobile and Residential proxies, or
<code>WEBCLAW40</code> for 40% off ISP (Static) proxies at
<a href="https://go.nodemaven.com/webclaw">NodeMaven</a>.
</td>
</tr>
<tr> <tr>
<td width="340" align="center"> <td width="340" align="center">
<a href="https://quantumproxies.net/?utm_source=webclaw&utm_medium=github&utm_campaign=sponsor"> <a href="https://quantumproxies.net/?utm_source=webclaw&utm_medium=github&utm_campaign=sponsor">

Binary file not shown.

After

Width:  |  Height:  |  Size: 302 KiB

View file

@ -166,6 +166,14 @@ struct Cli {
#[arg(long)] #[arg(long)]
urls_file: Option<String>, urls_file: Option<String>,
/// Assert that the URL has been handled for shell escaping. Suppresses
/// the URL-truncation stderr warning. Use when the URL is intentionally
/// passed with an empty/keyless query (e.g. legacy CGI) or when a
/// trailing `&` is genuinely part of the URL. The URL is fetched as-is
/// (no extra normalization beyond the standard scheme prepend).
#[arg(long)]
url_encoded: bool,
/// Output format (markdown, json, text, llm, html) /// Output format (markdown, json, text, llm, html)
#[arg(short, long, default_value = "markdown")] #[arg(short, long, default_value = "markdown")]
format: OutputFormat, format: OutputFormat,
@ -591,6 +599,31 @@ fn normalize_url(url: &str) -> String {
} }
} }
/// M14: detect URLs that look truncated by the shell (e.g. an unquoted URL
/// that the shell split on `&` or `?`). Returns `true` when:
/// - the URL ends with `&` (a trailing param separator suggests the next
/// param was lopped off), OR
/// - the URL contains `?` but no `=` after it (a query with bare keys is
/// rare; usually a real query has at least one `=`).
///
/// Informational only — caller decides whether to warn / abort. This is a
/// heuristic; legitimate URLs with bare-key queries will trigger a false
/// positive (suppressible via `--url-encoded`).
fn looks_truncated(url: &str) -> bool {
let trimmed = url.trim();
if trimmed.ends_with('&') {
return true;
}
if let Some((_before, after_q)) = trimmed.split_once('?') {
// Trim a trailing fragment so `?#section` etc. doesn't mask the check.
let query_part = after_q.split('#').next().unwrap_or(after_q);
if !query_part.contains('=') {
return true;
}
}
false
}
/// Derive a filename from a URL for `--output-dir`. /// Derive a filename from a URL for `--output-dir`.
/// ///
/// Strips the scheme/host, maps the path to a filesystem path, and appends /// Strips the scheme/host, maps the path to a filesystem path, and appends
@ -826,6 +859,14 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
.urls .urls
.first() .first()
.ok_or("no input provided -- pass a URL, --file, or --stdin")?; .ok_or("no input provided -- pass a URL, --file, or --stdin")?;
// M14: warn when the URL looks like the shell split it on `&` or `?`.
// Informational only — fetch still proceeds. Suppressed by --url-encoded,
// which asserts the caller has handled escaping intentionally.
if !cli.url_encoded && looks_truncated(raw_url) {
eprintln!(
"# webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded."
);
}
let url = normalize_url(raw_url); let url = normalize_url(raw_url);
let url = url.as_str(); let url = url.as_str();
@ -859,8 +900,11 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
let client = let client =
FetchClient::new(build_fetch_config(cli)).map_err(|e| format!("client error: {e}"))?; FetchClient::new(build_fetch_config(cli)).map_err(|e| format!("client error: {e}"))?;
let options = build_extraction_options(cli); let options = build_extraction_options(cli);
let result = client // M13: wrap with periodic stderr progress emitter. Fast fetches see
.fetch_and_extract_with_options(url, &options) // zero emissions (timer never fires in <10s); slow fetches get a
// line every 10s of elapsed time so the CLI doesn't appear hung.
let fetch_fut = client.fetch_and_extract_with_options(url, &options);
let result = webclaw_fetch::with_progress(url, fetch_fut)
.await .await
.map_err(|e| format!("fetch error: {e}"))?; .map_err(|e| format!("fetch error: {e}"))?;
@ -1504,7 +1548,7 @@ async fn run_crawl(cli: &Cli) -> Result<(), String> {
// Fire webhook on crawl complete // Fire webhook on crawl complete
if let Some(ref webhook_url) = cli.webhook { if let Some(ref webhook_url) = cli.webhook {
let urls: Vec<&str> = result.pages.iter().map(|p| p.url.as_str()).collect(); let urls: Vec<&str> = result.pages.iter().map(|p| p.url.as_str()).collect();
fire_webhook( let handle = fire_webhook(
webhook_url, webhook_url,
&serde_json::json!({ &serde_json::json!({
"event": "crawl_complete", "event": "crawl_complete",
@ -1515,8 +1559,8 @@ async fn run_crawl(cli: &Cli) -> Result<(), String> {
"urls": urls, "urls": urls,
}), }),
); );
// Brief pause so the async webhook has time to fire // Wait for the webhook to finish so the process doesn't exit mid-send.
tokio::time::sleep(std::time::Duration::from_millis(500)).await; let _ = handle.await;
} }
if result.errors > 0 { if result.errors > 0 {
@ -1614,7 +1658,7 @@ async fn run_batch(cli: &Cli, entries: &[(String, Option<String>)]) -> Result<()
// Fire webhook on batch complete // Fire webhook on batch complete
if let Some(ref webhook_url) = cli.webhook { if let Some(ref webhook_url) = cli.webhook {
let urls: Vec<&str> = results.iter().map(|r| r.url.as_str()).collect(); let urls: Vec<&str> = results.iter().map(|r| r.url.as_str()).collect();
fire_webhook( let handle = fire_webhook(
webhook_url, webhook_url,
&serde_json::json!({ &serde_json::json!({
"event": "batch_complete", "event": "batch_complete",
@ -1624,7 +1668,7 @@ async fn run_batch(cli: &Cli, entries: &[(String, Option<String>)]) -> Result<()
"urls": urls, "urls": urls,
}), }),
); );
tokio::time::sleep(std::time::Duration::from_millis(500)).await; let _ = handle.await;
} }
if errors > 0 { if errors > 0 {
@ -1698,9 +1742,12 @@ async fn spawn_on_change(cmd: &str, stdin_payload: &[u8]) {
} }
} }
/// Fire a webhook POST with a JSON payload. Non-blocking — errors logged to stderr. /// Fire a webhook POST with a JSON payload. Spawns the send on a background task
/// Auto-detects Discord and Slack webhook URLs and wraps the payload accordingly. /// and returns its `JoinHandle` so callers that need delivery (e.g. one-shot
fn fire_webhook(url: &str, payload: &serde_json::Value) { /// crawl/batch runs that exit immediately after) can `.await` it; long-running
/// loops can drop the handle and let it run fire-and-forget. Errors are logged
/// to stderr. Auto-detects Discord and Slack webhook URLs and wraps the payload.
fn fire_webhook(url: &str, payload: &serde_json::Value) -> tokio::task::JoinHandle<()> {
let url = url.to_string(); let url = url.to_string();
let is_discord = url.contains("discord.com/api/webhooks"); let is_discord = url.contains("discord.com/api/webhooks");
let is_slack = url.contains("hooks.slack.com"); let is_slack = url.contains("hooks.slack.com");
@ -1762,7 +1809,7 @@ fn fire_webhook(url: &str, payload: &serde_json::Value) {
}, },
Err(e) => eprintln!("[webhook] client error: {e}"), Err(e) => eprintln!("[webhook] client error: {e}"),
} }
}); })
} }
async fn run_watch(cli: &Cli, urls: &[String]) -> Result<(), String> { async fn run_watch(cli: &Cli, urls: &[String]) -> Result<(), String> {
@ -2274,7 +2321,7 @@ async fn run_batch_llm(cli: &Cli, entries: &[(String, Option<String>)]) -> Resul
eprintln!("Processed {total} URLs ({ok} ok, {errors} errors)"); eprintln!("Processed {total} URLs ({ok} ok, {errors} errors)");
if let Some(ref webhook_url) = cli.webhook { if let Some(ref webhook_url) = cli.webhook {
fire_webhook( let handle = fire_webhook(
webhook_url, webhook_url,
&serde_json::json!({ &serde_json::json!({
"event": "batch_llm_complete", "event": "batch_llm_complete",
@ -2283,7 +2330,7 @@ async fn run_batch_llm(cli: &Cli, entries: &[(String, Option<String>)]) -> Resul
"errors": errors, "errors": errors,
}), }),
); );
tokio::time::sleep(std::time::Duration::from_millis(500)).await; let _ = handle.await;
} }
if errors > 0 { if errors > 0 {
@ -2879,6 +2926,61 @@ mod tests {
let _ = std::fs::remove_dir_all(&dir); let _ = std::fs::remove_dir_all(&dir);
} }
// M14: URL truncation heuristic tests.
#[test]
fn looks_truncated_fires_on_trailing_ampersand() {
// The most common shell-split shape: `?a=1&` lost the `b=2`.
assert!(looks_truncated("https://example.com/?a=1&"));
assert!(looks_truncated("https://example.com/path?key=val&"));
}
#[test]
fn looks_truncated_fires_on_query_with_no_equals() {
// `?foo` with no `=` is a strong signal the shell ate the `=value`.
assert!(looks_truncated("https://example.com/?foo"));
// Bare `?` (empty query) also looks like the shell ate the whole pair.
assert!(looks_truncated("https://example.com/?"));
// Same with a fragment after — strip fragment before checking.
assert!(looks_truncated("https://example.com/?foo#section"));
}
#[test]
fn looks_truncated_silent_on_clean_url() {
// Normal URLs (no query, or query with at least one `=`) are clean.
assert!(!looks_truncated("https://example.com/"));
assert!(!looks_truncated("https://example.com/path/to/page"));
assert!(!looks_truncated("https://example.com/?a=1"));
assert!(!looks_truncated("https://example.com/?a=1&b=2"));
assert!(!looks_truncated(
"https://example.com/?a=1&b=2&c=hello%20world"
));
// Hash anchors without a query are clean.
assert!(!looks_truncated("https://example.com/page#section"));
}
#[test]
fn looks_truncated_silent_with_url_encoded_assertion_modeled_via_skip() {
// The --url-encoded flag suppresses the warning at the call site
// (main.rs gates the eprintln! behind `if !cli.url_encoded`).
// This test models the gate logic directly: when --url-encoded is set,
// the warning branch is never entered, even on a truncated-looking URL.
let url = "https://example.com/?a=1&";
let url_encoded_flag = true;
let should_warn = !url_encoded_flag && looks_truncated(url);
assert!(
!should_warn,
"--url-encoded must suppress the warning even on URL ending with &"
);
// Sanity: same URL without --url-encoded does warn.
let url_encoded_flag = false;
let should_warn = !url_encoded_flag && looks_truncated(url);
assert!(
should_warn,
"without --url-encoded, the warning should fire on URL ending with &"
);
}
#[test] #[test]
fn research_slug_truncation_is_char_safe() { fn research_slug_truncation_is_char_safe() {
// Multibyte query: byte-slicing at 50 would panic mid-codepoint. // Multibyte query: byte-slicing at 50 would panic mid-codepoint.

View file

@ -233,7 +233,13 @@ pub fn extract_endpoints(
} }
let slice = if text.len() > *budget { let slice = if text.len() > *budget {
*truncated = true; *truncated = true;
&text[..*budget] // Snap the cut to a UTF-8 char boundary so non-ASCII content
// (multibyte codepoints straddling the budget) can't panic.
let mut cut = (*budget).min(text.len());
while cut > 0 && !text.is_char_boundary(cut) {
cut -= 1;
}
&text[..cut]
} else { } else {
text text
}; };
@ -512,4 +518,16 @@ mod tests {
); );
assert!(r.hosts.iter().any(|h| h == "pubapi.ticketmaster.co.uk")); assert!(r.hosts.iter().any(|h| h == "pubapi.ticketmaster.co.uk"));
} }
#[test]
fn scan_truncation_at_non_ascii_boundary_does_not_panic() {
// A bundle just over the scan budget, padded with a multibyte char
// ('é' is 2 bytes) so the cut lands mid-codepoint. The old
// `&text[..budget]` slice panicked here; the boundary snap must not.
let pad = "é".repeat(MAX_SCAN_BYTES); // ~2× budget in bytes
let bundle = format!("{pad} fetch(\"/api/x\")");
let bundles = vec![("big.js".to_string(), bundle)];
let r = extract_endpoints("<html></html>", "https://example.com/", &bundles);
assert!(r.truncated, "oversized bundle should mark truncated");
}
} }

View file

@ -178,7 +178,12 @@ pub fn extract_sveltekit(html: &str) -> Vec<Value> {
/// Preserves already-quoted keys and string values. /// Preserves already-quoted keys and string values.
fn js_literal_to_json(input: &str) -> String { fn js_literal_to_json(input: &str) -> String {
let bytes = input.as_bytes(); let bytes = input.as_bytes();
let mut out = String::with_capacity(input.len() + input.len() / 10); // Accumulate raw bytes, not `byte as char`. The input is valid UTF-8 and we
// only ever copy its bytes verbatim or insert ASCII quotes, so the result is
// guaranteed valid UTF-8 — copying byte-by-byte preserves multibyte
// codepoints (e.g. accented/CJK string values) instead of mangling them
// into Latin-1 mojibake.
let mut out: Vec<u8> = Vec::with_capacity(input.len() + input.len() / 10);
let mut i = 0; let mut i = 0;
let len = bytes.len(); let len = bytes.len();
@ -187,14 +192,14 @@ fn js_literal_to_json(input: &str) -> String {
// Skip through strings // Skip through strings
if b == b'"' { if b == b'"' {
out.push('"'); out.push(b'"');
i += 1; i += 1;
while i < len { while i < len {
let c = bytes[i]; let c = bytes[i];
out.push(c as char); out.push(c);
i += 1; i += 1;
if c == b'\\' && i < len { if c == b'\\' && i < len {
out.push(bytes[i] as char); out.push(bytes[i]);
i += 1; i += 1;
} else if c == b'"' { } else if c == b'"' {
break; break;
@ -205,11 +210,11 @@ fn js_literal_to_json(input: &str) -> String {
// After { or , — look for unquoted key followed by : // After { or , — look for unquoted key followed by :
if (b == b'{' || b == b',' || b == b'[') && i + 1 < len { if (b == b'{' || b == b',' || b == b'[') && i + 1 < len {
out.push(b as char); out.push(b);
i += 1; i += 1;
// Skip whitespace // Skip whitespace
while i < len && bytes[i].is_ascii_whitespace() { while i < len && bytes[i].is_ascii_whitespace() {
out.push(bytes[i] as char); out.push(bytes[i]);
i += 1; i += 1;
} }
// Check if next is an unquoted identifier (key) // Check if next is an unquoted identifier (key)
@ -218,29 +223,30 @@ fn js_literal_to_json(input: &str) -> String {
while i < len && (bytes[i].is_ascii_alphanumeric() || bytes[i] == b'_') { while i < len && (bytes[i].is_ascii_alphanumeric() || bytes[i] == b'_') {
i += 1; i += 1;
} }
let key = &input[key_start..i]; let key = &bytes[key_start..i];
// Skip whitespace after key // Skip whitespace after key
while i < len && bytes[i].is_ascii_whitespace() { while i < len && bytes[i].is_ascii_whitespace() {
i += 1; i += 1;
} }
// If followed by :, it's an unquoted key — quote it // If followed by :, it's an unquoted key — quote it
if i < len && bytes[i] == b':' { if i < len && bytes[i] == b':' {
out.push('"'); out.push(b'"');
out.push_str(key); out.extend_from_slice(key);
out.push('"'); out.push(b'"');
} else { } else {
// Not a key — might be a bare value like true/false/null // Not a key — might be a bare value like true/false/null
out.push_str(key); out.extend_from_slice(key);
} }
} }
continue; continue;
} }
out.push(b as char); out.push(b);
i += 1; i += 1;
} }
out // Safe: we only copied bytes from valid-UTF-8 `input` plus ASCII quotes.
String::from_utf8(out).unwrap_or_else(|e| String::from_utf8_lossy(e.as_bytes()).into_owned())
} }
/// Replace raw newlines/tabs inside JSON string values with escape sequences. /// Replace raw newlines/tabs inside JSON string values with escape sequences.
@ -440,4 +446,17 @@ newline"}"#;
assert_eq!(parsed["text"], "line1\nline2"); assert_eq!(parsed["text"], "line1\nline2");
assert_eq!(parsed["raw"], "has\nnewline"); assert_eq!(parsed["raw"], "has\nnewline");
} }
#[test]
fn js_literal_to_json_preserves_multibyte_utf8() {
// Unquoted ASCII keys with accented and CJK string values (the shape
// SvelteKit emits). The old `byte as char` path turned the multibyte
// values into Latin-1 mojibake; they must now survive intact.
let input = r#"{name:"déjà vu", city:"東京", emoji:"🌱"}"#;
let json = js_literal_to_json(input);
let parsed: Value = serde_json::from_str(&json).unwrap();
assert_eq!(parsed["name"], "déjà vu");
assert_eq!(parsed["city"], "東京");
assert_eq!(parsed["emoji"], "🌱");
}
} }

View file

@ -14,13 +14,16 @@ tracing = { workspace = true }
tokio = { workspace = true } tokio = { workspace = true }
async-trait = "0.1" async-trait = "0.1"
# Pinned to exact pre-release versions: wreq/wreq-util are release candidates # Pinned to exact pre-release versions: wreq/wreq-util are release candidates
# with no semver stability between rc.N builds (rc.29 broke the TLS + Response # with no semver stability between rc.N builds. An exact pin keeps `cargo build`,
# API). An exact pin keeps `cargo build`, `cargo install` (which ignores # `cargo install` (which ignores Cargo.lock), and the release workflow all on the
# Cargo.lock), and the release workflow all on the version that compiles. # version that compiles.
wreq = { version = "=6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] } wreq = { version = "=6.0.0-rc.29", features = ["cookies", "gzip", "brotli", "zstd", "deflate", "stream"] }
wreq-util = "=3.0.0-rc.10" wreq-util = "=3.0.0-rc.12"
http = "1" http = "1"
bytes = "1" bytes = "1"
# Stream adapter for `wreq::Response::bytes_stream()` (wreq 6.0.0-rc.29 dropped
# `Response::chunk()`); used to buffer bodies under the running size ceiling.
futures-util = "0.3"
url = "2" url = "2"
rand = "0.8" rand = "0.8"
quick-xml = { version = "0.37", features = ["serde"] } quick-xml = { version = "0.37", features = ["serde"] }

View file

@ -12,6 +12,7 @@ use std::hash::{Hash, Hasher};
use std::sync::Arc; use std::sync::Arc;
use std::time::{Duration, Instant}; use std::time::{Duration, Instant};
use futures_util::StreamExt;
use rand::seq::SliceRandom; use rand::seq::SliceRandom;
use tokio::sync::Semaphore; use tokio::sync::Semaphore;
use tracing::{debug, instrument, warn}; use tracing::{debug, instrument, warn};
@ -118,7 +119,7 @@ impl Response {
/// negotiated), so a tiny compressed payload that inflates to /// negotiated), so a tiny compressed payload that inflates to
/// gigabytes is aborted as soon as the accumulated size crosses the /// gigabytes is aborted as soon as the accumulated size crosses the
/// cap — it never gets fully buffered in memory. /// cap — it never gets fully buffered in memory.
async fn from_wreq(mut resp: wreq::Response) -> Result<Self, FetchError> { async fn from_wreq(resp: wreq::Response) -> Result<Self, FetchError> {
if let Some(len) = resp.content_length() if let Some(len) = resp.content_length()
&& len > MAX_BODY_BYTES && len > MAX_BODY_BYTES
{ {
@ -130,12 +131,13 @@ impl Response {
let url = resp.uri().to_string(); let url = resp.uri().to_string();
let headers = resp.headers().clone(); let headers = resp.headers().clone();
// wreq 6.0.0-rc.29 dropped `Response::chunk()`. Stream post-decompression
// bytes via `bytes_stream()` and keep enforcing the running ceiling so a
// compression bomb is aborted before it is fully buffered in memory.
let mut buf = bytes::BytesMut::new(); let mut buf = bytes::BytesMut::new();
while let Some(chunk) = resp let mut stream = resp.bytes_stream();
.chunk() while let Some(chunk) = stream.next().await {
.await let chunk = chunk.map_err(|e| FetchError::BodyDecode(e.to_string()))?;
.map_err(|e| FetchError::BodyDecode(e.to_string()))?
{
check_body_ceiling(buf.len(), chunk.len())?; check_body_ceiling(buf.len(), chunk.len())?;
buf.extend_from_slice(&chunk); buf.extend_from_slice(&chunk);
} }
@ -799,11 +801,17 @@ fn is_challenge_html(html: &str) -> bool {
false false
} }
/// Extract the homepage URL (scheme + host) from a full URL. /// Extract the homepage URL (scheme + host[:port]) from a full URL.
fn extract_homepage(url: &str) -> Option<String> { fn extract_homepage(url: &str) -> Option<String> {
url::Url::parse(url) url::Url::parse(url).ok().map(|u| {
.ok() let host = u.host_str().unwrap_or("");
.map(|u| format!("{}://{}/", u.scheme(), u.host_str().unwrap_or(""))) // `port()` is `Some` only for a non-default port; include it so a
// host like example.com:8443 is warmed on the right port.
match u.port() {
Some(port) => format!("{}://{}:{}/", u.scheme(), host, port),
None => format!("{}://{}/", u.scheme(), host),
}
})
} }
/// Convert a webclaw-pdf PdfResult into a webclaw-core ExtractionResult. /// Convert a webclaw-pdf PdfResult into a webclaw-core ExtractionResult.

View file

@ -11,6 +11,7 @@ pub mod extractors;
pub mod fetcher; pub mod fetcher;
pub mod linkedin; pub mod linkedin;
pub mod locale; pub mod locale;
pub mod progress;
pub mod proxy; pub mod proxy;
pub mod reddit; pub mod reddit;
pub mod sitemap; pub mod sitemap;
@ -24,6 +25,7 @@ pub use error::FetchError;
pub use fetcher::Fetcher; pub use fetcher::Fetcher;
pub use http::HeaderMap; pub use http::HeaderMap;
pub use locale::{accept_language_for_tld, accept_language_for_url}; pub use locale::{accept_language_for_tld, accept_language_for_url};
pub use progress::{PROGRESS_INTERVAL, with_progress};
pub use proxy::{parse_proxy_file, parse_proxy_line}; pub use proxy::{parse_proxy_file, parse_proxy_line};
pub use sitemap::SitemapEntry; pub use sitemap::SitemapEntry;
pub use webclaw_pdf::PdfMode; pub use webclaw_pdf::PdfMode;

View file

@ -0,0 +1,293 @@
//! Periodic stderr progress line emitter for slow fetches (M13).
//!
//! Wraps any async fetch future with a `tokio::select!` against a
//! `tokio::time::interval`. Every `PROGRESS_INTERVAL` (default 10s) of
//! elapsed time, emits one line to STDERR of the form:
//!
//! ```text
//! # webclaw: still fetching <URL> (Ns)
//! ```
//!
//! Fetches completing in under `PROGRESS_INTERVAL` emit zero lines (the
//! timer never fires). Stdout is untouched.
//!
//! The URL is truncated to at most 80 chars (head + `...` + tail) so
//! pathological query strings don't blow up the stderr line. Truncation
//! is char-boundary safe (operates on `chars`, not bytes).
use std::future::Future;
use std::time::Duration;
use tokio::time::{Instant, MissedTickBehavior, interval};
/// Default progress emission interval. The first tick fires at +10s
/// elapsed; subsequent ticks at +20s, +30s, etc.
pub const PROGRESS_INTERVAL: Duration = Duration::from_secs(10);
/// Maximum URL length in the progress line. Longer URLs are truncated
/// `head...tail` style.
const MAX_URL_LEN: usize = 80;
/// Wrap a fetch future with the default 10s progress emitter. Writes
/// progress lines to STDERR via `eprintln!`. Returns the inner future's
/// result unchanged.
pub async fn with_progress<F, T>(url: &str, future: F) -> T
where
F: Future<Output = T>,
{
with_progress_writer(url, future, PROGRESS_INTERVAL, |s| eprintln!("{s}")).await
}
/// Test-friendly variant of [`with_progress`]: caller supplies the tick
/// interval (so tests can use a 50ms period instead of 10s) and a
/// writer closure (so tests can capture emitted lines without touching
/// real stderr).
///
/// Production code uses [`with_progress`] which delegates here with
/// [`PROGRESS_INTERVAL`] and an `eprintln!` writer.
pub async fn with_progress_writer<F, T, W>(
url: &str,
future: F,
period: Duration,
mut writer: W,
) -> T
where
F: Future<Output = T>,
W: FnMut(String),
{
let start = Instant::now();
let mut ticker = interval(period);
// First tick of `tokio::time::interval(period)` fires *immediately*
// (at construction time). We don't want a t=0 emit — consume that
// first tick before entering the select loop. Subsequent ticks fire
// at `start + period`, `start + 2*period`, ...
ticker.set_missed_tick_behavior(MissedTickBehavior::Skip);
ticker.tick().await;
tokio::pin!(future);
loop {
tokio::select! {
// Bias toward the future — if both are ready (rare), prefer
// returning the result over emitting a final tick.
biased;
result = &mut future => {
return result;
}
_ = ticker.tick() => {
let elapsed = start.elapsed();
writer(format_progress_line(url, elapsed));
}
}
}
}
/// Build the progress line: `# webclaw: still fetching <URL> (Ns)`.
/// URL is truncated via [`truncate_url`] to [`MAX_URL_LEN`] chars.
/// Elapsed is rounded to whole seconds (10, 20, 30, ...).
pub(crate) fn format_progress_line(url: &str, elapsed: Duration) -> String {
let truncated = truncate_url(url, MAX_URL_LEN);
let secs = elapsed.as_secs();
format!("# webclaw: still fetching {truncated} ({secs}s)")
}
/// Truncate `url` to at most `max` chars, using `head...tail` shape
/// when truncation is needed. Char-boundary safe (operates on `chars`).
pub(crate) fn truncate_url(url: &str, max: usize) -> String {
let total_chars = url.chars().count();
if total_chars <= max {
return url.to_string();
}
// Reserve 3 chars for "..." and split the remainder ~70/30 between
// head (path-side) and tail (query-side).
let avail = max.saturating_sub(3);
let head_chars = avail.saturating_sub(17);
let tail_chars = 17;
let head: String = url.chars().take(head_chars).collect();
let tail: String = url
.chars()
.rev()
.take(tail_chars)
.collect::<Vec<_>>()
.into_iter()
.rev()
.collect();
format!("{head}...{tail}")
}
#[cfg(test)]
mod tests {
use super::*;
use std::sync::{Arc, Mutex};
/// Collect emitted lines into a `Vec<String>` via a captured writer.
fn capture() -> (Arc<Mutex<Vec<String>>>, impl FnMut(String)) {
let sink: Arc<Mutex<Vec<String>>> = Arc::new(Mutex::new(Vec::new()));
let sink_clone = Arc::clone(&sink);
let writer = move |s: String| {
sink_clone.lock().unwrap().push(s);
};
(sink, writer)
}
#[tokio::test]
async fn test_progress_emits_after_interval_elapsed() {
let (sink, writer) = capture();
// 250ms future, 50ms interval — expect ~4-5 ticks before resolution.
let fut = tokio::time::sleep(Duration::from_millis(250));
with_progress_writer(
"https://example.com/slow",
async {
fut.await;
42_i32
},
Duration::from_millis(50),
writer,
)
.await;
let lines = sink.lock().unwrap();
assert!(
!lines.is_empty(),
"expected >=1 progress line; got {} ({:?})",
lines.len(),
*lines
);
for line in lines.iter() {
assert!(
line.starts_with("# webclaw: still fetching"),
"line shape wrong: {line:?}"
);
assert!(
line.contains("https://example.com/slow"),
"url missing from line: {line:?}"
);
}
}
#[tokio::test]
async fn test_progress_silent_on_fast_future() {
let (sink, writer) = capture();
// 10ms future, 1s interval — zero ticks expected.
let result = with_progress_writer(
"https://example.com/fast",
async {
tokio::time::sleep(Duration::from_millis(10)).await;
"done"
},
Duration::from_secs(1),
writer,
)
.await;
assert_eq!(result, "done");
let lines = sink.lock().unwrap();
assert_eq!(
lines.len(),
0,
"expected 0 progress lines on fast future; got {:?}",
*lines
);
}
#[tokio::test]
async fn test_progress_line_includes_url() {
let (sink, writer) = capture();
let target_url = "https://news.ycombinator.com/item?id=12345";
with_progress_writer(
target_url,
async {
tokio::time::sleep(Duration::from_millis(150)).await;
},
Duration::from_millis(50),
writer,
)
.await;
let lines = sink.lock().unwrap();
assert!(!lines.is_empty(), "expected progress lines");
assert!(
lines.iter().all(|l| l.contains(target_url)),
"every line should contain the URL: {:?}",
*lines
);
}
#[tokio::test]
async fn test_progress_returns_inner_result_ok() {
let (_sink, writer) = capture();
let r: Result<i32, String> = with_progress_writer(
"https://example.com/",
async { Ok::<i32, String>(7) },
Duration::from_secs(1),
writer,
)
.await;
assert_eq!(r, Ok(7));
}
#[tokio::test]
async fn test_progress_propagates_error() {
let (_sink, writer) = capture();
let r: Result<i32, String> = with_progress_writer(
"https://example.com/",
async { Err::<i32, String>("boom".to_string()) },
Duration::from_secs(1),
writer,
)
.await;
assert_eq!(r, Err("boom".to_string()));
}
#[test]
fn test_truncate_url_short_passthrough() {
let url = "https://example.com/";
assert_eq!(truncate_url(url, 80), url);
}
#[test]
fn test_truncate_url_long_head_dots_tail() {
let url = "https://www.example.com/very/long/path/segments/with/lots/of/text/and/then?q=some_long_query_string_value_here&other=more&another=thing";
let truncated = truncate_url(url, 80);
assert!(
truncated.chars().count() <= 80,
"truncated length {} > 80: {truncated:?}",
truncated.chars().count()
);
assert!(
truncated.contains("..."),
"expected '...' marker in truncated url: {truncated:?}"
);
assert!(
truncated.starts_with("https://www.example.com/"),
"truncated should start with the URL head: {truncated:?}"
);
}
#[test]
fn test_truncate_url_unicode_safe() {
// Cyrillic URL longer than 80 chars — must not panic on a
// mid-codepoint split.
let url =
"https://example.com/путь/к/очень/длинной/странице/с/большим/количеством/кириллицы/тут";
let truncated = truncate_url(url, 80);
assert!(truncated.is_char_boundary(truncated.len()));
// Roundtrip through chars to confirm valid UTF-8 throughout.
let _: String = truncated.chars().collect();
}
#[test]
fn test_format_progress_line_shape() {
let line = format_progress_line("https://example.com/", Duration::from_secs(10));
assert_eq!(line, "# webclaw: still fetching https://example.com/ (10s)");
}
#[test]
fn test_format_progress_line_seconds_only() {
// Sub-second elapsed rounds to 0s, not fractions. (In practice
// the first tick fires at +PROGRESS_INTERVAL so this is mostly
// a defensive shape assertion.)
let line = format_progress_line("https://x/", Duration::from_millis(9_500));
assert!(
line.ends_with("(9s)"),
"line should end with `(9s)`: {line:?}"
);
}
}

View file

@ -10,15 +10,24 @@ use std::{borrow::Cow, io, time::Duration};
use wreq::http2::{ use wreq::http2::{
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId, Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
}; };
use wreq::tls::{ use wreq::tls::compress::CertificateCompressor;
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions, use wreq::tls::{AlpnProtocol, AlpsProtocol, ExtensionType, TlsOptions, TlsVersion};
TlsVersion, use wreq::{Client, Emulation, Group, IntoEmulation};
}; use wreq_util::emulate::compress::{BrotliCompressor, ZlibCompressor};
use wreq::{Client, Emulation};
use crate::browser::BrowserVariant; use crate::browser::BrowserVariant;
use crate::error::FetchError; use crate::error::FetchError;
// Certificate-compression advertisement per profile. wreq 6.0.0-rc.29 replaced
// the `CertificateCompressionAlgorithm` enum argument with `&dyn
// CertificateCompressor` trait objects; wreq-util ships the concrete zlib/brotli
// implementations. The advertised set (and order) is a TLS fingerprint signal,
// so these mirror the previous enum lists exactly.
static CHROME_CERT_COMPRESSORS: &[&'static dyn CertificateCompressor] = &[&BrotliCompressor];
static FIREFOX_CERT_COMPRESSORS: &[&'static dyn CertificateCompressor] =
&[&ZlibCompressor, &BrotliCompressor];
static SAFARI_CERT_COMPRESSORS: &[&'static dyn CertificateCompressor] = &[&ZlibCompressor];
#[derive(Clone, Default)] #[derive(Clone, Default)]
struct PublicDnsResolver; struct PublicDnsResolver;
@ -119,14 +128,14 @@ fn chrome_extensions() -> Vec<ExtensionType> {
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45 ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
ExtensionType::EC_POINT_FORMATS, // 11 ExtensionType::EC_POINT_FORMATS, // 11
ExtensionType::CERT_COMPRESSION, // 27 ExtensionType::CERT_COMPRESSION, // 27
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint) ExtensionType::APPLICATION_SETTINGS, // 17613 (new codepoint, matches alps_use_new_codepoint)
ExtensionType::SUPPORTED_VERSIONS, // 43 ExtensionType::SUPPORTED_VERSIONS, // 43
ExtensionType::SIGNATURE_ALGORITHMS, // 13 ExtensionType::SIGNATURE_ALGORITHMS, // 13
ExtensionType::SERVER_NAME, // 0 ExtensionType::SERVER_NAME, // 0
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16 ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037 ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
ExtensionType::RENEGOTIATE, // 65281 ExtensionType::RENEGOTIATE, // 65281
ExtensionType::EXTENDED_MASTER_SECRET, // 23 ExtensionType::EXTENDED_MASTER_SECRET, // 23
] ]
} }
@ -287,7 +296,7 @@ fn chrome_tls() -> TlsOptions {
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2]) .alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
.alps_use_new_codepoint(true) .alps_use_new_codepoint(true)
.aes_hw_override(true) .aes_hw_override(true)
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI]) .certificate_compressors(CHROME_CERT_COMPRESSORS)
.build() .build()
} }
@ -304,10 +313,7 @@ fn firefox_tls() -> TlsOptions {
.pre_shared_key(true) .pre_shared_key(true)
.enable_ocsp_stapling(true) .enable_ocsp_stapling(true)
.enable_signed_cert_timestamps(true) .enable_signed_cert_timestamps(true)
.certificate_compression_algorithms(&[ .certificate_compressors(FIREFOX_CERT_COMPRESSORS)
CertificateCompressionAlgorithm::ZLIB,
CertificateCompressionAlgorithm::BROTLI,
])
.build() .build()
} }
@ -324,7 +330,7 @@ fn safari_tls() -> TlsOptions {
.pre_shared_key(false) .pre_shared_key(false)
.enable_ocsp_stapling(true) .enable_ocsp_stapling(true)
.enable_signed_cert_timestamps(true) .enable_signed_cert_timestamps(true)
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::ZLIB]) .certificate_compressors(SAFARI_CERT_COMPRESSORS)
.build() .build()
} }
@ -345,21 +351,23 @@ fn safari_tls() -> TlsOptions {
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set. /// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
/// 4. `accept-language` preserved from config.extra_headers for locale. /// 4. `accept-language` preserved from config.extra_headers for locale.
fn safari_ios_emulation() -> wreq::Emulation { fn safari_ios_emulation() -> wreq::Emulation {
use wreq::EmulationFactory; // wreq 6.0.0-rc.29 exposes the `Emulation` fields directly (no `*_mut()`
let mut em = wreq_util::Emulation::SafariIos26.emulation(); // accessors) and wreq-util 3.0.0-rc.12 renamed the enum to `Profile` with
// `IntoEmulation::into_emulation` replacing `EmulationFactory::emulation`.
let mut em = wreq_util::Profile::SafariIos26.into_emulation();
if let Some(tls) = em.tls_options_mut().as_mut() { if let Some(tls) = em.tls_options.as_mut() {
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions())); tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
} }
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE, // Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS // and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome. // to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
if let Some(h2) = em.http2_options_mut().as_mut() { if let Some(h2) = em.http2_options.as_mut() {
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true)); h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
} }
let hm = em.headers_mut(); let hm = &mut em.headers;
hm.clear(); hm.clear();
for (k, v) in SAFARI_IOS_HEADERS { for (k, v) in SAFARI_IOS_HEADERS {
if let (Ok(n), Ok(val)) = ( if let (Ok(n), Ok(val)) = (
@ -508,12 +516,12 @@ pub fn build_client(
.tls_options(tls) .tls_options(tls)
.http2_options(h2) .http2_options(h2)
.headers(build_headers(headers)) .headers(build_headers(headers))
.build() .build(Group::default())
} }
}; };
// Append extra headers after profile defaults. // Append extra headers after profile defaults.
let hm = emulation.headers_mut(); let hm = &mut emulation.headers;
for (k, v) in extra_headers { for (k, v) in extra_headers {
if let (Ok(n), Ok(val)) = ( if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()), http::header::HeaderName::from_bytes(k.as_bytes()),

View file

@ -1,6 +1,8 @@
/// Anthropic provider — Claude models via api.anthropic.com. /// Anthropic provider — Claude models via api.anthropic.com.
/// Anthropic's API differs from OpenAI: system message is a top-level param, /// Anthropic's API differs from OpenAI: system message is a top-level param,
/// not part of the messages array. /// not part of the messages array.
use std::time::Duration;
use async_trait::async_trait; use async_trait::async_trait;
use serde_json::json; use serde_json::json;
@ -35,7 +37,11 @@ impl AnthropicProvider {
let key = load_api_key(key_override, "ANTHROPIC_API_KEY")?; let key = load_api_key(key_override, "ANTHROPIC_API_KEY")?;
Some(Self { Some(Self {
client: reqwest::Client::new(), client: reqwest::Client::builder()
.timeout(Duration::from_secs(120))
.connect_timeout(Duration::from_secs(10))
.build()
.unwrap_or_else(|_| reqwest::Client::new()),
key, key,
base_url: base_url base_url: base_url
.or_else(|| std::env::var("ANTHROPIC_BASE_URL").ok()) .or_else(|| std::env::var("ANTHROPIC_BASE_URL").ok())
@ -108,11 +114,7 @@ impl LlmProvider for AnthropicProvider {
if !resp.status().is_success() { if !resp.status().is_success() {
let status = resp.status(); let status = resp.status();
let text = resp.text().await.unwrap_or_default(); let text = resp.text().await.unwrap_or_default();
let safe_text = if text.len() > 500 { let safe_text = text.chars().take(500).collect::<String>();
&text[..500]
} else {
&text
};
return Err(LlmError::ProviderError(format!( return Err(LlmError::ProviderError(format!(
"anthropic returned {status}: {safe_text}" "anthropic returned {status}: {safe_text}"
))); )));

View file

@ -1,5 +1,7 @@
/// Ollama provider — talks to a local Ollama instance (default localhost:11434). /// Ollama provider — talks to a local Ollama instance (default localhost:11434).
/// First choice in the provider chain: free, private, fast on Apple Silicon. /// First choice in the provider chain: free, private, fast on Apple Silicon.
use std::time::Duration;
use async_trait::async_trait; use async_trait::async_trait;
use serde_json::json; use serde_json::json;
@ -24,7 +26,11 @@ impl OllamaProvider {
.unwrap_or_else(|| "qwen3:8b".into()); .unwrap_or_else(|| "qwen3:8b".into());
Self { Self {
client: reqwest::Client::new(), client: reqwest::Client::builder()
.timeout(Duration::from_secs(120))
.connect_timeout(Duration::from_secs(10))
.build()
.unwrap_or_else(|_| reqwest::Client::new()),
base_url, base_url,
default_model, default_model,
} }
@ -70,11 +76,7 @@ impl LlmProvider for OllamaProvider {
if !resp.status().is_success() { if !resp.status().is_success() {
let status = resp.status(); let status = resp.status();
let text = resp.text().await.unwrap_or_default(); let text = resp.text().await.unwrap_or_default();
let safe_text = if text.len() > 500 { let safe_text = text.chars().take(500).collect::<String>();
&text[..500]
} else {
&text
};
return Err(LlmError::ProviderError(format!( return Err(LlmError::ProviderError(format!(
"ollama returned {status}: {safe_text}" "ollama returned {status}: {safe_text}"
))); )));
@ -98,7 +100,8 @@ impl LlmProvider for OllamaProvider {
async fn is_available(&self) -> bool { async fn is_available(&self) -> bool {
let url = format!("{}/api/tags", self.base_url); let url = format!("{}/api/tags", self.base_url);
matches!(self.client.get(&url).send().await, Ok(r) if r.status().is_success()) let req = self.client.get(&url).timeout(Duration::from_secs(10));
matches!(req.send().await, Ok(r) if r.status().is_success())
} }
fn name(&self) -> &str { fn name(&self) -> &str {

View file

@ -1,4 +1,6 @@
/// OpenAI provider — works with api.openai.com and any OpenAI-compatible endpoint. /// OpenAI provider — works with api.openai.com and any OpenAI-compatible endpoint.
use std::time::Duration;
use async_trait::async_trait; use async_trait::async_trait;
use serde_json::json; use serde_json::json;
@ -69,7 +71,11 @@ impl OpenAiProvider {
let key = load_api_key(key_override, "OPENAI_API_KEY")?; let key = load_api_key(key_override, "OPENAI_API_KEY")?;
Some(Self { Some(Self {
client: reqwest::Client::new(), client: reqwest::Client::builder()
.timeout(Duration::from_secs(120))
.connect_timeout(Duration::from_secs(10))
.build()
.unwrap_or_else(|_| reqwest::Client::new()),
key, key,
base_url: base_url base_url: base_url
.or_else(|| std::env::var("OPENAI_BASE_URL").ok()) .or_else(|| std::env::var("OPENAI_BASE_URL").ok())
@ -132,11 +138,7 @@ impl LlmProvider for OpenAiProvider {
if !resp.status().is_success() { if !resp.status().is_success() {
let status = resp.status(); let status = resp.status();
let text = resp.text().await.unwrap_or_default(); let text = resp.text().await.unwrap_or_default();
let safe_text = if text.len() > 500 { let safe_text = text.chars().take(500).collect::<String>();
&text[..500]
} else {
&text
};
return Err(LlmError::ProviderError(format!( return Err(LlmError::ProviderError(format!(
"openai returned {status}: {safe_text}" "openai returned {status}: {safe_text}"
))); )));

View file

@ -323,9 +323,10 @@ impl WebclawMcp {
if params.urls.len() > 100 { if params.urls.len() > 100 {
return Err("batch is limited to 100 URLs per request".into()); return Err("batch is limited to 100 URLs per request".into());
} }
for u in &params.urls { // No up-front DNS pre-validation: it aborted the whole batch on a
validate_url(u).await?; // single unresolvable URL. The fetch layer applies the same SSRF
} // guard (validate_public_http_url) per URL, so bad entries surface
// as individual per-URL errors below instead of failing the batch.
let format = params.format.as_deref().unwrap_or("markdown"); let format = params.format.as_deref().unwrap_or("markdown");
let concurrency = params.concurrency.unwrap_or(5); let concurrency = params.concurrency.unwrap_or(5);

View file

@ -1,6 +1,68 @@
# Proxy-Backed Crawling # Proxy-Backed Crawling
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file. Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL.
## Using ColdProxy
[ColdProxy](https://coldproxy.com/) is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with `--proxy` / `WEBCLAW_PROXY`, or list several in a `--proxy-file` pool.
### 1. Get your endpoint
Sign in to your [ColdProxy dashboard](https://coldproxy.com/) and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL:
```text
http://USERNAME:PASSWORD@HOST:PORT
```
### 2. One ColdProxy endpoint
```bash
export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
webclaw https://example.com --format markdown
```
Or pass it inline:
```bash
webclaw https://example.com \
--proxy "http://USERNAME:PASSWORD@HOST:PORT" \
--format markdown
```
### 3. Rotate a ColdProxy pool
List one ColdProxy endpoint per line in `coldproxy.txt`. Pool files use `host:port:user:pass` (one entry per line; lines starting with `#` are ignored). Mix product types and regions to match your workload:
```text
# residential IPv4
HOST:PORT:USERNAME:PASSWORD
# residential IPv6
HOST:PORT:USERNAME:PASSWORD
# datacenter IPv6
HOST:PORT:USERNAME:PASSWORD
```
webclaw rotates across the pool per request:
```bash
webclaw https://docs.example.com \
--crawl \
--depth 2 \
--max-pages 200 \
--concurrency 10 \
--delay 200 \
--proxy-file coldproxy.txt \
--format markdown
```
### 4. Target a country
ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions.
### Choosing a product
- **Residential IPv4 / IPv6** — highest trust; best for consumer sites, geo-restricted content, and regional QA.
- **Datacenter IPv6** — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints.
## Single Proxy ## Single Proxy
@ -20,12 +82,12 @@ webclaw https://example.com \
## Proxy Pool ## Proxy Pool
Create `proxies.txt` with one proxy per line: Create `proxies.txt` with one proxy per line in `host:port:user:pass` format (lines starting with `#` are ignored):
```text ```text
http://user:pass@proxy-1.example.com:8080 proxy-1.example.com:8080:user:pass
http://user:pass@proxy-2.example.com:8080 proxy-2.example.com:8080:user:pass
http://user:pass@proxy-3.example.com:8080 proxy-3.example.com:8080:user:pass
``` ```
Run a crawl with controlled concurrency: Run a crawl with controlled concurrency: