From d0909a25e3154e8648615196d430bae9027ae4c9 Mon Sep 17 00:00:00 2001 From: Valerio Date: Wed, 10 Jun 2026 10:39:46 +0200 Subject: [PATCH] docs: add ColdProxy proxy-backed crawling walkthrough --- README.md | 4 +- examples/proxy-backed-crawling/README.md | 72 ++++++++++++++++++++++-- 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 1d0a5ac..b97fc11 100644 --- a/README.md +++ b/README.md @@ -142,7 +142,7 @@ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50 - [HTML to Markdown for RAG](examples/html-to-markdown-rag/) - [Firecrawl-compatible API](examples/firecrawl-compatible-api/) - [MCP web scraping](examples/mcp-web-scraping/) -- [Proxy-backed crawling](examples/proxy-backed-crawling/) +- [Proxy-backed crawling with ColdProxy](examples/proxy-backed-crawling/) - [Cloudflare diagnostics](examples/cloudflare-diagnostics/) ### Extract brand assets @@ -401,6 +401,8 @@ Please remove secrets, cookies, private tokens, and customer data from logs befo residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data collection, regional testing, monitoring, and web scraping workflows. Explore ColdProxy's latest plans and available offers directly on the website. + See the proxy-backed crawling guide + for a hands-on walkthrough of wiring ColdProxy into webclaw. diff --git a/examples/proxy-backed-crawling/README.md b/examples/proxy-backed-crawling/README.md index fd49be9..d82a0ff 100644 --- a/examples/proxy-backed-crawling/README.md +++ b/examples/proxy-backed-crawling/README.md @@ -1,6 +1,68 @@ # Proxy-Backed Crawling -Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file. +Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL. + +## Using ColdProxy + +[ColdProxy](https://coldproxy.com/) is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with `--proxy` / `WEBCLAW_PROXY`, or list several in a `--proxy-file` pool. + +### 1. Get your endpoint + +Sign in to your [ColdProxy dashboard](https://coldproxy.com/) and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL: + +```text +http://USERNAME:PASSWORD@HOST:PORT +``` + +### 2. One ColdProxy endpoint + +```bash +export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT" +webclaw https://example.com --format markdown +``` + +Or pass it inline: + +```bash +webclaw https://example.com \ + --proxy "http://USERNAME:PASSWORD@HOST:PORT" \ + --format markdown +``` + +### 3. Rotate a ColdProxy pool + +List one ColdProxy endpoint per line in `coldproxy.txt`. Pool files use `host:port:user:pass` (one entry per line; lines starting with `#` are ignored). Mix product types and regions to match your workload: + +```text +# residential IPv4 +HOST:PORT:USERNAME:PASSWORD +# residential IPv6 +HOST:PORT:USERNAME:PASSWORD +# datacenter IPv6 +HOST:PORT:USERNAME:PASSWORD +``` + +webclaw rotates across the pool per request: + +```bash +webclaw https://docs.example.com \ + --crawl \ + --depth 2 \ + --max-pages 200 \ + --concurrency 10 \ + --delay 200 \ + --proxy-file coldproxy.txt \ + --format markdown +``` + +### 4. Target a country + +ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions. + +### Choosing a product + +- **Residential IPv4 / IPv6** — highest trust; best for consumer sites, geo-restricted content, and regional QA. +- **Datacenter IPv6** — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints. ## Single Proxy @@ -20,12 +82,12 @@ webclaw https://example.com \ ## Proxy Pool -Create `proxies.txt` with one proxy per line: +Create `proxies.txt` with one proxy per line in `host:port:user:pass` format (lines starting with `#` are ignored): ```text -http://user:pass@proxy-1.example.com:8080 -http://user:pass@proxy-2.example.com:8080 -http://user:pass@proxy-3.example.com:8080 +proxy-1.example.com:8080:user:pass +proxy-2.example.com:8080:user:pass +proxy-3.example.com:8080:user:pass ``` Run a crawl with controlled concurrency: