webclaw/examples/proxy-backed-crawling/README.md

3.2 KiB

Proxy-Backed Crawling

Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL.

Using ColdProxy

ColdProxy is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with --proxy / WEBCLAW_PROXY, or list several in a --proxy-file pool.

1. Get your endpoint

Sign in to your ColdProxy dashboard and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL:

http://USERNAME:PASSWORD@HOST:PORT

2. One ColdProxy endpoint

export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
webclaw https://example.com --format markdown

Or pass it inline:

webclaw https://example.com \
  --proxy "http://USERNAME:PASSWORD@HOST:PORT" \
  --format markdown

3. Rotate a ColdProxy pool

List one ColdProxy endpoint per line in coldproxy.txt. Pool files use host:port:user:pass (one entry per line; lines starting with # are ignored). Mix product types and regions to match your workload:

# residential IPv4
HOST:PORT:USERNAME:PASSWORD
# residential IPv6
HOST:PORT:USERNAME:PASSWORD
# datacenter IPv6
HOST:PORT:USERNAME:PASSWORD

webclaw rotates across the pool per request:

webclaw https://docs.example.com \
  --crawl \
  --depth 2 \
  --max-pages 200 \
  --concurrency 10 \
  --delay 200 \
  --proxy-file coldproxy.txt \
  --format markdown

4. Target a country

ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions.

Choosing a product

  • Residential IPv4 / IPv6 — highest trust; best for consumer sites, geo-restricted content, and regional QA.
  • Datacenter IPv6 — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints.

Single Proxy

webclaw https://example.com \
  --proxy http://user:pass@proxy.example.com:8080 \
  --format markdown

SOCKS5 is supported too:

webclaw https://example.com \
  --proxy socks5://proxy.example.com:1080 \
  --format markdown

Proxy Pool

Create proxies.txt with one proxy per line in host:port:user:pass format (lines starting with # are ignored):

proxy-1.example.com:8080:user:pass
proxy-2.example.com:8080:user:pass
proxy-3.example.com:8080:user:pass

Run a crawl with controlled concurrency:

webclaw https://docs.example.com \
  --crawl \
  --depth 2 \
  --max-pages 100 \
  --concurrency 10 \
  --delay 200 \
  --proxy-file proxies.txt \
  --format markdown

Batch URLs

webclaw --urls-file urls.txt \
  --proxy-file proxies.txt \
  --concurrency 10 \
  --format json

Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with WEBCLAW_API_KEY.