webclaw/examples/proxy-backed-crawling/README.md

115 lines
3.2 KiB
Markdown

# Proxy-Backed Crawling
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL.
## Using ColdProxy
[ColdProxy](https://coldproxy.com/) is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with `--proxy` / `WEBCLAW_PROXY`, or list several in a `--proxy-file` pool.
### 1. Get your endpoint
Sign in to your [ColdProxy dashboard](https://coldproxy.com/) and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL:
```text
http://USERNAME:PASSWORD@HOST:PORT
```
### 2. One ColdProxy endpoint
```bash
export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
webclaw https://example.com --format markdown
```
Or pass it inline:
```bash
webclaw https://example.com \
--proxy "http://USERNAME:PASSWORD@HOST:PORT" \
--format markdown
```
### 3. Rotate a ColdProxy pool
List one ColdProxy endpoint per line in `coldproxy.txt`. Pool files use `host:port:user:pass` (one entry per line; lines starting with `#` are ignored). Mix product types and regions to match your workload:
```text
# residential IPv4
HOST:PORT:USERNAME:PASSWORD
# residential IPv6
HOST:PORT:USERNAME:PASSWORD
# datacenter IPv6
HOST:PORT:USERNAME:PASSWORD
```
webclaw rotates across the pool per request:
```bash
webclaw https://docs.example.com \
--crawl \
--depth 2 \
--max-pages 200 \
--concurrency 10 \
--delay 200 \
--proxy-file coldproxy.txt \
--format markdown
```
### 4. Target a country
ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions.
### Choosing a product
- **Residential IPv4 / IPv6** — highest trust; best for consumer sites, geo-restricted content, and regional QA.
- **Datacenter IPv6** — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints.
## Single Proxy
```bash
webclaw https://example.com \
--proxy http://user:pass@proxy.example.com:8080 \
--format markdown
```
SOCKS5 is supported too:
```bash
webclaw https://example.com \
--proxy socks5://proxy.example.com:1080 \
--format markdown
```
## Proxy Pool
Create `proxies.txt` with one proxy per line in `host:port:user:pass` format (lines starting with `#` are ignored):
```text
proxy-1.example.com:8080:user:pass
proxy-2.example.com:8080:user:pass
proxy-3.example.com:8080:user:pass
```
Run a crawl with controlled concurrency:
```bash
webclaw https://docs.example.com \
--crawl \
--depth 2 \
--max-pages 100 \
--concurrency 10 \
--delay 200 \
--proxy-file proxies.txt \
--format markdown
```
## Batch URLs
```bash
webclaw --urls-file urls.txt \
--proxy-file proxies.txt \
--concurrency 10 \
--format json
```
Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.