mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-11 22:55:13 +02:00
115 lines
3.2 KiB
Markdown
115 lines
3.2 KiB
Markdown
# Proxy-Backed Crawling
|
|
|
|
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL.
|
|
|
|
## Using ColdProxy
|
|
|
|
[ColdProxy](https://coldproxy.com/) is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with `--proxy` / `WEBCLAW_PROXY`, or list several in a `--proxy-file` pool.
|
|
|
|
### 1. Get your endpoint
|
|
|
|
Sign in to your [ColdProxy dashboard](https://coldproxy.com/) and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL:
|
|
|
|
```text
|
|
http://USERNAME:PASSWORD@HOST:PORT
|
|
```
|
|
|
|
### 2. One ColdProxy endpoint
|
|
|
|
```bash
|
|
export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
|
|
webclaw https://example.com --format markdown
|
|
```
|
|
|
|
Or pass it inline:
|
|
|
|
```bash
|
|
webclaw https://example.com \
|
|
--proxy "http://USERNAME:PASSWORD@HOST:PORT" \
|
|
--format markdown
|
|
```
|
|
|
|
### 3. Rotate a ColdProxy pool
|
|
|
|
List one ColdProxy endpoint per line in `coldproxy.txt`. Pool files use `host:port:user:pass` (one entry per line; lines starting with `#` are ignored). Mix product types and regions to match your workload:
|
|
|
|
```text
|
|
# residential IPv4
|
|
HOST:PORT:USERNAME:PASSWORD
|
|
# residential IPv6
|
|
HOST:PORT:USERNAME:PASSWORD
|
|
# datacenter IPv6
|
|
HOST:PORT:USERNAME:PASSWORD
|
|
```
|
|
|
|
webclaw rotates across the pool per request:
|
|
|
|
```bash
|
|
webclaw https://docs.example.com \
|
|
--crawl \
|
|
--depth 2 \
|
|
--max-pages 200 \
|
|
--concurrency 10 \
|
|
--delay 200 \
|
|
--proxy-file coldproxy.txt \
|
|
--format markdown
|
|
```
|
|
|
|
### 4. Target a country
|
|
|
|
ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions.
|
|
|
|
### Choosing a product
|
|
|
|
- **Residential IPv4 / IPv6** — highest trust; best for consumer sites, geo-restricted content, and regional QA.
|
|
- **Datacenter IPv6** — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints.
|
|
|
|
## Single Proxy
|
|
|
|
```bash
|
|
webclaw https://example.com \
|
|
--proxy http://user:pass@proxy.example.com:8080 \
|
|
--format markdown
|
|
```
|
|
|
|
SOCKS5 is supported too:
|
|
|
|
```bash
|
|
webclaw https://example.com \
|
|
--proxy socks5://proxy.example.com:1080 \
|
|
--format markdown
|
|
```
|
|
|
|
## Proxy Pool
|
|
|
|
Create `proxies.txt` with one proxy per line in `host:port:user:pass` format (lines starting with `#` are ignored):
|
|
|
|
```text
|
|
proxy-1.example.com:8080:user:pass
|
|
proxy-2.example.com:8080:user:pass
|
|
proxy-3.example.com:8080:user:pass
|
|
```
|
|
|
|
Run a crawl with controlled concurrency:
|
|
|
|
```bash
|
|
webclaw https://docs.example.com \
|
|
--crawl \
|
|
--depth 2 \
|
|
--max-pages 100 \
|
|
--concurrency 10 \
|
|
--delay 200 \
|
|
--proxy-file proxies.txt \
|
|
--format markdown
|
|
```
|
|
|
|
## Batch URLs
|
|
|
|
```bash
|
|
webclaw --urls-file urls.txt \
|
|
--proxy-file proxies.txt \
|
|
--concurrency 10 \
|
|
--format json
|
|
```
|
|
|
|
Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.
|