Merge pull request #53 from 0xMassi/docs-coldproxy

docs: add ColdProxy proxy-backed crawling walkthrough
This commit is contained in:
Valerio 2026-06-10 14:40:01 +02:00 committed by GitHub
commit fae2766db1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 70 additions and 6 deletions

View file

@ -142,7 +142,7 @@ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
- [HTML to Markdown for RAG](examples/html-to-markdown-rag/)
- [Firecrawl-compatible API](examples/firecrawl-compatible-api/)
- [MCP web scraping](examples/mcp-web-scraping/)
- [Proxy-backed crawling](examples/proxy-backed-crawling/)
- [Proxy-backed crawling with ColdProxy](examples/proxy-backed-crawling/)
- [Cloudflare diagnostics](examples/cloudflare-diagnostics/)
### Extract brand assets
@ -401,6 +401,8 @@ Please remove secrets, cookies, private tokens, and customer data from logs befo
residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data
collection, regional testing, monitoring, and web scraping workflows. Explore
<a href="https://coldproxy.com/">ColdProxy</a>'s latest plans and available offers directly on the website.
See the <a href="examples/proxy-backed-crawling/#using-coldproxy">proxy-backed crawling guide</a>
for a hands-on walkthrough of wiring ColdProxy into webclaw.
</td>
</tr>
</table>

View file

@ -1,6 +1,68 @@
# Proxy-Backed Crawling
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL.
## Using ColdProxy
[ColdProxy](https://coldproxy.com/) is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with `--proxy` / `WEBCLAW_PROXY`, or list several in a `--proxy-file` pool.
### 1. Get your endpoint
Sign in to your [ColdProxy dashboard](https://coldproxy.com/) and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL:
```text
http://USERNAME:PASSWORD@HOST:PORT
```
### 2. One ColdProxy endpoint
```bash
export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
webclaw https://example.com --format markdown
```
Or pass it inline:
```bash
webclaw https://example.com \
--proxy "http://USERNAME:PASSWORD@HOST:PORT" \
--format markdown
```
### 3. Rotate a ColdProxy pool
List one ColdProxy endpoint per line in `coldproxy.txt`. Pool files use `host:port:user:pass` (one entry per line; lines starting with `#` are ignored). Mix product types and regions to match your workload:
```text
# residential IPv4
HOST:PORT:USERNAME:PASSWORD
# residential IPv6
HOST:PORT:USERNAME:PASSWORD
# datacenter IPv6
HOST:PORT:USERNAME:PASSWORD
```
webclaw rotates across the pool per request:
```bash
webclaw https://docs.example.com \
--crawl \
--depth 2 \
--max-pages 200 \
--concurrency 10 \
--delay 200 \
--proxy-file coldproxy.txt \
--format markdown
```
### 4. Target a country
ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions.
### Choosing a product
- **Residential IPv4 / IPv6** — highest trust; best for consumer sites, geo-restricted content, and regional QA.
- **Datacenter IPv6** — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints.
## Single Proxy
@ -20,12 +82,12 @@ webclaw https://example.com \
## Proxy Pool
Create `proxies.txt` with one proxy per line:
Create `proxies.txt` with one proxy per line in `host:port:user:pass` format (lines starting with `#` are ignored):
```text
http://user:pass@proxy-1.example.com:8080
http://user:pass@proxy-2.example.com:8080
http://user:pass@proxy-3.example.com:8080
proxy-1.example.com:8080:user:pass
proxy-2.example.com:8080:user:pass
proxy-3.example.com:8080:user:pass
```
Run a crawl with controlled concurrency: