mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-10 22:45:13 +02:00
docs: add ColdProxy proxy-backed crawling walkthrough
This commit is contained in:
parent
d0d7b835f2
commit
d0909a25e3
2 changed files with 70 additions and 6 deletions
|
|
@ -142,7 +142,7 @@ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
|
|||
- [HTML to Markdown for RAG](examples/html-to-markdown-rag/)
|
||||
- [Firecrawl-compatible API](examples/firecrawl-compatible-api/)
|
||||
- [MCP web scraping](examples/mcp-web-scraping/)
|
||||
- [Proxy-backed crawling](examples/proxy-backed-crawling/)
|
||||
- [Proxy-backed crawling with ColdProxy](examples/proxy-backed-crawling/)
|
||||
- [Cloudflare diagnostics](examples/cloudflare-diagnostics/)
|
||||
|
||||
### Extract brand assets
|
||||
|
|
@ -401,6 +401,8 @@ Please remove secrets, cookies, private tokens, and customer data from logs befo
|
|||
residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data
|
||||
collection, regional testing, monitoring, and web scraping workflows. Explore
|
||||
<a href="https://coldproxy.com/">ColdProxy</a>'s latest plans and available offers directly on the website.
|
||||
See the <a href="examples/proxy-backed-crawling/#using-coldproxy">proxy-backed crawling guide</a>
|
||||
for a hands-on walkthrough of wiring ColdProxy into webclaw.
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
|
|
|||
|
|
@ -1,6 +1,68 @@
|
|||
# Proxy-Backed Crawling
|
||||
|
||||
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
|
||||
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file, and accepts any standard HTTP/HTTPS or SOCKS5 proxy URL.
|
||||
|
||||
## Using ColdProxy
|
||||
|
||||
[ColdProxy](https://coldproxy.com/) is webclaw's infrastructure partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with `--proxy` / `WEBCLAW_PROXY`, or list several in a `--proxy-file` pool.
|
||||
|
||||
### 1. Get your endpoint
|
||||
|
||||
Sign in to your [ColdProxy dashboard](https://coldproxy.com/) and copy your proxy host, port, and credentials. Assemble them into a standard proxy URL:
|
||||
|
||||
```text
|
||||
http://USERNAME:PASSWORD@HOST:PORT
|
||||
```
|
||||
|
||||
### 2. One ColdProxy endpoint
|
||||
|
||||
```bash
|
||||
export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
|
||||
webclaw https://example.com --format markdown
|
||||
```
|
||||
|
||||
Or pass it inline:
|
||||
|
||||
```bash
|
||||
webclaw https://example.com \
|
||||
--proxy "http://USERNAME:PASSWORD@HOST:PORT" \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
### 3. Rotate a ColdProxy pool
|
||||
|
||||
List one ColdProxy endpoint per line in `coldproxy.txt`. Pool files use `host:port:user:pass` (one entry per line; lines starting with `#` are ignored). Mix product types and regions to match your workload:
|
||||
|
||||
```text
|
||||
# residential IPv4
|
||||
HOST:PORT:USERNAME:PASSWORD
|
||||
# residential IPv6
|
||||
HOST:PORT:USERNAME:PASSWORD
|
||||
# datacenter IPv6
|
||||
HOST:PORT:USERNAME:PASSWORD
|
||||
```
|
||||
|
||||
webclaw rotates across the pool per request:
|
||||
|
||||
```bash
|
||||
webclaw https://docs.example.com \
|
||||
--crawl \
|
||||
--depth 2 \
|
||||
--max-pages 200 \
|
||||
--concurrency 10 \
|
||||
--delay 200 \
|
||||
--proxy-file coldproxy.txt \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
### 4. Target a country
|
||||
|
||||
ColdProxy offers access across 195+ countries. Use the country-specific endpoint from your ColdProxy dashboard for each region you want to collect from (for example, a France residential endpoint for fr-localized pages). Add one endpoint per country to your pool file to spread a single crawl across regions.
|
||||
|
||||
### Choosing a product
|
||||
|
||||
- **Residential IPv4 / IPv6** — highest trust; best for consumer sites, geo-restricted content, and regional QA.
|
||||
- **Datacenter IPv6** — fastest and most cost-effective; best for high-volume crawling of tolerant endpoints.
|
||||
|
||||
## Single Proxy
|
||||
|
||||
|
|
@ -20,12 +82,12 @@ webclaw https://example.com \
|
|||
|
||||
## Proxy Pool
|
||||
|
||||
Create `proxies.txt` with one proxy per line:
|
||||
Create `proxies.txt` with one proxy per line in `host:port:user:pass` format (lines starting with `#` are ignored):
|
||||
|
||||
```text
|
||||
http://user:pass@proxy-1.example.com:8080
|
||||
http://user:pass@proxy-2.example.com:8080
|
||||
http://user:pass@proxy-3.example.com:8080
|
||||
proxy-1.example.com:8080:user:pass
|
||||
proxy-2.example.com:8080:user:pass
|
||||
proxy-3.example.com:8080:user:pass
|
||||
```
|
||||
|
||||
Run a crawl with controlled concurrency:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue