mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
docs: add workflow examples
This commit is contained in:
parent
b75b768ec3
commit
aab51bea91
7 changed files with 281 additions and 0 deletions
53
examples/proxy-backed-crawling/README.md
Normal file
53
examples/proxy-backed-crawling/README.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# Proxy-Backed Crawling
|
||||
|
||||
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
|
||||
|
||||
## Single Proxy
|
||||
|
||||
```bash
|
||||
webclaw https://example.com \
|
||||
--proxy http://user:pass@proxy.example.com:8080 \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
SOCKS5 is supported too:
|
||||
|
||||
```bash
|
||||
webclaw https://example.com \
|
||||
--proxy socks5://proxy.example.com:1080 \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
## Proxy Pool
|
||||
|
||||
Create `proxies.txt` with one proxy per line:
|
||||
|
||||
```text
|
||||
http://user:pass@proxy-1.example.com:8080
|
||||
http://user:pass@proxy-2.example.com:8080
|
||||
http://user:pass@proxy-3.example.com:8080
|
||||
```
|
||||
|
||||
Run a crawl with controlled concurrency:
|
||||
|
||||
```bash
|
||||
webclaw https://docs.example.com \
|
||||
--crawl \
|
||||
--depth 2 \
|
||||
--max-pages 100 \
|
||||
--concurrency 10 \
|
||||
--delay 200 \
|
||||
--proxy-file proxies.txt \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
## Batch URLs
|
||||
|
||||
```bash
|
||||
webclaw --urls-file urls.txt \
|
||||
--proxy-file proxies.txt \
|
||||
--concurrency 10 \
|
||||
--format json
|
||||
```
|
||||
|
||||
Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue