mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
docs: add workflow examples
This commit is contained in:
parent
b75b768ec3
commit
aab51bea91
7 changed files with 281 additions and 0 deletions
|
|
@ -2,6 +2,14 @@
|
|||
|
||||
Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.
|
||||
|
||||
## Workflow Guides
|
||||
|
||||
- [HTML to Markdown for RAG](./html-to-markdown-rag/) turns web pages into markdown or compact LLM text for retrieval pipelines.
|
||||
- [Firecrawl-Compatible API](./firecrawl-compatible-api/) shows the `/v2` compatibility routes for scrape, crawl, map, and search.
|
||||
- [MCP Web Scraping](./mcp-web-scraping/) connects webclaw to MCP clients such as Claude Code, Claude Desktop, Cursor, and Codex CLI.
|
||||
- [Proxy-Backed Crawling](./proxy-backed-crawling/) shows single-proxy and proxy-pool crawling from the CLI.
|
||||
- [Cloudflare Diagnostics](./cloudflare-diagnostics/) gives a reproducible checklist for blocked or empty protected-site results.
|
||||
|
||||
## Basic Extraction
|
||||
|
||||
```bash
|
||||
|
|
|
|||
58
examples/cloudflare-diagnostics/README.md
Normal file
58
examples/cloudflare-diagnostics/README.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
# Cloudflare Diagnostics
|
||||
|
||||
Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content.
|
||||
|
||||
## 1. Save the Raw Response
|
||||
|
||||
```bash
|
||||
webclaw https://protected.example.com --raw-html > raw.html
|
||||
```
|
||||
|
||||
Inspect `raw.html` for challenge copy, blocked request text, empty shells, or application HTML that needs JavaScript rendering.
|
||||
|
||||
## 2. Compare Extracted Formats
|
||||
|
||||
```bash
|
||||
webclaw https://protected.example.com --format markdown > page.md
|
||||
webclaw https://protected.example.com --format json > page.json
|
||||
webclaw https://protected.example.com --format llm > page.txt
|
||||
```
|
||||
|
||||
If raw HTML has content but markdown is empty, tune extraction with selectors:
|
||||
|
||||
```bash
|
||||
webclaw https://protected.example.com \
|
||||
--include "main, article, [role=main]" \
|
||||
--exclude "nav, footer, aside, .cookie-banner" \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
## 3. Try Another Browser Fingerprint
|
||||
|
||||
```bash
|
||||
webclaw https://protected.example.com --browser firefox --format markdown
|
||||
webclaw https://protected.example.com --browser random --format markdown
|
||||
```
|
||||
|
||||
## 4. Use Cloud Fallback
|
||||
|
||||
```bash
|
||||
export WEBCLAW_API_KEY=wc_your_key
|
||||
|
||||
webclaw https://protected.example.com --cloud --format markdown
|
||||
```
|
||||
|
||||
Cloud mode can use hosted routing, JS rendering, and protected-site handling that are not part of the fully local open-source path.
|
||||
|
||||
## 5. Keep a Reproducible Report
|
||||
|
||||
When reporting a problem, include:
|
||||
|
||||
- target URL
|
||||
- command used
|
||||
- selected format
|
||||
- whether `--raw-html` returned a challenge or normal page HTML
|
||||
- whether `--browser firefox` changed the result
|
||||
- whether cloud mode changed the result
|
||||
|
||||
Remove cookies, tokens, customer data, and private URLs before sharing logs.
|
||||
60
examples/firecrawl-compatible-api/README.md
Normal file
60
examples/firecrawl-compatible-api/README.md
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
# Firecrawl-Compatible API
|
||||
|
||||
webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls.
|
||||
|
||||
## Scrape
|
||||
|
||||
```bash
|
||||
curl https://api.webclaw.io/v2/scrape \
|
||||
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"formats": ["markdown"]
|
||||
}'
|
||||
```
|
||||
|
||||
## Crawl
|
||||
|
||||
```bash
|
||||
curl https://api.webclaw.io/v2/crawl \
|
||||
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://docs.example.com",
|
||||
"limit": 25,
|
||||
"maxDepth": 2
|
||||
}'
|
||||
```
|
||||
|
||||
Poll the returned crawl id:
|
||||
|
||||
```bash
|
||||
curl https://api.webclaw.io/v2/crawl/$CRAWL_ID \
|
||||
-H "Authorization: Bearer $WEBCLAW_API_KEY"
|
||||
```
|
||||
|
||||
## Map
|
||||
|
||||
```bash
|
||||
curl https://api.webclaw.io/v2/map \
|
||||
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://docs.example.com"
|
||||
}'
|
||||
```
|
||||
|
||||
## Search
|
||||
|
||||
```bash
|
||||
curl https://api.webclaw.io/v2/search \
|
||||
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"query": "site:docs.rs tokio tutorial",
|
||||
"limit": 5
|
||||
}'
|
||||
```
|
||||
|
||||
Compatibility routes are meant to reduce migration friction. For new projects, prefer the native `/v1` API because it exposes webclaw-specific options more directly.
|
||||
50
examples/html-to-markdown-rag/README.md
Normal file
50
examples/html-to-markdown-rag/README.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# HTML to Markdown for RAG
|
||||
|
||||
Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
|
||||
|
||||
## CLI
|
||||
|
||||
```bash
|
||||
# Clean markdown with headings, links, and readable structure.
|
||||
webclaw https://docs.anthropic.com --format markdown > page.md
|
||||
|
||||
# Token-optimized output for direct LLM context.
|
||||
webclaw https://docs.anthropic.com --format llm > page.txt
|
||||
|
||||
# Keep the main article content and remove common navigation/footer noise.
|
||||
webclaw https://docs.anthropic.com \
|
||||
--only-main-content \
|
||||
--format markdown \
|
||||
> page.md
|
||||
```
|
||||
|
||||
## Batch a URL List
|
||||
|
||||
Create `urls.txt`:
|
||||
|
||||
```text
|
||||
https://docs.anthropic.com/
|
||||
https://docs.anthropic.com/en/docs/claude-code
|
||||
https://docs.anthropic.com/en/api/messages
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
webclaw --urls-file urls.txt --format llm > corpus.txt
|
||||
```
|
||||
|
||||
## Hosted API
|
||||
|
||||
```bash
|
||||
curl https://api.webclaw.io/v1/scrape \
|
||||
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://docs.anthropic.com",
|
||||
"formats": ["markdown", "llm"],
|
||||
"only_main_content": true
|
||||
}'
|
||||
```
|
||||
|
||||
Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.
|
||||
44
examples/mcp-web-scraping/README.md
Normal file
44
examples/mcp-web-scraping/README.md
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
# MCP Web Scraping
|
||||
|
||||
Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
npx create-webclaw
|
||||
```
|
||||
|
||||
The installer detects supported MCP clients and can write the config for you.
|
||||
|
||||
## Manual Config
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"webclaw": {
|
||||
"command": "~/.webclaw/webclaw-mcp",
|
||||
"env": {
|
||||
"WEBCLAW_API_KEY": "wc_your_key"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`WEBCLAW_API_KEY` is optional for local extraction. Add it when you want cloud fallback for protected sites, JS rendering, hosted search, or hosted research.
|
||||
|
||||
## Example Prompts
|
||||
|
||||
```text
|
||||
Scrape https://docs.rs/tokio and summarize the parts about task spawning.
|
||||
```
|
||||
|
||||
```text
|
||||
Crawl https://docs.example.com up to depth 2 and return the pages most relevant to authentication.
|
||||
```
|
||||
|
||||
```text
|
||||
Extract the pricing tiers from https://example.com/pricing as JSON with fields name, price, limits, and features.
|
||||
```
|
||||
|
||||
The MCP server exposes tools for scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical extractors.
|
||||
53
examples/proxy-backed-crawling/README.md
Normal file
53
examples/proxy-backed-crawling/README.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# Proxy-Backed Crawling
|
||||
|
||||
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
|
||||
|
||||
## Single Proxy
|
||||
|
||||
```bash
|
||||
webclaw https://example.com \
|
||||
--proxy http://user:pass@proxy.example.com:8080 \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
SOCKS5 is supported too:
|
||||
|
||||
```bash
|
||||
webclaw https://example.com \
|
||||
--proxy socks5://proxy.example.com:1080 \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
## Proxy Pool
|
||||
|
||||
Create `proxies.txt` with one proxy per line:
|
||||
|
||||
```text
|
||||
http://user:pass@proxy-1.example.com:8080
|
||||
http://user:pass@proxy-2.example.com:8080
|
||||
http://user:pass@proxy-3.example.com:8080
|
||||
```
|
||||
|
||||
Run a crawl with controlled concurrency:
|
||||
|
||||
```bash
|
||||
webclaw https://docs.example.com \
|
||||
--crawl \
|
||||
--depth 2 \
|
||||
--max-pages 100 \
|
||||
--concurrency 10 \
|
||||
--delay 200 \
|
||||
--proxy-file proxies.txt \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
## Batch URLs
|
||||
|
||||
```bash
|
||||
webclaw --urls-file urls.txt \
|
||||
--proxy-file proxies.txt \
|
||||
--concurrency 10 \
|
||||
--format json
|
||||
```
|
||||
|
||||
Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue