mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-08 22:25:12 +02:00
docs: add workflow examples
This commit is contained in:
parent
b75b768ec3
commit
aab51bea91
7 changed files with 281 additions and 0 deletions
|
|
@ -137,6 +137,14 @@ webclaw https://example.com \
|
||||||
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
|
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Workflow examples
|
||||||
|
|
||||||
|
- [HTML to Markdown for RAG](examples/html-to-markdown-rag/)
|
||||||
|
- [Firecrawl-compatible API](examples/firecrawl-compatible-api/)
|
||||||
|
- [MCP web scraping](examples/mcp-web-scraping/)
|
||||||
|
- [Proxy-backed crawling](examples/proxy-backed-crawling/)
|
||||||
|
- [Cloudflare diagnostics](examples/cloudflare-diagnostics/)
|
||||||
|
|
||||||
### Extract brand assets
|
### Extract brand assets
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,14 @@
|
||||||
|
|
||||||
Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.
|
Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.
|
||||||
|
|
||||||
|
## Workflow Guides
|
||||||
|
|
||||||
|
- [HTML to Markdown for RAG](./html-to-markdown-rag/) turns web pages into markdown or compact LLM text for retrieval pipelines.
|
||||||
|
- [Firecrawl-Compatible API](./firecrawl-compatible-api/) shows the `/v2` compatibility routes for scrape, crawl, map, and search.
|
||||||
|
- [MCP Web Scraping](./mcp-web-scraping/) connects webclaw to MCP clients such as Claude Code, Claude Desktop, Cursor, and Codex CLI.
|
||||||
|
- [Proxy-Backed Crawling](./proxy-backed-crawling/) shows single-proxy and proxy-pool crawling from the CLI.
|
||||||
|
- [Cloudflare Diagnostics](./cloudflare-diagnostics/) gives a reproducible checklist for blocked or empty protected-site results.
|
||||||
|
|
||||||
## Basic Extraction
|
## Basic Extraction
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
58
examples/cloudflare-diagnostics/README.md
Normal file
58
examples/cloudflare-diagnostics/README.md
Normal file
|
|
@ -0,0 +1,58 @@
|
||||||
|
# Cloudflare Diagnostics
|
||||||
|
|
||||||
|
Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content.
|
||||||
|
|
||||||
|
## 1. Save the Raw Response
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://protected.example.com --raw-html > raw.html
|
||||||
|
```
|
||||||
|
|
||||||
|
Inspect `raw.html` for challenge copy, blocked request text, empty shells, or application HTML that needs JavaScript rendering.
|
||||||
|
|
||||||
|
## 2. Compare Extracted Formats
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://protected.example.com --format markdown > page.md
|
||||||
|
webclaw https://protected.example.com --format json > page.json
|
||||||
|
webclaw https://protected.example.com --format llm > page.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
If raw HTML has content but markdown is empty, tune extraction with selectors:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://protected.example.com \
|
||||||
|
--include "main, article, [role=main]" \
|
||||||
|
--exclude "nav, footer, aside, .cookie-banner" \
|
||||||
|
--format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Try Another Browser Fingerprint
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://protected.example.com --browser firefox --format markdown
|
||||||
|
webclaw https://protected.example.com --browser random --format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Use Cloud Fallback
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export WEBCLAW_API_KEY=wc_your_key
|
||||||
|
|
||||||
|
webclaw https://protected.example.com --cloud --format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
Cloud mode can use hosted routing, JS rendering, and protected-site handling that are not part of the fully local open-source path.
|
||||||
|
|
||||||
|
## 5. Keep a Reproducible Report
|
||||||
|
|
||||||
|
When reporting a problem, include:
|
||||||
|
|
||||||
|
- target URL
|
||||||
|
- command used
|
||||||
|
- selected format
|
||||||
|
- whether `--raw-html` returned a challenge or normal page HTML
|
||||||
|
- whether `--browser firefox` changed the result
|
||||||
|
- whether cloud mode changed the result
|
||||||
|
|
||||||
|
Remove cookies, tokens, customer data, and private URLs before sharing logs.
|
||||||
60
examples/firecrawl-compatible-api/README.md
Normal file
60
examples/firecrawl-compatible-api/README.md
Normal file
|
|
@ -0,0 +1,60 @@
|
||||||
|
# Firecrawl-Compatible API
|
||||||
|
|
||||||
|
webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls.
|
||||||
|
|
||||||
|
## Scrape
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://api.webclaw.io/v2/scrape \
|
||||||
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"url": "https://example.com",
|
||||||
|
"formats": ["markdown"]
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Crawl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://api.webclaw.io/v2/crawl \
|
||||||
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"url": "https://docs.example.com",
|
||||||
|
"limit": 25,
|
||||||
|
"maxDepth": 2
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Poll the returned crawl id:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://api.webclaw.io/v2/crawl/$CRAWL_ID \
|
||||||
|
-H "Authorization: Bearer $WEBCLAW_API_KEY"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Map
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://api.webclaw.io/v2/map \
|
||||||
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"url": "https://docs.example.com"
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Search
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://api.webclaw.io/v2/search \
|
||||||
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"query": "site:docs.rs tokio tutorial",
|
||||||
|
"limit": 5
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Compatibility routes are meant to reduce migration friction. For new projects, prefer the native `/v1` API because it exposes webclaw-specific options more directly.
|
||||||
50
examples/html-to-markdown-rag/README.md
Normal file
50
examples/html-to-markdown-rag/README.md
Normal file
|
|
@ -0,0 +1,50 @@
|
||||||
|
# HTML to Markdown for RAG
|
||||||
|
|
||||||
|
Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
|
||||||
|
|
||||||
|
## CLI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clean markdown with headings, links, and readable structure.
|
||||||
|
webclaw https://docs.anthropic.com --format markdown > page.md
|
||||||
|
|
||||||
|
# Token-optimized output for direct LLM context.
|
||||||
|
webclaw https://docs.anthropic.com --format llm > page.txt
|
||||||
|
|
||||||
|
# Keep the main article content and remove common navigation/footer noise.
|
||||||
|
webclaw https://docs.anthropic.com \
|
||||||
|
--only-main-content \
|
||||||
|
--format markdown \
|
||||||
|
> page.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Batch a URL List
|
||||||
|
|
||||||
|
Create `urls.txt`:
|
||||||
|
|
||||||
|
```text
|
||||||
|
https://docs.anthropic.com/
|
||||||
|
https://docs.anthropic.com/en/docs/claude-code
|
||||||
|
https://docs.anthropic.com/en/api/messages
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw --urls-file urls.txt --format llm > corpus.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Hosted API
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://api.webclaw.io/v1/scrape \
|
||||||
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"url": "https://docs.anthropic.com",
|
||||||
|
"formats": ["markdown", "llm"],
|
||||||
|
"only_main_content": true
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.
|
||||||
44
examples/mcp-web-scraping/README.md
Normal file
44
examples/mcp-web-scraping/README.md
Normal file
|
|
@ -0,0 +1,44 @@
|
||||||
|
# MCP Web Scraping
|
||||||
|
|
||||||
|
Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context.
|
||||||
|
|
||||||
|
## Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npx create-webclaw
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer detects supported MCP clients and can write the config for you.
|
||||||
|
|
||||||
|
## Manual Config
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mcpServers": {
|
||||||
|
"webclaw": {
|
||||||
|
"command": "~/.webclaw/webclaw-mcp",
|
||||||
|
"env": {
|
||||||
|
"WEBCLAW_API_KEY": "wc_your_key"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`WEBCLAW_API_KEY` is optional for local extraction. Add it when you want cloud fallback for protected sites, JS rendering, hosted search, or hosted research.
|
||||||
|
|
||||||
|
## Example Prompts
|
||||||
|
|
||||||
|
```text
|
||||||
|
Scrape https://docs.rs/tokio and summarize the parts about task spawning.
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
Crawl https://docs.example.com up to depth 2 and return the pages most relevant to authentication.
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
Extract the pricing tiers from https://example.com/pricing as JSON with fields name, price, limits, and features.
|
||||||
|
```
|
||||||
|
|
||||||
|
The MCP server exposes tools for scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical extractors.
|
||||||
53
examples/proxy-backed-crawling/README.md
Normal file
53
examples/proxy-backed-crawling/README.md
Normal file
|
|
@ -0,0 +1,53 @@
|
||||||
|
# Proxy-Backed Crawling
|
||||||
|
|
||||||
|
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
|
||||||
|
|
||||||
|
## Single Proxy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://example.com \
|
||||||
|
--proxy http://user:pass@proxy.example.com:8080 \
|
||||||
|
--format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
SOCKS5 is supported too:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://example.com \
|
||||||
|
--proxy socks5://proxy.example.com:1080 \
|
||||||
|
--format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
## Proxy Pool
|
||||||
|
|
||||||
|
Create `proxies.txt` with one proxy per line:
|
||||||
|
|
||||||
|
```text
|
||||||
|
http://user:pass@proxy-1.example.com:8080
|
||||||
|
http://user:pass@proxy-2.example.com:8080
|
||||||
|
http://user:pass@proxy-3.example.com:8080
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a crawl with controlled concurrency:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw https://docs.example.com \
|
||||||
|
--crawl \
|
||||||
|
--depth 2 \
|
||||||
|
--max-pages 100 \
|
||||||
|
--concurrency 10 \
|
||||||
|
--delay 200 \
|
||||||
|
--proxy-file proxies.txt \
|
||||||
|
--format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
## Batch URLs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
webclaw --urls-file urls.txt \
|
||||||
|
--proxy-file proxies.txt \
|
||||||
|
--concurrency 10 \
|
||||||
|
--format json
|
||||||
|
```
|
||||||
|
|
||||||
|
Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue