docs: add workflow examples

This commit is contained in:
Valerio 2026-05-18 18:56:00 +02:00
parent b75b768ec3
commit aab51bea91
7 changed files with 281 additions and 0 deletions

View file

@ -2,6 +2,14 @@
Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.
## Workflow Guides
- [HTML to Markdown for RAG](./html-to-markdown-rag/) turns web pages into markdown or compact LLM text for retrieval pipelines.
- [Firecrawl-Compatible API](./firecrawl-compatible-api/) shows the `/v2` compatibility routes for scrape, crawl, map, and search.
- [MCP Web Scraping](./mcp-web-scraping/) connects webclaw to MCP clients such as Claude Code, Claude Desktop, Cursor, and Codex CLI.
- [Proxy-Backed Crawling](./proxy-backed-crawling/) shows single-proxy and proxy-pool crawling from the CLI.
- [Cloudflare Diagnostics](./cloudflare-diagnostics/) gives a reproducible checklist for blocked or empty protected-site results.
## Basic Extraction
```bash

View file

@ -0,0 +1,58 @@
# Cloudflare Diagnostics
Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content.
## 1. Save the Raw Response
```bash
webclaw https://protected.example.com --raw-html > raw.html
```
Inspect `raw.html` for challenge copy, blocked request text, empty shells, or application HTML that needs JavaScript rendering.
## 2. Compare Extracted Formats
```bash
webclaw https://protected.example.com --format markdown > page.md
webclaw https://protected.example.com --format json > page.json
webclaw https://protected.example.com --format llm > page.txt
```
If raw HTML has content but markdown is empty, tune extraction with selectors:
```bash
webclaw https://protected.example.com \
--include "main, article, [role=main]" \
--exclude "nav, footer, aside, .cookie-banner" \
--format markdown
```
## 3. Try Another Browser Fingerprint
```bash
webclaw https://protected.example.com --browser firefox --format markdown
webclaw https://protected.example.com --browser random --format markdown
```
## 4. Use Cloud Fallback
```bash
export WEBCLAW_API_KEY=wc_your_key
webclaw https://protected.example.com --cloud --format markdown
```
Cloud mode can use hosted routing, JS rendering, and protected-site handling that are not part of the fully local open-source path.
## 5. Keep a Reproducible Report
When reporting a problem, include:
- target URL
- command used
- selected format
- whether `--raw-html` returned a challenge or normal page HTML
- whether `--browser firefox` changed the result
- whether cloud mode changed the result
Remove cookies, tokens, customer data, and private URLs before sharing logs.

View file

@ -0,0 +1,60 @@
# Firecrawl-Compatible API
webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls.
## Scrape
```bash
curl https://api.webclaw.io/v2/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"]
}'
```
## Crawl
```bash
curl https://api.webclaw.io/v2/crawl \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"limit": 25,
"maxDepth": 2
}'
```
Poll the returned crawl id:
```bash
curl https://api.webclaw.io/v2/crawl/$CRAWL_ID \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
```
## Map
```bash
curl https://api.webclaw.io/v2/map \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com"
}'
```
## Search
```bash
curl https://api.webclaw.io/v2/search \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "site:docs.rs tokio tutorial",
"limit": 5
}'
```
Compatibility routes are meant to reduce migration friction. For new projects, prefer the native `/v1` API because it exposes webclaw-specific options more directly.

View file

@ -0,0 +1,50 @@
# HTML to Markdown for RAG
Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
## CLI
```bash
# Clean markdown with headings, links, and readable structure.
webclaw https://docs.anthropic.com --format markdown > page.md
# Token-optimized output for direct LLM context.
webclaw https://docs.anthropic.com --format llm > page.txt
# Keep the main article content and remove common navigation/footer noise.
webclaw https://docs.anthropic.com \
--only-main-content \
--format markdown \
> page.md
```
## Batch a URL List
Create `urls.txt`:
```text
https://docs.anthropic.com/
https://docs.anthropic.com/en/docs/claude-code
https://docs.anthropic.com/en/api/messages
```
Run:
```bash
webclaw --urls-file urls.txt --format llm > corpus.txt
```
## Hosted API
```bash
curl https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.anthropic.com",
"formats": ["markdown", "llm"],
"only_main_content": true
}'
```
Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.

View file

@ -0,0 +1,44 @@
# MCP Web Scraping
Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context.
## Install
```bash
npx create-webclaw
```
The installer detects supported MCP clients and can write the config for you.
## Manual Config
```json
{
"mcpServers": {
"webclaw": {
"command": "~/.webclaw/webclaw-mcp",
"env": {
"WEBCLAW_API_KEY": "wc_your_key"
}
}
}
}
```
`WEBCLAW_API_KEY` is optional for local extraction. Add it when you want cloud fallback for protected sites, JS rendering, hosted search, or hosted research.
## Example Prompts
```text
Scrape https://docs.rs/tokio and summarize the parts about task spawning.
```
```text
Crawl https://docs.example.com up to depth 2 and return the pages most relevant to authentication.
```
```text
Extract the pricing tiers from https://example.com/pricing as JSON with fields name, price, limits, and features.
```
The MCP server exposes tools for scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical extractors.

View file

@ -0,0 +1,53 @@
# Proxy-Backed Crawling
Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
## Single Proxy
```bash
webclaw https://example.com \
--proxy http://user:pass@proxy.example.com:8080 \
--format markdown
```
SOCKS5 is supported too:
```bash
webclaw https://example.com \
--proxy socks5://proxy.example.com:1080 \
--format markdown
```
## Proxy Pool
Create `proxies.txt` with one proxy per line:
```text
http://user:pass@proxy-1.example.com:8080
http://user:pass@proxy-2.example.com:8080
http://user:pass@proxy-3.example.com:8080
```
Run a crawl with controlled concurrency:
```bash
webclaw https://docs.example.com \
--crawl \
--depth 2 \
--max-pages 100 \
--concurrency 10 \
--delay 200 \
--proxy-file proxies.txt \
--format markdown
```
## Batch URLs
```bash
webclaw --urls-file urls.txt \
--proxy-file proxies.txt \
--concurrency 10 \
--format json
```
Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.