docs: add workflow examples

2026-06-08 22:25:12 +02:00 · 2026-05-18 18:56:00 +02:00 · 2026-05-18 18:56:00 +02:00 · aab51bea91
commit aab51bea91
parent b75b768ec3
7 changed files with 281 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -137,6 +137,14 @@ webclaw https://example.com \
 webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
 ```
 ### Workflow examples
 - [HTML to Markdown for RAG](examples/html-to-markdown-rag/)
 - [Firecrawl-compatible API](examples/firecrawl-compatible-api/)
 - [MCP web scraping](examples/mcp-web-scraping/)
 - [Proxy-backed crawling](examples/proxy-backed-crawling/)
 - [Cloudflare diagnostics](examples/cloudflare-diagnostics/)
 ### Extract brand assets
 ```bash
--- a/examples/README.md
+++ b/examples/README.md
@ -2,6 +2,14 @@
 Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.
 ## Workflow Guides
 - [HTML to Markdown for RAG](./html-to-markdown-rag/) turns web pages into markdown or compact LLM text for retrieval pipelines.
 - [Firecrawl-Compatible API](./firecrawl-compatible-api/) shows the `/v2` compatibility routes for scrape, crawl, map, and search.
 - [MCP Web Scraping](./mcp-web-scraping/) connects webclaw to MCP clients such as Claude Code, Claude Desktop, Cursor, and Codex CLI.
 - [Proxy-Backed Crawling](./proxy-backed-crawling/) shows single-proxy and proxy-pool crawling from the CLI.
 - [Cloudflare Diagnostics](./cloudflare-diagnostics/) gives a reproducible checklist for blocked or empty protected-site results.
 ## Basic Extraction
 ```bash
--- a/examples/cloudflare-diagnostics/README.md
+++ b/examples/cloudflare-diagnostics/README.md
@ -0,0 +1,58 @@
 # Cloudflare Diagnostics
 Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content.
 ## 1. Save the Raw Response
 ```bash
 webclaw https://protected.example.com --raw-html > raw.html
 ```
 Inspect `raw.html` for challenge copy, blocked request text, empty shells, or application HTML that needs JavaScript rendering.
 ## 2. Compare Extracted Formats
 ```bash
 webclaw https://protected.example.com --format markdown > page.md
 webclaw https://protected.example.com --format json > page.json
 webclaw https://protected.example.com --format llm > page.txt
 ```
 If raw HTML has content but markdown is empty, tune extraction with selectors:
 ```bash
 webclaw https://protected.example.com \
  --include "main, article, [role=main]" \
  --exclude "nav, footer, aside, .cookie-banner" \
  --format markdown
 ```
 ## 3. Try Another Browser Fingerprint
 ```bash
 webclaw https://protected.example.com --browser firefox --format markdown
 webclaw https://protected.example.com --browser random --format markdown
 ```
 ## 4. Use Cloud Fallback
 ```bash
 export WEBCLAW_API_KEY=wc_your_key
 webclaw https://protected.example.com --cloud --format markdown
 ```
 Cloud mode can use hosted routing, JS rendering, and protected-site handling that are not part of the fully local open-source path.
 ## 5. Keep a Reproducible Report
 When reporting a problem, include:
 - target URL
 - command used
 - selected format
 - whether `--raw-html` returned a challenge or normal page HTML
 - whether `--browser firefox` changed the result
 - whether cloud mode changed the result
 Remove cookies, tokens, customer data, and private URLs before sharing logs.
--- a/examples/firecrawl-compatible-api/README.md
+++ b/examples/firecrawl-compatible-api/README.md
@ -0,0 +1,60 @@
 # Firecrawl-Compatible API
 webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls.
 ## Scrape
 ```bash
 curl https://api.webclaw.io/v2/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"]
  }'
 ```
 ## Crawl
 ```bash
 curl https://api.webclaw.io/v2/crawl \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 25,
    "maxDepth": 2
  }'
 ```
 Poll the returned crawl id:
 ```bash
 curl https://api.webclaw.io/v2/crawl/$CRAWL_ID \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"
 ```
 ## Map
 ```bash
 curl https://api.webclaw.io/v2/map \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com"
  }'
 ```
 ## Search
 ```bash
 curl https://api.webclaw.io/v2/search \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "site:docs.rs tokio tutorial",
    "limit": 5
  }'
 ```
 Compatibility routes are meant to reduce migration friction. For new projects, prefer the native `/v1` API because it exposes webclaw-specific options more directly.
--- a/examples/html-to-markdown-rag/README.md
+++ b/examples/html-to-markdown-rag/README.md
@ -0,0 +1,50 @@
 # HTML to Markdown for RAG
 Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
 ## CLI
 ```bash
 # Clean markdown with headings, links, and readable structure.
 webclaw https://docs.anthropic.com --format markdown > page.md
 # Token-optimized output for direct LLM context.
 webclaw https://docs.anthropic.com --format llm > page.txt
 # Keep the main article content and remove common navigation/footer noise.
 webclaw https://docs.anthropic.com \
  --only-main-content \
  --format markdown \
  > page.md
 ```
 ## Batch a URL List
 Create `urls.txt`:
 ```text
 https://docs.anthropic.com/
 https://docs.anthropic.com/en/docs/claude-code
 https://docs.anthropic.com/en/api/messages
 ```
 Run:
 ```bash
 webclaw --urls-file urls.txt --format llm > corpus.txt
 ```
 ## Hosted API
 ```bash
 curl https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.anthropic.com",
    "formats": ["markdown", "llm"],
    "only_main_content": true
  }'
 ```
 Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.
--- a/examples/mcp-web-scraping/README.md
+++ b/examples/mcp-web-scraping/README.md
@ -0,0 +1,44 @@
 # MCP Web Scraping
 Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context.
 ## Install
 ```bash
 npx create-webclaw
 ```
 The installer detects supported MCP clients and can write the config for you.
 ## Manual Config
 ```json
 {
  "mcpServers": {
    "webclaw": {
      "command": "~/.webclaw/webclaw-mcp",
      "env": {
        "WEBCLAW_API_KEY": "wc_your_key"
      }
    }
  }
 }
 ```
 `WEBCLAW_API_KEY` is optional for local extraction. Add it when you want cloud fallback for protected sites, JS rendering, hosted search, or hosted research.
 ## Example Prompts
 ```text
 Scrape https://docs.rs/tokio and summarize the parts about task spawning.
 ```
 ```text
 Crawl https://docs.example.com up to depth 2 and return the pages most relevant to authentication.
 ```
 ```text
 Extract the pricing tiers from https://example.com/pricing as JSON with fields name, price, limits, and features.
 ```
 The MCP server exposes tools for scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical extractors.
--- a/examples/proxy-backed-crawling/README.md
+++ b/examples/proxy-backed-crawling/README.md
@ -0,0 +1,53 @@
 # Proxy-Backed Crawling
 Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
 ## Single Proxy
 ```bash
 webclaw https://example.com \
  --proxy http://user:pass@proxy.example.com:8080 \
  --format markdown
 ```
 SOCKS5 is supported too:
 ```bash
 webclaw https://example.com \
  --proxy socks5://proxy.example.com:1080 \
  --format markdown
 ```
 ## Proxy Pool
 Create `proxies.txt` with one proxy per line:
 ```text
 http://user:pass@proxy-1.example.com:8080
 http://user:pass@proxy-2.example.com:8080
 http://user:pass@proxy-3.example.com:8080
 ```
 Run a crawl with controlled concurrency:
 ```bash
 webclaw https://docs.example.com \
  --crawl \
  --depth 2 \
  --max-pages 100 \
  --concurrency 10 \
  --delay 200 \
  --proxy-file proxies.txt \
  --format markdown
 ```
 ## Batch URLs
 ```bash
 webclaw --urls-file urls.txt \
  --proxy-file proxies.txt \
  --concurrency 10 \
  --format json
 ```
 Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.