docs: add workflow examples

2026-07-24 07:31:01 +02:00 · 2026-05-18 18:56:00 +02:00 · 2026-05-18 18:56:00 +02:00 · aab51bea91
commit aab51bea91
parent b75b768ec3
7 changed files with 281 additions and 0 deletions
--- a/examples/README.md
+++ b/examples/README.md
@ -2,6 +2,14 @@

 Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.

+## Workflow Guides
+
+- [HTML to Markdown for RAG](./html-to-markdown-rag/) turns web pages into markdown or compact LLM text for retrieval pipelines.
+- [Firecrawl-Compatible API](./firecrawl-compatible-api/) shows the `/v2` compatibility routes for scrape, crawl, map, and search.
+- [MCP Web Scraping](./mcp-web-scraping/) connects webclaw to MCP clients such as Claude Code, Claude Desktop, Cursor, and Codex CLI.
+- [Proxy-Backed Crawling](./proxy-backed-crawling/) shows single-proxy and proxy-pool crawling from the CLI.
+- [Cloudflare Diagnostics](./cloudflare-diagnostics/) gives a reproducible checklist for blocked or empty protected-site results.
+
 ## Basic Extraction

 ```bash
--- a/examples/cloudflare-diagnostics/README.md
+++ b/examples/cloudflare-diagnostics/README.md
@ -0,0 +1,58 @@
+# Cloudflare Diagnostics
+
+Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content.
+
+## 1. Save the Raw Response
+
+```bash
+webclaw https://protected.example.com --raw-html > raw.html
+```
+
+Inspect `raw.html` for challenge copy, blocked request text, empty shells, or application HTML that needs JavaScript rendering.
+
+## 2. Compare Extracted Formats
+
+```bash
+webclaw https://protected.example.com --format markdown > page.md
+webclaw https://protected.example.com --format json > page.json
+webclaw https://protected.example.com --format llm > page.txt
+```
+
+If raw HTML has content but markdown is empty, tune extraction with selectors:
+
+```bash
+webclaw https://protected.example.com \
+  --include "main, article, [role=main]" \
+  --exclude "nav, footer, aside, .cookie-banner" \
+  --format markdown
+```
+
+## 3. Try Another Browser Fingerprint
+
+```bash
+webclaw https://protected.example.com --browser firefox --format markdown
+webclaw https://protected.example.com --browser random --format markdown
+```
+
+## 4. Use Cloud Fallback
+
+```bash
+export WEBCLAW_API_KEY=wc_your_key
+
+webclaw https://protected.example.com --cloud --format markdown
+```
+
+Cloud mode can use hosted routing, JS rendering, and protected-site handling that are not part of the fully local open-source path.
+
+## 5. Keep a Reproducible Report
+
+When reporting a problem, include:
+
+- target URL
+- command used
+- selected format
+- whether `--raw-html` returned a challenge or normal page HTML
+- whether `--browser firefox` changed the result
+- whether cloud mode changed the result
+
+Remove cookies, tokens, customer data, and private URLs before sharing logs.
--- a/examples/firecrawl-compatible-api/README.md
+++ b/examples/firecrawl-compatible-api/README.md
@ -0,0 +1,60 @@
+# Firecrawl-Compatible API
+
+webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls.
+
+## Scrape
+
+```bash
+curl https://api.webclaw.io/v2/scrape \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com",
+    "formats": ["markdown"]
+  }'
+```
+
+## Crawl
+
+```bash
+curl https://api.webclaw.io/v2/crawl \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://docs.example.com",
+    "limit": 25,
+    "maxDepth": 2
+  }'
+```
+
+Poll the returned crawl id:
+
+```bash
+curl https://api.webclaw.io/v2/crawl/$CRAWL_ID \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+## Map
+
+```bash
+curl https://api.webclaw.io/v2/map \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://docs.example.com"
+  }'
+```
+
+## Search
+
+```bash
+curl https://api.webclaw.io/v2/search \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "site:docs.rs tokio tutorial",
+    "limit": 5
+  }'
+```
+
+Compatibility routes are meant to reduce migration friction. For new projects, prefer the native `/v1` API because it exposes webclaw-specific options more directly.
--- a/examples/html-to-markdown-rag/README.md
+++ b/examples/html-to-markdown-rag/README.md
@ -0,0 +1,50 @@
+# HTML to Markdown for RAG
+
+Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
+
+## CLI
+
+```bash
+# Clean markdown with headings, links, and readable structure.
+webclaw https://docs.anthropic.com --format markdown > page.md
+
+# Token-optimized output for direct LLM context.
+webclaw https://docs.anthropic.com --format llm > page.txt
+
+# Keep the main article content and remove common navigation/footer noise.
+webclaw https://docs.anthropic.com \
+  --only-main-content \
+  --format markdown \
+  > page.md
+```
+
+## Batch a URL List
+
+Create `urls.txt`:
+
+```text
+https://docs.anthropic.com/
+https://docs.anthropic.com/en/docs/claude-code
+https://docs.anthropic.com/en/api/messages
+```
+
+Run:
+
+```bash
+webclaw --urls-file urls.txt --format llm > corpus.txt
+```
+
+## Hosted API
+
+```bash
+curl https://api.webclaw.io/v1/scrape \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://docs.anthropic.com",
+    "formats": ["markdown", "llm"],
+    "only_main_content": true
+  }'
+```
+
+Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.
--- a/examples/mcp-web-scraping/README.md
+++ b/examples/mcp-web-scraping/README.md
@ -0,0 +1,44 @@
+# MCP Web Scraping
+
+Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context.
+
+## Install
+
+```bash
+npx create-webclaw
+```
+
+The installer detects supported MCP clients and can write the config for you.
+
+## Manual Config
+
+```json
+{
+  "mcpServers": {
+    "webclaw": {
+      "command": "~/.webclaw/webclaw-mcp",
+      "env": {
+        "WEBCLAW_API_KEY": "wc_your_key"
+      }
+    }
+  }
+}
+```
+
+`WEBCLAW_API_KEY` is optional for local extraction. Add it when you want cloud fallback for protected sites, JS rendering, hosted search, or hosted research.
+
+## Example Prompts
+
+```text
+Scrape https://docs.rs/tokio and summarize the parts about task spawning.
+```
+
+```text
+Crawl https://docs.example.com up to depth 2 and return the pages most relevant to authentication.
+```
+
+```text
+Extract the pricing tiers from https://example.com/pricing as JSON with fields name, price, limits, and features.
+```
+
+The MCP server exposes tools for scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical extractors.
--- a/examples/proxy-backed-crawling/README.md
+++ b/examples/proxy-backed-crawling/README.md
@ -0,0 +1,53 @@
+# Proxy-Backed Crawling
+
+Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file.
+
+## Single Proxy
+
+```bash
+webclaw https://example.com \
+  --proxy http://user:pass@proxy.example.com:8080 \
+  --format markdown
+```
+
+SOCKS5 is supported too:
+
+```bash
+webclaw https://example.com \
+  --proxy socks5://proxy.example.com:1080 \
+  --format markdown
+```
+
+## Proxy Pool
+
+Create `proxies.txt` with one proxy per line:
+
+```text
+http://user:pass@proxy-1.example.com:8080
+http://user:pass@proxy-2.example.com:8080
+http://user:pass@proxy-3.example.com:8080
+```
+
+Run a crawl with controlled concurrency:
+
+```bash
+webclaw https://docs.example.com \
+  --crawl \
+  --depth 2 \
+  --max-pages 100 \
+  --concurrency 10 \
+  --delay 200 \
+  --proxy-file proxies.txt \
+  --format markdown
+```
+
+## Batch URLs
+
+```bash
+webclaw --urls-file urls.txt \
+  --proxy-file proxies.txt \
+  --concurrency 10 \
+  --format json
+```
+
+Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.