From aab51bea919a1509a71767bd3e274bae0a6006d8 Mon Sep 17 00:00:00 2001 From: Valerio Date: Mon, 18 May 2026 18:56:00 +0200 Subject: [PATCH] docs: add workflow examples --- README.md | 8 +++ examples/README.md | 8 +++ examples/cloudflare-diagnostics/README.md | 58 ++++++++++++++++++++ examples/firecrawl-compatible-api/README.md | 60 +++++++++++++++++++++ examples/html-to-markdown-rag/README.md | 50 +++++++++++++++++ examples/mcp-web-scraping/README.md | 44 +++++++++++++++ examples/proxy-backed-crawling/README.md | 53 ++++++++++++++++++ 7 files changed, 281 insertions(+) create mode 100644 examples/cloudflare-diagnostics/README.md create mode 100644 examples/firecrawl-compatible-api/README.md create mode 100644 examples/html-to-markdown-rag/README.md create mode 100644 examples/mcp-web-scraping/README.md create mode 100644 examples/proxy-backed-crawling/README.md diff --git a/README.md b/README.md index 06bed01..1ca7380 100644 --- a/README.md +++ b/README.md @@ -137,6 +137,14 @@ webclaw https://example.com \ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50 ``` +### Workflow examples + +- [HTML to Markdown for RAG](examples/html-to-markdown-rag/) +- [Firecrawl-compatible API](examples/firecrawl-compatible-api/) +- [MCP web scraping](examples/mcp-web-scraping/) +- [Proxy-backed crawling](examples/proxy-backed-crawling/) +- [Cloudflare diagnostics](examples/cloudflare-diagnostics/) + ### Extract brand assets ```bash diff --git a/examples/README.md b/examples/README.md index 142f132..0967d74 100644 --- a/examples/README.md +++ b/examples/README.md @@ -2,6 +2,14 @@ Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately. +## Workflow Guides + +- [HTML to Markdown for RAG](./html-to-markdown-rag/) turns web pages into markdown or compact LLM text for retrieval pipelines. +- [Firecrawl-Compatible API](./firecrawl-compatible-api/) shows the `/v2` compatibility routes for scrape, crawl, map, and search. +- [MCP Web Scraping](./mcp-web-scraping/) connects webclaw to MCP clients such as Claude Code, Claude Desktop, Cursor, and Codex CLI. +- [Proxy-Backed Crawling](./proxy-backed-crawling/) shows single-proxy and proxy-pool crawling from the CLI. +- [Cloudflare Diagnostics](./cloudflare-diagnostics/) gives a reproducible checklist for blocked or empty protected-site results. + ## Basic Extraction ```bash diff --git a/examples/cloudflare-diagnostics/README.md b/examples/cloudflare-diagnostics/README.md new file mode 100644 index 0000000..e8fd197 --- /dev/null +++ b/examples/cloudflare-diagnostics/README.md @@ -0,0 +1,58 @@ +# Cloudflare Diagnostics + +Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content. + +## 1. Save the Raw Response + +```bash +webclaw https://protected.example.com --raw-html > raw.html +``` + +Inspect `raw.html` for challenge copy, blocked request text, empty shells, or application HTML that needs JavaScript rendering. + +## 2. Compare Extracted Formats + +```bash +webclaw https://protected.example.com --format markdown > page.md +webclaw https://protected.example.com --format json > page.json +webclaw https://protected.example.com --format llm > page.txt +``` + +If raw HTML has content but markdown is empty, tune extraction with selectors: + +```bash +webclaw https://protected.example.com \ + --include "main, article, [role=main]" \ + --exclude "nav, footer, aside, .cookie-banner" \ + --format markdown +``` + +## 3. Try Another Browser Fingerprint + +```bash +webclaw https://protected.example.com --browser firefox --format markdown +webclaw https://protected.example.com --browser random --format markdown +``` + +## 4. Use Cloud Fallback + +```bash +export WEBCLAW_API_KEY=wc_your_key + +webclaw https://protected.example.com --cloud --format markdown +``` + +Cloud mode can use hosted routing, JS rendering, and protected-site handling that are not part of the fully local open-source path. + +## 5. Keep a Reproducible Report + +When reporting a problem, include: + +- target URL +- command used +- selected format +- whether `--raw-html` returned a challenge or normal page HTML +- whether `--browser firefox` changed the result +- whether cloud mode changed the result + +Remove cookies, tokens, customer data, and private URLs before sharing logs. diff --git a/examples/firecrawl-compatible-api/README.md b/examples/firecrawl-compatible-api/README.md new file mode 100644 index 0000000..e6c3f4b --- /dev/null +++ b/examples/firecrawl-compatible-api/README.md @@ -0,0 +1,60 @@ +# Firecrawl-Compatible API + +webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls. + +## Scrape + +```bash +curl https://api.webclaw.io/v2/scrape \ + -H "Authorization: Bearer $WEBCLAW_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://example.com", + "formats": ["markdown"] + }' +``` + +## Crawl + +```bash +curl https://api.webclaw.io/v2/crawl \ + -H "Authorization: Bearer $WEBCLAW_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://docs.example.com", + "limit": 25, + "maxDepth": 2 + }' +``` + +Poll the returned crawl id: + +```bash +curl https://api.webclaw.io/v2/crawl/$CRAWL_ID \ + -H "Authorization: Bearer $WEBCLAW_API_KEY" +``` + +## Map + +```bash +curl https://api.webclaw.io/v2/map \ + -H "Authorization: Bearer $WEBCLAW_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://docs.example.com" + }' +``` + +## Search + +```bash +curl https://api.webclaw.io/v2/search \ + -H "Authorization: Bearer $WEBCLAW_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "query": "site:docs.rs tokio tutorial", + "limit": 5 + }' +``` + +Compatibility routes are meant to reduce migration friction. For new projects, prefer the native `/v1` API because it exposes webclaw-specific options more directly. diff --git a/examples/html-to-markdown-rag/README.md b/examples/html-to-markdown-rag/README.md new file mode 100644 index 0000000..d4c29b3 --- /dev/null +++ b/examples/html-to-markdown-rag/README.md @@ -0,0 +1,50 @@ +# HTML to Markdown for RAG + +Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent. + +## CLI + +```bash +# Clean markdown with headings, links, and readable structure. +webclaw https://docs.anthropic.com --format markdown > page.md + +# Token-optimized output for direct LLM context. +webclaw https://docs.anthropic.com --format llm > page.txt + +# Keep the main article content and remove common navigation/footer noise. +webclaw https://docs.anthropic.com \ + --only-main-content \ + --format markdown \ + > page.md +``` + +## Batch a URL List + +Create `urls.txt`: + +```text +https://docs.anthropic.com/ +https://docs.anthropic.com/en/docs/claude-code +https://docs.anthropic.com/en/api/messages +``` + +Run: + +```bash +webclaw --urls-file urls.txt --format llm > corpus.txt +``` + +## Hosted API + +```bash +curl https://api.webclaw.io/v1/scrape \ + -H "Authorization: Bearer $WEBCLAW_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://docs.anthropic.com", + "formats": ["markdown", "llm"], + "only_main_content": true + }' +``` + +Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context. diff --git a/examples/mcp-web-scraping/README.md b/examples/mcp-web-scraping/README.md new file mode 100644 index 0000000..0663670 --- /dev/null +++ b/examples/mcp-web-scraping/README.md @@ -0,0 +1,44 @@ +# MCP Web Scraping + +Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context. + +## Install + +```bash +npx create-webclaw +``` + +The installer detects supported MCP clients and can write the config for you. + +## Manual Config + +```json +{ + "mcpServers": { + "webclaw": { + "command": "~/.webclaw/webclaw-mcp", + "env": { + "WEBCLAW_API_KEY": "wc_your_key" + } + } + } +} +``` + +`WEBCLAW_API_KEY` is optional for local extraction. Add it when you want cloud fallback for protected sites, JS rendering, hosted search, or hosted research. + +## Example Prompts + +```text +Scrape https://docs.rs/tokio and summarize the parts about task spawning. +``` + +```text +Crawl https://docs.example.com up to depth 2 and return the pages most relevant to authentication. +``` + +```text +Extract the pricing tiers from https://example.com/pricing as JSON with fields name, price, limits, and features. +``` + +The MCP server exposes tools for scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical extractors. diff --git a/examples/proxy-backed-crawling/README.md b/examples/proxy-backed-crawling/README.md new file mode 100644 index 0000000..fd49be9 --- /dev/null +++ b/examples/proxy-backed-crawling/README.md @@ -0,0 +1,53 @@ +# Proxy-Backed Crawling + +Use proxy rotation when you need to distribute a crawl across a proxy pool. webclaw supports a single proxy or a proxy file. + +## Single Proxy + +```bash +webclaw https://example.com \ + --proxy http://user:pass@proxy.example.com:8080 \ + --format markdown +``` + +SOCKS5 is supported too: + +```bash +webclaw https://example.com \ + --proxy socks5://proxy.example.com:1080 \ + --format markdown +``` + +## Proxy Pool + +Create `proxies.txt` with one proxy per line: + +```text +http://user:pass@proxy-1.example.com:8080 +http://user:pass@proxy-2.example.com:8080 +http://user:pass@proxy-3.example.com:8080 +``` + +Run a crawl with controlled concurrency: + +```bash +webclaw https://docs.example.com \ + --crawl \ + --depth 2 \ + --max-pages 100 \ + --concurrency 10 \ + --delay 200 \ + --proxy-file proxies.txt \ + --format markdown +``` + +## Batch URLs + +```bash +webclaw --urls-file urls.txt \ + --proxy-file proxies.txt \ + --concurrency 10 \ + --format json +``` + +Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use hosted cloud mode with `WEBCLAW_API_KEY`.