2026-03-23 18:31:11 +01:00
< p align = "center" >
< a href = "https://webclaw.io" >
2026-05-10 22:44:57 +02:00
< img src = ".github/banner.png" alt = "webclaw" width = "760" / >
2026-03-23 18:31:11 +01:00
< / a >
< / p >
2026-04-30 11:46:45 +02:00
< h1 align = "center" > webclaw< / h1 >
< p align = "center" >
2026-05-10 22:44:57 +02:00
< strong > Turn websites into clean markdown, JSON, and LLM-ready context.< / strong > < br / >
< sub > CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.< / sub >
2026-04-30 11:46:45 +02:00
< / p >
2026-03-23 18:31:11 +01:00
< p align = "center" >
2026-05-05 02:17:21 -07:00
< a href = "https://github.com/0xMassi/webclaw/stargazers" > < img src = "https://shieldcn.dev/github/stars/0xMassi/webclaw.svg?variant=branded&logo=github" alt = "Stars" / > < / a >
< a href = "https://github.com/0xMassi/webclaw/releases" > < img src = "https://shieldcn.dev/github/tag/0xMassi/webclaw.svg?variant=branded&logo=rust" alt = "Version" / > < / a >
< a href = "https://github.com/0xMassi/webclaw/blob/main/LICENSE" > < img src = "https://shieldcn.dev/github/license/0xMassi/webclaw.svg?variant=branded" alt = "License" / > < / a >
< a href = "https://www.npmjs.com/package/create-webclaw" > < img src = "https://shieldcn.dev/npm/dt/create-webclaw.svg?variant=branded" alt = "npm installs" / > < / a >
2026-03-26 15:30:24 +01:00
< / p >
2026-05-10 22:44:57 +02:00
2026-03-26 15:30:24 +01:00
< p align = "center" >
2026-05-05 02:17:21 -07:00
< a href = "https://discord.gg/KDfd48EpnW" > < img src = "https://shieldcn.dev/badge/Discord-Join.svg?variant=branded&logo=discord" alt = "Discord" / > < / a >
< a href = "https://x.com/webclaw_io" >< img src = "https://shieldcn.dev/badge/Follow- @webclaw__io .svg?variant=branded&logo=x" alt = "X / Twitter" /></ a >
2026-05-10 22:44:57 +02:00
< a href = "https://webclaw.io" > < img src = "https://shieldcn.dev/badge/Hosted-webclaw.io.svg?variant=branded&logo=safari" alt = "Hosted webclaw" / > < / a >
2026-05-05 02:17:21 -07:00
< a href = "https://webclaw.io/docs" > < img src = "https://shieldcn.dev/badge/Docs-Read.svg?variant=branded&logo=readthedocs" alt = "Docs" / > < / a >
2026-03-23 18:31:11 +01:00
< / p >
2026-03-24 11:07:26 +01:00
< p align = "center" >
2026-05-10 22:44:57 +02:00
< img src = "assets/demo.gif" alt = "webclaw extracting clean markdown from a page" width = "760" / >
2026-03-24 11:07:26 +01:00
< / p >
---
2026-05-10 22:44:57 +02:00
Most web scraping tools give your agent one of two bad outputs:
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
- a blocked page, login wall, or empty app shell
- raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
[webclaw.io ](https://webclaw.io ) is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.
webclaw turns a URL into clean content your tools can actually use.
```bash
webclaw https://example.com --format markdown
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
```md
# Example Domain
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
This domain is for use in illustrative examples in documents.
2026-04-26 17:15:44 +02:00
2026-05-10 22:44:57 +02:00
You may use this domain in literature without prior coordination or asking for permission.
```
2026-04-26 17:15:44 +02:00
2026-05-10 22:44:57 +02:00
Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.
2026-04-26 17:15:44 +02:00
---
2026-05-10 22:44:57 +02:00
## Install
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
### Agent setup
The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:
2026-03-23 18:31:11 +01:00
```bash
npx create-webclaw
```
2026-05-10 22:44:57 +02:00
The installer detects supported clients and configures the MCP server for you.
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
### Homebrew
2026-03-23 18:31:11 +01:00
```bash
2026-03-24 10:41:37 +01:00
brew tap 0xMassi/webclaw
brew install webclaw
```
### Prebuilt binaries
2026-05-10 22:44:57 +02:00
Download macOS and Linux binaries from [GitHub Releases ](https://github.com/0xMassi/webclaw/releases ).
### Docker
```bash
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
```
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
### Cargo
2026-03-24 10:41:37 +01:00
```bash
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp
```
2026-05-10 22:44:57 +02:00
If building from source fails because native build tools are missing, install the platform prerequisites:
2026-04-30 11:46:45 +02:00
2026-05-10 22:44:57 +02:00
| OS | Command |
| --- | --- |
2026-04-30 11:46:45 +02:00
| Debian / Ubuntu | `sudo apt install -y pkg-config libssl-dev cmake clang git build-essential` |
| Fedora / RHEL | `sudo dnf install -y pkg-config openssl-devel cmake clang git make gcc` |
| Arch | `sudo pacman -S pkg-config openssl cmake clang git base-devel` |
2026-05-10 22:44:57 +02:00
| macOS | `xcode-select --install` |
2026-04-30 11:46:45 +02:00
2026-05-10 22:44:57 +02:00
---
2026-04-30 11:46:45 +02:00
2026-05-10 22:44:57 +02:00
## Quick Start
### Scrape one page
2026-03-24 10:41:37 +01:00
```bash
2026-05-10 22:44:57 +02:00
webclaw https://stripe.com --format markdown
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
### Return LLM-optimized text
2026-03-23 18:31:11 +01:00
```bash
2026-05-10 22:44:57 +02:00
webclaw https://docs.anthropic.com --format llm
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
### Keep only the main content
2026-03-23 18:31:11 +01:00
```bash
2026-05-10 22:44:57 +02:00
webclaw https://example.com/blog/post --only-main-content
```
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
### Include or exclude selectors
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```bash
webclaw https://example.com \
--include "article, main, .content" \
--exclude "nav, footer, .sidebar, .ad"
```
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
### Crawl a documentation site
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```bash
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
2026-03-23 18:31:11 +01:00
```
2026-05-18 18:56:00 +02:00
### Workflow examples
- [HTML to Markdown for RAG ](examples/html-to-markdown-rag/ )
- [Firecrawl-compatible API ](examples/firecrawl-compatible-api/ )
- [MCP web scraping ](examples/mcp-web-scraping/ )
- [Proxy-backed crawling ](examples/proxy-backed-crawling/ )
- [Cloudflare diagnostics ](examples/cloudflare-diagnostics/ )
2026-05-10 22:44:57 +02:00
### Extract brand assets
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```bash
webclaw https://github.com --brand
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
### Compare a page over time
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```bash
webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.json
2026-03-23 18:31:11 +01:00
```
---
2026-05-10 22:44:57 +02:00
## MCP Server
2026-03-24 16:47:45 +01:00
2026-05-10 22:44:57 +02:00
webclaw ships with an MCP server for AI agents.
2026-03-23 18:31:11 +01:00
```bash
2026-05-10 22:44:57 +02:00
npx create-webclaw
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
Manual config:
2026-03-23 18:31:11 +01:00
```json
{
"mcpServers": {
"webclaw": {
"command": "~/.webclaw/webclaw-mcp"
}
}
}
```
2026-05-10 22:44:57 +02:00
Then ask your agent things like:
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```text
Scrape these competitor pricing pages and summarize the differences.
```
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```text
Crawl this documentation site and prepare clean context for a RAG index.
```
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```text
Extract the brand colors, fonts, and logos from this company website.
```
2026-03-23 18:31:11 +01:00
---
2026-05-10 22:44:57 +02:00
## Tools
| Tool | What it does | Local |
| --- | --- | :-: |
| `scrape` | Extract one URL as markdown, text, JSON, LLM format, or HTML | Yes |
| `crawl` | Follow same-origin links and extract discovered pages | Yes |
| `map` | Discover URLs without extracting every page | Yes |
| `batch` | Scrape multiple URLs in parallel | Yes |
| `extract` | Convert page content into structured data | Yes, with local or configured LLM |
| `summarize` | Summarize a page | Yes, with local or configured LLM |
| `diff` | Compare page content snapshots | Yes |
| `brand` | Extract colors, fonts, logos, and metadata | Yes |
| `search` | Search the web and scrape results | Hosted API |
| `research` | Multi-source research workflow | Hosted API |
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
---
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
## SDKs
2026-03-23 18:31:11 +01:00
```bash
2026-05-10 22:44:57 +02:00
npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
< details >
< summary > TypeScript< / summary >
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```ts
import { Webclaw } from "@webclaw/sdk ";
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
const page = await client.scrape({
url: "https://example.com",
formats: ["markdown"],
only_main_content: true,
});
console.log(page.markdown);
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
< / details >
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
< details >
< summary > Python< / summary >
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```python
from webclaw import Webclaw
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
client = Webclaw(api_key="wc_your_key")
page = client.scrape(
"https://example.com",
formats=["markdown"],
only_main_content=True,
)
print(page.markdown)
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
< / details >
< details >
< summary > cURL< / summary >
2026-03-23 18:31:11 +01:00
```bash
2026-05-10 22:44:57 +02:00
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
< / details >
2026-03-23 18:31:11 +01:00
---
2026-05-10 22:44:57 +02:00
## Output Formats
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
| Format | Use it when you need |
| --- | --- |
| `markdown` | Clean page content with structure preserved |
| `llm` | Compact context for agents and RAG pipelines |
| `text` | Plain text with minimal formatting |
| `json` | Structured metadata, links, images, and extracted fields |
| `html` | Cleaned HTML for custom processing |
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
---
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
## Local First, Hosted When Needed
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
The CLI and MCP server work locally without an account for the core extraction path.
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
Use the hosted API at [webclaw.io ](https://webclaw.io ) when you need:
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
- protected-site access without managing infrastructure
- JavaScript rendering
- async crawl and research jobs
- web search
- watches and production usage tracking
- SDKs for application code
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
```bash
export WEBCLAW_API_KEY=wc_your_key
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
webclaw https://example.com --cloud
```
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
---
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
## What You Can Build
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
| Use case | Example |
| --- | --- |
| AI agent web access | Give Claude, Cursor, or another MCP client clean page context |
| RAG ingestion | Crawl docs, help centers, blogs, and knowledge bases |
| Competitor monitoring | Track pricing pages, changelogs, docs, and product pages |
| Structured extraction | Turn messy pages into typed JSON for automations |
| Research workflows | Search, scrape, summarize, and cite multiple sources |
| Brand intelligence | Extract logos, colors, fonts, and social metadata |
2026-03-23 18:31:11 +01:00
## Architecture
2026-05-10 22:44:57 +02:00
```text
2026-03-23 18:31:11 +01:00
webclaw/
crates/
2026-05-10 22:44:57 +02:00
webclaw-core HTML to markdown, text, JSON, and LLM-ready output
webclaw-fetch Fetching, crawling, batching, and mapping
webclaw-llm Local and hosted LLM provider support
2026-03-23 18:31:11 +01:00
webclaw-pdf PDF text extraction
2026-05-10 22:44:57 +02:00
webclaw-mcp MCP server for AI agents
webclaw-cli Command-line interface
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
`webclaw-core` is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.
2026-03-23 18:31:11 +01:00
---
## Configuration
| Variable | Description |
2026-05-10 22:44:57 +02:00
| --- | --- |
| `WEBCLAW_API_KEY` | Hosted API key |
| `OLLAMA_HOST` | Ollama URL for local LLM features |
| `OPENAI_API_KEY` | OpenAI-compatible LLM provider key |
| `OPENAI_BASE_URL` | OpenAI-compatible base URL |
| `ANTHROPIC_API_KEY` | Anthropic-compatible LLM provider key |
| `ANTHROPIC_BASE_URL` | Anthropic-compatible base URL |
2026-03-23 18:31:11 +01:00
| `WEBCLAW_PROXY` | Single proxy URL |
2026-05-10 22:44:57 +02:00
| `WEBCLAW_PROXY_FILE` | Proxy pool file |
2026-03-23 18:31:11 +01:00
---
2026-05-10 22:44:57 +02:00
## Contributing
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
The most useful contributions right now are practical and small:
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
- add examples for real agent and RAG workflows
- improve SDK snippets
- report pages that extract poorly
- add failing fixtures for messy HTML
- improve docs for MCP clients and local setup
- test the CLI on more Linux/macOS environments
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
Good first places to start:
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
- [Good first issues ](https://github.com/0xMassi/webclaw/issues?q=label%3A%22good+first+issue%22 )
- [Open a bug report ](https://github.com/0xMassi/webclaw/issues/new )
- [Start a discussion ](https://github.com/0xMassi/webclaw/discussions )
If a page extracts badly, include:
```text
URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:
2026-03-23 18:31:11 +01:00
```
2026-05-10 22:44:57 +02:00
Please remove secrets, cookies, private tokens, and customer data from logs before posting.
2026-03-23 18:31:11 +01:00
---
2026-05-31 18:35:45 +02:00
## Infrastructure Partner
< table >
< tr >
< td align = "center" >
< a href = "https://coldproxy.com/" >
< img src = "./assets/sponsors/coldproxy-banner.png" alt = "ColdProxy" width = "720" / >
< / a >
< / td >
< / tr >
< tr >
< td >
< strong > ColdProxy< / strong > supports webclaw as an Infrastructure Partner, providing residential IPv4,
residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data
collection, regional testing, monitoring, and web scraping workflows. Explore
< a href = "https://coldproxy.com/" > ColdProxy< / a > 's latest plans and available offers directly on the website.
< / td >
< / tr >
< / table >
---
2026-05-29 12:05:17 +02:00
## Studio Partners
2026-05-18 12:17:34 +02:00
< table >
< tr >
2026-05-18 12:27:11 +02:00
< td width = "340" align = "center" >
2026-05-18 12:17:34 +02:00
< a href = "https://quantumproxies.net/?utm_source=webclaw&utm_medium=github&utm_campaign=sponsor" >
2026-05-18 12:27:11 +02:00
< img src = "./assets/sponsors/quantum-proxies-banner.png" alt = "Quantum Proxies" width = "300" / >
2026-05-18 12:17:34 +02:00
< / a >
< / td >
< td >
2026-05-18 18:50:38 +02:00
< strong > Quantum Proxies< / strong > provides fast, reliable residential and ISP proxy infrastructure for developers running large-scale extraction workloads.
Get 20% off any plan with code < code > WEBCLAW20< / code > at
2026-05-18 12:17:34 +02:00
< a href = "https://quantumproxies.net/?utm_source=webclaw&utm_medium=github&utm_campaign=sponsor" > quantumproxies.net< / a > .
< / td >
< / tr >
2026-05-18 12:37:28 +02:00
< tr >
< td width = "340" align = "center" >
2026-05-18 13:09:02 +02:00
< a href = "https://proxy-seller.com/?partner=KXMQNNLIGHXR4B" >
2026-05-18 12:37:28 +02:00
< img src = "./assets/sponsors/proxy-seller-banner.png" alt = "Proxy-Seller" width = "300" / >
< / a >
< / td >
< td >
2026-05-18 13:09:02 +02:00
< strong > Proxy-Seller< / strong > maintains a global network of residential and datacenter proxies optimized for web extraction at scale.
The service supports high-volume concurrent scraping, geographic rotation, and integration with web extraction tools.
Use code < code > WBC15< / code > for 15% off IPv4, IPv6, ISP, and Residential proxies, and 10% off Mobile at
< a href = "https://proxy-seller.com/?partner=KXMQNNLIGHXR4B" > proxy-seller.com< / a > .
2026-05-18 12:37:28 +02:00
< / td >
< / tr >
2026-05-29 12:05:17 +02:00
< tr >
< td width = "340" align = "center" >
< a href = "https://www.rapidproxy.io/?ref=webclaw" >
< img src = "./assets/sponsors/rapidproxy-banner.png" alt = "RapidProxy" width = "300" / >
< / a >
< / td >
< td >
< strong > RapidProxy< / strong > delivers fast, reliable proxy infrastructure for large-scale data collection.
With 90M+ residential IPs, smart rotation, high concurrency, AI-powered CAPTCHA bypass, and non-expiring traffic, it helps keep scraping workflows stable at scale.
Use code < code > webclaw< / code > for 10% off, or
< a href = "https://www.rapidproxy.io/?ref=webclaw" > Try it free< / a > .
< / td >
< / tr >
2026-05-18 12:17:34 +02:00
< / table >
---
2026-05-15 17:51:22 -07:00
## Community Plugins
Third-party plugins that integrate webclaw with AI agent platforms:
| Plugin | Platform | What it does |
|---|---|---|
2026-05-16 11:19:15 +02:00
| [openclaw-webclaw ](https://github.com/jal-co/openclaw-webclaw ) | [OpenClaw ](https://openclaw.ai ) | Native webclaw v1 API plugin with 9 tools: scrape, search, crawl, extract, summarize, diff, map, batch, brand |
| [hermes-webclaw ](https://github.com/jal-co/hermes-webclaw ) | [Hermes Agent ](https://github.com/NousResearch/hermes-agent ) | Web search provider and 9 dedicated tools for the full v1 API surface. Install with `hermes plugins install jal-co/hermes-webclaw` |
2026-05-15 17:51:22 -07:00
Built a webclaw integration? [Open a PR ](https://github.com/0xMassi/webclaw/pulls ) to add it here.
---
2026-05-10 22:44:57 +02:00
## Contributors
2026-03-24 10:10:34 +01:00
2026-05-10 22:44:57 +02:00
Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.
2026-03-23 18:31:11 +01:00
2026-05-10 22:44:57 +02:00
< a href = "https://github.com/0xMassi/webclaw/graphs/contributors" >
< img src = "https://contrib.rocks/image?repo=0xMassi/webclaw" alt = "webclaw contributors" / >
< / a >
2026-04-01 18:04:55 +02:00
2026-05-10 22:44:57 +02:00
---
2026-04-01 18:04:55 +02:00
2026-04-26 17:55:22 +02:00
## Star History
< a href = "https://www.star-history.com/?repos=0xMassi%2Fwebclaw&type=date&legend=top-left" >
< picture >
< source media = "(prefers-color-scheme: dark)" srcset = "https://api.star-history.com/chart?repos=0xMassi/webclaw&type=date&theme=dark&legend=top-left" / >
< source media = "(prefers-color-scheme: light)" srcset = "https://api.star-history.com/chart?repos=0xMassi/webclaw&type=date&legend=top-left" / >
< img alt = "Star History Chart" src = "https://api.star-history.com/chart?repos=0xMassi/webclaw&type=date&legend=top-left" / >
< / picture >
< / a >
2026-05-10 22:44:57 +02:00
---
2026-03-23 18:31:11 +01:00
## License
2026-04-02 11:28:40 +02:00
[AGPL-3.0 ](LICENSE )