mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-07 22:15:12 +02:00
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
404 lines
11 KiB
Markdown
404 lines
11 KiB
Markdown
<p align="center">
|
|
<a href="https://webclaw.io">
|
|
<img src=".github/banner.png" alt="webclaw" width="760" />
|
|
</a>
|
|
</p>
|
|
|
|
<h1 align="center">webclaw</h1>
|
|
|
|
<p align="center">
|
|
<strong>Turn websites into clean markdown, JSON, and LLM-ready context.</strong><br/>
|
|
<sub>CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.</sub>
|
|
</p>
|
|
|
|
<p align="center">
|
|
<a href="https://github.com/0xMassi/webclaw/stargazers"><img src="https://shieldcn.dev/github/stars/0xMassi/webclaw.svg?variant=branded&logo=github" alt="Stars" /></a>
|
|
<a href="https://github.com/0xMassi/webclaw/releases"><img src="https://shieldcn.dev/github/tag/0xMassi/webclaw.svg?variant=branded&logo=rust" alt="Version" /></a>
|
|
<a href="https://github.com/0xMassi/webclaw/blob/main/LICENSE"><img src="https://shieldcn.dev/github/license/0xMassi/webclaw.svg?variant=branded" alt="License" /></a>
|
|
<a href="https://www.npmjs.com/package/create-webclaw"><img src="https://shieldcn.dev/npm/dt/create-webclaw.svg?variant=branded" alt="npm installs" /></a>
|
|
</p>
|
|
|
|
<p align="center">
|
|
<a href="https://discord.gg/KDfd48EpnW"><img src="https://shieldcn.dev/badge/Discord-Join.svg?variant=branded&logo=discord" alt="Discord" /></a>
|
|
<a href="https://x.com/webclaw_io"><img src="https://shieldcn.dev/badge/Follow-@webclaw__io.svg?variant=branded&logo=x" alt="X / Twitter" /></a>
|
|
<a href="https://webclaw.io"><img src="https://shieldcn.dev/badge/Hosted-webclaw.io.svg?variant=branded&logo=safari" alt="Hosted webclaw" /></a>
|
|
<a href="https://webclaw.io/docs"><img src="https://shieldcn.dev/badge/Docs-Read.svg?variant=branded&logo=readthedocs" alt="Docs" /></a>
|
|
</p>
|
|
|
|
<p align="center">
|
|
<img src="assets/demo.gif" alt="webclaw extracting clean markdown from a page" width="760" />
|
|
</p>
|
|
|
|
---
|
|
|
|
Most web scraping tools give your agent one of two bad outputs:
|
|
|
|
- a blocked page, login wall, or empty app shell
|
|
- raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate
|
|
|
|
[webclaw.io](https://webclaw.io) is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.
|
|
|
|
webclaw turns a URL into clean content your tools can actually use.
|
|
|
|
```bash
|
|
webclaw https://example.com --format markdown
|
|
```
|
|
|
|
```md
|
|
# Example Domain
|
|
|
|
This domain is for use in illustrative examples in documents.
|
|
|
|
You may use this domain in literature without prior coordination or asking for permission.
|
|
```
|
|
|
|
Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.
|
|
|
|
---
|
|
|
|
## Install
|
|
|
|
### Agent setup
|
|
|
|
The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:
|
|
|
|
```bash
|
|
npx create-webclaw
|
|
```
|
|
|
|
The installer detects supported clients and configures the MCP server for you.
|
|
|
|
### Homebrew
|
|
|
|
```bash
|
|
brew tap 0xMassi/webclaw
|
|
brew install webclaw
|
|
```
|
|
|
|
### Prebuilt binaries
|
|
|
|
Download macOS and Linux binaries from [GitHub Releases](https://github.com/0xMassi/webclaw/releases).
|
|
|
|
### Docker
|
|
|
|
```bash
|
|
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
|
|
```
|
|
|
|
### Cargo
|
|
|
|
```bash
|
|
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
|
|
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp
|
|
```
|
|
|
|
If building from source fails because native build tools are missing, install the platform prerequisites:
|
|
|
|
| OS | Command |
|
|
| --- | --- |
|
|
| Debian / Ubuntu | `sudo apt install -y pkg-config libssl-dev cmake clang git build-essential` |
|
|
| Fedora / RHEL | `sudo dnf install -y pkg-config openssl-devel cmake clang git make gcc` |
|
|
| Arch | `sudo pacman -S pkg-config openssl cmake clang git base-devel` |
|
|
| macOS | `xcode-select --install` |
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Scrape one page
|
|
|
|
```bash
|
|
webclaw https://stripe.com --format markdown
|
|
```
|
|
|
|
### Return LLM-optimized text
|
|
|
|
```bash
|
|
webclaw https://docs.anthropic.com --format llm
|
|
```
|
|
|
|
### Keep only the main content
|
|
|
|
```bash
|
|
webclaw https://example.com/blog/post --only-main-content
|
|
```
|
|
|
|
### Include or exclude selectors
|
|
|
|
```bash
|
|
webclaw https://example.com \
|
|
--include "article, main, .content" \
|
|
--exclude "nav, footer, .sidebar, .ad"
|
|
```
|
|
|
|
### Crawl a documentation site
|
|
|
|
```bash
|
|
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
|
|
```
|
|
|
|
### Extract brand assets
|
|
|
|
```bash
|
|
webclaw https://github.com --brand
|
|
```
|
|
|
|
### Compare a page over time
|
|
|
|
```bash
|
|
webclaw https://example.com/pricing --format json > pricing-old.json
|
|
webclaw https://example.com/pricing --diff-with pricing-old.json
|
|
```
|
|
|
|
---
|
|
|
|
## MCP Server
|
|
|
|
webclaw ships with an MCP server for AI agents.
|
|
|
|
```bash
|
|
npx create-webclaw
|
|
```
|
|
|
|
Manual config:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"webclaw": {
|
|
"command": "~/.webclaw/webclaw-mcp"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Then ask your agent things like:
|
|
|
|
```text
|
|
Scrape these competitor pricing pages and summarize the differences.
|
|
```
|
|
|
|
```text
|
|
Crawl this documentation site and prepare clean context for a RAG index.
|
|
```
|
|
|
|
```text
|
|
Extract the brand colors, fonts, and logos from this company website.
|
|
```
|
|
|
|
---
|
|
|
|
## Tools
|
|
|
|
| Tool | What it does | Local |
|
|
| --- | --- | :-: |
|
|
| `scrape` | Extract one URL as markdown, text, JSON, LLM format, or HTML | Yes |
|
|
| `crawl` | Follow same-origin links and extract discovered pages | Yes |
|
|
| `map` | Discover URLs without extracting every page | Yes |
|
|
| `batch` | Scrape multiple URLs in parallel | Yes |
|
|
| `extract` | Convert page content into structured data | Yes, with local or configured LLM |
|
|
| `summarize` | Summarize a page | Yes, with local or configured LLM |
|
|
| `diff` | Compare page content snapshots | Yes |
|
|
| `brand` | Extract colors, fonts, logos, and metadata | Yes |
|
|
| `search` | Search the web and scrape results | Hosted API |
|
|
| `research` | Multi-source research workflow | Hosted API |
|
|
|
|
---
|
|
|
|
## SDKs
|
|
|
|
```bash
|
|
npm install @webclaw/sdk
|
|
pip install webclaw
|
|
go get github.com/0xMassi/webclaw-go
|
|
```
|
|
|
|
<details>
|
|
<summary>TypeScript</summary>
|
|
|
|
```ts
|
|
import { Webclaw } from "@webclaw/sdk";
|
|
|
|
const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });
|
|
|
|
const page = await client.scrape({
|
|
url: "https://example.com",
|
|
formats: ["markdown"],
|
|
only_main_content: true,
|
|
});
|
|
|
|
console.log(page.markdown);
|
|
```
|
|
|
|
</details>
|
|
|
|
<details>
|
|
<summary>Python</summary>
|
|
|
|
```python
|
|
from webclaw import Webclaw
|
|
|
|
client = Webclaw(api_key="wc_your_key")
|
|
|
|
page = client.scrape(
|
|
"https://example.com",
|
|
formats=["markdown"],
|
|
only_main_content=True,
|
|
)
|
|
|
|
print(page.markdown)
|
|
```
|
|
|
|
</details>
|
|
|
|
<details>
|
|
<summary>cURL</summary>
|
|
|
|
```bash
|
|
curl -X POST https://api.webclaw.io/v1/scrape \
|
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://example.com",
|
|
"formats": ["markdown"],
|
|
"only_main_content": true
|
|
}'
|
|
```
|
|
|
|
</details>
|
|
|
|
---
|
|
|
|
## Output Formats
|
|
|
|
| Format | Use it when you need |
|
|
| --- | --- |
|
|
| `markdown` | Clean page content with structure preserved |
|
|
| `llm` | Compact context for agents and RAG pipelines |
|
|
| `text` | Plain text with minimal formatting |
|
|
| `json` | Structured metadata, links, images, and extracted fields |
|
|
| `html` | Cleaned HTML for custom processing |
|
|
|
|
---
|
|
|
|
## Local First, Hosted When Needed
|
|
|
|
The CLI and MCP server work locally without an account for the core extraction path.
|
|
|
|
Use the hosted API at [webclaw.io](https://webclaw.io) when you need:
|
|
|
|
- protected-site access without managing infrastructure
|
|
- JavaScript rendering
|
|
- async crawl and research jobs
|
|
- web search
|
|
- watches and production usage tracking
|
|
- SDKs for application code
|
|
|
|
```bash
|
|
export WEBCLAW_API_KEY=wc_your_key
|
|
|
|
webclaw https://example.com --cloud
|
|
```
|
|
|
|
---
|
|
|
|
## What You Can Build
|
|
|
|
| Use case | Example |
|
|
| --- | --- |
|
|
| AI agent web access | Give Claude, Cursor, or another MCP client clean page context |
|
|
| RAG ingestion | Crawl docs, help centers, blogs, and knowledge bases |
|
|
| Competitor monitoring | Track pricing pages, changelogs, docs, and product pages |
|
|
| Structured extraction | Turn messy pages into typed JSON for automations |
|
|
| Research workflows | Search, scrape, summarize, and cite multiple sources |
|
|
| Brand intelligence | Extract logos, colors, fonts, and social metadata |
|
|
|
|
## Architecture
|
|
|
|
```text
|
|
webclaw/
|
|
crates/
|
|
webclaw-core HTML to markdown, text, JSON, and LLM-ready output
|
|
webclaw-fetch Fetching, crawling, batching, and mapping
|
|
webclaw-llm Local and hosted LLM provider support
|
|
webclaw-pdf PDF text extraction
|
|
webclaw-mcp MCP server for AI agents
|
|
webclaw-cli Command-line interface
|
|
```
|
|
|
|
`webclaw-core` is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
| Variable | Description |
|
|
| --- | --- |
|
|
| `WEBCLAW_API_KEY` | Hosted API key |
|
|
| `OLLAMA_HOST` | Ollama URL for local LLM features |
|
|
| `OPENAI_API_KEY` | OpenAI-compatible LLM provider key |
|
|
| `OPENAI_BASE_URL` | OpenAI-compatible base URL |
|
|
| `ANTHROPIC_API_KEY` | Anthropic-compatible LLM provider key |
|
|
| `ANTHROPIC_BASE_URL` | Anthropic-compatible base URL |
|
|
| `WEBCLAW_PROXY` | Single proxy URL |
|
|
| `WEBCLAW_PROXY_FILE` | Proxy pool file |
|
|
|
|
---
|
|
|
|
## Contributing
|
|
|
|
The most useful contributions right now are practical and small:
|
|
|
|
- add examples for real agent and RAG workflows
|
|
- improve SDK snippets
|
|
- report pages that extract poorly
|
|
- add failing fixtures for messy HTML
|
|
- improve docs for MCP clients and local setup
|
|
- test the CLI on more Linux/macOS environments
|
|
|
|
Good first places to start:
|
|
|
|
- [Good first issues](https://github.com/0xMassi/webclaw/issues?q=label%3A%22good+first+issue%22)
|
|
- [Open a bug report](https://github.com/0xMassi/webclaw/issues/new)
|
|
- [Start a discussion](https://github.com/0xMassi/webclaw/discussions)
|
|
|
|
If a page extracts badly, include:
|
|
|
|
```text
|
|
URL:
|
|
Command or API request:
|
|
Expected output:
|
|
Actual output:
|
|
Format used: markdown / llm / text / json / html
|
|
CLI, MCP, SDK, or API:
|
|
```
|
|
|
|
Please remove secrets, cookies, private tokens, and customer data from logs before posting.
|
|
|
|
---
|
|
|
|
## Contributors
|
|
|
|
Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.
|
|
|
|
<a href="https://github.com/0xMassi/webclaw/graphs/contributors">
|
|
<img src="https://contrib.rocks/image?repo=0xMassi/webclaw" alt="webclaw contributors" />
|
|
</a>
|
|
|
|
---
|
|
|
|
## Star History
|
|
|
|
<a href="https://www.star-history.com/?repos=0xMassi%2Fwebclaw&type=date&legend=top-left">
|
|
<picture>
|
|
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=0xMassi/webclaw&type=date&theme=dark&legend=top-left" />
|
|
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=0xMassi/webclaw&type=date&legend=top-left" />
|
|
<img alt="Star History Chart" src="https://api.star-history.com/chart?repos=0xMassi/webclaw&type=date&legend=top-left" />
|
|
</picture>
|
|
</a>
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
[AGPL-3.0](LICENSE)
|