Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server. https://webclaw.io
Find a file
Valerio 58d274ffe9 style(reddit): use Option::zip to satisfy clippy
CI runs clippy with `-D warnings` on a newer toolchain that flags
`manual_option_zip`; collapse the and_then/map pair into Option::zip.
2026-06-04 17:48:17 +02:00
.cargo chore: remove reqwest_unstable rustflag (no longer needed) 2026-04-01 18:15:05 +02:00
.github chore(ci): bump actions/checkout and artifact actions to v5 2026-05-21 15:11:29 +02:00
assets Add sponsor preview placements 2026-06-04 10:04:32 +02:00
benchmarks docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25) 2026-04-17 14:46:19 +02:00
crates style(reddit): use Option::zip to satisfy clippy 2026-06-04 17:48:17 +02:00
deploy fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00
examples docs: add workflow examples 2026-05-18 18:56:00 +02:00
packages/create-webclaw feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22) 2026-04-16 19:44:08 +02:00
skill add SKILL.md for Claude Code skill integration 2026-03-27 17:58:01 +01:00
.dockerignore Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
.gitignore feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22) 2026-04-16 19:44:08 +02:00
Cargo.lock chore(release): v0.6.5 2026-06-04 17:36:02 +02:00
Cargo.toml chore(release): v0.6.5 2026-06-04 17:36:02 +02:00
CHANGELOG.md chore(release): v0.6.5 2026-06-04 17:36:02 +02:00
CLAUDE.md docs(claude): note youtube.rs role and yt-dlp short-circuit in server 2026-05-03 21:17:23 +02:00
CODE_OF_CONDUCT.md Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
CONTRIBUTING.md docs: update CONTRIBUTING.md for v0.3.0 architecture 2026-03-30 10:17:26 +02:00
docker-compose.yml Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
docker-entrypoint.sh fix(docker): entrypoint shim so child images with custom CMD work (#28) 2026-04-17 15:57:47 +02:00
Dockerfile feat(core): endpoints module for API surface extraction from HTML and JS (#47) 2026-05-19 19:05:16 +02:00
Dockerfile.ci feat(core): endpoints module for API surface extraction from HTML and JS (#47) 2026-05-19 19:05:16 +02:00
env.example Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
glama.json fix: update glama.json to match schema, add Glama badge to README 2026-03-24 16:27:12 +01:00
LICENSE feat: SvelteKit data extraction + license change to AGPL-3.0 2026-04-01 20:37:56 +02:00
proxies.example.txt Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
README.md Add sponsor preview placements 2026-06-04 10:04:32 +02:00
rustfmt.toml Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
setup.sh fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00
SKILL.md chore: add SKILL.md to repo root for skills.sh discoverability 2026-04-01 18:27:17 +02:00
smithery.yaml fix: handle raw newlines in JSON-LD strings 2026-04-16 11:40:25 +02:00
targets_1000.txt fix: handle raw newlines in JSON-LD strings 2026-04-16 11:40:25 +02:00

webclaw

webclaw

Turn websites into clean markdown, JSON, and LLM-ready context.
CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.

Stars Version License npm installs

Discord X / Twitter Hosted webclaw Docs

webclaw extracting clean markdown from a page


Most web scraping tools give your agent one of two bad outputs:

  • a blocked page, login wall, or empty app shell
  • raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate

webclaw.io is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.

webclaw turns a URL into clean content your tools can actually use.

webclaw https://example.com --format markdown
# Example Domain

This domain is for use in illustrative examples in documents.

You may use this domain in literature without prior coordination or asking for permission.

Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.


Install

Agent setup

The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:

npx create-webclaw

The installer detects supported clients and configures the MCP server for you.

Homebrew

brew tap 0xMassi/webclaw
brew install webclaw

Prebuilt binaries

Download macOS and Linux binaries from GitHub Releases.

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Cargo

cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp

If building from source fails because native build tools are missing, install the platform prerequisites:

OS Command
Debian / Ubuntu sudo apt install -y pkg-config libssl-dev cmake clang git build-essential
Fedora / RHEL sudo dnf install -y pkg-config openssl-devel cmake clang git make gcc
Arch sudo pacman -S pkg-config openssl cmake clang git base-devel
macOS xcode-select --install

Quick Start

Scrape one page

webclaw https://stripe.com --format markdown

Return LLM-optimized text

webclaw https://docs.anthropic.com --format llm

Keep only the main content

webclaw https://example.com/blog/post --only-main-content

Include or exclude selectors

webclaw https://example.com \
  --include "article, main, .content" \
  --exclude "nav, footer, .sidebar, .ad"

Crawl a documentation site

webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Workflow examples

Extract brand assets

webclaw https://github.com --brand

Compare a page over time

webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.json

MCP Server

webclaw ships with an MCP server for AI agents.

npx create-webclaw

Manual config:

{
  "mcpServers": {
    "webclaw": {
      "command": "~/.webclaw/webclaw-mcp"
    }
  }
}

Then ask your agent things like:

Scrape these competitor pricing pages and summarize the differences.
Crawl this documentation site and prepare clean context for a RAG index.
Extract the brand colors, fonts, and logos from this company website.

Tools

Tool What it does Local
scrape Extract one URL as markdown, text, JSON, LLM format, or HTML Yes
crawl Follow same-origin links and extract discovered pages Yes
map Discover URLs without extracting every page Yes
batch Scrape multiple URLs in parallel Yes
extract Convert page content into structured data Yes, with local or configured LLM
summarize Summarize a page Yes, with local or configured LLM
diff Compare page content snapshots Yes
brand Extract colors, fonts, logos, and metadata Yes
search Search the web and scrape results Hosted API
research Multi-source research workflow Hosted API

SDKs

npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go
TypeScript
import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);
Python
from webclaw import Webclaw

client = Webclaw(api_key="wc_your_key")

page = client.scrape(
    "https://example.com",
    formats=["markdown"],
    only_main_content=True,
)

print(page.markdown)
cURL
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Output Formats

Format Use it when you need
markdown Clean page content with structure preserved
llm Compact context for agents and RAG pipelines
text Plain text with minimal formatting
json Structured metadata, links, images, and extracted fields
html Cleaned HTML for custom processing

Local First, Hosted When Needed

The CLI and MCP server work locally without an account for the core extraction path.

Use the hosted API at webclaw.io when you need:

  • protected-site access without managing infrastructure
  • JavaScript rendering
  • async crawl and research jobs
  • web search
  • watches and production usage tracking
  • SDKs for application code
export WEBCLAW_API_KEY=wc_your_key

webclaw https://example.com --cloud

What You Can Build

Use case Example
AI agent web access Give Claude, Cursor, or another MCP client clean page context
RAG ingestion Crawl docs, help centers, blogs, and knowledge bases
Competitor monitoring Track pricing pages, changelogs, docs, and product pages
Structured extraction Turn messy pages into typed JSON for automations
Research workflows Search, scrape, summarize, and cite multiple sources
Brand intelligence Extract logos, colors, fonts, and social metadata

Architecture

webclaw/
  crates/
    webclaw-core     HTML to markdown, text, JSON, and LLM-ready output
    webclaw-fetch    Fetching, crawling, batching, and mapping
    webclaw-llm      Local and hosted LLM provider support
    webclaw-pdf      PDF text extraction
    webclaw-mcp      MCP server for AI agents
    webclaw-cli      Command-line interface

webclaw-core is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.


Configuration

Variable Description
WEBCLAW_API_KEY Hosted API key
OLLAMA_HOST Ollama URL for local LLM features
OPENAI_API_KEY OpenAI-compatible LLM provider key
OPENAI_BASE_URL OpenAI-compatible base URL
ANTHROPIC_API_KEY Anthropic-compatible LLM provider key
ANTHROPIC_BASE_URL Anthropic-compatible base URL
WEBCLAW_PROXY Single proxy URL
WEBCLAW_PROXY_FILE Proxy pool file

Contributing

The most useful contributions right now are practical and small:

  • add examples for real agent and RAG workflows
  • improve SDK snippets
  • report pages that extract poorly
  • add failing fixtures for messy HTML
  • improve docs for MCP clients and local setup
  • test the CLI on more Linux/macOS environments

Good first places to start:

If a page extracts badly, include:

URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:

Please remove secrets, cookies, private tokens, and customer data from logs before posting.


Infrastructure Partner

ColdProxy
ColdProxy supports webclaw as an Infrastructure Partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data collection, regional testing, monitoring, and web scraping workflows. Explore ColdProxy's latest plans and available offers directly on the website.

Studio Partners

Quantum Proxies Quantum Proxies provides fast, reliable residential and ISP proxy infrastructure for developers running large-scale extraction workloads. Get 20% off any plan with code WEBCLAW20 at quantumproxies.net.
Proxy-Seller Proxy-Seller maintains a global network of residential and datacenter proxies optimized for web extraction at scale. The service supports high-volume concurrent scraping, geographic rotation, and integration with web extraction tools. Use code WBC15 for 15% off IPv4, IPv6, ISP, and Residential proxies, and 10% off Mobile at proxy-seller.com.
RapidProxy RapidProxy delivers fast, reliable proxy infrastructure for large-scale data collection. With 90M+ residential IPs, smart rotation, high concurrency, AI-powered CAPTCHA bypass, and non-expiring traffic, it helps keep scraping workflows stable at scale. Use code webclaw for 10% off, or Try it free.

Community Plugins

Third-party plugins that integrate webclaw with AI agent platforms:

Plugin Platform What it does
openclaw-webclaw OpenClaw Native webclaw v1 API plugin with 9 tools: scrape, search, crawl, extract, summarize, diff, map, batch, brand
hermes-webclaw Hermes Agent Web search provider and 9 dedicated tools for the full v1 API surface. Install with hermes plugins install jal-co/hermes-webclaw

Built a webclaw integration? Open a PR to add it here.


Contributors

Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.

webclaw contributors

Star History

Star History Chart

License

AGPL-3.0