22 KiB
| name | description | homepage | user-invocable | metadata | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| webclaw | Web extraction engine with antibot bypass. Scrape, crawl, extract, summarize, search, map, diff, monitor, research, and analyze any URL — including Cloudflare-protected sites. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data extraction from web pages. | https://webclaw.io | true |
|
webclaw
High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically.
When to use this skill
- Always when you need to fetch web content and want reliable results
- When
web_fetchreturns empty/blocked content (403, Cloudflare challenges) - When you need structured data extraction (pricing tables, product info)
- When you need to crawl an entire site or discover all URLs
- When you need LLM-optimized content (cleaner than raw markdown)
- When you need to summarize a page without reading the full content
- When you need to detect content changes between visits
- When you need brand identity analysis (colors, fonts, logos)
- When you need web search results with optional page scraping
- When you need deep multi-source research on a topic
- When you need AI-guided scraping to accomplish a goal on a page
- When you need to monitor a URL for changes over time
API base
All requests go to https://api.webclaw.io/v1/.
Authentication: Authorization: Bearer $WEBCLAW_API_KEY
CLI API capture
Use the local CLI to capture browser network traffic from a public or authorized page, store learned endpoints locally, replay them safely, or export them as OpenAPI. Captures are written under %USERPROFILE%\.webclaw\api-captures by default, or under WEBCLAW_CAPTURE_DIR when set.
webclaw capture-network https://example.com --intent "discover product listing API" --wait-ms 3000
webclaw endpoints example.com/2026-05-16T12-00-00Z
webclaw replay-endpoint "GET https://example.com/api/products" --dry-run
webclaw export-openapi example.com/2026-05-16T12-00-00Z
Use webclaw show-endpoint "<endpoint-id>" to inspect one learned endpoint before replay. GET, HEAD, and OPTIONS endpoints can be replayed directly; POST, PUT, PATCH, and DELETE stay in dry-run preview unless you pass --confirm-unsafe.
MCP API capture tools
Use the MCP server tools when an agent needs to discover and reuse API calls made by a public or authorized page:
| Tool | Parameters | Use |
|---|---|---|
capture_network |
url, optional intent, wait_ms, headed |
Open an HTTP(S) page in Chromium, capture network traffic, redact secrets, infer endpoints, and save the capture locally. |
discover_endpoints |
capture_id |
Return all learned endpoint definitions for a saved capture. |
show_endpoint |
endpoint_id |
Inspect one learned endpoint before replay or OpenAPI export. |
replay_endpoint |
endpoint_id, optional params_json, dry_run, confirm_unsafe, headers, body_json |
Preview or replay a learned endpoint. Read-only methods can execute when dry_run is false; POST, PUT, PATCH, and DELETE stay dry-run unless confirm_unsafe is true. Redacted headers are never sent. |
export_openapi |
capture_id |
Write openapi.json beside the saved capture's endpoints.json. |
list_captures |
{} |
List saved captures from the configured capture root. |
Safety defaults: capture only pages and sessions the user is authorized to inspect, redact secrets by default, and do not use the capture tools to bypass CAPTCHAs, paywalls, login walls, rate limits, or access controls. Captures are stored under %USERPROFILE%\.webclaw\api-captures by default, or under WEBCLAW_CAPTURE_DIR when set.
Endpoints
1. Scrape — extract content from a single URL
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
string[] | ["markdown"] |
Output formats: markdown, text, llm, json |
include_selectors |
string[] | [] |
CSS selectors to keep (e.g. ["article", ".content"]) |
exclude_selectors |
string[] | [] |
CSS selectors to remove (e.g. ["nav", "footer", ".ads"]) |
only_main_content |
bool | false |
Extract only the main article/content area |
no_cache |
bool | false |
Skip cache, fetch fresh |
max_cache_age |
int | server default | Max acceptable cache age in seconds |
Response:
{
"url": "https://example.com",
"metadata": {
"title": "Example",
"description": "...",
"language": "en",
"word_count": 1234
},
"markdown": "# Page Title\n\nContent here...",
"cache": { "status": "miss" }
}
Format options:
markdown— clean markdown, best for general usetext— plain text without formattingllm— optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.json— full extraction result with all metadata
When antibot bypass activates (automatic, no extra config):
{
"antibot": {
"bypass": true,
"elapsed_ms": 3200
}
}
2. Crawl — scrape an entire website
Starts an async job. Poll for results.
Start crawl:
curl -X POST https://api.webclaw.io/v1/crawl \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"max_depth": 3,
"max_pages": 50,
"use_sitemap": true
}'
Response: { "job_id": "abc-123", "status": "running" }
Poll status:
curl https://api.webclaw.io/v1/crawl/abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response when complete:
{
"job_id": "abc-123",
"status": "completed",
"total": 47,
"completed": 45,
"errors": 2,
"pages": [
{
"url": "https://docs.example.com/intro",
"markdown": "# Introduction\n...",
"metadata": { "title": "Intro", "word_count": 500 }
}
]
}
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
max_depth |
int | 3 |
How many links deep to follow |
max_pages |
int | 100 |
Maximum pages to crawl |
use_sitemap |
bool | false |
Seed URLs from sitemap.xml |
formats |
string[] | ["markdown"] |
Output formats per page |
include_selectors |
string[] | [] |
CSS selectors to keep |
exclude_selectors |
string[] | [] |
CSS selectors to remove |
only_main_content |
bool | false |
Main content only |
3. Map — discover all URLs on a site
Fast URL discovery without full content extraction.
curl -X POST https://api.webclaw.io/v1/map \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"count": 142,
"urls": [
"https://example.com/about",
"https://example.com/pricing",
"https://example.com/docs/intro"
]
}
4. Batch — scrape multiple URLs in parallel
curl -X POST https://api.webclaw.io/v1/batch \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://a.com",
"https://b.com",
"https://c.com"
],
"formats": ["markdown"],
"concurrency": 5
}'
Response:
{
"total": 3,
"completed": 3,
"errors": 0,
"results": [
{ "url": "https://a.com", "markdown": "...", "metadata": {} },
{ "url": "https://b.com", "markdown": "...", "metadata": {} },
{ "url": "https://c.com", "error": "timeout" }
]
}
5. Extract — LLM-powered structured extraction
Pull structured data from any page using a JSON schema or plain-text prompt.
With JSON schema:
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } }
}
}
}
}
}
}'
With prompt:
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"prompt": "Extract all pricing tiers with names, monthly prices, and key features"
}'
Response:
{
"url": "https://example.com/pricing",
"data": {
"plans": [
{ "name": "Starter", "price": "$49/mo", "features": ["10k pages", "Email support"] },
{ "name": "Pro", "price": "$99/mo", "features": ["100k pages", "Priority support", "API access"] }
]
}
}
6. Summarize — get a quick summary of any page
curl -X POST https://api.webclaw.io/v1/summarize \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/long-article",
"max_sentences": 3
}'
Response:
{
"url": "https://example.com/long-article",
"summary": "The article discusses... Key findings include... The author concludes that..."
}
7. Diff — detect content changes
Compare current page content against a previous snapshot.
curl -X POST https://api.webclaw.io/v1/diff \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"previous": {
"markdown": "# Old content...",
"metadata": { "title": "Old Title" }
}
}'
Response:
{
"url": "https://example.com",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
"metadata_changes": [
{ "field": "title", "old": "Old Title", "new": "New Title" }
]
}
8. Brand — extract brand identity
Analyze a website's visual identity: colors, fonts, logo.
curl -X POST https://api.webclaw.io/v1/brand \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"brand": {
"colors": [
{ "hex": "#FF6B35", "usage": "primary" },
{ "hex": "#1A1A2E", "usage": "background" }
],
"fonts": ["Inter", "JetBrains Mono"],
"logo_url": "https://example.com/logo.svg",
"favicon_url": "https://example.com/favicon.ico"
}
}
9. Search — web search with optional scraping
Search the web and optionally scrape each result page.
curl -X POST https://api.webclaw.io/v1/search \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "best rust web frameworks 2026",
"num_results": 5,
"scrape": true,
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
num_results |
int | 10 |
Number of search results to return |
scrape |
bool | false |
Also scrape each result page for full content |
formats |
string[] | ["markdown"] |
Output formats when scrape is true |
country |
string | none | Country code for localized results (e.g. "us", "de") |
lang |
string | none | Language code for results (e.g. "en", "fr") |
Response:
{
"query": "best rust web frameworks 2026",
"results": [
{
"title": "Top Rust Web Frameworks in 2026",
"url": "https://blog.example.com/rust-frameworks",
"snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
"position": 1,
"markdown": "# Top Rust Web Frameworks\n\n..."
},
{
"title": "Choosing a Rust Backend Framework",
"url": "https://dev.to/rust-backends",
"snippet": "When starting a new Rust web project...",
"position": 2,
"markdown": "# Choosing a Rust Backend\n\n..."
}
]
}
The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.
10. Research — deep multi-source research
Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.
Start research:
curl -X POST https://api.webclaw.io/v1/research \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"max_iterations": 5,
"max_sources": 10,
"topic": "security",
"deep": true
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Research question or topic |
max_iterations |
int | server default | Maximum research iterations (search-read-analyze cycles) |
max_sources |
int | server default | Maximum number of sources to consult |
topic |
string | none | Topic hint to guide search strategy (e.g. "security", "finance", "engineering") |
deep |
bool | false |
Enable deep research mode for more thorough analysis (costs 10 credits instead of 1) |
Response: { "id": "res-abc-123", "status": "running" }
Poll results:
curl https://api.webclaw.io/v1/research/res-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response when complete:
{
"id": "res-abc-123",
"status": "completed",
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
"sources": [
{ "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
{ "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
],
"findings": [
"Turnstile uses browser environment signals and proof-of-work challenges",
"Managed mode auto-selects challenge difficulty based on visitor risk score",
"Known bypass approaches include instrumented browser automation"
],
"iterations": 5,
"elapsed_ms": 34200
}
Status values: running, completed, failed
11. Agent Scrape — AI-guided scraping
Use an AI agent to navigate and interact with a page to accomplish a specific goal. The agent can click, scroll, fill forms, and extract data across multiple steps.
curl -X POST https://api.webclaw.io/v1/agent-scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"goal": "Find the cheapest laptop with at least 16GB RAM and extract its full specs",
"max_steps": 10
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
goal |
string | required | What the agent should accomplish |
max_steps |
int | server default | Maximum number of actions the agent can take |
Response:
{
"url": "https://example.com/products",
"result": "The cheapest laptop with 16GB+ RAM is the ThinkPad E14 Gen 6 at $649. Specs: AMD Ryzen 5 7535U, 16GB DDR4, 512GB SSD, 14\" FHD IPS display, 57Wh battery.",
"steps": [
{ "action": "navigate", "detail": "Loaded products page" },
{ "action": "click", "detail": "Clicked 'Laptops' category filter" },
{ "action": "click", "detail": "Applied '16GB+' RAM filter" },
{ "action": "click", "detail": "Sorted by price: low to high" },
{ "action": "extract", "detail": "Extracted specs from first matching product" }
]
}
12. Watch — monitor a URL for changes
Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.
Create a monitor:
curl -X POST https://api.webclaw.io/v1/watch \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to monitor |
interval |
string | required | Check frequency as cron expression or seconds (e.g. "0 */6 * * *" or "3600") |
webhook_url |
string | none | URL to POST when changes are detected |
formats |
string[] | ["markdown"] |
Output formats for snapshots |
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"],
"created_at": "2026-03-20T10:00:00Z",
"last_check": null,
"status": "active"
}
List all monitors:
curl https://api.webclaw.io/v1/watch \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response:
{
"monitors": [
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"last_check": "2026-03-20T16:00:00Z",
"checks": 4
}
]
}
Get a monitor with snapshots:
curl https://api.webclaw.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"snapshots": [
{
"checked_at": "2026-03-20T16:00:00Z",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
},
{
"checked_at": "2026-03-20T10:00:00Z",
"status": "baseline"
}
]
}
Trigger an immediate check:
curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Delete a monitor:
curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Choosing the right format
| Goal | Format | Why |
|---|---|---|
| Read and understand a page | markdown |
Clean structure, headings, links preserved |
| Feed content to an AI model | llm |
Optimized: includes title + URL header, clean link refs |
| Search or index content | text |
Plain text, no formatting noise |
| Programmatic analysis | json |
Full metadata, structured data, DOM statistics |
Tips
- Use
llmformat when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing. - Use
only_main_content: trueto skip navigation, sidebars, and footers. Reduces noise significantly. - Use
include_selectors/exclude_selectorsfor fine-grained control whenonly_main_contentisn't enough. - Batch over individual scrapes when fetching multiple URLs — it's faster and more efficient.
- Use
mapbeforecrawlto discover the site structure first, then crawl specific sections. - Use
extractwith a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info). - Antibot bypass is automatic — no extra configuration needed. Works on Cloudflare, DataDome, AWS WAF, and JS-rendered SPAs.
- Use
searchwithscrape: trueto get full page content for each search result in one call instead of searching then scraping separately. - Use
researchfor complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enabledeep: truefor thorough analysis. - Use
agent-scrapefor interactive pages where data is behind filters, pagination, or form submissions that a simple scrape cannot reach. - Use
watchfor ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.
Smart Fetch Architecture
The webclaw MCP server uses a local-first approach:
- Local fetch — fast, free, no API credits used (~80% of sites)
- Cloud API fallback — automatic when bot protection or JS rendering is detected
This means:
- Most scrapes cost zero credits (local extraction)
- Cloudflare, DataDome, AWS WAF sites automatically fall back to the cloud API
- JS-rendered SPAs (React, Next.js, Vue) also fall back automatically
- Set
WEBCLAW_API_KEYto enable cloud fallback
vs web_fetch
| webclaw | web_fetch | |
|---|---|---|
| Cloudflare bypass | Automatic (cloud fallback) | Fails (403) |
| JS-rendered pages | Automatic fallback | Readability only |
| Output quality | 20-step optimization pipeline | Basic HTML parsing |
| Structured extraction | LLM-powered, schema-based | None |
| Crawling | Full site crawl with sitemap | Single page only |
| Caching | Built-in, configurable TTL | Per-session |
| Rate limiting | Managed server-side | Client responsibility |
Use web_fetch for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.