webclaw/SKILL.md
Valerio a1b9a55048 chore: add SKILL.md to repo root for skills.sh discoverability
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:27:17 +02:00

19 KiB

name description homepage user-invocable metadata
webclaw Web extraction engine with antibot bypass. Scrape, crawl, extract, summarize, search, map, diff, monitor, research, and analyze any URL — including Cloudflare-protected sites. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data extraction from web pages. https://webclaw.io true
openclaw
emoji requires primaryEnv homepage install
🦀
env
WEBCLAW_API_KEY
WEBCLAW_API_KEY https://webclaw.io
id kind bins label
npx node
webclaw-mcp
npx create-webclaw

webclaw

High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically.

When to use this skill

  • Always when you need to fetch web content and want reliable results
  • When web_fetch returns empty/blocked content (403, Cloudflare challenges)
  • When you need structured data extraction (pricing tables, product info)
  • When you need to crawl an entire site or discover all URLs
  • When you need LLM-optimized content (cleaner than raw markdown)
  • When you need to summarize a page without reading the full content
  • When you need to detect content changes between visits
  • When you need brand identity analysis (colors, fonts, logos)
  • When you need web search results with optional page scraping
  • When you need deep multi-source research on a topic
  • When you need AI-guided scraping to accomplish a goal on a page
  • When you need to monitor a URL for changes over time

API base

All requests go to https://api.webclaw.io/v1/.

Authentication: Authorization: Bearer $WEBCLAW_API_KEY

Endpoints

1. Scrape — extract content from a single URL

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Request fields:

Field Type Default Description
url string required URL to scrape
formats string[] ["markdown"] Output formats: markdown, text, llm, json
include_selectors string[] [] CSS selectors to keep (e.g. ["article", ".content"])
exclude_selectors string[] [] CSS selectors to remove (e.g. ["nav", "footer", ".ads"])
only_main_content bool false Extract only the main article/content area
no_cache bool false Skip cache, fetch fresh
max_cache_age int server default Max acceptable cache age in seconds

Response:

{
  "url": "https://example.com",
  "metadata": {
    "title": "Example",
    "description": "...",
    "language": "en",
    "word_count": 1234
  },
  "markdown": "# Page Title\n\nContent here...",
  "cache": { "status": "miss" }
}

Format options:

  • markdown — clean markdown, best for general use
  • text — plain text without formatting
  • llm — optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.
  • json — full extraction result with all metadata

When antibot bypass activates (automatic, no extra config):

{
  "antibot": {
    "bypass": true,
    "elapsed_ms": 3200
  }
}

2. Crawl — scrape an entire website

Starts an async job. Poll for results.

Start crawl:

curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "max_depth": 3,
    "max_pages": 50,
    "use_sitemap": true
  }'

Response: { "job_id": "abc-123", "status": "running" }

Poll status:

curl https://api.webclaw.io/v1/crawl/abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response when complete:

{
  "job_id": "abc-123",
  "status": "completed",
  "total": 47,
  "completed": 45,
  "errors": 2,
  "pages": [
    {
      "url": "https://docs.example.com/intro",
      "markdown": "# Introduction\n...",
      "metadata": { "title": "Intro", "word_count": 500 }
    }
  ]
}

Request fields:

Field Type Default Description
url string required Starting URL
max_depth int 3 How many links deep to follow
max_pages int 100 Maximum pages to crawl
use_sitemap bool false Seed URLs from sitemap.xml
formats string[] ["markdown"] Output formats per page
include_selectors string[] [] CSS selectors to keep
exclude_selectors string[] [] CSS selectors to remove
only_main_content bool false Main content only

3. Map — discover all URLs on a site

Fast URL discovery without full content extraction.

curl -X POST https://api.webclaw.io/v1/map \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "count": 142,
  "urls": [
    "https://example.com/about",
    "https://example.com/pricing",
    "https://example.com/docs/intro"
  ]
}

4. Batch — scrape multiple URLs in parallel

curl -X POST https://api.webclaw.io/v1/batch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://a.com",
      "https://b.com",
      "https://c.com"
    ],
    "formats": ["markdown"],
    "concurrency": 5
  }'

Response:

{
  "total": 3,
  "completed": 3,
  "errors": 0,
  "results": [
    { "url": "https://a.com", "markdown": "...", "metadata": {} },
    { "url": "https://b.com", "markdown": "...", "metadata": {} },
    { "url": "https://c.com", "error": "timeout" }
  ]
}

5. Extract — LLM-powered structured extraction

Pull structured data from any page using a JSON schema or plain-text prompt.

With JSON schema:

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "schema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "string" },
              "features": { "type": "array", "items": { "type": "string" } }
            }
          }
        }
      }
    }
  }'

With prompt:

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "prompt": "Extract all pricing tiers with names, monthly prices, and key features"
  }'

Response:

{
  "url": "https://example.com/pricing",
  "data": {
    "plans": [
      { "name": "Starter", "price": "$49/mo", "features": ["10k pages", "Email support"] },
      { "name": "Pro", "price": "$99/mo", "features": ["100k pages", "Priority support", "API access"] }
    ]
  }
}

6. Summarize — get a quick summary of any page

curl -X POST https://api.webclaw.io/v1/summarize \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/long-article",
    "max_sentences": 3
  }'

Response:

{
  "url": "https://example.com/long-article",
  "summary": "The article discusses... Key findings include... The author concludes that..."
}

7. Diff — detect content changes

Compare current page content against a previous snapshot.

curl -X POST https://api.webclaw.io/v1/diff \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "previous": {
      "markdown": "# Old content...",
      "metadata": { "title": "Old Title" }
    }
  }'

Response:

{
  "url": "https://example.com",
  "status": "changed",
  "diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
  "metadata_changes": [
    { "field": "title", "old": "Old Title", "new": "New Title" }
  ]
}

8. Brand — extract brand identity

Analyze a website's visual identity: colors, fonts, logo.

curl -X POST https://api.webclaw.io/v1/brand \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "brand": {
    "colors": [
      { "hex": "#FF6B35", "usage": "primary" },
      { "hex": "#1A1A2E", "usage": "background" }
    ],
    "fonts": ["Inter", "JetBrains Mono"],
    "logo_url": "https://example.com/logo.svg",
    "favicon_url": "https://example.com/favicon.ico"
  }
}

9. Search — web search with optional scraping

Search the web and optionally scrape each result page.

curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best rust web frameworks 2026",
    "num_results": 5,
    "scrape": true,
    "formats": ["markdown"]
  }'

Request fields:

Field Type Default Description
query string required Search query
num_results int 10 Number of search results to return
scrape bool false Also scrape each result page for full content
formats string[] ["markdown"] Output formats when scrape is true
country string none Country code for localized results (e.g. "us", "de")
lang string none Language code for results (e.g. "en", "fr")

Response:

{
  "query": "best rust web frameworks 2026",
  "results": [
    {
      "title": "Top Rust Web Frameworks in 2026",
      "url": "https://blog.example.com/rust-frameworks",
      "snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
      "position": 1,
      "markdown": "# Top Rust Web Frameworks\n\n..."
    },
    {
      "title": "Choosing a Rust Backend Framework",
      "url": "https://dev.to/rust-backends",
      "snippet": "When starting a new Rust web project...",
      "position": 2,
      "markdown": "# Choosing a Rust Backend\n\n..."
    }
  ]
}

The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.

10. Research — deep multi-source research

Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.

Start research:

curl -X POST https://api.webclaw.io/v1/research \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
    "max_iterations": 5,
    "max_sources": 10,
    "topic": "security",
    "deep": true
  }'

Request fields:

Field Type Default Description
query string required Research question or topic
max_iterations int server default Maximum research iterations (search-read-analyze cycles)
max_sources int server default Maximum number of sources to consult
topic string none Topic hint to guide search strategy (e.g. "security", "finance", "engineering")
deep bool false Enable deep research mode for more thorough analysis (costs 10 credits instead of 1)

Response: { "id": "res-abc-123", "status": "running" }

Poll results:

curl https://api.webclaw.io/v1/research/res-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response when complete:

{
  "id": "res-abc-123",
  "status": "completed",
  "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
  "report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
  "sources": [
    { "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
    { "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
  ],
  "findings": [
    "Turnstile uses browser environment signals and proof-of-work challenges",
    "Managed mode auto-selects challenge difficulty based on visitor risk score",
    "Known bypass approaches include instrumented browser automation"
  ],
  "iterations": 5,
  "elapsed_ms": 34200
}

Status values: running, completed, failed

11. Agent Scrape — AI-guided scraping

Use an AI agent to navigate and interact with a page to accomplish a specific goal. The agent can click, scroll, fill forms, and extract data across multiple steps.

curl -X POST https://api.webclaw.io/v1/agent-scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "goal": "Find the cheapest laptop with at least 16GB RAM and extract its full specs",
    "max_steps": 10
  }'

Request fields:

Field Type Default Description
url string required Starting URL
goal string required What the agent should accomplish
max_steps int server default Maximum number of actions the agent can take

Response:

{
  "url": "https://example.com/products",
  "result": "The cheapest laptop with 16GB+ RAM is the ThinkPad E14 Gen 6 at $649. Specs: AMD Ryzen 5 7535U, 16GB DDR4, 512GB SSD, 14\" FHD IPS display, 57Wh battery.",
  "steps": [
    { "action": "navigate", "detail": "Loaded products page" },
    { "action": "click", "detail": "Clicked 'Laptops' category filter" },
    { "action": "click", "detail": "Applied '16GB+' RAM filter" },
    { "action": "click", "detail": "Sorted by price: low to high" },
    { "action": "extract", "detail": "Extracted specs from first matching product" }
  ]
}

12. Watch — monitor a URL for changes

Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.

Create a monitor:

curl -X POST https://api.webclaw.io/v1/watch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "interval": "0 */6 * * *",
    "webhook_url": "https://hooks.example.com/pricing-changed",
    "formats": ["markdown"]
  }'

Request fields:

Field Type Default Description
url string required URL to monitor
interval string required Check frequency as cron expression or seconds (e.g. "0 */6 * * *" or "3600")
webhook_url string none URL to POST when changes are detected
formats string[] ["markdown"] Output formats for snapshots

Response:

{
  "id": "watch-abc-123",
  "url": "https://example.com/pricing",
  "interval": "0 */6 * * *",
  "webhook_url": "https://hooks.example.com/pricing-changed",
  "formats": ["markdown"],
  "created_at": "2026-03-20T10:00:00Z",
  "last_check": null,
  "status": "active"
}

List all monitors:

curl https://api.webclaw.io/v1/watch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response:

{
  "monitors": [
    {
      "id": "watch-abc-123",
      "url": "https://example.com/pricing",
      "interval": "0 */6 * * *",
      "status": "active",
      "last_check": "2026-03-20T16:00:00Z",
      "checks": 4
    }
  ]
}

Get a monitor with snapshots:

curl https://api.webclaw.io/v1/watch/watch-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response:

{
  "id": "watch-abc-123",
  "url": "https://example.com/pricing",
  "interval": "0 */6 * * *",
  "status": "active",
  "snapshots": [
    {
      "checked_at": "2026-03-20T16:00:00Z",
      "status": "changed",
      "diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
    },
    {
      "checked_at": "2026-03-20T10:00:00Z",
      "status": "baseline"
    }
  ]
}

Trigger an immediate check:

curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Delete a monitor:

curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Choosing the right format

Goal Format Why
Read and understand a page markdown Clean structure, headings, links preserved
Feed content to an AI model llm Optimized: includes title + URL header, clean link refs
Search or index content text Plain text, no formatting noise
Programmatic analysis json Full metadata, structured data, DOM statistics

Tips

  • Use llm format when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing.
  • Use only_main_content: true to skip navigation, sidebars, and footers. Reduces noise significantly.
  • Use include_selectors/exclude_selectors for fine-grained control when only_main_content isn't enough.
  • Batch over individual scrapes when fetching multiple URLs — it's faster and more efficient.
  • Use map before crawl to discover the site structure first, then crawl specific sections.
  • Use extract with a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info).
  • Antibot bypass is automatic — no extra configuration needed. Works on Cloudflare, DataDome, AWS WAF, and JS-rendered SPAs.
  • Use search with scrape: true to get full page content for each search result in one call instead of searching then scraping separately.
  • Use research for complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enable deep: true for thorough analysis.
  • Use agent-scrape for interactive pages where data is behind filters, pagination, or form submissions that a simple scrape cannot reach.
  • Use watch for ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.

Smart Fetch Architecture

The webclaw MCP server uses a local-first approach:

  1. Local fetch — fast, free, no API credits used (~80% of sites)
  2. Cloud API fallback — automatic when bot protection or JS rendering is detected

This means:

  • Most scrapes cost zero credits (local extraction)
  • Cloudflare, DataDome, AWS WAF sites automatically fall back to the cloud API
  • JS-rendered SPAs (React, Next.js, Vue) also fall back automatically
  • Set WEBCLAW_API_KEY to enable cloud fallback

vs web_fetch

webclaw web_fetch
Cloudflare bypass Automatic (cloud fallback) Fails (403)
JS-rendered pages Automatic fallback Readability only
Output quality 20-step optimization pipeline Basic HTML parsing
Structured extraction LLM-powered, schema-based None
Crawling Full site crawl with sitemap Single page only
Caching Built-in, configurable TTL Per-session
Rate limiting Managed server-side Client responsibility

Use web_fetch for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.