mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Valerio a1b9a55048 chore: add SKILL.md to repo root for skills.sh discoverability

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-01 18:27:17 +02:00

19 KiB

Raw Blame History

name

description

homepage

user-invocable

metadata

webclaw

Web extraction engine with antibot bypass. Scrape, crawl, extract, summarize, search, map, diff, monitor, research, and analyze any URL — including Cloudflare-protected sites. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data extraction from web pages.

https://webclaw.io

true

openclaw

emoji

requires

primaryEnv

homepage

install

🦀

env

WEBCLAW_API_KEY

https://webclaw.io

kind

bins

label

npx

node

webclaw-mcp

npx create-webclaw

webclaw

High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically.

When to use this skill

Always when you need to fetch web content and want reliable results
When web_fetch returns empty/blocked content (403, Cloudflare challenges)
When you need structured data extraction (pricing tables, product info)
When you need to crawl an entire site or discover all URLs
When you need LLM-optimized content (cleaner than raw markdown)
When you need to summarize a page without reading the full content
When you need to detect content changes between visits
When you need brand identity analysis (colors, fonts, logos)
When you need web search results with optional page scraping
When you need deep multi-source research on a topic
When you need AI-guided scraping to accomplish a goal on a page
When you need to monitor a URL for changes over time

API base

All requests go to https://api.webclaw.io/v1/.

Authentication: Authorization: Bearer $WEBCLAW_API_KEY

Endpoints

1. Scrape — extract content from a single URL

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Request fields:

Field	Type	Default	Description
`url`	string	required	URL to scrape
`formats`	string[]	`["markdown"]`	Output formats: `markdown`, `text`, `llm`, `json`
`include_selectors`	string[]	`[]`	CSS selectors to keep (e.g. `["article", ".content"]`)
`exclude_selectors`	string[]	`[]`	CSS selectors to remove (e.g. `["nav", "footer", ".ads"]`)
`only_main_content`	bool	`false`	Extract only the main article/content area
`no_cache`	bool	`false`	Skip cache, fetch fresh
`max_cache_age`	int	server default	Max acceptable cache age in seconds

Response:

{
  "url": "https://example.com",
  "metadata": {
    "title": "Example",
    "description": "...",
    "language": "en",
    "word_count": 1234
  },
  "markdown": "# Page Title\n\nContent here...",
  "cache": { "status": "miss" }
}

Format options:

markdown — clean markdown, best for general use
text — plain text without formatting
llm — optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.
json — full extraction result with all metadata

When antibot bypass activates (automatic, no extra config):

{
  "antibot": {
    "bypass": true,
    "elapsed_ms": 3200
  }
}

2. Crawl — scrape an entire website

Starts an async job. Poll for results.

Start crawl:

curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "max_depth": 3,
    "max_pages": 50,
    "use_sitemap": true
  }'

Response: { "job_id": "abc-123", "status": "running" }

Poll status:

curl https://api.webclaw.io/v1/crawl/abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response when complete:

{
  "job_id": "abc-123",
  "status": "completed",
  "total": 47,
  "completed": 45,
  "errors": 2,
  "pages": [
    {
      "url": "https://docs.example.com/intro",
      "markdown": "# Introduction\n...",
      "metadata": { "title": "Intro", "word_count": 500 }
    }
  ]
}

Request fields:

Field	Type	Default	Description
`url`	string	required	Starting URL
`max_depth`	int	`3`	How many links deep to follow
`max_pages`	int	`100`	Maximum pages to crawl
`use_sitemap`	bool	`false`	Seed URLs from sitemap.xml
`formats`	string[]	`["markdown"]`	Output formats per page
`include_selectors`	string[]	`[]`	CSS selectors to keep
`exclude_selectors`	string[]	`[]`	CSS selectors to remove
`only_main_content`	bool	`false`	Main content only

3. Map — discover all URLs on a site

Fast URL discovery without full content extraction.

curl -X POST https://api.webclaw.io/v1/map \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "count": 142,
  "urls": [
    "https://example.com/about",
    "https://example.com/pricing",
    "https://example.com/docs/intro"
  ]
}

4. Batch — scrape multiple URLs in parallel

curl -X POST https://api.webclaw.io/v1/batch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://a.com",
      "https://b.com",
      "https://c.com"
    ],
    "formats": ["markdown"],
    "concurrency": 5
  }'

Response:

{
  "total": 3,
  "completed": 3,
  "errors": 0,
  "results": [
    { "url": "https://a.com", "markdown": "...", "metadata": {} },
    { "url": "https://b.com", "markdown": "...", "metadata": {} },
    { "url": "https://c.com", "error": "timeout" }
  ]
}

5. Extract — LLM-powered structured extraction

Pull structured data from any page using a JSON schema or plain-text prompt.

With JSON schema:

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "schema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "string" },
              "features": { "type": "array", "items": { "type": "string" } }
            }
          }
        }
      }
    }
  }'

With prompt:

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "prompt": "Extract all pricing tiers with names, monthly prices, and key features"
  }'

Response:

{
  "url": "https://example.com/pricing",
  "data": {
    "plans": [
      { "name": "Starter", "price": "$49/mo", "features": ["10k pages", "Email support"] },
      { "name": "Pro", "price": "$99/mo", "features": ["100k pages", "Priority support", "API access"] }
    ]
  }
}

6. Summarize — get a quick summary of any page

curl -X POST https://api.webclaw.io/v1/summarize \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/long-article",
    "max_sentences": 3
  }'

Response:

{
  "url": "https://example.com/long-article",
  "summary": "The article discusses... Key findings include... The author concludes that..."
}

7. Diff — detect content changes

Compare current page content against a previous snapshot.

curl -X POST https://api.webclaw.io/v1/diff \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "previous": {
      "markdown": "# Old content...",
      "metadata": { "title": "Old Title" }
    }
  }'

Response:

{
  "url": "https://example.com",
  "status": "changed",
  "diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
  "metadata_changes": [
    { "field": "title", "old": "Old Title", "new": "New Title" }
  ]
}

8. Brand — extract brand identity

Analyze a website's visual identity: colors, fonts, logo.

curl -X POST https://api.webclaw.io/v1/brand \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "brand": {
    "colors": [
      { "hex": "#FF6B35", "usage": "primary" },
      { "hex": "#1A1A2E", "usage": "background" }
    ],
    "fonts": ["Inter", "JetBrains Mono"],
    "logo_url": "https://example.com/logo.svg",
    "favicon_url": "https://example.com/favicon.ico"
  }
}

9. Search — web search with optional scraping

Search the web and optionally scrape each result page.

curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best rust web frameworks 2026",
    "num_results": 5,
    "scrape": true,
    "formats": ["markdown"]
  }'

Request fields:

Field	Type	Default	Description
`query`	string	required	Search query
`num_results`	int	`10`	Number of search results to return
`scrape`	bool	`false`	Also scrape each result page for full content
`formats`	string[]	`["markdown"]`	Output formats when `scrape` is true
`country`	string	none	Country code for localized results (e.g. `"us"`, `"de"`)
`lang`	string	none	Language code for results (e.g. `"en"`, `"fr"`)

Response:

{
  "query": "best rust web frameworks 2026",
  "results": [
    {
      "title": "Top Rust Web Frameworks in 2026",
      "url": "https://blog.example.com/rust-frameworks",
      "snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
      "position": 1,
      "markdown": "# Top Rust Web Frameworks\n\n..."
    },
    {
      "title": "Choosing a Rust Backend Framework",
      "url": "https://dev.to/rust-backends",
      "snippet": "When starting a new Rust web project...",
      "position": 2,
      "markdown": "# Choosing a Rust Backend\n\n..."
    }
  ]
}

The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.

10. Research — deep multi-source research

Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.

Start research:

curl -X POST https://api.webclaw.io/v1/research \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
    "max_iterations": 5,
    "max_sources": 10,
    "topic": "security",
    "deep": true
  }'

Request fields:

Field	Type	Default	Description
`query`	string	required	Research question or topic
`max_iterations`	int	server default	Maximum research iterations (search-read-analyze cycles)
`max_sources`	int	server default	Maximum number of sources to consult
`topic`	string	none	Topic hint to guide search strategy (e.g. `"security"`, `"finance"`, `"engineering"`)
`deep`	bool	`false`	Enable deep research mode for more thorough analysis (costs 10 credits instead of 1)

Response: { "id": "res-abc-123", "status": "running" }

Poll results:

curl https://api.webclaw.io/v1/research/res-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response when complete:

{
  "id": "res-abc-123",
  "status": "completed",
  "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
  "report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
  "sources": [
    { "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
    { "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
  ],
  "findings": [
    "Turnstile uses browser environment signals and proof-of-work challenges",
    "Managed mode auto-selects challenge difficulty based on visitor risk score",
    "Known bypass approaches include instrumented browser automation"
  ],
  "iterations": 5,
  "elapsed_ms": 34200
}

Status values: running, completed, failed

11. Agent Scrape — AI-guided scraping

Use an AI agent to navigate and interact with a page to accomplish a specific goal. The agent can click, scroll, fill forms, and extract data across multiple steps.

curl -X POST https://api.webclaw.io/v1/agent-scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "goal": "Find the cheapest laptop with at least 16GB RAM and extract its full specs",
    "max_steps": 10
  }'

Request fields:

Field	Type	Default	Description
`url`	string	required	Starting URL
`goal`	string	required	What the agent should accomplish
`max_steps`	int	server default	Maximum number of actions the agent can take

Response:

{
  "url": "https://example.com/products",
  "result": "The cheapest laptop with 16GB+ RAM is the ThinkPad E14 Gen 6 at $649. Specs: AMD Ryzen 5 7535U, 16GB DDR4, 512GB SSD, 14\" FHD IPS display, 57Wh battery.",
  "steps": [
    { "action": "navigate", "detail": "Loaded products page" },
    { "action": "click", "detail": "Clicked 'Laptops' category filter" },
    { "action": "click", "detail": "Applied '16GB+' RAM filter" },
    { "action": "click", "detail": "Sorted by price: low to high" },
    { "action": "extract", "detail": "Extracted specs from first matching product" }
  ]
}

12. Watch — monitor a URL for changes

Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.

Create a monitor:

curl -X POST https://api.webclaw.io/v1/watch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "interval": "0 */6 * * *",
    "webhook_url": "https://hooks.example.com/pricing-changed",
    "formats": ["markdown"]
  }'

Request fields:

Field	Type	Default	Description
`url`	string	required	URL to monitor
`interval`	string	required	Check frequency as cron expression or seconds (e.g. `"0 /6 * *"` or `"3600"`)
`webhook_url`	string	none	URL to POST when changes are detected
`formats`	string[]	`["markdown"]`	Output formats for snapshots

Response:

{
  "id": "watch-abc-123",
  "url": "https://example.com/pricing",
  "interval": "0 */6 * * *",
  "webhook_url": "https://hooks.example.com/pricing-changed",
  "formats": ["markdown"],
  "created_at": "2026-03-20T10:00:00Z",
  "last_check": null,
  "status": "active"
}

List all monitors:

curl https://api.webclaw.io/v1/watch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response:

{
  "monitors": [
    {
      "id": "watch-abc-123",
      "url": "https://example.com/pricing",
      "interval": "0 */6 * * *",
      "status": "active",
      "last_check": "2026-03-20T16:00:00Z",
      "checks": 4
    }
  ]
}

Get a monitor with snapshots:

curl https://api.webclaw.io/v1/watch/watch-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response:

{
  "id": "watch-abc-123",
  "url": "https://example.com/pricing",
  "interval": "0 */6 * * *",
  "status": "active",
  "snapshots": [
    {
      "checked_at": "2026-03-20T16:00:00Z",
      "status": "changed",
      "diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
    },
    {
      "checked_at": "2026-03-20T10:00:00Z",
      "status": "baseline"
    }
  ]
}

Trigger an immediate check:

curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Delete a monitor:

curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Choosing the right format

Goal	Format	Why
Read and understand a page	`markdown`	Clean structure, headings, links preserved
Feed content to an AI model	`llm`	Optimized: includes title + URL header, clean link refs
Search or index content	`text`	Plain text, no formatting noise
Programmatic analysis	`json`	Full metadata, structured data, DOM statistics

Tips

Use llm format when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing.
Use only_main_content: true to skip navigation, sidebars, and footers. Reduces noise significantly.
Use include_selectors/exclude_selectors for fine-grained control when only_main_content isn't enough.
Batch over individual scrapes when fetching multiple URLs — it's faster and more efficient.
Use map before crawl to discover the site structure first, then crawl specific sections.
Use extract with a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info).
Antibot bypass is automatic — no extra configuration needed. Works on Cloudflare, DataDome, AWS WAF, and JS-rendered SPAs.
Use search with scrape: true to get full page content for each search result in one call instead of searching then scraping separately.
Use research for complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enable deep: true for thorough analysis.
Use agent-scrape for interactive pages where data is behind filters, pagination, or form submissions that a simple scrape cannot reach.
Use watch for ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.

Smart Fetch Architecture

The webclaw MCP server uses a local-first approach:

Local fetch — fast, free, no API credits used (~80% of sites)
Cloud API fallback — automatic when bot protection or JS rendering is detected

This means:

Most scrapes cost zero credits (local extraction)
Cloudflare, DataDome, AWS WAF sites automatically fall back to the cloud API
JS-rendered SPAs (React, Next.js, Vue) also fall back automatically
Set WEBCLAW_API_KEY to enable cloud fallback

vs web_fetch

	webclaw	web_fetch
Cloudflare bypass	Automatic (cloud fallback)	Fails (403)
JS-rendered pages	Automatic fallback	Readability only
Output quality	20-step optimization pipeline	Basic HTML parsing
Structured extraction	LLM-powered, schema-based	None
Crawling	Full site crawl with sitemap	Single page only
Caching	Built-in, configurable TTL	Per-session
Rate limiting	Managed server-side	Client responsibility

Use web_fetch for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.

19 KiB Raw Blame History

webclaw

When to use this skill

API base

Endpoints

1. Scrape — extract content from a single URL

2. Crawl — scrape an entire website

3. Map — discover all URLs on a site

4. Batch — scrape multiple URLs in parallel

5. Extract — LLM-powered structured extraction

6. Summarize — get a quick summary of any page

7. Diff — detect content changes

8. Brand — extract brand identity

9. Search — web search with optional scraping

10. Research — deep multi-source research

11. Agent Scrape — AI-guided scraping

12. Watch — monitor a URL for changes

Choosing the right format

Tips

Smart Fetch Architecture

vs web_fetch

19 KiB

Raw Blame History