add SKILL.md for Claude Code skill integration

Adds the webclaw skill definition for Claude Code / Smithery. Located at skill/SKILL.md with proper frontmatter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 00:06:21 +02:00 · 2026-03-27 17:58:01 +01:00 · 2026-03-27 17:58:01 +01:00 · 2f6255fe6f
commit 2f6255fe6f
parent a6be233df9
1 changed files with 634 additions and 0 deletions
--- a/skill/SKILL.md
+++ b/skill/SKILL.md
@ -0,0 +1,634 @@
+---
+name: webclaw
+description: Web extraction engine with antibot bypass. Scrape, crawl, extract, summarize, search, map, diff, monitor, research, and analyze any URL — including Cloudflare-protected sites. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data extraction from web pages.
+homepage: https://webclaw.io
+user-invocable: true
+metadata: {"openclaw":{"emoji":"🦀","requires":{"env":["WEBCLAW_API_KEY"]},"primaryEnv":"WEBCLAW_API_KEY","homepage":"https://webclaw.io","install":[{"id":"npx","kind":"node","bins":["webclaw-mcp"],"label":"npx create-webclaw"}]}}
+---
+
+# webclaw
+
+High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically.
+
+## When to use this skill
+
+- **Always** when you need to fetch web content and want reliable results
+- When `web_fetch` returns empty/blocked content (403, Cloudflare challenges)
+- When you need structured data extraction (pricing tables, product info)
+- When you need to crawl an entire site or discover all URLs
+- When you need LLM-optimized content (cleaner than raw markdown)
+- When you need to summarize a page without reading the full content
+- When you need to detect content changes between visits
+- When you need brand identity analysis (colors, fonts, logos)
+- When you need web search results with optional page scraping
+- When you need deep multi-source research on a topic
+- When you need AI-guided scraping to accomplish a goal on a page
+- When you need to monitor a URL for changes over time
+
+## API base
+
+All requests go to `https://api.webclaw.io/v1/`.
+
+Authentication: `Authorization: Bearer $WEBCLAW_API_KEY`
+
+## Endpoints
+
+### 1. Scrape — extract content from a single URL
+
+```bash
+curl -X POST https://api.webclaw.io/v1/scrape \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com",
+    "formats": ["markdown"],
+    "only_main_content": true
+  }'
+```
+
+**Request fields:**
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `url` | string | required | URL to scrape |
+| `formats` | string[] | `["markdown"]` | Output formats: `markdown`, `text`, `llm`, `json` |
+| `include_selectors` | string[] | `[]` | CSS selectors to keep (e.g. `["article", ".content"]`) |
+| `exclude_selectors` | string[] | `[]` | CSS selectors to remove (e.g. `["nav", "footer", ".ads"]`) |
+| `only_main_content` | bool | `false` | Extract only the main article/content area |
+| `no_cache` | bool | `false` | Skip cache, fetch fresh |
+| `max_cache_age` | int | server default | Max acceptable cache age in seconds |
+
+**Response:**
+
+```json
+{
+  "url": "https://example.com",
+  "metadata": {
+    "title": "Example",
+    "description": "...",
+    "language": "en",
+    "word_count": 1234
+  },
+  "markdown": "# Page Title\n\nContent here...",
+  "cache": { "status": "miss" }
+}
+```
+
+**Format options:**
+- `markdown` — clean markdown, best for general use
+- `text` — plain text without formatting
+- `llm` — optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.
+- `json` — full extraction result with all metadata
+
+**When antibot bypass activates** (automatic, no extra config):
+```json
+{
+  "antibot": {
+    "bypass": true,
+    "elapsed_ms": 3200
+  }
+}
+```
+
+### 2. Crawl — scrape an entire website
+
+Starts an async job. Poll for results.
+
+**Start crawl:**
+```bash
+curl -X POST https://api.webclaw.io/v1/crawl \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://docs.example.com",
+    "max_depth": 3,
+    "max_pages": 50,
+    "use_sitemap": true
+  }'
+```
+
+Response: `{ "job_id": "abc-123", "status": "running" }`
+
+**Poll status:**
+```bash
+curl https://api.webclaw.io/v1/crawl/abc-123 \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+Response when complete:
+```json
+{
+  "job_id": "abc-123",
+  "status": "completed",
+  "total": 47,
+  "completed": 45,
+  "errors": 2,
+  "pages": [
+    {
+      "url": "https://docs.example.com/intro",
+      "markdown": "# Introduction\n...",
+      "metadata": { "title": "Intro", "word_count": 500 }
+    }
+  ]
+}
+```
+
+**Request fields:**
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `url` | string | required | Starting URL |
+| `max_depth` | int | `3` | How many links deep to follow |
+| `max_pages` | int | `100` | Maximum pages to crawl |
+| `use_sitemap` | bool | `false` | Seed URLs from sitemap.xml |
+| `formats` | string[] | `["markdown"]` | Output formats per page |
+| `include_selectors` | string[] | `[]` | CSS selectors to keep |
+| `exclude_selectors` | string[] | `[]` | CSS selectors to remove |
+| `only_main_content` | bool | `false` | Main content only |
+
+### 3. Map — discover all URLs on a site
+
+Fast URL discovery without full content extraction.
+
+```bash
+curl -X POST https://api.webclaw.io/v1/map \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com"}'
+```
+
+Response:
+```json
+{
+  "url": "https://example.com",
+  "count": 142,
+  "urls": [
+    "https://example.com/about",
+    "https://example.com/pricing",
+    "https://example.com/docs/intro"
+  ]
+}
+```
+
+### 4. Batch — scrape multiple URLs in parallel
+
+```bash
+curl -X POST https://api.webclaw.io/v1/batch \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": [
+      "https://a.com",
+      "https://b.com",
+      "https://c.com"
+    ],
+    "formats": ["markdown"],
+    "concurrency": 5
+  }'
+```
+
+Response:
+```json
+{
+  "total": 3,
+  "completed": 3,
+  "errors": 0,
+  "results": [
+    { "url": "https://a.com", "markdown": "...", "metadata": {} },
+    { "url": "https://b.com", "markdown": "...", "metadata": {} },
+    { "url": "https://c.com", "error": "timeout" }
+  ]
+}
+```
+
+### 5. Extract — LLM-powered structured extraction
+
+Pull structured data from any page using a JSON schema or plain-text prompt.
+
+**With JSON schema:**
+```bash
+curl -X POST https://api.webclaw.io/v1/extract \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com/pricing",
+    "schema": {
+      "type": "object",
+      "properties": {
+        "plans": {
+          "type": "array",
+          "items": {
+            "type": "object",
+            "properties": {
+              "name": { "type": "string" },
+              "price": { "type": "string" },
+              "features": { "type": "array", "items": { "type": "string" } }
+            }
+          }
+        }
+      }
+    }
+  }'
+```
+
+**With prompt:**
+```bash
+curl -X POST https://api.webclaw.io/v1/extract \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com/pricing",
+    "prompt": "Extract all pricing tiers with names, monthly prices, and key features"
+  }'
+```
+
+Response:
+```json
+{
+  "url": "https://example.com/pricing",
+  "data": {
+    "plans": [
+      { "name": "Starter", "price": "$49/mo", "features": ["10k pages", "Email support"] },
+      { "name": "Pro", "price": "$99/mo", "features": ["100k pages", "Priority support", "API access"] }
+    ]
+  }
+}
+```
+
+### 6. Summarize — get a quick summary of any page
+
+```bash
+curl -X POST https://api.webclaw.io/v1/summarize \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com/long-article",
+    "max_sentences": 3
+  }'
+```
+
+Response:
+```json
+{
+  "url": "https://example.com/long-article",
+  "summary": "The article discusses... Key findings include... The author concludes that..."
+}
+```
+
+### 7. Diff — detect content changes
+
+Compare current page content against a previous snapshot.
+
+```bash
+curl -X POST https://api.webclaw.io/v1/diff \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com",
+    "previous": {
+      "markdown": "# Old content...",
+      "metadata": { "title": "Old Title" }
+    }
+  }'
+```
+
+Response:
+```json
+{
+  "url": "https://example.com",
+  "status": "changed",
+  "diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
+  "metadata_changes": [
+    { "field": "title", "old": "Old Title", "new": "New Title" }
+  ]
+}
+```
+
+### 8. Brand — extract brand identity
+
+Analyze a website's visual identity: colors, fonts, logo.
+
+```bash
+curl -X POST https://api.webclaw.io/v1/brand \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com"}'
+```
+
+Response:
+```json
+{
+  "url": "https://example.com",
+  "brand": {
+    "colors": [
+      { "hex": "#FF6B35", "usage": "primary" },
+      { "hex": "#1A1A2E", "usage": "background" }
+    ],
+    "fonts": ["Inter", "JetBrains Mono"],
+    "logo_url": "https://example.com/logo.svg",
+    "favicon_url": "https://example.com/favicon.ico"
+  }
+}
+```
+
+### 9. Search — web search with optional scraping
+
+Search the web and optionally scrape each result page.
+
+```bash
+curl -X POST https://api.webclaw.io/v1/search \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "best rust web frameworks 2026",
+    "num_results": 5,
+    "scrape": true,
+    "formats": ["markdown"]
+  }'
+```
+
+**Request fields:**
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `query` | string | required | Search query |
+| `num_results` | int | `10` | Number of search results to return |
+| `scrape` | bool | `false` | Also scrape each result page for full content |
+| `formats` | string[] | `["markdown"]` | Output formats when `scrape` is true |
+| `country` | string | none | Country code for localized results (e.g. `"us"`, `"de"`) |
+| `lang` | string | none | Language code for results (e.g. `"en"`, `"fr"`) |
+
+**Response:**
+
+```json
+{
+  "query": "best rust web frameworks 2026",
+  "results": [
+    {
+      "title": "Top Rust Web Frameworks in 2026",
+      "url": "https://blog.example.com/rust-frameworks",
+      "snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
+      "position": 1,
+      "markdown": "# Top Rust Web Frameworks\n\n..."
+    },
+    {
+      "title": "Choosing a Rust Backend Framework",
+      "url": "https://dev.to/rust-backends",
+      "snippet": "When starting a new Rust web project...",
+      "position": 2,
+      "markdown": "# Choosing a Rust Backend\n\n..."
+    }
+  ]
+}
+```
+
+The `markdown` field on each result is only present when `scrape: true`. Without it, you get titles, URLs, snippets, and positions only.
+
+### 10. Research — deep multi-source research
+
+Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.
+
+**Start research:**
+```bash
+curl -X POST https://api.webclaw.io/v1/research \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
+    "max_iterations": 5,
+    "max_sources": 10,
+    "topic": "security",
+    "deep": true
+  }'
+```
+
+**Request fields:**
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `query` | string | required | Research question or topic |
+| `max_iterations` | int | server default | Maximum research iterations (search-read-analyze cycles) |
+| `max_sources` | int | server default | Maximum number of sources to consult |
+| `topic` | string | none | Topic hint to guide search strategy (e.g. `"security"`, `"finance"`, `"engineering"`) |
+| `deep` | bool | `false` | Enable deep research mode for more thorough analysis (costs 10 credits instead of 1) |
+
+Response: `{ "id": "res-abc-123", "status": "running" }`
+
+**Poll results:**
+```bash
+curl https://api.webclaw.io/v1/research/res-abc-123 \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+Response when complete:
+```json
+{
+  "id": "res-abc-123",
+  "status": "completed",
+  "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
+  "report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
+  "sources": [
+    { "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
+    { "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
+  ],
+  "findings": [
+    "Turnstile uses browser environment signals and proof-of-work challenges",
+    "Managed mode auto-selects challenge difficulty based on visitor risk score",
+    "Known bypass approaches include instrumented browser automation"
+  ],
+  "iterations": 5,
+  "elapsed_ms": 34200
+}
+```
+
+**Status values:** `running`, `completed`, `failed`
+
+### 11. Agent Scrape — AI-guided scraping
+
+Use an AI agent to navigate and interact with a page to accomplish a specific goal. The agent can click, scroll, fill forms, and extract data across multiple steps.
+
+```bash
+curl -X POST https://api.webclaw.io/v1/agent-scrape \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com/products",
+    "goal": "Find the cheapest laptop with at least 16GB RAM and extract its full specs",
+    "max_steps": 10
+  }'
+```
+
+**Request fields:**
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `url` | string | required | Starting URL |
+| `goal` | string | required | What the agent should accomplish |
+| `max_steps` | int | server default | Maximum number of actions the agent can take |
+
+**Response:**
+
+```json
+{
+  "url": "https://example.com/products",
+  "result": "The cheapest laptop with 16GB+ RAM is the ThinkPad E14 Gen 6 at $649. Specs: AMD Ryzen 5 7535U, 16GB DDR4, 512GB SSD, 14\" FHD IPS display, 57Wh battery.",
+  "steps": [
+    { "action": "navigate", "detail": "Loaded products page" },
+    { "action": "click", "detail": "Clicked 'Laptops' category filter" },
+    { "action": "click", "detail": "Applied '16GB+' RAM filter" },
+    { "action": "click", "detail": "Sorted by price: low to high" },
+    { "action": "extract", "detail": "Extracted specs from first matching product" }
+  ]
+}
+```
+
+### 12. Watch — monitor a URL for changes
+
+Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.
+
+**Create a monitor:**
+```bash
+curl -X POST https://api.webclaw.io/v1/watch \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com/pricing",
+    "interval": "0 */6 * * *",
+    "webhook_url": "https://hooks.example.com/pricing-changed",
+    "formats": ["markdown"]
+  }'
+```
+
+**Request fields:**
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `url` | string | required | URL to monitor |
+| `interval` | string | required | Check frequency as cron expression or seconds (e.g. `"0 */6 * * *"` or `"3600"`) |
+| `webhook_url` | string | none | URL to POST when changes are detected |
+| `formats` | string[] | `["markdown"]` | Output formats for snapshots |
+
+Response:
+```json
+{
+  "id": "watch-abc-123",
+  "url": "https://example.com/pricing",
+  "interval": "0 */6 * * *",
+  "webhook_url": "https://hooks.example.com/pricing-changed",
+  "formats": ["markdown"],
+  "created_at": "2026-03-20T10:00:00Z",
+  "last_check": null,
+  "status": "active"
+}
+```
+
+**List all monitors:**
+```bash
+curl https://api.webclaw.io/v1/watch \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+Response:
+```json
+{
+  "monitors": [
+    {
+      "id": "watch-abc-123",
+      "url": "https://example.com/pricing",
+      "interval": "0 */6 * * *",
+      "status": "active",
+      "last_check": "2026-03-20T16:00:00Z",
+      "checks": 4
+    }
+  ]
+}
+```
+
+**Get a monitor with snapshots:**
+```bash
+curl https://api.webclaw.io/v1/watch/watch-abc-123 \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+Response:
+```json
+{
+  "id": "watch-abc-123",
+  "url": "https://example.com/pricing",
+  "interval": "0 */6 * * *",
+  "status": "active",
+  "snapshots": [
+    {
+      "checked_at": "2026-03-20T16:00:00Z",
+      "status": "changed",
+      "diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
+    },
+    {
+      "checked_at": "2026-03-20T10:00:00Z",
+      "status": "baseline"
+    }
+  ]
+}
+```
+
+**Trigger an immediate check:**
+```bash
+curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+**Delete a monitor:**
+```bash
+curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
+  -H "Authorization: Bearer $WEBCLAW_API_KEY"
+```
+
+## Choosing the right format
+
+| Goal | Format | Why |
+|------|--------|-----|
+| Read and understand a page | `markdown` | Clean structure, headings, links preserved |
+| Feed content to an AI model | `llm` | Optimized: includes title + URL header, clean link refs |
+| Search or index content | `text` | Plain text, no formatting noise |
+| Programmatic analysis | `json` | Full metadata, structured data, DOM statistics |
+
+## Tips
+
+- **Use `llm` format** when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing.
+- **Use `only_main_content: true`** to skip navigation, sidebars, and footers. Reduces noise significantly.
+- **Use `include_selectors`/`exclude_selectors`** for fine-grained control when `only_main_content` isn't enough.
+- **Batch over individual scrapes** when fetching multiple URLs — it's faster and more efficient.
+- **Use `map` before `crawl`** to discover the site structure first, then crawl specific sections.
+- **Use `extract` with a JSON schema** for reliable structured output (e.g., pricing tables, product specs, contact info).
+- **Antibot bypass is automatic** — no extra configuration needed. Works on Cloudflare, DataDome, AWS WAF, and JS-rendered SPAs.
+- **Use `search` with `scrape: true`** to get full page content for each search result in one call instead of searching then scraping separately.
+- **Use `research` for complex questions** that need multiple sources — it handles the search-read-synthesize loop automatically. Enable `deep: true` for thorough analysis.
+- **Use `agent-scrape` for interactive pages** where data is behind filters, pagination, or form submissions that a simple scrape cannot reach.
+- **Use `watch` for ongoing monitoring** — set up a cron schedule and a webhook to get notified when a page changes without polling manually.
+
+## Smart Fetch Architecture
+
+The webclaw MCP server uses a **local-first** approach:
+
+1. **Local fetch** — fast, free, no API credits used (~80% of sites)
+2. **Cloud API fallback** — automatic when bot protection or JS rendering is detected
+
+This means:
+- Most scrapes cost zero credits (local extraction)
+- Cloudflare, DataDome, AWS WAF sites automatically fall back to the cloud API
+- JS-rendered SPAs (React, Next.js, Vue) also fall back automatically
+- Set `WEBCLAW_API_KEY` to enable cloud fallback
+
+## vs web_fetch
+
+| | webclaw | web_fetch |
+|---|---------|-----------|
+| Cloudflare bypass | Automatic (cloud fallback) | Fails (403) |
+| JS-rendered pages | Automatic fallback | Readability only |
+| Output quality | 20-step optimization pipeline | Basic HTML parsing |
+| Structured extraction | LLM-powered, schema-based | None |
+| Crawling | Full site crawl with sitemap | Single page only |
+| Caching | Built-in, configurable TTL | Per-session |
+| Rate limiting | Managed server-side | Client responsibility |
+
+Use `web_fetch` for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.