mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Valerio ea9c783bc5 fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation

Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)

CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax

MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-24 17:25:05 +01:00

2.9 KiB

Raw Blame History

Changelog

All notable changes to webclaw are documented here. Format follows Keep a Changelog.

[0.1.1] — 2026-03-24

Fixed

MCP server now identifies as webclaw-mcp instead of rmcp in the MCP handshake
Research tool polling caps at 200 iterations (~10 min) instead of looping forever
CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
Text format output strips markdown table syntax (| --- | pipes)
All MCP tools validate URLs before network calls with clear error messages
Cloud API HTTP client has 60s timeout instead of no timeout
Local fetch calls timeout after 30s to prevent hanging on slow servers
Diff cloud fallback computes actual diff instead of returning raw scrape JSON
FetchClient startup failure logs and exits gracefully instead of panicking

Added

Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

[0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

Core Extraction

Readability-style content scoring with text density, semantic tags, and link density penalties
Exact CSS class token noise filtering with body-force fallback for SPAs
HTML → markdown conversion with URL resolution, image alt text, srcset optimization
9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
JSON data island extraction (React, Next.js, Contentful CMS)
YouTube transcript extraction (title, channel, views, duration, description)
Lazy-loaded image detection (data-src, data-lazy-src, data-original)
Brand identity extraction (name, colors, fonts, logos, OG image)
Content change tracking / diff engine
CSS selector filtering (include/exclude)

Fetching & Crawling

TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
BFS same-origin crawler with configurable depth, concurrency, and delay
Sitemap.xml and robots.txt discovery
Batch multi-URL concurrent extraction
Per-request proxy rotation from pool file
Reddit JSON API and LinkedIn post extractors

LLM Integration

Provider chain: Ollama (local-first) → OpenAI → Anthropic
JSON schema extraction (structured data from pages)
Natural language prompt extraction
Page summarization with configurable sentence count

PDF

PDF text extraction via pdf-extract
Auto-detection by Content-Type header

MCP Server

8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
stdio transport for Claude Desktop, Claude Code, and any MCP client
Smart Fetch: local extraction first, cloud API fallback

CLI

4 output formats: markdown, JSON, plain text, LLM-optimized
CSS selector filtering, crawling, sitemap discovery
Brand extraction, content diffing, LLM features
Browser profile selection, proxy support, stdin/file input

Infrastructure

Docker multi-stage build with Ollama sidecar
Deploy script for Hetzner VPS

2.9 KiB Raw Blame History