Initial release: webclaw v0.1.0 — web content extraction for LLMs

CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed | https://webclaw.io
2026-04-25 00:06:21 +02:00 · 2026-03-23 18:31:11 +01:00 · 2026-03-23 18:31:11 +01:00 · c99ec684fa
commit c99ec684fa
79 changed files with 24074 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -0,0 +1,53 @@
+# Changelog
+
+All notable changes to webclaw are documented here.
+Format follows [Keep a Changelog](https://keepachangelog.com/).
+
+## [0.1.0] — 2026-03-18
+
+First public release. Full-featured web content extraction toolkit for LLMs.
+
+### Core Extraction
+- Readability-style content scoring with text density, semantic tags, and link density penalties
+- Exact CSS class token noise filtering with body-force fallback for SPAs
+- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
+- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
+- JSON data island extraction (React, Next.js, Contentful CMS)
+- YouTube transcript extraction (title, channel, views, duration, description)
+- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
+- Brand identity extraction (name, colors, fonts, logos, OG image)
+- Content change tracking / diff engine
+- CSS selector filtering (include/exclude)
+
+### Fetching & Crawling
+- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
+- BFS same-origin crawler with configurable depth, concurrency, and delay
+- Sitemap.xml and robots.txt discovery
+- Batch multi-URL concurrent extraction
+- Per-request proxy rotation from pool file
+- Reddit JSON API and LinkedIn post extractors
+
+### LLM Integration
+- Provider chain: Ollama (local-first) → OpenAI → Anthropic
+- JSON schema extraction (structured data from pages)
+- Natural language prompt extraction
+- Page summarization with configurable sentence count
+
+### PDF
+- PDF text extraction via pdf-extract
+- Auto-detection by Content-Type header
+
+### MCP Server
+- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
+- stdio transport for Claude Desktop, Claude Code, and any MCP client
+- Smart Fetch: local extraction first, cloud API fallback
+
+### CLI
+- 4 output formats: markdown, JSON, plain text, LLM-optimized
+- CSS selector filtering, crawling, sitemap discovery
+- Brand extraction, content diffing, LLM features
+- Browser profile selection, proxy support, stdin/file input
+
+### Infrastructure
+- Docker multi-stage build with Ollama sidecar
+- Deploy script for Hetzner VPS