mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Valerio c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs

CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io

2026-03-23 18:31:11 +01:00

2.1 KiB

Raw Blame History

Changelog

All notable changes to webclaw are documented here. Format follows Keep a Changelog.

[0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

Core Extraction

Readability-style content scoring with text density, semantic tags, and link density penalties
Exact CSS class token noise filtering with body-force fallback for SPAs
HTML → markdown conversion with URL resolution, image alt text, srcset optimization
9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
JSON data island extraction (React, Next.js, Contentful CMS)
YouTube transcript extraction (title, channel, views, duration, description)
Lazy-loaded image detection (data-src, data-lazy-src, data-original)
Brand identity extraction (name, colors, fonts, logos, OG image)
Content change tracking / diff engine
CSS selector filtering (include/exclude)

Fetching & Crawling

TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
BFS same-origin crawler with configurable depth, concurrency, and delay
Sitemap.xml and robots.txt discovery
Batch multi-URL concurrent extraction
Per-request proxy rotation from pool file
Reddit JSON API and LinkedIn post extractors

LLM Integration

Provider chain: Ollama (local-first) → OpenAI → Anthropic
JSON schema extraction (structured data from pages)
Natural language prompt extraction
Page summarization with configurable sentence count

PDF

PDF text extraction via pdf-extract
Auto-detection by Content-Type header

MCP Server

8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
stdio transport for Claude Desktop, Claude Code, and any MCP client
Smart Fetch: local extraction first, cloud API fallback

CLI

4 output formats: markdown, JSON, plain text, LLM-optimized
CSS selector filtering, crawling, sitemap discovery
Brand extraction, content diffing, LLM features
Browser profile selection, proxy support, stdin/file input

Infrastructure

Docker multi-stage build with Ollama sidecar
Deploy script for Hetzner VPS

2.1 KiB Raw Blame History