mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed | https://webclaw.io
This commit is contained in:
commit
c99ec684fa
79 changed files with 24074 additions and 0 deletions
53
CHANGELOG.md
Normal file
53
CHANGELOG.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# Changelog
|
||||
|
||||
All notable changes to webclaw are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
||||
|
||||
## [0.1.0] — 2026-03-18
|
||||
|
||||
First public release. Full-featured web content extraction toolkit for LLMs.
|
||||
|
||||
### Core Extraction
|
||||
- Readability-style content scoring with text density, semantic tags, and link density penalties
|
||||
- Exact CSS class token noise filtering with body-force fallback for SPAs
|
||||
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
|
||||
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
|
||||
- JSON data island extraction (React, Next.js, Contentful CMS)
|
||||
- YouTube transcript extraction (title, channel, views, duration, description)
|
||||
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
|
||||
- Brand identity extraction (name, colors, fonts, logos, OG image)
|
||||
- Content change tracking / diff engine
|
||||
- CSS selector filtering (include/exclude)
|
||||
|
||||
### Fetching & Crawling
|
||||
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
|
||||
- BFS same-origin crawler with configurable depth, concurrency, and delay
|
||||
- Sitemap.xml and robots.txt discovery
|
||||
- Batch multi-URL concurrent extraction
|
||||
- Per-request proxy rotation from pool file
|
||||
- Reddit JSON API and LinkedIn post extractors
|
||||
|
||||
### LLM Integration
|
||||
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
|
||||
- JSON schema extraction (structured data from pages)
|
||||
- Natural language prompt extraction
|
||||
- Page summarization with configurable sentence count
|
||||
|
||||
### PDF
|
||||
- PDF text extraction via pdf-extract
|
||||
- Auto-detection by Content-Type header
|
||||
|
||||
### MCP Server
|
||||
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
|
||||
- stdio transport for Claude Desktop, Claude Code, and any MCP client
|
||||
- Smart Fetch: local extraction first, cloud API fallback
|
||||
|
||||
### CLI
|
||||
- 4 output formats: markdown, JSON, plain text, LLM-optimized
|
||||
- CSS selector filtering, crawling, sitemap discovery
|
||||
- Brand extraction, content diffing, LLM features
|
||||
- Browser profile selection, proxy support, stdin/file input
|
||||
|
||||
### Infrastructure
|
||||
- Docker multi-stage build with Ollama sidecar
|
||||
- Deploy script for Hetzner VPS
|
||||
Loading…
Add table
Add a link
Reference in a new issue