2026-03-23 18:31:11 +01:00
|
|
|
# Changelog
|
|
|
|
|
|
|
|
|
|
All notable changes to webclaw are documented here.
|
|
|
|
|
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
|
|
|
|
|
2026-03-24 18:44:52 +01:00
|
|
|
## [0.1.2] — 2026-03-24
|
|
|
|
|
|
|
|
|
|
### Fixed
|
|
|
|
|
- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-03-24 17:25:05 +01:00
|
|
|
## [0.1.1] — 2026-03-24
|
|
|
|
|
|
|
|
|
|
### Fixed
|
|
|
|
|
- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
|
|
|
|
|
- Research tool polling caps at 200 iterations (~10 min) instead of looping forever
|
|
|
|
|
- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
|
|
|
|
|
- Text format output strips markdown table syntax (`| --- |` pipes)
|
|
|
|
|
- All MCP tools validate URLs before network calls with clear error messages
|
|
|
|
|
- Cloud API HTTP client has 60s timeout instead of no timeout
|
|
|
|
|
- Local fetch calls timeout after 30s to prevent hanging on slow servers
|
|
|
|
|
- Diff cloud fallback computes actual diff instead of returning raw scrape JSON
|
|
|
|
|
- FetchClient startup failure logs and exits gracefully instead of panicking
|
|
|
|
|
|
|
|
|
|
### Added
|
|
|
|
|
- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-03-23 18:31:11 +01:00
|
|
|
## [0.1.0] — 2026-03-18
|
|
|
|
|
|
|
|
|
|
First public release. Full-featured web content extraction toolkit for LLMs.
|
|
|
|
|
|
|
|
|
|
### Core Extraction
|
|
|
|
|
- Readability-style content scoring with text density, semantic tags, and link density penalties
|
|
|
|
|
- Exact CSS class token noise filtering with body-force fallback for SPAs
|
|
|
|
|
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
|
|
|
|
|
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
|
|
|
|
|
- JSON data island extraction (React, Next.js, Contentful CMS)
|
|
|
|
|
- YouTube transcript extraction (title, channel, views, duration, description)
|
|
|
|
|
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
|
|
|
|
|
- Brand identity extraction (name, colors, fonts, logos, OG image)
|
|
|
|
|
- Content change tracking / diff engine
|
|
|
|
|
- CSS selector filtering (include/exclude)
|
|
|
|
|
|
|
|
|
|
### Fetching & Crawling
|
|
|
|
|
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
|
|
|
|
|
- BFS same-origin crawler with configurable depth, concurrency, and delay
|
|
|
|
|
- Sitemap.xml and robots.txt discovery
|
|
|
|
|
- Batch multi-URL concurrent extraction
|
|
|
|
|
- Per-request proxy rotation from pool file
|
|
|
|
|
- Reddit JSON API and LinkedIn post extractors
|
|
|
|
|
|
|
|
|
|
### LLM Integration
|
|
|
|
|
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
|
|
|
|
|
- JSON schema extraction (structured data from pages)
|
|
|
|
|
- Natural language prompt extraction
|
|
|
|
|
- Page summarization with configurable sentence count
|
|
|
|
|
|
|
|
|
|
### PDF
|
|
|
|
|
- PDF text extraction via pdf-extract
|
|
|
|
|
- Auto-detection by Content-Type header
|
|
|
|
|
|
|
|
|
|
### MCP Server
|
|
|
|
|
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
|
|
|
|
|
- stdio transport for Claude Desktop, Claude Code, and any MCP client
|
|
|
|
|
- Smart Fetch: local extraction first, cloud API fallback
|
|
|
|
|
|
|
|
|
|
### CLI
|
|
|
|
|
- 4 output formats: markdown, JSON, plain text, LLM-optimized
|
|
|
|
|
- CSS selector filtering, crawling, sitemap discovery
|
|
|
|
|
- Brand extraction, content diffing, LLM features
|
|
|
|
|
- Browser profile selection, proxy support, stdin/file input
|
|
|
|
|
|
|
|
|
|
### Infrastructure
|
|
|
|
|
- Docker multi-stage build with Ollama sidecar
|
|
|
|
|
- Deploy script for Hetzner VPS
|