webclaw/CHANGELOG.md

# Changelog

All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/).

## [0.1.3] — 2026-03-25

### Added
- Crawl streaming: real-time progress on stderr as pages complete (`[2/50] OK https://... (234ms, 1523 words)`)
- Crawl resume/cancel: `--crawl-state <path>` saves progress on Ctrl+C and resumes from where it left off
- MCP server proxy support via `WEBCLAW_PROXY` and `WEBCLAW_PROXY_FILE` env vars

### Changed
- Crawl results now expose visited set and remaining frontier for accurate state persistence

---

## [0.1.2] — 2026-03-25

### Changed
- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)

### Fixed
- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)

### Added
- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches

---

## [0.1.1] — 2026-03-24

### Fixed
- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
- Research tool polling caps at 200 iterations (~10 min) instead of looping forever
- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
- Text format output strips markdown table syntax (`| --- |` pipes)
- All MCP tools validate URLs before network calls with clear error messages
- Cloud API HTTP client has 60s timeout instead of no timeout
- Local fetch calls timeout after 30s to prevent hanging on slow servers
- Diff cloud fallback computes actual diff instead of returning raw scrape JSON
- FetchClient startup failure logs and exits gracefully instead of panicking

### Added
- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

---

## [0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

### Core Extraction
- Readability-style content scoring with text density, semantic tags, and link density penalties
- Exact CSS class token noise filtering with body-force fallback for SPAs
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
- JSON data island extraction (React, Next.js, Contentful CMS)
- YouTube transcript extraction (title, channel, views, duration, description)
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
- Brand identity extraction (name, colors, fonts, logos, OG image)
- Content change tracking / diff engine
- CSS selector filtering (include/exclude)

### Fetching & Crawling
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
- BFS same-origin crawler with configurable depth, concurrency, and delay
- Sitemap.xml and robots.txt discovery
- Batch multi-URL concurrent extraction
- Per-request proxy rotation from pool file
- Reddit JSON API and LinkedIn post extractors

### LLM Integration
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
- JSON schema extraction (structured data from pages)
- Natural language prompt extraction
- Page summarization with configurable sentence count

### PDF
- PDF text extraction via pdf-extract
- Auto-detection by Content-Type header

### MCP Server
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
- stdio transport for Claude Desktop, Claude Code, and any MCP client
- Smart Fetch: local extraction first, cloud API fallback

### CLI
- 4 output formats: markdown, JSON, plain text, LLM-optimized
- CSS selector filtering, crawling, sitemap discovery
- Brand extraction, content diffing, LLM features
- Browser profile selection, proxy support, stdin/file input

### Infrastructure
- Docker multi-stage build with Ollama sidecar
- Deploy script for Hetzner VPS
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`# Changelog`

			`All notable changes to webclaw are documented here.`
			`Format follows [Keep a Changelog](https://keepachangelog.com/).`

feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 21:38:28 +01:00			`## [0.1.3] — 2026-03-25`

			`### Added`
			- Crawl streaming: real-time progress on stderr as pages complete (`[2/50] OK https://... (234ms, 1523 words)`)
			- Crawl resume/cancel: `--crawl-state <path>` saves progress on Ctrl+C and resumes from where it left off
			- MCP server proxy support via `WEBCLAW_PROXY` and `WEBCLAW_PROXY_FILE` env vars

			`### Changed`
			`- Crawl results now expose visited set and remaining frontier for accurate state persistence`

			`---`

feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 18:50:07 +01:00			`## [0.1.2] — 2026-03-25`

			`### Changed`
			`- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)`
			`- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)`
chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-24 18:44:52 +01:00
			`### Fixed`
			- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)

feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 18:50:07 +01:00			`### Added`
			`- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches`

chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-24 18:44:52 +01:00			`---`

fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-24 17:25:05 +01:00			`## [0.1.1] — 2026-03-24`

			`### Fixed`
			- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
			`- Research tool polling caps at 200 iterations (~10 min) instead of looping forever`
			`- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)`
			- Text format output strips markdown table syntax (`\| --- \|` pipes)
			`- All MCP tools validate URLs before network calls with clear error messages`
			`- Cloud API HTTP client has 60s timeout instead of no timeout`
			`- Local fetch calls timeout after 30s to prevent hanging on slow servers`
			`- Diff cloud fallback computes actual diff instead of returning raw scrape JSON`
			`- FetchClient startup failure logs and exits gracefully instead of panicking`

			`### Added`
			`- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages`

			`---`

Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`## [0.1.0] — 2026-03-18`

			`First public release. Full-featured web content extraction toolkit for LLMs.`

			`### Core Extraction`
			`- Readability-style content scoring with text density, semantic tags, and link density penalties`
			`- Exact CSS class token noise filtering with body-force fallback for SPAs`
			`- HTML → markdown conversion with URL resolution, image alt text, srcset optimization`
			`- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)`
			`- JSON data island extraction (React, Next.js, Contentful CMS)`
			`- YouTube transcript extraction (title, channel, views, duration, description)`
			`- Lazy-loaded image detection (data-src, data-lazy-src, data-original)`
			`- Brand identity extraction (name, colors, fonts, logos, OG image)`
			`- Content change tracking / diff engine`
			`- CSS selector filtering (include/exclude)`

			`### Fetching & Crawling`
			`- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)`
			`- BFS same-origin crawler with configurable depth, concurrency, and delay`
			`- Sitemap.xml and robots.txt discovery`
			`- Batch multi-URL concurrent extraction`
			`- Per-request proxy rotation from pool file`
			`- Reddit JSON API and LinkedIn post extractors`

			`### LLM Integration`
			`- Provider chain: Ollama (local-first) → OpenAI → Anthropic`
			`- JSON schema extraction (structured data from pages)`
			`- Natural language prompt extraction`
			`- Page summarization with configurable sentence count`

			`### PDF`
			`- PDF text extraction via pdf-extract`
			`- Auto-detection by Content-Type header`

			`### MCP Server`
			`- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand`
			`- stdio transport for Claude Desktop, Claude Code, and any MCP client`
			`- Smart Fetch: local extraction first, cloud API fallback`

			`### CLI`
			`- 4 output formats: markdown, JSON, plain text, LLM-optimized`
			`- CSS selector filtering, crawling, sitemap discovery`
			`- Brand extraction, content diffing, LLM features`
			`- Browser profile selection, proxy support, stdin/file input`

			`### Infrastructure`
			`- Docker multi-stage build with Ollama sidecar`
			`- Deploy script for Hetzner VPS`