webclaw/CHANGELOG.md

# Changelog

All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/).

## [0.2.0] — 2026-03-26

### Added
- **DOCX extraction**: auto-detected by Content-Type or URL extension, outputs markdown with headings
- **XLSX/XLS extraction**: spreadsheets converted to markdown tables, multi-sheet support via calamine
- **CSV extraction**: parsed with quoted field handling, output as markdown table
- **HTML output format**: `-f html` returns sanitized HTML from the extracted content
- **Multi-URL watch**: `--watch` now works with `--urls-file` to monitor multiple URLs in parallel
- **Batch + LLM extraction**: `--extract-prompt` and `--extract-json` now work with multiple URLs
- **Scheduled batch watch**: watch multiple URLs with aggregate change reports and per-URL diffs

---

## [0.1.7] — 2026-03-26

### Fixed
- `--only-main-content`, `--include`, and `--exclude` now work in batch mode (#3)

---

## [0.1.6] — 2026-03-26

### Added
- `--watch`: monitor a URL for changes at a configurable interval with diff output
- `--watch-interval`: seconds between checks (default: 300)
- `--on-change`: run a command when changes are detected (diff JSON piped to stdin)
- `--webhook`: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks

---

## [0.1.5] — 2026-03-26

### Added
- `--output-dir`: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
- CSV input with custom filenames: `url,filename` format in `--urls-file`
- Root URLs use `hostname/index.ext` to avoid collisions in batch mode
- Subdirectories created automatically from URL path structure

---

## [0.1.4] — 2026-03-26

### Added
- QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
- Executes inline `<script>` tags in a sandboxed runtime to capture `window.__*` data blobs
- Parses Next.js RSC flight data (`self.__next_f`) for App Router sites
- Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
- Feature-gated with `quickjs` feature flag (enabled by default, disable for WASM builds)

---

## [0.1.3] — 2026-03-25

### Added
- Crawl streaming: real-time progress on stderr as pages complete (`[2/50] OK https://... (234ms, 1523 words)`)
- Crawl resume/cancel: `--crawl-state <path>` saves progress on Ctrl+C and resumes from where it left off
- MCP server proxy support via `WEBCLAW_PROXY` and `WEBCLAW_PROXY_FILE` env vars

### Changed
- Crawl results now expose visited set and remaining frontier for accurate state persistence

---

## [0.1.2] — 2026-03-25

### Changed
- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)

### Fixed
- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)

### Added
- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches

---

## [0.1.1] — 2026-03-24

### Fixed
- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
- Research tool polling caps at 200 iterations (~10 min) instead of looping forever
- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
- Text format output strips markdown table syntax (`| --- |` pipes)
- All MCP tools validate URLs before network calls with clear error messages
- Cloud API HTTP client has 60s timeout instead of no timeout
- Local fetch calls timeout after 30s to prevent hanging on slow servers
- Diff cloud fallback computes actual diff instead of returning raw scrape JSON
- FetchClient startup failure logs and exits gracefully instead of panicking

### Added
- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

---

## [0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

### Core Extraction
- Readability-style content scoring with text density, semantic tags, and link density penalties
- Exact CSS class token noise filtering with body-force fallback for SPAs
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
- JSON data island extraction (React, Next.js, Contentful CMS)
- YouTube transcript extraction (title, channel, views, duration, description)
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
- Brand identity extraction (name, colors, fonts, logos, OG image)
- Content change tracking / diff engine
- CSS selector filtering (include/exclude)

### Fetching & Crawling
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
- BFS same-origin crawler with configurable depth, concurrency, and delay
- Sitemap.xml and robots.txt discovery
- Batch multi-URL concurrent extraction
- Per-request proxy rotation from pool file
- Reddit JSON API and LinkedIn post extractors

### LLM Integration
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
- JSON schema extraction (structured data from pages)
- Natural language prompt extraction
- Page summarization with configurable sentence count

### PDF
- PDF text extraction via pdf-extract
- Auto-detection by Content-Type header

### MCP Server
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
- stdio transport for Claude Desktop, Claude Code, and any MCP client
- Smart Fetch: local extraction first, cloud API fallback

### CLI
- 4 output formats: markdown, JSON, plain text, LLM-optimized
- CSS selector filtering, crawling, sitemap discovery
- Brand extraction, content diffing, LLM features
- Browser profile selection, proxy support, stdin/file input

### Infrastructure
- Docker multi-stage build with Ollama sidecar
- Deploy script for Hetzner VPS
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`# Changelog`

			`All notable changes to webclaw are documented here.`
			`Format follows [Keep a Changelog](https://keepachangelog.com/).`

feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-26 15:28:23 +01:00			`## [0.2.0] — 2026-03-26`

			`### Added`
			`- DOCX extraction: auto-detected by Content-Type or URL extension, outputs markdown with headings`
			`- XLSX/XLS extraction: spreadsheets converted to markdown tables, multi-sheet support via calamine`
			`- CSV extraction: parsed with quoted field handling, output as markdown table`
			- HTML output format: `-f html` returns sanitized HTML from the extracted content
			- Multi-URL watch: `--watch` now works with `--urls-file` to monitor multiple URLs in parallel
			- Batch + LLM extraction: `--extract-prompt` and `--extract-json` now work with multiple URLs
			`- Scheduled batch watch: watch multiple URLs with aggregate change reports and per-URL diffs`

			`---`

fix: v0.1.7 — extraction options now work in batch mode (#3) --only-main-content, --include, and --exclude were ignored in batch mode because run_batch used default ExtractionOptions. Added fetch_and_extract_batch_with_options to pass CLI options through. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-26 13:30:20 +01:00			`## [0.1.7] — 2026-03-26`

			`### Fixed`
			- `--only-main-content`, `--include`, and `--exclude` now work in batch mode (#3)

			`---`

feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format) Watch mode: - --watch polls a URL at --watch-interval (default 5min) - Reports diffs to stdout when content changes - --on-change runs a command with diff JSON on stdin - Ctrl+C stops cleanly Webhooks: - --webhook POSTs JSON on crawl/batch complete and watch changes - Auto-detects Discord and Slack URLs, formats as embeds/blocks - Also available via WEBCLAW_WEBHOOK_URL env var - Non-blocking, errors logged to stderr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-26 12:30:08 +01:00			`## [0.1.6] — 2026-03-26`

			`### Added`
			- `--watch`: monitor a URL for changes at a configurable interval with diff output
			- `--watch-interval`: seconds between checks (default: 300)
			- `--on-change`: run a command when changes are detected (diff JSON piped to stdin)
			- `--webhook`: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks

			`---`

feat: v0.1.5 — --output-dir saves each page to a separate file Adds --output-dir flag for CLI. Each extracted page gets its own file with filename derived from the URL path. Works with single URL, crawl, and batch modes. CSV input supports custom filenames (url,filename). Root URLs use hostname/index.ext to avoid collisions in batch mode. Subdirectories created automatically from URL path structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-26 11:02:25 +01:00			`## [0.1.5] — 2026-03-26`

			`### Added`
			- `--output-dir`: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
			- CSV input with custom filenames: `url,filename` format in `--urls-file`
			- Root URLs use `hostname/index.ext` to avoid collisions in batch mode
			`- Subdirectories created automatically from URL path structure`

			`---`

feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction Embeds QuickJS (rquickjs) to execute inline <script> tags and extract data hidden in JavaScript variable assignments. Captures window.__* objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired), and self.__next_f (Next.js RSC flight data). Results: - NYTimes: 1,552 → 4,162 words (+168%) - Wired: 1,459 → 9,937 words (+580%) - Zero measurable performance overhead (<15ms per page) - Feature-gated: disable with --no-default-features for WASM Smart text filtering rejects CSS, base64, file paths, code strings. Only readable prose is appended under "## Additional Content". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-26 10:28:16 +01:00			`## [0.1.4] — 2026-03-26`

			`### Added`
			`- QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)`
			- Executes inline `<script>` tags in a sandboxed runtime to capture `window.__*` data blobs
			- Parses Next.js RSC flight data (`self.__next_f`) for App Router sites
			`- Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose`
			- Feature-gated with `quickjs` feature flag (enabled by default, disable for WASM builds)

			`---`

feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 21:38:28 +01:00			`## [0.1.3] — 2026-03-25`

			`### Added`
			- Crawl streaming: real-time progress on stderr as pages complete (`[2/50] OK https://... (234ms, 1523 words)`)
			- Crawl resume/cancel: `--crawl-state <path>` saves progress on Ctrl+C and resumes from where it left off
			- MCP server proxy support via `WEBCLAW_PROXY` and `WEBCLAW_PROXY_FILE` env vars

			`### Changed`
			`- Crawl results now expose visited set and remaining frontier for accurate state persistence`

			`---`

feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 18:50:07 +01:00			`## [0.1.2] — 2026-03-25`

			`### Changed`
			`- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)`
			`- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)`
chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-24 18:44:52 +01:00
			`### Fixed`
			- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)

feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-25 18:50:07 +01:00			`### Added`
			`- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches`

chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-24 18:44:52 +01:00			`---`

fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-24 17:25:05 +01:00			`## [0.1.1] — 2026-03-24`

			`### Fixed`
			- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
			`- Research tool polling caps at 200 iterations (~10 min) instead of looping forever`
			`- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)`
			- Text format output strips markdown table syntax (`\| --- \|` pipes)
			`- All MCP tools validate URLs before network calls with clear error messages`
			`- Cloud API HTTP client has 60s timeout instead of no timeout`
			`- Local fetch calls timeout after 30s to prevent hanging on slow servers`
			`- Diff cloud fallback computes actual diff instead of returning raw scrape JSON`
			`- FetchClient startup failure logs and exits gracefully instead of panicking`

			`### Added`
			`- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages`

			`---`

Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`## [0.1.0] — 2026-03-18`

			`First public release. Full-featured web content extraction toolkit for LLMs.`

			`### Core Extraction`
			`- Readability-style content scoring with text density, semantic tags, and link density penalties`
			`- Exact CSS class token noise filtering with body-force fallback for SPAs`
			`- HTML → markdown conversion with URL resolution, image alt text, srcset optimization`
			`- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)`
			`- JSON data island extraction (React, Next.js, Contentful CMS)`
			`- YouTube transcript extraction (title, channel, views, duration, description)`
			`- Lazy-loaded image detection (data-src, data-lazy-src, data-original)`
			`- Brand identity extraction (name, colors, fonts, logos, OG image)`
			`- Content change tracking / diff engine`
			`- CSS selector filtering (include/exclude)`

			`### Fetching & Crawling`
			`- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)`
			`- BFS same-origin crawler with configurable depth, concurrency, and delay`
			`- Sitemap.xml and robots.txt discovery`
			`- Batch multi-URL concurrent extraction`
			`- Per-request proxy rotation from pool file`
			`- Reddit JSON API and LinkedIn post extractors`

			`### LLM Integration`
			`- Provider chain: Ollama (local-first) → OpenAI → Anthropic`
			`- JSON schema extraction (structured data from pages)`
			`- Natural language prompt extraction`
			`- Page summarization with configurable sentence count`

			`### PDF`
			`- PDF text extraction via pdf-extract`
			`- Auto-detection by Content-Type header`

			`### MCP Server`
			`- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand`
			`- stdio transport for Claude Desktop, Claude Code, and any MCP client`
			`- Smart Fetch: local extraction first, cloud API fallback`

			`### CLI`
			`- 4 output formats: markdown, JSON, plain text, LLM-optimized`
			`- CSS selector filtering, crawling, sitemap discovery`
			`- Brand extraction, content diffing, LLM features`
			`- Browser profile selection, proxy support, stdin/file input`

			`### Infrastructure`
			`- Docker multi-stage build with Ollama sidecar`
			`- Deploy script for Hetzner VPS`