webclaw/CHANGELOG.md
Valerio ea14848772 feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension

New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:28:23 +01:00

6 KiB

Changelog

All notable changes to webclaw are documented here. Format follows Keep a Changelog.

[0.2.0] — 2026-03-26

Added

  • DOCX extraction: auto-detected by Content-Type or URL extension, outputs markdown with headings
  • XLSX/XLS extraction: spreadsheets converted to markdown tables, multi-sheet support via calamine
  • CSV extraction: parsed with quoted field handling, output as markdown table
  • HTML output format: -f html returns sanitized HTML from the extracted content
  • Multi-URL watch: --watch now works with --urls-file to monitor multiple URLs in parallel
  • Batch + LLM extraction: --extract-prompt and --extract-json now work with multiple URLs
  • Scheduled batch watch: watch multiple URLs with aggregate change reports and per-URL diffs

[0.1.7] — 2026-03-26

Fixed

  • --only-main-content, --include, and --exclude now work in batch mode (#3)

[0.1.6] — 2026-03-26

Added

  • --watch: monitor a URL for changes at a configurable interval with diff output
  • --watch-interval: seconds between checks (default: 300)
  • --on-change: run a command when changes are detected (diff JSON piped to stdin)
  • --webhook: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks

[0.1.5] — 2026-03-26

Added

  • --output-dir: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
  • CSV input with custom filenames: url,filename format in --urls-file
  • Root URLs use hostname/index.ext to avoid collisions in batch mode
  • Subdirectories created automatically from URL path structure

[0.1.4] — 2026-03-26

Added

  • QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
  • Executes inline <script> tags in a sandboxed runtime to capture window.__* data blobs
  • Parses Next.js RSC flight data (self.__next_f) for App Router sites
  • Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
  • Feature-gated with quickjs feature flag (enabled by default, disable for WASM builds)

[0.1.3] — 2026-03-25

Added

  • Crawl streaming: real-time progress on stderr as pages complete ([2/50] OK https://... (234ms, 1523 words))
  • Crawl resume/cancel: --crawl-state <path> saves progress on Ctrl+C and resumes from where it left off
  • MCP server proxy support via WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars

Changed

  • Crawl results now expose visited set and remaining frontier for accurate state persistence

[0.1.2] — 2026-03-25

Changed

  • Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
  • Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)

Fixed

  • Reddit scraping: use plain HTTP client for .json endpoint (TLS fingerprinting was getting blocked)

Added

  • YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches

[0.1.1] — 2026-03-24

Fixed

  • MCP server now identifies as webclaw-mcp instead of rmcp in the MCP handshake
  • Research tool polling caps at 200 iterations (~10 min) instead of looping forever
  • CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
  • Text format output strips markdown table syntax (| --- | pipes)
  • All MCP tools validate URLs before network calls with clear error messages
  • Cloud API HTTP client has 60s timeout instead of no timeout
  • Local fetch calls timeout after 30s to prevent hanging on slow servers
  • Diff cloud fallback computes actual diff instead of returning raw scrape JSON
  • FetchClient startup failure logs and exits gracefully instead of panicking

Added

  • Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

[0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

Core Extraction

  • Readability-style content scoring with text density, semantic tags, and link density penalties
  • Exact CSS class token noise filtering with body-force fallback for SPAs
  • HTML → markdown conversion with URL resolution, image alt text, srcset optimization
  • 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
  • JSON data island extraction (React, Next.js, Contentful CMS)
  • YouTube transcript extraction (title, channel, views, duration, description)
  • Lazy-loaded image detection (data-src, data-lazy-src, data-original)
  • Brand identity extraction (name, colors, fonts, logos, OG image)
  • Content change tracking / diff engine
  • CSS selector filtering (include/exclude)

Fetching & Crawling

  • TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
  • BFS same-origin crawler with configurable depth, concurrency, and delay
  • Sitemap.xml and robots.txt discovery
  • Batch multi-URL concurrent extraction
  • Per-request proxy rotation from pool file
  • Reddit JSON API and LinkedIn post extractors

LLM Integration

  • Provider chain: Ollama (local-first) → OpenAI → Anthropic
  • JSON schema extraction (structured data from pages)
  • Natural language prompt extraction
  • Page summarization with configurable sentence count

PDF

  • PDF text extraction via pdf-extract
  • Auto-detection by Content-Type header

MCP Server

  • 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
  • stdio transport for Claude Desktop, Claude Code, and any MCP client
  • Smart Fetch: local extraction first, cloud API fallback

CLI

  • 4 output formats: markdown, JSON, plain text, LLM-optimized
  • CSS selector filtering, crawling, sitemap discovery
  • Brand extraction, content diffing, LLM features
  • Browser profile selection, proxy support, stdin/file input

Infrastructure

  • Docker multi-stage build with Ollama sidecar
  • Deploy script for Hetzner VPS