webclaw/CHANGELOG.md
Valerio e5649e1824 feat: v0.1.5 — --output-dir saves each page to a separate file
Adds --output-dir flag for CLI. Each extracted page gets its own file
with filename derived from the URL path. Works with single URL, crawl,
and batch modes. CSV input supports custom filenames (url,filename).

Root URLs use hostname/index.ext to avoid collisions in batch mode.
Subdirectories created automatically from URL path structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 11:02:25 +01:00

4.8 KiB

Changelog

All notable changes to webclaw are documented here. Format follows Keep a Changelog.

[0.1.5] — 2026-03-26

Added

  • --output-dir: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
  • CSV input with custom filenames: url,filename format in --urls-file
  • Root URLs use hostname/index.ext to avoid collisions in batch mode
  • Subdirectories created automatically from URL path structure

[0.1.4] — 2026-03-26

Added

  • QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
  • Executes inline <script> tags in a sandboxed runtime to capture window.__* data blobs
  • Parses Next.js RSC flight data (self.__next_f) for App Router sites
  • Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
  • Feature-gated with quickjs feature flag (enabled by default, disable for WASM builds)

[0.1.3] — 2026-03-25

Added

  • Crawl streaming: real-time progress on stderr as pages complete ([2/50] OK https://... (234ms, 1523 words))
  • Crawl resume/cancel: --crawl-state <path> saves progress on Ctrl+C and resumes from where it left off
  • MCP server proxy support via WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars

Changed

  • Crawl results now expose visited set and remaining frontier for accurate state persistence

[0.1.2] — 2026-03-25

Changed

  • Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
  • Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)

Fixed

  • Reddit scraping: use plain HTTP client for .json endpoint (TLS fingerprinting was getting blocked)

Added

  • YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches

[0.1.1] — 2026-03-24

Fixed

  • MCP server now identifies as webclaw-mcp instead of rmcp in the MCP handshake
  • Research tool polling caps at 200 iterations (~10 min) instead of looping forever
  • CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
  • Text format output strips markdown table syntax (| --- | pipes)
  • All MCP tools validate URLs before network calls with clear error messages
  • Cloud API HTTP client has 60s timeout instead of no timeout
  • Local fetch calls timeout after 30s to prevent hanging on slow servers
  • Diff cloud fallback computes actual diff instead of returning raw scrape JSON
  • FetchClient startup failure logs and exits gracefully instead of panicking

Added

  • Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

[0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

Core Extraction

  • Readability-style content scoring with text density, semantic tags, and link density penalties
  • Exact CSS class token noise filtering with body-force fallback for SPAs
  • HTML → markdown conversion with URL resolution, image alt text, srcset optimization
  • 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
  • JSON data island extraction (React, Next.js, Contentful CMS)
  • YouTube transcript extraction (title, channel, views, duration, description)
  • Lazy-loaded image detection (data-src, data-lazy-src, data-original)
  • Brand identity extraction (name, colors, fonts, logos, OG image)
  • Content change tracking / diff engine
  • CSS selector filtering (include/exclude)

Fetching & Crawling

  • TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
  • BFS same-origin crawler with configurable depth, concurrency, and delay
  • Sitemap.xml and robots.txt discovery
  • Batch multi-URL concurrent extraction
  • Per-request proxy rotation from pool file
  • Reddit JSON API and LinkedIn post extractors

LLM Integration

  • Provider chain: Ollama (local-first) → OpenAI → Anthropic
  • JSON schema extraction (structured data from pages)
  • Natural language prompt extraction
  • Page summarization with configurable sentence count

PDF

  • PDF text extraction via pdf-extract
  • Auto-detection by Content-Type header

MCP Server

  • 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
  • stdio transport for Claude Desktop, Claude Code, and any MCP client
  • Smart Fetch: local extraction first, cloud API fallback

CLI

  • 4 output formats: markdown, JSON, plain text, LLM-optimized
  • CSS selector filtering, crawling, sitemap discovery
  • Brand extraction, content diffing, LLM features
  • Browser profile selection, proxy support, stdin/file input

Infrastructure

  • Docker multi-stage build with Ollama sidecar
  • Deploy script for Hetzner VPS