webclaw/CHANGELOG.md

# Changelog

All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/).

## [0.3.19] — 2026-04-17

### Fixed
- **Docker image can be used as a FROM base again.** v0.3.13 switched the Docker `CMD` to `ENTRYPOINT ["webclaw"]` so that `docker run IMAGE https://example.com` would pass the URL through as expected. That change trapped a different use case: downstream Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own `CMD ["./setup.sh"]` — the child's `./setup.sh` became the first arg to `webclaw`, which tried to fetch it as a URL and failed with `error sending request for uri (https://./setup.sh)`. Both `Dockerfile` and `Dockerfile.ci` now use a small `docker-entrypoint.sh` shim that forwards flags (`-*`) and URLs (`http://`, `https://`) to `webclaw`, but `exec`s anything else directly. All four use cases now work: `docker run IMAGE https://example.com`, `docker run IMAGE --help`, child-image `CMD ["./setup.sh"]`, and `docker run IMAGE bash` for debugging. Default `CMD` is `["webclaw", "--help"]`.

---

## [0.3.18] — 2026-04-16

### Fixed
- **UTF-8 char boundary panic in `webclaw-core::extractor::find_content_position` (#16).** After rejecting a match that fell inside image syntax (`![...](...)`), the scan advanced `search_from` by a single byte. If the rejected match started on a multi-byte character (Cyrillic, CJK, accented Latin, emoji), the next `markdown[search_from..]` slice landed mid-char and panicked with `byte index N is not a char boundary; it is inside 'X'`. Repro was `webclaw https://bruler.ru/about_brand -f json`. Now advances by `needle.len()` — always a valid char boundary, and faster because it skips the whole rejected match instead of re-scanning inside it. Two regression tests cover multi-byte rejected matches and all-rejected cycles in Cyrillic text.

---

## [0.3.17] — 2026-04-16

### Changed
- **`webclaw-fetch::sitemap::parse_robots_txt` now does proper directive parsing.** The previous `trimmed[..8].eq_ignore_ascii_case("sitemap:")` slice couldn't handle "Sitemap :" (space before colon) from bad generators, didn't strip inline `# ...` comments, and would have returned empty/garbage values if a directive line had no URL. Now splits on the first colon, matches any-case `sitemap` as the directive name, strips comments, and requires the value to contain `://` before accepting it. Eight new unit tests cover case variants, space-before-colon, inline comments, non-URL values, and non-sitemap directives.
- **`webclaw-fetch::crawler::is_cancelled` uses `Ordering::Acquire`** (was `Relaxed`). Technically equivalent on x86/arm64 for single-word loads, but the explicit ordering documents the synchronization intent for readers and the compiler.

### Added
- **`webclaw-mcp` caches the Firefox FetchClient lazily.** Tool calls that repeatedly request the Firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call; a single `OnceLock` keeps the client alive for the life of the server. Chrome (default) and Random (by design per-call) are unaffected.

---

## [0.3.16] — 2026-04-16

### Hardened
- **Response body caps across fetch + LLM providers (P2).** Every HTTP response buffered from the network is now rejected if it exceeds a hard size cap. `webclaw-fetch::Response::from_wreq` caps HTML/doc responses at 50 MB (before the allocation pays for anything and as a belt-and-braces check after `bytes().await`); `webclaw-llm` providers (anthropic / openai / ollama) cap JSON responses at 5 MB via a shared `response_json_capped` helper. Previously an adversarial or runaway upstream could push unbounded memory into the process. Closes the DoS-via-giant-body class of bugs noted in the audit.
- **Crawler frontier cap (P2).** After each depth level the frontier is truncated to `max(max_pages × 10, 100)` entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after `max_pages` halted new fetches, keeping string allocations alive long after the crawl was effectively done.
- **Glob pattern validation (P2).** User-supplied `include_patterns` / `exclude_patterns` passed to the crawler are now rejected if they contain more than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `**` against long paths; this keeps adversarial config files from weaponising it.

### Cleanup
- **Removed blanket `#![allow(dead_code)]` in `webclaw-cli/src/main.rs`.** No dead code surfaced; the suppression was obsolete.
- **`.gitignore`: replaced overbroad `*.json` with specific local-artifact patterns.** The previous rule would have swallowed `package.json` / `components.json` / `.smithery/*.json` if they were ever modified.

---

## [0.3.15] — 2026-04-16

### Fixed
- **Batch/crawl no longer panics on semaphore close (P1).** Three `permit.acquire().await.expect("semaphore closed")` call sites in `webclaw-fetch` (`client::fetch_batch`, `client::fetch_and_extract_batch_with_options`, `crawler` inner loop) now surface a typed `FetchError::Build("semaphore closed before acquire")` or a failed `PageResult` instead of panicking the spawned task. Under normal operation nothing changes; under shutdown-race or adversarial runtime state, the caller sees one failed entry in the batch instead of losing the task silently to the runtime's panic handler. Surfaced by the 2026-04-16 workspace audit.

---

## [0.3.14] — 2026-04-16

### Security
- **`--on-change` command injection closed (P0).** The `--on-change` flag on `webclaw watch` and its multi-URL variant used to pipe the whole user-supplied string through `sh -c`. Anyone (or any LLM driving the MCP surface, or any config file parsed on the user's behalf) that could influence the flag value could execute arbitrary shell. The command is now tokenized with `shlex` and executed directly via `Command::new(prog).args(args)`, so metacharacters like `;`, `&&`, `|`, `$()`, `<(...)`, and env expansion no longer fire. A `WEBCLAW_ALLOW_SHELL=1` escape hatch is available for users who genuinely need pipelines; it logs a warning on every invocation so it can't slip in silently. Surfaced by the 2026-04-16 workspace audit.

---

## [0.3.13] — 2026-04-10

### Fixed
- **Docker CMD replaced with ENTRYPOINT**: both `Dockerfile` and `Dockerfile.ci` now use `ENTRYPOINT ["webclaw"]` instead of `CMD ["webclaw"]`. CLI arguments (e.g. `docker run webclaw https://example.com`) now pass through correctly instead of being ignored.

---

## [0.3.12] — 2026-04-10

### Added
- **Crawl scope control**: new `allow_subdomains` and `allow_external_links` fields on `CrawlConfig`. By default crawls stay same-origin. Enable `allow_subdomains` to follow sibling/child subdomains (e.g. blog.example.com from example.com), or `allow_external_links` for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au).

---

## [0.3.11] — 2026-04-10

### Added
- **Sitemap fallback paths**: discovery now tries `/sitemap_index.xml`, `/wp-sitemap.xml`, and `/sitemap/sitemap-index.xml` in addition to the standard `/sitemap.xml`. Sites using WordPress or non-standard sitemap locations are now discovered without needing external search.

---

## [0.3.10] — 2026-04-10

### Changed
- **Fetch timeout reduced from 30s to 12s**: prevents cascading slowdowns when proxies are unresponsive. Worst-case per-URL drops from ~94s to ~25s.
- **Retry attempts reduced from 3 to 2**: combined with shorter timeout, total worst-case is 12s + 1s delay + 12s = 25s instead of 30s + 1s + 30s + 3s + 30s = 94s.

---

## [0.3.9] — 2026-04-04

### Fixed
- **Layout tables rendered as sections**: tables used for page layout (containing block elements like `<p>`, `<div>`, `<hr>`) are now rendered as standalone sections instead of pipe-delimited markdown tables. Fixes Drudge Report and similar sites where all content was flattened into a single unreadable line. (by [@devnen](https://github.com/devnen) in #14)
- **Stack overflow on deeply nested HTML**: pages with 200+ DOM nesting levels (e.g., Express.co.uk live blogs) no longer overflow the stack. Two-layer fix: depth guard in markdown.rs falls back to iterator-based text collection at depth 200, and `extract_with_options()` spawns an 8 MB worker thread for safety on Windows. (by [@devnen](https://github.com/devnen) in #14)
- **Noise filter swallowing content in malformed HTML**: `<form>` tags no longer unconditionally treated as noise — ASP.NET page-wrapping forms (>500 chars) are preserved. Safety valve prevents unclosed noise containers (header/footer with >5000 chars) from absorbing entire page content. (by [@devnen](https://github.com/devnen) in #14)

### Changed
- **Bold/italic block passthrough**: `<b>`/`<strong>`/`<em>`/`<i>` tags containing block-level children (e.g., Drudge wrapping columns in `<b>`) now act as transparent containers instead of collapsing everything into inline bold/italic. (by [@devnen](https://github.com/devnen) in #14)

---

## [0.3.8] — 2026-04-03

### Fixed
- **MCP research token overflow**: research results are now saved to `~/.webclaw/research/` and the MCP tool returns file paths + findings instead of the full report. Prevents "exceeds maximum allowed tokens" errors in Claude/Cursor.
- **Research caching**: same query returns cached result instantly without spending credits.
- **Anthropic rate limit throttling**: 60s delay between LLM calls in research to stay under Tier 1 limits (50K input tokens/min).

### Added
- **`dirs` dependency** for `~/.webclaw/research/` path resolution.

---
## [0.3.7] — 2026-04-03

### Added
- **`--research` CLI flag**: run deep research via the cloud API. Prints report to stdout and saves full result (report + sources + findings) to a JSON file. Supports `--deep` for longer reports.
- **MCP extract/summarize cloud fallback**: when no local LLM is available, these tools now fall back to the cloud API instead of erroring. Set `WEBCLAW_API_KEY` for automatic fallback.
- **MCP research structured output**: the research tool now returns structured JSON (report + sources + findings + metadata) instead of raw text, so agents can reference individual findings and source URLs.

---

## [0.3.6] — 2026-04-02

### Added
- **Structured data in markdown/LLM output**: `__NEXT_DATA__`, SvelteKit, and JSON-LD data now appears as a `## Structured Data` section with a JSON code block at the end of `-f markdown` and `-f llm` output. Works with `--only-main-content` and all other flags.

### Fixed
- **Homebrew CI**: formula now updates all 4 platform checksums after Docker build completes, preventing SHA mismatch on Linux installs (#12).

---

## [0.3.5] — 2026-04-02

### Added
- **`__NEXT_DATA__` extraction**: Next.js pages now have their `pageProps` JSON extracted into `structured_data`. Contains prices, product info, page state, and other data that isn't in the visible HTML. Tested on 45 sites — 13 now return rich structured data (BBC, Forbes, Nike, Stripe, TripAdvisor, Glassdoor, NASA, etc.).

---

## [0.3.4] — 2026-04-01

### Added
- **SvelteKit data island extraction**: extracts structured JSON from `kit.start()` data arrays. Handles unquoted JS object keys by converting to valid JSON before parsing. Data appears in the `structured_data` field.

### Changed
- **License changed from MIT to AGPL-3.0**.

---

## [0.3.3] — 2026-04-01

### Changed
- **Replaced custom TLS stack with wreq**: migrated from webclaw-tls (patched rustls/h2/hyper/reqwest) to [wreq](https://github.com/0x676e67/wreq) by [@0x676e67](https://github.com/0x676e67). wreq uses BoringSSL for TLS and the [http2](https://github.com/0x676e67/http2) crate for HTTP/2 fingerprinting — both battle-tested with 60+ browser profiles.
- **Removed all `[patch.crates-io]` entries**: consumers no longer need to patch rustls, h2, hyper, hyper-util, or reqwest. Just depend on webclaw normally.
- **Browser profiles rebuilt on wreq's Emulation API**: Chrome 145, Firefox 135, Safari 18, Edge 145 with correct TLS options (cipher suites, curves, GREASE, ECH, PSK session resumption), HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
- **Better TLS compatibility**: BoringSSL handles more server configurations than patched rustls (e.g. servers that previously returned IllegalParameter alerts).

### Removed
- webclaw-tls dependency and all 5 forked crates (webclaw-rustls, webclaw-h2, webclaw-hyper, webclaw-hyper-util, webclaw-reqwest).

### Acknowledgments
- TLS and HTTP/2 fingerprinting powered by [wreq](https://github.com/0x676e67/wreq) and [http2](https://github.com/0x676e67/http2) by [@0x676e67](https://github.com/0x676e67), who pioneered browser-grade HTTP/2 fingerprinting in Rust.

---

## [0.3.2] — 2026-03-31

### Added
- **`--cookie-file` flag**: load cookies from JSON files exported by browser extensions (EditThisCookie, Cookie-Editor). Format: `[{name, value, domain, ...}]`.
- **MCP `cookies` parameter**: the `scrape` tool now accepts a `cookies` array for authenticated scraping.
- **Combined cookies**: `--cookie` and `--cookie-file` can be used together and merge automatically.

---

## [0.3.1] — 2026-03-30

### Added
- **Cookie warmup fallback**: when a fetch returns an Akamai challenge page, automatically visits the homepage first to collect `_abck`/`bm_sz` cookies, then retries the original URL. Enables extraction of Akamai-protected subpages (e.g. fansale ticket pages) without JS rendering.

### Changed
- Fixed HTTP header wire order (accept/user-agent were in wrong positions) and added H2 PRIORITY flag in HEADERS frames.
- `FetchResult.headers` now uses `http::HeaderMap` instead of `HashMap<String, String>` — avoids per-response allocation, preserves multi-value headers.

## [0.3.0] — 2026-03-29

### Changed
- **Replaced primp with webclaw-tls**: switched to custom TLS fingerprinting stack.
- **Browser profiles**: Chrome 146 (Win/Mac), Firefox 135+, Safari 18, Edge 146 — captured from real browsers.
- **HTTP/2 fingerprinting**: SETTINGS frame ordering and pseudo-header ordering based on concepts pioneered by [@0x676e67](https://github.com/0x676e67).

### Fixed
- **HTTPS completely broken (#5)**: primp's forked rustls rejected valid certificates (UnknownIssuer on cross-signed chains like example.com). Fixed by using native OS root CAs alongside Mozilla bundle.
- **Unknown certificate extensions**: servers returning SCT in certificate entries no longer cause TLS errors.

### Added
- **Native root CA support**: uses OS trust store (macOS Keychain, Windows cert store) in addition to webpki-roots.
- **HTTP/2 fingerprinting**: SETTINGS frame ordering and pseudo-header ordering match real browsers.
- **Per-browser header ordering**: HTTP headers sent in browser-specific wire order.
- **Bandwidth tracking**: atomic byte counters shared across cloned clients.

---

## [0.2.2] — 2026-03-27

### Fixed
- **`cargo install` broken with primp 1.2.0**: added missing `reqwest` patch to `[patch.crates-io]`. primp moved to reqwest 0.13 which requires a patched fork.
- **Weekly dependency check**: CI now runs every Monday to catch primp patch drift before users hit it.

---

## [0.2.1] — 2026-03-27

### Added
- **Docker image on GHCR**: `docker run ghcr.io/0xmassi/webclaw` — auto-built on every release
- **QuickJS data island extraction**: inline `<script>` execution catches `window.__PRELOADED_STATE__`, Next.js hydration data, and other JS-embedded content

### Fixed
- Docker CI now runs as part of the release workflow (was missing, image was never published)

---

## [0.2.0] — 2026-03-26

### Added
- **DOCX extraction**: auto-detected by Content-Type or URL extension, outputs markdown with headings
- **XLSX/XLS extraction**: spreadsheets converted to markdown tables, multi-sheet support via calamine
- **CSV extraction**: parsed with quoted field handling, output as markdown table
- **HTML output format**: `-f html` returns sanitized HTML from the extracted content
- **Multi-URL watch**: `--watch` now works with `--urls-file` to monitor multiple URLs in parallel
- **Batch + LLM extraction**: `--extract-prompt` and `--extract-json` now work with multiple URLs
- **Scheduled batch watch**: watch multiple URLs with aggregate change reports and per-URL diffs

---

## [0.1.7] — 2026-03-26

### Fixed
- `--only-main-content`, `--include`, and `--exclude` now work in batch mode (#3)

---

## [0.1.6] — 2026-03-26

### Added
- `--watch`: monitor a URL for changes at a configurable interval with diff output
- `--watch-interval`: seconds between checks (default: 300)
- `--on-change`: run a command when changes are detected (diff JSON piped to stdin)
- `--webhook`: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks

---

## [0.1.5] — 2026-03-26

### Added
- `--output-dir`: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
- CSV input with custom filenames: `url,filename` format in `--urls-file`
- Root URLs use `hostname/index.ext` to avoid collisions in batch mode
- Subdirectories created automatically from URL path structure

---

## [0.1.4] — 2026-03-26

### Added
- QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
- Executes inline `<script>` tags in a sandboxed runtime to capture `window.__*` data blobs
- Parses Next.js RSC flight data (`self.__next_f`) for App Router sites
- Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
- Feature-gated with `quickjs` feature flag (enabled by default, disable for WASM builds)

---

## [0.1.3] — 2026-03-25

### Added
- Crawl streaming: real-time progress on stderr as pages complete (`[2/50] OK https://... (234ms, 1523 words)`)
- Crawl resume/cancel: `--crawl-state <path>` saves progress on Ctrl+C and resumes from where it left off
- MCP server proxy support via `WEBCLAW_PROXY` and `WEBCLAW_PROXY_FILE` env vars

### Changed
- Crawl results now expose visited set and remaining frontier for accurate state persistence

---

## [0.1.2] — 2026-03-25

### Changed
- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)

### Fixed
- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)

### Added
- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches

---

## [0.1.1] — 2026-03-24

### Fixed
- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
- Research tool polling caps at 200 iterations (~10 min) instead of looping forever
- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
- Text format output strips markdown table syntax (`| --- |` pipes)
- All MCP tools validate URLs before network calls with clear error messages
- Cloud API HTTP client has 60s timeout instead of no timeout
- Local fetch calls timeout after 30s to prevent hanging on slow servers
- Diff cloud fallback computes actual diff instead of returning raw scrape JSON
- FetchClient startup failure logs and exits gracefully instead of panicking

### Added
- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages

---

## [0.1.0] — 2026-03-18

First public release. Full-featured web content extraction toolkit for LLMs.

### Core Extraction
- Readability-style content scoring with text density, semantic tags, and link density penalties
- Exact CSS class token noise filtering with body-force fallback for SPAs
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
- JSON data island extraction (React, Next.js, Contentful CMS)
- YouTube transcript extraction (title, channel, views, duration, description)
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
- Brand identity extraction (name, colors, fonts, logos, OG image)
- Content change tracking / diff engine
- CSS selector filtering (include/exclude)

### Fetching & Crawling
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
- BFS same-origin crawler with configurable depth, concurrency, and delay
- Sitemap.xml and robots.txt discovery
- Batch multi-URL concurrent extraction
- Per-request proxy rotation from pool file
- Reddit JSON API and LinkedIn post extractors

### LLM Integration
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
- JSON schema extraction (structured data from pages)
- Natural language prompt extraction
- Page summarization with configurable sentence count

### PDF
- PDF text extraction via pdf-extract
- Auto-detection by Content-Type header

### MCP Server
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
- stdio transport for Claude Desktop, Claude Code, and any MCP client
- Smart Fetch: local extraction first, cloud API fallback

### CLI
- 4 output formats: markdown, JSON, plain text, LLM-optimized
- CSS selector filtering, crawling, sitemap discovery
- Brand extraction, content diffing, LLM features
- Browser profile selection, proxy support, stdin/file input

### Infrastructure
- Docker multi-stage build with Ollama sidecar
- Deploy script for Hetzner VPS
-												Initial release: webclaw v0.1.0 — web content extraction for LLMs

CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io

											
										
										
											2026-03-23 18:31:11 +01:00
+								# Changelog
 								All notable changes to webclaw are documented here.
 								Format follows [Keep a Changelog](https://keepachangelog.com/).
-												fix(docker): entrypoint shim so child images with custom CMD work (#28)

v0.3.13 switched ENTRYPOINT to ["webclaw"] to make `docker run IMAGE
https://example.com` work. That broke a different use case: downstream
Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own
CMD ["./setup.sh"] — the child's ./setup.sh becomes arg to webclaw,
which tries to fetch it as a URL and fails:

  fetch error: request failed: error sending request for uri
  (https://./setup.sh): client error (Connect)

Both Dockerfile and Dockerfile.ci now use docker-entrypoint.sh which:
- forwards flags (-*) and URLs (http://, https://) to `webclaw`
- exec's anything else directly

Test matrix (all pass locally):
  docker run IMAGE https://example.com     → webclaw scrape ok
  docker run IMAGE --help                   → webclaw --help ok
  docker run IMAGE                          → default CMD, --help
  docker run IMAGE bash                     → bash runs
  FROM IMAGE + CMD ["./setup.sh"]           → setup.sh runs, webclaw available

Default CMD is ["webclaw", "--help"] so bare `docker run IMAGE` still
prints help.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
											
										
										
											2026-04-17 15:57:47 +02:00
+								## [0.3.19] — 2026-04-17
 								### Fixed
 								- **Docker image can be used as a FROM base again.** v0.3.13 switched the Docker `CMD` to `ENTRYPOINT ["webclaw"]` so that `docker run IMAGE https://example.com` would pass the URL through as expected. That change trapped a different use case: downstream Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own `CMD ["./setup.sh"]` — the child's `./setup.sh` became the first arg to `webclaw`, which tried to fetch it as a URL and failed with `error sending request for uri (https://./setup.sh)`. Both `Dockerfile` and `Dockerfile.ci` now use a small `docker-entrypoint.sh` shim that forwards flags (`-*`) and URLs (`http://`, `https://`) to `webclaw`, but `exec`s anything else directly. All four use cases now work: `docker run IMAGE https://example.com`, `docker run IMAGE --help`, child-image `CMD ["./setup.sh"]`, and `docker run IMAGE bash` for debugging. Default `CMD` is `["webclaw", "--help"]`.
 								---
-												fix(core): UTF-8 char boundary panic in find_content_position (#16) (#24)

`search_from = abs_pos + 1` landed mid-char when a rejected match
started on a multi-byte UTF-8 character, panicking on the next
`markdown[search_from..]` slice. Advance by `needle.len()` instead —
always a valid char boundary, and skips the whole rejected match
instead of re-scanning inside it.

Repro: webclaw https://bruler.ru/about_brand -f json
Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'"
After:  extracts 2.3KB of clean Cyrillic markdown with 7 sections

Two regression tests cover multi-byte rejected matches and
all-rejected cycles in Cyrillic text.

Closes #16

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
											
										
										
											2026-04-17 12:02:52 +02:00
+								## [0.3.18] — 2026-04-16
 								### Fixed
 								- **UTF-8 char boundary panic in `webclaw-core::extractor::find_content_position` (#16).** After rejecting a match that fell inside image syntax (`![...](...)`), the scan advanced `search_from` by a single byte. If the rejected match started on a multi-byte character (Cyrillic, CJK, accented Latin, emoji), the next `markdown[search_from..]` slice landed mid-char and panicked with `byte index N is not a char boundary; it is inside 'X'`. Repro was `webclaw https://bruler.ru/about_brand -f json`. Now advances by `needle.len()` — always a valid char boundary, and faster because it skips the whole rejected match instead of re-scanning inside it. Two regression tests cover multi-byte rejected matches and all-rejected cycles in Cyrillic text.
 								---
-												polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23)

Three P3 items from the 2026-04-16 audit. Bump to 0.3.17.

webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice
plus eq_ignore_ascii_case for the directive test. That was fragile:
"Sitemap :" (space before colon) fell through silently, inline
"# ..." comments leaked into the URL, and a line with no URL at all
returned an empty string. Rewritten to split on the first colon,
match any-case "sitemap" as the directive name, strip comments, and
require `://` in the value. +7 unit tests cover case variants,
space-before-colon, comments, empty values, non-URL values, and
non-sitemap directives.

webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire
instead of Relaxed. Behaviourally equivalent on current hardware for
single-word atomic loads, but the explicit ordering documents intent
for readers + compilers.

webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox
FetchClient. Tool calls that repeatedly request the firefox profile
without cookies used to build a fresh reqwest pool + TLS stack per
call. Chrome (default) already used the long-lived field; Random is
per-call by design; cookie-bearing requests still build ad-hoc since
the cookie header is part of the client shape.

Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core,
43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace.

Refs: docs/AUDIT-2026-04-16.md P3 section

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-04-16 20:21:32 +02:00
+								## [0.3.17] — 2026-04-16
 								### Changed
 								- **`webclaw-fetch::sitemap::parse_robots_txt` now does proper directive parsing.** The previous `trimmed[..8].eq_ignore_ascii_case("sitemap:")` slice couldn't handle "Sitemap :" (space before colon) from bad generators, didn't strip inline `# ...` comments, and would have returned empty/garbage values if a directive line had no URL. Now splits on the first colon, matches any-case `sitemap` as the directive name, strips comments, and requires the value to contain `://` before accepting it. Eight new unit tests cover case variants, space-before-colon, inline comments, non-URL values, and non-sitemap directives.
 								- **`webclaw-fetch::crawler::is_cancelled` uses `Ordering::Acquire`** (was `Relaxed`). Technically equivalent on x86/arm64 for single-word loads, but the explicit ordering documents the synchronization intent for readers and the compiler.
 								### Added
 								- **`webclaw-mcp` caches the Firefox FetchClient lazily.** Tool calls that repeatedly request the Firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call; a single `OnceLock` keeps the client alive for the life of the server. Chrome (default) and Random (by design per-call) are unaffected.
 								---
-												feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)

* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)

Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
  Content-Length up front (before the allocation) and the actual
  .bytes() length after (belt-and-braces against lying upstreams).
  Previously the HTML -> markdown conversion downstream could allocate
  multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
  response_json_capped helper with a 5 MB cap. Protects against a
  malicious or runaway provider response exhausting memory.

Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.

Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.

Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
  no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
  artifact patterns (previous rule would have swallowed package.json,
  components.json, .smithery/*.json).

Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).

Version: 0.3.15 -> 0.3.16
CHANGELOG updated.

Refs: docs/AUDIT-2026-04-16.md (P2 section)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: gitignore CLI research dumps, drop accidentally-tracked file

research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).

Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-04-16 19:44:08 +02:00
+								## [0.3.16] — 2026-04-16
 								### Hardened
 								- **Response body caps across fetch + LLM providers (P2).** Every HTTP response buffered from the network is now rejected if it exceeds a hard size cap. `webclaw-fetch::Response::from_wreq` caps HTML/doc responses at 50 MB (before the allocation pays for anything and as a belt-and-braces check after `bytes().await`); `webclaw-llm` providers (anthropic / openai / ollama) cap JSON responses at 5 MB via a shared `response_json_capped` helper. Previously an adversarial or runaway upstream could push unbounded memory into the process. Closes the DoS-via-giant-body class of bugs noted in the audit.
 								- **Crawler frontier cap (P2).** After each depth level the frontier is truncated to `max(max_pages × 10, 100)` entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after `max_pages` halted new fetches, keeping string allocations alive long after the crawl was effectively done.
 								- **Glob pattern validation (P2).** User-supplied `include_patterns` / `exclude_patterns` passed to the crawler are now rejected if they contain more than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `**` against long paths; this keeps adversarial config files from weaponising it.
 								### Cleanup
 								- **Removed blanket `#![allow(dead_code)]` in `webclaw-cli/src/main.rs`.** No dead code surfaced; the suppression was obsolete.
 								- **`.gitignore`: replaced overbroad `*.json` with specific local-artifact patterns.** The previous rule would have swallowed `package.json` / `components.json` / `.smithery/*.json` if they were ever modified.
 								---
-												fix(fetch): surface semaphore-closed as typed error instead of panic (P1) (#21)

Three call sites in webclaw-fetch used .expect("semaphore closed") on
`Semaphore::acquire()`. Under normal operation they never fire, but
under a shutdown race or adversarial runtime state the spawned task
would panic and be silently dropped from the batch / crawl run — the
caller would see fewer results than URLs with no indication why.

Rewritten to match on the acquire result:
- client::fetch_batch and client::fetch_and_extract_batch_with_options
  now emit BatchResult/BatchExtractResult carrying
  FetchError::Build("semaphore closed before acquire").
- crawler's inner loop emits a failed PageResult with the same error
  string instead of panicking.

Behaviorally a no-op for the happy path. Fixes the silent-dropped-task
class of bug noted in the 2026-04-16 audit.

Version: 0.3.14 -> 0.3.15
CHANGELOG updated.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-04-16 19:20:26 +02:00
+								## [0.3.15] — 2026-04-16
 								### Fixed
 								- **Batch/crawl no longer panics on semaphore close (P1).** Three `permit.acquire().await.expect("semaphore closed")` call sites in `webclaw-fetch` (`client::fetch_batch`, `client::fetch_and_extract_batch_with_options`, `crawler` inner loop) now surface a typed `FetchError::Build("semaphore closed before acquire")` or a failed `PageResult` instead of panicking the spawned task. Under normal operation nothing changes; under shutdown-race or adversarial runtime state, the caller sees one failed entry in the batch instead of losing the task silently to the runtime's panic handler. Surfaced by the 2026-04-16 workspace audit.
 								---
-												fix(cli): close --on-change command injection via sh -c (P0) (#20)

* fix(cli): close --on-change command injection via sh -c (P0)

The --on-change flag on `webclaw watch` (single-URL, line 1588) and
`webclaw watch` multi-URL mode (line 1738) previously handed the entire
user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`.
Any path that can influence that string — a malicious config file, an MCP
client driven by an LLM with prompt-injection exposure, an untrusted
environment variable substitution — gets arbitrary shell execution.

The command is now tokenized with `shlex::split` (POSIX-ish quoting rules)
and executed directly via `Command::new(prog).args(args)`. Metacharacters
like `;`, `&&`, `|`, `$()`, `<(...)`, env expansion, and globbing no longer
fire.

An explicit opt-in escape hatch is available for users who genuinely need
a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path
and logs a warning on every invocation so it can't slip in silently.

Both call sites now route through a shared `spawn_on_change()` helper.

Adds `shlex = "1"` to webclaw-cli dependencies.

Version: 0.3.13 -> 0.3.14
CHANGELOG updated.

Surfaced by the 2026-04-16 workspace audit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore(brand): fix clippy 1.95 unnecessary_sort_by errors

Pre-existing sort_by calls in brand.rs became hard errors under clippy
1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same
ordering, no behavior change. Bundled here so CI goes green on the P0
command-injection fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-04-16 18:37:02 +02:00
+								## [0.3.14] — 2026-04-16
 								### Security
 								- **`--on-change` command injection closed (P0).** The `--on-change` flag on `webclaw watch` and its multi-URL variant used to pipe the whole user-supplied string through `sh -c`. Anyone (or any LLM driving the MCP surface, or any config file parsed on the user's behalf) that could influence the flag value could execute arbitrary shell. The command is now tokenized with `shlex` and executed directly via `Command::new(prog).args(args)`, so metacharacters like `;`, `&&`, `|`, `$()`, `<(...)`, and env expansion no longer fire. A `WEBCLAW_ALLOW_SHELL=1` escape hatch is available for users who genuinely need pipelines; it logs a warning on every invocation so it can't slip in silently. Surfaced by the 2026-04-16 workspace audit.
 								---
-												fix: use ENTRYPOINT instead of CMD in Dockerfiles for proper arg passthrough

Docker CMD gets overridden by any args, while ENTRYPOINT receives them.
This fixes `docker run webclaw <url>` silently ignoring the URL argument.

Bump to 0.3.13.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-14 20:24:26 +02:00
+								## [0.3.13] — 2026-04-10
 								### Fixed
 								- **Docker CMD replaced with ENTRYPOINT**: both `Dockerfile` and `Dockerfile.ci` now use `ENTRYPOINT ["webclaw"]` instead of `CMD ["webclaw"]`. CLI arguments (e.g. `docker run webclaw https://example.com`) now pass through correctly instead of being ignored.
 								---
-												feat: add allow_subdomains and allow_external_links to CrawlConfig

Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.

Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().

Bump to 0.3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-14 19:33:06 +02:00
+								## [0.3.12] — 2026-04-10
 								### Added
 								- **Crawl scope control**: new `allow_subdomains` and `allow_external_links` fields on `CrawlConfig`. By default crawls stay same-origin. Enable `allow_subdomains` to follow sibling/child subdomains (e.g. blog.example.com from example.com), or `allow_external_links` for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au).
 								---
-												feat: add fallback sitemap paths for broader discovery

Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.

Bump to 0.3.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-10 18:22:57 +02:00
+								## [0.3.11] — 2026-04-10
 								### Added
 								- **Sitemap fallback paths**: discovery now tries `/sitemap_index.xml`, `/wp-sitemap.xml`, and `/sitemap/sitemap-index.xml` in addition to the standard `/sitemap.xml`. Sites using WordPress or non-standard sitemap locations are now discovered without needing external search.
 								---
-												perf: reduce fetch timeout to 12s and retries to 2

Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-10 17:18:57 +02:00
+								## [0.3.10] — 2026-04-10
 								### Changed
 								- **Fetch timeout reduced from 30s to 12s**: prevents cascading slowdowns when proxies are unresponsive. Worst-case per-URL drops from ~94s to ~25s.
 								- **Retry attempts reduced from 3 to 2**: combined with shorter timeout, total worst-case is 12s + 1s delay + 12s = 25s instead of 30s + 1s + 30s + 3s + 30s = 94s.
 								---
-												chore: bump to 0.3.9, fix formatting from #14

Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-04 15:24:17 +02:00
+								## [0.3.9] — 2026-04-04
 								### Fixed
 								- **Layout tables rendered as sections**: tables used for page layout (containing block elements like `<p>`, `<div>`, `<hr>`) are now rendered as standalone sections instead of pipe-delimited markdown tables. Fixes Drudge Report and similar sites where all content was flattened into a single unreadable line. (by [@devnen](https://github.com/devnen) in #14)
 								- **Stack overflow on deeply nested HTML**: pages with 200+ DOM nesting levels (e.g., Express.co.uk live blogs) no longer overflow the stack. Two-layer fix: depth guard in markdown.rs falls back to iterator-based text collection at depth 200, and `extract_with_options()` spawns an 8 MB worker thread for safety on Windows. (by [@devnen](https://github.com/devnen) in #14)
 								- **Noise filter swallowing content in malformed HTML**: `<form>` tags no longer unconditionally treated as noise — ASP.NET page-wrapping forms (>500 chars) are preserved. Safety valve prevents unclosed noise containers (header/footer with >5000 chars) from absorbing entire page content. (by [@devnen](https://github.com/devnen) in #14)
 								### Changed
 								- **Bold/italic block passthrough**: `<b>`/`<strong>`/`<em>`/`<i>` tags containing block-level children (e.g., Drudge wrapping columns in `<b>`) now act as transparent containers instead of collapsing everything into inline bold/italic. (by [@devnen](https://github.com/devnen) in #14)
 								---
-												fix: MCP research saves to file, returns compact response

Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.

Same query returns cached result instantly without spending credits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-03 16:05:45 +02:00
+								## [0.3.8] — 2026-04-03
 								### Fixed
 								- **MCP research token overflow**: research results are now saved to `~/.webclaw/research/` and the MCP tool returns file paths + findings instead of the full report. Prevents "exceeds maximum allowed tokens" errors in Claude/Cursor.
 								- **Research caching**: same query returns cached result instantly without spending credits.
 								- **Anthropic rate limit throttling**: 60s delay between LLM calls in research to stay under Tier 1 limits (50K input tokens/min).
 								### Added
 								- **`dirs` dependency** for `~/.webclaw/research/` path resolution.
 								---
-												feat: CLI --research flag + MCP cloud fallback + structured research output

- --research "query": deep research via cloud API, saves JSON file with
  report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-03 14:04:04 +02:00
+								## [0.3.7] — 2026-04-03
 								### Added
 								- **`--research` CLI flag**: run deep research via the cloud API. Prints report to stdout and saves full result (report + sources + findings) to a JSON file. Supports `--deep` for longer reports.
 								- **MCP extract/summarize cloud fallback**: when no local LLM is available, these tools now fall back to the cloud API instead of erroring. Set `WEBCLAW_API_KEY` for automatic fallback.
 								- **MCP research structured output**: the research tool now returns structured JSON (report + sources + findings + metadata) instead of raw text, so agents can reference individual findings and source URLs.
 								---
-												feat: structured data in markdown/LLM output + v0.3.6

__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-02 19:16:56 +02:00
+								## [0.3.6] — 2026-04-02
 								### Added
 								- **Structured data in markdown/LLM output**: `__NEXT_DATA__`, SvelteKit, and JSON-LD data now appears as a `## Structured Data` section with a JSON code block at the end of `-f markdown` and `-f llm` output. Works with `--only-main-content` and all other flags.
 								### Fixed
 								- **Homebrew CI**: formula now updates all 4 platform checksums after Docker build completes, preventing SHA mismatch on Linux installs (#12).
 								---
-												feat: extract __NEXT_DATA__ into structured_data

Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.

Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-02 16:04:51 +02:00
+								## [0.3.5] — 2026-04-02
 								### Added
 								- **`__NEXT_DATA__` extraction**: Next.js pages now have their `pageProps` JSON extracted into `structured_data`. Contains prices, product info, page state, and other data that isn't in the visible HTML. Tested on 45 sites — 13 now return rich structured data (BBC, Forbes, Nike, Stripe, TripAdvisor, Glassdoor, NASA, etc.).
 								---
-												feat: SvelteKit data extraction + license change to AGPL-3.0

- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-01 20:37:56 +02:00
+								## [0.3.4] — 2026-04-01
 								### Added
 								- **SvelteKit data island extraction**: extracts structured JSON from `kit.start()` data arrays. Handles unquoted JS object keys by converting to valid JSON before parsing. Data appears in the `structured_data` field.
 								### Changed
 								- **License changed from MIT to AGPL-3.0**.
 								---
-												feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3

Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-01 18:04:55 +02:00
+								## [0.3.3] — 2026-04-01
 								### Changed
 								- **Replaced custom TLS stack with wreq**: migrated from webclaw-tls (patched rustls/h2/hyper/reqwest) to [wreq](https://github.com/0x676e67/wreq) by [@0x676e67](https://github.com/0x676e67). wreq uses BoringSSL for TLS and the [http2](https://github.com/0x676e67/http2) crate for HTTP/2 fingerprinting — both battle-tested with 60+ browser profiles.
 								- **Removed all `[patch.crates-io]` entries**: consumers no longer need to patch rustls, h2, hyper, hyper-util, or reqwest. Just depend on webclaw normally.
 								- **Browser profiles rebuilt on wreq's Emulation API**: Chrome 145, Firefox 135, Safari 18, Edge 145 with correct TLS options (cipher suites, curves, GREASE, ECH, PSK session resumption), HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
 								- **Better TLS compatibility**: BoringSSL handles more server configurations than patched rustls (e.g. servers that previously returned IllegalParameter alerts).
 								### Removed
 								- webclaw-tls dependency and all 5 forked crates (webclaw-rustls, webclaw-h2, webclaw-hyper, webclaw-hyper-util, webclaw-reqwest).
 								### Acknowledgments
 								- TLS and HTTP/2 fingerprinting powered by [wreq](https://github.com/0x676e67/wreq) and [http2](https://github.com/0x676e67/http2) by [@0x676e67](https://github.com/0x676e67), who pioneered browser-grade HTTP/2 fingerprinting in Rust.
 								---
-												chore: bump to v0.3.2, update changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-31 10:56:51 +02:00
+								## [0.3.2] — 2026-03-31
 								### Added
 								- **`--cookie-file` flag**: load cookies from JSON files exported by browser extensions (EditThisCookie, Cookie-Editor). Format: `[{name, value, domain, ...}]`.
 								- **MCP `cookies` parameter**: the `scrape` tool now accepts a `cookies` array for authenticated scraping.
 								- **Combined cookies**: `--cookie` and `--cookie-file` can be used together and merge automatically.
 								---
-												chore: bump v0.3.1, update CHANGELOG, fix fmt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-30 14:11:54 +02:00
+								## [0.3.1] — 2026-03-30
 								### Added
 								- **Cookie warmup fallback**: when a fetch returns an Akamai challenge page, automatically visits the homepage first to collect `_abck`/`bm_sz` cookies, then retries the original URL. Enables extraction of Akamai-protected subpages (e.g. fansale ticket pages) without JS rendering.
 								### Changed
-												feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3

Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-01 18:04:55 +02:00
+								- Fixed HTTP header wire order (accept/user-agent were in wrong positions) and added H2 PRIORITY flag in HEADERS frames.
-												chore: bump v0.3.1, update CHANGELOG, fix fmt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-30 14:11:54 +02:00
+								- `FetchResult.headers` now uses `http::HeaderMap` instead of `HashMap<String, String>` — avoids per-response allocation, preserves multi-value headers.
-												docs: add v0.3.0 changelog entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-29 16:47:04 +02:00
+								## [0.3.0] — 2026-03-29
 								### Changed
-												feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3

Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-01 18:04:55 +02:00
+								- **Replaced primp with webclaw-tls**: switched to custom TLS fingerprinting stack.
-												docs: add v0.3.0 changelog entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-29 16:47:04 +02:00
+								- **Browser profiles**: Chrome 146 (Win/Mac), Firefox 135+, Safari 18, Edge 146 — captured from real browsers.
-												feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3

Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-01 18:04:55 +02:00
+								- **HTTP/2 fingerprinting**: SETTINGS frame ordering and pseudo-header ordering based on concepts pioneered by [@0x676e67](https://github.com/0x676e67).
-												docs: add v0.3.0 changelog entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-29 16:47:04 +02:00
 								### Fixed
 								- **HTTPS completely broken (#5)**: primp's forked rustls rejected valid certificates (UnknownIssuer on cross-signed chains like example.com). Fixed by using native OS root CAs alongside Mozilla bundle.
 								- **Unknown certificate extensions**: servers returning SCT in certificate entries no longer cause TLS errors.
 								### Added
 								- **Native root CA support**: uses OS trust store (macOS Keychain, Windows cert store) in addition to webpki-roots.
 								- **HTTP/2 fingerprinting**: SETTINGS frame ordering and pseudo-header ordering match real browsers.
 								- **Per-browser header ordering**: HTTP headers sent in browser-specific wire order.
 								- **Bandwidth tracking**: atomic byte counters shared across cloned clients.
 								---
-												test: v0.2.2 pre-release check

											
										
										
											2026-03-27 18:48:15 +01:00
+								## [0.2.2] — 2026-03-27
 								### Fixed
 								- **`cargo install` broken with primp 1.2.0**: added missing `reqwest` patch to `[patch.crates-io]`. primp moved to reqwest 0.13 which requires a patched fork.
 								- **Weekly dependency check**: CI now runs every Monday to catch primp patch drift before users hit it.
 								---
-												feat: v0.2.1 — Docker image on GHCR, QuickJS data island extraction

- Docker image auto-built on every release via CI
- QuickJS sandbox executes inline <script> tags to extract JS-embedded
  content (window.__PRELOADED_STATE__, self.__next_f, etc.)
- Bumped version to 0.2.1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-27 17:18:31 +01:00
+								## [0.2.1] — 2026-03-27
 								### Added
 								- **Docker image on GHCR**: `docker run ghcr.io/0xmassi/webclaw` — auto-built on every release
 								- **QuickJS data island extraction**: inline `<script>` execution catches `window.__PRELOADED_STATE__`, Next.js hydration data, and other JS-embedded content
 								### Fixed
 								- Docker CI now runs as part of the release workflow (was missing, image was never published)
 								---
-												feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM

Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension

New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-26 15:28:23 +01:00
+								## [0.2.0] — 2026-03-26
 								### Added
 								- **DOCX extraction**: auto-detected by Content-Type or URL extension, outputs markdown with headings
 								- **XLSX/XLS extraction**: spreadsheets converted to markdown tables, multi-sheet support via calamine
 								- **CSV extraction**: parsed with quoted field handling, output as markdown table
 								- **HTML output format**: `-f html` returns sanitized HTML from the extracted content
 								- **Multi-URL watch**: `--watch` now works with `--urls-file` to monitor multiple URLs in parallel
 								- **Batch + LLM extraction**: `--extract-prompt` and `--extract-json` now work with multiple URLs
 								- **Scheduled batch watch**: watch multiple URLs with aggregate change reports and per-URL diffs
 								---
-												fix: v0.1.7 — extraction options now work in batch mode (#3)

--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-26 13:30:20 +01:00
+								## [0.1.7] — 2026-03-26
 								### Fixed
 								- `--only-main-content`, `--include`, and `--exclude` now work in batch mode (#3)
 								---
-												feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format)

Watch mode:
- --watch polls a URL at --watch-interval (default 5min)
- Reports diffs to stdout when content changes
- --on-change runs a command with diff JSON on stdin
- Ctrl+C stops cleanly

Webhooks:
- --webhook POSTs JSON on crawl/batch complete and watch changes
- Auto-detects Discord and Slack URLs, formats as embeds/blocks
- Also available via WEBCLAW_WEBHOOK_URL env var
- Non-blocking, errors logged to stderr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-26 12:30:08 +01:00
+								## [0.1.6] — 2026-03-26
 								### Added
 								- `--watch`: monitor a URL for changes at a configurable interval with diff output
 								- `--watch-interval`: seconds between checks (default: 300)
 								- `--on-change`: run a command when changes are detected (diff JSON piped to stdin)
 								- `--webhook`: POST JSON notifications on crawl/batch complete and watch changes. Auto-formats for Discord and Slack webhooks
 								---
-												feat: v0.1.5 — --output-dir saves each page to a separate file

Adds --output-dir flag for CLI. Each extracted page gets its own file
with filename derived from the URL path. Works with single URL, crawl,
and batch modes. CSV input supports custom filenames (url,filename).

Root URLs use hostname/index.ext to avoid collisions in batch mode.
Subdirectories created automatically from URL path structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-26 11:02:25 +01:00
+								## [0.1.5] — 2026-03-26
 								### Added
 								- `--output-dir`: save each page to a separate file instead of stdout. Works with single URL, crawl, and batch modes
 								- CSV input with custom filenames: `url,filename` format in `--urls-file`
 								- Root URLs use `hostname/index.ext` to avoid collisions in batch mode
 								- Subdirectories created automatically from URL path structure
 								---
-												feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction

Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).

Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM

Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-26 10:28:16 +01:00
+								## [0.1.4] — 2026-03-26
 								### Added
 								- QuickJS integration for extracting data from inline JavaScript (NYTimes +168%, Wired +580% more content)
 								- Executes inline `<script>` tags in a sandboxed runtime to capture `window.__*` data blobs
 								- Parses Next.js RSC flight data (`self.__next_f`) for App Router sites
 								- Smart text filtering rejects CSS, base64, file paths, and code — only keeps readable prose
 								- Feature-gated with `quickjs` feature flag (enabled by default, disable for WASM builds)
 								---
-												feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support

Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-25 21:38:28 +01:00
+								## [0.1.3] — 2026-03-25
 								### Added
 								- Crawl streaming: real-time progress on stderr as pages complete (`[2/50] OK https://... (234ms, 1523 words)`)
 								- Crawl resume/cancel: `--crawl-state <path>` saves progress on Ctrl+C and resumes from where it left off
 								- MCP server proxy support via `WEBCLAW_PROXY` and `WEBCLAW_PROXY_FILE` env vars
 								### Changed
 								- Crawl results now expose visited set and remaining frontier for accurate state persistence
 								---
-												feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra

- Switch default profile to Safari26/Mac (best CF pass rate)
- Auto-fallback to plain client on connection error or 403
- Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites
- Reddit .json endpoint uses plain client (TLS fingerprint was blocked)
- YouTube caption track extraction + timed text parser (core, not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-25 18:50:07 +01:00
+								## [0.1.2] — 2026-03-25
 								### Changed
 								- Default TLS profile switched from Chrome145/Win to Safari26/Mac (highest pass rate across CF-protected sites)
 								- Plain client fallback: when impersonated TLS gets connection error or 403, automatically retries without impersonation (fixes ycombinator.com, producthunt.com, and similar sites)
-												chore: bump to v0.1.2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-24 18:44:52 +01:00
 								### Fixed
 								- Reddit scraping: use plain HTTP client for `.json` endpoint (TLS fingerprinting was getting blocked)
-												feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra

- Switch default profile to Safari26/Mac (best CF pass rate)
- Auto-fallback to plain client on connection error or 403
- Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites
- Reddit .json endpoint uses plain client (TLS fingerprint was blocked)
- YouTube caption track extraction + timed text parser (core, not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-25 18:50:07 +01:00
+								### Added
 								- YouTube transcript extraction infrastructure in webclaw-core (caption track parsing, timed text XML parser) — wired up when cloud API launches
-												chore: bump to v0.1.2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-24 18:44:52 +01:00
+								---
-												fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation

Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)

CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax

MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

											
										
										
											2026-03-24 17:25:05 +01:00
+								## [0.1.1] — 2026-03-24
 								### Fixed
 								- MCP server now identifies as `webclaw-mcp` instead of `rmcp` in the MCP handshake
 								- Research tool polling caps at 200 iterations (~10 min) instead of looping forever
 								- CLI returns non-zero exit codes on errors (invalid format, fetch failures, missing LLM)
 								- Text format output strips markdown table syntax (`| --- |` pipes)
 								- All MCP tools validate URLs before network calls with clear error messages
 								- Cloud API HTTP client has 60s timeout instead of no timeout
 								- Local fetch calls timeout after 30s to prevent hanging on slow servers
 								- Diff cloud fallback computes actual diff instead of returning raw scrape JSON
 								- FetchClient startup failure logs and exits gracefully instead of panicking
 								### Added
 								- Upper bounds: batch capped at 100 URLs, crawl capped at 500 pages
 								---
-												Initial release: webclaw v0.1.0 — web content extraction for LLMs

CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io

											
										
										
											2026-03-23 18:31:11 +01:00
+								## [0.1.0] — 2026-03-18
 								First public release. Full-featured web content extraction toolkit for LLMs.
 								### Core Extraction
 								- Readability-style content scoring with text density, semantic tags, and link density penalties
 								- Exact CSS class token noise filtering with body-force fallback for SPAs
 								- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
 								- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
 								- JSON data island extraction (React, Next.js, Contentful CMS)
 								- YouTube transcript extraction (title, channel, views, duration, description)
 								- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
 								- Brand identity extraction (name, colors, fonts, logos, OG image)
 								- Content change tracking / diff engine
 								- CSS selector filtering (include/exclude)
 								### Fetching & Crawling
 								- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
 								- BFS same-origin crawler with configurable depth, concurrency, and delay
 								- Sitemap.xml and robots.txt discovery
 								- Batch multi-URL concurrent extraction
 								- Per-request proxy rotation from pool file
 								- Reddit JSON API and LinkedIn post extractors
 								### LLM Integration
 								- Provider chain: Ollama (local-first) → OpenAI → Anthropic
 								- JSON schema extraction (structured data from pages)
 								- Natural language prompt extraction
 								- Page summarization with configurable sentence count
 								### PDF
 								- PDF text extraction via pdf-extract
 								- Auto-detection by Content-Type header
 								### MCP Server
 								- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
 								- stdio transport for Claude Desktop, Claude Code, and any MCP client
 								- Smart Fetch: local extraction first, cloud API fallback
 								### CLI
 								- 4 output formats: markdown, JSON, plain text, LLM-optimized
 								- CSS selector filtering, crawling, sitemap discovery
 								- Brand extraction, content diffing, LLM features
 								- Browser profile selection, proxy support, stdin/file input
 								### Infrastructure
 								- Docker multi-stage build with Ollama sidecar
 								- Deploy script for Hetzner VPS