webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Author	SHA1	Message	Date
Valerio	1b8dfb77a6	feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format) Watch mode: - --watch polls a URL at --watch-interval (default 5min) - Reports diffs to stdout when content changes - --on-change runs a command with diff JSON on stdin - Ctrl+C stops cleanly Webhooks: - --webhook POSTs JSON on crawl/batch complete and watch changes - Auto-detects Discord and Slack URLs, formats as embeds/blocks - Also available via WEBCLAW_WEBHOOK_URL env var - Non-blocking, errors logged to stderr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 12:30:08 +01:00
Valerio	e5649e1824	feat: v0.1.5 — --output-dir saves each page to a separate file Adds --output-dir flag for CLI. Each extracted page gets its own file with filename derived from the URL path. Works with single URL, crawl, and batch modes. CSV input supports custom filenames (url,filename). Root URLs use hostname/index.ext to avoid collisions in batch mode. Subdirectories created automatically from URL path structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 11:02:25 +01:00
Valerio	32c035c543	feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction Embeds QuickJS (rquickjs) to execute inline <script> tags and extract data hidden in JavaScript variable assignments. Captures window.__* objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired), and self.__next_f (Next.js RSC flight data). Results: - NYTimes: 1,552 → 4,162 words (+168%) - Wired: 1,459 → 9,937 words (+580%) - Zero measurable performance overhead (<15ms per page) - Feature-gated: disable with --no-default-features for WASM Smart text filtering rejects CSS, base64, file paths, code strings. Only readable prose is appended under "## Additional Content". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 10:28:16 +01:00
Valerio	0c91c6d5a9	feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:38:28 +01:00
Valerio	c90c0b6066	chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 18:44:52 +01:00
Valerio	ea9c783bc5	fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:25:05 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

7 commits