Commit graph

7 commits

Author SHA1 Message Date
Valerio
1b8dfb77a6 feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format)
Watch mode:
- --watch polls a URL at --watch-interval (default 5min)
- Reports diffs to stdout when content changes
- --on-change runs a command with diff JSON on stdin
- Ctrl+C stops cleanly

Webhooks:
- --webhook POSTs JSON on crawl/batch complete and watch changes
- Auto-detects Discord and Slack URLs, formats as embeds/blocks
- Also available via WEBCLAW_WEBHOOK_URL env var
- Non-blocking, errors logged to stderr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 12:30:08 +01:00
Valerio
e5649e1824 feat: v0.1.5 — --output-dir saves each page to a separate file
Adds --output-dir flag for CLI. Each extracted page gets its own file
with filename derived from the URL path. Works with single URL, crawl,
and batch modes. CSV input supports custom filenames (url,filename).

Root URLs use hostname/index.ext to avoid collisions in batch mode.
Subdirectories created automatically from URL path structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 11:02:25 +01:00
Valerio
32c035c543 feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction
Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).

Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM

Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:28:16 +01:00
Valerio
0c91c6d5a9 feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:38:28 +01:00
Valerio
c90c0b6066 chore: bump to v0.1.2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:44:52 +01:00
Valerio
ea9c783bc5 fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)

CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax

MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:25:05 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00