Commit graph

4 commits

Author SHA1 Message Date
Valerio
0c91c6d5a9 feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:38:28 +01:00
Valerio
afe4d3077d feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra
- Switch default profile to Safari26/Mac (best CF pass rate)
- Auto-fallback to plain client on connection error or 403
- Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites
- Reddit .json endpoint uses plain client (TLS fingerprint was blocked)
- YouTube caption track extraction + timed text parser (core, not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 18:50:07 +01:00
Valerio
907966a983 fix: use plain client for Reddit JSON endpoint
Reddit blocks TLS-fingerprinted clients on their .json API but
accepts standard requests with a browser User-Agent. Switch to
a non-impersonated primp client for the Reddit fallback path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:43:47 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00