Commit graph

166 commits

Author SHA1 Message Date
Valerio
32c035c543 feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction
Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).

Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM

Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:28:16 +01:00
Valerio
0c91c6d5a9 feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:38:28 +01:00
Valerio
afe4d3077d feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra
- Switch default profile to Safari26/Mac (best CF pass rate)
- Auto-fallback to plain client on connection error or 403
- Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites
- Reddit .json endpoint uses plain client (TLS fingerprint was blocked)
- YouTube caption track extraction + timed text parser (core, not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 18:50:07 +01:00
Valerio
c90c0b6066 chore: bump to v0.1.2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:44:52 +01:00
Valerio
907966a983 fix: use plain client for Reddit JSON endpoint
Reddit blocks TLS-fingerprinted clients on their .json API but
accepts standard requests with a browser User-Agent. Switch to
a non-impersonated primp client for the Reddit fallback path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:43:47 +01:00
Valerio
dff458d2f5 fix: collapse nested if to satisfy clippy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:28:57 +01:00
Valerio
b92c0ed186 style: fix cargo fmt formatting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:27:15 +01:00
Valerio
ea9c783bc5 fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)

CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax

MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:25:05 +01:00
Valerio
09fa3f5fc9 style: move Glama badge to MCP Server section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:47:45 +01:00
Valerio
155a7ca579 fix: update glama.json to match schema, add Glama badge to README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:27:12 +01:00
Valerio
e84b5b5917 feat: add demo GIF to README — web_fetch 403 vs webclaw success
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 11:07:26 +01:00
Valerio
bdcd0592dc fix: use macos-latest for release builds 2026-03-24 10:43:36 +01:00
Valerio
d2887cace8 feat: GitHub Release workflow + Homebrew tap + install badges
- .github/workflows/release.yml: builds prebuilt binaries for
  macOS (arm64, x86_64) and Linux (x86_64, aarch64) on tag push.
  Creates GitHub Release with tarballs + SHA256SUMS.
  Auto-updates Homebrew formula via bump-homebrew-formula-action.

- README: added GitHub download count + npm install count badges.
  Install section now lists: Homebrew, prebuilt binaries, cargo
  install --git, Docker, Docker Compose.

- Homebrew tap created at github.com/0xMassi/homebrew-webclaw
  with Formula/webclaw.rb (installs webclaw + webclaw-mcp binaries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 10:41:37 +01:00
Valerio
d521a79a66 feat: add glama.json for MCP server directory listing
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 10:12:17 +01:00
Valerio
4fbc9b9d1f docs: add Discord link to README badges and community section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 10:10:34 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00