Commit graph

40 commits

Author SHA1 Message Date
Valerio
ee3c714aa9 docs: update CONTRIBUTING.md for v0.3.0 architecture
- Replace Impit/primp references with webclaw-tls
- Add architecture diagram showing crate layout + TLS repo
- Update crate boundaries table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 10:17:26 +02:00
Valerio
e3b0d0bd74 fix: make reddit and linkedin modules public for server access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:54:35 +02:00
Valerio
7051d2193b docs: add v0.3.0 changelog entry
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:47:04 +02:00
Valerio
f275a93bec fix: clippy empty-line-after-doc-comment in browser.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:45:05 +02:00
Valerio
140234c139 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:43:11 +02:00
Valerio
f13cb83c73 feat: replace primp with webclaw-tls, bump to v0.3.0
Replace primp dependency with our own TLS fingerprinting stack
(webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match.

- Remove primp entirely (zero references remaining)
- webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls
- Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains)
- Skip unknown certificate extensions (SCT tolerance)
- 99% bypass rate on 102 sites (was ~85% with primp)
- Fixes #5 (HTTPS broken — example.com and similar sites now work)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:40:10 +02:00
Valerio
77e93441c0 fix(ci): add QEMU for arm64 apt-get in Docker build
Plain docker build --platform linux/arm64 on amd64 runner needs QEMU
to execute RUN commands. QEMU is only needed for apt-get (seconds),
not for Rust compilation (the binaries are pre-built).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:51:51 +01:00
Valerio
8cf021a00b fix(ci): single Docker job with plain docker build + manifest
buildx creates manifest lists per-platform which can't be nested.
Use plain docker build for each arch then docker manifest create
to combine them. Single job, no matrix, no QEMU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:45:05 +01:00
Valerio
78810793cf chore: align Cargo.toml version with v0.2.3 tag 2026-03-27 20:41:02 +01:00
Valerio
ef120f6ec7 fix(ci): fix Docker binary path extraction from release tarball
Tarball extracts to webclaw-vX.Y.Z-target/ directory, not flat.
Use direct cp instead of find.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:39:14 +01:00
Valerio
48a3c45b36 ci: use pre-built binaries for Docker instead of QEMU cross-compilation
QEMU arm64 Rust builds took 60+ min and timed out in CI. Now the Docker
job downloads the pre-built release binaries and packages them directly.

- Dockerfile.ci: slim image for CI (downloads pre-built binaries)
- Dockerfile: full source build for local dev (unchanged build stage)
- Both use ubuntu:24.04 (GLIBC 2.39 matches CI build environment)
- Multi-arch manifest combines amd64 + arm64 images

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:32:50 +01:00
Valerio
dfcddd1973 ci: build multi-platform Docker images (amd64 + arm64)
Image only had linux/amd64, failing on Apple Silicon Macs with
"no matching manifest for linux/arm64/v8". Added QEMU + buildx
multi-platform support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:18:49 +01:00
Valerio
341f4737e1 test: v0.2.2 pre-release check 2026-03-27 18:48:15 +01:00
Valerio
8b82ad12d0 ci: add weekly primp compatibility check
Runs every Monday — updates primp to latest, tries to build.
If patches are out of sync the build fails with a clear error
pointing to primp's Cargo.toml for the new patch list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 18:45:45 +01:00
Valerio
76cb6b6cd7 fix: add reqwest to patch list, sync with primp 1.2.0
primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself
(primp-reqwest). Without this patch, cargo install gets vanilla
reqwest 0.13 which is missing the HTTP/2 impersonation methods.

Users should use: cargo install --locked --git ...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 18:45:28 +01:00
Valerio
2f6255fe6f add SKILL.md for Claude Code skill integration
Adds the webclaw skill definition for Claude Code / Smithery.
Located at skill/SKILL.md with proper frontmatter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 17:58:01 +01:00
Valerio
a6be233df9 feat: v0.2.1 — Docker image on GHCR, QuickJS data island extraction
- Docker image auto-built on every release via CI
- QuickJS sandbox executes inline <script> tags to extract JS-embedded
  content (window.__PRELOADED_STATE__, self.__next_f, etc.)
- Bumped version to 0.2.1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 17:18:31 +01:00
Valerio
b039b99858 ci: add Docker image build to release workflow
Pushes ghcr.io/0xmassi/webclaw:latest and :vX.Y.Z on every tagged
release. Uses BuildKit cache for fast rebuilds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 16:48:44 +01:00
Valerio
81e78963d0 feat: enable quickjs for JS data island extraction
webclaw-core has QuickJS behind a feature flag for extracting data from
inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc).
The server was using an old lockfile without the feature enabled.

Updated deps to v0.2.0 and explicitly enabled quickjs. This improves
extraction on SPAs like NYTimes, Nike, and Bloomberg where content is
embedded in JS variable assignments rather than visible DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 18:50:32 +01:00
Valerio
d67257e931 style: upgrade README badges to for-the-badge style, add X/Twitter
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:30:24 +01:00
Valerio
ea14848772 feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension

New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:28:23 +01:00
Valerio
0e4128782a fix: v0.1.7 — extraction options now work in batch mode (#3)
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 13:30:20 +01:00
Valerio
1b8dfb77a6 feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format)
Watch mode:
- --watch polls a URL at --watch-interval (default 5min)
- Reports diffs to stdout when content changes
- --on-change runs a command with diff JSON on stdin
- Ctrl+C stops cleanly

Webhooks:
- --webhook POSTs JSON on crawl/batch complete and watch changes
- Auto-detects Discord and Slack URLs, formats as embeds/blocks
- Also available via WEBCLAW_WEBHOOK_URL env var
- Non-blocking, errors logged to stderr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 12:30:08 +01:00
Valerio
e5649e1824 feat: v0.1.5 — --output-dir saves each page to a separate file
Adds --output-dir flag for CLI. Each extracted page gets its own file
with filename derived from the URL path. Works with single URL, crawl,
and batch modes. CSV input supports custom filenames (url,filename).

Root URLs use hostname/index.ext to avoid collisions in batch mode.
Subdirectories created automatically from URL path structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 11:02:25 +01:00
Valerio
32c035c543 feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction
Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).

Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM

Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:28:16 +01:00
Valerio
0c91c6d5a9 feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:38:28 +01:00
Valerio
afe4d3077d feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra
- Switch default profile to Safari26/Mac (best CF pass rate)
- Auto-fallback to plain client on connection error or 403
- Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites
- Reddit .json endpoint uses plain client (TLS fingerprint was blocked)
- YouTube caption track extraction + timed text parser (core, not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 18:50:07 +01:00
Valerio
c90c0b6066 chore: bump to v0.1.2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:44:52 +01:00
Valerio
907966a983 fix: use plain client for Reddit JSON endpoint
Reddit blocks TLS-fingerprinted clients on their .json API but
accepts standard requests with a browser User-Agent. Switch to
a non-impersonated primp client for the Reddit fallback path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:43:47 +01:00
Valerio
dff458d2f5 fix: collapse nested if to satisfy clippy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:28:57 +01:00
Valerio
b92c0ed186 style: fix cargo fmt formatting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:27:15 +01:00
Valerio
ea9c783bc5 fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)

CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax

MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:25:05 +01:00
Valerio
09fa3f5fc9 style: move Glama badge to MCP Server section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:47:45 +01:00
Valerio
155a7ca579 fix: update glama.json to match schema, add Glama badge to README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:27:12 +01:00
Valerio
e84b5b5917 feat: add demo GIF to README — web_fetch 403 vs webclaw success
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 11:07:26 +01:00
Valerio
bdcd0592dc fix: use macos-latest for release builds 2026-03-24 10:43:36 +01:00
Valerio
d2887cace8 feat: GitHub Release workflow + Homebrew tap + install badges
- .github/workflows/release.yml: builds prebuilt binaries for
  macOS (arm64, x86_64) and Linux (x86_64, aarch64) on tag push.
  Creates GitHub Release with tarballs + SHA256SUMS.
  Auto-updates Homebrew formula via bump-homebrew-formula-action.

- README: added GitHub download count + npm install count badges.
  Install section now lists: Homebrew, prebuilt binaries, cargo
  install --git, Docker, Docker Compose.

- Homebrew tap created at github.com/0xMassi/homebrew-webclaw
  with Formula/webclaw.rb (installs webclaw + webclaw-mcp binaries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 10:41:37 +01:00
Valerio
d521a79a66 feat: add glama.json for MCP server directory listing
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 10:12:17 +01:00
Valerio
4fbc9b9d1f docs: add Discord link to README badges and community section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 10:10:34 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00