webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-25 03:08:06 +02:00

Author	SHA1	Message	Date
Valerio	341f4737e1	test: v0.2.2 pre-release check	2026-03-27 18:48:15 +01:00
Valerio	8b82ad12d0	ci: add weekly primp compatibility check Runs every Monday — updates primp to latest, tries to build. If patches are out of sync the build fails with a clear error pointing to primp's Cargo.toml for the new patch list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 18:45:45 +01:00
Valerio	76cb6b6cd7	fix: add reqwest to patch list, sync with primp 1.2.0 primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself (primp-reqwest). Without this patch, cargo install gets vanilla reqwest 0.13 which is missing the HTTP/2 impersonation methods. Users should use: cargo install --locked --git ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 18:45:28 +01:00
Valerio	2f6255fe6f	add SKILL.md for Claude Code skill integration Adds the webclaw skill definition for Claude Code / Smithery. Located at skill/SKILL.md with proper frontmatter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 17:58:01 +01:00
Valerio	a6be233df9	feat: v0.2.1 — Docker image on GHCR, QuickJS data island extraction - Docker image auto-built on every release via CI - QuickJS sandbox executes inline <script> tags to extract JS-embedded content (window.__PRELOADED_STATE__, self.__next_f, etc.) - Bumped version to 0.2.1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 17:18:31 +01:00
Valerio	b039b99858	ci: add Docker image build to release workflow Pushes ghcr.io/0xmassi/webclaw:latest and :vX.Y.Z on every tagged release. Uses BuildKit cache for fast rebuilds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 16:48:44 +01:00
Valerio	81e78963d0	feat: enable quickjs for JS data island extraction webclaw-core has QuickJS behind a feature flag for extracting data from inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc). The server was using an old lockfile without the feature enabled. Updated deps to v0.2.0 and explicitly enabled quickjs. This improves extraction on SPAs like NYTimes, Nike, and Bloomberg where content is embedded in JS variable assignments rather than visible DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 18:50:32 +01:00
Valerio	d67257e931	style: upgrade README badges to for-the-badge style, add X/Twitter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:30:24 +01:00
Valerio	ea14848772	feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:28:23 +01:00
Valerio	0e4128782a	fix: v0.1.7 — extraction options now work in batch mode (#3 ) --only-main-content, --include, and --exclude were ignored in batch mode because run_batch used default ExtractionOptions. Added fetch_and_extract_batch_with_options to pass CLI options through. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 13:30:20 +01:00
Valerio	1b8dfb77a6	feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format) Watch mode: - --watch polls a URL at --watch-interval (default 5min) - Reports diffs to stdout when content changes - --on-change runs a command with diff JSON on stdin - Ctrl+C stops cleanly Webhooks: - --webhook POSTs JSON on crawl/batch complete and watch changes - Auto-detects Discord and Slack URLs, formats as embeds/blocks - Also available via WEBCLAW_WEBHOOK_URL env var - Non-blocking, errors logged to stderr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 12:30:08 +01:00
Valerio	e5649e1824	feat: v0.1.5 — --output-dir saves each page to a separate file Adds --output-dir flag for CLI. Each extracted page gets its own file with filename derived from the URL path. Works with single URL, crawl, and batch modes. CSV input supports custom filenames (url,filename). Root URLs use hostname/index.ext to avoid collisions in batch mode. Subdirectories created automatically from URL path structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 11:02:25 +01:00
Valerio	32c035c543	feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction Embeds QuickJS (rquickjs) to execute inline <script> tags and extract data hidden in JavaScript variable assignments. Captures window.__* objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired), and self.__next_f (Next.js RSC flight data). Results: - NYTimes: 1,552 → 4,162 words (+168%) - Wired: 1,459 → 9,937 words (+580%) - Zero measurable performance overhead (<15ms per page) - Feature-gated: disable with --no-default-features for WASM Smart text filtering rejects CSS, base64, file paths, code strings. Only readable prose is appended under "## Additional Content". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 10:28:16 +01:00
Valerio	0c91c6d5a9	feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:38:28 +01:00
Valerio	afe4d3077d	feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 18:50:07 +01:00
Valerio	c90c0b6066	chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 18:44:52 +01:00
Valerio	907966a983	fix: use plain client for Reddit JSON endpoint Reddit blocks TLS-fingerprinted clients on their .json API but accepts standard requests with a browser User-Agent. Switch to a non-impersonated primp client for the Reddit fallback path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 18:43:47 +01:00
Valerio	dff458d2f5	fix: collapse nested if to satisfy clippy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:28:57 +01:00
Valerio	b92c0ed186	style: fix cargo fmt formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:27:15 +01:00
Valerio	ea9c783bc5	fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:25:05 +01:00
Valerio	09fa3f5fc9	style: move Glama badge to MCP Server section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:47:45 +01:00
Valerio	155a7ca579	fix: update glama.json to match schema, add Glama badge to README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:27:12 +01:00
Valerio	e84b5b5917	feat: add demo GIF to README — web_fetch 403 vs webclaw success Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 11:07:26 +01:00
Valerio	bdcd0592dc	fix: use macos-latest for release builds	2026-03-24 10:43:36 +01:00
Valerio	d2887cace8	feat: GitHub Release workflow + Homebrew tap + install badges - .github/workflows/release.yml: builds prebuilt binaries for macOS (arm64, x86_64) and Linux (x86_64, aarch64) on tag push. Creates GitHub Release with tarballs + SHA256SUMS. Auto-updates Homebrew formula via bump-homebrew-formula-action. - README: added GitHub download count + npm install count badges. Install section now lists: Homebrew, prebuilt binaries, cargo install --git, Docker, Docker Compose. - Homebrew tap created at github.com/0xMassi/homebrew-webclaw with Formula/webclaw.rb (installs webclaw + webclaw-mcp binaries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 10:41:37 +01:00
Valerio	d521a79a66	feat: add glama.json for MCP server directory listing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 10:12:17 +01:00
Valerio	4fbc9b9d1f	docs: add Discord link to README badges and community section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 10:10:34 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

1 2

78 commits