webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-28 03:29:38 +02:00

Author	SHA1	Message	Date
Valerio	9af55c2a2d	fix(create-webclaw): repair binary install on Windows (and all platforms) `npx create-webclaw` never used the prebuilt binary on any platform and silently fell back to `cargo install`, which fails with "'cargo' is not recognized" / "cargo: not found" unless Rust is installed. Four bugs: 1. Asset name mismatch: getAssetName() hardcoded `webclaw-mcp-<target>`, but release assets are `webclaw-<tag>-<target>` (versioned, no `mcp-` infix). The `find()` always returned undefined, so the prebuilt path was never taken — on every OS, not just Windows. Now the asset name is built from the release tag_name + a platform→target map. 2. `unzip` is absent on Windows. The `.zip` branch now uses PowerShell `Expand-Archive` (ships with Windows 10/11) and keeps `unzip` only for the non-Windows case. 3. The prebuilt failure was swallowed by a bare `catch {}`, hiding the real cause (a 403 is almost always a GitHub API rate limit). The error is now surfaced, with a rate-limit hint + GITHUB_TOKEN support on the api.github.com request (token dropped on CDN redirects). 4. (missed by the report's own suggested fix) Archives extract into a `webclaw-<tag>-<target>/` subdirectory holding three binaries, so the old `chmod(BINARY_PATH)` hit a nonexistent path. webclaw-mcp is now lifted out of that subdir to BINARY_PATH and the rest is cleaned up. BINARY_NAME/BINARY_PATH also gain the `.exe` suffix on Windows so the written MCP config points at a real file. Tested in Docker (no Windows machine available): - Linux amd64 + arm64 on Debian trixie: full flow installs the binary and it answers a real MCP initialize handshake (serverInfo webclaw-mcp 0.6.13, 12 tools). - Windows .zip path validated against the real release zip: Expand-Archive equivalent extraction, nested `.exe` resolved + lifted, PE header `MZ`. Executing the .exe needs Windows (the reporter confirmed that on Win11). - Bug 3: with the GitHub API blocked, the new build prints the real reason instead of "No pre-built binary found". Closes #71 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 11:58:14 +02:00
Valerio	d69c50a31d	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:44:08 +02:00
Valerio	050b2ef463	feat: add allow_subdomains and allow_external_links to CrawlConfig Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:33:06 +02:00
Valerio	4e81c3430d	docs: update npm package license to AGPL-3.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 11:33:43 +02:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

5 commits