`npx create-webclaw` never used the prebuilt binary on any platform and
silently fell back to `cargo install`, which fails with "'cargo' is not
recognized" / "cargo: not found" unless Rust is installed. Four bugs:
1. Asset name mismatch: getAssetName() hardcoded `webclaw-mcp-<target>`,
but release assets are `webclaw-<tag>-<target>` (versioned, no `mcp-`
infix). The `find()` always returned undefined, so the prebuilt path
was never taken — on every OS, not just Windows. Now the asset name is
built from the release tag_name + a platform→target map.
2. `unzip` is absent on Windows. The `.zip` branch now uses PowerShell
`Expand-Archive` (ships with Windows 10/11) and keeps `unzip` only for
the non-Windows case.
3. The prebuilt failure was swallowed by a bare `catch {}`, hiding the
real cause (a 403 is almost always a GitHub API rate limit). The error
is now surfaced, with a rate-limit hint + GITHUB_TOKEN support on the
api.github.com request (token dropped on CDN redirects).
4. (missed by the report's own suggested fix) Archives extract into a
`webclaw-<tag>-<target>/` subdirectory holding three binaries, so the
old `chmod(BINARY_PATH)` hit a nonexistent path. webclaw-mcp is now
lifted out of that subdir to BINARY_PATH and the rest is cleaned up.
BINARY_NAME/BINARY_PATH also gain the `.exe` suffix on Windows so the
written MCP config points at a real file.
Tested in Docker (no Windows machine available):
- Linux amd64 + arm64 on Debian trixie: full flow installs the binary and
it answers a real MCP initialize handshake (serverInfo webclaw-mcp
0.6.13, 12 tools).
- Windows .zip path validated against the real release zip: Expand-Archive
equivalent extraction, nested `.exe` resolved + lifted, PE header `MZ`.
Executing the .exe needs Windows (the reporter confirmed that on Win11).
- Bug 3: with the GitHub API blocked, the new build prints the real reason
instead of "No pre-built binary found".
Closes#71
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)
Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
Content-Length up front (before the allocation) and the actual
.bytes() length after (belt-and-braces against lying upstreams).
Previously the HTML -> markdown conversion downstream could allocate
multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
response_json_capped helper with a 5 MB cap. Protects against a
malicious or runaway provider response exhausting memory.
Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.
Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.
Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
artifact patterns (previous rule would have swallowed package.json,
components.json, .smithery/*.json).
Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).
Version: 0.3.15 -> 0.3.16
CHANGELOG updated.
Refs: docs/AUDIT-2026-04-16.md (P2 section)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: gitignore CLI research dumps, drop accidentally-tracked file
research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).
Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.
Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().
Bump to 0.3.12.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>