Distribution release — the extraction engine is unchanged from 0.6.13. Ships
the broader-Linux-compatibility work: gnu binaries are now built on glibc 2.35
(so they run on Debian 12 / Ubuntu 22.04), plus new static musl binaries that
run on any Linux (Alpine, Amazon Linux 2023, RHEL 9). Also carries the
create-webclaw Windows/all-platform install fix and the WEBCLAW_API_KEY
setup-script fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A v*-rc* tag triggered the same release pipeline as a stable tag, which
would publish a normal release, overwrite ghcr.io/0xmassi/webclaw:latest,
and repoint the Homebrew formula at the rc — shipping a prerelease to every
stable Docker/brew user. Guard all three on the SemVer prerelease hyphen:
- release: add --prerelease for hyphenated tags
- docker: still push :${tag} (testable rc image) but only move :latest for stable
- homebrew: skip entirely for prereleases
Lets us cut rc tags (e.g. to validate the new musl build job) without
touching stable users.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gnu Linux binaries are glibc-floored (2.35 after #74), so they still
won't run on Amazon Linux 2023 / RHEL 9 (glibc 2.34), Alpine, or anything
older. Add fully static musl builds that run on ANY Linux regardless of
glibc.
Adds x86_64-unknown-linux-musl and aarch64-unknown-linux-musl to the build
matrix, built with cargo-zigbuild (zig as the C/C++ cross-compiler for
BoringSSL). Build scripts (bindgen) run as the glibc host so libclang loads,
and the linked output is fully static. A native Alpine build can't do this —
its static build scripts can't dlopen libclang.
musl assets ship ALONGSIDE the gnu ones (gnu stays default; musl is the
runs-anywhere fallback). The release job globs *.tar.gz, so the new assets
are checksummed + uploaded automatically; the docker/homebrew jobs enumerate
gnu targets explicitly and are unaffected.
Validated in Docker: cargo-zigbuild produced a fully static aarch64-musl
webclaw-mcp (ldd: not a dynamic executable) that answered an MCP handshake on
Alpine, Debian 11 (glibc 2.31), Debian 12, Amazon Linux 2023 (2.34), and
Ubuntu 24.04 — everywhere, including where the gnu builds fail.
Closes#73
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Linux release binaries were built on ubuntu-latest (now Ubuntu 24.04,
glibc 2.39), so they required GLIBC_2.38 and failed to start on older LTS
distros (Debian 12 = 2.36, Ubuntu 22.04 = 2.35) with:
libc.so.6: version `GLIBC_2.38' not found (required by webclaw-mcp)
glibc is forward- but not backward-compatible, so the build host's glibc
sets the floor. Pin both Linux targets to ubuntu-22.04 (glibc 2.35); the
aarch64 cross toolchain tracks the runner distro, so this lowers both.
Note: 2.35 still won't cover Amazon Linux 2023 / RHEL 9 (2.34) — full
coverage needs a musl static build. Tracked in #73.
Refs #73
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`npx create-webclaw` never used the prebuilt binary on any platform and
silently fell back to `cargo install`, which fails with "'cargo' is not
recognized" / "cargo: not found" unless Rust is installed. Four bugs:
1. Asset name mismatch: getAssetName() hardcoded `webclaw-mcp-<target>`,
but release assets are `webclaw-<tag>-<target>` (versioned, no `mcp-`
infix). The `find()` always returned undefined, so the prebuilt path
was never taken — on every OS, not just Windows. Now the asset name is
built from the release tag_name + a platform→target map.
2. `unzip` is absent on Windows. The `.zip` branch now uses PowerShell
`Expand-Archive` (ships with Windows 10/11) and keeps `unzip` only for
the non-Windows case.
3. The prebuilt failure was swallowed by a bare `catch {}`, hiding the
real cause (a 403 is almost always a GitHub API rate limit). The error
is now surfaced, with a rate-limit hint + GITHUB_TOKEN support on the
api.github.com request (token dropped on CDN redirects).
4. (missed by the report's own suggested fix) Archives extract into a
`webclaw-<tag>-<target>/` subdirectory holding three binaries, so the
old `chmod(BINARY_PATH)` hit a nonexistent path. webclaw-mcp is now
lifted out of that subdir to BINARY_PATH and the rest is cleaned up.
BINARY_NAME/BINARY_PATH also gain the `.exe` suffix on Windows so the
written MCP config points at a real file.
Tested in Docker (no Windows machine available):
- Linux amd64 + arm64 on Debian trixie: full flow installs the binary and
it answers a real MCP initialize handshake (serverInfo webclaw-mcp
0.6.13, 12 tools).
- Windows .zip path validated against the real release zip: Expand-Archive
equivalent extraction, nested `.exe` resolved + lifted, PE header `MZ`.
Executing the .exe needs Windows (the reporter confirmed that on Win11).
- Bug 3: with the GitHub API blocked, the new build prints the real reason
instead of "No pre-built binary found".
Closes#71
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Quantum Proxies is no longer a sponsor. Remove their Studio Partners entry
from the README and delete the banner/logo assets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- LLM provider chain is Ollama -> OpenAI -> Gemini -> Anthropic; Gemini
was added ahead of Anthropic (Google Cloud credits preferred) but the
docs still listed Ollama -> OpenAI -> Anthropic.
- Document the top-level webclaw-fetch verticals reddit.rs / linkedin.rs
(distinct from extractors/ and webclaw-core parsers) and progress.rs.
- Bump extractor count ~28 -> ~30 and call out the shared helpers
(og.rs, github_common.rs, jsonld_product.rs, ecommerce_product.rs).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
setup.sh and deploy/hetzner.sh emitted WEBCLAW_AUTH_KEY into the server's
.env, but webclaw-server reads WEBCLAW_API_KEY (env = "WEBCLAW_API_KEY").
The generated key was silently ignored — and since hetzner.sh binds
0.0.0.0, the server refused to start at all (it rejects a public bind
without WEBCLAW_API_KEY). Fix both .env writers, plus the hetzner help
line that told users to grep the wrong name and the env.example sample.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring core/CLAUDE.md current with the slices rescued this cycle, and fold
in earlier /init corrections that were never committed.
New capabilities documented:
- search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search`
subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY)
+ the now-local-first MCP `search` tool.
- map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap +
bounded crawl fallback), gzip sitemap support, and the new
`--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags.
- perf: shared `extractors/og.rs` parser and the QuickJS runtime gate /
parsed-document reuse noted on `js_eval.rs`.
Corrections folded in: real browser fingerprint versions live in tls.rs
(not browser.rs), accurate module/route lists, Repo Layout section, and
removal of the now-false "search lives only in production" notes.
Bumped the stated workspace version to 0.6.13.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.
- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
per-field Open Graph re-scan duplicated across 7 vertical extractors
(amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
the page has no JS-assigned data) and reuse the already-parsed document
instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
pool, max-idle-per-host, tcp keepalive) for connection reuse.
Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.
Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).
Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bundle three changes landed since 0.6.11:
- feat(search): standalone web search via Serper.dev (#63)
- feat(map): layered URL discovery with bounded crawl fallback (#64)
- fix(mcp): accept boolean params sent as JSON strings (#62 / #65)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to #58/#59, which fixed numeric params but left the booleans.
MCP clients (e.g. Claude Desktop) send `true` as the JSON string `"true"`,
which serde's default bool deserializer rejects with
`invalid type: string "true", expected a boolean`, failing the call.
Adds a `deser_opt_bool_or_str` helper (same untagged pattern as the #59
numeric helpers) that accepts a JSON boolean OR "true"/"false"
(case-insensitive, trimmed) and rejects anything else with a clear error.
Numeric-looking strings like "1" are intentionally NOT coerced to bool.
Applied to every Option<bool> tool param:
- scrape -> only_main_content
- crawl -> use_sitemap
- research -> deep
- search -> scrape (added by the standalone-search slice, #63)
16 unit tests (bool / "true"-string / absent->None / garbage->error per
field). No new dependencies.
Fixes#62.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).
`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:
- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
(authoritative, carries lastmod/priority/changefreq); when the sitemap
is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
same-origin crawl and harvest links from every fetched page plus the
unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
stderr so `--map -f json` stays machine-parseable.
One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main. OSS surfaces can now search without the hosted webclaw API
when the caller supplies their own Serper.dev key (free at serper.dev).
- webclaw-fetch::search() — calls Serper.dev directly (plain wreq client;
a JSON API needs no fingerprinting) and, with scrape=true, fetches +
extracts the top result pages concurrently (bounded) via the caller's
FetchClient. parse_serper_organic() is pure and unit-tested.
- MCP `search` tool: local-first — uses SERPER_API_KEY when set, else
falls back to the hosted webclaw API. Adds country/lang/scrape params.
- OSS REST server: POST /v1/search, gated on SERPER_API_KEY (501 when
unset, with a setup hint). Adds ApiError::NotImplemented.
- CLI: `webclaw search <query> [--serper-key|SERPER_API_KEY] [--num]
[--country] [--lang] [--scrape] [--format]`.
No new dependencies (reuses futures-util already in the tree). Original
work by the prior author on perf/audit-fixes; this re-applies only the
search slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a Google Gemini provider (Generative Language API) to the chain, ordered Ollama -> OpenAI -> Gemini -> Anthropic so Google credits are preferred with Anthropic as last-resort fallback. System->systemInstruction, assistant->model, json_mode->responseMimeType; model name validated before URL interpolation; maxOutputTokens defaults high for 2.5 thinking models. Also fixes AnthropicProvider default (retired claude-sonnet-4-20250514 -> 404); now claude-sonnet-4-6, honors ANTHROPIC_MODEL.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Declares the webclaw MCP server at the repo root (matches the README manual
config). Cursor's plugin scanner looks for .mcp.json/mcp.json at root.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Release the MCP numeric-param string-coercion fix (#58, PR #59):
crawl/batch/search/summarize numeric args now accept JSON numbers or
numeric strings, fixing clients (e.g. Claude Desktop) that send "5"
instead of 5.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reformat the string-or-number deserialize helpers and tests to satisfy
`cargo fmt --check` (style_edition 2024), which the lint CI job enforces.
Formatting only — no behavior change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MCP clients (Claude Desktop, VS Code Copilot, etc.) serialize numeric
tool arguments as JSON strings ("3" instead of 3). serde's built-in
u32/usize deserialisers reject these with:
invalid type: string "N", expected u32
Add two private coercion helpers — `deser_opt_u32_or_str` and
`deser_opt_usize_or_str` — that accept both JSON number and JSON string
representations, falling back to `str::parse` for the string form and
returning a clear custom error for non-numeric strings.
Annotate the six affected optional fields:
CrawlParams: depth (u32), max_pages (usize), concurrency (usize)
BatchParams: concurrency (usize)
SearchParams: num_results (u32)
SummarizeParams: max_sentences (usize)
Add 24 unit tests (4 per field: numeric string → value, native number
→ value, absent → None, non-numeric string → Err) verified green via
an isolated serde-only crate.
Fixes#58
The per-arch build + 'imagetools create' combine failed at the manifest
step with 'v0.6.9-arm64: not found' — buildx's default provenance/SBOM
attestations turn each per-arch tag into an index, and assembling them
races GHCR's read-after-write. Replace it with a single
'docker buildx build --platform linux/amd64,linux/arm64 --push'
(attestations off) so one manifest list is pushed atomically. Dockerfile.ci
now selects binaries by TARGETARCH. Adds a workflow_dispatch path to
re-publish an existing tag's image without rebuilding binaries or bumping
the version.
Publish the multi-arch Docker image with Buildx instead of the legacy
docker driver, whose GHCR push intermittently failed with 'unknown
blob'. The manifest list is now assembled registry-side with
`imagetools create`. This also unblocks the Homebrew formula update,
which depends on the Docker job. No library or CLI behavior changes.
- webclaw-llm: add explicit request + connect timeouts to the reqwest
client in every provider (anthropic, openai, ollama) with a shorter
timeout on the ollama health check, so a stalled provider fails fast.
- webclaw-llm: fix a panic when truncating a provider error body that
contains multibyte characters near the 500-char cut (char-safe take).
- webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char
boundary so oversized scripts with non-ASCII content no longer panic.
- webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of
`byte as char`, preserving multibyte UTF-8 in SvelteKit string values
rather than producing Latin-1 mojibake.
- webclaw-cli: have fire_webhook return its JoinHandle and await it at
the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps.
- webclaw-mcp: drop the up-front DNS pre-validation loop in batch that
aborted the whole request on one bad URL; the fetch layer already
applies the same SSRF guard per URL and reports per-URL errors.
- webclaw-fetch: include the port in the warmup homepage URL so hosts
on a non-default port are warmed correctly.
Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.