Bring core/CLAUDE.md current with the slices rescued this cycle, and fold
in earlier /init corrections that were never committed.
New capabilities documented:
- search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search`
subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY)
+ the now-local-first MCP `search` tool.
- map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap +
bounded crawl fallback), gzip sitemap support, and the new
`--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags.
- perf: shared `extractors/og.rs` parser and the QuickJS runtime gate /
parsed-document reuse noted on `js_eval.rs`.
Corrections folded in: real browser fingerprint versions live in tls.rs
(not browser.rs), accurate module/route lists, Repo Layout section, and
removal of the now-false "search lives only in production" notes.
Bumped the stated workspace version to 0.6.13.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.
- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
per-field Open Graph re-scan duplicated across 7 vertical extractors
(amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
the page has no JS-assigned data) and reuse the already-parsed document
instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
pool, max-idle-per-host, tcp keepalive) for connection reuse.
Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.
Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).
Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bundle three changes landed since 0.6.11:
- feat(search): standalone web search via Serper.dev (#63)
- feat(map): layered URL discovery with bounded crawl fallback (#64)
- fix(mcp): accept boolean params sent as JSON strings (#62 / #65)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to #58/#59, which fixed numeric params but left the booleans.
MCP clients (e.g. Claude Desktop) send `true` as the JSON string `"true"`,
which serde's default bool deserializer rejects with
`invalid type: string "true", expected a boolean`, failing the call.
Adds a `deser_opt_bool_or_str` helper (same untagged pattern as the #59
numeric helpers) that accepts a JSON boolean OR "true"/"false"
(case-insensitive, trimmed) and rejects anything else with a clear error.
Numeric-looking strings like "1" are intentionally NOT coerced to bool.
Applied to every Option<bool> tool param:
- scrape -> only_main_content
- crawl -> use_sitemap
- research -> deep
- search -> scrape (added by the standalone-search slice, #63)
16 unit tests (bool / "true"-string / absent->None / garbage->error per
field). No new dependencies.
Fixes#62.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).
`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:
- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
(authoritative, carries lastmod/priority/changefreq); when the sitemap
is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
same-origin crawl and harvest links from every fetched page plus the
unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
stderr so `--map -f json` stays machine-parseable.
One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main. OSS surfaces can now search without the hosted webclaw API
when the caller supplies their own Serper.dev key (free at serper.dev).
- webclaw-fetch::search() — calls Serper.dev directly (plain wreq client;
a JSON API needs no fingerprinting) and, with scrape=true, fetches +
extracts the top result pages concurrently (bounded) via the caller's
FetchClient. parse_serper_organic() is pure and unit-tested.
- MCP `search` tool: local-first — uses SERPER_API_KEY when set, else
falls back to the hosted webclaw API. Adds country/lang/scrape params.
- OSS REST server: POST /v1/search, gated on SERPER_API_KEY (501 when
unset, with a setup hint). Adds ApiError::NotImplemented.
- CLI: `webclaw search <query> [--serper-key|SERPER_API_KEY] [--num]
[--country] [--lang] [--scrape] [--format]`.
No new dependencies (reuses futures-util already in the tree). Original
work by the prior author on perf/audit-fixes; this re-applies only the
search slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a Google Gemini provider (Generative Language API) to the chain, ordered Ollama -> OpenAI -> Gemini -> Anthropic so Google credits are preferred with Anthropic as last-resort fallback. System->systemInstruction, assistant->model, json_mode->responseMimeType; model name validated before URL interpolation; maxOutputTokens defaults high for 2.5 thinking models. Also fixes AnthropicProvider default (retired claude-sonnet-4-20250514 -> 404); now claude-sonnet-4-6, honors ANTHROPIC_MODEL.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Declares the webclaw MCP server at the repo root (matches the README manual
config). Cursor's plugin scanner looks for .mcp.json/mcp.json at root.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Release the MCP numeric-param string-coercion fix (#58, PR #59):
crawl/batch/search/summarize numeric args now accept JSON numbers or
numeric strings, fixing clients (e.g. Claude Desktop) that send "5"
instead of 5.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reformat the string-or-number deserialize helpers and tests to satisfy
`cargo fmt --check` (style_edition 2024), which the lint CI job enforces.
Formatting only — no behavior change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MCP clients (Claude Desktop, VS Code Copilot, etc.) serialize numeric
tool arguments as JSON strings ("3" instead of 3). serde's built-in
u32/usize deserialisers reject these with:
invalid type: string "N", expected u32
Add two private coercion helpers — `deser_opt_u32_or_str` and
`deser_opt_usize_or_str` — that accept both JSON number and JSON string
representations, falling back to `str::parse` for the string form and
returning a clear custom error for non-numeric strings.
Annotate the six affected optional fields:
CrawlParams: depth (u32), max_pages (usize), concurrency (usize)
BatchParams: concurrency (usize)
SearchParams: num_results (u32)
SummarizeParams: max_sentences (usize)
Add 24 unit tests (4 per field: numeric string → value, native number
→ value, absent → None, non-numeric string → Err) verified green via
an isolated serde-only crate.
Fixes#58
The per-arch build + 'imagetools create' combine failed at the manifest
step with 'v0.6.9-arm64: not found' — buildx's default provenance/SBOM
attestations turn each per-arch tag into an index, and assembling them
races GHCR's read-after-write. Replace it with a single
'docker buildx build --platform linux/amd64,linux/arm64 --push'
(attestations off) so one manifest list is pushed atomically. Dockerfile.ci
now selects binaries by TARGETARCH. Adds a workflow_dispatch path to
re-publish an existing tag's image without rebuilding binaries or bumping
the version.
Publish the multi-arch Docker image with Buildx instead of the legacy
docker driver, whose GHCR push intermittently failed with 'unknown
blob'. The manifest list is now assembled registry-side with
`imagetools create`. This also unblocks the Homebrew formula update,
which depends on the Docker job. No library or CLI behavior changes.
- webclaw-llm: add explicit request + connect timeouts to the reqwest
client in every provider (anthropic, openai, ollama) with a shorter
timeout on the ollama health check, so a stalled provider fails fast.
- webclaw-llm: fix a panic when truncating a provider error body that
contains multibyte characters near the 500-char cut (char-safe take).
- webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char
boundary so oversized scripts with non-ASCII content no longer panic.
- webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of
`byte as char`, preserving multibyte UTF-8 in SvelteKit string values
rather than producing Latin-1 mojibake.
- webclaw-cli: have fire_webhook return its JoinHandle and await it at
the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps.
- webclaw-mcp: drop the up-front DNS pre-validation loop in batch that
aborted the whole request on one bad URL; the fetch layer already
applies the same SSRF guard per URL and reports per-URL errors.
- webclaw-fetch: include the port in the warmup homepage URL so hosts
on a non-default port are warmed correctly.
Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.
Ports the TLS/Response API breaks in the bump:
- certificate_compression_algorithms -> certificate_compressors with
wreq-util's BrotliCompressor/ZlibCompressor trait objects
- ExtensionType::APPLICATION_SETTINGS_NEW -> APPLICATION_SETTINGS (same
codepoint 17613)
- wreq_util::Emulation::SafariIos26.emulation() ->
Profile::SafariIos26.into_emulation(); Emulation fields are now public
so *_mut() accessors become direct field access; build() takes a Group
- Response::chunk() removed -> bytes_stream() (wreq 'stream' feature) with
the running body-size ceiling preserved; adds futures-util
Browser fingerprints verified unchanged on tls.peet.ws: Chrome JA3
43067709b025da334de1279a120f8e14, Safari iOS JA3 8d909525bd5bbb79f133d11cc05159fe.
When bash splits a URL at & or ? (a common foot-gun), webclaw
receives only the truncated prefix and silently fetches the wrong
page. Per issue #6:
1. Heuristic warning: if the URL ends with '&' or contains '?' with
no '=' after, emit a stderr warning before fetching:
# webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded.
2. New flag --url-encoded: parallel input that asserts the user has
handled escaping. Suppresses the truncation warning since intent
is explicit.
Fetch proceeds in both cases; this is informational only. 4 new
tests in webclaw-cli. Workspace 720 -> 724.
(cherry picked from commit 4ef27fcd33)
Webclaw's default -t timeout is 30s; slow sites previously sat
silently with no feedback. Now during a fetch, every 10s of elapsed
time webclaw writes one line to stderr:
# webclaw: still fetching <URL> (Ns)
Fetches completing in under 10s emit nothing (the timer never fires).
Stdout output is untouched - pure feedback signal on stderr.
No timeout change. No new flags. Default behavior is augmented at
stderr only.
Implemented via tokio::select! between the fetch future and a
tokio::time::interval. Latency cost: a single tokio task spawn
and a 10s tick - microseconds on the fast path.
10 new tests in webclaw-fetch::progress::tests (none ignored; the
slow-future test uses a 50ms test interval to keep cargo test fast).
Workspace total 710 -> 720.
(cherry picked from commit 06f065cb08)
wreq is a release candidate with no API stability between rc.N builds
(rc.29 broke the TLS + Response API). `cargo install` and the release
workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking
the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build
path deterministic until wreq reaches a stable release.
The TLS layer moved to wreq (BoringSSL) in-process; there is no longer a
[patch.crates-io] section or a separate TLS fork. Update the architecture
tree and crate-boundary notes to match.
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4
as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump
all nine references to v5 ahead of the deadline. The artifact steps are
v5-compatible: upload uses a unique matrix-target name and the download
step flattens subdirectories with find afterward.