Commit graph

208 commits

Author SHA1 Message Date
Valerio
480d3187db docs(claude-md): document search, map, and perf; refresh stale details
Bring core/CLAUDE.md current with the slices rescued this cycle, and fold
in earlier /init corrections that were never committed.

New capabilities documented:
- search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search`
  subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY)
  + the now-local-first MCP `search` tool.
- map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap +
  bounded crawl fallback), gzip sitemap support, and the new
  `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags.
- perf: shared `extractors/og.rs` parser and the QuickJS runtime gate /
  parsed-document reuse noted on `js_eval.rs`.

Corrections folded in: real browser fingerprint versions live in tls.rs
(not browser.rs), accurate module/route lists, Repo Layout section, and
removal of the now-false "search lives only in production" notes.
Bumped the stated workspace version to 0.6.13.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:10:36 +02:00
Valerio
ecfb72a1a3 chore(release): bump version to 0.6.13
Ship the hot-path extraction speedups (#66): selector hoisting, shared
Open Graph parsing, QuickJS gating + parsed-document reuse, and HTTP
connection-pool tuning. Byte-identical extraction output (verified).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:52:38 +02:00
Valerio
febe56d177
Merge pull request #66 from 0xMassi/perf/hot-path-speedups
perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating)
2026-06-17 16:47:22 +02:00
Valerio
3c54bea300 perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating)
Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.

- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
  noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
  per-field Open Graph re-scan duplicated across 7 vertical extractors
  (amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
  vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
  the page has no JS-assigned data) and reuse the already-parsed document
  instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
  pool, max-idle-per-host, tcp keepalive) for connection reuse.

Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.

Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).

Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:41:45 +02:00
Valerio
51d0c538f1 chore(release): bump version to 0.6.12
Bundle three changes landed since 0.6.11:
- feat(search): standalone web search via Serper.dev (#63)
- feat(map): layered URL discovery with bounded crawl fallback (#64)
- fix(mcp): accept boolean params sent as JSON strings (#62 / #65)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:10:45 +02:00
Valerio
c5dfce8ed5
Merge pull request #65 from 0xMassi/fix/mcp-bool-param-coercion
fix(mcp): accept boolean params sent as JSON strings (#62)
2026-06-17 15:41:42 +02:00
Valerio
b5d0f78bb8
Merge pull request #64 from 0xMassi/feat/map-crawl-fallback
feat(map): layered URL discovery with bounded crawl fallback
2026-06-17 15:38:43 +02:00
Valerio
884f06a5d3 fix(mcp): accept boolean params sent as JSON strings (#62)
Follow-up to #58/#59, which fixed numeric params but left the booleans.
MCP clients (e.g. Claude Desktop) send `true` as the JSON string `"true"`,
which serde's default bool deserializer rejects with
`invalid type: string "true", expected a boolean`, failing the call.

Adds a `deser_opt_bool_or_str` helper (same untagged pattern as the #59
numeric helpers) that accepts a JSON boolean OR "true"/"false"
(case-insensitive, trimmed) and rejects anything else with a clear error.
Numeric-looking strings like "1" are intentionally NOT coerced to bool.

Applied to every Option<bool> tool param:
- scrape   -> only_main_content
- crawl    -> use_sitemap
- research -> deep
- search   -> scrape   (added by the standalone-search slice, #63)

16 unit tests (bool / "true"-string / absent->None / garbage->error per
field). No new dependencies.

Fixes #62.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:37:36 +02:00
Valerio
179efbcf87 feat(map): layered URL discovery with bounded crawl fallback
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).

`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:

- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
  (authoritative, carries lastmod/priority/changefreq); when the sitemap
  is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
  same-origin crawl and harvest links from every fetched page plus the
  unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
  FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
  recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
  stderr so `--map -f json` stays machine-parseable.

One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:33:49 +02:00
Valerio
c3e5ef5143
Merge pull request #63 from 0xMassi/feat/standalone-search
feat(search): standalone web search via Serper.dev (bring-your-own-key)
2026-06-17 15:21:02 +02:00
Valerio
06f151c560 feat(search): standalone web search via Serper.dev (bring-your-own-key)
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main. OSS surfaces can now search without the hosted webclaw API
when the caller supplies their own Serper.dev key (free at serper.dev).

- webclaw-fetch::search() — calls Serper.dev directly (plain wreq client;
  a JSON API needs no fingerprinting) and, with scrape=true, fetches +
  extracts the top result pages concurrently (bounded) via the caller's
  FetchClient. parse_serper_organic() is pure and unit-tested.
- MCP `search` tool: local-first — uses SERPER_API_KEY when set, else
  falls back to the hosted webclaw API. Adds country/lang/scrape params.
- OSS REST server: POST /v1/search, gated on SERPER_API_KEY (501 when
  unset, with a setup hint). Adds ApiError::NotImplemented.
- CLI: `webclaw search <query> [--serper-key|SERPER_API_KEY] [--num]
  [--country] [--lang] [--scrape] [--format]`.

No new dependencies (reuses futures-util already in the tree). Original
work by the prior author on perf/audit-fixes; this re-applies only the
search slice onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:10:58 +02:00
Valerio
0c6f323f51 chore(release): v0.6.11 — Gemini provider + Anthropic model fix 2026-06-16 16:12:11 +02:00
Valerio
d9e3d0b2bb feat(llm): add Gemini provider and fix stale Anthropic default model
Adds a Google Gemini provider (Generative Language API) to the chain, ordered Ollama -> OpenAI -> Gemini -> Anthropic so Google credits are preferred with Anthropic as last-resort fallback. System->systemInstruction, assistant->model, json_mode->responseMimeType; model name validated before URL interpolation; maxOutputTokens defaults high for 2.5 thinking models. Also fixes AnthropicProvider default (retired claude-sonnet-4-20250514 -> 404); now claude-sonnet-4-6, honors ANTHROPIC_MODEL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 15:52:37 +02:00
Valerio
8a0768526f chore(mcp): add .mcp.json so Cursor / Open Plugins directories detect the MCP server
Declares the webclaw MCP server at the repo root (matches the README manual
config). Cursor's plugin scanner looks for .mcp.json/mcp.json at root.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:15:19 +02:00
Valerio
e7ec76bce9
docs(sponsors): add MangoProxy studio partner (#60) 2026-06-15 15:06:00 +02:00
Valerio
da6c6af724 chore(release): bump version to 0.6.10
Release the MCP numeric-param string-coercion fix (#58, PR #59):
crawl/batch/search/summarize numeric args now accept JSON numbers or
numeric strings, fixing clients (e.g. Claude Desktop) that send "5"
instead of 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:27:04 +02:00
Valerio
243e7032d0
Merge pull request #59 from crossi-dev/fix/numeric-params-string-coercion
fix: accept numeric MCP params sent as strings (#58)
2026-06-15 11:26:05 +02:00
Valerio
24ae3a7af2 style(mcp): apply rustfmt to numeric param coercion
Reformat the string-or-number deserialize helpers and tests to satisfy
`cargo fmt --check` (style_edition 2024), which the lint CI job enforces.
Formatting only — no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:25:55 +02:00
Charles Rossi
b5ee838d5f fix(tools): accept numeric params as JSON strings
MCP clients (Claude Desktop, VS Code Copilot, etc.) serialize numeric
tool arguments as JSON strings ("3" instead of 3). serde's built-in
u32/usize deserialisers reject these with:

  invalid type: string "N", expected u32

Add two private coercion helpers — `deser_opt_u32_or_str` and
`deser_opt_usize_or_str` — that accept both JSON number and JSON string
representations, falling back to `str::parse` for the string form and
returning a clear custom error for non-numeric strings.

Annotate the six affected optional fields:
  CrawlParams: depth (u32), max_pages (usize), concurrency (usize)
  BatchParams: concurrency (usize)
  SearchParams: num_results (u32)
  SummarizeParams: max_sentences (usize)

Add 24 unit tests (4 per field: numeric string → value, native number
→ value, absent → None, non-numeric string → Err) verified green via
an isolated serde-only crate.

Fixes #58
2026-06-15 01:04:35 -03:00
Valerio
28cd53efcb
Merge pull request #57 from raffaelemancuso/patch-1
Add Windows binaries to README
2026-06-12 17:59:55 +02:00
Raffaele Mancuso
c133478994
Add Windows binaries to README 2026-06-12 17:56:47 +02:00
Valerio
3c726060bf docs(proxy-example): reword residential product line; refresh NodeMaven banner 2026-06-11 15:16:56 +02:00
Valerio
cb78363466 chore(sponsors): update NodeMaven banner to new branding 2026-06-11 11:50:23 +02:00
Valerio
df7336d55b
Merge pull request #56 from 0xMassi/docs/nodemaven-partner
docs: add NodeMaven studio partner to README
2026-06-10 17:46:55 +02:00
Valerio
acd3021f38 docs(readme): add NodeMaven studio partner 2026-06-10 17:46:49 +02:00
Valerio
bcc58dbadd
Merge pull request #55 from 0xMassi/fix/docker-multiarch-single-build
ci(release): single multi-platform Docker build + dispatch re-publish
2026-06-10 15:56:36 +02:00
Valerio
8015de7db5 ci(release): build the Docker image in one multi-platform pass
The per-arch build + 'imagetools create' combine failed at the manifest
step with 'v0.6.9-arm64: not found' — buildx's default provenance/SBOM
attestations turn each per-arch tag into an index, and assembling them
races GHCR's read-after-write. Replace it with a single
'docker buildx build --platform linux/amd64,linux/arm64 --push'
(attestations off) so one manifest list is pushed atomically. Dockerfile.ci
now selects binaries by TARGETARCH. Adds a workflow_dispatch path to
re-publish an existing tag's image without rebuilding binaries or bumping
the version.
2026-06-10 15:54:28 +02:00
Valerio
be64409d62
Merge pull request #54 from 0xMassi/fix/docker-multiarch-release
chore: release v0.6.9 (fix multi-arch Docker publish)
2026-06-10 15:30:46 +02:00
Valerio
2773474984 chore: release v0.6.9
Publish the multi-arch Docker image with Buildx instead of the legacy
docker driver, whose GHCR push intermittently failed with 'unknown
blob'. The manifest list is now assembled registry-side with
`imagetools create`. This also unblocks the Homebrew formula update,
which depends on the Docker job. No library or CLI behavior changes.
2026-06-10 15:30:39 +02:00
Valerio
7dfa180e86 chore: release v0.6.8 2026-06-10 14:42:05 +02:00
Valerio
598f319bf3
Merge pull request #52 from 0xMassi/audit-fixes-2026-06-09
fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability
2026-06-10 14:40:29 +02:00
Valerio
fae2766db1
Merge pull request #53 from 0xMassi/docs-coldproxy
docs: add ColdProxy proxy-backed crawling walkthrough
2026-06-10 14:40:01 +02:00
Valerio
d0909a25e3 docs: add ColdProxy proxy-backed crawling walkthrough 2026-06-10 10:42:47 +02:00
Valerio
499345046c fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability
- webclaw-llm: add explicit request + connect timeouts to the reqwest
  client in every provider (anthropic, openai, ollama) with a shorter
  timeout on the ollama health check, so a stalled provider fails fast.
- webclaw-llm: fix a panic when truncating a provider error body that
  contains multibyte characters near the 500-char cut (char-safe take).
- webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char
  boundary so oversized scripts with non-ASCII content no longer panic.
- webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of
  `byte as char`, preserving multibyte UTF-8 in SvelteKit string values
  rather than producing Latin-1 mojibake.
- webclaw-cli: have fire_webhook return its JoinHandle and await it at
  the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps.
- webclaw-mcp: drop the up-front DNS pre-validation loop in batch that
  aborted the whole request on one bad URL; the fetch layer already
  applies the same SSRF guard per URL and reports per-URL errors.
- webclaw-fetch: include the port in the warmup homepage URL so hosts
  on a non-default port are warmed correctly.

Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.
2026-06-09 21:10:15 +02:00
Valerio
d0d7b835f2 docs(readme): update banner to new webclaw branding 2026-06-09 18:53:14 +02:00
Valerio
6519ac2a8b chore(release): v0.6.7 2026-06-09 12:38:03 +02:00
Valerio
14ded4b99e chore(deps): bump wreq 6.0.0-rc.29, wreq-util 3.0.0-rc.12
Ports the TLS/Response API breaks in the bump:
- certificate_compression_algorithms -> certificate_compressors with
  wreq-util's BrotliCompressor/ZlibCompressor trait objects
- ExtensionType::APPLICATION_SETTINGS_NEW -> APPLICATION_SETTINGS (same
  codepoint 17613)
- wreq_util::Emulation::SafariIos26.emulation() ->
  Profile::SafariIos26.into_emulation(); Emulation fields are now public
  so *_mut() accessors become direct field access; build() takes a Group
- Response::chunk() removed -> bytes_stream() (wreq 'stream' feature) with
  the running body-size ceiling preserved; adds futures-util

Browser fingerprints verified unchanged on tls.peet.ws: Chrome JA3
43067709b025da334de1279a120f8e14, Safari iOS JA3 8d909525bd5bbb79f133d11cc05159fe.
2026-06-09 12:38:03 +02:00
Valerio
72a451cfb6 chore(release): sync Cargo.lock to v0.6.6 2026-06-09 11:26:18 +02:00
Valerio
17fce81a95 chore(release): v0.6.6
Salvaged two CLI ergonomics fixes from #49:
- periodic progress line on slow fetches (stderr)
- --url-encoded flag + URL truncation warning
2026-06-09 11:24:13 +02:00
Valerio
84a0f9774d style: apply rustfmt to salvaged #49 commits 2026-06-09 11:24:13 +02:00
devnen
519dfb7864 feat(cli): URL truncation warning + --url-encoded flag
When bash splits a URL at & or ? (a common foot-gun), webclaw
receives only the truncated prefix and silently fetches the wrong
page. Per issue #6:

1. Heuristic warning: if the URL ends with '&' or contains '?' with
   no '=' after, emit a stderr warning before fetching:
     # webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded.

2. New flag --url-encoded: parallel input that asserts the user has
   handled escaping. Suppresses the truncation warning since intent
   is explicit.

Fetch proceeds in both cases; this is informational only. 4 new
tests in webclaw-cli. Workspace 720 -> 724.

(cherry picked from commit 4ef27fcd33)
2026-06-09 11:24:13 +02:00
devnen
985a90b083 feat(fetch): periodic progress stderr line on slow fetches
Webclaw's default -t timeout is 30s; slow sites previously sat
silently with no feedback. Now during a fetch, every 10s of elapsed
time webclaw writes one line to stderr:

  # webclaw: still fetching <URL> (Ns)

Fetches completing in under 10s emit nothing (the timer never fires).
Stdout output is untouched - pure feedback signal on stderr.

No timeout change. No new flags. Default behavior is augmented at
stderr only.

Implemented via tokio::select! between the fetch future and a
tokio::time::interval. Latency cost: a single tokio task spawn
and a 10s tick - microseconds on the fast path.

10 new tests in webclaw-fetch::progress::tests (none ignored; the
slow-future test uses a 50ms test interval to keep cargo test fast).
Workspace total 710 -> 720.

(cherry picked from commit 06f065cb08)
2026-06-09 11:24:13 +02:00
Valerio
a1abf625a0 build(deps): pin wreq/wreq-util to exact rc versions
wreq is a release candidate with no API stability between rc.N builds
(rc.29 broke the TLS + Response API). `cargo install` and the release
workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking
the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build
path deterministic until wreq reaches a stable release.
2026-06-04 19:33:31 +02:00
Valerio
9a63c1a3ca docs(contributing): describe in-process wreq TLS, drop stale patched-deps
The TLS layer moved to wreq (BoringSSL) in-process; there is no longer a
[patch.crates-io] section or a separate TLS fork. Update the architecture
tree and crate-boundary notes to match.
2026-06-04 17:56:24 +02:00
Valerio
58d274ffe9 style(reddit): use Option::zip to satisfy clippy
CI runs clippy with `-D warnings` on a newer toolchain that flags
`manual_option_zip`; collapse the and_then/map pair into Option::zip.
2026-06-04 17:48:17 +02:00
Valerio
f6000cba52 chore(release): v0.6.5
Reddit extraction moves from the dead .json API to old.reddit.com HTML.
2026-06-04 17:36:02 +02:00
Valerio
217bfe088b feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.

Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
  `.json` URL handling and the JSON response parser.

Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
  old.reddit omits a usable depth attribute, so the tree is walked
  recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
  (data-comments-count), self-vs-link (self class / self.* domain),
  flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
  the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
  "load more comments" stubs are skipped.

Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
  text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
  resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
  than falling through to generic extraction of Reddit chrome.

Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 17:36:02 +02:00
Valerio
3b7d11328e Add sponsor preview placements 2026-06-04 10:04:32 +02:00
Valerio
363e17d362 docs: add ColdProxy infrastructure partner 2026-05-31 18:35:45 +02:00
Valerio
8fe8bcb479 chore(ci): bump actions/checkout and artifact actions to v5
GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4
as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump
all nine references to v5 ahead of the deadline. The artifact steps are
v5-compatible: upload uses a unique matrix-target name and the download
step flattens subdirectories with find afterward.
2026-05-21 15:11:29 +02:00