webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-17 23:55:13 +02:00

Author	SHA1	Message	Date
Valerio	480d3187db	docs(claude-md): document search, map, and perf; refresh stale details Bring core/CLAUDE.md current with the slices rescued this cycle, and fold in earlier /init corrections that were never committed. New capabilities documented: - search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search` subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY) + the now-local-first MCP `search` tool. - map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap + bounded crawl fallback), gzip sitemap support, and the new `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags. - perf: shared `extractors/og.rs` parser and the QuickJS runtime gate / parsed-document reuse noted on `js_eval.rs`. Corrections folded in: real browser fingerprint versions live in tls.rs (not browser.rs), accurate module/route lists, Repo Layout section, and removal of the now-false "search lives only in production" notes. Bumped the stated workspace version to 0.6.13. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 17:10:36 +02:00
Valerio	ecfb72a1a3	chore(release): bump version to 0.6.13 Ship the hot-path extraction speedups (#66): selector hoisting, shared Open Graph parsing, QuickJS gating + parsed-document reuse, and HTTP connection-pool tuning. Byte-identical extraction output (verified). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 16:52:38 +02:00
Valerio	febe56d177	Merge pull request #66 from 0xMassi/perf/hot-path-speedups perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating)	2026-06-17 16:47:22 +02:00
Valerio	3c54bea300	perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) Rescued from the stale perf/audit-fixes branch — the perf-only subset of that branch's big mixed commit, ported cleanly onto current main with byte-identical extraction output. - markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node noise path into `Lazy` statics (stop recompiling them per element). - extractors: single shared `og()` / `parse_og()` module replaces the per-field Open Graph re-scan duplicated across 7 vertical extractors (amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly. - core: gate the QuickJS VM on a cheap marker check (skip it entirely when the page has no JS-assigned data) and reuse the already-parsed document instead of re-parsing the HTML. - fetch: connection-pool tuning on the wreq client (connect_timeout, idle pool, max-idle-per-host, tcp keepalive) for connection reuse. Output-equivalence is covered by existing tests (amazon quot-entity, trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all green. No new dependencies; no public API change. Deliberately EXCLUDED from this slice (separate concerns bundled in the original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/ server reliability hardening (much already shipped in 0.6.8), the tooling (cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a code-cleanup with no runtime benefit — not worth churning client.rs for). Original work by the prior author on perf/audit-fixes; this re-applies only the performance subset onto main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 16:41:45 +02:00
Valerio	51d0c538f1	chore(release): bump version to 0.6.12 Bundle three changes landed since 0.6.11: - feat(search): standalone web search via Serper.dev (#63) - feat(map): layered URL discovery with bounded crawl fallback (#64) - fix(mcp): accept boolean params sent as JSON strings (#62 / #65) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 16:10:45 +02:00
Valerio	c5dfce8ed5	Merge pull request #65 from 0xMassi/fix/mcp-bool-param-coercion fix(mcp): accept boolean params sent as JSON strings (#62)	2026-06-17 15:41:42 +02:00
Valerio	b5d0f78bb8	Merge pull request #64 from 0xMassi/feat/map-crawl-fallback feat(map): layered URL discovery with bounded crawl fallback	2026-06-17 15:38:43 +02:00
Valerio	884f06a5d3	fix(mcp): accept boolean params sent as JSON strings (#62 ) Follow-up to #58/#59, which fixed numeric params but left the booleans. MCP clients (e.g. Claude Desktop) send `true` as the JSON string `"true"`, which serde's default bool deserializer rejects with `invalid type: string "true", expected a boolean`, failing the call. Adds a `deser_opt_bool_or_str` helper (same untagged pattern as the #59 numeric helpers) that accepts a JSON boolean OR "true"/"false" (case-insensitive, trimmed) and rejects anything else with a clear error. Numeric-looking strings like "1" are intentionally NOT coerced to bool. Applied to every Option<bool> tool param: - scrape -> only_main_content - crawl -> use_sitemap - research -> deep - search -> scrape (added by the standalone-search slice, #63) 16 unit tests (bool / "true"-string / absent->None / garbage->error per field). No new dependencies. Fixes #62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:37:36 +02:00
Valerio	179efbcf87	feat(map): layered URL discovery with bounded crawl fallback Rescued from the stale perf/audit-fixes branch and ported cleanly onto current main (fetch + CLI only — the original commit never touched the server/MCP map surfaces). `--map` used to return only what a site advertises in sitemap.xml, which is nothing for sites with no sitemap (e.g. Hacker News) or a thin one. Now discovery is layered: - webclaw-fetch::discover_urls() / MapOptions — sitemaps first (authoritative, carries lastmod/priority/changefreq); when the sitemap is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded same-origin crawl and harvest links from every fetched page plus the unfetched frontier, deduped against the sitemap set. - sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() + FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index recursion (3->5); 4 more fallback paths. - CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to stderr so `--map -f json` stays machine-parseable. One new dependency: flate2 (already resolved in the lockfile transitively). Includes the commit's unit tests (map dedup/origin, gzip decode). Original work by the prior author on perf/audit-fixes; this re-applies only the map slice onto main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:33:49 +02:00
Valerio	c3e5ef5143	Merge pull request #63 from 0xMassi/feat/standalone-search feat(search): standalone web search via Serper.dev (bring-your-own-key)	2026-06-17 15:21:02 +02:00
Valerio	06f151c560	feat(search): standalone web search via Serper.dev (bring-your-own-key) Rescued from the stale perf/audit-fixes branch and ported cleanly onto current main. OSS surfaces can now search without the hosted webclaw API when the caller supplies their own Serper.dev key (free at serper.dev). - webclaw-fetch::search() — calls Serper.dev directly (plain wreq client; a JSON API needs no fingerprinting) and, with scrape=true, fetches + extracts the top result pages concurrently (bounded) via the caller's FetchClient. parse_serper_organic() is pure and unit-tested. - MCP `search` tool: local-first — uses SERPER_API_KEY when set, else falls back to the hosted webclaw API. Adds country/lang/scrape params. - OSS REST server: POST /v1/search, gated on SERPER_API_KEY (501 when unset, with a setup hint). Adds ApiError::NotImplemented. - CLI: `webclaw search <query> [--serper-key\|SERPER_API_KEY] [--num] [--country] [--lang] [--scrape] [--format]`. No new dependencies (reuses futures-util already in the tree). Original work by the prior author on perf/audit-fixes; this re-applies only the search slice onto main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:10:58 +02:00
Valerio	0c6f323f51	chore(release): v0.6.11 — Gemini provider + Anthropic model fix	2026-06-16 16:12:11 +02:00
Valerio	d9e3d0b2bb	feat(llm): add Gemini provider and fix stale Anthropic default model Adds a Google Gemini provider (Generative Language API) to the chain, ordered Ollama -> OpenAI -> Gemini -> Anthropic so Google credits are preferred with Anthropic as last-resort fallback. System->systemInstruction, assistant->model, json_mode->responseMimeType; model name validated before URL interpolation; maxOutputTokens defaults high for 2.5 thinking models. Also fixes AnthropicProvider default (retired claude-sonnet-4-20250514 -> 404); now claude-sonnet-4-6, honors ANTHROPIC_MODEL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 15:52:37 +02:00
Valerio	8a0768526f	chore(mcp): add .mcp.json so Cursor / Open Plugins directories detect the MCP server Declares the webclaw MCP server at the repo root (matches the README manual config). Cursor's plugin scanner looks for .mcp.json/mcp.json at root. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 18:15:19 +02:00
Valerio	e7ec76bce9	docs(sponsors): add MangoProxy studio partner (#60 )	2026-06-15 15:06:00 +02:00
Valerio	da6c6af724	chore(release): bump version to 0.6.10 Release the MCP numeric-param string-coercion fix (#58, PR #59): crawl/batch/search/summarize numeric args now accept JSON numbers or numeric strings, fixing clients (e.g. Claude Desktop) that send "5" instead of 5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 11:27:04 +02:00
Valerio	243e7032d0	Merge pull request #59 from crossi-dev/fix/numeric-params-string-coercion fix: accept numeric MCP params sent as strings (#58)	2026-06-15 11:26:05 +02:00
Valerio	24ae3a7af2	style(mcp): apply rustfmt to numeric param coercion Reformat the string-or-number deserialize helpers and tests to satisfy `cargo fmt --check` (style_edition 2024), which the lint CI job enforces. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 11:25:55 +02:00
Charles Rossi	b5ee838d5f	fix(tools): accept numeric params as JSON strings MCP clients (Claude Desktop, VS Code Copilot, etc.) serialize numeric tool arguments as JSON strings ("3" instead of 3). serde's built-in u32/usize deserialisers reject these with: invalid type: string "N", expected u32 Add two private coercion helpers — `deser_opt_u32_or_str` and `deser_opt_usize_or_str` — that accept both JSON number and JSON string representations, falling back to `str::parse` for the string form and returning a clear custom error for non-numeric strings. Annotate the six affected optional fields: CrawlParams: depth (u32), max_pages (usize), concurrency (usize) BatchParams: concurrency (usize) SearchParams: num_results (u32) SummarizeParams: max_sentences (usize) Add 24 unit tests (4 per field: numeric string → value, native number → value, absent → None, non-numeric string → Err) verified green via an isolated serde-only crate. Fixes #58	2026-06-15 01:04:35 -03:00
Valerio	28cd53efcb	Merge pull request #57 from raffaelemancuso/patch-1 Add Windows binaries to README	2026-06-12 17:59:55 +02:00
Raffaele Mancuso	c133478994	Add Windows binaries to README	2026-06-12 17:56:47 +02:00
Valerio	3c726060bf	docs(proxy-example): reword residential product line; refresh NodeMaven banner	2026-06-11 15:16:56 +02:00
Valerio	cb78363466	chore(sponsors): update NodeMaven banner to new branding	2026-06-11 11:50:23 +02:00
Valerio	df7336d55b	Merge pull request #56 from 0xMassi/docs/nodemaven-partner docs: add NodeMaven studio partner to README	2026-06-10 17:46:55 +02:00
Valerio	acd3021f38	docs(readme): add NodeMaven studio partner	2026-06-10 17:46:49 +02:00
Valerio	bcc58dbadd	Merge pull request #55 from 0xMassi/fix/docker-multiarch-single-build ci(release): single multi-platform Docker build + dispatch re-publish	2026-06-10 15:56:36 +02:00
Valerio	8015de7db5	ci(release): build the Docker image in one multi-platform pass The per-arch build + 'imagetools create' combine failed at the manifest step with 'v0.6.9-arm64: not found' — buildx's default provenance/SBOM attestations turn each per-arch tag into an index, and assembling them races GHCR's read-after-write. Replace it with a single 'docker buildx build --platform linux/amd64,linux/arm64 --push' (attestations off) so one manifest list is pushed atomically. Dockerfile.ci now selects binaries by TARGETARCH. Adds a workflow_dispatch path to re-publish an existing tag's image without rebuilding binaries or bumping the version.	2026-06-10 15:54:28 +02:00
Valerio	be64409d62	Merge pull request #54 from 0xMassi/fix/docker-multiarch-release chore: release v0.6.9 (fix multi-arch Docker publish)	2026-06-10 15:30:46 +02:00
Valerio	2773474984	chore: release v0.6.9 Publish the multi-arch Docker image with Buildx instead of the legacy docker driver, whose GHCR push intermittently failed with 'unknown blob'. The manifest list is now assembled registry-side with `imagetools create`. This also unblocks the Homebrew formula update, which depends on the Docker job. No library or CLI behavior changes.	2026-06-10 15:30:39 +02:00
Valerio	7dfa180e86	chore: release v0.6.8	2026-06-10 14:42:05 +02:00
Valerio	598f319bf3	Merge pull request #52 from 0xMassi/audit-fixes-2026-06-09 fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability	2026-06-10 14:40:29 +02:00
Valerio	fae2766db1	Merge pull request #53 from 0xMassi/docs-coldproxy docs: add ColdProxy proxy-backed crawling walkthrough	2026-06-10 14:40:01 +02:00
Valerio	d0909a25e3	docs: add ColdProxy proxy-backed crawling walkthrough	2026-06-10 10:42:47 +02:00
Valerio	499345046c	fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability - webclaw-llm: add explicit request + connect timeouts to the reqwest client in every provider (anthropic, openai, ollama) with a shorter timeout on the ollama health check, so a stalled provider fails fast. - webclaw-llm: fix a panic when truncating a provider error body that contains multibyte characters near the 500-char cut (char-safe take). - webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char boundary so oversized scripts with non-ASCII content no longer panic. - webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of `byte as char`, preserving multibyte UTF-8 in SvelteKit string values rather than producing Latin-1 mojibake. - webclaw-cli: have fire_webhook return its JoinHandle and await it at the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps. - webclaw-mcp: drop the up-front DNS pre-validation loop in batch that aborted the whole request on one bad URL; the fetch layer already applies the same SSRF guard per URL and reports per-URL errors. - webclaw-fetch: include the port in the warmup homepage URL so hosts on a non-default port are warmed correctly. Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.	2026-06-09 21:10:15 +02:00
Valerio	d0d7b835f2	docs(readme): update banner to new webclaw branding	2026-06-09 18:53:14 +02:00
Valerio	6519ac2a8b	chore(release): v0.6.7	2026-06-09 12:38:03 +02:00
Valerio	14ded4b99e	chore(deps): bump wreq 6.0.0-rc.29, wreq-util 3.0.0-rc.12 Ports the TLS/Response API breaks in the bump: - certificate_compression_algorithms -> certificate_compressors with wreq-util's BrotliCompressor/ZlibCompressor trait objects - ExtensionType::APPLICATION_SETTINGS_NEW -> APPLICATION_SETTINGS (same codepoint 17613) - wreq_util::Emulation::SafariIos26.emulation() -> Profile::SafariIos26.into_emulation(); Emulation fields are now public so *_mut() accessors become direct field access; build() takes a Group - Response::chunk() removed -> bytes_stream() (wreq 'stream' feature) with the running body-size ceiling preserved; adds futures-util Browser fingerprints verified unchanged on tls.peet.ws: Chrome JA3 43067709b025da334de1279a120f8e14, Safari iOS JA3 8d909525bd5bbb79f133d11cc05159fe.	2026-06-09 12:38:03 +02:00
Valerio	72a451cfb6	chore(release): sync Cargo.lock to v0.6.6	2026-06-09 11:26:18 +02:00
Valerio	17fce81a95	chore(release): v0.6.6 Salvaged two CLI ergonomics fixes from #49: - periodic progress line on slow fetches (stderr) - --url-encoded flag + URL truncation warning	2026-06-09 11:24:13 +02:00
Valerio	84a0f9774d	style: apply rustfmt to salvaged #49 commits	2026-06-09 11:24:13 +02:00
devnen	519dfb7864	feat(cli): URL truncation warning + --url-encoded flag When bash splits a URL at & or ? (a common foot-gun), webclaw receives only the truncated prefix and silently fetches the wrong page. Per issue #6: 1. Heuristic warning: if the URL ends with '&' or contains '?' with no '=' after, emit a stderr warning before fetching: # webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded. 2. New flag --url-encoded: parallel input that asserts the user has handled escaping. Suppresses the truncation warning since intent is explicit. Fetch proceeds in both cases; this is informational only. 4 new tests in webclaw-cli. Workspace 720 -> 724. (cherry picked from commit `4ef27fcd33`)	2026-06-09 11:24:13 +02:00
devnen	985a90b083	feat(fetch): periodic progress stderr line on slow fetches Webclaw's default -t timeout is 30s; slow sites previously sat silently with no feedback. Now during a fetch, every 10s of elapsed time webclaw writes one line to stderr: # webclaw: still fetching <URL> (Ns) Fetches completing in under 10s emit nothing (the timer never fires). Stdout output is untouched - pure feedback signal on stderr. No timeout change. No new flags. Default behavior is augmented at stderr only. Implemented via tokio::select! between the fetch future and a tokio::time::interval. Latency cost: a single tokio task spawn and a 10s tick - microseconds on the fast path. 10 new tests in webclaw-fetch::progress::tests (none ignored; the slow-future test uses a 50ms test interval to keep cargo test fast). Workspace total 710 -> 720. (cherry picked from commit `06f065cb08`)	2026-06-09 11:24:13 +02:00
Valerio	a1abf625a0	build(deps): pin wreq/wreq-util to exact rc versions wreq is a release candidate with no API stability between rc.N builds (rc.29 broke the TLS + Response API). `cargo install` and the release workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build path deterministic until wreq reaches a stable release.	2026-06-04 19:33:31 +02:00
Valerio	9a63c1a3ca	docs(contributing): describe in-process wreq TLS, drop stale patched-deps The TLS layer moved to wreq (BoringSSL) in-process; there is no longer a [patch.crates-io] section or a separate TLS fork. Update the architecture tree and crate-boundary notes to match.	2026-06-04 17:56:24 +02:00
Valerio	58d274ffe9	style(reddit): use Option::zip to satisfy clippy CI runs clippy with `-D warnings` on a newer toolchain that flags `manual_option_zip`; collapse the and_then/map pair into Option::zip.	2026-06-04 17:48:17 +02:00
Valerio	f6000cba52	chore(release): v0.6.5 Reddit extraction moves from the dead .json API to old.reddit.com HTML.	2026-06-04 17:36:02 +02:00
Valerio	217bfe088b	feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate.	2026-06-04 17:36:02 +02:00
Valerio	3b7d11328e	Add sponsor preview placements	2026-06-04 10:04:32 +02:00
Valerio	363e17d362	docs: add ColdProxy infrastructure partner	2026-05-31 18:35:45 +02:00
Valerio	8fe8bcb479	chore(ci): bump actions/checkout and artifact actions to v5 GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4 as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump all nine references to v5 ahead of the deadline. The artifact steps are v5-compatible: upload uses a unique matrix-target name and the download step flattens subdirectories with find afterward.	2026-05-21 15:11:29 +02:00

1 2 3 4 5

208 commits