webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-10 22:45:13 +02:00

Author	SHA1	Message	Date
Valerio	598f319bf3	Merge pull request #52 from 0xMassi/audit-fixes-2026-06-09 fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability	2026-06-10 14:40:29 +02:00
Valerio	fae2766db1	Merge pull request #53 from 0xMassi/docs-coldproxy docs: add ColdProxy proxy-backed crawling walkthrough	2026-06-10 14:40:01 +02:00
Valerio	d0909a25e3	docs: add ColdProxy proxy-backed crawling walkthrough	2026-06-10 10:42:47 +02:00
Valerio	499345046c	fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability - webclaw-llm: add explicit request + connect timeouts to the reqwest client in every provider (anthropic, openai, ollama) with a shorter timeout on the ollama health check, so a stalled provider fails fast. - webclaw-llm: fix a panic when truncating a provider error body that contains multibyte characters near the 500-char cut (char-safe take). - webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char boundary so oversized scripts with non-ASCII content no longer panic. - webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of `byte as char`, preserving multibyte UTF-8 in SvelteKit string values rather than producing Latin-1 mojibake. - webclaw-cli: have fire_webhook return its JoinHandle and await it at the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps. - webclaw-mcp: drop the up-front DNS pre-validation loop in batch that aborted the whole request on one bad URL; the fetch layer already applies the same SSRF guard per URL and reports per-URL errors. - webclaw-fetch: include the port in the warmup homepage URL so hosts on a non-default port are warmed correctly. Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.	2026-06-09 21:10:15 +02:00
Valerio	d0d7b835f2	docs(readme): update banner to new webclaw branding	2026-06-09 18:53:14 +02:00
Valerio	6519ac2a8b	chore(release): v0.6.7	2026-06-09 12:38:03 +02:00
Valerio	14ded4b99e	chore(deps): bump wreq 6.0.0-rc.29, wreq-util 3.0.0-rc.12 Ports the TLS/Response API breaks in the bump: - certificate_compression_algorithms -> certificate_compressors with wreq-util's BrotliCompressor/ZlibCompressor trait objects - ExtensionType::APPLICATION_SETTINGS_NEW -> APPLICATION_SETTINGS (same codepoint 17613) - wreq_util::Emulation::SafariIos26.emulation() -> Profile::SafariIos26.into_emulation(); Emulation fields are now public so *_mut() accessors become direct field access; build() takes a Group - Response::chunk() removed -> bytes_stream() (wreq 'stream' feature) with the running body-size ceiling preserved; adds futures-util Browser fingerprints verified unchanged on tls.peet.ws: Chrome JA3 43067709b025da334de1279a120f8e14, Safari iOS JA3 8d909525bd5bbb79f133d11cc05159fe.	2026-06-09 12:38:03 +02:00
Valerio	72a451cfb6	chore(release): sync Cargo.lock to v0.6.6	2026-06-09 11:26:18 +02:00
Valerio	17fce81a95	chore(release): v0.6.6 Salvaged two CLI ergonomics fixes from #49: - periodic progress line on slow fetches (stderr) - --url-encoded flag + URL truncation warning	2026-06-09 11:24:13 +02:00
Valerio	84a0f9774d	style: apply rustfmt to salvaged #49 commits	2026-06-09 11:24:13 +02:00
devnen	519dfb7864	feat(cli): URL truncation warning + --url-encoded flag When bash splits a URL at & or ? (a common foot-gun), webclaw receives only the truncated prefix and silently fetches the wrong page. Per issue #6: 1. Heuristic warning: if the URL ends with '&' or contains '?' with no '=' after, emit a stderr warning before fetching: # webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded. 2. New flag --url-encoded: parallel input that asserts the user has handled escaping. Suppresses the truncation warning since intent is explicit. Fetch proceeds in both cases; this is informational only. 4 new tests in webclaw-cli. Workspace 720 -> 724. (cherry picked from commit `4ef27fcd33`)	2026-06-09 11:24:13 +02:00
devnen	985a90b083	feat(fetch): periodic progress stderr line on slow fetches Webclaw's default -t timeout is 30s; slow sites previously sat silently with no feedback. Now during a fetch, every 10s of elapsed time webclaw writes one line to stderr: # webclaw: still fetching <URL> (Ns) Fetches completing in under 10s emit nothing (the timer never fires). Stdout output is untouched - pure feedback signal on stderr. No timeout change. No new flags. Default behavior is augmented at stderr only. Implemented via tokio::select! between the fetch future and a tokio::time::interval. Latency cost: a single tokio task spawn and a 10s tick - microseconds on the fast path. 10 new tests in webclaw-fetch::progress::tests (none ignored; the slow-future test uses a 50ms test interval to keep cargo test fast). Workspace total 710 -> 720. (cherry picked from commit `06f065cb08`)	2026-06-09 11:24:13 +02:00
Valerio	a1abf625a0	build(deps): pin wreq/wreq-util to exact rc versions wreq is a release candidate with no API stability between rc.N builds (rc.29 broke the TLS + Response API). `cargo install` and the release workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build path deterministic until wreq reaches a stable release.	2026-06-04 19:33:31 +02:00
Valerio	9a63c1a3ca	docs(contributing): describe in-process wreq TLS, drop stale patched-deps The TLS layer moved to wreq (BoringSSL) in-process; there is no longer a [patch.crates-io] section or a separate TLS fork. Update the architecture tree and crate-boundary notes to match.	2026-06-04 17:56:24 +02:00
Valerio	58d274ffe9	style(reddit): use Option::zip to satisfy clippy CI runs clippy with `-D warnings` on a newer toolchain that flags `manual_option_zip`; collapse the and_then/map pair into Option::zip.	2026-06-04 17:48:17 +02:00
Valerio	f6000cba52	chore(release): v0.6.5 Reddit extraction moves from the dead .json API to old.reddit.com HTML.	2026-06-04 17:36:02 +02:00
Valerio	217bfe088b	feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate.	2026-06-04 17:36:02 +02:00
Valerio	3b7d11328e	Add sponsor preview placements	2026-06-04 10:04:32 +02:00
Valerio	363e17d362	docs: add ColdProxy infrastructure partner	2026-05-31 18:35:45 +02:00
Valerio	8fe8bcb479	chore(ci): bump actions/checkout and artifact actions to v5 GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4 as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump all nine references to v5 ahead of the deadline. The artifact steps are v5-compatible: upload uses a unique matrix-target name and the download step flattens subdirectories with find afterward.	2026-05-21 15:11:29 +02:00
Valerio	51260ae4e3	chore(release): record v0.6.4 version bump and changelog The v0.6.4 tag shipped the API surface discovery module but the release commit left the workspace version at 0.6.3 with no matching changelog entry. Bump [workspace.package] to 0.6.4 and add the [0.6.4] CHANGELOG section so the code matches the tag.	2026-05-21 12:58:47 +02:00
Valerio	fe567a6af1	feat(core): endpoints module for API surface extraction from HTML and JS (#47 ) * feat(core): endpoints module — extract API surface from HTML + JS bundles * fix(docker): source CA bundle from distroless instead of apt (fixes arm64 release build) * fix(test): serialize env-mutating CloudClient tests to stop flaky CI * feat(core): filter endpoint-extractor noise (invalid hosts, schema domains, bare paths)	2026-05-19 19:05:16 +02:00
Valerio	be8bcfebd9	fix: harden resource limits, path safety, and WASM build (#46 ) Security audit follow-up across the workspace: - webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a cfg(not(wasm32)) target dependency and the extraction entry point uses a direct call on wasm instead of spawning a thread, so it builds and runs on wasm32 with or without default features. - webclaw-core: bound the structured-data scrubber recursion (depth cap) so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the stack. - webclaw-fetch: stream the response body with a running ceiling so a small highly compressed payload cannot inflate to gigabytes in memory; redact user:pass@ from proxy URLs before they reach error strings. - webclaw-cli: contain output filenames inside the chosen directory (reject .. / absolute, drop traversal path segments), run --webhook URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s, and make research slug truncation char-safe. - webclaw-mcp: char-safe slug truncation (no multibyte slice panic). - setup.sh / deploy/hetzner.sh: replace eval on read input with printf -v, and mask auth key / API token in console output. - CI: enforce the wasm32 build invariant for webclaw-core. Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.	2026-05-19 17:03:52 +02:00
Valerio	aab51bea91	docs: add workflow examples	2026-05-18 18:56:00 +02:00
Valerio	b75b768ec3	Update Quantum Proxies sponsor copy	2026-05-18 18:50:38 +02:00
Valerio	3fabdc1d02	fix: clean llm output noise Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.	2026-05-18 18:39:33 +02:00
Valerio	5eef8358b0	docs: update sponsor partner details	2026-05-18 13:09:02 +02:00
Valerio	7dfd62ec1d	docs: add proxy-seller studio partner	2026-05-18 12:37:28 +02:00
Valerio	6d886c44f6	docs: enlarge studio partner banner	2026-05-18 12:27:11 +02:00
Valerio	8e3ad17428	docs: tighten studio partner layout	2026-05-18 12:23:19 +02:00
Valerio	7321549412	docs: add studio partner section	2026-05-18 12:17:34 +02:00
Valerio	72edb61881	Merge pull request #42 from jal-co/docs/add-community-plugins docs: add community plugins section	2026-05-16 11:24:33 +02:00
Valerio	00d86a12bc	docs: refine community plugin copy	2026-05-16 11:19:15 +02:00
Justin Levine	c8be5214f6	docs: add community plugins section with OpenClaw and Hermes integrations	2026-05-15 17:51:22 -07:00
Valerio	0ea189c5b2	fix(ci): pass repository to release cli Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-05-12 12:28:14 +02:00
Valerio	a629534490	fix(security): prepare 0.6.1 hardening Merge the 0.6.1 security hardening release candidate after local and CI verification.	2026-05-12 12:16:42 +02:00
Valerio	fd2e75d509	chore(fetch): satisfy clippy for resolver setup	2026-05-12 12:09:18 +02:00
Valerio	e2f89941ac	chore(release): prepare 0.6.1	2026-05-12 12:06:06 +02:00
Valerio	307b4f980d	fix(extractors): harden marketplace host matching	2026-05-12 12:03:43 +02:00
Valerio	dbf9ce08a6	fix(ci): scope release workflow token permissions	2026-05-12 12:00:47 +02:00
Valerio	3bcb288d13	fix(fetch): guard challenge detection before utf8 decoding	2026-05-12 12:00:47 +02:00
Valerio	a611ae26f3	fix(security): harden local fetch surfaces	2026-05-12 12:00:25 +02:00
Valerio	af96628dc9	Revise README for clarity and updated content Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.	2026-05-10 22:44:57 +02:00
devnen	e8ca1417d6	Improve --format llm output quality (#37 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Improve LLM-format output for modern news and documentation pages. - Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records - Fix element/text spacing without detaching punctuation on docs, forums, and reference pages - Remove common accessibility link chrome from LLM text and link labels - Bump workspace version to 0.6.0 and update the changelog Thanks to Nenad Oric (@devnen) for the original PR and contribution.	2026-05-10 15:11:12 +02:00
Valerio	7f75143954	docs: update hosted api trial copy Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-05-06 17:16:35 +02:00
Valerio	e6a95f783d	chore: bump version to 0.5.9 Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-06 11:42:09 +02:00
Valerio	a3aa4bce6f	fix: support LLM provider compatibility options Closes #36	2026-05-06 11:36:53 +02:00
Valerio	86183b11e4	docs: credit Windows release contribution Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-05 11:44:07 +02:00
SURYANSH MISHRA	513b0e493e	ci: add Windows release artifacts Closes #34	2026-05-05 11:38:30 +02:00
Valerio	a1242a1c1d	docs: credit README badge refresh	2026-05-05 11:18:58 +02:00

1 2 3 4

178 commits