webclaw

apunkt/webclaw

Fork 0

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-13 23:15:13 +02:00

Commit graph

Author	SHA1	Message	Date
webclaw	02302e7a1d	perf(core): hot-path extraction speedups + senior-grade hardening Extraction ~22% faster on the corpus benchmark with byte-identical output: - hoist recompiled CSS selectors in the markdown noise path - single-pass shared og() meta parsing across vertical extractors - output-safe QuickJS gating (skip the JS VM when no candidate data) + reuse the already-parsed document instead of re-parsing - wreq connect_timeout + connection-pool tuning; dedup the retry loop Reliability + correctness: - char-boundary-safe truncation of LLM error bodies (shared helper) - HTTP connect/read timeouts on all LLM provider clients - isolate pdf-extract behind catch_unwind + spawn_blocking - OSS server: crawl inherits the shared fetch profile; ProviderChain built once in AppState; request TimeoutLayer API / safety / docs: - #[non_exhaustive] on public enums + result structs (+ builders) - #![forbid(unsafe_code)] on pure crates, deny on llm - //! crate docs + doctests; scrub bypass/vendor/target specifics from public crate docs and comments Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml + cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.	2026-06-04 20:22:00 +02:00
Valerio	1352f48e05	fix(cli): close --on-change command injection via sh -c (P0) (#20 ) * fix(cli): close --on-change command injection via sh -c (P0) The --on-change flag on `webclaw watch` (single-URL, line 1588) and `webclaw watch` multi-URL mode (line 1738) previously handed the entire user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`. Any path that can influence that string — a malicious config file, an MCP client driven by an LLM with prompt-injection exposure, an untrusted environment variable substitution — gets arbitrary shell execution. The command is now tokenized with `shlex::split` (POSIX-ish quoting rules) and executed directly via `Command::new(prog).args(args)`. Metacharacters like `;`, `&&`, `\|`, `$()`, `<(...)`, env expansion, and globbing no longer fire. An explicit opt-in escape hatch is available for users who genuinely need a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path and logs a warning on every invocation so it can't slip in silently. Both call sites now route through a shared `spawn_on_change()` helper. Adds `shlex = "1"` to webclaw-cli dependencies. Version: 0.3.13 -> 0.3.14 CHANGELOG updated. Surfaced by the 2026-04-16 workspace audit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore(brand): fix clippy 1.95 unnecessary_sort_by errors Pre-existing sort_by calls in brand.rs became hard errors under clippy 1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same ordering, no behavior change. Bundled here so CI goes green on the P0 command-injection fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 18:37:02 +02:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

Author

SHA1

Message

Date

webclaw

02302e7a1d

perf(core): hot-path extraction speedups + senior-grade hardening

Extraction ~22% faster on the corpus benchmark with byte-identical output:
- hoist recompiled CSS selectors in the markdown noise path
- single-pass shared og() meta parsing across vertical extractors
- output-safe QuickJS gating (skip the JS VM when no candidate data) +
  reuse the already-parsed document instead of re-parsing
- wreq connect_timeout + connection-pool tuning; dedup the retry loop

Reliability + correctness:
- char-boundary-safe truncation of LLM error bodies (shared helper)
- HTTP connect/read timeouts on all LLM provider clients
- isolate pdf-extract behind catch_unwind + spawn_blocking
- OSS server: crawl inherits the shared fetch profile; ProviderChain built
  once in AppState; request TimeoutLayer

API / safety / docs:
- #[non_exhaustive] on public enums + result structs (+ builders)
- #![forbid(unsafe_code)] on pure crates, deny on llm
- //! crate docs + doctests; scrub bypass/vendor/target specifics from
  public crate docs and comments

Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml +
cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.

2026-06-04 20:22:00 +02:00

Valerio

1352f48e05

fix(cli): close --on-change command injection via sh -c (P0) (#20 )

* fix(cli): close --on-change command injection via sh -c (P0)

The --on-change flag on `webclaw watch` (single-URL, line 1588) and
`webclaw watch` multi-URL mode (line 1738) previously handed the entire
user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`.
Any path that can influence that string — a malicious config file, an MCP
client driven by an LLM with prompt-injection exposure, an untrusted
environment variable substitution — gets arbitrary shell execution.

The command is now tokenized with `shlex::split` (POSIX-ish quoting rules)
and executed directly via `Command::new(prog).args(args)`. Metacharacters
like `;`, `&&`, `|`, `$()`, `<(...)`, env expansion, and globbing no longer
fire.

An explicit opt-in escape hatch is available for users who genuinely need
a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path
and logs a warning on every invocation so it can't slip in silently.

Both call sites now route through a shared `spawn_on_change()` helper.

Adds `shlex = "1"` to webclaw-cli dependencies.

Version: 0.3.13 -> 0.3.14
CHANGELOG updated.

Surfaced by the 2026-04-16 workspace audit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore(brand): fix clippy 1.95 unnecessary_sort_by errors

Pre-existing sort_by calls in brand.rs became hard errors under clippy
1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same
ordering, no behavior change. Bundled here so CI goes green on the P0
command-injection fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-16 18:37:02 +02:00

Valerio

c99ec684fa

Initial release: webclaw v0.1.0 — web content extraction for LLMs

CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io

2026-03-23 18:31:11 +01:00

3 commits