webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-09 22:35:12 +02:00

webclaw 02302e7a1d perf(core): hot-path extraction speedups + senior-grade hardening Extraction ~22% faster on the corpus benchmark with byte-identical output: - hoist recompiled CSS selectors in the markdown noise path - single-pass shared og() meta parsing across vertical extractors - output-safe QuickJS gating (skip the JS VM when no candidate data) + reuse the already-parsed document instead of re-parsing - wreq connect_timeout + connection-pool tuning; dedup the retry loop Reliability + correctness: - char-boundary-safe truncation of LLM error bodies (shared helper) - HTTP connect/read timeouts on all LLM provider clients - isolate pdf-extract behind catch_unwind + spawn_blocking - OSS server: crawl inherits the shared fetch profile; ProviderChain built once in AppState; request TimeoutLayer API / safety / docs: - #[non_exhaustive] on public enums + result structs (+ builders) - #![forbid(unsafe_code)] on pure crates, deny on llm - //! crate docs + doctests; scrub bypass/vendor/target specifics from public crate docs and comments Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml + cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.		2026-06-04 20:22:00 +02:00
..
src	perf(core): hot-path extraction speedups + senior-grade hardening	2026-06-04 20:22:00 +02:00
testdata	feat(reddit): parse old.reddit.com HTML instead of the dead .json API	2026-06-04 17:36:02 +02:00
Cargo.toml	perf(core): hot-path extraction speedups + senior-grade hardening	2026-06-04 20:22:00 +02:00