webclaw/crates/webclaw-pdf/Cargo.toml at perf/audit-fixes - apunkt/webclaw - bitfreedom.net: free all bits, everywhere

apunkt/webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-15 23:35:14 +02:00

webclaw 02302e7a1d perf(core): hot-path extraction speedups + senior-grade hardening

Extraction ~22% faster on the corpus benchmark with byte-identical output:
- hoist recompiled CSS selectors in the markdown noise path
- single-pass shared og() meta parsing across vertical extractors
- output-safe QuickJS gating (skip the JS VM when no candidate data) +
  reuse the already-parsed document instead of re-parsing
- wreq connect_timeout + connection-pool tuning; dedup the retry loop

Reliability + correctness:
- char-boundary-safe truncation of LLM error bodies (shared helper)
- HTTP connect/read timeouts on all LLM provider clients
- isolate pdf-extract behind catch_unwind + spawn_blocking
- OSS server: crawl inherits the shared fetch profile; ProviderChain built
  once in AppState; request TimeoutLayer

API / safety / docs:
- #[non_exhaustive] on public enums + result structs (+ builders)
- #![forbid(unsafe_code)] on pure crates, deny on llm
- //! crate docs + doctests; scrub bypass/vendor/target specifics from
  public crate docs and comments

Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml +
cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.

2026-06-04 20:22:00 +02:00

15 lines

310 B

TOML

Raw Permalink Blame History

 [package]
 name = "webclaw-pdf"
 description = "PDF text extraction for webclaw"
 version.workspace = true
 edition.workspace = true
 rust-version.workspace = true
 license.workspace = true
 [lints]
 workspace = true
 [dependencies]
 pdf-extract = "0.7"
 thiserror = { workspace = true }
 tracing = { workspace = true }