webclaw/crates/webclaw-core/src
webclaw 02302e7a1d perf(core): hot-path extraction speedups + senior-grade hardening
Extraction ~22% faster on the corpus benchmark with byte-identical output:
- hoist recompiled CSS selectors in the markdown noise path
- single-pass shared og() meta parsing across vertical extractors
- output-safe QuickJS gating (skip the JS VM when no candidate data) +
  reuse the already-parsed document instead of re-parsing
- wreq connect_timeout + connection-pool tuning; dedup the retry loop

Reliability + correctness:
- char-boundary-safe truncation of LLM error bodies (shared helper)
- HTTP connect/read timeouts on all LLM provider clients
- isolate pdf-extract behind catch_unwind + spawn_blocking
- OSS server: crawl inherits the shared fetch profile; ProviderChain built
  once in AppState; request TimeoutLayer

API / safety / docs:
- #[non_exhaustive] on public enums + result structs (+ builders)
- #![forbid(unsafe_code)] on pure crates, deny on llm
- //! crate docs + doctests; scrub bypass/vendor/target specifics from
  public crate docs and comments

Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml +
cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.
2026-06-04 20:22:00 +02:00
..
llm perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
brand.rs fix: improve brand extraction signals 2026-05-04 21:25:07 +02:00
data_island.rs feat: SvelteKit data extraction + license change to AGPL-3.0 2026-04-01 20:37:56 +02:00
diff.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
domain.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
endpoints.rs feat(core): endpoints module for API surface extraction from HTML and JS (#47) 2026-05-19 19:05:16 +02:00
error.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
extractor.rs style: cargo fmt 2026-04-17 12:03:22 +02:00
js_eval.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
lib.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
markdown.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
metadata.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
noise.rs chore: bump to 0.3.9, fix formatting from #14 2026-04-04 15:24:17 +02:00
reddit.rs style(reddit): use Option::zip to satisfy clippy 2026-06-04 17:48:17 +02:00
structured_data.rs fix: handle raw newlines in JSON-LD strings 2026-04-16 11:40:25 +02:00
types.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
youtube.rs feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra 2026-03-25 18:50:07 +01:00