webclaw/crates/webclaw-fetch/src/extractors
webclaw 02302e7a1d perf(core): hot-path extraction speedups + senior-grade hardening
Extraction ~22% faster on the corpus benchmark with byte-identical output:
- hoist recompiled CSS selectors in the markdown noise path
- single-pass shared og() meta parsing across vertical extractors
- output-safe QuickJS gating (skip the JS VM when no candidate data) +
  reuse the already-parsed document instead of re-parsing
- wreq connect_timeout + connection-pool tuning; dedup the retry loop

Reliability + correctness:
- char-boundary-safe truncation of LLM error bodies (shared helper)
- HTTP connect/read timeouts on all LLM provider clients
- isolate pdf-extract behind catch_unwind + spawn_blocking
- OSS server: crawl inherits the shared fetch profile; ProviderChain built
  once in AppState; request TimeoutLayer

API / safety / docs:
- #[non_exhaustive] on public enums + result structs (+ builders)
- #![forbid(unsafe_code)] on pure crates, deny on llm
- //! crate docs + doctests; scrub bypass/vendor/target specifics from
  public crate docs and comments

Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml +
cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.
2026-06-04 20:22:00 +02:00
..
amazon_product.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
arxiv.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
crates_io.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
dev_to.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
docker_hub.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
ebay_listing.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
ecommerce_product.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
etsy_listing.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
github_issue.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
github_pr.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
github_release.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
github_repo.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
hackernews.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
huggingface_dataset.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
huggingface_model.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
instagram_post.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
instagram_profile.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
linkedin_post.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
mod.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
npm.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
og.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
pypi.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
reddit.rs feat(reddit): parse old.reddit.com HTML instead of the dead .json API 2026-06-04 17:36:02 +02:00
shopify_collection.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
shopify_product.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
stackoverflow.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
substack_post.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
trustpilot_reviews.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00
woocommerce_product.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
youtube_video.rs perf(core): hot-path extraction speedups + senior-grade hardening 2026-06-04 20:22:00 +02:00