webclaw/crates/webclaw-fetch/src
Valerio 3c54bea300 perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating)
Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.

- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
  noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
  per-field Open Graph re-scan duplicated across 7 vertical extractors
  (amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
  vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
  the page has no JS-assigned data) and reuse the already-parsed document
  instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
  pool, max-idle-per-host, tcp keepalive) for connection reuse.

Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.

Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).

Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:41:45 +02:00
..
extractors perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
browser.rs Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper 2026-04-23 12:58:24 +02:00
client.rs feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
cloud.rs feat(core): endpoints module for API surface extraction from HTML and JS (#47) 2026-05-19 19:05:16 +02:00
crawler.rs feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
document.rs feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 2026-04-01 18:04:55 +02:00
error.rs feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 2026-04-01 18:04:55 +02:00
fetcher.rs feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend 2026-04-22 21:17:50 +02:00
lib.rs feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
linkedin.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
locale.rs Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper 2026-04-23 12:58:24 +02:00
map.rs feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
progress.rs style: apply rustfmt to salvaged #49 commits 2026-06-09 11:24:13 +02:00
proxy.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
reddit.rs feat(reddit): parse old.reddit.com HTML instead of the dead .json API 2026-06-04 17:36:02 +02:00
search.rs feat(search): standalone web search via Serper.dev (bring-your-own-key) 2026-06-17 15:10:58 +02:00
sitemap.rs feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
tls.rs perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
url_security.rs fix(security): harden local fetch surfaces 2026-05-12 12:00:25 +02:00