webclaw/crates/webclaw-core/src
Valerio 3c54bea300 perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating)
Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.

- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
  noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
  per-field Open Graph re-scan duplicated across 7 vertical extractors
  (amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
  vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
  the page has no JS-assigned data) and reuse the already-parsed document
  instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
  pool, max-idle-per-host, tcp keepalive) for connection reuse.

Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.

Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).

Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:41:45 +02:00
..
llm fix: harden resource limits, path safety, and WASM build (#46) 2026-05-19 17:03:52 +02:00
brand.rs fix: improve brand extraction signals 2026-05-04 21:25:07 +02:00
data_island.rs feat: SvelteKit data extraction + license change to AGPL-3.0 2026-04-01 20:37:56 +02:00
diff.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
domain.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
endpoints.rs fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability 2026-06-09 21:10:15 +02:00
error.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
extractor.rs style: cargo fmt 2026-04-17 12:03:22 +02:00
js_eval.rs perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
lib.rs perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
markdown.rs perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
metadata.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
noise.rs chore: bump to 0.3.9, fix formatting from #14 2026-04-04 15:24:17 +02:00
reddit.rs style(reddit): use Option::zip to satisfy clippy 2026-06-04 17:48:17 +02:00
structured_data.rs fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability 2026-06-09 21:10:15 +02:00
types.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
youtube.rs feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra 2026-03-25 18:50:07 +01:00