webclaw/crates
Valerio 3c54bea300 perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating)
Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.

- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
  noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
  per-field Open Graph re-scan duplicated across 7 vertical extractors
  (amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
  vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
  the page has no JS-assigned data) and reuse the already-parsed document
  instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
  pool, max-idle-per-host, tcp keepalive) for connection reuse.

Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.

Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).

Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:41:45 +02:00
..
webclaw-cli feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
webclaw-core perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
webclaw-fetch perf: hot-path extraction speedups (selector hoist, shared og, QuickJS gating) 2026-06-17 16:41:45 +02:00
webclaw-llm feat(llm): add Gemini provider and fix stale Anthropic default model 2026-06-16 15:52:37 +02:00
webclaw-mcp fix(mcp): accept boolean params sent as JSON strings (#62) 2026-06-17 15:37:36 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server feat(search): standalone web search via Serper.dev (bring-your-own-key) 2026-06-17 15:10:58 +02:00