omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Claude a1e9f32ee1 pq-l2: bench quality fixes — pre-alloc output, warmup, black_box Three related fixes from the code-review pass that make the per-query timing measure kernel work and only kernel work: 1. distance_table API now takes `&mut [f32]` output buffer - Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every call allocated a fresh Vec inside the timed region. An agent that reduced allocator pressure (e.g., via interior-mutability hacks with RefCell + thread-local scratch) would have shown up as a "kernel win" when it was actually just dodging the allocator. - New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`. run_experiment pre-allocates one buffer per workload and reuses it across queries. Same for the criterion bench (one scratch buffer per bench_function closure). Timing now reflects only the kernel work. 2. Warmup query per workload - The first query of each (shape × distribution) combo paid cold-cache cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds L2 on many laptops) and on the codebook (786 KB at that shape). With SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean. - run_experiment now does one untimed distance_table + probe_top_k call per workload before the timing loop. Black-boxed so it can't be DCE'd. 3. std::hint::black_box on probe_top_k result in the trial loop - The criterion bench already did this; the trial harness (which is the load-bearing measurement) did not. Under LTO + opt-level=3, since the binary was the only consumer of `_hits`, the optimizer could in principle DCE the heap maintenance work. black_box makes the result observably live. Doc updates: - crates/pq-l2/program.md: API contract reflects the new signature; the obsolete "avoid the Vec alloc in distance_table" prior is replaced with a note about reducing probe_top_k's Vec<(u32, f32)> allocation (single small alloc per query, real concern once the kernel SIMDs). - docs/targets/pq-l2.md: API description updated. Verified: - cargo build / clippy / test: clean - baseline trial: correctness pass, exit 0, ~40s wall-clock - baseline numbers are now slower than before (geomean 1.35M vs prior 880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were artificially low — allocator pressure improvements masqueraded as kernel improvements, and LTO could in principle DCE heap maintenance. The new numbers measure actual kernel work. https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5	2026-05-15 01:24:54 +00:00
..
lance-autoresearch	pq-l2: bench quality fixes — pre-alloc output, warmup, black_box	2026-05-15 01:24:54 +00:00

pq-l2: bench quality fixes — pre-alloc output, warmup, black_box

Three related fixes from the code-review pass that make the per-query
timing measure kernel work and only kernel work:

1. distance_table API now takes `&mut [f32]` output buffer
   - Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every
     call allocated a fresh Vec inside the timed region. An agent that
     reduced allocator pressure (e.g., via interior-mutability hacks with
     RefCell + thread-local scratch) would have shown up as a "kernel win"
     when it was actually just dodging the allocator.
   - New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`.
     run_experiment pre-allocates one buffer per workload and reuses it
     across queries. Same for the criterion bench (one scratch buffer per
     bench_function closure). Timing now reflects only the kernel work.

2. Warmup query per workload
   - The first query of each (shape × distribution) combo paid cold-cache
     cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds
     L2 on many laptops) and on the codebook (786 KB at that shape). With
     SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean.
   - run_experiment now does one untimed distance_table + probe_top_k call
     per workload before the timing loop. Black-boxed so it can't be DCE'd.

3. std::hint::black_box on probe_top_k result in the trial loop
   - The criterion bench already did this; the trial harness (which is the
     load-bearing measurement) did not. Under LTO + opt-level=3, since the
     binary was the only consumer of `_hits`, the optimizer could in
     principle DCE the heap maintenance work. black_box makes the result
     observably live.

Doc updates:
- crates/pq-l2/program.md: API contract reflects the new signature; the
  obsolete "avoid the Vec alloc in distance_table" prior is replaced with
  a note about reducing probe_top_k's Vec<(u32, f32)> allocation
  (single small alloc per query, real concern once the kernel SIMDs).
- docs/targets/pq-l2.md: API description updated.

Verified:
- cargo build / clippy / test: clean
- baseline trial: correctness pass, exit 0, ~40s wall-clock
- baseline numbers are now slower than before (geomean 1.35M vs prior
  880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were
  artificially low — allocator pressure improvements masqueraded as
  kernel improvements, and LTO could in principle DCE heap maintenance.
  The new numbers measure actual kernel work.

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-15 01:24:54 +00:00

lance-autoresearch

pq-l2: bench quality fixes — pre-alloc output, warmup, black_box

2026-05-15 01:24:54 +00:00