omnigraph/research
Claude a1e9f32ee1
pq-l2: bench quality fixes — pre-alloc output, warmup, black_box
Three related fixes from the code-review pass that make the per-query
timing measure kernel work and only kernel work:

1. distance_table API now takes `&mut [f32]` output buffer
   - Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every
     call allocated a fresh Vec inside the timed region. An agent that
     reduced allocator pressure (e.g., via interior-mutability hacks with
     RefCell + thread-local scratch) would have shown up as a "kernel win"
     when it was actually just dodging the allocator.
   - New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`.
     run_experiment pre-allocates one buffer per workload and reuses it
     across queries. Same for the criterion bench (one scratch buffer per
     bench_function closure). Timing now reflects only the kernel work.

2. Warmup query per workload
   - The first query of each (shape × distribution) combo paid cold-cache
     cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds
     L2 on many laptops) and on the codebook (786 KB at that shape). With
     SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean.
   - run_experiment now does one untimed distance_table + probe_top_k call
     per workload before the timing loop. Black-boxed so it can't be DCE'd.

3. std::hint::black_box on probe_top_k result in the trial loop
   - The criterion bench already did this; the trial harness (which is the
     load-bearing measurement) did not. Under LTO + opt-level=3, since the
     binary was the only consumer of `_hits`, the optimizer could in
     principle DCE the heap maintenance work. black_box makes the result
     observably live.

Doc updates:
- crates/pq-l2/program.md: API contract reflects the new signature; the
  obsolete "avoid the Vec alloc in distance_table" prior is replaced with
  a note about reducing probe_top_k's Vec<(u32, f32)> allocation
  (single small alloc per query, real concern once the kernel SIMDs).
- docs/targets/pq-l2.md: API description updated.

Verified:
- cargo build / clippy / test: clean
- baseline trial: correctness pass, exit 0, ~40s wall-clock
- baseline numbers are now slower than before (geomean 1.35M vs prior
  880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were
  artificially low — allocator pressure improvements masqueraded as
  kernel improvements, and LTO could in principle DCE heap maintenance.
  The new numbers measure actual kernel work.

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-15 01:24:54 +00:00
..
lance-autoresearch pq-l2: bench quality fixes — pre-alloc output, warmup, black_box 2026-05-15 01:24:54 +00:00