mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-24 02:38:06 +02:00
Three related fixes from the code-review pass that make the per-query
timing measure kernel work and only kernel work:
1. distance_table API now takes `&mut [f32]` output buffer
- Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every
call allocated a fresh Vec inside the timed region. An agent that
reduced allocator pressure (e.g., via interior-mutability hacks with
RefCell + thread-local scratch) would have shown up as a "kernel win"
when it was actually just dodging the allocator.
- New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`.
run_experiment pre-allocates one buffer per workload and reuses it
across queries. Same for the criterion bench (one scratch buffer per
bench_function closure). Timing now reflects only the kernel work.
2. Warmup query per workload
- The first query of each (shape × distribution) combo paid cold-cache
cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds
L2 on many laptops) and on the codebook (786 KB at that shape). With
SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean.
- run_experiment now does one untimed distance_table + probe_top_k call
per workload before the timing loop. Black-boxed so it can't be DCE'd.
3. std::hint::black_box on probe_top_k result in the trial loop
- The criterion bench already did this; the trial harness (which is the
load-bearing measurement) did not. Under LTO + opt-level=3, since the
binary was the only consumer of `_hits`, the optimizer could in
principle DCE the heap maintenance work. black_box makes the result
observably live.
Doc updates:
- crates/pq-l2/program.md: API contract reflects the new signature; the
obsolete "avoid the Vec alloc in distance_table" prior is replaced with
a note about reducing probe_top_k's Vec<(u32, f32)> allocation
(single small alloc per query, real concern once the kernel SIMDs).
- docs/targets/pq-l2.md: API description updated.
Verified:
- cargo build / clippy / test: clean
- baseline trial: correctness pass, exit 0, ~40s wall-clock
- baseline numbers are now slower than before (geomean 1.35M vs prior
880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were
artificially low — allocator pressure improvements masqueraded as
kernel improvements, and LTO could in principle DCE heap maintenance.
The new numbers measure actual kernel work.
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
|
||
|---|---|---|
| .. | ||
| targets | ||
| adding-a-target.md | ||
| design.md | ||