omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Claude a1e9f32ee1 pq-l2: bench quality fixes — pre-alloc output, warmup, black_box Three related fixes from the code-review pass that make the per-query timing measure kernel work and only kernel work: 1. distance_table API now takes `&mut [f32]` output buffer - Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every call allocated a fresh Vec inside the timed region. An agent that reduced allocator pressure (e.g., via interior-mutability hacks with RefCell + thread-local scratch) would have shown up as a "kernel win" when it was actually just dodging the allocator. - New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`. run_experiment pre-allocates one buffer per workload and reuses it across queries. Same for the criterion bench (one scratch buffer per bench_function closure). Timing now reflects only the kernel work. 2. Warmup query per workload - The first query of each (shape × distribution) combo paid cold-cache cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds L2 on many laptops) and on the codebook (786 KB at that shape). With SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean. - run_experiment now does one untimed distance_table + probe_top_k call per workload before the timing loop. Black-boxed so it can't be DCE'd. 3. std::hint::black_box on probe_top_k result in the trial loop - The criterion bench already did this; the trial harness (which is the load-bearing measurement) did not. Under LTO + opt-level=3, since the binary was the only consumer of `_hits`, the optimizer could in principle DCE the heap maintenance work. black_box makes the result observably live. Doc updates: - crates/pq-l2/program.md: API contract reflects the new signature; the obsolete "avoid the Vec alloc in distance_table" prior is replaced with a note about reducing probe_top_k's Vec<(u32, f32)> allocation (single small alloc per query, real concern once the kernel SIMDs). - docs/targets/pq-l2.md: API description updated. Verified: - cargo build / clippy / test: clean - baseline trial: correctness pass, exit 0, ~40s wall-clock - baseline numbers are now slower than before (geomean 1.35M vs prior 880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were artificially low — allocator pressure improvements masqueraded as kernel improvements, and LTO could in principle DCE heap maintenance. The new numbers measure actual kernel work. https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5		2026-05-15 01:24:54 +00:00
..
targets	pq-l2: bench quality fixes — pre-alloc output, warmup, black_box	2026-05-15 01:24:54 +00:00
adding-a-target.md	research: fix lance-autoresearch correctness bugs surfaced by code review	2026-05-15 00:55:57 +00:00
design.md	research: fix lance-autoresearch correctness bugs surfaced by code review	2026-05-15 00:55:57 +00:00