Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives
the agent incentive to overfit to one data distribution: a kernel that hits
recall@10 on SIFT-shaped clusters could regress on other distributions and
still pass the gate. This commit replaces both halves of the oracle.
Correctness phase (was: recall@K floor):
- Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar
reference kernel, on a 5-distribution input battery (Gaussian, uniform,
sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ
shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4).
Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by
construction.
Speed phase (was: geomean ns over one synthetic dataset):
- Geomean ns/query measured across 3 PQ shapes x 3 data distributions:
(128, 16, 256) - SIFT-like
(256, 16, 256) - sub_vector_dim=16
(768, 96, 256) - BERT-like
crossed with clustered / uniform / sparse data. Fixed seed across trials
for reproducibility; per-combo timings reported alongside the global
geomean / worst / best so a kernel that wins on one combo and regresses
on another fails the worst-case guard.
Kernel API (was: const-DIM scalar functions):
- Generic over (dim, num_sub_vectors, num_centroids) via PqShape.
- PqKernel::new(shape, codebook) lets the agent pre-process the codebook
once (transpose, cache c.c, pack LUT, etc.) and amortize across queries.
Build cost is excluded from per-query timing - the bench measures
distance_table + probe_top_k only.
Other consequences:
- SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the
cache-directory plumbing all delete - the harness is now fully
self-contained, no external download.
- src/inputs.rs replaces src/fixture.rs; deterministic per-trial
test-data + workload generation, no frozen artifacts.
- Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to
the omnigraph parent workspace from inside research/.
Verified end-to-end:
- cargo build --release: clean
- cargo clippy --release --all-targets -- -D warnings: clean
- cargo run --release --bin run_experiment: correctness pass, geomean
1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0,
total wall-clock ~39s
- smoke test: kernel returning 0 distance -> correctness fail with
diagnostic, exit 2
- cargo test --release --lib: 2/2 unit tests pass
(correctness_battery_is_deterministic, speed_workloads_match_shapes)
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
8.4 KiB
Lance PQ L2 kernel research — agent instructions
You are an autonomous research assistant. Your job is to improve src/kernels.rs
so that cargo run --release --bin run_experiment reports a lower
geomean_ns_per_query while:
- The correctness phase passes — your kernel's distance values must match the
scalar reference within
MAX_ABS_ERR = 1e-4, and the top-K must be tie-tolerant equivalent on every input the bench generates. - The
worst_ns_per_querydoes not regress more than 5% against the last-kept kernel — if you win on one (shape × distribution) and lose significantly on another, the change isn't a generalizable improvement.
This bench is intentionally dataset-independent: there is no fixed dataset. The correctness oracle is mathematical equivalence to the scalar reference, checked across multiple PQ shapes and synthetic input distributions (Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed oracle is the geomean across multiple shapes × distributions, with worst-case guarded. A win that depends on a specific data distribution or PQ shape will fail to clear the bar by construction.
Read this file end-to-end before doing anything else. Then run setup, then the loop.
Setup (do once at the start of every session)
- Read these files, in this order:
README.mdprogram.md(this file)src/lib.rssrc/kernels.rs(the only file you may edit)src/reference.rssrc/inputs.rssrc/bin/run_experiment.rs
- Ensure
results.tsvexists. If not, create it with this header line:commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description - Run the baseline trial:
cargo run --release --bin run_experiment > run.log 2>&1. Confirmcorrectness: pass. Parserun.logand append a row toresults.tsvwithkeep=baselineanddescription="seeded scalar PQ-L2 baseline". This is your reference number. - Commit the baseline row with a one-line message like
baseline: <numbers>.
What you CAN do
- Modify
src/kernels.rsfreely. You may:- Pre-process the codebook in
PqKernel::new(transpose layouts, cachec·cfor the FMA trick, pack the codebook for register-resident lookup, etc.). This cost is paid once per dataset and amortized across queries — the bench measures per-query, not per-(build + query). - Reorder loops, switch internal data layouts, drop down to
std::archintrinsics under#[cfg(target_arch = ...)]gates. Always keep a portable scalar fallback so the kernel compiles everywhere. - Use
unsafeif needed; document the invariants you're relying on. - Mark hot functions
#[inline]; add private helpers freely. - Add
#[cfg(test)] mod tests { ... }insidesrc/kernels.rsif you want in-file property checks.
- Pre-process the codebook in
What you CANNOT do
- Do not modify
src/lib.rs(PqShapeand the tolerance constants are shared with the immutable scaffolding). - Do not modify
src/bin/run_experiment.rs,src/reference.rs,src/inputs.rs,benches/pq_l2.rs, orCargo.toml. - Do not add new crate dependencies.
- Do not alter the public API of
kernels::PqKernel:PqKernel::new(shape: PqShape, codebook: &[f32]) -> SelfPqKernel::shape(&self) -> &PqShapePqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>
- Do not introduce lossy techniques (LUT u8/u16 quantization, asymmetric- distance approximation, etc.) — the correctness phase asserts exact-up-to-ε match against the scalar reference. If you want to explore a lossy track, surface that in a separate kernel and propose a track extension.
The metric
Minimize geomean_ns_per_query (geometric mean of per-query wall-clock across
all timed queries, all shapes, all distributions) subject to:
- Correctness phase: pass (exit-2 otherwise).
worst_ns_per_query≤ 1.05 × the last-kept kernel's worst.total_seconds≤ 600.- Build is clean:
cargo build --releasesucceeds,cargo clippy --release --all-targets -- -D warningsreports zero issues.
Ties break toward simpler code. If two kernels report the same speed within
~3% noise, prefer fewer lines / less unsafe.
Lance-PQ-specific priors (lossless directions)
These directions are known to pay off without compromising arithmetic accuracy. Pick one hypothesis at a time; implement; measure; decide.
- Codebook layout. The reference layout is
[m][k][d]. For a fixed query, iterating over centroids stays in cache, but the inner loop overdis short. Transposing to[m][d][k]lets you SIMD-load 8(query - centroid)lanes acrossdand broadcast overk. Do the transpose inPqKernel::newonce. - Cache
c·c. The diff–square–sum is(q - c)·(q - c) = q·q - 2qc + c·c. Hoistq·qper sub-vector, precomputec·conce at codebook-load time. Inner loop becomes one FMA (-2qc + cc). Watch the sign / accumulator ordering so the rounding stays within tolerance. - Probe layout. The probe is dominated by
acc += table[m][codes[off+m]]×num_sub_vectors. Transposing codes to[m][i](one row per sub-quantizer, contiguous over base index) lets you process up to 32+ vectors per inner iteration withvpgatherdq-style loads. - Top-K integration.
push()does a branch + heap sift on every code. At 50k probes per query × 9 (shape × dist) combos that's the second-biggest cost after the gather. Block the probe (e.g., 512 codes at a time), find the local top-K with a branchless pass, then merge into the global heap. - Prefetch. A
_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)ahead of the gather is usually pure win at 50k+ scale where codes don't all fit in L2. - FMA chains for table build. The diff–square–sum maps cleanly to FMA on
AVX2/NEON. Even without intrinsics, structuring the inner loop so
rustcemits FMA helps. - Avoid the
Vecallocation in the hot path.distance_tableallocates a freshVec<f32>per call. Returning a fixed-capacity buffer is a public-API change you can't make — but you can reuse a thread-local scratch buffer internally if it speeds the build.
The loop
Once setup is done, repeat indefinitely:
- Observe state. Read the last ~5 rows of
results.tsv. Note which ideas have been tried, what won, what regressed. Form a hypothesis with one sentence stating the change and the predicted effect on speed and correctness. - Edit
src/kernels.rs. Keep the diff focused on the one hypothesis. - Build and lint.
If either fails, fix and try again — do not commit broken state.cargo build --release cargo clippy --release --all-targets -- -D warnings - Run the trial.
cargo run --release --bin run_experiment > run.log 2>&1 - Parse the result. Extract
correctness,geomean_ns_per_query,worst_ns_per_query(with combo),peak_mem_mb,total_seconds. Compute deltas vs. baseline. - Decide keep or revert.
- Keep iff:
correctness: pass, geomean strictly better than the last-kept row (allow ~1% noise band), andworst_ns_per_query≤ 1.05 × last-kept's worst. - Revert otherwise:
git restore src/kernels.rs(or commit andgit revertif you want the revert in history). Note what failed.
- Keep iff:
- Log. Append one row to
results.tsv:<short_sha> <iso8601> <correctness> <geomean_ns> <worst_ns> <worst_combo> <best_ns> <best_combo> <peak_mem> <elapsed> <keep|revert> <one-line description> - Commit. One-line message describing the change and the headline number,
e.g.
transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%).
Hygiene
- Always commit
src/kernels.rschanges; never commitresults.tsvorrun.log(they're gitignored). - If a change fails to build, do not commit. Iterate until it builds, or revert cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
results.tsvand update your mental model before proposing the next. - Per-trial cap: 10 minutes. If
cargo runis still going after 10 min, kill it and mark the trial astimeout.
Never stop
Keep going until interrupted. Each loop iteration is one hypothesis, one edit, one measurement, one commit. No multi-step plans across iterations.