omnigraph/research/lance-autoresearch/program.md
Claude 272b70bfb4
research: redesign lance-autoresearch oracle to be dataset-independent
Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives
the agent incentive to overfit to one data distribution: a kernel that hits
recall@10 on SIFT-shaped clusters could regress on other distributions and
still pass the gate. This commit replaces both halves of the oracle.

Correctness phase (was: recall@K floor):
  - Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar
    reference kernel, on a 5-distribution input battery (Gaussian, uniform,
    sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ
    shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4).
    Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by
    construction.

Speed phase (was: geomean ns over one synthetic dataset):
  - Geomean ns/query measured across 3 PQ shapes x 3 data distributions:
      (128, 16, 256) - SIFT-like
      (256, 16, 256) - sub_vector_dim=16
      (768, 96, 256) - BERT-like
    crossed with clustered / uniform / sparse data. Fixed seed across trials
    for reproducibility; per-combo timings reported alongside the global
    geomean / worst / best so a kernel that wins on one combo and regresses
    on another fails the worst-case guard.

Kernel API (was: const-DIM scalar functions):
  - Generic over (dim, num_sub_vectors, num_centroids) via PqShape.
  - PqKernel::new(shape, codebook) lets the agent pre-process the codebook
    once (transpose, cache c.c, pack LUT, etc.) and amortize across queries.
    Build cost is excluded from per-query timing - the bench measures
    distance_table + probe_top_k only.

Other consequences:
  - SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the
    cache-directory plumbing all delete - the harness is now fully
    self-contained, no external download.
  - src/inputs.rs replaces src/fixture.rs; deterministic per-trial
    test-data + workload generation, no frozen artifacts.
  - Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to
    the omnigraph parent workspace from inside research/.

Verified end-to-end:
  - cargo build --release: clean
  - cargo clippy --release --all-targets -- -D warnings: clean
  - cargo run --release --bin run_experiment: correctness pass, geomean
    1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0,
    total wall-clock ~39s
  - smoke test: kernel returning 0 distance -> correctness fail with
    diagnostic, exit 2
  - cargo test --release --lib: 2/2 unit tests pass
    (correctness_battery_is_deterministic, speed_workloads_match_shapes)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-14 23:03:45 +00:00

8.4 KiB
Raw Blame History

Lance PQ L2 kernel research — agent instructions

You are an autonomous research assistant. Your job is to improve src/kernels.rs so that cargo run --release --bin run_experiment reports a lower geomean_ns_per_query while:

  1. The correctness phase passes — your kernel's distance values must match the scalar reference within MAX_ABS_ERR = 1e-4, and the top-K must be tie-tolerant equivalent on every input the bench generates.
  2. The worst_ns_per_query does not regress more than 5% against the last-kept kernel — if you win on one (shape × distribution) and lose significantly on another, the change isn't a generalizable improvement.

This bench is intentionally dataset-independent: there is no fixed dataset. The correctness oracle is mathematical equivalence to the scalar reference, checked across multiple PQ shapes and synthetic input distributions (Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed oracle is the geomean across multiple shapes × distributions, with worst-case guarded. A win that depends on a specific data distribution or PQ shape will fail to clear the bar by construction.

Read this file end-to-end before doing anything else. Then run setup, then the loop.

Setup (do once at the start of every session)

  1. Read these files, in this order:
    • README.md
    • program.md (this file)
    • src/lib.rs
    • src/kernels.rs (the only file you may edit)
    • src/reference.rs
    • src/inputs.rs
    • src/bin/run_experiment.rs
  2. Ensure results.tsv exists. If not, create it with this header line:
    commit	timestamp	correctness	geomean_ns	worst_ns	worst_combo	best_ns	best_combo	peak_mem_mb	total_seconds	keep	description
    
  3. Run the baseline trial: cargo run --release --bin run_experiment > run.log 2>&1. Confirm correctness: pass. Parse run.log and append a row to results.tsv with keep=baseline and description="seeded scalar PQ-L2 baseline". This is your reference number.
  4. Commit the baseline row with a one-line message like baseline: <numbers>.

What you CAN do

  • Modify src/kernels.rs freely. You may:
    • Pre-process the codebook in PqKernel::new (transpose layouts, cache c·c for the FMA trick, pack the codebook for register-resident lookup, etc.). This cost is paid once per dataset and amortized across queries — the bench measures per-query, not per-(build + query).
    • Reorder loops, switch internal data layouts, drop down to std::arch intrinsics under #[cfg(target_arch = ...)] gates. Always keep a portable scalar fallback so the kernel compiles everywhere.
    • Use unsafe if needed; document the invariants you're relying on.
    • Mark hot functions #[inline]; add private helpers freely.
    • Add #[cfg(test)] mod tests { ... } inside src/kernels.rs if you want in-file property checks.

What you CANNOT do

  • Do not modify src/lib.rs (PqShape and the tolerance constants are shared with the immutable scaffolding).
  • Do not modify src/bin/run_experiment.rs, src/reference.rs, src/inputs.rs, benches/pq_l2.rs, or Cargo.toml.
  • Do not add new crate dependencies.
  • Do not alter the public API of kernels::PqKernel:
    • PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self
    • PqKernel::shape(&self) -> &PqShape
    • PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>
    • PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>
  • Do not introduce lossy techniques (LUT u8/u16 quantization, asymmetric- distance approximation, etc.) — the correctness phase asserts exact-up-to-ε match against the scalar reference. If you want to explore a lossy track, surface that in a separate kernel and propose a track extension.

The metric

Minimize geomean_ns_per_query (geometric mean of per-query wall-clock across all timed queries, all shapes, all distributions) subject to:

  1. Correctness phase: pass (exit-2 otherwise).
  2. worst_ns_per_query ≤ 1.05 × the last-kept kernel's worst.
  3. total_seconds ≤ 600.
  4. Build is clean: cargo build --release succeeds, cargo clippy --release --all-targets -- -D warnings reports zero issues.

Ties break toward simpler code. If two kernels report the same speed within ~3% noise, prefer fewer lines / less unsafe.

Lance-PQ-specific priors (lossless directions)

These directions are known to pay off without compromising arithmetic accuracy. Pick one hypothesis at a time; implement; measure; decide.

  • Codebook layout. The reference layout is [m][k][d]. For a fixed query, iterating over centroids stays in cache, but the inner loop over d is short. Transposing to [m][d][k] lets you SIMD-load 8 (query - centroid) lanes across d and broadcast over k. Do the transpose in PqKernel::new once.
  • Cache c·c. The diffsquaresum is (q - c)·(q - c) = q·q - 2qc + c·c. Hoist q·q per sub-vector, precompute c·c once at codebook-load time. Inner loop becomes one FMA (-2qc + cc). Watch the sign / accumulator ordering so the rounding stays within tolerance.
  • Probe layout. The probe is dominated by acc += table[m][codes[off+m]] × num_sub_vectors. Transposing codes to [m][i] (one row per sub-quantizer, contiguous over base index) lets you process up to 32+ vectors per inner iteration with vpgatherdq-style loads.
  • Top-K integration. push() does a branch + heap sift on every code. At 50k probes per query × 9 (shape × dist) combos that's the second-biggest cost after the gather. Block the probe (e.g., 512 codes at a time), find the local top-K with a branchless pass, then merge into the global heap.
  • Prefetch. A _mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0) ahead of the gather is usually pure win at 50k+ scale where codes don't all fit in L2.
  • FMA chains for table build. The diffsquaresum maps cleanly to FMA on AVX2/NEON. Even without intrinsics, structuring the inner loop so rustc emits FMA helps.
  • Avoid the Vec allocation in the hot path. distance_table allocates a fresh Vec<f32> per call. Returning a fixed-capacity buffer is a public-API change you can't make — but you can reuse a thread-local scratch buffer internally if it speeds the build.

The loop

Once setup is done, repeat indefinitely:

  1. Observe state. Read the last ~5 rows of results.tsv. Note which ideas have been tried, what won, what regressed. Form a hypothesis with one sentence stating the change and the predicted effect on speed and correctness.
  2. Edit src/kernels.rs. Keep the diff focused on the one hypothesis.
  3. Build and lint.
    cargo build --release
    cargo clippy --release --all-targets -- -D warnings
    
    If either fails, fix and try again — do not commit broken state.
  4. Run the trial.
    cargo run --release --bin run_experiment > run.log 2>&1
    
  5. Parse the result. Extract correctness, geomean_ns_per_query, worst_ns_per_query (with combo), peak_mem_mb, total_seconds. Compute deltas vs. baseline.
  6. Decide keep or revert.
    • Keep iff: correctness: pass, geomean strictly better than the last-kept row (allow ~1% noise band), and worst_ns_per_query ≤ 1.05 × last-kept's worst.
    • Revert otherwise: git restore src/kernels.rs (or commit and git revert if you want the revert in history). Note what failed.
  7. Log. Append one row to results.tsv:
    <short_sha>	<iso8601>	<correctness>	<geomean_ns>	<worst_ns>	<worst_combo>	<best_ns>	<best_combo>	<peak_mem>	<elapsed>	<keep|revert>	<one-line description>
    
  8. Commit. One-line message describing the change and the headline number, e.g. transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%).

Hygiene

  • Always commit src/kernels.rs changes; never commit results.tsv or run.log (they're gitignored).
  • If a change fails to build, do not commit. Iterate until it builds, or revert cleanly.
  • If two consecutive ideas regress, take a beat: re-read the last ~10 rows of results.tsv and update your mental model before proposing the next.
  • Per-trial cap: 10 minutes. If cargo run is still going after 10 min, kill it and mark the trial as timeout.

Never stop

Keep going until interrupted. Each loop iteration is one hypothesis, one edit, one measurement, one commit. No multi-step plans across iterations.