omnigraph/research/lance-autoresearch/program.md
Claude ed376af7d8
research: lance-autoresearch — PQ L2 kernel autoresearch harness
Stand up a standalone Rust project under research/lance-autoresearch/ for
LLM-driven optimization of Lance's PQ L2 distance kernels, following Karpathy's
three-file autoresearch contract:

  - src/kernels.rs (mutable, the agent's playground): scalar baseline PQ L2
    distance + top-K matching Lance 4.x's algorithm shape (16 sub-vectors,
    256 centroids, 8-bit codes, 128-d f32).
  - src/{fixture,reference,bin/run_experiment}.rs (immutable): SIFT1M loader
    (fvecs/ivecs + frozen codebook) with deterministic synthetic fallback,
    brute-force ground truth, fixed-format result block with recall@10 floor
    + time-budget exits.
  - program.md (human-iterated): the skill the agent reads each session —
    setup, what it can / cannot edit, the metric, Lance-PQ-specific priors,
    the keep/revert loop.

Smoke tests pass: baseline build clean, recall@10 = 0.66 on synthetic above
the 0.50 floor (exit 0), broken kernel triggers floor failure (exit 2),
clippy -D warnings clean. Excludes research/ from omnigraph workspace so
the nested project doesn't enter omnigraph's cargo build graph.

Licensed dual MIT / Apache-2.0 to keep the upstream-PR path to lance-format/lance
clean.

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-14 22:38:39 +00:00

7.7 KiB
Raw Blame History

Lance PQ L2 kernel research — agent instructions

You are an autonomous research assistant. Your job is to improve the kernel(s) in src/kernels.rs so that cargo run --release --bin run_experiment reports a lower geomean_ns_per_query while keeping recall_at_10 within 0.005 of the seeded baseline (and never below the hard floor 0.50).

Read this file end-to-end before doing anything else. Then run setup, then the loop.

Setup (do once at the start of every session)

  1. Read these files, in this order:
    • README.md
    • program.md (this file)
    • src/lib.rs
    • src/kernels.rs (the only file you may edit)
    • src/bin/run_experiment.rs
    • src/fixture.rs
  2. Confirm fixtures are present. SIFT1M lives under ~/.cache/lance-autoresearch/. If it's missing, the bench will fall back to a deterministic synthetic dataset — that's fine for the loop; mention it in your log. If you want SIFT1M, run bash scripts/prepare_fixtures.sh (one-time, ~510 min, ~250 MB download).
  3. Ensure results.tsv exists. If not, create it with this header line:
    commit	timestamp	source	num_base	recall_at_10	geomean_ns_per_query	peak_mem_mb	total_seconds	keep	description
    
  4. Run the baseline trial: cargo run --release --bin run_experiment > run.log 2>&1. Parse run.log and append a row to results.tsv with keep=baseline, description="seeded scalar PQ-L2 baseline". This is your reference number.
  5. Commit the baseline row with a one-line message like baseline: <numbers>.

What you CAN do

  • Modify src/kernels.rs freely. You may:
    • Reorder loops, change iteration order over codes or sub-vectors.
    • Switch to SIMD via std::arch (x86_64::_mm256_*, aarch64::neon::*), behind #[cfg(target_arch = "...")] gates. Always keep a portable scalar fallback so the kernel compiles everywhere.
    • Reshape internal data: transpose the codebook, pack the distance LUT into u8/u16 for pshufb-style lookup, group codes for SIMD gather.
    • Use unsafe if needed; document the invariants you're relying on.
    • Mark hot functions #[inline] or split them; add private helpers freely.
  • Add #[cfg(test)] mod tests { ... } inside src/kernels.rs if you want property checks against the scalar path.

What you CANNOT do

  • Do not modify src/lib.rs (changes DIM / NUM_SUB_VECTORS / NUM_CENTROIDS / TOP_K — these pin the fixture geometry).
  • Do not modify src/bin/run_experiment.rs, src/reference.rs, src/fixture.rs, benches/pq_l2.rs, scripts/prepare_fixtures.sh, or Cargo.toml.
  • Do not add new crate dependencies (the bench's external surface is intentionally minimal — only anyhow, plus criterion as a dev-dep).
  • Do not delete or alter the public API of kernels.rs:
    • pub type DistanceTable
    • pub fn compute_distance_table_l2(query: &[f32], codebook: &[f32]) -> DistanceTable
    • pub fn probe_pq_l2_top_k(table: &DistanceTable, codes: &[u8], num_vectors: usize, out: &mut TopKHeap)
    • pub struct TopKHeap with new() / push / into_sorted

The metric

Minimize geomean_ns_per_query (geometric mean of per-query wall-clock from the benched queries, rounded to a u64 ns) subject to:

  1. recall_at_10 >= baseline_recall_at_10 - 0.005
  2. recall_at_10 >= 0.50 (hard floor; below this the bench exits non-zero)
  3. total_seconds <= 600
  4. Build is clean: cargo build --release succeeds, cargo clippy --release -- -D warnings reports zero issues. (Run cargo clippy --release before each commit.)

Ties break toward simpler code. If two kernels report the same speed within noise (~3%), prefer the one with fewer lines or less unsafe.

Lance-PQ-specific priors

These are the directions known to pay off on this kernel shape. Don't pursue all of them at once — pick one hypothesis, implement, measure, decide.

  • Codebook layout for the table-build step. The reference layout is [m][k][d]. For a fixed query, iterating over centroids stays in cache, but the inner loop over d is short (8 floats). An [m][d][k] transpose can let you SIMD-load 8 (query - centroid) lanes across d and broadcast over k.
  • LUT packing for the probe step. The probe is dominated by acc += table[m][codes[off+m]] × 16. Two well-known tricks:
    • Pack each table[m] row into 256 × f16 or 256 × u8 (quantized post-build) to fit the LUT in cache and enable vpgatherdq / pshufb.
    • Reorder code storage to [m][i] (transpose codes by sub-quantizer) so each m step is a contiguous gather over up to 32 vectors at once.
  • Top-K integration. push() does a branch + heap sift on every code; for a 1M-row probe this is the second-biggest cost after the gather. Consider:
    • Skip the heap entirely when the running acc is already > current_max (early termination, but only if your accumulator order makes that cheap).
    • Block the probe (e.g., 1024 codes at a time), find the local top-K with a branchless scan, then merge into the global heap.
  • Prefetch. A _mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0) ahead of the gather is usually pure win at 1M scale where codes don't all fit in L2.
  • FMA in the table build. The diffsquaresum sequence is (q - c)·(q - c) per element — that's (q*q) - 2qc + c*c. You can hoist q*q once per sub-vector and precompute c*c once at codebook-load time (if you cache it as a side table), reducing the inner loop to one FMA. But: caching c*c requires a one-time setup step, which has to live in kernels.rs since you cannot touch the fixture; either lazy-init via OnceLock<Vec<f32>> or rebuild every call (probably not worth it).

The loop

Once setup is done, repeat indefinitely:

  1. Observe state. Read the last ~5 rows of results.tsv. Note which ideas have been tried, what won, what regressed. Form a hypothesis with one sentence stating the change and the predicted effect on speed and recall.
  2. Edit src/kernels.rs. Keep the diff focused on the one hypothesis.
  3. Build and lint. Run:
    cargo build --release
    cargo clippy --release --all-targets -- -D warnings
    
    If either fails, fix and try again — do not commit broken state.
  4. Run the trial.
    cargo run --release --bin run_experiment > run.log 2>&1
    
  5. Parse the result. Extract recall_at_10, geomean_ns_per_query, peak_mem_mb, total_seconds from run.log. Compute the deltas vs. baseline.
  6. Decide keep or revert.
    • Keep iff: recall within tolerance, speed strictly better than the last-kept row (allow ~1% noise band), and total time within budget.
    • Revert otherwise: git restore src/kernels.rs (or commit and git revert if you want the revert in history). Note what failed.
  7. Log. Append one row to results.tsv:
    <short_sha>	<iso8601>	<source>	<num_base>	<recall>	<geomean_ns>	<peak_mem>	<elapsed>	<keep|revert>	<one-line description>
    
  8. Commit. Use a one-line message describing the change and the headline number, e.g. transpose codebook; 184k → 142k ns/query (recall 0.94).

Hygiene

  • Always commit src/kernels.rs changes; never commit results.tsv or run.log (they're gitignored).
  • If a change fails to build, do not commit. Iterate until it builds, or revert cleanly.
  • If two consecutive ideas regress, take a beat: re-read the last ~10 rows of results.tsv and update your mental model before proposing the next.
  • Per-trial cap: 10 minutes. If cargo run is still going after 10 min, kill it and mark the trial as timeout.

Never stop

Keep going until interrupted. Each loop iteration is one hypothesis, one edit, one measurement, one commit. No multi-step plans across iterations.