omnigraph/research/lance-autoresearch/program.md
Claude ed376af7d8
research: lance-autoresearch — PQ L2 kernel autoresearch harness
Stand up a standalone Rust project under research/lance-autoresearch/ for
LLM-driven optimization of Lance's PQ L2 distance kernels, following Karpathy's
three-file autoresearch contract:

  - src/kernels.rs (mutable, the agent's playground): scalar baseline PQ L2
    distance + top-K matching Lance 4.x's algorithm shape (16 sub-vectors,
    256 centroids, 8-bit codes, 128-d f32).
  - src/{fixture,reference,bin/run_experiment}.rs (immutable): SIFT1M loader
    (fvecs/ivecs + frozen codebook) with deterministic synthetic fallback,
    brute-force ground truth, fixed-format result block with recall@10 floor
    + time-budget exits.
  - program.md (human-iterated): the skill the agent reads each session —
    setup, what it can / cannot edit, the metric, Lance-PQ-specific priors,
    the keep/revert loop.

Smoke tests pass: baseline build clean, recall@10 = 0.66 on synthetic above
the 0.50 floor (exit 0), broken kernel triggers floor failure (exit 2),
clippy -D warnings clean. Excludes research/ from omnigraph workspace so
the nested project doesn't enter omnigraph's cargo build graph.

Licensed dual MIT / Apache-2.0 to keep the upstream-PR path to lance-format/lance
clean.

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-14 22:38:39 +00:00

151 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Lance PQ L2 kernel research — agent instructions
You are an autonomous research assistant. Your job is to improve the kernel(s) in
`src/kernels.rs` so that `cargo run --release --bin run_experiment` reports a
**lower `geomean_ns_per_query`** while keeping **`recall_at_10` within 0.005 of
the seeded baseline** (and never below the hard floor 0.50).
Read this file end-to-end before doing anything else. Then run setup, then the loop.
## Setup (do once at the start of every session)
1. Read these files, in this order:
- `README.md`
- `program.md` (this file)
- `src/lib.rs`
- `src/kernels.rs` *(the only file you may edit)*
- `src/bin/run_experiment.rs`
- `src/fixture.rs`
2. Confirm fixtures are present. SIFT1M lives under `~/.cache/lance-autoresearch/`.
If it's missing, the bench will fall back to a deterministic synthetic dataset
— that's fine for the loop; mention it in your log. If you want SIFT1M, run
`bash scripts/prepare_fixtures.sh` (one-time, ~510 min, ~250 MB download).
3. Ensure `results.tsv` exists. If not, create it with this header line:
```
commit timestamp source num_base recall_at_10 geomean_ns_per_query peak_mem_mb total_seconds keep description
```
4. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`.
Parse `run.log` and append a row to `results.tsv` with `keep=baseline`,
`description="seeded scalar PQ-L2 baseline"`. This is your reference number.
5. Commit the baseline row with a one-line message like `baseline: <numbers>`.
## What you CAN do
- Modify **`src/kernels.rs`** freely. You may:
- Reorder loops, change iteration order over codes or sub-vectors.
- Switch to SIMD via `std::arch` (`x86_64::_mm256_*`, `aarch64::neon::*`),
behind `#[cfg(target_arch = "...")]` gates. Always keep a portable scalar
fallback so the kernel compiles everywhere.
- Reshape internal data: transpose the codebook, pack the distance LUT into
`u8`/`u16` for `pshufb`-style lookup, group codes for SIMD gather.
- Use `unsafe` if needed; document the invariants you're relying on.
- Mark hot functions `#[inline]` or split them; add private helpers freely.
- Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want
property checks against the scalar path.
## What you CANNOT do
- Do **not** modify `src/lib.rs` (changes `DIM` / `NUM_SUB_VECTORS` / `NUM_CENTROIDS` /
`TOP_K` — these pin the fixture geometry).
- Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`, `src/fixture.rs`,
`benches/pq_l2.rs`, `scripts/prepare_fixtures.sh`, or `Cargo.toml`.
- Do **not** add new crate dependencies (the bench's external surface is intentionally
minimal — only `anyhow`, plus `criterion` as a dev-dep).
- Do **not** delete or alter the public API of `kernels.rs`:
- `pub type DistanceTable`
- `pub fn compute_distance_table_l2(query: &[f32], codebook: &[f32]) -> DistanceTable`
- `pub fn probe_pq_l2_top_k(table: &DistanceTable, codes: &[u8], num_vectors: usize, out: &mut TopKHeap)`
- `pub struct TopKHeap` with `new() / push / into_sorted`
## The metric
Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock from the
benched queries, rounded to a u64 ns) subject to:
1. `recall_at_10 >= baseline_recall_at_10 - 0.005`
2. `recall_at_10 >= 0.50` (hard floor; below this the bench exits non-zero)
3. `total_seconds <= 600`
4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release -- -D warnings`
reports zero issues. (Run `cargo clippy --release` before each commit.)
Ties break toward simpler code. If two kernels report the same speed within
noise (~3%), prefer the one with fewer lines or less `unsafe`.
## Lance-PQ-specific priors
These are the directions known to pay off on this kernel shape. Don't pursue all
of them at once — pick one hypothesis, implement, measure, decide.
- **Codebook layout for the table-build step.** The reference layout is
`[m][k][d]`. For a fixed query, iterating over centroids stays in cache, but
the inner loop over `d` is short (8 floats). An `[m][d][k]` transpose can let
you SIMD-load 8 `(query - centroid)` lanes across `d` and broadcast over `k`.
- **LUT packing for the probe step.** The probe is dominated by `acc +=
table[m][codes[off+m]]` × 16. Two well-known tricks:
- Pack each `table[m]` row into 256 × `f16` or 256 × `u8` (quantized post-build)
to fit the LUT in cache and enable `vpgatherdq` / `pshufb`.
- Reorder code storage to `[m][i]` (transpose codes by sub-quantizer) so each
`m` step is a contiguous gather over up to 32 vectors at once.
- **Top-K integration.** `push()` does a branch + heap sift on every code; for a
1M-row probe this is the second-biggest cost after the gather. Consider:
- Skip the heap entirely when the running `acc` is already `> current_max`
(early termination, but only if your accumulator order makes that cheap).
- Block the probe (e.g., 1024 codes at a time), find the local top-K with a
branchless scan, then merge into the global heap.
- **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)` ahead
of the gather is usually pure win at 1M scale where codes don't all fit in L2.
- **FMA in the table build.** The diffsquaresum sequence is
`(q - c)·(q - c)` per element — that's `(q*q) - 2qc + c*c`. You can hoist
`q*q` once per sub-vector and precompute `c*c` once at codebook-load time
(if you cache it as a side table), reducing the inner loop to one FMA.
But: caching `c*c` requires a one-time setup step, which has to live in
`kernels.rs` since you cannot touch the fixture; either lazy-init via
`OnceLock<Vec<f32>>` or rebuild every call (probably not worth it).
## The loop
Once setup is done, repeat indefinitely:
1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
have been tried, what won, what regressed. Form a hypothesis with one
sentence stating the change and the predicted effect on speed and recall.
2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis.
3. **Build and lint.** Run:
```
cargo build --release
cargo clippy --release --all-targets -- -D warnings
```
If either fails, fix and try again — do not commit broken state.
4. **Run the trial.**
```
cargo run --release --bin run_experiment > run.log 2>&1
```
5. **Parse the result.** Extract `recall_at_10`, `geomean_ns_per_query`,
`peak_mem_mb`, `total_seconds` from `run.log`. Compute the deltas vs. baseline.
6. **Decide keep or revert.**
- **Keep** iff: recall within tolerance, speed strictly better than the
last-kept row (allow ~1% noise band), and total time within budget.
- **Revert** otherwise: `git restore src/kernels.rs` (or commit and `git
revert` if you want the revert in history). Note what failed.
7. **Log.** Append one row to `results.tsv`:
```
<short_sha> <iso8601> <source> <num_base> <recall> <geomean_ns> <peak_mem> <elapsed> <keep|revert> <one-line description>
```
8. **Commit.** Use a one-line message describing the change and the headline
number, e.g. `transpose codebook; 184k → 142k ns/query (recall 0.94)`.
## Hygiene
- Always commit `src/kernels.rs` changes; never commit `results.tsv` or `run.log`
(they're gitignored).
- If a change fails to build, do not commit. Iterate until it builds, or revert
cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
`results.tsv` and update your mental model before proposing the next.
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it
and mark the trial as `timeout`.
## Never stop
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
one measurement, one commit. No multi-step plans across iterations.