research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.

Layout:

  research/lance-autoresearch/
  ├── Cargo.toml          (workspace root)
  ├── README.md           (target table, contract overview, repo layout)
  ├── HARNESS.md          (universal loop contract every target inherits)
  ├── crates/
  │   ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
  │   │                    MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
  │   └── pq-l2/          (the landed target; was the previous single crate)
  └── docs/
      ├── design.md           (rationale for workspace shape, no Target trait)
      ├── adding-a-target.md  (step-by-step workflow for new targets)
      └── targets/pq-l2.md    (per-target capsule)

Decisions documented in docs/design.md:

- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
  per-target src tree so agent edits don't conflict; per-target build/test
  isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
  constants, time budget). Intentionally NO Target trait - decode kernel
  signatures and distance kernel signatures differ enough that a unifying
  trait would either bloat or require erased boxing. Each target is its own
  natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
  the priors and API spec are per-target. Two files instead of one because
  copy-pasting the universal loop into every program.md would drift.

pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
  TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
  duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
  PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
  pins criterion uniformly across targets

The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.

Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
  geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
This commit is contained in:
Claude 2026-05-15 00:15:02 +00:00
parent 92ce8f1e7f
commit 0d72cc69fb
No known key found for this signature in database
21 changed files with 1012 additions and 366 deletions

View file

@ -1,32 +1,14 @@
# Empty `[workspace]` section so cargo treats this directory as its own
# workspace root and does NOT walk up to the parent omnigraph workspace.
# Without this, cargo from inside `research/lance-autoresearch/` will try to
# resolve omnigraph's dependencies even though we're excluded as a member.
[workspace]
resolver = "2"
members = [
"crates/harness-common",
"crates/pq-l2",
]
[package]
name = "lance-autoresearch"
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"
description = "Autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM agents."
publish = false
[lib]
path = "src/lib.rs"
[[bin]]
name = "run_experiment"
path = "src/bin/run_experiment.rs"
[[bench]]
name = "pq_l2"
harness = false
[dependencies]
# Each per-target crate sets its own deps. Shared deps below pin versions
# uniformly across targets so the workspace lockfile stays clean.
[workspace.dependencies]
anyhow = "1"
[dev-dependencies]
criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
[profile.release]

View file

@ -0,0 +1,137 @@
# HARNESS — shared loop contract for every lance-autoresearch target
This document is the universal part of every target's agent instructions. Each
target's `program.md` is a thin layer of *target-specific priors and API spec*
on top of the conventions below. The agent reads `HARNESS.md` and the target's
`program.md` at the start of every session.
## What this harness is
A single agent (you) edits one file in one target crate to optimize a Lance
kernel. Per trial, you build, run a binary that exercises the kernel against
diverse inputs, parse a fixed-format output block, and decide keep-or-revert.
This is a Karpathy-style autoresearch loop. It assumes:
- Per-trial eval is **seconds-scale**. Long enough to measure, short enough to
iterate hundreds of times in a session.
- The kernel has a **deterministic correctness oracle** — a scalar reference
that produces the same answer to compare against.
- The optimization target is **dataset-independent**: the harness generates
diverse inputs each trial, so wins generalize across distributions and
shapes by construction.
Targets that don't fit these constraints (index-build parameter tuning,
plan-patching, anything where eval is minutes-to-hours) belong in the
BauplanLabs tournament-loop shape, not this harness. See `docs/design.md` for
the boundary.
## What's editable, per target
| Path | Mutability | Why |
|---|---|---|
| `crates/<target>/src/kernels.rs` | **mutable** | Your playground. The whole point. |
| `crates/<target>/src/reference.rs` | immutable | The oracle. Touching it makes wins meaningless. |
| `crates/<target>/src/inputs.rs` | immutable | The fixture generator. Touching it makes timings incomparable across trials. |
| `crates/<target>/src/lib.rs` | immutable | Shared types pinned by the bench (`PqShape` etc.). |
| `crates/<target>/src/bin/run_experiment.rs` | immutable | The trial harness. |
| `crates/<target>/benches/*.rs` | immutable | Criterion bench, optional read-only reference. |
| `crates/<target>/Cargo.toml` | immutable | Adding deps changes the optimization target. |
| `crates/<target>/program.md` | human-iterated between runs | Not edited by you in-loop; the human refines it. |
| `crates/<target>/results.tsv` | append-only | Your audit log. Gitignored. |
| `crates/harness-common/**` | immutable | Workspace-shared infrastructure. |
| `HARNESS.md` (this file) | immutable | Workspace-shared loop contract. |
You may add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
property checks. You may NOT add new crate dependencies. You may NOT use
unsafe-only-on-broken-assumptions tricks (e.g., assuming a fixture invariant
that holds today but isn't documented).
## The metric
Every target's `run_experiment` binary prints a fixed-format output block ending
with these universal fields:
- `correctness:``pass` or `fail`. Set by comparing your kernel against the
scalar reference on every input the bench generates.
- `geomean_ns_per_*:` — geometric mean of per-operation wall-clock across all
timed operations.
- `worst_ns_per_*:` — slowest combo's geomean.
- `peak_mem_mb:` — process RSS high-water-mark.
- `total_seconds:` — trial wall-clock.
A kernel is **kept** iff:
1. `correctness: pass` (any failure → `std::process::exit(2)`).
2. `geomean_ns_per_*` strictly better than the previous best-kept kernel
(allow ~1% noise band).
3. `worst_ns_per_*` ≤ 1.05 × the previous best-kept kernel's worst.
4. `total_seconds` ≤ 600 (the per-trial cap; exceed it → `std::process::exit(3)`).
5. Build clean: `cargo build --release` and
`cargo clippy --release --all-targets -- -D warnings` both succeed.
Ties break toward simpler code: same speed within ~3% noise → fewer lines /
less `unsafe` wins.
## The loop
After reading `HARNESS.md` and the target's `program.md`:
1. **Setup (once per session).** Confirm `results.tsv` exists; if not, create
it with a per-target header (the target's `program.md` defines the columns).
Run the baseline trial:
```
cargo run --release --bin run_experiment -p <target> > run.log 2>&1
```
Append a row tagged `keep=baseline` and commit it.
2. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
have been tried, what won, what regressed. Form one hypothesis with one
sentence stating the change and the predicted effect on speed and
correctness.
3. **Edit `kernels.rs`.** Keep the diff focused on the one hypothesis.
4. **Build and lint.**
```
cargo build --release
cargo clippy --release --all-targets -- -D warnings
```
If either fails, fix and retry. Do not commit broken state.
5. **Run the trial.**
```
cargo run --release --bin run_experiment -p <target> > run.log 2>&1
```
6. **Parse and decide.** Extract the universal fields plus any per-target
fields. Compute deltas vs. the last-kept row. Apply the keep criteria above.
7. **Log.** Append one row to `results.tsv` matching the target's header.
8. **Commit.** One-line message describing the change and the headline number,
e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
9. **Hygiene.**
- Always commit `kernels.rs` changes; never commit `results.tsv` or
`run.log` (gitignored).
- If a change fails to build, do not commit. Iterate or revert cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows
and update your mental model before proposing the next.
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min,
kill it and mark the trial as `timeout`.
## Never stop
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
one measurement, one commit. No multi-step plans across iterations.
## Working across multiple targets
If a session spans multiple targets, work on **one target per session**. Don't
edit `kernels.rs` in two crates between commits — the agent's mental model is
shared but the keep-decision is per-target. Pick a target, do a session there,
commit, switch.
The human is responsible for selecting which target to work on next. Don't
proactively switch targets unless the user asks.

View file

@ -1,112 +1,143 @@
# lance-autoresearch
An autoresearch-style harness for evolving [Lance](https://github.com/lance-format/lance)
PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor).
Modeled on Andrej Karpathy's
A multi-target workspace for evolving [Lance](https://github.com/lance-format/lance)
hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor),
in the style of Andrej Karpathy's
[`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
three-file contract:
single-agent autoresearch loop.
- **Immutable bench**`src/bin/run_experiment.rs` + `src/inputs.rs` +
`src/reference.rs`. The agent cannot touch these.
- **Mutable kernel**`src/kernels.rs`. The agent's playground. Starts as a
scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to
beat it.
- **Human-iterated program**`program.md`. The "skill" the agent reads at
the start of every session. The human refines it between runs.
Each target is an independent Rust crate under `crates/`:
| Target | Status | Lance source area | What's optimized |
|---|---|---|---|
| [`crates/pq-l2`](crates/pq-l2) | landed | `lance-linalg::distance::l2`, PQ probe | PQ L2 distance: build LUT, probe codes, top-K |
| `crates/pq-cosine` | candidate (A1) | `lance-linalg::distance::cosine` | PQ cosine distance |
| `crates/pq-dot` | candidate (A1) | `lance-linalg::distance::dot` | PQ dot-product distance |
| `crates/ivf-partition` | candidate (A2) | `lance-index::vector::ivf` partition select | IVF partition selection (centroid scan) |
| `crates/fts-bm25` | candidate (A3) | `lance-index::scalar::inverted` BM25 | FTS BM25 scoring inner loop |
| `crates/bitpack` | candidate (A4) | `lance-encoding::encodings::bitpack` | Bitpack integer decode |
| `crates/dictionary` | candidate (A5) | `lance-encoding::encodings::dictionary` | Dictionary decode |
| `crates/fsst` | candidate (A6) | `lance-encoding::encodings::fsst` | FSST string decode |
| `crates/take` | candidate (A7) | `lance-core::utils::take` | Take / gather kernel |
| `crates/predicate` | candidate (A8) | `lance-datafusion` filter eval | Predicate evaluation kernels |
| `crates/posting-intersect` | candidate (A9) | `lance-index::scalar::inverted` | Posting list intersection (FTS AND) |
| `crates/topk-merge` | candidate (A10) | scan-merge | Top-K k-way merge |
The candidate targets are documented in [`docs/targets/`](docs/targets/) and can
be added by following [`docs/adding-a-target.md`](docs/adding-a-target.md). The
single landed target (`pq-l2`) proves the harness shape; the candidates wait
for an agent to spin them up.
## The contract every target follows
Karpathy's three-file shape, applied per target:
| File (per target crate) | Mutability | Edited by |
|---|---|---|
| `src/kernels.rs` | **mutable** | the agent |
| `src/reference.rs`, `src/inputs.rs`, `src/lib.rs`, `src/bin/run_experiment.rs`, `benches/*.rs` | immutable | — |
| `program.md` | human-iterated | the human, between runs |
| `results.tsv` | append-only | the agent, per trial (gitignored) |
The shared utilities — deterministic PRNG, geomean, peak-RSS readback,
tolerance constants, time-budget — live in [`crates/harness-common`](crates/harness-common/src/lib.rs)
and are consumed by every target. There is intentionally **no `Target` trait**:
decode-kernel signatures and distance-kernel signatures are different enough
that a unifying trait would either bloat or require erased boxing. Each target
is its own natural shape; the shared crate is plumbing only.
The shared loop conventions every target's `program.md` inherits live in
[`HARNESS.md`](HARNESS.md). Per-target priors and API specifics live in each
target's own `program.md`.
## Dataset-independent by design
Every other ANN benchmark you've seen is "compete on this fixed dataset"
(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness*
(the math) and *kernel speed under one specific data distribution*. An LLM
agent given recall@K as the oracle has incentive to overfit to the dataset's
quirks.
(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (the
math) and *kernel speed under one specific data distribution*. An LLM agent
given recall@K as the oracle has incentive to overfit to the dataset's quirks.
We split them:
We split them, every target:
- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4`) match to a scalar
reference kernel, on diverse generated inputs (Gaussian, uniform, sparse,
large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical
equivalence; there's no dataset to overfit. Lossy techniques fail this gate.
- **Speed** = geomean ns/query across multiple PQ shapes ×
multiple data distributions. A kernel that wins on one distribution and
regresses on another fails the worst-case guard.
- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4` for floats; bitwise for
integer/byte kernels) match to a scalar reference, on diverse generated
inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail
this gate.
- **Speed** = geomean ns/operation across multiple shape × distribution
combinations, with worst-case guard. A kernel that wins on one distribution
and regresses on another fails to keep.
By construction, an "improvement" generalizes across distributions and shapes.
There is no `wget sift.tar.gz` step; the harness is fully self-contained.
There is no `wget sift.tar.gz` step; every target is fully self-contained.
## Why a separate repo
## Why a separate repo (and a workspace, not a single crate)
OmniGraph (the graph engine that motivated this) pins Lance at a released
version and consumes its kernels via the public crate API. Improvements live one
layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the
optimization target pure (only the kernel changes), keeps the license clean for
upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
keeps the agent's working set tiny.
version and consumes its kernels via the public crate API. Improvements live
one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps
the optimization target pure (only the kernel changes), keeps the license clean
for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
keeps each agent's working set tiny.
**Workspace not single-crate** because per-target deps differ — FSST decode
will want a different dependency set than PQ kernels — and the agent's edits
to one target's `kernels.rs` must not collide with another's lib path. Each
target is buildable, testable, and runnable in isolation: `cd crates/<target>
&& cargo run --release --bin run_experiment`.
## Quick start
```bash
cargo run --release --bin run_experiment
# Run the landed PQ L2 target's baseline.
cargo run --release --bin run_experiment -p pq-l2
# Or run with Claude Code / Codex:
# Open the repo in your agent of choice and prompt:
# Hi, have a look at program.md and let's kick off a new experiment.
# Or with Claude Code / Codex, working on one target:
cd crates/pq-l2
# Open in your agent of choice and prompt:
# Hi, have a look at program.md and let's kick off a new experiment.
# Add a new target (see docs/adding-a-target.md):
cp -r crates/pq-l2 crates/pq-cosine
# ... edit Cargo.toml name, kernels.rs / reference.rs / inputs.rs / program.md
```
## File ownership
| File | Mutability | Edited by |
|---|---|---|
| `src/kernels.rs` | **mutable** | the agent |
| `src/bin/run_experiment.rs` | immutable | — |
| `src/reference.rs` | immutable | — |
| `src/inputs.rs` | immutable | — |
| `src/lib.rs` | immutable (shared types) | — |
| `benches/pq_l2.rs` | immutable | — |
| `program.md` | human-iterated | the human, between runs |
| `results.tsv` | append-only | the agent, per trial (gitignored) |
## The metric
`run_experiment` runs two phases per trial: a correctness check and a
multi-shape × multi-distribution speed measurement. Output looks like:
## Repo layout
```
correctness: pass
---
correctness: pass
shapes_tested: (128,16,256) (256,16,256) (768,96,256)
distributions_tested: clustered uniform sparse
geomean_ns_per_query: 18234
worst_ns_per_query: 24515 ((768,96,256), sparse)
best_ns_per_query: 12876 ((128,16,256), clustered)
per_combo_geomean_ns:
(128,16,256) clustered -> 12876 ns
(128,16,256) uniform -> 13441 ns
...
peak_mem_mb: 28.4
total_seconds: 12.3
lance-autoresearch/
├── Cargo.toml # workspace root
├── README.md # you are here
├── HARNESS.md # shared loop contract every target inherits
├── LICENSE-MIT, LICENSE-APACHE # dual-licensed (Apache compat for Lance PRs)
├── crates/
│ ├── harness-common/ # shared: SplitMix64, geomean, peak RSS, tolerance, time budget
│ │ └── src/{lib,prng,stats,sysinfo,tolerance}.rs
│ └── pq-l2/ # landed target
│ ├── Cargo.toml
│ ├── program.md # this target's agent skill
│ ├── src/
│ │ ├── lib.rs # PqShape + module wiring (immutable)
│ │ ├── kernels.rs # MUTABLE — agent's playground
│ │ ├── reference.rs # IMMUTABLE — scalar reference, oracle helpers
│ │ ├── inputs.rs # IMMUTABLE — diverse test-data generators
│ │ └── bin/run_experiment.rs # IMMUTABLE — per-trial entry point
│ └── benches/pq_l2.rs # criterion benchmark (immutable)
└── docs/
├── design.md # rationale for the workspace shape
├── adding-a-target.md # workflow for spinning up a new target
└── targets/
└── pq-l2.md # capsule: upstream Lance pointers, oracle, status
```
A kernel is "kept" iff:
- Correctness phase passes (mathematical equivalence to scalar reference)
- `geomean_ns_per_query` strictly better than the previous best-kept kernel
- `worst_ns_per_query` ≤ 1.05 × the previous best-kept kernel's worst
- `total_seconds` ≤ 600
See `program.md` for the full loop spec.
## Upstream contribution path
When a commit clears the keep bar by a meaningful margin (≥10% geomean
speedup with worst-case guard intact), the human reviews the diff, ports the
technique against [`lance-format/lance`](https://github.com/lance-format/lance)
HEAD, runs Lance's own test suite, and opens a PR. Because `src/kernels.rs` is
dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing
path, the upstream PR inherits Apache-2.0 cleanly.
When a commit on any target clears the keep bar by a meaningful margin
(≥10% geomean speedup with worst-case guard intact), the human reviews the
diff, ports the technique against
[`lance-format/lance`](https://github.com/lance-format/lance) HEAD, runs
Lance's own test suite, and opens a PR. Because the workspace is dual
MIT/Apache-2.0 licensed and each target's kernel is algorithmically modeled on
Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.
## License

View file

@ -0,0 +1,10 @@
[package]
name = "harness-common"
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"
description = "Shared utilities for lance-autoresearch per-target harnesses (PRNG, geomean, peak RSS, tolerance constants, time budget)."
publish = false
[lib]
path = "src/lib.rs"

View file

@ -0,0 +1,36 @@
//! Shared utilities for lance-autoresearch per-target harnesses.
//!
//! Each target crate (`pq-l2`, future `pq-cosine`, `bitpack-decode`, etc.)
//! defines its own `kernels.rs` (mutable, the agent's playground), `reference.rs`
//! (immutable scalar reference), `inputs.rs` (immutable test-data generators),
//! and `bin/run_experiment.rs` (immutable per-trial entry point). They all need
//! the same handful of building blocks: a deterministic PRNG, a geomean
//! aggregator, peak-RSS readback, tolerance constants for the bit-exact oracle,
//! and a single shared time-budget constant. That's everything in this crate.
//!
//! What is **not** here, and intentionally not abstracted:
//!
//! - A `Target` trait. Decode kernels (`bitpack`, `dictionary`, `FSST`) have
//! very different signatures than distance kernels (`PqKernel::probe_top_k`),
//! and forcing them into one trait shape would either bloat the trait or
//! require erased boxing. Keep each target's API natural to its kernel.
//!
//! - Output-format orchestration. Each target's `run_experiment.rs` prints its
//! own fixed-format result block — different targets report different
//! per-combo dimensions (PQ shapes vs bit widths vs distribution kinds vs ...).
//! Sharing the format would make the per-target binaries less readable and
//! gain very little — `println!` is cheap.
pub mod prng;
pub mod stats;
pub mod sysinfo;
pub mod tolerance;
pub use prng::SplitMix64;
pub use stats::geomean;
pub use sysinfo::peak_rss_mb;
pub use tolerance::{MAX_ABS_ERR, TOPK_DIST_TOL};
/// Per-trial wall-clock cap. Targets should `std::process::exit(3)` if exceeded
/// so the agent's loop logs the trial as a timeout instead of a measurement.
pub const TIME_BUDGET_SECS: u64 = 600;

View file

@ -0,0 +1,52 @@
//! Deterministic SplitMix64 PRNG. Same seed produces the same sequence on
//! every machine; no platform-specific RNG / no `rand` crate. Reproducibility
//! across trials is the whole point.
pub struct SplitMix64 {
state: u64,
}
impl SplitMix64 {
pub fn new(seed: u64) -> Self {
Self { state: seed }
}
pub fn next_u64(&mut self) -> u64 {
self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
let mut z = self.state;
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
z ^ (z >> 31)
}
/// Uniform in `[0, 1)` with 24 bits of mantissa precision.
pub fn next_f32(&mut self) -> f32 {
let bits = (self.next_u64() >> 40) as u32;
bits as f32 / ((1u32 << 24) as f32)
}
/// Standard normal via BoxMuller. Cheap and sufficient for fixture
/// generation; not cryptographically anything.
pub fn next_normal(&mut self) -> f32 {
let mut u1 = self.next_f32();
if u1 < 1e-7 {
u1 = 1e-7;
}
let u2 = self.next_f32();
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn deterministic_across_calls() {
let mut a = SplitMix64::new(0x1234_5678);
let mut b = SplitMix64::new(0x1234_5678);
for _ in 0..1000 {
assert_eq!(a.next_u64(), b.next_u64());
}
}
}

View file

@ -0,0 +1,36 @@
//! Geometric mean of u64 timings. Robust to outliers; the right aggregator for
//! latency distributions because halving one query and doubling another cancels.
pub fn geomean(xs: &[u64]) -> u64 {
if xs.is_empty() {
return 0;
}
let mut sum_ln = 0.0f64;
for &x in xs {
sum_ln += (x.max(1) as f64).ln();
}
(sum_ln / xs.len() as f64).exp() as u64
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn empty_yields_zero() {
assert_eq!(geomean(&[]), 0);
}
#[test]
fn single_value_round_trips() {
assert_eq!(geomean(&[100]), 100);
}
#[test]
fn geomean_is_below_arithmetic_mean() {
let xs = [1, 10, 100, 1000];
let g = geomean(&xs);
let am: u64 = xs.iter().sum::<u64>() / xs.len() as u64;
assert!(g < am);
}
}

View file

@ -0,0 +1,24 @@
//! Peak resident-set-size readback (Linux only; non-Linux returns 0).
#[cfg(target_os = "linux")]
pub fn peak_rss_mb() -> f64 {
let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
return 0.0;
};
for line in s.lines() {
if let Some(rest) = line.strip_prefix("VmPeak:") {
let kb: f64 = rest
.split_whitespace()
.next()
.and_then(|t| t.parse().ok())
.unwrap_or(0.0);
return kb / 1024.0;
}
}
0.0
}
#[cfg(not(target_os = "linux"))]
pub fn peak_rss_mb() -> f64 {
0.0
}

View file

@ -0,0 +1,15 @@
//! Default tolerance constants for bit-exact correctness oracles.
//!
//! These suit float-arithmetic kernels (PQ distance, BM25 scoring, vector
//! normalization) where SIMD-accumulator reordering is legal but real bugs
//! shift values by orders of magnitude. Targets that operate on integer or
//! byte-exact data (bitpack decode, dictionary decode, FSST decode) should
//! assert strict bitwise equality and not use these constants.
/// Maximum permitted absolute element error between agent kernel output and
/// scalar reference output, for float kernels.
pub const MAX_ABS_ERR: f32 = 1e-4;
/// Maximum permitted distance error when comparing top-K results between
/// agent kernel and scalar reference, for float kernels.
pub const TOPK_DIST_TOL: f32 = 1e-4;

View file

@ -0,0 +1,24 @@
[package]
name = "pq-l2"
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"
description = "Autoresearch target: Lance PQ L2 distance kernel optimization."
publish = false
[lib]
path = "src/lib.rs"
[[bin]]
name = "run_experiment"
path = "src/bin/run_experiment.rs"
[[bench]]
name = "pq_l2"
harness = false
[dependencies]
harness-common = { path = "../harness-common" }
[dev-dependencies]
criterion = { workspace = true }

View file

@ -7,8 +7,8 @@ use std::hint::black_box;
use criterion::{Criterion, criterion_group, criterion_main};
use lance_autoresearch::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
use lance_autoresearch::kernels::PqKernel;
use pq_l2::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
use pq_l2::kernels::PqKernel;
fn bench_pq_l2(c: &mut Criterion) {
let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE);

View file

@ -0,0 +1,98 @@
# Target: PQ L2 — agent instructions
This is the per-target overlay on top of [`../../HARNESS.md`](../../HARNESS.md).
Read **HARNESS.md first** for the universal loop contract (what's editable,
the metric, the loop, hygiene, never stop). This file adds the PQ-L2-specific
API spec and priors.
## Setup (once per session)
1. Read in this order:
- `../../HARNESS.md`
- `../../README.md`
- `program.md` (this file)
- `src/lib.rs`
- `src/kernels.rs` *(the only file you may edit)*
- `src/reference.rs`
- `src/inputs.rs`
- `src/bin/run_experiment.rs`
2. Ensure `results.tsv` exists. If not, create it with this header:
```
commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description
```
3. Baseline trial:
```
cargo run --release --bin run_experiment > run.log 2>&1
```
Append a row tagged `keep=baseline`, commit it.
## Public API contract (must remain stable)
The bench imports these from `crate::kernels`. You may NOT change their
signatures. You MAY add private helpers, internal data layouts, `unsafe`
blocks, `std::arch` intrinsics under `#[cfg(target_arch = ...)]` gates,
pre-computed state inside `PqKernel`, etc.
```rust
pub struct PqKernel { /* agent's private fields */ }
impl PqKernel {
pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
pub fn shape(&self) -> &PqShape;
pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
}
```
Pre-processing in `new` is free — the bench measures `distance_table +
probe_top_k` per query, not per (build + query). Codebook transposes,
cached `c·c`, packed LUTs, etc., should live in `new`.
## What you can / cannot do
(See HARNESS.md for the universal table; this is the PQ-L2 specific
addition.)
- **Cannot** change `PqShape` or the constants in `lib.rs`. They define
the optimization target.
- **Cannot** introduce lossy techniques (LUT u8/u16 quantization, asymmetric
approximation, anything that drops bits relative to the scalar reference).
The correctness phase asserts `max_abs_err ≤ 1e-4` against the scalar
reference; lossy techniques fail this gate. If you want to explore a lossy
track, propose it to the human as a separate kernel surface.
- **Can** mark hot functions `#[inline]`, split them, add private helpers.
- **Can** add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
property checks against the scalar path.
## Lance-PQ-specific priors
These are the directions that pay off on this kernel shape without
compromising arithmetic accuracy. Pick one hypothesis per trial; don't try
to combine multiple ideas at once.
- **Codebook layout transpose.** The reference layout is `[m][k][d]`.
Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` lanes
across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` once.
- **Cache `c·c` per centroid.** The diffsquaresum is
`(q - c)·(q - c) = q·q - 2qc + c·c`. Hoist `q·q` per sub-vector,
precompute `c·c` once at `new()` time, store next to the codebook. Inner
loop becomes one FMA. Watch sign / accumulator ordering so rounding stays
within `MAX_ABS_ERR`.
- **Probe-side code transpose.** Probe is dominated by
`acc += table[m][codes[off+m]]` × `num_sub_vectors`. Transposing codes to
`[m][i]` (one row per sub-quantizer, contiguous over base index) lets you
process 32+ vectors per inner iteration with `vpgatherdq`-style loads.
- **Top-K block-then-merge.** `push()` does a branch + heap sift on every
code. At 20k probes per query × 9 (shape × dist) combos that's the
second-biggest cost after the gather. Block the probe (e.g., 512 codes at
a time), find the local top-K with a branchless pass, then merge into the
global heap.
- **Prefetch.** `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
ahead of the gather is usually pure win at 20k+ scale.
- **FMA chains for table build.** The diffsquaresum maps cleanly to FMA
on AVX2/NEON. Even without intrinsics, structuring the inner loop so
`rustc` emits FMA helps.
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates
a fresh `Vec<f32>` per call. The public API is fixed (returns `Vec<f32>`),
but you can reuse a thread-local scratch buffer internally and copy to a
`Vec` at the boundary if it speeds the build.

View file

@ -35,18 +35,18 @@
use std::time::Instant;
use lance_autoresearch::inputs::{
use harness_common::{MAX_ABS_ERR, TIME_BUDGET_SECS, TOPK_DIST_TOL, geomean, peak_rss_mb};
use pq_l2::inputs::{
DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads,
};
use lance_autoresearch::kernels::PqKernel;
use lance_autoresearch::reference::{ScalarReference, max_abs_err, topk_consistent};
use lance_autoresearch::{MAX_ABS_ERR, PqShape, TOPK_DIST_TOL};
use pq_l2::kernels::PqKernel;
use pq_l2::reference::{ScalarReference, max_abs_err, topk_consistent};
use pq_l2::PqShape;
// Any constants; the only requirement is that they're pinned across trials so
// the inputs and the timings are reproducible.
const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE;
const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE;
const TIME_BUDGET_SECS: u64 = 600;
fn main() {
let start = Instant::now();
@ -210,17 +210,6 @@ fn run_speed(workloads: &[SpeedWorkload]) -> SpeedReport {
}
}
fn geomean(xs: &[u64]) -> u64 {
if xs.is_empty() {
return 0;
}
let mut sum_ln = 0.0f64;
for &x in xs {
sum_ln += (x.max(1) as f64).ln();
}
(sum_ln / xs.len() as f64).exp() as u64
}
fn format_shape(s: &PqShape) -> String {
format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids)
}
@ -233,26 +222,3 @@ fn format_dist(d: &DataDistribution) -> String {
}
.to_string()
}
#[cfg(target_os = "linux")]
fn peak_rss_mb() -> f64 {
let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
return 0.0;
};
for line in s.lines() {
if let Some(rest) = line.strip_prefix("VmPeak:") {
let kb: f64 = rest
.split_whitespace()
.next()
.and_then(|t| t.parse().ok())
.unwrap_or(0.0);
return kb / 1024.0;
}
}
0.0
}
#[cfg(not(target_os = "linux"))]
fn peak_rss_mb() -> f64 {
0.0
}

View file

@ -16,6 +16,7 @@
//! the codebook is shape-appropriate, not random.
use crate::PqShape;
use harness_common::SplitMix64;
/// PQ shapes the bench evaluates. The agent's kernel must produce correct
/// output and competitive speed on every one.
@ -295,36 +296,6 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
out
}
/// SplitMix64 — small, deterministic; bit-for-bit reproducible across machines.
struct SplitMix64 {
state: u64,
}
impl SplitMix64 {
fn new(seed: u64) -> Self {
Self { state: seed }
}
fn next_u64(&mut self) -> u64 {
self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
let mut z = self.state;
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
z ^ (z >> 31)
}
fn next_f32(&mut self) -> f32 {
let bits = (self.next_u64() >> 40) as u32;
bits as f32 / ((1u32 << 24) as f32)
}
fn next_normal(&mut self) -> f32 {
let mut u1 = self.next_f32();
if u1 < 1e-7 {
u1 = 1e-7;
}
let u2 = self.next_f32();
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
}
}
fn shape_hash(s: PqShape) -> u64 {
(s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)

View file

@ -1,17 +1,20 @@
//! Lance autoresearch harness — public API for the bench binary, benchmarks, and tests.
//! Autoresearch target: Lance PQ L2 distance kernel optimization.
//!
//! Contract (Karpathy-style three files):
//! Karpathy-style three-file contract:
//!
//! - `kernels` — the AGENT'S PLAYGROUND. Modify freely.
//! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math.
//! - `inputs` — IMMUTABLE. Diverse test-data + workload generators,
//! deterministic per fixed seed, varied across the input battery.
//!
//! The optimization target is dataset-independent: the agent's kernel must match
//! the scalar reference within `MAX_ABS_ERR` on every input the bench generates,
//! and minimize geomean ns/query across multiple PQ shapes and data
//! distributions. There is no fixed dataset; an "improvement" by construction
//! generalizes across distributions and shapes.
//! The optimization target is dataset-independent: the agent's kernel must
//! match the scalar reference within `harness_common::MAX_ABS_ERR` on every
//! input the bench generates, and minimize geomean ns/query across multiple
//! PQ shapes and data distributions. There is no fixed dataset.
//!
//! Shared utilities (deterministic PRNG, geomean, peak RSS, tolerance
//! constants, time budget) come from the `harness-common` workspace crate.
//! See `../HARNESS.md` for the harness conventions every target follows.
pub mod inputs;
pub mod kernels;
@ -45,12 +48,3 @@ impl PqShape {
self.num_sub_vectors * self.num_centroids * self.sub_vector_dim()
}
}
/// Tolerance for the agent kernel's distance values vs. the scalar reference.
/// Loose enough to permit legal SIMD-accumulator reordering; tight enough to
/// catch real arithmetic bugs.
pub const MAX_ABS_ERR: f32 = 1e-4;
/// Tolerance for top-K *distances* (id sets are compared with tie-tolerance —
/// see `reference::topk_consistent`).
pub const TOPK_DIST_TOL: f32 = 1e-4;

View file

@ -0,0 +1,192 @@
# Adding a new target
Walk through this when spinning up a new optimization target (A1 cosine, A4
bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
decisions to make per target if the kernel fits the autoresearch shape.
If your target's per-trial eval is more than ~30 seconds, or the correctness
oracle can't be a deterministic comparison against a scalar reference, this
harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
for the boundary.
## Steps
### 1. Pick a template target
The closest existing target. For now there's just `pq-l2`, but as more land:
- Distance / scoring kernels that take a query and return per-row scores →
template off `pq-l2`.
- Decode kernels that take encoded bytes and return an Arrow array →
template off `bitpack` once it lands.
- Scan / merge kernels → template off `topk-merge` once it lands.
```bash
cp -r crates/pq-l2 crates/<my-target>
```
### 2. Rewrite `Cargo.toml`
```toml
[package]
name = "<my-target>"
# version, edition, license, publish stay the same
```
Add the target to the workspace `members` in the root `Cargo.toml`:
```toml
[workspace]
members = [
"crates/harness-common",
"crates/pq-l2",
"crates/<my-target>", # add this
]
```
### 3. Rewrite `src/lib.rs`
Define the target's `Shape` type (analogue of `PqShape`) and any other types
shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
which fields are pinned by the harness vs. agent-tunable.
This file is **immutable** to the agent. The shape parameters define the
optimization target — changing them changes what's being optimized.
### 4. Rewrite `src/reference.rs`
Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
no cleverness. This is what the agent's kernel is compared against. Mirror
the public API of `kernels.rs` exactly.
For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
(or analogues) — the comparison helpers the bench uses to assert
near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
`TOPK_DIST_TOL`.
For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
returned Arrow array. No tolerance constants needed.
### 5. Rewrite `src/inputs.rs`
Two surfaces:
- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
distribution combinations, sized small enough that the correctness phase
finishes in seconds. The point is breadth, not realism.
- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
combinations sized for stable timings. Aim for total trial wall-clock
≤ 60s; the agent's iteration latency dominates correctness elsewhere.
Use `harness_common::SplitMix64` for determinism. Same seed → same battery
across trials.
### 6. Rewrite `src/kernels.rs` (the agent's playground)
Implement a clean scalar baseline matching the algorithm shape of the Lance
upstream code. The header comment must:
- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
algorithm is modeled on.
- Document the public API the bench calls — these are the surfaces the agent
may NOT change.
- List "what you can do" / "what you cannot do" rules specific to this
target.
The starting kernel must be correct (passes the correctness phase against
`reference.rs`) and lint-clean. The agent's job is to make it faster.
### 7. Rewrite `src/bin/run_experiment.rs`
Two phases:
- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
reference, compare. Any mismatch → print `correctness: fail`, diagnostic
line, exit 2.
- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
query / per row / per byte. Aggregate geomean / worst / best across all
combos. Print fixed-format result block.
Universal output fields (every target) are listed in `HARNESS.md` "The
metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
for bitpack).
Use:
- `harness_common::geomean` for the aggregator
- `harness_common::peak_rss_mb` for memory readback
- `harness_common::TIME_BUDGET_SECS` for the time-budget check
### 8. (Optional) Rewrite `benches/<my-target>.rs`
Criterion benchmark with the same kernel calls as `run_experiment` but
under criterion's statistical-sampling harness. Optional — the per-trial
binary is the agent's primary measurement; criterion is for the human's
deeper investigation.
### 9. Write `program.md`
Per-target agent skill, layered on top of `HARNESS.md`. Sections:
- **Setup** — which files to read at session start (always include
`../../HARNESS.md`).
- **Public API contract** — the exact functions / structs the agent must
keep stable.
- **Target-specific priors** — known SIMD techniques for this kernel shape,
algorithmic transformations worth trying, common pitfalls. This is the
highest-leverage content; spend time on it.
- **`results.tsv` header** — the per-target column set.
### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
A short doc covering:
- What's optimized (one sentence)
- Upstream Lance source pointers (rev, file paths, function names)
- Oracle definition (bit-exact / `max_abs_err`)
- Speed workload shape (what shapes × distributions span)
- Status (candidate / landed / has-results)
### 11. Verify end-to-end
```bash
cargo build --release -p <my-target>
cargo clippy --release -p <my-target> --all-targets -- -D warnings
cargo run --release --bin run_experiment -p <my-target>
```
The baseline trial must:
- Print `correctness: pass`
- Exit 0
- Finish within ~60s
- Reference a sensible `geomean_ns_per_*` baseline number
Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
zero), confirm the trial exits 2 with `correctness: fail`. Restore.
### 12. Add the target row to the top-level `README.md`
In the targets table at the top of the README, change the new target's row
from `candidate` to `landed`.
### 13. Commit
One commit for the target's scaffolding. Don't bundle multiple targets in
one commit — each target's history should be independently revertible.
## Common gotchas
- **Forgetting the empty `[workspace]` block** at the root means cargo walks
up to the omnigraph parent workspace. Already handled; just don't remove it.
- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
Use `harness-common = { path = "../harness-common" }`.
- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
with one shape an agent could specialize and pass, with two there's not
enough variety. Ensure the shapes span at least one "outlier" (e.g., for
PQ, one shape with `sub_vector_dim != 8`).
- **Correctness battery too narrow.** Five distributions is the floor: at
minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
the integer analogue: uniform / clustered / skewed / few-distinct /
monotonic).
- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
rate drops below useful. Reduce workload sizes; the speed metric is
per-operation, not per-workload, so absolute size doesn't change the
comparison.

View file

@ -0,0 +1,152 @@
# Design — why the workspace is shaped this way
This document records the rationale for the multi-target workspace shape so
future contributors don't relitigate the early decisions.
## The thing we're building
A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
"Multi-target" because Lance has many such kernels — distance kernels in
`lance-linalg`, decoders in `lance-encoding`, scan/merge kernels — and the
right harness shape is identical across them: bit-exact correctness oracle,
geomean-across-distributions speed metric, single-agent autoresearch loop.
The original [research note](../../docs/research/llm-evolutionary-sampling.md)
enumerates ten such candidates (A1A10) clustered by Lance crate. The first
landed (`pq-l2`) proves the harness shape; the rest follow the same template.
## Decision: workspace, not single crate
A single crate exposing multiple binaries (`run_experiment_pq_l2`,
`run_experiment_bitpack`, ...) was the obvious-looking alternative. Rejected
for three reasons:
1. **Per-target deps differ.** FSST decode wants different deps than PQ
kernels (a string-compression library vs. just `f32` math). A single
`Cargo.toml` would either bundle every target's deps into every build or
require fine-grained features. Workspaces give per-target `Cargo.toml`
for free.
2. **Edit isolation.** The agent edits one target's `kernels.rs` at a time.
In a single crate, `kernels.rs` files would collide on path or have to live
in target-specific submodules with target-specific naming. Per-target
crates put `src/kernels.rs` at the natural location every time and let the
agent navigate one tree per session.
3. **Build / test isolation.** `cargo build -p pq-l2` builds only what's
needed for the PQ L2 target; `cargo test -p pq-l2` runs only its tests.
The agent's iteration loop is faster because it doesn't pay for unrelated
targets' compile time.
The downside — workspace boilerplate, per-target `Cargo.toml`, the empty
`[workspace]` block at the workspace root that prevents cargo from walking up
to the parent omnigraph workspace — is a one-time cost. Per-target overhead
of adding a new target is one `cp -r` plus path edits.
## Decision: shared `harness-common` crate, no `Target` trait
A `Target` trait was the obvious-looking other alternative — express the
common loop generically, plug in target-specific types. Rejected because:
1. **Kernel signatures vary too much for a single trait shape.** PQ
`probe_top_k` returns `Vec<(u32, f32)>`. Bitpack decode returns an
`IntArray`. FSST decode returns `Vec<u8>`. Predicate evaluation returns a
`BooleanArray`. A unifying trait would need erased boxing or a wide
associated-type surface, both of which obscure the actual hot path the
agent is editing.
2. **The orchestration that *is* shared is small.** A deterministic PRNG
(~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four
tolerance constants. Total ~70 lines of shared code. Building a trait
abstraction over 70 lines costs more than it saves.
3. **The output format isn't worth sharing.** Each target's
`run_experiment.rs` prints a fixed-format result block; the *fields*
differ per target (PQ shapes vs bit widths vs distribution kinds). A
shared formatter would be either trivial wrapping of `println!` (no
value) or a complicated builder API (negative value).
`harness-common` therefore exposes plumbing only: `SplitMix64`, `geomean`,
`peak_rss_mb`, `MAX_ABS_ERR`, `TOPK_DIST_TOL`, `TIME_BUDGET_SECS`. Each
target consumes what it needs. The shared loop contract is documented in
`HARNESS.md`, not encoded in code.
## Decision: per-target `program.md` + shared `HARNESS.md`
The agent reads two files at session start:
- `HARNESS.md` (workspace-level) — universal: the loop, the metric, the
edit-permission table, hygiene rules.
- `crates/<target>/program.md` (per-target) — specific: the kernel API the
agent must keep stable, target-specific priors (which SIMD intrinsics tend
to win on this kernel shape), the `results.tsv` column header.
The shape mirrors how Karpathy's `nanochat-research` `program.md` works,
factored across the dimension that varies (per target) vs. doesn't (the loop
itself). Two files instead of one because copy-pasting the universal loop
into every `program.md` makes them drift.
## Decision: dataset-independent oracle every target
The first iteration of the harness used recall@K vs. SIFT1M as the
correctness oracle. We replaced it with bit-exact (or near-bit-exact for
floats) match against a scalar reference because:
1. The agent had incentive to overfit lossy approximations to the dataset's
cluster structure, even though we didn't ask for that.
2. SIFT1M is 250 MB and a hassle to download; the harness benefited from
being self-contained.
3. Mathematical equivalence is a strictly stronger contract than recall
preservation: if the kernel is bit-equivalent to the scalar reference,
recall is automatically identical because the distance values are the
same. There's nothing recall@K catches that bit-exactness doesn't.
This decision generalizes to every target. Decode kernels get strict bitwise
equality (no float arithmetic involved). Distance and BM25 kernels get
`max_abs_err ≤ 1e-4` (loose enough for SIMD-accumulator reordering, tight
enough for real bugs). Targets that genuinely require lossy techniques to
get headroom — there might be some; LUT u8 quantization in PQ is one — go
in a separate "lossy track" with a recall-based oracle on diverse datasets,
not the bit-exact track.
## Decision: per-target speed measurement spans multiple shapes × distributions
A single dataset would let an agent overfit to that dataset's distribution.
Each target's `inputs.rs` therefore generates speed workloads across:
- Multiple **shapes** of the kernel's domain (PQ: `(dim, num_sub_vectors,
num_centroids)`; bitpack: bit width; etc.). Captures how the kernel
performs at different sizes Lance users actually encounter.
- Multiple **data distributions** (Gaussian / uniform / sparse for floats;
uniform / skewed / clustered for integers; etc.). Captures whether the
kernel's win is data-distribution-conditional.
The keep gate uses geomean across all (shape × distribution) combos with a
worst-case guard: a kernel that wins on one combo and regresses ≥5% on
another fails to keep, even if the geomean improves. This forces wins to
generalize.
## What's deliberately not abstracted
- **Output format.** Each target prints its own field block. See above.
- **`TopKHeap` and other small data structures.** When two targets need a
`TopKHeap`, the second one copies the first's. Three copies of a 30-line
struct is cheaper than one trait-erased indirection.
- **Test data shapes.** Each target's `inputs.rs` knows its own kernel's
fixture shape. Sharing would require a generic `Fixture<Kernel>` trait,
which would either be too narrow (forces every kernel into a `query +
workload` shape) or too wide (gives up the type safety that makes the
bench's correctness check obvious).
## When to revisit
If the workspace grows past ~6 active targets and we notice we're
copy-pasting more than ~50 lines of `run_experiment.rs` boilerplate per new
target, consider extracting a shared `RunExperiment` helper that takes
closures for the correctness and speed phases. Don't pre-extract — wait
until the duplication is real and visible.
If we add a target that genuinely doesn't fit the autoresearch loop (eval
crosses ~30s; tournament sampling becomes the right control loop), it
belongs in a separate workspace, not this one. The boundary line is the
loop shape, not the target type.

View file

@ -0,0 +1,98 @@
# Target: `pq-l2`
PQ L2 distance kernel for f32 dense vectors — the asymmetric-distance compute
that runs on every `IvfPq` / `IvfHnswPq` ANN query in Lance.
## Status
**Landed.** Baseline scalar kernel committed; the agent's job is to find
generalizable speedups against it.
## What's optimized
Two functions in `crates/pq-l2/src/kernels.rs`:
- `PqKernel::distance_table(query)` — builds the asymmetric distance table
(`[num_sub_vectors][num_centroids]`) for one query against the codebook.
Cost: `num_sub_vectors × num_centroids × sub_vector_dim` MAC ops per query.
- `PqKernel::probe_top_k(table, codes, num_vectors, k)` — probes
`num_vectors` PQ-encoded vectors, accumulates per-vector distance via
`num_sub_vectors` table lookups, returns top-K. Cost:
`num_vectors × num_sub_vectors` lookups + heap maintenance per query.
This is the dominant cost at typical scales.
`PqKernel::new(shape, codebook)` is also editable — the agent may pre-process
the codebook (transpose layout, cache `c·c` for the FMA trick, pack the LUT)
and amortize over queries; build cost is excluded from per-query timing.
## Upstream Lance source
Algorithmically modeled on `lance-linalg::distance::l2` plus the PQ
asymmetric-distance compute in `lance::index::vector::pq`. Specifically the
f32 dense path; the byte / fixed-point variants are out of scope for this
target.
When porting a winning kernel upstream:
- File: `lance-linalg/src/distance/l2.rs` and the L2-specific path in
`lance/src/index/vector/pq.rs`.
- License: Apache-2.0 (matches our dual MIT/Apache-2.0 → upstream takes
the Apache half).
## Oracle
**Float-accumulator-tolerance match against scalar reference.** Per
`harness_common::MAX_ABS_ERR = 1e-4`:
- Distance table values must match the scalar reference within `1e-4` per
element. Loose enough for legal SIMD-accumulator reordering, tight enough
to catch real arithmetic bugs.
- Top-K results compared with `harness_common::TOPK_DIST_TOL = 1e-4` plus
tie-tolerant id substitution (any permutation within a tied-distance band
is accepted).
The correctness phase asserts both on every input combination — five input
distributions × three PQ shapes = 15 cases per trial.
## Speed workload
Three shapes:
- `(128, 16, 256)` — SIFT-like; sub_vector_dim = 8
- `(256, 16, 256)` — sub_vector_dim = 16
- `(768, 96, 256)` — BERT-base-like; large codebook
Three data distributions:
- `Clustered` — 32 cluster centers, low intra-cluster noise
- `Uniform` — uniform on [-1, 1]
- `Sparse` — 90% zeros + 10% Gaussian
Per (shape × distribution): 20,000 base vectors PQ-encoded, 32 queries
timed. Total trial wall-clock: ~3060s on a developer laptop.
## Output fields
```
correctness: pass | fail
shapes_tested: (128,16,256) (256,16,256) (768,96,256)
distributions_tested: clustered uniform sparse
geomean_ns_per_query: <u64>
worst_ns_per_query: <u64> (<shape>, <dist>)
best_ns_per_query: <u64> (<shape>, <dist>)
per_combo_geomean_ns:
(...)
peak_mem_mb: <f64>
total_seconds: <f64>
```
## Known headroom (priors for the agent)
See `crates/pq-l2/program.md` "Lance-PQ-specific priors" for the canonical
list. Highlights:
- Codebook layout transpose (`[m][k][d]``[m][d][k]`) for SIMD-broadcast
table build.
- Cache `c·c` per centroid in `new()` so the inner loop is `q·q 2qc + c·c`
(one FMA chain).
- Probe-side code transpose so the inner loop processes 32+ vectors per
iteration via gather.
- Top-K block-then-merge instead of per-vector heap insert.
- Prefetch on `codes[i+64]` ahead of gather.

View file

@ -1,172 +0,0 @@
# Lance PQ L2 kernel research — agent instructions
You are an autonomous research assistant. Your job is to improve `src/kernels.rs`
so that `cargo run --release --bin run_experiment` reports a **lower
`geomean_ns_per_query`** while:
1. The **correctness phase passes** — your kernel's distance values must match the
scalar reference within `MAX_ABS_ERR = 1e-4`, and the top-K must be
tie-tolerant equivalent on every input the bench generates.
2. The `worst_ns_per_query` does **not regress more than 5%** against the
last-kept kernel — if you win on one (shape × distribution) and lose
significantly on another, the change isn't a generalizable improvement.
This bench is intentionally **dataset-independent**: there is no fixed dataset.
The correctness oracle is mathematical equivalence to the scalar reference,
checked across multiple PQ shapes and synthetic input distributions
(Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed
oracle is the geomean across multiple shapes × distributions, with worst-case
guarded. A win that depends on a specific data distribution or PQ shape will
fail to clear the bar by construction.
Read this file end-to-end before doing anything else. Then run setup, then the loop.
## Setup (do once at the start of every session)
1. Read these files, in this order:
- `README.md`
- `program.md` (this file)
- `src/lib.rs`
- `src/kernels.rs` *(the only file you may edit)*
- `src/reference.rs`
- `src/inputs.rs`
- `src/bin/run_experiment.rs`
2. Ensure `results.tsv` exists. If not, create it with this header line:
```
commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description
```
3. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`.
Confirm `correctness: pass`. Parse `run.log` and append a row to `results.tsv`
with `keep=baseline` and `description="seeded scalar PQ-L2 baseline"`. This
is your reference number.
4. Commit the baseline row with a one-line message like `baseline: <numbers>`.
## What you CAN do
- Modify **`src/kernels.rs`** freely. You may:
- Pre-process the codebook in `PqKernel::new` (transpose layouts, cache
`c·c` for the FMA trick, pack the codebook for register-resident lookup,
etc.). This cost is paid once per dataset and amortized across queries —
the bench measures per-query, not per-(build + query).
- Reorder loops, switch internal data layouts, drop down to `std::arch`
intrinsics under `#[cfg(target_arch = ...)]` gates. **Always keep a
portable scalar fallback** so the kernel compiles everywhere.
- Use `unsafe` if needed; document the invariants you're relying on.
- Mark hot functions `#[inline]`; add private helpers freely.
- Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want
in-file property checks.
## What you CANNOT do
- Do **not** modify `src/lib.rs` (`PqShape` and the tolerance constants are
shared with the immutable scaffolding).
- Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`,
`src/inputs.rs`, `benches/pq_l2.rs`, or `Cargo.toml`.
- Do **not** add new crate dependencies.
- Do **not** alter the public API of `kernels::PqKernel`:
- `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
- `PqKernel::shape(&self) -> &PqShape`
- `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
- `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
- Do **not** introduce lossy techniques (LUT u8/u16 quantization, asymmetric-
distance approximation, etc.) — the correctness phase asserts exact-up-to-ε
match against the scalar reference. If you want to explore a lossy track,
surface that in a separate kernel and propose a track extension.
## The metric
Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock across
all timed queries, all shapes, all distributions) subject to:
1. Correctness phase: **pass** (exit-2 otherwise).
2. `worst_ns_per_query` ≤ 1.05 × the last-kept kernel's worst.
3. `total_seconds` ≤ 600.
4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release
--all-targets -- -D warnings` reports zero issues.
Ties break toward simpler code. If two kernels report the same speed within
~3% noise, prefer fewer lines / less `unsafe`.
## Lance-PQ-specific priors (lossless directions)
These directions are known to pay off without compromising arithmetic accuracy.
Pick one hypothesis at a time; implement; measure; decide.
- **Codebook layout.** The reference layout is `[m][k][d]`. For a fixed query,
iterating over centroids stays in cache, but the inner loop over `d` is
short. Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)`
lanes across `d` and broadcast over `k`. Do the transpose in `PqKernel::new`
once.
- **Cache `c·c`.** The diffsquaresum is `(q - c)·(q - c) = q·q - 2qc + c·c`.
Hoist `q·q` per sub-vector, precompute `c·c` once at codebook-load time.
Inner loop becomes one FMA (`-2qc + cc`). Watch the sign / accumulator
ordering so the rounding stays within tolerance.
- **Probe layout.** The probe is dominated by `acc += table[m][codes[off+m]]`
× `num_sub_vectors`. Transposing codes to `[m][i]` (one row per sub-quantizer,
contiguous over base index) lets you process up to 32+ vectors per inner
iteration with `vpgatherdq`-style loads.
- **Top-K integration.** `push()` does a branch + heap sift on every code.
At 50k probes per query × 9 (shape × dist) combos that's the second-biggest
cost after the gather. Block the probe (e.g., 512 codes at a time), find the
local top-K with a branchless pass, then merge into the global heap.
- **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
ahead of the gather is usually pure win at 50k+ scale where codes don't all
fit in L2.
- **FMA chains for table build.** The diffsquaresum maps cleanly to FMA on
AVX2/NEON. Even without intrinsics, structuring the inner loop so `rustc`
emits FMA helps.
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates a
fresh `Vec<f32>` per call. Returning a fixed-capacity buffer is a public-API
change you can't make — but you can reuse a thread-local scratch buffer
internally if it speeds the build.
## The loop
Once setup is done, repeat indefinitely:
1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
have been tried, what won, what regressed. Form a hypothesis with one
sentence stating the change and the predicted effect on speed and
correctness.
2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis.
3. **Build and lint.**
```
cargo build --release
cargo clippy --release --all-targets -- -D warnings
```
If either fails, fix and try again — do not commit broken state.
4. **Run the trial.**
```
cargo run --release --bin run_experiment > run.log 2>&1
```
5. **Parse the result.** Extract `correctness`, `geomean_ns_per_query`,
`worst_ns_per_query` (with combo), `peak_mem_mb`, `total_seconds`. Compute
deltas vs. baseline.
6. **Decide keep or revert.**
- **Keep** iff: `correctness: pass`, geomean strictly better than the
last-kept row (allow ~1% noise band), and `worst_ns_per_query` ≤ 1.05 ×
last-kept's worst.
- **Revert** otherwise: `git restore src/kernels.rs` (or commit and
`git revert` if you want the revert in history). Note what failed.
7. **Log.** Append one row to `results.tsv`:
```
<short_sha> <iso8601> <correctness> <geomean_ns> <worst_ns> <worst_combo> <best_ns> <best_combo> <peak_mem> <elapsed> <keep|revert> <one-line description>
```
8. **Commit.** One-line message describing the change and the headline number,
e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
## Hygiene
- Always commit `src/kernels.rs` changes; never commit `results.tsv` or
`run.log` (they're gitignored).
- If a change fails to build, do not commit. Iterate until it builds, or
revert cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
`results.tsv` and update your mental model before proposing the next.
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it
and mark the trial as `timeout`.
## Never stop
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
one measurement, one commit. No multi-step plans across iterations.