research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.

Layout:

  research/lance-autoresearch/
  ├── Cargo.toml          (workspace root)
  ├── README.md           (target table, contract overview, repo layout)
  ├── HARNESS.md          (universal loop contract every target inherits)
  ├── crates/
  │   ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
  │   │                    MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
  │   └── pq-l2/          (the landed target; was the previous single crate)
  └── docs/
      ├── design.md           (rationale for workspace shape, no Target trait)
      ├── adding-a-target.md  (step-by-step workflow for new targets)
      └── targets/pq-l2.md    (per-target capsule)

Decisions documented in docs/design.md:

- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
  per-target src tree so agent edits don't conflict; per-target build/test
  isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
  constants, time budget). Intentionally NO Target trait - decode kernel
  signatures and distance kernel signatures differ enough that a unifying
  trait would either bloat or require erased boxing. Each target is its own
  natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
  the priors and API spec are per-target. Two files instead of one because
  copy-pasting the universal loop into every program.md would drift.

pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
  TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
  duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
  PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
  pins criterion uniformly across targets

The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.

Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
  geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
This commit is contained in:
Claude 2026-05-15 00:15:02 +00:00
parent 92ce8f1e7f
commit 0d72cc69fb
No known key found for this signature in database
21 changed files with 1012 additions and 366 deletions

View file

@ -1,32 +1,14 @@
# Empty `[workspace]` section so cargo treats this directory as its own
# workspace root and does NOT walk up to the parent omnigraph workspace.
# Without this, cargo from inside `research/lance-autoresearch/` will try to
# resolve omnigraph's dependencies even though we're excluded as a member.
[workspace] [workspace]
resolver = "2"
members = [
"crates/harness-common",
"crates/pq-l2",
]
[package] # Each per-target crate sets its own deps. Shared deps below pin versions
name = "lance-autoresearch" # uniformly across targets so the workspace lockfile stays clean.
version = "0.1.0" [workspace.dependencies]
edition = "2024"
license = "MIT OR Apache-2.0"
description = "Autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM agents."
publish = false
[lib]
path = "src/lib.rs"
[[bin]]
name = "run_experiment"
path = "src/bin/run_experiment.rs"
[[bench]]
name = "pq_l2"
harness = false
[dependencies]
anyhow = "1" anyhow = "1"
[dev-dependencies]
criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] } criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
[profile.release] [profile.release]

View file

@ -0,0 +1,137 @@
# HARNESS — shared loop contract for every lance-autoresearch target
This document is the universal part of every target's agent instructions. Each
target's `program.md` is a thin layer of *target-specific priors and API spec*
on top of the conventions below. The agent reads `HARNESS.md` and the target's
`program.md` at the start of every session.
## What this harness is
A single agent (you) edits one file in one target crate to optimize a Lance
kernel. Per trial, you build, run a binary that exercises the kernel against
diverse inputs, parse a fixed-format output block, and decide keep-or-revert.
This is a Karpathy-style autoresearch loop. It assumes:
- Per-trial eval is **seconds-scale**. Long enough to measure, short enough to
iterate hundreds of times in a session.
- The kernel has a **deterministic correctness oracle** — a scalar reference
that produces the same answer to compare against.
- The optimization target is **dataset-independent**: the harness generates
diverse inputs each trial, so wins generalize across distributions and
shapes by construction.
Targets that don't fit these constraints (index-build parameter tuning,
plan-patching, anything where eval is minutes-to-hours) belong in the
BauplanLabs tournament-loop shape, not this harness. See `docs/design.md` for
the boundary.
## What's editable, per target
| Path | Mutability | Why |
|---|---|---|
| `crates/<target>/src/kernels.rs` | **mutable** | Your playground. The whole point. |
| `crates/<target>/src/reference.rs` | immutable | The oracle. Touching it makes wins meaningless. |
| `crates/<target>/src/inputs.rs` | immutable | The fixture generator. Touching it makes timings incomparable across trials. |
| `crates/<target>/src/lib.rs` | immutable | Shared types pinned by the bench (`PqShape` etc.). |
| `crates/<target>/src/bin/run_experiment.rs` | immutable | The trial harness. |
| `crates/<target>/benches/*.rs` | immutable | Criterion bench, optional read-only reference. |
| `crates/<target>/Cargo.toml` | immutable | Adding deps changes the optimization target. |
| `crates/<target>/program.md` | human-iterated between runs | Not edited by you in-loop; the human refines it. |
| `crates/<target>/results.tsv` | append-only | Your audit log. Gitignored. |
| `crates/harness-common/**` | immutable | Workspace-shared infrastructure. |
| `HARNESS.md` (this file) | immutable | Workspace-shared loop contract. |
You may add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
property checks. You may NOT add new crate dependencies. You may NOT use
unsafe-only-on-broken-assumptions tricks (e.g., assuming a fixture invariant
that holds today but isn't documented).
## The metric
Every target's `run_experiment` binary prints a fixed-format output block ending
with these universal fields:
- `correctness:``pass` or `fail`. Set by comparing your kernel against the
scalar reference on every input the bench generates.
- `geomean_ns_per_*:` — geometric mean of per-operation wall-clock across all
timed operations.
- `worst_ns_per_*:` — slowest combo's geomean.
- `peak_mem_mb:` — process RSS high-water-mark.
- `total_seconds:` — trial wall-clock.
A kernel is **kept** iff:
1. `correctness: pass` (any failure → `std::process::exit(2)`).
2. `geomean_ns_per_*` strictly better than the previous best-kept kernel
(allow ~1% noise band).
3. `worst_ns_per_*` ≤ 1.05 × the previous best-kept kernel's worst.
4. `total_seconds` ≤ 600 (the per-trial cap; exceed it → `std::process::exit(3)`).
5. Build clean: `cargo build --release` and
`cargo clippy --release --all-targets -- -D warnings` both succeed.
Ties break toward simpler code: same speed within ~3% noise → fewer lines /
less `unsafe` wins.
## The loop
After reading `HARNESS.md` and the target's `program.md`:
1. **Setup (once per session).** Confirm `results.tsv` exists; if not, create
it with a per-target header (the target's `program.md` defines the columns).
Run the baseline trial:
```
cargo run --release --bin run_experiment -p <target> > run.log 2>&1
```
Append a row tagged `keep=baseline` and commit it.
2. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
have been tried, what won, what regressed. Form one hypothesis with one
sentence stating the change and the predicted effect on speed and
correctness.
3. **Edit `kernels.rs`.** Keep the diff focused on the one hypothesis.
4. **Build and lint.**
```
cargo build --release
cargo clippy --release --all-targets -- -D warnings
```
If either fails, fix and retry. Do not commit broken state.
5. **Run the trial.**
```
cargo run --release --bin run_experiment -p <target> > run.log 2>&1
```
6. **Parse and decide.** Extract the universal fields plus any per-target
fields. Compute deltas vs. the last-kept row. Apply the keep criteria above.
7. **Log.** Append one row to `results.tsv` matching the target's header.
8. **Commit.** One-line message describing the change and the headline number,
e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
9. **Hygiene.**
- Always commit `kernels.rs` changes; never commit `results.tsv` or
`run.log` (gitignored).
- If a change fails to build, do not commit. Iterate or revert cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows
and update your mental model before proposing the next.
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min,
kill it and mark the trial as `timeout`.
## Never stop
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
one measurement, one commit. No multi-step plans across iterations.
## Working across multiple targets
If a session spans multiple targets, work on **one target per session**. Don't
edit `kernels.rs` in two crates between commits — the agent's mental model is
shared but the keep-decision is per-target. Pick a target, do a session there,
commit, switch.
The human is responsible for selecting which target to work on next. Don't
proactively switch targets unless the user asks.

View file

@ -1,112 +1,143 @@
# lance-autoresearch # lance-autoresearch
An autoresearch-style harness for evolving [Lance](https://github.com/lance-format/lance) A multi-target workspace for evolving [Lance](https://github.com/lance-format/lance)
PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor). hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor),
in the style of Andrej Karpathy's
Modeled on Andrej Karpathy's
[`nanochat-research`](https://x.com/karpathy/status/1855651423497650238) [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
three-file contract: single-agent autoresearch loop.
- **Immutable bench**`src/bin/run_experiment.rs` + `src/inputs.rs` + Each target is an independent Rust crate under `crates/`:
`src/reference.rs`. The agent cannot touch these.
- **Mutable kernel**`src/kernels.rs`. The agent's playground. Starts as a | Target | Status | Lance source area | What's optimized |
scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to |---|---|---|---|
beat it. | [`crates/pq-l2`](crates/pq-l2) | landed | `lance-linalg::distance::l2`, PQ probe | PQ L2 distance: build LUT, probe codes, top-K |
- **Human-iterated program**`program.md`. The "skill" the agent reads at | `crates/pq-cosine` | candidate (A1) | `lance-linalg::distance::cosine` | PQ cosine distance |
the start of every session. The human refines it between runs. | `crates/pq-dot` | candidate (A1) | `lance-linalg::distance::dot` | PQ dot-product distance |
| `crates/ivf-partition` | candidate (A2) | `lance-index::vector::ivf` partition select | IVF partition selection (centroid scan) |
| `crates/fts-bm25` | candidate (A3) | `lance-index::scalar::inverted` BM25 | FTS BM25 scoring inner loop |
| `crates/bitpack` | candidate (A4) | `lance-encoding::encodings::bitpack` | Bitpack integer decode |
| `crates/dictionary` | candidate (A5) | `lance-encoding::encodings::dictionary` | Dictionary decode |
| `crates/fsst` | candidate (A6) | `lance-encoding::encodings::fsst` | FSST string decode |
| `crates/take` | candidate (A7) | `lance-core::utils::take` | Take / gather kernel |
| `crates/predicate` | candidate (A8) | `lance-datafusion` filter eval | Predicate evaluation kernels |
| `crates/posting-intersect` | candidate (A9) | `lance-index::scalar::inverted` | Posting list intersection (FTS AND) |
| `crates/topk-merge` | candidate (A10) | scan-merge | Top-K k-way merge |
The candidate targets are documented in [`docs/targets/`](docs/targets/) and can
be added by following [`docs/adding-a-target.md`](docs/adding-a-target.md). The
single landed target (`pq-l2`) proves the harness shape; the candidates wait
for an agent to spin them up.
## The contract every target follows
Karpathy's three-file shape, applied per target:
| File (per target crate) | Mutability | Edited by |
|---|---|---|
| `src/kernels.rs` | **mutable** | the agent |
| `src/reference.rs`, `src/inputs.rs`, `src/lib.rs`, `src/bin/run_experiment.rs`, `benches/*.rs` | immutable | — |
| `program.md` | human-iterated | the human, between runs |
| `results.tsv` | append-only | the agent, per trial (gitignored) |
The shared utilities — deterministic PRNG, geomean, peak-RSS readback,
tolerance constants, time-budget — live in [`crates/harness-common`](crates/harness-common/src/lib.rs)
and are consumed by every target. There is intentionally **no `Target` trait**:
decode-kernel signatures and distance-kernel signatures are different enough
that a unifying trait would either bloat or require erased boxing. Each target
is its own natural shape; the shared crate is plumbing only.
The shared loop conventions every target's `program.md` inherits live in
[`HARNESS.md`](HARNESS.md). Per-target priors and API specifics live in each
target's own `program.md`.
## Dataset-independent by design ## Dataset-independent by design
Every other ANN benchmark you've seen is "compete on this fixed dataset" Every other ANN benchmark you've seen is "compete on this fixed dataset"
(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (the
(the math) and *kernel speed under one specific data distribution*. An LLM math) and *kernel speed under one specific data distribution*. An LLM agent
agent given recall@K as the oracle has incentive to overfit to the dataset's given recall@K as the oracle has incentive to overfit to the dataset's quirks.
quirks.
We split them: We split them, every target:
- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4`) match to a scalar - **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4` for floats; bitwise for
reference kernel, on diverse generated inputs (Gaussian, uniform, sparse, integer/byte kernels) match to a scalar reference, on diverse generated
large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail
equivalence; there's no dataset to overfit. Lossy techniques fail this gate. this gate.
- **Speed** = geomean ns/query across multiple PQ shapes × - **Speed** = geomean ns/operation across multiple shape × distribution
multiple data distributions. A kernel that wins on one distribution and combinations, with worst-case guard. A kernel that wins on one distribution
regresses on another fails the worst-case guard. and regresses on another fails to keep.
By construction, an "improvement" generalizes across distributions and shapes. By construction, an "improvement" generalizes across distributions and shapes.
There is no `wget sift.tar.gz` step; the harness is fully self-contained. There is no `wget sift.tar.gz` step; every target is fully self-contained.
## Why a separate repo ## Why a separate repo (and a workspace, not a single crate)
OmniGraph (the graph engine that motivated this) pins Lance at a released OmniGraph (the graph engine that motivated this) pins Lance at a released
version and consumes its kernels via the public crate API. Improvements live one version and consumes its kernels via the public crate API. Improvements live
layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps
optimization target pure (only the kernel changes), keeps the license clean for the optimization target pure (only the kernel changes), keeps the license clean
upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
keeps the agent's working set tiny. keeps each agent's working set tiny.
**Workspace not single-crate** because per-target deps differ — FSST decode
will want a different dependency set than PQ kernels — and the agent's edits
to one target's `kernels.rs` must not collide with another's lib path. Each
target is buildable, testable, and runnable in isolation: `cd crates/<target>
&& cargo run --release --bin run_experiment`.
## Quick start ## Quick start
```bash ```bash
cargo run --release --bin run_experiment # Run the landed PQ L2 target's baseline.
cargo run --release --bin run_experiment -p pq-l2
# Or run with Claude Code / Codex: # Or with Claude Code / Codex, working on one target:
# Open the repo in your agent of choice and prompt: cd crates/pq-l2
# Hi, have a look at program.md and let's kick off a new experiment. # Open in your agent of choice and prompt:
# Hi, have a look at program.md and let's kick off a new experiment.
# Add a new target (see docs/adding-a-target.md):
cp -r crates/pq-l2 crates/pq-cosine
# ... edit Cargo.toml name, kernels.rs / reference.rs / inputs.rs / program.md
``` ```
## File ownership ## Repo layout
| File | Mutability | Edited by |
|---|---|---|
| `src/kernels.rs` | **mutable** | the agent |
| `src/bin/run_experiment.rs` | immutable | — |
| `src/reference.rs` | immutable | — |
| `src/inputs.rs` | immutable | — |
| `src/lib.rs` | immutable (shared types) | — |
| `benches/pq_l2.rs` | immutable | — |
| `program.md` | human-iterated | the human, between runs |
| `results.tsv` | append-only | the agent, per trial (gitignored) |
## The metric
`run_experiment` runs two phases per trial: a correctness check and a
multi-shape × multi-distribution speed measurement. Output looks like:
``` ```
correctness: pass lance-autoresearch/
--- ├── Cargo.toml # workspace root
correctness: pass ├── README.md # you are here
shapes_tested: (128,16,256) (256,16,256) (768,96,256) ├── HARNESS.md # shared loop contract every target inherits
distributions_tested: clustered uniform sparse ├── LICENSE-MIT, LICENSE-APACHE # dual-licensed (Apache compat for Lance PRs)
geomean_ns_per_query: 18234 ├── crates/
worst_ns_per_query: 24515 ((768,96,256), sparse) │ ├── harness-common/ # shared: SplitMix64, geomean, peak RSS, tolerance, time budget
best_ns_per_query: 12876 ((128,16,256), clustered) │ │ └── src/{lib,prng,stats,sysinfo,tolerance}.rs
per_combo_geomean_ns: │ └── pq-l2/ # landed target
(128,16,256) clustered -> 12876 ns │ ├── Cargo.toml
(128,16,256) uniform -> 13441 ns │ ├── program.md # this target's agent skill
... │ ├── src/
peak_mem_mb: 28.4 │ │ ├── lib.rs # PqShape + module wiring (immutable)
total_seconds: 12.3 │ │ ├── kernels.rs # MUTABLE — agent's playground
│ │ ├── reference.rs # IMMUTABLE — scalar reference, oracle helpers
│ │ ├── inputs.rs # IMMUTABLE — diverse test-data generators
│ │ └── bin/run_experiment.rs # IMMUTABLE — per-trial entry point
│ └── benches/pq_l2.rs # criterion benchmark (immutable)
└── docs/
├── design.md # rationale for the workspace shape
├── adding-a-target.md # workflow for spinning up a new target
└── targets/
└── pq-l2.md # capsule: upstream Lance pointers, oracle, status
``` ```
A kernel is "kept" iff:
- Correctness phase passes (mathematical equivalence to scalar reference)
- `geomean_ns_per_query` strictly better than the previous best-kept kernel
- `worst_ns_per_query` ≤ 1.05 × the previous best-kept kernel's worst
- `total_seconds` ≤ 600
See `program.md` for the full loop spec.
## Upstream contribution path ## Upstream contribution path
When a commit clears the keep bar by a meaningful margin (≥10% geomean When a commit on any target clears the keep bar by a meaningful margin
speedup with worst-case guard intact), the human reviews the diff, ports the (≥10% geomean speedup with worst-case guard intact), the human reviews the
technique against [`lance-format/lance`](https://github.com/lance-format/lance) diff, ports the technique against
HEAD, runs Lance's own test suite, and opens a PR. Because `src/kernels.rs` is [`lance-format/lance`](https://github.com/lance-format/lance) HEAD, runs
dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing Lance's own test suite, and opens a PR. Because the workspace is dual
path, the upstream PR inherits Apache-2.0 cleanly. MIT/Apache-2.0 licensed and each target's kernel is algorithmically modeled on
Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.
## License ## License

View file

@ -0,0 +1,10 @@
[package]
name = "harness-common"
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"
description = "Shared utilities for lance-autoresearch per-target harnesses (PRNG, geomean, peak RSS, tolerance constants, time budget)."
publish = false
[lib]
path = "src/lib.rs"

View file

@ -0,0 +1,36 @@
//! Shared utilities for lance-autoresearch per-target harnesses.
//!
//! Each target crate (`pq-l2`, future `pq-cosine`, `bitpack-decode`, etc.)
//! defines its own `kernels.rs` (mutable, the agent's playground), `reference.rs`
//! (immutable scalar reference), `inputs.rs` (immutable test-data generators),
//! and `bin/run_experiment.rs` (immutable per-trial entry point). They all need
//! the same handful of building blocks: a deterministic PRNG, a geomean
//! aggregator, peak-RSS readback, tolerance constants for the bit-exact oracle,
//! and a single shared time-budget constant. That's everything in this crate.
//!
//! What is **not** here, and intentionally not abstracted:
//!
//! - A `Target` trait. Decode kernels (`bitpack`, `dictionary`, `FSST`) have
//! very different signatures than distance kernels (`PqKernel::probe_top_k`),
//! and forcing them into one trait shape would either bloat the trait or
//! require erased boxing. Keep each target's API natural to its kernel.
//!
//! - Output-format orchestration. Each target's `run_experiment.rs` prints its
//! own fixed-format result block — different targets report different
//! per-combo dimensions (PQ shapes vs bit widths vs distribution kinds vs ...).
//! Sharing the format would make the per-target binaries less readable and
//! gain very little — `println!` is cheap.
pub mod prng;
pub mod stats;
pub mod sysinfo;
pub mod tolerance;
pub use prng::SplitMix64;
pub use stats::geomean;
pub use sysinfo::peak_rss_mb;
pub use tolerance::{MAX_ABS_ERR, TOPK_DIST_TOL};
/// Per-trial wall-clock cap. Targets should `std::process::exit(3)` if exceeded
/// so the agent's loop logs the trial as a timeout instead of a measurement.
pub const TIME_BUDGET_SECS: u64 = 600;

View file

@ -0,0 +1,52 @@
//! Deterministic SplitMix64 PRNG. Same seed produces the same sequence on
//! every machine; no platform-specific RNG / no `rand` crate. Reproducibility
//! across trials is the whole point.
pub struct SplitMix64 {
state: u64,
}
impl SplitMix64 {
pub fn new(seed: u64) -> Self {
Self { state: seed }
}
pub fn next_u64(&mut self) -> u64 {
self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
let mut z = self.state;
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
z ^ (z >> 31)
}
/// Uniform in `[0, 1)` with 24 bits of mantissa precision.
pub fn next_f32(&mut self) -> f32 {
let bits = (self.next_u64() >> 40) as u32;
bits as f32 / ((1u32 << 24) as f32)
}
/// Standard normal via BoxMuller. Cheap and sufficient for fixture
/// generation; not cryptographically anything.
pub fn next_normal(&mut self) -> f32 {
let mut u1 = self.next_f32();
if u1 < 1e-7 {
u1 = 1e-7;
}
let u2 = self.next_f32();
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn deterministic_across_calls() {
let mut a = SplitMix64::new(0x1234_5678);
let mut b = SplitMix64::new(0x1234_5678);
for _ in 0..1000 {
assert_eq!(a.next_u64(), b.next_u64());
}
}
}

View file

@ -0,0 +1,36 @@
//! Geometric mean of u64 timings. Robust to outliers; the right aggregator for
//! latency distributions because halving one query and doubling another cancels.
pub fn geomean(xs: &[u64]) -> u64 {
if xs.is_empty() {
return 0;
}
let mut sum_ln = 0.0f64;
for &x in xs {
sum_ln += (x.max(1) as f64).ln();
}
(sum_ln / xs.len() as f64).exp() as u64
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn empty_yields_zero() {
assert_eq!(geomean(&[]), 0);
}
#[test]
fn single_value_round_trips() {
assert_eq!(geomean(&[100]), 100);
}
#[test]
fn geomean_is_below_arithmetic_mean() {
let xs = [1, 10, 100, 1000];
let g = geomean(&xs);
let am: u64 = xs.iter().sum::<u64>() / xs.len() as u64;
assert!(g < am);
}
}

View file

@ -0,0 +1,24 @@
//! Peak resident-set-size readback (Linux only; non-Linux returns 0).
#[cfg(target_os = "linux")]
pub fn peak_rss_mb() -> f64 {
let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
return 0.0;
};
for line in s.lines() {
if let Some(rest) = line.strip_prefix("VmPeak:") {
let kb: f64 = rest
.split_whitespace()
.next()
.and_then(|t| t.parse().ok())
.unwrap_or(0.0);
return kb / 1024.0;
}
}
0.0
}
#[cfg(not(target_os = "linux"))]
pub fn peak_rss_mb() -> f64 {
0.0
}

View file

@ -0,0 +1,15 @@
//! Default tolerance constants for bit-exact correctness oracles.
//!
//! These suit float-arithmetic kernels (PQ distance, BM25 scoring, vector
//! normalization) where SIMD-accumulator reordering is legal but real bugs
//! shift values by orders of magnitude. Targets that operate on integer or
//! byte-exact data (bitpack decode, dictionary decode, FSST decode) should
//! assert strict bitwise equality and not use these constants.
/// Maximum permitted absolute element error between agent kernel output and
/// scalar reference output, for float kernels.
pub const MAX_ABS_ERR: f32 = 1e-4;
/// Maximum permitted distance error when comparing top-K results between
/// agent kernel and scalar reference, for float kernels.
pub const TOPK_DIST_TOL: f32 = 1e-4;

View file

@ -0,0 +1,24 @@
[package]
name = "pq-l2"
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"
description = "Autoresearch target: Lance PQ L2 distance kernel optimization."
publish = false
[lib]
path = "src/lib.rs"
[[bin]]
name = "run_experiment"
path = "src/bin/run_experiment.rs"
[[bench]]
name = "pq_l2"
harness = false
[dependencies]
harness-common = { path = "../harness-common" }
[dev-dependencies]
criterion = { workspace = true }

View file

@ -7,8 +7,8 @@ use std::hint::black_box;
use criterion::{Criterion, criterion_group, criterion_main}; use criterion::{Criterion, criterion_group, criterion_main};
use lance_autoresearch::inputs::{SHAPES, SPEED_TOP_K, speed_workloads}; use pq_l2::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
use lance_autoresearch::kernels::PqKernel; use pq_l2::kernels::PqKernel;
fn bench_pq_l2(c: &mut Criterion) { fn bench_pq_l2(c: &mut Criterion) {
let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE); let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE);

View file

@ -0,0 +1,98 @@
# Target: PQ L2 — agent instructions
This is the per-target overlay on top of [`../../HARNESS.md`](../../HARNESS.md).
Read **HARNESS.md first** for the universal loop contract (what's editable,
the metric, the loop, hygiene, never stop). This file adds the PQ-L2-specific
API spec and priors.
## Setup (once per session)
1. Read in this order:
- `../../HARNESS.md`
- `../../README.md`
- `program.md` (this file)
- `src/lib.rs`
- `src/kernels.rs` *(the only file you may edit)*
- `src/reference.rs`
- `src/inputs.rs`
- `src/bin/run_experiment.rs`
2. Ensure `results.tsv` exists. If not, create it with this header:
```
commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description
```
3. Baseline trial:
```
cargo run --release --bin run_experiment > run.log 2>&1
```
Append a row tagged `keep=baseline`, commit it.
## Public API contract (must remain stable)
The bench imports these from `crate::kernels`. You may NOT change their
signatures. You MAY add private helpers, internal data layouts, `unsafe`
blocks, `std::arch` intrinsics under `#[cfg(target_arch = ...)]` gates,
pre-computed state inside `PqKernel`, etc.
```rust
pub struct PqKernel { /* agent's private fields */ }
impl PqKernel {
pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
pub fn shape(&self) -> &PqShape;
pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
}
```
Pre-processing in `new` is free — the bench measures `distance_table +
probe_top_k` per query, not per (build + query). Codebook transposes,
cached `c·c`, packed LUTs, etc., should live in `new`.
## What you can / cannot do
(See HARNESS.md for the universal table; this is the PQ-L2 specific
addition.)
- **Cannot** change `PqShape` or the constants in `lib.rs`. They define
the optimization target.
- **Cannot** introduce lossy techniques (LUT u8/u16 quantization, asymmetric
approximation, anything that drops bits relative to the scalar reference).
The correctness phase asserts `max_abs_err ≤ 1e-4` against the scalar
reference; lossy techniques fail this gate. If you want to explore a lossy
track, propose it to the human as a separate kernel surface.
- **Can** mark hot functions `#[inline]`, split them, add private helpers.
- **Can** add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
property checks against the scalar path.
## Lance-PQ-specific priors
These are the directions that pay off on this kernel shape without
compromising arithmetic accuracy. Pick one hypothesis per trial; don't try
to combine multiple ideas at once.
- **Codebook layout transpose.** The reference layout is `[m][k][d]`.
Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` lanes
across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` once.
- **Cache `c·c` per centroid.** The diffsquaresum is
`(q - c)·(q - c) = q·q - 2qc + c·c`. Hoist `q·q` per sub-vector,
precompute `c·c` once at `new()` time, store next to the codebook. Inner
loop becomes one FMA. Watch sign / accumulator ordering so rounding stays
within `MAX_ABS_ERR`.
- **Probe-side code transpose.** Probe is dominated by
`acc += table[m][codes[off+m]]` × `num_sub_vectors`. Transposing codes to
`[m][i]` (one row per sub-quantizer, contiguous over base index) lets you
process 32+ vectors per inner iteration with `vpgatherdq`-style loads.
- **Top-K block-then-merge.** `push()` does a branch + heap sift on every
code. At 20k probes per query × 9 (shape × dist) combos that's the
second-biggest cost after the gather. Block the probe (e.g., 512 codes at
a time), find the local top-K with a branchless pass, then merge into the
global heap.
- **Prefetch.** `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
ahead of the gather is usually pure win at 20k+ scale.
- **FMA chains for table build.** The diffsquaresum maps cleanly to FMA
on AVX2/NEON. Even without intrinsics, structuring the inner loop so
`rustc` emits FMA helps.
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates
a fresh `Vec<f32>` per call. The public API is fixed (returns `Vec<f32>`),
but you can reuse a thread-local scratch buffer internally and copy to a
`Vec` at the boundary if it speeds the build.

View file

@ -35,18 +35,18 @@
use std::time::Instant; use std::time::Instant;
use lance_autoresearch::inputs::{ use harness_common::{MAX_ABS_ERR, TIME_BUDGET_SECS, TOPK_DIST_TOL, geomean, peak_rss_mb};
use pq_l2::inputs::{
DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads, DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads,
}; };
use lance_autoresearch::kernels::PqKernel; use pq_l2::kernels::PqKernel;
use lance_autoresearch::reference::{ScalarReference, max_abs_err, topk_consistent}; use pq_l2::reference::{ScalarReference, max_abs_err, topk_consistent};
use lance_autoresearch::{MAX_ABS_ERR, PqShape, TOPK_DIST_TOL}; use pq_l2::PqShape;
// Any constants; the only requirement is that they're pinned across trials so // Any constants; the only requirement is that they're pinned across trials so
// the inputs and the timings are reproducible. // the inputs and the timings are reproducible.
const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE; const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE;
const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE; const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE;
const TIME_BUDGET_SECS: u64 = 600;
fn main() { fn main() {
let start = Instant::now(); let start = Instant::now();
@ -210,17 +210,6 @@ fn run_speed(workloads: &[SpeedWorkload]) -> SpeedReport {
} }
} }
fn geomean(xs: &[u64]) -> u64 {
if xs.is_empty() {
return 0;
}
let mut sum_ln = 0.0f64;
for &x in xs {
sum_ln += (x.max(1) as f64).ln();
}
(sum_ln / xs.len() as f64).exp() as u64
}
fn format_shape(s: &PqShape) -> String { fn format_shape(s: &PqShape) -> String {
format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids) format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids)
} }
@ -233,26 +222,3 @@ fn format_dist(d: &DataDistribution) -> String {
} }
.to_string() .to_string()
} }
#[cfg(target_os = "linux")]
fn peak_rss_mb() -> f64 {
let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
return 0.0;
};
for line in s.lines() {
if let Some(rest) = line.strip_prefix("VmPeak:") {
let kb: f64 = rest
.split_whitespace()
.next()
.and_then(|t| t.parse().ok())
.unwrap_or(0.0);
return kb / 1024.0;
}
}
0.0
}
#[cfg(not(target_os = "linux"))]
fn peak_rss_mb() -> f64 {
0.0
}

View file

@ -16,6 +16,7 @@
//! the codebook is shape-appropriate, not random. //! the codebook is shape-appropriate, not random.
use crate::PqShape; use crate::PqShape;
use harness_common::SplitMix64;
/// PQ shapes the bench evaluates. The agent's kernel must produce correct /// PQ shapes the bench evaluates. The agent's kernel must produce correct
/// output and competitive speed on every one. /// output and competitive speed on every one.
@ -295,36 +296,6 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
out out
} }
/// SplitMix64 — small, deterministic; bit-for-bit reproducible across machines.
struct SplitMix64 {
state: u64,
}
impl SplitMix64 {
fn new(seed: u64) -> Self {
Self { state: seed }
}
fn next_u64(&mut self) -> u64 {
self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
let mut z = self.state;
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
z ^ (z >> 31)
}
fn next_f32(&mut self) -> f32 {
let bits = (self.next_u64() >> 40) as u32;
bits as f32 / ((1u32 << 24) as f32)
}
fn next_normal(&mut self) -> f32 {
let mut u1 = self.next_f32();
if u1 < 1e-7 {
u1 = 1e-7;
}
let u2 = self.next_f32();
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
}
}
fn shape_hash(s: PqShape) -> u64 { fn shape_hash(s: PqShape) -> u64 {
(s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15) (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9) ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)

View file

@ -1,17 +1,20 @@
//! Lance autoresearch harness — public API for the bench binary, benchmarks, and tests. //! Autoresearch target: Lance PQ L2 distance kernel optimization.
//! //!
//! Contract (Karpathy-style three files): //! Karpathy-style three-file contract:
//! //!
//! - `kernels` — the AGENT'S PLAYGROUND. Modify freely. //! - `kernels` — the AGENT'S PLAYGROUND. Modify freely.
//! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math. //! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math.
//! - `inputs` — IMMUTABLE. Diverse test-data + workload generators, //! - `inputs` — IMMUTABLE. Diverse test-data + workload generators,
//! deterministic per fixed seed, varied across the input battery. //! deterministic per fixed seed, varied across the input battery.
//! //!
//! The optimization target is dataset-independent: the agent's kernel must match //! The optimization target is dataset-independent: the agent's kernel must
//! the scalar reference within `MAX_ABS_ERR` on every input the bench generates, //! match the scalar reference within `harness_common::MAX_ABS_ERR` on every
//! and minimize geomean ns/query across multiple PQ shapes and data //! input the bench generates, and minimize geomean ns/query across multiple
//! distributions. There is no fixed dataset; an "improvement" by construction //! PQ shapes and data distributions. There is no fixed dataset.
//! generalizes across distributions and shapes. //!
//! Shared utilities (deterministic PRNG, geomean, peak RSS, tolerance
//! constants, time budget) come from the `harness-common` workspace crate.
//! See `../HARNESS.md` for the harness conventions every target follows.
pub mod inputs; pub mod inputs;
pub mod kernels; pub mod kernels;
@ -45,12 +48,3 @@ impl PqShape {
self.num_sub_vectors * self.num_centroids * self.sub_vector_dim() self.num_sub_vectors * self.num_centroids * self.sub_vector_dim()
} }
} }
/// Tolerance for the agent kernel's distance values vs. the scalar reference.
/// Loose enough to permit legal SIMD-accumulator reordering; tight enough to
/// catch real arithmetic bugs.
pub const MAX_ABS_ERR: f32 = 1e-4;
/// Tolerance for top-K *distances* (id sets are compared with tie-tolerance —
/// see `reference::topk_consistent`).
pub const TOPK_DIST_TOL: f32 = 1e-4;

View file

@ -0,0 +1,192 @@
# Adding a new target
Walk through this when spinning up a new optimization target (A1 cosine, A4
bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
decisions to make per target if the kernel fits the autoresearch shape.
If your target's per-trial eval is more than ~30 seconds, or the correctness
oracle can't be a deterministic comparison against a scalar reference, this
harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
for the boundary.
## Steps
### 1. Pick a template target
The closest existing target. For now there's just `pq-l2`, but as more land:
- Distance / scoring kernels that take a query and return per-row scores →
template off `pq-l2`.
- Decode kernels that take encoded bytes and return an Arrow array →
template off `bitpack` once it lands.
- Scan / merge kernels → template off `topk-merge` once it lands.
```bash
cp -r crates/pq-l2 crates/<my-target>
```
### 2. Rewrite `Cargo.toml`
```toml
[package]
name = "<my-target>"
# version, edition, license, publish stay the same
```
Add the target to the workspace `members` in the root `Cargo.toml`:
```toml
[workspace]
members = [
"crates/harness-common",
"crates/pq-l2",
"crates/<my-target>", # add this
]
```
### 3. Rewrite `src/lib.rs`
Define the target's `Shape` type (analogue of `PqShape`) and any other types
shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
which fields are pinned by the harness vs. agent-tunable.
This file is **immutable** to the agent. The shape parameters define the
optimization target — changing them changes what's being optimized.
### 4. Rewrite `src/reference.rs`
Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
no cleverness. This is what the agent's kernel is compared against. Mirror
the public API of `kernels.rs` exactly.
For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
(or analogues) — the comparison helpers the bench uses to assert
near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
`TOPK_DIST_TOL`.
For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
returned Arrow array. No tolerance constants needed.
### 5. Rewrite `src/inputs.rs`
Two surfaces:
- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
distribution combinations, sized small enough that the correctness phase
finishes in seconds. The point is breadth, not realism.
- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
combinations sized for stable timings. Aim for total trial wall-clock
≤ 60s; the agent's iteration latency dominates correctness elsewhere.
Use `harness_common::SplitMix64` for determinism. Same seed → same battery
across trials.
### 6. Rewrite `src/kernels.rs` (the agent's playground)
Implement a clean scalar baseline matching the algorithm shape of the Lance
upstream code. The header comment must:
- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
algorithm is modeled on.
- Document the public API the bench calls — these are the surfaces the agent
may NOT change.
- List "what you can do" / "what you cannot do" rules specific to this
target.
The starting kernel must be correct (passes the correctness phase against
`reference.rs`) and lint-clean. The agent's job is to make it faster.
### 7. Rewrite `src/bin/run_experiment.rs`
Two phases:
- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
reference, compare. Any mismatch → print `correctness: fail`, diagnostic
line, exit 2.
- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
query / per row / per byte. Aggregate geomean / worst / best across all
combos. Print fixed-format result block.
Universal output fields (every target) are listed in `HARNESS.md` "The
metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
for bitpack).
Use:
- `harness_common::geomean` for the aggregator
- `harness_common::peak_rss_mb` for memory readback
- `harness_common::TIME_BUDGET_SECS` for the time-budget check
### 8. (Optional) Rewrite `benches/<my-target>.rs`
Criterion benchmark with the same kernel calls as `run_experiment` but
under criterion's statistical-sampling harness. Optional — the per-trial
binary is the agent's primary measurement; criterion is for the human's
deeper investigation.
### 9. Write `program.md`
Per-target agent skill, layered on top of `HARNESS.md`. Sections:
- **Setup** — which files to read at session start (always include
`../../HARNESS.md`).
- **Public API contract** — the exact functions / structs the agent must
keep stable.
- **Target-specific priors** — known SIMD techniques for this kernel shape,
algorithmic transformations worth trying, common pitfalls. This is the
highest-leverage content; spend time on it.
- **`results.tsv` header** — the per-target column set.
### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
A short doc covering:
- What's optimized (one sentence)
- Upstream Lance source pointers (rev, file paths, function names)
- Oracle definition (bit-exact / `max_abs_err`)
- Speed workload shape (what shapes × distributions span)
- Status (candidate / landed / has-results)
### 11. Verify end-to-end
```bash
cargo build --release -p <my-target>
cargo clippy --release -p <my-target> --all-targets -- -D warnings
cargo run --release --bin run_experiment -p <my-target>
```
The baseline trial must:
- Print `correctness: pass`
- Exit 0
- Finish within ~60s
- Reference a sensible `geomean_ns_per_*` baseline number
Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
zero), confirm the trial exits 2 with `correctness: fail`. Restore.
### 12. Add the target row to the top-level `README.md`
In the targets table at the top of the README, change the new target's row
from `candidate` to `landed`.
### 13. Commit
One commit for the target's scaffolding. Don't bundle multiple targets in
one commit — each target's history should be independently revertible.
## Common gotchas
- **Forgetting the empty `[workspace]` block** at the root means cargo walks
up to the omnigraph parent workspace. Already handled; just don't remove it.
- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
Use `harness-common = { path = "../harness-common" }`.
- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
with one shape an agent could specialize and pass, with two there's not
enough variety. Ensure the shapes span at least one "outlier" (e.g., for
PQ, one shape with `sub_vector_dim != 8`).
- **Correctness battery too narrow.** Five distributions is the floor: at
minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
the integer analogue: uniform / clustered / skewed / few-distinct /
monotonic).
- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
rate drops below useful. Reduce workload sizes; the speed metric is
per-operation, not per-workload, so absolute size doesn't change the
comparison.

View file

@ -0,0 +1,152 @@
# Design — why the workspace is shaped this way
This document records the rationale for the multi-target workspace shape so
future contributors don't relitigate the early decisions.
## The thing we're building
A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
"Multi-target" because Lance has many such kernels — distance kernels in
`lance-linalg`, decoders in `lance-encoding`, scan/merge kernels — and the
right harness shape is identical across them: bit-exact correctness oracle,
geomean-across-distributions speed metric, single-agent autoresearch loop.
The original [research note](../../docs/research/llm-evolutionary-sampling.md)
enumerates ten such candidates (A1A10) clustered by Lance crate. The first
landed (`pq-l2`) proves the harness shape; the rest follow the same template.
## Decision: workspace, not single crate
A single crate exposing multiple binaries (`run_experiment_pq_l2`,
`run_experiment_bitpack`, ...) was the obvious-looking alternative. Rejected
for three reasons:
1. **Per-target deps differ.** FSST decode wants different deps than PQ
kernels (a string-compression library vs. just `f32` math). A single
`Cargo.toml` would either bundle every target's deps into every build or
require fine-grained features. Workspaces give per-target `Cargo.toml`
for free.
2. **Edit isolation.** The agent edits one target's `kernels.rs` at a time.
In a single crate, `kernels.rs` files would collide on path or have to live
in target-specific submodules with target-specific naming. Per-target
crates put `src/kernels.rs` at the natural location every time and let the
agent navigate one tree per session.
3. **Build / test isolation.** `cargo build -p pq-l2` builds only what's
needed for the PQ L2 target; `cargo test -p pq-l2` runs only its tests.
The agent's iteration loop is faster because it doesn't pay for unrelated
targets' compile time.
The downside — workspace boilerplate, per-target `Cargo.toml`, the empty
`[workspace]` block at the workspace root that prevents cargo from walking up
to the parent omnigraph workspace — is a one-time cost. Per-target overhead
of adding a new target is one `cp -r` plus path edits.
## Decision: shared `harness-common` crate, no `Target` trait
A `Target` trait was the obvious-looking other alternative — express the
common loop generically, plug in target-specific types. Rejected because:
1. **Kernel signatures vary too much for a single trait shape.** PQ
`probe_top_k` returns `Vec<(u32, f32)>`. Bitpack decode returns an
`IntArray`. FSST decode returns `Vec<u8>`. Predicate evaluation returns a
`BooleanArray`. A unifying trait would need erased boxing or a wide
associated-type surface, both of which obscure the actual hot path the
agent is editing.
2. **The orchestration that *is* shared is small.** A deterministic PRNG
(~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four
tolerance constants. Total ~70 lines of shared code. Building a trait
abstraction over 70 lines costs more than it saves.
3. **The output format isn't worth sharing.** Each target's
`run_experiment.rs` prints a fixed-format result block; the *fields*
differ per target (PQ shapes vs bit widths vs distribution kinds). A
shared formatter would be either trivial wrapping of `println!` (no
value) or a complicated builder API (negative value).
`harness-common` therefore exposes plumbing only: `SplitMix64`, `geomean`,
`peak_rss_mb`, `MAX_ABS_ERR`, `TOPK_DIST_TOL`, `TIME_BUDGET_SECS`. Each
target consumes what it needs. The shared loop contract is documented in
`HARNESS.md`, not encoded in code.
## Decision: per-target `program.md` + shared `HARNESS.md`
The agent reads two files at session start:
- `HARNESS.md` (workspace-level) — universal: the loop, the metric, the
edit-permission table, hygiene rules.
- `crates/<target>/program.md` (per-target) — specific: the kernel API the
agent must keep stable, target-specific priors (which SIMD intrinsics tend
to win on this kernel shape), the `results.tsv` column header.
The shape mirrors how Karpathy's `nanochat-research` `program.md` works,
factored across the dimension that varies (per target) vs. doesn't (the loop
itself). Two files instead of one because copy-pasting the universal loop
into every `program.md` makes them drift.
## Decision: dataset-independent oracle every target
The first iteration of the harness used recall@K vs. SIFT1M as the
correctness oracle. We replaced it with bit-exact (or near-bit-exact for
floats) match against a scalar reference because:
1. The agent had incentive to overfit lossy approximations to the dataset's
cluster structure, even though we didn't ask for that.
2. SIFT1M is 250 MB and a hassle to download; the harness benefited from
being self-contained.
3. Mathematical equivalence is a strictly stronger contract than recall
preservation: if the kernel is bit-equivalent to the scalar reference,
recall is automatically identical because the distance values are the
same. There's nothing recall@K catches that bit-exactness doesn't.
This decision generalizes to every target. Decode kernels get strict bitwise
equality (no float arithmetic involved). Distance and BM25 kernels get
`max_abs_err ≤ 1e-4` (loose enough for SIMD-accumulator reordering, tight
enough for real bugs). Targets that genuinely require lossy techniques to
get headroom — there might be some; LUT u8 quantization in PQ is one — go
in a separate "lossy track" with a recall-based oracle on diverse datasets,
not the bit-exact track.
## Decision: per-target speed measurement spans multiple shapes × distributions
A single dataset would let an agent overfit to that dataset's distribution.
Each target's `inputs.rs` therefore generates speed workloads across:
- Multiple **shapes** of the kernel's domain (PQ: `(dim, num_sub_vectors,
num_centroids)`; bitpack: bit width; etc.). Captures how the kernel
performs at different sizes Lance users actually encounter.
- Multiple **data distributions** (Gaussian / uniform / sparse for floats;
uniform / skewed / clustered for integers; etc.). Captures whether the
kernel's win is data-distribution-conditional.
The keep gate uses geomean across all (shape × distribution) combos with a
worst-case guard: a kernel that wins on one combo and regresses ≥5% on
another fails to keep, even if the geomean improves. This forces wins to
generalize.
## What's deliberately not abstracted
- **Output format.** Each target prints its own field block. See above.
- **`TopKHeap` and other small data structures.** When two targets need a
`TopKHeap`, the second one copies the first's. Three copies of a 30-line
struct is cheaper than one trait-erased indirection.
- **Test data shapes.** Each target's `inputs.rs` knows its own kernel's
fixture shape. Sharing would require a generic `Fixture<Kernel>` trait,
which would either be too narrow (forces every kernel into a `query +
workload` shape) or too wide (gives up the type safety that makes the
bench's correctness check obvious).
## When to revisit
If the workspace grows past ~6 active targets and we notice we're
copy-pasting more than ~50 lines of `run_experiment.rs` boilerplate per new
target, consider extracting a shared `RunExperiment` helper that takes
closures for the correctness and speed phases. Don't pre-extract — wait
until the duplication is real and visible.
If we add a target that genuinely doesn't fit the autoresearch loop (eval
crosses ~30s; tournament sampling becomes the right control loop), it
belongs in a separate workspace, not this one. The boundary line is the
loop shape, not the target type.

View file

@ -0,0 +1,98 @@
# Target: `pq-l2`
PQ L2 distance kernel for f32 dense vectors — the asymmetric-distance compute
that runs on every `IvfPq` / `IvfHnswPq` ANN query in Lance.
## Status
**Landed.** Baseline scalar kernel committed; the agent's job is to find
generalizable speedups against it.
## What's optimized
Two functions in `crates/pq-l2/src/kernels.rs`:
- `PqKernel::distance_table(query)` — builds the asymmetric distance table
(`[num_sub_vectors][num_centroids]`) for one query against the codebook.
Cost: `num_sub_vectors × num_centroids × sub_vector_dim` MAC ops per query.
- `PqKernel::probe_top_k(table, codes, num_vectors, k)` — probes
`num_vectors` PQ-encoded vectors, accumulates per-vector distance via
`num_sub_vectors` table lookups, returns top-K. Cost:
`num_vectors × num_sub_vectors` lookups + heap maintenance per query.
This is the dominant cost at typical scales.
`PqKernel::new(shape, codebook)` is also editable — the agent may pre-process
the codebook (transpose layout, cache `c·c` for the FMA trick, pack the LUT)
and amortize over queries; build cost is excluded from per-query timing.
## Upstream Lance source
Algorithmically modeled on `lance-linalg::distance::l2` plus the PQ
asymmetric-distance compute in `lance::index::vector::pq`. Specifically the
f32 dense path; the byte / fixed-point variants are out of scope for this
target.
When porting a winning kernel upstream:
- File: `lance-linalg/src/distance/l2.rs` and the L2-specific path in
`lance/src/index/vector/pq.rs`.
- License: Apache-2.0 (matches our dual MIT/Apache-2.0 → upstream takes
the Apache half).
## Oracle
**Float-accumulator-tolerance match against scalar reference.** Per
`harness_common::MAX_ABS_ERR = 1e-4`:
- Distance table values must match the scalar reference within `1e-4` per
element. Loose enough for legal SIMD-accumulator reordering, tight enough
to catch real arithmetic bugs.
- Top-K results compared with `harness_common::TOPK_DIST_TOL = 1e-4` plus
tie-tolerant id substitution (any permutation within a tied-distance band
is accepted).
The correctness phase asserts both on every input combination — five input
distributions × three PQ shapes = 15 cases per trial.
## Speed workload
Three shapes:
- `(128, 16, 256)` — SIFT-like; sub_vector_dim = 8
- `(256, 16, 256)` — sub_vector_dim = 16
- `(768, 96, 256)` — BERT-base-like; large codebook
Three data distributions:
- `Clustered` — 32 cluster centers, low intra-cluster noise
- `Uniform` — uniform on [-1, 1]
- `Sparse` — 90% zeros + 10% Gaussian
Per (shape × distribution): 20,000 base vectors PQ-encoded, 32 queries
timed. Total trial wall-clock: ~3060s on a developer laptop.
## Output fields
```
correctness: pass | fail
shapes_tested: (128,16,256) (256,16,256) (768,96,256)
distributions_tested: clustered uniform sparse
geomean_ns_per_query: <u64>
worst_ns_per_query: <u64> (<shape>, <dist>)
best_ns_per_query: <u64> (<shape>, <dist>)
per_combo_geomean_ns:
(...)
peak_mem_mb: <f64>
total_seconds: <f64>
```
## Known headroom (priors for the agent)
See `crates/pq-l2/program.md` "Lance-PQ-specific priors" for the canonical
list. Highlights:
- Codebook layout transpose (`[m][k][d]``[m][d][k]`) for SIMD-broadcast
table build.
- Cache `c·c` per centroid in `new()` so the inner loop is `q·q 2qc + c·c`
(one FMA chain).
- Probe-side code transpose so the inner loop processes 32+ vectors per
iteration via gather.
- Top-K block-then-merge instead of per-vector heap insert.
- Prefetch on `codes[i+64]` ahead of gather.

View file

@ -1,172 +0,0 @@
# Lance PQ L2 kernel research — agent instructions
You are an autonomous research assistant. Your job is to improve `src/kernels.rs`
so that `cargo run --release --bin run_experiment` reports a **lower
`geomean_ns_per_query`** while:
1. The **correctness phase passes** — your kernel's distance values must match the
scalar reference within `MAX_ABS_ERR = 1e-4`, and the top-K must be
tie-tolerant equivalent on every input the bench generates.
2. The `worst_ns_per_query` does **not regress more than 5%** against the
last-kept kernel — if you win on one (shape × distribution) and lose
significantly on another, the change isn't a generalizable improvement.
This bench is intentionally **dataset-independent**: there is no fixed dataset.
The correctness oracle is mathematical equivalence to the scalar reference,
checked across multiple PQ shapes and synthetic input distributions
(Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed
oracle is the geomean across multiple shapes × distributions, with worst-case
guarded. A win that depends on a specific data distribution or PQ shape will
fail to clear the bar by construction.
Read this file end-to-end before doing anything else. Then run setup, then the loop.
## Setup (do once at the start of every session)
1. Read these files, in this order:
- `README.md`
- `program.md` (this file)
- `src/lib.rs`
- `src/kernels.rs` *(the only file you may edit)*
- `src/reference.rs`
- `src/inputs.rs`
- `src/bin/run_experiment.rs`
2. Ensure `results.tsv` exists. If not, create it with this header line:
```
commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description
```
3. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`.
Confirm `correctness: pass`. Parse `run.log` and append a row to `results.tsv`
with `keep=baseline` and `description="seeded scalar PQ-L2 baseline"`. This
is your reference number.
4. Commit the baseline row with a one-line message like `baseline: <numbers>`.
## What you CAN do
- Modify **`src/kernels.rs`** freely. You may:
- Pre-process the codebook in `PqKernel::new` (transpose layouts, cache
`c·c` for the FMA trick, pack the codebook for register-resident lookup,
etc.). This cost is paid once per dataset and amortized across queries —
the bench measures per-query, not per-(build + query).
- Reorder loops, switch internal data layouts, drop down to `std::arch`
intrinsics under `#[cfg(target_arch = ...)]` gates. **Always keep a
portable scalar fallback** so the kernel compiles everywhere.
- Use `unsafe` if needed; document the invariants you're relying on.
- Mark hot functions `#[inline]`; add private helpers freely.
- Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want
in-file property checks.
## What you CANNOT do
- Do **not** modify `src/lib.rs` (`PqShape` and the tolerance constants are
shared with the immutable scaffolding).
- Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`,
`src/inputs.rs`, `benches/pq_l2.rs`, or `Cargo.toml`.
- Do **not** add new crate dependencies.
- Do **not** alter the public API of `kernels::PqKernel`:
- `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
- `PqKernel::shape(&self) -> &PqShape`
- `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
- `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
- Do **not** introduce lossy techniques (LUT u8/u16 quantization, asymmetric-
distance approximation, etc.) — the correctness phase asserts exact-up-to-ε
match against the scalar reference. If you want to explore a lossy track,
surface that in a separate kernel and propose a track extension.
## The metric
Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock across
all timed queries, all shapes, all distributions) subject to:
1. Correctness phase: **pass** (exit-2 otherwise).
2. `worst_ns_per_query` ≤ 1.05 × the last-kept kernel's worst.
3. `total_seconds` ≤ 600.
4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release
--all-targets -- -D warnings` reports zero issues.
Ties break toward simpler code. If two kernels report the same speed within
~3% noise, prefer fewer lines / less `unsafe`.
## Lance-PQ-specific priors (lossless directions)
These directions are known to pay off without compromising arithmetic accuracy.
Pick one hypothesis at a time; implement; measure; decide.
- **Codebook layout.** The reference layout is `[m][k][d]`. For a fixed query,
iterating over centroids stays in cache, but the inner loop over `d` is
short. Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)`
lanes across `d` and broadcast over `k`. Do the transpose in `PqKernel::new`
once.
- **Cache `c·c`.** The diffsquaresum is `(q - c)·(q - c) = q·q - 2qc + c·c`.
Hoist `q·q` per sub-vector, precompute `c·c` once at codebook-load time.
Inner loop becomes one FMA (`-2qc + cc`). Watch the sign / accumulator
ordering so the rounding stays within tolerance.
- **Probe layout.** The probe is dominated by `acc += table[m][codes[off+m]]`
× `num_sub_vectors`. Transposing codes to `[m][i]` (one row per sub-quantizer,
contiguous over base index) lets you process up to 32+ vectors per inner
iteration with `vpgatherdq`-style loads.
- **Top-K integration.** `push()` does a branch + heap sift on every code.
At 50k probes per query × 9 (shape × dist) combos that's the second-biggest
cost after the gather. Block the probe (e.g., 512 codes at a time), find the
local top-K with a branchless pass, then merge into the global heap.
- **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
ahead of the gather is usually pure win at 50k+ scale where codes don't all
fit in L2.
- **FMA chains for table build.** The diffsquaresum maps cleanly to FMA on
AVX2/NEON. Even without intrinsics, structuring the inner loop so `rustc`
emits FMA helps.
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates a
fresh `Vec<f32>` per call. Returning a fixed-capacity buffer is a public-API
change you can't make — but you can reuse a thread-local scratch buffer
internally if it speeds the build.
## The loop
Once setup is done, repeat indefinitely:
1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
have been tried, what won, what regressed. Form a hypothesis with one
sentence stating the change and the predicted effect on speed and
correctness.
2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis.
3. **Build and lint.**
```
cargo build --release
cargo clippy --release --all-targets -- -D warnings
```
If either fails, fix and try again — do not commit broken state.
4. **Run the trial.**
```
cargo run --release --bin run_experiment > run.log 2>&1
```
5. **Parse the result.** Extract `correctness`, `geomean_ns_per_query`,
`worst_ns_per_query` (with combo), `peak_mem_mb`, `total_seconds`. Compute
deltas vs. baseline.
6. **Decide keep or revert.**
- **Keep** iff: `correctness: pass`, geomean strictly better than the
last-kept row (allow ~1% noise band), and `worst_ns_per_query` ≤ 1.05 ×
last-kept's worst.
- **Revert** otherwise: `git restore src/kernels.rs` (or commit and
`git revert` if you want the revert in history). Note what failed.
7. **Log.** Append one row to `results.tsv`:
```
<short_sha> <iso8601> <correctness> <geomean_ns> <worst_ns> <worst_combo> <best_ns> <best_combo> <peak_mem> <elapsed> <keep|revert> <one-line description>
```
8. **Commit.** One-line message describing the change and the headline number,
e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
## Hygiene
- Always commit `src/kernels.rs` changes; never commit `results.tsv` or
`run.log` (they're gitignored).
- If a change fails to build, do not commit. Iterate until it builds, or
revert cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
`results.tsv` and update your mental model before proposing the next.
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it
and mark the trial as `timeout`.
## Never stop
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
one measurement, one commit. No multi-step plans across iterations.