research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research note, a single-crate shape doesn't scale: per-target deps will collide, the agent's edits to one target's kernels.rs would conflict with another's lib path, and build/test isolation is lost. Restructure into a Cargo workspace. Layout: research/lance-autoresearch/ ├── Cargo.toml (workspace root) ├── README.md (target table, contract overview, repo layout) ├── HARNESS.md (universal loop contract every target inherits) ├── crates/ │ ├── harness-common/ (shared: SplitMix64, geomean, peak RSS, │ │ MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS) │ └── pq-l2/ (the landed target; was the previous single crate) └── docs/ ├── design.md (rationale for workspace shape, no Target trait) ├── adding-a-target.md (step-by-step workflow for new targets) └── targets/pq-l2.md (per-target capsule) Decisions documented in docs/design.md: - Workspace, not single crate: per-target Cargo.toml so deps don't collide; per-target src tree so agent edits don't conflict; per-target build/test isolation for faster agent iteration. - harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance constants, time budget). Intentionally NO Target trait - decode kernel signatures and distance kernel signatures differ enough that a unifying trait would either bloat or require erased boxing. Each target is its own natural shape. - Per-target program.md + shared HARNESS.md: the loop contract is universal, the priors and API spec are per-target. Two files instead of one because copy-pasting the universal loop into every program.md would drift. pq-l2 refactor: - src/* moved into crates/pq-l2/src/* via git mv (preserves history) - crate renamed lance-autoresearch -> pq-l2 - SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of duplication that would have been copy-pasted into every new target) - program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the PQ-L2-specific API contract and SIMD priors remain - Cargo.toml depends on harness-common via path; workspace.dependencies pins criterion uniformly across targets The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2 IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode, A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect, A10 top-K merge) are listed in README.md's target table as "candidate"; each gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow. Verified end-to-end: - cargo build --release: clean, both crates compile - cargo clippy --release --workspace --all-targets -- -D warnings: clean - cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2) - cargo run --release --bin run_experiment -p pq-l2: correctness pass, geomean ~880k ns, exit 0, ~30s wall-clock - omnigraph parent workspace unchanged (research/ excluded as before) https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-06-09 01:35:18 +02:00 · 2026-05-15 00:15:02 +00:00 · 2026-05-15 00:15:02 +00:00 · 0d72cc69fb
commit 0d72cc69fb
parent 92ce8f1e7f
21 changed files with 1012 additions and 366 deletions
--- a/research/lance-autoresearch/Cargo.toml
+++ b/research/lance-autoresearch/Cargo.toml
@ -1,32 +1,14 @@
-# Empty `[workspace]` section so cargo treats this directory as its own
-# workspace root and does NOT walk up to the parent omnigraph workspace.
-# Without this, cargo from inside `research/lance-autoresearch/` will try to
-# resolve omnigraph's dependencies even though we're excluded as a member.
 [workspace]
+resolver = "2"
+members = [
+    "crates/harness-common",
+    "crates/pq-l2",
+]

-[package]
-name = "lance-autoresearch"
-version = "0.1.0"
-edition = "2024"
-license = "MIT OR Apache-2.0"
-description = "Autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM agents."
-publish = false
-
-[lib]
-path = "src/lib.rs"
-
-[[bin]]
-name = "run_experiment"
-path = "src/bin/run_experiment.rs"
-
-[[bench]]
-name = "pq_l2"
-harness = false
-
-[dependencies]
+# Each per-target crate sets its own deps. Shared deps below pin versions
+# uniformly across targets so the workspace lockfile stays clean.
+[workspace.dependencies]
 anyhow = "1"
-
-[dev-dependencies]
 criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }

 [profile.release]
--- a/research/lance-autoresearch/HARNESS.md
+++ b/research/lance-autoresearch/HARNESS.md
@ -0,0 +1,137 @@
+# HARNESS — shared loop contract for every lance-autoresearch target
+
+This document is the universal part of every target's agent instructions. Each
+target's `program.md` is a thin layer of *target-specific priors and API spec*
+on top of the conventions below. The agent reads `HARNESS.md` and the target's
+`program.md` at the start of every session.
+
+## What this harness is
+
+A single agent (you) edits one file in one target crate to optimize a Lance
+kernel. Per trial, you build, run a binary that exercises the kernel against
+diverse inputs, parse a fixed-format output block, and decide keep-or-revert.
+
+This is a Karpathy-style autoresearch loop. It assumes:
+
+- Per-trial eval is **seconds-scale**. Long enough to measure, short enough to
+  iterate hundreds of times in a session.
+- The kernel has a **deterministic correctness oracle** — a scalar reference
+  that produces the same answer to compare against.
+- The optimization target is **dataset-independent**: the harness generates
+  diverse inputs each trial, so wins generalize across distributions and
+  shapes by construction.
+
+Targets that don't fit these constraints (index-build parameter tuning,
+plan-patching, anything where eval is minutes-to-hours) belong in the
+BauplanLabs tournament-loop shape, not this harness. See `docs/design.md` for
+the boundary.
+
+## What's editable, per target
+
+| Path | Mutability | Why |
+|---|---|---|
+| `crates/<target>/src/kernels.rs` | **mutable** | Your playground. The whole point. |
+| `crates/<target>/src/reference.rs` | immutable | The oracle. Touching it makes wins meaningless. |
+| `crates/<target>/src/inputs.rs` | immutable | The fixture generator. Touching it makes timings incomparable across trials. |
+| `crates/<target>/src/lib.rs` | immutable | Shared types pinned by the bench (`PqShape` etc.). |
+| `crates/<target>/src/bin/run_experiment.rs` | immutable | The trial harness. |
+| `crates/<target>/benches/*.rs` | immutable | Criterion bench, optional read-only reference. |
+| `crates/<target>/Cargo.toml` | immutable | Adding deps changes the optimization target. |
+| `crates/<target>/program.md` | human-iterated between runs | Not edited by you in-loop; the human refines it. |
+| `crates/<target>/results.tsv` | append-only | Your audit log. Gitignored. |
+| `crates/harness-common/**` | immutable | Workspace-shared infrastructure. |
+| `HARNESS.md` (this file) | immutable | Workspace-shared loop contract. |
+
+You may add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
+property checks. You may NOT add new crate dependencies. You may NOT use
+unsafe-only-on-broken-assumptions tricks (e.g., assuming a fixture invariant
+that holds today but isn't documented).
+
+## The metric
+
+Every target's `run_experiment` binary prints a fixed-format output block ending
+with these universal fields:
+
+- `correctness:` — `pass` or `fail`. Set by comparing your kernel against the
+  scalar reference on every input the bench generates.
+- `geomean_ns_per_*:` — geometric mean of per-operation wall-clock across all
+  timed operations.
+- `worst_ns_per_*:` — slowest combo's geomean.
+- `peak_mem_mb:` — process RSS high-water-mark.
+- `total_seconds:` — trial wall-clock.
+
+A kernel is **kept** iff:
+
+1. `correctness: pass` (any failure → `std::process::exit(2)`).
+2. `geomean_ns_per_*` strictly better than the previous best-kept kernel
+   (allow ~1% noise band).
+3. `worst_ns_per_*` ≤ 1.05 × the previous best-kept kernel's worst.
+4. `total_seconds` ≤ 600 (the per-trial cap; exceed it → `std::process::exit(3)`).
+5. Build clean: `cargo build --release` and
+   `cargo clippy --release --all-targets -- -D warnings` both succeed.
+
+Ties break toward simpler code: same speed within ~3% noise → fewer lines /
+less `unsafe` wins.
+
+## The loop
+
+After reading `HARNESS.md` and the target's `program.md`:
+
+1. **Setup (once per session).** Confirm `results.tsv` exists; if not, create
+   it with a per-target header (the target's `program.md` defines the columns).
+   Run the baseline trial:
+   ```
+   cargo run --release --bin run_experiment -p <target> > run.log 2>&1
+   ```
+   Append a row tagged `keep=baseline` and commit it.
+
+2. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
+   have been tried, what won, what regressed. Form one hypothesis with one
+   sentence stating the change and the predicted effect on speed and
+   correctness.
+
+3. **Edit `kernels.rs`.** Keep the diff focused on the one hypothesis.
+
+4. **Build and lint.**
+   ```
+   cargo build --release
+   cargo clippy --release --all-targets -- -D warnings
+   ```
+   If either fails, fix and retry. Do not commit broken state.
+
+5. **Run the trial.**
+   ```
+   cargo run --release --bin run_experiment -p <target> > run.log 2>&1
+   ```
+
+6. **Parse and decide.** Extract the universal fields plus any per-target
+   fields. Compute deltas vs. the last-kept row. Apply the keep criteria above.
+
+7. **Log.** Append one row to `results.tsv` matching the target's header.
+
+8. **Commit.** One-line message describing the change and the headline number,
+   e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
+
+9. **Hygiene.**
+   - Always commit `kernels.rs` changes; never commit `results.tsv` or
+     `run.log` (gitignored).
+   - If a change fails to build, do not commit. Iterate or revert cleanly.
+   - If two consecutive ideas regress, take a beat: re-read the last ~10 rows
+     and update your mental model before proposing the next.
+   - Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min,
+     kill it and mark the trial as `timeout`.
+
+## Never stop
+
+Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
+one measurement, one commit. No multi-step plans across iterations.
+
+## Working across multiple targets
+
+If a session spans multiple targets, work on **one target per session**. Don't
+edit `kernels.rs` in two crates between commits — the agent's mental model is
+shared but the keep-decision is per-target. Pick a target, do a session there,
+commit, switch.
+
+The human is responsible for selecting which target to work on next. Don't
+proactively switch targets unless the user asks.
--- a/research/lance-autoresearch/README.md
+++ b/research/lance-autoresearch/README.md
@ -1,112 +1,143 @@
 # lance-autoresearch

-An autoresearch-style harness for evolving [Lance](https://github.com/lance-format/lance)
-PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor).
-
-Modeled on Andrej Karpathy's
+A multi-target workspace for evolving [Lance](https://github.com/lance-format/lance)
+hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor),
+in the style of Andrej Karpathy's
 [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
-three-file contract:
+single-agent autoresearch loop.

- **Immutable bench** — `src/bin/run_experiment.rs` + `src/inputs.rs` +
-  `src/reference.rs`. The agent cannot touch these.
- **Mutable kernel** — `src/kernels.rs`. The agent's playground. Starts as a
-  scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to
-  beat it.
- **Human-iterated program** — `program.md`. The "skill" the agent reads at
-  the start of every session. The human refines it between runs.
+Each target is an independent Rust crate under `crates/`:
+
+| Target | Status | Lance source area | What's optimized |
+|---|---|---|---|
+| [`crates/pq-l2`](crates/pq-l2) | landed | `lance-linalg::distance::l2`, PQ probe | PQ L2 distance: build LUT, probe codes, top-K |
+| `crates/pq-cosine`     | candidate (A1) | `lance-linalg::distance::cosine` | PQ cosine distance |
+| `crates/pq-dot`        | candidate (A1) | `lance-linalg::distance::dot` | PQ dot-product distance |
+| `crates/ivf-partition` | candidate (A2) | `lance-index::vector::ivf` partition select | IVF partition selection (centroid scan) |
+| `crates/fts-bm25`      | candidate (A3) | `lance-index::scalar::inverted` BM25 | FTS BM25 scoring inner loop |
+| `crates/bitpack`       | candidate (A4) | `lance-encoding::encodings::bitpack` | Bitpack integer decode |
+| `crates/dictionary`    | candidate (A5) | `lance-encoding::encodings::dictionary` | Dictionary decode |
+| `crates/fsst`          | candidate (A6) | `lance-encoding::encodings::fsst` | FSST string decode |
+| `crates/take`          | candidate (A7) | `lance-core::utils::take` | Take / gather kernel |
+| `crates/predicate`     | candidate (A8) | `lance-datafusion` filter eval | Predicate evaluation kernels |
+| `crates/posting-intersect` | candidate (A9) | `lance-index::scalar::inverted` | Posting list intersection (FTS AND) |
+| `crates/topk-merge`    | candidate (A10) | scan-merge | Top-K k-way merge |
+
+The candidate targets are documented in [`docs/targets/`](docs/targets/) and can
+be added by following [`docs/adding-a-target.md`](docs/adding-a-target.md). The
+single landed target (`pq-l2`) proves the harness shape; the candidates wait
+for an agent to spin them up.
+
+## The contract every target follows
+
+Karpathy's three-file shape, applied per target:
+
+| File (per target crate) | Mutability | Edited by |
+|---|---|---|
+| `src/kernels.rs` | **mutable** | the agent |
+| `src/reference.rs`, `src/inputs.rs`, `src/lib.rs`, `src/bin/run_experiment.rs`, `benches/*.rs` | immutable | — |
+| `program.md` | human-iterated | the human, between runs |
+| `results.tsv` | append-only | the agent, per trial (gitignored) |
+
+The shared utilities — deterministic PRNG, geomean, peak-RSS readback,
+tolerance constants, time-budget — live in [`crates/harness-common`](crates/harness-common/src/lib.rs)
+and are consumed by every target. There is intentionally **no `Target` trait**:
+decode-kernel signatures and distance-kernel signatures are different enough
+that a unifying trait would either bloat or require erased boxing. Each target
+is its own natural shape; the shared crate is plumbing only.
+
+The shared loop conventions every target's `program.md` inherits live in
+[`HARNESS.md`](HARNESS.md). Per-target priors and API specifics live in each
+target's own `program.md`.

 ## Dataset-independent by design

 Every other ANN benchmark you've seen is "compete on this fixed dataset"
-(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness*
-(the math) and *kernel speed under one specific data distribution*. An LLM
-agent given recall@K as the oracle has incentive to overfit to the dataset's
-quirks.
+(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (the
+math) and *kernel speed under one specific data distribution*. An LLM agent
+given recall@K as the oracle has incentive to overfit to the dataset's quirks.

-We split them:
+We split them, every target:

- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4`) match to a scalar
-  reference kernel, on diverse generated inputs (Gaussian, uniform, sparse,
-  large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical
-  equivalence; there's no dataset to overfit. Lossy techniques fail this gate.
- **Speed** = geomean ns/query across multiple PQ shapes ×
-  multiple data distributions. A kernel that wins on one distribution and
-  regresses on another fails the worst-case guard.
+- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4` for floats; bitwise for
+  integer/byte kernels) match to a scalar reference, on diverse generated
+  inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail
+  this gate.
+- **Speed** = geomean ns/operation across multiple shape × distribution
+  combinations, with worst-case guard. A kernel that wins on one distribution
+  and regresses on another fails to keep.

 By construction, an "improvement" generalizes across distributions and shapes.
-There is no `wget sift.tar.gz` step; the harness is fully self-contained.
+There is no `wget sift.tar.gz` step; every target is fully self-contained.

-## Why a separate repo
+## Why a separate repo (and a workspace, not a single crate)

 OmniGraph (the graph engine that motivated this) pins Lance at a released
-version and consumes its kernels via the public crate API. Improvements live one
-layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the
-optimization target pure (only the kernel changes), keeps the license clean for
-upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
-keeps the agent's working set tiny.
+version and consumes its kernels via the public crate API. Improvements live
+one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps
+the optimization target pure (only the kernel changes), keeps the license clean
+for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
+keeps each agent's working set tiny.
+
+**Workspace not single-crate** because per-target deps differ — FSST decode
+will want a different dependency set than PQ kernels — and the agent's edits
+to one target's `kernels.rs` must not collide with another's lib path. Each
+target is buildable, testable, and runnable in isolation: `cd crates/<target>
+&& cargo run --release --bin run_experiment`.

 ## Quick start

 ```bash
-cargo run --release --bin run_experiment
+# Run the landed PQ L2 target's baseline.
+cargo run --release --bin run_experiment -p pq-l2

-# Or run with Claude Code / Codex:
-#    Open the repo in your agent of choice and prompt:
-#       Hi, have a look at program.md and let's kick off a new experiment.
+# Or with Claude Code / Codex, working on one target:
+cd crates/pq-l2
+# Open in your agent of choice and prompt:
+#   Hi, have a look at program.md and let's kick off a new experiment.
+
+# Add a new target (see docs/adding-a-target.md):
+cp -r crates/pq-l2 crates/pq-cosine
+# ... edit Cargo.toml name, kernels.rs / reference.rs / inputs.rs / program.md
 ```

-## File ownership
-
-| File | Mutability | Edited by |
-|---|---|---|
-| `src/kernels.rs` | **mutable** | the agent |
-| `src/bin/run_experiment.rs` | immutable | — |
-| `src/reference.rs` | immutable | — |
-| `src/inputs.rs` | immutable | — |
-| `src/lib.rs` | immutable (shared types) | — |
-| `benches/pq_l2.rs` | immutable | — |
-| `program.md` | human-iterated | the human, between runs |
-| `results.tsv` | append-only | the agent, per trial (gitignored) |
-
-## The metric
-
-`run_experiment` runs two phases per trial: a correctness check and a
-multi-shape × multi-distribution speed measurement. Output looks like:
+## Repo layout

 ```
-correctness:           pass
---
-correctness:           pass
-shapes_tested:         (128,16,256) (256,16,256) (768,96,256)
-distributions_tested:  clustered uniform sparse
-geomean_ns_per_query:  18234
-worst_ns_per_query:    24515 ((768,96,256), sparse)
-best_ns_per_query:     12876 ((128,16,256), clustered)
-per_combo_geomean_ns:
-  (128,16,256) clustered  -> 12876 ns
-  (128,16,256) uniform    -> 13441 ns
-  ...
-peak_mem_mb:           28.4
-total_seconds:         12.3
+lance-autoresearch/
+├── Cargo.toml                         # workspace root
+├── README.md                          # you are here
+├── HARNESS.md                         # shared loop contract every target inherits
+├── LICENSE-MIT, LICENSE-APACHE        # dual-licensed (Apache compat for Lance PRs)
+├── crates/
+│   ├── harness-common/                # shared: SplitMix64, geomean, peak RSS, tolerance, time budget
+│   │   └── src/{lib,prng,stats,sysinfo,tolerance}.rs
+│   └── pq-l2/                         # landed target
+│       ├── Cargo.toml
+│       ├── program.md                 # this target's agent skill
+│       ├── src/
+│       │   ├── lib.rs                 # PqShape + module wiring (immutable)
+│       │   ├── kernels.rs             # MUTABLE — agent's playground
+│       │   ├── reference.rs           # IMMUTABLE — scalar reference, oracle helpers
+│       │   ├── inputs.rs              # IMMUTABLE — diverse test-data generators
+│       │   └── bin/run_experiment.rs  # IMMUTABLE — per-trial entry point
+│       └── benches/pq_l2.rs           # criterion benchmark (immutable)
+└── docs/
+    ├── design.md                      # rationale for the workspace shape
+    ├── adding-a-target.md             # workflow for spinning up a new target
+    └── targets/
+        └── pq-l2.md                   # capsule: upstream Lance pointers, oracle, status
 ```

-A kernel is "kept" iff:
-
- Correctness phase passes (mathematical equivalence to scalar reference)
- `geomean_ns_per_query` strictly better than the previous best-kept kernel
- `worst_ns_per_query` ≤ 1.05 × the previous best-kept kernel's worst
- `total_seconds` ≤ 600
-
-See `program.md` for the full loop spec.
-
 ## Upstream contribution path

-When a commit clears the keep bar by a meaningful margin (≥10% geomean
-speedup with worst-case guard intact), the human reviews the diff, ports the
-technique against [`lance-format/lance`](https://github.com/lance-format/lance)
-HEAD, runs Lance's own test suite, and opens a PR. Because `src/kernels.rs` is
-dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing
-path, the upstream PR inherits Apache-2.0 cleanly.
+When a commit on any target clears the keep bar by a meaningful margin
+(≥10% geomean speedup with worst-case guard intact), the human reviews the
+diff, ports the technique against
+[`lance-format/lance`](https://github.com/lance-format/lance) HEAD, runs
+Lance's own test suite, and opens a PR. Because the workspace is dual
+MIT/Apache-2.0 licensed and each target's kernel is algorithmically modeled on
+Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.

 ## License

--- a/research/lance-autoresearch/crates/harness-common/Cargo.toml
+++ b/research/lance-autoresearch/crates/harness-common/Cargo.toml
@ -0,0 +1,10 @@
+[package]
+name = "harness-common"
+version = "0.1.0"
+edition = "2024"
+license = "MIT OR Apache-2.0"
+description = "Shared utilities for lance-autoresearch per-target harnesses (PRNG, geomean, peak RSS, tolerance constants, time budget)."
+publish = false
+
+[lib]
+path = "src/lib.rs"
--- a/research/lance-autoresearch/crates/harness-common/src/lib.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/lib.rs
@ -0,0 +1,36 @@
+//! Shared utilities for lance-autoresearch per-target harnesses.
+//!
+//! Each target crate (`pq-l2`, future `pq-cosine`, `bitpack-decode`, etc.)
+//! defines its own `kernels.rs` (mutable, the agent's playground), `reference.rs`
+//! (immutable scalar reference), `inputs.rs` (immutable test-data generators),
+//! and `bin/run_experiment.rs` (immutable per-trial entry point). They all need
+//! the same handful of building blocks: a deterministic PRNG, a geomean
+//! aggregator, peak-RSS readback, tolerance constants for the bit-exact oracle,
+//! and a single shared time-budget constant. That's everything in this crate.
+//!
+//! What is **not** here, and intentionally not abstracted:
+//!
+//! - A `Target` trait. Decode kernels (`bitpack`, `dictionary`, `FSST`) have
+//!   very different signatures than distance kernels (`PqKernel::probe_top_k`),
+//!   and forcing them into one trait shape would either bloat the trait or
+//!   require erased boxing. Keep each target's API natural to its kernel.
+//!
+//! - Output-format orchestration. Each target's `run_experiment.rs` prints its
+//!   own fixed-format result block — different targets report different
+//!   per-combo dimensions (PQ shapes vs bit widths vs distribution kinds vs ...).
+//!   Sharing the format would make the per-target binaries less readable and
+//!   gain very little — `println!` is cheap.
+
+pub mod prng;
+pub mod stats;
+pub mod sysinfo;
+pub mod tolerance;
+
+pub use prng::SplitMix64;
+pub use stats::geomean;
+pub use sysinfo::peak_rss_mb;
+pub use tolerance::{MAX_ABS_ERR, TOPK_DIST_TOL};
+
+/// Per-trial wall-clock cap. Targets should `std::process::exit(3)` if exceeded
+/// so the agent's loop logs the trial as a timeout instead of a measurement.
+pub const TIME_BUDGET_SECS: u64 = 600;
--- a/research/lance-autoresearch/crates/harness-common/src/prng.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/prng.rs
@ -0,0 +1,52 @@
+//! Deterministic SplitMix64 PRNG. Same seed produces the same sequence on
+//! every machine; no platform-specific RNG / no `rand` crate. Reproducibility
+//! across trials is the whole point.
+
+pub struct SplitMix64 {
+    state: u64,
+}
+
+impl SplitMix64 {
+    pub fn new(seed: u64) -> Self {
+        Self { state: seed }
+    }
+
+    pub fn next_u64(&mut self) -> u64 {
+        self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
+        let mut z = self.state;
+        z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
+        z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
+        z ^ (z >> 31)
+    }
+
+    /// Uniform in `[0, 1)` with 24 bits of mantissa precision.
+    pub fn next_f32(&mut self) -> f32 {
+        let bits = (self.next_u64() >> 40) as u32;
+        bits as f32 / ((1u32 << 24) as f32)
+    }
+
+    /// Standard normal via Box–Muller. Cheap and sufficient for fixture
+    /// generation; not cryptographically anything.
+    pub fn next_normal(&mut self) -> f32 {
+        let mut u1 = self.next_f32();
+        if u1 < 1e-7 {
+            u1 = 1e-7;
+        }
+        let u2 = self.next_f32();
+        (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn deterministic_across_calls() {
+        let mut a = SplitMix64::new(0x1234_5678);
+        let mut b = SplitMix64::new(0x1234_5678);
+        for _ in 0..1000 {
+            assert_eq!(a.next_u64(), b.next_u64());
+        }
+    }
+}
--- a/research/lance-autoresearch/crates/harness-common/src/stats.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/stats.rs
@ -0,0 +1,36 @@
+//! Geometric mean of u64 timings. Robust to outliers; the right aggregator for
+//! latency distributions because halving one query and doubling another cancels.
+
+pub fn geomean(xs: &[u64]) -> u64 {
+    if xs.is_empty() {
+        return 0;
+    }
+    let mut sum_ln = 0.0f64;
+    for &x in xs {
+        sum_ln += (x.max(1) as f64).ln();
+    }
+    (sum_ln / xs.len() as f64).exp() as u64
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn empty_yields_zero() {
+        assert_eq!(geomean(&[]), 0);
+    }
+
+    #[test]
+    fn single_value_round_trips() {
+        assert_eq!(geomean(&[100]), 100);
+    }
+
+    #[test]
+    fn geomean_is_below_arithmetic_mean() {
+        let xs = [1, 10, 100, 1000];
+        let g = geomean(&xs);
+        let am: u64 = xs.iter().sum::<u64>() / xs.len() as u64;
+        assert!(g < am);
+    }
+}
--- a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
@ -0,0 +1,24 @@
+//! Peak resident-set-size readback (Linux only; non-Linux returns 0).
+
+#[cfg(target_os = "linux")]
+pub fn peak_rss_mb() -> f64 {
+    let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
+        return 0.0;
+    };
+    for line in s.lines() {
+        if let Some(rest) = line.strip_prefix("VmPeak:") {
+            let kb: f64 = rest
+                .split_whitespace()
+                .next()
+                .and_then(|t| t.parse().ok())
+                .unwrap_or(0.0);
+            return kb / 1024.0;
+        }
+    }
+    0.0
+}
+
+#[cfg(not(target_os = "linux"))]
+pub fn peak_rss_mb() -> f64 {
+    0.0
+}
--- a/research/lance-autoresearch/crates/harness-common/src/tolerance.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/tolerance.rs
@ -0,0 +1,15 @@
+//! Default tolerance constants for bit-exact correctness oracles.
+//!
+//! These suit float-arithmetic kernels (PQ distance, BM25 scoring, vector
+//! normalization) where SIMD-accumulator reordering is legal but real bugs
+//! shift values by orders of magnitude. Targets that operate on integer or
+//! byte-exact data (bitpack decode, dictionary decode, FSST decode) should
+//! assert strict bitwise equality and not use these constants.
+
+/// Maximum permitted absolute element error between agent kernel output and
+/// scalar reference output, for float kernels.
+pub const MAX_ABS_ERR: f32 = 1e-4;
+
+/// Maximum permitted distance error when comparing top-K results between
+/// agent kernel and scalar reference, for float kernels.
+pub const TOPK_DIST_TOL: f32 = 1e-4;
--- a/research/lance-autoresearch/crates/pq-l2/Cargo.toml
+++ b/research/lance-autoresearch/crates/pq-l2/Cargo.toml
@ -0,0 +1,24 @@
+[package]
+name = "pq-l2"
+version = "0.1.0"
+edition = "2024"
+license = "MIT OR Apache-2.0"
+description = "Autoresearch target: Lance PQ L2 distance kernel optimization."
+publish = false
+
+[lib]
+path = "src/lib.rs"
+
+[[bin]]
+name = "run_experiment"
+path = "src/bin/run_experiment.rs"
+
+[[bench]]
+name = "pq_l2"
+harness = false
+
+[dependencies]
+harness-common = { path = "../harness-common" }
+
+[dev-dependencies]
+criterion = { workspace = true }
--- a/research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs
+++ b/research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs
@ -7,8 +7,8 @@ use std::hint::black_box;

 use criterion::{Criterion, criterion_group, criterion_main};

-use lance_autoresearch::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
-use lance_autoresearch::kernels::PqKernel;
+use pq_l2::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
+use pq_l2::kernels::PqKernel;

 fn bench_pq_l2(c: &mut Criterion) {
    let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE);
--- a/research/lance-autoresearch/crates/pq-l2/program.md
+++ b/research/lance-autoresearch/crates/pq-l2/program.md
@ -0,0 +1,98 @@
+# Target: PQ L2 — agent instructions
+
+This is the per-target overlay on top of [`../../HARNESS.md`](../../HARNESS.md).
+Read **HARNESS.md first** for the universal loop contract (what's editable,
+the metric, the loop, hygiene, never stop). This file adds the PQ-L2-specific
+API spec and priors.
+
+## Setup (once per session)
+
+1. Read in this order:
+   - `../../HARNESS.md`
+   - `../../README.md`
+   - `program.md` (this file)
+   - `src/lib.rs`
+   - `src/kernels.rs` *(the only file you may edit)*
+   - `src/reference.rs`
+   - `src/inputs.rs`
+   - `src/bin/run_experiment.rs`
+2. Ensure `results.tsv` exists. If not, create it with this header:
+   ```
+   commit	timestamp	correctness	geomean_ns	worst_ns	worst_combo	best_ns	best_combo	peak_mem_mb	total_seconds	keep	description
+   ```
+3. Baseline trial:
+   ```
+   cargo run --release --bin run_experiment > run.log 2>&1
+   ```
+   Append a row tagged `keep=baseline`, commit it.
+
+## Public API contract (must remain stable)
+
+The bench imports these from `crate::kernels`. You may NOT change their
+signatures. You MAY add private helpers, internal data layouts, `unsafe`
+blocks, `std::arch` intrinsics under `#[cfg(target_arch = ...)]` gates,
+pre-computed state inside `PqKernel`, etc.
+
+```rust
+pub struct PqKernel { /* agent's private fields */ }
+
+impl PqKernel {
+    pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
+    pub fn shape(&self) -> &PqShape;
+    pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
+    pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
+}
+```
+
+Pre-processing in `new` is free — the bench measures `distance_table +
+probe_top_k` per query, not per (build + query). Codebook transposes,
+cached `c·c`, packed LUTs, etc., should live in `new`.
+
+## What you can / cannot do
+
+(See HARNESS.md for the universal table; this is the PQ-L2 specific
+addition.)
+
+- **Cannot** change `PqShape` or the constants in `lib.rs`. They define
+  the optimization target.
+- **Cannot** introduce lossy techniques (LUT u8/u16 quantization, asymmetric
+  approximation, anything that drops bits relative to the scalar reference).
+  The correctness phase asserts `max_abs_err ≤ 1e-4` against the scalar
+  reference; lossy techniques fail this gate. If you want to explore a lossy
+  track, propose it to the human as a separate kernel surface.
+- **Can** mark hot functions `#[inline]`, split them, add private helpers.
+- **Can** add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
+  property checks against the scalar path.
+
+## Lance-PQ-specific priors
+
+These are the directions that pay off on this kernel shape without
+compromising arithmetic accuracy. Pick one hypothesis per trial; don't try
+to combine multiple ideas at once.
+
+- **Codebook layout transpose.** The reference layout is `[m][k][d]`.
+  Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` lanes
+  across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` once.
+- **Cache `c·c` per centroid.** The diff–square–sum is
+  `(q - c)·(q - c) = q·q - 2qc + c·c`. Hoist `q·q` per sub-vector,
+  precompute `c·c` once at `new()` time, store next to the codebook. Inner
+  loop becomes one FMA. Watch sign / accumulator ordering so rounding stays
+  within `MAX_ABS_ERR`.
+- **Probe-side code transpose.** Probe is dominated by
+  `acc += table[m][codes[off+m]]` × `num_sub_vectors`. Transposing codes to
+  `[m][i]` (one row per sub-quantizer, contiguous over base index) lets you
+  process 32+ vectors per inner iteration with `vpgatherdq`-style loads.
+- **Top-K block-then-merge.** `push()` does a branch + heap sift on every
+  code. At 20k probes per query × 9 (shape × dist) combos that's the
+  second-biggest cost after the gather. Block the probe (e.g., 512 codes at
+  a time), find the local top-K with a branchless pass, then merge into the
+  global heap.
+- **Prefetch.** `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
+  ahead of the gather is usually pure win at 20k+ scale.
+- **FMA chains for table build.** The diff–square–sum maps cleanly to FMA
+  on AVX2/NEON. Even without intrinsics, structuring the inner loop so
+  `rustc` emits FMA helps.
+- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates
+  a fresh `Vec<f32>` per call. The public API is fixed (returns `Vec<f32>`),
+  but you can reuse a thread-local scratch buffer internally and copy to a
+  `Vec` at the boundary if it speeds the build.
--- a/research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs
@ -35,18 +35,18 @@

 use std::time::Instant;

-use lance_autoresearch::inputs::{
+use harness_common::{MAX_ABS_ERR, TIME_BUDGET_SECS, TOPK_DIST_TOL, geomean, peak_rss_mb};
+use pq_l2::inputs::{
    DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads,
 };
-use lance_autoresearch::kernels::PqKernel;
-use lance_autoresearch::reference::{ScalarReference, max_abs_err, topk_consistent};
-use lance_autoresearch::{MAX_ABS_ERR, PqShape, TOPK_DIST_TOL};
+use pq_l2::kernels::PqKernel;
+use pq_l2::reference::{ScalarReference, max_abs_err, topk_consistent};
+use pq_l2::PqShape;

 // Any constants; the only requirement is that they're pinned across trials so
 // the inputs and the timings are reproducible.
 const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE;
 const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE;
-const TIME_BUDGET_SECS: u64 = 600;

 fn main() {
    let start = Instant::now();
@ -210,17 +210,6 @@ fn run_speed(workloads: &[SpeedWorkload]) -> SpeedReport {
    }
 }

-fn geomean(xs: &[u64]) -> u64 {
-    if xs.is_empty() {
-        return 0;
-    }
-    let mut sum_ln = 0.0f64;
-    for &x in xs {
-        sum_ln += (x.max(1) as f64).ln();
-    }
-    (sum_ln / xs.len() as f64).exp() as u64
-}
-
 fn format_shape(s: &PqShape) -> String {
    format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids)
 }
@ -233,26 +222,3 @@ fn format_dist(d: &DataDistribution) -> String {
    }
    .to_string()
 }
-
-#[cfg(target_os = "linux")]
-fn peak_rss_mb() -> f64 {
-    let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
-        return 0.0;
-    };
-    for line in s.lines() {
-        if let Some(rest) = line.strip_prefix("VmPeak:") {
-            let kb: f64 = rest
-                .split_whitespace()
-                .next()
-                .and_then(|t| t.parse().ok())
-                .unwrap_or(0.0);
-            return kb / 1024.0;
-        }
-    }
-    0.0
-}
-
-#[cfg(not(target_os = "linux"))]
-fn peak_rss_mb() -> f64 {
-    0.0
-}
--- a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
@ -16,6 +16,7 @@
 //! the codebook is shape-appropriate, not random.

 use crate::PqShape;
+use harness_common::SplitMix64;

 /// PQ shapes the bench evaluates. The agent's kernel must produce correct
 /// output and competitive speed on every one.
@ -295,36 +296,6 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
    out
 }

-/// SplitMix64 — small, deterministic; bit-for-bit reproducible across machines.
-struct SplitMix64 {
-    state: u64,
-}
-
-impl SplitMix64 {
-    fn new(seed: u64) -> Self {
-        Self { state: seed }
-    }
-    fn next_u64(&mut self) -> u64 {
-        self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
-        let mut z = self.state;
-        z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
-        z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
-        z ^ (z >> 31)
-    }
-    fn next_f32(&mut self) -> f32 {
-        let bits = (self.next_u64() >> 40) as u32;
-        bits as f32 / ((1u32 << 24) as f32)
-    }
-    fn next_normal(&mut self) -> f32 {
-        let mut u1 = self.next_f32();
-        if u1 < 1e-7 {
-            u1 = 1e-7;
-        }
-        let u2 = self.next_f32();
-        (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
-    }
-}
-
 fn shape_hash(s: PqShape) -> u64 {
    (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
        ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)
--- a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
--- a/research/lance-autoresearch/crates/pq-l2/src/lib.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/lib.rs
@ -1,17 +1,20 @@
-//! Lance autoresearch harness — public API for the bench binary, benchmarks, and tests.
+//! Autoresearch target: Lance PQ L2 distance kernel optimization.
 //!
-//! Contract (Karpathy-style three files):
+//! Karpathy-style three-file contract:
 //!
 //! - `kernels` — the AGENT'S PLAYGROUND. Modify freely.
 //! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math.
 //! - `inputs` — IMMUTABLE. Diverse test-data + workload generators,
 //!   deterministic per fixed seed, varied across the input battery.
 //!
-//! The optimization target is dataset-independent: the agent's kernel must match
-//! the scalar reference within `MAX_ABS_ERR` on every input the bench generates,
-//! and minimize geomean ns/query across multiple PQ shapes and data
-//! distributions. There is no fixed dataset; an "improvement" by construction
-//! generalizes across distributions and shapes.
+//! The optimization target is dataset-independent: the agent's kernel must
+//! match the scalar reference within `harness_common::MAX_ABS_ERR` on every
+//! input the bench generates, and minimize geomean ns/query across multiple
+//! PQ shapes and data distributions. There is no fixed dataset.
+//!
+//! Shared utilities (deterministic PRNG, geomean, peak RSS, tolerance
+//! constants, time budget) come from the `harness-common` workspace crate.
+//! See `../HARNESS.md` for the harness conventions every target follows.

 pub mod inputs;
 pub mod kernels;
@ -45,12 +48,3 @@ impl PqShape {
        self.num_sub_vectors * self.num_centroids * self.sub_vector_dim()
    }
 }
-
-/// Tolerance for the agent kernel's distance values vs. the scalar reference.
-/// Loose enough to permit legal SIMD-accumulator reordering; tight enough to
-/// catch real arithmetic bugs.
-pub const MAX_ABS_ERR: f32 = 1e-4;
-
-/// Tolerance for top-K *distances* (id sets are compared with tie-tolerance —
-/// see `reference::topk_consistent`).
-pub const TOPK_DIST_TOL: f32 = 1e-4;
--- a/research/lance-autoresearch/crates/pq-l2/src/reference.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/reference.rs
--- a/research/lance-autoresearch/docs/adding-a-target.md
+++ b/research/lance-autoresearch/docs/adding-a-target.md
@ -0,0 +1,192 @@
+# Adding a new target
+
+Walk through this when spinning up a new optimization target (A1 cosine, A4
+bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
+decisions to make per target if the kernel fits the autoresearch shape.
+
+If your target's per-trial eval is more than ~30 seconds, or the correctness
+oracle can't be a deterministic comparison against a scalar reference, this
+harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
+for the boundary.
+
+## Steps
+
+### 1. Pick a template target
+
+The closest existing target. For now there's just `pq-l2`, but as more land:
+- Distance / scoring kernels that take a query and return per-row scores →
+  template off `pq-l2`.
+- Decode kernels that take encoded bytes and return an Arrow array →
+  template off `bitpack` once it lands.
+- Scan / merge kernels → template off `topk-merge` once it lands.
+
+```bash
+cp -r crates/pq-l2 crates/<my-target>
+```
+
+### 2. Rewrite `Cargo.toml`
+
+```toml
+[package]
+name = "<my-target>"
+# version, edition, license, publish stay the same
+```
+
+Add the target to the workspace `members` in the root `Cargo.toml`:
+
+```toml
+[workspace]
+members = [
+    "crates/harness-common",
+    "crates/pq-l2",
+    "crates/<my-target>",   # add this
+]
+```
+
+### 3. Rewrite `src/lib.rs`
+
+Define the target's `Shape` type (analogue of `PqShape`) and any other types
+shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
+which fields are pinned by the harness vs. agent-tunable.
+
+This file is **immutable** to the agent. The shape parameters define the
+optimization target — changing them changes what's being optimized.
+
+### 4. Rewrite `src/reference.rs`
+
+Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
+no cleverness. This is what the agent's kernel is compared against. Mirror
+the public API of `kernels.rs` exactly.
+
+For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
+(or analogues) — the comparison helpers the bench uses to assert
+near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
+`TOPK_DIST_TOL`.
+
+For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
+returned Arrow array. No tolerance constants needed.
+
+### 5. Rewrite `src/inputs.rs`
+
+Two surfaces:
+
+- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
+  distribution combinations, sized small enough that the correctness phase
+  finishes in seconds. The point is breadth, not realism.
+- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
+  combinations sized for stable timings. Aim for total trial wall-clock
+  ≤ 60s; the agent's iteration latency dominates correctness elsewhere.
+
+Use `harness_common::SplitMix64` for determinism. Same seed → same battery
+across trials.
+
+### 6. Rewrite `src/kernels.rs` (the agent's playground)
+
+Implement a clean scalar baseline matching the algorithm shape of the Lance
+upstream code. The header comment must:
+
+- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
+  algorithm is modeled on.
+- Document the public API the bench calls — these are the surfaces the agent
+  may NOT change.
+- List "what you can do" / "what you cannot do" rules specific to this
+  target.
+
+The starting kernel must be correct (passes the correctness phase against
+`reference.rs`) and lint-clean. The agent's job is to make it faster.
+
+### 7. Rewrite `src/bin/run_experiment.rs`
+
+Two phases:
+
+- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
+  reference, compare. Any mismatch → print `correctness: fail`, diagnostic
+  line, exit 2.
+- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
+  query / per row / per byte. Aggregate geomean / worst / best across all
+  combos. Print fixed-format result block.
+
+Universal output fields (every target) are listed in `HARNESS.md` "The
+metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
+for bitpack).
+
+Use:
+- `harness_common::geomean` for the aggregator
+- `harness_common::peak_rss_mb` for memory readback
+- `harness_common::TIME_BUDGET_SECS` for the time-budget check
+
+### 8. (Optional) Rewrite `benches/<my-target>.rs`
+
+Criterion benchmark with the same kernel calls as `run_experiment` but
+under criterion's statistical-sampling harness. Optional — the per-trial
+binary is the agent's primary measurement; criterion is for the human's
+deeper investigation.
+
+### 9. Write `program.md`
+
+Per-target agent skill, layered on top of `HARNESS.md`. Sections:
+
+- **Setup** — which files to read at session start (always include
+  `../../HARNESS.md`).
+- **Public API contract** — the exact functions / structs the agent must
+  keep stable.
+- **Target-specific priors** — known SIMD techniques for this kernel shape,
+  algorithmic transformations worth trying, common pitfalls. This is the
+  highest-leverage content; spend time on it.
+- **`results.tsv` header** — the per-target column set.
+
+### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
+
+A short doc covering:
+
+- What's optimized (one sentence)
+- Upstream Lance source pointers (rev, file paths, function names)
+- Oracle definition (bit-exact / `max_abs_err`)
+- Speed workload shape (what shapes × distributions span)
+- Status (candidate / landed / has-results)
+
+### 11. Verify end-to-end
+
+```bash
+cargo build --release -p <my-target>
+cargo clippy --release -p <my-target> --all-targets -- -D warnings
+cargo run --release --bin run_experiment -p <my-target>
+```
+
+The baseline trial must:
+- Print `correctness: pass`
+- Exit 0
+- Finish within ~60s
+- Reference a sensible `geomean_ns_per_*` baseline number
+
+Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
+zero), confirm the trial exits 2 with `correctness: fail`. Restore.
+
+### 12. Add the target row to the top-level `README.md`
+
+In the targets table at the top of the README, change the new target's row
+from `candidate` to `landed`.
+
+### 13. Commit
+
+One commit for the target's scaffolding. Don't bundle multiple targets in
+one commit — each target's history should be independently revertible.
+
+## Common gotchas
+
+- **Forgetting the empty `[workspace]` block** at the root means cargo walks
+  up to the omnigraph parent workspace. Already handled; just don't remove it.
+- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
+  Use `harness-common = { path = "../harness-common" }`.
+- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
+  with one shape an agent could specialize and pass, with two there's not
+  enough variety. Ensure the shapes span at least one "outlier" (e.g., for
+  PQ, one shape with `sub_vector_dim != 8`).
+- **Correctness battery too narrow.** Five distributions is the floor: at
+  minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
+  the integer analogue: uniform / clustered / skewed / few-distinct /
+  monotonic).
+- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
+  rate drops below useful. Reduce workload sizes; the speed metric is
+  per-operation, not per-workload, so absolute size doesn't change the
+  comparison.
--- a/research/lance-autoresearch/docs/design.md
+++ b/research/lance-autoresearch/docs/design.md
@ -0,0 +1,152 @@
+# Design — why the workspace is shaped this way
+
+This document records the rationale for the multi-target workspace shape so
+future contributors don't relitigate the early decisions.
+
+## The thing we're building
+
+A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
+"Multi-target" because Lance has many such kernels — distance kernels in
+`lance-linalg`, decoders in `lance-encoding`, scan/merge kernels — and the
+right harness shape is identical across them: bit-exact correctness oracle,
+geomean-across-distributions speed metric, single-agent autoresearch loop.
+
+The original [research note](../../docs/research/llm-evolutionary-sampling.md)
+enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
+landed (`pq-l2`) proves the harness shape; the rest follow the same template.
+
+## Decision: workspace, not single crate
+
+A single crate exposing multiple binaries (`run_experiment_pq_l2`,
+`run_experiment_bitpack`, ...) was the obvious-looking alternative. Rejected
+for three reasons:
+
+1. **Per-target deps differ.** FSST decode wants different deps than PQ
+   kernels (a string-compression library vs. just `f32` math). A single
+   `Cargo.toml` would either bundle every target's deps into every build or
+   require fine-grained features. Workspaces give per-target `Cargo.toml`
+   for free.
+
+2. **Edit isolation.** The agent edits one target's `kernels.rs` at a time.
+   In a single crate, `kernels.rs` files would collide on path or have to live
+   in target-specific submodules with target-specific naming. Per-target
+   crates put `src/kernels.rs` at the natural location every time and let the
+   agent navigate one tree per session.
+
+3. **Build / test isolation.** `cargo build -p pq-l2` builds only what's
+   needed for the PQ L2 target; `cargo test -p pq-l2` runs only its tests.
+   The agent's iteration loop is faster because it doesn't pay for unrelated
+   targets' compile time.
+
+The downside — workspace boilerplate, per-target `Cargo.toml`, the empty
+`[workspace]` block at the workspace root that prevents cargo from walking up
+to the parent omnigraph workspace — is a one-time cost. Per-target overhead
+of adding a new target is one `cp -r` plus path edits.
+
+## Decision: shared `harness-common` crate, no `Target` trait
+
+A `Target` trait was the obvious-looking other alternative — express the
+common loop generically, plug in target-specific types. Rejected because:
+
+1. **Kernel signatures vary too much for a single trait shape.** PQ
+   `probe_top_k` returns `Vec<(u32, f32)>`. Bitpack decode returns an
+   `IntArray`. FSST decode returns `Vec<u8>`. Predicate evaluation returns a
+   `BooleanArray`. A unifying trait would need erased boxing or a wide
+   associated-type surface, both of which obscure the actual hot path the
+   agent is editing.
+
+2. **The orchestration that *is* shared is small.** A deterministic PRNG
+   (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four
+   tolerance constants. Total ~70 lines of shared code. Building a trait
+   abstraction over 70 lines costs more than it saves.
+
+3. **The output format isn't worth sharing.** Each target's
+   `run_experiment.rs` prints a fixed-format result block; the *fields*
+   differ per target (PQ shapes vs bit widths vs distribution kinds). A
+   shared formatter would be either trivial wrapping of `println!` (no
+   value) or a complicated builder API (negative value).
+
+`harness-common` therefore exposes plumbing only: `SplitMix64`, `geomean`,
+`peak_rss_mb`, `MAX_ABS_ERR`, `TOPK_DIST_TOL`, `TIME_BUDGET_SECS`. Each
+target consumes what it needs. The shared loop contract is documented in
+`HARNESS.md`, not encoded in code.
+
+## Decision: per-target `program.md` + shared `HARNESS.md`
+
+The agent reads two files at session start:
+
+- `HARNESS.md` (workspace-level) — universal: the loop, the metric, the
+  edit-permission table, hygiene rules.
+- `crates/<target>/program.md` (per-target) — specific: the kernel API the
+  agent must keep stable, target-specific priors (which SIMD intrinsics tend
+  to win on this kernel shape), the `results.tsv` column header.
+
+The shape mirrors how Karpathy's `nanochat-research` `program.md` works,
+factored across the dimension that varies (per target) vs. doesn't (the loop
+itself). Two files instead of one because copy-pasting the universal loop
+into every `program.md` makes them drift.
+
+## Decision: dataset-independent oracle every target
+
+The first iteration of the harness used recall@K vs. SIFT1M as the
+correctness oracle. We replaced it with bit-exact (or near-bit-exact for
+floats) match against a scalar reference because:
+
+1. The agent had incentive to overfit lossy approximations to the dataset's
+   cluster structure, even though we didn't ask for that.
+2. SIFT1M is 250 MB and a hassle to download; the harness benefited from
+   being self-contained.
+3. Mathematical equivalence is a strictly stronger contract than recall
+   preservation: if the kernel is bit-equivalent to the scalar reference,
+   recall is automatically identical because the distance values are the
+   same. There's nothing recall@K catches that bit-exactness doesn't.
+
+This decision generalizes to every target. Decode kernels get strict bitwise
+equality (no float arithmetic involved). Distance and BM25 kernels get
+`max_abs_err ≤ 1e-4` (loose enough for SIMD-accumulator reordering, tight
+enough for real bugs). Targets that genuinely require lossy techniques to
+get headroom — there might be some; LUT u8 quantization in PQ is one — go
+in a separate "lossy track" with a recall-based oracle on diverse datasets,
+not the bit-exact track.
+
+## Decision: per-target speed measurement spans multiple shapes × distributions
+
+A single dataset would let an agent overfit to that dataset's distribution.
+Each target's `inputs.rs` therefore generates speed workloads across:
+
+- Multiple **shapes** of the kernel's domain (PQ: `(dim, num_sub_vectors,
+  num_centroids)`; bitpack: bit width; etc.). Captures how the kernel
+  performs at different sizes Lance users actually encounter.
+- Multiple **data distributions** (Gaussian / uniform / sparse for floats;
+  uniform / skewed / clustered for integers; etc.). Captures whether the
+  kernel's win is data-distribution-conditional.
+
+The keep gate uses geomean across all (shape × distribution) combos with a
+worst-case guard: a kernel that wins on one combo and regresses ≥5% on
+another fails to keep, even if the geomean improves. This forces wins to
+generalize.
+
+## What's deliberately not abstracted
+
+- **Output format.** Each target prints its own field block. See above.
+- **`TopKHeap` and other small data structures.** When two targets need a
+  `TopKHeap`, the second one copies the first's. Three copies of a 30-line
+  struct is cheaper than one trait-erased indirection.
+- **Test data shapes.** Each target's `inputs.rs` knows its own kernel's
+  fixture shape. Sharing would require a generic `Fixture<Kernel>` trait,
+  which would either be too narrow (forces every kernel into a `query +
+  workload` shape) or too wide (gives up the type safety that makes the
+  bench's correctness check obvious).
+
+## When to revisit
+
+If the workspace grows past ~6 active targets and we notice we're
+copy-pasting more than ~50 lines of `run_experiment.rs` boilerplate per new
+target, consider extracting a shared `RunExperiment` helper that takes
+closures for the correctness and speed phases. Don't pre-extract — wait
+until the duplication is real and visible.
+
+If we add a target that genuinely doesn't fit the autoresearch loop (eval
+crosses ~30s; tournament sampling becomes the right control loop), it
+belongs in a separate workspace, not this one. The boundary line is the
+loop shape, not the target type.
--- a/research/lance-autoresearch/docs/targets/pq-l2.md
+++ b/research/lance-autoresearch/docs/targets/pq-l2.md
@ -0,0 +1,98 @@
+# Target: `pq-l2`
+
+PQ L2 distance kernel for f32 dense vectors — the asymmetric-distance compute
+that runs on every `IvfPq` / `IvfHnswPq` ANN query in Lance.
+
+## Status
+
+**Landed.** Baseline scalar kernel committed; the agent's job is to find
+generalizable speedups against it.
+
+## What's optimized
+
+Two functions in `crates/pq-l2/src/kernels.rs`:
+
+- `PqKernel::distance_table(query)` — builds the asymmetric distance table
+  (`[num_sub_vectors][num_centroids]`) for one query against the codebook.
+  Cost: `num_sub_vectors × num_centroids × sub_vector_dim` MAC ops per query.
+- `PqKernel::probe_top_k(table, codes, num_vectors, k)` — probes
+  `num_vectors` PQ-encoded vectors, accumulates per-vector distance via
+  `num_sub_vectors` table lookups, returns top-K. Cost:
+  `num_vectors × num_sub_vectors` lookups + heap maintenance per query.
+  This is the dominant cost at typical scales.
+
+`PqKernel::new(shape, codebook)` is also editable — the agent may pre-process
+the codebook (transpose layout, cache `c·c` for the FMA trick, pack the LUT)
+and amortize over queries; build cost is excluded from per-query timing.
+
+## Upstream Lance source
+
+Algorithmically modeled on `lance-linalg::distance::l2` plus the PQ
+asymmetric-distance compute in `lance::index::vector::pq`. Specifically the
+f32 dense path; the byte / fixed-point variants are out of scope for this
+target.
+
+When porting a winning kernel upstream:
+- File: `lance-linalg/src/distance/l2.rs` and the L2-specific path in
+  `lance/src/index/vector/pq.rs`.
+- License: Apache-2.0 (matches our dual MIT/Apache-2.0 → upstream takes
+  the Apache half).
+
+## Oracle
+
+**Float-accumulator-tolerance match against scalar reference.** Per
+`harness_common::MAX_ABS_ERR = 1e-4`:
+
+- Distance table values must match the scalar reference within `1e-4` per
+  element. Loose enough for legal SIMD-accumulator reordering, tight enough
+  to catch real arithmetic bugs.
+- Top-K results compared with `harness_common::TOPK_DIST_TOL = 1e-4` plus
+  tie-tolerant id substitution (any permutation within a tied-distance band
+  is accepted).
+
+The correctness phase asserts both on every input combination — five input
+distributions × three PQ shapes = 15 cases per trial.
+
+## Speed workload
+
+Three shapes:
+- `(128, 16, 256)` — SIFT-like; sub_vector_dim = 8
+- `(256, 16, 256)` — sub_vector_dim = 16
+- `(768, 96, 256)` — BERT-base-like; large codebook
+
+Three data distributions:
+- `Clustered` — 32 cluster centers, low intra-cluster noise
+- `Uniform` — uniform on [-1, 1]
+- `Sparse` — 90% zeros + 10% Gaussian
+
+Per (shape × distribution): 20,000 base vectors PQ-encoded, 32 queries
+timed. Total trial wall-clock: ~30–60s on a developer laptop.
+
+## Output fields
+
+```
+correctness:           pass | fail
+shapes_tested:         (128,16,256) (256,16,256) (768,96,256)
+distributions_tested:  clustered uniform sparse
+geomean_ns_per_query:  <u64>
+worst_ns_per_query:    <u64> (<shape>, <dist>)
+best_ns_per_query:     <u64> (<shape>, <dist>)
+per_combo_geomean_ns:
+  (...)
+peak_mem_mb:           <f64>
+total_seconds:         <f64>
+```
+
+## Known headroom (priors for the agent)
+
+See `crates/pq-l2/program.md` "Lance-PQ-specific priors" for the canonical
+list. Highlights:
+
+- Codebook layout transpose (`[m][k][d]` → `[m][d][k]`) for SIMD-broadcast
+  table build.
+- Cache `c·c` per centroid in `new()` so the inner loop is `q·q − 2qc + c·c`
+  (one FMA chain).
+- Probe-side code transpose so the inner loop processes 32+ vectors per
+  iteration via gather.
+- Top-K block-then-merge instead of per-vector heap insert.
+- Prefetch on `codes[i+64]` ahead of gather.
--- a/research/lance-autoresearch/program.md
+++ b/research/lance-autoresearch/program.md
@ -1,172 +0,0 @@
-# Lance PQ L2 kernel research — agent instructions
-
-You are an autonomous research assistant. Your job is to improve `src/kernels.rs`
-so that `cargo run --release --bin run_experiment` reports a **lower
-`geomean_ns_per_query`** while:
-
-1. The **correctness phase passes** — your kernel's distance values must match the
-   scalar reference within `MAX_ABS_ERR = 1e-4`, and the top-K must be
-   tie-tolerant equivalent on every input the bench generates.
-2. The `worst_ns_per_query` does **not regress more than 5%** against the
-   last-kept kernel — if you win on one (shape × distribution) and lose
-   significantly on another, the change isn't a generalizable improvement.
-
-This bench is intentionally **dataset-independent**: there is no fixed dataset.
-The correctness oracle is mathematical equivalence to the scalar reference,
-checked across multiple PQ shapes and synthetic input distributions
-(Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed
-oracle is the geomean across multiple shapes × distributions, with worst-case
-guarded. A win that depends on a specific data distribution or PQ shape will
-fail to clear the bar by construction.
-
-Read this file end-to-end before doing anything else. Then run setup, then the loop.
-
-## Setup (do once at the start of every session)
-
-1. Read these files, in this order:
-   - `README.md`
-   - `program.md` (this file)
-   - `src/lib.rs`
-   - `src/kernels.rs` *(the only file you may edit)*
-   - `src/reference.rs`
-   - `src/inputs.rs`
-   - `src/bin/run_experiment.rs`
-2. Ensure `results.tsv` exists. If not, create it with this header line:
-   ```
-   commit	timestamp	correctness	geomean_ns	worst_ns	worst_combo	best_ns	best_combo	peak_mem_mb	total_seconds	keep	description
-   ```
-3. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`.
-   Confirm `correctness: pass`. Parse `run.log` and append a row to `results.tsv`
-   with `keep=baseline` and `description="seeded scalar PQ-L2 baseline"`. This
-   is your reference number.
-4. Commit the baseline row with a one-line message like `baseline: <numbers>`.
-
-## What you CAN do
-
- Modify **`src/kernels.rs`** freely. You may:
-  - Pre-process the codebook in `PqKernel::new` (transpose layouts, cache
-    `c·c` for the FMA trick, pack the codebook for register-resident lookup,
-    etc.). This cost is paid once per dataset and amortized across queries —
-    the bench measures per-query, not per-(build + query).
-  - Reorder loops, switch internal data layouts, drop down to `std::arch`
-    intrinsics under `#[cfg(target_arch = ...)]` gates. **Always keep a
-    portable scalar fallback** so the kernel compiles everywhere.
-  - Use `unsafe` if needed; document the invariants you're relying on.
-  - Mark hot functions `#[inline]`; add private helpers freely.
-  - Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want
-    in-file property checks.
-
-## What you CANNOT do
-
- Do **not** modify `src/lib.rs` (`PqShape` and the tolerance constants are
-  shared with the immutable scaffolding).
- Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`,
-  `src/inputs.rs`, `benches/pq_l2.rs`, or `Cargo.toml`.
- Do **not** add new crate dependencies.
- Do **not** alter the public API of `kernels::PqKernel`:
-  - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
-  - `PqKernel::shape(&self) -> &PqShape`
-  - `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
-  - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
- Do **not** introduce lossy techniques (LUT u8/u16 quantization, asymmetric-
-  distance approximation, etc.) — the correctness phase asserts exact-up-to-ε
-  match against the scalar reference. If you want to explore a lossy track,
-  surface that in a separate kernel and propose a track extension.
-
-## The metric
-
-Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock across
-all timed queries, all shapes, all distributions) subject to:
-
-1. Correctness phase: **pass** (exit-2 otherwise).
-2. `worst_ns_per_query` ≤ 1.05 × the last-kept kernel's worst.
-3. `total_seconds` ≤ 600.
-4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release
-   --all-targets -- -D warnings` reports zero issues.
-
-Ties break toward simpler code. If two kernels report the same speed within
-~3% noise, prefer fewer lines / less `unsafe`.
-
-## Lance-PQ-specific priors (lossless directions)
-
-These directions are known to pay off without compromising arithmetic accuracy.
-Pick one hypothesis at a time; implement; measure; decide.
-
- **Codebook layout.** The reference layout is `[m][k][d]`. For a fixed query,
-  iterating over centroids stays in cache, but the inner loop over `d` is
-  short. Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)`
-  lanes across `d` and broadcast over `k`. Do the transpose in `PqKernel::new`
-  once.
- **Cache `c·c`.** The diff–square–sum is `(q - c)·(q - c) = q·q - 2qc + c·c`.
-  Hoist `q·q` per sub-vector, precompute `c·c` once at codebook-load time.
-  Inner loop becomes one FMA (`-2qc + cc`). Watch the sign / accumulator
-  ordering so the rounding stays within tolerance.
- **Probe layout.** The probe is dominated by `acc += table[m][codes[off+m]]`
-  × `num_sub_vectors`. Transposing codes to `[m][i]` (one row per sub-quantizer,
-  contiguous over base index) lets you process up to 32+ vectors per inner
-  iteration with `vpgatherdq`-style loads.
- **Top-K integration.** `push()` does a branch + heap sift on every code.
-  At 50k probes per query × 9 (shape × dist) combos that's the second-biggest
-  cost after the gather. Block the probe (e.g., 512 codes at a time), find the
-  local top-K with a branchless pass, then merge into the global heap.
- **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
-  ahead of the gather is usually pure win at 50k+ scale where codes don't all
-  fit in L2.
- **FMA chains for table build.** The diff–square–sum maps cleanly to FMA on
-  AVX2/NEON. Even without intrinsics, structuring the inner loop so `rustc`
-  emits FMA helps.
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates a
-  fresh `Vec<f32>` per call. Returning a fixed-capacity buffer is a public-API
-  change you can't make — but you can reuse a thread-local scratch buffer
-  internally if it speeds the build.
-
-## The loop
-
-Once setup is done, repeat indefinitely:
-
-1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
-   have been tried, what won, what regressed. Form a hypothesis with one
-   sentence stating the change and the predicted effect on speed and
-   correctness.
-2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis.
-3. **Build and lint.**
-   ```
-   cargo build --release
-   cargo clippy --release --all-targets -- -D warnings
-   ```
-   If either fails, fix and try again — do not commit broken state.
-4. **Run the trial.**
-   ```
-   cargo run --release --bin run_experiment > run.log 2>&1
-   ```
-5. **Parse the result.** Extract `correctness`, `geomean_ns_per_query`,
-   `worst_ns_per_query` (with combo), `peak_mem_mb`, `total_seconds`. Compute
-   deltas vs. baseline.
-6. **Decide keep or revert.**
-   - **Keep** iff: `correctness: pass`, geomean strictly better than the
-     last-kept row (allow ~1% noise band), and `worst_ns_per_query` ≤ 1.05 ×
-     last-kept's worst.
-   - **Revert** otherwise: `git restore src/kernels.rs` (or commit and
-     `git revert` if you want the revert in history). Note what failed.
-7. **Log.** Append one row to `results.tsv`:
-   ```
-   <short_sha>	<iso8601>	<correctness>	<geomean_ns>	<worst_ns>	<worst_combo>	<best_ns>	<best_combo>	<peak_mem>	<elapsed>	<keep|revert>	<one-line description>
-   ```
-8. **Commit.** One-line message describing the change and the headline number,
-   e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
-
-## Hygiene
-
- Always commit `src/kernels.rs` changes; never commit `results.tsv` or
-  `run.log` (they're gitignored).
- If a change fails to build, do not commit. Iterate until it builds, or
-  revert cleanly.
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
-  `results.tsv` and update your mental model before proposing the next.
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it
-  and mark the trial as `timeout`.
-
-## Never stop
-
-Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
-one measurement, one commit. No multi-step plans across iterations.