research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research note, a single-crate shape doesn't scale: per-target deps will collide, the agent's edits to one target's kernels.rs would conflict with another's lib path, and build/test isolation is lost. Restructure into a Cargo workspace. Layout: research/lance-autoresearch/ ├── Cargo.toml (workspace root) ├── README.md (target table, contract overview, repo layout) ├── HARNESS.md (universal loop contract every target inherits) ├── crates/ │ ├── harness-common/ (shared: SplitMix64, geomean, peak RSS, │ │ MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS) │ └── pq-l2/ (the landed target; was the previous single crate) └── docs/ ├── design.md (rationale for workspace shape, no Target trait) ├── adding-a-target.md (step-by-step workflow for new targets) └── targets/pq-l2.md (per-target capsule) Decisions documented in docs/design.md: - Workspace, not single crate: per-target Cargo.toml so deps don't collide; per-target src tree so agent edits don't conflict; per-target build/test isolation for faster agent iteration. - harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance constants, time budget). Intentionally NO Target trait - decode kernel signatures and distance kernel signatures differ enough that a unifying trait would either bloat or require erased boxing. Each target is its own natural shape. - Per-target program.md + shared HARNESS.md: the loop contract is universal, the priors and API spec are per-target. Two files instead of one because copy-pasting the universal loop into every program.md would drift. pq-l2 refactor: - src/* moved into crates/pq-l2/src/* via git mv (preserves history) - crate renamed lance-autoresearch -> pq-l2 - SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of duplication that would have been copy-pasted into every new target) - program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the PQ-L2-specific API contract and SIMD priors remain - Cargo.toml depends on harness-common via path; workspace.dependencies pins criterion uniformly across targets The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2 IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode, A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect, A10 top-K merge) are listed in README.md's target table as "candidate"; each gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow. Verified end-to-end: - cargo build --release: clean, both crates compile - cargo clippy --release --workspace --all-targets -- -D warnings: clean - cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2) - cargo run --release --bin run_experiment -p pq-l2: correctness pass, geomean ~880k ns, exit 0, ~30s wall-clock - omnigraph parent workspace unchanged (research/ excluded as before) https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-07-03 02:51:04 +02:00 · 2026-05-15 00:15:02 +00:00 · 2026-05-15 00:15:02 +00:00 · 0d72cc69fb
commit 0d72cc69fb
parent 92ce8f1e7f
21 changed files with 1012 additions and 366 deletions
--- a/research/lance-autoresearch/Cargo.toml
+++ b/research/lance-autoresearch/Cargo.toml
@ -1,32 +1,14 @@
 # Empty `[workspace]` section so cargo treats this directory as its own
 # workspace root and does NOT walk up to the parent omnigraph workspace.
 # Without this, cargo from inside `research/lance-autoresearch/` will try to
 # resolve omnigraph's dependencies even though we're excluded as a member.
 [workspace]
 resolver = "2"
 members = [
    "crates/harness-common",
    "crates/pq-l2",
 ]
-[package]
+# Each per-target crate sets its own deps. Shared deps below pin versions
-name = "lance-autoresearch"
+# uniformly across targets so the workspace lockfile stays clean.
-version = "0.1.0"
+[workspace.dependencies]
 edition = "2024"
 license = "MIT OR Apache-2.0"
 description = "Autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM agents."
 publish = false
 [lib]
 path = "src/lib.rs"
 [[bin]]
 name = "run_experiment"
 path = "src/bin/run_experiment.rs"
 [[bench]]
 name = "pq_l2"
 harness = false
 [dependencies]
 anyhow = "1"
 [dev-dependencies]
 criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
 [profile.release]
--- a/research/lance-autoresearch/HARNESS.md
+++ b/research/lance-autoresearch/HARNESS.md
@ -0,0 +1,137 @@
 # HARNESS — shared loop contract for every lance-autoresearch target
 This document is the universal part of every target's agent instructions. Each
 target's `program.md` is a thin layer of *target-specific priors and API spec*
 on top of the conventions below. The agent reads `HARNESS.md` and the target's
 `program.md` at the start of every session.
 ## What this harness is
 A single agent (you) edits one file in one target crate to optimize a Lance
 kernel. Per trial, you build, run a binary that exercises the kernel against
 diverse inputs, parse a fixed-format output block, and decide keep-or-revert.
 This is a Karpathy-style autoresearch loop. It assumes:
 - Per-trial eval is **seconds-scale**. Long enough to measure, short enough to
  iterate hundreds of times in a session.
 - The kernel has a **deterministic correctness oracle** — a scalar reference
  that produces the same answer to compare against.
 - The optimization target is **dataset-independent**: the harness generates
  diverse inputs each trial, so wins generalize across distributions and
  shapes by construction.
 Targets that don't fit these constraints (index-build parameter tuning,
 plan-patching, anything where eval is minutes-to-hours) belong in the
 BauplanLabs tournament-loop shape, not this harness. See `docs/design.md` for
 the boundary.
 ## What's editable, per target
 | Path | Mutability | Why |
 |---|---|---|
 | `crates/<target>/src/kernels.rs` | **mutable** | Your playground. The whole point. |
 | `crates/<target>/src/reference.rs` | immutable | The oracle. Touching it makes wins meaningless. |
 | `crates/<target>/src/inputs.rs` | immutable | The fixture generator. Touching it makes timings incomparable across trials. |
 | `crates/<target>/src/lib.rs` | immutable | Shared types pinned by the bench (`PqShape` etc.). |
 | `crates/<target>/src/bin/run_experiment.rs` | immutable | The trial harness. |
 | `crates/<target>/benches/*.rs` | immutable | Criterion bench, optional read-only reference. |
 | `crates/<target>/Cargo.toml` | immutable | Adding deps changes the optimization target. |
 | `crates/<target>/program.md` | human-iterated between runs | Not edited by you in-loop; the human refines it. |
 | `crates/<target>/results.tsv` | append-only | Your audit log. Gitignored. |
 | `crates/harness-common/**` | immutable | Workspace-shared infrastructure. |
 | `HARNESS.md` (this file) | immutable | Workspace-shared loop contract. |
 You may add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
 property checks. You may NOT add new crate dependencies. You may NOT use
 unsafe-only-on-broken-assumptions tricks (e.g., assuming a fixture invariant
 that holds today but isn't documented).
 ## The metric
 Every target's `run_experiment` binary prints a fixed-format output block ending
 with these universal fields:
 - `correctness:` — `pass` or `fail`. Set by comparing your kernel against the
  scalar reference on every input the bench generates.
 - `geomean_ns_per_*:` — geometric mean of per-operation wall-clock across all
  timed operations.
 - `worst_ns_per_*:` — slowest combo's geomean.
 - `peak_mem_mb:` — process RSS high-water-mark.
 - `total_seconds:` — trial wall-clock.
 A kernel is **kept** iff:
 1. `correctness: pass` (any failure → `std::process::exit(2)`).
 2. `geomean_ns_per_*` strictly better than the previous best-kept kernel
   (allow ~1% noise band).
 3. `worst_ns_per_*` ≤ 1.05 × the previous best-kept kernel's worst.
 4. `total_seconds` ≤ 600 (the per-trial cap; exceed it → `std::process::exit(3)`).
 5. Build clean: `cargo build --release` and
   `cargo clippy --release --all-targets -- -D warnings` both succeed.
 Ties break toward simpler code: same speed within ~3% noise → fewer lines /
 less `unsafe` wins.
 ## The loop
 After reading `HARNESS.md` and the target's `program.md`:
 1. **Setup (once per session).** Confirm `results.tsv` exists; if not, create
   it with a per-target header (the target's `program.md` defines the columns).
   Run the baseline trial:
   ```
   cargo run --release --bin run_experiment -p <target> > run.log 2>&1
   ```
   Append a row tagged `keep=baseline` and commit it.
 2. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
   have been tried, what won, what regressed. Form one hypothesis with one
   sentence stating the change and the predicted effect on speed and
   correctness.
 3. **Edit `kernels.rs`.** Keep the diff focused on the one hypothesis.
 4. **Build and lint.**
   ```
   cargo build --release
   cargo clippy --release --all-targets -- -D warnings
   ```
   If either fails, fix and retry. Do not commit broken state.
 5. **Run the trial.**
   ```
   cargo run --release --bin run_experiment -p <target> > run.log 2>&1
   ```
 6. **Parse and decide.** Extract the universal fields plus any per-target
   fields. Compute deltas vs. the last-kept row. Apply the keep criteria above.
 7. **Log.** Append one row to `results.tsv` matching the target's header.
 8. **Commit.** One-line message describing the change and the headline number,
   e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
 9. **Hygiene.**
   - Always commit `kernels.rs` changes; never commit `results.tsv` or
     `run.log` (gitignored).
   - If a change fails to build, do not commit. Iterate or revert cleanly.
   - If two consecutive ideas regress, take a beat: re-read the last ~10 rows
     and update your mental model before proposing the next.
   - Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min,
     kill it and mark the trial as `timeout`.
 ## Never stop
 Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
 one measurement, one commit. No multi-step plans across iterations.
 ## Working across multiple targets
 If a session spans multiple targets, work on **one target per session**. Don't
 edit `kernels.rs` in two crates between commits — the agent's mental model is
 shared but the keep-decision is per-target. Pick a target, do a session there,
 commit, switch.
 The human is responsible for selecting which target to work on next. Don't
 proactively switch targets unless the user asks.
--- a/research/lance-autoresearch/README.md
+++ b/research/lance-autoresearch/README.md
@ -1,112 +1,143 @@
 # lance-autoresearch
-An autoresearch-style harness for evolving [Lance](https://github.com/lance-format/lance)
+A multi-target workspace for evolving [Lance](https://github.com/lance-format/lance)
-PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor).
+hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor),
-
+in the style of Andrej Karpathy's
 Modeled on Andrej Karpathy's
 [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
-three-file contract:
+single-agent autoresearch loop.
- **Immutable bench** — `src/bin/run_experiment.rs` + `src/inputs.rs` +
+Each target is an independent Rust crate under `crates/`:
-  `src/reference.rs`. The agent cannot touch these.
+
- **Mutable kernel** — `src/kernels.rs`. The agent's playground. Starts as a
+| Target | Status | Lance source area | What's optimized |
-  scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to
+|---|---|---|---|
-  beat it.
+| [`crates/pq-l2`](crates/pq-l2) | landed | `lance-linalg::distance::l2`, PQ probe | PQ L2 distance: build LUT, probe codes, top-K |
- **Human-iterated program** — `program.md`. The "skill" the agent reads at
+| `crates/pq-cosine`     | candidate (A1) | `lance-linalg::distance::cosine` | PQ cosine distance |
-  the start of every session. The human refines it between runs.
+| `crates/pq-dot`        | candidate (A1) | `lance-linalg::distance::dot` | PQ dot-product distance |
 | `crates/ivf-partition` | candidate (A2) | `lance-index::vector::ivf` partition select | IVF partition selection (centroid scan) |
 | `crates/fts-bm25`      | candidate (A3) | `lance-index::scalar::inverted` BM25 | FTS BM25 scoring inner loop |
 | `crates/bitpack`       | candidate (A4) | `lance-encoding::encodings::bitpack` | Bitpack integer decode |
 | `crates/dictionary`    | candidate (A5) | `lance-encoding::encodings::dictionary` | Dictionary decode |
 | `crates/fsst`          | candidate (A6) | `lance-encoding::encodings::fsst` | FSST string decode |
 | `crates/take`          | candidate (A7) | `lance-core::utils::take` | Take / gather kernel |
 | `crates/predicate`     | candidate (A8) | `lance-datafusion` filter eval | Predicate evaluation kernels |
 | `crates/posting-intersect` | candidate (A9) | `lance-index::scalar::inverted` | Posting list intersection (FTS AND) |
 | `crates/topk-merge`    | candidate (A10) | scan-merge | Top-K k-way merge |
 The candidate targets are documented in [`docs/targets/`](docs/targets/) and can
 be added by following [`docs/adding-a-target.md`](docs/adding-a-target.md). The
 single landed target (`pq-l2`) proves the harness shape; the candidates wait
 for an agent to spin them up.
 ## The contract every target follows
 Karpathy's three-file shape, applied per target:
 | File (per target crate) | Mutability | Edited by |
 |---|---|---|
 | `src/kernels.rs` | **mutable** | the agent |
 | `src/reference.rs`, `src/inputs.rs`, `src/lib.rs`, `src/bin/run_experiment.rs`, `benches/*.rs` | immutable | — |
 | `program.md` | human-iterated | the human, between runs |
 | `results.tsv` | append-only | the agent, per trial (gitignored) |
 The shared utilities — deterministic PRNG, geomean, peak-RSS readback,
 tolerance constants, time-budget — live in [`crates/harness-common`](crates/harness-common/src/lib.rs)
 and are consumed by every target. There is intentionally **no `Target` trait**:
 decode-kernel signatures and distance-kernel signatures are different enough
 that a unifying trait would either bloat or require erased boxing. Each target
 is its own natural shape; the shared crate is plumbing only.
 The shared loop conventions every target's `program.md` inherits live in
 [`HARNESS.md`](HARNESS.md). Per-target priors and API specifics live in each
 target's own `program.md`.
 ## Dataset-independent by design
 Every other ANN benchmark you've seen is "compete on this fixed dataset"
-(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness*
+(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (the
-(the math) and *kernel speed under one specific data distribution*. An LLM
+math) and *kernel speed under one specific data distribution*. An LLM agent
-agent given recall@K as the oracle has incentive to overfit to the dataset's
+given recall@K as the oracle has incentive to overfit to the dataset's quirks.
 quirks.
-We split them:
+We split them, every target:
- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4`) match to a scalar
+- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4` for floats; bitwise for
-  reference kernel, on diverse generated inputs (Gaussian, uniform, sparse,
+  integer/byte kernels) match to a scalar reference, on diverse generated
-  large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical
+  inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail
-  equivalence; there's no dataset to overfit. Lossy techniques fail this gate.
+  this gate.
- **Speed** = geomean ns/query across multiple PQ shapes ×
+- **Speed** = geomean ns/operation across multiple shape × distribution
-  multiple data distributions. A kernel that wins on one distribution and
+  combinations, with worst-case guard. A kernel that wins on one distribution
-  regresses on another fails the worst-case guard.
+  and regresses on another fails to keep.
 By construction, an "improvement" generalizes across distributions and shapes.
-There is no `wget sift.tar.gz` step; the harness is fully self-contained.
+There is no `wget sift.tar.gz` step; every target is fully self-contained.
-## Why a separate repo
+## Why a separate repo (and a workspace, not a single crate)
 OmniGraph (the graph engine that motivated this) pins Lance at a released
-version and consumes its kernels via the public crate API. Improvements live one
+version and consumes its kernels via the public crate API. Improvements live
-layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the
+one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps
-optimization target pure (only the kernel changes), keeps the license clean for
+the optimization target pure (only the kernel changes), keeps the license clean
-upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
+for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
-keeps the agent's working set tiny.
+keeps each agent's working set tiny.
 **Workspace not single-crate** because per-target deps differ — FSST decode
 will want a different dependency set than PQ kernels — and the agent's edits
 to one target's `kernels.rs` must not collide with another's lib path. Each
 target is buildable, testable, and runnable in isolation: `cd crates/<target>
 && cargo run --release --bin run_experiment`.
 ## Quick start
 ```bash
-cargo run --release --bin run_experiment
+# Run the landed PQ L2 target's baseline.
 cargo run --release --bin run_experiment -p pq-l2
-# Or run with Claude Code / Codex:
+# Or with Claude Code / Codex, working on one target:
-#    Open the repo in your agent of choice and prompt:
+cd crates/pq-l2
-#       Hi, have a look at program.md and let's kick off a new experiment.
+# Open in your agent of choice and prompt:
 #   Hi, have a look at program.md and let's kick off a new experiment.
 # Add a new target (see docs/adding-a-target.md):
 cp -r crates/pq-l2 crates/pq-cosine
 # ... edit Cargo.toml name, kernels.rs / reference.rs / inputs.rs / program.md
 ```
-## File ownership
+## Repo layout
 | File | Mutability | Edited by |
 |---|---|---|
 | `src/kernels.rs` | **mutable** | the agent |
 | `src/bin/run_experiment.rs` | immutable | — |
 | `src/reference.rs` | immutable | — |
 | `src/inputs.rs` | immutable | — |
 | `src/lib.rs` | immutable (shared types) | — |
 | `benches/pq_l2.rs` | immutable | — |
 | `program.md` | human-iterated | the human, between runs |
 | `results.tsv` | append-only | the agent, per trial (gitignored) |
 ## The metric
 `run_experiment` runs two phases per trial: a correctness check and a
 multi-shape × multi-distribution speed measurement. Output looks like:
 ```
-correctness:           pass
+lance-autoresearch/
---
+├── Cargo.toml                         # workspace root
-correctness:           pass
+├── README.md                          # you are here
-shapes_tested:         (128,16,256) (256,16,256) (768,96,256)
+├── HARNESS.md                         # shared loop contract every target inherits
-distributions_tested:  clustered uniform sparse
+├── LICENSE-MIT, LICENSE-APACHE        # dual-licensed (Apache compat for Lance PRs)
-geomean_ns_per_query:  18234
+├── crates/
-worst_ns_per_query:    24515 ((768,96,256), sparse)
+│   ├── harness-common/                # shared: SplitMix64, geomean, peak RSS, tolerance, time budget
-best_ns_per_query:     12876 ((128,16,256), clustered)
+│   │   └── src/{lib,prng,stats,sysinfo,tolerance}.rs
-per_combo_geomean_ns:
+│   └── pq-l2/                         # landed target
-  (128,16,256) clustered  -> 12876 ns
+│       ├── Cargo.toml
-  (128,16,256) uniform    -> 13441 ns
+│       ├── program.md                 # this target's agent skill
-  ...
+│       ├── src/
-peak_mem_mb:           28.4
+│       │   ├── lib.rs                 # PqShape + module wiring (immutable)
-total_seconds:         12.3
+│       │   ├── kernels.rs             # MUTABLE — agent's playground
 │       │   ├── reference.rs           # IMMUTABLE — scalar reference, oracle helpers
 │       │   ├── inputs.rs              # IMMUTABLE — diverse test-data generators
 │       │   └── bin/run_experiment.rs  # IMMUTABLE — per-trial entry point
 │       └── benches/pq_l2.rs           # criterion benchmark (immutable)
 └── docs/
    ├── design.md                      # rationale for the workspace shape
    ├── adding-a-target.md             # workflow for spinning up a new target
    └── targets/
        └── pq-l2.md                   # capsule: upstream Lance pointers, oracle, status
 ```
 A kernel is "kept" iff:
 - Correctness phase passes (mathematical equivalence to scalar reference)
 - `geomean_ns_per_query` strictly better than the previous best-kept kernel
 - `worst_ns_per_query` ≤ 1.05 × the previous best-kept kernel's worst
 - `total_seconds` ≤ 600
 See `program.md` for the full loop spec.
 ## Upstream contribution path
-When a commit clears the keep bar by a meaningful margin (≥10% geomean
+When a commit on any target clears the keep bar by a meaningful margin
-speedup with worst-case guard intact), the human reviews the diff, ports the
+(≥10% geomean speedup with worst-case guard intact), the human reviews the
-technique against [`lance-format/lance`](https://github.com/lance-format/lance)
+diff, ports the technique against
-HEAD, runs Lance's own test suite, and opens a PR. Because `src/kernels.rs` is
+[`lance-format/lance`](https://github.com/lance-format/lance) HEAD, runs
-dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing
+Lance's own test suite, and opens a PR. Because the workspace is dual
-path, the upstream PR inherits Apache-2.0 cleanly.
+MIT/Apache-2.0 licensed and each target's kernel is algorithmically modeled on
 Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.
 ## License
--- a/research/lance-autoresearch/crates/harness-common/Cargo.toml
+++ b/research/lance-autoresearch/crates/harness-common/Cargo.toml
@ -0,0 +1,10 @@
 [package]
 name = "harness-common"
 version = "0.1.0"
 edition = "2024"
 license = "MIT OR Apache-2.0"
 description = "Shared utilities for lance-autoresearch per-target harnesses (PRNG, geomean, peak RSS, tolerance constants, time budget)."
 publish = false
 [lib]
 path = "src/lib.rs"
--- a/research/lance-autoresearch/crates/harness-common/src/lib.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/lib.rs
@ -0,0 +1,36 @@
 //! Shared utilities for lance-autoresearch per-target harnesses.
 //!
 //! Each target crate (`pq-l2`, future `pq-cosine`, `bitpack-decode`, etc.)
 //! defines its own `kernels.rs` (mutable, the agent's playground), `reference.rs`
 //! (immutable scalar reference), `inputs.rs` (immutable test-data generators),
 //! and `bin/run_experiment.rs` (immutable per-trial entry point). They all need
 //! the same handful of building blocks: a deterministic PRNG, a geomean
 //! aggregator, peak-RSS readback, tolerance constants for the bit-exact oracle,
 //! and a single shared time-budget constant. That's everything in this crate.
 //!
 //! What is **not** here, and intentionally not abstracted:
 //!
 //! - A `Target` trait. Decode kernels (`bitpack`, `dictionary`, `FSST`) have
 //!   very different signatures than distance kernels (`PqKernel::probe_top_k`),
 //!   and forcing them into one trait shape would either bloat the trait or
 //!   require erased boxing. Keep each target's API natural to its kernel.
 //!
 //! - Output-format orchestration. Each target's `run_experiment.rs` prints its
 //!   own fixed-format result block — different targets report different
 //!   per-combo dimensions (PQ shapes vs bit widths vs distribution kinds vs ...).
 //!   Sharing the format would make the per-target binaries less readable and
 //!   gain very little — `println!` is cheap.
 pub mod prng;
 pub mod stats;
 pub mod sysinfo;
 pub mod tolerance;
 pub use prng::SplitMix64;
 pub use stats::geomean;
 pub use sysinfo::peak_rss_mb;
 pub use tolerance::{MAX_ABS_ERR, TOPK_DIST_TOL};
 /// Per-trial wall-clock cap. Targets should `std::process::exit(3)` if exceeded
 /// so the agent's loop logs the trial as a timeout instead of a measurement.
 pub const TIME_BUDGET_SECS: u64 = 600;
--- a/research/lance-autoresearch/crates/harness-common/src/prng.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/prng.rs
@ -0,0 +1,52 @@
 //! Deterministic SplitMix64 PRNG. Same seed produces the same sequence on
 //! every machine; no platform-specific RNG / no `rand` crate. Reproducibility
 //! across trials is the whole point.
 pub struct SplitMix64 {
    state: u64,
 }
 impl SplitMix64 {
    pub fn new(seed: u64) -> Self {
        Self { state: seed }
    }
    pub fn next_u64(&mut self) -> u64 {
        self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
        let mut z = self.state;
        z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
        z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
        z ^ (z >> 31)
    }
    /// Uniform in `[0, 1)` with 24 bits of mantissa precision.
    pub fn next_f32(&mut self) -> f32 {
        let bits = (self.next_u64() >> 40) as u32;
        bits as f32 / ((1u32 << 24) as f32)
    }
    /// Standard normal via Box–Muller. Cheap and sufficient for fixture
    /// generation; not cryptographically anything.
    pub fn next_normal(&mut self) -> f32 {
        let mut u1 = self.next_f32();
        if u1 < 1e-7 {
            u1 = 1e-7;
        }
        let u2 = self.next_f32();
        (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
    }
 }
 #[cfg(test)]
 mod tests {
    use super::*;
    #[test]
    fn deterministic_across_calls() {
        let mut a = SplitMix64::new(0x1234_5678);
        let mut b = SplitMix64::new(0x1234_5678);
        for _ in 0..1000 {
            assert_eq!(a.next_u64(), b.next_u64());
        }
    }
 }
--- a/research/lance-autoresearch/crates/harness-common/src/stats.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/stats.rs
@ -0,0 +1,36 @@
 //! Geometric mean of u64 timings. Robust to outliers; the right aggregator for
 //! latency distributions because halving one query and doubling another cancels.
 pub fn geomean(xs: &[u64]) -> u64 {
    if xs.is_empty() {
        return 0;
    }
    let mut sum_ln = 0.0f64;
    for &x in xs {
        sum_ln += (x.max(1) as f64).ln();
    }
    (sum_ln / xs.len() as f64).exp() as u64
 }
 #[cfg(test)]
 mod tests {
    use super::*;
    #[test]
    fn empty_yields_zero() {
        assert_eq!(geomean(&[]), 0);
    }
    #[test]
    fn single_value_round_trips() {
        assert_eq!(geomean(&[100]), 100);
    }
    #[test]
    fn geomean_is_below_arithmetic_mean() {
        let xs = [1, 10, 100, 1000];
        let g = geomean(&xs);
        let am: u64 = xs.iter().sum::<u64>() / xs.len() as u64;
        assert!(g < am);
    }
 }
--- a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
@ -0,0 +1,24 @@
 //! Peak resident-set-size readback (Linux only; non-Linux returns 0).
 #[cfg(target_os = "linux")]
 pub fn peak_rss_mb() -> f64 {
    let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
        return 0.0;
    };
    for line in s.lines() {
        if let Some(rest) = line.strip_prefix("VmPeak:") {
            let kb: f64 = rest
                .split_whitespace()
                .next()
                .and_then(|t| t.parse().ok())
                .unwrap_or(0.0);
            return kb / 1024.0;
        }
    }
    0.0
 }
 #[cfg(not(target_os = "linux"))]
 pub fn peak_rss_mb() -> f64 {
    0.0
 }
--- a/research/lance-autoresearch/crates/harness-common/src/tolerance.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/tolerance.rs
@ -0,0 +1,15 @@
 //! Default tolerance constants for bit-exact correctness oracles.
 //!
 //! These suit float-arithmetic kernels (PQ distance, BM25 scoring, vector
 //! normalization) where SIMD-accumulator reordering is legal but real bugs
 //! shift values by orders of magnitude. Targets that operate on integer or
 //! byte-exact data (bitpack decode, dictionary decode, FSST decode) should
 //! assert strict bitwise equality and not use these constants.
 /// Maximum permitted absolute element error between agent kernel output and
 /// scalar reference output, for float kernels.
 pub const MAX_ABS_ERR: f32 = 1e-4;
 /// Maximum permitted distance error when comparing top-K results between
 /// agent kernel and scalar reference, for float kernels.
 pub const TOPK_DIST_TOL: f32 = 1e-4;
--- a/research/lance-autoresearch/crates/pq-l2/Cargo.toml
+++ b/research/lance-autoresearch/crates/pq-l2/Cargo.toml
@ -0,0 +1,24 @@
 [package]
 name = "pq-l2"
 version = "0.1.0"
 edition = "2024"
 license = "MIT OR Apache-2.0"
 description = "Autoresearch target: Lance PQ L2 distance kernel optimization."
 publish = false
 [lib]
 path = "src/lib.rs"
 [[bin]]
 name = "run_experiment"
 path = "src/bin/run_experiment.rs"
 [[bench]]
 name = "pq_l2"
 harness = false
 [dependencies]
 harness-common = { path = "../harness-common" }
 [dev-dependencies]
 criterion = { workspace = true }
--- a/research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs
+++ b/research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs
@ -7,8 +7,8 @@ use std::hint::black_box;
 use criterion::{Criterion, criterion_group, criterion_main};
-use lance_autoresearch::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
+use pq_l2::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
-use lance_autoresearch::kernels::PqKernel;
+use pq_l2::kernels::PqKernel;
 fn bench_pq_l2(c: &mut Criterion) {
    let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE);
--- a/research/lance-autoresearch/crates/pq-l2/program.md
+++ b/research/lance-autoresearch/crates/pq-l2/program.md
@ -0,0 +1,98 @@
 # Target: PQ L2 — agent instructions
 This is the per-target overlay on top of [`../../HARNESS.md`](../../HARNESS.md).
 Read **HARNESS.md first** for the universal loop contract (what's editable,
 the metric, the loop, hygiene, never stop). This file adds the PQ-L2-specific
 API spec and priors.
 ## Setup (once per session)
 1. Read in this order:
   - `../../HARNESS.md`
   - `../../README.md`
   - `program.md` (this file)
   - `src/lib.rs`
   - `src/kernels.rs` *(the only file you may edit)*
   - `src/reference.rs`
   - `src/inputs.rs`
   - `src/bin/run_experiment.rs`
 2. Ensure `results.tsv` exists. If not, create it with this header:
   ```
   commit	timestamp	correctness	geomean_ns	worst_ns	worst_combo	best_ns	best_combo	peak_mem_mb	total_seconds	keep	description
   ```
 3. Baseline trial:
   ```
   cargo run --release --bin run_experiment > run.log 2>&1
   ```
   Append a row tagged `keep=baseline`, commit it.
 ## Public API contract (must remain stable)
 The bench imports these from `crate::kernels`. You may NOT change their
 signatures. You MAY add private helpers, internal data layouts, `unsafe`
 blocks, `std::arch` intrinsics under `#[cfg(target_arch = ...)]` gates,
 pre-computed state inside `PqKernel`, etc.
 ```rust
 pub struct PqKernel { /* agent's private fields */ }
 impl PqKernel {
    pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
    pub fn shape(&self) -> &PqShape;
    pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
    pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
 }
 ```
 Pre-processing in `new` is free — the bench measures `distance_table +
 probe_top_k` per query, not per (build + query). Codebook transposes,
 cached `c·c`, packed LUTs, etc., should live in `new`.
 ## What you can / cannot do
 (See HARNESS.md for the universal table; this is the PQ-L2 specific
 addition.)
 - **Cannot** change `PqShape` or the constants in `lib.rs`. They define
  the optimization target.
 - **Cannot** introduce lossy techniques (LUT u8/u16 quantization, asymmetric
  approximation, anything that drops bits relative to the scalar reference).
  The correctness phase asserts `max_abs_err ≤ 1e-4` against the scalar
  reference; lossy techniques fail this gate. If you want to explore a lossy
  track, propose it to the human as a separate kernel surface.
 - **Can** mark hot functions `#[inline]`, split them, add private helpers.
 - **Can** add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
  property checks against the scalar path.
 ## Lance-PQ-specific priors
 These are the directions that pay off on this kernel shape without
 compromising arithmetic accuracy. Pick one hypothesis per trial; don't try
 to combine multiple ideas at once.
 - **Codebook layout transpose.** The reference layout is `[m][k][d]`.
  Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` lanes
  across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` once.
 - **Cache `c·c` per centroid.** The diff–square–sum is
  `(q - c)·(q - c) = q·q - 2qc + c·c`. Hoist `q·q` per sub-vector,
  precompute `c·c` once at `new()` time, store next to the codebook. Inner
  loop becomes one FMA. Watch sign / accumulator ordering so rounding stays
  within `MAX_ABS_ERR`.
 - **Probe-side code transpose.** Probe is dominated by
  `acc += table[m][codes[off+m]]` × `num_sub_vectors`. Transposing codes to
  `[m][i]` (one row per sub-quantizer, contiguous over base index) lets you
  process 32+ vectors per inner iteration with `vpgatherdq`-style loads.
 - **Top-K block-then-merge.** `push()` does a branch + heap sift on every
  code. At 20k probes per query × 9 (shape × dist) combos that's the
  second-biggest cost after the gather. Block the probe (e.g., 512 codes at
  a time), find the local top-K with a branchless pass, then merge into the
  global heap.
 - **Prefetch.** `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
  ahead of the gather is usually pure win at 20k+ scale.
 - **FMA chains for table build.** The diff–square–sum maps cleanly to FMA
  on AVX2/NEON. Even without intrinsics, structuring the inner loop so
  `rustc` emits FMA helps.
 - **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates
  a fresh `Vec<f32>` per call. The public API is fixed (returns `Vec<f32>`),
  but you can reuse a thread-local scratch buffer internally and copy to a
  `Vec` at the boundary if it speeds the build.
--- a/research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs
@ -35,18 +35,18 @@
 use std::time::Instant;
-use lance_autoresearch::inputs::{
+use harness_common::{MAX_ABS_ERR, TIME_BUDGET_SECS, TOPK_DIST_TOL, geomean, peak_rss_mb};
 use pq_l2::inputs::{
    DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads,
 };
-use lance_autoresearch::kernels::PqKernel;
+use pq_l2::kernels::PqKernel;
-use lance_autoresearch::reference::{ScalarReference, max_abs_err, topk_consistent};
+use pq_l2::reference::{ScalarReference, max_abs_err, topk_consistent};
-use lance_autoresearch::{MAX_ABS_ERR, PqShape, TOPK_DIST_TOL};
+use pq_l2::PqShape;
 // Any constants; the only requirement is that they're pinned across trials so
 // the inputs and the timings are reproducible.
 const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE;
 const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE;
 const TIME_BUDGET_SECS: u64 = 600;
 fn main() {
    let start = Instant::now();
@ -210,17 +210,6 @@ fn run_speed(workloads: &[SpeedWorkload]) -> SpeedReport {
    }
 }
 fn geomean(xs: &[u64]) -> u64 {
    if xs.is_empty() {
        return 0;
    }
    let mut sum_ln = 0.0f64;
    for &x in xs {
        sum_ln += (x.max(1) as f64).ln();
    }
    (sum_ln / xs.len() as f64).exp() as u64
 }
 fn format_shape(s: &PqShape) -> String {
    format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids)
 }
@ -233,26 +222,3 @@ fn format_dist(d: &DataDistribution) -> String {
    }
    .to_string()
 }
 #[cfg(target_os = "linux")]
 fn peak_rss_mb() -> f64 {
    let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
        return 0.0;
    };
    for line in s.lines() {
        if let Some(rest) = line.strip_prefix("VmPeak:") {
            let kb: f64 = rest
                .split_whitespace()
                .next()
                .and_then(|t| t.parse().ok())
                .unwrap_or(0.0);
            return kb / 1024.0;
        }
    }
    0.0
 }
 #[cfg(not(target_os = "linux"))]
 fn peak_rss_mb() -> f64 {
    0.0
 }
--- a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
@ -16,6 +16,7 @@
 //! the codebook is shape-appropriate, not random.
 use crate::PqShape;
 use harness_common::SplitMix64;
 /// PQ shapes the bench evaluates. The agent's kernel must produce correct
 /// output and competitive speed on every one.
@ -295,36 +296,6 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
    out
 }
 /// SplitMix64 — small, deterministic; bit-for-bit reproducible across machines.
 struct SplitMix64 {
    state: u64,
 }
 impl SplitMix64 {
    fn new(seed: u64) -> Self {
        Self { state: seed }
    }
    fn next_u64(&mut self) -> u64 {
        self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
        let mut z = self.state;
        z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
        z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
        z ^ (z >> 31)
    }
    fn next_f32(&mut self) -> f32 {
        let bits = (self.next_u64() >> 40) as u32;
        bits as f32 / ((1u32 << 24) as f32)
    }
    fn next_normal(&mut self) -> f32 {
        let mut u1 = self.next_f32();
        if u1 < 1e-7 {
            u1 = 1e-7;
        }
        let u2 = self.next_f32();
        (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
    }
 }
 fn shape_hash(s: PqShape) -> u64 {
    (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
        ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)
--- a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
--- a/research/lance-autoresearch/crates/pq-l2/src/lib.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/lib.rs
@ -1,17 +1,20 @@
-//! Lance autoresearch harness — public API for the bench binary, benchmarks, and tests.
+//! Autoresearch target: Lance PQ L2 distance kernel optimization.
 //!
-//! Contract (Karpathy-style three files):
+//! Karpathy-style three-file contract:
 //!
 //! - `kernels` — the AGENT'S PLAYGROUND. Modify freely.
 //! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math.
 //! - `inputs` — IMMUTABLE. Diverse test-data + workload generators,
 //!   deterministic per fixed seed, varied across the input battery.
 //!
-//! The optimization target is dataset-independent: the agent's kernel must match
+//! The optimization target is dataset-independent: the agent's kernel must
-//! the scalar reference within `MAX_ABS_ERR` on every input the bench generates,
+//! match the scalar reference within `harness_common::MAX_ABS_ERR` on every
-//! and minimize geomean ns/query across multiple PQ shapes and data
+//! input the bench generates, and minimize geomean ns/query across multiple
-//! distributions. There is no fixed dataset; an "improvement" by construction
+//! PQ shapes and data distributions. There is no fixed dataset.
-//! generalizes across distributions and shapes.
+//!
 //! Shared utilities (deterministic PRNG, geomean, peak RSS, tolerance
 //! constants, time budget) come from the `harness-common` workspace crate.
 //! See `../HARNESS.md` for the harness conventions every target follows.
 pub mod inputs;
 pub mod kernels;
@ -45,12 +48,3 @@ impl PqShape {
        self.num_sub_vectors * self.num_centroids * self.sub_vector_dim()
    }
 }
 /// Tolerance for the agent kernel's distance values vs. the scalar reference.
 /// Loose enough to permit legal SIMD-accumulator reordering; tight enough to
 /// catch real arithmetic bugs.
 pub const MAX_ABS_ERR: f32 = 1e-4;
 /// Tolerance for top-K *distances* (id sets are compared with tie-tolerance —
 /// see `reference::topk_consistent`).
 pub const TOPK_DIST_TOL: f32 = 1e-4;
--- a/research/lance-autoresearch/crates/pq-l2/src/reference.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/reference.rs
--- a/research/lance-autoresearch/docs/adding-a-target.md
+++ b/research/lance-autoresearch/docs/adding-a-target.md
@ -0,0 +1,192 @@
 # Adding a new target
 Walk through this when spinning up a new optimization target (A1 cosine, A4
 bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
 decisions to make per target if the kernel fits the autoresearch shape.
 If your target's per-trial eval is more than ~30 seconds, or the correctness
 oracle can't be a deterministic comparison against a scalar reference, this
 harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
 for the boundary.
 ## Steps
 ### 1. Pick a template target
 The closest existing target. For now there's just `pq-l2`, but as more land:
 - Distance / scoring kernels that take a query and return per-row scores →
  template off `pq-l2`.
 - Decode kernels that take encoded bytes and return an Arrow array →
  template off `bitpack` once it lands.
 - Scan / merge kernels → template off `topk-merge` once it lands.
 ```bash
 cp -r crates/pq-l2 crates/<my-target>
 ```
 ### 2. Rewrite `Cargo.toml`
 ```toml
 [package]
 name = "<my-target>"
 # version, edition, license, publish stay the same
 ```
 Add the target to the workspace `members` in the root `Cargo.toml`:
 ```toml
 [workspace]
 members = [
    "crates/harness-common",
    "crates/pq-l2",
    "crates/<my-target>",   # add this
 ]
 ```
 ### 3. Rewrite `src/lib.rs`
 Define the target's `Shape` type (analogue of `PqShape`) and any other types
 shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
 which fields are pinned by the harness vs. agent-tunable.
 This file is **immutable** to the agent. The shape parameters define the
 optimization target — changing them changes what's being optimized.
 ### 4. Rewrite `src/reference.rs`
 Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
 no cleverness. This is what the agent's kernel is compared against. Mirror
 the public API of `kernels.rs` exactly.
 For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
 (or analogues) — the comparison helpers the bench uses to assert
 near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
 `TOPK_DIST_TOL`.
 For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
 returned Arrow array. No tolerance constants needed.
 ### 5. Rewrite `src/inputs.rs`
 Two surfaces:
 - `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
  distribution combinations, sized small enough that the correctness phase
  finishes in seconds. The point is breadth, not realism.
 - `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
  combinations sized for stable timings. Aim for total trial wall-clock
  ≤ 60s; the agent's iteration latency dominates correctness elsewhere.
 Use `harness_common::SplitMix64` for determinism. Same seed → same battery
 across trials.
 ### 6. Rewrite `src/kernels.rs` (the agent's playground)
 Implement a clean scalar baseline matching the algorithm shape of the Lance
 upstream code. The header comment must:
 - Cite the upstream Lance source (`lance-format/lance` rev / file path) the
  algorithm is modeled on.
 - Document the public API the bench calls — these are the surfaces the agent
  may NOT change.
 - List "what you can do" / "what you cannot do" rules specific to this
  target.
 The starting kernel must be correct (passes the correctness phase against
 `reference.rs`) and lint-clean. The agent's job is to make it faster.
 ### 7. Rewrite `src/bin/run_experiment.rs`
 Two phases:
 - **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
  reference, compare. Any mismatch → print `correctness: fail`, diagnostic
  line, exit 2.
 - **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
  query / per row / per byte. Aggregate geomean / worst / best across all
  combos. Print fixed-format result block.
 Universal output fields (every target) are listed in `HARNESS.md` "The
 metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
 for bitpack).
 Use:
 - `harness_common::geomean` for the aggregator
 - `harness_common::peak_rss_mb` for memory readback
 - `harness_common::TIME_BUDGET_SECS` for the time-budget check
 ### 8. (Optional) Rewrite `benches/<my-target>.rs`
 Criterion benchmark with the same kernel calls as `run_experiment` but
 under criterion's statistical-sampling harness. Optional — the per-trial
 binary is the agent's primary measurement; criterion is for the human's
 deeper investigation.
 ### 9. Write `program.md`
 Per-target agent skill, layered on top of `HARNESS.md`. Sections:
 - **Setup** — which files to read at session start (always include
  `../../HARNESS.md`).
 - **Public API contract** — the exact functions / structs the agent must
  keep stable.
 - **Target-specific priors** — known SIMD techniques for this kernel shape,
  algorithmic transformations worth trying, common pitfalls. This is the
  highest-leverage content; spend time on it.
 - **`results.tsv` header** — the per-target column set.
 ### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
 A short doc covering:
 - What's optimized (one sentence)
 - Upstream Lance source pointers (rev, file paths, function names)
 - Oracle definition (bit-exact / `max_abs_err`)
 - Speed workload shape (what shapes × distributions span)
 - Status (candidate / landed / has-results)
 ### 11. Verify end-to-end
 ```bash
 cargo build --release -p <my-target>
 cargo clippy --release -p <my-target> --all-targets -- -D warnings
 cargo run --release --bin run_experiment -p <my-target>
 ```
 The baseline trial must:
 - Print `correctness: pass`
 - Exit 0
 - Finish within ~60s
 - Reference a sensible `geomean_ns_per_*` baseline number
 Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
 zero), confirm the trial exits 2 with `correctness: fail`. Restore.
 ### 12. Add the target row to the top-level `README.md`
 In the targets table at the top of the README, change the new target's row
 from `candidate` to `landed`.
 ### 13. Commit
 One commit for the target's scaffolding. Don't bundle multiple targets in
 one commit — each target's history should be independently revertible.
 ## Common gotchas
 - **Forgetting the empty `[workspace]` block** at the root means cargo walks
  up to the omnigraph parent workspace. Already handled; just don't remove it.
 - **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
  Use `harness-common = { path = "../harness-common" }`.
 - **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
  with one shape an agent could specialize and pass, with two there's not
  enough variety. Ensure the shapes span at least one "outlier" (e.g., for
  PQ, one shape with `sub_vector_dim != 8`).
 - **Correctness battery too narrow.** Five distributions is the floor: at
  minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
  the integer analogue: uniform / clustered / skewed / few-distinct /
  monotonic).
 - **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
  rate drops below useful. Reduce workload sizes; the speed metric is
  per-operation, not per-workload, so absolute size doesn't change the
  comparison.
--- a/research/lance-autoresearch/docs/design.md
+++ b/research/lance-autoresearch/docs/design.md
@ -0,0 +1,152 @@
 # Design — why the workspace is shaped this way
 This document records the rationale for the multi-target workspace shape so
 future contributors don't relitigate the early decisions.
 ## The thing we're building
 A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
 "Multi-target" because Lance has many such kernels — distance kernels in
 `lance-linalg`, decoders in `lance-encoding`, scan/merge kernels — and the
 right harness shape is identical across them: bit-exact correctness oracle,
 geomean-across-distributions speed metric, single-agent autoresearch loop.
 The original [research note](../../docs/research/llm-evolutionary-sampling.md)
 enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
 landed (`pq-l2`) proves the harness shape; the rest follow the same template.
 ## Decision: workspace, not single crate
 A single crate exposing multiple binaries (`run_experiment_pq_l2`,
 `run_experiment_bitpack`, ...) was the obvious-looking alternative. Rejected
 for three reasons:
 1. **Per-target deps differ.** FSST decode wants different deps than PQ
   kernels (a string-compression library vs. just `f32` math). A single
   `Cargo.toml` would either bundle every target's deps into every build or
   require fine-grained features. Workspaces give per-target `Cargo.toml`
   for free.
 2. **Edit isolation.** The agent edits one target's `kernels.rs` at a time.
   In a single crate, `kernels.rs` files would collide on path or have to live
   in target-specific submodules with target-specific naming. Per-target
   crates put `src/kernels.rs` at the natural location every time and let the
   agent navigate one tree per session.
 3. **Build / test isolation.** `cargo build -p pq-l2` builds only what's
   needed for the PQ L2 target; `cargo test -p pq-l2` runs only its tests.
   The agent's iteration loop is faster because it doesn't pay for unrelated
   targets' compile time.
 The downside — workspace boilerplate, per-target `Cargo.toml`, the empty
 `[workspace]` block at the workspace root that prevents cargo from walking up
 to the parent omnigraph workspace — is a one-time cost. Per-target overhead
 of adding a new target is one `cp -r` plus path edits.
 ## Decision: shared `harness-common` crate, no `Target` trait
 A `Target` trait was the obvious-looking other alternative — express the
 common loop generically, plug in target-specific types. Rejected because:
 1. **Kernel signatures vary too much for a single trait shape.** PQ
   `probe_top_k` returns `Vec<(u32, f32)>`. Bitpack decode returns an
   `IntArray`. FSST decode returns `Vec<u8>`. Predicate evaluation returns a
   `BooleanArray`. A unifying trait would need erased boxing or a wide
   associated-type surface, both of which obscure the actual hot path the
   agent is editing.
 2. **The orchestration that *is* shared is small.** A deterministic PRNG
   (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four
   tolerance constants. Total ~70 lines of shared code. Building a trait
   abstraction over 70 lines costs more than it saves.
 3. **The output format isn't worth sharing.** Each target's
   `run_experiment.rs` prints a fixed-format result block; the *fields*
   differ per target (PQ shapes vs bit widths vs distribution kinds). A
   shared formatter would be either trivial wrapping of `println!` (no
   value) or a complicated builder API (negative value).
 `harness-common` therefore exposes plumbing only: `SplitMix64`, `geomean`,
 `peak_rss_mb`, `MAX_ABS_ERR`, `TOPK_DIST_TOL`, `TIME_BUDGET_SECS`. Each
 target consumes what it needs. The shared loop contract is documented in
 `HARNESS.md`, not encoded in code.
 ## Decision: per-target `program.md` + shared `HARNESS.md`
 The agent reads two files at session start:
 - `HARNESS.md` (workspace-level) — universal: the loop, the metric, the
  edit-permission table, hygiene rules.
 - `crates/<target>/program.md` (per-target) — specific: the kernel API the
  agent must keep stable, target-specific priors (which SIMD intrinsics tend
  to win on this kernel shape), the `results.tsv` column header.
 The shape mirrors how Karpathy's `nanochat-research` `program.md` works,
 factored across the dimension that varies (per target) vs. doesn't (the loop
 itself). Two files instead of one because copy-pasting the universal loop
 into every `program.md` makes them drift.
 ## Decision: dataset-independent oracle every target
 The first iteration of the harness used recall@K vs. SIFT1M as the
 correctness oracle. We replaced it with bit-exact (or near-bit-exact for
 floats) match against a scalar reference because:
 1. The agent had incentive to overfit lossy approximations to the dataset's
   cluster structure, even though we didn't ask for that.
 2. SIFT1M is 250 MB and a hassle to download; the harness benefited from
   being self-contained.
 3. Mathematical equivalence is a strictly stronger contract than recall
   preservation: if the kernel is bit-equivalent to the scalar reference,
   recall is automatically identical because the distance values are the
   same. There's nothing recall@K catches that bit-exactness doesn't.
 This decision generalizes to every target. Decode kernels get strict bitwise
 equality (no float arithmetic involved). Distance and BM25 kernels get
 `max_abs_err ≤ 1e-4` (loose enough for SIMD-accumulator reordering, tight
 enough for real bugs). Targets that genuinely require lossy techniques to
 get headroom — there might be some; LUT u8 quantization in PQ is one — go
 in a separate "lossy track" with a recall-based oracle on diverse datasets,
 not the bit-exact track.
 ## Decision: per-target speed measurement spans multiple shapes × distributions
 A single dataset would let an agent overfit to that dataset's distribution.
 Each target's `inputs.rs` therefore generates speed workloads across:
 - Multiple **shapes** of the kernel's domain (PQ: `(dim, num_sub_vectors,
  num_centroids)`; bitpack: bit width; etc.). Captures how the kernel
  performs at different sizes Lance users actually encounter.
 - Multiple **data distributions** (Gaussian / uniform / sparse for floats;
  uniform / skewed / clustered for integers; etc.). Captures whether the
  kernel's win is data-distribution-conditional.
 The keep gate uses geomean across all (shape × distribution) combos with a
 worst-case guard: a kernel that wins on one combo and regresses ≥5% on
 another fails to keep, even if the geomean improves. This forces wins to
 generalize.
 ## What's deliberately not abstracted
 - **Output format.** Each target prints its own field block. See above.
 - **`TopKHeap` and other small data structures.** When two targets need a
  `TopKHeap`, the second one copies the first's. Three copies of a 30-line
  struct is cheaper than one trait-erased indirection.
 - **Test data shapes.** Each target's `inputs.rs` knows its own kernel's
  fixture shape. Sharing would require a generic `Fixture<Kernel>` trait,
  which would either be too narrow (forces every kernel into a `query +
  workload` shape) or too wide (gives up the type safety that makes the
  bench's correctness check obvious).
 ## When to revisit
 If the workspace grows past ~6 active targets and we notice we're
 copy-pasting more than ~50 lines of `run_experiment.rs` boilerplate per new
 target, consider extracting a shared `RunExperiment` helper that takes
 closures for the correctness and speed phases. Don't pre-extract — wait
 until the duplication is real and visible.
 If we add a target that genuinely doesn't fit the autoresearch loop (eval
 crosses ~30s; tournament sampling becomes the right control loop), it
 belongs in a separate workspace, not this one. The boundary line is the
 loop shape, not the target type.
--- a/research/lance-autoresearch/docs/targets/pq-l2.md
+++ b/research/lance-autoresearch/docs/targets/pq-l2.md
@ -0,0 +1,98 @@
 # Target: `pq-l2`
 PQ L2 distance kernel for f32 dense vectors — the asymmetric-distance compute
 that runs on every `IvfPq` / `IvfHnswPq` ANN query in Lance.
 ## Status
 **Landed.** Baseline scalar kernel committed; the agent's job is to find
 generalizable speedups against it.
 ## What's optimized
 Two functions in `crates/pq-l2/src/kernels.rs`:
 - `PqKernel::distance_table(query)` — builds the asymmetric distance table
  (`[num_sub_vectors][num_centroids]`) for one query against the codebook.
  Cost: `num_sub_vectors × num_centroids × sub_vector_dim` MAC ops per query.
 - `PqKernel::probe_top_k(table, codes, num_vectors, k)` — probes
  `num_vectors` PQ-encoded vectors, accumulates per-vector distance via
  `num_sub_vectors` table lookups, returns top-K. Cost:
  `num_vectors × num_sub_vectors` lookups + heap maintenance per query.
  This is the dominant cost at typical scales.
 `PqKernel::new(shape, codebook)` is also editable — the agent may pre-process
 the codebook (transpose layout, cache `c·c` for the FMA trick, pack the LUT)
 and amortize over queries; build cost is excluded from per-query timing.
 ## Upstream Lance source
 Algorithmically modeled on `lance-linalg::distance::l2` plus the PQ
 asymmetric-distance compute in `lance::index::vector::pq`. Specifically the
 f32 dense path; the byte / fixed-point variants are out of scope for this
 target.
 When porting a winning kernel upstream:
 - File: `lance-linalg/src/distance/l2.rs` and the L2-specific path in
  `lance/src/index/vector/pq.rs`.
 - License: Apache-2.0 (matches our dual MIT/Apache-2.0 → upstream takes
  the Apache half).
 ## Oracle
 **Float-accumulator-tolerance match against scalar reference.** Per
 `harness_common::MAX_ABS_ERR = 1e-4`:
 - Distance table values must match the scalar reference within `1e-4` per
  element. Loose enough for legal SIMD-accumulator reordering, tight enough
  to catch real arithmetic bugs.
 - Top-K results compared with `harness_common::TOPK_DIST_TOL = 1e-4` plus
  tie-tolerant id substitution (any permutation within a tied-distance band
  is accepted).
 The correctness phase asserts both on every input combination — five input
 distributions × three PQ shapes = 15 cases per trial.
 ## Speed workload
 Three shapes:
 - `(128, 16, 256)` — SIFT-like; sub_vector_dim = 8
 - `(256, 16, 256)` — sub_vector_dim = 16
 - `(768, 96, 256)` — BERT-base-like; large codebook
 Three data distributions:
 - `Clustered` — 32 cluster centers, low intra-cluster noise
 - `Uniform` — uniform on [-1, 1]
 - `Sparse` — 90% zeros + 10% Gaussian
 Per (shape × distribution): 20,000 base vectors PQ-encoded, 32 queries
 timed. Total trial wall-clock: ~30–60s on a developer laptop.
 ## Output fields
 ```
 correctness:           pass | fail
 shapes_tested:         (128,16,256) (256,16,256) (768,96,256)
 distributions_tested:  clustered uniform sparse
 geomean_ns_per_query:  <u64>
 worst_ns_per_query:    <u64> (<shape>, <dist>)
 best_ns_per_query:     <u64> (<shape>, <dist>)
 per_combo_geomean_ns:
  (...)
 peak_mem_mb:           <f64>
 total_seconds:         <f64>
 ```
 ## Known headroom (priors for the agent)
 See `crates/pq-l2/program.md` "Lance-PQ-specific priors" for the canonical
 list. Highlights:
 - Codebook layout transpose (`[m][k][d]` → `[m][d][k]`) for SIMD-broadcast
  table build.
 - Cache `c·c` per centroid in `new()` so the inner loop is `q·q − 2qc + c·c`
  (one FMA chain).
 - Probe-side code transpose so the inner loop processes 32+ vectors per
  iteration via gather.
 - Top-K block-then-merge instead of per-vector heap insert.
 - Prefetch on `codes[i+64]` ahead of gather.
--- a/research/lance-autoresearch/program.md
+++ b/research/lance-autoresearch/program.md
@ -1,172 +0,0 @@
 # Lance PQ L2 kernel research — agent instructions
 You are an autonomous research assistant. Your job is to improve `src/kernels.rs`
 so that `cargo run --release --bin run_experiment` reports a **lower
 `geomean_ns_per_query`** while:
 1. The **correctness phase passes** — your kernel's distance values must match the
   scalar reference within `MAX_ABS_ERR = 1e-4`, and the top-K must be
   tie-tolerant equivalent on every input the bench generates.
 2. The `worst_ns_per_query` does **not regress more than 5%** against the
   last-kept kernel — if you win on one (shape × distribution) and lose
   significantly on another, the change isn't a generalizable improvement.
 This bench is intentionally **dataset-independent**: there is no fixed dataset.
 The correctness oracle is mathematical equivalence to the scalar reference,
 checked across multiple PQ shapes and synthetic input distributions
 (Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed
 oracle is the geomean across multiple shapes × distributions, with worst-case
 guarded. A win that depends on a specific data distribution or PQ shape will
 fail to clear the bar by construction.
 Read this file end-to-end before doing anything else. Then run setup, then the loop.
 ## Setup (do once at the start of every session)
 1. Read these files, in this order:
   - `README.md`
   - `program.md` (this file)
   - `src/lib.rs`
   - `src/kernels.rs` *(the only file you may edit)*
   - `src/reference.rs`
   - `src/inputs.rs`
   - `src/bin/run_experiment.rs`
 2. Ensure `results.tsv` exists. If not, create it with this header line:
   ```
   commit	timestamp	correctness	geomean_ns	worst_ns	worst_combo	best_ns	best_combo	peak_mem_mb	total_seconds	keep	description
   ```
 3. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`.
   Confirm `correctness: pass`. Parse `run.log` and append a row to `results.tsv`
   with `keep=baseline` and `description="seeded scalar PQ-L2 baseline"`. This
   is your reference number.
 4. Commit the baseline row with a one-line message like `baseline: <numbers>`.
 ## What you CAN do
 - Modify **`src/kernels.rs`** freely. You may:
  - Pre-process the codebook in `PqKernel::new` (transpose layouts, cache
    `c·c` for the FMA trick, pack the codebook for register-resident lookup,
    etc.). This cost is paid once per dataset and amortized across queries —
    the bench measures per-query, not per-(build + query).
  - Reorder loops, switch internal data layouts, drop down to `std::arch`
    intrinsics under `#[cfg(target_arch = ...)]` gates. **Always keep a
    portable scalar fallback** so the kernel compiles everywhere.
  - Use `unsafe` if needed; document the invariants you're relying on.
  - Mark hot functions `#[inline]`; add private helpers freely.
  - Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want
    in-file property checks.
 ## What you CANNOT do
 - Do **not** modify `src/lib.rs` (`PqShape` and the tolerance constants are
  shared with the immutable scaffolding).
 - Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`,
  `src/inputs.rs`, `benches/pq_l2.rs`, or `Cargo.toml`.
 - Do **not** add new crate dependencies.
 - Do **not** alter the public API of `kernels::PqKernel`:
  - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
  - `PqKernel::shape(&self) -> &PqShape`
  - `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
  - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
 - Do **not** introduce lossy techniques (LUT u8/u16 quantization, asymmetric-
  distance approximation, etc.) — the correctness phase asserts exact-up-to-ε
  match against the scalar reference. If you want to explore a lossy track,
  surface that in a separate kernel and propose a track extension.
 ## The metric
 Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock across
 all timed queries, all shapes, all distributions) subject to:
 1. Correctness phase: **pass** (exit-2 otherwise).
 2. `worst_ns_per_query` ≤ 1.05 × the last-kept kernel's worst.
 3. `total_seconds` ≤ 600.
 4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release
   --all-targets -- -D warnings` reports zero issues.
 Ties break toward simpler code. If two kernels report the same speed within
 ~3% noise, prefer fewer lines / less `unsafe`.
 ## Lance-PQ-specific priors (lossless directions)
 These directions are known to pay off without compromising arithmetic accuracy.
 Pick one hypothesis at a time; implement; measure; decide.
 - **Codebook layout.** The reference layout is `[m][k][d]`. For a fixed query,
  iterating over centroids stays in cache, but the inner loop over `d` is
  short. Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)`
  lanes across `d` and broadcast over `k`. Do the transpose in `PqKernel::new`
  once.
 - **Cache `c·c`.** The diff–square–sum is `(q - c)·(q - c) = q·q - 2qc + c·c`.
  Hoist `q·q` per sub-vector, precompute `c·c` once at codebook-load time.
  Inner loop becomes one FMA (`-2qc + cc`). Watch the sign / accumulator
  ordering so the rounding stays within tolerance.
 - **Probe layout.** The probe is dominated by `acc += table[m][codes[off+m]]`
  × `num_sub_vectors`. Transposing codes to `[m][i]` (one row per sub-quantizer,
  contiguous over base index) lets you process up to 32+ vectors per inner
  iteration with `vpgatherdq`-style loads.
 - **Top-K integration.** `push()` does a branch + heap sift on every code.
  At 50k probes per query × 9 (shape × dist) combos that's the second-biggest
  cost after the gather. Block the probe (e.g., 512 codes at a time), find the
  local top-K with a branchless pass, then merge into the global heap.
 - **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
  ahead of the gather is usually pure win at 50k+ scale where codes don't all
  fit in L2.
 - **FMA chains for table build.** The diff–square–sum maps cleanly to FMA on
  AVX2/NEON. Even without intrinsics, structuring the inner loop so `rustc`
  emits FMA helps.
 - **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates a
  fresh `Vec<f32>` per call. Returning a fixed-capacity buffer is a public-API
  change you can't make — but you can reuse a thread-local scratch buffer
  internally if it speeds the build.
 ## The loop
 Once setup is done, repeat indefinitely:
 1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
   have been tried, what won, what regressed. Form a hypothesis with one
   sentence stating the change and the predicted effect on speed and
   correctness.
 2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis.
 3. **Build and lint.**
   ```
   cargo build --release
   cargo clippy --release --all-targets -- -D warnings
   ```
   If either fails, fix and try again — do not commit broken state.
 4. **Run the trial.**
   ```
   cargo run --release --bin run_experiment > run.log 2>&1
   ```
 5. **Parse the result.** Extract `correctness`, `geomean_ns_per_query`,
   `worst_ns_per_query` (with combo), `peak_mem_mb`, `total_seconds`. Compute
   deltas vs. baseline.
 6. **Decide keep or revert.**
   - **Keep** iff: `correctness: pass`, geomean strictly better than the
     last-kept row (allow ~1% noise band), and `worst_ns_per_query` ≤ 1.05 ×
     last-kept's worst.
   - **Revert** otherwise: `git restore src/kernels.rs` (or commit and
     `git revert` if you want the revert in history). Note what failed.
 7. **Log.** Append one row to `results.tsv`:
   ```
   <short_sha>	<iso8601>	<correctness>	<geomean_ns>	<worst_ns>	<worst_combo>	<best_ns>	<best_combo>	<peak_mem>	<elapsed>	<keep|revert>	<one-line description>
   ```
 8. **Commit.** One-line message describing the change and the headline number,
   e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
 ## Hygiene
 - Always commit `src/kernels.rs` changes; never commit `results.tsv` or
  `run.log` (they're gitignored).
 - If a change fails to build, do not commit. Iterate until it builds, or
  revert cleanly.
 - If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
  `results.tsv` and update your mental model before proposing the next.
 - Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it
  and mark the trial as `timeout`.
 ## Never stop
 Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
 one measurement, one commit. No multi-step plans across iterations.