research: fix lance-autoresearch correctness bugs surfaced by code review

A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-06-09 01:35:18 +02:00 · 2026-05-15 00:55:57 +00:00 · 2026-05-15 00:55:57 +00:00 · 7b1b0b5b75
commit 7b1b0b5b75
parent 0d72cc69fb
10 changed files with 75 additions and 44 deletions
--- a/research/lance-autoresearch/Cargo.toml
+++ b/research/lance-autoresearch/Cargo.toml
@ -6,9 +6,9 @@ members = [
 ]

 # Each per-target crate sets its own deps. Shared deps below pin versions
-# uniformly across targets so the workspace lockfile stays clean.
+# uniformly across targets so the workspace lockfile stays clean. Keep this
+# list small — only deps actually consumed by ≥2 targets belong here.
 [workspace.dependencies]
-anyhow = "1"
 criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }

 [profile.release]
--- a/research/lance-autoresearch/HARNESS.md
+++ b/research/lance-autoresearch/HARNESS.md
@ -70,6 +70,11 @@ A kernel is **kept** iff:
 5. Build clean: `cargo build --release` and
   `cargo clippy --release --all-targets -- -D warnings` both succeed.

+Exit codes summary for `run_experiment`: `0` on success, `1` on internal
+error (panic, fixture load failure), `2` on correctness failure, `3` on
+time-budget breach. The agent's loop should treat anything non-zero as
+"revert; do not log as a measurement."
+
 Ties break toward simpler code: same speed within ~3% noise → fewer lines /
 less `unsafe` wins.

--- a/research/lance-autoresearch/README.md
+++ b/research/lance-autoresearch/README.md
@ -6,7 +6,11 @@ in the style of Andrej Karpathy's
 [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
 single-agent autoresearch loop.

-Each target is an independent Rust crate under `crates/`:
+Each landed target is an independent Rust crate under `crates/`. The
+candidates below are listed as a roadmap — they have no code yet, only the
+research-note rationale and a `docs/targets/<name>.md` capsule (when one
+exists). Spinning up a candidate is the
+[`docs/adding-a-target.md`](docs/adding-a-target.md) workflow.

 | Target | Status | Lance source area | What's optimized |
 |---|---|---|---|
--- a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
@ -1,4 +1,9 @@
 //! Peak resident-set-size readback (Linux only; non-Linux returns 0).
+//!
+//! Reads `VmHWM` from `/proc/self/status` — the high-water-mark of resident
+//! memory pages, not the high-water-mark of virtual address space. `VmPeak`
+//! (virtual peak) would include mmap'd files, guard pages, and untouched
+//! allocations; `VmHWM` is what users mean by "peak memory".

 #[cfg(target_os = "linux")]
 pub fn peak_rss_mb() -> f64 {
@ -6,7 +11,7 @@ pub fn peak_rss_mb() -> f64 {
        return 0.0;
    };
    for line in s.lines() {
-        if let Some(rest) = line.strip_prefix("VmPeak:") {
+        if let Some(rest) = line.strip_prefix("VmHWM:") {
            let kb: f64 = rest
                .split_whitespace()
                .next()
--- a/research/lance-autoresearch/crates/pq-l2/program.md
+++ b/research/lance-autoresearch/crates/pq-l2/program.md
@ -38,7 +38,6 @@ pub struct PqKernel { /* agent's private fields */ }

 impl PqKernel {
    pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
-    pub fn shape(&self) -> &PqShape;
    pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
    pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
 }
--- a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
@ -1,3 +1,5 @@
+// SPDX-License-Identifier: MIT OR Apache-2.0
+
 //! IMMUTABLE. Diverse test-data + workload generators.
 //!
 //! Two surfaces. `correctness_battery(seed)` yields several
@ -90,7 +92,7 @@ pub fn correctness_battery(seed: u64) -> Vec<CorrectnessCase> {
    ];
    for &shape in SHAPES {
        for (kind, label) in kinds {
-            let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ kind_hash(*kind));
+            let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), kind_hash(*kind)]));
            let num_vectors = 4096;
            let codebook = build_codebook(shape, *kind, CODEBOOK_TRAIN_SIZE, &mut rng);
            let base = gen_vectors(num_vectors, shape.dim, *kind, &mut rng);
@ -114,7 +116,7 @@ pub fn speed_workloads(seed: u64) -> Vec<SpeedWorkload> {
    let mut out = Vec::new();
    for &shape in SHAPES {
        for &dist in DISTRIBUTIONS {
-            let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ dist_hash(dist));
+            let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), dist_hash(dist)]));
            let kind = match dist {
                DataDistribution::Clustered => InputKind::Clustered,
                DataDistribution::Uniform => InputKind::Uniform,
@ -296,6 +298,21 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
    out
 }

+/// Combine seed components without XOR-collision. Iterates the SplitMix64
+/// finalizer once per part, so distinct part vectors produce distinct seeds
+/// with vanishingly small collision probability (vs. raw XOR, where
+/// `a ^ b ^ c == a ^ c ^ b` and small constants can cancel each other).
+fn mix_seeds(parts: &[u64]) -> u64 {
+    let mut mixed: u64 = 0;
+    for &p in parts {
+        mixed = mixed.wrapping_add(p).wrapping_add(0x9E37_79B9_7F4A_7C15);
+        mixed = (mixed ^ (mixed >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
+        mixed = (mixed ^ (mixed >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
+        mixed ^= mixed >> 31;
+    }
+    mixed
+}
+
 fn shape_hash(s: PqShape) -> u64 {
    (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
        ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)
--- a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
@ -1,4 +1,4 @@
-// SPDX-License-Identifier: Apache-2.0
+// SPDX-License-Identifier: MIT OR Apache-2.0
 //
 // AGENT'S PLAYGROUND. This is the file you (the agent) modify.
 //
@ -14,7 +14,6 @@
 // PUBLIC API CONTRACT (must remain stable so the bench keeps building):
 //   - `pub struct PqKernel`
 //   - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
-//   - `PqKernel::shape(&self) -> &PqShape`
 //   - `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
 //   - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
 //
@ -59,10 +58,6 @@ impl PqKernel {
        }
    }

-    pub fn shape(&self) -> &PqShape {
-        &self.shape
-    }
-
    /// Asymmetric L2 distance table for one query.
    ///
    /// Layout of returned `Vec<f32>`: `[num_sub_vectors][num_centroids]` flat
--- a/research/lance-autoresearch/crates/pq-l2/src/reference.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/reference.rs
@ -1,3 +1,5 @@
+// SPDX-License-Identifier: MIT OR Apache-2.0
+
 //! IMMUTABLE. Scalar reference kernel — defines the math the agent must match.
 //!
 //! Same public API as `kernels::PqKernel`, intentionally — it's the bit-for-bit
@ -22,10 +24,6 @@ impl ScalarReference {
        }
    }

-    pub fn shape(&self) -> &PqShape {
-        &self.shape
-    }
-
    #[allow(clippy::needless_range_loop)]
    pub fn distance_table(&self, query: &[f32]) -> Vec<f32> {
        let s = &self.shape;
@ -87,13 +85,23 @@ pub fn max_abs_err(a: &[f32], b: &[f32]) -> f32 {
        .fold(0.0f32, f32::max)
 }

-/// Check two top-K results are equivalent up to:
-///   - shared distance tolerance `dist_tol`
-///   - distance-tied id substitution (if two candidates have equal distances,
-///     either order is acceptable)
+/// Check two top-K results are equivalent up to per-rank distance tolerance.
 ///
-/// Returns `Ok(())` on match, or `Err(diagnostic_string)` describing the first
-/// disagreement found.
+/// At each rank `i`, asserts `|agent[i].dist - reference[i].dist| <= dist_tol`.
+/// Ids at the same rank may differ silently. This is correct because:
+///
+/// 1. If both kernels compute distances within `dist_tol` of each other,
+///    differing ids at the same rank means their distances are within a tied
+///    band of width `2*dist_tol`; both ids legally belong in that band.
+/// 2. Stronger checks (e.g., set-equality of ids) reject legal cases. When the
+///    K-th distance is at a multi-way tie, two correct implementations can
+///    return different K-sized subsets of the tied band — heap eviction order
+///    in the agent kernel vs. sort stability in the scalar reference.
+///    Set-equality fails on these legitimately.
+///
+/// What this catches: any case where the agent kernel computes a distance that
+/// disagrees with the scalar reference's distance for the same (query, codes)
+/// input. The first rank where the math diverges is flagged.
 pub fn topk_consistent(
    agent: &[(u32, f32)],
    reference: &[(u32, f32)],
@ -113,18 +121,6 @@ pub fn topk_consistent(
                (a_d - r_d).abs()
            ));
        }
-        if a_id != r_id {
-            // Different id at same rank is acceptable iff this distance is tied
-            // with a neighbor in either result — we accept any permutation
-            // within a tied-distance band.
-            let agent_neighbor_match = agent.iter().any(|(id, d)| id == r_id && (d - r_d).abs() <= dist_tol);
-            let ref_neighbor_match = reference.iter().any(|(id, d)| id == a_id && (d - a_d).abs() <= dist_tol);
-            if !agent_neighbor_match || !ref_neighbor_match {
-                return Err(format!(
-                    "topk[{i}] id mismatch with no tie-break excuse: agent=({a_id}, {a_d}) reference=({r_id}, {r_d})"
-                ));
-            }
-        }
    }
    Ok(())
 }
--- a/research/lance-autoresearch/docs/adding-a-target.md
+++ b/research/lance-autoresearch/docs/adding-a-target.md
@ -1,13 +1,19 @@
 # Adding a new target

 Walk through this when spinning up a new optimization target (A1 cosine, A4
-bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
-decisions to make per target if the kernel fits the autoresearch shape.
+bitpack, etc.). The workflow is: copy an existing target as scaffolding,
+then rewrite the four source files (`lib.rs`, `kernels.rs`, `reference.rs`,
+`inputs.rs`), the entry point (`bin/run_experiment.rs`), the per-target
+`program.md`, and add a capsule under `docs/targets/`. The Cargo manifest,
+workspace registration, license headers, file-layout conventions, gitignore,
+and shared utilities (`harness-common`) are inherited from the template —
+that's the part you don't rewrite.

-If your target's per-trial eval is more than ~30 seconds, or the correctness
-oracle can't be a deterministic comparison against a scalar reference, this
-harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
-for the boundary.
+No architectural decisions are required per target if the kernel fits the
+autoresearch shape. If your target's per-trial eval is more than ~30 seconds,
+or the correctness oracle can't be a deterministic comparison against a
+scalar reference, this harness is the wrong fit — see [`design.md`](design.md)
+"When to revisit" for the boundary.

 ## Steps

--- a/research/lance-autoresearch/docs/design.md
+++ b/research/lance-autoresearch/docs/design.md
@ -11,9 +11,13 @@ A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
 right harness shape is identical across them: bit-exact correctness oracle,
 geomean-across-distributions speed metric, single-agent autoresearch loop.

-The original [research note](../../docs/research/llm-evolutionary-sampling.md)
-enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
-landed (`pq-l2`) proves the harness shape; the rest follow the same template.
+The original research note that motivated this repo enumerates ten such
+candidates (A1–A10) clustered by Lance crate. The first landed (`pq-l2`) proves
+the harness shape; the rest follow the same template. When this repo lives
+inside the parent OmniGraph workspace, the note is at
+`omnigraph/docs/research/llm-evolutionary-sampling.md`; if the repo is ever
+extracted as a standalone OSS project, the note is part of OmniGraph's
+private design history and won't ship with the extracted repo.

 ## Decision: workspace, not single crate