From 7b1b0b5b75ee59c3351ca92bbc91592d79189e5f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 15 May 2026 00:55:57 +0000
Subject: [PATCH] research: fix lance-autoresearch correctness bugs surfaced by
 code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A code review pass found a cluster of real bugs in metrics and contract;
fixing them before any agent loop runs against this harness.

Critical metric bug:
- harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space
  high-water-mark, includes mmap'd files / guard pages / untouched
  allocations) instead of VmHWM (resident pages high-water-mark). The
  function name and HARNESS.md contract both promised RSS. Every
  peak_mem_mb row logged under the old code was virtual peak, not RSS.

Correctness contract bug:
- reference::topk_consistent's tie-tolerance had a flawed neighbor-scan
  check: when the K-th distance fell in a multi-way tie, agent and
  reference could legally return different K-sized subsets of the tied
  band (heap eviction order vs. sort stability), and the neighbor scan
  required both endpoints to be present, false-negativing legitimate
  cases. Simplified to a positional distance-tolerance check; ids at the
  same rank may differ silently because the distance match within tol
  constrains the swap to a 2*tol band. Diagnostic comment explains the
  rationale.

API hygiene:
- Removed dead PqKernel::shape() and ScalarReference::shape() — declared
  in the public API contract (program.md, kernels.rs comment), required
  to be stable, never called by the bench / benches / inputs / reference.
  Now the contract reflects what the bench actually uses.
- Removed dead `anyhow` workspace dependency.

Determinism:
- PRNG seed mixing now uses the SplitMix64 finalizer per part instead of
  raw XOR. Raw XOR is commutative and small-constant collisions are
  reachable; mix_seeds iterates the finalizer once per ingredient so
  distinct (seed, shape, kind) tuples produce distinct streams with
  vanishingly small collision probability.

License headers:
- kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match
  the crate's Cargo.toml license field (the rest of the crate is dual-
  licensed). Added matching SPDX headers to reference.rs and inputs.rs.

Doc cleanups:
- design.md: replaced the broken relative link
  `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside
  lance-autoresearch where the note doesn't live) with a path-explained
  reference noting the note lives in the parent OmniGraph repo and won't
  ship on extraction.
- README.md: clarified that the target table mixes a single landed target
  with a candidate roadmap — they have no code yet.
- HARNESS.md: added exit code 1 (internal error) to the exit-code summary;
  was documented in run_experiment.rs but not in the loop contract.
- adding-a-target.md: dropped the misleading "cp -r plus surgical edits"
  framing — the workflow rewrites 7 files; what's inherited is Cargo
  manifest, license headers, workspace registration, and shared utilities.

Verified end-to-end: cargo build / clippy / test all green. Baseline
trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads
RSS — same workload reports 91 MB, plausibly correct given the temporary
fixture-construction buffers).

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
---
 research/lance-autoresearch/Cargo.toml        |  4 +-
 research/lance-autoresearch/HARNESS.md        |  5 +++
 research/lance-autoresearch/README.md         |  6 ++-
 .../crates/harness-common/src/sysinfo.rs      |  7 +++-
 .../crates/pq-l2/program.md                   |  1 -
 .../crates/pq-l2/src/inputs.rs                | 21 +++++++++-
 .../crates/pq-l2/src/kernels.rs               |  7 +---
 .../crates/pq-l2/src/reference.rs             | 40 +++++++++----------
 .../docs/adding-a-target.md                   | 18 ++++++---
 research/lance-autoresearch/docs/design.md    | 10 +++--
 10 files changed, 75 insertions(+), 44 deletions(-)

diff --git a/research/lance-autoresearch/Cargo.toml b/research/lance-autoresearch/Cargo.toml
index a56f436..bfd6342 100644
--- a/research/lance-autoresearch/Cargo.toml
+++ b/research/lance-autoresearch/Cargo.toml
@@ -6,9 +6,9 @@ members = [
 ]
 
 # Each per-target crate sets its own deps. Shared deps below pin versions
-# uniformly across targets so the workspace lockfile stays clean.
+# uniformly across targets so the workspace lockfile stays clean. Keep this
+# list small — only deps actually consumed by ≥2 targets belong here.
 [workspace.dependencies]
-anyhow = "1"
 criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
 
 [profile.release]
diff --git a/research/lance-autoresearch/HARNESS.md b/research/lance-autoresearch/HARNESS.md
index 25defc7..6a6dfc8 100644
--- a/research/lance-autoresearch/HARNESS.md
+++ b/research/lance-autoresearch/HARNESS.md
@@ -70,6 +70,11 @@ A kernel is **kept** iff:
 5. Build clean: `cargo build --release` and
    `cargo clippy --release --all-targets -- -D warnings` both succeed.
 
+Exit codes summary for `run_experiment`: `0` on success, `1` on internal
+error (panic, fixture load failure), `2` on correctness failure, `3` on
+time-budget breach. The agent's loop should treat anything non-zero as
+"revert; do not log as a measurement."
+
 Ties break toward simpler code: same speed within ~3% noise → fewer lines /
 less `unsafe` wins.
 
diff --git a/research/lance-autoresearch/README.md b/research/lance-autoresearch/README.md
index 7f52b77..88687ca 100644
--- a/research/lance-autoresearch/README.md
+++ b/research/lance-autoresearch/README.md
@@ -6,7 +6,11 @@ in the style of Andrej Karpathy's
 [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
 single-agent autoresearch loop.
 
-Each target is an independent Rust crate under `crates/`:
+Each landed target is an independent Rust crate under `crates/`. The
+candidates below are listed as a roadmap — they have no code yet, only the
+research-note rationale and a `docs/targets/<name>.md` capsule (when one
+exists). Spinning up a candidate is the
+[`docs/adding-a-target.md`](docs/adding-a-target.md) workflow.
 
 | Target | Status | Lance source area | What's optimized |
 |---|---|---|---|
diff --git a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
index d389ff4..44af29d 100644
--- a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
+++ b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs
@@ -1,4 +1,9 @@
 //! Peak resident-set-size readback (Linux only; non-Linux returns 0).
+//!
+//! Reads `VmHWM` from `/proc/self/status` — the high-water-mark of resident
+//! memory pages, not the high-water-mark of virtual address space. `VmPeak`
+//! (virtual peak) would include mmap'd files, guard pages, and untouched
+//! allocations; `VmHWM` is what users mean by "peak memory".
 
 #[cfg(target_os = "linux")]
 pub fn peak_rss_mb() -> f64 {
@@ -6,7 +11,7 @@ pub fn peak_rss_mb() -> f64 {
         return 0.0;
     };
     for line in s.lines() {
-        if let Some(rest) = line.strip_prefix("VmPeak:") {
+        if let Some(rest) = line.strip_prefix("VmHWM:") {
             let kb: f64 = rest
                 .split_whitespace()
                 .next()
diff --git a/research/lance-autoresearch/crates/pq-l2/program.md b/research/lance-autoresearch/crates/pq-l2/program.md
index 7e778fb..1b8b9ec 100644
--- a/research/lance-autoresearch/crates/pq-l2/program.md
+++ b/research/lance-autoresearch/crates/pq-l2/program.md
@@ -38,7 +38,6 @@ pub struct PqKernel { /* agent's private fields */ }
 
 impl PqKernel {
     pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
-    pub fn shape(&self) -> &PqShape;
     pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
     pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
 }
diff --git a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
index 78917e5..29df4ef 100644
--- a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: MIT OR Apache-2.0
+
 //! IMMUTABLE. Diverse test-data + workload generators.
 //!
 //! Two surfaces. `correctness_battery(seed)` yields several
@@ -90,7 +92,7 @@ pub fn correctness_battery(seed: u64) -> Vec<CorrectnessCase> {
     ];
     for &shape in SHAPES {
         for (kind, label) in kinds {
-            let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ kind_hash(*kind));
+            let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), kind_hash(*kind)]));
             let num_vectors = 4096;
             let codebook = build_codebook(shape, *kind, CODEBOOK_TRAIN_SIZE, &mut rng);
             let base = gen_vectors(num_vectors, shape.dim, *kind, &mut rng);
@@ -114,7 +116,7 @@ pub fn speed_workloads(seed: u64) -> Vec<SpeedWorkload> {
     let mut out = Vec::new();
     for &shape in SHAPES {
         for &dist in DISTRIBUTIONS {
-            let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ dist_hash(dist));
+            let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), dist_hash(dist)]));
             let kind = match dist {
                 DataDistribution::Clustered => InputKind::Clustered,
                 DataDistribution::Uniform => InputKind::Uniform,
@@ -296,6 +298,21 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
     out
 }
 
+/// Combine seed components without XOR-collision. Iterates the SplitMix64
+/// finalizer once per part, so distinct part vectors produce distinct seeds
+/// with vanishingly small collision probability (vs. raw XOR, where
+/// `a ^ b ^ c == a ^ c ^ b` and small constants can cancel each other).
+fn mix_seeds(parts: &[u64]) -> u64 {
+    let mut mixed: u64 = 0;
+    for &p in parts {
+        mixed = mixed.wrapping_add(p).wrapping_add(0x9E37_79B9_7F4A_7C15);
+        mixed = (mixed ^ (mixed >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
+        mixed = (mixed ^ (mixed >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
+        mixed ^= mixed >> 31;
+    }
+    mixed
+}
+
 fn shape_hash(s: PqShape) -> u64 {
     (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
         ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)
diff --git a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
index c4c1a70..3e42e2a 100644
--- a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: Apache-2.0
+// SPDX-License-Identifier: MIT OR Apache-2.0
 //
 // AGENT'S PLAYGROUND. This is the file you (the agent) modify.
 //
@@ -14,7 +14,6 @@
 // PUBLIC API CONTRACT (must remain stable so the bench keeps building):
 //   - `pub struct PqKernel`
 //   - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
-//   - `PqKernel::shape(&self) -> &PqShape`
 //   - `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
 //   - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
 //
@@ -59,10 +58,6 @@ impl PqKernel {
         }
     }
 
-    pub fn shape(&self) -> &PqShape {
-        &self.shape
-    }
-
     /// Asymmetric L2 distance table for one query.
     ///
     /// Layout of returned `Vec<f32>`: `[num_sub_vectors][num_centroids]` flat
diff --git a/research/lance-autoresearch/crates/pq-l2/src/reference.rs b/research/lance-autoresearch/crates/pq-l2/src/reference.rs
index 72a7c54..57cfd84 100644
--- a/research/lance-autoresearch/crates/pq-l2/src/reference.rs
+++ b/research/lance-autoresearch/crates/pq-l2/src/reference.rs
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: MIT OR Apache-2.0
+
 //! IMMUTABLE. Scalar reference kernel — defines the math the agent must match.
 //!
 //! Same public API as `kernels::PqKernel`, intentionally — it's the bit-for-bit
@@ -22,10 +24,6 @@ impl ScalarReference {
         }
     }
 
-    pub fn shape(&self) -> &PqShape {
-        &self.shape
-    }
-
     #[allow(clippy::needless_range_loop)]
     pub fn distance_table(&self, query: &[f32]) -> Vec<f32> {
         let s = &self.shape;
@@ -87,13 +85,23 @@ pub fn max_abs_err(a: &[f32], b: &[f32]) -> f32 {
         .fold(0.0f32, f32::max)
 }
 
-/// Check two top-K results are equivalent up to:
-///   - shared distance tolerance `dist_tol`
-///   - distance-tied id substitution (if two candidates have equal distances,
-///     either order is acceptable)
+/// Check two top-K results are equivalent up to per-rank distance tolerance.
 ///
-/// Returns `Ok(())` on match, or `Err(diagnostic_string)` describing the first
-/// disagreement found.
+/// At each rank `i`, asserts `|agent[i].dist - reference[i].dist| <= dist_tol`.
+/// Ids at the same rank may differ silently. This is correct because:
+///
+/// 1. If both kernels compute distances within `dist_tol` of each other,
+///    differing ids at the same rank means their distances are within a tied
+///    band of width `2*dist_tol`; both ids legally belong in that band.
+/// 2. Stronger checks (e.g., set-equality of ids) reject legal cases. When the
+///    K-th distance is at a multi-way tie, two correct implementations can
+///    return different K-sized subsets of the tied band — heap eviction order
+///    in the agent kernel vs. sort stability in the scalar reference.
+///    Set-equality fails on these legitimately.
+///
+/// What this catches: any case where the agent kernel computes a distance that
+/// disagrees with the scalar reference's distance for the same (query, codes)
+/// input. The first rank where the math diverges is flagged.
 pub fn topk_consistent(
     agent: &[(u32, f32)],
     reference: &[(u32, f32)],
@@ -113,18 +121,6 @@ pub fn topk_consistent(
                 (a_d - r_d).abs()
             ));
         }
-        if a_id != r_id {
-            // Different id at same rank is acceptable iff this distance is tied
-            // with a neighbor in either result — we accept any permutation
-            // within a tied-distance band.
-            let agent_neighbor_match = agent.iter().any(|(id, d)| id == r_id && (d - r_d).abs() <= dist_tol);
-            let ref_neighbor_match = reference.iter().any(|(id, d)| id == a_id && (d - a_d).abs() <= dist_tol);
-            if !agent_neighbor_match || !ref_neighbor_match {
-                return Err(format!(
-                    "topk[{i}] id mismatch with no tie-break excuse: agent=({a_id}, {a_d}) reference=({r_id}, {r_d})"
-                ));
-            }
-        }
     }
     Ok(())
 }
diff --git a/research/lance-autoresearch/docs/adding-a-target.md b/research/lance-autoresearch/docs/adding-a-target.md
index 4e3cecc..2cead94 100644
--- a/research/lance-autoresearch/docs/adding-a-target.md
+++ b/research/lance-autoresearch/docs/adding-a-target.md
@@ -1,13 +1,19 @@
 # Adding a new target
 
 Walk through this when spinning up a new optimization target (A1 cosine, A4
-bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
-decisions to make per target if the kernel fits the autoresearch shape.
+bitpack, etc.). The workflow is: copy an existing target as scaffolding,
+then rewrite the four source files (`lib.rs`, `kernels.rs`, `reference.rs`,
+`inputs.rs`), the entry point (`bin/run_experiment.rs`), the per-target
+`program.md`, and add a capsule under `docs/targets/`. The Cargo manifest,
+workspace registration, license headers, file-layout conventions, gitignore,
+and shared utilities (`harness-common`) are inherited from the template —
+that's the part you don't rewrite.
 
-If your target's per-trial eval is more than ~30 seconds, or the correctness
-oracle can't be a deterministic comparison against a scalar reference, this
-harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
-for the boundary.
+No architectural decisions are required per target if the kernel fits the
+autoresearch shape. If your target's per-trial eval is more than ~30 seconds,
+or the correctness oracle can't be a deterministic comparison against a
+scalar reference, this harness is the wrong fit — see [`design.md`](design.md)
+"When to revisit" for the boundary.
 
 ## Steps
 
diff --git a/research/lance-autoresearch/docs/design.md b/research/lance-autoresearch/docs/design.md
index 8dc2087..bc7b2be 100644
--- a/research/lance-autoresearch/docs/design.md
+++ b/research/lance-autoresearch/docs/design.md
@@ -11,9 +11,13 @@ A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
 right harness shape is identical across them: bit-exact correctness oracle,
 geomean-across-distributions speed metric, single-agent autoresearch loop.
 
-The original [research note](../../docs/research/llm-evolutionary-sampling.md)
-enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
-landed (`pq-l2`) proves the harness shape; the rest follow the same template.
+The original research note that motivated this repo enumerates ten such
+candidates (A1–A10) clustered by Lance crate. The first landed (`pq-l2`) proves
+the harness shape; the rest follow the same template. When this repo lives
+inside the parent OmniGraph workspace, the note is at
+`omnigraph/docs/research/llm-evolutionary-sampling.md`; if the repo is ever
+extracted as a standalone OSS project, the note is part of OmniGraph's
+private design history and won't ship with the extracted repo.
 
 ## Decision: workspace, not single crate