research: fix lance-autoresearch correctness bugs surfaced by code review

A code review pass found a cluster of real bugs in metrics and contract;
fixing them before any agent loop runs against this harness.

Critical metric bug:
- harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space
  high-water-mark, includes mmap'd files / guard pages / untouched
  allocations) instead of VmHWM (resident pages high-water-mark). The
  function name and HARNESS.md contract both promised RSS. Every
  peak_mem_mb row logged under the old code was virtual peak, not RSS.

Correctness contract bug:
- reference::topk_consistent's tie-tolerance had a flawed neighbor-scan
  check: when the K-th distance fell in a multi-way tie, agent and
  reference could legally return different K-sized subsets of the tied
  band (heap eviction order vs. sort stability), and the neighbor scan
  required both endpoints to be present, false-negativing legitimate
  cases. Simplified to a positional distance-tolerance check; ids at the
  same rank may differ silently because the distance match within tol
  constrains the swap to a 2*tol band. Diagnostic comment explains the
  rationale.

API hygiene:
- Removed dead PqKernel::shape() and ScalarReference::shape() — declared
  in the public API contract (program.md, kernels.rs comment), required
  to be stable, never called by the bench / benches / inputs / reference.
  Now the contract reflects what the bench actually uses.
- Removed dead `anyhow` workspace dependency.

Determinism:
- PRNG seed mixing now uses the SplitMix64 finalizer per part instead of
  raw XOR. Raw XOR is commutative and small-constant collisions are
  reachable; mix_seeds iterates the finalizer once per ingredient so
  distinct (seed, shape, kind) tuples produce distinct streams with
  vanishingly small collision probability.

License headers:
- kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match
  the crate's Cargo.toml license field (the rest of the crate is dual-
  licensed). Added matching SPDX headers to reference.rs and inputs.rs.

Doc cleanups:
- design.md: replaced the broken relative link
  `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside
  lance-autoresearch where the note doesn't live) with a path-explained
  reference noting the note lives in the parent OmniGraph repo and won't
  ship on extraction.
- README.md: clarified that the target table mixes a single landed target
  with a candidate roadmap — they have no code yet.
- HARNESS.md: added exit code 1 (internal error) to the exit-code summary;
  was documented in run_experiment.rs but not in the loop contract.
- adding-a-target.md: dropped the misleading "cp -r plus surgical edits"
  framing — the workflow rewrites 7 files; what's inherited is Cargo
  manifest, license headers, workspace registration, and shared utilities.

Verified end-to-end: cargo build / clippy / test all green. Baseline
trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads
RSS — same workload reports 91 MB, plausibly correct given the temporary
fixture-construction buffers).

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
This commit is contained in:
Claude 2026-05-15 00:55:57 +00:00
parent 0d72cc69fb
commit 7b1b0b5b75
No known key found for this signature in database
10 changed files with 75 additions and 44 deletions

View file

@ -6,9 +6,9 @@ members = [
]
# Each per-target crate sets its own deps. Shared deps below pin versions
# uniformly across targets so the workspace lockfile stays clean.
# uniformly across targets so the workspace lockfile stays clean. Keep this
# list small — only deps actually consumed by ≥2 targets belong here.
[workspace.dependencies]
anyhow = "1"
criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
[profile.release]

View file

@ -70,6 +70,11 @@ A kernel is **kept** iff:
5. Build clean: `cargo build --release` and
`cargo clippy --release --all-targets -- -D warnings` both succeed.
Exit codes summary for `run_experiment`: `0` on success, `1` on internal
error (panic, fixture load failure), `2` on correctness failure, `3` on
time-budget breach. The agent's loop should treat anything non-zero as
"revert; do not log as a measurement."
Ties break toward simpler code: same speed within ~3% noise → fewer lines /
less `unsafe` wins.

View file

@ -6,7 +6,11 @@ in the style of Andrej Karpathy's
[`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
single-agent autoresearch loop.
Each target is an independent Rust crate under `crates/`:
Each landed target is an independent Rust crate under `crates/`. The
candidates below are listed as a roadmap — they have no code yet, only the
research-note rationale and a `docs/targets/<name>.md` capsule (when one
exists). Spinning up a candidate is the
[`docs/adding-a-target.md`](docs/adding-a-target.md) workflow.
| Target | Status | Lance source area | What's optimized |
|---|---|---|---|

View file

@ -1,4 +1,9 @@
//! Peak resident-set-size readback (Linux only; non-Linux returns 0).
//!
//! Reads `VmHWM` from `/proc/self/status` — the high-water-mark of resident
//! memory pages, not the high-water-mark of virtual address space. `VmPeak`
//! (virtual peak) would include mmap'd files, guard pages, and untouched
//! allocations; `VmHWM` is what users mean by "peak memory".
#[cfg(target_os = "linux")]
pub fn peak_rss_mb() -> f64 {
@ -6,7 +11,7 @@ pub fn peak_rss_mb() -> f64 {
return 0.0;
};
for line in s.lines() {
if let Some(rest) = line.strip_prefix("VmPeak:") {
if let Some(rest) = line.strip_prefix("VmHWM:") {
let kb: f64 = rest
.split_whitespace()
.next()

View file

@ -38,7 +38,6 @@ pub struct PqKernel { /* agent's private fields */ }
impl PqKernel {
pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
pub fn shape(&self) -> &PqShape;
pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
}

View file

@ -1,3 +1,5 @@
// SPDX-License-Identifier: MIT OR Apache-2.0
//! IMMUTABLE. Diverse test-data + workload generators.
//!
//! Two surfaces. `correctness_battery(seed)` yields several
@ -90,7 +92,7 @@ pub fn correctness_battery(seed: u64) -> Vec<CorrectnessCase> {
];
for &shape in SHAPES {
for (kind, label) in kinds {
let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ kind_hash(*kind));
let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), kind_hash(*kind)]));
let num_vectors = 4096;
let codebook = build_codebook(shape, *kind, CODEBOOK_TRAIN_SIZE, &mut rng);
let base = gen_vectors(num_vectors, shape.dim, *kind, &mut rng);
@ -114,7 +116,7 @@ pub fn speed_workloads(seed: u64) -> Vec<SpeedWorkload> {
let mut out = Vec::new();
for &shape in SHAPES {
for &dist in DISTRIBUTIONS {
let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ dist_hash(dist));
let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), dist_hash(dist)]));
let kind = match dist {
DataDistribution::Clustered => InputKind::Clustered,
DataDistribution::Uniform => InputKind::Uniform,
@ -296,6 +298,21 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
out
}
/// Combine seed components without XOR-collision. Iterates the SplitMix64
/// finalizer once per part, so distinct part vectors produce distinct seeds
/// with vanishingly small collision probability (vs. raw XOR, where
/// `a ^ b ^ c == a ^ c ^ b` and small constants can cancel each other).
fn mix_seeds(parts: &[u64]) -> u64 {
let mut mixed: u64 = 0;
for &p in parts {
mixed = mixed.wrapping_add(p).wrapping_add(0x9E37_79B9_7F4A_7C15);
mixed = (mixed ^ (mixed >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
mixed = (mixed ^ (mixed >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
mixed ^= mixed >> 31;
}
mixed
}
fn shape_hash(s: PqShape) -> u64 {
(s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)

View file

@ -1,4 +1,4 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-License-Identifier: MIT OR Apache-2.0
//
// AGENT'S PLAYGROUND. This is the file you (the agent) modify.
//
@ -14,7 +14,6 @@
// PUBLIC API CONTRACT (must remain stable so the bench keeps building):
// - `pub struct PqKernel`
// - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
// - `PqKernel::shape(&self) -> &PqShape`
// - `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
// - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
//
@ -59,10 +58,6 @@ impl PqKernel {
}
}
pub fn shape(&self) -> &PqShape {
&self.shape
}
/// Asymmetric L2 distance table for one query.
///
/// Layout of returned `Vec<f32>`: `[num_sub_vectors][num_centroids]` flat

View file

@ -1,3 +1,5 @@
// SPDX-License-Identifier: MIT OR Apache-2.0
//! IMMUTABLE. Scalar reference kernel — defines the math the agent must match.
//!
//! Same public API as `kernels::PqKernel`, intentionally — it's the bit-for-bit
@ -22,10 +24,6 @@ impl ScalarReference {
}
}
pub fn shape(&self) -> &PqShape {
&self.shape
}
#[allow(clippy::needless_range_loop)]
pub fn distance_table(&self, query: &[f32]) -> Vec<f32> {
let s = &self.shape;
@ -87,13 +85,23 @@ pub fn max_abs_err(a: &[f32], b: &[f32]) -> f32 {
.fold(0.0f32, f32::max)
}
/// Check two top-K results are equivalent up to:
/// - shared distance tolerance `dist_tol`
/// - distance-tied id substitution (if two candidates have equal distances,
/// either order is acceptable)
/// Check two top-K results are equivalent up to per-rank distance tolerance.
///
/// Returns `Ok(())` on match, or `Err(diagnostic_string)` describing the first
/// disagreement found.
/// At each rank `i`, asserts `|agent[i].dist - reference[i].dist| <= dist_tol`.
/// Ids at the same rank may differ silently. This is correct because:
///
/// 1. If both kernels compute distances within `dist_tol` of each other,
/// differing ids at the same rank means their distances are within a tied
/// band of width `2*dist_tol`; both ids legally belong in that band.
/// 2. Stronger checks (e.g., set-equality of ids) reject legal cases. When the
/// K-th distance is at a multi-way tie, two correct implementations can
/// return different K-sized subsets of the tied band — heap eviction order
/// in the agent kernel vs. sort stability in the scalar reference.
/// Set-equality fails on these legitimately.
///
/// What this catches: any case where the agent kernel computes a distance that
/// disagrees with the scalar reference's distance for the same (query, codes)
/// input. The first rank where the math diverges is flagged.
pub fn topk_consistent(
agent: &[(u32, f32)],
reference: &[(u32, f32)],
@ -113,18 +121,6 @@ pub fn topk_consistent(
(a_d - r_d).abs()
));
}
if a_id != r_id {
// Different id at same rank is acceptable iff this distance is tied
// with a neighbor in either result — we accept any permutation
// within a tied-distance band.
let agent_neighbor_match = agent.iter().any(|(id, d)| id == r_id && (d - r_d).abs() <= dist_tol);
let ref_neighbor_match = reference.iter().any(|(id, d)| id == a_id && (d - a_d).abs() <= dist_tol);
if !agent_neighbor_match || !ref_neighbor_match {
return Err(format!(
"topk[{i}] id mismatch with no tie-break excuse: agent=({a_id}, {a_d}) reference=({r_id}, {r_d})"
));
}
}
}
Ok(())
}

View file

@ -1,13 +1,19 @@
# Adding a new target
Walk through this when spinning up a new optimization target (A1 cosine, A4
bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
decisions to make per target if the kernel fits the autoresearch shape.
bitpack, etc.). The workflow is: copy an existing target as scaffolding,
then rewrite the four source files (`lib.rs`, `kernels.rs`, `reference.rs`,
`inputs.rs`), the entry point (`bin/run_experiment.rs`), the per-target
`program.md`, and add a capsule under `docs/targets/`. The Cargo manifest,
workspace registration, license headers, file-layout conventions, gitignore,
and shared utilities (`harness-common`) are inherited from the template —
that's the part you don't rewrite.
If your target's per-trial eval is more than ~30 seconds, or the correctness
oracle can't be a deterministic comparison against a scalar reference, this
harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
for the boundary.
No architectural decisions are required per target if the kernel fits the
autoresearch shape. If your target's per-trial eval is more than ~30 seconds,
or the correctness oracle can't be a deterministic comparison against a
scalar reference, this harness is the wrong fit — see [`design.md`](design.md)
"When to revisit" for the boundary.
## Steps

View file

@ -11,9 +11,13 @@ A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
right harness shape is identical across them: bit-exact correctness oracle,
geomean-across-distributions speed metric, single-agent autoresearch loop.
The original [research note](../../docs/research/llm-evolutionary-sampling.md)
enumerates ten such candidates (A1A10) clustered by Lance crate. The first
landed (`pq-l2`) proves the harness shape; the rest follow the same template.
The original research note that motivated this repo enumerates ten such
candidates (A1A10) clustered by Lance crate. The first landed (`pq-l2`) proves
the harness shape; the rest follow the same template. When this repo lives
inside the parent OmniGraph workspace, the note is at
`omnigraph/docs/research/llm-evolutionary-sampling.md`; if the repo is ever
extracted as a standalone OSS project, the note is part of OmniGraph's
private design history and won't ship with the extracted repo.
## Decision: workspace, not single crate