From 7b1b0b5b75ee59c3351ca92bbc91592d79189e5f Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 15 May 2026 00:55:57 +0000 Subject: [PATCH] research: fix lance-autoresearch correctness bugs surfaced by code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5 --- research/lance-autoresearch/Cargo.toml | 4 +- research/lance-autoresearch/HARNESS.md | 5 +++ research/lance-autoresearch/README.md | 6 ++- .../crates/harness-common/src/sysinfo.rs | 7 +++- .../crates/pq-l2/program.md | 1 - .../crates/pq-l2/src/inputs.rs | 21 +++++++++- .../crates/pq-l2/src/kernels.rs | 7 +--- .../crates/pq-l2/src/reference.rs | 40 +++++++++---------- .../docs/adding-a-target.md | 18 ++++++--- research/lance-autoresearch/docs/design.md | 10 +++-- 10 files changed, 75 insertions(+), 44 deletions(-) diff --git a/research/lance-autoresearch/Cargo.toml b/research/lance-autoresearch/Cargo.toml index a56f436..bfd6342 100644 --- a/research/lance-autoresearch/Cargo.toml +++ b/research/lance-autoresearch/Cargo.toml @@ -6,9 +6,9 @@ members = [ ] # Each per-target crate sets its own deps. Shared deps below pin versions -# uniformly across targets so the workspace lockfile stays clean. +# uniformly across targets so the workspace lockfile stays clean. Keep this +# list small — only deps actually consumed by ≥2 targets belong here. [workspace.dependencies] -anyhow = "1" criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] } [profile.release] diff --git a/research/lance-autoresearch/HARNESS.md b/research/lance-autoresearch/HARNESS.md index 25defc7..6a6dfc8 100644 --- a/research/lance-autoresearch/HARNESS.md +++ b/research/lance-autoresearch/HARNESS.md @@ -70,6 +70,11 @@ A kernel is **kept** iff: 5. Build clean: `cargo build --release` and `cargo clippy --release --all-targets -- -D warnings` both succeed. +Exit codes summary for `run_experiment`: `0` on success, `1` on internal +error (panic, fixture load failure), `2` on correctness failure, `3` on +time-budget breach. The agent's loop should treat anything non-zero as +"revert; do not log as a measurement." + Ties break toward simpler code: same speed within ~3% noise → fewer lines / less `unsafe` wins. diff --git a/research/lance-autoresearch/README.md b/research/lance-autoresearch/README.md index 7f52b77..88687ca 100644 --- a/research/lance-autoresearch/README.md +++ b/research/lance-autoresearch/README.md @@ -6,7 +6,11 @@ in the style of Andrej Karpathy's [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238) single-agent autoresearch loop. -Each target is an independent Rust crate under `crates/`: +Each landed target is an independent Rust crate under `crates/`. The +candidates below are listed as a roadmap — they have no code yet, only the +research-note rationale and a `docs/targets/.md` capsule (when one +exists). Spinning up a candidate is the +[`docs/adding-a-target.md`](docs/adding-a-target.md) workflow. | Target | Status | Lance source area | What's optimized | |---|---|---|---| diff --git a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs index d389ff4..44af29d 100644 --- a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs +++ b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs @@ -1,4 +1,9 @@ //! Peak resident-set-size readback (Linux only; non-Linux returns 0). +//! +//! Reads `VmHWM` from `/proc/self/status` — the high-water-mark of resident +//! memory pages, not the high-water-mark of virtual address space. `VmPeak` +//! (virtual peak) would include mmap'd files, guard pages, and untouched +//! allocations; `VmHWM` is what users mean by "peak memory". #[cfg(target_os = "linux")] pub fn peak_rss_mb() -> f64 { @@ -6,7 +11,7 @@ pub fn peak_rss_mb() -> f64 { return 0.0; }; for line in s.lines() { - if let Some(rest) = line.strip_prefix("VmPeak:") { + if let Some(rest) = line.strip_prefix("VmHWM:") { let kb: f64 = rest .split_whitespace() .next() diff --git a/research/lance-autoresearch/crates/pq-l2/program.md b/research/lance-autoresearch/crates/pq-l2/program.md index 7e778fb..1b8b9ec 100644 --- a/research/lance-autoresearch/crates/pq-l2/program.md +++ b/research/lance-autoresearch/crates/pq-l2/program.md @@ -38,7 +38,6 @@ pub struct PqKernel { /* agent's private fields */ } impl PqKernel { pub fn new(shape: PqShape, codebook: &[f32]) -> Self; - pub fn shape(&self) -> &PqShape; pub fn distance_table(&self, query: &[f32]) -> Vec; pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>; } diff --git a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs index 78917e5..29df4ef 100644 --- a/research/lance-autoresearch/crates/pq-l2/src/inputs.rs +++ b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: MIT OR Apache-2.0 + //! IMMUTABLE. Diverse test-data + workload generators. //! //! Two surfaces. `correctness_battery(seed)` yields several @@ -90,7 +92,7 @@ pub fn correctness_battery(seed: u64) -> Vec { ]; for &shape in SHAPES { for (kind, label) in kinds { - let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ kind_hash(*kind)); + let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), kind_hash(*kind)])); let num_vectors = 4096; let codebook = build_codebook(shape, *kind, CODEBOOK_TRAIN_SIZE, &mut rng); let base = gen_vectors(num_vectors, shape.dim, *kind, &mut rng); @@ -114,7 +116,7 @@ pub fn speed_workloads(seed: u64) -> Vec { let mut out = Vec::new(); for &shape in SHAPES { for &dist in DISTRIBUTIONS { - let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ dist_hash(dist)); + let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), dist_hash(dist)])); let kind = match dist { DataDistribution::Clustered => InputKind::Clustered, DataDistribution::Uniform => InputKind::Uniform, @@ -296,6 +298,21 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec { out } +/// Combine seed components without XOR-collision. Iterates the SplitMix64 +/// finalizer once per part, so distinct part vectors produce distinct seeds +/// with vanishingly small collision probability (vs. raw XOR, where +/// `a ^ b ^ c == a ^ c ^ b` and small constants can cancel each other). +fn mix_seeds(parts: &[u64]) -> u64 { + let mut mixed: u64 = 0; + for &p in parts { + mixed = mixed.wrapping_add(p).wrapping_add(0x9E37_79B9_7F4A_7C15); + mixed = (mixed ^ (mixed >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); + mixed = (mixed ^ (mixed >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); + mixed ^= mixed >> 31; + } + mixed +} + fn shape_hash(s: PqShape) -> u64 { (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15) ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9) diff --git a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs index c4c1a70..3e42e2a 100644 --- a/research/lance-autoresearch/crates/pq-l2/src/kernels.rs +++ b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs @@ -1,4 +1,4 @@ -// SPDX-License-Identifier: Apache-2.0 +// SPDX-License-Identifier: MIT OR Apache-2.0 // // AGENT'S PLAYGROUND. This is the file you (the agent) modify. // @@ -14,7 +14,6 @@ // PUBLIC API CONTRACT (must remain stable so the bench keeps building): // - `pub struct PqKernel` // - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self` -// - `PqKernel::shape(&self) -> &PqShape` // - `PqKernel::distance_table(&self, query: &[f32]) -> Vec` // - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>` // @@ -59,10 +58,6 @@ impl PqKernel { } } - pub fn shape(&self) -> &PqShape { - &self.shape - } - /// Asymmetric L2 distance table for one query. /// /// Layout of returned `Vec`: `[num_sub_vectors][num_centroids]` flat diff --git a/research/lance-autoresearch/crates/pq-l2/src/reference.rs b/research/lance-autoresearch/crates/pq-l2/src/reference.rs index 72a7c54..57cfd84 100644 --- a/research/lance-autoresearch/crates/pq-l2/src/reference.rs +++ b/research/lance-autoresearch/crates/pq-l2/src/reference.rs @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: MIT OR Apache-2.0 + //! IMMUTABLE. Scalar reference kernel — defines the math the agent must match. //! //! Same public API as `kernels::PqKernel`, intentionally — it's the bit-for-bit @@ -22,10 +24,6 @@ impl ScalarReference { } } - pub fn shape(&self) -> &PqShape { - &self.shape - } - #[allow(clippy::needless_range_loop)] pub fn distance_table(&self, query: &[f32]) -> Vec { let s = &self.shape; @@ -87,13 +85,23 @@ pub fn max_abs_err(a: &[f32], b: &[f32]) -> f32 { .fold(0.0f32, f32::max) } -/// Check two top-K results are equivalent up to: -/// - shared distance tolerance `dist_tol` -/// - distance-tied id substitution (if two candidates have equal distances, -/// either order is acceptable) +/// Check two top-K results are equivalent up to per-rank distance tolerance. /// -/// Returns `Ok(())` on match, or `Err(diagnostic_string)` describing the first -/// disagreement found. +/// At each rank `i`, asserts `|agent[i].dist - reference[i].dist| <= dist_tol`. +/// Ids at the same rank may differ silently. This is correct because: +/// +/// 1. If both kernels compute distances within `dist_tol` of each other, +/// differing ids at the same rank means their distances are within a tied +/// band of width `2*dist_tol`; both ids legally belong in that band. +/// 2. Stronger checks (e.g., set-equality of ids) reject legal cases. When the +/// K-th distance is at a multi-way tie, two correct implementations can +/// return different K-sized subsets of the tied band — heap eviction order +/// in the agent kernel vs. sort stability in the scalar reference. +/// Set-equality fails on these legitimately. +/// +/// What this catches: any case where the agent kernel computes a distance that +/// disagrees with the scalar reference's distance for the same (query, codes) +/// input. The first rank where the math diverges is flagged. pub fn topk_consistent( agent: &[(u32, f32)], reference: &[(u32, f32)], @@ -113,18 +121,6 @@ pub fn topk_consistent( (a_d - r_d).abs() )); } - if a_id != r_id { - // Different id at same rank is acceptable iff this distance is tied - // with a neighbor in either result — we accept any permutation - // within a tied-distance band. - let agent_neighbor_match = agent.iter().any(|(id, d)| id == r_id && (d - r_d).abs() <= dist_tol); - let ref_neighbor_match = reference.iter().any(|(id, d)| id == a_id && (d - a_d).abs() <= dist_tol); - if !agent_neighbor_match || !ref_neighbor_match { - return Err(format!( - "topk[{i}] id mismatch with no tie-break excuse: agent=({a_id}, {a_d}) reference=({r_id}, {r_d})" - )); - } - } } Ok(()) } diff --git a/research/lance-autoresearch/docs/adding-a-target.md b/research/lance-autoresearch/docs/adding-a-target.md index 4e3cecc..2cead94 100644 --- a/research/lance-autoresearch/docs/adding-a-target.md +++ b/research/lance-autoresearch/docs/adding-a-target.md @@ -1,13 +1,19 @@ # Adding a new target Walk through this when spinning up a new optimization target (A1 cosine, A4 -bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural -decisions to make per target if the kernel fits the autoresearch shape. +bitpack, etc.). The workflow is: copy an existing target as scaffolding, +then rewrite the four source files (`lib.rs`, `kernels.rs`, `reference.rs`, +`inputs.rs`), the entry point (`bin/run_experiment.rs`), the per-target +`program.md`, and add a capsule under `docs/targets/`. The Cargo manifest, +workspace registration, license headers, file-layout conventions, gitignore, +and shared utilities (`harness-common`) are inherited from the template — +that's the part you don't rewrite. -If your target's per-trial eval is more than ~30 seconds, or the correctness -oracle can't be a deterministic comparison against a scalar reference, this -harness is the wrong fit — see [`design.md`](design.md) "When to revisit" -for the boundary. +No architectural decisions are required per target if the kernel fits the +autoresearch shape. If your target's per-trial eval is more than ~30 seconds, +or the correctness oracle can't be a deterministic comparison against a +scalar reference, this harness is the wrong fit — see [`design.md`](design.md) +"When to revisit" for the boundary. ## Steps diff --git a/research/lance-autoresearch/docs/design.md b/research/lance-autoresearch/docs/design.md index 8dc2087..bc7b2be 100644 --- a/research/lance-autoresearch/docs/design.md +++ b/research/lance-autoresearch/docs/design.md @@ -11,9 +11,13 @@ A multi-target harness for LLM-driven optimization of Lance hot-path kernels. right harness shape is identical across them: bit-exact correctness oracle, geomean-across-distributions speed metric, single-agent autoresearch loop. -The original [research note](../../docs/research/llm-evolutionary-sampling.md) -enumerates ten such candidates (A1–A10) clustered by Lance crate. The first -landed (`pq-l2`) proves the harness shape; the rest follow the same template. +The original research note that motivated this repo enumerates ten such +candidates (A1–A10) clustered by Lance crate. The first landed (`pq-l2`) proves +the harness shape; the rest follow the same template. When this repo lives +inside the parent OmniGraph workspace, the note is at +`omnigraph/docs/research/llm-evolutionary-sampling.md`; if the repo is ever +extracted as a standalone OSS project, the note is part of OmniGraph's +private design history and won't ship with the extracted repo. ## Decision: workspace, not single crate