mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
research: fix lance-autoresearch correctness bugs surfaced by code review
A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
This commit is contained in:
parent
0d72cc69fb
commit
7b1b0b5b75
10 changed files with 75 additions and 44 deletions
|
|
@ -6,9 +6,9 @@ members = [
|
|||
]
|
||||
|
||||
# Each per-target crate sets its own deps. Shared deps below pin versions
|
||||
# uniformly across targets so the workspace lockfile stays clean.
|
||||
# uniformly across targets so the workspace lockfile stays clean. Keep this
|
||||
# list small — only deps actually consumed by ≥2 targets belong here.
|
||||
[workspace.dependencies]
|
||||
anyhow = "1"
|
||||
criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
|
||||
|
||||
[profile.release]
|
||||
|
|
|
|||
|
|
@ -70,6 +70,11 @@ A kernel is **kept** iff:
|
|||
5. Build clean: `cargo build --release` and
|
||||
`cargo clippy --release --all-targets -- -D warnings` both succeed.
|
||||
|
||||
Exit codes summary for `run_experiment`: `0` on success, `1` on internal
|
||||
error (panic, fixture load failure), `2` on correctness failure, `3` on
|
||||
time-budget breach. The agent's loop should treat anything non-zero as
|
||||
"revert; do not log as a measurement."
|
||||
|
||||
Ties break toward simpler code: same speed within ~3% noise → fewer lines /
|
||||
less `unsafe` wins.
|
||||
|
||||
|
|
|
|||
|
|
@ -6,7 +6,11 @@ in the style of Andrej Karpathy's
|
|||
[`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
|
||||
single-agent autoresearch loop.
|
||||
|
||||
Each target is an independent Rust crate under `crates/`:
|
||||
Each landed target is an independent Rust crate under `crates/`. The
|
||||
candidates below are listed as a roadmap — they have no code yet, only the
|
||||
research-note rationale and a `docs/targets/<name>.md` capsule (when one
|
||||
exists). Spinning up a candidate is the
|
||||
[`docs/adding-a-target.md`](docs/adding-a-target.md) workflow.
|
||||
|
||||
| Target | Status | Lance source area | What's optimized |
|
||||
|---|---|---|---|
|
||||
|
|
|
|||
|
|
@ -1,4 +1,9 @@
|
|||
//! Peak resident-set-size readback (Linux only; non-Linux returns 0).
|
||||
//!
|
||||
//! Reads `VmHWM` from `/proc/self/status` — the high-water-mark of resident
|
||||
//! memory pages, not the high-water-mark of virtual address space. `VmPeak`
|
||||
//! (virtual peak) would include mmap'd files, guard pages, and untouched
|
||||
//! allocations; `VmHWM` is what users mean by "peak memory".
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
pub fn peak_rss_mb() -> f64 {
|
||||
|
|
@ -6,7 +11,7 @@ pub fn peak_rss_mb() -> f64 {
|
|||
return 0.0;
|
||||
};
|
||||
for line in s.lines() {
|
||||
if let Some(rest) = line.strip_prefix("VmPeak:") {
|
||||
if let Some(rest) = line.strip_prefix("VmHWM:") {
|
||||
let kb: f64 = rest
|
||||
.split_whitespace()
|
||||
.next()
|
||||
|
|
|
|||
|
|
@ -38,7 +38,6 @@ pub struct PqKernel { /* agent's private fields */ }
|
|||
|
||||
impl PqKernel {
|
||||
pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
|
||||
pub fn shape(&self) -> &PqShape;
|
||||
pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
|
||||
pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,3 +1,5 @@
|
|||
// SPDX-License-Identifier: MIT OR Apache-2.0
|
||||
|
||||
//! IMMUTABLE. Diverse test-data + workload generators.
|
||||
//!
|
||||
//! Two surfaces. `correctness_battery(seed)` yields several
|
||||
|
|
@ -90,7 +92,7 @@ pub fn correctness_battery(seed: u64) -> Vec<CorrectnessCase> {
|
|||
];
|
||||
for &shape in SHAPES {
|
||||
for (kind, label) in kinds {
|
||||
let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ kind_hash(*kind));
|
||||
let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), kind_hash(*kind)]));
|
||||
let num_vectors = 4096;
|
||||
let codebook = build_codebook(shape, *kind, CODEBOOK_TRAIN_SIZE, &mut rng);
|
||||
let base = gen_vectors(num_vectors, shape.dim, *kind, &mut rng);
|
||||
|
|
@ -114,7 +116,7 @@ pub fn speed_workloads(seed: u64) -> Vec<SpeedWorkload> {
|
|||
let mut out = Vec::new();
|
||||
for &shape in SHAPES {
|
||||
for &dist in DISTRIBUTIONS {
|
||||
let mut rng = SplitMix64::new(seed ^ shape_hash(shape) ^ dist_hash(dist));
|
||||
let mut rng = SplitMix64::new(mix_seeds(&[seed, shape_hash(shape), dist_hash(dist)]));
|
||||
let kind = match dist {
|
||||
DataDistribution::Clustered => InputKind::Clustered,
|
||||
DataDistribution::Uniform => InputKind::Uniform,
|
||||
|
|
@ -296,6 +298,21 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
|
|||
out
|
||||
}
|
||||
|
||||
/// Combine seed components without XOR-collision. Iterates the SplitMix64
|
||||
/// finalizer once per part, so distinct part vectors produce distinct seeds
|
||||
/// with vanishingly small collision probability (vs. raw XOR, where
|
||||
/// `a ^ b ^ c == a ^ c ^ b` and small constants can cancel each other).
|
||||
fn mix_seeds(parts: &[u64]) -> u64 {
|
||||
let mut mixed: u64 = 0;
|
||||
for &p in parts {
|
||||
mixed = mixed.wrapping_add(p).wrapping_add(0x9E37_79B9_7F4A_7C15);
|
||||
mixed = (mixed ^ (mixed >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
|
||||
mixed = (mixed ^ (mixed >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
|
||||
mixed ^= mixed >> 31;
|
||||
}
|
||||
mixed
|
||||
}
|
||||
|
||||
fn shape_hash(s: PqShape) -> u64 {
|
||||
(s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
|
||||
^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-License-Identifier: MIT OR Apache-2.0
|
||||
//
|
||||
// AGENT'S PLAYGROUND. This is the file you (the agent) modify.
|
||||
//
|
||||
|
|
@ -14,7 +14,6 @@
|
|||
// PUBLIC API CONTRACT (must remain stable so the bench keeps building):
|
||||
// - `pub struct PqKernel`
|
||||
// - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
|
||||
// - `PqKernel::shape(&self) -> &PqShape`
|
||||
// - `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
|
||||
// - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
|
||||
//
|
||||
|
|
@ -59,10 +58,6 @@ impl PqKernel {
|
|||
}
|
||||
}
|
||||
|
||||
pub fn shape(&self) -> &PqShape {
|
||||
&self.shape
|
||||
}
|
||||
|
||||
/// Asymmetric L2 distance table for one query.
|
||||
///
|
||||
/// Layout of returned `Vec<f32>`: `[num_sub_vectors][num_centroids]` flat
|
||||
|
|
|
|||
|
|
@ -1,3 +1,5 @@
|
|||
// SPDX-License-Identifier: MIT OR Apache-2.0
|
||||
|
||||
//! IMMUTABLE. Scalar reference kernel — defines the math the agent must match.
|
||||
//!
|
||||
//! Same public API as `kernels::PqKernel`, intentionally — it's the bit-for-bit
|
||||
|
|
@ -22,10 +24,6 @@ impl ScalarReference {
|
|||
}
|
||||
}
|
||||
|
||||
pub fn shape(&self) -> &PqShape {
|
||||
&self.shape
|
||||
}
|
||||
|
||||
#[allow(clippy::needless_range_loop)]
|
||||
pub fn distance_table(&self, query: &[f32]) -> Vec<f32> {
|
||||
let s = &self.shape;
|
||||
|
|
@ -87,13 +85,23 @@ pub fn max_abs_err(a: &[f32], b: &[f32]) -> f32 {
|
|||
.fold(0.0f32, f32::max)
|
||||
}
|
||||
|
||||
/// Check two top-K results are equivalent up to:
|
||||
/// - shared distance tolerance `dist_tol`
|
||||
/// - distance-tied id substitution (if two candidates have equal distances,
|
||||
/// either order is acceptable)
|
||||
/// Check two top-K results are equivalent up to per-rank distance tolerance.
|
||||
///
|
||||
/// Returns `Ok(())` on match, or `Err(diagnostic_string)` describing the first
|
||||
/// disagreement found.
|
||||
/// At each rank `i`, asserts `|agent[i].dist - reference[i].dist| <= dist_tol`.
|
||||
/// Ids at the same rank may differ silently. This is correct because:
|
||||
///
|
||||
/// 1. If both kernels compute distances within `dist_tol` of each other,
|
||||
/// differing ids at the same rank means their distances are within a tied
|
||||
/// band of width `2*dist_tol`; both ids legally belong in that band.
|
||||
/// 2. Stronger checks (e.g., set-equality of ids) reject legal cases. When the
|
||||
/// K-th distance is at a multi-way tie, two correct implementations can
|
||||
/// return different K-sized subsets of the tied band — heap eviction order
|
||||
/// in the agent kernel vs. sort stability in the scalar reference.
|
||||
/// Set-equality fails on these legitimately.
|
||||
///
|
||||
/// What this catches: any case where the agent kernel computes a distance that
|
||||
/// disagrees with the scalar reference's distance for the same (query, codes)
|
||||
/// input. The first rank where the math diverges is flagged.
|
||||
pub fn topk_consistent(
|
||||
agent: &[(u32, f32)],
|
||||
reference: &[(u32, f32)],
|
||||
|
|
@ -113,18 +121,6 @@ pub fn topk_consistent(
|
|||
(a_d - r_d).abs()
|
||||
));
|
||||
}
|
||||
if a_id != r_id {
|
||||
// Different id at same rank is acceptable iff this distance is tied
|
||||
// with a neighbor in either result — we accept any permutation
|
||||
// within a tied-distance band.
|
||||
let agent_neighbor_match = agent.iter().any(|(id, d)| id == r_id && (d - r_d).abs() <= dist_tol);
|
||||
let ref_neighbor_match = reference.iter().any(|(id, d)| id == a_id && (d - a_d).abs() <= dist_tol);
|
||||
if !agent_neighbor_match || !ref_neighbor_match {
|
||||
return Err(format!(
|
||||
"topk[{i}] id mismatch with no tie-break excuse: agent=({a_id}, {a_d}) reference=({r_id}, {r_d})"
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,13 +1,19 @@
|
|||
# Adding a new target
|
||||
|
||||
Walk through this when spinning up a new optimization target (A1 cosine, A4
|
||||
bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
|
||||
decisions to make per target if the kernel fits the autoresearch shape.
|
||||
bitpack, etc.). The workflow is: copy an existing target as scaffolding,
|
||||
then rewrite the four source files (`lib.rs`, `kernels.rs`, `reference.rs`,
|
||||
`inputs.rs`), the entry point (`bin/run_experiment.rs`), the per-target
|
||||
`program.md`, and add a capsule under `docs/targets/`. The Cargo manifest,
|
||||
workspace registration, license headers, file-layout conventions, gitignore,
|
||||
and shared utilities (`harness-common`) are inherited from the template —
|
||||
that's the part you don't rewrite.
|
||||
|
||||
If your target's per-trial eval is more than ~30 seconds, or the correctness
|
||||
oracle can't be a deterministic comparison against a scalar reference, this
|
||||
harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
|
||||
for the boundary.
|
||||
No architectural decisions are required per target if the kernel fits the
|
||||
autoresearch shape. If your target's per-trial eval is more than ~30 seconds,
|
||||
or the correctness oracle can't be a deterministic comparison against a
|
||||
scalar reference, this harness is the wrong fit — see [`design.md`](design.md)
|
||||
"When to revisit" for the boundary.
|
||||
|
||||
## Steps
|
||||
|
||||
|
|
|
|||
|
|
@ -11,9 +11,13 @@ A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
|
|||
right harness shape is identical across them: bit-exact correctness oracle,
|
||||
geomean-across-distributions speed metric, single-agent autoresearch loop.
|
||||
|
||||
The original [research note](../../docs/research/llm-evolutionary-sampling.md)
|
||||
enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
|
||||
landed (`pq-l2`) proves the harness shape; the rest follow the same template.
|
||||
The original research note that motivated this repo enumerates ten such
|
||||
candidates (A1–A10) clustered by Lance crate. The first landed (`pq-l2`) proves
|
||||
the harness shape; the rest follow the same template. When this repo lives
|
||||
inside the parent OmniGraph workspace, the note is at
|
||||
`omnigraph/docs/research/llm-evolutionary-sampling.md`; if the repo is ever
|
||||
extracted as a standalone OSS project, the note is part of OmniGraph's
|
||||
private design history and won't ship with the extracted repo.
|
||||
|
||||
## Decision: workspace, not single crate
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue