mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-27 02:39:38 +02:00

research: fix lance-autoresearch correctness bugs surfaced by code review

A code review pass found a cluster of real bugs in metrics and contract;
fixing them before any agent loop runs against this harness.

Critical metric bug:
- harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space
  high-water-mark, includes mmap'd files / guard pages / untouched
  allocations) instead of VmHWM (resident pages high-water-mark). The
  function name and HARNESS.md contract both promised RSS. Every
  peak_mem_mb row logged under the old code was virtual peak, not RSS.

Correctness contract bug:
- reference::topk_consistent's tie-tolerance had a flawed neighbor-scan
  check: when the K-th distance fell in a multi-way tie, agent and
  reference could legally return different K-sized subsets of the tied
  band (heap eviction order vs. sort stability), and the neighbor scan
  required both endpoints to be present, false-negativing legitimate
  cases. Simplified to a positional distance-tolerance check; ids at the
  same rank may differ silently because the distance match within tol
  constrains the swap to a 2*tol band. Diagnostic comment explains the
  rationale.

API hygiene:
- Removed dead PqKernel::shape() and ScalarReference::shape() — declared
  in the public API contract (program.md, kernels.rs comment), required
  to be stable, never called by the bench / benches / inputs / reference.
  Now the contract reflects what the bench actually uses.
- Removed dead `anyhow` workspace dependency.

Determinism:
- PRNG seed mixing now uses the SplitMix64 finalizer per part instead of
  raw XOR. Raw XOR is commutative and small-constant collisions are
  reachable; mix_seeds iterates the finalizer once per ingredient so
  distinct (seed, shape, kind) tuples produce distinct streams with
  vanishingly small collision probability.

License headers:
- kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match
  the crate's Cargo.toml license field (the rest of the crate is dual-
  licensed). Added matching SPDX headers to reference.rs and inputs.rs.

Doc cleanups:
- design.md: replaced the broken relative link
  `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside
  lance-autoresearch where the note doesn't live) with a path-explained
  reference noting the note lives in the parent OmniGraph repo and won't
  ship on extraction.
- README.md: clarified that the target table mixes a single landed target
  with a candidate roadmap — they have no code yet.
- HARNESS.md: added exit code 1 (internal error) to the exit-code summary;
  was documented in run_experiment.rs but not in the loop contract.
- adding-a-target.md: dropped the misleading "cp -r plus surgical edits"
  framing — the workflow rewrites 7 files; what's inherited is Cargo
  manifest, license headers, workspace registration, and shared utilities.

Verified end-to-end: cargo build / clippy / test all green. Baseline
trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads
RSS — same workload reports 91 MB, plausibly correct given the temporary
fixture-construction buffers).

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-15 00:55:57 +00:00

7.8 KiB

Raw Blame History

Design — why the workspace is shaped this way

This document records the rationale for the multi-target workspace shape so future contributors don't relitigate the early decisions.

The thing we're building

A multi-target harness for LLM-driven optimization of Lance hot-path kernels. "Multi-target" because Lance has many such kernels — distance kernels in lance-linalg, decoders in lance-encoding, scan/merge kernels — and the right harness shape is identical across them: bit-exact correctness oracle, geomean-across-distributions speed metric, single-agent autoresearch loop.

The original research note that motivated this repo enumerates ten such candidates (A1–A10) clustered by Lance crate. The first landed (pq-l2) proves the harness shape; the rest follow the same template. When this repo lives inside the parent OmniGraph workspace, the note is at omnigraph/docs/research/llm-evolutionary-sampling.md; if the repo is ever extracted as a standalone OSS project, the note is part of OmniGraph's private design history and won't ship with the extracted repo.

Decision: workspace, not single crate

A single crate exposing multiple binaries (run_experiment_pq_l2, run_experiment_bitpack, ...) was the obvious-looking alternative. Rejected for three reasons:

Per-target deps differ. FSST decode wants different deps than PQ kernels (a string-compression library vs. just f32 math). A single Cargo.toml would either bundle every target's deps into every build or require fine-grained features. Workspaces give per-target Cargo.toml for free.
Edit isolation. The agent edits one target's kernels.rs at a time. In a single crate, kernels.rs files would collide on path or have to live in target-specific submodules with target-specific naming. Per-target crates put src/kernels.rs at the natural location every time and let the agent navigate one tree per session.
Build / test isolation. cargo build -p pq-l2 builds only what's needed for the PQ L2 target; cargo test -p pq-l2 runs only its tests. The agent's iteration loop is faster because it doesn't pay for unrelated targets' compile time.

The downside — workspace boilerplate, per-target Cargo.toml, the empty [workspace] block at the workspace root that prevents cargo from walking up to the parent omnigraph workspace — is a one-time cost. Per-target overhead of adding a new target is one cp -r plus path edits.

Decision: shared `harness-common` crate, no `Target` trait

A Target trait was the obvious-looking other alternative — express the common loop generically, plug in target-specific types. Rejected because:

Kernel signatures vary too much for a single trait shape. PQ probe_top_k returns Vec<(u32, f32)>. Bitpack decode returns an IntArray. FSST decode returns Vec<u8>. Predicate evaluation returns a BooleanArray. A unifying trait would need erased boxing or a wide associated-type surface, both of which obscure the actual hot path the agent is editing.
The orchestration that is shared is small. A deterministic PRNG (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four tolerance constants. Total ~70 lines of shared code. Building a trait abstraction over 70 lines costs more than it saves.
The output format isn't worth sharing. Each target's run_experiment.rs prints a fixed-format result block; the fields differ per target (PQ shapes vs bit widths vs distribution kinds). A shared formatter would be either trivial wrapping of println! (no value) or a complicated builder API (negative value).

harness-common therefore exposes plumbing only: SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS. Each target consumes what it needs. The shared loop contract is documented in HARNESS.md, not encoded in code.

Decision: per-target `program.md` + shared `HARNESS.md`

The agent reads two files at session start:

HARNESS.md (workspace-level) — universal: the loop, the metric, the edit-permission table, hygiene rules.
crates/<target>/program.md (per-target) — specific: the kernel API the agent must keep stable, target-specific priors (which SIMD intrinsics tend to win on this kernel shape), the results.tsv column header.

The shape mirrors how Karpathy's nanochat-research program.md works, factored across the dimension that varies (per target) vs. doesn't (the loop itself). Two files instead of one because copy-pasting the universal loop into every program.md makes them drift.

Decision: dataset-independent oracle every target

The first iteration of the harness used recall@K vs. SIFT1M as the correctness oracle. We replaced it with bit-exact (or near-bit-exact for floats) match against a scalar reference because:

The agent had incentive to overfit lossy approximations to the dataset's cluster structure, even though we didn't ask for that.
SIFT1M is 250 MB and a hassle to download; the harness benefited from being self-contained.
Mathematical equivalence is a strictly stronger contract than recall preservation: if the kernel is bit-equivalent to the scalar reference, recall is automatically identical because the distance values are the same. There's nothing recall@K catches that bit-exactness doesn't.

This decision generalizes to every target. Decode kernels get strict bitwise equality (no float arithmetic involved). Distance and BM25 kernels get max_abs_err ≤ 1e-4 (loose enough for SIMD-accumulator reordering, tight enough for real bugs). Targets that genuinely require lossy techniques to get headroom — there might be some; LUT u8 quantization in PQ is one — go in a separate "lossy track" with a recall-based oracle on diverse datasets, not the bit-exact track.

Decision: per-target speed measurement spans multiple shapes × distributions

A single dataset would let an agent overfit to that dataset's distribution. Each target's inputs.rs therefore generates speed workloads across:

Multiple shapes of the kernel's domain (PQ: (dim, num_sub_vectors, num_centroids); bitpack: bit width; etc.). Captures how the kernel performs at different sizes Lance users actually encounter.
Multiple data distributions (Gaussian / uniform / sparse for floats; uniform / skewed / clustered for integers; etc.). Captures whether the kernel's win is data-distribution-conditional.

The keep gate uses geomean across all (shape × distribution) combos with a worst-case guard: a kernel that wins on one combo and regresses ≥5% on another fails to keep, even if the geomean improves. This forces wins to generalize.

What's deliberately not abstracted

Output format. Each target prints its own field block. See above.
TopKHeap and other small data structures. When two targets need a TopKHeap, the second one copies the first's. Three copies of a 30-line struct is cheaper than one trait-erased indirection.
Test data shapes. Each target's inputs.rs knows its own kernel's fixture shape. Sharing would require a generic Fixture<Kernel> trait, which would either be too narrow (forces every kernel into a query + workload shape) or too wide (gives up the type safety that makes the bench's correctness check obvious).

When to revisit

If the workspace grows past ~6 active targets and we notice we're copy-pasting more than ~50 lines of run_experiment.rs boilerplate per new target, consider extracting a shared RunExperiment helper that takes closures for the correctness and speed phases. Don't pre-extract — wait until the duplication is real and visible.

If we add a target that genuinely doesn't fit the autoresearch loop (eval crosses ~30s; tournament sampling becomes the right control loop), it belongs in a separate workspace, not this one. The boundary line is the loop shape, not the target type.

7.8 KiB Raw Blame History Unescape Escape