A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
7.8 KiB
Design — why the workspace is shaped this way
This document records the rationale for the multi-target workspace shape so future contributors don't relitigate the early decisions.
The thing we're building
A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
"Multi-target" because Lance has many such kernels — distance kernels in
lance-linalg, decoders in lance-encoding, scan/merge kernels — and the
right harness shape is identical across them: bit-exact correctness oracle,
geomean-across-distributions speed metric, single-agent autoresearch loop.
The original research note that motivated this repo enumerates ten such
candidates (A1–A10) clustered by Lance crate. The first landed (pq-l2) proves
the harness shape; the rest follow the same template. When this repo lives
inside the parent OmniGraph workspace, the note is at
omnigraph/docs/research/llm-evolutionary-sampling.md; if the repo is ever
extracted as a standalone OSS project, the note is part of OmniGraph's
private design history and won't ship with the extracted repo.
Decision: workspace, not single crate
A single crate exposing multiple binaries (run_experiment_pq_l2,
run_experiment_bitpack, ...) was the obvious-looking alternative. Rejected
for three reasons:
-
Per-target deps differ. FSST decode wants different deps than PQ kernels (a string-compression library vs. just
f32math). A singleCargo.tomlwould either bundle every target's deps into every build or require fine-grained features. Workspaces give per-targetCargo.tomlfor free. -
Edit isolation. The agent edits one target's
kernels.rsat a time. In a single crate,kernels.rsfiles would collide on path or have to live in target-specific submodules with target-specific naming. Per-target crates putsrc/kernels.rsat the natural location every time and let the agent navigate one tree per session. -
Build / test isolation.
cargo build -p pq-l2builds only what's needed for the PQ L2 target;cargo test -p pq-l2runs only its tests. The agent's iteration loop is faster because it doesn't pay for unrelated targets' compile time.
The downside — workspace boilerplate, per-target Cargo.toml, the empty
[workspace] block at the workspace root that prevents cargo from walking up
to the parent omnigraph workspace — is a one-time cost. Per-target overhead
of adding a new target is one cp -r plus path edits.
Decision: shared harness-common crate, no Target trait
A Target trait was the obvious-looking other alternative — express the
common loop generically, plug in target-specific types. Rejected because:
-
Kernel signatures vary too much for a single trait shape. PQ
probe_top_kreturnsVec<(u32, f32)>. Bitpack decode returns anIntArray. FSST decode returnsVec<u8>. Predicate evaluation returns aBooleanArray. A unifying trait would need erased boxing or a wide associated-type surface, both of which obscure the actual hot path the agent is editing. -
The orchestration that is shared is small. A deterministic PRNG (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four tolerance constants. Total ~70 lines of shared code. Building a trait abstraction over 70 lines costs more than it saves.
-
The output format isn't worth sharing. Each target's
run_experiment.rsprints a fixed-format result block; the fields differ per target (PQ shapes vs bit widths vs distribution kinds). A shared formatter would be either trivial wrapping ofprintln!(no value) or a complicated builder API (negative value).
harness-common therefore exposes plumbing only: SplitMix64, geomean,
peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS. Each
target consumes what it needs. The shared loop contract is documented in
HARNESS.md, not encoded in code.
Decision: per-target program.md + shared HARNESS.md
The agent reads two files at session start:
HARNESS.md(workspace-level) — universal: the loop, the metric, the edit-permission table, hygiene rules.crates/<target>/program.md(per-target) — specific: the kernel API the agent must keep stable, target-specific priors (which SIMD intrinsics tend to win on this kernel shape), theresults.tsvcolumn header.
The shape mirrors how Karpathy's nanochat-research program.md works,
factored across the dimension that varies (per target) vs. doesn't (the loop
itself). Two files instead of one because copy-pasting the universal loop
into every program.md makes them drift.
Decision: dataset-independent oracle every target
The first iteration of the harness used recall@K vs. SIFT1M as the correctness oracle. We replaced it with bit-exact (or near-bit-exact for floats) match against a scalar reference because:
- The agent had incentive to overfit lossy approximations to the dataset's cluster structure, even though we didn't ask for that.
- SIFT1M is 250 MB and a hassle to download; the harness benefited from being self-contained.
- Mathematical equivalence is a strictly stronger contract than recall preservation: if the kernel is bit-equivalent to the scalar reference, recall is automatically identical because the distance values are the same. There's nothing recall@K catches that bit-exactness doesn't.
This decision generalizes to every target. Decode kernels get strict bitwise
equality (no float arithmetic involved). Distance and BM25 kernels get
max_abs_err ≤ 1e-4 (loose enough for SIMD-accumulator reordering, tight
enough for real bugs). Targets that genuinely require lossy techniques to
get headroom — there might be some; LUT u8 quantization in PQ is one — go
in a separate "lossy track" with a recall-based oracle on diverse datasets,
not the bit-exact track.
Decision: per-target speed measurement spans multiple shapes × distributions
A single dataset would let an agent overfit to that dataset's distribution.
Each target's inputs.rs therefore generates speed workloads across:
- Multiple shapes of the kernel's domain (PQ:
(dim, num_sub_vectors, num_centroids); bitpack: bit width; etc.). Captures how the kernel performs at different sizes Lance users actually encounter. - Multiple data distributions (Gaussian / uniform / sparse for floats; uniform / skewed / clustered for integers; etc.). Captures whether the kernel's win is data-distribution-conditional.
The keep gate uses geomean across all (shape × distribution) combos with a worst-case guard: a kernel that wins on one combo and regresses ≥5% on another fails to keep, even if the geomean improves. This forces wins to generalize.
What's deliberately not abstracted
- Output format. Each target prints its own field block. See above.
TopKHeapand other small data structures. When two targets need aTopKHeap, the second one copies the first's. Three copies of a 30-line struct is cheaper than one trait-erased indirection.- Test data shapes. Each target's
inputs.rsknows its own kernel's fixture shape. Sharing would require a genericFixture<Kernel>trait, which would either be too narrow (forces every kernel into aquery + workloadshape) or too wide (gives up the type safety that makes the bench's correctness check obvious).
When to revisit
If the workspace grows past ~6 active targets and we notice we're
copy-pasting more than ~50 lines of run_experiment.rs boilerplate per new
target, consider extracting a shared RunExperiment helper that takes
closures for the correctness and speed phases. Don't pre-extract — wait
until the duplication is real and visible.
If we add a target that genuinely doesn't fit the autoresearch loop (eval crosses ~30s; tournament sampling becomes the right control loop), it belongs in a separate workspace, not this one. The boundary line is the loop shape, not the target type.