A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
7.4 KiB
Adding a new target
Walk through this when spinning up a new optimization target (A1 cosine, A4
bitpack, etc.). The workflow is: copy an existing target as scaffolding,
then rewrite the four source files (lib.rs, kernels.rs, reference.rs,
inputs.rs), the entry point (bin/run_experiment.rs), the per-target
program.md, and add a capsule under docs/targets/. The Cargo manifest,
workspace registration, license headers, file-layout conventions, gitignore,
and shared utilities (harness-common) are inherited from the template —
that's the part you don't rewrite.
No architectural decisions are required per target if the kernel fits the
autoresearch shape. If your target's per-trial eval is more than ~30 seconds,
or the correctness oracle can't be a deterministic comparison against a
scalar reference, this harness is the wrong fit — see design.md
"When to revisit" for the boundary.
Steps
1. Pick a template target
The closest existing target. For now there's just pq-l2, but as more land:
- Distance / scoring kernels that take a query and return per-row scores →
template off
pq-l2. - Decode kernels that take encoded bytes and return an Arrow array →
template off
bitpackonce it lands. - Scan / merge kernels → template off
topk-mergeonce it lands.
cp -r crates/pq-l2 crates/<my-target>
2. Rewrite Cargo.toml
[package]
name = "<my-target>"
# version, edition, license, publish stay the same
Add the target to the workspace members in the root Cargo.toml:
[workspace]
members = [
"crates/harness-common",
"crates/pq-l2",
"crates/<my-target>", # add this
]
3. Rewrite src/lib.rs
Define the target's Shape type (analogue of PqShape) and any other types
shared between kernels.rs and reference.rs and inputs.rs. Document
which fields are pinned by the harness vs. agent-tunable.
This file is immutable to the agent. The shape parameters define the optimization target — changing them changes what's being optimized.
4. Rewrite src/reference.rs
Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
no cleverness. This is what the agent's kernel is compared against. Mirror
the public API of kernels.rs exactly.
For float kernels, also export max_abs_err(a, b) and topk_consistent(...)
(or analogues) — the comparison helpers the bench uses to assert
near-bit-exact equivalence with harness_common::MAX_ABS_ERR /
TOPK_DIST_TOL.
For integer / byte kernels, the comparison is simpler — assert_eq! on the
returned Arrow array. No tolerance constants needed.
5. Rewrite src/inputs.rs
Two surfaces:
correctness_battery(seed) -> Vec<CorrectnessCase>— diverse shape × distribution combinations, sized small enough that the correctness phase finishes in seconds. The point is breadth, not realism.speed_workloads(seed) -> Vec<SpeedWorkload>— larger shape × distribution combinations sized for stable timings. Aim for total trial wall-clock ≤ 60s; the agent's iteration latency dominates correctness elsewhere.
Use harness_common::SplitMix64 for determinism. Same seed → same battery
across trials.
6. Rewrite src/kernels.rs (the agent's playground)
Implement a clean scalar baseline matching the algorithm shape of the Lance upstream code. The header comment must:
- Cite the upstream Lance source (
lance-format/lancerev / file path) the algorithm is modeled on. - Document the public API the bench calls — these are the surfaces the agent may NOT change.
- List "what you can do" / "what you cannot do" rules specific to this target.
The starting kernel must be correct (passes the correctness phase against
reference.rs) and lint-clean. The agent's job is to make it faster.
7. Rewrite src/bin/run_experiment.rs
Two phases:
- Correctness phase: for each
CorrectnessCase, run agent kernel + reference, compare. Any mismatch → printcorrectness: fail, diagnostic line, exit 2. - Speed phase: for each
SpeedWorkload, run agent kernel and time per query / per row / per byte. Aggregate geomean / worst / best across all combos. Print fixed-format result block.
Universal output fields (every target) are listed in HARNESS.md "The
metric." Add per-target fields above them as needed (e.g., bit_widths_tested
for bitpack).
Use:
harness_common::geomeanfor the aggregatorharness_common::peak_rss_mbfor memory readbackharness_common::TIME_BUDGET_SECSfor the time-budget check
8. (Optional) Rewrite benches/<my-target>.rs
Criterion benchmark with the same kernel calls as run_experiment but
under criterion's statistical-sampling harness. Optional — the per-trial
binary is the agent's primary measurement; criterion is for the human's
deeper investigation.
9. Write program.md
Per-target agent skill, layered on top of HARNESS.md. Sections:
- Setup — which files to read at session start (always include
../../HARNESS.md). - Public API contract — the exact functions / structs the agent must keep stable.
- Target-specific priors — known SIMD techniques for this kernel shape, algorithmic transformations worth trying, common pitfalls. This is the highest-leverage content; spend time on it.
results.tsvheader — the per-target column set.
10. Write the per-target capsule in docs/targets/<my-target>.md
A short doc covering:
- What's optimized (one sentence)
- Upstream Lance source pointers (rev, file paths, function names)
- Oracle definition (bit-exact /
max_abs_err) - Speed workload shape (what shapes × distributions span)
- Status (candidate / landed / has-results)
11. Verify end-to-end
cargo build --release -p <my-target>
cargo clippy --release -p <my-target> --all-targets -- -D warnings
cargo run --release --bin run_experiment -p <my-target>
The baseline trial must:
- Print
correctness: pass - Exit 0
- Finish within ~60s
- Reference a sensible
geomean_ns_per_*baseline number
Smoke-test the gate: deliberately break kernels.rs (e.g., return constant
zero), confirm the trial exits 2 with correctness: fail. Restore.
12. Add the target row to the top-level README.md
In the targets table at the top of the README, change the new target's row
from candidate to landed.
13. Commit
One commit for the target's scaffolding. Don't bundle multiple targets in one commit — each target's history should be independently revertible.
Common gotchas
- Forgetting the empty
[workspace]block at the root means cargo walks up to the omnigraph parent workspace. Already handled; just don't remove it. - Per-target
Cargo.tomlreferencing the wrongharness-commonpath. Useharness-common = { path = "../harness-common" }. - Picking a
SHAPESset that's too small. Three shapes is the floor; with one shape an agent could specialize and pass, with two there's not enough variety. Ensure the shapes span at least one "outlier" (e.g., for PQ, one shape withsub_vector_dim != 8). - Correctness battery too narrow. Five distributions is the floor: at minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or the integer analogue: uniform / clustered / skewed / few-distinct / monotonic).
- Trial time too long. If the speed phase exceeds ~60s, agent iteration rate drops below useful. Reduce workload sizes; the speed metric is per-operation, not per-workload, so absolute size doesn't change the comparison.