omnigraph/research/lance-autoresearch/docs/adding-a-target.md
Claude 7b1b0b5b75
research: fix lance-autoresearch correctness bugs surfaced by code review
A code review pass found a cluster of real bugs in metrics and contract;
fixing them before any agent loop runs against this harness.

Critical metric bug:
- harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space
  high-water-mark, includes mmap'd files / guard pages / untouched
  allocations) instead of VmHWM (resident pages high-water-mark). The
  function name and HARNESS.md contract both promised RSS. Every
  peak_mem_mb row logged under the old code was virtual peak, not RSS.

Correctness contract bug:
- reference::topk_consistent's tie-tolerance had a flawed neighbor-scan
  check: when the K-th distance fell in a multi-way tie, agent and
  reference could legally return different K-sized subsets of the tied
  band (heap eviction order vs. sort stability), and the neighbor scan
  required both endpoints to be present, false-negativing legitimate
  cases. Simplified to a positional distance-tolerance check; ids at the
  same rank may differ silently because the distance match within tol
  constrains the swap to a 2*tol band. Diagnostic comment explains the
  rationale.

API hygiene:
- Removed dead PqKernel::shape() and ScalarReference::shape() — declared
  in the public API contract (program.md, kernels.rs comment), required
  to be stable, never called by the bench / benches / inputs / reference.
  Now the contract reflects what the bench actually uses.
- Removed dead `anyhow` workspace dependency.

Determinism:
- PRNG seed mixing now uses the SplitMix64 finalizer per part instead of
  raw XOR. Raw XOR is commutative and small-constant collisions are
  reachable; mix_seeds iterates the finalizer once per ingredient so
  distinct (seed, shape, kind) tuples produce distinct streams with
  vanishingly small collision probability.

License headers:
- kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match
  the crate's Cargo.toml license field (the rest of the crate is dual-
  licensed). Added matching SPDX headers to reference.rs and inputs.rs.

Doc cleanups:
- design.md: replaced the broken relative link
  `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside
  lance-autoresearch where the note doesn't live) with a path-explained
  reference noting the note lives in the parent OmniGraph repo and won't
  ship on extraction.
- README.md: clarified that the target table mixes a single landed target
  with a candidate roadmap — they have no code yet.
- HARNESS.md: added exit code 1 (internal error) to the exit-code summary;
  was documented in run_experiment.rs but not in the loop contract.
- adding-a-target.md: dropped the misleading "cp -r plus surgical edits"
  framing — the workflow rewrites 7 files; what's inherited is Cargo
  manifest, license headers, workspace registration, and shared utilities.

Verified end-to-end: cargo build / clippy / test all green. Baseline
trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads
RSS — same workload reports 91 MB, plausibly correct given the temporary
fixture-construction buffers).

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-15 00:55:57 +00:00

7.4 KiB
Raw Blame History

Adding a new target

Walk through this when spinning up a new optimization target (A1 cosine, A4 bitpack, etc.). The workflow is: copy an existing target as scaffolding, then rewrite the four source files (lib.rs, kernels.rs, reference.rs, inputs.rs), the entry point (bin/run_experiment.rs), the per-target program.md, and add a capsule under docs/targets/. The Cargo manifest, workspace registration, license headers, file-layout conventions, gitignore, and shared utilities (harness-common) are inherited from the template — that's the part you don't rewrite.

No architectural decisions are required per target if the kernel fits the autoresearch shape. If your target's per-trial eval is more than ~30 seconds, or the correctness oracle can't be a deterministic comparison against a scalar reference, this harness is the wrong fit — see design.md "When to revisit" for the boundary.

Steps

1. Pick a template target

The closest existing target. For now there's just pq-l2, but as more land:

  • Distance / scoring kernels that take a query and return per-row scores → template off pq-l2.
  • Decode kernels that take encoded bytes and return an Arrow array → template off bitpack once it lands.
  • Scan / merge kernels → template off topk-merge once it lands.
cp -r crates/pq-l2 crates/<my-target>

2. Rewrite Cargo.toml

[package]
name = "<my-target>"
# version, edition, license, publish stay the same

Add the target to the workspace members in the root Cargo.toml:

[workspace]
members = [
    "crates/harness-common",
    "crates/pq-l2",
    "crates/<my-target>",   # add this
]

3. Rewrite src/lib.rs

Define the target's Shape type (analogue of PqShape) and any other types shared between kernels.rs and reference.rs and inputs.rs. Document which fields are pinned by the harness vs. agent-tunable.

This file is immutable to the agent. The shape parameters define the optimization target — changing them changes what's being optimized.

4. Rewrite src/reference.rs

Implement the scalar reference kernel — the math, in plain Rust, no SIMD, no cleverness. This is what the agent's kernel is compared against. Mirror the public API of kernels.rs exactly.

For float kernels, also export max_abs_err(a, b) and topk_consistent(...) (or analogues) — the comparison helpers the bench uses to assert near-bit-exact equivalence with harness_common::MAX_ABS_ERR / TOPK_DIST_TOL.

For integer / byte kernels, the comparison is simpler — assert_eq! on the returned Arrow array. No tolerance constants needed.

5. Rewrite src/inputs.rs

Two surfaces:

  • correctness_battery(seed) -> Vec<CorrectnessCase> — diverse shape × distribution combinations, sized small enough that the correctness phase finishes in seconds. The point is breadth, not realism.
  • speed_workloads(seed) -> Vec<SpeedWorkload> — larger shape × distribution combinations sized for stable timings. Aim for total trial wall-clock ≤ 60s; the agent's iteration latency dominates correctness elsewhere.

Use harness_common::SplitMix64 for determinism. Same seed → same battery across trials.

6. Rewrite src/kernels.rs (the agent's playground)

Implement a clean scalar baseline matching the algorithm shape of the Lance upstream code. The header comment must:

  • Cite the upstream Lance source (lance-format/lance rev / file path) the algorithm is modeled on.
  • Document the public API the bench calls — these are the surfaces the agent may NOT change.
  • List "what you can do" / "what you cannot do" rules specific to this target.

The starting kernel must be correct (passes the correctness phase against reference.rs) and lint-clean. The agent's job is to make it faster.

7. Rewrite src/bin/run_experiment.rs

Two phases:

  • Correctness phase: for each CorrectnessCase, run agent kernel + reference, compare. Any mismatch → print correctness: fail, diagnostic line, exit 2.
  • Speed phase: for each SpeedWorkload, run agent kernel and time per query / per row / per byte. Aggregate geomean / worst / best across all combos. Print fixed-format result block.

Universal output fields (every target) are listed in HARNESS.md "The metric." Add per-target fields above them as needed (e.g., bit_widths_tested for bitpack).

Use:

  • harness_common::geomean for the aggregator
  • harness_common::peak_rss_mb for memory readback
  • harness_common::TIME_BUDGET_SECS for the time-budget check

8. (Optional) Rewrite benches/<my-target>.rs

Criterion benchmark with the same kernel calls as run_experiment but under criterion's statistical-sampling harness. Optional — the per-trial binary is the agent's primary measurement; criterion is for the human's deeper investigation.

9. Write program.md

Per-target agent skill, layered on top of HARNESS.md. Sections:

  • Setup — which files to read at session start (always include ../../HARNESS.md).
  • Public API contract — the exact functions / structs the agent must keep stable.
  • Target-specific priors — known SIMD techniques for this kernel shape, algorithmic transformations worth trying, common pitfalls. This is the highest-leverage content; spend time on it.
  • results.tsv header — the per-target column set.

10. Write the per-target capsule in docs/targets/<my-target>.md

A short doc covering:

  • What's optimized (one sentence)
  • Upstream Lance source pointers (rev, file paths, function names)
  • Oracle definition (bit-exact / max_abs_err)
  • Speed workload shape (what shapes × distributions span)
  • Status (candidate / landed / has-results)

11. Verify end-to-end

cargo build --release -p <my-target>
cargo clippy --release -p <my-target> --all-targets -- -D warnings
cargo run --release --bin run_experiment -p <my-target>

The baseline trial must:

  • Print correctness: pass
  • Exit 0
  • Finish within ~60s
  • Reference a sensible geomean_ns_per_* baseline number

Smoke-test the gate: deliberately break kernels.rs (e.g., return constant zero), confirm the trial exits 2 with correctness: fail. Restore.

12. Add the target row to the top-level README.md

In the targets table at the top of the README, change the new target's row from candidate to landed.

13. Commit

One commit for the target's scaffolding. Don't bundle multiple targets in one commit — each target's history should be independently revertible.

Common gotchas

  • Forgetting the empty [workspace] block at the root means cargo walks up to the omnigraph parent workspace. Already handled; just don't remove it.
  • Per-target Cargo.toml referencing the wrong harness-common path. Use harness-common = { path = "../harness-common" }.
  • Picking a SHAPES set that's too small. Three shapes is the floor; with one shape an agent could specialize and pass, with two there's not enough variety. Ensure the shapes span at least one "outlier" (e.g., for PQ, one shape with sub_vector_dim != 8).
  • Correctness battery too narrow. Five distributions is the floor: at minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or the integer analogue: uniform / clustered / skewed / few-distinct / monotonic).
  • Trial time too long. If the speed phase exceeds ~60s, agent iteration rate drops below useful. Reduce workload sizes; the speed metric is per-operation, not per-workload, so absolute size doesn't change the comparison.