omnigraph/research/lance-autoresearch/docs/adding-a-target.md
Claude 0d72cc69fb
research: restructure lance-autoresearch as multi-target workspace
The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.

Layout:

  research/lance-autoresearch/
  ├── Cargo.toml          (workspace root)
  ├── README.md           (target table, contract overview, repo layout)
  ├── HARNESS.md          (universal loop contract every target inherits)
  ├── crates/
  │   ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
  │   │                    MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
  │   └── pq-l2/          (the landed target; was the previous single crate)
  └── docs/
      ├── design.md           (rationale for workspace shape, no Target trait)
      ├── adding-a-target.md  (step-by-step workflow for new targets)
      └── targets/pq-l2.md    (per-target capsule)

Decisions documented in docs/design.md:

- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
  per-target src tree so agent edits don't conflict; per-target build/test
  isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
  constants, time budget). Intentionally NO Target trait - decode kernel
  signatures and distance kernel signatures differ enough that a unifying
  trait would either bloat or require erased boxing. Each target is its own
  natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
  the priors and API spec are per-target. Two files instead of one because
  copy-pasting the universal loop into every program.md would drift.

pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
  TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
  duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
  PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
  pins criterion uniformly across targets

The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.

Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
  geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-15 00:15:02 +00:00

7 KiB
Raw Blame History

Adding a new target

Walk through this when spinning up a new optimization target (A1 cosine, A4 bitpack, etc.). It's a cp -r plus surgical edits — no architectural decisions to make per target if the kernel fits the autoresearch shape.

If your target's per-trial eval is more than ~30 seconds, or the correctness oracle can't be a deterministic comparison against a scalar reference, this harness is the wrong fit — see design.md "When to revisit" for the boundary.

Steps

1. Pick a template target

The closest existing target. For now there's just pq-l2, but as more land:

  • Distance / scoring kernels that take a query and return per-row scores → template off pq-l2.
  • Decode kernels that take encoded bytes and return an Arrow array → template off bitpack once it lands.
  • Scan / merge kernels → template off topk-merge once it lands.
cp -r crates/pq-l2 crates/<my-target>

2. Rewrite Cargo.toml

[package]
name = "<my-target>"
# version, edition, license, publish stay the same

Add the target to the workspace members in the root Cargo.toml:

[workspace]
members = [
    "crates/harness-common",
    "crates/pq-l2",
    "crates/<my-target>",   # add this
]

3. Rewrite src/lib.rs

Define the target's Shape type (analogue of PqShape) and any other types shared between kernels.rs and reference.rs and inputs.rs. Document which fields are pinned by the harness vs. agent-tunable.

This file is immutable to the agent. The shape parameters define the optimization target — changing them changes what's being optimized.

4. Rewrite src/reference.rs

Implement the scalar reference kernel — the math, in plain Rust, no SIMD, no cleverness. This is what the agent's kernel is compared against. Mirror the public API of kernels.rs exactly.

For float kernels, also export max_abs_err(a, b) and topk_consistent(...) (or analogues) — the comparison helpers the bench uses to assert near-bit-exact equivalence with harness_common::MAX_ABS_ERR / TOPK_DIST_TOL.

For integer / byte kernels, the comparison is simpler — assert_eq! on the returned Arrow array. No tolerance constants needed.

5. Rewrite src/inputs.rs

Two surfaces:

  • correctness_battery(seed) -> Vec<CorrectnessCase> — diverse shape × distribution combinations, sized small enough that the correctness phase finishes in seconds. The point is breadth, not realism.
  • speed_workloads(seed) -> Vec<SpeedWorkload> — larger shape × distribution combinations sized for stable timings. Aim for total trial wall-clock ≤ 60s; the agent's iteration latency dominates correctness elsewhere.

Use harness_common::SplitMix64 for determinism. Same seed → same battery across trials.

6. Rewrite src/kernels.rs (the agent's playground)

Implement a clean scalar baseline matching the algorithm shape of the Lance upstream code. The header comment must:

  • Cite the upstream Lance source (lance-format/lance rev / file path) the algorithm is modeled on.
  • Document the public API the bench calls — these are the surfaces the agent may NOT change.
  • List "what you can do" / "what you cannot do" rules specific to this target.

The starting kernel must be correct (passes the correctness phase against reference.rs) and lint-clean. The agent's job is to make it faster.

7. Rewrite src/bin/run_experiment.rs

Two phases:

  • Correctness phase: for each CorrectnessCase, run agent kernel + reference, compare. Any mismatch → print correctness: fail, diagnostic line, exit 2.
  • Speed phase: for each SpeedWorkload, run agent kernel and time per query / per row / per byte. Aggregate geomean / worst / best across all combos. Print fixed-format result block.

Universal output fields (every target) are listed in HARNESS.md "The metric." Add per-target fields above them as needed (e.g., bit_widths_tested for bitpack).

Use:

  • harness_common::geomean for the aggregator
  • harness_common::peak_rss_mb for memory readback
  • harness_common::TIME_BUDGET_SECS for the time-budget check

8. (Optional) Rewrite benches/<my-target>.rs

Criterion benchmark with the same kernel calls as run_experiment but under criterion's statistical-sampling harness. Optional — the per-trial binary is the agent's primary measurement; criterion is for the human's deeper investigation.

9. Write program.md

Per-target agent skill, layered on top of HARNESS.md. Sections:

  • Setup — which files to read at session start (always include ../../HARNESS.md).
  • Public API contract — the exact functions / structs the agent must keep stable.
  • Target-specific priors — known SIMD techniques for this kernel shape, algorithmic transformations worth trying, common pitfalls. This is the highest-leverage content; spend time on it.
  • results.tsv header — the per-target column set.

10. Write the per-target capsule in docs/targets/<my-target>.md

A short doc covering:

  • What's optimized (one sentence)
  • Upstream Lance source pointers (rev, file paths, function names)
  • Oracle definition (bit-exact / max_abs_err)
  • Speed workload shape (what shapes × distributions span)
  • Status (candidate / landed / has-results)

11. Verify end-to-end

cargo build --release -p <my-target>
cargo clippy --release -p <my-target> --all-targets -- -D warnings
cargo run --release --bin run_experiment -p <my-target>

The baseline trial must:

  • Print correctness: pass
  • Exit 0
  • Finish within ~60s
  • Reference a sensible geomean_ns_per_* baseline number

Smoke-test the gate: deliberately break kernels.rs (e.g., return constant zero), confirm the trial exits 2 with correctness: fail. Restore.

12. Add the target row to the top-level README.md

In the targets table at the top of the README, change the new target's row from candidate to landed.

13. Commit

One commit for the target's scaffolding. Don't bundle multiple targets in one commit — each target's history should be independently revertible.

Common gotchas

  • Forgetting the empty [workspace] block at the root means cargo walks up to the omnigraph parent workspace. Already handled; just don't remove it.
  • Per-target Cargo.toml referencing the wrong harness-common path. Use harness-common = { path = "../harness-common" }.
  • Picking a SHAPES set that's too small. Three shapes is the floor; with one shape an agent could specialize and pass, with two there's not enough variety. Ensure the shapes span at least one "outlier" (e.g., for PQ, one shape with sub_vector_dim != 8).
  • Correctness battery too narrow. Five distributions is the floor: at minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or the integer analogue: uniform / clustered / skewed / few-distinct / monotonic).
  • Trial time too long. If the speed phase exceeds ~60s, agent iteration rate drops below useful. Reduce workload sizes; the speed metric is per-operation, not per-workload, so absolute size doesn't change the comparison.