omnigraph/research/lance-autoresearch/docs/adding-a-target.md

# Adding a new target

Walk through this when spinning up a new optimization target (A1 cosine, A4
bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
decisions to make per target if the kernel fits the autoresearch shape.

If your target's per-trial eval is more than ~30 seconds, or the correctness
oracle can't be a deterministic comparison against a scalar reference, this
harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
for the boundary.

## Steps

### 1. Pick a template target

The closest existing target. For now there's just `pq-l2`, but as more land:
- Distance / scoring kernels that take a query and return per-row scores →
  template off `pq-l2`.
- Decode kernels that take encoded bytes and return an Arrow array →
  template off `bitpack` once it lands.
- Scan / merge kernels → template off `topk-merge` once it lands.

```bash
cp -r crates/pq-l2 crates/<my-target>
```

### 2. Rewrite `Cargo.toml`

```toml
[package]
name = "<my-target>"
# version, edition, license, publish stay the same
```

Add the target to the workspace `members` in the root `Cargo.toml`:

```toml
[workspace]
members = [
    "crates/harness-common",
    "crates/pq-l2",
    "crates/<my-target>",   # add this
]
```

### 3. Rewrite `src/lib.rs`

Define the target's `Shape` type (analogue of `PqShape`) and any other types
shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
which fields are pinned by the harness vs. agent-tunable.

This file is **immutable** to the agent. The shape parameters define the
optimization target — changing them changes what's being optimized.

### 4. Rewrite `src/reference.rs`

Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
no cleverness. This is what the agent's kernel is compared against. Mirror
the public API of `kernels.rs` exactly.

For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
(or analogues) — the comparison helpers the bench uses to assert
near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
`TOPK_DIST_TOL`.

For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
returned Arrow array. No tolerance constants needed.

### 5. Rewrite `src/inputs.rs`

Two surfaces:

- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
  distribution combinations, sized small enough that the correctness phase
  finishes in seconds. The point is breadth, not realism.
- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
  combinations sized for stable timings. Aim for total trial wall-clock
  ≤ 60s; the agent's iteration latency dominates correctness elsewhere.

Use `harness_common::SplitMix64` for determinism. Same seed → same battery
across trials.

### 6. Rewrite `src/kernels.rs` (the agent's playground)

Implement a clean scalar baseline matching the algorithm shape of the Lance
upstream code. The header comment must:

- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
  algorithm is modeled on.
- Document the public API the bench calls — these are the surfaces the agent
  may NOT change.
- List "what you can do" / "what you cannot do" rules specific to this
  target.

The starting kernel must be correct (passes the correctness phase against
`reference.rs`) and lint-clean. The agent's job is to make it faster.

### 7. Rewrite `src/bin/run_experiment.rs`

Two phases:

- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
  reference, compare. Any mismatch → print `correctness: fail`, diagnostic
  line, exit 2.
- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
  query / per row / per byte. Aggregate geomean / worst / best across all
  combos. Print fixed-format result block.

Universal output fields (every target) are listed in `HARNESS.md` "The
metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
for bitpack).

Use:
- `harness_common::geomean` for the aggregator
- `harness_common::peak_rss_mb` for memory readback
- `harness_common::TIME_BUDGET_SECS` for the time-budget check

### 8. (Optional) Rewrite `benches/<my-target>.rs`

Criterion benchmark with the same kernel calls as `run_experiment` but
under criterion's statistical-sampling harness. Optional — the per-trial
binary is the agent's primary measurement; criterion is for the human's
deeper investigation.

### 9. Write `program.md`

Per-target agent skill, layered on top of `HARNESS.md`. Sections:

- **Setup** — which files to read at session start (always include
  `../../HARNESS.md`).
- **Public API contract** — the exact functions / structs the agent must
  keep stable.
- **Target-specific priors** — known SIMD techniques for this kernel shape,
  algorithmic transformations worth trying, common pitfalls. This is the
  highest-leverage content; spend time on it.
- **`results.tsv` header** — the per-target column set.

### 10. Write the per-target capsule in `docs/targets/<my-target>.md`

A short doc covering:

- What's optimized (one sentence)
- Upstream Lance source pointers (rev, file paths, function names)
- Oracle definition (bit-exact / `max_abs_err`)
- Speed workload shape (what shapes × distributions span)
- Status (candidate / landed / has-results)

### 11. Verify end-to-end

```bash
cargo build --release -p <my-target>
cargo clippy --release -p <my-target> --all-targets -- -D warnings
cargo run --release --bin run_experiment -p <my-target>
```

The baseline trial must:
- Print `correctness: pass`
- Exit 0
- Finish within ~60s
- Reference a sensible `geomean_ns_per_*` baseline number

Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
zero), confirm the trial exits 2 with `correctness: fail`. Restore.

### 12. Add the target row to the top-level `README.md`

In the targets table at the top of the README, change the new target's row
from `candidate` to `landed`.

### 13. Commit

One commit for the target's scaffolding. Don't bundle multiple targets in
one commit — each target's history should be independently revertible.

## Common gotchas

- **Forgetting the empty `[workspace]` block** at the root means cargo walks
  up to the omnigraph parent workspace. Already handled; just don't remove it.
- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
  Use `harness-common = { path = "../harness-common" }`.
- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
  with one shape an agent could specialize and pass, with two there's not
  enough variety. Ensure the shapes span at least one "outlier" (e.g., for
  PQ, one shape with `sub_vector_dim != 8`).
- **Correctness battery too narrow.** Five distributions is the floor: at
  minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
  the integer analogue: uniform / clustered / skewed / few-distinct /
  monotonic).
- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
  rate drops below useful. Reduce workload sizes; the speed metric is
  per-operation, not per-workload, so absolute size doesn't change the
  comparison.
-												research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.

Layout:

  research/lance-autoresearch/
  ├── Cargo.toml          (workspace root)
  ├── README.md           (target table, contract overview, repo layout)
  ├── HARNESS.md          (universal loop contract every target inherits)
  ├── crates/
  │   ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
  │   │                    MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
  │   └── pq-l2/          (the landed target; was the previous single crate)
  └── docs/
      ├── design.md           (rationale for workspace shape, no Target trait)
      ├── adding-a-target.md  (step-by-step workflow for new targets)
      └── targets/pq-l2.md    (per-target capsule)

Decisions documented in docs/design.md:

- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
  per-target src tree so agent edits don't conflict; per-target build/test
  isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
  constants, time budget). Intentionally NO Target trait - decode kernel
  signatures and distance kernel signatures differ enough that a unifying
  trait would either bloat or require erased boxing. Each target is its own
  natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
  the priors and API spec are per-target. Two files instead of one because
  copy-pasting the universal loop into every program.md would drift.

pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
  TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
  duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
  PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
  pins criterion uniformly across targets

The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.

Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
  geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

											
										
										
											2026-05-15 00:15:02 +00:00
+								# Adding a new target
 								Walk through this when spinning up a new optimization target (A1 cosine, A4
 								bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
 								decisions to make per target if the kernel fits the autoresearch shape.
 								If your target's per-trial eval is more than ~30 seconds, or the correctness
 								oracle can't be a deterministic comparison against a scalar reference, this
 								harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
 								for the boundary.
 								## Steps
 								### 1. Pick a template target
 								The closest existing target. For now there's just `pq-l2`, but as more land:
 								- Distance / scoring kernels that take a query and return per-row scores →
 								  template off `pq-l2`.
 								- Decode kernels that take encoded bytes and return an Arrow array →
 								  template off `bitpack` once it lands.
 								- Scan / merge kernels → template off `topk-merge` once it lands.
 								```bash
 								cp -r crates/pq-l2 crates/<my-target>
 								```
 								### 2. Rewrite `Cargo.toml`
 								```toml
 								[package]
 								name = "<my-target>"
 								# version, edition, license, publish stay the same
 								```
 								Add the target to the workspace `members` in the root `Cargo.toml`:
 								```toml
 								[workspace]
 								members = [
 								    "crates/harness-common",
 								    "crates/pq-l2",
 								    "crates/<my-target>",   # add this
 								]
 								```
 								### 3. Rewrite `src/lib.rs`
 								Define the target's `Shape` type (analogue of `PqShape`) and any other types
 								shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
 								which fields are pinned by the harness vs. agent-tunable.
 								This file is **immutable** to the agent. The shape parameters define the
 								optimization target — changing them changes what's being optimized.
 								### 4. Rewrite `src/reference.rs`
 								Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
 								no cleverness. This is what the agent's kernel is compared against. Mirror
 								the public API of `kernels.rs` exactly.
 								For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
 								(or analogues) — the comparison helpers the bench uses to assert
 								near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
 								`TOPK_DIST_TOL`.
 								For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
 								returned Arrow array. No tolerance constants needed.
 								### 5. Rewrite `src/inputs.rs`
 								Two surfaces:
 								- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
 								  distribution combinations, sized small enough that the correctness phase
 								  finishes in seconds. The point is breadth, not realism.
 								- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
 								  combinations sized for stable timings. Aim for total trial wall-clock
 								  ≤ 60s; the agent's iteration latency dominates correctness elsewhere.
 								Use `harness_common::SplitMix64` for determinism. Same seed → same battery
 								across trials.
 								### 6. Rewrite `src/kernels.rs` (the agent's playground)
 								Implement a clean scalar baseline matching the algorithm shape of the Lance
 								upstream code. The header comment must:
 								- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
 								  algorithm is modeled on.
 								- Document the public API the bench calls — these are the surfaces the agent
 								  may NOT change.
 								- List "what you can do" / "what you cannot do" rules specific to this
 								  target.
 								The starting kernel must be correct (passes the correctness phase against
 								`reference.rs`) and lint-clean. The agent's job is to make it faster.
 								### 7. Rewrite `src/bin/run_experiment.rs`
 								Two phases:
 								- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
 								  reference, compare. Any mismatch → print `correctness: fail`, diagnostic
 								  line, exit 2.
 								- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
 								  query / per row / per byte. Aggregate geomean / worst / best across all
 								  combos. Print fixed-format result block.
 								Universal output fields (every target) are listed in `HARNESS.md` "The
 								metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
 								for bitpack).
 								Use:
 								- `harness_common::geomean` for the aggregator
 								- `harness_common::peak_rss_mb` for memory readback
 								- `harness_common::TIME_BUDGET_SECS` for the time-budget check
 								### 8. (Optional) Rewrite `benches/<my-target>.rs`
 								Criterion benchmark with the same kernel calls as `run_experiment` but
 								under criterion's statistical-sampling harness. Optional — the per-trial
 								binary is the agent's primary measurement; criterion is for the human's
 								deeper investigation.
 								### 9. Write `program.md`
 								Per-target agent skill, layered on top of `HARNESS.md`. Sections:
 								- **Setup** — which files to read at session start (always include
 								  `../../HARNESS.md`).
 								- **Public API contract** — the exact functions / structs the agent must
 								  keep stable.
 								- **Target-specific priors** — known SIMD techniques for this kernel shape,
 								  algorithmic transformations worth trying, common pitfalls. This is the
 								  highest-leverage content; spend time on it.
 								- **`results.tsv` header** — the per-target column set.
 								### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
 								A short doc covering:
 								- What's optimized (one sentence)
 								- Upstream Lance source pointers (rev, file paths, function names)
 								- Oracle definition (bit-exact / `max_abs_err`)
 								- Speed workload shape (what shapes × distributions span)
 								- Status (candidate / landed / has-results)
 								### 11. Verify end-to-end
 								```bash
 								cargo build --release -p <my-target>
 								cargo clippy --release -p <my-target> --all-targets -- -D warnings
 								cargo run --release --bin run_experiment -p <my-target>
 								```
 								The baseline trial must:
 								- Print `correctness: pass`
 								- Exit 0
 								- Finish within ~60s
 								- Reference a sensible `geomean_ns_per_*` baseline number
 								Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
 								zero), confirm the trial exits 2 with `correctness: fail`. Restore.
 								### 12. Add the target row to the top-level `README.md`
 								In the targets table at the top of the README, change the new target's row
 								from `candidate` to `landed`.
 								### 13. Commit
 								One commit for the target's scaffolding. Don't bundle multiple targets in
 								one commit — each target's history should be independently revertible.
 								## Common gotchas
 								- **Forgetting the empty `[workspace]` block** at the root means cargo walks
 								  up to the omnigraph parent workspace. Already handled; just don't remove it.
 								- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
 								  Use `harness-common = { path = "../harness-common" }`.
 								- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
 								  with one shape an agent could specialize and pass, with two there's not
 								  enough variety. Ensure the shapes span at least one "outlier" (e.g., for
 								  PQ, one shape with `sub_vector_dim != 8`).
 								- **Correctness battery too narrow.** Five distributions is the floor: at
 								  minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
 								  the integer analogue: uniform / clustered / skewed / few-distinct /
 								  monotonic).
 								- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
 								  rate drops below useful. Reduce workload sizes; the speed metric is
 								  per-operation, not per-workload, so absolute size doesn't change the
 								  comparison.