mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-27 02:39:38 +02:00
A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
198 lines
7.4 KiB
Markdown
198 lines
7.4 KiB
Markdown
# Adding a new target
|
||
|
||
Walk through this when spinning up a new optimization target (A1 cosine, A4
|
||
bitpack, etc.). The workflow is: copy an existing target as scaffolding,
|
||
then rewrite the four source files (`lib.rs`, `kernels.rs`, `reference.rs`,
|
||
`inputs.rs`), the entry point (`bin/run_experiment.rs`), the per-target
|
||
`program.md`, and add a capsule under `docs/targets/`. The Cargo manifest,
|
||
workspace registration, license headers, file-layout conventions, gitignore,
|
||
and shared utilities (`harness-common`) are inherited from the template —
|
||
that's the part you don't rewrite.
|
||
|
||
No architectural decisions are required per target if the kernel fits the
|
||
autoresearch shape. If your target's per-trial eval is more than ~30 seconds,
|
||
or the correctness oracle can't be a deterministic comparison against a
|
||
scalar reference, this harness is the wrong fit — see [`design.md`](design.md)
|
||
"When to revisit" for the boundary.
|
||
|
||
## Steps
|
||
|
||
### 1. Pick a template target
|
||
|
||
The closest existing target. For now there's just `pq-l2`, but as more land:
|
||
- Distance / scoring kernels that take a query and return per-row scores →
|
||
template off `pq-l2`.
|
||
- Decode kernels that take encoded bytes and return an Arrow array →
|
||
template off `bitpack` once it lands.
|
||
- Scan / merge kernels → template off `topk-merge` once it lands.
|
||
|
||
```bash
|
||
cp -r crates/pq-l2 crates/<my-target>
|
||
```
|
||
|
||
### 2. Rewrite `Cargo.toml`
|
||
|
||
```toml
|
||
[package]
|
||
name = "<my-target>"
|
||
# version, edition, license, publish stay the same
|
||
```
|
||
|
||
Add the target to the workspace `members` in the root `Cargo.toml`:
|
||
|
||
```toml
|
||
[workspace]
|
||
members = [
|
||
"crates/harness-common",
|
||
"crates/pq-l2",
|
||
"crates/<my-target>", # add this
|
||
]
|
||
```
|
||
|
||
### 3. Rewrite `src/lib.rs`
|
||
|
||
Define the target's `Shape` type (analogue of `PqShape`) and any other types
|
||
shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
|
||
which fields are pinned by the harness vs. agent-tunable.
|
||
|
||
This file is **immutable** to the agent. The shape parameters define the
|
||
optimization target — changing them changes what's being optimized.
|
||
|
||
### 4. Rewrite `src/reference.rs`
|
||
|
||
Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
|
||
no cleverness. This is what the agent's kernel is compared against. Mirror
|
||
the public API of `kernels.rs` exactly.
|
||
|
||
For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
|
||
(or analogues) — the comparison helpers the bench uses to assert
|
||
near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
|
||
`TOPK_DIST_TOL`.
|
||
|
||
For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
|
||
returned Arrow array. No tolerance constants needed.
|
||
|
||
### 5. Rewrite `src/inputs.rs`
|
||
|
||
Two surfaces:
|
||
|
||
- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
|
||
distribution combinations, sized small enough that the correctness phase
|
||
finishes in seconds. The point is breadth, not realism.
|
||
- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
|
||
combinations sized for stable timings. Aim for total trial wall-clock
|
||
≤ 60s; the agent's iteration latency dominates correctness elsewhere.
|
||
|
||
Use `harness_common::SplitMix64` for determinism. Same seed → same battery
|
||
across trials.
|
||
|
||
### 6. Rewrite `src/kernels.rs` (the agent's playground)
|
||
|
||
Implement a clean scalar baseline matching the algorithm shape of the Lance
|
||
upstream code. The header comment must:
|
||
|
||
- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
|
||
algorithm is modeled on.
|
||
- Document the public API the bench calls — these are the surfaces the agent
|
||
may NOT change.
|
||
- List "what you can do" / "what you cannot do" rules specific to this
|
||
target.
|
||
|
||
The starting kernel must be correct (passes the correctness phase against
|
||
`reference.rs`) and lint-clean. The agent's job is to make it faster.
|
||
|
||
### 7. Rewrite `src/bin/run_experiment.rs`
|
||
|
||
Two phases:
|
||
|
||
- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
|
||
reference, compare. Any mismatch → print `correctness: fail`, diagnostic
|
||
line, exit 2.
|
||
- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
|
||
query / per row / per byte. Aggregate geomean / worst / best across all
|
||
combos. Print fixed-format result block.
|
||
|
||
Universal output fields (every target) are listed in `HARNESS.md` "The
|
||
metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
|
||
for bitpack).
|
||
|
||
Use:
|
||
- `harness_common::geomean` for the aggregator
|
||
- `harness_common::peak_rss_mb` for memory readback
|
||
- `harness_common::TIME_BUDGET_SECS` for the time-budget check
|
||
|
||
### 8. (Optional) Rewrite `benches/<my-target>.rs`
|
||
|
||
Criterion benchmark with the same kernel calls as `run_experiment` but
|
||
under criterion's statistical-sampling harness. Optional — the per-trial
|
||
binary is the agent's primary measurement; criterion is for the human's
|
||
deeper investigation.
|
||
|
||
### 9. Write `program.md`
|
||
|
||
Per-target agent skill, layered on top of `HARNESS.md`. Sections:
|
||
|
||
- **Setup** — which files to read at session start (always include
|
||
`../../HARNESS.md`).
|
||
- **Public API contract** — the exact functions / structs the agent must
|
||
keep stable.
|
||
- **Target-specific priors** — known SIMD techniques for this kernel shape,
|
||
algorithmic transformations worth trying, common pitfalls. This is the
|
||
highest-leverage content; spend time on it.
|
||
- **`results.tsv` header** — the per-target column set.
|
||
|
||
### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
|
||
|
||
A short doc covering:
|
||
|
||
- What's optimized (one sentence)
|
||
- Upstream Lance source pointers (rev, file paths, function names)
|
||
- Oracle definition (bit-exact / `max_abs_err`)
|
||
- Speed workload shape (what shapes × distributions span)
|
||
- Status (candidate / landed / has-results)
|
||
|
||
### 11. Verify end-to-end
|
||
|
||
```bash
|
||
cargo build --release -p <my-target>
|
||
cargo clippy --release -p <my-target> --all-targets -- -D warnings
|
||
cargo run --release --bin run_experiment -p <my-target>
|
||
```
|
||
|
||
The baseline trial must:
|
||
- Print `correctness: pass`
|
||
- Exit 0
|
||
- Finish within ~60s
|
||
- Reference a sensible `geomean_ns_per_*` baseline number
|
||
|
||
Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
|
||
zero), confirm the trial exits 2 with `correctness: fail`. Restore.
|
||
|
||
### 12. Add the target row to the top-level `README.md`
|
||
|
||
In the targets table at the top of the README, change the new target's row
|
||
from `candidate` to `landed`.
|
||
|
||
### 13. Commit
|
||
|
||
One commit for the target's scaffolding. Don't bundle multiple targets in
|
||
one commit — each target's history should be independently revertible.
|
||
|
||
## Common gotchas
|
||
|
||
- **Forgetting the empty `[workspace]` block** at the root means cargo walks
|
||
up to the omnigraph parent workspace. Already handled; just don't remove it.
|
||
- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
|
||
Use `harness-common = { path = "../harness-common" }`.
|
||
- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
|
||
with one shape an agent could specialize and pass, with two there's not
|
||
enough variety. Ensure the shapes span at least one "outlier" (e.g., for
|
||
PQ, one shape with `sub_vector_dim != 8`).
|
||
- **Correctness battery too narrow.** Five distributions is the floor: at
|
||
minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
|
||
the integer analogue: uniform / clustered / skewed / few-distinct /
|
||
monotonic).
|
||
- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
|
||
rate drops below useful. Reduce workload sizes; the speed metric is
|
||
per-operation, not per-workload, so absolute size doesn't change the
|
||
comparison.
|