mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.

Layout:

  research/lance-autoresearch/
  ├── Cargo.toml          (workspace root)
  ├── README.md           (target table, contract overview, repo layout)
  ├── HARNESS.md          (universal loop contract every target inherits)
  ├── crates/
  │   ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
  │   │                    MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
  │   └── pq-l2/          (the landed target; was the previous single crate)
  └── docs/
      ├── design.md           (rationale for workspace shape, no Target trait)
      ├── adding-a-target.md  (step-by-step workflow for new targets)
      └── targets/pq-l2.md    (per-target capsule)

Decisions documented in docs/design.md:

- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
  per-target src tree so agent edits don't conflict; per-target build/test
  isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
  constants, time budget). Intentionally NO Target trait - decode kernel
  signatures and distance kernel signatures differ enough that a unifying
  trait would either bloat or require erased boxing. Each target is its own
  natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
  the priors and API spec are per-target. Two files instead of one because
  copy-pasting the universal loop into every program.md would drift.

pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
  TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
  duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
  PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
  pins criterion uniformly across targets

The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.

Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
  geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-15 00:15:02 +00:00

7.6 KiB

Raw Blame History

Design — why the workspace is shaped this way

This document records the rationale for the multi-target workspace shape so future contributors don't relitigate the early decisions.

The thing we're building

A multi-target harness for LLM-driven optimization of Lance hot-path kernels. "Multi-target" because Lance has many such kernels — distance kernels in lance-linalg, decoders in lance-encoding, scan/merge kernels — and the right harness shape is identical across them: bit-exact correctness oracle, geomean-across-distributions speed metric, single-agent autoresearch loop.

The original research note enumerates ten such candidates (A1–A10) clustered by Lance crate. The first landed (pq-l2) proves the harness shape; the rest follow the same template.

Decision: workspace, not single crate

A single crate exposing multiple binaries (run_experiment_pq_l2, run_experiment_bitpack, ...) was the obvious-looking alternative. Rejected for three reasons:

Per-target deps differ. FSST decode wants different deps than PQ kernels (a string-compression library vs. just f32 math). A single Cargo.toml would either bundle every target's deps into every build or require fine-grained features. Workspaces give per-target Cargo.toml for free.
Edit isolation. The agent edits one target's kernels.rs at a time. In a single crate, kernels.rs files would collide on path or have to live in target-specific submodules with target-specific naming. Per-target crates put src/kernels.rs at the natural location every time and let the agent navigate one tree per session.
Build / test isolation. cargo build -p pq-l2 builds only what's needed for the PQ L2 target; cargo test -p pq-l2 runs only its tests. The agent's iteration loop is faster because it doesn't pay for unrelated targets' compile time.

The downside — workspace boilerplate, per-target Cargo.toml, the empty [workspace] block at the workspace root that prevents cargo from walking up to the parent omnigraph workspace — is a one-time cost. Per-target overhead of adding a new target is one cp -r plus path edits.

Decision: shared `harness-common` crate, no `Target` trait

A Target trait was the obvious-looking other alternative — express the common loop generically, plug in target-specific types. Rejected because:

Kernel signatures vary too much for a single trait shape. PQ probe_top_k returns Vec<(u32, f32)>. Bitpack decode returns an IntArray. FSST decode returns Vec<u8>. Predicate evaluation returns a BooleanArray. A unifying trait would need erased boxing or a wide associated-type surface, both of which obscure the actual hot path the agent is editing.
The orchestration that is shared is small. A deterministic PRNG (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four tolerance constants. Total ~70 lines of shared code. Building a trait abstraction over 70 lines costs more than it saves.
The output format isn't worth sharing. Each target's run_experiment.rs prints a fixed-format result block; the fields differ per target (PQ shapes vs bit widths vs distribution kinds). A shared formatter would be either trivial wrapping of println! (no value) or a complicated builder API (negative value).

harness-common therefore exposes plumbing only: SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS. Each target consumes what it needs. The shared loop contract is documented in HARNESS.md, not encoded in code.

Decision: per-target `program.md` + shared `HARNESS.md`

The agent reads two files at session start:

HARNESS.md (workspace-level) — universal: the loop, the metric, the edit-permission table, hygiene rules.
crates/<target>/program.md (per-target) — specific: the kernel API the agent must keep stable, target-specific priors (which SIMD intrinsics tend to win on this kernel shape), the results.tsv column header.

The shape mirrors how Karpathy's nanochat-research program.md works, factored across the dimension that varies (per target) vs. doesn't (the loop itself). Two files instead of one because copy-pasting the universal loop into every program.md makes them drift.

Decision: dataset-independent oracle every target

The first iteration of the harness used recall@K vs. SIFT1M as the correctness oracle. We replaced it with bit-exact (or near-bit-exact for floats) match against a scalar reference because:

The agent had incentive to overfit lossy approximations to the dataset's cluster structure, even though we didn't ask for that.
SIFT1M is 250 MB and a hassle to download; the harness benefited from being self-contained.
Mathematical equivalence is a strictly stronger contract than recall preservation: if the kernel is bit-equivalent to the scalar reference, recall is automatically identical because the distance values are the same. There's nothing recall@K catches that bit-exactness doesn't.

This decision generalizes to every target. Decode kernels get strict bitwise equality (no float arithmetic involved). Distance and BM25 kernels get max_abs_err ≤ 1e-4 (loose enough for SIMD-accumulator reordering, tight enough for real bugs). Targets that genuinely require lossy techniques to get headroom — there might be some; LUT u8 quantization in PQ is one — go in a separate "lossy track" with a recall-based oracle on diverse datasets, not the bit-exact track.

Decision: per-target speed measurement spans multiple shapes × distributions

A single dataset would let an agent overfit to that dataset's distribution. Each target's inputs.rs therefore generates speed workloads across:

Multiple shapes of the kernel's domain (PQ: (dim, num_sub_vectors, num_centroids); bitpack: bit width; etc.). Captures how the kernel performs at different sizes Lance users actually encounter.
Multiple data distributions (Gaussian / uniform / sparse for floats; uniform / skewed / clustered for integers; etc.). Captures whether the kernel's win is data-distribution-conditional.

The keep gate uses geomean across all (shape × distribution) combos with a worst-case guard: a kernel that wins on one combo and regresses ≥5% on another fails to keep, even if the geomean improves. This forces wins to generalize.

What's deliberately not abstracted

Output format. Each target prints its own field block. See above.
TopKHeap and other small data structures. When two targets need a TopKHeap, the second one copies the first's. Three copies of a 30-line struct is cheaper than one trait-erased indirection.
Test data shapes. Each target's inputs.rs knows its own kernel's fixture shape. Sharing would require a generic Fixture<Kernel> trait, which would either be too narrow (forces every kernel into a query + workload shape) or too wide (gives up the type safety that makes the bench's correctness check obvious).

When to revisit

If the workspace grows past ~6 active targets and we notice we're copy-pasting more than ~50 lines of run_experiment.rs boilerplate per new target, consider extracting a shared RunExperiment helper that takes closures for the correctness and speed phases. Don't pre-extract — wait until the duplication is real and visible.

If we add a target that genuinely doesn't fit the autoresearch loop (eval crosses ~30s; tournament sampling becomes the right control loop), it belongs in a separate workspace, not this one. The boundary line is the loop shape, not the target type.

7.6 KiB Raw Blame History Unescape Escape