The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.
Layout:
research/lance-autoresearch/
├── Cargo.toml (workspace root)
├── README.md (target table, contract overview, repo layout)
├── HARNESS.md (universal loop contract every target inherits)
├── crates/
│ ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
│ │ MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
│ └── pq-l2/ (the landed target; was the previous single crate)
└── docs/
├── design.md (rationale for workspace shape, no Target trait)
├── adding-a-target.md (step-by-step workflow for new targets)
└── targets/pq-l2.md (per-target capsule)
Decisions documented in docs/design.md:
- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
per-target src tree so agent edits don't conflict; per-target build/test
isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
constants, time budget). Intentionally NO Target trait - decode kernel
signatures and distance kernel signatures differ enough that a unifying
trait would either bloat or require erased boxing. Each target is its own
natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
the priors and API spec are per-target. Two files instead of one because
copy-pasting the universal loop into every program.md would drift.
pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
pins criterion uniformly across targets
The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.
Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
7.6 KiB
Design — why the workspace is shaped this way
This document records the rationale for the multi-target workspace shape so future contributors don't relitigate the early decisions.
The thing we're building
A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
"Multi-target" because Lance has many such kernels — distance kernels in
lance-linalg, decoders in lance-encoding, scan/merge kernels — and the
right harness shape is identical across them: bit-exact correctness oracle,
geomean-across-distributions speed metric, single-agent autoresearch loop.
The original research note
enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
landed (pq-l2) proves the harness shape; the rest follow the same template.
Decision: workspace, not single crate
A single crate exposing multiple binaries (run_experiment_pq_l2,
run_experiment_bitpack, ...) was the obvious-looking alternative. Rejected
for three reasons:
-
Per-target deps differ. FSST decode wants different deps than PQ kernels (a string-compression library vs. just
f32math). A singleCargo.tomlwould either bundle every target's deps into every build or require fine-grained features. Workspaces give per-targetCargo.tomlfor free. -
Edit isolation. The agent edits one target's
kernels.rsat a time. In a single crate,kernels.rsfiles would collide on path or have to live in target-specific submodules with target-specific naming. Per-target crates putsrc/kernels.rsat the natural location every time and let the agent navigate one tree per session. -
Build / test isolation.
cargo build -p pq-l2builds only what's needed for the PQ L2 target;cargo test -p pq-l2runs only its tests. The agent's iteration loop is faster because it doesn't pay for unrelated targets' compile time.
The downside — workspace boilerplate, per-target Cargo.toml, the empty
[workspace] block at the workspace root that prevents cargo from walking up
to the parent omnigraph workspace — is a one-time cost. Per-target overhead
of adding a new target is one cp -r plus path edits.
Decision: shared harness-common crate, no Target trait
A Target trait was the obvious-looking other alternative — express the
common loop generically, plug in target-specific types. Rejected because:
-
Kernel signatures vary too much for a single trait shape. PQ
probe_top_kreturnsVec<(u32, f32)>. Bitpack decode returns anIntArray. FSST decode returnsVec<u8>. Predicate evaluation returns aBooleanArray. A unifying trait would need erased boxing or a wide associated-type surface, both of which obscure the actual hot path the agent is editing. -
The orchestration that is shared is small. A deterministic PRNG (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four tolerance constants. Total ~70 lines of shared code. Building a trait abstraction over 70 lines costs more than it saves.
-
The output format isn't worth sharing. Each target's
run_experiment.rsprints a fixed-format result block; the fields differ per target (PQ shapes vs bit widths vs distribution kinds). A shared formatter would be either trivial wrapping ofprintln!(no value) or a complicated builder API (negative value).
harness-common therefore exposes plumbing only: SplitMix64, geomean,
peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS. Each
target consumes what it needs. The shared loop contract is documented in
HARNESS.md, not encoded in code.
Decision: per-target program.md + shared HARNESS.md
The agent reads two files at session start:
HARNESS.md(workspace-level) — universal: the loop, the metric, the edit-permission table, hygiene rules.crates/<target>/program.md(per-target) — specific: the kernel API the agent must keep stable, target-specific priors (which SIMD intrinsics tend to win on this kernel shape), theresults.tsvcolumn header.
The shape mirrors how Karpathy's nanochat-research program.md works,
factored across the dimension that varies (per target) vs. doesn't (the loop
itself). Two files instead of one because copy-pasting the universal loop
into every program.md makes them drift.
Decision: dataset-independent oracle every target
The first iteration of the harness used recall@K vs. SIFT1M as the correctness oracle. We replaced it with bit-exact (or near-bit-exact for floats) match against a scalar reference because:
- The agent had incentive to overfit lossy approximations to the dataset's cluster structure, even though we didn't ask for that.
- SIFT1M is 250 MB and a hassle to download; the harness benefited from being self-contained.
- Mathematical equivalence is a strictly stronger contract than recall preservation: if the kernel is bit-equivalent to the scalar reference, recall is automatically identical because the distance values are the same. There's nothing recall@K catches that bit-exactness doesn't.
This decision generalizes to every target. Decode kernels get strict bitwise
equality (no float arithmetic involved). Distance and BM25 kernels get
max_abs_err ≤ 1e-4 (loose enough for SIMD-accumulator reordering, tight
enough for real bugs). Targets that genuinely require lossy techniques to
get headroom — there might be some; LUT u8 quantization in PQ is one — go
in a separate "lossy track" with a recall-based oracle on diverse datasets,
not the bit-exact track.
Decision: per-target speed measurement spans multiple shapes × distributions
A single dataset would let an agent overfit to that dataset's distribution.
Each target's inputs.rs therefore generates speed workloads across:
- Multiple shapes of the kernel's domain (PQ:
(dim, num_sub_vectors, num_centroids); bitpack: bit width; etc.). Captures how the kernel performs at different sizes Lance users actually encounter. - Multiple data distributions (Gaussian / uniform / sparse for floats; uniform / skewed / clustered for integers; etc.). Captures whether the kernel's win is data-distribution-conditional.
The keep gate uses geomean across all (shape × distribution) combos with a worst-case guard: a kernel that wins on one combo and regresses ≥5% on another fails to keep, even if the geomean improves. This forces wins to generalize.
What's deliberately not abstracted
- Output format. Each target prints its own field block. See above.
TopKHeapand other small data structures. When two targets need aTopKHeap, the second one copies the first's. Three copies of a 30-line struct is cheaper than one trait-erased indirection.- Test data shapes. Each target's
inputs.rsknows its own kernel's fixture shape. Sharing would require a genericFixture<Kernel>trait, which would either be too narrow (forces every kernel into aquery + workloadshape) or too wide (gives up the type safety that makes the bench's correctness check obvious).
When to revisit
If the workspace grows past ~6 active targets and we notice we're
copy-pasting more than ~50 lines of run_experiment.rs boilerplate per new
target, consider extracting a shared RunExperiment helper that takes
closures for the correctness and speed phases. Don't pre-extract — wait
until the duplication is real and visible.
If we add a target that genuinely doesn't fit the autoresearch loop (eval crosses ~30s; tournament sampling becomes the right control loop), it belongs in a separate workspace, not this one. The boundary line is the loop shape, not the target type.