A code review pass found a cluster of real bugs in metrics and contract;
fixing them before any agent loop runs against this harness.
Critical metric bug:
- harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space
high-water-mark, includes mmap'd files / guard pages / untouched
allocations) instead of VmHWM (resident pages high-water-mark). The
function name and HARNESS.md contract both promised RSS. Every
peak_mem_mb row logged under the old code was virtual peak, not RSS.
Correctness contract bug:
- reference::topk_consistent's tie-tolerance had a flawed neighbor-scan
check: when the K-th distance fell in a multi-way tie, agent and
reference could legally return different K-sized subsets of the tied
band (heap eviction order vs. sort stability), and the neighbor scan
required both endpoints to be present, false-negativing legitimate
cases. Simplified to a positional distance-tolerance check; ids at the
same rank may differ silently because the distance match within tol
constrains the swap to a 2*tol band. Diagnostic comment explains the
rationale.
API hygiene:
- Removed dead PqKernel::shape() and ScalarReference::shape() — declared
in the public API contract (program.md, kernels.rs comment), required
to be stable, never called by the bench / benches / inputs / reference.
Now the contract reflects what the bench actually uses.
- Removed dead `anyhow` workspace dependency.
Determinism:
- PRNG seed mixing now uses the SplitMix64 finalizer per part instead of
raw XOR. Raw XOR is commutative and small-constant collisions are
reachable; mix_seeds iterates the finalizer once per ingredient so
distinct (seed, shape, kind) tuples produce distinct streams with
vanishingly small collision probability.
License headers:
- kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match
the crate's Cargo.toml license field (the rest of the crate is dual-
licensed). Added matching SPDX headers to reference.rs and inputs.rs.
Doc cleanups:
- design.md: replaced the broken relative link
`../../docs/research/llm-evolutionary-sampling.md` (which resolved inside
lance-autoresearch where the note doesn't live) with a path-explained
reference noting the note lives in the parent OmniGraph repo and won't
ship on extraction.
- README.md: clarified that the target table mixes a single landed target
with a candidate roadmap — they have no code yet.
- HARNESS.md: added exit code 1 (internal error) to the exit-code summary;
was documented in run_experiment.rs but not in the loop contract.
- adding-a-target.md: dropped the misleading "cp -r plus surgical edits"
framing — the workflow rewrites 7 files; what's inherited is Cargo
manifest, license headers, workspace registration, and shared utilities.
Verified end-to-end: cargo build / clippy / test all green. Baseline
trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads
RSS — same workload reports 91 MB, plausibly correct given the temporary
fixture-construction buffers).
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.
Layout:
research/lance-autoresearch/
├── Cargo.toml (workspace root)
├── README.md (target table, contract overview, repo layout)
├── HARNESS.md (universal loop contract every target inherits)
├── crates/
│ ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
│ │ MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
│ └── pq-l2/ (the landed target; was the previous single crate)
└── docs/
├── design.md (rationale for workspace shape, no Target trait)
├── adding-a-target.md (step-by-step workflow for new targets)
└── targets/pq-l2.md (per-target capsule)
Decisions documented in docs/design.md:
- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
per-target src tree so agent edits don't conflict; per-target build/test
isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
constants, time budget). Intentionally NO Target trait - decode kernel
signatures and distance kernel signatures differ enough that a unifying
trait would either bloat or require erased boxing. Each target is its own
natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
the priors and API spec are per-target. Two files instead of one because
copy-pasting the universal loop into every program.md would drift.
pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
pins criterion uniformly across targets
The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.
Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives
the agent incentive to overfit to one data distribution: a kernel that hits
recall@10 on SIFT-shaped clusters could regress on other distributions and
still pass the gate. This commit replaces both halves of the oracle.
Correctness phase (was: recall@K floor):
- Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar
reference kernel, on a 5-distribution input battery (Gaussian, uniform,
sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ
shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4).
Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by
construction.
Speed phase (was: geomean ns over one synthetic dataset):
- Geomean ns/query measured across 3 PQ shapes x 3 data distributions:
(128, 16, 256) - SIFT-like
(256, 16, 256) - sub_vector_dim=16
(768, 96, 256) - BERT-like
crossed with clustered / uniform / sparse data. Fixed seed across trials
for reproducibility; per-combo timings reported alongside the global
geomean / worst / best so a kernel that wins on one combo and regresses
on another fails the worst-case guard.
Kernel API (was: const-DIM scalar functions):
- Generic over (dim, num_sub_vectors, num_centroids) via PqShape.
- PqKernel::new(shape, codebook) lets the agent pre-process the codebook
once (transpose, cache c.c, pack LUT, etc.) and amortize across queries.
Build cost is excluded from per-query timing - the bench measures
distance_table + probe_top_k only.
Other consequences:
- SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the
cache-directory plumbing all delete - the harness is now fully
self-contained, no external download.
- src/inputs.rs replaces src/fixture.rs; deterministic per-trial
test-data + workload generation, no frozen artifacts.
- Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to
the omnigraph parent workspace from inside research/.
Verified end-to-end:
- cargo build --release: clean
- cargo clippy --release --all-targets -- -D warnings: clean
- cargo run --release --bin run_experiment: correctness pass, geomean
1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0,
total wall-clock ~39s
- smoke test: kernel returning 0 distance -> correctness fail with
diagnostic, exit 2
- cargo test --release --lib: 2/2 unit tests pass
(correctness_battery_is_deterministic, speed_workloads_match_shapes)
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5