mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-21 02:28:07 +02:00

Claude 272b70bfb4 research: redesign lance-autoresearch oracle to be dataset-independent Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives the agent incentive to overfit to one data distribution: a kernel that hits recall@10 on SIFT-shaped clusters could regress on other distributions and still pass the gate. This commit replaces both halves of the oracle. Correctness phase (was: recall@K floor): - Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar reference kernel, on a 5-distribution input battery (Gaussian, uniform, sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4). Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by construction. Speed phase (was: geomean ns over one synthetic dataset): - Geomean ns/query measured across 3 PQ shapes x 3 data distributions: (128, 16, 256) - SIFT-like (256, 16, 256) - sub_vector_dim=16 (768, 96, 256) - BERT-like crossed with clustered / uniform / sparse data. Fixed seed across trials for reproducibility; per-combo timings reported alongside the global geomean / worst / best so a kernel that wins on one combo and regresses on another fails the worst-case guard. Kernel API (was: const-DIM scalar functions): - Generic over (dim, num_sub_vectors, num_centroids) via PqShape. - PqKernel::new(shape, codebook) lets the agent pre-process the codebook once (transpose, cache c.c, pack LUT, etc.) and amortize across queries. Build cost is excluded from per-query timing - the bench measures distance_table + probe_top_k only. Other consequences: - SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the cache-directory plumbing all delete - the harness is now fully self-contained, no external download. - src/inputs.rs replaces src/fixture.rs; deterministic per-trial test-data + workload generation, no frozen artifacts. - Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to the omnigraph parent workspace from inside research/. Verified end-to-end: - cargo build --release: clean - cargo clippy --release --all-targets -- -D warnings: clean - cargo run --release --bin run_experiment: correctness pass, geomean 1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0, total wall-clock ~39s - smoke test: kernel returning 0 distance -> correctness fail with diagnostic, exit 2 - cargo test --release --lib: 2/2 unit tests pass (correctness_battery_is_deterministic, speed_workloads_match_shapes) https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5		2026-05-14 23:03:45 +00:00
..
benches	research: redesign lance-autoresearch oracle to be dataset-independent	2026-05-14 23:03:45 +00:00
src	research: redesign lance-autoresearch oracle to be dataset-independent	2026-05-14 23:03:45 +00:00
.gitignore	research: lance-autoresearch — PQ L2 kernel autoresearch harness	2026-05-14 22:38:39 +00:00
Cargo.toml	research: redesign lance-autoresearch oracle to be dataset-independent	2026-05-14 23:03:45 +00:00
LICENSE-APACHE	research: lance-autoresearch — PQ L2 kernel autoresearch harness	2026-05-14 22:38:39 +00:00
LICENSE-MIT	research: lance-autoresearch — PQ L2 kernel autoresearch harness	2026-05-14 22:38:39 +00:00
program.md	research: redesign lance-autoresearch oracle to be dataset-independent	2026-05-14 23:03:45 +00:00
README.md	research: redesign lance-autoresearch oracle to be dataset-independent	2026-05-14 23:03:45 +00:00
rust-toolchain.toml	research: lance-autoresearch — PQ L2 kernel autoresearch harness	2026-05-14 22:38:39 +00:00

README.md

lance-autoresearch

An autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor).

Modeled on Andrej Karpathy's nanochat-research three-file contract:

Immutable bench — src/bin/run_experiment.rs + src/inputs.rs + src/reference.rs. The agent cannot touch these.
Mutable kernel — src/kernels.rs. The agent's playground. Starts as a scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to beat it.
Human-iterated program — program.md. The "skill" the agent reads at the start of every session. The human refines it between runs.

Dataset-independent by design

Every other ANN benchmark you've seen is "compete on this fixed dataset" (SIFT1M, GIST1M, DEEP1B). That conflates two things: kernel correctness (the math) and kernel speed under one specific data distribution. An LLM agent given recall@K as the oracle has incentive to overfit to the dataset's quirks.

We split them:

Correctness = bit-equivalent (max_abs_err ≤ 1e-4) match to a scalar reference kernel, on diverse generated inputs (Gaussian, uniform, sparse, large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical equivalence; there's no dataset to overfit. Lossy techniques fail this gate.
Speed = geomean ns/query across multiple PQ shapes × multiple data distributions. A kernel that wins on one distribution and regresses on another fails the worst-case guard.

By construction, an "improvement" generalizes across distributions and shapes. There is no wget sift.tar.gz step; the harness is fully self-contained.

Why a separate repo

OmniGraph (the graph engine that motivated this) pins Lance at a released version and consumes its kernels via the public crate API. Improvements live one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the optimization target pure (only the kernel changes), keeps the license clean for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and keeps the agent's working set tiny.

Quick start

cargo run --release --bin run_experiment

# Or run with Claude Code / Codex:
#    Open the repo in your agent of choice and prompt:
#       Hi, have a look at program.md and let's kick off a new experiment.

File ownership

File	Mutability	Edited by
`src/kernels.rs`	mutable	the agent
`src/bin/run_experiment.rs`	immutable	—
`src/reference.rs`	immutable	—
`src/inputs.rs`	immutable	—
`src/lib.rs`	immutable (shared types)	—
`benches/pq_l2.rs`	immutable	—
`program.md`	human-iterated	the human, between runs
`results.tsv`	append-only	the agent, per trial (gitignored)

The metric

run_experiment runs two phases per trial: a correctness check and a multi-shape × multi-distribution speed measurement. Output looks like:

correctness:           pass
---
correctness:           pass
shapes_tested:         (128,16,256) (256,16,256) (768,96,256)
distributions_tested:  clustered uniform sparse
geomean_ns_per_query:  18234
worst_ns_per_query:    24515 ((768,96,256), sparse)
best_ns_per_query:     12876 ((128,16,256), clustered)
per_combo_geomean_ns:
  (128,16,256) clustered  -> 12876 ns
  (128,16,256) uniform    -> 13441 ns
  ...
peak_mem_mb:           28.4
total_seconds:         12.3

A kernel is "kept" iff:

Correctness phase passes (mathematical equivalence to scalar reference)
geomean_ns_per_query strictly better than the previous best-kept kernel
worst_ns_per_query ≤ 1.05 × the previous best-kept kernel's worst
total_seconds ≤ 600

See program.md for the full loop spec.

Upstream contribution path

When a commit clears the keep bar by a meaningful margin (≥10% geomean speedup with worst-case guard intact), the human reviews the diff, ports the technique against lance-format/lance HEAD, runs Lance's own test suite, and opens a PR. Because src/kernels.rs is dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.

License

Dual-licensed under either of:

MIT license (LICENSE-MIT)
Apache License, Version 2.0 (LICENSE-APACHE)

at your option.

README.md Unescape Escape