omnigraph/research/lance-autoresearch
Claude 272b70bfb4
research: redesign lance-autoresearch oracle to be dataset-independent
Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives
the agent incentive to overfit to one data distribution: a kernel that hits
recall@10 on SIFT-shaped clusters could regress on other distributions and
still pass the gate. This commit replaces both halves of the oracle.

Correctness phase (was: recall@K floor):
  - Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar
    reference kernel, on a 5-distribution input battery (Gaussian, uniform,
    sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ
    shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4).
    Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by
    construction.

Speed phase (was: geomean ns over one synthetic dataset):
  - Geomean ns/query measured across 3 PQ shapes x 3 data distributions:
      (128, 16, 256) - SIFT-like
      (256, 16, 256) - sub_vector_dim=16
      (768, 96, 256) - BERT-like
    crossed with clustered / uniform / sparse data. Fixed seed across trials
    for reproducibility; per-combo timings reported alongside the global
    geomean / worst / best so a kernel that wins on one combo and regresses
    on another fails the worst-case guard.

Kernel API (was: const-DIM scalar functions):
  - Generic over (dim, num_sub_vectors, num_centroids) via PqShape.
  - PqKernel::new(shape, codebook) lets the agent pre-process the codebook
    once (transpose, cache c.c, pack LUT, etc.) and amortize across queries.
    Build cost is excluded from per-query timing - the bench measures
    distance_table + probe_top_k only.

Other consequences:
  - SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the
    cache-directory plumbing all delete - the harness is now fully
    self-contained, no external download.
  - src/inputs.rs replaces src/fixture.rs; deterministic per-trial
    test-data + workload generation, no frozen artifacts.
  - Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to
    the omnigraph parent workspace from inside research/.

Verified end-to-end:
  - cargo build --release: clean
  - cargo clippy --release --all-targets -- -D warnings: clean
  - cargo run --release --bin run_experiment: correctness pass, geomean
    1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0,
    total wall-clock ~39s
  - smoke test: kernel returning 0 distance -> correctness fail with
    diagnostic, exit 2
  - cargo test --release --lib: 2/2 unit tests pass
    (correctness_battery_is_deterministic, speed_workloads_match_shapes)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-14 23:03:45 +00:00
..
benches research: redesign lance-autoresearch oracle to be dataset-independent 2026-05-14 23:03:45 +00:00
src research: redesign lance-autoresearch oracle to be dataset-independent 2026-05-14 23:03:45 +00:00
.gitignore research: lance-autoresearch — PQ L2 kernel autoresearch harness 2026-05-14 22:38:39 +00:00
Cargo.toml research: redesign lance-autoresearch oracle to be dataset-independent 2026-05-14 23:03:45 +00:00
LICENSE-APACHE research: lance-autoresearch — PQ L2 kernel autoresearch harness 2026-05-14 22:38:39 +00:00
LICENSE-MIT research: lance-autoresearch — PQ L2 kernel autoresearch harness 2026-05-14 22:38:39 +00:00
program.md research: redesign lance-autoresearch oracle to be dataset-independent 2026-05-14 23:03:45 +00:00
README.md research: redesign lance-autoresearch oracle to be dataset-independent 2026-05-14 23:03:45 +00:00
rust-toolchain.toml research: lance-autoresearch — PQ L2 kernel autoresearch harness 2026-05-14 22:38:39 +00:00

lance-autoresearch

An autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor).

Modeled on Andrej Karpathy's nanochat-research three-file contract:

  • Immutable benchsrc/bin/run_experiment.rs + src/inputs.rs + src/reference.rs. The agent cannot touch these.
  • Mutable kernelsrc/kernels.rs. The agent's playground. Starts as a scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to beat it.
  • Human-iterated programprogram.md. The "skill" the agent reads at the start of every session. The human refines it between runs.

Dataset-independent by design

Every other ANN benchmark you've seen is "compete on this fixed dataset" (SIFT1M, GIST1M, DEEP1B). That conflates two things: kernel correctness (the math) and kernel speed under one specific data distribution. An LLM agent given recall@K as the oracle has incentive to overfit to the dataset's quirks.

We split them:

  • Correctness = bit-equivalent (max_abs_err ≤ 1e-4) match to a scalar reference kernel, on diverse generated inputs (Gaussian, uniform, sparse, large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical equivalence; there's no dataset to overfit. Lossy techniques fail this gate.
  • Speed = geomean ns/query across multiple PQ shapes × multiple data distributions. A kernel that wins on one distribution and regresses on another fails the worst-case guard.

By construction, an "improvement" generalizes across distributions and shapes. There is no wget sift.tar.gz step; the harness is fully self-contained.

Why a separate repo

OmniGraph (the graph engine that motivated this) pins Lance at a released version and consumes its kernels via the public crate API. Improvements live one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the optimization target pure (only the kernel changes), keeps the license clean for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and keeps the agent's working set tiny.

Quick start

cargo run --release --bin run_experiment

# Or run with Claude Code / Codex:
#    Open the repo in your agent of choice and prompt:
#       Hi, have a look at program.md and let's kick off a new experiment.

File ownership

File Mutability Edited by
src/kernels.rs mutable the agent
src/bin/run_experiment.rs immutable
src/reference.rs immutable
src/inputs.rs immutable
src/lib.rs immutable (shared types)
benches/pq_l2.rs immutable
program.md human-iterated the human, between runs
results.tsv append-only the agent, per trial (gitignored)

The metric

run_experiment runs two phases per trial: a correctness check and a multi-shape × multi-distribution speed measurement. Output looks like:

correctness:           pass
---
correctness:           pass
shapes_tested:         (128,16,256) (256,16,256) (768,96,256)
distributions_tested:  clustered uniform sparse
geomean_ns_per_query:  18234
worst_ns_per_query:    24515 ((768,96,256), sparse)
best_ns_per_query:     12876 ((128,16,256), clustered)
per_combo_geomean_ns:
  (128,16,256) clustered  -> 12876 ns
  (128,16,256) uniform    -> 13441 ns
  ...
peak_mem_mb:           28.4
total_seconds:         12.3

A kernel is "kept" iff:

  • Correctness phase passes (mathematical equivalence to scalar reference)
  • geomean_ns_per_query strictly better than the previous best-kept kernel
  • worst_ns_per_query ≤ 1.05 × the previous best-kept kernel's worst
  • total_seconds ≤ 600

See program.md for the full loop spec.

Upstream contribution path

When a commit clears the keep bar by a meaningful margin (≥10% geomean speedup with worst-case guard intact), the human reviews the diff, ports the technique against lance-format/lance HEAD, runs Lance's own test suite, and opens a PR. Because src/kernels.rs is dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.

License

Dual-licensed under either of:

at your option.