omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-18 02:24:27 +02:00

Claude 272b70bfb4 research: redesign lance-autoresearch oracle to be dataset-independent Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives the agent incentive to overfit to one data distribution: a kernel that hits recall@10 on SIFT-shaped clusters could regress on other distributions and still pass the gate. This commit replaces both halves of the oracle. Correctness phase (was: recall@K floor): - Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar reference kernel, on a 5-distribution input battery (Gaussian, uniform, sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4). Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by construction. Speed phase (was: geomean ns over one synthetic dataset): - Geomean ns/query measured across 3 PQ shapes x 3 data distributions: (128, 16, 256) - SIFT-like (256, 16, 256) - sub_vector_dim=16 (768, 96, 256) - BERT-like crossed with clustered / uniform / sparse data. Fixed seed across trials for reproducibility; per-combo timings reported alongside the global geomean / worst / best so a kernel that wins on one combo and regresses on another fails the worst-case guard. Kernel API (was: const-DIM scalar functions): - Generic over (dim, num_sub_vectors, num_centroids) via PqShape. - PqKernel::new(shape, codebook) lets the agent pre-process the codebook once (transpose, cache c.c, pack LUT, etc.) and amortize across queries. Build cost is excluded from per-query timing - the bench measures distance_table + probe_top_k only. Other consequences: - SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the cache-directory plumbing all delete - the harness is now fully self-contained, no external download. - src/inputs.rs replaces src/fixture.rs; deterministic per-trial test-data + workload generation, no frozen artifacts. - Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to the omnigraph parent workspace from inside research/. Verified end-to-end: - cargo build --release: clean - cargo clippy --release --all-targets -- -D warnings: clean - cargo run --release --bin run_experiment: correctness pass, geomean 1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0, total wall-clock ~39s - smoke test: kernel returning 0 distance -> correctness fail with diagnostic, exit 2 - cargo test --release --lib: 2/2 unit tests pass (correctness_battery_is_deterministic, speed_workloads_match_shapes) https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5	2026-05-14 23:03:45 +00:00
..
lance-autoresearch	research: redesign lance-autoresearch oracle to be dataset-independent	2026-05-14 23:03:45 +00:00

research: redesign lance-autoresearch oracle to be dataset-independent

Original harness used recall@K vs. SIFT1M as the correctness oracle, which gives
the agent incentive to overfit to one data distribution: a kernel that hits
recall@10 on SIFT-shaped clusters could regress on other distributions and
still pass the gate. This commit replaces both halves of the oracle.

Correctness phase (was: recall@K floor):
  - Bit-equivalent (max_abs_err <= 1e-4) match against an immutable scalar
    reference kernel, on a 5-distribution input battery (Gaussian, uniform,
    sparse, large-dynamic-range, mostly-zero) crossed with all evaluated PQ
    shapes. Top-K compared with tie-tolerant equivalence (TOPK_DIST_TOL=1e-4).
    Lossy techniques (LUT u8/u16 quantization, etc.) fail this gate by
    construction.

Speed phase (was: geomean ns over one synthetic dataset):
  - Geomean ns/query measured across 3 PQ shapes x 3 data distributions:
      (128, 16, 256) - SIFT-like
      (256, 16, 256) - sub_vector_dim=16
      (768, 96, 256) - BERT-like
    crossed with clustered / uniform / sparse data. Fixed seed across trials
    for reproducibility; per-combo timings reported alongside the global
    geomean / worst / best so a kernel that wins on one combo and regresses
    on another fails the worst-case guard.

Kernel API (was: const-DIM scalar functions):
  - Generic over (dim, num_sub_vectors, num_centroids) via PqShape.
  - PqKernel::new(shape, codebook) lets the agent pre-process the codebook
    once (transpose, cache c.c, pack LUT, etc.) and amortize across queries.
    Build cost is excluded from per-query timing - the bench measures
    distance_table + probe_top_k only.

Other consequences:
  - SIFT1M loader (src/fixture.rs), prepare_fixtures.sh, and the
    cache-directory plumbing all delete - the harness is now fully
    self-contained, no external download.
  - src/inputs.rs replaces src/fixture.rs; deterministic per-trial
    test-data + workload generation, no frozen artifacts.
  - Cargo.toml gains an empty [workspace] block so cargo doesn't walk up to
    the omnigraph parent workspace from inside research/.

Verified end-to-end:
  - cargo build --release: clean
  - cargo clippy --release --all-targets -- -D warnings: clean
  - cargo run --release --bin run_experiment: correctness pass, geomean
    1.22M ns, worst 4.82M ns ((768,96,256), sparse), best 596k ns, exit 0,
    total wall-clock ~39s
  - smoke test: kernel returning 0 distance -> correctness fail with
    diagnostic, exit 2
  - cargo test --release --lib: 2/2 unit tests pass
    (correctness_battery_is_deterministic, speed_workloads_match_shapes)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-14 23:03:45 +00:00

lance-autoresearch

research: redesign lance-autoresearch oracle to be dataset-independent

2026-05-14 23:03:45 +00:00