docs/research: track that the first harness landed differs from proposed shape

The note proposed surface 1 (index-build tuning) with recall@K oracle and BauplanLabs evolutionary tournament as the "smallest experiment that would produce signal." What landed at research/lance-autoresearch/ is a different shape: PQ kernel optimization with bit-exact correctness oracle and Karpathy single-agent autoresearch loop. Add a "First implementation landed" section that records the divergence and the reasoning (seconds-scale eval favors the autoresearch shape; kernel work has a more direct upstream PR path; the bit-exact oracle removes dataset-overfitting incentive). Bumps the note to revision 3. scripts/check-agents-md.sh still passes (30/30 links). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-06-09 01:35:18 +02:00 · 2026-05-14 23:32:57 +00:00 · 2026-05-14 23:32:57 +00:00 · 1d3cca1e76
commit 1d3cca1e76
parent 272b70bfb4
1 changed files with 13 additions and 1 deletions
--- a/docs/research/llm-evolutionary-sampling.md
+++ b/docs/research/llm-evolutionary-sampling.md
@ -1,7 +1,7 @@
 # LLM Evolutionary Sampling — applicability to Lance directly

 **Type:** Research note (exploratory, not a committed plan)
-**Status:** Draft for discussion — revision 2
+**Status:** Draft for discussion — revision 3 (first harness landed; see "First implementation landed" below)
 **Date:** 2026-05-14
 **Author:** assigned via `claude/llm-evolutionary-sampling-research-UgKX8`

@ -170,6 +170,18 @@ If the winning tuple beats Lance defaults by ≥1.3× in latency at equal recall

 **Out of scope for the first experiment:** Surface 6 (plan-patching) entirely. Surface 4 (compaction) because the per-trial cost is too high to learn fast. OmniGraph integration — make it a generic Lance tool first; an OmniGraph wrapper is a one-day port if the tool works.

+## First implementation landed: PQ kernel autoresearch harness
+
+The first harness committed against this research is **not** surface 1 above. It targets a related-but-different surface and adopts a different control loop. The harness lives at [`research/lance-autoresearch/`](../../research/lance-autoresearch/) (its own README + `program.md` document the contract).
+
+**Target shifted from index-build tuning (surface 1) to PQ kernel optimization.** The kernels in `lance-linalg`'s `distance/pq` module — `compute_distance_table_l2` + `probe_pq_l2_top_k` — sit one layer below the parameter surface and are exercised on every ANN query. They're self-contained Rust (no DataFusion plumbing), the per-trial eval is seconds not minutes, and a winning kernel ports directly upstream as a `lance-format/lance` PR. Index-parameter tuning remains a valid surface; it's just a slower iteration loop and a longer upstream path.
+
+**Control loop shifted from BauplanLabs evolutionary sampling (`bol_evol`, `n_samples_per_step`, tournament selection) to Karpathy's single-agent autoresearch contract.** With seconds-scale evaluation, parallel-sample tournaments don't pay for themselves; a single agent editing one file in a tight loop is more sample-efficient. The paper's tournament shape is the right answer when the eval is minutes-to-hours (its TPC-DS regime); when eval is seconds, the autoresearch shape wins. If we move to a slower surface later (index-build tuning, compaction tuning), the BauplanLabs control loop becomes the right choice again.
+
+**Oracle shifted from recall@K vs. SIFT1M to bit-exact equivalence + multi-distribution speed.** The doc's surface 1 oracle is `recall@K ≥ floor, minimize p95_latency` on one fixed dataset. That conflates "kernel is mathematically correct" with "kernel preserves recall on this distribution"; it also gives the agent incentive to overfit lossy approximations to SIFT-shaped clusters. The harness instead requires `max_abs_err ≤ 1e-4` against a scalar reference kernel on a 5-distribution input battery × 3 PQ shapes (correctness phase), then measures geomean ns/query across 3 shapes × 3 distributions with a worst-case guard (speed phase). Any "improvement" generalizes across distributions and PQ shapes by construction. There is no fixed dataset; the harness is fully self-contained.
+
+**What remains to validate against the paper's findings.** Whether autoresearch-shape LLM-driven kernel work actually produces meaningful Lance-upstreamable speedups is the open question; the harness exists to answer it empirically. If the answer is "yes, ≥10% geomean speedup with worst-case guard intact," the obvious next step is to spin the loop on surface 1 (index parameter tuning) with the BauplanLabs control shape, where the per-trial cost justifies parallel sampling. If the answer is "no meaningful win after a hundred trials," that's also a publishable conclusion — autoresearch-shape kernel optimization may already be at substrate-defaults optimum.
+
 ## Footnote: OmniGraph-IR as an alternative target

 The previous revision of this note focused on patching OmniGraph's own `QueryIR` (`crates/omnigraph-compiler/src/ir/mod.rs:9`) — multi-hop `Expand` ordering and direction, hybrid retrieval (`rrf`) leg tuning, filter pushdown shape. That surface is real and the [§IX](../invariants.md) deny-list already calls out the gap ("cost-blind plan choice — lowering-order execution is not a planner").