From 0de7fb3057b6f0037cc7f94ea7cea28d90a75603 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 14 May 2026 21:38:12 +0000 Subject: [PATCH] research: reframe LLM evolutionary sampling note around Lance directly MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User clarified the target: optimize Lance directly rather than OmniGraph's IR layer. Rewrites the note with Lance as the primary target. Key reframe: Lance is parameter-heavy (not just plan-shape-heavy). The biggest wins come from configuration tuples (IvfPq num_partitions / num_sub_vectors / quantizer choice, nprobes / refine_factor / prefilter, batch_size / io_buffer_size / thread pools, AIMD throttle, scalar-index choice per column, compaction policy). None of these need a Lance fork — Lance accepts them as config and emits the metrics. That makes parameter-search a no-fork, substrate-respecting application of the BauplanLabs JSON-Patch-on-DAG mechanic (patches over config objects instead of plan trees). The plan-patching angle (LanceTableProvider → DataFusion ExecutionPlan, HashJoinExec swap, multi-join reorder) is parked as the long-term play behind an upstream-contribution step: serializing/round-tripping ExecutionPlan as JSON is the prerequisite Bauplan added in their fork, and the right move is to contribute it upstream rather than maintain a fork. Ranks six surfaces by value/difficulty, proposes a smallest experiment on surface 1 (workload-conditioned IvfPq tuning on SIFT1M or LAION-sample with recall@10 / p95-latency fitness, bol_evol with n_steps=3, n_samples=4), and treats OmniGraph-IR work as a complementary footnote since it composes cleanly with a Lance-tuner output. --- docs/research/llm-evolutionary-sampling.md | 213 ++++++++++++--------- 1 file changed, 122 insertions(+), 91 deletions(-) diff --git a/docs/research/llm-evolutionary-sampling.md b/docs/research/llm-evolutionary-sampling.md index a036c3c..bba40db 100644 --- a/docs/research/llm-evolutionary-sampling.md +++ b/docs/research/llm-evolutionary-sampling.md @@ -1,170 +1,201 @@ -# LLM Evolutionary Sampling — applicability to OmniGraph +# LLM Evolutionary Sampling — applicability to Lance directly **Type:** Research note (exploratory, not a committed plan) -**Status:** Draft for discussion +**Status:** Draft for discussion — revision 2 **Date:** 2026-05-14 **Author:** assigned via `claude/llm-evolutionary-sampling-research-UgKX8` ## TL;DR -Erol et al. (BauplanLabs / Stanford / TogetherAI / Bauplan, arXiv [2602.10387](https://arxiv.org/abs/2602.10387)) ship **DBPlanBench**, a harness that takes a DataFusion physical plan, serializes it to JSON, asks an LLM to propose RFC 6902 JSON-Patch edits (mostly hash-join build/probe-side swaps and multi-join reorderings), benchmarks each candidate end-to-end on Modal cloud sandboxes, and runs a small evolutionary loop (`n_steps`, `n_samples_per_step`, `top_k_patches`). They report up to **4.78× speedups** on TPC-DS queries and demonstrate that patches found on small scale factors transfer to larger ones via scan-signature matching. +Erol et al. (BauplanLabs / Stanford / TogetherAI / Bauplan, arXiv [2602.10387](https://arxiv.org/abs/2602.10387)) ship **DBPlanBench**: take a DataFusion physical plan, serialize it to JSON, have an LLM propose RFC 6902 JSON-Patch edits (hash-join build/probe swaps, multi-join reorders), benchmark each candidate end-to-end, run a small evolutionary loop (`n_steps`, `n_samples_per_step`, `top_k_patches`). Up to **4.78× speedups** on TPC-DS, with patches found at small scale factors transferring to larger ones via scan-signature matching. -The *direct* port — fork DataFusion, expose plans, prompt an LLM — is a poor fit for us. OmniGraph uses DataFusion only as a narrow `MemTable` utility in one function (`table_store::scan_pending_batches`, `crates/omnigraph/src/table_store.rs:1605`); we **own the IR** and the executor walks it directly (`crates/omnigraph/src/exec/query.rs:348`). Patching DataFusion would fight invariant [§I.1](../invariants.md) (substrate respect) and would make us maintain a fork. +This note targets **Lance directly** rather than OmniGraph's IR. Lance is a better-shaped target for two reasons: -The *adapted* form is interesting. The natural surface is **OmniGraph's own `QueryIR` and the lowering between `.gq` AST and engine execution** (`crates/omnigraph-compiler/src/ir/mod.rs:9`). The current lowering is deterministic by source order and has no cost model — the deny-list in [§IX](../invariants.md) calls this out explicitly ("cost-blind plan choice — lowering-order execution is not a planner"). Evolutionary plan search lets us close that gap by *measuring* good plans instead of building a hand-rolled cost model first. The mechanic also pairs naturally with the aspirational invariants [§V.18](../invariants.md) (estimate-vs-actual logging) and [§VII.41–42](../invariants.md) (SIP, factorize multi-hop): every evolutionary trial is a real measurement, and the search corpus surfaces the cases where SIP or factorization actually wins. +1. **Lance is parameter-heavy, not just plan-shape-heavy.** The biggest performance wins in Lance come from *configuration tuples* — index build parameters (`num_partitions`, `num_sub_vectors`, quantizer choice), scan-time knobs (`nprobes`, `refine_factor`, `batch_size`, `io_buffer_size`, `prefilter` vs postfilter), fragment layout, compaction policy. The Lance performance guide [openly admits](https://lance.org/guide/performance/) the defaults are "balanced" rather than tuned per workload, and the AIMD throttle starts at 2000 req/s with a 5000 cap — generic defaults that any specific deployment should re-tune. The BauplanLabs JSON-Patch-on-DAG mechanic transfers, but the substrate to patch is a config object, not a `HashJoinExec`. +2. **Cross-scale transfer matters more here than in the paper.** The paper's headline is "tune at SF=3, apply at SF=10." Lance has this problem *intrinsically* — you train an `IvfPq` index on a sample, you scan with parameters chosen on a development dataset, and the per-deployment differences (vector dimensionality, partition count, query selectivity) dominate any plan-shape effect. Cross-scale transfer of well-tuned config tuples is exactly what production users need. -This note (a) summarizes what the paper does, (b) maps it onto our architecture honestly, (c) lists concrete application surfaces with cost/value, and (d) proposes a smallest experiment that would move us. +The direct DataFusion-plan-patching angle still exists for Lance — `LanceTableProvider` lets DataFusion run SQL over Lance, and that produces an `ExecutionPlan` that could be patched the same way — but it depends on upstream features (JSON round-trip of `ExecutionPlan`) that Bauplan added in their fork. **Contributing those upstream to Lance + DataFusion** is a more durable bet than maintaining a fork. The parameter-search angle, by contrast, needs **no fork at all** — Lance already accepts these as config and produces measurable execution metrics. -## What the paper does +This revision leads with the parameter-search angle as the primary target, treats the upstream-contribution plan-patching angle as the long-term play, and demotes the OmniGraph-IR angle (which is real but less novel) to a closing footnote. -The BauplanLabs system is an *offline plan optimizer with an online benchmarker*. Components: +## What the paper does (compact recap) -1. **A patched DataFusion fork** (`datafusion_patched/` in their repo) that (a) emits a physical plan in a compact JSON form with node IDs as keys and edges as `input` / `left` / `right` references, and (b) accepts a patched plan and executes it. The "compact serialized representation" the paper refers to is this proto-derived JSON dialect plus a `succinct_table_info` blob carrying per-table cardinalities. +DBPlanBench has four pieces (`src/sampling/` in the upstream repo): -2. **An LLM sampler** (`src/sampling/gpt_plan_optimizer.py`) that takes the current plan + `succinct_table_info` + the SQL string and asks GPT-5 (default) to produce a JSON Patch array. The system prompt (`src/sampling/sql_optimization_prompts.py`) is highly specific: it walks the model through (i) cardinality estimation by *semantic reasoning over column names and predicates*, (ii) join-side swap rules (`left` should be the smaller input), (iii) multi-join reorder rules, and (iv) the projection-index recalculation needed after a swap. The model returns operations like `{"op": "replace", "path": "/6/hashJoin/on/0/left", "value": ...}`. +1. A **patched DataFusion fork** (`datafusion_patched/`) that serializes physical plans to JSON with node IDs as keys and `input` / `left` / `right` edges, plus a `succinct_table_info` blob with per-table cardinalities — and accepts a patched JSON plan and executes it. +2. An **LLM sampler** (`gpt_plan_optimizer.py`) that prompts GPT-5 with the plan, table info, and query. The system prompt (`sql_optimization_prompts.py`) walks the model through (i) cardinality estimation by semantic reasoning over column names and predicates, (ii) build-side selection rules (smaller input on left), (iii) multi-join reorder rules, (iv) projection-index recalculation after a swap. The model returns RFC 6902 patches. +3. A **Modal-sandboxed evaluator** that applies the patch with `jsonpatch`, runs the patched plan `n_runs` times, validates result-set equality vs. the base plan, and reports `execution_time.min`. +4. An **evolutionary loop** (`orchestrator.py`) with three strategies: `bol_evol` (keep best, mutate from there), `pst_evol` (broader exploration, take all last-step plans as bases), `best_of` (single-step, no evolution). -3. **An evaluator on Modal** that applies the patch with `jsonpatch`, hands the patched plan back to the patched DataFusion engine, runs it `n_runs` times against TPC-DS / TPC-H at the configured scale factor, validates result-set equality vs. the base plan, and reports `execution_time.min`. +The cross-scale transfer (`plan_scaler.py`, 740 lines) is a separate machine: it walks the new SF's plan, matches scan signatures against the old SF's plan, remaps node IDs, and reapplies the patches. This is the practical-value lever. -4. **An evolutionary loop** (`src/sampling/orchestrator.py`) with three strategies: - - `bol_evol` ("best-of-last evolutionary"): keep the best plan from the previous step, ask the LLM for `n_samples_per_step` further edits. - - `pst_evol` ("post-evaluation evolutionary"): broader exploration, takes all last-step plans as bases. - - `best_of`: single-step, no evolution (an N-of-1 LLM ablation). - Selection is by `optimization_metric` (default `execution_time.min`). +The fitness function is end-to-end wall-clock. There is no internal model of the optimizer. -5. **Cross-scale transfer** (`src/sampling/plan_scaler.py`, 740 lines): takes a patch found at SF=3 and rewrites it for SF=6/10/etc. by matching scan signatures and remapping internal node IDs. This is the headline reason the system is useful in practice — search is cheap on small data, payoff is on large data. +## Why Lance is the right target shape -The fitness function is **end-to-end wall-clock**. There is no model of the optimizer or the executor; the LLM is steered by a hand-written system prompt encoding standard relational rules. +Lance's tunable surface, drawn from the [performance guide](https://lance.org/guide/performance/), [read/write guide](https://lance.org/guide/read_and_write/), [index pages](https://lance.org/format/table/index/), and [DataFusion integration](https://lance.org/integrations/datafusion/): -## Mapping onto OmniGraph +**Vector index build (`IvfPq` / `IvfHnswSq` / `IvfHnswPq` / `RaBitQ`).** `num_partitions`, `num_sub_vectors`, `nbits`, quantizer choice (PQ vs SQ vs RQ), `sample_rate` (default 256), `metric_type` (L2 / cosine / dot), HNSW-specific `ef_construction` and `m`. Storage and recall trade off heavily across these; the perf guide lays out the math (e.g., `num_partitions * sample_rate * dimension * sizeof(data_type)` is the IVF training RAM, which is non-trivial — 768 MiB at 1024 partitions × 768-d float32). -Three structural facts shape the answer. +**Scan-time vector search.** `nprobes`, `refine_factor`, pre-filter vs post-filter (pre is cheaper when predicates are selective; post is cheaper when they're not — currently a static decision). -**Fact 1: We are not a DataFusion consumer at the query level.** `omnigraph-compiler` lowers `.gq` to `QueryIR { pipeline: Vec, ... }` (`crates/omnigraph-compiler/src/ir/mod.rs:9`) where `IROp` is `NodeScan | Expand | Filter | AntiJoin`. `exec::query::execute_query` walks the pipeline as a hand-rolled streaming interpreter and produces Arrow `RecordBatch`es directly (`crates/omnigraph/src/exec/query.rs:348`). DataFusion is touched only inside `scan_pending_batches` (`crates/omnigraph/src/table_store.rs:1612`) to apply SQL-style filters to in-memory pending batches for read-your-writes. We do not build `LogicalPlan` or `ExecutionPlan` trees anywhere. Therefore: **the surface the paper targets — DataFusion physical plans — does not exist in our hot path.** Forking DataFusion to add it would be the wrong direction. +**Scalar index choice per column.** BTree (range queries), Bitmap (equality, small-range, many-bitmap-overhead), Bloom-filter (membership, no range), Label-list (list columns), Zone-map (page-pruning), R-Tree (spatial), Ngram (LIKE), FTS (text). The right choice depends on column cardinality, value distribution, and the query workload — not on the schema. Today an operator picks one per column at index-build time; a workload-aware advisor is a clean LLM job. -**Fact 2: We have no planner.** The deny-list in [docs/invariants.md §IX](../invariants.md) lists "cost-blind plan choice — lowering-order execution is not a planner" as an explicit anti-pattern; the absence of a planner is acknowledged. Today, multi-hop traversal order, join-side selection, and ordering of `nearest()` / `bm25()` / `rrf()` retrievers are all determined by lexical order of the `.gq` query and lowering convention, not by any model of cost. This means there is **no existing decision surface to plug an LLM into**; we would be introducing one. That is a feature, not a bug: it means the IR is small enough that JSON-Patch on IR ops is a viable representation today, before the IR has accreted dozens of operator kinds. +**Scan parameters.** `batch_size` (default 8192 rows; recommended ~1MB-per-batch for scalar, smaller for high-dim vectors), `io_buffer_size` (default 2GB), `LANCE_IO_THREADS` (8 local / 64 cloud), `LANCE_CPU_THREADS` (cores), `index_cache_size_bytes` (default 6 GiB), AIMD throttle (initial 2000, max 5000, decrease 0.5, additive 300, burst 100). Every one of these has a deployment-specific optimum. -**Fact 3: Lance and DataFusion are substrates, not our property.** Per [§I.1–3](../invariants.md) we do not rebuild what the substrate owns. The paper's approach to evolutionary search is *substrate-local*: they own the patched DataFusion and edit its physical plan. We don't, and we shouldn't. The right surface for us is *above* the substrate, at the IR / lowering layer where we already have authority. That maps cleanly to the paper's mechanic — JSON Patch on a serialized DAG of operators — even though the operators are ours, not DataFusion's. +**Write parameters.** `max_rows_per_file`, `max_rows_per_group`, `max_bytes_per_file`, `data_storage_version` (v2 has different page sizes), `enable_v2_manifest_paths`, `enable_stable_row_ids` (perf doc notes this is "experimental" for indices). -The composite picture: the paper's *philosophy* (use an LLM with a real benchmarker as a search loop over plan variants, replacing or supplementing a cost model) is portable. The paper's *target* (DataFusion physical plans) is not. Where they patch `HashJoinExec` build/probe sides, we would patch `IROp::Expand` direction and order; where they tune `nprobes` on a SQL hint, we would tune Lance scan parameters at lowering time. +**Compaction.** `target_rows_per_fragment` (default 1Mi), `materialize_deletions`, `materialize_deletions_threshold`, `num_threads`, `defer_index_remap` (Fragment Reuse Index — decouples compaction from index rebuilds, huge for continuous-ingest tables but adds an index-load-time cost). Frequency and timing of compaction. -## Concrete application surfaces in OmniGraph +**Plan-patching surface (via DataFusion).** `LanceTableProvider` registers a Lance dataset as a DataFusion table; DataFusion's standard `ExecutionPlan` covers joins, aggregates, sorts, while Lance contributes a custom `LanceScanExec`-style node with pushdown for column selection and simple filters. The Bauplan-style edit space (`HashJoinExec` build/probe swap, multi-join reorder) lives here. -Listed roughly by value-to-difficulty ratio, best first. +The key structural observation: **the first six surfaces are configuration. The seventh is a plan.** Bauplan's contribution is for the seventh; for Lance, the first six are higher-leverage and don't need a fork. -### 1. Multi-hop `Expand` ordering and direction +## Application surfaces in Lance (ranked by value/difficulty) -**Surface.** A `.gq` query of the form `MATCH (a:A)-[r1:R1]->(b:B)-[r2:R2]->(c:C) WHERE … RETURN …` lowers today to `[NodeScan(a), Expand(a→b via R1), Expand(b→c via R2), Filter(…), …]` in source order (`crates/omnigraph-compiler/src/ir/lower.rs:11`). Two knobs that change runtime dramatically: +### 1. Workload-conditioned vector index build (`IvfPq` / `IvfHnsw*`) -- **Hop order.** For a query that ends with a heavy filter on `c`, starting from `c` and expanding backward via CSC is usually faster than starting from `a` and expanding forward via CSR — because the filter prunes the seed set before traversal blows up. The IR already has `Direction` per `Expand`; the CSR/CSC indexes are built per edge type (`docs/indexes.md`); the topology to walk either direction is in place. The current lowering does not consider this. -- **Build-side for adjacency join.** `execute_expand` (`crates/omnigraph/src/exec/query.rs:770`) deduplicates destination IDs and passes them as a SQL `IN`-list to Lance for hydration. This is the [§IX](../invariants.md) "ad-hoc IN-list filtering when SIP fits" anti-pattern — the engine knows it. Evolutionary sampling could *demonstrate* the SIP win on a representative corpus before we commit code to it. +**Surface.** Per `vector` column, the choice of quantizer (PQ / SQ / RQ) and its parameters drives storage size by ~10× and recall by ~10 percentage points. Defaults are deliberately conservative. The decision is a tuple, not a tree. -**LLM patch shape.** A small IR-Patch dialect: `{"op": "reverse", "path": "/pipeline/1/direction"}`, `{"op": "swap", "from": "/pipeline/1", "path": "/pipeline/2"}`, `{"op": "hint", "path": "/pipeline/1", "value": {"hydration_strategy": "sip"}}`. The system prompt would carry per-edge-type cardinality (we already have `__manifest` row counts) and per-type fanout statistics if we expose them. +**LLM patch shape.** JSON Patch over a `VectorIndexConfig` object: +```json +[ + {"op": "replace", "path": "/quantizer", "value": "IvfHnswSq"}, + {"op": "replace", "path": "/num_partitions", "value": 4096}, + {"op": "replace", "path": "/num_sub_vectors", "value": 96}, + {"op": "replace", "path": "/sample_rate", "value": 128}, + {"op": "replace", "path": "/ef_construction", "value": 200}, + {"op": "replace", "path": "/m", "value": 32} +] +``` -**Fitness.** Wall-clock on representative `.gq` corpus + result-set equality (canonicalize `ORDER BY ... LIMIT` by sorting on the ordering columns before hash). +**Prompt seeding.** Pass the column schema, vector dimensionality, dataset row count, sample query workload (top-k values), and the recall/latency target. The LLM has good priors here (PQ for storage-bound, HNSW for low-latency, RQ for streaming-friendly recall). -**Why this is the best target.** It is the surface the paper is closest to (join reorder, build-side swap), the underlying mechanics (CSR/CSC, direction) already exist, and the search is bounded — `pipeline.len()!` permutations is small for realistic queries. +**Fitness.** Two-objective: `recall@K` against a labeled query set, and `p95_latency`. Combine via a deployment-specific weighting (or Pareto frontier). -### 2. Hybrid retrieval ordering and `k` tuning (`rrf` with `nearest` + `bm25`) +**Cross-scale transfer.** Build at 1% sample, apply at full. Validate by re-measuring on full at the chosen tuple. -**Surface.** `IRExpr::Rrf { primary, secondary, k }` is one of our headline features (`crates/omnigraph-compiler/src/ir/mod.rs:122`). Today the engine runs both retrievers and fuses; the order, per-leg `k`, and any pre-filter pushdown into each leg are not adaptively chosen. Search-mode detection happens by scanning the `ORDER BY` list (`crates/omnigraph/src/exec/query.rs:111`). +**Why this is the best target.** It is the surface Lance defaults explicitly under-tune. The decision is per-deployment, not per-query, so the harness can amortize cost. And the LLM's semantic reasoning (column name → vector type → likely quantizer) is on familiar ground. -**LLM patch shape.** Tunables per retriever leg: `nearest.nprobes`, `nearest.refine_factor`, `bm25.top_k`, and `rrf.k`. Plus the structural choice of *which* leg to run first and whether to use the first leg's results as a pre-filter to the second. +### 2. Per-query scan tuning (`nprobes`, `refine_factor`, pre/post-filter) -**Fitness.** Same wall-clock + result-set equality. The result-set equality check has to be careful here: top-K vector / BM25 ordering is sensitive to index parameters; the right oracle is *the user's chosen ranking metric* (recall@K on a labeled set, or rank-correlation with the unpruned plan), not bit-identical results. This is more delicate than the join case. +**Surface.** Even with a fixed vector index, the right `nprobes` and `refine_factor` depend on the predicate selectivity. A highly-selective metadata predicate ("status = 'active'" eliminating 95% of rows) flips the pre-vs-post-filter trade-off; today this is a per-query knob, picked statically. -**Why this is the next best target.** Hybrid retrieval is exactly the workload OmniGraph sells as a differentiator. Any non-trivial tuning surface we can show speedup on is high-leverage. Lance's vector index already has the dials; we just don't expose them per-query yet. +**LLM patch shape.** JSON Patch on a `QueryConfig`: +```json +[ + {"op": "replace", "path": "/nprobes", "value": 32}, + {"op": "replace", "path": "/refine_factor", "value": 10}, + {"op": "replace", "path": "/prefilter", "value": true} +] +``` -### 3. Filter pushdown shape (Lance SQL string construction) +**Fitness.** Recall@K on a labeled set + latency. **The result-set check matters here:** lowering `nprobes` lowers recall, so bit-identity is the wrong oracle — use rank correlation or labeled recall. -**Surface.** `build_lance_filter` translates IR filter trees into Lance SQL strings (`crates/omnigraph/src/table_store.rs:1159`). The translation today is structural — it doesn't consider how Lance's BTREE / inverted indexes will pick up the resulting expression. Two filters that are semantically equivalent (`x > 5 AND y = 'a'` vs `y = 'a' AND x > 5`) can hit different index paths. +**Cross-scale transfer.** Tune on a slice; apply globally. -**LLM patch shape.** Edits over the filter tree: reordering AND-clauses, factoring out a clause that's a BTREE prefix match, choosing between `IN (...)` and a join with a literal table. +**Why this is the second-best target.** It is per-query, so search costs amortize less, but it's where Lance users actually see knobs they don't know how to set. -**Fitness.** Wall-clock; the result-set check is straightforward (filters are deterministic). +### 3. Scalar-index recommender across a workload -**Why this is interesting but lower priority.** Lance's own scanner does some of this; the gap is narrower. But it's also the safest target — the search space is small, the validation is bit-identical, and the LLM is on familiar SQL ground (the paper's strength). +**Surface.** Given a representative SQL workload over a Lance dataset, choose which columns get indexes and which kind (BTree / Bitmap / Bloom / Zone-map / Label-list / Ngram). Lance lets you build one of each per column; the wrong choice costs index storage and build time. The Lance perf guide is explicit that "Queries against large ranges are currently extremely slow [on bitmap]" — index choice is non-obvious. -### 4. Vector index build parameters (offline, not per-query) +**LLM patch shape.** JSON Patch over a `Vec<{column, index_type, params}>` describing the full index set for a dataset. -**Surface.** `ensure_indices` (`crates/omnigraph/src/table_store.rs:1349`) builds BTREE / FTS / vector indexes with default parameters. Lance's `IvfPqIndexParams` has `num_partitions`, `num_sub_vectors`, `metric_type`, etc.; we use defaults today. +**Fitness.** Geomean query latency across the workload, with a soft budget on total index size. -**LLM patch shape.** Offline-only: per-vector-column index parameters. Search runs against a held-out query workload. +**Why this is interesting.** Index advising is a classic DBA problem; the LLM's column-name-semantic reasoning + workload-pattern detection is exactly what a human DBA does, slowly. This is the surface where the BauplanLabs prompting style (semantic cardinality estimation) transfers most directly. -**Fitness.** Average query latency across the workload, traded against index size. +### 4. Compaction & fragment policy -**Why this is interesting separately.** It's offline, the loop is slow, and the win is per-deployment rather than per-query. The paper's cross-scale transfer idea is directly applicable here: parameters tuned on a small scale factor often transfer to a larger one. +**Surface.** `target_rows_per_fragment`, FRI on/off (`defer_index_remap`), compaction frequency, materialize-deletions threshold. The right values depend on ingest rate, read pattern, and whether the table has indices. The perf guide notes compaction conflicts with index builds and that FRI was added specifically to decouple them — a deployment-specific knob no default handles well. -### 5. Per-table compaction / cleanup policy +**LLM patch shape.** Configuration tuple per table or per table archetype (high-ingest fact table vs. slow-changing dimension). -**Surface.** `omnigraph optimize` and `omnigraph cleanup` (`docs/maintenance.md`) take global flags today. Per-table policy — small-row-count tables should compact aggressively, vector-index-bearing tables care about fragment alignment — is a per-deployment decision. +**Fitness.** A composite — read-after-compact latency, write throughput, storage size over a synthetic week. -**LLM patch shape.** Per-table-type tuple: `(target_fragment_size, compaction_trigger, version_retention)`. +**Why this is a slower loop but high-value.** The benchmark runs over a *trajectory* (ingest then read), not a single query. Each candidate evaluation is minutes-to-hours. But the win is per-deployment and persists for the life of the schema. -**Fitness.** A composite of read-latency-after-compact and storage-size-over-time. +### 5. AIMD throttle and thread-pool tuning per object store -**Why this is the weakest fit.** The decision rate is slow (hours/days), the LLM-in-the-loop is unjustified; a static heuristic or a small learned model would be cheaper. Listing for completeness. +**Surface.** `lance_aimd_initial_rate`, `lance_aimd_max_rate`, `lance_aimd_decrease_factor`, `lance_aimd_additive_increment`, `lance_aimd_burst_capacity`, `LANCE_IO_THREADS`, `LANCE_CPU_THREADS`, `io_buffer_size`, `batch_size`. The perf guide gives a target "S3 gets to 5000 req/s in ~10 seconds" — meaning these defaults are S3-shaped. RustFS, MinIO, GCS, R2 all behave differently. -### 6. (Not recommended) Forking DataFusion +**LLM patch shape.** Tuple of throttle + thread + buffer settings, conditioned on the object store type. -Mentioned only to be explicit: we could fork DataFusion as the paper does. We should not. We touch DataFusion in one function and the paper's contribution is largely *because* of that fork. Reproducing it would commit us to maintaining a fork against an active upstream — and the marginal value is zero until we actually use DataFusion's planner, which we don't. +**Fitness.** Scan throughput, latency at p50/p95/p99, error rate under load. + +**Why this is narrow but valuable.** It's per-environment, the search space is small, and the LLM's priors on object-store behavior are decent. + +### 6. Plan-patching on LanceTableProvider + DataFusion (upstream contribution path) + +**Surface.** `LanceTableProvider` registers a Lance dataset as a DataFusion table; queries hit DataFusion's planner and produce an `ExecutionPlan` tree that includes a Lance-scan node plus standard DataFusion operators (joins, aggregates, sorts). The Bauplan technique fits here directly — same `HashJoinExec` swap, same multi-join reorder, plus Lance-specific patches like "pull this filter down into the scan as a Lance `prefilter`." + +**Why this is the long-term play, not the short-term.** The Bauplan technique needs a way to serialize `ExecutionPlan` to JSON and accept a patched one. That feature does not exist in upstream DataFusion; Bauplan added it in their fork. **The right move is to contribute that upstream** — it's independently useful (plan portability, RPC-shipped plans, observability) — and then layer evolutionary sampling on top. Forking Lance (or DataFusion via Lance) to ship this internally is the wrong investment; the maintenance burden against an active upstream is high, and the value is exactly the same as the open-source version. + +Until that lands upstream, this surface is parked. + +### 7. Note on `merge_insert` strategy + +Lance's `merge_insert` has a small DAG of `WhenMatched` / `WhenNotMatched` decisions. The structural variation is small (4–6 shapes) and the right choice is usually obvious from the user's intent (upsert, insert-if-not-exists, replace-portion). LLM-evo doesn't add value here vs. a static rule. ## Risks and open questions -**Hyrum's Law and shipped variance.** [§IX](../invariants.md) deny-list and [§VI.28](../invariants.md) require determinism: "Plan choice is deterministic given identical statistics." Evolutionary sampling *during search* is nondeterministic by design; *during serving* we must not expose that variance. The discipline is: search offline, freeze the winning plan as a cache keyed on canonicalized query shape + statistics-bucket, and serve from the cache. Same plan for same inputs. +**Lance fork vs. external harness.** Surfaces 1–5 need **no fork** — Lance's API already accepts these as parameters and emits the metrics. The harness is "build dataset with config X, run workload, measure, repeat." Surface 6 (plan-patching) needs upstream features; until they land, parked. -**Semantic equivalence beyond bit-identity.** The paper validates result-set equality. We have queries where this is the right oracle (analytic queries with deterministic `ORDER BY`) and queries where it is not (top-K hybrid retrieval, where parameter changes shift ranking slightly but the user metric is recall@K, not bit-identity). We need a two-tier validator: bit-identical for deterministic ops, semantic for retrieval ops, with the retrieval oracle declared by query shape. +**Result-set equality is the wrong oracle for retrieval surfaces.** For vector / FTS / hybrid search, parameter changes shift ranking. Use recall@K against a labeled set, or rank correlation against an exhaustive baseline. The BauplanLabs validator was bit-identical because they targeted analytic queries. -**Query corpus.** TPC-DS / TPC-H gave the paper a fixed, well-known target. We do not have a published `.gq` benchmark suite. The honest answer is that the first deliverable of any LLM-evo-sampling project on OmniGraph is *the corpus itself* — a representative set of `.gq` queries against a representative dataset, with provenance. This is a real bootstrap problem; without a corpus, "we got 4× on TPC-DS" doesn't translate. +**Workload corpus is the bootstrap problem.** Surfaces 1, 2, 3, 4 are *workload-conditioned* — what's optimal for one query pattern is wrong for another. The first deliverable of any project here is a representative query workload (or a generator), with provenance. TPC-H / TPC-DS / SIFT / DEEP10M cover the analytic and vector cases; graph workloads are scarcer. -**Compute cost.** The paper runs `n_samples=5 × n_steps=2 × n_runs=5 = 50` benchmark runs per query plus LLM calls, all on Modal sandboxes. For us, "Modal sandbox" maps to a containerized OmniGraph harness with a known fixture; the per-trial cost should be similar or lower (the engine is lighter), but the LLM bill is real and the wall-clock for a meaningful corpus is days, not hours. +**Determinism at serving time.** Search introduces variance; serving must not. Discipline: search offline, freeze the chosen tuple as part of the deployment's configuration, version it. Same tuple → same plan → same answer. This is the same Hyrum's-Law point that applied to the OmniGraph framing. -**Invariant alignment.** The mechanic upholds several aspirational invariants in a clean way: +**Compute cost.** The paper uses Modal cloud sandboxes; per-trial costs are real. For Lance, a typical surface-1 search (vector index params) is on the order of dozens of trials × index-build-time, where each build is minutes. Surface-4 (compaction) is hours per trial. Budget realistically; cross-scale transfer is what makes this affordable. -- [§V.18](../invariants.md) ("estimate-vs-actual logging on every estimator") — every evolutionary trial *is* the actual. The search output is a corpus of (plan, observed-cost) tuples that bootstraps a real cost model. -- [§V.19](../invariants.md) (observable state) — search results, frozen plans, and their statistics-buckets are auditable. -- [§VII.41–42](../invariants.md) (SIP, factorize) — the search will tell us *which* queries benefit from SIP or factorization before we commit to a uniform rule. -- [§VI.28](../invariants.md) (determinism) — *upheld at serving time* if we cache + freeze. +**Substrate respect (§I.1).** Surfaces 1–5 do not violate substrate respect — we're driving Lance from the outside, no fork. Surface 6 requires the upstream change first; until then, do not introduce a fork. -The mechanic does **not** violate the deny-list: we are not building a parallel storage or transaction layer, not bypassing the substrate, not introducing acks before durability, and not relaxing isolation. The substrate-respect line ([§I.1](../invariants.md)) is the one to watch: keep the search above the substrate, not inside a fork of it. +**Upstreaming the harness itself.** The natural home for this is **a `lance-tuner` crate** (or similar) contributed to the Lance project. It's a generic LLM-driven workload-conditioned configuration tuner; OmniGraph is one consumer. Shipping it externally to Lance is fine, but the project's value compounds if it lands in the Lance ecosystem where users find it. -**Schema and `mutate` queries.** The paper's domain is read queries. Our `mutate_as` queries route through a different path (`MutationStaging` accumulator + `stage_*` / `commit_staged`, see [docs/runs.md](../runs.md) and [docs/transactions.md](../transactions.md)). Mutation plans should be out of scope for any first experiment — atomicity-critical paths are the wrong place to introduce LLM-proposed structural rewrites. +## Smallest experiment that would produce signal -## Smallest experiment that would move us +Pick **surface 1 (workload-conditioned vector index build)** for the first cut. It is the surface with the highest known gap between defaults and per-workload optima, the LLM has strong priors, and the validation oracle (recall@K) is well-defined. -The point of this note is to enable a decision, not commit to a project. The minimum experiment that produces signal: +1. **Pick a public dataset:** SIFT1M (128-d, 1M vectors) or LAION-400M-sample (768-d, ~1M vectors). Both have published recall benchmarks for sanity check. +2. **Define the workload:** a fixed set of 1000 query vectors + ground-truth top-100 neighbors (precomputed via brute force, cached). +3. **Define the patch dialect:** JSON Patch over `{quantizer, num_partitions, num_sub_vectors, nbits, sample_rate, metric_type, ef_construction?, m?}` with type-aware validation (e.g., `ef_construction` only valid for HNSW variants). +4. **Define fitness:** weighted `(recall@10, p95_latency)`. Use `recall@10 >= 0.95` as a hard floor and minimize `p95_latency` subject to it. +5. **Implement `bol_evol`:** `n_steps=3, n_samples_per_step=4, top_k=1`. Per step: build the index, run all 1000 queries, measure recall+latency, report. Each step is ~minutes-to-hours. +6. **Compare to baselines:** Lance defaults, the published best for the dataset, and a random-search baseline of equal compute budget. +7. **Measure cross-scale transfer:** take the winning tuple at 100k vectors, apply at 1M and 10M, see if the win persists. -1. **Pick one surface: multi-hop `Expand` ordering and direction.** Smallest patch dialect, clearest invariant (bit-identical results), surface we already know is suboptimal. +If the winning tuple beats Lance defaults by ≥1.3× in latency at equal recall on the test dataset, surface 2 (scan tuning) is the next experiment. If it only beats by ≤1.1×, the conclusion is "Lance defaults are close to per-workload optimum for the tested workloads," which is itself publishable and sunsets the project cheaply. -2. **Build a `.gq` corpus of ~30 queries** against the existing test fixtures (`crates/omnigraph/tests/fixtures/`). Mix: 2-hop and 3-hop traversals, with and without anti-join, with and without leaf filters. Document provenance. +**Out of scope for the first experiment:** Surface 6 (plan-patching) entirely. Surface 4 (compaction) because the per-trial cost is too high to learn fast. OmniGraph integration — make it a generic Lance tool first; an OmniGraph wrapper is a one-day port if the tool works. -3. **Add an `--explain ir` flag** to `omnigraph read` (or a `dump_ir` test helper) that serializes `QueryIR` to JSON. This is independently useful (the deny-list calls out "plans are explainable", [§V.22](../invariants.md)) and is the substrate the LLM edits. +## Footnote: OmniGraph-IR as an alternative target -4. **Wrap the existing engine in a benchmark harness** using `tempfile::tempdir()` (the pattern already in `tests/helpers/mod.rs`) and the `criterion` story (currently absent — see [docs/testing.md](../testing.md): "no `benches/` directories"). Per-trial cost is engine-init + run; `n_runs=5` should be sufficient. +The previous revision of this note focused on patching OmniGraph's own `QueryIR` (`crates/omnigraph-compiler/src/ir/mod.rs:9`) — multi-hop `Expand` ordering and direction, hybrid retrieval (`rrf`) leg tuning, filter pushdown shape. That surface is real and the [§IX](../invariants.md) deny-list already calls out the gap ("cost-blind plan choice — lowering-order execution is not a planner"). -5. **Implement a single LLM strategy: `bol_evol` with `n_steps=2, n_samples_per_step=5, top_k=1`.** Same as the paper's quick example. Use the same JSON-Patch primitive, restricted to a permutation + direction subset of operations. +Now that the framing is Lance-direct, the OmniGraph-IR angle is **secondary, not abandoned**: -6. **Measure:** geomean speedup, fraction of queries with ≥10% improvement, search cost in $ and wall-clock, transferability of winning patches to the same query shape on a different fixture. +- The two are complementary: a Lance-tuner output (e.g., a chosen `IvfPq` configuration) is consumed by OmniGraph at `ensure_indices` time anyway. Tuning Lance below us is the highest-leverage layer. +- The OmniGraph-IR surface remains the right answer for *plan-shape* decisions inside OmniGraph (`Expand` direction, hop order, hybrid-retrieval leg ordering) because those decisions live above Lance and Lance can't reason about them. +- Plan-shape and parameter-tuning can compose: pick the right hop order *and* the right `nprobes` for the resulting vector scan. -If the geomean is ≥1.3× on a corpus we believe in, the next surface (hybrid retrieval) is justified. If it's ≤1.1×, we have learned something specific (probably: our IR is small enough that lexical order is already close to optimal) and the project sunsets cheaply. - -What this experiment intentionally does *not* do: it does not introduce a runtime planner, does not change any `mutate` path, does not fork DataFusion, does not touch the manifest writer or recovery sweep. It is additive search over a serialized read-IR with offline freezing. +If a Lance-tuner project is built first and works, the OmniGraph-IR project can reuse most of the harness (corpus, LLM driver, evolutionary loop), swapping the patch dialect and the engine target. The reverse — building OmniGraph-IR first and porting to Lance — is also possible but less leveraged, because Lance's parameter surface generalizes beyond OmniGraph. ## References - Paper: Erol, Hao, Bianchi, Greco, Tagliabue, Zou. *Making Databases Faster with LLM Evolutionary Sampling*. arXiv [2602.10387](https://arxiv.org/abs/2602.10387). - Repo: [BauplanLabs/Making-Databases-Faster-with-LLM-Evolutionary-Sampling](https://github.com/BauplanLabs/Making-Databases-Faster-with-LLM-Evolutionary-Sampling). -- Key files in the upstream repo (read in preparing this note): - - `src/sampling/sql_optimization_prompts.py` — system prompt (cardinality-by-semantics, join-side rules, projection-index recalculation). - - `src/sampling/gpt_plan_optimizer.py` — LiteLLM driver with `n_samples` parallel calls. - - `src/sampling/sample_plans.py` — `SamplingStrategy` (`bol_evol`, `pst_evol`, `best_of`), upstream-patch chain reconstruction. +- Key upstream files (read in preparing this note): + - `src/sampling/sql_optimization_prompts.py` — system prompt: semantic cardinality estimation, join-side rules, projection-index recalculation. + - `src/sampling/gpt_plan_optimizer.py` — LiteLLM driver, parallel sampling. + - `src/sampling/sample_plans.py` — `SamplingStrategy` (`bol_evol`, `pst_evol`, `best_of`), patch-chain reconstruction. - `src/sampling/orchestrator.py` — multi-step loop, resume semantics. - - `src/sampling/plan_scaler.py` — cross-scale transfer via scan signatures. -- OmniGraph internals referenced: - - `crates/omnigraph-compiler/src/ir/mod.rs:9` — `QueryIR` / `IROp`. - - `crates/omnigraph-compiler/src/ir/lower.rs:11` — `lower_query`, source-order lowering. - - `crates/omnigraph/src/exec/query.rs:348` — `execute_query`, hand-rolled pipeline interpreter. - - `crates/omnigraph/src/exec/query.rs:770` — `execute_expand`, the IN-list hydration path. - - `crates/omnigraph/src/table_store.rs:1159` — `build_lance_filter`, IR-filter → Lance SQL. - - `crates/omnigraph/src/table_store.rs:1349` — `ensure_indices`, index build parameters. - - `crates/omnigraph/src/table_store.rs:1612` — `scan_pending_batches`, the only DataFusion `MemTable` site. -- Invariants engaged: [§I.1–3](../invariants.md) substrate respect, [§V.18–22](../invariants.md) honesty / observability / explainability, [§VI.28](../invariants.md) determinism, [§VII.41–42](../invariants.md) SIP / factorize, [§IX](../invariants.md) deny-list ("cost-blind plan choice", "ad-hoc IN-list filtering when SIP fits", "shipping observable behavior as if it weren't part of the contract"). + - `src/sampling/plan_scaler.py` — scan-signature-based cross-scale transfer. +- Lance documentation referenced: + - [Performance guide](https://lance.org/guide/performance/) — thread pools, memory model, AIMD throttle, per-index-type characteristics, Fragment Reuse Index. + - [Read and write](https://lance.org/guide/read_and_write/) — `WriteParams`, `compact_files`, `merge_insert`. + - [DataFusion integration](https://lance.org/integrations/datafusion/) — `LanceTableProvider`, UDFs, JSON functions. + - [Vector index spec](https://lance.org/format/table/index/vector/) — IVF/PQ/HNSW parameters. + - [Distributed indexing](https://lance.org/guide/distributed_indexing/) — segment-level index APIs (`build_index_metadata_from_segments` is `pub(crate)`; see [docs/lance.md](../lance.md) audit stanza). +- Invariants engaged: [§I.1–3](../invariants.md) substrate respect (the parameter-search angle deliberately stays outside Lance; the plan-patching angle is parked behind an upstream contribution), [§V.18](../invariants.md) estimate-vs-actual logging (every trial is the actual), [§VI.28](../invariants.md) determinism (search offline, freeze tuples for serving), [§VII.41–42](../invariants.md) SIP / factorize (orthogonal: applies if OmniGraph-IR work is later picked up), [§IX](../invariants.md) deny-list (none violated by the parameter-search path).