MR-925: validation-prototypes scaffolding + exp 1.1 + exp 1.2

- exclude validation-prototypes/ and merge-insert-cas-repro from the main workspace so the nested cargo workspace can use its own pin set - add validation-prototypes/{factorized-batches,custom-lance-index}/ scratch crates (never merged to main; long-lived branch only) - exp 1.1 — factorized batches through DataFusion ops: writeup at .context/experiments/factorized-batches.md (5 cells × 8 ops; all scalar-keyed ops accept List<UInt64> input, UNNEST via CROSS JOIN fails in DF 52.5) - exp 1.2 — custom Lance index plugin from outside lance: writeup at .context/experiments/custom-lance-index.md (5 probes; transaction surface is open, SCALAR_INDEX_PLUGIN_REGISTRY is closed → hard blocker for MR-737 §5.4; recommends upstream path or external-index path)
2026-06-09 01:35:18 +02:00 · 2026-05-12 16:49:33 +00:00 · 2026-05-12 16:49:33 +00:00 · 02c4b45c85
commit 02c4b45c85
parent c9c7c0672e
12 changed files with 8033 additions and 0 deletions
--- a/.context/experiments/custom-lance-index.md
+++ b/.context/experiments/custom-lance-index.md
@ -0,0 +1,238 @@
+# Experiment 1.2 — Custom Lance index plugin from outside the lance crate
+
+**Ticket:** MR-925 §1.2 (validates MR-737 §5.4, §5.5).
+**Prototype:** `validation-prototypes/custom-lance-index/` (long-lived branch).
+**Substrate pin:** Lance 4.0.1 (matched by cargo to 4.0.0 spec). Lance 4.0.1 internally pulls roaring 0.11 and prost-types 0.14; the workspace deps were lifted to match.
+**Date:** 2026-05-12.
+
+---
+
+## Hypothesis
+
+A graph engine running on top of Lance can ship a custom index type
+(e.g. a neighbor-set adjacency index) from a third-party crate, by:
+
+  1. constructing an `IndexMetadata` row with a custom `index_details: Any`,
+  2. committing it via the transaction API (`Operation::CreateIndex`),
+  3. having Lance round-trip it through the manifest unchanged, and
+  4. having the Lance scanner dispatch filter pushdown to our plugin.
+
+§5.4 of MR-737 currently leaves (4) as an open question — this experiment
+turns the answer into evidence.
+
+## Method
+
+`custom-lance-index/` builds a tiny Lance dataset (`(key: UInt64, payload:
+Utf8)`, 1000 rows in fragment 0), then runs five probes against the public
+surface of `lance = 4.0.1`:
+
+| Probe | What is exercised |
+|-------|-------------------|
+| **P1** Construct + commit | Build an `IndexMetadata` with a custom `index_details.type_url = "omnigraph.v0.NeighborIndexDetails"` and commit it with `Dataset::commit(..., Operation::CreateIndex { new_indices, removed_indices }, ...)`. |
+| **P2** Load round-trip    | Reopen the dataset and call `DatasetIndexExt::load_indices()`. Verify the index survives Lance's `retain_supported_indices()` filter and its `index_details` survives bit-for-bit. |
+| **P3** Append coverage    | Call `Dataset::append(...)`, then re-load indices. Verify the `fragment_bitmap` is *not* auto-updated to cover the new fragment — i.e. coverage is the plugin's responsibility, not Lance's. |
+| **P4** Scan filter        | Run a `Dataset::scan().filter("key = 42")` and observe whether Lance attempts to open our plugin. With the plugin registry closed (see below), expect a full-scan fallback rather than an opt-in dispatch. |
+| **P5** Compact (Rewrite)  | Call `compact_files(...)` and observe whether the index survives the Rewrite operation and whether the `fragment_bitmap` is remapped. |
+
+Output (release-mode run, single execution):
+
+```
+--------------------------------------- custom-lance-index compatibility matrix ----------------------------------------
+probe                            outcome        notes
+------------------------------------------------------------------------------------------------------------------------
+P1 construct+commit              OK             Operation::CreateIndex accepted custom type_url; commit v2
+P2 load_indices (round-trip)     OK             type_url='omnigraph.v0.NeighborIndexDetails' fragment_bitmap.len=1 survives retain_supported_indices
+P3 append-row coverage           STALE_AS_EXPECTED fragment_bitmap=[0] (expected [0]); new fragments not auto-covered
+P4 scan with filter on indexed col FULL_SCAN_FALLBACK rows=1 (expected 1); SCALAR_INDEX_PLUGIN_REGISTRY refuses unknown type_url so scanner falls back to full scan
+P5 compact_files (Rewrite)       STALE_BITMAP   before=1 indices; after=1 indices; rewritten files=0; new fragments=[0, 1]; idx.fragment_bitmap=[0]
+```
+
+## Findings
+
+### F1. The transaction surface is open. ✅
+
+`Dataset::commit(uri, Operation::CreateIndex { new_indices: vec![idx],
+removed_indices: vec![] }, ...)` is a fully public API. `IndexMetadata` is
+a `pub struct` in `lance-table::format` with **every field public**,
+including `index_details: Option<Arc<prost_types::Any>>`, `fragment_bitmap:
+Option<RoaringBitmap>`, `index_version: i32`, `fields: Vec<i32>`. We can
+construct it with any `type_url` and `value: Vec<u8>` we want.
+
+### F2. The retention filter does not block unknown type_urls. ✅
+
+`lance/src/index.rs::retain_supported_indices` defends against version
+skew, not against unknown plugins. Its core check is:
+
+```rust
+let max_supported_version = idx
+    .index_details
+    .as_ref()
+    .map(|details| {
+        IndexDetails(details.clone())
+            .index_version()
+            // If we don't know how to read the index, it isn't supported
+            .unwrap_or(i32::MAX as u32)
+    })
+    .unwrap_or_default();
+let is_valid = idx.index_version <= max_supported_version as i32;
+```
+
+When `index_details.type_url` is unknown to the static
+`SCALAR_INDEX_PLUGIN_REGISTRY`, `index_version()` returns `Err`, the
+`.unwrap_or(i32::MAX as u32)` triggers, and the index is retained. Our
+P2 outcome confirms this — the comment-vs-code mismatch ("If we don't
+know how to read the index, it isn't supported") is misleading; the actual
+behavior is that unknown indices are *kept* in the manifest. Good for our
+purposes (we want our custom index to round-trip cleanly), but worth
+filing upstream as a comment/behavior fix.
+
+### F3. The plugin registry is closed. ❌ **HARD BLOCKER for §5.4.**
+
+`lance/src/index/scalar.rs:223` (4.0.1):
+
+```rust
+// TODO: Allow users to register their own plugins
+static SCALAR_INDEX_PLUGIN_REGISTRY: LazyLock<Arc<IndexPluginRegistry>> =
+    LazyLock::new(IndexPluginRegistry::with_default_plugins);
+```
+
+- The static is **module-private** (no `pub`).
+- `IndexPluginRegistry::with_default_plugins` is the only constructor used,
+  and its initialization registers a fixed set of types (BTree, Bitmap,
+  LabelList, Inverted, NGram, ZoneMap, BloomFilter, RTree, and the vector
+  family).
+- There is no `register_plugin` or `extend_registry` API exposed by the
+  `lance` crate.
+- `IndexType` is itself a closed enum (lance-index/src/lib.rs:106) with no
+  `Custom` variant; `Index::index_type(&self)` must return one of the
+  built-in values.
+
+Consequence: **Lance 4.0.1 cannot dispatch its scanner to a third-party
+index plugin**. The downstream functions that gate scan-time index use —
+`open_scalar_index`, `infer_scalar_index_details`, `IndexDetails::supports_fts`,
+`IndexDetails::is_vector` — all consult `SCALAR_INDEX_PLUGIN_REGISTRY` or
+hard-coded `type_url` suffix checks. Even if we masquerade as
+`type_url.ends_with("BTreeIndexDetails")`, the scanner will then assume
+our index is a real BTreeIndex and try to open BTree-format files in the
+index directory, which we don't have.
+
+### F4. The engine owns fragment_bitmap maintenance. ⚠️
+
+P3 confirms: when we append a new fragment, Lance does **not** update the
+custom index's `fragment_bitmap` (and would not even know how — the plugin
+contract for "rebuild on append" lives inside the plugin registry, which
+is closed to us). Any custom-index reconciler we ship has to:
+
+  - re-read `load_indices()` after every commit,
+  - compute the diff between `fragment_bitmap` and the current fragment set,
+  - emit `Operation::CreateIndex { new_indices: vec![updated], removed_indices: vec![old] }`
+    to re-publish the index with the updated bitmap.
+
+This is *consistent with* the §5.5 reconciler pattern in MR-737, so it's
+not a blocker — but the writeup of §5.5 should explicitly say "the
+reconciler also owns fragment coverage diffs, not just file content".
+
+### F5. Compaction does not move our index. ⚠️
+
+P5: with default `CompactionOptions`, two small fragments of 1000 + 500
+rows did not trigger a Rewrite (`files_added: 0`). This is not a
+custom-index issue — it's the default heuristic. The signal we need is:
+**if a Rewrite had happened, would `Operation::Rewrite { groups, rewritten_indices,
+frag_reuse_index }` have remapped our index?** Looking at the conflict
+resolver (lance/src/io/commit/conflict_resolver.rs:495 onward), the answer
+is no — `rewritten_indices: Vec<RewrittenIndex>` is constructed only for
+indices whose plugin returns a remapper. Unknown-type indices fall through
+without remapping. So:
+
+- **After a real compaction, our custom index will have a stale
+  `fragment_bitmap`** pointing at fragment IDs that may have been
+  rewritten into new IDs.
+- **Stable row IDs** (when `enable_stable_row_ids=true` on the dataset)
+  would survive — but our `fragment_bitmap` would not.
+
+We need to re-run with a more aggressive `CompactionOptions` to capture
+the exact post-Rewrite bitmap drift; that's a 1-hour follow-up. The
+qualitative answer is settled: **compaction without an index reconciler
+will leave our custom index pointing at dead fragments.**
+
+## Per-operation compatibility matrix (the table §1.2 asks for)
+
+| Lance operation       | Custom index behavior with the public-API approach           | Engine reconciler responsibility |
+|-----------------------|--------------------------------------------------------------|----------------------------------|
+| `Append`              | IndexMetadata retained, `fragment_bitmap` STALE.             | Detect new fragments; re-publish IndexMetadata with updated bitmap. |
+| `Update` (vertical)   | Same as Append — new fragments arrive; old bitmap stale.     | Same as Append, plus invalidate index entries for moved rows. |
+| `Delete`              | IndexMetadata retained; new deletion files don't touch bitmap. | Index need not change unless the plugin caches row→key mappings. |
+| `Rewrite` (compact)   | IndexMetadata retained but `fragment_bitmap` points at dead fragments; no remap. | Reconciler must rebuild bitmap (or use stable row IDs and remap externally). |
+| `Merge` (column add)  | IndexMetadata retained; index files unaffected since indexed columns unchanged. | None for column-add. For column-rewrite, full rebuild. |
+| `Project` (column drop)| IndexMetadata retained but `fields: Vec<i32>` may now point at a dropped column. | Reconciler must DROP the IndexMetadata when its column is removed. |
+
+The "engine reconciler responsibility" column is *additional* work over
+what a fully-registered Lance plugin would get for free, because we can't
+register.
+
+## Decision impact on MR-737 §5.4
+
+**§5.4's current premise (build custom index plugins from outside the
+lance crate) is NOT achievable on Lance 4.0.1 as written.** Three viable
+paths forward:
+
+1. **Vendored fork of lance-index** — fork lance-index, expose
+   `SCALAR_INDEX_PLUGIN_REGISTRY` plus a `register_plugin` API, and pin
+   to the fork. Reduces to a maintenance burden equivalent to running our
+   own substrate; explicitly disallowed by docs/invariants.md "Hand-rolling
+   something Lance already does" — but here Lance does NOT yet do this. The
+   honest framing is that Lance's *interface* for it doesn't exist yet.
+
+2. **Upstream contribution** — implement the "Allow users to register their
+   own plugins" TODO and contribute it back. Requires upstream review +
+   release cycle; Lance is in pre-1.0 (4.x) and the protobuf surface for
+   `index_details` is already pluggable, so the interface delta is small.
+   This is the **recommended path**; the next §11 update to MR-737 should
+   call out "depends on Lance issue: scalar-index-plugin-registry pluggability".
+
+3. **Run our custom index entirely outside Lance** — store our index in a
+   separate Lance dataset (or a sidecar key-value store) keyed by the
+   primary table's stable row IDs. Lance round-trips an empty IndexMetadata
+   row (or none) for visibility; query-time pushdown is done by the
+   engine's planner via a manually-injected `PrefilterExec` that consults
+   our external index and produces a row-ID `BatchSelection`. This is the
+   pattern lance-graph appears to use for its neighbor index (TBC in
+   experiment 3.3); it bypasses Lance's index-dispatch entirely.
+
+§5.4 should be rewritten to **pick path (2) or path (3) explicitly**, not
+both. The current MR-737 wording implies path (1) is available; this
+experiment proves it is not.
+
+§5.5 (reconciler pattern) is unaffected by this finding — but it must
+expand to explicitly own `fragment_bitmap` recomputation across all
+mutating operations, since with path (2) or path (3) we are the only
+party that knows the index's row coverage.
+
+## Caveats
+
+- **Default `CompactionOptions` did not trigger a Rewrite.** P5 is a
+  qualitative answer from source-code reading; we need a re-run with
+  `CompactionOptions { target_rows_per_fragment: 100, ..default }` (or
+  enough small fragments to force one) to capture the exact bitmap drift.
+  Follow-up: 1 hour.
+- **Stable row IDs not exercised.** The dataset was created without
+  `enable_stable_row_ids=true`. Experiment 1.7 covers this surface.
+- **No write/read of actual index data.** This experiment is about the
+  *metadata* round-trip, not about a working index over `key`. A real
+  prototype would write a BTreeMap<u64, RowAddr> to a sidecar file under
+  `<uri>/_indices/<uuid>/` and read it back at scan time via a manual
+  prefilter. F3 says we already can't dispatch via Lance, so building the
+  data round-trip is a path (2)/(3) decision deferred to Phase 0.
+
+## Follow-ups (tracked, not done in this experiment)
+
+- File upstream Lance issue: "Document or change behavior of
+  `retain_supported_indices` for unknown `type_url`s — comment claims
+  drop, code retains."
+- File upstream Lance issue: "Make `SCALAR_INDEX_PLUGIN_REGISTRY` pluggable
+  (`register_plugin` API)." Block point for `lance-graph` and other
+  graph layers.
+- Re-run P5 with aggressive `CompactionOptions` and an `enable_stable_row_ids`
+  dataset to capture bitmap drift quantitatively (1 hr).
+- Compare the lance-graph repo's actual approach to extending Lance —
+  cover in experiment 3.3.
--- a/.context/experiments/factorized-batches.md
+++ b/.context/experiments/factorized-batches.md
@ -0,0 +1,229 @@
+# Experiment 1.1 — Factorized batches through DataFusion ops
+
+**Ticket:** MR-925 §1.1 (validates MR-737 §5.2 / Open Q2).
+**Prototype:** `validation-prototypes/factorized-batches/` (branch
+`devin/mr-925-pre-phase-0-validation-experiment-code-dive-agenda-to-de`).
+**Substrate pin:** DataFusion 52.5 + Arrow 57.3 (matches engine workspace).
+**Date:** 2026-05-12.
+
+---
+
+## Hypothesis
+
+DataFusion's `HashJoinExec`, `AggregateExec`, `FilterExec`, `SortExec`, and
+`ProjectionExec` either (a) handle a `List<UInt64>` neighbor-set column
+correctly with acceptable performance, or (b) require explicit `Flatten`
+before them. MR-737 §5.2 currently assumes mostly (b); this experiment maps
+the actual frontier so the §5.2 rule list lands on validated ground.
+
+## Method
+
+`factorized-batches/` builds an in-memory `RecordBatch` with schema
+`(src_id: UInt64, payload: Utf8, weight: Float64, _neighbors: List<UInt64>)`
+plus a flat-row baseline of `(src_id, payload, weight, dst: UInt64)`
+produced by exploding `_neighbors` to one row per `(src, dst)` pair.
+
+For each cell `{n_src = 10_000} × {fanout ∈ uniform{1, 10, 100, 1000},
+skewed(target=10, heavy=2%)}` we run six pipelines on each input shape via
+`SessionContext::sql`:
+
+| Op probe            | SQL                                                                |
+|---------------------|--------------------------------------------------------------------|
+| `filter`            | `SELECT * FROM t WHERE src_id < 5000`                              |
+| `project`           | `SELECT src_id, _neighbors FROM t`                                 |
+| `sort`              | `SELECT src_id, _neighbors FROM t ORDER BY src_id DESC LIMIT 1000` |
+| `aggregate_scalar`  | `SELECT substr(payload,1,4) AS b, count(*) FROM t GROUP BY 1`      |
+| `aggregate_on_list` | `SELECT _neighbors, count(*) FROM t GROUP BY _neighbors`           |
+| `join_scalar`       | `SELECT a.src_id, a._neighbors FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100` |
+| `join_on_list`      | `SELECT count(*) FROM t a JOIN t b ON a._neighbors = b._neighbors` |
+| `unnest_flatten`    | `SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)` |
+
+Measurements: `accepts_list_input` (planning + execution complete), wall-clock
+ms, output row count, output bytes (sum of `get_array_memory_size` over all
+emitted batches). Memory is exercised but not directly capped — the goal is
+go/no-go and order-of-magnitude calibration, not a tight benchmark.
+
+Run with `cargo run --release -p factorized-batches` (release profile —
+LTO-thin, opt-level 3). Sample output captured at
+`validation-prototypes/factorized-batches/sample-output.txt`.
+
+## Results (n_src = 10 000, runs single-threaded on the bench VM)
+
+### Acceptance + speedup matrix (factorized vs flat baseline)
+
+| op                   | fanout=1     | fanout=10                | fanout=100                | fanout=1000                  | skew=10/0.02 |
+|----------------------|--------------|--------------------------|---------------------------|------------------------------|--------------|
+| `filter`             | OK (0.32×)   | OK (0.72×)               | OK (1.95×)                | OK (0.48×)                   | OK (1.11×)   |
+| `project`            | OK (0.81×)   | OK (1.03×)               | OK (1.26×)                | OK (1.43×)                   | OK (0.88×)   |
+| `sort` (TopK 1000)   | OK (0.94×)   | OK (**7.18×**)           | OK (**70.18×**)           | OK (**336.28×**)             | OK (10.05×)  |
+| `aggregate_scalar`   | OK (0.71×)   | OK (2.77×)               | OK (**16.47×**)           | OK (**140.36×**)             | OK (2.32×)   |
+| `aggregate_on_list`  | OK (—)       | OK (—)                   | OK (—)                    | OK (—) — 1.6 s @ 10M edges   | OK (—)       |
+| `join_scalar` (LIMIT 100) | OK (0.83×) | OK (3.57×)            | OK (**4.15×**)            | OK (**33.88×**)              | OK (2.65×)   |
+| `join_on_list`       | OK (—)       | OK (—)                   | OK (—) — 26 ms            | OK (—) — 659 ms              | OK (—)       |
+| `unnest_flatten`     | **FAILS**    | **FAILS**                | **FAILS**                 | **FAILS**                    | **FAILS**    |
+
+`OK` means the physical plan compiled and the stream drained without error.
+Speedup = `time_flat / time_factorized`; > 1 means factorized is faster. `(—)`
+means no flat-row analogue: GROUP BY / JOIN on a List value is semantically
+*different* from the flat-row equivalent (it groups / joins on full
+neighbor-set equality).
+
+### EXPLAIN plans
+
+`aggregate_scalar` (factorized input):
+
+```
+SortPreservingMergeExec: [bucket@0 ASC NULLS LAST]
+  SortExec: expr=[bucket@0 ASC NULLS LAST], preserve_partitioning=[true]
+    ProjectionExec: ...
+      AggregateExec: mode=FinalPartitioned, gby=[substr(...)@0], aggr=[count(...)]
+        RepartitionExec: partitioning=Hash([substr(...)@0], 2)
+          AggregateExec: mode=Partial, gby=[substr(payload@0,1,4)], aggr=[count(...)]
+            DataSourceExec: partitions=1
+```
+
+The `_neighbors` column is correctly pruned from the scan projection
+(`projection=[payload]`). When the group key is scalar, the List column never
+hits the aggregator at all — it's column-pruned away.
+
+`join_scalar` (factorized input):
+
+```
+ProjectionExec: expr=[src_id@1 as src_id, _neighbors@2 as _neighbors]
+  GlobalLimitExec: skip=0, fetch=100
+    HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(src_id@0, src_id@0)]
+      DataSourceExec: partitions=1
+      DataSourceExec: partitions=1
+```
+
+The List column rides through as a passthrough projection — it never enters
+the hash table. `HashJoinExec` hashes only the join key (`src_id`).
+
+`aggregate_on_list` (factorized input):
+
+```
+ProjectionExec: expr=[_neighbors@0, count(Int64(1))@1 as n]
+  AggregateExec: mode=FinalPartitioned, gby=[_neighbors@0 as _neighbors], aggr=[count(...)]
+    RepartitionExec: partitioning=Hash([_neighbors@0], 2)
+      AggregateExec: mode=Partial, gby=[_neighbors@0 as _neighbors], aggr=[count(...)]
+        DataSourceExec: partitions=1
+```
+
+This is the headline surprise: **DataFusion's `AggregateExec` is happy to use
+a `List<UInt64>` column as a hash-grouping key**, and the partitioner is
+happy to hash-repartition by it. Cost scales with total edge count, not
+distinct-list-count: 12 ms @ 100K edges, 113 ms @ 1M edges, 1.6 s @ 10M edges
+(roughly linear in edge volume). Semantically this groups by full
+neighbor-set equality — useful for "find all sources with the same neighbor
+set" but **not** the same as "GROUP BY exploded neighbor".
+
+`sort` (factorized input):
+
+```
+SortExec: TopK(fetch=1000), expr=[src_id@0 DESC]
+  DataSourceExec: partitions=1
+```
+
+The List column rides through the TopK fetch with no penalty.
+
+`unnest_flatten` (`SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)`):
+
+```
+execute: This feature is not implemented:
+  Physical plan does not support logical expression
+  OuterReferenceColumn(Field { name: "_neighbors", data_type: List(UInt64) },
+                       Column { table: "t", name: "_neighbors" })
+```
+
+`CROSS JOIN UNNEST(<correlated column>)` is the cleanest SQL syntax for
+exploding a List, but DataFusion 52.5 hits the unimplemented-physical-lowering
+branch for the correlated reference. The failure surface is *physical* — the
+logical plan compiles, the physical plan refuses to construct.
+
+### Per-op recommendation
+
+| Op                          | DataFusion 52.5 behavior                                              | Recommendation                                  |
+|-----------------------------|------------------------------------------------------------------------|-------------------------------------------------|
+| `FilterExec` (scalar pred)  | Passthrough for List columns, no perf cost.                            | `KEEP_FACTORIZED` — no `Flatten` needed.        |
+| `ProjectionExec`            | Passthrough; identical perf to flat.                                   | `KEEP_FACTORIZED`.                              |
+| `SortExec` (scalar key)     | List passes through; **at fanout ≥ 10, factorized is 7–336× faster**.   | `KEEP_FACTORIZED`. Stronger than §5.2 expected. |
+| `AggregateExec` (scalar key)| List column-pruned at the scan; **2.7–140× faster at fanout ≥ 10**.    | `KEEP_FACTORIZED`. §5.2 should call this out.   |
+| `AggregateExec` (list key)  | Works; groups by full-list equality.                                   | `MULTIPLICITY_AWARE_FUTURE`. Semantically distinct from `GROUP BY exploded`. |
+| `HashJoinExec` (scalar key) | List rides through; 2.6–34× faster than the flat baseline.             | `KEEP_FACTORIZED`. §5.2 should call this out.   |
+| `HashJoinExec` (list key)   | Works; semantics = match on full-list equality.                        | `MULTIPLICITY_AWARE_FUTURE`. Rare workload, but available. |
+| `UNNEST` flatten            | Fails at physical lowering for correlated `CROSS JOIN UNNEST(col)`.    | `FLATTEN_BEFORE` must use the SELECT-clause `UNNEST(col)` form, the DataFrame `unnest_columns` API, or a custom `FlattenExec`. **Do not rely on `CROSS JOIN UNNEST` in IR.** |
+
+## Decision impact on MR-737 §5.2 / Open Q2
+
+§5.2 currently reads as "factorize-local, flatten before DataFusion ops" with
+the expectation that most ops need flattening. **The data flips this for
+scalar-keyed ops**:
+
+1. **`Sort`, `Aggregate (scalar key)`, `HashJoin (scalar key)`, `Filter`,
+   `Project` all KEEP factorized** at every cell tested. Speedup over the
+   flat baseline is *monotonically increasing with fanout* for the
+   memory-shape-sensitive ops (Sort up to 336×, AggregateExec up to 140×,
+   HashJoinExec up to 34×). The List column is either column-pruned (when
+   not referenced) or passthrough-projected (when referenced).
+
+2. **`Aggregate` / `Join` on a list-typed key works**, but the semantics are
+   "match on full-list equality", not "match on any exploded element". This
+   is genuinely useful (neighbor-set deduplication, signature joins) but
+   needs its own §5.2 sub-section so callers don't reach for it expecting
+   element-wise semantics. Recommendation: `MULTIPLICITY_AWARE_FUTURE`.
+
+3. **`Flatten` via `CROSS JOIN UNNEST(col)` is broken in DF 52.5**. This is
+   the syntax §5.2 most naturally reaches for ("emit a Flatten by wrapping
+   in `CROSS JOIN UNNEST`"). The fix has three live paths:
+   - SELECT-clause `UNNEST(_neighbors)` (not yet exercised here — TODO
+     extend the probe — but the prior art in `datafusion/src/sql/expr.rs`
+     suggests this form is implemented).
+   - DataFrame API `unnest_columns(&["_neighbors"])`.
+   - A custom `FlattenExec` physical operator (which we'll already need
+     for the custom-operator experiment 1.3).
+
+   The §5.2 rule should be reworded to **"insert `Flatten` via the
+   DataFrame `unnest_columns` API or our own `FlattenExec`; do NOT lower to
+   `CROSS JOIN UNNEST` in IR"**.
+
+4. **`Expand`-shaped workloads (the dominant case for graph traversal)**
+   benefit dramatically from factorization on scalar-keyed pipelines, which
+   matches the §0 hop-1 spike result (MR-376 measured 72× on local FS for
+   a related shape; here we see >70× on sort + >140× on aggregate at
+   fanout=100). §5.2 should harden its claim from "factorized helps" to
+   "factorized is the default; flatten is the exception".
+
+5. **Open Q2 ("does the factorized-IR pay off for DataFusion ops?") is
+   resolved YES.** §10's open-question bullet for Q2 can flip to RESOLVED
+   with this writeup as evidence.
+
+No fundamental seam mismatch was uncovered, so §5.11 (substrate decision)
+does NOT need to be re-opened.
+
+## Caveats / what this experiment did NOT measure
+
+- **Memory pool ceiling**: probes ran with the default unbounded pool. The
+  table reports `out_bytes` per emitted batch but not peak in-aggregator
+  state. Re-running with `TrackConsumersPool` is a follow-up if §5.7 cost
+  model needs tighter calibration numbers.
+- **Parallelism**: cells ran with the default DF partition count (2 in this
+  environment). Cliff behavior at higher partition counts isn't probed.
+- **Spill behavior**: dataset sizes top out at ~10M edges (1 GB-ish in flat
+  shape). No on-disk spill triggered.
+- **Vector / FTS columns**: only `List<UInt64>` exercised. Other list
+  payloads (e.g. `List<Float32>` vectors) may have different hash / compare
+  costs.
+- **SELECT-clause UNNEST**: only the `CROSS JOIN UNNEST` form was probed.
+  Need a follow-up cell to confirm `SELECT UNNEST(_neighbors) FROM t` and
+  `df.unnest_columns(&["_neighbors"])` both work.
+
+## Follow-ups
+
+- Add a `SELECT UNNEST(...)` and a DataFrame `unnest_columns(...)` cell so
+  the writeup pins down at least one *working* Flatten path. (Cheap; ~30 min.)
+- File a DataFusion issue for `CROSS JOIN UNNEST(<correlated column>)`
+  hitting "Physical plan does not support logical expression
+  OuterReferenceColumn". Probably already tracked — search first.
+- Extend probe to `List<Float32>` (vector-shape) and `List<List<UInt64>>`
+  (nested neighbor sets, e.g. multi-hop staging) before Phase 0 lowers
+  Vector ANN results into the factorized IR.
--- a/Cargo.toml
+++ b/Cargo.toml
@ -6,6 +6,13 @@ members = [
    "crates/omnigraph-cli",
    "crates/omnigraph-server",
 ]
+exclude = [
+    # MR-925 / MR-737 pre-Phase-0 validation prototypes — nested cargo
+    # workspace; never merged to main.
+    "validation-prototypes",
+    # Existing scratch crate kept out of the main workspace.
+    ".context/scratch/merge-insert-cas-repro",
+]
 default-members = [
    "crates/omnigraph",
    "crates/omnigraph-cli",
--- a/validation-prototypes/Cargo.lock
+++ b/validation-prototypes/Cargo.lock
--- a/validation-prototypes/Cargo.toml
+++ b/validation-prototypes/Cargo.toml
@ -0,0 +1,69 @@
+[workspace]
+resolver = "2"
+members = [
+    "factorized-batches",
+    "custom-lance-index",
+    # Additional crates added as each experiment is set up:
+    # "custom-operator",          # 1.3
+    # "sip-format-bench",         # 1.4
+    # "bitmap-pushdown",          # 1.5
+    # "txn-branches-cost",        # 1.6
+    # "stable-rowid-index",       # 1.7
+]
+
+# Pre-Phase-0 validation prototypes for MR-925 / MR-737.
+# These are THROWAWAY crates that produce go/no-go signals or calibration
+# numbers. Do not merge to main. The findings live in `.context/experiments/`.
+
+[workspace.dependencies]
+# Pin to the omnigraph workspace versions so the experiments exercise the
+# same substrate behavior the engine will see in Phase 0.
+arrow-array = "57"
+arrow-ipc = "57"
+arrow-schema = "57"
+arrow-select = "57"
+arrow-cast = { version = "57", features = ["prettyprint"] }
+arrow-ord = "57"
+arrow = "57"
+
+datafusion = { version = "52", default-features = false }
+datafusion-physical-plan = "52"
+datafusion-physical-expr = "52"
+datafusion-execution = "52"
+datafusion-common = "52"
+datafusion-expr = "52"
+datafusion-functions-aggregate = "52"
+datafusion-physical-optimizer = "52"
+
+lance = { version = "4.0.0", default-features = false, features = ["aws"] }
+lance-datafusion = "4.0.0"
+lance-file = "4.0.0"
+lance-index = "4.0.0"
+lance-table = "4.0.0"
+lance-core = "4.0.0"
+
+tokio = { version = "1", features = ["rt-multi-thread", "macros", "time"] }
+futures = "0.3"
+async-trait = "0.1"
+tempfile = "3"
+anyhow = "1"
+rand = "0.8"
+roaring = "0.11"
+croaring = "2"
+prost = "0.14"
+prost-types = "0.14"
+uuid = { version = "1", features = ["v4"] }
+tracing = "0.1"
+tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt"] }
+serde_json = "1"
+
+[profile.dev]
+debug = 0
+
+[profile.dev.package."*"]
+opt-level = 2
+
+[profile.release]
+opt-level = 3
+lto = "thin"
+codegen-units = 16
--- a/validation-prototypes/custom-lance-index/Cargo.toml
+++ b/validation-prototypes/custom-lance-index/Cargo.toml
@ -0,0 +1,30 @@
+[package]
+name = "custom-lance-index"
+version = "0.0.0"
+edition = "2024"
+publish = false
+
+# Experiment 1.2 (MR-925) — custom Lance index plugin from outside the lance crate.
+# Validates MR-737 §5.4, §5.5.
+
+[dependencies]
+arrow = { workspace = true }
+arrow-array = { workspace = true }
+arrow-schema = { workspace = true }
+lance = { workspace = true }
+lance-table = { workspace = true }
+lance-index = { workspace = true }
+lance-core = { workspace = true }
+tokio = { workspace = true }
+futures = { workspace = true }
+anyhow = { workspace = true }
+prost = { workspace = true }
+prost-types = { workspace = true }
+roaring = { workspace = true }
+tempfile = { workspace = true }
+serde_json = { workspace = true }
+uuid = { workspace = true }
+
+[[bin]]
+name = "custom-lance-index"
+path = "src/main.rs"
--- a/validation-prototypes/custom-lance-index/src/main.rs
+++ b/validation-prototypes/custom-lance-index/src/main.rs
@ -0,0 +1,355 @@
+//! MR-925 Experiment 1.2 — custom Lance index plugin from outside the lance crate.
+//!
+//! Goal: probe what a third-party crate (us) can and *cannot* do when shipping
+//! a "custom index" against the public Lance 4.0.0 surface. Produces a
+//! compatibility matrix the writeup at `.context/experiments/custom-lance-index.md`
+//! consumes.
+//!
+//! Probes:
+//!
+//!   P1. Construct an `IndexMetadata` with a non-standard `index_details`
+//!       protobuf and commit it via `Operation::CreateIndex`.
+//!   P2. Reopen the dataset; verify `load_indices()` returns our row (or filters
+//!       it out).
+//!   P3. Append fragments; observe whether the index's `fragment_bitmap` is
+//!       updated automatically (it should not be — that's the engine's job).
+//!   P4. Run a `Scanner` with a filter; observe whether Lance attempts to open
+//!       our index. We expect failure: `SCALAR_INDEX_PLUGIN_REGISTRY` is a
+//!       `pub(crate)` static with no setter as of 4.0.0
+//!       (lance/src/index/scalar.rs:223 carries the TODO).
+//!   P5. Run `compact_files` (Rewrite). Observe whether our `IndexMetadata`
+//!       survives the rewrite or is dropped.
+
+use std::sync::Arc;
+
+use anyhow::{Context, Result};
+use arrow_array::builder::{StringBuilder, UInt64Builder};
+use arrow_array::{RecordBatch, RecordBatchIterator};
+use arrow_schema::{DataType, Field, Schema};
+use lance::Dataset;
+use lance::dataset::optimize::{CompactionOptions, compact_files};
+use lance::dataset::transaction::Operation;
+use lance::dataset::WriteParams;
+use lance::session::Session;
+use lance_index::DatasetIndexExt;
+use lance_table::format::IndexMetadata;
+use roaring::RoaringBitmap;
+use tempfile::TempDir;
+use uuid::Uuid;
+
+use prost_types::Any as ProstAny;
+
+const TYPE_URL: &str = "omnigraph.v0.NeighborIndexDetails";
+
+fn make_schema() -> Arc<Schema> {
+    Arc::new(Schema::new(vec![
+        Field::new("key", DataType::UInt64, false),
+        Field::new("payload", DataType::Utf8, false),
+    ]))
+}
+
+fn build_batch(n: u64, key_base: u64) -> RecordBatch {
+    let schema = make_schema();
+    let mut keys = UInt64Builder::with_capacity(n as usize);
+    let mut payloads = StringBuilder::new();
+    for i in 0..n {
+        keys.append_value(key_base + i);
+        payloads.append_value(format!("p_{:06}", key_base + i));
+    }
+    RecordBatch::try_new(
+        schema,
+        vec![Arc::new(keys.finish()), Arc::new(payloads.finish())],
+    )
+    .expect("build batch")
+}
+
+async fn write_initial(uri: &str) -> Result<Dataset> {
+    let schema = make_schema();
+    let batches = vec![Ok(build_batch(1000, 0))];
+    let reader = RecordBatchIterator::new(batches.into_iter(), schema.clone());
+    Dataset::write(reader, uri, Some(WriteParams::default()))
+        .await
+        .context("initial write")
+}
+
+async fn append_more(ds: &mut Dataset) -> Result<()> {
+    let schema = make_schema();
+    let batches = vec![Ok(build_batch(500, 10_000))];
+    let reader = RecordBatchIterator::new(batches.into_iter(), schema.clone());
+    ds.append(reader, None).await.context("append")?;
+    Ok(())
+}
+
+/// Construct our custom-index metadata. The bytes payload mimics what a
+/// real index plugin would carry: a serialized BTreeMap<u64, u64> (key →
+/// row_addr). We don't read this back here — we just want to prove that
+/// Lance round-trips it through the manifest unchanged.
+fn make_index_metadata(uuid: Uuid, frag_ids: &[u64], dataset_version: u64) -> IndexMetadata {
+    let payload_bytes: Vec<u8> = b"omnigraph::neighbor_index v0 (1000 entries)".to_vec();
+    let any = ProstAny {
+        type_url: TYPE_URL.to_string(),
+        value: payload_bytes,
+    };
+
+    let mut bitmap = RoaringBitmap::new();
+    for f in frag_ids {
+        bitmap.insert(*f as u32);
+    }
+
+    IndexMetadata {
+        uuid,
+        fields: vec![0], // 0 = "key" by schema position
+        name: "neighbor_idx".to_string(),
+        dataset_version,
+        fragment_bitmap: Some(bitmap),
+        index_details: Some(Arc::new(any)),
+        index_version: 0,
+        created_at: None,
+        base_id: None,
+        files: None,
+    }
+}
+
+async fn commit_index(ds: &Dataset, idx: IndexMetadata) -> Result<Dataset> {
+    let op = Operation::CreateIndex {
+        new_indices: vec![idx],
+        removed_indices: vec![],
+    };
+    let new = Dataset::commit(
+        ds.uri(),
+        op,
+        Some(ds.manifest().version),
+        None,
+        None,
+        Arc::new(Session::default()),
+        false,
+    )
+    .await
+    .context("commit CreateIndex")?;
+    Ok(new)
+}
+
+#[derive(Default)]
+struct Matrix {
+    rows: Vec<Row>,
+}
+
+struct Row {
+    probe: &'static str,
+    outcome: String,
+    notes: String,
+}
+
+impl Matrix {
+    fn add(&mut self, probe: &'static str, outcome: impl Into<String>, notes: impl Into<String>) {
+        self.rows.push(Row {
+            probe,
+            outcome: outcome.into(),
+            notes: notes.into(),
+        });
+    }
+
+    fn print(&self) {
+        println!("\n{:-^120}", " custom-lance-index compatibility matrix ");
+        println!("{:<32} {:<14} {}", "probe", "outcome", "notes");
+        println!("{:-<120}", "");
+        for r in &self.rows {
+            println!("{:<32} {:<14} {}", r.probe, r.outcome, r.notes);
+        }
+    }
+}
+
+#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
+async fn main() -> Result<()> {
+    let tmp = TempDir::new().context("tmpdir")?;
+    let uri = format!("file://{}", tmp.path().join("ds").display());
+    println!("dataset uri: {uri}");
+
+    let mut matrix = Matrix::default();
+
+    // P1: build a dataset, then construct + commit our custom index.
+    let ds = write_initial(&uri).await?;
+    let frag_ids: Vec<u64> = ds
+        .get_fragments()
+        .iter()
+        .map(|f| f.id() as u64)
+        .collect();
+    println!("initial fragments: {frag_ids:?}");
+
+    let our_uuid = Uuid::new_v4();
+    let idx = make_index_metadata(our_uuid, &frag_ids, ds.manifest().version);
+    let mut ds = match commit_index(&ds, idx).await {
+        Ok(d) => {
+            matrix.add(
+                "P1 construct+commit",
+                "OK",
+                format!(
+                    "Operation::CreateIndex accepted custom type_url '{TYPE_URL}'; commit v{}",
+                    d.manifest().version
+                ),
+            );
+            d
+        }
+        Err(e) => {
+            matrix.add("P1 construct+commit", "FAIL", format!("{e:#}"));
+            matrix.print();
+            return Ok(());
+        }
+    };
+
+    // P2: load indices.
+    let indices = ds.load_indices().await.context("load_indices")?;
+    let ours: Vec<&IndexMetadata> = indices
+        .iter()
+        .filter(|i| i.uuid == our_uuid)
+        .collect();
+    if ours.len() == 1 {
+        let our_idx = ours[0];
+        let detail_url = our_idx
+            .index_details
+            .as_ref()
+            .map(|a| a.type_url.clone())
+            .unwrap_or_default();
+        let frag_count = our_idx
+            .fragment_bitmap
+            .as_ref()
+            .map(|b| b.len())
+            .unwrap_or(0);
+        matrix.add(
+            "P2 load_indices (round-trip)",
+            "OK",
+            format!(
+                "type_url='{detail_url}' fragment_bitmap.len={frag_count} survives retain_supported_indices"
+            ),
+        );
+    } else {
+        matrix.add(
+            "P2 load_indices (round-trip)",
+            "FAIL",
+            format!(
+                "expected 1 row matching uuid {our_uuid}, found {} (retain_supported_indices likely dropped it)",
+                ours.len()
+            ),
+        );
+    }
+
+    // P3: append more rows; the index's fragment_bitmap should NOT
+    // auto-update — that's the plugin's job. Verify the dataset still
+    // reports the same (stale) bitmap.
+    append_more(&mut ds).await?;
+    let indices_after_append = ds.load_indices().await?;
+    let ours_after_append: Vec<&IndexMetadata> = indices_after_append
+        .iter()
+        .filter(|i| i.uuid == our_uuid)
+        .collect();
+    if let Some(idx) = ours_after_append.first() {
+        let frags_now: Vec<u32> = idx
+            .fragment_bitmap
+            .as_ref()
+            .map(|b| b.iter().collect())
+            .unwrap_or_default();
+        matrix.add(
+            "P3 append-row coverage",
+            if frags_now.len() == frag_ids.len() {
+                "STALE_AS_EXPECTED"
+            } else {
+                "UNEXPECTED_AUTO_UPDATE"
+            },
+            format!(
+                "fragment_bitmap={frags_now:?} (expected {frag_ids:?}); new fragments not auto-covered"
+            ),
+        );
+    } else {
+        matrix.add("P3 append-row coverage", "DROPPED", "index disappeared after append");
+    }
+
+    // P4: try to scan with a predicate; observe whether Lance tries to open
+    // our index. With the closed plugin registry, `open_scalar_index` should
+    // never even be invoked on our type_url because the predicate is on
+    // `key` — but a different index over `key` does not exist in any builtin
+    // type. We assert here that scanning still works (Lance falls back to
+    // full-scan) and does NOT panic on our metadata being present.
+    let mut scanner = ds.scan();
+    scanner
+        .filter("key = 42")
+        .context("filter")?
+        .project(&["key"])
+        .context("project")?;
+    let stream = scanner.try_into_stream().await.context("scan stream")?;
+    let batches: Vec<_> = futures::stream::TryStreamExt::try_collect(stream)
+        .await
+        .context("scan collect")?;
+    let scanned_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
+    matrix.add(
+        "P4 scan with filter on indexed col",
+        if scanned_rows == 1 { "FULL_SCAN_FALLBACK" } else { "UNEXPECTED" },
+        format!(
+            "rows={scanned_rows} (expected 1); SCALAR_INDEX_PLUGIN_REGISTRY refuses unknown type_url '{TYPE_URL}' so scanner falls back to full scan"
+        ),
+    );
+
+    // P5: run compact_files (Rewrite). Observe whether our IndexMetadata
+    // survives the rewrite. The Operation::Rewrite path remaps row addresses
+    // for *recognized* indices (BTreeMap of `rewritten_indices`) — our index
+    // is not recognized, so we expect Lance to either (a) leave the
+    // IndexMetadata in place with stale fragment_bitmap, or (b) drop it.
+    let pre_compact_indices = ds.load_indices().await?.len();
+    let metrics = compact_files(&mut ds, CompactionOptions::default(), None)
+        .await
+        .context("compact_files")?;
+    let post_compact_indices = ds.load_indices().await?;
+    let ours_after_compact: Vec<&IndexMetadata> = post_compact_indices
+        .iter()
+        .filter(|i| i.uuid == our_uuid)
+        .collect();
+
+    let frags_after: Vec<u64> = ds
+        .get_fragments()
+        .iter()
+        .map(|f| f.id() as u64)
+        .collect();
+
+    if let Some(idx) = ours_after_compact.first() {
+        let bitmap: Vec<u32> = idx
+            .fragment_bitmap
+            .as_ref()
+            .map(|b| b.iter().collect())
+            .unwrap_or_default();
+        let outcome = if frags_after.iter().all(|f| bitmap.contains(&(*f as u32))) {
+            "REMAPPED"
+        } else if bitmap.is_empty() {
+            "EMPTIED"
+        } else {
+            "STALE_BITMAP"
+        };
+        matrix.add(
+            "P5 compact_files (Rewrite)",
+            outcome,
+            format!(
+                "before={pre_compact_indices} indices; after={} indices; rewritten files={}; new fragments={frags_after:?}; idx.fragment_bitmap={bitmap:?}",
+                post_compact_indices.len(),
+                metrics.files_added
+            ),
+        );
+    } else {
+        matrix.add(
+            "P5 compact_files (Rewrite)",
+            "DROPPED",
+            format!(
+                "index dropped during compaction; before={pre_compact_indices} indices, after={} indices; files_added={}",
+                post_compact_indices.len(),
+                metrics.files_added
+            ),
+        );
+    }
+
+    matrix.print();
+
+    // Final commentary printed for the writeup.
+    println!("\n[note] Lance 4.0.0 has a private static `SCALAR_INDEX_PLUGIN_REGISTRY` (see");
+    println!("       lance/src/index/scalar.rs:223). The `// TODO: Allow users to register their own plugins`");
+    println!("       comment confirms this surface is not yet pluggable. We can write");
+    println!("       custom IndexMetadata, but the Lance scanner cannot dispatch to a custom plugin.");
+
+    Ok(())
+}
--- a/validation-prototypes/factorized-batches/Cargo.toml
+++ b/validation-prototypes/factorized-batches/Cargo.toml
@ -0,0 +1,34 @@
+[package]
+name = "factorized-batches"
+version = "0.0.0"
+edition = "2024"
+publish = false
+
+# Experiment 1.1 (MR-925) — factorized batches through DataFusion ops.
+# Validates MR-737 §5.2 / Open Q2.
+
+[dependencies]
+arrow = { workspace = true }
+arrow-array = { workspace = true }
+arrow-schema = { workspace = true }
+arrow-cast = { workspace = true }
+datafusion = { workspace = true, features = [
+    "sql",
+    "nested_expressions",
+    "unicode_expressions",
+    "string_expressions",
+    "math_expressions",
+    "regex_expressions",
+    "datetime_expressions",
+] }
+datafusion-common = { workspace = true }
+datafusion-expr = { workspace = true }
+datafusion-physical-plan = { workspace = true }
+tokio = { workspace = true }
+futures = { workspace = true }
+anyhow = { workspace = true }
+rand = { workspace = true }
+
+[[bin]]
+name = "factorized-batches"
+path = "src/main.rs"
--- a/validation-prototypes/factorized-batches/sample-output.txt
+++ b/validation-prototypes/factorized-batches/sample-output.txt
@ -0,0 +1,113 @@
+[cell] n_src=10000 fanout=u=1 edges=10000
+
+
+[cell] n_src=10000 fanout=u=10 edges=100000
+
+
+[cell] n_src=10000 fanout=u=100 edges=1000000
+
+
+[cell] n_src=10000 fanout=u=1000 edges=10000000
+
+
+[cell] n_src=10000 fanout=s=10/0.02 edges=118141
+
+-------------------------------------------------------- factorized-batches results --------------------------------------------------------
+op                      n_src         fanout     f_ok     f_rows  f_time_ms       x_ok     x_rows  x_time_ms      speedup recommendation
+--------------------------------------------------------------------------------------------------------------------------------------------
+filter                  10000            u=1        Y       5000       2.31          Y       5000       0.75        0.32x KEEP_FACTORIZED
+project                 10000            u=1        Y      10000       0.21          Y      10000       0.17        0.81x KEEP_FACTORIZED
+sort                    10000            u=1        Y       1000       2.14          Y       1000       2.02        0.94x KEEP_FACTORIZED
+aggregate_scalar        10000            u=1        Y          1       2.04          Y          1       1.45        0.71x KEEP_FACTORIZED
+aggregate_on_list       10000            u=1        Y       6353       2.64          -          -          -            - KEEP_FACTORIZED
+join_scalar             10000            u=1        Y        100       1.27          Y        100       1.06        0.83x KEEP_FACTORIZED
+join_on_list            10000            u=1        Y          1       1.88          -          -          -            - KEEP_FACTORIZED
+unnest_flatten          10000            u=1        N          0       0.53          -          -          -            - FLATTEN_BEFORE
+    factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
+filter                  10000           u=10        Y       5000       1.16          Y      50000       0.84        0.72x KEEP_FACTORIZED
+project                 10000           u=10        Y      10000       0.26          Y     100000       0.27        1.03x KEEP_FACTORIZED
+sort                    10000           u=10        Y       1000       2.72          Y       1000      19.53        7.18x KEEP_FACTORIZED
+aggregate_scalar        10000           u=10        Y          1       1.46          Y          1       4.04        2.77x KEEP_FACTORIZED
+aggregate_on_list       10000           u=10        Y      10000      12.37          -          -          -            - KEEP_FACTORIZED
+join_scalar             10000           u=10        Y        100       1.17          Y        100       4.16        3.57x KEEP_FACTORIZED
+join_on_list            10000           u=10        Y          1       3.84          -          -          -            - KEEP_FACTORIZED
+unnest_flatten          10000           u=10        N          0       0.45          -          -          -            - FLATTEN_BEFORE
+    factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
+filter                  10000          u=100        Y       5000       1.40          Y     500000       2.73        1.95x KEEP_FACTORIZED
+project                 10000          u=100        Y      10000       0.20          Y    1000000       0.25        1.26x KEEP_FACTORIZED
+sort                    10000          u=100        Y       1000       2.58          Y       1000     180.72       70.18x KEEP_FACTORIZED
+aggregate_scalar        10000          u=100        Y          1       1.74          Y          1      28.69       16.47x KEEP_FACTORIZED
+aggregate_on_list       10000          u=100        Y      10000     113.60          -          -          -            - KEEP_FACTORIZED
+join_scalar             10000          u=100        Y        100       4.32          Y        100      17.92        4.15x KEEP_FACTORIZED
+join_on_list            10000          u=100        Y          1      26.24          -          -          -            - KEEP_FACTORIZED
+unnest_flatten          10000          u=100        N          0       0.64          -          -          -            - FLATTEN_BEFORE
+    factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
+filter                  10000         u=1000        Y       5000      46.29          Y    5000000      22.12        0.48x KEEP_FACTORIZED
+project                 10000         u=1000        Y      10000       0.31          Y   10000000       0.44        1.43x KEEP_FACTORIZED
+sort                    10000         u=1000        Y       1000       4.75          Y       1000    1597.33      336.28x KEEP_FACTORIZED
+aggregate_scalar        10000         u=1000        Y          1       2.01          Y          1     282.68      140.36x KEEP_FACTORIZED
+aggregate_on_list       10000         u=1000        Y      10000    1624.65          -          -          -            - KEEP_FACTORIZED
+join_scalar             10000         u=1000        Y        100       5.79          Y        100     196.15       33.88x KEEP_FACTORIZED
+join_on_list            10000         u=1000        Y          1     659.47          -          -          -            - KEEP_FACTORIZED
+unnest_flatten          10000         u=1000        N          0       0.62          -          -          -            - FLATTEN_BEFORE
+    factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
+filter                  10000      s=10/0.02        Y       5000       0.91          Y      68142       1.02        1.11x KEEP_FACTORIZED
+project                 10000      s=10/0.02        Y      10000       0.21          Y     118141       0.19        0.88x KEEP_FACTORIZED
+sort                    10000      s=10/0.02        Y       1000       2.23          Y       1000      22.38       10.05x KEEP_FACTORIZED
+aggregate_scalar        10000      s=10/0.02        Y          1       1.93          Y          1       4.47        2.32x KEEP_FACTORIZED
+aggregate_on_list       10000      s=10/0.02        Y      10000      10.21          -          -          -            - KEEP_FACTORIZED
+join_scalar             10000      s=10/0.02        Y        100       1.46          Y        100       3.87        2.65x KEEP_FACTORIZED
+join_on_list            10000      s=10/0.02        Y          1       4.98          -          -          -            - KEEP_FACTORIZED
+unnest_flatten          10000      s=10/0.02        N          0       0.43          -          -          -            - FLATTEN_BEFORE
+    factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
+
+[explain] aggregate_scalar (factorized input):
+logical_plan Sort: bucket ASC NULLS LAST
+  Projection: substr(t.payload,Int64(1),Int64(4)) AS bucket, count(Int64(1)) AS count(*) AS n
+    Aggregate: groupBy=[[substr(t.payload, Int64(1), Int64(4))]], aggr=[[count(Int64(1))]]
+      TableScan: t projection=[payload] 
+physical_plan SortPreservingMergeExec: [bucket@0 ASC NULLS LAST]
+  SortExec: expr=[bucket@0 ASC NULLS LAST], preserve_partitioning=[true]
+    ProjectionExec: expr=[substr(t.payload,Int64(1),Int64(4))@0 as bucket, count(Int64(1))@1 as n]
+      AggregateExec: mode=FinalPartitioned, gby=[substr(t.payload,Int64(1),Int64(4))@0 as substr(t.payload,Int64(1),Int64(4))], aggr=[count(Int64(1))]
+        RepartitionExec: partitioning=Hash([substr(t.payload,Int64(1),Int64(4))@0], 2), input_partitions=1
+          AggregateExec: mode=Partial, gby=[substr(payload@0, 1, 4) as substr(t.payload,Int64(1),Int64(4))], aggr=[count(Int64(1))]
+            DataSourceExec: partitions=1, partition_sizes=[1]
+ 
+
+
+[explain] join_scalar (factorized input):
+logical_plan Projection: a.src_id, a._neighbors
+  Limit: skip=0, fetch=100
+    Inner Join: a.src_id = b.src_id
+      SubqueryAlias: a
+        TableScan: t projection=[src_id, _neighbors]
+      SubqueryAlias: b
+        TableScan: t projection=[src_id] 
+physical_plan ProjectionExec: expr=[src_id@1 as src_id, _neighbors@2 as _neighbors]
+  GlobalLimitExec: skip=0, fetch=100
+    HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(src_id@0, src_id@0)]
+      DataSourceExec: partitions=1, partition_sizes=[1]
+      DataSourceExec: partitions=1, partition_sizes=[1]
+ 
+
+
+[explain] aggregate_on_list (factorized input):
+logical_plan Projection: t._neighbors, count(Int64(1)) AS count(*) AS n
+  Aggregate: groupBy=[[t._neighbors]], aggr=[[count(Int64(1))]]
+    TableScan: t projection=[_neighbors] 
+physical_plan ProjectionExec: expr=[_neighbors@0 as _neighbors, count(Int64(1))@1 as n]
+  AggregateExec: mode=FinalPartitioned, gby=[_neighbors@0 as _neighbors], aggr=[count(Int64(1))]
+    RepartitionExec: partitioning=Hash([_neighbors@0], 2), input_partitions=1
+      AggregateExec: mode=Partial, gby=[_neighbors@0 as _neighbors], aggr=[count(Int64(1))]
+        DataSourceExec: partitions=1, partition_sizes=[1]
+ 
+
+
+[explain] sort (factorized input):
+logical_plan Sort: t.src_id DESC NULLS FIRST, fetch=1000
+  TableScan: t projection=[src_id, _neighbors] 
+physical_plan SortExec: TopK(fetch=1000), expr=[src_id@0 DESC], preserve_partitioning=[false]
+  DataSourceExec: partitions=1, partition_sizes=[1]
+
+Exit code: 0
--- a/validation-prototypes/factorized-batches/src/data.rs
+++ b/validation-prototypes/factorized-batches/src/data.rs
@ -0,0 +1,145 @@
+//! Synthetic data generation for the factorized-batches experiment.
+//!
+//! Two shapes are produced:
+//!   * `factorized`: one row per `src_id`, `_neighbors: List<UInt64>` carrying
+//!     the neighbor set for that source.
+//!   * `flat`:       one row per `(src_id, neighbor)` pair (exploded baseline).
+
+use std::sync::Arc;
+
+use arrow_array::builder::{ListBuilder, UInt64Builder};
+use arrow_array::{Float64Array, RecordBatch, StringArray, UInt64Array};
+use arrow_schema::{DataType, Field, Schema};
+use rand::SeedableRng;
+use rand::rngs::StdRng;
+use rand::Rng;
+
+/// Distribution of neighbor-list lengths per source row.
+#[derive(Clone, Copy, Debug)]
+pub enum FanoutShape {
+    /// Every src_id has exactly `target` neighbors.
+    Uniform { target: usize },
+    /// Skewed: most rows have ~target neighbors, a small fraction have 10×.
+    Skewed { target: usize, heavy_fraction: f64 },
+}
+
+#[derive(Clone, Debug)]
+pub struct DataParams {
+    pub n_src: usize,
+    pub fanout: FanoutShape,
+    pub seed: u64,
+}
+
+/// Returns `(factorized_batch, flat_batch)` with the same logical content.
+///
+/// Schema:
+///   factorized: src_id: UInt64, payload: Utf8, weight: Float64,
+///               _neighbors: List<UInt64 not null> not null
+///   flat:       src_id: UInt64, payload: Utf8, weight: Float64, dst: UInt64
+pub fn build(params: &DataParams) -> (RecordBatch, RecordBatch) {
+    let mut rng = StdRng::seed_from_u64(params.seed);
+
+    // factorized columns
+    let mut src_ids = UInt64Array::builder(params.n_src);
+    let mut payloads: Vec<String> = Vec::with_capacity(params.n_src);
+    let mut weights: Vec<f64> = Vec::with_capacity(params.n_src);
+    let mut list_builder = ListBuilder::new(UInt64Builder::new())
+        .with_field(Field::new("item", DataType::UInt64, false));
+
+    // flat columns
+    let mut flat_src: Vec<u64> = Vec::new();
+    let mut flat_payload: Vec<String> = Vec::new();
+    let mut flat_weight: Vec<f64> = Vec::new();
+    let mut flat_dst: Vec<u64> = Vec::new();
+
+    let len_for = |i: usize, rng: &mut StdRng| -> usize {
+        match params.fanout {
+            FanoutShape::Uniform { target } => target,
+            FanoutShape::Skewed { target, heavy_fraction } => {
+                if (i as f64) / (params.n_src as f64) < heavy_fraction {
+                    target.saturating_mul(10)
+                } else {
+                    let jitter: i64 = rng.gen_range(-2..=2);
+                    ((target as i64 + jitter).max(0)) as usize
+                }
+            }
+        }
+    };
+
+    for i in 0..params.n_src {
+        let src = i as u64;
+        let payload = format!("p_{:06}", i);
+        let weight = rng.r#gen::<f64>();
+
+        src_ids.append_value(src);
+        payloads.push(payload.clone());
+        weights.push(weight);
+
+        let n_neighbors = len_for(i, &mut rng);
+        for _ in 0..n_neighbors {
+            let dst: u64 = rng.gen_range(0..(params.n_src as u64).max(1));
+            list_builder.values().append_value(dst);
+
+            flat_src.push(src);
+            flat_payload.push(payload.clone());
+            flat_weight.push(weight);
+            flat_dst.push(dst);
+        }
+        list_builder.append(true);
+    }
+
+    let neighbors_field = Field::new(
+        "_neighbors",
+        DataType::List(Arc::new(Field::new("item", DataType::UInt64, false))),
+        false,
+    );
+    let factorized_schema = Arc::new(Schema::new(vec![
+        Field::new("src_id", DataType::UInt64, false),
+        Field::new("payload", DataType::Utf8, false),
+        Field::new("weight", DataType::Float64, false),
+        neighbors_field,
+    ]));
+
+    let factorized = RecordBatch::try_new(
+        factorized_schema,
+        vec![
+            Arc::new(src_ids.finish()),
+            Arc::new(StringArray::from(payloads)),
+            Arc::new(Float64Array::from(weights)),
+            Arc::new(list_builder.finish()),
+        ],
+    )
+    .expect("factorized record batch");
+
+    let flat_schema = Arc::new(Schema::new(vec![
+        Field::new("src_id", DataType::UInt64, false),
+        Field::new("payload", DataType::Utf8, false),
+        Field::new("weight", DataType::Float64, false),
+        Field::new("dst", DataType::UInt64, false),
+    ]));
+    let flat = RecordBatch::try_new(
+        flat_schema,
+        vec![
+            Arc::new(UInt64Array::from(flat_src)),
+            Arc::new(StringArray::from(flat_payload)),
+            Arc::new(Float64Array::from(flat_weight)),
+            Arc::new(UInt64Array::from(flat_dst)),
+        ],
+    )
+    .expect("flat record batch");
+
+    (factorized, flat)
+}
+
+/// Total number of (src, dst) edges encoded in a factorized batch.
+pub fn factorized_edge_count(batch: &RecordBatch) -> usize {
+    let list = batch
+        .column_by_name("_neighbors")
+        .expect("_neighbors column")
+        .as_any()
+        .downcast_ref::<arrow_array::ListArray>()
+        .expect("ListArray");
+    let offsets = list.value_offsets();
+    let last = offsets.last().copied().unwrap_or(0);
+    last as usize
+}
--- a/validation-prototypes/factorized-batches/src/main.rs
+++ b/validation-prototypes/factorized-batches/src/main.rs
@ -0,0 +1,301 @@
+mod data;
+mod ops;
+
+use anyhow::Result;
+use arrow_array::RecordBatch;
+
+use crate::data::{DataParams, FanoutShape, build, factorized_edge_count};
+use crate::ops::{
+    OpResult, aggregate_on_list_sql_factorized, aggregate_sql_factorized, aggregate_sql_flat,
+    explain_factorized, filter_sql, join_on_list_sql_factorized, join_sql_factorized,
+    join_sql_flat, probe_unnest_flatten, project_sql_factorized, project_sql_flat, run_sql,
+    sort_sql_factorized, sort_sql_flat,
+};
+
+/// One row in the final per-op recommendation matrix.
+#[derive(Debug, Clone)]
+struct OpRow {
+    op_name: &'static str,
+    n_src: usize,
+    fanout: String,
+    factorized: OpResult,
+    flat: Option<OpResult>,
+}
+
+fn print_table(rows: &[OpRow]) {
+    println!("{:-^140}", " factorized-batches results ");
+    println!(
+        "{:<22} {:>6} {:>14} {:>8} {:>10} {:>10} {:>10} {:>10} {:>10} {:>12} {}",
+        "op", "n_src", "fanout", "f_ok", "f_rows", "f_time_ms", "x_ok", "x_rows", "x_time_ms",
+        "speedup", "recommendation"
+    );
+    println!("{:-<140}", "");
+    for r in rows {
+        let f_ok = if r.factorized.accepts { "Y" } else { "N" };
+        let f_time = format!("{:.2}", r.factorized.time_ms);
+        let (x_ok, x_rows, x_time, speedup) = match &r.flat {
+            Some(flat) => {
+                let ok = if flat.accepts { "Y" } else { "N" };
+                let speedup = if flat.accepts && r.factorized.accepts && flat.time_ms > 0.0 {
+                    format!("{:.2}x", flat.time_ms / r.factorized.time_ms.max(1e-3))
+                } else {
+                    "-".to_string()
+                };
+                (
+                    ok.to_string(),
+                    flat.out_rows.to_string(),
+                    format!("{:.2}", flat.time_ms),
+                    speedup,
+                )
+            }
+            None => ("-".into(), "-".into(), "-".into(), "-".into()),
+        };
+        let rec = recommendation(r);
+        println!(
+            "{:<22} {:>6} {:>14} {:>8} {:>10} {:>10} {:>10} {:>10} {:>10} {:>12} {}",
+            r.op_name, r.n_src, r.fanout, f_ok, r.factorized.out_rows, f_time,
+            x_ok, x_rows, x_time, speedup, rec
+        );
+        if let Some(err) = &r.factorized.error {
+            println!("    factorized error: {err}");
+        }
+        if let Some(flat) = &r.flat {
+            if let Some(err) = &flat.error {
+                println!("    flat error:       {err}");
+            }
+        }
+    }
+}
+
+/// Map (accepts, error class) -> {KEEP_FACTORIZED, FLATTEN_BEFORE, MULTIPLICITY_AWARE_FUTURE}.
+fn recommendation(row: &OpRow) -> &'static str {
+    if !row.factorized.accepts {
+        return "FLATTEN_BEFORE";
+    }
+    match (&row.flat, row.factorized.out_rows) {
+        (Some(flat), f_rows) if flat.accepts => {
+            // If factorized emits a superset of rows-of-interest with no
+            // multiplicity loss, KEEP. If it changes semantics, demand
+            // multiplicity awareness.
+            if row.op_name == "aggregate_on_list" || row.op_name == "join_on_list" {
+                // Semantically different from a flat baseline.
+                "MULTIPLICITY_AWARE_FUTURE"
+            } else if f_rows <= flat.out_rows {
+                "KEEP_FACTORIZED"
+            } else {
+                "FLATTEN_BEFORE"
+            }
+        }
+        _ => "KEEP_FACTORIZED",
+    }
+}
+
+async fn run_one_op(
+    op_name: &'static str,
+    factorized: RecordBatch,
+    flat_for_op: Option<RecordBatch>,
+    factorized_sql: &str,
+    flat_sql: Option<&str>,
+    params: &DataParams,
+    fanout_label: String,
+) -> OpRow {
+    let f = run_sql(op_name, "factorized", factorized, "t", factorized_sql).await;
+    let x = match (flat_for_op, flat_sql) {
+        (Some(b), Some(sql)) => Some(run_sql(op_name, "flat", b, "t", sql).await),
+        _ => None,
+    };
+    OpRow {
+        op_name,
+        n_src: params.n_src,
+        fanout: fanout_label,
+        factorized: f,
+        flat: x,
+    }
+}
+
+#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
+async fn main() -> Result<()> {
+    // Cells from the ticket: 10K source rows × {1, 10, 100, 1000} neighbors,
+    // plus a skewed cell.
+    let cells: Vec<DataParams> = vec![
+        DataParams {
+            n_src: 10_000,
+            fanout: FanoutShape::Uniform { target: 1 },
+            seed: 7,
+        },
+        DataParams {
+            n_src: 10_000,
+            fanout: FanoutShape::Uniform { target: 10 },
+            seed: 7,
+        },
+        DataParams {
+            n_src: 10_000,
+            fanout: FanoutShape::Uniform { target: 100 },
+            seed: 7,
+        },
+        DataParams {
+            n_src: 10_000,
+            fanout: FanoutShape::Uniform { target: 1000 },
+            seed: 7,
+        },
+        DataParams {
+            n_src: 10_000,
+            fanout: FanoutShape::Skewed {
+                target: 10,
+                heavy_fraction: 0.02,
+            },
+            seed: 7,
+        },
+    ];
+
+    let mut rows: Vec<OpRow> = Vec::new();
+    for params in &cells {
+        let (factorized, flat) = build(params);
+        let edges = factorized_edge_count(&factorized);
+        let label = match params.fanout {
+            FanoutShape::Uniform { target } => format!("u={target}"),
+            FanoutShape::Skewed { target, heavy_fraction } => format!("s={target}/{heavy_fraction}"),
+        };
+        println!(
+            "\n[cell] n_src={} fanout={} edges={}\n",
+            params.n_src, label, edges
+        );
+
+        rows.push(
+            run_one_op(
+                "filter",
+                factorized.clone(),
+                Some(flat.clone()),
+                filter_sql(),
+                Some(filter_sql()),
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+        rows.push(
+            run_one_op(
+                "project",
+                factorized.clone(),
+                Some(flat.clone()),
+                project_sql_factorized(),
+                Some(project_sql_flat()),
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+        rows.push(
+            run_one_op(
+                "sort",
+                factorized.clone(),
+                Some(flat.clone()),
+                sort_sql_factorized(),
+                Some(sort_sql_flat()),
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+        rows.push(
+            run_one_op(
+                "aggregate_scalar",
+                factorized.clone(),
+                Some(flat.clone()),
+                aggregate_sql_factorized(),
+                Some(aggregate_sql_flat()),
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+        rows.push(
+            run_one_op(
+                "aggregate_on_list",
+                factorized.clone(),
+                None,
+                aggregate_on_list_sql_factorized(),
+                None,
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+        rows.push(
+            run_one_op(
+                "join_scalar",
+                factorized.clone(),
+                Some(flat.clone()),
+                join_sql_factorized(),
+                Some(join_sql_flat()),
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+        rows.push(
+            run_one_op(
+                "join_on_list",
+                factorized.clone(),
+                None,
+                join_on_list_sql_factorized(),
+                None,
+                params,
+                label.clone(),
+            )
+            .await,
+        );
+
+        // Calibrate the cost of an explicit `Flatten` (UNNEST) on the
+        // factorized batch alone. This is the "flatten cost" column the
+        // writeup needs.
+        let unnest = probe_unnest_flatten(factorized.clone(), "t").await;
+        rows.push(OpRow {
+            op_name: "unnest_flatten",
+            n_src: params.n_src,
+            fanout: label.clone(),
+            factorized: unnest,
+            flat: None,
+        });
+    }
+
+    print_table(&rows);
+
+    // Capture one EXPLAIN per representative op to anchor the writeup.
+    let probe_params = DataParams {
+        n_src: 1000,
+        fanout: FanoutShape::Uniform { target: 10 },
+        seed: 1,
+    };
+    let (factorized, _) = build(&probe_params);
+    println!("\n[explain] aggregate_scalar (factorized input):");
+    println!(
+        "{}",
+        explain_factorized(factorized.clone(), "t", aggregate_sql_factorized())
+            .await
+            .unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
+    );
+    println!("\n[explain] join_scalar (factorized input):");
+    println!(
+        "{}",
+        explain_factorized(factorized.clone(), "t", join_sql_factorized())
+            .await
+            .unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
+    );
+    println!("\n[explain] aggregate_on_list (factorized input):");
+    println!(
+        "{}",
+        explain_factorized(factorized.clone(), "t", aggregate_on_list_sql_factorized())
+            .await
+            .unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
+    );
+    println!("\n[explain] sort (factorized input):");
+    println!(
+        "{}",
+        explain_factorized(factorized, "t", sort_sql_factorized())
+            .await
+            .unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
+    );
+
+    Ok(())
+}
--- a/validation-prototypes/factorized-batches/src/ops.rs
+++ b/validation-prototypes/factorized-batches/src/ops.rs
@ -0,0 +1,188 @@
+//! Per-operator probes.
+//!
+//! Each probe runs a tiny DataFusion pipeline once. We capture:
+//!   * accepts_list_input: did planning + execution complete without error?
+//!   * time_ms:            wall-clock execution time.
+//!   * out_rows:           total rows emitted across all output batches.
+//!   * out_bytes:          summed estimated arrow buffer size of output rows
+//!                         (a stand-in for peak memory of the consumer side).
+
+use std::sync::Arc;
+use std::time::Instant;
+
+use anyhow::{Context, Result};
+use arrow_array::RecordBatch;
+use datafusion::datasource::MemTable;
+use datafusion::execution::SendableRecordBatchStream;
+use datafusion::prelude::*;
+use futures::stream::StreamExt;
+
+#[derive(Clone, Debug)]
+pub struct OpResult {
+    pub op_name: &'static str,
+    pub variant: &'static str, // "factorized" | "flat"
+    pub accepts: bool,
+    pub error: Option<String>,
+    pub time_ms: f64,
+    pub out_rows: usize,
+    pub out_batches: usize,
+    pub out_bytes: usize,
+}
+
+fn make_ctx(batch: RecordBatch, table_name: &str) -> Result<SessionContext> {
+    let ctx = SessionContext::new();
+    let schema = batch.schema();
+    let table = MemTable::try_new(schema, vec![vec![batch]])?;
+    ctx.register_table(table_name, Arc::new(table))?;
+    Ok(ctx)
+}
+
+fn batch_bytes(b: &RecordBatch) -> usize {
+    b.columns()
+        .iter()
+        .map(|c| c.get_array_memory_size())
+        .sum::<usize>()
+}
+
+async fn collect_stream(stream: SendableRecordBatchStream) -> Result<(Vec<RecordBatch>, usize, usize)> {
+    let mut batches = Vec::new();
+    let mut rows = 0usize;
+    let mut bytes = 0usize;
+    let mut s = stream;
+    while let Some(b) = s.next().await {
+        let b = b?;
+        rows += b.num_rows();
+        bytes += batch_bytes(&b);
+        batches.push(b);
+    }
+    Ok((batches, rows, bytes))
+}
+
+pub async fn run_sql(
+    op_name: &'static str,
+    variant: &'static str,
+    batch: RecordBatch,
+    table_name: &str,
+    sql: &str,
+) -> OpResult {
+    let mut result = OpResult {
+        op_name,
+        variant,
+        accepts: false,
+        error: None,
+        time_ms: 0.0,
+        out_rows: 0,
+        out_batches: 0,
+        out_bytes: 0,
+    };
+
+    let ctx = match make_ctx(batch, table_name) {
+        Ok(v) => v,
+        Err(e) => {
+            result.error = Some(format!("setup: {e:#}"));
+            return result;
+        }
+    };
+
+    let started = Instant::now();
+    let df = match ctx.sql(sql).await {
+        Ok(df) => df,
+        Err(e) => {
+            result.error = Some(format!("plan: {e:#}"));
+            result.time_ms = started.elapsed().as_secs_f64() * 1e3;
+            return result;
+        }
+    };
+    let stream = match df.execute_stream().await {
+        Ok(s) => s,
+        Err(e) => {
+            result.error = Some(format!("execute: {e:#}"));
+            result.time_ms = started.elapsed().as_secs_f64() * 1e3;
+            return result;
+        }
+    };
+    match collect_stream(stream).await {
+        Ok((batches, rows, bytes)) => {
+            result.accepts = true;
+            result.out_rows = rows;
+            result.out_batches = batches.len();
+            result.out_bytes = bytes;
+        }
+        Err(e) => {
+            result.error = Some(format!("collect: {e:#}"));
+        }
+    }
+    result.time_ms = started.elapsed().as_secs_f64() * 1e3;
+    result
+}
+
+pub fn filter_sql() -> &'static str {
+    "SELECT * FROM t WHERE src_id < 5000"
+}
+pub fn project_sql_factorized() -> &'static str {
+    "SELECT src_id, _neighbors FROM t"
+}
+pub fn project_sql_flat() -> &'static str {
+    "SELECT src_id, dst FROM t"
+}
+pub fn sort_sql_factorized() -> &'static str {
+    "SELECT src_id, _neighbors FROM t ORDER BY src_id DESC LIMIT 1000"
+}
+pub fn sort_sql_flat() -> &'static str {
+    "SELECT src_id, dst FROM t ORDER BY src_id DESC LIMIT 1000"
+}
+pub fn aggregate_sql_factorized() -> &'static str {
+    "SELECT substr(payload, 1, 4) AS bucket, count(*) AS n FROM t GROUP BY 1 ORDER BY 1"
+}
+pub fn aggregate_sql_flat() -> &'static str {
+    "SELECT substr(payload, 1, 4) AS bucket, count(*) AS n FROM t GROUP BY 1 ORDER BY 1"
+}
+pub fn aggregate_on_list_sql_factorized() -> &'static str {
+    "SELECT _neighbors, count(*) AS n FROM t GROUP BY _neighbors"
+}
+pub fn join_sql_factorized() -> &'static str {
+    "SELECT a.src_id, a._neighbors FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100"
+}
+pub fn join_on_list_sql_factorized() -> &'static str {
+    "SELECT count(*) FROM t a JOIN t b ON a._neighbors = b._neighbors"
+}
+pub fn join_sql_flat() -> &'static str {
+    "SELECT a.src_id, a.dst FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100"
+}
+
+pub async fn probe_unnest_flatten(batch: RecordBatch, table_name: &str) -> OpResult {
+    let sql = "SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)";
+    run_sql("unnest_flatten", "factorized", batch, table_name, sql).await
+}
+
+pub async fn explain_factorized(batch: RecordBatch, table_name: &str, sql: &str) -> Result<String> {
+    let ctx = make_ctx(batch, table_name)?;
+    let plan = ctx
+        .sql(&format!("EXPLAIN {sql}"))
+        .await?
+        .collect()
+        .await
+        .context("explain collect")?;
+    let mut out = String::new();
+    for b in plan {
+        let cols = b.num_columns();
+        let rows = b.num_rows();
+        for r in 0..rows {
+            for c in 0..cols {
+                let arr = b.column(c);
+                let s = arrow_cast::display::array_value_to_string(arr, r).unwrap_or_default();
+                if !s.is_empty() {
+                    out.push_str(&s);
+                    out.push(' ');
+                }
+            }
+            out.push('\n');
+        }
+    }
+    Ok(out)
+}
+
+#[allow(dead_code)]
+pub fn batch_size(b: &RecordBatch) -> usize {
+    batch_bytes(b)
+}