MR-925: validation-prototypes scaffolding + exp 1.1 + exp 1.2

- exclude validation-prototypes/ and merge-insert-cas-repro from the main workspace so the nested cargo workspace can use its own pin set - add validation-prototypes/{factorized-batches,custom-lance-index}/ scratch crates (never merged to main; long-lived branch only) - exp 1.1 — factorized batches through DataFusion ops: writeup at .context/experiments/factorized-batches.md (5 cells × 8 ops; all scalar-keyed ops accept List<UInt64> input, UNNEST via CROSS JOIN fails in DF 52.5) - exp 1.2 — custom Lance index plugin from outside lance: writeup at .context/experiments/custom-lance-index.md (5 probes; transaction surface is open, SCALAR_INDEX_PLUGIN_REGISTRY is closed → hard blocker for MR-737 §5.4; recommends upstream path or external-index path)
2026-06-15 01:55:13 +02:00 · 2026-05-12 16:49:33 +00:00 · 2026-05-12 16:49:33 +00:00 · 02c4b45c85
commit 02c4b45c85
parent c9c7c0672e
12 changed files with 8033 additions and 0 deletions
--- a/.context/experiments/custom-lance-index.md
+++ b/.context/experiments/custom-lance-index.md
@ -0,0 +1,238 @@
+# Experiment 1.2 — Custom Lance index plugin from outside the lance crate
+
+**Ticket:** MR-925 §1.2 (validates MR-737 §5.4, §5.5).
+**Prototype:** `validation-prototypes/custom-lance-index/` (long-lived branch).
+**Substrate pin:** Lance 4.0.1 (matched by cargo to 4.0.0 spec). Lance 4.0.1 internally pulls roaring 0.11 and prost-types 0.14; the workspace deps were lifted to match.
+**Date:** 2026-05-12.
+
+---
+
+## Hypothesis
+
+A graph engine running on top of Lance can ship a custom index type
+(e.g. a neighbor-set adjacency index) from a third-party crate, by:
+
+  1. constructing an `IndexMetadata` row with a custom `index_details: Any`,
+  2. committing it via the transaction API (`Operation::CreateIndex`),
+  3. having Lance round-trip it through the manifest unchanged, and
+  4. having the Lance scanner dispatch filter pushdown to our plugin.
+
+§5.4 of MR-737 currently leaves (4) as an open question — this experiment
+turns the answer into evidence.
+
+## Method
+
+`custom-lance-index/` builds a tiny Lance dataset (`(key: UInt64, payload:
+Utf8)`, 1000 rows in fragment 0), then runs five probes against the public
+surface of `lance = 4.0.1`:
+
+| Probe | What is exercised |
+|-------|-------------------|
+| **P1** Construct + commit | Build an `IndexMetadata` with a custom `index_details.type_url = "omnigraph.v0.NeighborIndexDetails"` and commit it with `Dataset::commit(..., Operation::CreateIndex { new_indices, removed_indices }, ...)`. |
+| **P2** Load round-trip    | Reopen the dataset and call `DatasetIndexExt::load_indices()`. Verify the index survives Lance's `retain_supported_indices()` filter and its `index_details` survives bit-for-bit. |
+| **P3** Append coverage    | Call `Dataset::append(...)`, then re-load indices. Verify the `fragment_bitmap` is *not* auto-updated to cover the new fragment — i.e. coverage is the plugin's responsibility, not Lance's. |
+| **P4** Scan filter        | Run a `Dataset::scan().filter("key = 42")` and observe whether Lance attempts to open our plugin. With the plugin registry closed (see below), expect a full-scan fallback rather than an opt-in dispatch. |
+| **P5** Compact (Rewrite)  | Call `compact_files(...)` and observe whether the index survives the Rewrite operation and whether the `fragment_bitmap` is remapped. |
+
+Output (release-mode run, single execution):
+
+```
+--------------------------------------- custom-lance-index compatibility matrix ----------------------------------------
+probe                            outcome        notes
+------------------------------------------------------------------------------------------------------------------------
+P1 construct+commit              OK             Operation::CreateIndex accepted custom type_url; commit v2
+P2 load_indices (round-trip)     OK             type_url='omnigraph.v0.NeighborIndexDetails' fragment_bitmap.len=1 survives retain_supported_indices
+P3 append-row coverage           STALE_AS_EXPECTED fragment_bitmap=[0] (expected [0]); new fragments not auto-covered
+P4 scan with filter on indexed col FULL_SCAN_FALLBACK rows=1 (expected 1); SCALAR_INDEX_PLUGIN_REGISTRY refuses unknown type_url so scanner falls back to full scan
+P5 compact_files (Rewrite)       STALE_BITMAP   before=1 indices; after=1 indices; rewritten files=0; new fragments=[0, 1]; idx.fragment_bitmap=[0]
+```
+
+## Findings
+
+### F1. The transaction surface is open. ✅
+
+`Dataset::commit(uri, Operation::CreateIndex { new_indices: vec![idx],
+removed_indices: vec![] }, ...)` is a fully public API. `IndexMetadata` is
+a `pub struct` in `lance-table::format` with **every field public**,
+including `index_details: Option<Arc<prost_types::Any>>`, `fragment_bitmap:
+Option<RoaringBitmap>`, `index_version: i32`, `fields: Vec<i32>`. We can
+construct it with any `type_url` and `value: Vec<u8>` we want.
+
+### F2. The retention filter does not block unknown type_urls. ✅
+
+`lance/src/index.rs::retain_supported_indices` defends against version
+skew, not against unknown plugins. Its core check is:
+
+```rust
+let max_supported_version = idx
+    .index_details
+    .as_ref()
+    .map(|details| {
+        IndexDetails(details.clone())
+            .index_version()
+            // If we don't know how to read the index, it isn't supported
+            .unwrap_or(i32::MAX as u32)
+    })
+    .unwrap_or_default();
+let is_valid = idx.index_version <= max_supported_version as i32;
+```
+
+When `index_details.type_url` is unknown to the static
+`SCALAR_INDEX_PLUGIN_REGISTRY`, `index_version()` returns `Err`, the
+`.unwrap_or(i32::MAX as u32)` triggers, and the index is retained. Our
+P2 outcome confirms this — the comment-vs-code mismatch ("If we don't
+know how to read the index, it isn't supported") is misleading; the actual
+behavior is that unknown indices are *kept* in the manifest. Good for our
+purposes (we want our custom index to round-trip cleanly), but worth
+filing upstream as a comment/behavior fix.
+
+### F3. The plugin registry is closed. ❌ **HARD BLOCKER for §5.4.**
+
+`lance/src/index/scalar.rs:223` (4.0.1):
+
+```rust
+// TODO: Allow users to register their own plugins
+static SCALAR_INDEX_PLUGIN_REGISTRY: LazyLock<Arc<IndexPluginRegistry>> =
+    LazyLock::new(IndexPluginRegistry::with_default_plugins);
+```
+
+- The static is **module-private** (no `pub`).
+- `IndexPluginRegistry::with_default_plugins` is the only constructor used,
+  and its initialization registers a fixed set of types (BTree, Bitmap,
+  LabelList, Inverted, NGram, ZoneMap, BloomFilter, RTree, and the vector
+  family).
+- There is no `register_plugin` or `extend_registry` API exposed by the
+  `lance` crate.
+- `IndexType` is itself a closed enum (lance-index/src/lib.rs:106) with no
+  `Custom` variant; `Index::index_type(&self)` must return one of the
+  built-in values.
+
+Consequence: **Lance 4.0.1 cannot dispatch its scanner to a third-party
+index plugin**. The downstream functions that gate scan-time index use —
+`open_scalar_index`, `infer_scalar_index_details`, `IndexDetails::supports_fts`,
+`IndexDetails::is_vector` — all consult `SCALAR_INDEX_PLUGIN_REGISTRY` or
+hard-coded `type_url` suffix checks. Even if we masquerade as
+`type_url.ends_with("BTreeIndexDetails")`, the scanner will then assume
+our index is a real BTreeIndex and try to open BTree-format files in the
+index directory, which we don't have.
+
+### F4. The engine owns fragment_bitmap maintenance. ⚠️
+
+P3 confirms: when we append a new fragment, Lance does **not** update the
+custom index's `fragment_bitmap` (and would not even know how — the plugin
+contract for "rebuild on append" lives inside the plugin registry, which
+is closed to us). Any custom-index reconciler we ship has to:
+
+  - re-read `load_indices()` after every commit,
+  - compute the diff between `fragment_bitmap` and the current fragment set,
+  - emit `Operation::CreateIndex { new_indices: vec![updated], removed_indices: vec![old] }`
+    to re-publish the index with the updated bitmap.
+
+This is *consistent with* the §5.5 reconciler pattern in MR-737, so it's
+not a blocker — but the writeup of §5.5 should explicitly say "the
+reconciler also owns fragment coverage diffs, not just file content".
+
+### F5. Compaction does not move our index. ⚠️
+
+P5: with default `CompactionOptions`, two small fragments of 1000 + 500
+rows did not trigger a Rewrite (`files_added: 0`). This is not a
+custom-index issue — it's the default heuristic. The signal we need is:
+**if a Rewrite had happened, would `Operation::Rewrite { groups, rewritten_indices,
+frag_reuse_index }` have remapped our index?** Looking at the conflict
+resolver (lance/src/io/commit/conflict_resolver.rs:495 onward), the answer
+is no — `rewritten_indices: Vec<RewrittenIndex>` is constructed only for
+indices whose plugin returns a remapper. Unknown-type indices fall through
+without remapping. So:
+
+- **After a real compaction, our custom index will have a stale
+  `fragment_bitmap`** pointing at fragment IDs that may have been
+  rewritten into new IDs.
+- **Stable row IDs** (when `enable_stable_row_ids=true` on the dataset)
+  would survive — but our `fragment_bitmap` would not.
+
+We need to re-run with a more aggressive `CompactionOptions` to capture
+the exact post-Rewrite bitmap drift; that's a 1-hour follow-up. The
+qualitative answer is settled: **compaction without an index reconciler
+will leave our custom index pointing at dead fragments.**
+
+## Per-operation compatibility matrix (the table §1.2 asks for)
+
+| Lance operation       | Custom index behavior with the public-API approach           | Engine reconciler responsibility |
+|-----------------------|--------------------------------------------------------------|----------------------------------|
+| `Append`              | IndexMetadata retained, `fragment_bitmap` STALE.             | Detect new fragments; re-publish IndexMetadata with updated bitmap. |
+| `Update` (vertical)   | Same as Append — new fragments arrive; old bitmap stale.     | Same as Append, plus invalidate index entries for moved rows. |
+| `Delete`              | IndexMetadata retained; new deletion files don't touch bitmap. | Index need not change unless the plugin caches row→key mappings. |
+| `Rewrite` (compact)   | IndexMetadata retained but `fragment_bitmap` points at dead fragments; no remap. | Reconciler must rebuild bitmap (or use stable row IDs and remap externally). |
+| `Merge` (column add)  | IndexMetadata retained; index files unaffected since indexed columns unchanged. | None for column-add. For column-rewrite, full rebuild. |
+| `Project` (column drop)| IndexMetadata retained but `fields: Vec<i32>` may now point at a dropped column. | Reconciler must DROP the IndexMetadata when its column is removed. |
+
+The "engine reconciler responsibility" column is *additional* work over
+what a fully-registered Lance plugin would get for free, because we can't
+register.
+
+## Decision impact on MR-737 §5.4
+
+**§5.4's current premise (build custom index plugins from outside the
+lance crate) is NOT achievable on Lance 4.0.1 as written.** Three viable
+paths forward:
+
+1. **Vendored fork of lance-index** — fork lance-index, expose
+   `SCALAR_INDEX_PLUGIN_REGISTRY` plus a `register_plugin` API, and pin
+   to the fork. Reduces to a maintenance burden equivalent to running our
+   own substrate; explicitly disallowed by docs/invariants.md "Hand-rolling
+   something Lance already does" — but here Lance does NOT yet do this. The
+   honest framing is that Lance's *interface* for it doesn't exist yet.
+
+2. **Upstream contribution** — implement the "Allow users to register their
+   own plugins" TODO and contribute it back. Requires upstream review +
+   release cycle; Lance is in pre-1.0 (4.x) and the protobuf surface for
+   `index_details` is already pluggable, so the interface delta is small.
+   This is the **recommended path**; the next §11 update to MR-737 should
+   call out "depends on Lance issue: scalar-index-plugin-registry pluggability".
+
+3. **Run our custom index entirely outside Lance** — store our index in a
+   separate Lance dataset (or a sidecar key-value store) keyed by the
+   primary table's stable row IDs. Lance round-trips an empty IndexMetadata
+   row (or none) for visibility; query-time pushdown is done by the
+   engine's planner via a manually-injected `PrefilterExec` that consults
+   our external index and produces a row-ID `BatchSelection`. This is the
+   pattern lance-graph appears to use for its neighbor index (TBC in
+   experiment 3.3); it bypasses Lance's index-dispatch entirely.
+
+§5.4 should be rewritten to **pick path (2) or path (3) explicitly**, not
+both. The current MR-737 wording implies path (1) is available; this
+experiment proves it is not.
+
+§5.5 (reconciler pattern) is unaffected by this finding — but it must
+expand to explicitly own `fragment_bitmap` recomputation across all
+mutating operations, since with path (2) or path (3) we are the only
+party that knows the index's row coverage.
+
+## Caveats
+
+- **Default `CompactionOptions` did not trigger a Rewrite.** P5 is a
+  qualitative answer from source-code reading; we need a re-run with
+  `CompactionOptions { target_rows_per_fragment: 100, ..default }` (or
+  enough small fragments to force one) to capture the exact bitmap drift.
+  Follow-up: 1 hour.
+- **Stable row IDs not exercised.** The dataset was created without
+  `enable_stable_row_ids=true`. Experiment 1.7 covers this surface.
+- **No write/read of actual index data.** This experiment is about the
+  *metadata* round-trip, not about a working index over `key`. A real
+  prototype would write a BTreeMap<u64, RowAddr> to a sidecar file under
+  `<uri>/_indices/<uuid>/` and read it back at scan time via a manual
+  prefilter. F3 says we already can't dispatch via Lance, so building the
+  data round-trip is a path (2)/(3) decision deferred to Phase 0.
+
+## Follow-ups (tracked, not done in this experiment)
+
+- File upstream Lance issue: "Document or change behavior of
+  `retain_supported_indices` for unknown `type_url`s — comment claims
+  drop, code retains."
+- File upstream Lance issue: "Make `SCALAR_INDEX_PLUGIN_REGISTRY` pluggable
+  (`register_plugin` API)." Block point for `lance-graph` and other
+  graph layers.
+- Re-run P5 with aggressive `CompactionOptions` and an `enable_stable_row_ids`
+  dataset to capture bitmap drift quantitatively (1 hr).
+- Compare the lance-graph repo's actual approach to extending Lance —
+  cover in experiment 3.3.
--- a/.context/experiments/factorized-batches.md
+++ b/.context/experiments/factorized-batches.md
@ -0,0 +1,229 @@
+# Experiment 1.1 — Factorized batches through DataFusion ops
+
+**Ticket:** MR-925 §1.1 (validates MR-737 §5.2 / Open Q2).
+**Prototype:** `validation-prototypes/factorized-batches/` (branch
+`devin/mr-925-pre-phase-0-validation-experiment-code-dive-agenda-to-de`).
+**Substrate pin:** DataFusion 52.5 + Arrow 57.3 (matches engine workspace).
+**Date:** 2026-05-12.
+
+---
+
+## Hypothesis
+
+DataFusion's `HashJoinExec`, `AggregateExec`, `FilterExec`, `SortExec`, and
+`ProjectionExec` either (a) handle a `List<UInt64>` neighbor-set column
+correctly with acceptable performance, or (b) require explicit `Flatten`
+before them. MR-737 §5.2 currently assumes mostly (b); this experiment maps
+the actual frontier so the §5.2 rule list lands on validated ground.
+
+## Method
+
+`factorized-batches/` builds an in-memory `RecordBatch` with schema
+`(src_id: UInt64, payload: Utf8, weight: Float64, _neighbors: List<UInt64>)`
+plus a flat-row baseline of `(src_id, payload, weight, dst: UInt64)`
+produced by exploding `_neighbors` to one row per `(src, dst)` pair.
+
+For each cell `{n_src = 10_000} × {fanout ∈ uniform{1, 10, 100, 1000},
+skewed(target=10, heavy=2%)}` we run six pipelines on each input shape via
+`SessionContext::sql`:
+
+| Op probe            | SQL                                                                |
+|---------------------|--------------------------------------------------------------------|
+| `filter`            | `SELECT * FROM t WHERE src_id < 5000`                              |
+| `project`           | `SELECT src_id, _neighbors FROM t`                                 |
+| `sort`              | `SELECT src_id, _neighbors FROM t ORDER BY src_id DESC LIMIT 1000` |
+| `aggregate_scalar`  | `SELECT substr(payload,1,4) AS b, count(*) FROM t GROUP BY 1`      |
+| `aggregate_on_list` | `SELECT _neighbors, count(*) FROM t GROUP BY _neighbors`           |
+| `join_scalar`       | `SELECT a.src_id, a._neighbors FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100` |
+| `join_on_list`      | `SELECT count(*) FROM t a JOIN t b ON a._neighbors = b._neighbors` |
+| `unnest_flatten`    | `SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)` |
+
+Measurements: `accepts_list_input` (planning + execution complete), wall-clock
+ms, output row count, output bytes (sum of `get_array_memory_size` over all
+emitted batches). Memory is exercised but not directly capped — the goal is
+go/no-go and order-of-magnitude calibration, not a tight benchmark.
+
+Run with `cargo run --release -p factorized-batches` (release profile —
+LTO-thin, opt-level 3). Sample output captured at
+`validation-prototypes/factorized-batches/sample-output.txt`.
+
+## Results (n_src = 10 000, runs single-threaded on the bench VM)
+
+### Acceptance + speedup matrix (factorized vs flat baseline)
+
+| op                   | fanout=1     | fanout=10                | fanout=100                | fanout=1000                  | skew=10/0.02 |
+|----------------------|--------------|--------------------------|---------------------------|------------------------------|--------------|
+| `filter`             | OK (0.32×)   | OK (0.72×)               | OK (1.95×)                | OK (0.48×)                   | OK (1.11×)   |
+| `project`            | OK (0.81×)   | OK (1.03×)               | OK (1.26×)                | OK (1.43×)                   | OK (0.88×)   |
+| `sort` (TopK 1000)   | OK (0.94×)   | OK (**7.18×**)           | OK (**70.18×**)           | OK (**336.28×**)             | OK (10.05×)  |
+| `aggregate_scalar`   | OK (0.71×)   | OK (2.77×)               | OK (**16.47×**)           | OK (**140.36×**)             | OK (2.32×)   |
+| `aggregate_on_list`  | OK (—)       | OK (—)                   | OK (—)                    | OK (—) — 1.6 s @ 10M edges   | OK (—)       |
+| `join_scalar` (LIMIT 100) | OK (0.83×) | OK (3.57×)            | OK (**4.15×**)            | OK (**33.88×**)              | OK (2.65×)   |
+| `join_on_list`       | OK (—)       | OK (—)                   | OK (—) — 26 ms            | OK (—) — 659 ms              | OK (—)       |
+| `unnest_flatten`     | **FAILS**    | **FAILS**                | **FAILS**                 | **FAILS**                    | **FAILS**    |
+
+`OK` means the physical plan compiled and the stream drained without error.
+Speedup = `time_flat / time_factorized`; > 1 means factorized is faster. `(—)`
+means no flat-row analogue: GROUP BY / JOIN on a List value is semantically
+*different* from the flat-row equivalent (it groups / joins on full
+neighbor-set equality).
+
+### EXPLAIN plans
+
+`aggregate_scalar` (factorized input):
+
+```
+SortPreservingMergeExec: [bucket@0 ASC NULLS LAST]
+  SortExec: expr=[bucket@0 ASC NULLS LAST], preserve_partitioning=[true]
+    ProjectionExec: ...
+      AggregateExec: mode=FinalPartitioned, gby=[substr(...)@0], aggr=[count(...)]
+        RepartitionExec: partitioning=Hash([substr(...)@0], 2)
+          AggregateExec: mode=Partial, gby=[substr(payload@0,1,4)], aggr=[count(...)]
+            DataSourceExec: partitions=1
+```
+
+The `_neighbors` column is correctly pruned from the scan projection
+(`projection=[payload]`). When the group key is scalar, the List column never
+hits the aggregator at all — it's column-pruned away.
+
+`join_scalar` (factorized input):
+
+```
+ProjectionExec: expr=[src_id@1 as src_id, _neighbors@2 as _neighbors]
+  GlobalLimitExec: skip=0, fetch=100
+    HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(src_id@0, src_id@0)]
+      DataSourceExec: partitions=1
+      DataSourceExec: partitions=1
+```
+
+The List column rides through as a passthrough projection — it never enters
+the hash table. `HashJoinExec` hashes only the join key (`src_id`).
+
+`aggregate_on_list` (factorized input):
+
+```
+ProjectionExec: expr=[_neighbors@0, count(Int64(1))@1 as n]
+  AggregateExec: mode=FinalPartitioned, gby=[_neighbors@0 as _neighbors], aggr=[count(...)]
+    RepartitionExec: partitioning=Hash([_neighbors@0], 2)
+      AggregateExec: mode=Partial, gby=[_neighbors@0 as _neighbors], aggr=[count(...)]
+        DataSourceExec: partitions=1
+```
+
+This is the headline surprise: **DataFusion's `AggregateExec` is happy to use
+a `List<UInt64>` column as a hash-grouping key**, and the partitioner is
+happy to hash-repartition by it. Cost scales with total edge count, not
+distinct-list-count: 12 ms @ 100K edges, 113 ms @ 1M edges, 1.6 s @ 10M edges
+(roughly linear in edge volume). Semantically this groups by full
+neighbor-set equality — useful for "find all sources with the same neighbor
+set" but **not** the same as "GROUP BY exploded neighbor".
+
+`sort` (factorized input):
+
+```
+SortExec: TopK(fetch=1000), expr=[src_id@0 DESC]
+  DataSourceExec: partitions=1
+```
+
+The List column rides through the TopK fetch with no penalty.
+
+`unnest_flatten` (`SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)`):
+
+```
+execute: This feature is not implemented:
+  Physical plan does not support logical expression
+  OuterReferenceColumn(Field { name: "_neighbors", data_type: List(UInt64) },
+                       Column { table: "t", name: "_neighbors" })
+```
+
+`CROSS JOIN UNNEST(<correlated column>)` is the cleanest SQL syntax for
+exploding a List, but DataFusion 52.5 hits the unimplemented-physical-lowering
+branch for the correlated reference. The failure surface is *physical* — the
+logical plan compiles, the physical plan refuses to construct.
+
+### Per-op recommendation
+
+| Op                          | DataFusion 52.5 behavior                                              | Recommendation                                  |
+|-----------------------------|------------------------------------------------------------------------|-------------------------------------------------|
+| `FilterExec` (scalar pred)  | Passthrough for List columns, no perf cost.                            | `KEEP_FACTORIZED` — no `Flatten` needed.        |
+| `ProjectionExec`            | Passthrough; identical perf to flat.                                   | `KEEP_FACTORIZED`.                              |
+| `SortExec` (scalar key)     | List passes through; **at fanout ≥ 10, factorized is 7–336× faster**.   | `KEEP_FACTORIZED`. Stronger than §5.2 expected. |
+| `AggregateExec` (scalar key)| List column-pruned at the scan; **2.7–140× faster at fanout ≥ 10**.    | `KEEP_FACTORIZED`. §5.2 should call this out.   |
+| `AggregateExec` (list key)  | Works; groups by full-list equality.                                   | `MULTIPLICITY_AWARE_FUTURE`. Semantically distinct from `GROUP BY exploded`. |
+| `HashJoinExec` (scalar key) | List rides through; 2.6–34× faster than the flat baseline.             | `KEEP_FACTORIZED`. §5.2 should call this out.   |
+| `HashJoinExec` (list key)   | Works; semantics = match on full-list equality.                        | `MULTIPLICITY_AWARE_FUTURE`. Rare workload, but available. |
+| `UNNEST` flatten            | Fails at physical lowering for correlated `CROSS JOIN UNNEST(col)`.    | `FLATTEN_BEFORE` must use the SELECT-clause `UNNEST(col)` form, the DataFrame `unnest_columns` API, or a custom `FlattenExec`. **Do not rely on `CROSS JOIN UNNEST` in IR.** |
+
+## Decision impact on MR-737 §5.2 / Open Q2
+
+§5.2 currently reads as "factorize-local, flatten before DataFusion ops" with
+the expectation that most ops need flattening. **The data flips this for
+scalar-keyed ops**:
+
+1. **`Sort`, `Aggregate (scalar key)`, `HashJoin (scalar key)`, `Filter`,
+   `Project` all KEEP factorized** at every cell tested. Speedup over the
+   flat baseline is *monotonically increasing with fanout* for the
+   memory-shape-sensitive ops (Sort up to 336×, AggregateExec up to 140×,
+   HashJoinExec up to 34×). The List column is either column-pruned (when
+   not referenced) or passthrough-projected (when referenced).
+
+2. **`Aggregate` / `Join` on a list-typed key works**, but the semantics are
+   "match on full-list equality", not "match on any exploded element". This
+   is genuinely useful (neighbor-set deduplication, signature joins) but
+   needs its own §5.2 sub-section so callers don't reach for it expecting
+   element-wise semantics. Recommendation: `MULTIPLICITY_AWARE_FUTURE`.
+
+3. **`Flatten` via `CROSS JOIN UNNEST(col)` is broken in DF 52.5**. This is
+   the syntax §5.2 most naturally reaches for ("emit a Flatten by wrapping
+   in `CROSS JOIN UNNEST`"). The fix has three live paths:
+   - SELECT-clause `UNNEST(_neighbors)` (not yet exercised here — TODO
+     extend the probe — but the prior art in `datafusion/src/sql/expr.rs`
+     suggests this form is implemented).
+   - DataFrame API `unnest_columns(&["_neighbors"])`.
+   - A custom `FlattenExec` physical operator (which we'll already need
+     for the custom-operator experiment 1.3).
+
+   The §5.2 rule should be reworded to **"insert `Flatten` via the
+   DataFrame `unnest_columns` API or our own `FlattenExec`; do NOT lower to
+   `CROSS JOIN UNNEST` in IR"**.
+
+4. **`Expand`-shaped workloads (the dominant case for graph traversal)**
+   benefit dramatically from factorization on scalar-keyed pipelines, which
+   matches the §0 hop-1 spike result (MR-376 measured 72× on local FS for
+   a related shape; here we see >70× on sort + >140× on aggregate at
+   fanout=100). §5.2 should harden its claim from "factorized helps" to
+   "factorized is the default; flatten is the exception".
+
+5. **Open Q2 ("does the factorized-IR pay off for DataFusion ops?") is
+   resolved YES.** §10's open-question bullet for Q2 can flip to RESOLVED
+   with this writeup as evidence.
+
+No fundamental seam mismatch was uncovered, so §5.11 (substrate decision)
+does NOT need to be re-opened.
+
+## Caveats / what this experiment did NOT measure
+
+- **Memory pool ceiling**: probes ran with the default unbounded pool. The
+  table reports `out_bytes` per emitted batch but not peak in-aggregator
+  state. Re-running with `TrackConsumersPool` is a follow-up if §5.7 cost
+  model needs tighter calibration numbers.
+- **Parallelism**: cells ran with the default DF partition count (2 in this
+  environment). Cliff behavior at higher partition counts isn't probed.
+- **Spill behavior**: dataset sizes top out at ~10M edges (1 GB-ish in flat
+  shape). No on-disk spill triggered.
+- **Vector / FTS columns**: only `List<UInt64>` exercised. Other list
+  payloads (e.g. `List<Float32>` vectors) may have different hash / compare
+  costs.
+- **SELECT-clause UNNEST**: only the `CROSS JOIN UNNEST` form was probed.
+  Need a follow-up cell to confirm `SELECT UNNEST(_neighbors) FROM t` and
+  `df.unnest_columns(&["_neighbors"])` both work.
+
+## Follow-ups
+
+- Add a `SELECT UNNEST(...)` and a DataFrame `unnest_columns(...)` cell so
+  the writeup pins down at least one *working* Flatten path. (Cheap; ~30 min.)
+- File a DataFusion issue for `CROSS JOIN UNNEST(<correlated column>)`
+  hitting "Physical plan does not support logical expression
+  OuterReferenceColumn". Probably already tracked — search first.
+- Extend probe to `List<Float32>` (vector-shape) and `List<List<UInt64>>`
+  (nested neighbor sets, e.g. multi-hop staging) before Phase 0 lowers
+  Vector ANN results into the factorized IR.