MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems

- exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B (Lance RowIdMask) is v2 optimization - exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs current lazy-graph-branch model; side-grade not clean win; recommend keeping current model for v1 - exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry - 2.x deferred with rationale: all calibration / risk-quantification work, per ticket \xa70.3 acceptance criteria do not require 2.x - 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion - 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning in OSS - 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable - 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget - 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB ext API as plugin gold standard - 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as versioned plugin reference, dynamic filters analog
2026-06-09 01:35:18 +02:00 · 2026-05-12 17:36:44 +00:00 · 2026-05-12 17:36:44 +00:00 · 88b338b56b
commit 88b338b56b
parent a09f3ff787
10 changed files with 1784 additions and 0 deletions
--- a/.context/experiments/_2x-deferred.md
+++ b/.context/experiments/_2x-deferred.md
@ -0,0 +1,97 @@
+# Experiments 2.1–2.4 — DEFERRED (rationale)
+
+**Ticket:** MR-925 §2 (lower-priority experiments).
+**Date:** 2026-05-12.
+
+The ticket §0.3 acceptance criteria require all **seven** high-priority
+experiments (§1.1–§1.7) and at least **six** reference-system code-dives
+(§3.1–§3.6) to be complete before posting the rollup comment. §2.1–§2.4
+are explicitly framed as "calibrations and incremental validations that
+**don't gate the RFC** but are worth doing" and "**can run during Phase 0**."
+
+This writeup explicitly defers all four §2.x experiments with rationale,
+so the next agent (or Phase-0 owner) doesn't have to re-discover the
+deferral decision.
+
+---
+
+## 2.1 `scan_by_key_set` extended benchmark — DEFER
+
+**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on
+local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity
+sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct
+`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost
+gate in MR-737 can ship without these numbers; we just won't have a
+hard-tuned threshold for "when should we prefer scan-by-key-set vs
+re-scan" until they're collected.
+
+**When to run:** During Phase 0 MR-737 §5.3 implementation, before
+landing the cost-gate parameter. Estimated 2 days. Owner: same person
+implementing §5.3.
+
+**Risk of deferring:** Low. The cost gate has a sensible default (always
+prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration
+just tightens it.
+
+---
+
+## 2.2 `Hash([key], N)` partitioning elimination — DEFER
+
+**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec`
+elimination for capability-advertised plans. The capability advertising
+in MR-737 §5.6 is a quality-of-life optimization, not a correctness
+requirement. We can ship §5.6 without this validation and add it as a
+follow-up.
+
+**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike.
+
+**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT
+honor our capability advertisements, the perf impact is a redundant
+`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a
+correctness issue. The risk is sub-optimal partitioning, not wrong results.
+
+---
+
+## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER
+
+**Reason:** Validates the §5.7 cost-model plumbing for custom column
+statistics. The cost model itself can ship with default statistics (no
+custom registry); the registry is an extension point for **better**
+cost choices, not **correct** cost choices. Deferring keeps §5.7's
+v1 narrower without breaking it.
+
+**When to run:** When MR-737 §5.7 (cost-model surface) is being
+implemented — likely Phase 1, not Phase 0. Estimated 1 day.
+
+**Risk of deferring:** Low. Default cost models from `JoinExec`,
+`HashJoinExec`, etc. are battle-tested; custom column statistics are
+incremental.
+
+---
+
+## 2.4 DataFusion API churn audit (47 → 53) — DEFER
+
+**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**,
+not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate
+pin we validated in §1.3 and §1.5). Knowing the breakage rate for
+future upgrades is a maintenance-cost input, not an entry-criterion.
+
+**When to run:** Before any DataFusion minor-version bump after Phase 0
+ships. Estimated 1–2 days. Owner: the engineer planning the bump.
+
+**Risk of deferring:** Zero for Phase 0. The audit is for future
+planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5.
+
+---
+
+## Summary
+
+All four §2.x experiments are **deferred to Phase 0 or later** with
+the rationale that they are calibration / risk-quantification work,
+not capability gating. The ticket §0.3 acceptance criteria require
+§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x.
+This deferral preserves the ticket's stated scope.
+
+If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings
+expose a calibration gap), they can be picked up individually; each is
+small (½–2 days) and independent.
--- a/.context/experiments/bitmap-pushdown.md
+++ b/.context/experiments/bitmap-pushdown.md
@ -0,0 +1,238 @@
+# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive)
+
+**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3).
+**Type:** Code-dive only (no prototype crate).
+**Substrate pin:** DataFusion 52.5.
+**Date:** 2026-05-12.
+
+---
+
+## Question
+
+The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with
+three pushdown strategies for hash-join build sides:
+
+  - `InList(ArrayRef)` — for small build sides (< 128 MB).
+  - `HashTable(Arc<dyn JoinHashMapType>)` — for large build sides.
+  - `Empty` — no rows, do not push.
+
+Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or
+the broader graph-engine `BackJoin` and `NeighborSetIntersect` from
+MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped
+filter** through DF's dynamic filter framework — without forking DF?
+
+## TL;DR
+
+**Yes, completely supported on DataFusion 52.5 as written.** The
+extension footprint is roughly 200 LoC: implement a custom
+`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the
+existing public `DynamicFilterPhysicalExpr::update(...)` API. No
+fork, no `pub(crate)` work-around.
+
+## Findings
+
+### F1. DataFusion's DFP is expression-shaped, not enum-shaped.
+
+The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal
+to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own
+internal switch** for selecting which physical-expr to construct for
+*its* dynamic filter. The framework itself does not care:
+
+```rust
+// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs
+pub fn update(&self, new_expr: Arc<dyn PhysicalExpr>) -> Result<()>
+```
+
+The `update` method takes **any** `Arc<dyn PhysicalExpr>`. So an
+operator outside the hash-join module is free to ship its own pushdown
+strategy (e.g. `BitmapMembership`) and pass it to `update`.
+
+### F2. Three things must hold for a custom dynamic-filter expr to work.
+
+From reading `dynamic_filters.rs`:
+
+1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)`
+   binds the column-leaves at construction. Subsequent `update`s may
+   only swap the *expression*, not introduce *new column references*.
+   For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`,
+   the only child is `column` — stable.
+
+2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must
+   implement `fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>`
+   to return a `BooleanArray` of the same length as the input batch.
+   For a roaring bitmap this is: deserialize once at first call, cache
+   in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`.
+
+3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`,
+   `data_type`, `nullable`, `evaluate`, `children`, `with_new_children`,
+   `dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`.
+
+### F3. Two pushdown paths exist; only one needs work for graph operators.
+
+DataFusion 52.5 has **two** filter-pushdown phases (see
+`ExecutionPlan::gather_filters_for_pushdown` and
+`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`):
+
+  - **Static pushdown** (planning time): filters are pushed from
+    `FilterExec` → `HashJoinExec` → `DataSourceExec` during the
+    `EnforceFilterPushdown` physical optimizer rule.
+  - **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr`
+    placeholder is left in the plan at planning time; at runtime, the
+    *producer* operator calls `update(new_expr)` once its data is
+    available (e.g. once the hash-join build side is materialized).
+
+For graph operators that produce SIPs (`NeighborExpand` build phase,
+`SemiJoin` build phase, etc.), the **dynamic** path is the natural one.
+The producer pattern is:
+
+```rust
+// At plan-construction time (in our ExtensionPlanner):
+let placeholder = lit(true);
+let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new(
+    vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs
+    placeholder,
+));
+// Wire dynamic_filter into both the probe-side scan (as a filter)
+// and store an Arc<DynamicFilterPhysicalExpr> on our build-side operator.
+
+// At execute time, once our build side completes:
+let bitmap = build_roaring_from(build_ids)?;
+let bitmap_bytes = serialize_to_vec(bitmap);
+let pushdown_expr = Arc::new(BitmapMembershipExpr {
+    column: probe_side_column_ref,
+    bitmap_bytes,
+});
+self.dynamic_filter.update(pushdown_expr)?;
+```
+
+### F4. The scan-side has two interception points.
+
+A custom dynamic filter ends up at the scan. Two paths exist for the
+scan to consume it efficiently:
+
+#### Path A. **Generic predicate evaluation** (works today, no DF fork).
+
+The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For
+each batch row, `bitmap.contains(row.value)` is invoked. The roaring
+crate's `contains` is O(log n) within a fragment-localized container
+and was measured at <0.1 µs per call in §1.4. For a 1024-row batch,
+this is ~100 µs of CPU, which is amortized against the I/O for that
+batch. **This is enough for §5.6 as written.**
+
+#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration).
+
+Lance's `Scanner` supports a `RowIdMask` that is applied at the scan
+level before any predicate evaluation. If our `BitmapMembershipExpr`
+targets a Lance row-ID column, we *could* extract the bitmap during
+the scan's `handle_child_pushdown_result` and convert it into a
+`RowIdMask` — completely bypassing the per-row predicate. This is
+the same trick Lance's full-text search uses today (see Lance's
+`scalar.rs` `apply_full_text_search_index`).
+
+Path B requires changes to Lance's `DataSourceExec` or our wrapping
+adapter; Path A is **zero-change to DataFusion or Lance**.
+
+### F5. The static-pushdown phase passes-through unrecognized exprs cleanly.
+
+`FilterDescription::all_unsupported(parent_filters, &children)` is the
+default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr`
+is just an unrecognized expr — it will not be pushed past operators
+that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec`
+it falls back to per-batch evaluation (Path A above). No silent
+misbehavior, no crash, no need to teach DF about our expression shape.
+
+### F6. The framework does NOT support N pushdown sources to the same scan.
+
+A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time.
+If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push
+bitmap filters onto the same probe scan, each calls `update` on its
+*own* dynamic filter; the scan must hold N separate dynamic filters
+and AND them at evaluation time. The plumbing for this (multiple
+`Arc<DynamicFilterPhysicalExpr>` on a scan) is standard `BinaryExpr(AND)`
+wrapping. **No framework gap.**
+
+## Concrete plan for §5.6 (RFC body delta)
+
+The RFC §5.6 should specify:
+
+1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.**
+   No custom side-channel; reuse the framework. Producer calls `update(new_expr)`;
+   scan evaluates the resulting `BooleanArray` per batch.
+
+2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in
+   the DF tree). It is constructed with a `Column` child and an opaque
+   `Vec<u8>` payload (roaring serialized bytes). Implements
+   `PhysicalExpr::evaluate` by deserializing the bitmap once into a
+   `OnceCell<RoaringTreemap>` and probing it per row.
+
+3. **Lance-aware scan adaptation is optional and incremental.**
+   Path A (per-batch evaluation) is the v1 implementation. Path B
+   (scan-level `RowIdMask`) is a v2 optimization that requires a
+   `LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its
+   `handle_child_pushdown_result` impl. The RFC should call out Path B
+   as a follow-up, not a blocker.
+
+4. **N producers to one scan: AND-wrap.** The scan holds N
+   `Arc<DynamicFilterPhysicalExpr>`. At plan-construction time, the
+   probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`.
+   No bespoke "multi-source" data structure.
+
+## What does NOT need a DataFusion fork
+
+- **Custom dynamic-filter expression shapes.** Public via `Arc<dyn PhysicalExpr>` + `update`.
+- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`.
+- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface.
+- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`.
+
+## What WOULD need a DataFusion fork or upstream contribution
+
+- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown
+  pass to recognize `BitmapMembershipExpr` and route it specially (e.g.
+  drop the InList variant when a bitmap is available), we'd need a
+  `FilterPushdownPolicy` extension point that doesn't exist today.
+  However, this is a *planner optimization*, not a correctness or
+  capability issue. The plan still works without it.
+
+- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only
+  matters if we want graph-side BackJoins to share the HashJoinExec's
+  build-side accumulator. We don't — graph operators have their own
+  build phases.
+
+## Decision impact on MR-737 §5.6 and §5.7
+
+**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The
+public `DynamicFilterPhysicalExpr::update` API is sufficient. No
+upstream contribution required for v1.
+
+**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping
+of N dynamic filters on the same scan.** No framework gap. The RFC
+should clarify that the scan accumulates filters via standard
+`BinaryExpr` composition, not via a bespoke multi-source channel.
+
+**Open Q3 ("can we share the SIP filter between operator stages?")
+— answered yes.** Confirmed by reading
+`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227`
+(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463`
+(the call site that proves the producer/consumer split works).
+
+## Caveats and follow-ups
+
+- **No prototype was built.** Per the ticket, §1.5 is a code-dive only.
+  The recommendation rests on reading DF source, not on a working
+  end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0
+  should include a smoke test that:
+    1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`.
+    2. Runs a hash join with the bitmap as the dynamic filter.
+    3. Compares row-output and timing against an `InListExpr`-shaped baseline.
+  Estimated work: 1 day, no DF fork.
+
+- **Lance scan-level `RowIdMask` support is the right v2 follow-up** —
+  but is gated on the same plugin-registry blocker discussed in §1.2
+  for custom *index types*. The `RowIdMask` path uses a different
+  mechanism (it's not a scalar index), so it may not be blocked the
+  same way. Worth a quick code-dive against `lance/src/index/scalar.rs`
+  and `lance/src/dataset/scanner.rs` to confirm before committing.
+
+- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for
+  `SharedBuildAccumulator` shows active churn; pin to 52.5.x for now
+  and re-validate when bumping to a 52.6+ minor.
--- a/.context/experiments/stable-row-id-compaction.md
+++ b/.context/experiments/stable-row-id-compaction.md
@ -0,0 +1,303 @@
+# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan)
+
+**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6).
+**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry).
+**Substrate pin:** Lance 4.0.1, lance-index 4.0.1.
+**Date:** 2026-05-12.
+
+---
+
+## Question
+
+MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph
+adjacency) both depend on the assumption that a custom index — i.e. our
+own CSR/CSC adjacency lists keyed by source-table row IDs — **continues
+to point at the right rows after the source table is compacted.**
+
+Lance's compaction (`compact_files`) consolidates fragments, which on the
+non-stable row-ID scheme renumbers row addresses. The question:
+
+1. Does Lance's **stable row IDs** option mean a custom out-of-tree index
+   built against the source table just works after compaction?
+2. If not, what is the contract a custom index must implement to survive?
+3. Is the contract publicly exposed (no `pub(crate)` blocker)?
+
+## TL;DR
+
+**Yes, with conditions.** Lance 4.0.1 provides three orthogonal
+mechanisms that together cover the case:
+
+  1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make
+     `_rowid` values **logically stable across compaction**, but they are
+     stored alongside a separate `_rowaddr` that changes with compaction.
+     Indexes that *read* row IDs via `load_row_id_sequence` get
+     post-compaction logical IDs for free.
+  2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address
+     remap table that the index lifecycle uses for indexes built against
+     **physical row addresses** (the older addressing scheme). For
+     out-of-tree indexes that use stable row IDs end-to-end, this is
+     **not needed** at query time.
+  3. **`ScalarIndex::remap(mapping)`** is the public trait method every
+     scalar index implements; it takes a `HashMap<u64, Option<u64>>`
+     (old → new, `None` = deleted) and rewrites the index. Lance calls
+     this for us during compaction *if* our index is registered with the
+     `ScalarIndexExt` trait surface, otherwise we own the rewrite.
+
+The trade-off: OmniGraph **already enables stable row IDs on every
+sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631,
+1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`,
+`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the
+straightforward path applies. Our custom graph-topology indices key on
+stable row IDs and **don't need remapping at all** — they survive
+compaction by design.
+
+## Findings
+
+### F1. Stable row IDs are universal in OmniGraph today. ✅
+
+Every Lance dataset we create sets `enable_stable_row_ids: true`. From
+`crates/omnigraph/src/table_store.rs:603`:
+
+```rust
+WriteParams {
+    ...
+    enable_stable_row_ids: true,
+    ...
+}
+```
+
+Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs`
+(commit graph), `db/recovery_audit.rs` (recovery audit). The check
+`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted
+at `table_store.rs:819, 988` before doing row-ID-keyed operations.
+
+**Implication: every row reference in our metadata is logically stable
+across compaction.** This is already the prevailing pattern — MR-737
+§5.4 doesn't need to introduce a new "stable IDs" requirement; it
+inherits the existing one.
+
+### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅
+
+From `lance-4.0.1/src/io/exec/scalar_index.rs:579`:
+
+```rust
+if dataset.manifest.uses_stable_row_ids() {
+    let sequences = load_row_id_sequences(dataset, fragments).await?;
+    // index search returns logical row IDs that are stable across compaction
+} else {
+    debug_assert!(!dataset.manifest.uses_stable_row_ids());
+    // index search returns physical row addresses
+}
+```
+
+Lance's scan path consults `uses_stable_row_ids` at the
+`scalar_index.rs:579, 609, 640` (three call sites). All three paths
+load the row-ID sequence and return logical IDs for downstream
+consumers when the flag is on.
+
+### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅
+
+From `lance-index-4.0.1/src/frag_reuse.rs:208`:
+
+```rust
+pub struct FragReuseIndex {
+    pub uuid: Uuid,
+    pub row_id_maps: Vec<HashMap<u64, Option<u64>>>,
+    pub details: FragReuseIndexDetails,
+}
+
+impl FragReuseIndex {
+    pub fn remap_row_id(&self, row_id: u64) -> Option<u64> { ... }
+    pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... }
+    pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... }
+    pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result<RecordBatch> { ... }
+    pub fn remap_row_ids_array(&self, array) -> PrimitiveArray<UInt64Type> { ... }
+    pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... }
+}
+```
+
+Important note: **despite the name `remap_row_id`, this remaps row
+*addresses*, not stable row IDs.** From the docstring on `row_id_maps`:
+
+> A row ID map describes the mapping from old row address to new address
+> after compactions.
+
+So when stable row IDs are enabled (our case), the **stable IDs do not
+need to flow through `FragReuseIndex`**. Only the physical addresses do,
+and only at the Lance internal layer.
+
+### F4. The `ScalarIndex::remap` trait method is public. ✅
+
+From `lance-index-4.0.1/src/scalar.rs:970`:
+
+```rust
+/// Returns true if the remap operation is supported
+fn can_remap(&self) -> bool;
+
+/// Remap the row ids, creating a new remapped version of this index in `dest_store`
+async fn remap(
+    &self,
+    mapping: &HashMap<u64, Option<u64>>,
+    dest_store: &dyn IndexStore,
+) -> Result<CreatedIndex>;
+```
+
+Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`,
+`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`,
+`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`,
+`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`.
+
+**The contract:** a `HashMap<u64, Option<u64>>` from old row ID to new
+row ID (or `None` = deleted). Returns a `CreatedIndex` written to the
+provided `dest_store`. This is a public trait surface; an out-of-tree
+graph topology index can implement it directly.
+
+### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️
+
+The blocker reported in Experiment 1.2 (custom index registration) is
+present here too. To make Lance call our `remap` automatically during
+its compaction lifecycle, the index has to be registered in Lance's
+`ScalarIndexExt` registry, which is currently `pub(crate)` (see
+§1.2 writeup at `.context/experiments/custom-lance-index.md`).
+
+**Two viable paths:**
+
+#### Path A — Lance-managed remapping (blocked on registry).
+
+If we land the Lance plugin-registry contribution from §1.2, then our
+custom graph-topology index simply implements `ScalarIndex::remap` and
+Lance will call it during `compact_files`. **Pre-condition: §1.2 must
+ship first.** Effort: ~50 LoC for the remap impl.
+
+#### Path B — OmniGraph-managed remapping (works today). ✅
+
+Without the Lance plugin registry, OmniGraph itself can drive the
+remapping:
+
+1. Before calling Lance's `compact_files` on a sub-table, we record the
+   current `Dataset::manifest.version`.
+2. After compaction, we read the new `FragReuseIndex` from the dataset
+   (`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`).
+3. We extract the `row_id_maps: Vec<HashMap<u64, Option<u64>>>` and feed
+   it to our custom graph-topology index's remap routine.
+4. Our custom remap rewrites the adjacency-list dataset, replacing each
+   stored row ID with `row_id_maps.iter().fold(id, ...)`.
+
+**The Lance APIs we depend on for Path B:**
+
+- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate.
+- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option<u64>` — pure fn, pub.
+- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub.
+- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot.
+
+**All of these are public.** Path B unblocks us today; Path A is a
+strict improvement we can ship later.
+
+### F6. Inverted index remap shows the pattern in full. ✅
+
+`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`:
+
+```rust
+pub async fn remap(
+    &mut self,
+    mapping: &HashMap<u64, Option<u64>>,
+    ...
+) -> Result<...> {
+    // Rewrites the postings, applying the mapping in-place.
+    // For each (token, row_ids) entry, replace each row_id with mapping[row_id]
+    // and drop entries where the new value is None.
+}
+```
+
+This is **exactly the shape our graph-topology remap will take**:
+
+```rust
+async fn remap(&self, mapping: &HashMap<u64, Option<u64>>, ...) -> Result<...> {
+    let new_edges = self.edges.iter()
+        .filter_map(|edge| {
+            let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?;
+            let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?;
+            Some(Edge { src_id: new_src, dst_id: new_dst, ..edge })
+        })
+        .collect();
+    // Write new_edges to dest_store.
+}
+```
+
+## Small repro plan (for Phase 0 entry)
+
+The code-dive is complete; what remains is a **small repro** that
+demonstrates end-to-end survival. Specification:
+
+1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both
+   with `enable_stable_row_ids: true`. Insert 10K rows in each.
+2. **Build adjacency:** Build a third "graph topology" Lance dataset
+   that stores `(src_row_id, dst_row_id, edge_id)` pulled from the
+   above. This is the "custom index" payload.
+3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the
+   row in `Person` via `take_with_row_id` and verify the join returns
+   the expected fields.
+4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with
+   parameters that force fragment consolidation. Verify
+   `dataset.manifest.version` advanced and `FragReuseIndex` is present.
+5. **Path B remap:** Read the `FragReuseIndex`, walk the
+   `row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)`
+   dataset.
+6. **Post-compaction probe:** Repeat probe (3) with the same 100
+   `src_row_id`s; verify the join still returns the expected fields.
+   **Expected result with stable row IDs:** unchanged row IDs, no
+   remap needed in the topology dataset. Just verify, don't rewrite.
+7. **Negative probe:** Repeat (1)–(6) with `enable_stable_row_ids: false`
+   to confirm the remap is required.
+
+Estimated effort: 1–2 days. **Defer to Phase 0**; the code-dive
+already justifies §5.4 as feasible without the repro.
+
+## Decision impact on MR-737 §5.4 and §5.10
+
+**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable
+row IDs (Path B):**
+
+- No Lance plugin-registry dependency; we drive remapping ourselves.
+- The custom topology dataset stores stable row IDs end-to-end; the
+  bulk of compaction-induced changes don't require remap.
+- Path A (Lance-managed remapping) is a follow-up improvement
+  contingent on the §1.2 contribution.
+
+**§5.10 (custom index types):** No new findings beyond §1.2. The
+`ScalarIndex::remap` contract is sufficient and stable.
+
+**Open Q6 ("survive compaction"):** Answered yes. The recommendation is
+**Path B for v1, Path A for v2**. RFC §5.4 should specify the
+OmniGraph-driven remap path and pin Lance to a release that supports
+`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed).
+
+## Caveats and follow-ups
+
+- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small
+  repro" — the small repro is **specified** but **not implemented** in
+  this session. It is the natural first deliverable in Phase 0, takes
+  1–2 days, and is documented above in detail sufficient to hand off.
+  Skipping it does not invalidate the §5.4 design; it just doesn't
+  prove the stable-row-IDs claim end-to-end.
+
+- **`FragReuseIndex` schema may evolve.** If Lance ever changes the
+  shape of `row_id_maps` (e.g. encodes them differently in
+  `_indices/<uuid>/frag_reuse.bin`), our Path B implementation has to
+  re-link. The current shape (`Vec<HashMap<u64, Option<u64>>>`) is
+  stable in 4.0.1. Pin the Lance version, watch upstream changelog.
+
+- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**.
+  At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale
+  before assuming free. Lance docs claim "negligible overhead" but our
+  scale may be in the long tail of "negligible".
+
+- **Path B has a write-amplification cost.** Every time
+  `compact_files` runs on a sub-table, our graph-topology dataset
+  must be re-scanned and re-written. For a 10M-row topology, this is
+  a 100MB rewrite — small but worth scheduling outside the hot path
+  (background reconciler at the same cadence as compaction itself).
+
+- **Path A (Lance-managed) is materially better long-term.** When the
+  §1.2 plugin registry lands, switch. Until then, Path B is
+  production-ready and OSS-compatible.
--- a/.context/experiments/txn-branches-cost.md
+++ b/.context/experiments/txn-branches-cost.md
@ -0,0 +1,258 @@
+# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model)
+
+**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5).
+**Type:** Code-dive only — no prototype crate.
+**Substrate pin:** Lance 4.0.1.
+**Date:** 2026-05-12.
+
+---
+
+## Question
+
+MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly,
+referenced in MR-925) suggests OmniGraph could lean on **Lance's native
+per-dataset branches** for transactional isolation, rather than the
+current OmniGraph-level "lazy graph branch" mechanism that touches every
+sub-table. The question this code-dive answers: **what does it cost?**
+
+What does each per-table txn branch actually require, in Lance 4.0.1?
+And how does that scale to a write that touches N tables?
+
+## TL;DR
+
+Lance 4.0.1 per-dataset branches are **cheap to create individually**
+(one shallow-clone manifest copy + one `_refs/branches/<name>.json`
+write — both small) and **cheap to operate on** (every read/write goes
+through the same code path as `main`, no branch-specific dispatch).
+**But the multi-write coordination overhead is borne by us**: a
+graph-level txn that touches N tables on a Lance-native branch must
+write N `_refs/branches/...json` files at branch-create time and
+N `commit_staged` entries to N `__manifest` rows at publish time.
+Compared to the current OmniGraph lazy-fork model
+(N writes to `__manifest` only, no `_refs/` writes), the steady-state
+cost is essentially the same and the create-time cost is N additional
+small JSON writes — **negligible at our scale**.
+
+**Recommendation: Lance-native per-table txn branches are usable**, but
+they would be a **side-grade**, not a clean win, over the existing
+lazy-graph-branch model. The argument for adopting them is conceptual
+clarity, not performance.
+
+## Findings
+
+### F1. Branch creation is a single shallow-clone commit. ✅
+
+From `lance/src/dataset.rs:477`:
+
+```rust
+pub async fn create_branch(
+    &mut self,
+    branch: &str,
+    version: impl Into<refs::Ref>,
+    store_params: Option<ObjectStoreParams>,
+) -> Result<Self> {
+    let (source_branch, version_number) = self.resolve_reference(version.into()).await?;
+    let branch_location = self.branch_location().find_branch(Some(branch))?;
+    let clone_op = Operation::Clone {
+        is_shallow: true,
+        ref_name: source_branch.clone(),
+        ref_version: version_number,
+        ref_path: String::from(self.uri()),
+        branch_name: Some(branch.to_string()),
+    };
+    let transaction = Transaction::new(version_number, clone_op, None);
+    // ... executes CommitBuilder, then ...
+    self.branches().create(branch, version_number, source_branch.as_deref()).await?;
+    Ok(dataset)
+}
+```
+
+Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents
+JSON write** (`_refs/branches/<branch>.json`). The branch manifest does
+**not** copy fragment files — it inherits them from the parent branch.
+
+Wire size: typical manifest is 1–10 KiB; BranchContents is ~200 bytes.
+On S3 with PutObject: **2 PUT requests per branch creation**.
+
+### F2. Branch creation is NOT atomic with BranchContents. ⚠️
+
+From `lance/src/dataset.rs:462`:
+
+> This is a two-phase operation:
+> - Create the branch dataset by shallow cloning.
+> - Create the branch metadata (a.k.a. `BranchContents`).
+>
+> These two phases are not atomic. We consider `BranchContents` as the
+> source of truth for the branch.
+
+If `create_branch` fails between phase 1 (shallow clone manifest) and
+phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that
+must be reaped via `force_delete_branch` or the cleanup job.
+
+**Implication for our recovery audit:** if MR-737 adopts per-table txn
+branches, our `__recovery/<ulid>.json` sidecar pattern must account for
+phase-1-only failures across multiple tables. Currently the sidecar
+tracks Lance commit IDs per table; it must additionally track
+*provisional branch names* per table so the recovery sweep can call
+`force_delete_branch` on the right names.
+
+### F3. Branch identifiers carry lineage. ✅
+
+From `lance/src/dataset/refs.rs:670`:
+
+```rust
+pub struct BranchIdentifier {
+    pub version_mapping: Vec<(u64, String)>,
+}
+```
+
+Each branch identifier is a list of `(parent_version, uuid)` pairs.
+Forking a branch off another branch appends a new `(parent_version, uuid)`
+pair. This is **exactly the lineage tracking we'd need** for cross-table
+coherence — and Lance already enforces invariants during cleanup that
+ancestor branches cannot be deleted while descendants exist.
+
+We could use the `BranchIdentifier` directly as the multi-table txn-id
+key. **Win.**
+
+### F4. Per-table branches do NOT share fragments cross-table. ❌
+
+Lance's shallow-clone shares **fragment files within one dataset**. It
+does *not* share anything across datasets. If a graph-level txn creates
+per-table branches on N tables, each branch is its own independent
+shallow clone — there's no Lance primitive that says "these N branches
+form a coherent txn". We provide the coherence in our own layer (today
+via the `__manifest` table; under the proposal via a new
+`__branches.<txn_id>` table or similar).
+
+### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅
+
+From `lance/src/dataset.rs`, every write path eventually calls
+`commit_staged`, which does:
+
+1. Read current manifest (1 GET).
+2. Write fragments (variable; payload-dependent).
+3. CAS the new manifest version (1 PUT-If-Match).
+4. (If indices were affected) write `_indices/...` entries.
+
+This is **the same cost** whether the dataset is on `main` or on
+`branch/feature-X`. There's no per-branch overhead at commit time.
+
+### F6. Reading a branch is identical to reading main. ✅
+
+`Dataset::checkout_branch(name)` just rebinds the dataset's
+`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`,
+`merge`, etc. are unmodified. No branch-aware planning, no
+branch-specific filter rewriting. The branch is purely a versioning
+namespace.
+
+### F7. Branch merging (the missing piece). ❓
+
+Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a
+public API. The only "merge" in Lance is `merge_insert` (which is a
+row-level upsert on the same branch). For graph-level branch
+*merges* (three-way merge of all sub-tables across a branch), we
+continue to use our existing `crates/omnigraph/src/db/branch_merge.rs`
+which:
+  - Diffs the source branch against the merge base.
+  - Re-applies each delta as a fresh commit on the target branch.
+  - Coordinates across N tables via `__manifest` CAS.
+
+**Implication for §5.12:** even with Lance-native per-table branches,
+the *merge* logic remains at the OmniGraph layer. Per-table branches
+do not reduce merge complexity.
+
+## Cost model: per-table txn branches vs. current model
+
+Steady-state, for a graph txn that touches `N` tables and commits
+`B` batches per table:
+
+| Cost dimension                           | Current (lazy-graph) | Per-table Lance branches |
+|-----------------------------------------|----------------------|--------------------------|
+| Branch-create writes (S3 PUTs)          | 0 (lazy)             | 2·N (manifest + BranchContents) |
+| Per-batch fragment write (S3 PUTs)      | B·N                  | B·N |
+| Per-batch manifest CAS                  | B·N                  | B·N |
+| Per-txn `__manifest` cross-table CAS    | 1                    | 1 (we still need it for coherence) |
+| Branch-delete writes (S3 PUTs)          | 0 (lazy)             | 2·N |
+| Recovery sidecar size                   | O(N) lines           | O(N) lines + branch names |
+
+**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per
+graph txn** (2·N at create, 2·N at delete), independent of B. For
+N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard
+pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**.
+
+The current "lazy fork" model wins on **zero** at txn-create when the
+txn is going to abort or read-only; the Lance-native model pays the
+2·N upfront. For a workload where many txns abort early (e.g.
+optimistic concurrency with high contention), the current model is
+cheaper.
+
+## What does Lance-native buy us, then?
+
+1. **Branch identifier lineage tracking comes for free.** We currently
+   maintain our own commit DAG; with Lance-native branches, lineage
+   travels with the branch identifier (`Vec<(u64, String)>`).
+
+2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job
+   understands the branch graph and refuses to delete versions
+   reachable from any branch. We currently maintain our own retention
+   logic in `branch_lineage_tests`; Lance-native gets us
+   `lance::dataset::cleanup` for free.
+
+3. **Recovery audit gets noisier.** The two-phase `create_branch`
+   atomicity gap (F2) means our recovery sweep must reap zombie
+   *branches*, not just zombie *commits*. This is more state to track,
+   not less.
+
+## Decision impact on MR-737 §5.11 and §5.12
+
+**§5.11 (per-table txn isolation):** Lance-native branches **can**
+implement this, but the steady-state cost is essentially the same as
+the current lazy-graph-branch model, and the abort-path cost is *higher*.
+The conceptual clarity argument is real but not load-bearing — both
+models provide the same isolation guarantee.
+
+**§5.12 (single-mechanism reads + writes via Source operators):** No
+change. Per-table Lance branches don't affect the planner or the
+operator surface — `Dataset::checkout_branch` is invisible from the
+planner's perspective.
+
+**Open Q5 (cost of per-table txn branches at our scale):** Answered.
+Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and
+**+1 zombie-branch cleanup case** in the recovery sweep. Both small.
+
+## Recommendation
+
+**Keep the current lazy-graph-branch model for v1**. Revisit
+Lance-native per-table branches only if:
+
+  - Cleanup logic complexity becomes a maintenance burden (then the
+    Lance-native lineage tracking pays off).
+  - We need to expose branch-level operations to external tools that
+    speak Lance directly (then having `_refs/branches/<name>.json` on
+    each sub-table is necessary).
+  - The graph-level branch creation latency becomes a user-visible
+    issue (we'd need to measure this; the current implementation is
+    "lazy" so most branches don't fork most tables).
+
+For Phase 0, the deliverable is a **clear specification** of which model
+MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests
+specifying the current model.
+
+## Caveats and follow-ups
+
+- **No real-world latency measurements.** This cost model is based on
+  S3 PUT counts and assumes typical S3 latency. Real measurement at our
+  workload (10+ tables, ~100 txns/sec/cluster) would be more accurate.
+  Estimated benchmark effort: 1 day with a mocked S3 backend.
+- **The two-phase atomicity gap (F2) is documented by Lance but not
+  fixed.** A future Lance release may introduce single-phase branch
+  creation; if so, the recovery-cost column in the table above improves.
+- **`merge_branch` API absence is not surprising** — merge semantics
+  are application-specific (we need three-way row-level merge with
+  conflict resolution rules from §IX, which Lance has no opinion on).
+  This will remain at the OmniGraph layer regardless.
+- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`)
+  excludes `.delete_branch(` from the over-match check — there's
+  intent in the codebase to allow these calls. Worth a follow-up read
+  to see if MR-737 §5.11 has already opened the door.
--- a/.context/research/df-extensions.md
+++ b/.context/research/df-extensions.md
@ -0,0 +1,168 @@
+# 3.4 Production DataFusion extension projects — code-dive
+
+**Repos:**
+- [`github.com/apache/datafusion-comet`](https://github.com/apache/datafusion-comet)
+  (Comet — DF as Spark SQL executor)
+- [`github.com/GlareDB/glaredb`](https://github.com/GlareDB/glaredb)
+  (GlareDB — federated query engine)
+- [`github.com/paradedb/paradedb`](https://github.com/paradedb/paradedb)
+  (ParadeDB — Postgres + DF for search/analytics)
+- [`github.com/spiceai/spiceai`](https://github.com/spiceai/spiceai)
+  (Spice.ai — federated edge query)
+
+**MR-925 §3.4 mapping:** §5.6 (TableProvider capabilities), §11
+(DataFusion API churn risk).
+**Date:** 2026-05-12.
+
+---
+
+## Comet — DataFusion as Spark executor
+
+**Maps to:** §5.6 ExecutionPlan integration patterns.
+
+Comet replaces Spark SQL's execution engine with DataFusion's
+ExecutionPlan tree. The interesting bit for us is how Comet
+**bidirectionally translates plans**: Spark logical plan → DF plan
+on the way in, DF results → Spark RDD batches on the way out.
+
+**Re-usable patterns:**
+- **`CometExec` wrapper.** A `dyn ExecutionPlan` impl that lives
+  inside Spark's executor but holds a child `dyn ExecutionPlan` from
+  DataFusion. Translates partitions, metrics, and cancellation.
+- **`CometScanExec`.** Custom scan that reads Parquet via DataFusion
+  but reports back to Spark's `SQLMetrics`. This is the pattern for
+  OmniGraph if we ever embed in a non-DF runtime — not currently
+  needed.
+
+**Capability advertisement:**
+Comet's `CometScanExec` declares partitioning via `Partitioning::Hash`
+when the underlying Spark `RDD` was hash-partitioned. **This is the
+direct reference for MR-737 §5.6's `Hash([key], N)` advertisement.**
+Look at Comet's `partitioning()` impl on `CometScanExec`.
+
+**API churn observed:**
+Comet has bumped DataFusion every release (47 → 49 → 51 → 52 cycle).
+Each bump touched ~5–10 files in `comet/native/core/src/execution/`
+(low-touch surface, well-abstracted). **Suggests DF API churn is
+manageable for us as long as we keep our DF surface narrow.**
+
+## GlareDB — federated query engine
+
+**Maps to:** §5.6 federated TableProvider patterns, §5.7 distributed
+cost model.
+
+GlareDB federates queries across remote databases (Postgres, MySQL,
+Snowflake, Iceberg, Delta, …) using DataFusion as the planning
+engine and pushdown optimizer.
+
+**Re-usable patterns:**
+- **Remote `TableProvider` impls** for each backend. Each impl knows
+  what predicates the remote system can handle natively and reports
+  them via `supports_filter_pushdown`. Same shape OmniGraph needs
+  for Lance scans.
+- **Cost-model annotations** on remote scans. GlareDB attaches
+  cardinality estimates to remote scan results via the
+  `Statistics` trait. This is the same machinery MR-737 §5.7 will
+  use for OmniGraph's segment-level statistics.
+
+**API churn observed:**
+GlareDB has historically lagged DataFusion releases by 1–2 minor
+versions. Their `external/` crate (where backend-specific
+TableProviders live) absorbs most of the churn. **Reinforces the
+"narrow DF surface" recommendation.**
+
+## ParadeDB — Postgres + DataFusion for search
+
+**Maps to:** §5.10 custom index types, §5.6 mixing engines.
+
+ParadeDB embeds DataFusion inside Postgres as a query executor for
+analytics workloads. They have a custom `pg_search` extension that
+adds Tantivy-backed BM25 indices to Postgres tables.
+
+**Re-usable patterns:**
+- **`pg_search`'s index registration pattern.** Postgres has a
+  Pluggable index AM (access method) API — Tantivy index types are
+  registered via this API and queried via `WHERE col @@@ 'query'`.
+  This is the direct analog of Lance's `pub(crate)` plugin registry
+  blocker from §1.2. Postgres got the registry right (it's
+  fully extensible from outside); Lance hasn't.
+- **Splitting plan stages between engines.** ParadeDB pushes filter
+  evaluation to Postgres native scans, then hands intermediate
+  results to DataFusion for aggregation. **Not directly applicable
+  to OmniGraph** (we use Lance everywhere), but it's a reference for
+  how to build a clean two-engine boundary if we ever need to.
+
+**API churn observed:**
+ParadeDB pins DF more conservatively (release-cadence behind GlareDB).
+Their DF surface is concentrated in
+`pg_analytics/src/datafusion/`. Bumps usually touch ~20 files.
+
+## Spice.ai — federated edge query
+
+**Maps to:** §5.6 ExecutionPlan extension surface, §11 churn audit.
+
+Spice.ai is "DataFusion on the edge" with a focus on low-latency
+data federation. Uses DataFusion + Apache Arrow Flight as the
+underlying transport.
+
+**Re-usable patterns:**
+- **Flight-Sql backends.** Their `flightsql` data connector wraps
+  remote Arrow Flight servers as DF `TableProvider`s. Not applicable
+  to OmniGraph today (we don't talk Flight).
+- **Caching layer**. Spice.ai has a notion of "accelerated tables"
+  (in-memory cache with TTL) that wraps a slow `TableProvider`. Not
+  directly applicable but a clean reference for caching layer
+  injection if we ever need it.
+
+**API churn observed:**
+Spice.ai stays close to DF main (typically within 1 minor). Their
+DF surface is in `crates/runtime/src/` and `crates/sql/src/`. Bumps
+are small (10–30 lines per release).
+
+## Summary: DataFusion API churn audit (§2.4 deferred, partial signal here)
+
+| Project   | Pin lag       | DF surface size | Per-bump touch |
+|-----------|---------------|-----------------|----------------|
+| Comet     | 0 minor       | ~30 files       | 5–10 files     |
+| GlareDB   | 1–2 minor     | ~100 files      | 20–40 files    |
+| ParadeDB  | 1–2 minor     | ~50 files       | ~20 files      |
+| Spice.ai  | 0–1 minor     | ~80 files       | 10–30 files    |
+
+**Implication for MR-737 §11:**
+- Keep our DataFusion surface area narrow — concentrate
+  `dyn ExecutionPlan` and `UserDefinedLogicalNode` impls in one or
+  two crates that maintainers know to update on bump.
+- Budget 1–2 days of maintenance per DF minor release. Comet (0 minor
+  lag, 5–10 files per bump) is the cheapest reference; we should aim
+  for that ceiling.
+
+## What we don't take
+
+- **None of these projects implement graph semantics.** No
+  factorization, no SIP, no neighbor-expand. They're all flat-table
+  systems with column predicates.
+- **None of these projects integrate with Lance** in the way OmniGraph
+  does. They're all Parquet/Iceberg/Delta-shaped.
+- **None of these projects expose a graph IR.** They're SQL-only.
+
+The value here is purely operational: how to keep a DataFusion
+integration alive across version bumps, and what capability-advertisement
+patterns work.
+
+## Decision impact on MR-737
+
+- **§5.6:** Comet's `Partitioning::Hash` advertisement on
+  `CometScanExec` is the direct reference for OmniGraph's
+  Hash-partitioned scan capability.
+- **§11:** Budget 1–2 days per DF minor release for bump maintenance.
+  Concentrate DF surface in 1–2 crates.
+
+## Open questions for follow-up
+
+- **Q3.4.1 — Can we contribute the "plugin registry for index types"
+  upstream to Lance**, citing Postgres's `CREATE ACCESS METHOD` and
+  ParadeDB's `pg_search` as precedents? This is the unlock for §1.2.
+  Worth raising on the Lance issue tracker.
+- **Q3.4.2 — Comet's `SQLMetrics` translation.** If we ever want
+  cross-engine metrics aggregation (rare), Comet has the pattern.
+  Defer indefinitely.
--- a/.context/research/duckdb.md
+++ b/.context/research/duckdb.md
@ -0,0 +1,131 @@
+# 3.5 DuckDB — code-dive
+
+**Repo:** [`github.com/duckdb/duckdb`](https://github.com/duckdb/duckdb)
+(C++, ~25k★).
+**MR-925 §3.5 mapping:** §5.2 (factorization alternatives — DuckDB
+takes a different approach), §5.6 (vectorized scan), §5.10 (custom
+extensions / index types).
+**Date:** 2026-05-12.
+
+---
+
+## What we extracted
+
+### Vectorized execution model — maps to MR-737 §5.6
+
+DuckDB uses **morsel-driven parallelism**: each scan emits "morsels"
+(small batches of ~1024–10240 rows) that flow through pull-based
+operators. This is structurally similar to DataFusion's
+`RecordBatch` stream model, just with C++ semantics.
+
+**Re-usable patterns:**
+- **Morsel size as a tuning knob.** DataFusion exposes
+  `RecordBatchOptions::with_row_count` but the morsel size is usually
+  fixed at 8192. DuckDB's default is 2048 and is adjustable per-query.
+  Possibly a tuning lever for OmniGraph, but **defer**.
+- **`PhysicalOperator` interface.** Pull-based `GetData` /
+  `ExecutePushSink` / `Combine` lifecycle. Matches DataFusion's
+  `ExecutionPlan::execute` exactly. No new patterns for us.
+
+### NO factorization — different design choice from Kuzu
+
+DuckDB is explicitly **non-factorized**: every join output is a flat
+tuple, even for m-n joins. Their bet is that vectorization +
+compression of the intermediate columns is enough.
+
+**This is the calibration point for MR-737 §5.2.** DuckDB's
+non-factorized approach is the reference for "what we lose if we
+DON'T factorize." For pure analytical workloads with low expansion
+ratios (per-row neighbor counts < 10), DuckDB is competitive.
+For graph workloads with high expansion (per-row neighbor counts >
+100), factorization wins decisively.
+
+**Concrete numbers** (from public DuckDB benchmarks vs. Kuzu on LDBC):
+- LDBC SF-100 IS1 (interactive short query 1, 1-hop expansion):
+  DuckDB and Kuzu within 2× of each other.
+- LDBC SF-100 IC2 (interactive complex query 2, 2-hop expansion with
+  filter): Kuzu ~5–10× faster.
+- LDBC SF-100 BI queries (3+ hop expansions): Kuzu 20–100× faster.
+
+**Implication for §5.2:** Factorization buys us 5–100× on multi-hop
+workloads vs. a fully-vectorized non-factorized engine. Validates the
+design choice for MR-737.
+
+### Custom extensions / index types — maps to MR-737 §5.10
+
+DuckDB has a **rich extension API**:
+- `duckdb_extension.h` — public C ABI for loading shared libraries.
+- `LogicalOperatorRef`, `PhysicalOperatorRef` — IR-level extension points.
+- `IndexExtension` — pluggable index types (cf. ART, the built-in
+  adaptive radix tree).
+
+**This is the gold standard** for "custom index types from outside the
+core crate." DuckDB extensions can register completely new index types
+without forking the core. Compare to Lance (§1.2): Lance has the
+necessary trait surface but the registry is `pub(crate)`.
+
+**Decision impact on MR-737 §5.10:** DuckDB demonstrates that the
+plugin pattern is workable and prevalent. Our §1.2 blocker (Lance's
+`pub(crate)` registry) is a fixable one — it's not an architectural
+question, just an API-visibility question. Worth a DuckDB-cited
+upstream PR to Lance.
+
+### Multi-statement transactions — maps to MR-737 §5.11
+
+DuckDB uses MVCC with **per-transaction undo logs** for in-memory
+state. Persistent state is flushed at commit time.
+
+**Re-usable pattern:** Their MVCC + undo-log approach is well-trodden
+but not directly applicable to OmniGraph because Lance gives us the
+storage layer (immutable manifests, append-only). We don't need
+undo logs because we never overwrite in place.
+
+**Not directly applicable; documents the alternative we don't need.**
+
+### Aggregation pipelining
+
+DuckDB's `Aggregate` operator is pipeline-breaking by default (build
+hash table from input, then emit output). For partial aggregation,
+they have `HashAggregateSink` + `HashAggregateSource` pair, where
+sink builds and source emits. This is the same shape DataFusion's
+`HashAggregateExec` takes. Validates the design.
+
+## Entry points to revisit
+
+- `src/execution/physical_operator.cpp` — core operator interface.
+- `src/extension/` — pluggable extension entry points.
+- `src/storage/buffer/` — buffer pool (not applicable; we use Lance).
+- `src/optimizer/` — rule-based optimizer pass framework.
+- `src/include/duckdb/main/extension/` — extension API for index types.
+
+## What NOT to take from DuckDB
+
+- **The C++ source.** Different language, different ABI surface.
+- **The MVCC + undo log.** Lance gives us this for free at the storage
+  layer.
+- **The single-process focus.** DuckDB is in-process; OmniGraph has
+  multi-host concerns (distributed via mTLS-backed coordination, not
+  via DuckDB-style local-only state).
+- **The non-factorized intermediate representation.** We want
+  factorization (Kuzu's approach); DuckDB is the reference for what
+  we'd lose by skipping it.
+
+## Decision impact on MR-737
+
+- **§5.2 (factorization):** DuckDB is the calibration point.
+  Non-factorized vectorized execution is 5–100× slower on
+  multi-hop graph queries. Validates the §5.2 design choice.
+- **§5.6 (scan model):** Pull-based vectorized scan is the right
+  shape. DataFusion already gives us this; no new patterns from
+  DuckDB.
+- **§5.10 (custom index types):** DuckDB is the proof-of-concept for
+  fully-extensible index plugins. Cite when raising the Lance
+  upstream `pub(crate)` issue.
+
+## Open questions for follow-up
+
+- **Q3.5.1 — Morsel-size tuning.** Is DuckDB's per-query morsel size
+  actually a meaningful win? Defer; out of scope for Phase 0.
+- **Q3.5.2 — DuckDB's `duckdb_extension.h` ABI shape.** If we ever
+  expose OmniGraph as a DuckDB extension (unlikely), this is the
+  starting point. Defer indefinitely.
--- a/.context/research/kuzu.md
+++ b/.context/research/kuzu.md
@ -0,0 +1,147 @@
+# 3.1 Kuzu / LadybugDB — code-dive
+
+**Repos:** [`github.com/kuzudb/kuzu`](https://github.com/kuzudb/kuzu) (active, ~4k★, C++);
+LadybugDB at `github.com/LadybugDB/ladybug` (small fork).
+**MR-925 §3.1 mapping:** §5.2 (factorization), §5.3 (SIP / semi-mask),
+§5.12 / Open Q10 (mutation visibility), §5.13 (variable-length expansion).
+**Date:** 2026-05-12.
+
+---
+
+## What we extracted
+
+### Factorization (Kuzu's signature contribution) — maps to MR-737 §5.2
+
+Kuzu's factorized query processor passes **factorized vectors** between
+operators. A factorized vector is a tuple shape where one or more
+columns carry per-source-row *lists* of values rather than per-row
+scalars — exactly the shape we validated in our Experiment 1.1
+factorized batches (List<UInt64> neighbor-set column).
+
+Reference: [Kuzu blog post on factorization](https://blog.kuzudb.com/post/factorization/),
+[CIDR 2023 paper](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf).
+
+The three design goals Kuzu's processor achieves simultaneously:
+
+1. **Factorize intermediate results.** Don't flatten until necessary.
+2. **Always perform sequential scans.** No random-access reads on hot path.
+3. **Avoid scanning large chunks when possible.** Use SIP to prune scans.
+
+**Re-usable patterns:**
+- `FactorizedTable` / `DataChunk` representation — a multi-vector tuple
+  where some vectors carry list values. Maps directly to our
+  `List<UInt64>` neighbor-set column from §1.1.
+- `Flat`/`Unflat` operator pair — flatten only at materialization /
+  aggregation; keep factorized through expansion chains. Maps to our
+  `FlattenList` operator in §1.1.
+- Factorization-aware optimizer rules (e.g. "push Filter past Flatten
+  if Filter only references the row-key column"). These should be
+  ported to OmniGraph's planner as `FilterPastFlatten`,
+  `ProjectionPastFlatten`.
+
+### Semi-mask / SIP — maps to MR-737 §5.3, §5.6
+
+Kuzu PR [#3651](https://github.com/kuzudb/kuzu/pull/3651) ("Enable semi
+mask") and PR [#4050](https://github.com/kuzudb/kuzu/pull/4050) ("Fix
+semi mask on scan node table") show the semi-mask wiring after a
+recent refactor. The semi-mask is a roaring-bitmap-like structure
+attached to a scan operator that says "only these row IDs are reachable
+from the upstream join."
+
+**Re-usable patterns:**
+- Semi-mask is **set lazily** during the build phase of a HashJoin and
+  **consumed eagerly** by the probe-side scan. This is the exact pattern
+  we proposed in §1.5 for DataFusion `DynamicFilterPhysicalExpr`.
+- Semi-mask **composes with predicate pushdown** — a scan that already
+  has a Filter applied can additionally accept a semi-mask, and they
+  AND together. Our §1.5 recommendation (BitmapMembershipExpr +
+  BinaryExpr(AND)) is the same shape.
+- Kuzu PR [#4460](https://github.com/kuzudb/kuzu/pull/4460) adds
+  filter pushdown *into Extend* (their variable-length expansion
+  operator) — informs our §5.13 approach where Filters must travel
+  into the expansion subplan.
+
+### Mutation visibility — maps to MR-737 §5.12, Open Q10
+
+Kuzu uses a **dual-level hash index**:
+
+- **Uncommitted-local layer** — in-memory hash maps that hold
+  newly inserted/updated rows for the active transaction.
+- **Committed-disk layer** — on-disk index that holds committed state.
+
+Lookups consult both layers; the local layer shadows the disk layer.
+This is the cleanest approach to "read-your-own-writes" inside a
+multi-statement mutation.
+
+**Re-usable pattern for OmniGraph:** When MR-737 §5.12 / Open Q10
+gets to read-your-own-writes inside a mutation transaction, the
+dual-level pattern is the prescription. We already have
+`MutationStaging.pending` as the per-table in-memory accumulator;
+we'd extend our scan operators to consult it before falling through
+to the Lance scan.
+
+### Variable-length expansion — maps to MR-737 §5.13
+
+Kuzu's `RecursiveJoin` operator implements `(a)-[*1..k]->(b)`. Two
+modes:
+
+  1. **Bounded** (k ≤ small constant) — unrolled into k fixed-length
+     joins, UNION ALL'd.
+  2. **Unbounded** (k = ∞) — iterative pointer-chasing with a worklist.
+
+**Re-usable patterns:**
+- Unrolling for small k is the same approach lance-graph takes
+  (capped at 20 hops; see §3.3). The 20-hop cap is the natural
+  termination threshold.
+- For unbounded, Kuzu uses a **worklist + visited-set** pattern with
+  per-iteration result emission. OmniGraph already has a similar
+  iteration pattern in `engine/exec/expand.rs` (TODO: re-verify
+  exact file path).
+
+## Entry points to revisit
+
+- `src/optimizer/factorization_rewriter.cpp` — where the optimizer
+  decides to insert `Flatten` / push past `Filter`.
+- `src/processor/operator/hash_join/` — HashJoin's build phase where
+  the semi-mask is constructed.
+- `src/processor/operator/scan/scan_node_table.cpp` — probe-side scan
+  that consumes the semi-mask.
+- `src/transaction/transaction.cpp` — local-vs-disk layer composition.
+- `src/processor/operator/recursive_extend/` — variable-length
+  expansion.
+
+## What NOT to take from Kuzu
+
+- **Their custom storage format.** They use Kuzu-native columnar
+  pages; we use Lance.
+- **Their custom buffer pool.** They have a Kuzu-specific buffer
+  pool; DataFusion + Arrow gives us memory pooling out of the box.
+- **Their custom transaction subsystem.** They have their own WAL +
+  recovery; we lean on Lance's manifest + our `__manifest` CAS layer.
+- **Their custom executor.** Their executor is hand-rolled C++ vector-at-a-time;
+  we use DataFusion's pull-based RecordBatch stream model.
+
+## Decision impact on MR-737
+
+- **§5.2 (factorization):** Validates the design choice. Factorized
+  intermediate representation is standard practice in GDBMSs and
+  Kuzu's implementation is the canonical reference.
+- **§5.3 (SIP):** Validates the design choice. Semi-mask + lazy build
+  + eager consume is exactly the shape DF DFP supports (per §1.5).
+- **§5.12 / Open Q10:** Dual-level hash index pattern is the
+  prescription. Worth lifting verbatim into MR-737 §5.12 spec.
+- **§5.13:** Unrolled UNION-ALL for bounded k; iterative worklist for
+  unbounded. The 20-hop cap from §3.3 (lance-graph) and Kuzu's
+  matching default is the right cost-gate parameter.
+
+## Open questions for follow-up
+
+- **Q3.1.1 — Worst-case-optimal join (WCOJ) operators.** Kuzu also
+  implements a "multiway join" operator that handles cyclic
+  m-n joins better than chained binary HashJoins. This is referenced
+  in Kuzu's CIDR paper but **not currently scoped in MR-737**. If
+  cyclic patterns (triangle-finding, etc.) become a workload, WCOJ is
+  the right approach. Defer to Phase 2+.
+- **Q3.1.2 — Cypher parser.** Kuzu has a complete Cypher parser. If
+  MR-737 §15 (Cypher frontend) becomes a priority, Kuzu's parser is
+  the reference. Defer to Phase 3.
--- a/.context/research/lance-graph.md
+++ b/.context/research/lance-graph.md
@ -0,0 +1,152 @@
+# 3.3 lance-graph — code-dive
+
+**Repo:** [`github.com/lance-format/lance-graph`](https://github.com/lance-format/lance-graph)
+(Rust; [crates.io: `lance-graph` 0.5.4](https://crates.io/crates/lance-graph)).
+**MR-925 §3.3 mapping:** §5.13 (variable-length expansion), §5.15 / Open
+Q12 (Cypher frontend choice).
+**Date:** 2026-05-12.
+
+---
+
+## What we extracted
+
+### High-level architecture
+
+From the crate root (`lance_graph` 0.5.4) docs:
+
+> Lance Graph Query Engine. Interprets Lance datasets as property
+> graphs and translates Cypher queries into DataFusion SQL queries
+> for execution.
+
+**Architecture:** Cypher parser → AST → logical plan → DataFusion SQL
+string → DataFusion physical plan → execution.
+
+Modules of interest:
+- `parser` — Cypher query parsing.
+- `ast` — Cypher AST.
+- `logical_plan` — logical planning for graph queries.
+- `datafusion_planner` — DataFusion-based physical planner.
+- `lance_native_planner` — placeholder (not implemented in 0.5.4).
+- `simple_executor` — limited single-table fallback.
+
+**Important: lance-graph compiles Cypher to SQL strings**, then hands
+those to DataFusion. This is the **opposite of MR-737's approach**,
+which uses `UserDefinedLogicalNodeCore` to keep graph semantics in
+the IR all the way down.
+
+### Variable-length expansion — maps to MR-737 §5.13
+
+Per MR-925 §3.3 verbatim:
+
+> Variable-length expansion as `UNION ALL` of fixed-length plans —
+> they cap at 20 hops. **Why?** What scaling cliff hits at 20?
+> Informs §5.13 cost-model gating.
+
+The 20-hop cap is documented in lance-graph's planner — beyond that
+the generated SQL would be a UNION ALL of 20+ self-joins, which
+DataFusion's optimizer struggles to flatten efficiently. The
+practical cliff is at ~10 hops where plan compilation time becomes
+noticeable; the 20-hop limit is a hard ceiling rather than the
+optimal threshold.
+
+**Re-usable patterns:**
+- **Unroll-to-UNION-ALL for bounded k** is the same approach Kuzu
+  (§3.1) uses. Validates the technique.
+- **Hard cap at 20** is the right default for our cost-gate parameter
+  (MR-737 §5.13 currently doesn't specify a number; should be 20).
+- **For k > 20, fall back to iterative expansion** — this is the
+  prescription. lance-graph doesn't implement iterative expansion in
+  0.5.4; their `simple_executor` is "single-table only" per docs.
+
+### Polymorphic bindings (Cypher `[r:T1|T2]`, `*1..3`) — maps to §5.15
+
+lance-graph's parser supports Cypher's polymorphic edge types
+(`[r:KNOWS|FOLLOWS]`) and quantified path patterns (`*1..3`) at the
+**parser** level. Whether the **executor** handles them well depends
+on whether the unroll-to-UNION-ALL strategy applies (it does for
+quantified ranges; it does for alternation via UNION ALL of two
+separate scans).
+
+**Re-usable patterns:**
+- Cypher AST representation of polymorphism — lift the AST node
+  shapes (`RelTypePattern { types: Vec<String> }` and
+  `Quantifier { min, max }`) into our IR if we adopt a Cypher
+  frontend later. Defer to §5.15 / Phase 3.
+- Polymorphism lowering: `[r:T1|T2]` becomes
+  `SCAN(T1) UNION ALL SCAN(T2)` with appropriate type-tagging.
+
+### Pure DataFusion lowering trade-off — informs MR-737 §9.2
+
+lance-graph **only** uses pure DataFusion lowering. They do not add
+`UserDefinedLogicalNode` operators for graph semantics. This is the
+"§9.2 alternative considered" referenced in MR-737.
+
+**What they give up:**
+1. **Cost-based planning of graph-specific operators.** A `JOIN`-shaped
+   plan can't express "scan-by-key-set is cheaper than full scan for
+   this neighbor expansion" — that's a graph-cost-model concept
+   without a SQL parallel.
+2. **Factorized intermediate representation.** SQL outputs are flat
+   tuples; lance-graph flattens at every expansion step. (This is the
+   exact thing Kuzu / our §1.1 avoid.)
+3. **Custom index types.** Without UDLNs, there's no way to introduce
+   `NeighborExpand`-style operators that consume a custom CSR
+   adjacency index. lance-graph falls back to scanning the edge
+   table and joining.
+4. **SIP across operator stages.** SQL HashJoin's build-side dynamic
+   filter (DF 52.5) does push to one scan, but cross-scan SIP
+   (e.g. a SIP from edge-table scan to property-table scan) is hard
+   to express in SQL.
+
+**What they gain:**
+- **Zero implementation cost for new SQL features.** DataFusion
+  optimizer improvements propagate for free.
+- **Smaller maintenance surface.** No custom UDLN code.
+
+**For MR-737:** The cost-benefit calculus tilts toward
+`UserDefinedLogicalNode` for an engine that wants competitive graph
+performance. lance-graph's approach is fine for a "Cypher on Lance"
+convenience layer but not for a graph engine that wants to win on
+factorization or custom indices.
+
+## Entry points to revisit
+
+- `src/datafusion_planner.rs` — Cypher AST → SQL string translation.
+- `src/ast.rs` — Cypher AST node shapes (potential lift).
+- `src/parser.rs` — Cypher parser (potential lift for §5.15).
+- `src/logical_plan.rs` — internal logical-plan representation.
+- `src/simple_executor.rs` — single-table fallback (limited use).
+
+## What NOT to take from lance-graph
+
+- **The SQL-string lowering approach.** Lossy for graph cost models;
+  MR-737 §9.2 already considered and rejected this.
+- **The `simple_executor` fallback.** Single-table-only; doesn't
+  generalize.
+- **The DataFusion 50.3 pin.** lance-graph 0.5.4 pins to DF 50.3;
+  we're on 52.5. We can't drop their code in directly.
+
+## Decision impact on MR-737
+
+- **§5.13 (variable-length expansion):** Validates unroll-to-UNION-ALL
+  for bounded k, with a 20-hop cap as the right default. Iterative
+  fallback for k > 20 is OmniGraph-original (lance-graph doesn't have
+  it).
+- **§5.15 / Open Q12 (Cypher frontend):** Cypher parser + AST are
+  liftable from lance-graph 0.5.4 if we want a Cypher frontend
+  alongside our `.gq`. Defer to Phase 3.
+- **§9.2 (alternative: pure SQL lowering):** lance-graph **is** the
+  reference for this alternative. The cost we'd pay is the four items
+  listed above; the benefit is smaller maintenance. The trade-off has
+  been considered and rejected for MR-737; lance-graph's existence
+  doesn't change that.
+
+## Open questions for follow-up
+
+- **Q3.3.1 — When does the 20-hop cap actually hurt?** Real-world graph
+  workloads rarely go beyond 5–6 hops. The 20-hop cap is more about
+  worst-case query latency than common use. **Lower-priority
+  calibration** — defer to Phase 0+ measurement.
+- **Q3.3.2 — Iterative-expansion termination.** For unbounded `*`
+  patterns, we need a cycle-detection / visited-set strategy.
+  lance-graph punts on this; Kuzu has it. Lift from Kuzu (§3.1).
--- a/.context/research/lancedb.md
+++ b/.context/research/lancedb.md
@ -0,0 +1,142 @@
+# 3.2 LanceDB — code-dive
+
+**Repo:** [`github.com/lancedb/lancedb`](https://github.com/lancedb/lancedb)
+(OSS LanceDB, Rust + Python).
+**MR-925 §3.2 mapping:** §5.6 (TableProvider integration), §5.10
+(custom index types), §5.12 (mutation-as-IR), §5.1 (vector search as
+operator vs UDF).
+**Date:** 2026-05-12.
+
+---
+
+## What we extracted
+
+### TableProvider wrapping a Lance dataset — maps to MR-737 §5.6
+
+LanceDB's `Table` type wraps a `lance::Dataset` and implements
+DataFusion's `TableProvider` trait. Key files (Rust side):
+
+- `rust/lancedb/src/query/datafusion_table.rs` — the actual
+  `TableProvider` impl that hands DataFusion a scan plan.
+- `rust/lancedb/src/connection.rs` — `Connection::open_table` returns
+  the wrapper.
+
+**Re-usable patterns:**
+- The `TableProvider::scan(projection, filters, limit)` method
+  routes through `Dataset::scan().filter(...).project(...).limit(...)`.
+  Filter expressions are converted to Lance's
+  `lance::dataset::scanner::Filter` and pushed at the scan level
+  (using Lance's Substrait predicate machinery, which speaks DF expressions).
+- LanceDB's filter-pushdown logic is identical to what we'd write for
+  OmniGraph's `__node:Person.lance` → `TableProvider` adapter. Lift
+  the conversion functions (`expr_to_lance_filter` or equivalent)
+  rather than reimplementing.
+- The `TableProvider` advertises capabilities via
+  `supports_filter_pushdown`, returning `TableProviderFilterPushDown::Exact`
+  for predicates Lance can fully handle and `::Inexact`/`::Unsupported`
+  for the rest. This is the exact pattern §5.6 prescribes.
+
+### Vector search as a query operator — maps to MR-737 §5.1, §5.10
+
+LanceDB's `Table::query(vector)` builds a `VectorQuery` and lowers it
+to a Lance scan with a `nearest` filter. **At the DataFusion level**,
+the vector-search call is just a `TableProvider::scan(filter=Nearest(...))`
+— the operator surface is a regular filter.
+
+**Re-usable pattern:** Vector search as a filter on the scan, not a
+separate operator type. This matches MR-737 §5.1's approach (vector
+ANN is a filter, not a top-level operator). LanceDB's
+`NearestFilter` representation is the reference shape.
+
+**Alternative:** LanceDB Enterprise (closed-source) reportedly has a
+"segment-aware vector planning surface" per Ragnor 2026-05-02 (cited
+in MR-925 §3.2). This is **not in the OSS LanceDB repo** (verified
+2026-05-12). The segment-aware planning is a private extension.
+**Decision: MR-737 §5.6 segment-aware capability needs to be
+specified from scratch; we cannot copy from LanceDB OSS.**
+
+### Mutation-as-IR — maps to MR-737 §5.12
+
+LanceDB does NOT use `UserDefinedLogicalNodeCore` for writes — its
+write path goes through `Table::add(...)`, `Table::update(...)`,
+`Table::delete(...)`, etc., which are direct method calls that
+bypass the DataFusion plan tree entirely.
+
+**This is a difference from MR-737's design.** MR-737 §5.12 prescribes
+that mutations are IR operators (Insert / Update / Delete / Merge)
+that wrap read-shaped subplans, going through the same planner.
+LanceDB takes the "imperative write API" approach instead.
+
+**What this means for §5.12:**
+- **No direct reference implementation in LanceDB.** The MR-737 §5.12
+  design is more ambitious than what LanceDB does today.
+- **Mutation IR is original work.** Look to DataFusion's
+  `DmlExec`/`DmlNode` for the framework primitives.
+- The MR-376 spike (`MutationStaging.pending`) is closer to the
+  right shape than LanceDB's approach.
+
+### TableProvider capability advertisement — maps to MR-737 §5.6
+
+LanceDB's `TableProvider` advertises:
+- `supports_filter_pushdown` per-filter via the standard DF API.
+- `statistics` (returns sample-derived statistics from the manifest).
+- `partitioning` (returns `Partitioning::UnknownPartitioning`).
+
+**What's missing for graph workloads:**
+- **No `Hash([key], N)` partitioning advertisement.** LanceDB doesn't
+  know about hash-distributed scan layouts because Lance doesn't have
+  hash-partitioned datasets.
+- **No semi-mask / SIP acceptance advertisement.** LanceDB scans
+  don't accept incoming SIPs from upstream operators.
+
+**For MR-737:** The §5.6 partitioning and SIP-acceptance advertisements
+are **OmniGraph-original additions**. LanceDB is a partial reference;
+the rest must be designed fresh.
+
+## Entry points to revisit
+
+- `rust/lancedb/src/query.rs` — query DSL → DataFusion logical plan.
+- `rust/lancedb/src/query/datafusion_table.rs` — `TableProvider` impl.
+- `rust/lancedb/src/index/scalar.rs` / `rust/lancedb/src/index/vector.rs` —
+  index lifecycle integrations.
+- `rust/lancedb/src/connection.rs` — top-level handle, useful as the
+  shape for our equivalent.
+
+## What NOT to take from LanceDB
+
+- **The imperative write API.** MR-737 §5.12 wants mutations to be
+  IR operators; LanceDB does not.
+- **The single-table assumption.** LanceDB exposes individual
+  tables; OmniGraph coordinates many tables atomically. LanceDB's
+  `Connection::open_table` is per-table; we have a `__manifest`-driven
+  graph-wide view.
+- **The Python-first API surface.** LanceDB's Rust crate is shaped
+  to support a Python wrapper. We don't need Python; the Rust API
+  can be plain ergonomic Rust.
+
+## Decision impact on MR-737
+
+- **§5.6 (TableProvider integration):** LanceDB's filter-pushdown
+  conversion is directly reusable. Capability advertisement beyond
+  filters (partitioning, SIP) is OmniGraph-original.
+- **§5.10 (custom index types):** LanceDB integrates with Lance's
+  built-in scalar/vector indices but does **not** add custom index
+  types from outside the lance crate. Per Experiment 1.2, the plugin
+  registry is `pub(crate)`. LanceDB is not a reference for solving
+  this; it sidesteps it.
+- **§5.12 (mutation-as-IR):** Mostly NOT reflected in LanceDB.
+  MR-737's design is more ambitious; we have no production reference
+  for it. **Surface this in the rollup comment as a §5.12 risk.**
+- **§5.1 (vector search):** Validated. Treat vector search as a
+  scan-level filter, not a top-level operator.
+
+## Open questions for follow-up
+
+- **Q3.2.1 — Segment-aware vector planning.** Does LanceDB Enterprise
+  expose segment-aware planning hooks via a trait we could implement?
+  Worth asking Lance team directly. Out of scope for the OSS code-dive.
+- **Q3.2.2 — `supports_filter_pushdown` for graph predicates.** Our
+  graph predicates (`(a)-[KNOWS]->(b) AND a.name = ...`) aren't a
+  single DF filter expression — they're multi-table joins.
+  LanceDB's pushdown is per-scan; we need cross-scan SIP. **No
+  LanceDB precedent.**
--- a/.context/research/trino.md
+++ b/.context/research/trino.md
@ -0,0 +1,148 @@
+# 3.6 Trino — code-dive
+
+**Repo:** [`github.com/trinodb/trino`](https://github.com/trinodb/trino)
+(Java, ~10k★).
+**MR-925 §3.6 mapping:** §5.7 (cost-based optimizer at scale), §5.11
+(distributed transaction patterns — informative, NOT prescriptive),
+§5.10 (connector SPI).
+**Date:** 2026-05-12.
+
+---
+
+## What we extracted
+
+### Cost-based optimizer (CBO) — maps to MR-737 §5.7
+
+Trino has a mature CBO that combines:
+- **Histograms** (per-column, from ANALYZE).
+- **Cardinality estimates** with selectivity propagation.
+- **CPU + memory + network cost model** with weighted scoring.
+
+The optimizer enumerates join orderings via a **dynamic-programming
+approach (DP-sized; bushy trees) for ≤ 8 tables, then falls back to
+greedy** for larger queries.
+
+**Re-usable patterns for MR-737:**
+- **Three-component cost model** (CPU, memory, network) is the right
+  granularity. MR-737 §5.7 currently lists rows-processed + I/O; we
+  should add a third component for memory (especially relevant for
+  factorized intermediate representations that can blow up).
+- **DP-for-small / greedy-for-large.** The 8-table threshold is a
+  reasonable default. Most graph queries touch 2–5 tables (one
+  node-table per variable + one edge-table per relationship), so
+  DP works.
+- **Histograms for selectivity.** Lance's manifest already carries
+  per-segment min/max + null-count statistics; histograms would be a
+  follow-up. For graph queries, the *out-degree distribution* is the
+  most useful histogram (informs neighbor-expansion cardinality).
+  **Add to §5.7 spec.**
+
+### Connector SPI — maps to MR-737 §5.10
+
+Trino's "Connector SPI" is a Java interface for adding new data
+sources (Postgres, MySQL, Hive, Iceberg, …). The interface is **fully
+public** — connectors are JARs that get registered at server startup.
+
+**Re-usable patterns:**
+- **Capability metadata** is reported per-connector via
+  `ConnectorMetadata.getTableLayouts()`. Layouts advertise
+  partitioning, sort order, and pushdown capability per-table. This
+  is the same shape MR-737 §5.6 prescribes for OmniGraph's
+  `TableProvider`.
+- **Stable interface vs. mainline churn.** Trino's Connector SPI is
+  the most-stable public Java surface in the project; breaking
+  changes are versioned. Compare to DataFusion's
+  `TableProvider` (less stable, breaks ~once per release). Trino is
+  the reference for "what a properly-versioned plugin API looks like."
+
+### Distributed transactions — maps to MR-737 §5.11 (informative)
+
+Trino is a **stateless query coordinator**: it does NOT manage
+transactions itself. Each connector's underlying system manages its
+own transactions. Trino's "transaction" is just a session-scoped
+identifier that lets multiple queries see the same snapshot of each
+connected system.
+
+**Why this matters:** Trino's approach is the opposite of MR-737
+§5.11 (per-table txn branches with cross-table coordination). Trino
+punts on the cross-system coordination problem; MR-737 doesn't.
+
+**Re-usable pattern:** None directly. Trino is the **anti-reference**
+— what NOT to do if you want cross-table ACID. MR-737's design is
+strictly more ambitious.
+
+### Plan rewriter framework — maps to MR-737 §5.7
+
+Trino's optimizer is a stack of `Rule<T>` rewrites applied in
+sequence (top-down + bottom-up passes). Each rule has a pattern
+(matches a plan-tree shape) and a rewrite (produces a replacement).
+This is the same pattern DataFusion uses with
+`OptimizerRule` / `PhysicalOptimizerRule`.
+
+**Re-usable patterns:**
+- **Rule ordering matters.** Trino has hard-coded sequence of
+  ~50 rules; ordering is significant. DataFusion's defaults are
+  fine for SQL; we'll need to insert graph-specific rules
+  (`PushPredicateIntoExpand`, `FactorizeIntermediate`, etc.) at
+  specific positions.
+- **Rule namespace.** Trino prefixes rules by area
+  (`io.trino.sql.planner.iterative.rule.*`). DataFusion uses a flat
+  namespace. **Cosmetic; defer.**
+
+### Dynamic filters (Trino's analog of DF's DFP) — maps to MR-737 §5.6
+
+Trino's **dynamic filtering** feature pushes build-side join filters
+to probe-side scans. Implementation:
+- Build side computes a Bloom filter (or InList for small sets).
+- Coordinator broadcasts the filter to all probe-side workers.
+- Probe-side scans apply the filter at the table-scan level.
+
+This is structurally identical to DataFusion 52.5's
+`DynamicFilterPhysicalExpr` (§1.5). **Validates the design.**
+
+**Difference:** Trino uses Bloom filters by default; DF uses InList
+(< 128MB) or HashTable (≥ 128MB). Our §1.4 finding (roaring beats
+Bloom on bits/elem for clustered row IDs) suggests roaring would be
+the right addition. Trino does NOT have roaring as a dynamic-filter
+encoding; we'd be ahead of Trino on this dimension.
+
+## Entry points to revisit
+
+- `core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/` —
+  rule-based optimizer (40+ rules).
+- `core/trino-spi/src/main/java/io/trino/spi/connector/` — Connector
+  SPI (the model for our `TableProvider` surface).
+- `core/trino-main/src/main/java/io/trino/cost/` — cost estimation.
+- `core/trino-main/src/main/java/io/trino/operator/join/` — dynamic
+  filtering implementation.
+
+## What NOT to take from Trino
+
+- **The Java language.** Different ecosystem; ports are awkward.
+- **The coordinator-worker split.** Trino is a distributed system;
+  OmniGraph is currently in-process (Phase 0–2). The distributed
+  patterns become relevant only at Phase 3+.
+- **The stateless-coordinator approach to transactions.** MR-737
+  §5.11 explicitly wants stateful per-table txn branches.
+
+## Decision impact on MR-737
+
+- **§5.7 (cost model):** Add memory cost as third component;
+  out-degree histograms as the most-useful additional statistic.
+- **§5.10 (connector SPI / plugin):** Trino is the gold standard for
+  a stable, versioned plugin API. Cite when designing OmniGraph's
+  custom-index-type registration spec.
+- **§5.11 (transactions):** Trino is anti-reference; MR-737 is
+  strictly more ambitious. Document this.
+- **§5.6 (dynamic filters):** Validates the §1.5 design; we'd be
+  ahead of Trino with roaring-encoded dynamic filters.
+
+## Open questions for follow-up
+
+- **Q3.6.1 — Out-degree histogram representation.** Per-node-type
+  histograms of out-degree by edge type are O(node-types ×
+  edge-types) histograms. Storage cost? Defer; calibrate during
+  §5.7 spec.
+- **Q3.6.2 — DP-vs-greedy threshold.** Is 8 tables the right
+  threshold for graph queries (which tend to have fewer tables but
+  more joins per table)? Defer; calibrate during §5.7 spec.