MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems

- exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B (Lance RowIdMask) is v2 optimization - exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs current lazy-graph-branch model; side-grade not clean win; recommend keeping current model for v1 - exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry - 2.x deferred with rationale: all calibration / risk-quantification work, per ticket \xa70.3 acceptance criteria do not require 2.x - 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion - 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning in OSS - 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable - 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget - 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB ext API as plugin gold standard - 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as versioned plugin reference, dynamic filters analog
2026-06-27 02:39:38 +02:00 · 2026-05-12 17:36:44 +00:00 · 2026-05-12 17:36:44 +00:00 · 88b338b56b
commit 88b338b56b
parent a09f3ff787
10 changed files with 1784 additions and 0 deletions
--- a/.context/experiments/_2x-deferred.md
+++ b/.context/experiments/_2x-deferred.md
@ -0,0 +1,97 @@
+# Experiments 2.1–2.4 — DEFERRED (rationale)
+
+**Ticket:** MR-925 §2 (lower-priority experiments).
+**Date:** 2026-05-12.
+
+The ticket §0.3 acceptance criteria require all **seven** high-priority
+experiments (§1.1–§1.7) and at least **six** reference-system code-dives
+(§3.1–§3.6) to be complete before posting the rollup comment. §2.1–§2.4
+are explicitly framed as "calibrations and incremental validations that
+**don't gate the RFC** but are worth doing" and "**can run during Phase 0**."
+
+This writeup explicitly defers all four §2.x experiments with rationale,
+so the next agent (or Phase-0 owner) doesn't have to re-discover the
+deferral decision.
+
+---
+
+## 2.1 `scan_by_key_set` extended benchmark — DEFER
+
+**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on
+local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity
+sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct
+`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost
+gate in MR-737 can ship without these numbers; we just won't have a
+hard-tuned threshold for "when should we prefer scan-by-key-set vs
+re-scan" until they're collected.
+
+**When to run:** During Phase 0 MR-737 §5.3 implementation, before
+landing the cost-gate parameter. Estimated 2 days. Owner: same person
+implementing §5.3.
+
+**Risk of deferring:** Low. The cost gate has a sensible default (always
+prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration
+just tightens it.
+
+---
+
+## 2.2 `Hash([key], N)` partitioning elimination — DEFER
+
+**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec`
+elimination for capability-advertised plans. The capability advertising
+in MR-737 §5.6 is a quality-of-life optimization, not a correctness
+requirement. We can ship §5.6 without this validation and add it as a
+follow-up.
+
+**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike.
+
+**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT
+honor our capability advertisements, the perf impact is a redundant
+`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a
+correctness issue. The risk is sub-optimal partitioning, not wrong results.
+
+---
+
+## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER
+
+**Reason:** Validates the §5.7 cost-model plumbing for custom column
+statistics. The cost model itself can ship with default statistics (no
+custom registry); the registry is an extension point for **better**
+cost choices, not **correct** cost choices. Deferring keeps §5.7's
+v1 narrower without breaking it.
+
+**When to run:** When MR-737 §5.7 (cost-model surface) is being
+implemented — likely Phase 1, not Phase 0. Estimated 1 day.
+
+**Risk of deferring:** Low. Default cost models from `JoinExec`,
+`HashJoinExec`, etc. are battle-tested; custom column statistics are
+incremental.
+
+---
+
+## 2.4 DataFusion API churn audit (47 → 53) — DEFER
+
+**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**,
+not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate
+pin we validated in §1.3 and §1.5). Knowing the breakage rate for
+future upgrades is a maintenance-cost input, not an entry-criterion.
+
+**When to run:** Before any DataFusion minor-version bump after Phase 0
+ships. Estimated 1–2 days. Owner: the engineer planning the bump.
+
+**Risk of deferring:** Zero for Phase 0. The audit is for future
+planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5.
+
+---
+
+## Summary
+
+All four §2.x experiments are **deferred to Phase 0 or later** with
+the rationale that they are calibration / risk-quantification work,
+not capability gating. The ticket §0.3 acceptance criteria require
+§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x.
+This deferral preserves the ticket's stated scope.
+
+If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings
+expose a calibration gap), they can be picked up individually; each is
+small (½–2 days) and independent.
--- a/.context/experiments/bitmap-pushdown.md
+++ b/.context/experiments/bitmap-pushdown.md
@ -0,0 +1,238 @@
+# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive)
+
+**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3).
+**Type:** Code-dive only (no prototype crate).
+**Substrate pin:** DataFusion 52.5.
+**Date:** 2026-05-12.
+
+---
+
+## Question
+
+The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with
+three pushdown strategies for hash-join build sides:
+
+  - `InList(ArrayRef)` — for small build sides (< 128 MB).
+  - `HashTable(Arc<dyn JoinHashMapType>)` — for large build sides.
+  - `Empty` — no rows, do not push.
+
+Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or
+the broader graph-engine `BackJoin` and `NeighborSetIntersect` from
+MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped
+filter** through DF's dynamic filter framework — without forking DF?
+
+## TL;DR
+
+**Yes, completely supported on DataFusion 52.5 as written.** The
+extension footprint is roughly 200 LoC: implement a custom
+`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the
+existing public `DynamicFilterPhysicalExpr::update(...)` API. No
+fork, no `pub(crate)` work-around.
+
+## Findings
+
+### F1. DataFusion's DFP is expression-shaped, not enum-shaped.
+
+The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal
+to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own
+internal switch** for selecting which physical-expr to construct for
+*its* dynamic filter. The framework itself does not care:
+
+```rust
+// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs
+pub fn update(&self, new_expr: Arc<dyn PhysicalExpr>) -> Result<()>
+```
+
+The `update` method takes **any** `Arc<dyn PhysicalExpr>`. So an
+operator outside the hash-join module is free to ship its own pushdown
+strategy (e.g. `BitmapMembership`) and pass it to `update`.
+
+### F2. Three things must hold for a custom dynamic-filter expr to work.
+
+From reading `dynamic_filters.rs`:
+
+1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)`
+   binds the column-leaves at construction. Subsequent `update`s may
+   only swap the *expression*, not introduce *new column references*.
+   For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`,
+   the only child is `column` — stable.
+
+2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must
+   implement `fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>`
+   to return a `BooleanArray` of the same length as the input batch.
+   For a roaring bitmap this is: deserialize once at first call, cache
+   in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`.
+
+3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`,
+   `data_type`, `nullable`, `evaluate`, `children`, `with_new_children`,
+   `dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`.
+
+### F3. Two pushdown paths exist; only one needs work for graph operators.
+
+DataFusion 52.5 has **two** filter-pushdown phases (see
+`ExecutionPlan::gather_filters_for_pushdown` and
+`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`):
+
+  - **Static pushdown** (planning time): filters are pushed from
+    `FilterExec` → `HashJoinExec` → `DataSourceExec` during the
+    `EnforceFilterPushdown` physical optimizer rule.
+  - **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr`
+    placeholder is left in the plan at planning time; at runtime, the
+    *producer* operator calls `update(new_expr)` once its data is
+    available (e.g. once the hash-join build side is materialized).
+
+For graph operators that produce SIPs (`NeighborExpand` build phase,
+`SemiJoin` build phase, etc.), the **dynamic** path is the natural one.
+The producer pattern is:
+
+```rust
+// At plan-construction time (in our ExtensionPlanner):
+let placeholder = lit(true);
+let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new(
+    vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs
+    placeholder,
+));
+// Wire dynamic_filter into both the probe-side scan (as a filter)
+// and store an Arc<DynamicFilterPhysicalExpr> on our build-side operator.
+
+// At execute time, once our build side completes:
+let bitmap = build_roaring_from(build_ids)?;
+let bitmap_bytes = serialize_to_vec(bitmap);
+let pushdown_expr = Arc::new(BitmapMembershipExpr {
+    column: probe_side_column_ref,
+    bitmap_bytes,
+});
+self.dynamic_filter.update(pushdown_expr)?;
+```
+
+### F4. The scan-side has two interception points.
+
+A custom dynamic filter ends up at the scan. Two paths exist for the
+scan to consume it efficiently:
+
+#### Path A. **Generic predicate evaluation** (works today, no DF fork).
+
+The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For
+each batch row, `bitmap.contains(row.value)` is invoked. The roaring
+crate's `contains` is O(log n) within a fragment-localized container
+and was measured at <0.1 µs per call in §1.4. For a 1024-row batch,
+this is ~100 µs of CPU, which is amortized against the I/O for that
+batch. **This is enough for §5.6 as written.**
+
+#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration).
+
+Lance's `Scanner` supports a `RowIdMask` that is applied at the scan
+level before any predicate evaluation. If our `BitmapMembershipExpr`
+targets a Lance row-ID column, we *could* extract the bitmap during
+the scan's `handle_child_pushdown_result` and convert it into a
+`RowIdMask` — completely bypassing the per-row predicate. This is
+the same trick Lance's full-text search uses today (see Lance's
+`scalar.rs` `apply_full_text_search_index`).
+
+Path B requires changes to Lance's `DataSourceExec` or our wrapping
+adapter; Path A is **zero-change to DataFusion or Lance**.
+
+### F5. The static-pushdown phase passes-through unrecognized exprs cleanly.
+
+`FilterDescription::all_unsupported(parent_filters, &children)` is the
+default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr`
+is just an unrecognized expr — it will not be pushed past operators
+that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec`
+it falls back to per-batch evaluation (Path A above). No silent
+misbehavior, no crash, no need to teach DF about our expression shape.
+
+### F6. The framework does NOT support N pushdown sources to the same scan.
+
+A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time.
+If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push
+bitmap filters onto the same probe scan, each calls `update` on its
+*own* dynamic filter; the scan must hold N separate dynamic filters
+and AND them at evaluation time. The plumbing for this (multiple
+`Arc<DynamicFilterPhysicalExpr>` on a scan) is standard `BinaryExpr(AND)`
+wrapping. **No framework gap.**
+
+## Concrete plan for §5.6 (RFC body delta)
+
+The RFC §5.6 should specify:
+
+1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.**
+   No custom side-channel; reuse the framework. Producer calls `update(new_expr)`;
+   scan evaluates the resulting `BooleanArray` per batch.
+
+2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in
+   the DF tree). It is constructed with a `Column` child and an opaque
+   `Vec<u8>` payload (roaring serialized bytes). Implements
+   `PhysicalExpr::evaluate` by deserializing the bitmap once into a
+   `OnceCell<RoaringTreemap>` and probing it per row.
+
+3. **Lance-aware scan adaptation is optional and incremental.**
+   Path A (per-batch evaluation) is the v1 implementation. Path B
+   (scan-level `RowIdMask`) is a v2 optimization that requires a
+   `LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its
+   `handle_child_pushdown_result` impl. The RFC should call out Path B
+   as a follow-up, not a blocker.
+
+4. **N producers to one scan: AND-wrap.** The scan holds N
+   `Arc<DynamicFilterPhysicalExpr>`. At plan-construction time, the
+   probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`.
+   No bespoke "multi-source" data structure.
+
+## What does NOT need a DataFusion fork
+
+- **Custom dynamic-filter expression shapes.** Public via `Arc<dyn PhysicalExpr>` + `update`.
+- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`.
+- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface.
+- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`.
+
+## What WOULD need a DataFusion fork or upstream contribution
+
+- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown
+  pass to recognize `BitmapMembershipExpr` and route it specially (e.g.
+  drop the InList variant when a bitmap is available), we'd need a
+  `FilterPushdownPolicy` extension point that doesn't exist today.
+  However, this is a *planner optimization*, not a correctness or
+  capability issue. The plan still works without it.
+
+- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only
+  matters if we want graph-side BackJoins to share the HashJoinExec's
+  build-side accumulator. We don't — graph operators have their own
+  build phases.
+
+## Decision impact on MR-737 §5.6 and §5.7
+
+**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The
+public `DynamicFilterPhysicalExpr::update` API is sufficient. No
+upstream contribution required for v1.
+
+**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping
+of N dynamic filters on the same scan.** No framework gap. The RFC
+should clarify that the scan accumulates filters via standard
+`BinaryExpr` composition, not via a bespoke multi-source channel.
+
+**Open Q3 ("can we share the SIP filter between operator stages?")
+— answered yes.** Confirmed by reading
+`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227`
+(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463`
+(the call site that proves the producer/consumer split works).
+
+## Caveats and follow-ups
+
+- **No prototype was built.** Per the ticket, §1.5 is a code-dive only.
+  The recommendation rests on reading DF source, not on a working
+  end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0
+  should include a smoke test that:
+    1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`.
+    2. Runs a hash join with the bitmap as the dynamic filter.
+    3. Compares row-output and timing against an `InListExpr`-shaped baseline.
+  Estimated work: 1 day, no DF fork.
+
+- **Lance scan-level `RowIdMask` support is the right v2 follow-up** —
+  but is gated on the same plugin-registry blocker discussed in §1.2
+  for custom *index types*. The `RowIdMask` path uses a different
+  mechanism (it's not a scalar index), so it may not be blocked the
+  same way. Worth a quick code-dive against `lance/src/index/scalar.rs`
+  and `lance/src/dataset/scanner.rs` to confirm before committing.
+
+- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for
+  `SharedBuildAccumulator` shows active churn; pin to 52.5.x for now
+  and re-validate when bumping to a 52.6+ minor.
--- a/.context/experiments/stable-row-id-compaction.md
+++ b/.context/experiments/stable-row-id-compaction.md
@ -0,0 +1,303 @@
+# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan)
+
+**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6).
+**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry).
+**Substrate pin:** Lance 4.0.1, lance-index 4.0.1.
+**Date:** 2026-05-12.
+
+---
+
+## Question
+
+MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph
+adjacency) both depend on the assumption that a custom index — i.e. our
+own CSR/CSC adjacency lists keyed by source-table row IDs — **continues
+to point at the right rows after the source table is compacted.**
+
+Lance's compaction (`compact_files`) consolidates fragments, which on the
+non-stable row-ID scheme renumbers row addresses. The question:
+
+1. Does Lance's **stable row IDs** option mean a custom out-of-tree index
+   built against the source table just works after compaction?
+2. If not, what is the contract a custom index must implement to survive?
+3. Is the contract publicly exposed (no `pub(crate)` blocker)?
+
+## TL;DR
+
+**Yes, with conditions.** Lance 4.0.1 provides three orthogonal
+mechanisms that together cover the case:
+
+  1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make
+     `_rowid` values **logically stable across compaction**, but they are
+     stored alongside a separate `_rowaddr` that changes with compaction.
+     Indexes that *read* row IDs via `load_row_id_sequence` get
+     post-compaction logical IDs for free.
+  2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address
+     remap table that the index lifecycle uses for indexes built against
+     **physical row addresses** (the older addressing scheme). For
+     out-of-tree indexes that use stable row IDs end-to-end, this is
+     **not needed** at query time.
+  3. **`ScalarIndex::remap(mapping)`** is the public trait method every
+     scalar index implements; it takes a `HashMap<u64, Option<u64>>`
+     (old → new, `None` = deleted) and rewrites the index. Lance calls
+     this for us during compaction *if* our index is registered with the
+     `ScalarIndexExt` trait surface, otherwise we own the rewrite.
+
+The trade-off: OmniGraph **already enables stable row IDs on every
+sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631,
+1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`,
+`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the
+straightforward path applies. Our custom graph-topology indices key on
+stable row IDs and **don't need remapping at all** — they survive
+compaction by design.
+
+## Findings
+
+### F1. Stable row IDs are universal in OmniGraph today. ✅
+
+Every Lance dataset we create sets `enable_stable_row_ids: true`. From
+`crates/omnigraph/src/table_store.rs:603`:
+
+```rust
+WriteParams {
+    ...
+    enable_stable_row_ids: true,
+    ...
+}
+```
+
+Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs`
+(commit graph), `db/recovery_audit.rs` (recovery audit). The check
+`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted
+at `table_store.rs:819, 988` before doing row-ID-keyed operations.
+
+**Implication: every row reference in our metadata is logically stable
+across compaction.** This is already the prevailing pattern — MR-737
+§5.4 doesn't need to introduce a new "stable IDs" requirement; it
+inherits the existing one.
+
+### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅
+
+From `lance-4.0.1/src/io/exec/scalar_index.rs:579`:
+
+```rust
+if dataset.manifest.uses_stable_row_ids() {
+    let sequences = load_row_id_sequences(dataset, fragments).await?;
+    // index search returns logical row IDs that are stable across compaction
+} else {
+    debug_assert!(!dataset.manifest.uses_stable_row_ids());
+    // index search returns physical row addresses
+}
+```
+
+Lance's scan path consults `uses_stable_row_ids` at the
+`scalar_index.rs:579, 609, 640` (three call sites). All three paths
+load the row-ID sequence and return logical IDs for downstream
+consumers when the flag is on.
+
+### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅
+
+From `lance-index-4.0.1/src/frag_reuse.rs:208`:
+
+```rust
+pub struct FragReuseIndex {
+    pub uuid: Uuid,
+    pub row_id_maps: Vec<HashMap<u64, Option<u64>>>,
+    pub details: FragReuseIndexDetails,
+}
+
+impl FragReuseIndex {
+    pub fn remap_row_id(&self, row_id: u64) -> Option<u64> { ... }
+    pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... }
+    pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... }
+    pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result<RecordBatch> { ... }
+    pub fn remap_row_ids_array(&self, array) -> PrimitiveArray<UInt64Type> { ... }
+    pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... }
+}
+```
+
+Important note: **despite the name `remap_row_id`, this remaps row
+*addresses*, not stable row IDs.** From the docstring on `row_id_maps`:
+
+> A row ID map describes the mapping from old row address to new address
+> after compactions.
+
+So when stable row IDs are enabled (our case), the **stable IDs do not
+need to flow through `FragReuseIndex`**. Only the physical addresses do,
+and only at the Lance internal layer.
+
+### F4. The `ScalarIndex::remap` trait method is public. ✅
+
+From `lance-index-4.0.1/src/scalar.rs:970`:
+
+```rust
+/// Returns true if the remap operation is supported
+fn can_remap(&self) -> bool;
+
+/// Remap the row ids, creating a new remapped version of this index in `dest_store`
+async fn remap(
+    &self,
+    mapping: &HashMap<u64, Option<u64>>,
+    dest_store: &dyn IndexStore,
+) -> Result<CreatedIndex>;
+```
+
+Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`,
+`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`,
+`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`,
+`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`.
+
+**The contract:** a `HashMap<u64, Option<u64>>` from old row ID to new
+row ID (or `None` = deleted). Returns a `CreatedIndex` written to the
+provided `dest_store`. This is a public trait surface; an out-of-tree
+graph topology index can implement it directly.
+
+### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️
+
+The blocker reported in Experiment 1.2 (custom index registration) is
+present here too. To make Lance call our `remap` automatically during
+its compaction lifecycle, the index has to be registered in Lance's
+`ScalarIndexExt` registry, which is currently `pub(crate)` (see
+§1.2 writeup at `.context/experiments/custom-lance-index.md`).
+
+**Two viable paths:**
+
+#### Path A — Lance-managed remapping (blocked on registry).
+
+If we land the Lance plugin-registry contribution from §1.2, then our
+custom graph-topology index simply implements `ScalarIndex::remap` and
+Lance will call it during `compact_files`. **Pre-condition: §1.2 must
+ship first.** Effort: ~50 LoC for the remap impl.
+
+#### Path B — OmniGraph-managed remapping (works today). ✅
+
+Without the Lance plugin registry, OmniGraph itself can drive the
+remapping:
+
+1. Before calling Lance's `compact_files` on a sub-table, we record the
+   current `Dataset::manifest.version`.
+2. After compaction, we read the new `FragReuseIndex` from the dataset
+   (`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`).
+3. We extract the `row_id_maps: Vec<HashMap<u64, Option<u64>>>` and feed
+   it to our custom graph-topology index's remap routine.
+4. Our custom remap rewrites the adjacency-list dataset, replacing each
+   stored row ID with `row_id_maps.iter().fold(id, ...)`.
+
+**The Lance APIs we depend on for Path B:**
+
+- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate.
+- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option<u64>` — pure fn, pub.
+- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub.
+- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot.
+
+**All of these are public.** Path B unblocks us today; Path A is a
+strict improvement we can ship later.
+
+### F6. Inverted index remap shows the pattern in full. ✅
+
+`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`:
+
+```rust
+pub async fn remap(
+    &mut self,
+    mapping: &HashMap<u64, Option<u64>>,
+    ...
+) -> Result<...> {
+    // Rewrites the postings, applying the mapping in-place.
+    // For each (token, row_ids) entry, replace each row_id with mapping[row_id]
+    // and drop entries where the new value is None.
+}
+```
+
+This is **exactly the shape our graph-topology remap will take**:
+
+```rust
+async fn remap(&self, mapping: &HashMap<u64, Option<u64>>, ...) -> Result<...> {
+    let new_edges = self.edges.iter()
+        .filter_map(|edge| {
+            let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?;
+            let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?;
+            Some(Edge { src_id: new_src, dst_id: new_dst, ..edge })
+        })
+        .collect();
+    // Write new_edges to dest_store.
+}
+```
+
+## Small repro plan (for Phase 0 entry)
+
+The code-dive is complete; what remains is a **small repro** that
+demonstrates end-to-end survival. Specification:
+
+1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both
+   with `enable_stable_row_ids: true`. Insert 10K rows in each.
+2. **Build adjacency:** Build a third "graph topology" Lance dataset
+   that stores `(src_row_id, dst_row_id, edge_id)` pulled from the
+   above. This is the "custom index" payload.
+3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the
+   row in `Person` via `take_with_row_id` and verify the join returns
+   the expected fields.
+4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with
+   parameters that force fragment consolidation. Verify
+   `dataset.manifest.version` advanced and `FragReuseIndex` is present.
+5. **Path B remap:** Read the `FragReuseIndex`, walk the
+   `row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)`
+   dataset.
+6. **Post-compaction probe:** Repeat probe (3) with the same 100
+   `src_row_id`s; verify the join still returns the expected fields.
+   **Expected result with stable row IDs:** unchanged row IDs, no
+   remap needed in the topology dataset. Just verify, don't rewrite.
+7. **Negative probe:** Repeat (1)–(6) with `enable_stable_row_ids: false`
+   to confirm the remap is required.
+
+Estimated effort: 1–2 days. **Defer to Phase 0**; the code-dive
+already justifies §5.4 as feasible without the repro.
+
+## Decision impact on MR-737 §5.4 and §5.10
+
+**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable
+row IDs (Path B):**
+
+- No Lance plugin-registry dependency; we drive remapping ourselves.
+- The custom topology dataset stores stable row IDs end-to-end; the
+  bulk of compaction-induced changes don't require remap.
+- Path A (Lance-managed remapping) is a follow-up improvement
+  contingent on the §1.2 contribution.
+
+**§5.10 (custom index types):** No new findings beyond §1.2. The
+`ScalarIndex::remap` contract is sufficient and stable.
+
+**Open Q6 ("survive compaction"):** Answered yes. The recommendation is
+**Path B for v1, Path A for v2**. RFC §5.4 should specify the
+OmniGraph-driven remap path and pin Lance to a release that supports
+`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed).
+
+## Caveats and follow-ups
+
+- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small
+  repro" — the small repro is **specified** but **not implemented** in
+  this session. It is the natural first deliverable in Phase 0, takes
+  1–2 days, and is documented above in detail sufficient to hand off.
+  Skipping it does not invalidate the §5.4 design; it just doesn't
+  prove the stable-row-IDs claim end-to-end.
+
+- **`FragReuseIndex` schema may evolve.** If Lance ever changes the
+  shape of `row_id_maps` (e.g. encodes them differently in
+  `_indices/<uuid>/frag_reuse.bin`), our Path B implementation has to
+  re-link. The current shape (`Vec<HashMap<u64, Option<u64>>>`) is
+  stable in 4.0.1. Pin the Lance version, watch upstream changelog.
+
+- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**.
+  At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale
+  before assuming free. Lance docs claim "negligible overhead" but our
+  scale may be in the long tail of "negligible".
+
+- **Path B has a write-amplification cost.** Every time
+  `compact_files` runs on a sub-table, our graph-topology dataset
+  must be re-scanned and re-written. For a 10M-row topology, this is
+  a 100MB rewrite — small but worth scheduling outside the hot path
+  (background reconciler at the same cadence as compaction itself).
+
+- **Path A (Lance-managed) is materially better long-term.** When the
+  §1.2 plugin registry lands, switch. Until then, Path B is
+  production-ready and OSS-compatible.
--- a/.context/experiments/txn-branches-cost.md
+++ b/.context/experiments/txn-branches-cost.md
@ -0,0 +1,258 @@
+# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model)
+
+**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5).
+**Type:** Code-dive only — no prototype crate.
+**Substrate pin:** Lance 4.0.1.
+**Date:** 2026-05-12.
+
+---
+
+## Question
+
+MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly,
+referenced in MR-925) suggests OmniGraph could lean on **Lance's native
+per-dataset branches** for transactional isolation, rather than the
+current OmniGraph-level "lazy graph branch" mechanism that touches every
+sub-table. The question this code-dive answers: **what does it cost?**
+
+What does each per-table txn branch actually require, in Lance 4.0.1?
+And how does that scale to a write that touches N tables?
+
+## TL;DR
+
+Lance 4.0.1 per-dataset branches are **cheap to create individually**
+(one shallow-clone manifest copy + one `_refs/branches/<name>.json`
+write — both small) and **cheap to operate on** (every read/write goes
+through the same code path as `main`, no branch-specific dispatch).
+**But the multi-write coordination overhead is borne by us**: a
+graph-level txn that touches N tables on a Lance-native branch must
+write N `_refs/branches/...json` files at branch-create time and
+N `commit_staged` entries to N `__manifest` rows at publish time.
+Compared to the current OmniGraph lazy-fork model
+(N writes to `__manifest` only, no `_refs/` writes), the steady-state
+cost is essentially the same and the create-time cost is N additional
+small JSON writes — **negligible at our scale**.
+
+**Recommendation: Lance-native per-table txn branches are usable**, but
+they would be a **side-grade**, not a clean win, over the existing
+lazy-graph-branch model. The argument for adopting them is conceptual
+clarity, not performance.
+
+## Findings
+
+### F1. Branch creation is a single shallow-clone commit. ✅
+
+From `lance/src/dataset.rs:477`:
+
+```rust
+pub async fn create_branch(
+    &mut self,
+    branch: &str,
+    version: impl Into<refs::Ref>,
+    store_params: Option<ObjectStoreParams>,
+) -> Result<Self> {
+    let (source_branch, version_number) = self.resolve_reference(version.into()).await?;
+    let branch_location = self.branch_location().find_branch(Some(branch))?;
+    let clone_op = Operation::Clone {
+        is_shallow: true,
+        ref_name: source_branch.clone(),
+        ref_version: version_number,
+        ref_path: String::from(self.uri()),
+        branch_name: Some(branch.to_string()),
+    };
+    let transaction = Transaction::new(version_number, clone_op, None);
+    // ... executes CommitBuilder, then ...
+    self.branches().create(branch, version_number, source_branch.as_deref()).await?;
+    Ok(dataset)
+}
+```
+
+Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents
+JSON write** (`_refs/branches/<branch>.json`). The branch manifest does
+**not** copy fragment files — it inherits them from the parent branch.
+
+Wire size: typical manifest is 1–10 KiB; BranchContents is ~200 bytes.
+On S3 with PutObject: **2 PUT requests per branch creation**.
+
+### F2. Branch creation is NOT atomic with BranchContents. ⚠️
+
+From `lance/src/dataset.rs:462`:
+
+> This is a two-phase operation:
+> - Create the branch dataset by shallow cloning.
+> - Create the branch metadata (a.k.a. `BranchContents`).
+>
+> These two phases are not atomic. We consider `BranchContents` as the
+> source of truth for the branch.
+
+If `create_branch` fails between phase 1 (shallow clone manifest) and
+phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that
+must be reaped via `force_delete_branch` or the cleanup job.
+
+**Implication for our recovery audit:** if MR-737 adopts per-table txn
+branches, our `__recovery/<ulid>.json` sidecar pattern must account for
+phase-1-only failures across multiple tables. Currently the sidecar
+tracks Lance commit IDs per table; it must additionally track
+*provisional branch names* per table so the recovery sweep can call
+`force_delete_branch` on the right names.
+
+### F3. Branch identifiers carry lineage. ✅
+
+From `lance/src/dataset/refs.rs:670`:
+
+```rust
+pub struct BranchIdentifier {
+    pub version_mapping: Vec<(u64, String)>,
+}
+```
+
+Each branch identifier is a list of `(parent_version, uuid)` pairs.
+Forking a branch off another branch appends a new `(parent_version, uuid)`
+pair. This is **exactly the lineage tracking we'd need** for cross-table
+coherence — and Lance already enforces invariants during cleanup that
+ancestor branches cannot be deleted while descendants exist.
+
+We could use the `BranchIdentifier` directly as the multi-table txn-id
+key. **Win.**
+
+### F4. Per-table branches do NOT share fragments cross-table. ❌
+
+Lance's shallow-clone shares **fragment files within one dataset**. It
+does *not* share anything across datasets. If a graph-level txn creates
+per-table branches on N tables, each branch is its own independent
+shallow clone — there's no Lance primitive that says "these N branches
+form a coherent txn". We provide the coherence in our own layer (today
+via the `__manifest` table; under the proposal via a new
+`__branches.<txn_id>` table or similar).
+
+### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅
+
+From `lance/src/dataset.rs`, every write path eventually calls
+`commit_staged`, which does:
+
+1. Read current manifest (1 GET).
+2. Write fragments (variable; payload-dependent).
+3. CAS the new manifest version (1 PUT-If-Match).
+4. (If indices were affected) write `_indices/...` entries.
+
+This is **the same cost** whether the dataset is on `main` or on
+`branch/feature-X`. There's no per-branch overhead at commit time.
+
+### F6. Reading a branch is identical to reading main. ✅
+
+`Dataset::checkout_branch(name)` just rebinds the dataset's
+`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`,
+`merge`, etc. are unmodified. No branch-aware planning, no
+branch-specific filter rewriting. The branch is purely a versioning
+namespace.
+
+### F7. Branch merging (the missing piece). ❓
+
+Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a
+public API. The only "merge" in Lance is `merge_insert` (which is a
+row-level upsert on the same branch). For graph-level branch
+*merges* (three-way merge of all sub-tables across a branch), we
+continue to use our existing `crates/omnigraph/src/db/branch_merge.rs`
+which:
+  - Diffs the source branch against the merge base.
+  - Re-applies each delta as a fresh commit on the target branch.
+  - Coordinates across N tables via `__manifest` CAS.
+
+**Implication for §5.12:** even with Lance-native per-table branches,
+the *merge* logic remains at the OmniGraph layer. Per-table branches
+do not reduce merge complexity.
+
+## Cost model: per-table txn branches vs. current model
+
+Steady-state, for a graph txn that touches `N` tables and commits
+`B` batches per table:
+
+| Cost dimension                           | Current (lazy-graph) | Per-table Lance branches |
+|-----------------------------------------|----------------------|--------------------------|
+| Branch-create writes (S3 PUTs)          | 0 (lazy)             | 2·N (manifest + BranchContents) |
+| Per-batch fragment write (S3 PUTs)      | B·N                  | B·N |
+| Per-batch manifest CAS                  | B·N                  | B·N |
+| Per-txn `__manifest` cross-table CAS    | 1                    | 1 (we still need it for coherence) |
+| Branch-delete writes (S3 PUTs)          | 0 (lazy)             | 2·N |
+| Recovery sidecar size                   | O(N) lines           | O(N) lines + branch names |
+
+**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per
+graph txn** (2·N at create, 2·N at delete), independent of B. For
+N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard
+pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**.
+
+The current "lazy fork" model wins on **zero** at txn-create when the
+txn is going to abort or read-only; the Lance-native model pays the
+2·N upfront. For a workload where many txns abort early (e.g.
+optimistic concurrency with high contention), the current model is
+cheaper.
+
+## What does Lance-native buy us, then?
+
+1. **Branch identifier lineage tracking comes for free.** We currently
+   maintain our own commit DAG; with Lance-native branches, lineage
+   travels with the branch identifier (`Vec<(u64, String)>`).
+
+2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job
+   understands the branch graph and refuses to delete versions
+   reachable from any branch. We currently maintain our own retention
+   logic in `branch_lineage_tests`; Lance-native gets us
+   `lance::dataset::cleanup` for free.
+
+3. **Recovery audit gets noisier.** The two-phase `create_branch`
+   atomicity gap (F2) means our recovery sweep must reap zombie
+   *branches*, not just zombie *commits*. This is more state to track,
+   not less.
+
+## Decision impact on MR-737 §5.11 and §5.12
+
+**§5.11 (per-table txn isolation):** Lance-native branches **can**
+implement this, but the steady-state cost is essentially the same as
+the current lazy-graph-branch model, and the abort-path cost is *higher*.
+The conceptual clarity argument is real but not load-bearing — both
+models provide the same isolation guarantee.
+
+**§5.12 (single-mechanism reads + writes via Source operators):** No
+change. Per-table Lance branches don't affect the planner or the
+operator surface — `Dataset::checkout_branch` is invisible from the
+planner's perspective.
+
+**Open Q5 (cost of per-table txn branches at our scale):** Answered.
+Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and
+**+1 zombie-branch cleanup case** in the recovery sweep. Both small.
+
+## Recommendation
+
+**Keep the current lazy-graph-branch model for v1**. Revisit
+Lance-native per-table branches only if:
+
+  - Cleanup logic complexity becomes a maintenance burden (then the
+    Lance-native lineage tracking pays off).
+  - We need to expose branch-level operations to external tools that
+    speak Lance directly (then having `_refs/branches/<name>.json` on
+    each sub-table is necessary).
+  - The graph-level branch creation latency becomes a user-visible
+    issue (we'd need to measure this; the current implementation is
+    "lazy" so most branches don't fork most tables).
+
+For Phase 0, the deliverable is a **clear specification** of which model
+MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests
+specifying the current model.
+
+## Caveats and follow-ups
+
+- **No real-world latency measurements.** This cost model is based on
+  S3 PUT counts and assumes typical S3 latency. Real measurement at our
+  workload (10+ tables, ~100 txns/sec/cluster) would be more accurate.
+  Estimated benchmark effort: 1 day with a mocked S3 backend.
+- **The two-phase atomicity gap (F2) is documented by Lance but not
+  fixed.** A future Lance release may introduce single-phase branch
+  creation; if so, the recovery-cost column in the table above improves.
+- **`merge_branch` API absence is not surprising** — merge semantics
+  are application-specific (we need three-way row-level merge with
+  conflict resolution rules from §IX, which Lance has no opinion on).
+  This will remain at the OmniGraph layer regardless.
+- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`)
+  excludes `.delete_branch(` from the over-match check — there's
+  intent in the codebase to allow these calls. Worth a follow-up read
+  to see if MR-737 §5.11 has already opened the door.