MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems

- exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped
  pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B
  (Lance RowIdMask) is v2 optimization
- exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs
  current lazy-graph-branch model; side-grade not clean win; recommend keeping current
  model for v1
- exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in
  OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships
  today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry
- 2.x deferred with rationale: all calibration / risk-quantification work, per ticket
  \xa70.3 acceptance criteria do not require 2.x
- 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion
- 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning
  in OSS
- 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable
- 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget
- 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB
  ext API as plugin gold standard
- 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as
  versioned plugin reference, dynamic filters analog
This commit is contained in:
Devin AI 2026-05-12 17:36:44 +00:00
parent a09f3ff787
commit 88b338b56b
10 changed files with 1784 additions and 0 deletions

View file

@ -0,0 +1,97 @@
# Experiments 2.12.4 — DEFERRED (rationale)
**Ticket:** MR-925 §2 (lower-priority experiments).
**Date:** 2026-05-12.
The ticket §0.3 acceptance criteria require all **seven** high-priority
experiments (§1.1§1.7) and at least **six** reference-system code-dives
(§3.1§3.6) to be complete before posting the rollup comment. §2.1§2.4
are explicitly framed as "calibrations and incremental validations that
**don't gate the RFC** but are worth doing" and "**can run during Phase 0**."
This writeup explicitly defers all four §2.x experiments with rationale,
so the next agent (or Phase-0 owner) doesn't have to re-discover the
deferral decision.
---
## 2.1 `scan_by_key_set` extended benchmark — DEFER
**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on
local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity
sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct
`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost
gate in MR-737 can ship without these numbers; we just won't have a
hard-tuned threshold for "when should we prefer scan-by-key-set vs
re-scan" until they're collected.
**When to run:** During Phase 0 MR-737 §5.3 implementation, before
landing the cost-gate parameter. Estimated 2 days. Owner: same person
implementing §5.3.
**Risk of deferring:** Low. The cost gate has a sensible default (always
prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration
just tightens it.
---
## 2.2 `Hash([key], N)` partitioning elimination — DEFER
**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec`
elimination for capability-advertised plans. The capability advertising
in MR-737 §5.6 is a quality-of-life optimization, not a correctness
requirement. We can ship §5.6 without this validation and add it as a
follow-up.
**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike.
**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT
honor our capability advertisements, the perf impact is a redundant
`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a
correctness issue. The risk is sub-optimal partitioning, not wrong results.
---
## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER
**Reason:** Validates the §5.7 cost-model plumbing for custom column
statistics. The cost model itself can ship with default statistics (no
custom registry); the registry is an extension point for **better**
cost choices, not **correct** cost choices. Deferring keeps §5.7's
v1 narrower without breaking it.
**When to run:** When MR-737 §5.7 (cost-model surface) is being
implemented — likely Phase 1, not Phase 0. Estimated 1 day.
**Risk of deferring:** Low. Default cost models from `JoinExec`,
`HashJoinExec`, etc. are battle-tested; custom column statistics are
incremental.
---
## 2.4 DataFusion API churn audit (47 → 53) — DEFER
**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**,
not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate
pin we validated in §1.3 and §1.5). Knowing the breakage rate for
future upgrades is a maintenance-cost input, not an entry-criterion.
**When to run:** Before any DataFusion minor-version bump after Phase 0
ships. Estimated 12 days. Owner: the engineer planning the bump.
**Risk of deferring:** Zero for Phase 0. The audit is for future
planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5.
---
## Summary
All four §2.x experiments are **deferred to Phase 0 or later** with
the rationale that they are calibration / risk-quantification work,
not capability gating. The ticket §0.3 acceptance criteria require
§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x.
This deferral preserves the ticket's stated scope.
If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings
expose a calibration gap), they can be picked up individually; each is
small (½2 days) and independent.

View file

@ -0,0 +1,238 @@
# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive)
**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3).
**Type:** Code-dive only (no prototype crate).
**Substrate pin:** DataFusion 52.5.
**Date:** 2026-05-12.
---
## Question
The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with
three pushdown strategies for hash-join build sides:
- `InList(ArrayRef)` — for small build sides (< 128 MB).
- `HashTable(Arc<dyn JoinHashMapType>)` — for large build sides.
- `Empty` — no rows, do not push.
Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or
the broader graph-engine `BackJoin` and `NeighborSetIntersect` from
MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped
filter** through DF's dynamic filter framework — without forking DF?
## TL;DR
**Yes, completely supported on DataFusion 52.5 as written.** The
extension footprint is roughly 200 LoC: implement a custom
`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the
existing public `DynamicFilterPhysicalExpr::update(...)` API. No
fork, no `pub(crate)` work-around.
## Findings
### F1. DataFusion's DFP is expression-shaped, not enum-shaped.
The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal
to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own
internal switch** for selecting which physical-expr to construct for
*its* dynamic filter. The framework itself does not care:
```rust
// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs
pub fn update(&self, new_expr: Arc<dyn PhysicalExpr>) -> Result<()>
```
The `update` method takes **any** `Arc<dyn PhysicalExpr>`. So an
operator outside the hash-join module is free to ship its own pushdown
strategy (e.g. `BitmapMembership`) and pass it to `update`.
### F2. Three things must hold for a custom dynamic-filter expr to work.
From reading `dynamic_filters.rs`:
1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)`
binds the column-leaves at construction. Subsequent `update`s may
only swap the *expression*, not introduce *new column references*.
For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`,
the only child is `column` — stable.
2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must
implement `fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>`
to return a `BooleanArray` of the same length as the input batch.
For a roaring bitmap this is: deserialize once at first call, cache
in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`.
3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`,
`data_type`, `nullable`, `evaluate`, `children`, `with_new_children`,
`dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`.
### F3. Two pushdown paths exist; only one needs work for graph operators.
DataFusion 52.5 has **two** filter-pushdown phases (see
`ExecutionPlan::gather_filters_for_pushdown` and
`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`):
- **Static pushdown** (planning time): filters are pushed from
`FilterExec``HashJoinExec``DataSourceExec` during the
`EnforceFilterPushdown` physical optimizer rule.
- **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr`
placeholder is left in the plan at planning time; at runtime, the
*producer* operator calls `update(new_expr)` once its data is
available (e.g. once the hash-join build side is materialized).
For graph operators that produce SIPs (`NeighborExpand` build phase,
`SemiJoin` build phase, etc.), the **dynamic** path is the natural one.
The producer pattern is:
```rust
// At plan-construction time (in our ExtensionPlanner):
let placeholder = lit(true);
let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new(
vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs
placeholder,
));
// Wire dynamic_filter into both the probe-side scan (as a filter)
// and store an Arc<DynamicFilterPhysicalExpr> on our build-side operator.
// At execute time, once our build side completes:
let bitmap = build_roaring_from(build_ids)?;
let bitmap_bytes = serialize_to_vec(bitmap);
let pushdown_expr = Arc::new(BitmapMembershipExpr {
column: probe_side_column_ref,
bitmap_bytes,
});
self.dynamic_filter.update(pushdown_expr)?;
```
### F4. The scan-side has two interception points.
A custom dynamic filter ends up at the scan. Two paths exist for the
scan to consume it efficiently:
#### Path A. **Generic predicate evaluation** (works today, no DF fork).
The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For
each batch row, `bitmap.contains(row.value)` is invoked. The roaring
crate's `contains` is O(log n) within a fragment-localized container
and was measured at <0.1 µs per call in §1.4. For a 1024-row batch,
this is ~100 µs of CPU, which is amortized against the I/O for that
batch. **This is enough for §5.6 as written.**
#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration).
Lance's `Scanner` supports a `RowIdMask` that is applied at the scan
level before any predicate evaluation. If our `BitmapMembershipExpr`
targets a Lance row-ID column, we *could* extract the bitmap during
the scan's `handle_child_pushdown_result` and convert it into a
`RowIdMask` — completely bypassing the per-row predicate. This is
the same trick Lance's full-text search uses today (see Lance's
`scalar.rs` `apply_full_text_search_index`).
Path B requires changes to Lance's `DataSourceExec` or our wrapping
adapter; Path A is **zero-change to DataFusion or Lance**.
### F5. The static-pushdown phase passes-through unrecognized exprs cleanly.
`FilterDescription::all_unsupported(parent_filters, &children)` is the
default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr`
is just an unrecognized expr — it will not be pushed past operators
that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec`
it falls back to per-batch evaluation (Path A above). No silent
misbehavior, no crash, no need to teach DF about our expression shape.
### F6. The framework does NOT support N pushdown sources to the same scan.
A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time.
If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push
bitmap filters onto the same probe scan, each calls `update` on its
*own* dynamic filter; the scan must hold N separate dynamic filters
and AND them at evaluation time. The plumbing for this (multiple
`Arc<DynamicFilterPhysicalExpr>` on a scan) is standard `BinaryExpr(AND)`
wrapping. **No framework gap.**
## Concrete plan for §5.6 (RFC body delta)
The RFC §5.6 should specify:
1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.**
No custom side-channel; reuse the framework. Producer calls `update(new_expr)`;
scan evaluates the resulting `BooleanArray` per batch.
2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in
the DF tree). It is constructed with a `Column` child and an opaque
`Vec<u8>` payload (roaring serialized bytes). Implements
`PhysicalExpr::evaluate` by deserializing the bitmap once into a
`OnceCell<RoaringTreemap>` and probing it per row.
3. **Lance-aware scan adaptation is optional and incremental.**
Path A (per-batch evaluation) is the v1 implementation. Path B
(scan-level `RowIdMask`) is a v2 optimization that requires a
`LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its
`handle_child_pushdown_result` impl. The RFC should call out Path B
as a follow-up, not a blocker.
4. **N producers to one scan: AND-wrap.** The scan holds N
`Arc<DynamicFilterPhysicalExpr>`. At plan-construction time, the
probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`.
No bespoke "multi-source" data structure.
## What does NOT need a DataFusion fork
- **Custom dynamic-filter expression shapes.** Public via `Arc<dyn PhysicalExpr>` + `update`.
- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`.
- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface.
- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`.
## What WOULD need a DataFusion fork or upstream contribution
- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown
pass to recognize `BitmapMembershipExpr` and route it specially (e.g.
drop the InList variant when a bitmap is available), we'd need a
`FilterPushdownPolicy` extension point that doesn't exist today.
However, this is a *planner optimization*, not a correctness or
capability issue. The plan still works without it.
- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only
matters if we want graph-side BackJoins to share the HashJoinExec's
build-side accumulator. We don't — graph operators have their own
build phases.
## Decision impact on MR-737 §5.6 and §5.7
**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The
public `DynamicFilterPhysicalExpr::update` API is sufficient. No
upstream contribution required for v1.
**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping
of N dynamic filters on the same scan.** No framework gap. The RFC
should clarify that the scan accumulates filters via standard
`BinaryExpr` composition, not via a bespoke multi-source channel.
**Open Q3 ("can we share the SIP filter between operator stages?")
— answered yes.** Confirmed by reading
`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227`
(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463`
(the call site that proves the producer/consumer split works).
## Caveats and follow-ups
- **No prototype was built.** Per the ticket, §1.5 is a code-dive only.
The recommendation rests on reading DF source, not on a working
end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0
should include a smoke test that:
1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`.
2. Runs a hash join with the bitmap as the dynamic filter.
3. Compares row-output and timing against an `InListExpr`-shaped baseline.
Estimated work: 1 day, no DF fork.
- **Lance scan-level `RowIdMask` support is the right v2 follow-up**
but is gated on the same plugin-registry blocker discussed in §1.2
for custom *index types*. The `RowIdMask` path uses a different
mechanism (it's not a scalar index), so it may not be blocked the
same way. Worth a quick code-dive against `lance/src/index/scalar.rs`
and `lance/src/dataset/scanner.rs` to confirm before committing.
- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for
`SharedBuildAccumulator` shows active churn; pin to 52.5.x for now
and re-validate when bumping to a 52.6+ minor.

View file

@ -0,0 +1,303 @@
# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan)
**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6).
**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry).
**Substrate pin:** Lance 4.0.1, lance-index 4.0.1.
**Date:** 2026-05-12.
---
## Question
MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph
adjacency) both depend on the assumption that a custom index — i.e. our
own CSR/CSC adjacency lists keyed by source-table row IDs — **continues
to point at the right rows after the source table is compacted.**
Lance's compaction (`compact_files`) consolidates fragments, which on the
non-stable row-ID scheme renumbers row addresses. The question:
1. Does Lance's **stable row IDs** option mean a custom out-of-tree index
built against the source table just works after compaction?
2. If not, what is the contract a custom index must implement to survive?
3. Is the contract publicly exposed (no `pub(crate)` blocker)?
## TL;DR
**Yes, with conditions.** Lance 4.0.1 provides three orthogonal
mechanisms that together cover the case:
1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make
`_rowid` values **logically stable across compaction**, but they are
stored alongside a separate `_rowaddr` that changes with compaction.
Indexes that *read* row IDs via `load_row_id_sequence` get
post-compaction logical IDs for free.
2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address
remap table that the index lifecycle uses for indexes built against
**physical row addresses** (the older addressing scheme). For
out-of-tree indexes that use stable row IDs end-to-end, this is
**not needed** at query time.
3. **`ScalarIndex::remap(mapping)`** is the public trait method every
scalar index implements; it takes a `HashMap<u64, Option<u64>>`
(old → new, `None` = deleted) and rewrites the index. Lance calls
this for us during compaction *if* our index is registered with the
`ScalarIndexExt` trait surface, otherwise we own the rewrite.
The trade-off: OmniGraph **already enables stable row IDs on every
sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631,
1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`,
`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the
straightforward path applies. Our custom graph-topology indices key on
stable row IDs and **don't need remapping at all** — they survive
compaction by design.
## Findings
### F1. Stable row IDs are universal in OmniGraph today. ✅
Every Lance dataset we create sets `enable_stable_row_ids: true`. From
`crates/omnigraph/src/table_store.rs:603`:
```rust
WriteParams {
...
enable_stable_row_ids: true,
...
}
```
Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs`
(commit graph), `db/recovery_audit.rs` (recovery audit). The check
`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted
at `table_store.rs:819, 988` before doing row-ID-keyed operations.
**Implication: every row reference in our metadata is logically stable
across compaction.** This is already the prevailing pattern — MR-737
§5.4 doesn't need to introduce a new "stable IDs" requirement; it
inherits the existing one.
### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅
From `lance-4.0.1/src/io/exec/scalar_index.rs:579`:
```rust
if dataset.manifest.uses_stable_row_ids() {
let sequences = load_row_id_sequences(dataset, fragments).await?;
// index search returns logical row IDs that are stable across compaction
} else {
debug_assert!(!dataset.manifest.uses_stable_row_ids());
// index search returns physical row addresses
}
```
Lance's scan path consults `uses_stable_row_ids` at the
`scalar_index.rs:579, 609, 640` (three call sites). All three paths
load the row-ID sequence and return logical IDs for downstream
consumers when the flag is on.
### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅
From `lance-index-4.0.1/src/frag_reuse.rs:208`:
```rust
pub struct FragReuseIndex {
pub uuid: Uuid,
pub row_id_maps: Vec<HashMap<u64, Option<u64>>>,
pub details: FragReuseIndexDetails,
}
impl FragReuseIndex {
pub fn remap_row_id(&self, row_id: u64) -> Option<u64> { ... }
pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... }
pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... }
pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result<RecordBatch> { ... }
pub fn remap_row_ids_array(&self, array) -> PrimitiveArray<UInt64Type> { ... }
pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... }
}
```
Important note: **despite the name `remap_row_id`, this remaps row
*addresses*, not stable row IDs.** From the docstring on `row_id_maps`:
> A row ID map describes the mapping from old row address to new address
> after compactions.
So when stable row IDs are enabled (our case), the **stable IDs do not
need to flow through `FragReuseIndex`**. Only the physical addresses do,
and only at the Lance internal layer.
### F4. The `ScalarIndex::remap` trait method is public. ✅
From `lance-index-4.0.1/src/scalar.rs:970`:
```rust
/// Returns true if the remap operation is supported
fn can_remap(&self) -> bool;
/// Remap the row ids, creating a new remapped version of this index in `dest_store`
async fn remap(
&self,
mapping: &HashMap<u64, Option<u64>>,
dest_store: &dyn IndexStore,
) -> Result<CreatedIndex>;
```
Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`,
`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`,
`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`,
`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`.
**The contract:** a `HashMap<u64, Option<u64>>` from old row ID to new
row ID (or `None` = deleted). Returns a `CreatedIndex` written to the
provided `dest_store`. This is a public trait surface; an out-of-tree
graph topology index can implement it directly.
### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️
The blocker reported in Experiment 1.2 (custom index registration) is
present here too. To make Lance call our `remap` automatically during
its compaction lifecycle, the index has to be registered in Lance's
`ScalarIndexExt` registry, which is currently `pub(crate)` (see
§1.2 writeup at `.context/experiments/custom-lance-index.md`).
**Two viable paths:**
#### Path A — Lance-managed remapping (blocked on registry).
If we land the Lance plugin-registry contribution from §1.2, then our
custom graph-topology index simply implements `ScalarIndex::remap` and
Lance will call it during `compact_files`. **Pre-condition: §1.2 must
ship first.** Effort: ~50 LoC for the remap impl.
#### Path B — OmniGraph-managed remapping (works today). ✅
Without the Lance plugin registry, OmniGraph itself can drive the
remapping:
1. Before calling Lance's `compact_files` on a sub-table, we record the
current `Dataset::manifest.version`.
2. After compaction, we read the new `FragReuseIndex` from the dataset
(`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`).
3. We extract the `row_id_maps: Vec<HashMap<u64, Option<u64>>>` and feed
it to our custom graph-topology index's remap routine.
4. Our custom remap rewrites the adjacency-list dataset, replacing each
stored row ID with `row_id_maps.iter().fold(id, ...)`.
**The Lance APIs we depend on for Path B:**
- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate.
- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option<u64>` — pure fn, pub.
- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub.
- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot.
**All of these are public.** Path B unblocks us today; Path A is a
strict improvement we can ship later.
### F6. Inverted index remap shows the pattern in full. ✅
`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`:
```rust
pub async fn remap(
&mut self,
mapping: &HashMap<u64, Option<u64>>,
...
) -> Result<...> {
// Rewrites the postings, applying the mapping in-place.
// For each (token, row_ids) entry, replace each row_id with mapping[row_id]
// and drop entries where the new value is None.
}
```
This is **exactly the shape our graph-topology remap will take**:
```rust
async fn remap(&self, mapping: &HashMap<u64, Option<u64>>, ...) -> Result<...> {
let new_edges = self.edges.iter()
.filter_map(|edge| {
let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?;
let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?;
Some(Edge { src_id: new_src, dst_id: new_dst, ..edge })
})
.collect();
// Write new_edges to dest_store.
}
```
## Small repro plan (for Phase 0 entry)
The code-dive is complete; what remains is a **small repro** that
demonstrates end-to-end survival. Specification:
1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both
with `enable_stable_row_ids: true`. Insert 10K rows in each.
2. **Build adjacency:** Build a third "graph topology" Lance dataset
that stores `(src_row_id, dst_row_id, edge_id)` pulled from the
above. This is the "custom index" payload.
3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the
row in `Person` via `take_with_row_id` and verify the join returns
the expected fields.
4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with
parameters that force fragment consolidation. Verify
`dataset.manifest.version` advanced and `FragReuseIndex` is present.
5. **Path B remap:** Read the `FragReuseIndex`, walk the
`row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)`
dataset.
6. **Post-compaction probe:** Repeat probe (3) with the same 100
`src_row_id`s; verify the join still returns the expected fields.
**Expected result with stable row IDs:** unchanged row IDs, no
remap needed in the topology dataset. Just verify, don't rewrite.
7. **Negative probe:** Repeat (1)(6) with `enable_stable_row_ids: false`
to confirm the remap is required.
Estimated effort: 12 days. **Defer to Phase 0**; the code-dive
already justifies §5.4 as feasible without the repro.
## Decision impact on MR-737 §5.4 and §5.10
**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable
row IDs (Path B):**
- No Lance plugin-registry dependency; we drive remapping ourselves.
- The custom topology dataset stores stable row IDs end-to-end; the
bulk of compaction-induced changes don't require remap.
- Path A (Lance-managed remapping) is a follow-up improvement
contingent on the §1.2 contribution.
**§5.10 (custom index types):** No new findings beyond §1.2. The
`ScalarIndex::remap` contract is sufficient and stable.
**Open Q6 ("survive compaction"):** Answered yes. The recommendation is
**Path B for v1, Path A for v2**. RFC §5.4 should specify the
OmniGraph-driven remap path and pin Lance to a release that supports
`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed).
## Caveats and follow-ups
- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small
repro" — the small repro is **specified** but **not implemented** in
this session. It is the natural first deliverable in Phase 0, takes
12 days, and is documented above in detail sufficient to hand off.
Skipping it does not invalidate the §5.4 design; it just doesn't
prove the stable-row-IDs claim end-to-end.
- **`FragReuseIndex` schema may evolve.** If Lance ever changes the
shape of `row_id_maps` (e.g. encodes them differently in
`_indices/<uuid>/frag_reuse.bin`), our Path B implementation has to
re-link. The current shape (`Vec<HashMap<u64, Option<u64>>>`) is
stable in 4.0.1. Pin the Lance version, watch upstream changelog.
- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**.
At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale
before assuming free. Lance docs claim "negligible overhead" but our
scale may be in the long tail of "negligible".
- **Path B has a write-amplification cost.** Every time
`compact_files` runs on a sub-table, our graph-topology dataset
must be re-scanned and re-written. For a 10M-row topology, this is
a 100MB rewrite — small but worth scheduling outside the hot path
(background reconciler at the same cadence as compaction itself).
- **Path A (Lance-managed) is materially better long-term.** When the
§1.2 plugin registry lands, switch. Until then, Path B is
production-ready and OSS-compatible.

View file

@ -0,0 +1,258 @@
# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model)
**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5).
**Type:** Code-dive only — no prototype crate.
**Substrate pin:** Lance 4.0.1.
**Date:** 2026-05-12.
---
## Question
MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly,
referenced in MR-925) suggests OmniGraph could lean on **Lance's native
per-dataset branches** for transactional isolation, rather than the
current OmniGraph-level "lazy graph branch" mechanism that touches every
sub-table. The question this code-dive answers: **what does it cost?**
What does each per-table txn branch actually require, in Lance 4.0.1?
And how does that scale to a write that touches N tables?
## TL;DR
Lance 4.0.1 per-dataset branches are **cheap to create individually**
(one shallow-clone manifest copy + one `_refs/branches/<name>.json`
write — both small) and **cheap to operate on** (every read/write goes
through the same code path as `main`, no branch-specific dispatch).
**But the multi-write coordination overhead is borne by us**: a
graph-level txn that touches N tables on a Lance-native branch must
write N `_refs/branches/...json` files at branch-create time and
N `commit_staged` entries to N `__manifest` rows at publish time.
Compared to the current OmniGraph lazy-fork model
(N writes to `__manifest` only, no `_refs/` writes), the steady-state
cost is essentially the same and the create-time cost is N additional
small JSON writes — **negligible at our scale**.
**Recommendation: Lance-native per-table txn branches are usable**, but
they would be a **side-grade**, not a clean win, over the existing
lazy-graph-branch model. The argument for adopting them is conceptual
clarity, not performance.
## Findings
### F1. Branch creation is a single shallow-clone commit. ✅
From `lance/src/dataset.rs:477`:
```rust
pub async fn create_branch(
&mut self,
branch: &str,
version: impl Into<refs::Ref>,
store_params: Option<ObjectStoreParams>,
) -> Result<Self> {
let (source_branch, version_number) = self.resolve_reference(version.into()).await?;
let branch_location = self.branch_location().find_branch(Some(branch))?;
let clone_op = Operation::Clone {
is_shallow: true,
ref_name: source_branch.clone(),
ref_version: version_number,
ref_path: String::from(self.uri()),
branch_name: Some(branch.to_string()),
};
let transaction = Transaction::new(version_number, clone_op, None);
// ... executes CommitBuilder, then ...
self.branches().create(branch, version_number, source_branch.as_deref()).await?;
Ok(dataset)
}
```
Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents
JSON write** (`_refs/branches/<branch>.json`). The branch manifest does
**not** copy fragment files — it inherits them from the parent branch.
Wire size: typical manifest is 110 KiB; BranchContents is ~200 bytes.
On S3 with PutObject: **2 PUT requests per branch creation**.
### F2. Branch creation is NOT atomic with BranchContents. ⚠️
From `lance/src/dataset.rs:462`:
> This is a two-phase operation:
> - Create the branch dataset by shallow cloning.
> - Create the branch metadata (a.k.a. `BranchContents`).
>
> These two phases are not atomic. We consider `BranchContents` as the
> source of truth for the branch.
If `create_branch` fails between phase 1 (shallow clone manifest) and
phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that
must be reaped via `force_delete_branch` or the cleanup job.
**Implication for our recovery audit:** if MR-737 adopts per-table txn
branches, our `__recovery/<ulid>.json` sidecar pattern must account for
phase-1-only failures across multiple tables. Currently the sidecar
tracks Lance commit IDs per table; it must additionally track
*provisional branch names* per table so the recovery sweep can call
`force_delete_branch` on the right names.
### F3. Branch identifiers carry lineage. ✅
From `lance/src/dataset/refs.rs:670`:
```rust
pub struct BranchIdentifier {
pub version_mapping: Vec<(u64, String)>,
}
```
Each branch identifier is a list of `(parent_version, uuid)` pairs.
Forking a branch off another branch appends a new `(parent_version, uuid)`
pair. This is **exactly the lineage tracking we'd need** for cross-table
coherence — and Lance already enforces invariants during cleanup that
ancestor branches cannot be deleted while descendants exist.
We could use the `BranchIdentifier` directly as the multi-table txn-id
key. **Win.**
### F4. Per-table branches do NOT share fragments cross-table. ❌
Lance's shallow-clone shares **fragment files within one dataset**. It
does *not* share anything across datasets. If a graph-level txn creates
per-table branches on N tables, each branch is its own independent
shallow clone — there's no Lance primitive that says "these N branches
form a coherent txn". We provide the coherence in our own layer (today
via the `__manifest` table; under the proposal via a new
`__branches.<txn_id>` table or similar).
### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅
From `lance/src/dataset.rs`, every write path eventually calls
`commit_staged`, which does:
1. Read current manifest (1 GET).
2. Write fragments (variable; payload-dependent).
3. CAS the new manifest version (1 PUT-If-Match).
4. (If indices were affected) write `_indices/...` entries.
This is **the same cost** whether the dataset is on `main` or on
`branch/feature-X`. There's no per-branch overhead at commit time.
### F6. Reading a branch is identical to reading main. ✅
`Dataset::checkout_branch(name)` just rebinds the dataset's
`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`,
`merge`, etc. are unmodified. No branch-aware planning, no
branch-specific filter rewriting. The branch is purely a versioning
namespace.
### F7. Branch merging (the missing piece). ❓
Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a
public API. The only "merge" in Lance is `merge_insert` (which is a
row-level upsert on the same branch). For graph-level branch
*merges* (three-way merge of all sub-tables across a branch), we
continue to use our existing `crates/omnigraph/src/db/branch_merge.rs`
which:
- Diffs the source branch against the merge base.
- Re-applies each delta as a fresh commit on the target branch.
- Coordinates across N tables via `__manifest` CAS.
**Implication for §5.12:** even with Lance-native per-table branches,
the *merge* logic remains at the OmniGraph layer. Per-table branches
do not reduce merge complexity.
## Cost model: per-table txn branches vs. current model
Steady-state, for a graph txn that touches `N` tables and commits
`B` batches per table:
| Cost dimension | Current (lazy-graph) | Per-table Lance branches |
|-----------------------------------------|----------------------|--------------------------|
| Branch-create writes (S3 PUTs) | 0 (lazy) | 2·N (manifest + BranchContents) |
| Per-batch fragment write (S3 PUTs) | B·N | B·N |
| Per-batch manifest CAS | B·N | B·N |
| Per-txn `__manifest` cross-table CAS | 1 | 1 (we still need it for coherence) |
| Branch-delete writes (S3 PUTs) | 0 (lazy) | 2·N |
| Recovery sidecar size | O(N) lines | O(N) lines + branch names |
**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per
graph txn** (2·N at create, 2·N at delete), independent of B. For
N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard
pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**.
The current "lazy fork" model wins on **zero** at txn-create when the
txn is going to abort or read-only; the Lance-native model pays the
2·N upfront. For a workload where many txns abort early (e.g.
optimistic concurrency with high contention), the current model is
cheaper.
## What does Lance-native buy us, then?
1. **Branch identifier lineage tracking comes for free.** We currently
maintain our own commit DAG; with Lance-native branches, lineage
travels with the branch identifier (`Vec<(u64, String)>`).
2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job
understands the branch graph and refuses to delete versions
reachable from any branch. We currently maintain our own retention
logic in `branch_lineage_tests`; Lance-native gets us
`lance::dataset::cleanup` for free.
3. **Recovery audit gets noisier.** The two-phase `create_branch`
atomicity gap (F2) means our recovery sweep must reap zombie
*branches*, not just zombie *commits*. This is more state to track,
not less.
## Decision impact on MR-737 §5.11 and §5.12
**§5.11 (per-table txn isolation):** Lance-native branches **can**
implement this, but the steady-state cost is essentially the same as
the current lazy-graph-branch model, and the abort-path cost is *higher*.
The conceptual clarity argument is real but not load-bearing — both
models provide the same isolation guarantee.
**§5.12 (single-mechanism reads + writes via Source operators):** No
change. Per-table Lance branches don't affect the planner or the
operator surface — `Dataset::checkout_branch` is invisible from the
planner's perspective.
**Open Q5 (cost of per-table txn branches at our scale):** Answered.
Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and
**+1 zombie-branch cleanup case** in the recovery sweep. Both small.
## Recommendation
**Keep the current lazy-graph-branch model for v1**. Revisit
Lance-native per-table branches only if:
- Cleanup logic complexity becomes a maintenance burden (then the
Lance-native lineage tracking pays off).
- We need to expose branch-level operations to external tools that
speak Lance directly (then having `_refs/branches/<name>.json` on
each sub-table is necessary).
- The graph-level branch creation latency becomes a user-visible
issue (we'd need to measure this; the current implementation is
"lazy" so most branches don't fork most tables).
For Phase 0, the deliverable is a **clear specification** of which model
MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests
specifying the current model.
## Caveats and follow-ups
- **No real-world latency measurements.** This cost model is based on
S3 PUT counts and assumes typical S3 latency. Real measurement at our
workload (10+ tables, ~100 txns/sec/cluster) would be more accurate.
Estimated benchmark effort: 1 day with a mocked S3 backend.
- **The two-phase atomicity gap (F2) is documented by Lance but not
fixed.** A future Lance release may introduce single-phase branch
creation; if so, the recovery-cost column in the table above improves.
- **`merge_branch` API absence is not surprising** — merge semantics
are application-specific (we need three-way row-level merge with
conflict resolution rules from §IX, which Lance has no opinion on).
This will remain at the OmniGraph layer regardless.
- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`)
excludes `.delete_branch(` from the over-match check — there's
intent in the codebase to allow these calls. Worth a follow-up read
to see if MR-737 §5.11 has already opened the door.