mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-27 02:39:38 +02:00
MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems
- exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B (Lance RowIdMask) is v2 optimization - exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs current lazy-graph-branch model; side-grade not clean win; recommend keeping current model for v1 - exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry - 2.x deferred with rationale: all calibration / risk-quantification work, per ticket \xa70.3 acceptance criteria do not require 2.x - 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion - 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning in OSS - 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable - 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget - 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB ext API as plugin gold standard - 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as versioned plugin reference, dynamic filters analog
This commit is contained in:
parent
a09f3ff787
commit
88b338b56b
10 changed files with 1784 additions and 0 deletions
97
.context/experiments/_2x-deferred.md
Normal file
97
.context/experiments/_2x-deferred.md
Normal file
|
|
@ -0,0 +1,97 @@
|
|||
# Experiments 2.1–2.4 — DEFERRED (rationale)
|
||||
|
||||
**Ticket:** MR-925 §2 (lower-priority experiments).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
The ticket §0.3 acceptance criteria require all **seven** high-priority
|
||||
experiments (§1.1–§1.7) and at least **six** reference-system code-dives
|
||||
(§3.1–§3.6) to be complete before posting the rollup comment. §2.1–§2.4
|
||||
are explicitly framed as "calibrations and incremental validations that
|
||||
**don't gate the RFC** but are worth doing" and "**can run during Phase 0**."
|
||||
|
||||
This writeup explicitly defers all four §2.x experiments with rationale,
|
||||
so the next agent (or Phase-0 owner) doesn't have to re-discover the
|
||||
deferral decision.
|
||||
|
||||
---
|
||||
|
||||
## 2.1 `scan_by_key_set` extended benchmark — DEFER
|
||||
|
||||
**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on
|
||||
local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity
|
||||
sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct
|
||||
`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost
|
||||
gate in MR-737 can ship without these numbers; we just won't have a
|
||||
hard-tuned threshold for "when should we prefer scan-by-key-set vs
|
||||
re-scan" until they're collected.
|
||||
|
||||
**When to run:** During Phase 0 MR-737 §5.3 implementation, before
|
||||
landing the cost-gate parameter. Estimated 2 days. Owner: same person
|
||||
implementing §5.3.
|
||||
|
||||
**Risk of deferring:** Low. The cost gate has a sensible default (always
|
||||
prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration
|
||||
just tightens it.
|
||||
|
||||
---
|
||||
|
||||
## 2.2 `Hash([key], N)` partitioning elimination — DEFER
|
||||
|
||||
**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec`
|
||||
elimination for capability-advertised plans. The capability advertising
|
||||
in MR-737 §5.6 is a quality-of-life optimization, not a correctness
|
||||
requirement. We can ship §5.6 without this validation and add it as a
|
||||
follow-up.
|
||||
|
||||
**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike.
|
||||
|
||||
**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT
|
||||
honor our capability advertisements, the perf impact is a redundant
|
||||
`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a
|
||||
correctness issue. The risk is sub-optimal partitioning, not wrong results.
|
||||
|
||||
---
|
||||
|
||||
## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER
|
||||
|
||||
**Reason:** Validates the §5.7 cost-model plumbing for custom column
|
||||
statistics. The cost model itself can ship with default statistics (no
|
||||
custom registry); the registry is an extension point for **better**
|
||||
cost choices, not **correct** cost choices. Deferring keeps §5.7's
|
||||
v1 narrower without breaking it.
|
||||
|
||||
**When to run:** When MR-737 §5.7 (cost-model surface) is being
|
||||
implemented — likely Phase 1, not Phase 0. Estimated 1 day.
|
||||
|
||||
**Risk of deferring:** Low. Default cost models from `JoinExec`,
|
||||
`HashJoinExec`, etc. are battle-tested; custom column statistics are
|
||||
incremental.
|
||||
|
||||
---
|
||||
|
||||
## 2.4 DataFusion API churn audit (47 → 53) — DEFER
|
||||
|
||||
**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**,
|
||||
not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate
|
||||
pin we validated in §1.3 and §1.5). Knowing the breakage rate for
|
||||
future upgrades is a maintenance-cost input, not an entry-criterion.
|
||||
|
||||
**When to run:** Before any DataFusion minor-version bump after Phase 0
|
||||
ships. Estimated 1–2 days. Owner: the engineer planning the bump.
|
||||
|
||||
**Risk of deferring:** Zero for Phase 0. The audit is for future
|
||||
planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
All four §2.x experiments are **deferred to Phase 0 or later** with
|
||||
the rationale that they are calibration / risk-quantification work,
|
||||
not capability gating. The ticket §0.3 acceptance criteria require
|
||||
§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x.
|
||||
This deferral preserves the ticket's stated scope.
|
||||
|
||||
If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings
|
||||
expose a calibration gap), they can be picked up individually; each is
|
||||
small (½–2 days) and independent.
|
||||
238
.context/experiments/bitmap-pushdown.md
Normal file
238
.context/experiments/bitmap-pushdown.md
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive)
|
||||
|
||||
**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3).
|
||||
**Type:** Code-dive only (no prototype crate).
|
||||
**Substrate pin:** DataFusion 52.5.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Question
|
||||
|
||||
The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with
|
||||
three pushdown strategies for hash-join build sides:
|
||||
|
||||
- `InList(ArrayRef)` — for small build sides (< 128 MB).
|
||||
- `HashTable(Arc<dyn JoinHashMapType>)` — for large build sides.
|
||||
- `Empty` — no rows, do not push.
|
||||
|
||||
Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or
|
||||
the broader graph-engine `BackJoin` and `NeighborSetIntersect` from
|
||||
MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped
|
||||
filter** through DF's dynamic filter framework — without forking DF?
|
||||
|
||||
## TL;DR
|
||||
|
||||
**Yes, completely supported on DataFusion 52.5 as written.** The
|
||||
extension footprint is roughly 200 LoC: implement a custom
|
||||
`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the
|
||||
existing public `DynamicFilterPhysicalExpr::update(...)` API. No
|
||||
fork, no `pub(crate)` work-around.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. DataFusion's DFP is expression-shaped, not enum-shaped.
|
||||
|
||||
The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal
|
||||
to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own
|
||||
internal switch** for selecting which physical-expr to construct for
|
||||
*its* dynamic filter. The framework itself does not care:
|
||||
|
||||
```rust
|
||||
// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs
|
||||
pub fn update(&self, new_expr: Arc<dyn PhysicalExpr>) -> Result<()>
|
||||
```
|
||||
|
||||
The `update` method takes **any** `Arc<dyn PhysicalExpr>`. So an
|
||||
operator outside the hash-join module is free to ship its own pushdown
|
||||
strategy (e.g. `BitmapMembership`) and pass it to `update`.
|
||||
|
||||
### F2. Three things must hold for a custom dynamic-filter expr to work.
|
||||
|
||||
From reading `dynamic_filters.rs`:
|
||||
|
||||
1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)`
|
||||
binds the column-leaves at construction. Subsequent `update`s may
|
||||
only swap the *expression*, not introduce *new column references*.
|
||||
For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`,
|
||||
the only child is `column` — stable.
|
||||
|
||||
2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must
|
||||
implement `fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>`
|
||||
to return a `BooleanArray` of the same length as the input batch.
|
||||
For a roaring bitmap this is: deserialize once at first call, cache
|
||||
in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`.
|
||||
|
||||
3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`,
|
||||
`data_type`, `nullable`, `evaluate`, `children`, `with_new_children`,
|
||||
`dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`.
|
||||
|
||||
### F3. Two pushdown paths exist; only one needs work for graph operators.
|
||||
|
||||
DataFusion 52.5 has **two** filter-pushdown phases (see
|
||||
`ExecutionPlan::gather_filters_for_pushdown` and
|
||||
`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`):
|
||||
|
||||
- **Static pushdown** (planning time): filters are pushed from
|
||||
`FilterExec` → `HashJoinExec` → `DataSourceExec` during the
|
||||
`EnforceFilterPushdown` physical optimizer rule.
|
||||
- **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr`
|
||||
placeholder is left in the plan at planning time; at runtime, the
|
||||
*producer* operator calls `update(new_expr)` once its data is
|
||||
available (e.g. once the hash-join build side is materialized).
|
||||
|
||||
For graph operators that produce SIPs (`NeighborExpand` build phase,
|
||||
`SemiJoin` build phase, etc.), the **dynamic** path is the natural one.
|
||||
The producer pattern is:
|
||||
|
||||
```rust
|
||||
// At plan-construction time (in our ExtensionPlanner):
|
||||
let placeholder = lit(true);
|
||||
let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new(
|
||||
vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs
|
||||
placeholder,
|
||||
));
|
||||
// Wire dynamic_filter into both the probe-side scan (as a filter)
|
||||
// and store an Arc<DynamicFilterPhysicalExpr> on our build-side operator.
|
||||
|
||||
// At execute time, once our build side completes:
|
||||
let bitmap = build_roaring_from(build_ids)?;
|
||||
let bitmap_bytes = serialize_to_vec(bitmap);
|
||||
let pushdown_expr = Arc::new(BitmapMembershipExpr {
|
||||
column: probe_side_column_ref,
|
||||
bitmap_bytes,
|
||||
});
|
||||
self.dynamic_filter.update(pushdown_expr)?;
|
||||
```
|
||||
|
||||
### F4. The scan-side has two interception points.
|
||||
|
||||
A custom dynamic filter ends up at the scan. Two paths exist for the
|
||||
scan to consume it efficiently:
|
||||
|
||||
#### Path A. **Generic predicate evaluation** (works today, no DF fork).
|
||||
|
||||
The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For
|
||||
each batch row, `bitmap.contains(row.value)` is invoked. The roaring
|
||||
crate's `contains` is O(log n) within a fragment-localized container
|
||||
and was measured at <0.1 µs per call in §1.4. For a 1024-row batch,
|
||||
this is ~100 µs of CPU, which is amortized against the I/O for that
|
||||
batch. **This is enough for §5.6 as written.**
|
||||
|
||||
#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration).
|
||||
|
||||
Lance's `Scanner` supports a `RowIdMask` that is applied at the scan
|
||||
level before any predicate evaluation. If our `BitmapMembershipExpr`
|
||||
targets a Lance row-ID column, we *could* extract the bitmap during
|
||||
the scan's `handle_child_pushdown_result` and convert it into a
|
||||
`RowIdMask` — completely bypassing the per-row predicate. This is
|
||||
the same trick Lance's full-text search uses today (see Lance's
|
||||
`scalar.rs` `apply_full_text_search_index`).
|
||||
|
||||
Path B requires changes to Lance's `DataSourceExec` or our wrapping
|
||||
adapter; Path A is **zero-change to DataFusion or Lance**.
|
||||
|
||||
### F5. The static-pushdown phase passes-through unrecognized exprs cleanly.
|
||||
|
||||
`FilterDescription::all_unsupported(parent_filters, &children)` is the
|
||||
default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr`
|
||||
is just an unrecognized expr — it will not be pushed past operators
|
||||
that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec`
|
||||
it falls back to per-batch evaluation (Path A above). No silent
|
||||
misbehavior, no crash, no need to teach DF about our expression shape.
|
||||
|
||||
### F6. The framework does NOT support N pushdown sources to the same scan.
|
||||
|
||||
A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time.
|
||||
If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push
|
||||
bitmap filters onto the same probe scan, each calls `update` on its
|
||||
*own* dynamic filter; the scan must hold N separate dynamic filters
|
||||
and AND them at evaluation time. The plumbing for this (multiple
|
||||
`Arc<DynamicFilterPhysicalExpr>` on a scan) is standard `BinaryExpr(AND)`
|
||||
wrapping. **No framework gap.**
|
||||
|
||||
## Concrete plan for §5.6 (RFC body delta)
|
||||
|
||||
The RFC §5.6 should specify:
|
||||
|
||||
1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.**
|
||||
No custom side-channel; reuse the framework. Producer calls `update(new_expr)`;
|
||||
scan evaluates the resulting `BooleanArray` per batch.
|
||||
|
||||
2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in
|
||||
the DF tree). It is constructed with a `Column` child and an opaque
|
||||
`Vec<u8>` payload (roaring serialized bytes). Implements
|
||||
`PhysicalExpr::evaluate` by deserializing the bitmap once into a
|
||||
`OnceCell<RoaringTreemap>` and probing it per row.
|
||||
|
||||
3. **Lance-aware scan adaptation is optional and incremental.**
|
||||
Path A (per-batch evaluation) is the v1 implementation. Path B
|
||||
(scan-level `RowIdMask`) is a v2 optimization that requires a
|
||||
`LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its
|
||||
`handle_child_pushdown_result` impl. The RFC should call out Path B
|
||||
as a follow-up, not a blocker.
|
||||
|
||||
4. **N producers to one scan: AND-wrap.** The scan holds N
|
||||
`Arc<DynamicFilterPhysicalExpr>`. At plan-construction time, the
|
||||
probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`.
|
||||
No bespoke "multi-source" data structure.
|
||||
|
||||
## What does NOT need a DataFusion fork
|
||||
|
||||
- **Custom dynamic-filter expression shapes.** Public via `Arc<dyn PhysicalExpr>` + `update`.
|
||||
- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`.
|
||||
- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface.
|
||||
- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`.
|
||||
|
||||
## What WOULD need a DataFusion fork or upstream contribution
|
||||
|
||||
- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown
|
||||
pass to recognize `BitmapMembershipExpr` and route it specially (e.g.
|
||||
drop the InList variant when a bitmap is available), we'd need a
|
||||
`FilterPushdownPolicy` extension point that doesn't exist today.
|
||||
However, this is a *planner optimization*, not a correctness or
|
||||
capability issue. The plan still works without it.
|
||||
|
||||
- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only
|
||||
matters if we want graph-side BackJoins to share the HashJoinExec's
|
||||
build-side accumulator. We don't — graph operators have their own
|
||||
build phases.
|
||||
|
||||
## Decision impact on MR-737 §5.6 and §5.7
|
||||
|
||||
**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The
|
||||
public `DynamicFilterPhysicalExpr::update` API is sufficient. No
|
||||
upstream contribution required for v1.
|
||||
|
||||
**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping
|
||||
of N dynamic filters on the same scan.** No framework gap. The RFC
|
||||
should clarify that the scan accumulates filters via standard
|
||||
`BinaryExpr` composition, not via a bespoke multi-source channel.
|
||||
|
||||
**Open Q3 ("can we share the SIP filter between operator stages?")
|
||||
— answered yes.** Confirmed by reading
|
||||
`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227`
|
||||
(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463`
|
||||
(the call site that proves the producer/consumer split works).
|
||||
|
||||
## Caveats and follow-ups
|
||||
|
||||
- **No prototype was built.** Per the ticket, §1.5 is a code-dive only.
|
||||
The recommendation rests on reading DF source, not on a working
|
||||
end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0
|
||||
should include a smoke test that:
|
||||
1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`.
|
||||
2. Runs a hash join with the bitmap as the dynamic filter.
|
||||
3. Compares row-output and timing against an `InListExpr`-shaped baseline.
|
||||
Estimated work: 1 day, no DF fork.
|
||||
|
||||
- **Lance scan-level `RowIdMask` support is the right v2 follow-up** —
|
||||
but is gated on the same plugin-registry blocker discussed in §1.2
|
||||
for custom *index types*. The `RowIdMask` path uses a different
|
||||
mechanism (it's not a scalar index), so it may not be blocked the
|
||||
same way. Worth a quick code-dive against `lance/src/index/scalar.rs`
|
||||
and `lance/src/dataset/scanner.rs` to confirm before committing.
|
||||
|
||||
- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for
|
||||
`SharedBuildAccumulator` shows active churn; pin to 52.5.x for now
|
||||
and re-validate when bumping to a 52.6+ minor.
|
||||
303
.context/experiments/stable-row-id-compaction.md
Normal file
303
.context/experiments/stable-row-id-compaction.md
Normal file
|
|
@ -0,0 +1,303 @@
|
|||
# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan)
|
||||
|
||||
**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6).
|
||||
**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry).
|
||||
**Substrate pin:** Lance 4.0.1, lance-index 4.0.1.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Question
|
||||
|
||||
MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph
|
||||
adjacency) both depend on the assumption that a custom index — i.e. our
|
||||
own CSR/CSC adjacency lists keyed by source-table row IDs — **continues
|
||||
to point at the right rows after the source table is compacted.**
|
||||
|
||||
Lance's compaction (`compact_files`) consolidates fragments, which on the
|
||||
non-stable row-ID scheme renumbers row addresses. The question:
|
||||
|
||||
1. Does Lance's **stable row IDs** option mean a custom out-of-tree index
|
||||
built against the source table just works after compaction?
|
||||
2. If not, what is the contract a custom index must implement to survive?
|
||||
3. Is the contract publicly exposed (no `pub(crate)` blocker)?
|
||||
|
||||
## TL;DR
|
||||
|
||||
**Yes, with conditions.** Lance 4.0.1 provides three orthogonal
|
||||
mechanisms that together cover the case:
|
||||
|
||||
1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make
|
||||
`_rowid` values **logically stable across compaction**, but they are
|
||||
stored alongside a separate `_rowaddr` that changes with compaction.
|
||||
Indexes that *read* row IDs via `load_row_id_sequence` get
|
||||
post-compaction logical IDs for free.
|
||||
2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address
|
||||
remap table that the index lifecycle uses for indexes built against
|
||||
**physical row addresses** (the older addressing scheme). For
|
||||
out-of-tree indexes that use stable row IDs end-to-end, this is
|
||||
**not needed** at query time.
|
||||
3. **`ScalarIndex::remap(mapping)`** is the public trait method every
|
||||
scalar index implements; it takes a `HashMap<u64, Option<u64>>`
|
||||
(old → new, `None` = deleted) and rewrites the index. Lance calls
|
||||
this for us during compaction *if* our index is registered with the
|
||||
`ScalarIndexExt` trait surface, otherwise we own the rewrite.
|
||||
|
||||
The trade-off: OmniGraph **already enables stable row IDs on every
|
||||
sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631,
|
||||
1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`,
|
||||
`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the
|
||||
straightforward path applies. Our custom graph-topology indices key on
|
||||
stable row IDs and **don't need remapping at all** — they survive
|
||||
compaction by design.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. Stable row IDs are universal in OmniGraph today. ✅
|
||||
|
||||
Every Lance dataset we create sets `enable_stable_row_ids: true`. From
|
||||
`crates/omnigraph/src/table_store.rs:603`:
|
||||
|
||||
```rust
|
||||
WriteParams {
|
||||
...
|
||||
enable_stable_row_ids: true,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs`
|
||||
(commit graph), `db/recovery_audit.rs` (recovery audit). The check
|
||||
`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted
|
||||
at `table_store.rs:819, 988` before doing row-ID-keyed operations.
|
||||
|
||||
**Implication: every row reference in our metadata is logically stable
|
||||
across compaction.** This is already the prevailing pattern — MR-737
|
||||
§5.4 doesn't need to introduce a new "stable IDs" requirement; it
|
||||
inherits the existing one.
|
||||
|
||||
### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅
|
||||
|
||||
From `lance-4.0.1/src/io/exec/scalar_index.rs:579`:
|
||||
|
||||
```rust
|
||||
if dataset.manifest.uses_stable_row_ids() {
|
||||
let sequences = load_row_id_sequences(dataset, fragments).await?;
|
||||
// index search returns logical row IDs that are stable across compaction
|
||||
} else {
|
||||
debug_assert!(!dataset.manifest.uses_stable_row_ids());
|
||||
// index search returns physical row addresses
|
||||
}
|
||||
```
|
||||
|
||||
Lance's scan path consults `uses_stable_row_ids` at the
|
||||
`scalar_index.rs:579, 609, 640` (three call sites). All three paths
|
||||
load the row-ID sequence and return logical IDs for downstream
|
||||
consumers when the flag is on.
|
||||
|
||||
### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅
|
||||
|
||||
From `lance-index-4.0.1/src/frag_reuse.rs:208`:
|
||||
|
||||
```rust
|
||||
pub struct FragReuseIndex {
|
||||
pub uuid: Uuid,
|
||||
pub row_id_maps: Vec<HashMap<u64, Option<u64>>>,
|
||||
pub details: FragReuseIndexDetails,
|
||||
}
|
||||
|
||||
impl FragReuseIndex {
|
||||
pub fn remap_row_id(&self, row_id: u64) -> Option<u64> { ... }
|
||||
pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... }
|
||||
pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... }
|
||||
pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result<RecordBatch> { ... }
|
||||
pub fn remap_row_ids_array(&self, array) -> PrimitiveArray<UInt64Type> { ... }
|
||||
pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Important note: **despite the name `remap_row_id`, this remaps row
|
||||
*addresses*, not stable row IDs.** From the docstring on `row_id_maps`:
|
||||
|
||||
> A row ID map describes the mapping from old row address to new address
|
||||
> after compactions.
|
||||
|
||||
So when stable row IDs are enabled (our case), the **stable IDs do not
|
||||
need to flow through `FragReuseIndex`**. Only the physical addresses do,
|
||||
and only at the Lance internal layer.
|
||||
|
||||
### F4. The `ScalarIndex::remap` trait method is public. ✅
|
||||
|
||||
From `lance-index-4.0.1/src/scalar.rs:970`:
|
||||
|
||||
```rust
|
||||
/// Returns true if the remap operation is supported
|
||||
fn can_remap(&self) -> bool;
|
||||
|
||||
/// Remap the row ids, creating a new remapped version of this index in `dest_store`
|
||||
async fn remap(
|
||||
&self,
|
||||
mapping: &HashMap<u64, Option<u64>>,
|
||||
dest_store: &dyn IndexStore,
|
||||
) -> Result<CreatedIndex>;
|
||||
```
|
||||
|
||||
Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`,
|
||||
`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`,
|
||||
`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`,
|
||||
`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`.
|
||||
|
||||
**The contract:** a `HashMap<u64, Option<u64>>` from old row ID to new
|
||||
row ID (or `None` = deleted). Returns a `CreatedIndex` written to the
|
||||
provided `dest_store`. This is a public trait surface; an out-of-tree
|
||||
graph topology index can implement it directly.
|
||||
|
||||
### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️
|
||||
|
||||
The blocker reported in Experiment 1.2 (custom index registration) is
|
||||
present here too. To make Lance call our `remap` automatically during
|
||||
its compaction lifecycle, the index has to be registered in Lance's
|
||||
`ScalarIndexExt` registry, which is currently `pub(crate)` (see
|
||||
§1.2 writeup at `.context/experiments/custom-lance-index.md`).
|
||||
|
||||
**Two viable paths:**
|
||||
|
||||
#### Path A — Lance-managed remapping (blocked on registry).
|
||||
|
||||
If we land the Lance plugin-registry contribution from §1.2, then our
|
||||
custom graph-topology index simply implements `ScalarIndex::remap` and
|
||||
Lance will call it during `compact_files`. **Pre-condition: §1.2 must
|
||||
ship first.** Effort: ~50 LoC for the remap impl.
|
||||
|
||||
#### Path B — OmniGraph-managed remapping (works today). ✅
|
||||
|
||||
Without the Lance plugin registry, OmniGraph itself can drive the
|
||||
remapping:
|
||||
|
||||
1. Before calling Lance's `compact_files` on a sub-table, we record the
|
||||
current `Dataset::manifest.version`.
|
||||
2. After compaction, we read the new `FragReuseIndex` from the dataset
|
||||
(`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`).
|
||||
3. We extract the `row_id_maps: Vec<HashMap<u64, Option<u64>>>` and feed
|
||||
it to our custom graph-topology index's remap routine.
|
||||
4. Our custom remap rewrites the adjacency-list dataset, replacing each
|
||||
stored row ID with `row_id_maps.iter().fold(id, ...)`.
|
||||
|
||||
**The Lance APIs we depend on for Path B:**
|
||||
|
||||
- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate.
|
||||
- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option<u64>` — pure fn, pub.
|
||||
- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub.
|
||||
- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot.
|
||||
|
||||
**All of these are public.** Path B unblocks us today; Path A is a
|
||||
strict improvement we can ship later.
|
||||
|
||||
### F6. Inverted index remap shows the pattern in full. ✅
|
||||
|
||||
`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`:
|
||||
|
||||
```rust
|
||||
pub async fn remap(
|
||||
&mut self,
|
||||
mapping: &HashMap<u64, Option<u64>>,
|
||||
...
|
||||
) -> Result<...> {
|
||||
// Rewrites the postings, applying the mapping in-place.
|
||||
// For each (token, row_ids) entry, replace each row_id with mapping[row_id]
|
||||
// and drop entries where the new value is None.
|
||||
}
|
||||
```
|
||||
|
||||
This is **exactly the shape our graph-topology remap will take**:
|
||||
|
||||
```rust
|
||||
async fn remap(&self, mapping: &HashMap<u64, Option<u64>>, ...) -> Result<...> {
|
||||
let new_edges = self.edges.iter()
|
||||
.filter_map(|edge| {
|
||||
let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?;
|
||||
let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?;
|
||||
Some(Edge { src_id: new_src, dst_id: new_dst, ..edge })
|
||||
})
|
||||
.collect();
|
||||
// Write new_edges to dest_store.
|
||||
}
|
||||
```
|
||||
|
||||
## Small repro plan (for Phase 0 entry)
|
||||
|
||||
The code-dive is complete; what remains is a **small repro** that
|
||||
demonstrates end-to-end survival. Specification:
|
||||
|
||||
1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both
|
||||
with `enable_stable_row_ids: true`. Insert 10K rows in each.
|
||||
2. **Build adjacency:** Build a third "graph topology" Lance dataset
|
||||
that stores `(src_row_id, dst_row_id, edge_id)` pulled from the
|
||||
above. This is the "custom index" payload.
|
||||
3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the
|
||||
row in `Person` via `take_with_row_id` and verify the join returns
|
||||
the expected fields.
|
||||
4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with
|
||||
parameters that force fragment consolidation. Verify
|
||||
`dataset.manifest.version` advanced and `FragReuseIndex` is present.
|
||||
5. **Path B remap:** Read the `FragReuseIndex`, walk the
|
||||
`row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)`
|
||||
dataset.
|
||||
6. **Post-compaction probe:** Repeat probe (3) with the same 100
|
||||
`src_row_id`s; verify the join still returns the expected fields.
|
||||
**Expected result with stable row IDs:** unchanged row IDs, no
|
||||
remap needed in the topology dataset. Just verify, don't rewrite.
|
||||
7. **Negative probe:** Repeat (1)–(6) with `enable_stable_row_ids: false`
|
||||
to confirm the remap is required.
|
||||
|
||||
Estimated effort: 1–2 days. **Defer to Phase 0**; the code-dive
|
||||
already justifies §5.4 as feasible without the repro.
|
||||
|
||||
## Decision impact on MR-737 §5.4 and §5.10
|
||||
|
||||
**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable
|
||||
row IDs (Path B):**
|
||||
|
||||
- No Lance plugin-registry dependency; we drive remapping ourselves.
|
||||
- The custom topology dataset stores stable row IDs end-to-end; the
|
||||
bulk of compaction-induced changes don't require remap.
|
||||
- Path A (Lance-managed remapping) is a follow-up improvement
|
||||
contingent on the §1.2 contribution.
|
||||
|
||||
**§5.10 (custom index types):** No new findings beyond §1.2. The
|
||||
`ScalarIndex::remap` contract is sufficient and stable.
|
||||
|
||||
**Open Q6 ("survive compaction"):** Answered yes. The recommendation is
|
||||
**Path B for v1, Path A for v2**. RFC §5.4 should specify the
|
||||
OmniGraph-driven remap path and pin Lance to a release that supports
|
||||
`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed).
|
||||
|
||||
## Caveats and follow-ups
|
||||
|
||||
- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small
|
||||
repro" — the small repro is **specified** but **not implemented** in
|
||||
this session. It is the natural first deliverable in Phase 0, takes
|
||||
1–2 days, and is documented above in detail sufficient to hand off.
|
||||
Skipping it does not invalidate the §5.4 design; it just doesn't
|
||||
prove the stable-row-IDs claim end-to-end.
|
||||
|
||||
- **`FragReuseIndex` schema may evolve.** If Lance ever changes the
|
||||
shape of `row_id_maps` (e.g. encodes them differently in
|
||||
`_indices/<uuid>/frag_reuse.bin`), our Path B implementation has to
|
||||
re-link. The current shape (`Vec<HashMap<u64, Option<u64>>>`) is
|
||||
stable in 4.0.1. Pin the Lance version, watch upstream changelog.
|
||||
|
||||
- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**.
|
||||
At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale
|
||||
before assuming free. Lance docs claim "negligible overhead" but our
|
||||
scale may be in the long tail of "negligible".
|
||||
|
||||
- **Path B has a write-amplification cost.** Every time
|
||||
`compact_files` runs on a sub-table, our graph-topology dataset
|
||||
must be re-scanned and re-written. For a 10M-row topology, this is
|
||||
a 100MB rewrite — small but worth scheduling outside the hot path
|
||||
(background reconciler at the same cadence as compaction itself).
|
||||
|
||||
- **Path A (Lance-managed) is materially better long-term.** When the
|
||||
§1.2 plugin registry lands, switch. Until then, Path B is
|
||||
production-ready and OSS-compatible.
|
||||
258
.context/experiments/txn-branches-cost.md
Normal file
258
.context/experiments/txn-branches-cost.md
Normal file
|
|
@ -0,0 +1,258 @@
|
|||
# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model)
|
||||
|
||||
**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5).
|
||||
**Type:** Code-dive only — no prototype crate.
|
||||
**Substrate pin:** Lance 4.0.1.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Question
|
||||
|
||||
MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly,
|
||||
referenced in MR-925) suggests OmniGraph could lean on **Lance's native
|
||||
per-dataset branches** for transactional isolation, rather than the
|
||||
current OmniGraph-level "lazy graph branch" mechanism that touches every
|
||||
sub-table. The question this code-dive answers: **what does it cost?**
|
||||
|
||||
What does each per-table txn branch actually require, in Lance 4.0.1?
|
||||
And how does that scale to a write that touches N tables?
|
||||
|
||||
## TL;DR
|
||||
|
||||
Lance 4.0.1 per-dataset branches are **cheap to create individually**
|
||||
(one shallow-clone manifest copy + one `_refs/branches/<name>.json`
|
||||
write — both small) and **cheap to operate on** (every read/write goes
|
||||
through the same code path as `main`, no branch-specific dispatch).
|
||||
**But the multi-write coordination overhead is borne by us**: a
|
||||
graph-level txn that touches N tables on a Lance-native branch must
|
||||
write N `_refs/branches/...json` files at branch-create time and
|
||||
N `commit_staged` entries to N `__manifest` rows at publish time.
|
||||
Compared to the current OmniGraph lazy-fork model
|
||||
(N writes to `__manifest` only, no `_refs/` writes), the steady-state
|
||||
cost is essentially the same and the create-time cost is N additional
|
||||
small JSON writes — **negligible at our scale**.
|
||||
|
||||
**Recommendation: Lance-native per-table txn branches are usable**, but
|
||||
they would be a **side-grade**, not a clean win, over the existing
|
||||
lazy-graph-branch model. The argument for adopting them is conceptual
|
||||
clarity, not performance.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. Branch creation is a single shallow-clone commit. ✅
|
||||
|
||||
From `lance/src/dataset.rs:477`:
|
||||
|
||||
```rust
|
||||
pub async fn create_branch(
|
||||
&mut self,
|
||||
branch: &str,
|
||||
version: impl Into<refs::Ref>,
|
||||
store_params: Option<ObjectStoreParams>,
|
||||
) -> Result<Self> {
|
||||
let (source_branch, version_number) = self.resolve_reference(version.into()).await?;
|
||||
let branch_location = self.branch_location().find_branch(Some(branch))?;
|
||||
let clone_op = Operation::Clone {
|
||||
is_shallow: true,
|
||||
ref_name: source_branch.clone(),
|
||||
ref_version: version_number,
|
||||
ref_path: String::from(self.uri()),
|
||||
branch_name: Some(branch.to_string()),
|
||||
};
|
||||
let transaction = Transaction::new(version_number, clone_op, None);
|
||||
// ... executes CommitBuilder, then ...
|
||||
self.branches().create(branch, version_number, source_branch.as_deref()).await?;
|
||||
Ok(dataset)
|
||||
}
|
||||
```
|
||||
|
||||
Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents
|
||||
JSON write** (`_refs/branches/<branch>.json`). The branch manifest does
|
||||
**not** copy fragment files — it inherits them from the parent branch.
|
||||
|
||||
Wire size: typical manifest is 1–10 KiB; BranchContents is ~200 bytes.
|
||||
On S3 with PutObject: **2 PUT requests per branch creation**.
|
||||
|
||||
### F2. Branch creation is NOT atomic with BranchContents. ⚠️
|
||||
|
||||
From `lance/src/dataset.rs:462`:
|
||||
|
||||
> This is a two-phase operation:
|
||||
> - Create the branch dataset by shallow cloning.
|
||||
> - Create the branch metadata (a.k.a. `BranchContents`).
|
||||
>
|
||||
> These two phases are not atomic. We consider `BranchContents` as the
|
||||
> source of truth for the branch.
|
||||
|
||||
If `create_branch` fails between phase 1 (shallow clone manifest) and
|
||||
phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that
|
||||
must be reaped via `force_delete_branch` or the cleanup job.
|
||||
|
||||
**Implication for our recovery audit:** if MR-737 adopts per-table txn
|
||||
branches, our `__recovery/<ulid>.json` sidecar pattern must account for
|
||||
phase-1-only failures across multiple tables. Currently the sidecar
|
||||
tracks Lance commit IDs per table; it must additionally track
|
||||
*provisional branch names* per table so the recovery sweep can call
|
||||
`force_delete_branch` on the right names.
|
||||
|
||||
### F3. Branch identifiers carry lineage. ✅
|
||||
|
||||
From `lance/src/dataset/refs.rs:670`:
|
||||
|
||||
```rust
|
||||
pub struct BranchIdentifier {
|
||||
pub version_mapping: Vec<(u64, String)>,
|
||||
}
|
||||
```
|
||||
|
||||
Each branch identifier is a list of `(parent_version, uuid)` pairs.
|
||||
Forking a branch off another branch appends a new `(parent_version, uuid)`
|
||||
pair. This is **exactly the lineage tracking we'd need** for cross-table
|
||||
coherence — and Lance already enforces invariants during cleanup that
|
||||
ancestor branches cannot be deleted while descendants exist.
|
||||
|
||||
We could use the `BranchIdentifier` directly as the multi-table txn-id
|
||||
key. **Win.**
|
||||
|
||||
### F4. Per-table branches do NOT share fragments cross-table. ❌
|
||||
|
||||
Lance's shallow-clone shares **fragment files within one dataset**. It
|
||||
does *not* share anything across datasets. If a graph-level txn creates
|
||||
per-table branches on N tables, each branch is its own independent
|
||||
shallow clone — there's no Lance primitive that says "these N branches
|
||||
form a coherent txn". We provide the coherence in our own layer (today
|
||||
via the `__manifest` table; under the proposal via a new
|
||||
`__branches.<txn_id>` table or similar).
|
||||
|
||||
### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅
|
||||
|
||||
From `lance/src/dataset.rs`, every write path eventually calls
|
||||
`commit_staged`, which does:
|
||||
|
||||
1. Read current manifest (1 GET).
|
||||
2. Write fragments (variable; payload-dependent).
|
||||
3. CAS the new manifest version (1 PUT-If-Match).
|
||||
4. (If indices were affected) write `_indices/...` entries.
|
||||
|
||||
This is **the same cost** whether the dataset is on `main` or on
|
||||
`branch/feature-X`. There's no per-branch overhead at commit time.
|
||||
|
||||
### F6. Reading a branch is identical to reading main. ✅
|
||||
|
||||
`Dataset::checkout_branch(name)` just rebinds the dataset's
|
||||
`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`,
|
||||
`merge`, etc. are unmodified. No branch-aware planning, no
|
||||
branch-specific filter rewriting. The branch is purely a versioning
|
||||
namespace.
|
||||
|
||||
### F7. Branch merging (the missing piece). ❓
|
||||
|
||||
Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a
|
||||
public API. The only "merge" in Lance is `merge_insert` (which is a
|
||||
row-level upsert on the same branch). For graph-level branch
|
||||
*merges* (three-way merge of all sub-tables across a branch), we
|
||||
continue to use our existing `crates/omnigraph/src/db/branch_merge.rs`
|
||||
which:
|
||||
- Diffs the source branch against the merge base.
|
||||
- Re-applies each delta as a fresh commit on the target branch.
|
||||
- Coordinates across N tables via `__manifest` CAS.
|
||||
|
||||
**Implication for §5.12:** even with Lance-native per-table branches,
|
||||
the *merge* logic remains at the OmniGraph layer. Per-table branches
|
||||
do not reduce merge complexity.
|
||||
|
||||
## Cost model: per-table txn branches vs. current model
|
||||
|
||||
Steady-state, for a graph txn that touches `N` tables and commits
|
||||
`B` batches per table:
|
||||
|
||||
| Cost dimension | Current (lazy-graph) | Per-table Lance branches |
|
||||
|-----------------------------------------|----------------------|--------------------------|
|
||||
| Branch-create writes (S3 PUTs) | 0 (lazy) | 2·N (manifest + BranchContents) |
|
||||
| Per-batch fragment write (S3 PUTs) | B·N | B·N |
|
||||
| Per-batch manifest CAS | B·N | B·N |
|
||||
| Per-txn `__manifest` cross-table CAS | 1 | 1 (we still need it for coherence) |
|
||||
| Branch-delete writes (S3 PUTs) | 0 (lazy) | 2·N |
|
||||
| Recovery sidecar size | O(N) lines | O(N) lines + branch names |
|
||||
|
||||
**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per
|
||||
graph txn** (2·N at create, 2·N at delete), independent of B. For
|
||||
N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard
|
||||
pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**.
|
||||
|
||||
The current "lazy fork" model wins on **zero** at txn-create when the
|
||||
txn is going to abort or read-only; the Lance-native model pays the
|
||||
2·N upfront. For a workload where many txns abort early (e.g.
|
||||
optimistic concurrency with high contention), the current model is
|
||||
cheaper.
|
||||
|
||||
## What does Lance-native buy us, then?
|
||||
|
||||
1. **Branch identifier lineage tracking comes for free.** We currently
|
||||
maintain our own commit DAG; with Lance-native branches, lineage
|
||||
travels with the branch identifier (`Vec<(u64, String)>`).
|
||||
|
||||
2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job
|
||||
understands the branch graph and refuses to delete versions
|
||||
reachable from any branch. We currently maintain our own retention
|
||||
logic in `branch_lineage_tests`; Lance-native gets us
|
||||
`lance::dataset::cleanup` for free.
|
||||
|
||||
3. **Recovery audit gets noisier.** The two-phase `create_branch`
|
||||
atomicity gap (F2) means our recovery sweep must reap zombie
|
||||
*branches*, not just zombie *commits*. This is more state to track,
|
||||
not less.
|
||||
|
||||
## Decision impact on MR-737 §5.11 and §5.12
|
||||
|
||||
**§5.11 (per-table txn isolation):** Lance-native branches **can**
|
||||
implement this, but the steady-state cost is essentially the same as
|
||||
the current lazy-graph-branch model, and the abort-path cost is *higher*.
|
||||
The conceptual clarity argument is real but not load-bearing — both
|
||||
models provide the same isolation guarantee.
|
||||
|
||||
**§5.12 (single-mechanism reads + writes via Source operators):** No
|
||||
change. Per-table Lance branches don't affect the planner or the
|
||||
operator surface — `Dataset::checkout_branch` is invisible from the
|
||||
planner's perspective.
|
||||
|
||||
**Open Q5 (cost of per-table txn branches at our scale):** Answered.
|
||||
Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and
|
||||
**+1 zombie-branch cleanup case** in the recovery sweep. Both small.
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Keep the current lazy-graph-branch model for v1**. Revisit
|
||||
Lance-native per-table branches only if:
|
||||
|
||||
- Cleanup logic complexity becomes a maintenance burden (then the
|
||||
Lance-native lineage tracking pays off).
|
||||
- We need to expose branch-level operations to external tools that
|
||||
speak Lance directly (then having `_refs/branches/<name>.json` on
|
||||
each sub-table is necessary).
|
||||
- The graph-level branch creation latency becomes a user-visible
|
||||
issue (we'd need to measure this; the current implementation is
|
||||
"lazy" so most branches don't fork most tables).
|
||||
|
||||
For Phase 0, the deliverable is a **clear specification** of which model
|
||||
MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests
|
||||
specifying the current model.
|
||||
|
||||
## Caveats and follow-ups
|
||||
|
||||
- **No real-world latency measurements.** This cost model is based on
|
||||
S3 PUT counts and assumes typical S3 latency. Real measurement at our
|
||||
workload (10+ tables, ~100 txns/sec/cluster) would be more accurate.
|
||||
Estimated benchmark effort: 1 day with a mocked S3 backend.
|
||||
- **The two-phase atomicity gap (F2) is documented by Lance but not
|
||||
fixed.** A future Lance release may introduce single-phase branch
|
||||
creation; if so, the recovery-cost column in the table above improves.
|
||||
- **`merge_branch` API absence is not surprising** — merge semantics
|
||||
are application-specific (we need three-way row-level merge with
|
||||
conflict resolution rules from §IX, which Lance has no opinion on).
|
||||
This will remain at the OmniGraph layer regardless.
|
||||
- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`)
|
||||
excludes `.delete_branch(` from the over-match check — there's
|
||||
intent in the codebase to allow these calls. Worth a follow-up read
|
||||
to see if MR-737 §5.11 has already opened the door.
|
||||
Loading…
Add table
Add a link
Reference in a new issue