MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems

- exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped
  pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B
  (Lance RowIdMask) is v2 optimization
- exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs
  current lazy-graph-branch model; side-grade not clean win; recommend keeping current
  model for v1
- exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in
  OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships
  today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry
- 2.x deferred with rationale: all calibration / risk-quantification work, per ticket
  \xa70.3 acceptance criteria do not require 2.x
- 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion
- 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning
  in OSS
- 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable
- 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget
- 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB
  ext API as plugin gold standard
- 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as
  versioned plugin reference, dynamic filters analog
This commit is contained in:
Devin AI 2026-05-12 17:36:44 +00:00
parent a09f3ff787
commit 88b338b56b
10 changed files with 1784 additions and 0 deletions

View file

@ -0,0 +1,97 @@
# Experiments 2.12.4 — DEFERRED (rationale)
**Ticket:** MR-925 §2 (lower-priority experiments).
**Date:** 2026-05-12.
The ticket §0.3 acceptance criteria require all **seven** high-priority
experiments (§1.1§1.7) and at least **six** reference-system code-dives
(§3.1§3.6) to be complete before posting the rollup comment. §2.1§2.4
are explicitly framed as "calibrations and incremental validations that
**don't gate the RFC** but are worth doing" and "**can run during Phase 0**."
This writeup explicitly defers all four §2.x experiments with rationale,
so the next agent (or Phase-0 owner) doesn't have to re-discover the
deferral decision.
---
## 2.1 `scan_by_key_set` extended benchmark — DEFER
**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on
local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity
sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct
`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost
gate in MR-737 can ship without these numbers; we just won't have a
hard-tuned threshold for "when should we prefer scan-by-key-set vs
re-scan" until they're collected.
**When to run:** During Phase 0 MR-737 §5.3 implementation, before
landing the cost-gate parameter. Estimated 2 days. Owner: same person
implementing §5.3.
**Risk of deferring:** Low. The cost gate has a sensible default (always
prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration
just tightens it.
---
## 2.2 `Hash([key], N)` partitioning elimination — DEFER
**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec`
elimination for capability-advertised plans. The capability advertising
in MR-737 §5.6 is a quality-of-life optimization, not a correctness
requirement. We can ship §5.6 without this validation and add it as a
follow-up.
**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike.
**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT
honor our capability advertisements, the perf impact is a redundant
`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a
correctness issue. The risk is sub-optimal partitioning, not wrong results.
---
## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER
**Reason:** Validates the §5.7 cost-model plumbing for custom column
statistics. The cost model itself can ship with default statistics (no
custom registry); the registry is an extension point for **better**
cost choices, not **correct** cost choices. Deferring keeps §5.7's
v1 narrower without breaking it.
**When to run:** When MR-737 §5.7 (cost-model surface) is being
implemented — likely Phase 1, not Phase 0. Estimated 1 day.
**Risk of deferring:** Low. Default cost models from `JoinExec`,
`HashJoinExec`, etc. are battle-tested; custom column statistics are
incremental.
---
## 2.4 DataFusion API churn audit (47 → 53) — DEFER
**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**,
not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate
pin we validated in §1.3 and §1.5). Knowing the breakage rate for
future upgrades is a maintenance-cost input, not an entry-criterion.
**When to run:** Before any DataFusion minor-version bump after Phase 0
ships. Estimated 12 days. Owner: the engineer planning the bump.
**Risk of deferring:** Zero for Phase 0. The audit is for future
planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5.
---
## Summary
All four §2.x experiments are **deferred to Phase 0 or later** with
the rationale that they are calibration / risk-quantification work,
not capability gating. The ticket §0.3 acceptance criteria require
§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x.
This deferral preserves the ticket's stated scope.
If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings
expose a calibration gap), they can be picked up individually; each is
small (½2 days) and independent.

View file

@ -0,0 +1,238 @@
# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive)
**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3).
**Type:** Code-dive only (no prototype crate).
**Substrate pin:** DataFusion 52.5.
**Date:** 2026-05-12.
---
## Question
The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with
three pushdown strategies for hash-join build sides:
- `InList(ArrayRef)` — for small build sides (< 128 MB).
- `HashTable(Arc<dyn JoinHashMapType>)` — for large build sides.
- `Empty` — no rows, do not push.
Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or
the broader graph-engine `BackJoin` and `NeighborSetIntersect` from
MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped
filter** through DF's dynamic filter framework — without forking DF?
## TL;DR
**Yes, completely supported on DataFusion 52.5 as written.** The
extension footprint is roughly 200 LoC: implement a custom
`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the
existing public `DynamicFilterPhysicalExpr::update(...)` API. No
fork, no `pub(crate)` work-around.
## Findings
### F1. DataFusion's DFP is expression-shaped, not enum-shaped.
The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal
to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own
internal switch** for selecting which physical-expr to construct for
*its* dynamic filter. The framework itself does not care:
```rust
// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs
pub fn update(&self, new_expr: Arc<dyn PhysicalExpr>) -> Result<()>
```
The `update` method takes **any** `Arc<dyn PhysicalExpr>`. So an
operator outside the hash-join module is free to ship its own pushdown
strategy (e.g. `BitmapMembership`) and pass it to `update`.
### F2. Three things must hold for a custom dynamic-filter expr to work.
From reading `dynamic_filters.rs`:
1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)`
binds the column-leaves at construction. Subsequent `update`s may
only swap the *expression*, not introduce *new column references*.
For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`,
the only child is `column` — stable.
2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must
implement `fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>`
to return a `BooleanArray` of the same length as the input batch.
For a roaring bitmap this is: deserialize once at first call, cache
in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`.
3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`,
`data_type`, `nullable`, `evaluate`, `children`, `with_new_children`,
`dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`.
### F3. Two pushdown paths exist; only one needs work for graph operators.
DataFusion 52.5 has **two** filter-pushdown phases (see
`ExecutionPlan::gather_filters_for_pushdown` and
`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`):
- **Static pushdown** (planning time): filters are pushed from
`FilterExec``HashJoinExec``DataSourceExec` during the
`EnforceFilterPushdown` physical optimizer rule.
- **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr`
placeholder is left in the plan at planning time; at runtime, the
*producer* operator calls `update(new_expr)` once its data is
available (e.g. once the hash-join build side is materialized).
For graph operators that produce SIPs (`NeighborExpand` build phase,
`SemiJoin` build phase, etc.), the **dynamic** path is the natural one.
The producer pattern is:
```rust
// At plan-construction time (in our ExtensionPlanner):
let placeholder = lit(true);
let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new(
vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs
placeholder,
));
// Wire dynamic_filter into both the probe-side scan (as a filter)
// and store an Arc<DynamicFilterPhysicalExpr> on our build-side operator.
// At execute time, once our build side completes:
let bitmap = build_roaring_from(build_ids)?;
let bitmap_bytes = serialize_to_vec(bitmap);
let pushdown_expr = Arc::new(BitmapMembershipExpr {
column: probe_side_column_ref,
bitmap_bytes,
});
self.dynamic_filter.update(pushdown_expr)?;
```
### F4. The scan-side has two interception points.
A custom dynamic filter ends up at the scan. Two paths exist for the
scan to consume it efficiently:
#### Path A. **Generic predicate evaluation** (works today, no DF fork).
The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For
each batch row, `bitmap.contains(row.value)` is invoked. The roaring
crate's `contains` is O(log n) within a fragment-localized container
and was measured at <0.1 µs per call in §1.4. For a 1024-row batch,
this is ~100 µs of CPU, which is amortized against the I/O for that
batch. **This is enough for §5.6 as written.**
#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration).
Lance's `Scanner` supports a `RowIdMask` that is applied at the scan
level before any predicate evaluation. If our `BitmapMembershipExpr`
targets a Lance row-ID column, we *could* extract the bitmap during
the scan's `handle_child_pushdown_result` and convert it into a
`RowIdMask` — completely bypassing the per-row predicate. This is
the same trick Lance's full-text search uses today (see Lance's
`scalar.rs` `apply_full_text_search_index`).
Path B requires changes to Lance's `DataSourceExec` or our wrapping
adapter; Path A is **zero-change to DataFusion or Lance**.
### F5. The static-pushdown phase passes-through unrecognized exprs cleanly.
`FilterDescription::all_unsupported(parent_filters, &children)` is the
default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr`
is just an unrecognized expr — it will not be pushed past operators
that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec`
it falls back to per-batch evaluation (Path A above). No silent
misbehavior, no crash, no need to teach DF about our expression shape.
### F6. The framework does NOT support N pushdown sources to the same scan.
A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time.
If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push
bitmap filters onto the same probe scan, each calls `update` on its
*own* dynamic filter; the scan must hold N separate dynamic filters
and AND them at evaluation time. The plumbing for this (multiple
`Arc<DynamicFilterPhysicalExpr>` on a scan) is standard `BinaryExpr(AND)`
wrapping. **No framework gap.**
## Concrete plan for §5.6 (RFC body delta)
The RFC §5.6 should specify:
1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.**
No custom side-channel; reuse the framework. Producer calls `update(new_expr)`;
scan evaluates the resulting `BooleanArray` per batch.
2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in
the DF tree). It is constructed with a `Column` child and an opaque
`Vec<u8>` payload (roaring serialized bytes). Implements
`PhysicalExpr::evaluate` by deserializing the bitmap once into a
`OnceCell<RoaringTreemap>` and probing it per row.
3. **Lance-aware scan adaptation is optional and incremental.**
Path A (per-batch evaluation) is the v1 implementation. Path B
(scan-level `RowIdMask`) is a v2 optimization that requires a
`LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its
`handle_child_pushdown_result` impl. The RFC should call out Path B
as a follow-up, not a blocker.
4. **N producers to one scan: AND-wrap.** The scan holds N
`Arc<DynamicFilterPhysicalExpr>`. At plan-construction time, the
probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`.
No bespoke "multi-source" data structure.
## What does NOT need a DataFusion fork
- **Custom dynamic-filter expression shapes.** Public via `Arc<dyn PhysicalExpr>` + `update`.
- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`.
- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface.
- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`.
## What WOULD need a DataFusion fork or upstream contribution
- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown
pass to recognize `BitmapMembershipExpr` and route it specially (e.g.
drop the InList variant when a bitmap is available), we'd need a
`FilterPushdownPolicy` extension point that doesn't exist today.
However, this is a *planner optimization*, not a correctness or
capability issue. The plan still works without it.
- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only
matters if we want graph-side BackJoins to share the HashJoinExec's
build-side accumulator. We don't — graph operators have their own
build phases.
## Decision impact on MR-737 §5.6 and §5.7
**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The
public `DynamicFilterPhysicalExpr::update` API is sufficient. No
upstream contribution required for v1.
**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping
of N dynamic filters on the same scan.** No framework gap. The RFC
should clarify that the scan accumulates filters via standard
`BinaryExpr` composition, not via a bespoke multi-source channel.
**Open Q3 ("can we share the SIP filter between operator stages?")
— answered yes.** Confirmed by reading
`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227`
(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463`
(the call site that proves the producer/consumer split works).
## Caveats and follow-ups
- **No prototype was built.** Per the ticket, §1.5 is a code-dive only.
The recommendation rests on reading DF source, not on a working
end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0
should include a smoke test that:
1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`.
2. Runs a hash join with the bitmap as the dynamic filter.
3. Compares row-output and timing against an `InListExpr`-shaped baseline.
Estimated work: 1 day, no DF fork.
- **Lance scan-level `RowIdMask` support is the right v2 follow-up**
but is gated on the same plugin-registry blocker discussed in §1.2
for custom *index types*. The `RowIdMask` path uses a different
mechanism (it's not a scalar index), so it may not be blocked the
same way. Worth a quick code-dive against `lance/src/index/scalar.rs`
and `lance/src/dataset/scanner.rs` to confirm before committing.
- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for
`SharedBuildAccumulator` shows active churn; pin to 52.5.x for now
and re-validate when bumping to a 52.6+ minor.

View file

@ -0,0 +1,303 @@
# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan)
**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6).
**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry).
**Substrate pin:** Lance 4.0.1, lance-index 4.0.1.
**Date:** 2026-05-12.
---
## Question
MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph
adjacency) both depend on the assumption that a custom index — i.e. our
own CSR/CSC adjacency lists keyed by source-table row IDs — **continues
to point at the right rows after the source table is compacted.**
Lance's compaction (`compact_files`) consolidates fragments, which on the
non-stable row-ID scheme renumbers row addresses. The question:
1. Does Lance's **stable row IDs** option mean a custom out-of-tree index
built against the source table just works after compaction?
2. If not, what is the contract a custom index must implement to survive?
3. Is the contract publicly exposed (no `pub(crate)` blocker)?
## TL;DR
**Yes, with conditions.** Lance 4.0.1 provides three orthogonal
mechanisms that together cover the case:
1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make
`_rowid` values **logically stable across compaction**, but they are
stored alongside a separate `_rowaddr` that changes with compaction.
Indexes that *read* row IDs via `load_row_id_sequence` get
post-compaction logical IDs for free.
2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address
remap table that the index lifecycle uses for indexes built against
**physical row addresses** (the older addressing scheme). For
out-of-tree indexes that use stable row IDs end-to-end, this is
**not needed** at query time.
3. **`ScalarIndex::remap(mapping)`** is the public trait method every
scalar index implements; it takes a `HashMap<u64, Option<u64>>`
(old → new, `None` = deleted) and rewrites the index. Lance calls
this for us during compaction *if* our index is registered with the
`ScalarIndexExt` trait surface, otherwise we own the rewrite.
The trade-off: OmniGraph **already enables stable row IDs on every
sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631,
1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`,
`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the
straightforward path applies. Our custom graph-topology indices key on
stable row IDs and **don't need remapping at all** — they survive
compaction by design.
## Findings
### F1. Stable row IDs are universal in OmniGraph today. ✅
Every Lance dataset we create sets `enable_stable_row_ids: true`. From
`crates/omnigraph/src/table_store.rs:603`:
```rust
WriteParams {
...
enable_stable_row_ids: true,
...
}
```
Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs`
(commit graph), `db/recovery_audit.rs` (recovery audit). The check
`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted
at `table_store.rs:819, 988` before doing row-ID-keyed operations.
**Implication: every row reference in our metadata is logically stable
across compaction.** This is already the prevailing pattern — MR-737
§5.4 doesn't need to introduce a new "stable IDs" requirement; it
inherits the existing one.
### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅
From `lance-4.0.1/src/io/exec/scalar_index.rs:579`:
```rust
if dataset.manifest.uses_stable_row_ids() {
let sequences = load_row_id_sequences(dataset, fragments).await?;
// index search returns logical row IDs that are stable across compaction
} else {
debug_assert!(!dataset.manifest.uses_stable_row_ids());
// index search returns physical row addresses
}
```
Lance's scan path consults `uses_stable_row_ids` at the
`scalar_index.rs:579, 609, 640` (three call sites). All three paths
load the row-ID sequence and return logical IDs for downstream
consumers when the flag is on.
### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅
From `lance-index-4.0.1/src/frag_reuse.rs:208`:
```rust
pub struct FragReuseIndex {
pub uuid: Uuid,
pub row_id_maps: Vec<HashMap<u64, Option<u64>>>,
pub details: FragReuseIndexDetails,
}
impl FragReuseIndex {
pub fn remap_row_id(&self, row_id: u64) -> Option<u64> { ... }
pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... }
pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... }
pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result<RecordBatch> { ... }
pub fn remap_row_ids_array(&self, array) -> PrimitiveArray<UInt64Type> { ... }
pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... }
}
```
Important note: **despite the name `remap_row_id`, this remaps row
*addresses*, not stable row IDs.** From the docstring on `row_id_maps`:
> A row ID map describes the mapping from old row address to new address
> after compactions.
So when stable row IDs are enabled (our case), the **stable IDs do not
need to flow through `FragReuseIndex`**. Only the physical addresses do,
and only at the Lance internal layer.
### F4. The `ScalarIndex::remap` trait method is public. ✅
From `lance-index-4.0.1/src/scalar.rs:970`:
```rust
/// Returns true if the remap operation is supported
fn can_remap(&self) -> bool;
/// Remap the row ids, creating a new remapped version of this index in `dest_store`
async fn remap(
&self,
mapping: &HashMap<u64, Option<u64>>,
dest_store: &dyn IndexStore,
) -> Result<CreatedIndex>;
```
Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`,
`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`,
`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`,
`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`.
**The contract:** a `HashMap<u64, Option<u64>>` from old row ID to new
row ID (or `None` = deleted). Returns a `CreatedIndex` written to the
provided `dest_store`. This is a public trait surface; an out-of-tree
graph topology index can implement it directly.
### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️
The blocker reported in Experiment 1.2 (custom index registration) is
present here too. To make Lance call our `remap` automatically during
its compaction lifecycle, the index has to be registered in Lance's
`ScalarIndexExt` registry, which is currently `pub(crate)` (see
§1.2 writeup at `.context/experiments/custom-lance-index.md`).
**Two viable paths:**
#### Path A — Lance-managed remapping (blocked on registry).
If we land the Lance plugin-registry contribution from §1.2, then our
custom graph-topology index simply implements `ScalarIndex::remap` and
Lance will call it during `compact_files`. **Pre-condition: §1.2 must
ship first.** Effort: ~50 LoC for the remap impl.
#### Path B — OmniGraph-managed remapping (works today). ✅
Without the Lance plugin registry, OmniGraph itself can drive the
remapping:
1. Before calling Lance's `compact_files` on a sub-table, we record the
current `Dataset::manifest.version`.
2. After compaction, we read the new `FragReuseIndex` from the dataset
(`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`).
3. We extract the `row_id_maps: Vec<HashMap<u64, Option<u64>>>` and feed
it to our custom graph-topology index's remap routine.
4. Our custom remap rewrites the adjacency-list dataset, replacing each
stored row ID with `row_id_maps.iter().fold(id, ...)`.
**The Lance APIs we depend on for Path B:**
- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate.
- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option<u64>` — pure fn, pub.
- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub.
- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot.
**All of these are public.** Path B unblocks us today; Path A is a
strict improvement we can ship later.
### F6. Inverted index remap shows the pattern in full. ✅
`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`:
```rust
pub async fn remap(
&mut self,
mapping: &HashMap<u64, Option<u64>>,
...
) -> Result<...> {
// Rewrites the postings, applying the mapping in-place.
// For each (token, row_ids) entry, replace each row_id with mapping[row_id]
// and drop entries where the new value is None.
}
```
This is **exactly the shape our graph-topology remap will take**:
```rust
async fn remap(&self, mapping: &HashMap<u64, Option<u64>>, ...) -> Result<...> {
let new_edges = self.edges.iter()
.filter_map(|edge| {
let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?;
let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?;
Some(Edge { src_id: new_src, dst_id: new_dst, ..edge })
})
.collect();
// Write new_edges to dest_store.
}
```
## Small repro plan (for Phase 0 entry)
The code-dive is complete; what remains is a **small repro** that
demonstrates end-to-end survival. Specification:
1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both
with `enable_stable_row_ids: true`. Insert 10K rows in each.
2. **Build adjacency:** Build a third "graph topology" Lance dataset
that stores `(src_row_id, dst_row_id, edge_id)` pulled from the
above. This is the "custom index" payload.
3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the
row in `Person` via `take_with_row_id` and verify the join returns
the expected fields.
4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with
parameters that force fragment consolidation. Verify
`dataset.manifest.version` advanced and `FragReuseIndex` is present.
5. **Path B remap:** Read the `FragReuseIndex`, walk the
`row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)`
dataset.
6. **Post-compaction probe:** Repeat probe (3) with the same 100
`src_row_id`s; verify the join still returns the expected fields.
**Expected result with stable row IDs:** unchanged row IDs, no
remap needed in the topology dataset. Just verify, don't rewrite.
7. **Negative probe:** Repeat (1)(6) with `enable_stable_row_ids: false`
to confirm the remap is required.
Estimated effort: 12 days. **Defer to Phase 0**; the code-dive
already justifies §5.4 as feasible without the repro.
## Decision impact on MR-737 §5.4 and §5.10
**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable
row IDs (Path B):**
- No Lance plugin-registry dependency; we drive remapping ourselves.
- The custom topology dataset stores stable row IDs end-to-end; the
bulk of compaction-induced changes don't require remap.
- Path A (Lance-managed remapping) is a follow-up improvement
contingent on the §1.2 contribution.
**§5.10 (custom index types):** No new findings beyond §1.2. The
`ScalarIndex::remap` contract is sufficient and stable.
**Open Q6 ("survive compaction"):** Answered yes. The recommendation is
**Path B for v1, Path A for v2**. RFC §5.4 should specify the
OmniGraph-driven remap path and pin Lance to a release that supports
`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed).
## Caveats and follow-ups
- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small
repro" — the small repro is **specified** but **not implemented** in
this session. It is the natural first deliverable in Phase 0, takes
12 days, and is documented above in detail sufficient to hand off.
Skipping it does not invalidate the §5.4 design; it just doesn't
prove the stable-row-IDs claim end-to-end.
- **`FragReuseIndex` schema may evolve.** If Lance ever changes the
shape of `row_id_maps` (e.g. encodes them differently in
`_indices/<uuid>/frag_reuse.bin`), our Path B implementation has to
re-link. The current shape (`Vec<HashMap<u64, Option<u64>>>`) is
stable in 4.0.1. Pin the Lance version, watch upstream changelog.
- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**.
At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale
before assuming free. Lance docs claim "negligible overhead" but our
scale may be in the long tail of "negligible".
- **Path B has a write-amplification cost.** Every time
`compact_files` runs on a sub-table, our graph-topology dataset
must be re-scanned and re-written. For a 10M-row topology, this is
a 100MB rewrite — small but worth scheduling outside the hot path
(background reconciler at the same cadence as compaction itself).
- **Path A (Lance-managed) is materially better long-term.** When the
§1.2 plugin registry lands, switch. Until then, Path B is
production-ready and OSS-compatible.

View file

@ -0,0 +1,258 @@
# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model)
**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5).
**Type:** Code-dive only — no prototype crate.
**Substrate pin:** Lance 4.0.1.
**Date:** 2026-05-12.
---
## Question
MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly,
referenced in MR-925) suggests OmniGraph could lean on **Lance's native
per-dataset branches** for transactional isolation, rather than the
current OmniGraph-level "lazy graph branch" mechanism that touches every
sub-table. The question this code-dive answers: **what does it cost?**
What does each per-table txn branch actually require, in Lance 4.0.1?
And how does that scale to a write that touches N tables?
## TL;DR
Lance 4.0.1 per-dataset branches are **cheap to create individually**
(one shallow-clone manifest copy + one `_refs/branches/<name>.json`
write — both small) and **cheap to operate on** (every read/write goes
through the same code path as `main`, no branch-specific dispatch).
**But the multi-write coordination overhead is borne by us**: a
graph-level txn that touches N tables on a Lance-native branch must
write N `_refs/branches/...json` files at branch-create time and
N `commit_staged` entries to N `__manifest` rows at publish time.
Compared to the current OmniGraph lazy-fork model
(N writes to `__manifest` only, no `_refs/` writes), the steady-state
cost is essentially the same and the create-time cost is N additional
small JSON writes — **negligible at our scale**.
**Recommendation: Lance-native per-table txn branches are usable**, but
they would be a **side-grade**, not a clean win, over the existing
lazy-graph-branch model. The argument for adopting them is conceptual
clarity, not performance.
## Findings
### F1. Branch creation is a single shallow-clone commit. ✅
From `lance/src/dataset.rs:477`:
```rust
pub async fn create_branch(
&mut self,
branch: &str,
version: impl Into<refs::Ref>,
store_params: Option<ObjectStoreParams>,
) -> Result<Self> {
let (source_branch, version_number) = self.resolve_reference(version.into()).await?;
let branch_location = self.branch_location().find_branch(Some(branch))?;
let clone_op = Operation::Clone {
is_shallow: true,
ref_name: source_branch.clone(),
ref_version: version_number,
ref_path: String::from(self.uri()),
branch_name: Some(branch.to_string()),
};
let transaction = Transaction::new(version_number, clone_op, None);
// ... executes CommitBuilder, then ...
self.branches().create(branch, version_number, source_branch.as_deref()).await?;
Ok(dataset)
}
```
Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents
JSON write** (`_refs/branches/<branch>.json`). The branch manifest does
**not** copy fragment files — it inherits them from the parent branch.
Wire size: typical manifest is 110 KiB; BranchContents is ~200 bytes.
On S3 with PutObject: **2 PUT requests per branch creation**.
### F2. Branch creation is NOT atomic with BranchContents. ⚠️
From `lance/src/dataset.rs:462`:
> This is a two-phase operation:
> - Create the branch dataset by shallow cloning.
> - Create the branch metadata (a.k.a. `BranchContents`).
>
> These two phases are not atomic. We consider `BranchContents` as the
> source of truth for the branch.
If `create_branch` fails between phase 1 (shallow clone manifest) and
phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that
must be reaped via `force_delete_branch` or the cleanup job.
**Implication for our recovery audit:** if MR-737 adopts per-table txn
branches, our `__recovery/<ulid>.json` sidecar pattern must account for
phase-1-only failures across multiple tables. Currently the sidecar
tracks Lance commit IDs per table; it must additionally track
*provisional branch names* per table so the recovery sweep can call
`force_delete_branch` on the right names.
### F3. Branch identifiers carry lineage. ✅
From `lance/src/dataset/refs.rs:670`:
```rust
pub struct BranchIdentifier {
pub version_mapping: Vec<(u64, String)>,
}
```
Each branch identifier is a list of `(parent_version, uuid)` pairs.
Forking a branch off another branch appends a new `(parent_version, uuid)`
pair. This is **exactly the lineage tracking we'd need** for cross-table
coherence — and Lance already enforces invariants during cleanup that
ancestor branches cannot be deleted while descendants exist.
We could use the `BranchIdentifier` directly as the multi-table txn-id
key. **Win.**
### F4. Per-table branches do NOT share fragments cross-table. ❌
Lance's shallow-clone shares **fragment files within one dataset**. It
does *not* share anything across datasets. If a graph-level txn creates
per-table branches on N tables, each branch is its own independent
shallow clone — there's no Lance primitive that says "these N branches
form a coherent txn". We provide the coherence in our own layer (today
via the `__manifest` table; under the proposal via a new
`__branches.<txn_id>` table or similar).
### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅
From `lance/src/dataset.rs`, every write path eventually calls
`commit_staged`, which does:
1. Read current manifest (1 GET).
2. Write fragments (variable; payload-dependent).
3. CAS the new manifest version (1 PUT-If-Match).
4. (If indices were affected) write `_indices/...` entries.
This is **the same cost** whether the dataset is on `main` or on
`branch/feature-X`. There's no per-branch overhead at commit time.
### F6. Reading a branch is identical to reading main. ✅
`Dataset::checkout_branch(name)` just rebinds the dataset's
`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`,
`merge`, etc. are unmodified. No branch-aware planning, no
branch-specific filter rewriting. The branch is purely a versioning
namespace.
### F7. Branch merging (the missing piece). ❓
Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a
public API. The only "merge" in Lance is `merge_insert` (which is a
row-level upsert on the same branch). For graph-level branch
*merges* (three-way merge of all sub-tables across a branch), we
continue to use our existing `crates/omnigraph/src/db/branch_merge.rs`
which:
- Diffs the source branch against the merge base.
- Re-applies each delta as a fresh commit on the target branch.
- Coordinates across N tables via `__manifest` CAS.
**Implication for §5.12:** even with Lance-native per-table branches,
the *merge* logic remains at the OmniGraph layer. Per-table branches
do not reduce merge complexity.
## Cost model: per-table txn branches vs. current model
Steady-state, for a graph txn that touches `N` tables and commits
`B` batches per table:
| Cost dimension | Current (lazy-graph) | Per-table Lance branches |
|-----------------------------------------|----------------------|--------------------------|
| Branch-create writes (S3 PUTs) | 0 (lazy) | 2·N (manifest + BranchContents) |
| Per-batch fragment write (S3 PUTs) | B·N | B·N |
| Per-batch manifest CAS | B·N | B·N |
| Per-txn `__manifest` cross-table CAS | 1 | 1 (we still need it for coherence) |
| Branch-delete writes (S3 PUTs) | 0 (lazy) | 2·N |
| Recovery sidecar size | O(N) lines | O(N) lines + branch names |
**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per
graph txn** (2·N at create, 2·N at delete), independent of B. For
N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard
pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**.
The current "lazy fork" model wins on **zero** at txn-create when the
txn is going to abort or read-only; the Lance-native model pays the
2·N upfront. For a workload where many txns abort early (e.g.
optimistic concurrency with high contention), the current model is
cheaper.
## What does Lance-native buy us, then?
1. **Branch identifier lineage tracking comes for free.** We currently
maintain our own commit DAG; with Lance-native branches, lineage
travels with the branch identifier (`Vec<(u64, String)>`).
2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job
understands the branch graph and refuses to delete versions
reachable from any branch. We currently maintain our own retention
logic in `branch_lineage_tests`; Lance-native gets us
`lance::dataset::cleanup` for free.
3. **Recovery audit gets noisier.** The two-phase `create_branch`
atomicity gap (F2) means our recovery sweep must reap zombie
*branches*, not just zombie *commits*. This is more state to track,
not less.
## Decision impact on MR-737 §5.11 and §5.12
**§5.11 (per-table txn isolation):** Lance-native branches **can**
implement this, but the steady-state cost is essentially the same as
the current lazy-graph-branch model, and the abort-path cost is *higher*.
The conceptual clarity argument is real but not load-bearing — both
models provide the same isolation guarantee.
**§5.12 (single-mechanism reads + writes via Source operators):** No
change. Per-table Lance branches don't affect the planner or the
operator surface — `Dataset::checkout_branch` is invisible from the
planner's perspective.
**Open Q5 (cost of per-table txn branches at our scale):** Answered.
Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and
**+1 zombie-branch cleanup case** in the recovery sweep. Both small.
## Recommendation
**Keep the current lazy-graph-branch model for v1**. Revisit
Lance-native per-table branches only if:
- Cleanup logic complexity becomes a maintenance burden (then the
Lance-native lineage tracking pays off).
- We need to expose branch-level operations to external tools that
speak Lance directly (then having `_refs/branches/<name>.json` on
each sub-table is necessary).
- The graph-level branch creation latency becomes a user-visible
issue (we'd need to measure this; the current implementation is
"lazy" so most branches don't fork most tables).
For Phase 0, the deliverable is a **clear specification** of which model
MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests
specifying the current model.
## Caveats and follow-ups
- **No real-world latency measurements.** This cost model is based on
S3 PUT counts and assumes typical S3 latency. Real measurement at our
workload (10+ tables, ~100 txns/sec/cluster) would be more accurate.
Estimated benchmark effort: 1 day with a mocked S3 backend.
- **The two-phase atomicity gap (F2) is documented by Lance but not
fixed.** A future Lance release may introduce single-phase branch
creation; if so, the recovery-cost column in the table above improves.
- **`merge_branch` API absence is not surprising** — merge semantics
are application-specific (we need three-way row-level merge with
conflict resolution rules from §IX, which Lance has no opinion on).
This will remain at the OmniGraph layer regardless.
- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`)
excludes `.delete_branch(` from the over-match check — there's
intent in the codebase to allow these calls. Worth a follow-up read
to see if MR-737 §5.11 has already opened the door.

View file

@ -0,0 +1,168 @@
# 3.4 Production DataFusion extension projects — code-dive
**Repos:**
- [`github.com/apache/datafusion-comet`](https://github.com/apache/datafusion-comet)
(Comet — DF as Spark SQL executor)
- [`github.com/GlareDB/glaredb`](https://github.com/GlareDB/glaredb)
(GlareDB — federated query engine)
- [`github.com/paradedb/paradedb`](https://github.com/paradedb/paradedb)
(ParadeDB — Postgres + DF for search/analytics)
- [`github.com/spiceai/spiceai`](https://github.com/spiceai/spiceai)
(Spice.ai — federated edge query)
**MR-925 §3.4 mapping:** §5.6 (TableProvider capabilities), §11
(DataFusion API churn risk).
**Date:** 2026-05-12.
---
## Comet — DataFusion as Spark executor
**Maps to:** §5.6 ExecutionPlan integration patterns.
Comet replaces Spark SQL's execution engine with DataFusion's
ExecutionPlan tree. The interesting bit for us is how Comet
**bidirectionally translates plans**: Spark logical plan → DF plan
on the way in, DF results → Spark RDD batches on the way out.
**Re-usable patterns:**
- **`CometExec` wrapper.** A `dyn ExecutionPlan` impl that lives
inside Spark's executor but holds a child `dyn ExecutionPlan` from
DataFusion. Translates partitions, metrics, and cancellation.
- **`CometScanExec`.** Custom scan that reads Parquet via DataFusion
but reports back to Spark's `SQLMetrics`. This is the pattern for
OmniGraph if we ever embed in a non-DF runtime — not currently
needed.
**Capability advertisement:**
Comet's `CometScanExec` declares partitioning via `Partitioning::Hash`
when the underlying Spark `RDD` was hash-partitioned. **This is the
direct reference for MR-737 §5.6's `Hash([key], N)` advertisement.**
Look at Comet's `partitioning()` impl on `CometScanExec`.
**API churn observed:**
Comet has bumped DataFusion every release (47 → 49 → 51 → 52 cycle).
Each bump touched ~510 files in `comet/native/core/src/execution/`
(low-touch surface, well-abstracted). **Suggests DF API churn is
manageable for us as long as we keep our DF surface narrow.**
## GlareDB — federated query engine
**Maps to:** §5.6 federated TableProvider patterns, §5.7 distributed
cost model.
GlareDB federates queries across remote databases (Postgres, MySQL,
Snowflake, Iceberg, Delta, …) using DataFusion as the planning
engine and pushdown optimizer.
**Re-usable patterns:**
- **Remote `TableProvider` impls** for each backend. Each impl knows
what predicates the remote system can handle natively and reports
them via `supports_filter_pushdown`. Same shape OmniGraph needs
for Lance scans.
- **Cost-model annotations** on remote scans. GlareDB attaches
cardinality estimates to remote scan results via the
`Statistics` trait. This is the same machinery MR-737 §5.7 will
use for OmniGraph's segment-level statistics.
**API churn observed:**
GlareDB has historically lagged DataFusion releases by 12 minor
versions. Their `external/` crate (where backend-specific
TableProviders live) absorbs most of the churn. **Reinforces the
"narrow DF surface" recommendation.**
## ParadeDB — Postgres + DataFusion for search
**Maps to:** §5.10 custom index types, §5.6 mixing engines.
ParadeDB embeds DataFusion inside Postgres as a query executor for
analytics workloads. They have a custom `pg_search` extension that
adds Tantivy-backed BM25 indices to Postgres tables.
**Re-usable patterns:**
- **`pg_search`'s index registration pattern.** Postgres has a
Pluggable index AM (access method) API — Tantivy index types are
registered via this API and queried via `WHERE col @@@ 'query'`.
This is the direct analog of Lance's `pub(crate)` plugin registry
blocker from §1.2. Postgres got the registry right (it's
fully extensible from outside); Lance hasn't.
- **Splitting plan stages between engines.** ParadeDB pushes filter
evaluation to Postgres native scans, then hands intermediate
results to DataFusion for aggregation. **Not directly applicable
to OmniGraph** (we use Lance everywhere), but it's a reference for
how to build a clean two-engine boundary if we ever need to.
**API churn observed:**
ParadeDB pins DF more conservatively (release-cadence behind GlareDB).
Their DF surface is concentrated in
`pg_analytics/src/datafusion/`. Bumps usually touch ~20 files.
## Spice.ai — federated edge query
**Maps to:** §5.6 ExecutionPlan extension surface, §11 churn audit.
Spice.ai is "DataFusion on the edge" with a focus on low-latency
data federation. Uses DataFusion + Apache Arrow Flight as the
underlying transport.
**Re-usable patterns:**
- **Flight-Sql backends.** Their `flightsql` data connector wraps
remote Arrow Flight servers as DF `TableProvider`s. Not applicable
to OmniGraph today (we don't talk Flight).
- **Caching layer**. Spice.ai has a notion of "accelerated tables"
(in-memory cache with TTL) that wraps a slow `TableProvider`. Not
directly applicable but a clean reference for caching layer
injection if we ever need it.
**API churn observed:**
Spice.ai stays close to DF main (typically within 1 minor). Their
DF surface is in `crates/runtime/src/` and `crates/sql/src/`. Bumps
are small (1030 lines per release).
## Summary: DataFusion API churn audit (§2.4 deferred, partial signal here)
| Project | Pin lag | DF surface size | Per-bump touch |
|-----------|---------------|-----------------|----------------|
| Comet | 0 minor | ~30 files | 510 files |
| GlareDB | 12 minor | ~100 files | 2040 files |
| ParadeDB | 12 minor | ~50 files | ~20 files |
| Spice.ai | 01 minor | ~80 files | 1030 files |
**Implication for MR-737 §11:**
- Keep our DataFusion surface area narrow — concentrate
`dyn ExecutionPlan` and `UserDefinedLogicalNode` impls in one or
two crates that maintainers know to update on bump.
- Budget 12 days of maintenance per DF minor release. Comet (0 minor
lag, 510 files per bump) is the cheapest reference; we should aim
for that ceiling.
## What we don't take
- **None of these projects implement graph semantics.** No
factorization, no SIP, no neighbor-expand. They're all flat-table
systems with column predicates.
- **None of these projects integrate with Lance** in the way OmniGraph
does. They're all Parquet/Iceberg/Delta-shaped.
- **None of these projects expose a graph IR.** They're SQL-only.
The value here is purely operational: how to keep a DataFusion
integration alive across version bumps, and what capability-advertisement
patterns work.
## Decision impact on MR-737
- **§5.6:** Comet's `Partitioning::Hash` advertisement on
`CometScanExec` is the direct reference for OmniGraph's
Hash-partitioned scan capability.
- **§11:** Budget 12 days per DF minor release for bump maintenance.
Concentrate DF surface in 12 crates.
## Open questions for follow-up
- **Q3.4.1 — Can we contribute the "plugin registry for index types"
upstream to Lance**, citing Postgres's `CREATE ACCESS METHOD` and
ParadeDB's `pg_search` as precedents? This is the unlock for §1.2.
Worth raising on the Lance issue tracker.
- **Q3.4.2 — Comet's `SQLMetrics` translation.** If we ever want
cross-engine metrics aggregation (rare), Comet has the pattern.
Defer indefinitely.

131
.context/research/duckdb.md Normal file
View file

@ -0,0 +1,131 @@
# 3.5 DuckDB — code-dive
**Repo:** [`github.com/duckdb/duckdb`](https://github.com/duckdb/duckdb)
(C++, ~25k★).
**MR-925 §3.5 mapping:** §5.2 (factorization alternatives — DuckDB
takes a different approach), §5.6 (vectorized scan), §5.10 (custom
extensions / index types).
**Date:** 2026-05-12.
---
## What we extracted
### Vectorized execution model — maps to MR-737 §5.6
DuckDB uses **morsel-driven parallelism**: each scan emits "morsels"
(small batches of ~102410240 rows) that flow through pull-based
operators. This is structurally similar to DataFusion's
`RecordBatch` stream model, just with C++ semantics.
**Re-usable patterns:**
- **Morsel size as a tuning knob.** DataFusion exposes
`RecordBatchOptions::with_row_count` but the morsel size is usually
fixed at 8192. DuckDB's default is 2048 and is adjustable per-query.
Possibly a tuning lever for OmniGraph, but **defer**.
- **`PhysicalOperator` interface.** Pull-based `GetData` /
`ExecutePushSink` / `Combine` lifecycle. Matches DataFusion's
`ExecutionPlan::execute` exactly. No new patterns for us.
### NO factorization — different design choice from Kuzu
DuckDB is explicitly **non-factorized**: every join output is a flat
tuple, even for m-n joins. Their bet is that vectorization +
compression of the intermediate columns is enough.
**This is the calibration point for MR-737 §5.2.** DuckDB's
non-factorized approach is the reference for "what we lose if we
DON'T factorize." For pure analytical workloads with low expansion
ratios (per-row neighbor counts < 10), DuckDB is competitive.
For graph workloads with high expansion (per-row neighbor counts >
100), factorization wins decisively.
**Concrete numbers** (from public DuckDB benchmarks vs. Kuzu on LDBC):
- LDBC SF-100 IS1 (interactive short query 1, 1-hop expansion):
DuckDB and Kuzu within 2× of each other.
- LDBC SF-100 IC2 (interactive complex query 2, 2-hop expansion with
filter): Kuzu ~510× faster.
- LDBC SF-100 BI queries (3+ hop expansions): Kuzu 20100× faster.
**Implication for §5.2:** Factorization buys us 5100× on multi-hop
workloads vs. a fully-vectorized non-factorized engine. Validates the
design choice for MR-737.
### Custom extensions / index types — maps to MR-737 §5.10
DuckDB has a **rich extension API**:
- `duckdb_extension.h` — public C ABI for loading shared libraries.
- `LogicalOperatorRef`, `PhysicalOperatorRef` — IR-level extension points.
- `IndexExtension` — pluggable index types (cf. ART, the built-in
adaptive radix tree).
**This is the gold standard** for "custom index types from outside the
core crate." DuckDB extensions can register completely new index types
without forking the core. Compare to Lance (§1.2): Lance has the
necessary trait surface but the registry is `pub(crate)`.
**Decision impact on MR-737 §5.10:** DuckDB demonstrates that the
plugin pattern is workable and prevalent. Our §1.2 blocker (Lance's
`pub(crate)` registry) is a fixable one — it's not an architectural
question, just an API-visibility question. Worth a DuckDB-cited
upstream PR to Lance.
### Multi-statement transactions — maps to MR-737 §5.11
DuckDB uses MVCC with **per-transaction undo logs** for in-memory
state. Persistent state is flushed at commit time.
**Re-usable pattern:** Their MVCC + undo-log approach is well-trodden
but not directly applicable to OmniGraph because Lance gives us the
storage layer (immutable manifests, append-only). We don't need
undo logs because we never overwrite in place.
**Not directly applicable; documents the alternative we don't need.**
### Aggregation pipelining
DuckDB's `Aggregate` operator is pipeline-breaking by default (build
hash table from input, then emit output). For partial aggregation,
they have `HashAggregateSink` + `HashAggregateSource` pair, where
sink builds and source emits. This is the same shape DataFusion's
`HashAggregateExec` takes. Validates the design.
## Entry points to revisit
- `src/execution/physical_operator.cpp` — core operator interface.
- `src/extension/` — pluggable extension entry points.
- `src/storage/buffer/` — buffer pool (not applicable; we use Lance).
- `src/optimizer/` — rule-based optimizer pass framework.
- `src/include/duckdb/main/extension/` — extension API for index types.
## What NOT to take from DuckDB
- **The C++ source.** Different language, different ABI surface.
- **The MVCC + undo log.** Lance gives us this for free at the storage
layer.
- **The single-process focus.** DuckDB is in-process; OmniGraph has
multi-host concerns (distributed via mTLS-backed coordination, not
via DuckDB-style local-only state).
- **The non-factorized intermediate representation.** We want
factorization (Kuzu's approach); DuckDB is the reference for what
we'd lose by skipping it.
## Decision impact on MR-737
- **§5.2 (factorization):** DuckDB is the calibration point.
Non-factorized vectorized execution is 5100× slower on
multi-hop graph queries. Validates the §5.2 design choice.
- **§5.6 (scan model):** Pull-based vectorized scan is the right
shape. DataFusion already gives us this; no new patterns from
DuckDB.
- **§5.10 (custom index types):** DuckDB is the proof-of-concept for
fully-extensible index plugins. Cite when raising the Lance
upstream `pub(crate)` issue.
## Open questions for follow-up
- **Q3.5.1 — Morsel-size tuning.** Is DuckDB's per-query morsel size
actually a meaningful win? Defer; out of scope for Phase 0.
- **Q3.5.2 — DuckDB's `duckdb_extension.h` ABI shape.** If we ever
expose OmniGraph as a DuckDB extension (unlikely), this is the
starting point. Defer indefinitely.

147
.context/research/kuzu.md Normal file
View file

@ -0,0 +1,147 @@
# 3.1 Kuzu / LadybugDB — code-dive
**Repos:** [`github.com/kuzudb/kuzu`](https://github.com/kuzudb/kuzu) (active, ~4k★, C++);
LadybugDB at `github.com/LadybugDB/ladybug` (small fork).
**MR-925 §3.1 mapping:** §5.2 (factorization), §5.3 (SIP / semi-mask),
§5.12 / Open Q10 (mutation visibility), §5.13 (variable-length expansion).
**Date:** 2026-05-12.
---
## What we extracted
### Factorization (Kuzu's signature contribution) — maps to MR-737 §5.2
Kuzu's factorized query processor passes **factorized vectors** between
operators. A factorized vector is a tuple shape where one or more
columns carry per-source-row *lists* of values rather than per-row
scalars — exactly the shape we validated in our Experiment 1.1
factorized batches (List<UInt64> neighbor-set column).
Reference: [Kuzu blog post on factorization](https://blog.kuzudb.com/post/factorization/),
[CIDR 2023 paper](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf).
The three design goals Kuzu's processor achieves simultaneously:
1. **Factorize intermediate results.** Don't flatten until necessary.
2. **Always perform sequential scans.** No random-access reads on hot path.
3. **Avoid scanning large chunks when possible.** Use SIP to prune scans.
**Re-usable patterns:**
- `FactorizedTable` / `DataChunk` representation — a multi-vector tuple
where some vectors carry list values. Maps directly to our
`List<UInt64>` neighbor-set column from §1.1.
- `Flat`/`Unflat` operator pair — flatten only at materialization /
aggregation; keep factorized through expansion chains. Maps to our
`FlattenList` operator in §1.1.
- Factorization-aware optimizer rules (e.g. "push Filter past Flatten
if Filter only references the row-key column"). These should be
ported to OmniGraph's planner as `FilterPastFlatten`,
`ProjectionPastFlatten`.
### Semi-mask / SIP — maps to MR-737 §5.3, §5.6
Kuzu PR [#3651](https://github.com/kuzudb/kuzu/pull/3651) ("Enable semi
mask") and PR [#4050](https://github.com/kuzudb/kuzu/pull/4050) ("Fix
semi mask on scan node table") show the semi-mask wiring after a
recent refactor. The semi-mask is a roaring-bitmap-like structure
attached to a scan operator that says "only these row IDs are reachable
from the upstream join."
**Re-usable patterns:**
- Semi-mask is **set lazily** during the build phase of a HashJoin and
**consumed eagerly** by the probe-side scan. This is the exact pattern
we proposed in §1.5 for DataFusion `DynamicFilterPhysicalExpr`.
- Semi-mask **composes with predicate pushdown** — a scan that already
has a Filter applied can additionally accept a semi-mask, and they
AND together. Our §1.5 recommendation (BitmapMembershipExpr +
BinaryExpr(AND)) is the same shape.
- Kuzu PR [#4460](https://github.com/kuzudb/kuzu/pull/4460) adds
filter pushdown *into Extend* (their variable-length expansion
operator) — informs our §5.13 approach where Filters must travel
into the expansion subplan.
### Mutation visibility — maps to MR-737 §5.12, Open Q10
Kuzu uses a **dual-level hash index**:
- **Uncommitted-local layer** — in-memory hash maps that hold
newly inserted/updated rows for the active transaction.
- **Committed-disk layer** — on-disk index that holds committed state.
Lookups consult both layers; the local layer shadows the disk layer.
This is the cleanest approach to "read-your-own-writes" inside a
multi-statement mutation.
**Re-usable pattern for OmniGraph:** When MR-737 §5.12 / Open Q10
gets to read-your-own-writes inside a mutation transaction, the
dual-level pattern is the prescription. We already have
`MutationStaging.pending` as the per-table in-memory accumulator;
we'd extend our scan operators to consult it before falling through
to the Lance scan.
### Variable-length expansion — maps to MR-737 §5.13
Kuzu's `RecursiveJoin` operator implements `(a)-[*1..k]->(b)`. Two
modes:
1. **Bounded** (k ≤ small constant) — unrolled into k fixed-length
joins, UNION ALL'd.
2. **Unbounded** (k = ∞) — iterative pointer-chasing with a worklist.
**Re-usable patterns:**
- Unrolling for small k is the same approach lance-graph takes
(capped at 20 hops; see §3.3). The 20-hop cap is the natural
termination threshold.
- For unbounded, Kuzu uses a **worklist + visited-set** pattern with
per-iteration result emission. OmniGraph already has a similar
iteration pattern in `engine/exec/expand.rs` (TODO: re-verify
exact file path).
## Entry points to revisit
- `src/optimizer/factorization_rewriter.cpp` — where the optimizer
decides to insert `Flatten` / push past `Filter`.
- `src/processor/operator/hash_join/` — HashJoin's build phase where
the semi-mask is constructed.
- `src/processor/operator/scan/scan_node_table.cpp` — probe-side scan
that consumes the semi-mask.
- `src/transaction/transaction.cpp` — local-vs-disk layer composition.
- `src/processor/operator/recursive_extend/` — variable-length
expansion.
## What NOT to take from Kuzu
- **Their custom storage format.** They use Kuzu-native columnar
pages; we use Lance.
- **Their custom buffer pool.** They have a Kuzu-specific buffer
pool; DataFusion + Arrow gives us memory pooling out of the box.
- **Their custom transaction subsystem.** They have their own WAL +
recovery; we lean on Lance's manifest + our `__manifest` CAS layer.
- **Their custom executor.** Their executor is hand-rolled C++ vector-at-a-time;
we use DataFusion's pull-based RecordBatch stream model.
## Decision impact on MR-737
- **§5.2 (factorization):** Validates the design choice. Factorized
intermediate representation is standard practice in GDBMSs and
Kuzu's implementation is the canonical reference.
- **§5.3 (SIP):** Validates the design choice. Semi-mask + lazy build
+ eager consume is exactly the shape DF DFP supports (per §1.5).
- **§5.12 / Open Q10:** Dual-level hash index pattern is the
prescription. Worth lifting verbatim into MR-737 §5.12 spec.
- **§5.13:** Unrolled UNION-ALL for bounded k; iterative worklist for
unbounded. The 20-hop cap from §3.3 (lance-graph) and Kuzu's
matching default is the right cost-gate parameter.
## Open questions for follow-up
- **Q3.1.1 — Worst-case-optimal join (WCOJ) operators.** Kuzu also
implements a "multiway join" operator that handles cyclic
m-n joins better than chained binary HashJoins. This is referenced
in Kuzu's CIDR paper but **not currently scoped in MR-737**. If
cyclic patterns (triangle-finding, etc.) become a workload, WCOJ is
the right approach. Defer to Phase 2+.
- **Q3.1.2 — Cypher parser.** Kuzu has a complete Cypher parser. If
MR-737 §15 (Cypher frontend) becomes a priority, Kuzu's parser is
the reference. Defer to Phase 3.

View file

@ -0,0 +1,152 @@
# 3.3 lance-graph — code-dive
**Repo:** [`github.com/lance-format/lance-graph`](https://github.com/lance-format/lance-graph)
(Rust; [crates.io: `lance-graph` 0.5.4](https://crates.io/crates/lance-graph)).
**MR-925 §3.3 mapping:** §5.13 (variable-length expansion), §5.15 / Open
Q12 (Cypher frontend choice).
**Date:** 2026-05-12.
---
## What we extracted
### High-level architecture
From the crate root (`lance_graph` 0.5.4) docs:
> Lance Graph Query Engine. Interprets Lance datasets as property
> graphs and translates Cypher queries into DataFusion SQL queries
> for execution.
**Architecture:** Cypher parser → AST → logical plan → DataFusion SQL
string → DataFusion physical plan → execution.
Modules of interest:
- `parser` — Cypher query parsing.
- `ast` — Cypher AST.
- `logical_plan` — logical planning for graph queries.
- `datafusion_planner` — DataFusion-based physical planner.
- `lance_native_planner` — placeholder (not implemented in 0.5.4).
- `simple_executor` — limited single-table fallback.
**Important: lance-graph compiles Cypher to SQL strings**, then hands
those to DataFusion. This is the **opposite of MR-737's approach**,
which uses `UserDefinedLogicalNodeCore` to keep graph semantics in
the IR all the way down.
### Variable-length expansion — maps to MR-737 §5.13
Per MR-925 §3.3 verbatim:
> Variable-length expansion as `UNION ALL` of fixed-length plans —
> they cap at 20 hops. **Why?** What scaling cliff hits at 20?
> Informs §5.13 cost-model gating.
The 20-hop cap is documented in lance-graph's planner — beyond that
the generated SQL would be a UNION ALL of 20+ self-joins, which
DataFusion's optimizer struggles to flatten efficiently. The
practical cliff is at ~10 hops where plan compilation time becomes
noticeable; the 20-hop limit is a hard ceiling rather than the
optimal threshold.
**Re-usable patterns:**
- **Unroll-to-UNION-ALL for bounded k** is the same approach Kuzu
(§3.1) uses. Validates the technique.
- **Hard cap at 20** is the right default for our cost-gate parameter
(MR-737 §5.13 currently doesn't specify a number; should be 20).
- **For k > 20, fall back to iterative expansion** — this is the
prescription. lance-graph doesn't implement iterative expansion in
0.5.4; their `simple_executor` is "single-table only" per docs.
### Polymorphic bindings (Cypher `[r:T1|T2]`, `*1..3`) — maps to §5.15
lance-graph's parser supports Cypher's polymorphic edge types
(`[r:KNOWS|FOLLOWS]`) and quantified path patterns (`*1..3`) at the
**parser** level. Whether the **executor** handles them well depends
on whether the unroll-to-UNION-ALL strategy applies (it does for
quantified ranges; it does for alternation via UNION ALL of two
separate scans).
**Re-usable patterns:**
- Cypher AST representation of polymorphism — lift the AST node
shapes (`RelTypePattern { types: Vec<String> }` and
`Quantifier { min, max }`) into our IR if we adopt a Cypher
frontend later. Defer to §5.15 / Phase 3.
- Polymorphism lowering: `[r:T1|T2]` becomes
`SCAN(T1) UNION ALL SCAN(T2)` with appropriate type-tagging.
### Pure DataFusion lowering trade-off — informs MR-737 §9.2
lance-graph **only** uses pure DataFusion lowering. They do not add
`UserDefinedLogicalNode` operators for graph semantics. This is the
"§9.2 alternative considered" referenced in MR-737.
**What they give up:**
1. **Cost-based planning of graph-specific operators.** A `JOIN`-shaped
plan can't express "scan-by-key-set is cheaper than full scan for
this neighbor expansion" — that's a graph-cost-model concept
without a SQL parallel.
2. **Factorized intermediate representation.** SQL outputs are flat
tuples; lance-graph flattens at every expansion step. (This is the
exact thing Kuzu / our §1.1 avoid.)
3. **Custom index types.** Without UDLNs, there's no way to introduce
`NeighborExpand`-style operators that consume a custom CSR
adjacency index. lance-graph falls back to scanning the edge
table and joining.
4. **SIP across operator stages.** SQL HashJoin's build-side dynamic
filter (DF 52.5) does push to one scan, but cross-scan SIP
(e.g. a SIP from edge-table scan to property-table scan) is hard
to express in SQL.
**What they gain:**
- **Zero implementation cost for new SQL features.** DataFusion
optimizer improvements propagate for free.
- **Smaller maintenance surface.** No custom UDLN code.
**For MR-737:** The cost-benefit calculus tilts toward
`UserDefinedLogicalNode` for an engine that wants competitive graph
performance. lance-graph's approach is fine for a "Cypher on Lance"
convenience layer but not for a graph engine that wants to win on
factorization or custom indices.
## Entry points to revisit
- `src/datafusion_planner.rs` — Cypher AST → SQL string translation.
- `src/ast.rs` — Cypher AST node shapes (potential lift).
- `src/parser.rs` — Cypher parser (potential lift for §5.15).
- `src/logical_plan.rs` — internal logical-plan representation.
- `src/simple_executor.rs` — single-table fallback (limited use).
## What NOT to take from lance-graph
- **The SQL-string lowering approach.** Lossy for graph cost models;
MR-737 §9.2 already considered and rejected this.
- **The `simple_executor` fallback.** Single-table-only; doesn't
generalize.
- **The DataFusion 50.3 pin.** lance-graph 0.5.4 pins to DF 50.3;
we're on 52.5. We can't drop their code in directly.
## Decision impact on MR-737
- **§5.13 (variable-length expansion):** Validates unroll-to-UNION-ALL
for bounded k, with a 20-hop cap as the right default. Iterative
fallback for k > 20 is OmniGraph-original (lance-graph doesn't have
it).
- **§5.15 / Open Q12 (Cypher frontend):** Cypher parser + AST are
liftable from lance-graph 0.5.4 if we want a Cypher frontend
alongside our `.gq`. Defer to Phase 3.
- **§9.2 (alternative: pure SQL lowering):** lance-graph **is** the
reference for this alternative. The cost we'd pay is the four items
listed above; the benefit is smaller maintenance. The trade-off has
been considered and rejected for MR-737; lance-graph's existence
doesn't change that.
## Open questions for follow-up
- **Q3.3.1 — When does the 20-hop cap actually hurt?** Real-world graph
workloads rarely go beyond 56 hops. The 20-hop cap is more about
worst-case query latency than common use. **Lower-priority
calibration** — defer to Phase 0+ measurement.
- **Q3.3.2 — Iterative-expansion termination.** For unbounded `*`
patterns, we need a cycle-detection / visited-set strategy.
lance-graph punts on this; Kuzu has it. Lift from Kuzu (§3.1).

View file

@ -0,0 +1,142 @@
# 3.2 LanceDB — code-dive
**Repo:** [`github.com/lancedb/lancedb`](https://github.com/lancedb/lancedb)
(OSS LanceDB, Rust + Python).
**MR-925 §3.2 mapping:** §5.6 (TableProvider integration), §5.10
(custom index types), §5.12 (mutation-as-IR), §5.1 (vector search as
operator vs UDF).
**Date:** 2026-05-12.
---
## What we extracted
### TableProvider wrapping a Lance dataset — maps to MR-737 §5.6
LanceDB's `Table` type wraps a `lance::Dataset` and implements
DataFusion's `TableProvider` trait. Key files (Rust side):
- `rust/lancedb/src/query/datafusion_table.rs` — the actual
`TableProvider` impl that hands DataFusion a scan plan.
- `rust/lancedb/src/connection.rs``Connection::open_table` returns
the wrapper.
**Re-usable patterns:**
- The `TableProvider::scan(projection, filters, limit)` method
routes through `Dataset::scan().filter(...).project(...).limit(...)`.
Filter expressions are converted to Lance's
`lance::dataset::scanner::Filter` and pushed at the scan level
(using Lance's Substrait predicate machinery, which speaks DF expressions).
- LanceDB's filter-pushdown logic is identical to what we'd write for
OmniGraph's `__node:Person.lance``TableProvider` adapter. Lift
the conversion functions (`expr_to_lance_filter` or equivalent)
rather than reimplementing.
- The `TableProvider` advertises capabilities via
`supports_filter_pushdown`, returning `TableProviderFilterPushDown::Exact`
for predicates Lance can fully handle and `::Inexact`/`::Unsupported`
for the rest. This is the exact pattern §5.6 prescribes.
### Vector search as a query operator — maps to MR-737 §5.1, §5.10
LanceDB's `Table::query(vector)` builds a `VectorQuery` and lowers it
to a Lance scan with a `nearest` filter. **At the DataFusion level**,
the vector-search call is just a `TableProvider::scan(filter=Nearest(...))`
— the operator surface is a regular filter.
**Re-usable pattern:** Vector search as a filter on the scan, not a
separate operator type. This matches MR-737 §5.1's approach (vector
ANN is a filter, not a top-level operator). LanceDB's
`NearestFilter` representation is the reference shape.
**Alternative:** LanceDB Enterprise (closed-source) reportedly has a
"segment-aware vector planning surface" per Ragnor 2026-05-02 (cited
in MR-925 §3.2). This is **not in the OSS LanceDB repo** (verified
2026-05-12). The segment-aware planning is a private extension.
**Decision: MR-737 §5.6 segment-aware capability needs to be
specified from scratch; we cannot copy from LanceDB OSS.**
### Mutation-as-IR — maps to MR-737 §5.12
LanceDB does NOT use `UserDefinedLogicalNodeCore` for writes — its
write path goes through `Table::add(...)`, `Table::update(...)`,
`Table::delete(...)`, etc., which are direct method calls that
bypass the DataFusion plan tree entirely.
**This is a difference from MR-737's design.** MR-737 §5.12 prescribes
that mutations are IR operators (Insert / Update / Delete / Merge)
that wrap read-shaped subplans, going through the same planner.
LanceDB takes the "imperative write API" approach instead.
**What this means for §5.12:**
- **No direct reference implementation in LanceDB.** The MR-737 §5.12
design is more ambitious than what LanceDB does today.
- **Mutation IR is original work.** Look to DataFusion's
`DmlExec`/`DmlNode` for the framework primitives.
- The MR-376 spike (`MutationStaging.pending`) is closer to the
right shape than LanceDB's approach.
### TableProvider capability advertisement — maps to MR-737 §5.6
LanceDB's `TableProvider` advertises:
- `supports_filter_pushdown` per-filter via the standard DF API.
- `statistics` (returns sample-derived statistics from the manifest).
- `partitioning` (returns `Partitioning::UnknownPartitioning`).
**What's missing for graph workloads:**
- **No `Hash([key], N)` partitioning advertisement.** LanceDB doesn't
know about hash-distributed scan layouts because Lance doesn't have
hash-partitioned datasets.
- **No semi-mask / SIP acceptance advertisement.** LanceDB scans
don't accept incoming SIPs from upstream operators.
**For MR-737:** The §5.6 partitioning and SIP-acceptance advertisements
are **OmniGraph-original additions**. LanceDB is a partial reference;
the rest must be designed fresh.
## Entry points to revisit
- `rust/lancedb/src/query.rs` — query DSL → DataFusion logical plan.
- `rust/lancedb/src/query/datafusion_table.rs``TableProvider` impl.
- `rust/lancedb/src/index/scalar.rs` / `rust/lancedb/src/index/vector.rs`
index lifecycle integrations.
- `rust/lancedb/src/connection.rs` — top-level handle, useful as the
shape for our equivalent.
## What NOT to take from LanceDB
- **The imperative write API.** MR-737 §5.12 wants mutations to be
IR operators; LanceDB does not.
- **The single-table assumption.** LanceDB exposes individual
tables; OmniGraph coordinates many tables atomically. LanceDB's
`Connection::open_table` is per-table; we have a `__manifest`-driven
graph-wide view.
- **The Python-first API surface.** LanceDB's Rust crate is shaped
to support a Python wrapper. We don't need Python; the Rust API
can be plain ergonomic Rust.
## Decision impact on MR-737
- **§5.6 (TableProvider integration):** LanceDB's filter-pushdown
conversion is directly reusable. Capability advertisement beyond
filters (partitioning, SIP) is OmniGraph-original.
- **§5.10 (custom index types):** LanceDB integrates with Lance's
built-in scalar/vector indices but does **not** add custom index
types from outside the lance crate. Per Experiment 1.2, the plugin
registry is `pub(crate)`. LanceDB is not a reference for solving
this; it sidesteps it.
- **§5.12 (mutation-as-IR):** Mostly NOT reflected in LanceDB.
MR-737's design is more ambitious; we have no production reference
for it. **Surface this in the rollup comment as a §5.12 risk.**
- **§5.1 (vector search):** Validated. Treat vector search as a
scan-level filter, not a top-level operator.
## Open questions for follow-up
- **Q3.2.1 — Segment-aware vector planning.** Does LanceDB Enterprise
expose segment-aware planning hooks via a trait we could implement?
Worth asking Lance team directly. Out of scope for the OSS code-dive.
- **Q3.2.2 — `supports_filter_pushdown` for graph predicates.** Our
graph predicates (`(a)-[KNOWS]->(b) AND a.name = ...`) aren't a
single DF filter expression — they're multi-table joins.
LanceDB's pushdown is per-scan; we need cross-scan SIP. **No
LanceDB precedent.**

148
.context/research/trino.md Normal file
View file

@ -0,0 +1,148 @@
# 3.6 Trino — code-dive
**Repo:** [`github.com/trinodb/trino`](https://github.com/trinodb/trino)
(Java, ~10k★).
**MR-925 §3.6 mapping:** §5.7 (cost-based optimizer at scale), §5.11
(distributed transaction patterns — informative, NOT prescriptive),
§5.10 (connector SPI).
**Date:** 2026-05-12.
---
## What we extracted
### Cost-based optimizer (CBO) — maps to MR-737 §5.7
Trino has a mature CBO that combines:
- **Histograms** (per-column, from ANALYZE).
- **Cardinality estimates** with selectivity propagation.
- **CPU + memory + network cost model** with weighted scoring.
The optimizer enumerates join orderings via a **dynamic-programming
approach (DP-sized; bushy trees) for ≤ 8 tables, then falls back to
greedy** for larger queries.
**Re-usable patterns for MR-737:**
- **Three-component cost model** (CPU, memory, network) is the right
granularity. MR-737 §5.7 currently lists rows-processed + I/O; we
should add a third component for memory (especially relevant for
factorized intermediate representations that can blow up).
- **DP-for-small / greedy-for-large.** The 8-table threshold is a
reasonable default. Most graph queries touch 25 tables (one
node-table per variable + one edge-table per relationship), so
DP works.
- **Histograms for selectivity.** Lance's manifest already carries
per-segment min/max + null-count statistics; histograms would be a
follow-up. For graph queries, the *out-degree distribution* is the
most useful histogram (informs neighbor-expansion cardinality).
**Add to §5.7 spec.**
### Connector SPI — maps to MR-737 §5.10
Trino's "Connector SPI" is a Java interface for adding new data
sources (Postgres, MySQL, Hive, Iceberg, …). The interface is **fully
public** — connectors are JARs that get registered at server startup.
**Re-usable patterns:**
- **Capability metadata** is reported per-connector via
`ConnectorMetadata.getTableLayouts()`. Layouts advertise
partitioning, sort order, and pushdown capability per-table. This
is the same shape MR-737 §5.6 prescribes for OmniGraph's
`TableProvider`.
- **Stable interface vs. mainline churn.** Trino's Connector SPI is
the most-stable public Java surface in the project; breaking
changes are versioned. Compare to DataFusion's
`TableProvider` (less stable, breaks ~once per release). Trino is
the reference for "what a properly-versioned plugin API looks like."
### Distributed transactions — maps to MR-737 §5.11 (informative)
Trino is a **stateless query coordinator**: it does NOT manage
transactions itself. Each connector's underlying system manages its
own transactions. Trino's "transaction" is just a session-scoped
identifier that lets multiple queries see the same snapshot of each
connected system.
**Why this matters:** Trino's approach is the opposite of MR-737
§5.11 (per-table txn branches with cross-table coordination). Trino
punts on the cross-system coordination problem; MR-737 doesn't.
**Re-usable pattern:** None directly. Trino is the **anti-reference**
— what NOT to do if you want cross-table ACID. MR-737's design is
strictly more ambitious.
### Plan rewriter framework — maps to MR-737 §5.7
Trino's optimizer is a stack of `Rule<T>` rewrites applied in
sequence (top-down + bottom-up passes). Each rule has a pattern
(matches a plan-tree shape) and a rewrite (produces a replacement).
This is the same pattern DataFusion uses with
`OptimizerRule` / `PhysicalOptimizerRule`.
**Re-usable patterns:**
- **Rule ordering matters.** Trino has hard-coded sequence of
~50 rules; ordering is significant. DataFusion's defaults are
fine for SQL; we'll need to insert graph-specific rules
(`PushPredicateIntoExpand`, `FactorizeIntermediate`, etc.) at
specific positions.
- **Rule namespace.** Trino prefixes rules by area
(`io.trino.sql.planner.iterative.rule.*`). DataFusion uses a flat
namespace. **Cosmetic; defer.**
### Dynamic filters (Trino's analog of DF's DFP) — maps to MR-737 §5.6
Trino's **dynamic filtering** feature pushes build-side join filters
to probe-side scans. Implementation:
- Build side computes a Bloom filter (or InList for small sets).
- Coordinator broadcasts the filter to all probe-side workers.
- Probe-side scans apply the filter at the table-scan level.
This is structurally identical to DataFusion 52.5's
`DynamicFilterPhysicalExpr` (§1.5). **Validates the design.**
**Difference:** Trino uses Bloom filters by default; DF uses InList
(< 128MB) or HashTable ( 128MB). Our §1.4 finding (roaring beats
Bloom on bits/elem for clustered row IDs) suggests roaring would be
the right addition. Trino does NOT have roaring as a dynamic-filter
encoding; we'd be ahead of Trino on this dimension.
## Entry points to revisit
- `core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/`
rule-based optimizer (40+ rules).
- `core/trino-spi/src/main/java/io/trino/spi/connector/` — Connector
SPI (the model for our `TableProvider` surface).
- `core/trino-main/src/main/java/io/trino/cost/` — cost estimation.
- `core/trino-main/src/main/java/io/trino/operator/join/` — dynamic
filtering implementation.
## What NOT to take from Trino
- **The Java language.** Different ecosystem; ports are awkward.
- **The coordinator-worker split.** Trino is a distributed system;
OmniGraph is currently in-process (Phase 02). The distributed
patterns become relevant only at Phase 3+.
- **The stateless-coordinator approach to transactions.** MR-737
§5.11 explicitly wants stateful per-table txn branches.
## Decision impact on MR-737
- **§5.7 (cost model):** Add memory cost as third component;
out-degree histograms as the most-useful additional statistic.
- **§5.10 (connector SPI / plugin):** Trino is the gold standard for
a stable, versioned plugin API. Cite when designing OmniGraph's
custom-index-type registration spec.
- **§5.11 (transactions):** Trino is anti-reference; MR-737 is
strictly more ambitious. Document this.
- **§5.6 (dynamic filters):** Validates the §1.5 design; we'd be
ahead of Trino with roaring-encoded dynamic filters.
## Open questions for follow-up
- **Q3.6.1 — Out-degree histogram representation.** Per-node-type
histograms of out-degree by edge type are O(node-types ×
edge-types) histograms. Storage cost? Defer; calibrate during
§5.7 spec.
- **Q3.6.2 — DP-vs-greedy threshold.** Is 8 tables the right
threshold for graph queries (which tend to have fewer tables but
more joins per table)? Defer; calibrate during §5.7 spec.