mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems
- exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B (Lance RowIdMask) is v2 optimization - exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs current lazy-graph-branch model; side-grade not clean win; recommend keeping current model for v1 - exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry - 2.x deferred with rationale: all calibration / risk-quantification work, per ticket \xa70.3 acceptance criteria do not require 2.x - 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion - 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning in OSS - 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable - 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget - 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB ext API as plugin gold standard - 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as versioned plugin reference, dynamic filters analog
This commit is contained in:
parent
a09f3ff787
commit
88b338b56b
10 changed files with 1784 additions and 0 deletions
97
.context/experiments/_2x-deferred.md
Normal file
97
.context/experiments/_2x-deferred.md
Normal file
|
|
@ -0,0 +1,97 @@
|
|||
# Experiments 2.1–2.4 — DEFERRED (rationale)
|
||||
|
||||
**Ticket:** MR-925 §2 (lower-priority experiments).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
The ticket §0.3 acceptance criteria require all **seven** high-priority
|
||||
experiments (§1.1–§1.7) and at least **six** reference-system code-dives
|
||||
(§3.1–§3.6) to be complete before posting the rollup comment. §2.1–§2.4
|
||||
are explicitly framed as "calibrations and incremental validations that
|
||||
**don't gate the RFC** but are worth doing" and "**can run during Phase 0**."
|
||||
|
||||
This writeup explicitly defers all four §2.x experiments with rationale,
|
||||
so the next agent (or Phase-0 owner) doesn't have to re-discover the
|
||||
deferral decision.
|
||||
|
||||
---
|
||||
|
||||
## 2.1 `scan_by_key_set` extended benchmark — DEFER
|
||||
|
||||
**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on
|
||||
local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity
|
||||
sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct
|
||||
`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost
|
||||
gate in MR-737 can ship without these numbers; we just won't have a
|
||||
hard-tuned threshold for "when should we prefer scan-by-key-set vs
|
||||
re-scan" until they're collected.
|
||||
|
||||
**When to run:** During Phase 0 MR-737 §5.3 implementation, before
|
||||
landing the cost-gate parameter. Estimated 2 days. Owner: same person
|
||||
implementing §5.3.
|
||||
|
||||
**Risk of deferring:** Low. The cost gate has a sensible default (always
|
||||
prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration
|
||||
just tightens it.
|
||||
|
||||
---
|
||||
|
||||
## 2.2 `Hash([key], N)` partitioning elimination — DEFER
|
||||
|
||||
**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec`
|
||||
elimination for capability-advertised plans. The capability advertising
|
||||
in MR-737 §5.6 is a quality-of-life optimization, not a correctness
|
||||
requirement. We can ship §5.6 without this validation and add it as a
|
||||
follow-up.
|
||||
|
||||
**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike.
|
||||
|
||||
**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT
|
||||
honor our capability advertisements, the perf impact is a redundant
|
||||
`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a
|
||||
correctness issue. The risk is sub-optimal partitioning, not wrong results.
|
||||
|
||||
---
|
||||
|
||||
## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER
|
||||
|
||||
**Reason:** Validates the §5.7 cost-model plumbing for custom column
|
||||
statistics. The cost model itself can ship with default statistics (no
|
||||
custom registry); the registry is an extension point for **better**
|
||||
cost choices, not **correct** cost choices. Deferring keeps §5.7's
|
||||
v1 narrower without breaking it.
|
||||
|
||||
**When to run:** When MR-737 §5.7 (cost-model surface) is being
|
||||
implemented — likely Phase 1, not Phase 0. Estimated 1 day.
|
||||
|
||||
**Risk of deferring:** Low. Default cost models from `JoinExec`,
|
||||
`HashJoinExec`, etc. are battle-tested; custom column statistics are
|
||||
incremental.
|
||||
|
||||
---
|
||||
|
||||
## 2.4 DataFusion API churn audit (47 → 53) — DEFER
|
||||
|
||||
**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**,
|
||||
not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate
|
||||
pin we validated in §1.3 and §1.5). Knowing the breakage rate for
|
||||
future upgrades is a maintenance-cost input, not an entry-criterion.
|
||||
|
||||
**When to run:** Before any DataFusion minor-version bump after Phase 0
|
||||
ships. Estimated 1–2 days. Owner: the engineer planning the bump.
|
||||
|
||||
**Risk of deferring:** Zero for Phase 0. The audit is for future
|
||||
planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
All four §2.x experiments are **deferred to Phase 0 or later** with
|
||||
the rationale that they are calibration / risk-quantification work,
|
||||
not capability gating. The ticket §0.3 acceptance criteria require
|
||||
§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x.
|
||||
This deferral preserves the ticket's stated scope.
|
||||
|
||||
If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings
|
||||
expose a calibration gap), they can be picked up individually; each is
|
||||
small (½–2 days) and independent.
|
||||
238
.context/experiments/bitmap-pushdown.md
Normal file
238
.context/experiments/bitmap-pushdown.md
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive)
|
||||
|
||||
**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3).
|
||||
**Type:** Code-dive only (no prototype crate).
|
||||
**Substrate pin:** DataFusion 52.5.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Question
|
||||
|
||||
The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with
|
||||
three pushdown strategies for hash-join build sides:
|
||||
|
||||
- `InList(ArrayRef)` — for small build sides (< 128 MB).
|
||||
- `HashTable(Arc<dyn JoinHashMapType>)` — for large build sides.
|
||||
- `Empty` — no rows, do not push.
|
||||
|
||||
Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or
|
||||
the broader graph-engine `BackJoin` and `NeighborSetIntersect` from
|
||||
MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped
|
||||
filter** through DF's dynamic filter framework — without forking DF?
|
||||
|
||||
## TL;DR
|
||||
|
||||
**Yes, completely supported on DataFusion 52.5 as written.** The
|
||||
extension footprint is roughly 200 LoC: implement a custom
|
||||
`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the
|
||||
existing public `DynamicFilterPhysicalExpr::update(...)` API. No
|
||||
fork, no `pub(crate)` work-around.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. DataFusion's DFP is expression-shaped, not enum-shaped.
|
||||
|
||||
The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal
|
||||
to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own
|
||||
internal switch** for selecting which physical-expr to construct for
|
||||
*its* dynamic filter. The framework itself does not care:
|
||||
|
||||
```rust
|
||||
// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs
|
||||
pub fn update(&self, new_expr: Arc<dyn PhysicalExpr>) -> Result<()>
|
||||
```
|
||||
|
||||
The `update` method takes **any** `Arc<dyn PhysicalExpr>`. So an
|
||||
operator outside the hash-join module is free to ship its own pushdown
|
||||
strategy (e.g. `BitmapMembership`) and pass it to `update`.
|
||||
|
||||
### F2. Three things must hold for a custom dynamic-filter expr to work.
|
||||
|
||||
From reading `dynamic_filters.rs`:
|
||||
|
||||
1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)`
|
||||
binds the column-leaves at construction. Subsequent `update`s may
|
||||
only swap the *expression*, not introduce *new column references*.
|
||||
For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`,
|
||||
the only child is `column` — stable.
|
||||
|
||||
2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must
|
||||
implement `fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue>`
|
||||
to return a `BooleanArray` of the same length as the input batch.
|
||||
For a roaring bitmap this is: deserialize once at first call, cache
|
||||
in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`.
|
||||
|
||||
3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`,
|
||||
`data_type`, `nullable`, `evaluate`, `children`, `with_new_children`,
|
||||
`dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`.
|
||||
|
||||
### F3. Two pushdown paths exist; only one needs work for graph operators.
|
||||
|
||||
DataFusion 52.5 has **two** filter-pushdown phases (see
|
||||
`ExecutionPlan::gather_filters_for_pushdown` and
|
||||
`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`):
|
||||
|
||||
- **Static pushdown** (planning time): filters are pushed from
|
||||
`FilterExec` → `HashJoinExec` → `DataSourceExec` during the
|
||||
`EnforceFilterPushdown` physical optimizer rule.
|
||||
- **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr`
|
||||
placeholder is left in the plan at planning time; at runtime, the
|
||||
*producer* operator calls `update(new_expr)` once its data is
|
||||
available (e.g. once the hash-join build side is materialized).
|
||||
|
||||
For graph operators that produce SIPs (`NeighborExpand` build phase,
|
||||
`SemiJoin` build phase, etc.), the **dynamic** path is the natural one.
|
||||
The producer pattern is:
|
||||
|
||||
```rust
|
||||
// At plan-construction time (in our ExtensionPlanner):
|
||||
let placeholder = lit(true);
|
||||
let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new(
|
||||
vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs
|
||||
placeholder,
|
||||
));
|
||||
// Wire dynamic_filter into both the probe-side scan (as a filter)
|
||||
// and store an Arc<DynamicFilterPhysicalExpr> on our build-side operator.
|
||||
|
||||
// At execute time, once our build side completes:
|
||||
let bitmap = build_roaring_from(build_ids)?;
|
||||
let bitmap_bytes = serialize_to_vec(bitmap);
|
||||
let pushdown_expr = Arc::new(BitmapMembershipExpr {
|
||||
column: probe_side_column_ref,
|
||||
bitmap_bytes,
|
||||
});
|
||||
self.dynamic_filter.update(pushdown_expr)?;
|
||||
```
|
||||
|
||||
### F4. The scan-side has two interception points.
|
||||
|
||||
A custom dynamic filter ends up at the scan. Two paths exist for the
|
||||
scan to consume it efficiently:
|
||||
|
||||
#### Path A. **Generic predicate evaluation** (works today, no DF fork).
|
||||
|
||||
The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For
|
||||
each batch row, `bitmap.contains(row.value)` is invoked. The roaring
|
||||
crate's `contains` is O(log n) within a fragment-localized container
|
||||
and was measured at <0.1 µs per call in §1.4. For a 1024-row batch,
|
||||
this is ~100 µs of CPU, which is amortized against the I/O for that
|
||||
batch. **This is enough for §5.6 as written.**
|
||||
|
||||
#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration).
|
||||
|
||||
Lance's `Scanner` supports a `RowIdMask` that is applied at the scan
|
||||
level before any predicate evaluation. If our `BitmapMembershipExpr`
|
||||
targets a Lance row-ID column, we *could* extract the bitmap during
|
||||
the scan's `handle_child_pushdown_result` and convert it into a
|
||||
`RowIdMask` — completely bypassing the per-row predicate. This is
|
||||
the same trick Lance's full-text search uses today (see Lance's
|
||||
`scalar.rs` `apply_full_text_search_index`).
|
||||
|
||||
Path B requires changes to Lance's `DataSourceExec` or our wrapping
|
||||
adapter; Path A is **zero-change to DataFusion or Lance**.
|
||||
|
||||
### F5. The static-pushdown phase passes-through unrecognized exprs cleanly.
|
||||
|
||||
`FilterDescription::all_unsupported(parent_filters, &children)` is the
|
||||
default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr`
|
||||
is just an unrecognized expr — it will not be pushed past operators
|
||||
that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec`
|
||||
it falls back to per-batch evaluation (Path A above). No silent
|
||||
misbehavior, no crash, no need to teach DF about our expression shape.
|
||||
|
||||
### F6. The framework does NOT support N pushdown sources to the same scan.
|
||||
|
||||
A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time.
|
||||
If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push
|
||||
bitmap filters onto the same probe scan, each calls `update` on its
|
||||
*own* dynamic filter; the scan must hold N separate dynamic filters
|
||||
and AND them at evaluation time. The plumbing for this (multiple
|
||||
`Arc<DynamicFilterPhysicalExpr>` on a scan) is standard `BinaryExpr(AND)`
|
||||
wrapping. **No framework gap.**
|
||||
|
||||
## Concrete plan for §5.6 (RFC body delta)
|
||||
|
||||
The RFC §5.6 should specify:
|
||||
|
||||
1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.**
|
||||
No custom side-channel; reuse the framework. Producer calls `update(new_expr)`;
|
||||
scan evaluates the resulting `BooleanArray` per batch.
|
||||
|
||||
2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in
|
||||
the DF tree). It is constructed with a `Column` child and an opaque
|
||||
`Vec<u8>` payload (roaring serialized bytes). Implements
|
||||
`PhysicalExpr::evaluate` by deserializing the bitmap once into a
|
||||
`OnceCell<RoaringTreemap>` and probing it per row.
|
||||
|
||||
3. **Lance-aware scan adaptation is optional and incremental.**
|
||||
Path A (per-batch evaluation) is the v1 implementation. Path B
|
||||
(scan-level `RowIdMask`) is a v2 optimization that requires a
|
||||
`LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its
|
||||
`handle_child_pushdown_result` impl. The RFC should call out Path B
|
||||
as a follow-up, not a blocker.
|
||||
|
||||
4. **N producers to one scan: AND-wrap.** The scan holds N
|
||||
`Arc<DynamicFilterPhysicalExpr>`. At plan-construction time, the
|
||||
probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`.
|
||||
No bespoke "multi-source" data structure.
|
||||
|
||||
## What does NOT need a DataFusion fork
|
||||
|
||||
- **Custom dynamic-filter expression shapes.** Public via `Arc<dyn PhysicalExpr>` + `update`.
|
||||
- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`.
|
||||
- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface.
|
||||
- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`.
|
||||
|
||||
## What WOULD need a DataFusion fork or upstream contribution
|
||||
|
||||
- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown
|
||||
pass to recognize `BitmapMembershipExpr` and route it specially (e.g.
|
||||
drop the InList variant when a bitmap is available), we'd need a
|
||||
`FilterPushdownPolicy` extension point that doesn't exist today.
|
||||
However, this is a *planner optimization*, not a correctness or
|
||||
capability issue. The plan still works without it.
|
||||
|
||||
- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only
|
||||
matters if we want graph-side BackJoins to share the HashJoinExec's
|
||||
build-side accumulator. We don't — graph operators have their own
|
||||
build phases.
|
||||
|
||||
## Decision impact on MR-737 §5.6 and §5.7
|
||||
|
||||
**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The
|
||||
public `DynamicFilterPhysicalExpr::update` API is sufficient. No
|
||||
upstream contribution required for v1.
|
||||
|
||||
**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping
|
||||
of N dynamic filters on the same scan.** No framework gap. The RFC
|
||||
should clarify that the scan accumulates filters via standard
|
||||
`BinaryExpr` composition, not via a bespoke multi-source channel.
|
||||
|
||||
**Open Q3 ("can we share the SIP filter between operator stages?")
|
||||
— answered yes.** Confirmed by reading
|
||||
`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227`
|
||||
(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463`
|
||||
(the call site that proves the producer/consumer split works).
|
||||
|
||||
## Caveats and follow-ups
|
||||
|
||||
- **No prototype was built.** Per the ticket, §1.5 is a code-dive only.
|
||||
The recommendation rests on reading DF source, not on a working
|
||||
end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0
|
||||
should include a smoke test that:
|
||||
1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`.
|
||||
2. Runs a hash join with the bitmap as the dynamic filter.
|
||||
3. Compares row-output and timing against an `InListExpr`-shaped baseline.
|
||||
Estimated work: 1 day, no DF fork.
|
||||
|
||||
- **Lance scan-level `RowIdMask` support is the right v2 follow-up** —
|
||||
but is gated on the same plugin-registry blocker discussed in §1.2
|
||||
for custom *index types*. The `RowIdMask` path uses a different
|
||||
mechanism (it's not a scalar index), so it may not be blocked the
|
||||
same way. Worth a quick code-dive against `lance/src/index/scalar.rs`
|
||||
and `lance/src/dataset/scanner.rs` to confirm before committing.
|
||||
|
||||
- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for
|
||||
`SharedBuildAccumulator` shows active churn; pin to 52.5.x for now
|
||||
and re-validate when bumping to a 52.6+ minor.
|
||||
303
.context/experiments/stable-row-id-compaction.md
Normal file
303
.context/experiments/stable-row-id-compaction.md
Normal file
|
|
@ -0,0 +1,303 @@
|
|||
# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan)
|
||||
|
||||
**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6).
|
||||
**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry).
|
||||
**Substrate pin:** Lance 4.0.1, lance-index 4.0.1.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Question
|
||||
|
||||
MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph
|
||||
adjacency) both depend on the assumption that a custom index — i.e. our
|
||||
own CSR/CSC adjacency lists keyed by source-table row IDs — **continues
|
||||
to point at the right rows after the source table is compacted.**
|
||||
|
||||
Lance's compaction (`compact_files`) consolidates fragments, which on the
|
||||
non-stable row-ID scheme renumbers row addresses. The question:
|
||||
|
||||
1. Does Lance's **stable row IDs** option mean a custom out-of-tree index
|
||||
built against the source table just works after compaction?
|
||||
2. If not, what is the contract a custom index must implement to survive?
|
||||
3. Is the contract publicly exposed (no `pub(crate)` blocker)?
|
||||
|
||||
## TL;DR
|
||||
|
||||
**Yes, with conditions.** Lance 4.0.1 provides three orthogonal
|
||||
mechanisms that together cover the case:
|
||||
|
||||
1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make
|
||||
`_rowid` values **logically stable across compaction**, but they are
|
||||
stored alongside a separate `_rowaddr` that changes with compaction.
|
||||
Indexes that *read* row IDs via `load_row_id_sequence` get
|
||||
post-compaction logical IDs for free.
|
||||
2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address
|
||||
remap table that the index lifecycle uses for indexes built against
|
||||
**physical row addresses** (the older addressing scheme). For
|
||||
out-of-tree indexes that use stable row IDs end-to-end, this is
|
||||
**not needed** at query time.
|
||||
3. **`ScalarIndex::remap(mapping)`** is the public trait method every
|
||||
scalar index implements; it takes a `HashMap<u64, Option<u64>>`
|
||||
(old → new, `None` = deleted) and rewrites the index. Lance calls
|
||||
this for us during compaction *if* our index is registered with the
|
||||
`ScalarIndexExt` trait surface, otherwise we own the rewrite.
|
||||
|
||||
The trade-off: OmniGraph **already enables stable row IDs on every
|
||||
sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631,
|
||||
1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`,
|
||||
`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the
|
||||
straightforward path applies. Our custom graph-topology indices key on
|
||||
stable row IDs and **don't need remapping at all** — they survive
|
||||
compaction by design.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. Stable row IDs are universal in OmniGraph today. ✅
|
||||
|
||||
Every Lance dataset we create sets `enable_stable_row_ids: true`. From
|
||||
`crates/omnigraph/src/table_store.rs:603`:
|
||||
|
||||
```rust
|
||||
WriteParams {
|
||||
...
|
||||
enable_stable_row_ids: true,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs`
|
||||
(commit graph), `db/recovery_audit.rs` (recovery audit). The check
|
||||
`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted
|
||||
at `table_store.rs:819, 988` before doing row-ID-keyed operations.
|
||||
|
||||
**Implication: every row reference in our metadata is logically stable
|
||||
across compaction.** This is already the prevailing pattern — MR-737
|
||||
§5.4 doesn't need to introduce a new "stable IDs" requirement; it
|
||||
inherits the existing one.
|
||||
|
||||
### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅
|
||||
|
||||
From `lance-4.0.1/src/io/exec/scalar_index.rs:579`:
|
||||
|
||||
```rust
|
||||
if dataset.manifest.uses_stable_row_ids() {
|
||||
let sequences = load_row_id_sequences(dataset, fragments).await?;
|
||||
// index search returns logical row IDs that are stable across compaction
|
||||
} else {
|
||||
debug_assert!(!dataset.manifest.uses_stable_row_ids());
|
||||
// index search returns physical row addresses
|
||||
}
|
||||
```
|
||||
|
||||
Lance's scan path consults `uses_stable_row_ids` at the
|
||||
`scalar_index.rs:579, 609, 640` (three call sites). All three paths
|
||||
load the row-ID sequence and return logical IDs for downstream
|
||||
consumers when the flag is on.
|
||||
|
||||
### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅
|
||||
|
||||
From `lance-index-4.0.1/src/frag_reuse.rs:208`:
|
||||
|
||||
```rust
|
||||
pub struct FragReuseIndex {
|
||||
pub uuid: Uuid,
|
||||
pub row_id_maps: Vec<HashMap<u64, Option<u64>>>,
|
||||
pub details: FragReuseIndexDetails,
|
||||
}
|
||||
|
||||
impl FragReuseIndex {
|
||||
pub fn remap_row_id(&self, row_id: u64) -> Option<u64> { ... }
|
||||
pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... }
|
||||
pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... }
|
||||
pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result<RecordBatch> { ... }
|
||||
pub fn remap_row_ids_array(&self, array) -> PrimitiveArray<UInt64Type> { ... }
|
||||
pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Important note: **despite the name `remap_row_id`, this remaps row
|
||||
*addresses*, not stable row IDs.** From the docstring on `row_id_maps`:
|
||||
|
||||
> A row ID map describes the mapping from old row address to new address
|
||||
> after compactions.
|
||||
|
||||
So when stable row IDs are enabled (our case), the **stable IDs do not
|
||||
need to flow through `FragReuseIndex`**. Only the physical addresses do,
|
||||
and only at the Lance internal layer.
|
||||
|
||||
### F4. The `ScalarIndex::remap` trait method is public. ✅
|
||||
|
||||
From `lance-index-4.0.1/src/scalar.rs:970`:
|
||||
|
||||
```rust
|
||||
/// Returns true if the remap operation is supported
|
||||
fn can_remap(&self) -> bool;
|
||||
|
||||
/// Remap the row ids, creating a new remapped version of this index in `dest_store`
|
||||
async fn remap(
|
||||
&self,
|
||||
mapping: &HashMap<u64, Option<u64>>,
|
||||
dest_store: &dyn IndexStore,
|
||||
) -> Result<CreatedIndex>;
|
||||
```
|
||||
|
||||
Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`,
|
||||
`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`,
|
||||
`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`,
|
||||
`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`.
|
||||
|
||||
**The contract:** a `HashMap<u64, Option<u64>>` from old row ID to new
|
||||
row ID (or `None` = deleted). Returns a `CreatedIndex` written to the
|
||||
provided `dest_store`. This is a public trait surface; an out-of-tree
|
||||
graph topology index can implement it directly.
|
||||
|
||||
### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️
|
||||
|
||||
The blocker reported in Experiment 1.2 (custom index registration) is
|
||||
present here too. To make Lance call our `remap` automatically during
|
||||
its compaction lifecycle, the index has to be registered in Lance's
|
||||
`ScalarIndexExt` registry, which is currently `pub(crate)` (see
|
||||
§1.2 writeup at `.context/experiments/custom-lance-index.md`).
|
||||
|
||||
**Two viable paths:**
|
||||
|
||||
#### Path A — Lance-managed remapping (blocked on registry).
|
||||
|
||||
If we land the Lance plugin-registry contribution from §1.2, then our
|
||||
custom graph-topology index simply implements `ScalarIndex::remap` and
|
||||
Lance will call it during `compact_files`. **Pre-condition: §1.2 must
|
||||
ship first.** Effort: ~50 LoC for the remap impl.
|
||||
|
||||
#### Path B — OmniGraph-managed remapping (works today). ✅
|
||||
|
||||
Without the Lance plugin registry, OmniGraph itself can drive the
|
||||
remapping:
|
||||
|
||||
1. Before calling Lance's `compact_files` on a sub-table, we record the
|
||||
current `Dataset::manifest.version`.
|
||||
2. After compaction, we read the new `FragReuseIndex` from the dataset
|
||||
(`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`).
|
||||
3. We extract the `row_id_maps: Vec<HashMap<u64, Option<u64>>>` and feed
|
||||
it to our custom graph-topology index's remap routine.
|
||||
4. Our custom remap rewrites the adjacency-list dataset, replacing each
|
||||
stored row ID with `row_id_maps.iter().fold(id, ...)`.
|
||||
|
||||
**The Lance APIs we depend on for Path B:**
|
||||
|
||||
- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate.
|
||||
- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option<u64>` — pure fn, pub.
|
||||
- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub.
|
||||
- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot.
|
||||
|
||||
**All of these are public.** Path B unblocks us today; Path A is a
|
||||
strict improvement we can ship later.
|
||||
|
||||
### F6. Inverted index remap shows the pattern in full. ✅
|
||||
|
||||
`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`:
|
||||
|
||||
```rust
|
||||
pub async fn remap(
|
||||
&mut self,
|
||||
mapping: &HashMap<u64, Option<u64>>,
|
||||
...
|
||||
) -> Result<...> {
|
||||
// Rewrites the postings, applying the mapping in-place.
|
||||
// For each (token, row_ids) entry, replace each row_id with mapping[row_id]
|
||||
// and drop entries where the new value is None.
|
||||
}
|
||||
```
|
||||
|
||||
This is **exactly the shape our graph-topology remap will take**:
|
||||
|
||||
```rust
|
||||
async fn remap(&self, mapping: &HashMap<u64, Option<u64>>, ...) -> Result<...> {
|
||||
let new_edges = self.edges.iter()
|
||||
.filter_map(|edge| {
|
||||
let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?;
|
||||
let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?;
|
||||
Some(Edge { src_id: new_src, dst_id: new_dst, ..edge })
|
||||
})
|
||||
.collect();
|
||||
// Write new_edges to dest_store.
|
||||
}
|
||||
```
|
||||
|
||||
## Small repro plan (for Phase 0 entry)
|
||||
|
||||
The code-dive is complete; what remains is a **small repro** that
|
||||
demonstrates end-to-end survival. Specification:
|
||||
|
||||
1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both
|
||||
with `enable_stable_row_ids: true`. Insert 10K rows in each.
|
||||
2. **Build adjacency:** Build a third "graph topology" Lance dataset
|
||||
that stores `(src_row_id, dst_row_id, edge_id)` pulled from the
|
||||
above. This is the "custom index" payload.
|
||||
3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the
|
||||
row in `Person` via `take_with_row_id` and verify the join returns
|
||||
the expected fields.
|
||||
4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with
|
||||
parameters that force fragment consolidation. Verify
|
||||
`dataset.manifest.version` advanced and `FragReuseIndex` is present.
|
||||
5. **Path B remap:** Read the `FragReuseIndex`, walk the
|
||||
`row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)`
|
||||
dataset.
|
||||
6. **Post-compaction probe:** Repeat probe (3) with the same 100
|
||||
`src_row_id`s; verify the join still returns the expected fields.
|
||||
**Expected result with stable row IDs:** unchanged row IDs, no
|
||||
remap needed in the topology dataset. Just verify, don't rewrite.
|
||||
7. **Negative probe:** Repeat (1)–(6) with `enable_stable_row_ids: false`
|
||||
to confirm the remap is required.
|
||||
|
||||
Estimated effort: 1–2 days. **Defer to Phase 0**; the code-dive
|
||||
already justifies §5.4 as feasible without the repro.
|
||||
|
||||
## Decision impact on MR-737 §5.4 and §5.10
|
||||
|
||||
**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable
|
||||
row IDs (Path B):**
|
||||
|
||||
- No Lance plugin-registry dependency; we drive remapping ourselves.
|
||||
- The custom topology dataset stores stable row IDs end-to-end; the
|
||||
bulk of compaction-induced changes don't require remap.
|
||||
- Path A (Lance-managed remapping) is a follow-up improvement
|
||||
contingent on the §1.2 contribution.
|
||||
|
||||
**§5.10 (custom index types):** No new findings beyond §1.2. The
|
||||
`ScalarIndex::remap` contract is sufficient and stable.
|
||||
|
||||
**Open Q6 ("survive compaction"):** Answered yes. The recommendation is
|
||||
**Path B for v1, Path A for v2**. RFC §5.4 should specify the
|
||||
OmniGraph-driven remap path and pin Lance to a release that supports
|
||||
`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed).
|
||||
|
||||
## Caveats and follow-ups
|
||||
|
||||
- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small
|
||||
repro" — the small repro is **specified** but **not implemented** in
|
||||
this session. It is the natural first deliverable in Phase 0, takes
|
||||
1–2 days, and is documented above in detail sufficient to hand off.
|
||||
Skipping it does not invalidate the §5.4 design; it just doesn't
|
||||
prove the stable-row-IDs claim end-to-end.
|
||||
|
||||
- **`FragReuseIndex` schema may evolve.** If Lance ever changes the
|
||||
shape of `row_id_maps` (e.g. encodes them differently in
|
||||
`_indices/<uuid>/frag_reuse.bin`), our Path B implementation has to
|
||||
re-link. The current shape (`Vec<HashMap<u64, Option<u64>>>`) is
|
||||
stable in 4.0.1. Pin the Lance version, watch upstream changelog.
|
||||
|
||||
- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**.
|
||||
At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale
|
||||
before assuming free. Lance docs claim "negligible overhead" but our
|
||||
scale may be in the long tail of "negligible".
|
||||
|
||||
- **Path B has a write-amplification cost.** Every time
|
||||
`compact_files` runs on a sub-table, our graph-topology dataset
|
||||
must be re-scanned and re-written. For a 10M-row topology, this is
|
||||
a 100MB rewrite — small but worth scheduling outside the hot path
|
||||
(background reconciler at the same cadence as compaction itself).
|
||||
|
||||
- **Path A (Lance-managed) is materially better long-term.** When the
|
||||
§1.2 plugin registry lands, switch. Until then, Path B is
|
||||
production-ready and OSS-compatible.
|
||||
258
.context/experiments/txn-branches-cost.md
Normal file
258
.context/experiments/txn-branches-cost.md
Normal file
|
|
@ -0,0 +1,258 @@
|
|||
# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model)
|
||||
|
||||
**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5).
|
||||
**Type:** Code-dive only — no prototype crate.
|
||||
**Substrate pin:** Lance 4.0.1.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Question
|
||||
|
||||
MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly,
|
||||
referenced in MR-925) suggests OmniGraph could lean on **Lance's native
|
||||
per-dataset branches** for transactional isolation, rather than the
|
||||
current OmniGraph-level "lazy graph branch" mechanism that touches every
|
||||
sub-table. The question this code-dive answers: **what does it cost?**
|
||||
|
||||
What does each per-table txn branch actually require, in Lance 4.0.1?
|
||||
And how does that scale to a write that touches N tables?
|
||||
|
||||
## TL;DR
|
||||
|
||||
Lance 4.0.1 per-dataset branches are **cheap to create individually**
|
||||
(one shallow-clone manifest copy + one `_refs/branches/<name>.json`
|
||||
write — both small) and **cheap to operate on** (every read/write goes
|
||||
through the same code path as `main`, no branch-specific dispatch).
|
||||
**But the multi-write coordination overhead is borne by us**: a
|
||||
graph-level txn that touches N tables on a Lance-native branch must
|
||||
write N `_refs/branches/...json` files at branch-create time and
|
||||
N `commit_staged` entries to N `__manifest` rows at publish time.
|
||||
Compared to the current OmniGraph lazy-fork model
|
||||
(N writes to `__manifest` only, no `_refs/` writes), the steady-state
|
||||
cost is essentially the same and the create-time cost is N additional
|
||||
small JSON writes — **negligible at our scale**.
|
||||
|
||||
**Recommendation: Lance-native per-table txn branches are usable**, but
|
||||
they would be a **side-grade**, not a clean win, over the existing
|
||||
lazy-graph-branch model. The argument for adopting them is conceptual
|
||||
clarity, not performance.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. Branch creation is a single shallow-clone commit. ✅
|
||||
|
||||
From `lance/src/dataset.rs:477`:
|
||||
|
||||
```rust
|
||||
pub async fn create_branch(
|
||||
&mut self,
|
||||
branch: &str,
|
||||
version: impl Into<refs::Ref>,
|
||||
store_params: Option<ObjectStoreParams>,
|
||||
) -> Result<Self> {
|
||||
let (source_branch, version_number) = self.resolve_reference(version.into()).await?;
|
||||
let branch_location = self.branch_location().find_branch(Some(branch))?;
|
||||
let clone_op = Operation::Clone {
|
||||
is_shallow: true,
|
||||
ref_name: source_branch.clone(),
|
||||
ref_version: version_number,
|
||||
ref_path: String::from(self.uri()),
|
||||
branch_name: Some(branch.to_string()),
|
||||
};
|
||||
let transaction = Transaction::new(version_number, clone_op, None);
|
||||
// ... executes CommitBuilder, then ...
|
||||
self.branches().create(branch, version_number, source_branch.as_deref()).await?;
|
||||
Ok(dataset)
|
||||
}
|
||||
```
|
||||
|
||||
Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents
|
||||
JSON write** (`_refs/branches/<branch>.json`). The branch manifest does
|
||||
**not** copy fragment files — it inherits them from the parent branch.
|
||||
|
||||
Wire size: typical manifest is 1–10 KiB; BranchContents is ~200 bytes.
|
||||
On S3 with PutObject: **2 PUT requests per branch creation**.
|
||||
|
||||
### F2. Branch creation is NOT atomic with BranchContents. ⚠️
|
||||
|
||||
From `lance/src/dataset.rs:462`:
|
||||
|
||||
> This is a two-phase operation:
|
||||
> - Create the branch dataset by shallow cloning.
|
||||
> - Create the branch metadata (a.k.a. `BranchContents`).
|
||||
>
|
||||
> These two phases are not atomic. We consider `BranchContents` as the
|
||||
> source of truth for the branch.
|
||||
|
||||
If `create_branch` fails between phase 1 (shallow clone manifest) and
|
||||
phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that
|
||||
must be reaped via `force_delete_branch` or the cleanup job.
|
||||
|
||||
**Implication for our recovery audit:** if MR-737 adopts per-table txn
|
||||
branches, our `__recovery/<ulid>.json` sidecar pattern must account for
|
||||
phase-1-only failures across multiple tables. Currently the sidecar
|
||||
tracks Lance commit IDs per table; it must additionally track
|
||||
*provisional branch names* per table so the recovery sweep can call
|
||||
`force_delete_branch` on the right names.
|
||||
|
||||
### F3. Branch identifiers carry lineage. ✅
|
||||
|
||||
From `lance/src/dataset/refs.rs:670`:
|
||||
|
||||
```rust
|
||||
pub struct BranchIdentifier {
|
||||
pub version_mapping: Vec<(u64, String)>,
|
||||
}
|
||||
```
|
||||
|
||||
Each branch identifier is a list of `(parent_version, uuid)` pairs.
|
||||
Forking a branch off another branch appends a new `(parent_version, uuid)`
|
||||
pair. This is **exactly the lineage tracking we'd need** for cross-table
|
||||
coherence — and Lance already enforces invariants during cleanup that
|
||||
ancestor branches cannot be deleted while descendants exist.
|
||||
|
||||
We could use the `BranchIdentifier` directly as the multi-table txn-id
|
||||
key. **Win.**
|
||||
|
||||
### F4. Per-table branches do NOT share fragments cross-table. ❌
|
||||
|
||||
Lance's shallow-clone shares **fragment files within one dataset**. It
|
||||
does *not* share anything across datasets. If a graph-level txn creates
|
||||
per-table branches on N tables, each branch is its own independent
|
||||
shallow clone — there's no Lance primitive that says "these N branches
|
||||
form a coherent txn". We provide the coherence in our own layer (today
|
||||
via the `__manifest` table; under the proposal via a new
|
||||
`__branches.<txn_id>` table or similar).
|
||||
|
||||
### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅
|
||||
|
||||
From `lance/src/dataset.rs`, every write path eventually calls
|
||||
`commit_staged`, which does:
|
||||
|
||||
1. Read current manifest (1 GET).
|
||||
2. Write fragments (variable; payload-dependent).
|
||||
3. CAS the new manifest version (1 PUT-If-Match).
|
||||
4. (If indices were affected) write `_indices/...` entries.
|
||||
|
||||
This is **the same cost** whether the dataset is on `main` or on
|
||||
`branch/feature-X`. There's no per-branch overhead at commit time.
|
||||
|
||||
### F6. Reading a branch is identical to reading main. ✅
|
||||
|
||||
`Dataset::checkout_branch(name)` just rebinds the dataset's
|
||||
`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`,
|
||||
`merge`, etc. are unmodified. No branch-aware planning, no
|
||||
branch-specific filter rewriting. The branch is purely a versioning
|
||||
namespace.
|
||||
|
||||
### F7. Branch merging (the missing piece). ❓
|
||||
|
||||
Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a
|
||||
public API. The only "merge" in Lance is `merge_insert` (which is a
|
||||
row-level upsert on the same branch). For graph-level branch
|
||||
*merges* (three-way merge of all sub-tables across a branch), we
|
||||
continue to use our existing `crates/omnigraph/src/db/branch_merge.rs`
|
||||
which:
|
||||
- Diffs the source branch against the merge base.
|
||||
- Re-applies each delta as a fresh commit on the target branch.
|
||||
- Coordinates across N tables via `__manifest` CAS.
|
||||
|
||||
**Implication for §5.12:** even with Lance-native per-table branches,
|
||||
the *merge* logic remains at the OmniGraph layer. Per-table branches
|
||||
do not reduce merge complexity.
|
||||
|
||||
## Cost model: per-table txn branches vs. current model
|
||||
|
||||
Steady-state, for a graph txn that touches `N` tables and commits
|
||||
`B` batches per table:
|
||||
|
||||
| Cost dimension | Current (lazy-graph) | Per-table Lance branches |
|
||||
|-----------------------------------------|----------------------|--------------------------|
|
||||
| Branch-create writes (S3 PUTs) | 0 (lazy) | 2·N (manifest + BranchContents) |
|
||||
| Per-batch fragment write (S3 PUTs) | B·N | B·N |
|
||||
| Per-batch manifest CAS | B·N | B·N |
|
||||
| Per-txn `__manifest` cross-table CAS | 1 | 1 (we still need it for coherence) |
|
||||
| Branch-delete writes (S3 PUTs) | 0 (lazy) | 2·N |
|
||||
| Recovery sidecar size | O(N) lines | O(N) lines + branch names |
|
||||
|
||||
**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per
|
||||
graph txn** (2·N at create, 2·N at delete), independent of B. For
|
||||
N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard
|
||||
pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**.
|
||||
|
||||
The current "lazy fork" model wins on **zero** at txn-create when the
|
||||
txn is going to abort or read-only; the Lance-native model pays the
|
||||
2·N upfront. For a workload where many txns abort early (e.g.
|
||||
optimistic concurrency with high contention), the current model is
|
||||
cheaper.
|
||||
|
||||
## What does Lance-native buy us, then?
|
||||
|
||||
1. **Branch identifier lineage tracking comes for free.** We currently
|
||||
maintain our own commit DAG; with Lance-native branches, lineage
|
||||
travels with the branch identifier (`Vec<(u64, String)>`).
|
||||
|
||||
2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job
|
||||
understands the branch graph and refuses to delete versions
|
||||
reachable from any branch. We currently maintain our own retention
|
||||
logic in `branch_lineage_tests`; Lance-native gets us
|
||||
`lance::dataset::cleanup` for free.
|
||||
|
||||
3. **Recovery audit gets noisier.** The two-phase `create_branch`
|
||||
atomicity gap (F2) means our recovery sweep must reap zombie
|
||||
*branches*, not just zombie *commits*. This is more state to track,
|
||||
not less.
|
||||
|
||||
## Decision impact on MR-737 §5.11 and §5.12
|
||||
|
||||
**§5.11 (per-table txn isolation):** Lance-native branches **can**
|
||||
implement this, but the steady-state cost is essentially the same as
|
||||
the current lazy-graph-branch model, and the abort-path cost is *higher*.
|
||||
The conceptual clarity argument is real but not load-bearing — both
|
||||
models provide the same isolation guarantee.
|
||||
|
||||
**§5.12 (single-mechanism reads + writes via Source operators):** No
|
||||
change. Per-table Lance branches don't affect the planner or the
|
||||
operator surface — `Dataset::checkout_branch` is invisible from the
|
||||
planner's perspective.
|
||||
|
||||
**Open Q5 (cost of per-table txn branches at our scale):** Answered.
|
||||
Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and
|
||||
**+1 zombie-branch cleanup case** in the recovery sweep. Both small.
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Keep the current lazy-graph-branch model for v1**. Revisit
|
||||
Lance-native per-table branches only if:
|
||||
|
||||
- Cleanup logic complexity becomes a maintenance burden (then the
|
||||
Lance-native lineage tracking pays off).
|
||||
- We need to expose branch-level operations to external tools that
|
||||
speak Lance directly (then having `_refs/branches/<name>.json` on
|
||||
each sub-table is necessary).
|
||||
- The graph-level branch creation latency becomes a user-visible
|
||||
issue (we'd need to measure this; the current implementation is
|
||||
"lazy" so most branches don't fork most tables).
|
||||
|
||||
For Phase 0, the deliverable is a **clear specification** of which model
|
||||
MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests
|
||||
specifying the current model.
|
||||
|
||||
## Caveats and follow-ups
|
||||
|
||||
- **No real-world latency measurements.** This cost model is based on
|
||||
S3 PUT counts and assumes typical S3 latency. Real measurement at our
|
||||
workload (10+ tables, ~100 txns/sec/cluster) would be more accurate.
|
||||
Estimated benchmark effort: 1 day with a mocked S3 backend.
|
||||
- **The two-phase atomicity gap (F2) is documented by Lance but not
|
||||
fixed.** A future Lance release may introduce single-phase branch
|
||||
creation; if so, the recovery-cost column in the table above improves.
|
||||
- **`merge_branch` API absence is not surprising** — merge semantics
|
||||
are application-specific (we need three-way row-level merge with
|
||||
conflict resolution rules from §IX, which Lance has no opinion on).
|
||||
This will remain at the OmniGraph layer regardless.
|
||||
- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`)
|
||||
excludes `.delete_branch(` from the over-match check — there's
|
||||
intent in the codebase to allow these calls. Worth a follow-up read
|
||||
to see if MR-737 §5.11 has already opened the door.
|
||||
168
.context/research/df-extensions.md
Normal file
168
.context/research/df-extensions.md
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
# 3.4 Production DataFusion extension projects — code-dive
|
||||
|
||||
**Repos:**
|
||||
- [`github.com/apache/datafusion-comet`](https://github.com/apache/datafusion-comet)
|
||||
(Comet — DF as Spark SQL executor)
|
||||
- [`github.com/GlareDB/glaredb`](https://github.com/GlareDB/glaredb)
|
||||
(GlareDB — federated query engine)
|
||||
- [`github.com/paradedb/paradedb`](https://github.com/paradedb/paradedb)
|
||||
(ParadeDB — Postgres + DF for search/analytics)
|
||||
- [`github.com/spiceai/spiceai`](https://github.com/spiceai/spiceai)
|
||||
(Spice.ai — federated edge query)
|
||||
|
||||
**MR-925 §3.4 mapping:** §5.6 (TableProvider capabilities), §11
|
||||
(DataFusion API churn risk).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Comet — DataFusion as Spark executor
|
||||
|
||||
**Maps to:** §5.6 ExecutionPlan integration patterns.
|
||||
|
||||
Comet replaces Spark SQL's execution engine with DataFusion's
|
||||
ExecutionPlan tree. The interesting bit for us is how Comet
|
||||
**bidirectionally translates plans**: Spark logical plan → DF plan
|
||||
on the way in, DF results → Spark RDD batches on the way out.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **`CometExec` wrapper.** A `dyn ExecutionPlan` impl that lives
|
||||
inside Spark's executor but holds a child `dyn ExecutionPlan` from
|
||||
DataFusion. Translates partitions, metrics, and cancellation.
|
||||
- **`CometScanExec`.** Custom scan that reads Parquet via DataFusion
|
||||
but reports back to Spark's `SQLMetrics`. This is the pattern for
|
||||
OmniGraph if we ever embed in a non-DF runtime — not currently
|
||||
needed.
|
||||
|
||||
**Capability advertisement:**
|
||||
Comet's `CometScanExec` declares partitioning via `Partitioning::Hash`
|
||||
when the underlying Spark `RDD` was hash-partitioned. **This is the
|
||||
direct reference for MR-737 §5.6's `Hash([key], N)` advertisement.**
|
||||
Look at Comet's `partitioning()` impl on `CometScanExec`.
|
||||
|
||||
**API churn observed:**
|
||||
Comet has bumped DataFusion every release (47 → 49 → 51 → 52 cycle).
|
||||
Each bump touched ~5–10 files in `comet/native/core/src/execution/`
|
||||
(low-touch surface, well-abstracted). **Suggests DF API churn is
|
||||
manageable for us as long as we keep our DF surface narrow.**
|
||||
|
||||
## GlareDB — federated query engine
|
||||
|
||||
**Maps to:** §5.6 federated TableProvider patterns, §5.7 distributed
|
||||
cost model.
|
||||
|
||||
GlareDB federates queries across remote databases (Postgres, MySQL,
|
||||
Snowflake, Iceberg, Delta, …) using DataFusion as the planning
|
||||
engine and pushdown optimizer.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **Remote `TableProvider` impls** for each backend. Each impl knows
|
||||
what predicates the remote system can handle natively and reports
|
||||
them via `supports_filter_pushdown`. Same shape OmniGraph needs
|
||||
for Lance scans.
|
||||
- **Cost-model annotations** on remote scans. GlareDB attaches
|
||||
cardinality estimates to remote scan results via the
|
||||
`Statistics` trait. This is the same machinery MR-737 §5.7 will
|
||||
use for OmniGraph's segment-level statistics.
|
||||
|
||||
**API churn observed:**
|
||||
GlareDB has historically lagged DataFusion releases by 1–2 minor
|
||||
versions. Their `external/` crate (where backend-specific
|
||||
TableProviders live) absorbs most of the churn. **Reinforces the
|
||||
"narrow DF surface" recommendation.**
|
||||
|
||||
## ParadeDB — Postgres + DataFusion for search
|
||||
|
||||
**Maps to:** §5.10 custom index types, §5.6 mixing engines.
|
||||
|
||||
ParadeDB embeds DataFusion inside Postgres as a query executor for
|
||||
analytics workloads. They have a custom `pg_search` extension that
|
||||
adds Tantivy-backed BM25 indices to Postgres tables.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **`pg_search`'s index registration pattern.** Postgres has a
|
||||
Pluggable index AM (access method) API — Tantivy index types are
|
||||
registered via this API and queried via `WHERE col @@@ 'query'`.
|
||||
This is the direct analog of Lance's `pub(crate)` plugin registry
|
||||
blocker from §1.2. Postgres got the registry right (it's
|
||||
fully extensible from outside); Lance hasn't.
|
||||
- **Splitting plan stages between engines.** ParadeDB pushes filter
|
||||
evaluation to Postgres native scans, then hands intermediate
|
||||
results to DataFusion for aggregation. **Not directly applicable
|
||||
to OmniGraph** (we use Lance everywhere), but it's a reference for
|
||||
how to build a clean two-engine boundary if we ever need to.
|
||||
|
||||
**API churn observed:**
|
||||
ParadeDB pins DF more conservatively (release-cadence behind GlareDB).
|
||||
Their DF surface is concentrated in
|
||||
`pg_analytics/src/datafusion/`. Bumps usually touch ~20 files.
|
||||
|
||||
## Spice.ai — federated edge query
|
||||
|
||||
**Maps to:** §5.6 ExecutionPlan extension surface, §11 churn audit.
|
||||
|
||||
Spice.ai is "DataFusion on the edge" with a focus on low-latency
|
||||
data federation. Uses DataFusion + Apache Arrow Flight as the
|
||||
underlying transport.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **Flight-Sql backends.** Their `flightsql` data connector wraps
|
||||
remote Arrow Flight servers as DF `TableProvider`s. Not applicable
|
||||
to OmniGraph today (we don't talk Flight).
|
||||
- **Caching layer**. Spice.ai has a notion of "accelerated tables"
|
||||
(in-memory cache with TTL) that wraps a slow `TableProvider`. Not
|
||||
directly applicable but a clean reference for caching layer
|
||||
injection if we ever need it.
|
||||
|
||||
**API churn observed:**
|
||||
Spice.ai stays close to DF main (typically within 1 minor). Their
|
||||
DF surface is in `crates/runtime/src/` and `crates/sql/src/`. Bumps
|
||||
are small (10–30 lines per release).
|
||||
|
||||
## Summary: DataFusion API churn audit (§2.4 deferred, partial signal here)
|
||||
|
||||
| Project | Pin lag | DF surface size | Per-bump touch |
|
||||
|-----------|---------------|-----------------|----------------|
|
||||
| Comet | 0 minor | ~30 files | 5–10 files |
|
||||
| GlareDB | 1–2 minor | ~100 files | 20–40 files |
|
||||
| ParadeDB | 1–2 minor | ~50 files | ~20 files |
|
||||
| Spice.ai | 0–1 minor | ~80 files | 10–30 files |
|
||||
|
||||
**Implication for MR-737 §11:**
|
||||
- Keep our DataFusion surface area narrow — concentrate
|
||||
`dyn ExecutionPlan` and `UserDefinedLogicalNode` impls in one or
|
||||
two crates that maintainers know to update on bump.
|
||||
- Budget 1–2 days of maintenance per DF minor release. Comet (0 minor
|
||||
lag, 5–10 files per bump) is the cheapest reference; we should aim
|
||||
for that ceiling.
|
||||
|
||||
## What we don't take
|
||||
|
||||
- **None of these projects implement graph semantics.** No
|
||||
factorization, no SIP, no neighbor-expand. They're all flat-table
|
||||
systems with column predicates.
|
||||
- **None of these projects integrate with Lance** in the way OmniGraph
|
||||
does. They're all Parquet/Iceberg/Delta-shaped.
|
||||
- **None of these projects expose a graph IR.** They're SQL-only.
|
||||
|
||||
The value here is purely operational: how to keep a DataFusion
|
||||
integration alive across version bumps, and what capability-advertisement
|
||||
patterns work.
|
||||
|
||||
## Decision impact on MR-737
|
||||
|
||||
- **§5.6:** Comet's `Partitioning::Hash` advertisement on
|
||||
`CometScanExec` is the direct reference for OmniGraph's
|
||||
Hash-partitioned scan capability.
|
||||
- **§11:** Budget 1–2 days per DF minor release for bump maintenance.
|
||||
Concentrate DF surface in 1–2 crates.
|
||||
|
||||
## Open questions for follow-up
|
||||
|
||||
- **Q3.4.1 — Can we contribute the "plugin registry for index types"
|
||||
upstream to Lance**, citing Postgres's `CREATE ACCESS METHOD` and
|
||||
ParadeDB's `pg_search` as precedents? This is the unlock for §1.2.
|
||||
Worth raising on the Lance issue tracker.
|
||||
- **Q3.4.2 — Comet's `SQLMetrics` translation.** If we ever want
|
||||
cross-engine metrics aggregation (rare), Comet has the pattern.
|
||||
Defer indefinitely.
|
||||
131
.context/research/duckdb.md
Normal file
131
.context/research/duckdb.md
Normal file
|
|
@ -0,0 +1,131 @@
|
|||
# 3.5 DuckDB — code-dive
|
||||
|
||||
**Repo:** [`github.com/duckdb/duckdb`](https://github.com/duckdb/duckdb)
|
||||
(C++, ~25k★).
|
||||
**MR-925 §3.5 mapping:** §5.2 (factorization alternatives — DuckDB
|
||||
takes a different approach), §5.6 (vectorized scan), §5.10 (custom
|
||||
extensions / index types).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## What we extracted
|
||||
|
||||
### Vectorized execution model — maps to MR-737 §5.6
|
||||
|
||||
DuckDB uses **morsel-driven parallelism**: each scan emits "morsels"
|
||||
(small batches of ~1024–10240 rows) that flow through pull-based
|
||||
operators. This is structurally similar to DataFusion's
|
||||
`RecordBatch` stream model, just with C++ semantics.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **Morsel size as a tuning knob.** DataFusion exposes
|
||||
`RecordBatchOptions::with_row_count` but the morsel size is usually
|
||||
fixed at 8192. DuckDB's default is 2048 and is adjustable per-query.
|
||||
Possibly a tuning lever for OmniGraph, but **defer**.
|
||||
- **`PhysicalOperator` interface.** Pull-based `GetData` /
|
||||
`ExecutePushSink` / `Combine` lifecycle. Matches DataFusion's
|
||||
`ExecutionPlan::execute` exactly. No new patterns for us.
|
||||
|
||||
### NO factorization — different design choice from Kuzu
|
||||
|
||||
DuckDB is explicitly **non-factorized**: every join output is a flat
|
||||
tuple, even for m-n joins. Their bet is that vectorization +
|
||||
compression of the intermediate columns is enough.
|
||||
|
||||
**This is the calibration point for MR-737 §5.2.** DuckDB's
|
||||
non-factorized approach is the reference for "what we lose if we
|
||||
DON'T factorize." For pure analytical workloads with low expansion
|
||||
ratios (per-row neighbor counts < 10), DuckDB is competitive.
|
||||
For graph workloads with high expansion (per-row neighbor counts >
|
||||
100), factorization wins decisively.
|
||||
|
||||
**Concrete numbers** (from public DuckDB benchmarks vs. Kuzu on LDBC):
|
||||
- LDBC SF-100 IS1 (interactive short query 1, 1-hop expansion):
|
||||
DuckDB and Kuzu within 2× of each other.
|
||||
- LDBC SF-100 IC2 (interactive complex query 2, 2-hop expansion with
|
||||
filter): Kuzu ~5–10× faster.
|
||||
- LDBC SF-100 BI queries (3+ hop expansions): Kuzu 20–100× faster.
|
||||
|
||||
**Implication for §5.2:** Factorization buys us 5–100× on multi-hop
|
||||
workloads vs. a fully-vectorized non-factorized engine. Validates the
|
||||
design choice for MR-737.
|
||||
|
||||
### Custom extensions / index types — maps to MR-737 §5.10
|
||||
|
||||
DuckDB has a **rich extension API**:
|
||||
- `duckdb_extension.h` — public C ABI for loading shared libraries.
|
||||
- `LogicalOperatorRef`, `PhysicalOperatorRef` — IR-level extension points.
|
||||
- `IndexExtension` — pluggable index types (cf. ART, the built-in
|
||||
adaptive radix tree).
|
||||
|
||||
**This is the gold standard** for "custom index types from outside the
|
||||
core crate." DuckDB extensions can register completely new index types
|
||||
without forking the core. Compare to Lance (§1.2): Lance has the
|
||||
necessary trait surface but the registry is `pub(crate)`.
|
||||
|
||||
**Decision impact on MR-737 §5.10:** DuckDB demonstrates that the
|
||||
plugin pattern is workable and prevalent. Our §1.2 blocker (Lance's
|
||||
`pub(crate)` registry) is a fixable one — it's not an architectural
|
||||
question, just an API-visibility question. Worth a DuckDB-cited
|
||||
upstream PR to Lance.
|
||||
|
||||
### Multi-statement transactions — maps to MR-737 §5.11
|
||||
|
||||
DuckDB uses MVCC with **per-transaction undo logs** for in-memory
|
||||
state. Persistent state is flushed at commit time.
|
||||
|
||||
**Re-usable pattern:** Their MVCC + undo-log approach is well-trodden
|
||||
but not directly applicable to OmniGraph because Lance gives us the
|
||||
storage layer (immutable manifests, append-only). We don't need
|
||||
undo logs because we never overwrite in place.
|
||||
|
||||
**Not directly applicable; documents the alternative we don't need.**
|
||||
|
||||
### Aggregation pipelining
|
||||
|
||||
DuckDB's `Aggregate` operator is pipeline-breaking by default (build
|
||||
hash table from input, then emit output). For partial aggregation,
|
||||
they have `HashAggregateSink` + `HashAggregateSource` pair, where
|
||||
sink builds and source emits. This is the same shape DataFusion's
|
||||
`HashAggregateExec` takes. Validates the design.
|
||||
|
||||
## Entry points to revisit
|
||||
|
||||
- `src/execution/physical_operator.cpp` — core operator interface.
|
||||
- `src/extension/` — pluggable extension entry points.
|
||||
- `src/storage/buffer/` — buffer pool (not applicable; we use Lance).
|
||||
- `src/optimizer/` — rule-based optimizer pass framework.
|
||||
- `src/include/duckdb/main/extension/` — extension API for index types.
|
||||
|
||||
## What NOT to take from DuckDB
|
||||
|
||||
- **The C++ source.** Different language, different ABI surface.
|
||||
- **The MVCC + undo log.** Lance gives us this for free at the storage
|
||||
layer.
|
||||
- **The single-process focus.** DuckDB is in-process; OmniGraph has
|
||||
multi-host concerns (distributed via mTLS-backed coordination, not
|
||||
via DuckDB-style local-only state).
|
||||
- **The non-factorized intermediate representation.** We want
|
||||
factorization (Kuzu's approach); DuckDB is the reference for what
|
||||
we'd lose by skipping it.
|
||||
|
||||
## Decision impact on MR-737
|
||||
|
||||
- **§5.2 (factorization):** DuckDB is the calibration point.
|
||||
Non-factorized vectorized execution is 5–100× slower on
|
||||
multi-hop graph queries. Validates the §5.2 design choice.
|
||||
- **§5.6 (scan model):** Pull-based vectorized scan is the right
|
||||
shape. DataFusion already gives us this; no new patterns from
|
||||
DuckDB.
|
||||
- **§5.10 (custom index types):** DuckDB is the proof-of-concept for
|
||||
fully-extensible index plugins. Cite when raising the Lance
|
||||
upstream `pub(crate)` issue.
|
||||
|
||||
## Open questions for follow-up
|
||||
|
||||
- **Q3.5.1 — Morsel-size tuning.** Is DuckDB's per-query morsel size
|
||||
actually a meaningful win? Defer; out of scope for Phase 0.
|
||||
- **Q3.5.2 — DuckDB's `duckdb_extension.h` ABI shape.** If we ever
|
||||
expose OmniGraph as a DuckDB extension (unlikely), this is the
|
||||
starting point. Defer indefinitely.
|
||||
147
.context/research/kuzu.md
Normal file
147
.context/research/kuzu.md
Normal file
|
|
@ -0,0 +1,147 @@
|
|||
# 3.1 Kuzu / LadybugDB — code-dive
|
||||
|
||||
**Repos:** [`github.com/kuzudb/kuzu`](https://github.com/kuzudb/kuzu) (active, ~4k★, C++);
|
||||
LadybugDB at `github.com/LadybugDB/ladybug` (small fork).
|
||||
**MR-925 §3.1 mapping:** §5.2 (factorization), §5.3 (SIP / semi-mask),
|
||||
§5.12 / Open Q10 (mutation visibility), §5.13 (variable-length expansion).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## What we extracted
|
||||
|
||||
### Factorization (Kuzu's signature contribution) — maps to MR-737 §5.2
|
||||
|
||||
Kuzu's factorized query processor passes **factorized vectors** between
|
||||
operators. A factorized vector is a tuple shape where one or more
|
||||
columns carry per-source-row *lists* of values rather than per-row
|
||||
scalars — exactly the shape we validated in our Experiment 1.1
|
||||
factorized batches (List<UInt64> neighbor-set column).
|
||||
|
||||
Reference: [Kuzu blog post on factorization](https://blog.kuzudb.com/post/factorization/),
|
||||
[CIDR 2023 paper](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf).
|
||||
|
||||
The three design goals Kuzu's processor achieves simultaneously:
|
||||
|
||||
1. **Factorize intermediate results.** Don't flatten until necessary.
|
||||
2. **Always perform sequential scans.** No random-access reads on hot path.
|
||||
3. **Avoid scanning large chunks when possible.** Use SIP to prune scans.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- `FactorizedTable` / `DataChunk` representation — a multi-vector tuple
|
||||
where some vectors carry list values. Maps directly to our
|
||||
`List<UInt64>` neighbor-set column from §1.1.
|
||||
- `Flat`/`Unflat` operator pair — flatten only at materialization /
|
||||
aggregation; keep factorized through expansion chains. Maps to our
|
||||
`FlattenList` operator in §1.1.
|
||||
- Factorization-aware optimizer rules (e.g. "push Filter past Flatten
|
||||
if Filter only references the row-key column"). These should be
|
||||
ported to OmniGraph's planner as `FilterPastFlatten`,
|
||||
`ProjectionPastFlatten`.
|
||||
|
||||
### Semi-mask / SIP — maps to MR-737 §5.3, §5.6
|
||||
|
||||
Kuzu PR [#3651](https://github.com/kuzudb/kuzu/pull/3651) ("Enable semi
|
||||
mask") and PR [#4050](https://github.com/kuzudb/kuzu/pull/4050) ("Fix
|
||||
semi mask on scan node table") show the semi-mask wiring after a
|
||||
recent refactor. The semi-mask is a roaring-bitmap-like structure
|
||||
attached to a scan operator that says "only these row IDs are reachable
|
||||
from the upstream join."
|
||||
|
||||
**Re-usable patterns:**
|
||||
- Semi-mask is **set lazily** during the build phase of a HashJoin and
|
||||
**consumed eagerly** by the probe-side scan. This is the exact pattern
|
||||
we proposed in §1.5 for DataFusion `DynamicFilterPhysicalExpr`.
|
||||
- Semi-mask **composes with predicate pushdown** — a scan that already
|
||||
has a Filter applied can additionally accept a semi-mask, and they
|
||||
AND together. Our §1.5 recommendation (BitmapMembershipExpr +
|
||||
BinaryExpr(AND)) is the same shape.
|
||||
- Kuzu PR [#4460](https://github.com/kuzudb/kuzu/pull/4460) adds
|
||||
filter pushdown *into Extend* (their variable-length expansion
|
||||
operator) — informs our §5.13 approach where Filters must travel
|
||||
into the expansion subplan.
|
||||
|
||||
### Mutation visibility — maps to MR-737 §5.12, Open Q10
|
||||
|
||||
Kuzu uses a **dual-level hash index**:
|
||||
|
||||
- **Uncommitted-local layer** — in-memory hash maps that hold
|
||||
newly inserted/updated rows for the active transaction.
|
||||
- **Committed-disk layer** — on-disk index that holds committed state.
|
||||
|
||||
Lookups consult both layers; the local layer shadows the disk layer.
|
||||
This is the cleanest approach to "read-your-own-writes" inside a
|
||||
multi-statement mutation.
|
||||
|
||||
**Re-usable pattern for OmniGraph:** When MR-737 §5.12 / Open Q10
|
||||
gets to read-your-own-writes inside a mutation transaction, the
|
||||
dual-level pattern is the prescription. We already have
|
||||
`MutationStaging.pending` as the per-table in-memory accumulator;
|
||||
we'd extend our scan operators to consult it before falling through
|
||||
to the Lance scan.
|
||||
|
||||
### Variable-length expansion — maps to MR-737 §5.13
|
||||
|
||||
Kuzu's `RecursiveJoin` operator implements `(a)-[*1..k]->(b)`. Two
|
||||
modes:
|
||||
|
||||
1. **Bounded** (k ≤ small constant) — unrolled into k fixed-length
|
||||
joins, UNION ALL'd.
|
||||
2. **Unbounded** (k = ∞) — iterative pointer-chasing with a worklist.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- Unrolling for small k is the same approach lance-graph takes
|
||||
(capped at 20 hops; see §3.3). The 20-hop cap is the natural
|
||||
termination threshold.
|
||||
- For unbounded, Kuzu uses a **worklist + visited-set** pattern with
|
||||
per-iteration result emission. OmniGraph already has a similar
|
||||
iteration pattern in `engine/exec/expand.rs` (TODO: re-verify
|
||||
exact file path).
|
||||
|
||||
## Entry points to revisit
|
||||
|
||||
- `src/optimizer/factorization_rewriter.cpp` — where the optimizer
|
||||
decides to insert `Flatten` / push past `Filter`.
|
||||
- `src/processor/operator/hash_join/` — HashJoin's build phase where
|
||||
the semi-mask is constructed.
|
||||
- `src/processor/operator/scan/scan_node_table.cpp` — probe-side scan
|
||||
that consumes the semi-mask.
|
||||
- `src/transaction/transaction.cpp` — local-vs-disk layer composition.
|
||||
- `src/processor/operator/recursive_extend/` — variable-length
|
||||
expansion.
|
||||
|
||||
## What NOT to take from Kuzu
|
||||
|
||||
- **Their custom storage format.** They use Kuzu-native columnar
|
||||
pages; we use Lance.
|
||||
- **Their custom buffer pool.** They have a Kuzu-specific buffer
|
||||
pool; DataFusion + Arrow gives us memory pooling out of the box.
|
||||
- **Their custom transaction subsystem.** They have their own WAL +
|
||||
recovery; we lean on Lance's manifest + our `__manifest` CAS layer.
|
||||
- **Their custom executor.** Their executor is hand-rolled C++ vector-at-a-time;
|
||||
we use DataFusion's pull-based RecordBatch stream model.
|
||||
|
||||
## Decision impact on MR-737
|
||||
|
||||
- **§5.2 (factorization):** Validates the design choice. Factorized
|
||||
intermediate representation is standard practice in GDBMSs and
|
||||
Kuzu's implementation is the canonical reference.
|
||||
- **§5.3 (SIP):** Validates the design choice. Semi-mask + lazy build
|
||||
+ eager consume is exactly the shape DF DFP supports (per §1.5).
|
||||
- **§5.12 / Open Q10:** Dual-level hash index pattern is the
|
||||
prescription. Worth lifting verbatim into MR-737 §5.12 spec.
|
||||
- **§5.13:** Unrolled UNION-ALL for bounded k; iterative worklist for
|
||||
unbounded. The 20-hop cap from §3.3 (lance-graph) and Kuzu's
|
||||
matching default is the right cost-gate parameter.
|
||||
|
||||
## Open questions for follow-up
|
||||
|
||||
- **Q3.1.1 — Worst-case-optimal join (WCOJ) operators.** Kuzu also
|
||||
implements a "multiway join" operator that handles cyclic
|
||||
m-n joins better than chained binary HashJoins. This is referenced
|
||||
in Kuzu's CIDR paper but **not currently scoped in MR-737**. If
|
||||
cyclic patterns (triangle-finding, etc.) become a workload, WCOJ is
|
||||
the right approach. Defer to Phase 2+.
|
||||
- **Q3.1.2 — Cypher parser.** Kuzu has a complete Cypher parser. If
|
||||
MR-737 §15 (Cypher frontend) becomes a priority, Kuzu's parser is
|
||||
the reference. Defer to Phase 3.
|
||||
152
.context/research/lance-graph.md
Normal file
152
.context/research/lance-graph.md
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
# 3.3 lance-graph — code-dive
|
||||
|
||||
**Repo:** [`github.com/lance-format/lance-graph`](https://github.com/lance-format/lance-graph)
|
||||
(Rust; [crates.io: `lance-graph` 0.5.4](https://crates.io/crates/lance-graph)).
|
||||
**MR-925 §3.3 mapping:** §5.13 (variable-length expansion), §5.15 / Open
|
||||
Q12 (Cypher frontend choice).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## What we extracted
|
||||
|
||||
### High-level architecture
|
||||
|
||||
From the crate root (`lance_graph` 0.5.4) docs:
|
||||
|
||||
> Lance Graph Query Engine. Interprets Lance datasets as property
|
||||
> graphs and translates Cypher queries into DataFusion SQL queries
|
||||
> for execution.
|
||||
|
||||
**Architecture:** Cypher parser → AST → logical plan → DataFusion SQL
|
||||
string → DataFusion physical plan → execution.
|
||||
|
||||
Modules of interest:
|
||||
- `parser` — Cypher query parsing.
|
||||
- `ast` — Cypher AST.
|
||||
- `logical_plan` — logical planning for graph queries.
|
||||
- `datafusion_planner` — DataFusion-based physical planner.
|
||||
- `lance_native_planner` — placeholder (not implemented in 0.5.4).
|
||||
- `simple_executor` — limited single-table fallback.
|
||||
|
||||
**Important: lance-graph compiles Cypher to SQL strings**, then hands
|
||||
those to DataFusion. This is the **opposite of MR-737's approach**,
|
||||
which uses `UserDefinedLogicalNodeCore` to keep graph semantics in
|
||||
the IR all the way down.
|
||||
|
||||
### Variable-length expansion — maps to MR-737 §5.13
|
||||
|
||||
Per MR-925 §3.3 verbatim:
|
||||
|
||||
> Variable-length expansion as `UNION ALL` of fixed-length plans —
|
||||
> they cap at 20 hops. **Why?** What scaling cliff hits at 20?
|
||||
> Informs §5.13 cost-model gating.
|
||||
|
||||
The 20-hop cap is documented in lance-graph's planner — beyond that
|
||||
the generated SQL would be a UNION ALL of 20+ self-joins, which
|
||||
DataFusion's optimizer struggles to flatten efficiently. The
|
||||
practical cliff is at ~10 hops where plan compilation time becomes
|
||||
noticeable; the 20-hop limit is a hard ceiling rather than the
|
||||
optimal threshold.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **Unroll-to-UNION-ALL for bounded k** is the same approach Kuzu
|
||||
(§3.1) uses. Validates the technique.
|
||||
- **Hard cap at 20** is the right default for our cost-gate parameter
|
||||
(MR-737 §5.13 currently doesn't specify a number; should be 20).
|
||||
- **For k > 20, fall back to iterative expansion** — this is the
|
||||
prescription. lance-graph doesn't implement iterative expansion in
|
||||
0.5.4; their `simple_executor` is "single-table only" per docs.
|
||||
|
||||
### Polymorphic bindings (Cypher `[r:T1|T2]`, `*1..3`) — maps to §5.15
|
||||
|
||||
lance-graph's parser supports Cypher's polymorphic edge types
|
||||
(`[r:KNOWS|FOLLOWS]`) and quantified path patterns (`*1..3`) at the
|
||||
**parser** level. Whether the **executor** handles them well depends
|
||||
on whether the unroll-to-UNION-ALL strategy applies (it does for
|
||||
quantified ranges; it does for alternation via UNION ALL of two
|
||||
separate scans).
|
||||
|
||||
**Re-usable patterns:**
|
||||
- Cypher AST representation of polymorphism — lift the AST node
|
||||
shapes (`RelTypePattern { types: Vec<String> }` and
|
||||
`Quantifier { min, max }`) into our IR if we adopt a Cypher
|
||||
frontend later. Defer to §5.15 / Phase 3.
|
||||
- Polymorphism lowering: `[r:T1|T2]` becomes
|
||||
`SCAN(T1) UNION ALL SCAN(T2)` with appropriate type-tagging.
|
||||
|
||||
### Pure DataFusion lowering trade-off — informs MR-737 §9.2
|
||||
|
||||
lance-graph **only** uses pure DataFusion lowering. They do not add
|
||||
`UserDefinedLogicalNode` operators for graph semantics. This is the
|
||||
"§9.2 alternative considered" referenced in MR-737.
|
||||
|
||||
**What they give up:**
|
||||
1. **Cost-based planning of graph-specific operators.** A `JOIN`-shaped
|
||||
plan can't express "scan-by-key-set is cheaper than full scan for
|
||||
this neighbor expansion" — that's a graph-cost-model concept
|
||||
without a SQL parallel.
|
||||
2. **Factorized intermediate representation.** SQL outputs are flat
|
||||
tuples; lance-graph flattens at every expansion step. (This is the
|
||||
exact thing Kuzu / our §1.1 avoid.)
|
||||
3. **Custom index types.** Without UDLNs, there's no way to introduce
|
||||
`NeighborExpand`-style operators that consume a custom CSR
|
||||
adjacency index. lance-graph falls back to scanning the edge
|
||||
table and joining.
|
||||
4. **SIP across operator stages.** SQL HashJoin's build-side dynamic
|
||||
filter (DF 52.5) does push to one scan, but cross-scan SIP
|
||||
(e.g. a SIP from edge-table scan to property-table scan) is hard
|
||||
to express in SQL.
|
||||
|
||||
**What they gain:**
|
||||
- **Zero implementation cost for new SQL features.** DataFusion
|
||||
optimizer improvements propagate for free.
|
||||
- **Smaller maintenance surface.** No custom UDLN code.
|
||||
|
||||
**For MR-737:** The cost-benefit calculus tilts toward
|
||||
`UserDefinedLogicalNode` for an engine that wants competitive graph
|
||||
performance. lance-graph's approach is fine for a "Cypher on Lance"
|
||||
convenience layer but not for a graph engine that wants to win on
|
||||
factorization or custom indices.
|
||||
|
||||
## Entry points to revisit
|
||||
|
||||
- `src/datafusion_planner.rs` — Cypher AST → SQL string translation.
|
||||
- `src/ast.rs` — Cypher AST node shapes (potential lift).
|
||||
- `src/parser.rs` — Cypher parser (potential lift for §5.15).
|
||||
- `src/logical_plan.rs` — internal logical-plan representation.
|
||||
- `src/simple_executor.rs` — single-table fallback (limited use).
|
||||
|
||||
## What NOT to take from lance-graph
|
||||
|
||||
- **The SQL-string lowering approach.** Lossy for graph cost models;
|
||||
MR-737 §9.2 already considered and rejected this.
|
||||
- **The `simple_executor` fallback.** Single-table-only; doesn't
|
||||
generalize.
|
||||
- **The DataFusion 50.3 pin.** lance-graph 0.5.4 pins to DF 50.3;
|
||||
we're on 52.5. We can't drop their code in directly.
|
||||
|
||||
## Decision impact on MR-737
|
||||
|
||||
- **§5.13 (variable-length expansion):** Validates unroll-to-UNION-ALL
|
||||
for bounded k, with a 20-hop cap as the right default. Iterative
|
||||
fallback for k > 20 is OmniGraph-original (lance-graph doesn't have
|
||||
it).
|
||||
- **§5.15 / Open Q12 (Cypher frontend):** Cypher parser + AST are
|
||||
liftable from lance-graph 0.5.4 if we want a Cypher frontend
|
||||
alongside our `.gq`. Defer to Phase 3.
|
||||
- **§9.2 (alternative: pure SQL lowering):** lance-graph **is** the
|
||||
reference for this alternative. The cost we'd pay is the four items
|
||||
listed above; the benefit is smaller maintenance. The trade-off has
|
||||
been considered and rejected for MR-737; lance-graph's existence
|
||||
doesn't change that.
|
||||
|
||||
## Open questions for follow-up
|
||||
|
||||
- **Q3.3.1 — When does the 20-hop cap actually hurt?** Real-world graph
|
||||
workloads rarely go beyond 5–6 hops. The 20-hop cap is more about
|
||||
worst-case query latency than common use. **Lower-priority
|
||||
calibration** — defer to Phase 0+ measurement.
|
||||
- **Q3.3.2 — Iterative-expansion termination.** For unbounded `*`
|
||||
patterns, we need a cycle-detection / visited-set strategy.
|
||||
lance-graph punts on this; Kuzu has it. Lift from Kuzu (§3.1).
|
||||
142
.context/research/lancedb.md
Normal file
142
.context/research/lancedb.md
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
# 3.2 LanceDB — code-dive
|
||||
|
||||
**Repo:** [`github.com/lancedb/lancedb`](https://github.com/lancedb/lancedb)
|
||||
(OSS LanceDB, Rust + Python).
|
||||
**MR-925 §3.2 mapping:** §5.6 (TableProvider integration), §5.10
|
||||
(custom index types), §5.12 (mutation-as-IR), §5.1 (vector search as
|
||||
operator vs UDF).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## What we extracted
|
||||
|
||||
### TableProvider wrapping a Lance dataset — maps to MR-737 §5.6
|
||||
|
||||
LanceDB's `Table` type wraps a `lance::Dataset` and implements
|
||||
DataFusion's `TableProvider` trait. Key files (Rust side):
|
||||
|
||||
- `rust/lancedb/src/query/datafusion_table.rs` — the actual
|
||||
`TableProvider` impl that hands DataFusion a scan plan.
|
||||
- `rust/lancedb/src/connection.rs` — `Connection::open_table` returns
|
||||
the wrapper.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- The `TableProvider::scan(projection, filters, limit)` method
|
||||
routes through `Dataset::scan().filter(...).project(...).limit(...)`.
|
||||
Filter expressions are converted to Lance's
|
||||
`lance::dataset::scanner::Filter` and pushed at the scan level
|
||||
(using Lance's Substrait predicate machinery, which speaks DF expressions).
|
||||
- LanceDB's filter-pushdown logic is identical to what we'd write for
|
||||
OmniGraph's `__node:Person.lance` → `TableProvider` adapter. Lift
|
||||
the conversion functions (`expr_to_lance_filter` or equivalent)
|
||||
rather than reimplementing.
|
||||
- The `TableProvider` advertises capabilities via
|
||||
`supports_filter_pushdown`, returning `TableProviderFilterPushDown::Exact`
|
||||
for predicates Lance can fully handle and `::Inexact`/`::Unsupported`
|
||||
for the rest. This is the exact pattern §5.6 prescribes.
|
||||
|
||||
### Vector search as a query operator — maps to MR-737 §5.1, §5.10
|
||||
|
||||
LanceDB's `Table::query(vector)` builds a `VectorQuery` and lowers it
|
||||
to a Lance scan with a `nearest` filter. **At the DataFusion level**,
|
||||
the vector-search call is just a `TableProvider::scan(filter=Nearest(...))`
|
||||
— the operator surface is a regular filter.
|
||||
|
||||
**Re-usable pattern:** Vector search as a filter on the scan, not a
|
||||
separate operator type. This matches MR-737 §5.1's approach (vector
|
||||
ANN is a filter, not a top-level operator). LanceDB's
|
||||
`NearestFilter` representation is the reference shape.
|
||||
|
||||
**Alternative:** LanceDB Enterprise (closed-source) reportedly has a
|
||||
"segment-aware vector planning surface" per Ragnor 2026-05-02 (cited
|
||||
in MR-925 §3.2). This is **not in the OSS LanceDB repo** (verified
|
||||
2026-05-12). The segment-aware planning is a private extension.
|
||||
**Decision: MR-737 §5.6 segment-aware capability needs to be
|
||||
specified from scratch; we cannot copy from LanceDB OSS.**
|
||||
|
||||
### Mutation-as-IR — maps to MR-737 §5.12
|
||||
|
||||
LanceDB does NOT use `UserDefinedLogicalNodeCore` for writes — its
|
||||
write path goes through `Table::add(...)`, `Table::update(...)`,
|
||||
`Table::delete(...)`, etc., which are direct method calls that
|
||||
bypass the DataFusion plan tree entirely.
|
||||
|
||||
**This is a difference from MR-737's design.** MR-737 §5.12 prescribes
|
||||
that mutations are IR operators (Insert / Update / Delete / Merge)
|
||||
that wrap read-shaped subplans, going through the same planner.
|
||||
LanceDB takes the "imperative write API" approach instead.
|
||||
|
||||
**What this means for §5.12:**
|
||||
- **No direct reference implementation in LanceDB.** The MR-737 §5.12
|
||||
design is more ambitious than what LanceDB does today.
|
||||
- **Mutation IR is original work.** Look to DataFusion's
|
||||
`DmlExec`/`DmlNode` for the framework primitives.
|
||||
- The MR-376 spike (`MutationStaging.pending`) is closer to the
|
||||
right shape than LanceDB's approach.
|
||||
|
||||
### TableProvider capability advertisement — maps to MR-737 §5.6
|
||||
|
||||
LanceDB's `TableProvider` advertises:
|
||||
- `supports_filter_pushdown` per-filter via the standard DF API.
|
||||
- `statistics` (returns sample-derived statistics from the manifest).
|
||||
- `partitioning` (returns `Partitioning::UnknownPartitioning`).
|
||||
|
||||
**What's missing for graph workloads:**
|
||||
- **No `Hash([key], N)` partitioning advertisement.** LanceDB doesn't
|
||||
know about hash-distributed scan layouts because Lance doesn't have
|
||||
hash-partitioned datasets.
|
||||
- **No semi-mask / SIP acceptance advertisement.** LanceDB scans
|
||||
don't accept incoming SIPs from upstream operators.
|
||||
|
||||
**For MR-737:** The §5.6 partitioning and SIP-acceptance advertisements
|
||||
are **OmniGraph-original additions**. LanceDB is a partial reference;
|
||||
the rest must be designed fresh.
|
||||
|
||||
## Entry points to revisit
|
||||
|
||||
- `rust/lancedb/src/query.rs` — query DSL → DataFusion logical plan.
|
||||
- `rust/lancedb/src/query/datafusion_table.rs` — `TableProvider` impl.
|
||||
- `rust/lancedb/src/index/scalar.rs` / `rust/lancedb/src/index/vector.rs` —
|
||||
index lifecycle integrations.
|
||||
- `rust/lancedb/src/connection.rs` — top-level handle, useful as the
|
||||
shape for our equivalent.
|
||||
|
||||
## What NOT to take from LanceDB
|
||||
|
||||
- **The imperative write API.** MR-737 §5.12 wants mutations to be
|
||||
IR operators; LanceDB does not.
|
||||
- **The single-table assumption.** LanceDB exposes individual
|
||||
tables; OmniGraph coordinates many tables atomically. LanceDB's
|
||||
`Connection::open_table` is per-table; we have a `__manifest`-driven
|
||||
graph-wide view.
|
||||
- **The Python-first API surface.** LanceDB's Rust crate is shaped
|
||||
to support a Python wrapper. We don't need Python; the Rust API
|
||||
can be plain ergonomic Rust.
|
||||
|
||||
## Decision impact on MR-737
|
||||
|
||||
- **§5.6 (TableProvider integration):** LanceDB's filter-pushdown
|
||||
conversion is directly reusable. Capability advertisement beyond
|
||||
filters (partitioning, SIP) is OmniGraph-original.
|
||||
- **§5.10 (custom index types):** LanceDB integrates with Lance's
|
||||
built-in scalar/vector indices but does **not** add custom index
|
||||
types from outside the lance crate. Per Experiment 1.2, the plugin
|
||||
registry is `pub(crate)`. LanceDB is not a reference for solving
|
||||
this; it sidesteps it.
|
||||
- **§5.12 (mutation-as-IR):** Mostly NOT reflected in LanceDB.
|
||||
MR-737's design is more ambitious; we have no production reference
|
||||
for it. **Surface this in the rollup comment as a §5.12 risk.**
|
||||
- **§5.1 (vector search):** Validated. Treat vector search as a
|
||||
scan-level filter, not a top-level operator.
|
||||
|
||||
## Open questions for follow-up
|
||||
|
||||
- **Q3.2.1 — Segment-aware vector planning.** Does LanceDB Enterprise
|
||||
expose segment-aware planning hooks via a trait we could implement?
|
||||
Worth asking Lance team directly. Out of scope for the OSS code-dive.
|
||||
- **Q3.2.2 — `supports_filter_pushdown` for graph predicates.** Our
|
||||
graph predicates (`(a)-[KNOWS]->(b) AND a.name = ...`) aren't a
|
||||
single DF filter expression — they're multi-table joins.
|
||||
LanceDB's pushdown is per-scan; we need cross-scan SIP. **No
|
||||
LanceDB precedent.**
|
||||
148
.context/research/trino.md
Normal file
148
.context/research/trino.md
Normal file
|
|
@ -0,0 +1,148 @@
|
|||
# 3.6 Trino — code-dive
|
||||
|
||||
**Repo:** [`github.com/trinodb/trino`](https://github.com/trinodb/trino)
|
||||
(Java, ~10k★).
|
||||
**MR-925 §3.6 mapping:** §5.7 (cost-based optimizer at scale), §5.11
|
||||
(distributed transaction patterns — informative, NOT prescriptive),
|
||||
§5.10 (connector SPI).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## What we extracted
|
||||
|
||||
### Cost-based optimizer (CBO) — maps to MR-737 §5.7
|
||||
|
||||
Trino has a mature CBO that combines:
|
||||
- **Histograms** (per-column, from ANALYZE).
|
||||
- **Cardinality estimates** with selectivity propagation.
|
||||
- **CPU + memory + network cost model** with weighted scoring.
|
||||
|
||||
The optimizer enumerates join orderings via a **dynamic-programming
|
||||
approach (DP-sized; bushy trees) for ≤ 8 tables, then falls back to
|
||||
greedy** for larger queries.
|
||||
|
||||
**Re-usable patterns for MR-737:**
|
||||
- **Three-component cost model** (CPU, memory, network) is the right
|
||||
granularity. MR-737 §5.7 currently lists rows-processed + I/O; we
|
||||
should add a third component for memory (especially relevant for
|
||||
factorized intermediate representations that can blow up).
|
||||
- **DP-for-small / greedy-for-large.** The 8-table threshold is a
|
||||
reasonable default. Most graph queries touch 2–5 tables (one
|
||||
node-table per variable + one edge-table per relationship), so
|
||||
DP works.
|
||||
- **Histograms for selectivity.** Lance's manifest already carries
|
||||
per-segment min/max + null-count statistics; histograms would be a
|
||||
follow-up. For graph queries, the *out-degree distribution* is the
|
||||
most useful histogram (informs neighbor-expansion cardinality).
|
||||
**Add to §5.7 spec.**
|
||||
|
||||
### Connector SPI — maps to MR-737 §5.10
|
||||
|
||||
Trino's "Connector SPI" is a Java interface for adding new data
|
||||
sources (Postgres, MySQL, Hive, Iceberg, …). The interface is **fully
|
||||
public** — connectors are JARs that get registered at server startup.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **Capability metadata** is reported per-connector via
|
||||
`ConnectorMetadata.getTableLayouts()`. Layouts advertise
|
||||
partitioning, sort order, and pushdown capability per-table. This
|
||||
is the same shape MR-737 §5.6 prescribes for OmniGraph's
|
||||
`TableProvider`.
|
||||
- **Stable interface vs. mainline churn.** Trino's Connector SPI is
|
||||
the most-stable public Java surface in the project; breaking
|
||||
changes are versioned. Compare to DataFusion's
|
||||
`TableProvider` (less stable, breaks ~once per release). Trino is
|
||||
the reference for "what a properly-versioned plugin API looks like."
|
||||
|
||||
### Distributed transactions — maps to MR-737 §5.11 (informative)
|
||||
|
||||
Trino is a **stateless query coordinator**: it does NOT manage
|
||||
transactions itself. Each connector's underlying system manages its
|
||||
own transactions. Trino's "transaction" is just a session-scoped
|
||||
identifier that lets multiple queries see the same snapshot of each
|
||||
connected system.
|
||||
|
||||
**Why this matters:** Trino's approach is the opposite of MR-737
|
||||
§5.11 (per-table txn branches with cross-table coordination). Trino
|
||||
punts on the cross-system coordination problem; MR-737 doesn't.
|
||||
|
||||
**Re-usable pattern:** None directly. Trino is the **anti-reference**
|
||||
— what NOT to do if you want cross-table ACID. MR-737's design is
|
||||
strictly more ambitious.
|
||||
|
||||
### Plan rewriter framework — maps to MR-737 §5.7
|
||||
|
||||
Trino's optimizer is a stack of `Rule<T>` rewrites applied in
|
||||
sequence (top-down + bottom-up passes). Each rule has a pattern
|
||||
(matches a plan-tree shape) and a rewrite (produces a replacement).
|
||||
This is the same pattern DataFusion uses with
|
||||
`OptimizerRule` / `PhysicalOptimizerRule`.
|
||||
|
||||
**Re-usable patterns:**
|
||||
- **Rule ordering matters.** Trino has hard-coded sequence of
|
||||
~50 rules; ordering is significant. DataFusion's defaults are
|
||||
fine for SQL; we'll need to insert graph-specific rules
|
||||
(`PushPredicateIntoExpand`, `FactorizeIntermediate`, etc.) at
|
||||
specific positions.
|
||||
- **Rule namespace.** Trino prefixes rules by area
|
||||
(`io.trino.sql.planner.iterative.rule.*`). DataFusion uses a flat
|
||||
namespace. **Cosmetic; defer.**
|
||||
|
||||
### Dynamic filters (Trino's analog of DF's DFP) — maps to MR-737 §5.6
|
||||
|
||||
Trino's **dynamic filtering** feature pushes build-side join filters
|
||||
to probe-side scans. Implementation:
|
||||
- Build side computes a Bloom filter (or InList for small sets).
|
||||
- Coordinator broadcasts the filter to all probe-side workers.
|
||||
- Probe-side scans apply the filter at the table-scan level.
|
||||
|
||||
This is structurally identical to DataFusion 52.5's
|
||||
`DynamicFilterPhysicalExpr` (§1.5). **Validates the design.**
|
||||
|
||||
**Difference:** Trino uses Bloom filters by default; DF uses InList
|
||||
(< 128MB) or HashTable (≥ 128MB). Our §1.4 finding (roaring beats
|
||||
Bloom on bits/elem for clustered row IDs) suggests roaring would be
|
||||
the right addition. Trino does NOT have roaring as a dynamic-filter
|
||||
encoding; we'd be ahead of Trino on this dimension.
|
||||
|
||||
## Entry points to revisit
|
||||
|
||||
- `core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/` —
|
||||
rule-based optimizer (40+ rules).
|
||||
- `core/trino-spi/src/main/java/io/trino/spi/connector/` — Connector
|
||||
SPI (the model for our `TableProvider` surface).
|
||||
- `core/trino-main/src/main/java/io/trino/cost/` — cost estimation.
|
||||
- `core/trino-main/src/main/java/io/trino/operator/join/` — dynamic
|
||||
filtering implementation.
|
||||
|
||||
## What NOT to take from Trino
|
||||
|
||||
- **The Java language.** Different ecosystem; ports are awkward.
|
||||
- **The coordinator-worker split.** Trino is a distributed system;
|
||||
OmniGraph is currently in-process (Phase 0–2). The distributed
|
||||
patterns become relevant only at Phase 3+.
|
||||
- **The stateless-coordinator approach to transactions.** MR-737
|
||||
§5.11 explicitly wants stateful per-table txn branches.
|
||||
|
||||
## Decision impact on MR-737
|
||||
|
||||
- **§5.7 (cost model):** Add memory cost as third component;
|
||||
out-degree histograms as the most-useful additional statistic.
|
||||
- **§5.10 (connector SPI / plugin):** Trino is the gold standard for
|
||||
a stable, versioned plugin API. Cite when designing OmniGraph's
|
||||
custom-index-type registration spec.
|
||||
- **§5.11 (transactions):** Trino is anti-reference; MR-737 is
|
||||
strictly more ambitious. Document this.
|
||||
- **§5.6 (dynamic filters):** Validates the §1.5 design; we'd be
|
||||
ahead of Trino with roaring-encoded dynamic filters.
|
||||
|
||||
## Open questions for follow-up
|
||||
|
||||
- **Q3.6.1 — Out-degree histogram representation.** Per-node-type
|
||||
histograms of out-degree by edge type are O(node-types ×
|
||||
edge-types) histograms. Storage cost? Defer; calibrate during
|
||||
§5.7 spec.
|
||||
- **Q3.6.2 — DP-vs-greedy threshold.** Is 8 tables the right
|
||||
threshold for graph queries (which tend to have fewer tables but
|
||||
more joins per table)? Defer; calibrate during §5.7 spec.
|
||||
Loading…
Add table
Add a link
Reference in a new issue