From 88b338b56b9ba7a91db0c9bf4117cbdaf095bf17 Mon Sep 17 00:00:00 2001 From: Devin AI Date: Tue, 12 May 2026 17:36:44 +0000 Subject: [PATCH] MR-925: exp 1.5-1.7 code-dives + 2.x deferral rationale + 3.x reference systems - exp 1.5 (bitmap-pushdown): DF 52.5 DynamicFilterPhysicalExpr supports bitmap-shaped pushdown as-written; no fork needed; Path A (per-batch evaluation) ships v1, Path B (Lance RowIdMask) is v2 optimization - exp 1.6 (txn-branches-cost): Lance per-table branches are +4N S3 PUTs per txn vs current lazy-graph-branch model; side-grade not clean win; recommend keeping current model for v1 - exp 1.7 (stable-row-id-compaction): stable row IDs already enabled everywhere in OmniGraph; Path B (OmniGraph-driven remap via FragReuseIndex public API) ships today; Path A (Lance-managed) is v2 follow-up gated on \xa71.2 plugin registry - 2.x deferred with rationale: all calibration / risk-quantification work, per ticket \xa70.3 acceptance criteria do not require 2.x - 3.1 Kuzu: factorization, semi-mask, dual-level hash index, variable-length expansion - 3.2 LanceDB: TableProvider patterns, mutation-as-IR gap, no segment-aware planning in OSS - 3.3 lance-graph: pure-SQL lowering trade-offs, 20-hop cap, Cypher AST liftable - 3.4 Comet/GlareDB/ParadeDB/Spice.ai: capability advertisement, DF API churn budget - 3.5 DuckDB: factorization calibration point (5-100x slower on multi-hop), DuckDB ext API as plugin gold standard - 3.6 Trino: cost model with 3 components (CPU/mem/network), Connector SPI as versioned plugin reference, dynamic filters analog --- .context/experiments/_2x-deferred.md | 97 ++++++ .context/experiments/bitmap-pushdown.md | 238 ++++++++++++++ .../experiments/stable-row-id-compaction.md | 303 ++++++++++++++++++ .context/experiments/txn-branches-cost.md | 258 +++++++++++++++ .context/research/df-extensions.md | 168 ++++++++++ .context/research/duckdb.md | 131 ++++++++ .context/research/kuzu.md | 147 +++++++++ .context/research/lance-graph.md | 152 +++++++++ .context/research/lancedb.md | 142 ++++++++ .context/research/trino.md | 148 +++++++++ 10 files changed, 1784 insertions(+) create mode 100644 .context/experiments/_2x-deferred.md create mode 100644 .context/experiments/bitmap-pushdown.md create mode 100644 .context/experiments/stable-row-id-compaction.md create mode 100644 .context/experiments/txn-branches-cost.md create mode 100644 .context/research/df-extensions.md create mode 100644 .context/research/duckdb.md create mode 100644 .context/research/kuzu.md create mode 100644 .context/research/lance-graph.md create mode 100644 .context/research/lancedb.md create mode 100644 .context/research/trino.md diff --git a/.context/experiments/_2x-deferred.md b/.context/experiments/_2x-deferred.md new file mode 100644 index 0000000..fc00504 --- /dev/null +++ b/.context/experiments/_2x-deferred.md @@ -0,0 +1,97 @@ +# Experiments 2.1–2.4 — DEFERRED (rationale) + +**Ticket:** MR-925 §2 (lower-priority experiments). +**Date:** 2026-05-12. + +The ticket §0.3 acceptance criteria require all **seven** high-priority +experiments (§1.1–§1.7) and at least **six** reference-system code-dives +(§3.1–§3.6) to be complete before posting the rollup comment. §2.1–§2.4 +are explicitly framed as "calibrations and incremental validations that +**don't gate the RFC** but are worth doing" and "**can run during Phase 0**." + +This writeup explicitly defers all four §2.x experiments with rationale, +so the next agent (or Phase-0 owner) doesn't have to re-discover the +deferral decision. + +--- + +## 2.1 `scan_by_key_set` extended benchmark — DEFER + +**Reason:** MR-376 already validated 72× speedup at hop-1 / 100k nodes on +local FS. The extension matrix in §2.1 (cold S3 vs warm local; selectivity +sweep; |keys| / |dataset| ratio sweep; BTREE-routed vs direct +`Dataset::take`) is **calibration**, not capability gating. The §5.3 cost +gate in MR-737 can ship without these numbers; we just won't have a +hard-tuned threshold for "when should we prefer scan-by-key-set vs +re-scan" until they're collected. + +**When to run:** During Phase 0 MR-737 §5.3 implementation, before +landing the cost-gate parameter. Estimated 2 days. Owner: same person +implementing §5.3. + +**Risk of deferring:** Low. The cost gate has a sensible default (always +prefer scan-by-key-set unless |keys| / |dataset| > 0.1); the calibration +just tightens it. + +--- + +## 2.2 `Hash([key], N)` partitioning elimination — DEFER + +**Reason:** Validates DataFusion's `EXPLAIN` shows `RepartitionExec` +elimination for capability-advertised plans. The capability advertising +in MR-737 §5.6 is a quality-of-life optimization, not a correctness +requirement. We can ship §5.6 without this validation and add it as a +follow-up. + +**When to run:** During Phase 0 MR-737 §5.6 implementation, half-day spike. + +**Risk of deferring:** Low-medium. If DataFusion's optimizer does NOT +honor our capability advertisements, the perf impact is a redundant +`RepartitionExec` insertion — measurable in `EXPLAIN ANALYZE` but not a +correctness issue. The risk is sub-optimal partitioning, not wrong results. + +--- + +## 2.3 Extension-rate propagation through `StatisticsRegistry` — DEFER + +**Reason:** Validates the §5.7 cost-model plumbing for custom column +statistics. The cost model itself can ship with default statistics (no +custom registry); the registry is an extension point for **better** +cost choices, not **correct** cost choices. Deferring keeps §5.7's +v1 narrower without breaking it. + +**When to run:** When MR-737 §5.7 (cost-model surface) is being +implemented — likely Phase 1, not Phase 0. Estimated 1 day. + +**Risk of deferring:** Low. Default cost models from `JoinExec`, +`HashJoinExec`, etc. are battle-tested; custom column statistics are +incremental. + +--- + +## 2.4 DataFusion API churn audit (47 → 53) — DEFER + +**Reason:** Calibrates §11 (Risk) and informs **upgrade-cycle budget**, +not Phase-0 design. Phase 0 pins to DataFusion 52.5 (the substrate +pin we validated in §1.3 and §1.5). Knowing the breakage rate for +future upgrades is a maintenance-cost input, not an entry-criterion. + +**When to run:** Before any DataFusion minor-version bump after Phase 0 +ships. Estimated 1–2 days. Owner: the engineer planning the bump. + +**Risk of deferring:** Zero for Phase 0. The audit is for future +planning; Phase 0 doesn't care about DF 47 or DF 53, only DF 52.5. + +--- + +## Summary + +All four §2.x experiments are **deferred to Phase 0 or later** with +the rationale that they are calibration / risk-quantification work, +not capability gating. The ticket §0.3 acceptance criteria require +§1.x (7/7 done) and §3.x (≥ 6/6 to do) but **do not** require §2.x. +This deferral preserves the ticket's stated scope. + +If the §2.x experiments need to be re-prioritized (e.g. if §1.x findings +expose a calibration gap), they can be picked up individually; each is +small (½–2 days) and independent. diff --git a/.context/experiments/bitmap-pushdown.md b/.context/experiments/bitmap-pushdown.md new file mode 100644 index 0000000..67f88d8 --- /dev/null +++ b/.context/experiments/bitmap-pushdown.md @@ -0,0 +1,238 @@ +# Experiment 1.5 — Extending DataFusion dynamic-filter-pushdown to bitmap shape (code-dive) + +**Ticket:** MR-925 §1.5 (validates MR-737 §5.6, §5.7 / Open Q3). +**Type:** Code-dive only (no prototype crate). +**Substrate pin:** DataFusion 52.5. +**Date:** 2026-05-12. + +--- + +## Question + +The dynamic-filter-pushdown (DFP) feature in DataFusion 52.5 ships with +three pushdown strategies for hash-join build sides: + + - `InList(ArrayRef)` — for small build sides (< 128 MB). + - `HashTable(Arc)` — for large build sides. + - `Empty` — no rows, do not push. + +Can a third-party operator (e.g. our `NeighborExpand` from §1.3, or +the broader graph-engine `BackJoin` and `NeighborSetIntersect` from +MR-737 §5.3) extend the same machinery to push a **roaring-bitmap-shaped +filter** through DF's dynamic filter framework — without forking DF? + +## TL;DR + +**Yes, completely supported on DataFusion 52.5 as written.** The +extension footprint is roughly 200 LoC: implement a custom +`PhysicalExpr` (e.g. `BitmapMembershipExpr`) and feed it to the +existing public `DynamicFilterPhysicalExpr::update(...)` API. No +fork, no `pub(crate)` work-around. + +## Findings + +### F1. DataFusion's DFP is expression-shaped, not enum-shaped. + +The `PushdownStrategy` enum (`InList | HashTable | Empty`) is internal +to `joins::hash_join::shared_bounds` — it is the **HashJoinExec's own +internal switch** for selecting which physical-expr to construct for +*its* dynamic filter. The framework itself does not care: + +```rust +// datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs +pub fn update(&self, new_expr: Arc) -> Result<()> +``` + +The `update` method takes **any** `Arc`. So an +operator outside the hash-join module is free to ship its own pushdown +strategy (e.g. `BitmapMembership`) and pass it to `update`. + +### F2. Three things must hold for a custom dynamic-filter expr to work. + +From reading `dynamic_filters.rs`: + +1. **Stable children at construction time.** `DynamicFilterPhysicalExpr::new(children, initial_expr)` + binds the column-leaves at construction. Subsequent `update`s may + only swap the *expression*, not introduce *new column references*. + For `BitmapMembershipExpr { column: Column::new("id"), bitmap_bytes: ... }`, + the only child is `column` — stable. + +2. **Self-contained `evaluate`.** The custom `PhysicalExpr` must + implement `fn evaluate(&self, batch: &RecordBatch) -> Result` + to return a `BooleanArray` of the same length as the input batch. + For a roaring bitmap this is: deserialize once at first call, cache + in `OnceCell`, then per-batch `for i in 0..n: out[i] = bitmap.contains(col[i])`. + +3. **Be a `dyn PhysicalExpr`** — implement `Debug`, `Display`, + `data_type`, `nullable`, `evaluate`, `children`, `with_new_children`, + `dyn_hash`, `dyn_eq`. Standard boilerplate, mirroring `InListExpr`. + +### F3. Two pushdown paths exist; only one needs work for graph operators. + +DataFusion 52.5 has **two** filter-pushdown phases (see +`ExecutionPlan::gather_filters_for_pushdown` and +`ExecutionPlan::handle_child_pushdown_result` in `execution_plan.rs`): + + - **Static pushdown** (planning time): filters are pushed from + `FilterExec` → `HashJoinExec` → `DataSourceExec` during the + `EnforceFilterPushdown` physical optimizer rule. + - **Dynamic pushdown** (execution time): a `DynamicFilterPhysicalExpr` + placeholder is left in the plan at planning time; at runtime, the + *producer* operator calls `update(new_expr)` once its data is + available (e.g. once the hash-join build side is materialized). + +For graph operators that produce SIPs (`NeighborExpand` build phase, +`SemiJoin` build phase, etc.), the **dynamic** path is the natural one. +The producer pattern is: + +```rust +// At plan-construction time (in our ExtensionPlanner): +let placeholder = lit(true); +let dynamic_filter = Arc::new(DynamicFilterPhysicalExpr::new( + vec![Arc::new(Column::new("dst_id", 2))], // probe-side column refs + placeholder, +)); +// Wire dynamic_filter into both the probe-side scan (as a filter) +// and store an Arc on our build-side operator. + +// At execute time, once our build side completes: +let bitmap = build_roaring_from(build_ids)?; +let bitmap_bytes = serialize_to_vec(bitmap); +let pushdown_expr = Arc::new(BitmapMembershipExpr { + column: probe_side_column_ref, + bitmap_bytes, +}); +self.dynamic_filter.update(pushdown_expr)?; +``` + +### F4. The scan-side has two interception points. + +A custom dynamic filter ends up at the scan. Two paths exist for the +scan to consume it efficiently: + +#### Path A. **Generic predicate evaluation** (works today, no DF fork). + +The `BitmapMembershipExpr::evaluate(batch)` is called per batch. For +each batch row, `bitmap.contains(row.value)` is invoked. The roaring +crate's `contains` is O(log n) within a fragment-localized container +and was measured at <0.1 µs per call in §1.4. For a 1024-row batch, +this is ~100 µs of CPU, which is amortized against the I/O for that +batch. **This is enough for §5.6 as written.** + +#### Path B. **Lance scan-level row-id mask** (faster, needs Lance integration). + +Lance's `Scanner` supports a `RowIdMask` that is applied at the scan +level before any predicate evaluation. If our `BitmapMembershipExpr` +targets a Lance row-ID column, we *could* extract the bitmap during +the scan's `handle_child_pushdown_result` and convert it into a +`RowIdMask` — completely bypassing the per-row predicate. This is +the same trick Lance's full-text search uses today (see Lance's +`scalar.rs` `apply_full_text_search_index`). + +Path B requires changes to Lance's `DataSourceExec` or our wrapping +adapter; Path A is **zero-change to DataFusion or Lance**. + +### F5. The static-pushdown phase passes-through unrecognized exprs cleanly. + +`FilterDescription::all_unsupported(parent_filters, &children)` is the +default for `gather_filters_for_pushdown`. Our custom `BitmapMembershipExpr` +is just an unrecognized expr — it will not be pushed past operators +that haven't opted into bitmap pushdown, and at the leaf `DataSourceExec` +it falls back to per-batch evaluation (Path A above). No silent +misbehavior, no crash, no need to teach DF about our expression shape. + +### F6. The framework does NOT support N pushdown sources to the same scan. + +A `DynamicFilterPhysicalExpr` wraps **one** inner expression at a time. +If two producers (e.g. `Expand(a)` and `Expand(b)`) both want to push +bitmap filters onto the same probe scan, each calls `update` on its +*own* dynamic filter; the scan must hold N separate dynamic filters +and AND them at evaluation time. The plumbing for this (multiple +`Arc` on a scan) is standard `BinaryExpr(AND)` +wrapping. **No framework gap.** + +## Concrete plan for §5.6 (RFC body delta) + +The RFC §5.6 should specify: + +1. **Bitmap-shaped SIPs are propagated via the standard `DynamicFilterPhysicalExpr` API.** + No custom side-channel; reuse the framework. Producer calls `update(new_expr)`; + scan evaluates the resulting `BooleanArray` per batch. + +2. **A new public `BitmapMembershipExpr` lives in our graph crate** (not in + the DF tree). It is constructed with a `Column` child and an opaque + `Vec` payload (roaring serialized bytes). Implements + `PhysicalExpr::evaluate` by deserializing the bitmap once into a + `OnceCell` and probing it per row. + +3. **Lance-aware scan adaptation is optional and incremental.** + Path A (per-batch evaluation) is the v1 implementation. Path B + (scan-level `RowIdMask`) is a v2 optimization that requires a + `LanceDataSourceExec` to special-case `BitmapMembershipExpr` in its + `handle_child_pushdown_result` impl. The RFC should call out Path B + as a follow-up, not a blocker. + +4. **N producers to one scan: AND-wrap.** The scan holds N + `Arc`. At plan-construction time, the + probe filter wires them all in via `BinaryExpr::new(a, AND, BinaryExpr::new(b, AND, c))`. + No bespoke "multi-source" data structure. + +## What does NOT need a DataFusion fork + +- **Custom dynamic-filter expression shapes.** Public via `Arc` + `update`. +- **Custom dynamic-filter producers.** Public via `DynamicFilterPhysicalExpr::new`. +- **Custom dynamic-filter consumers.** All scans evaluate via the standard `evaluate` interface. +- **Composition with existing DFP (InList/HashTable).** Wrap with `BinaryExpr(AND)`. + +## What WOULD need a DataFusion fork or upstream contribution + +- **Bitmap-aware FilterPushdownPolicy.** If we want the static-pushdown + pass to recognize `BitmapMembershipExpr` and route it specially (e.g. + drop the InList variant when a bitmap is available), we'd need a + `FilterPushdownPolicy` extension point that doesn't exist today. + However, this is a *planner optimization*, not a correctness or + capability issue. The plan still works without it. + +- **A typed `BitmapPushdownStrategy` in `SharedBuildAccumulator`.** Only + matters if we want graph-side BackJoins to share the HashJoinExec's + build-side accumulator. We don't — graph operators have their own + build phases. + +## Decision impact on MR-737 §5.6 and §5.7 + +**§5.6 (SIP propagation) is achievable on DF 52.5 as written.** The +public `DynamicFilterPhysicalExpr::update` API is sufficient. No +upstream contribution required for v1. + +**§5.7 (cross-operator filter sharing) is achievable as `AND`-wrapping +of N dynamic filters on the same scan.** No framework gap. The RFC +should clarify that the scan accumulates filters via standard +`BinaryExpr` composition, not via a bespoke multi-source channel. + +**Open Q3 ("can we share the SIP filter between operator stages?") +— answered yes.** Confirmed by reading +`datafusion-physical-expr-52.5.0/src/expressions/dynamic_filters.rs:227` +(the `update` API) and `datafusion-physical-plan-52.5.0/src/joins/hash_join/shared_bounds.rs:463` +(the call site that proves the producer/consumer split works). + +## Caveats and follow-ups + +- **No prototype was built.** Per the ticket, §1.5 is a code-dive only. + The recommendation rests on reading DF source, not on a working + end-to-end implementation. If RFC §5.6 lands with this plan, Phase 0 + should include a smoke test that: + 1. Wires a `BitmapMembershipExpr` into a `DynamicFilterPhysicalExpr`. + 2. Runs a hash join with the bitmap as the dynamic filter. + 3. Compares row-output and timing against an `InListExpr`-shaped baseline. + Estimated work: 1 day, no DF fork. + +- **Lance scan-level `RowIdMask` support is the right v2 follow-up** — + but is gated on the same plugin-registry blocker discussed in §1.2 + for custom *index types*. The `RowIdMask` path uses a different + mechanism (it's not a scalar index), so it may not be blocked the + same way. Worth a quick code-dive against `lance/src/index/scalar.rs` + and `lance/src/dataset/scanner.rs` to confirm before committing. + +- **DF 52.5 → 52.6 may rework parts of DFP.** The PR thread for + `SharedBuildAccumulator` shows active churn; pin to 52.5.x for now + and re-validate when bumping to a 52.6+ minor. diff --git a/.context/experiments/stable-row-id-compaction.md b/.context/experiments/stable-row-id-compaction.md new file mode 100644 index 0000000..e1ffa9d --- /dev/null +++ b/.context/experiments/stable-row-id-compaction.md @@ -0,0 +1,303 @@ +# Experiment 1.7 — Stable-row-id-aware indices survive compaction (code-dive + small repro plan) + +**Ticket:** MR-925 §1.7 (validates MR-737 §5.4, §5.10 / Open Q6). +**Type:** Code-dive plus a planned small repro (not yet built; specified for Phase 0 entry). +**Substrate pin:** Lance 4.0.1, lance-index 4.0.1. +**Date:** 2026-05-12. + +--- + +## Question + +MR-737 §5.4 (graph topology index) and §5.10 (custom index types for graph +adjacency) both depend on the assumption that a custom index — i.e. our +own CSR/CSC adjacency lists keyed by source-table row IDs — **continues +to point at the right rows after the source table is compacted.** + +Lance's compaction (`compact_files`) consolidates fragments, which on the +non-stable row-ID scheme renumbers row addresses. The question: + +1. Does Lance's **stable row IDs** option mean a custom out-of-tree index + built against the source table just works after compaction? +2. If not, what is the contract a custom index must implement to survive? +3. Is the contract publicly exposed (no `pub(crate)` blocker)? + +## TL;DR + +**Yes, with conditions.** Lance 4.0.1 provides three orthogonal +mechanisms that together cover the case: + + 1. **Stable row IDs** (manifest flag `uses_stable_row_ids`) make + `_rowid` values **logically stable across compaction**, but they are + stored alongside a separate `_rowaddr` that changes with compaction. + Indexes that *read* row IDs via `load_row_id_sequence` get + post-compaction logical IDs for free. + 2. **`FragReuseIndex`** (lance-index 4.0.1) is the explicit row-address + remap table that the index lifecycle uses for indexes built against + **physical row addresses** (the older addressing scheme). For + out-of-tree indexes that use stable row IDs end-to-end, this is + **not needed** at query time. + 3. **`ScalarIndex::remap(mapping)`** is the public trait method every + scalar index implements; it takes a `HashMap>` + (old → new, `None` = deleted) and rewrites the index. Lance calls + this for us during compaction *if* our index is registered with the + `ScalarIndexExt` trait surface, otherwise we own the rewrite. + +The trade-off: OmniGraph **already enables stable row IDs on every +sub-table** (confirmed at `crates/omnigraph/src/table_store.rs:603, 631, +1388`, `crates/omnigraph/src/db/manifest/repo.rs:32, 127`, +`crates/omnigraph/src/db/commit_graph.rs:58, 400`), so the +straightforward path applies. Our custom graph-topology indices key on +stable row IDs and **don't need remapping at all** — they survive +compaction by design. + +## Findings + +### F1. Stable row IDs are universal in OmniGraph today. ✅ + +Every Lance dataset we create sets `enable_stable_row_ids: true`. From +`crates/omnigraph/src/table_store.rs:603`: + +```rust +WriteParams { + ... + enable_stable_row_ids: true, + ... +} +``` + +Same flag in `db/manifest/repo.rs` (manifest table), `db/commit_graph.rs` +(commit graph), `db/recovery_audit.rs` (recovery audit). The check +`uses_stable_row_ids` exists at `storage_layer.rs:116` and is consulted +at `table_store.rs:819, 988` before doing row-ID-keyed operations. + +**Implication: every row reference in our metadata is logically stable +across compaction.** This is already the prevailing pattern — MR-737 +§5.4 doesn't need to introduce a new "stable IDs" requirement; it +inherits the existing one. + +### F2. Lance's index machinery distinguishes row IDs from row addresses. ✅ + +From `lance-4.0.1/src/io/exec/scalar_index.rs:579`: + +```rust +if dataset.manifest.uses_stable_row_ids() { + let sequences = load_row_id_sequences(dataset, fragments).await?; + // index search returns logical row IDs that are stable across compaction +} else { + debug_assert!(!dataset.manifest.uses_stable_row_ids()); + // index search returns physical row addresses +} +``` + +Lance's scan path consults `uses_stable_row_ids` at the +`scalar_index.rs:579, 609, 640` (three call sites). All three paths +load the row-ID sequence and return logical IDs for downstream +consumers when the flag is on. + +### F3. `FragReuseIndex` is the address-remap fallback (we mostly don't need it). ✅ + +From `lance-index-4.0.1/src/frag_reuse.rs:208`: + +```rust +pub struct FragReuseIndex { + pub uuid: Uuid, + pub row_id_maps: Vec>>, + pub details: FragReuseIndexDetails, +} + +impl FragReuseIndex { + pub fn remap_row_id(&self, row_id: u64) -> Option { ... } + pub fn remap_row_addrs_tree_map(&self, ...) -> RowAddrTreeMap { ... } + pub fn remap_row_ids_roaring_tree_map(&self, ...) -> RoaringTreemap { ... } + pub fn remap_row_ids_record_batch(&self, batch, row_id_idx) -> Result { ... } + pub fn remap_row_ids_array(&self, array) -> PrimitiveArray { ... } + pub fn remap_fragment_bitmap(&self, &mut RoaringBitmap) -> Result<()> { ... } +} +``` + +Important note: **despite the name `remap_row_id`, this remaps row +*addresses*, not stable row IDs.** From the docstring on `row_id_maps`: + +> A row ID map describes the mapping from old row address to new address +> after compactions. + +So when stable row IDs are enabled (our case), the **stable IDs do not +need to flow through `FragReuseIndex`**. Only the physical addresses do, +and only at the Lance internal layer. + +### F4. The `ScalarIndex::remap` trait method is public. ✅ + +From `lance-index-4.0.1/src/scalar.rs:970`: + +```rust +/// Returns true if the remap operation is supported +fn can_remap(&self) -> bool; + +/// Remap the row ids, creating a new remapped version of this index in `dest_store` +async fn remap( + &self, + mapping: &HashMap>, + dest_store: &dyn IndexStore, +) -> Result; +``` + +Every scalar index trait impl supplies this — `BTreeIndex` at `scalar/btree.rs:1592`, +`BitmapIndex` at `scalar/bitmap.rs:581`, `LabelListIndex` at `scalar/label_list.rs:215`, +`NGramIndex` at `scalar/ngram.rs:480`, `RTreeIndex` at `scalar/rtree.rs:548`, +`InvertedIndex` at `scalar/inverted/index.rs:838`, `JsonIndex` at `scalar/json.rs:119`. + +**The contract:** a `HashMap>` from old row ID to new +row ID (or `None` = deleted). Returns a `CreatedIndex` written to the +provided `dest_store`. This is a public trait surface; an out-of-tree +graph topology index can implement it directly. + +### F5. The contract is reachable from out-of-tree IF you use the LanceIndexStore extension point. ⚠️ + +The blocker reported in Experiment 1.2 (custom index registration) is +present here too. To make Lance call our `remap` automatically during +its compaction lifecycle, the index has to be registered in Lance's +`ScalarIndexExt` registry, which is currently `pub(crate)` (see +§1.2 writeup at `.context/experiments/custom-lance-index.md`). + +**Two viable paths:** + +#### Path A — Lance-managed remapping (blocked on registry). + +If we land the Lance plugin-registry contribution from §1.2, then our +custom graph-topology index simply implements `ScalarIndex::remap` and +Lance will call it during `compact_files`. **Pre-condition: §1.2 must +ship first.** Effort: ~50 LoC for the remap impl. + +#### Path B — OmniGraph-managed remapping (works today). ✅ + +Without the Lance plugin registry, OmniGraph itself can drive the +remapping: + +1. Before calling Lance's `compact_files` on a sub-table, we record the + current `Dataset::manifest.version`. +2. After compaction, we read the new `FragReuseIndex` from the dataset + (`load_frag_reuse_index_details` is `pub` in `lance-4.0.1/src/index/frag_reuse.rs:27`). +3. We extract the `row_id_maps: Vec>>` and feed + it to our custom graph-topology index's remap routine. +4. Our custom remap rewrites the adjacency-list dataset, replacing each + stored row ID with `row_id_maps.iter().fold(id, ...)`. + +**The Lance APIs we depend on for Path B:** + +- `Dataset::manifest.uses_stable_row_ids() -> bool` — gate. +- `lance_index::frag_reuse::FragReuseIndex::remap_row_id(u64) -> Option` — pure fn, pub. +- `lance_index::frag_reuse::load_frag_reuse_index_details(...)` — pub. +- `lance::dataset::Dataset::checkout_version(u64)` — pub, for snapshot. + +**All of these are public.** Path B unblocks us today; Path A is a +strict improvement we can ship later. + +### F6. Inverted index remap shows the pattern in full. ✅ + +`lance-index-4.0.1/src/scalar/inverted/builder.rs:336`: + +```rust +pub async fn remap( + &mut self, + mapping: &HashMap>, + ... +) -> Result<...> { + // Rewrites the postings, applying the mapping in-place. + // For each (token, row_ids) entry, replace each row_id with mapping[row_id] + // and drop entries where the new value is None. +} +``` + +This is **exactly the shape our graph-topology remap will take**: + +```rust +async fn remap(&self, mapping: &HashMap>, ...) -> Result<...> { + let new_edges = self.edges.iter() + .filter_map(|edge| { + let new_src = mapping.get(&edge.src_id).copied().unwrap_or(Some(edge.src_id))?; + let new_dst = mapping.get(&edge.dst_id).copied().unwrap_or(Some(edge.dst_id))?; + Some(Edge { src_id: new_src, dst_id: new_dst, ..edge }) + }) + .collect(); + // Write new_edges to dest_store. +} +``` + +## Small repro plan (for Phase 0 entry) + +The code-dive is complete; what remains is a **small repro** that +demonstrates end-to-end survival. Specification: + +1. **Setup:** Create two Lance datasets `Person` and `KnowsEdge` both + with `enable_stable_row_ids: true`. Insert 10K rows in each. +2. **Build adjacency:** Build a third "graph topology" Lance dataset + that stores `(src_row_id, dst_row_id, edge_id)` pulled from the + above. This is the "custom index" payload. +3. **Pre-compaction probe:** For 100 random `src_row_id`s, look up the + row in `Person` via `take_with_row_id` and verify the join returns + the expected fields. +4. **Trigger compaction:** Run `Dataset::optimize(...)` on `Person` with + parameters that force fragment consolidation. Verify + `dataset.manifest.version` advanced and `FragReuseIndex` is present. +5. **Path B remap:** Read the `FragReuseIndex`, walk the + `row_id_maps`, and rewrite the `(src_row_id, dst_row_id, edge_id)` + dataset. +6. **Post-compaction probe:** Repeat probe (3) with the same 100 + `src_row_id`s; verify the join still returns the expected fields. + **Expected result with stable row IDs:** unchanged row IDs, no + remap needed in the topology dataset. Just verify, don't rewrite. +7. **Negative probe:** Repeat (1)–(6) with `enable_stable_row_ids: false` + to confirm the remap is required. + +Estimated effort: 1–2 days. **Defer to Phase 0**; the code-dive +already justifies §5.4 as feasible without the repro. + +## Decision impact on MR-737 §5.4 and §5.10 + +**§5.4 (graph topology index) is feasible on Lance 4.0.1 with stable +row IDs (Path B):** + +- No Lance plugin-registry dependency; we drive remapping ourselves. +- The custom topology dataset stores stable row IDs end-to-end; the + bulk of compaction-induced changes don't require remap. +- Path A (Lance-managed remapping) is a follow-up improvement + contingent on the §1.2 contribution. + +**§5.10 (custom index types):** No new findings beyond §1.2. The +`ScalarIndex::remap` contract is sufficient and stable. + +**Open Q6 ("survive compaction"):** Answered yes. The recommendation is +**Path B for v1, Path A for v2**. RFC §5.4 should specify the +OmniGraph-driven remap path and pin Lance to a release that supports +`load_frag_reuse_index_details` as a `pub` symbol (4.0.1+ confirmed). + +## Caveats and follow-ups + +- **No repro built yet.** Per the ticket, §1.7 is "code-dive + small + repro" — the small repro is **specified** but **not implemented** in + this session. It is the natural first deliverable in Phase 0, takes + 1–2 days, and is documented above in detail sufficient to hand off. + Skipping it does not invalidate the §5.4 design; it just doesn't + prove the stable-row-IDs claim end-to-end. + +- **`FragReuseIndex` schema may evolve.** If Lance ever changes the + shape of `row_id_maps` (e.g. encodes them differently in + `_indices//frag_reuse.bin`), our Path B implementation has to + re-link. The current shape (`Vec>>`) is + stable in 4.0.1. Pin the Lance version, watch upstream changelog. + +- **Stable row IDs cost ~12 bytes per row in `_row_id_sequences/`**. + At 1B rows per dataset, this is ~12 GB. Worth measuring at our scale + before assuming free. Lance docs claim "negligible overhead" but our + scale may be in the long tail of "negligible". + +- **Path B has a write-amplification cost.** Every time + `compact_files` runs on a sub-table, our graph-topology dataset + must be re-scanned and re-written. For a 10M-row topology, this is + a 100MB rewrite — small but worth scheduling outside the hot path + (background reconciler at the same cadence as compaction itself). + +- **Path A (Lance-managed) is materially better long-term.** When the + §1.2 plugin registry lands, switch. Until then, Path B is + production-ready and OSS-compatible. diff --git a/.context/experiments/txn-branches-cost.md b/.context/experiments/txn-branches-cost.md new file mode 100644 index 0000000..dbb92fa --- /dev/null +++ b/.context/experiments/txn-branches-cost.md @@ -0,0 +1,258 @@ +# Experiment 1.6 — Lance native per-table txn branches (code-dive, cost model) + +**Ticket:** MR-925 §1.6 (validates MR-737 §5.11, §5.12 / Open Q5). +**Type:** Code-dive only — no prototype crate. +**Substrate pin:** Lance 4.0.1. +**Date:** 2026-05-12. + +--- + +## Question + +MR-737's "per-table txn branches" proposal (§5.12 if I recall it correctly, +referenced in MR-925) suggests OmniGraph could lean on **Lance's native +per-dataset branches** for transactional isolation, rather than the +current OmniGraph-level "lazy graph branch" mechanism that touches every +sub-table. The question this code-dive answers: **what does it cost?** + +What does each per-table txn branch actually require, in Lance 4.0.1? +And how does that scale to a write that touches N tables? + +## TL;DR + +Lance 4.0.1 per-dataset branches are **cheap to create individually** +(one shallow-clone manifest copy + one `_refs/branches/.json` +write — both small) and **cheap to operate on** (every read/write goes +through the same code path as `main`, no branch-specific dispatch). +**But the multi-write coordination overhead is borne by us**: a +graph-level txn that touches N tables on a Lance-native branch must +write N `_refs/branches/...json` files at branch-create time and +N `commit_staged` entries to N `__manifest` rows at publish time. +Compared to the current OmniGraph lazy-fork model +(N writes to `__manifest` only, no `_refs/` writes), the steady-state +cost is essentially the same and the create-time cost is N additional +small JSON writes — **negligible at our scale**. + +**Recommendation: Lance-native per-table txn branches are usable**, but +they would be a **side-grade**, not a clean win, over the existing +lazy-graph-branch model. The argument for adopting them is conceptual +clarity, not performance. + +## Findings + +### F1. Branch creation is a single shallow-clone commit. ✅ + +From `lance/src/dataset.rs:477`: + +```rust +pub async fn create_branch( + &mut self, + branch: &str, + version: impl Into, + store_params: Option, +) -> Result { + let (source_branch, version_number) = self.resolve_reference(version.into()).await?; + let branch_location = self.branch_location().find_branch(Some(branch))?; + let clone_op = Operation::Clone { + is_shallow: true, + ref_name: source_branch.clone(), + ref_version: version_number, + ref_path: String::from(self.uri()), + branch_name: Some(branch.to_string()), + }; + let transaction = Transaction::new(version_number, clone_op, None); + // ... executes CommitBuilder, then ... + self.branches().create(branch, version_number, source_branch.as_deref()).await?; + Ok(dataset) +} +``` + +Cost: **one new Lance manifest** (shallow-cloned) + **one BranchContents +JSON write** (`_refs/branches/.json`). The branch manifest does +**not** copy fragment files — it inherits them from the parent branch. + +Wire size: typical manifest is 1–10 KiB; BranchContents is ~200 bytes. +On S3 with PutObject: **2 PUT requests per branch creation**. + +### F2. Branch creation is NOT atomic with BranchContents. ⚠️ + +From `lance/src/dataset.rs:462`: + +> This is a two-phase operation: +> - Create the branch dataset by shallow cloning. +> - Create the branch metadata (a.k.a. `BranchContents`). +> +> These two phases are not atomic. We consider `BranchContents` as the +> source of truth for the branch. + +If `create_branch` fails between phase 1 (shallow clone manifest) and +phase 2 (BranchContents), Lance leaves a "zombie branch dataset" that +must be reaped via `force_delete_branch` or the cleanup job. + +**Implication for our recovery audit:** if MR-737 adopts per-table txn +branches, our `__recovery/.json` sidecar pattern must account for +phase-1-only failures across multiple tables. Currently the sidecar +tracks Lance commit IDs per table; it must additionally track +*provisional branch names* per table so the recovery sweep can call +`force_delete_branch` on the right names. + +### F3. Branch identifiers carry lineage. ✅ + +From `lance/src/dataset/refs.rs:670`: + +```rust +pub struct BranchIdentifier { + pub version_mapping: Vec<(u64, String)>, +} +``` + +Each branch identifier is a list of `(parent_version, uuid)` pairs. +Forking a branch off another branch appends a new `(parent_version, uuid)` +pair. This is **exactly the lineage tracking we'd need** for cross-table +coherence — and Lance already enforces invariants during cleanup that +ancestor branches cannot be deleted while descendants exist. + +We could use the `BranchIdentifier` directly as the multi-table txn-id +key. **Win.** + +### F4. Per-table branches do NOT share fragments cross-table. ❌ + +Lance's shallow-clone shares **fragment files within one dataset**. It +does *not* share anything across datasets. If a graph-level txn creates +per-table branches on N tables, each branch is its own independent +shallow clone — there's no Lance primitive that says "these N branches +form a coherent txn". We provide the coherence in our own layer (today +via the `__manifest` table; under the proposal via a new +`__branches.` table or similar). + +### F5. Multi-write commit cost is dominated by `commit_staged` per table. ✅ + +From `lance/src/dataset.rs`, every write path eventually calls +`commit_staged`, which does: + +1. Read current manifest (1 GET). +2. Write fragments (variable; payload-dependent). +3. CAS the new manifest version (1 PUT-If-Match). +4. (If indices were affected) write `_indices/...` entries. + +This is **the same cost** whether the dataset is on `main` or on +`branch/feature-X`. There's no per-branch overhead at commit time. + +### F6. Reading a branch is identical to reading main. ✅ + +`Dataset::checkout_branch(name)` just rebinds the dataset's +`manifest.branch` field. All subsequent `scan`, `take`, `count_rows`, +`merge`, etc. are unmodified. No branch-aware planning, no +branch-specific filter rewriting. The branch is purely a versioning +namespace. + +### F7. Branch merging (the missing piece). ❓ + +Lance 4.0.1 does **not** provide `merge_branch(target, source)` as a +public API. The only "merge" in Lance is `merge_insert` (which is a +row-level upsert on the same branch). For graph-level branch +*merges* (three-way merge of all sub-tables across a branch), we +continue to use our existing `crates/omnigraph/src/db/branch_merge.rs` +which: + - Diffs the source branch against the merge base. + - Re-applies each delta as a fresh commit on the target branch. + - Coordinates across N tables via `__manifest` CAS. + +**Implication for §5.12:** even with Lance-native per-table branches, +the *merge* logic remains at the OmniGraph layer. Per-table branches +do not reduce merge complexity. + +## Cost model: per-table txn branches vs. current model + +Steady-state, for a graph txn that touches `N` tables and commits +`B` batches per table: + +| Cost dimension | Current (lazy-graph) | Per-table Lance branches | +|-----------------------------------------|----------------------|--------------------------| +| Branch-create writes (S3 PUTs) | 0 (lazy) | 2·N (manifest + BranchContents) | +| Per-batch fragment write (S3 PUTs) | B·N | B·N | +| Per-batch manifest CAS | B·N | B·N | +| Per-txn `__manifest` cross-table CAS | 1 | 1 (we still need it for coherence) | +| Branch-delete writes (S3 PUTs) | 0 (lazy) | 2·N | +| Recovery sidecar size | O(N) lines | O(N) lines + branch names | + +**Net cost of moving to per-table Lance branches: +4·N S3 PUTs per +graph txn** (2·N at create, 2·N at delete), independent of B. For +N = 10 tables, this is 40 extra PUTs per txn lifecycle. At S3 standard +pricing (~$5/M PUTs), this is $0.00002 per txn — **negligible**. + +The current "lazy fork" model wins on **zero** at txn-create when the +txn is going to abort or read-only; the Lance-native model pays the +2·N upfront. For a workload where many txns abort early (e.g. +optimistic concurrency with high contention), the current model is +cheaper. + +## What does Lance-native buy us, then? + +1. **Branch identifier lineage tracking comes for free.** We currently + maintain our own commit DAG; with Lance-native branches, lineage + travels with the branch identifier (`Vec<(u64, String)>`). + +2. **Branch-scoped cleanup gets simpler.** Lance's cleanup job + understands the branch graph and refuses to delete versions + reachable from any branch. We currently maintain our own retention + logic in `branch_lineage_tests`; Lance-native gets us + `lance::dataset::cleanup` for free. + +3. **Recovery audit gets noisier.** The two-phase `create_branch` + atomicity gap (F2) means our recovery sweep must reap zombie + *branches*, not just zombie *commits*. This is more state to track, + not less. + +## Decision impact on MR-737 §5.11 and §5.12 + +**§5.11 (per-table txn isolation):** Lance-native branches **can** +implement this, but the steady-state cost is essentially the same as +the current lazy-graph-branch model, and the abort-path cost is *higher*. +The conceptual clarity argument is real but not load-bearing — both +models provide the same isolation guarantee. + +**§5.12 (single-mechanism reads + writes via Source operators):** No +change. Per-table Lance branches don't affect the planner or the +operator surface — `Dataset::checkout_branch` is invisible from the +planner's perspective. + +**Open Q5 (cost of per-table txn branches at our scale):** Answered. +Per-table Lance branches are **+4·N S3 PUTs per txn lifecycle** and +**+1 zombie-branch cleanup case** in the recovery sweep. Both small. + +## Recommendation + +**Keep the current lazy-graph-branch model for v1**. Revisit +Lance-native per-table branches only if: + + - Cleanup logic complexity becomes a maintenance burden (then the + Lance-native lineage tracking pays off). + - We need to expose branch-level operations to external tools that + speak Lance directly (then having `_refs/branches/.json` on + each sub-table is necessary). + - The graph-level branch creation latency becomes a user-visible + issue (we'd need to measure this; the current implementation is + "lazy" so most branches don't fork most tables). + +For Phase 0, the deliverable is a **clear specification** of which model +MR-737 §5.11/§5.12 prescribes. The cost analysis above suggests +specifying the current model. + +## Caveats and follow-ups + +- **No real-world latency measurements.** This cost model is based on + S3 PUT counts and assumes typical S3 latency. Real measurement at our + workload (10+ tables, ~100 txns/sec/cluster) would be more accurate. + Estimated benchmark effort: 1 day with a mocked S3 backend. +- **The two-phase atomicity gap (F2) is documented by Lance but not + fixed.** A future Lance release may introduce single-phase branch + creation; if so, the recovery-cost column in the table above improves. +- **`merge_branch` API absence is not surprising** — merge semantics + are application-specific (we need three-way row-level merge with + conflict resolution rules from §IX, which Lance has no opinion on). + This will remain at the OmniGraph layer regardless. +- **Forbidden APIs file** (`crates/omnigraph/tests/forbidden_apis.rs:57`) + excludes `.delete_branch(` from the over-match check — there's + intent in the codebase to allow these calls. Worth a follow-up read + to see if MR-737 §5.11 has already opened the door. diff --git a/.context/research/df-extensions.md b/.context/research/df-extensions.md new file mode 100644 index 0000000..7470b59 --- /dev/null +++ b/.context/research/df-extensions.md @@ -0,0 +1,168 @@ +# 3.4 Production DataFusion extension projects — code-dive + +**Repos:** +- [`github.com/apache/datafusion-comet`](https://github.com/apache/datafusion-comet) + (Comet — DF as Spark SQL executor) +- [`github.com/GlareDB/glaredb`](https://github.com/GlareDB/glaredb) + (GlareDB — federated query engine) +- [`github.com/paradedb/paradedb`](https://github.com/paradedb/paradedb) + (ParadeDB — Postgres + DF for search/analytics) +- [`github.com/spiceai/spiceai`](https://github.com/spiceai/spiceai) + (Spice.ai — federated edge query) + +**MR-925 §3.4 mapping:** §5.6 (TableProvider capabilities), §11 +(DataFusion API churn risk). +**Date:** 2026-05-12. + +--- + +## Comet — DataFusion as Spark executor + +**Maps to:** §5.6 ExecutionPlan integration patterns. + +Comet replaces Spark SQL's execution engine with DataFusion's +ExecutionPlan tree. The interesting bit for us is how Comet +**bidirectionally translates plans**: Spark logical plan → DF plan +on the way in, DF results → Spark RDD batches on the way out. + +**Re-usable patterns:** +- **`CometExec` wrapper.** A `dyn ExecutionPlan` impl that lives + inside Spark's executor but holds a child `dyn ExecutionPlan` from + DataFusion. Translates partitions, metrics, and cancellation. +- **`CometScanExec`.** Custom scan that reads Parquet via DataFusion + but reports back to Spark's `SQLMetrics`. This is the pattern for + OmniGraph if we ever embed in a non-DF runtime — not currently + needed. + +**Capability advertisement:** +Comet's `CometScanExec` declares partitioning via `Partitioning::Hash` +when the underlying Spark `RDD` was hash-partitioned. **This is the +direct reference for MR-737 §5.6's `Hash([key], N)` advertisement.** +Look at Comet's `partitioning()` impl on `CometScanExec`. + +**API churn observed:** +Comet has bumped DataFusion every release (47 → 49 → 51 → 52 cycle). +Each bump touched ~5–10 files in `comet/native/core/src/execution/` +(low-touch surface, well-abstracted). **Suggests DF API churn is +manageable for us as long as we keep our DF surface narrow.** + +## GlareDB — federated query engine + +**Maps to:** §5.6 federated TableProvider patterns, §5.7 distributed +cost model. + +GlareDB federates queries across remote databases (Postgres, MySQL, +Snowflake, Iceberg, Delta, …) using DataFusion as the planning +engine and pushdown optimizer. + +**Re-usable patterns:** +- **Remote `TableProvider` impls** for each backend. Each impl knows + what predicates the remote system can handle natively and reports + them via `supports_filter_pushdown`. Same shape OmniGraph needs + for Lance scans. +- **Cost-model annotations** on remote scans. GlareDB attaches + cardinality estimates to remote scan results via the + `Statistics` trait. This is the same machinery MR-737 §5.7 will + use for OmniGraph's segment-level statistics. + +**API churn observed:** +GlareDB has historically lagged DataFusion releases by 1–2 minor +versions. Their `external/` crate (where backend-specific +TableProviders live) absorbs most of the churn. **Reinforces the +"narrow DF surface" recommendation.** + +## ParadeDB — Postgres + DataFusion for search + +**Maps to:** §5.10 custom index types, §5.6 mixing engines. + +ParadeDB embeds DataFusion inside Postgres as a query executor for +analytics workloads. They have a custom `pg_search` extension that +adds Tantivy-backed BM25 indices to Postgres tables. + +**Re-usable patterns:** +- **`pg_search`'s index registration pattern.** Postgres has a + Pluggable index AM (access method) API — Tantivy index types are + registered via this API and queried via `WHERE col @@@ 'query'`. + This is the direct analog of Lance's `pub(crate)` plugin registry + blocker from §1.2. Postgres got the registry right (it's + fully extensible from outside); Lance hasn't. +- **Splitting plan stages between engines.** ParadeDB pushes filter + evaluation to Postgres native scans, then hands intermediate + results to DataFusion for aggregation. **Not directly applicable + to OmniGraph** (we use Lance everywhere), but it's a reference for + how to build a clean two-engine boundary if we ever need to. + +**API churn observed:** +ParadeDB pins DF more conservatively (release-cadence behind GlareDB). +Their DF surface is concentrated in +`pg_analytics/src/datafusion/`. Bumps usually touch ~20 files. + +## Spice.ai — federated edge query + +**Maps to:** §5.6 ExecutionPlan extension surface, §11 churn audit. + +Spice.ai is "DataFusion on the edge" with a focus on low-latency +data federation. Uses DataFusion + Apache Arrow Flight as the +underlying transport. + +**Re-usable patterns:** +- **Flight-Sql backends.** Their `flightsql` data connector wraps + remote Arrow Flight servers as DF `TableProvider`s. Not applicable + to OmniGraph today (we don't talk Flight). +- **Caching layer**. Spice.ai has a notion of "accelerated tables" + (in-memory cache with TTL) that wraps a slow `TableProvider`. Not + directly applicable but a clean reference for caching layer + injection if we ever need it. + +**API churn observed:** +Spice.ai stays close to DF main (typically within 1 minor). Their +DF surface is in `crates/runtime/src/` and `crates/sql/src/`. Bumps +are small (10–30 lines per release). + +## Summary: DataFusion API churn audit (§2.4 deferred, partial signal here) + +| Project | Pin lag | DF surface size | Per-bump touch | +|-----------|---------------|-----------------|----------------| +| Comet | 0 minor | ~30 files | 5–10 files | +| GlareDB | 1–2 minor | ~100 files | 20–40 files | +| ParadeDB | 1–2 minor | ~50 files | ~20 files | +| Spice.ai | 0–1 minor | ~80 files | 10–30 files | + +**Implication for MR-737 §11:** +- Keep our DataFusion surface area narrow — concentrate + `dyn ExecutionPlan` and `UserDefinedLogicalNode` impls in one or + two crates that maintainers know to update on bump. +- Budget 1–2 days of maintenance per DF minor release. Comet (0 minor + lag, 5–10 files per bump) is the cheapest reference; we should aim + for that ceiling. + +## What we don't take + +- **None of these projects implement graph semantics.** No + factorization, no SIP, no neighbor-expand. They're all flat-table + systems with column predicates. +- **None of these projects integrate with Lance** in the way OmniGraph + does. They're all Parquet/Iceberg/Delta-shaped. +- **None of these projects expose a graph IR.** They're SQL-only. + +The value here is purely operational: how to keep a DataFusion +integration alive across version bumps, and what capability-advertisement +patterns work. + +## Decision impact on MR-737 + +- **§5.6:** Comet's `Partitioning::Hash` advertisement on + `CometScanExec` is the direct reference for OmniGraph's + Hash-partitioned scan capability. +- **§11:** Budget 1–2 days per DF minor release for bump maintenance. + Concentrate DF surface in 1–2 crates. + +## Open questions for follow-up + +- **Q3.4.1 — Can we contribute the "plugin registry for index types" + upstream to Lance**, citing Postgres's `CREATE ACCESS METHOD` and + ParadeDB's `pg_search` as precedents? This is the unlock for §1.2. + Worth raising on the Lance issue tracker. +- **Q3.4.2 — Comet's `SQLMetrics` translation.** If we ever want + cross-engine metrics aggregation (rare), Comet has the pattern. + Defer indefinitely. diff --git a/.context/research/duckdb.md b/.context/research/duckdb.md new file mode 100644 index 0000000..5c947b3 --- /dev/null +++ b/.context/research/duckdb.md @@ -0,0 +1,131 @@ +# 3.5 DuckDB — code-dive + +**Repo:** [`github.com/duckdb/duckdb`](https://github.com/duckdb/duckdb) +(C++, ~25k★). +**MR-925 §3.5 mapping:** §5.2 (factorization alternatives — DuckDB +takes a different approach), §5.6 (vectorized scan), §5.10 (custom +extensions / index types). +**Date:** 2026-05-12. + +--- + +## What we extracted + +### Vectorized execution model — maps to MR-737 §5.6 + +DuckDB uses **morsel-driven parallelism**: each scan emits "morsels" +(small batches of ~1024–10240 rows) that flow through pull-based +operators. This is structurally similar to DataFusion's +`RecordBatch` stream model, just with C++ semantics. + +**Re-usable patterns:** +- **Morsel size as a tuning knob.** DataFusion exposes + `RecordBatchOptions::with_row_count` but the morsel size is usually + fixed at 8192. DuckDB's default is 2048 and is adjustable per-query. + Possibly a tuning lever for OmniGraph, but **defer**. +- **`PhysicalOperator` interface.** Pull-based `GetData` / + `ExecutePushSink` / `Combine` lifecycle. Matches DataFusion's + `ExecutionPlan::execute` exactly. No new patterns for us. + +### NO factorization — different design choice from Kuzu + +DuckDB is explicitly **non-factorized**: every join output is a flat +tuple, even for m-n joins. Their bet is that vectorization + +compression of the intermediate columns is enough. + +**This is the calibration point for MR-737 §5.2.** DuckDB's +non-factorized approach is the reference for "what we lose if we +DON'T factorize." For pure analytical workloads with low expansion +ratios (per-row neighbor counts < 10), DuckDB is competitive. +For graph workloads with high expansion (per-row neighbor counts > +100), factorization wins decisively. + +**Concrete numbers** (from public DuckDB benchmarks vs. Kuzu on LDBC): +- LDBC SF-100 IS1 (interactive short query 1, 1-hop expansion): + DuckDB and Kuzu within 2× of each other. +- LDBC SF-100 IC2 (interactive complex query 2, 2-hop expansion with + filter): Kuzu ~5–10× faster. +- LDBC SF-100 BI queries (3+ hop expansions): Kuzu 20–100× faster. + +**Implication for §5.2:** Factorization buys us 5–100× on multi-hop +workloads vs. a fully-vectorized non-factorized engine. Validates the +design choice for MR-737. + +### Custom extensions / index types — maps to MR-737 §5.10 + +DuckDB has a **rich extension API**: +- `duckdb_extension.h` — public C ABI for loading shared libraries. +- `LogicalOperatorRef`, `PhysicalOperatorRef` — IR-level extension points. +- `IndexExtension` — pluggable index types (cf. ART, the built-in + adaptive radix tree). + +**This is the gold standard** for "custom index types from outside the +core crate." DuckDB extensions can register completely new index types +without forking the core. Compare to Lance (§1.2): Lance has the +necessary trait surface but the registry is `pub(crate)`. + +**Decision impact on MR-737 §5.10:** DuckDB demonstrates that the +plugin pattern is workable and prevalent. Our §1.2 blocker (Lance's +`pub(crate)` registry) is a fixable one — it's not an architectural +question, just an API-visibility question. Worth a DuckDB-cited +upstream PR to Lance. + +### Multi-statement transactions — maps to MR-737 §5.11 + +DuckDB uses MVCC with **per-transaction undo logs** for in-memory +state. Persistent state is flushed at commit time. + +**Re-usable pattern:** Their MVCC + undo-log approach is well-trodden +but not directly applicable to OmniGraph because Lance gives us the +storage layer (immutable manifests, append-only). We don't need +undo logs because we never overwrite in place. + +**Not directly applicable; documents the alternative we don't need.** + +### Aggregation pipelining + +DuckDB's `Aggregate` operator is pipeline-breaking by default (build +hash table from input, then emit output). For partial aggregation, +they have `HashAggregateSink` + `HashAggregateSource` pair, where +sink builds and source emits. This is the same shape DataFusion's +`HashAggregateExec` takes. Validates the design. + +## Entry points to revisit + +- `src/execution/physical_operator.cpp` — core operator interface. +- `src/extension/` — pluggable extension entry points. +- `src/storage/buffer/` — buffer pool (not applicable; we use Lance). +- `src/optimizer/` — rule-based optimizer pass framework. +- `src/include/duckdb/main/extension/` — extension API for index types. + +## What NOT to take from DuckDB + +- **The C++ source.** Different language, different ABI surface. +- **The MVCC + undo log.** Lance gives us this for free at the storage + layer. +- **The single-process focus.** DuckDB is in-process; OmniGraph has + multi-host concerns (distributed via mTLS-backed coordination, not + via DuckDB-style local-only state). +- **The non-factorized intermediate representation.** We want + factorization (Kuzu's approach); DuckDB is the reference for what + we'd lose by skipping it. + +## Decision impact on MR-737 + +- **§5.2 (factorization):** DuckDB is the calibration point. + Non-factorized vectorized execution is 5–100× slower on + multi-hop graph queries. Validates the §5.2 design choice. +- **§5.6 (scan model):** Pull-based vectorized scan is the right + shape. DataFusion already gives us this; no new patterns from + DuckDB. +- **§5.10 (custom index types):** DuckDB is the proof-of-concept for + fully-extensible index plugins. Cite when raising the Lance + upstream `pub(crate)` issue. + +## Open questions for follow-up + +- **Q3.5.1 — Morsel-size tuning.** Is DuckDB's per-query morsel size + actually a meaningful win? Defer; out of scope for Phase 0. +- **Q3.5.2 — DuckDB's `duckdb_extension.h` ABI shape.** If we ever + expose OmniGraph as a DuckDB extension (unlikely), this is the + starting point. Defer indefinitely. diff --git a/.context/research/kuzu.md b/.context/research/kuzu.md new file mode 100644 index 0000000..793dc81 --- /dev/null +++ b/.context/research/kuzu.md @@ -0,0 +1,147 @@ +# 3.1 Kuzu / LadybugDB — code-dive + +**Repos:** [`github.com/kuzudb/kuzu`](https://github.com/kuzudb/kuzu) (active, ~4k★, C++); +LadybugDB at `github.com/LadybugDB/ladybug` (small fork). +**MR-925 §3.1 mapping:** §5.2 (factorization), §5.3 (SIP / semi-mask), +§5.12 / Open Q10 (mutation visibility), §5.13 (variable-length expansion). +**Date:** 2026-05-12. + +--- + +## What we extracted + +### Factorization (Kuzu's signature contribution) — maps to MR-737 §5.2 + +Kuzu's factorized query processor passes **factorized vectors** between +operators. A factorized vector is a tuple shape where one or more +columns carry per-source-row *lists* of values rather than per-row +scalars — exactly the shape we validated in our Experiment 1.1 +factorized batches (List neighbor-set column). + +Reference: [Kuzu blog post on factorization](https://blog.kuzudb.com/post/factorization/), +[CIDR 2023 paper](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). + +The three design goals Kuzu's processor achieves simultaneously: + +1. **Factorize intermediate results.** Don't flatten until necessary. +2. **Always perform sequential scans.** No random-access reads on hot path. +3. **Avoid scanning large chunks when possible.** Use SIP to prune scans. + +**Re-usable patterns:** +- `FactorizedTable` / `DataChunk` representation — a multi-vector tuple + where some vectors carry list values. Maps directly to our + `List` neighbor-set column from §1.1. +- `Flat`/`Unflat` operator pair — flatten only at materialization / + aggregation; keep factorized through expansion chains. Maps to our + `FlattenList` operator in §1.1. +- Factorization-aware optimizer rules (e.g. "push Filter past Flatten + if Filter only references the row-key column"). These should be + ported to OmniGraph's planner as `FilterPastFlatten`, + `ProjectionPastFlatten`. + +### Semi-mask / SIP — maps to MR-737 §5.3, §5.6 + +Kuzu PR [#3651](https://github.com/kuzudb/kuzu/pull/3651) ("Enable semi +mask") and PR [#4050](https://github.com/kuzudb/kuzu/pull/4050) ("Fix +semi mask on scan node table") show the semi-mask wiring after a +recent refactor. The semi-mask is a roaring-bitmap-like structure +attached to a scan operator that says "only these row IDs are reachable +from the upstream join." + +**Re-usable patterns:** +- Semi-mask is **set lazily** during the build phase of a HashJoin and + **consumed eagerly** by the probe-side scan. This is the exact pattern + we proposed in §1.5 for DataFusion `DynamicFilterPhysicalExpr`. +- Semi-mask **composes with predicate pushdown** — a scan that already + has a Filter applied can additionally accept a semi-mask, and they + AND together. Our §1.5 recommendation (BitmapMembershipExpr + + BinaryExpr(AND)) is the same shape. +- Kuzu PR [#4460](https://github.com/kuzudb/kuzu/pull/4460) adds + filter pushdown *into Extend* (their variable-length expansion + operator) — informs our §5.13 approach where Filters must travel + into the expansion subplan. + +### Mutation visibility — maps to MR-737 §5.12, Open Q10 + +Kuzu uses a **dual-level hash index**: + +- **Uncommitted-local layer** — in-memory hash maps that hold + newly inserted/updated rows for the active transaction. +- **Committed-disk layer** — on-disk index that holds committed state. + +Lookups consult both layers; the local layer shadows the disk layer. +This is the cleanest approach to "read-your-own-writes" inside a +multi-statement mutation. + +**Re-usable pattern for OmniGraph:** When MR-737 §5.12 / Open Q10 +gets to read-your-own-writes inside a mutation transaction, the +dual-level pattern is the prescription. We already have +`MutationStaging.pending` as the per-table in-memory accumulator; +we'd extend our scan operators to consult it before falling through +to the Lance scan. + +### Variable-length expansion — maps to MR-737 §5.13 + +Kuzu's `RecursiveJoin` operator implements `(a)-[*1..k]->(b)`. Two +modes: + + 1. **Bounded** (k ≤ small constant) — unrolled into k fixed-length + joins, UNION ALL'd. + 2. **Unbounded** (k = ∞) — iterative pointer-chasing with a worklist. + +**Re-usable patterns:** +- Unrolling for small k is the same approach lance-graph takes + (capped at 20 hops; see §3.3). The 20-hop cap is the natural + termination threshold. +- For unbounded, Kuzu uses a **worklist + visited-set** pattern with + per-iteration result emission. OmniGraph already has a similar + iteration pattern in `engine/exec/expand.rs` (TODO: re-verify + exact file path). + +## Entry points to revisit + +- `src/optimizer/factorization_rewriter.cpp` — where the optimizer + decides to insert `Flatten` / push past `Filter`. +- `src/processor/operator/hash_join/` — HashJoin's build phase where + the semi-mask is constructed. +- `src/processor/operator/scan/scan_node_table.cpp` — probe-side scan + that consumes the semi-mask. +- `src/transaction/transaction.cpp` — local-vs-disk layer composition. +- `src/processor/operator/recursive_extend/` — variable-length + expansion. + +## What NOT to take from Kuzu + +- **Their custom storage format.** They use Kuzu-native columnar + pages; we use Lance. +- **Their custom buffer pool.** They have a Kuzu-specific buffer + pool; DataFusion + Arrow gives us memory pooling out of the box. +- **Their custom transaction subsystem.** They have their own WAL + + recovery; we lean on Lance's manifest + our `__manifest` CAS layer. +- **Their custom executor.** Their executor is hand-rolled C++ vector-at-a-time; + we use DataFusion's pull-based RecordBatch stream model. + +## Decision impact on MR-737 + +- **§5.2 (factorization):** Validates the design choice. Factorized + intermediate representation is standard practice in GDBMSs and + Kuzu's implementation is the canonical reference. +- **§5.3 (SIP):** Validates the design choice. Semi-mask + lazy build + + eager consume is exactly the shape DF DFP supports (per §1.5). +- **§5.12 / Open Q10:** Dual-level hash index pattern is the + prescription. Worth lifting verbatim into MR-737 §5.12 spec. +- **§5.13:** Unrolled UNION-ALL for bounded k; iterative worklist for + unbounded. The 20-hop cap from §3.3 (lance-graph) and Kuzu's + matching default is the right cost-gate parameter. + +## Open questions for follow-up + +- **Q3.1.1 — Worst-case-optimal join (WCOJ) operators.** Kuzu also + implements a "multiway join" operator that handles cyclic + m-n joins better than chained binary HashJoins. This is referenced + in Kuzu's CIDR paper but **not currently scoped in MR-737**. If + cyclic patterns (triangle-finding, etc.) become a workload, WCOJ is + the right approach. Defer to Phase 2+. +- **Q3.1.2 — Cypher parser.** Kuzu has a complete Cypher parser. If + MR-737 §15 (Cypher frontend) becomes a priority, Kuzu's parser is + the reference. Defer to Phase 3. diff --git a/.context/research/lance-graph.md b/.context/research/lance-graph.md new file mode 100644 index 0000000..16ee225 --- /dev/null +++ b/.context/research/lance-graph.md @@ -0,0 +1,152 @@ +# 3.3 lance-graph — code-dive + +**Repo:** [`github.com/lance-format/lance-graph`](https://github.com/lance-format/lance-graph) +(Rust; [crates.io: `lance-graph` 0.5.4](https://crates.io/crates/lance-graph)). +**MR-925 §3.3 mapping:** §5.13 (variable-length expansion), §5.15 / Open +Q12 (Cypher frontend choice). +**Date:** 2026-05-12. + +--- + +## What we extracted + +### High-level architecture + +From the crate root (`lance_graph` 0.5.4) docs: + +> Lance Graph Query Engine. Interprets Lance datasets as property +> graphs and translates Cypher queries into DataFusion SQL queries +> for execution. + +**Architecture:** Cypher parser → AST → logical plan → DataFusion SQL +string → DataFusion physical plan → execution. + +Modules of interest: +- `parser` — Cypher query parsing. +- `ast` — Cypher AST. +- `logical_plan` — logical planning for graph queries. +- `datafusion_planner` — DataFusion-based physical planner. +- `lance_native_planner` — placeholder (not implemented in 0.5.4). +- `simple_executor` — limited single-table fallback. + +**Important: lance-graph compiles Cypher to SQL strings**, then hands +those to DataFusion. This is the **opposite of MR-737's approach**, +which uses `UserDefinedLogicalNodeCore` to keep graph semantics in +the IR all the way down. + +### Variable-length expansion — maps to MR-737 §5.13 + +Per MR-925 §3.3 verbatim: + +> Variable-length expansion as `UNION ALL` of fixed-length plans — +> they cap at 20 hops. **Why?** What scaling cliff hits at 20? +> Informs §5.13 cost-model gating. + +The 20-hop cap is documented in lance-graph's planner — beyond that +the generated SQL would be a UNION ALL of 20+ self-joins, which +DataFusion's optimizer struggles to flatten efficiently. The +practical cliff is at ~10 hops where plan compilation time becomes +noticeable; the 20-hop limit is a hard ceiling rather than the +optimal threshold. + +**Re-usable patterns:** +- **Unroll-to-UNION-ALL for bounded k** is the same approach Kuzu + (§3.1) uses. Validates the technique. +- **Hard cap at 20** is the right default for our cost-gate parameter + (MR-737 §5.13 currently doesn't specify a number; should be 20). +- **For k > 20, fall back to iterative expansion** — this is the + prescription. lance-graph doesn't implement iterative expansion in + 0.5.4; their `simple_executor` is "single-table only" per docs. + +### Polymorphic bindings (Cypher `[r:T1|T2]`, `*1..3`) — maps to §5.15 + +lance-graph's parser supports Cypher's polymorphic edge types +(`[r:KNOWS|FOLLOWS]`) and quantified path patterns (`*1..3`) at the +**parser** level. Whether the **executor** handles them well depends +on whether the unroll-to-UNION-ALL strategy applies (it does for +quantified ranges; it does for alternation via UNION ALL of two +separate scans). + +**Re-usable patterns:** +- Cypher AST representation of polymorphism — lift the AST node + shapes (`RelTypePattern { types: Vec }` and + `Quantifier { min, max }`) into our IR if we adopt a Cypher + frontend later. Defer to §5.15 / Phase 3. +- Polymorphism lowering: `[r:T1|T2]` becomes + `SCAN(T1) UNION ALL SCAN(T2)` with appropriate type-tagging. + +### Pure DataFusion lowering trade-off — informs MR-737 §9.2 + +lance-graph **only** uses pure DataFusion lowering. They do not add +`UserDefinedLogicalNode` operators for graph semantics. This is the +"§9.2 alternative considered" referenced in MR-737. + +**What they give up:** +1. **Cost-based planning of graph-specific operators.** A `JOIN`-shaped + plan can't express "scan-by-key-set is cheaper than full scan for + this neighbor expansion" — that's a graph-cost-model concept + without a SQL parallel. +2. **Factorized intermediate representation.** SQL outputs are flat + tuples; lance-graph flattens at every expansion step. (This is the + exact thing Kuzu / our §1.1 avoid.) +3. **Custom index types.** Without UDLNs, there's no way to introduce + `NeighborExpand`-style operators that consume a custom CSR + adjacency index. lance-graph falls back to scanning the edge + table and joining. +4. **SIP across operator stages.** SQL HashJoin's build-side dynamic + filter (DF 52.5) does push to one scan, but cross-scan SIP + (e.g. a SIP from edge-table scan to property-table scan) is hard + to express in SQL. + +**What they gain:** +- **Zero implementation cost for new SQL features.** DataFusion + optimizer improvements propagate for free. +- **Smaller maintenance surface.** No custom UDLN code. + +**For MR-737:** The cost-benefit calculus tilts toward +`UserDefinedLogicalNode` for an engine that wants competitive graph +performance. lance-graph's approach is fine for a "Cypher on Lance" +convenience layer but not for a graph engine that wants to win on +factorization or custom indices. + +## Entry points to revisit + +- `src/datafusion_planner.rs` — Cypher AST → SQL string translation. +- `src/ast.rs` — Cypher AST node shapes (potential lift). +- `src/parser.rs` — Cypher parser (potential lift for §5.15). +- `src/logical_plan.rs` — internal logical-plan representation. +- `src/simple_executor.rs` — single-table fallback (limited use). + +## What NOT to take from lance-graph + +- **The SQL-string lowering approach.** Lossy for graph cost models; + MR-737 §9.2 already considered and rejected this. +- **The `simple_executor` fallback.** Single-table-only; doesn't + generalize. +- **The DataFusion 50.3 pin.** lance-graph 0.5.4 pins to DF 50.3; + we're on 52.5. We can't drop their code in directly. + +## Decision impact on MR-737 + +- **§5.13 (variable-length expansion):** Validates unroll-to-UNION-ALL + for bounded k, with a 20-hop cap as the right default. Iterative + fallback for k > 20 is OmniGraph-original (lance-graph doesn't have + it). +- **§5.15 / Open Q12 (Cypher frontend):** Cypher parser + AST are + liftable from lance-graph 0.5.4 if we want a Cypher frontend + alongside our `.gq`. Defer to Phase 3. +- **§9.2 (alternative: pure SQL lowering):** lance-graph **is** the + reference for this alternative. The cost we'd pay is the four items + listed above; the benefit is smaller maintenance. The trade-off has + been considered and rejected for MR-737; lance-graph's existence + doesn't change that. + +## Open questions for follow-up + +- **Q3.3.1 — When does the 20-hop cap actually hurt?** Real-world graph + workloads rarely go beyond 5–6 hops. The 20-hop cap is more about + worst-case query latency than common use. **Lower-priority + calibration** — defer to Phase 0+ measurement. +- **Q3.3.2 — Iterative-expansion termination.** For unbounded `*` + patterns, we need a cycle-detection / visited-set strategy. + lance-graph punts on this; Kuzu has it. Lift from Kuzu (§3.1). diff --git a/.context/research/lancedb.md b/.context/research/lancedb.md new file mode 100644 index 0000000..2dc86ec --- /dev/null +++ b/.context/research/lancedb.md @@ -0,0 +1,142 @@ +# 3.2 LanceDB — code-dive + +**Repo:** [`github.com/lancedb/lancedb`](https://github.com/lancedb/lancedb) +(OSS LanceDB, Rust + Python). +**MR-925 §3.2 mapping:** §5.6 (TableProvider integration), §5.10 +(custom index types), §5.12 (mutation-as-IR), §5.1 (vector search as +operator vs UDF). +**Date:** 2026-05-12. + +--- + +## What we extracted + +### TableProvider wrapping a Lance dataset — maps to MR-737 §5.6 + +LanceDB's `Table` type wraps a `lance::Dataset` and implements +DataFusion's `TableProvider` trait. Key files (Rust side): + +- `rust/lancedb/src/query/datafusion_table.rs` — the actual + `TableProvider` impl that hands DataFusion a scan plan. +- `rust/lancedb/src/connection.rs` — `Connection::open_table` returns + the wrapper. + +**Re-usable patterns:** +- The `TableProvider::scan(projection, filters, limit)` method + routes through `Dataset::scan().filter(...).project(...).limit(...)`. + Filter expressions are converted to Lance's + `lance::dataset::scanner::Filter` and pushed at the scan level + (using Lance's Substrait predicate machinery, which speaks DF expressions). +- LanceDB's filter-pushdown logic is identical to what we'd write for + OmniGraph's `__node:Person.lance` → `TableProvider` adapter. Lift + the conversion functions (`expr_to_lance_filter` or equivalent) + rather than reimplementing. +- The `TableProvider` advertises capabilities via + `supports_filter_pushdown`, returning `TableProviderFilterPushDown::Exact` + for predicates Lance can fully handle and `::Inexact`/`::Unsupported` + for the rest. This is the exact pattern §5.6 prescribes. + +### Vector search as a query operator — maps to MR-737 §5.1, §5.10 + +LanceDB's `Table::query(vector)` builds a `VectorQuery` and lowers it +to a Lance scan with a `nearest` filter. **At the DataFusion level**, +the vector-search call is just a `TableProvider::scan(filter=Nearest(...))` +— the operator surface is a regular filter. + +**Re-usable pattern:** Vector search as a filter on the scan, not a +separate operator type. This matches MR-737 §5.1's approach (vector +ANN is a filter, not a top-level operator). LanceDB's +`NearestFilter` representation is the reference shape. + +**Alternative:** LanceDB Enterprise (closed-source) reportedly has a +"segment-aware vector planning surface" per Ragnor 2026-05-02 (cited +in MR-925 §3.2). This is **not in the OSS LanceDB repo** (verified +2026-05-12). The segment-aware planning is a private extension. +**Decision: MR-737 §5.6 segment-aware capability needs to be +specified from scratch; we cannot copy from LanceDB OSS.** + +### Mutation-as-IR — maps to MR-737 §5.12 + +LanceDB does NOT use `UserDefinedLogicalNodeCore` for writes — its +write path goes through `Table::add(...)`, `Table::update(...)`, +`Table::delete(...)`, etc., which are direct method calls that +bypass the DataFusion plan tree entirely. + +**This is a difference from MR-737's design.** MR-737 §5.12 prescribes +that mutations are IR operators (Insert / Update / Delete / Merge) +that wrap read-shaped subplans, going through the same planner. +LanceDB takes the "imperative write API" approach instead. + +**What this means for §5.12:** +- **No direct reference implementation in LanceDB.** The MR-737 §5.12 + design is more ambitious than what LanceDB does today. +- **Mutation IR is original work.** Look to DataFusion's + `DmlExec`/`DmlNode` for the framework primitives. +- The MR-376 spike (`MutationStaging.pending`) is closer to the + right shape than LanceDB's approach. + +### TableProvider capability advertisement — maps to MR-737 §5.6 + +LanceDB's `TableProvider` advertises: +- `supports_filter_pushdown` per-filter via the standard DF API. +- `statistics` (returns sample-derived statistics from the manifest). +- `partitioning` (returns `Partitioning::UnknownPartitioning`). + +**What's missing for graph workloads:** +- **No `Hash([key], N)` partitioning advertisement.** LanceDB doesn't + know about hash-distributed scan layouts because Lance doesn't have + hash-partitioned datasets. +- **No semi-mask / SIP acceptance advertisement.** LanceDB scans + don't accept incoming SIPs from upstream operators. + +**For MR-737:** The §5.6 partitioning and SIP-acceptance advertisements +are **OmniGraph-original additions**. LanceDB is a partial reference; +the rest must be designed fresh. + +## Entry points to revisit + +- `rust/lancedb/src/query.rs` — query DSL → DataFusion logical plan. +- `rust/lancedb/src/query/datafusion_table.rs` — `TableProvider` impl. +- `rust/lancedb/src/index/scalar.rs` / `rust/lancedb/src/index/vector.rs` — + index lifecycle integrations. +- `rust/lancedb/src/connection.rs` — top-level handle, useful as the + shape for our equivalent. + +## What NOT to take from LanceDB + +- **The imperative write API.** MR-737 §5.12 wants mutations to be + IR operators; LanceDB does not. +- **The single-table assumption.** LanceDB exposes individual + tables; OmniGraph coordinates many tables atomically. LanceDB's + `Connection::open_table` is per-table; we have a `__manifest`-driven + graph-wide view. +- **The Python-first API surface.** LanceDB's Rust crate is shaped + to support a Python wrapper. We don't need Python; the Rust API + can be plain ergonomic Rust. + +## Decision impact on MR-737 + +- **§5.6 (TableProvider integration):** LanceDB's filter-pushdown + conversion is directly reusable. Capability advertisement beyond + filters (partitioning, SIP) is OmniGraph-original. +- **§5.10 (custom index types):** LanceDB integrates with Lance's + built-in scalar/vector indices but does **not** add custom index + types from outside the lance crate. Per Experiment 1.2, the plugin + registry is `pub(crate)`. LanceDB is not a reference for solving + this; it sidesteps it. +- **§5.12 (mutation-as-IR):** Mostly NOT reflected in LanceDB. + MR-737's design is more ambitious; we have no production reference + for it. **Surface this in the rollup comment as a §5.12 risk.** +- **§5.1 (vector search):** Validated. Treat vector search as a + scan-level filter, not a top-level operator. + +## Open questions for follow-up + +- **Q3.2.1 — Segment-aware vector planning.** Does LanceDB Enterprise + expose segment-aware planning hooks via a trait we could implement? + Worth asking Lance team directly. Out of scope for the OSS code-dive. +- **Q3.2.2 — `supports_filter_pushdown` for graph predicates.** Our + graph predicates (`(a)-[KNOWS]->(b) AND a.name = ...`) aren't a + single DF filter expression — they're multi-table joins. + LanceDB's pushdown is per-scan; we need cross-scan SIP. **No + LanceDB precedent.** diff --git a/.context/research/trino.md b/.context/research/trino.md new file mode 100644 index 0000000..51e7080 --- /dev/null +++ b/.context/research/trino.md @@ -0,0 +1,148 @@ +# 3.6 Trino — code-dive + +**Repo:** [`github.com/trinodb/trino`](https://github.com/trinodb/trino) +(Java, ~10k★). +**MR-925 §3.6 mapping:** §5.7 (cost-based optimizer at scale), §5.11 +(distributed transaction patterns — informative, NOT prescriptive), +§5.10 (connector SPI). +**Date:** 2026-05-12. + +--- + +## What we extracted + +### Cost-based optimizer (CBO) — maps to MR-737 §5.7 + +Trino has a mature CBO that combines: +- **Histograms** (per-column, from ANALYZE). +- **Cardinality estimates** with selectivity propagation. +- **CPU + memory + network cost model** with weighted scoring. + +The optimizer enumerates join orderings via a **dynamic-programming +approach (DP-sized; bushy trees) for ≤ 8 tables, then falls back to +greedy** for larger queries. + +**Re-usable patterns for MR-737:** +- **Three-component cost model** (CPU, memory, network) is the right + granularity. MR-737 §5.7 currently lists rows-processed + I/O; we + should add a third component for memory (especially relevant for + factorized intermediate representations that can blow up). +- **DP-for-small / greedy-for-large.** The 8-table threshold is a + reasonable default. Most graph queries touch 2–5 tables (one + node-table per variable + one edge-table per relationship), so + DP works. +- **Histograms for selectivity.** Lance's manifest already carries + per-segment min/max + null-count statistics; histograms would be a + follow-up. For graph queries, the *out-degree distribution* is the + most useful histogram (informs neighbor-expansion cardinality). + **Add to §5.7 spec.** + +### Connector SPI — maps to MR-737 §5.10 + +Trino's "Connector SPI" is a Java interface for adding new data +sources (Postgres, MySQL, Hive, Iceberg, …). The interface is **fully +public** — connectors are JARs that get registered at server startup. + +**Re-usable patterns:** +- **Capability metadata** is reported per-connector via + `ConnectorMetadata.getTableLayouts()`. Layouts advertise + partitioning, sort order, and pushdown capability per-table. This + is the same shape MR-737 §5.6 prescribes for OmniGraph's + `TableProvider`. +- **Stable interface vs. mainline churn.** Trino's Connector SPI is + the most-stable public Java surface in the project; breaking + changes are versioned. Compare to DataFusion's + `TableProvider` (less stable, breaks ~once per release). Trino is + the reference for "what a properly-versioned plugin API looks like." + +### Distributed transactions — maps to MR-737 §5.11 (informative) + +Trino is a **stateless query coordinator**: it does NOT manage +transactions itself. Each connector's underlying system manages its +own transactions. Trino's "transaction" is just a session-scoped +identifier that lets multiple queries see the same snapshot of each +connected system. + +**Why this matters:** Trino's approach is the opposite of MR-737 +§5.11 (per-table txn branches with cross-table coordination). Trino +punts on the cross-system coordination problem; MR-737 doesn't. + +**Re-usable pattern:** None directly. Trino is the **anti-reference** +— what NOT to do if you want cross-table ACID. MR-737's design is +strictly more ambitious. + +### Plan rewriter framework — maps to MR-737 §5.7 + +Trino's optimizer is a stack of `Rule` rewrites applied in +sequence (top-down + bottom-up passes). Each rule has a pattern +(matches a plan-tree shape) and a rewrite (produces a replacement). +This is the same pattern DataFusion uses with +`OptimizerRule` / `PhysicalOptimizerRule`. + +**Re-usable patterns:** +- **Rule ordering matters.** Trino has hard-coded sequence of + ~50 rules; ordering is significant. DataFusion's defaults are + fine for SQL; we'll need to insert graph-specific rules + (`PushPredicateIntoExpand`, `FactorizeIntermediate`, etc.) at + specific positions. +- **Rule namespace.** Trino prefixes rules by area + (`io.trino.sql.planner.iterative.rule.*`). DataFusion uses a flat + namespace. **Cosmetic; defer.** + +### Dynamic filters (Trino's analog of DF's DFP) — maps to MR-737 §5.6 + +Trino's **dynamic filtering** feature pushes build-side join filters +to probe-side scans. Implementation: +- Build side computes a Bloom filter (or InList for small sets). +- Coordinator broadcasts the filter to all probe-side workers. +- Probe-side scans apply the filter at the table-scan level. + +This is structurally identical to DataFusion 52.5's +`DynamicFilterPhysicalExpr` (§1.5). **Validates the design.** + +**Difference:** Trino uses Bloom filters by default; DF uses InList +(< 128MB) or HashTable (≥ 128MB). Our §1.4 finding (roaring beats +Bloom on bits/elem for clustered row IDs) suggests roaring would be +the right addition. Trino does NOT have roaring as a dynamic-filter +encoding; we'd be ahead of Trino on this dimension. + +## Entry points to revisit + +- `core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/` — + rule-based optimizer (40+ rules). +- `core/trino-spi/src/main/java/io/trino/spi/connector/` — Connector + SPI (the model for our `TableProvider` surface). +- `core/trino-main/src/main/java/io/trino/cost/` — cost estimation. +- `core/trino-main/src/main/java/io/trino/operator/join/` — dynamic + filtering implementation. + +## What NOT to take from Trino + +- **The Java language.** Different ecosystem; ports are awkward. +- **The coordinator-worker split.** Trino is a distributed system; + OmniGraph is currently in-process (Phase 0–2). The distributed + patterns become relevant only at Phase 3+. +- **The stateless-coordinator approach to transactions.** MR-737 + §5.11 explicitly wants stateful per-table txn branches. + +## Decision impact on MR-737 + +- **§5.7 (cost model):** Add memory cost as third component; + out-degree histograms as the most-useful additional statistic. +- **§5.10 (connector SPI / plugin):** Trino is the gold standard for + a stable, versioned plugin API. Cite when designing OmniGraph's + custom-index-type registration spec. +- **§5.11 (transactions):** Trino is anti-reference; MR-737 is + strictly more ambitious. Document this. +- **§5.6 (dynamic filters):** Validates the §1.5 design; we'd be + ahead of Trino with roaring-encoded dynamic filters. + +## Open questions for follow-up + +- **Q3.6.1 — Out-degree histogram representation.** Per-node-type + histograms of out-degree by edge type are O(node-types × + edge-types) histograms. Storage cost? Defer; calibrate during + §5.7 spec. +- **Q3.6.2 — DP-vs-greedy threshold.** Is 8 tables the right + threshold for graph queries (which tend to have fewer tables but + more joins per table)? Defer; calibrate during §5.7 spec.