diff --git a/docs/dev/datafusion-future-improvements.md b/docs/dev/datafusion-future-improvements.md new file mode 100644 index 0000000..d8d79c8 --- /dev/null +++ b/docs/dev/datafusion-future-improvements.md @@ -0,0 +1,119 @@ +# DataFusion: current state + future improvements + +**Audience:** contributors thinking about query-execution performance, +predicate pushdown, or planner work. +**Companion:** [lance.md](lance.md) for upstream Lance pages and version +state; [invariants.md](invariants.md) for the relevant deny-list items; +[execution.md](execution.md) for query-execution context. + +This file tracks DataFusion-related improvements that have landed in +OmniGraph, the passive wins active by virtue of the current DF version, +and the structural changes still on the table. Update it whenever we ship +a DF-related change or a DF upstream release changes the picture. + +**Current pin:** DataFusion 53.1.0 (workspace dep `datafusion = "53"`, +default-features = false, features = `["nested_expressions"]`). Pulled +in transitively by Lance 6.0.1; our direct touchpoints are narrow. + +## Direct touchpoints in our code + +We have only two direct DataFusion consumers; everything else is +transitive through Lance. + +| Site | Role | State | +|---|---|---| +| `crates/omnigraph/src/exec/query.rs::build_lance_filter_expr` (and helpers `ir_filter_to_expr`, `ir_expr_to_expr`, `literal_to_expr`) | Lower typed IR filters to a DataFusion `Expr` and apply via `Scanner::filter_expr` | **Structured** (PR #113) | +| `crates/omnigraph/src/table_store.rs::scan_pending_batches` | Run SQL against an in-memory `MemTable` registered in a fresh `SessionContext` to filter the in-flight `MutationStaging.pending` batches | String SQL — small enough that the migration cost outweighs the benefit; out of scope | + +We have **no custom `impl ExecutionPlan`**, no exhaustive `match` on +`ScalarValue`, no direct `tantivy`/`tokenizer` imports. Three classes of +DF 53 breaking changes therefore do not reach us: +`Arc` wrapping (#19893), removed `ExecutionPlan::statistics()` +(#20319), the `ScalarValue::RunEndEncoded` variant (#19895). + +## Shipped + +| | Where | What it bought us | +|---|---|---| +| DataFusion 52 → 53 bump | PR #111 | Substrate baseline. Every passive DF 53 optimizer/perf win activated automatically. | +| `nested_expressions` feature enabled | PR #113 | Made `datafusion::functions_nested::expr_fn::array_has` (and the rest of the nested-type expr-fns) reachable from our code. | +| `execute_node_scan` → `Scanner::filter_expr(Expr)` | PR #113 | Killed string-flattened pushdown on the bulk of the read path. **`CompOp::Contains` now pushes down** (via `array_has`) — previously returned `None` from `ir_filter_to_sql` and fell through to in-memory post-scan filtering. DF 53 optimizer rules now act on our predicates instead of being short-circuited by the string SQL detour. | + +## Passive wins active on DF 53 + +These activated automatically when PR #111 landed. They apply to any +predicate / plan that reaches DataFusion (now including our +`execute_node_scan` filters via the structured Expr path): + +| DF PR | Win | Where it bites us | +|---|---|---| +| [#20528](https://github.com/apache/datafusion/pull/20528) | Vectorized `IN`-list eq kernel | `id IN (…)` predicates in cascade-delete (`exec/merge.rs:1016`) and the structured Expr path | +| [#20111](https://github.com/apache/datafusion/pull/20111) | `PhysicalExprSimplifier` constant-folds before exec | All predicates handed to Lance via `Scanner::filter_expr` | +| [#20097](https://github.com/apache/datafusion/pull/20097) | `CASE WHEN x THEN y ELSE NULL` shortcut | Any generated CASE expressions in our predicates | +| [#20228](https://github.com/apache/datafusion/pull/20228) | Push limit into hash join | Anti-join (`not { … }`) lowered to `JoinType::LeftAnti` with a query-level `LIMIT N` | +| [#19918](https://github.com/apache/datafusion/pull/19918) | `HashJoinExec::try_new` `null_aware` flag | Correct `NOT IN` semantics when our anti-join involves nullable columns | +| [#19625](https://github.com/apache/datafusion/pull/19625) | Optimize `Nullstate` / accumulators | `count` / `sum` / `avg` / `min` / `max` aggregations | +| [#19910](https://github.com/apache/datafusion/pull/19910) | Misc hash / hash-aggregation perf | Same — hash-aggregation hot paths | + +## Still on the table + +Ranked by leverage. Update when one ships. + +### Tier 1 — structural, unblocked today + +| Item | Effort | Notes | +|---|---|---| +| **`hydrate_nodes` (Expand-time pushdown) → `Expr`** | Medium (~2 days) | The Expand pipeline at `exec/query.rs:771-796` still serializes through `hydrate_nodes`'s `extra_filter_sql: Option<&str>` parameter. Migrating it pushes structured pushdown into `TableStorage::scan_stream(filter: Option<&str>)` → `Option`, which cascades through 6+ call sites (`scan_stream_with`, `count_rows`, `count_rows_with_staged`). Largest remaining tech-debt slice on the structured-Expr refactor. | + +### Tier 2 — gated on Lance v7 + +| Item | Trigger | Notes | +|---|---|---| +| **Mutation delete predicate → `Expr`** via `DeleteBuilder::execute_uncommitted` (Lance [#6658](https://github.com/lance-format/lance/issues/6658)) | Lance v7.x bump | Issue closed 2026-05-14, but the public API first ships in `v7.0.0-beta.10`, not v6.x. Couples with **MR-A** (delete two-phase migration — tracked at [issue #112](https://github.com/ModernRelay/omnigraph/issues/112)). The DF Expr move at this site is half the work; the rest is retiring the parse-time D₂ rule and extending recovery sidecar coverage. | +| **`DeleteBuilder::from_expr(...)`** (Lance #6343, v5.0) | Same | The structured Expr variant of the inline delete path. Useful only while the inline `delete_where` residual still exists; supplanted by the staged form above once MR-A lands. | + +### Tier 3 — future-shape (require owning more of the planner) + +| Item | DF PR | Notes | +|---|---|---| +| Extension planner for `TableScan` | [#20548](https://github.com/apache/datafusion/pull/20548) | Would let us plug a custom planner converting `IROp::NodeScan` directly into a DF logical plan, bypassing the Lance string-SQL detour entirely. Big change. Only worth it if we own more of the physical plan. | +| `ExpressionPlacement` enum | [#20065](https://github.com/apache/datafusion/pull/20065) | Optimizer-hint enum letting the planner decide whether an expression evaluates in scan / filter / projection. Relevant only if we own optimizer rules. We don't. | +| `ExtractLeafExpressions` optimizer rule for `get_field` pushdown | [#20117](https://github.com/apache/datafusion/pull/20117) | Applies automatically when we use struct projections in pushdown. We don't generate them today. | +| `feat: support Set Comparison Subquery` | [#19109](https://github.com/apache/datafusion/pull/19109) | New subquery shape. Not relevant today — we don't lower to subqueries. | +| `OuterReferenceColumn` non-adjacent outer relations | [#19930](https://github.com/apache/datafusion/pull/19930) | Grandparent-scope subqueries. Future-shape unlock. | +| `AggregateMode::PartialReduce` (tree-reduce aggregation) | [#20019](https://github.com/apache/datafusion/pull/20019) | Aggregation perf opt for very wide partitions. Could opt in; modest gain. | +| `Pushdown filters through UnionExec` | [#20145](https://github.com/apache/datafusion/pull/20145) | Applies automatically when planning multi-fragment unions. Future-shape for graph traversal. | + +### Tier 4 — won't reach us without major changes + +| Item | DF PR | Why it doesn't bite | +|---|---|---| +| `Wrap immutable plan parts into Arc` | #19893 | We have no custom `impl ExecutionPlan`. | +| `Cache PlanProperties, fast-path with_new_children` | #19792 | Same. | +| `Remove the statistics() api` | #20319 | Same. | +| `feat: support limited deletion` | #20137 | DF-native LIMIT-on-DELETE. Lance owns delete in our stack; orthogonal. | + +## Upstream cadence + +We don't choose our DataFusion version directly — Lance does. Lance 6.0.1 +pins DF 53. Lance 7.0.0-rc.1 (2026-05-21) is on DF 53. Lance 7.x or 8.x +may pick up DF 54 / 55; when that happens, refresh this doc with a new +"Passive wins" row and a fresh upgrade audit. + +DataFusion 54.0.0 has shipped (per the upstream upgrade-guide index). +Anything in 54 that would actively bite us when Lance picks it up is +worth surfacing here as a heads-up; right now there's no urgency. + +## Maintenance + +- Bump the **Current pin** stanza on every Lance bump (Lance dictates + the DF version). +- When a Tier 1 / Tier 2 item ships, move it to **Shipped** with a PR + link and a one-line description of what it bought. +- When DF ships a new major, add a row to **Passive wins active on DF + N** with a one-line note. Track Tier 3 / Tier 4 items only if a + follow-up MR mentions them or invalidates the categorization. +- Don't track every passive perf improvement — only the ones that + measurably touch our query/exec/mutation paths. +- This doc is a snapshot; for in-depth Lance-side context see + [lance.md](lance.md) and its periodic alignment audits. diff --git a/docs/dev/index.md b/docs/dev/index.md index 504c277..26339b8 100644 --- a/docs/dev/index.md +++ b/docs/dev/index.md @@ -23,6 +23,7 @@ constraints. User-facing behavior should still be documented through | On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) | | Direct-publish writes, D2, staged writes, recovery sidecars | [runs.md](runs.md) | | Query execution, mutation execution, loader flow | [execution.md](execution.md) | +| DataFusion: current state, passive wins, future improvements | [datafusion-future-improvements.md](datafusion-future-improvements.md) | | Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) | | Branch and commit internals | [branches-commits.md](../user/branches-commits.md) | | Three-way merge implementation and conflicts | [merge.md](merge.md) |