mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
docs: add datafusion-future-improvements.md to dev docs
Captures the post-PR #111 (Lance 4→6) + PR #113 (structured Expr pushdown) DataFusion state in one place, so future maintainers don't have to re-derive what's done, what's free, and what's still on the table from chat history. Structure: - Direct touchpoints (only 2 — narrow surface) - Shipped: PR-by-PR delta of what's landed - Passive wins active on DF 53 (PR-linked, with where-it-bites-us notes) - Still on the table, ranked by tier: - T1: structural, unblocked today (hydrate_nodes Expr pushdown) - T2: gated on Lance v7 (delete Expr via MR-A / issue #112) - T3: future-shape unlocks (extension planner, expression placement, etc.) - T4: won't reach us without major changes (custom ExecutionPlan territory) - Upstream cadence note (Lance dictates the DF version) - Maintenance section Linked from docs/dev/index.md so the check-agents-md CI guard passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
cb80fa40f1
commit
8393fd7946
2 changed files with 120 additions and 0 deletions
119
docs/dev/datafusion-future-improvements.md
Normal file
119
docs/dev/datafusion-future-improvements.md
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
# DataFusion: current state + future improvements
|
||||
|
||||
**Audience:** contributors thinking about query-execution performance,
|
||||
predicate pushdown, or planner work.
|
||||
**Companion:** [lance.md](lance.md) for upstream Lance pages and version
|
||||
state; [invariants.md](invariants.md) for the relevant deny-list items;
|
||||
[execution.md](execution.md) for query-execution context.
|
||||
|
||||
This file tracks DataFusion-related improvements that have landed in
|
||||
OmniGraph, the passive wins active by virtue of the current DF version,
|
||||
and the structural changes still on the table. Update it whenever we ship
|
||||
a DF-related change or a DF upstream release changes the picture.
|
||||
|
||||
**Current pin:** DataFusion 53.1.0 (workspace dep `datafusion = "53"`,
|
||||
default-features = false, features = `["nested_expressions"]`). Pulled
|
||||
in transitively by Lance 6.0.1; our direct touchpoints are narrow.
|
||||
|
||||
## Direct touchpoints in our code
|
||||
|
||||
We have only two direct DataFusion consumers; everything else is
|
||||
transitive through Lance.
|
||||
|
||||
| Site | Role | State |
|
||||
|---|---|---|
|
||||
| `crates/omnigraph/src/exec/query.rs::build_lance_filter_expr` (and helpers `ir_filter_to_expr`, `ir_expr_to_expr`, `literal_to_expr`) | Lower typed IR filters to a DataFusion `Expr` and apply via `Scanner::filter_expr` | **Structured** (PR #113) |
|
||||
| `crates/omnigraph/src/table_store.rs::scan_pending_batches` | Run SQL against an in-memory `MemTable` registered in a fresh `SessionContext` to filter the in-flight `MutationStaging.pending` batches | String SQL — small enough that the migration cost outweighs the benefit; out of scope |
|
||||
|
||||
We have **no custom `impl ExecutionPlan`**, no exhaustive `match` on
|
||||
`ScalarValue`, no direct `tantivy`/`tokenizer` imports. Three classes of
|
||||
DF 53 breaking changes therefore do not reach us:
|
||||
`Arc<PlanProperties>` wrapping (#19893), removed `ExecutionPlan::statistics()`
|
||||
(#20319), the `ScalarValue::RunEndEncoded` variant (#19895).
|
||||
|
||||
## Shipped
|
||||
|
||||
| | Where | What it bought us |
|
||||
|---|---|---|
|
||||
| DataFusion 52 → 53 bump | PR #111 | Substrate baseline. Every passive DF 53 optimizer/perf win activated automatically. |
|
||||
| `nested_expressions` feature enabled | PR #113 | Made `datafusion::functions_nested::expr_fn::array_has` (and the rest of the nested-type expr-fns) reachable from our code. |
|
||||
| `execute_node_scan` → `Scanner::filter_expr(Expr)` | PR #113 | Killed string-flattened pushdown on the bulk of the read path. **`CompOp::Contains` now pushes down** (via `array_has`) — previously returned `None` from `ir_filter_to_sql` and fell through to in-memory post-scan filtering. DF 53 optimizer rules now act on our predicates instead of being short-circuited by the string SQL detour. |
|
||||
|
||||
## Passive wins active on DF 53
|
||||
|
||||
These activated automatically when PR #111 landed. They apply to any
|
||||
predicate / plan that reaches DataFusion (now including our
|
||||
`execute_node_scan` filters via the structured Expr path):
|
||||
|
||||
| DF PR | Win | Where it bites us |
|
||||
|---|---|---|
|
||||
| [#20528](https://github.com/apache/datafusion/pull/20528) | Vectorized `IN`-list eq kernel | `id IN (…)` predicates in cascade-delete (`exec/merge.rs:1016`) and the structured Expr path |
|
||||
| [#20111](https://github.com/apache/datafusion/pull/20111) | `PhysicalExprSimplifier` constant-folds before exec | All predicates handed to Lance via `Scanner::filter_expr` |
|
||||
| [#20097](https://github.com/apache/datafusion/pull/20097) | `CASE WHEN x THEN y ELSE NULL` shortcut | Any generated CASE expressions in our predicates |
|
||||
| [#20228](https://github.com/apache/datafusion/pull/20228) | Push limit into hash join | Anti-join (`not { … }`) lowered to `JoinType::LeftAnti` with a query-level `LIMIT N` |
|
||||
| [#19918](https://github.com/apache/datafusion/pull/19918) | `HashJoinExec::try_new` `null_aware` flag | Correct `NOT IN` semantics when our anti-join involves nullable columns |
|
||||
| [#19625](https://github.com/apache/datafusion/pull/19625) | Optimize `Nullstate` / accumulators | `count` / `sum` / `avg` / `min` / `max` aggregations |
|
||||
| [#19910](https://github.com/apache/datafusion/pull/19910) | Misc hash / hash-aggregation perf | Same — hash-aggregation hot paths |
|
||||
|
||||
## Still on the table
|
||||
|
||||
Ranked by leverage. Update when one ships.
|
||||
|
||||
### Tier 1 — structural, unblocked today
|
||||
|
||||
| Item | Effort | Notes |
|
||||
|---|---|---|
|
||||
| **`hydrate_nodes` (Expand-time pushdown) → `Expr`** | Medium (~2 days) | The Expand pipeline at `exec/query.rs:771-796` still serializes through `hydrate_nodes`'s `extra_filter_sql: Option<&str>` parameter. Migrating it pushes structured pushdown into `TableStorage::scan_stream(filter: Option<&str>)` → `Option<Expr>`, which cascades through 6+ call sites (`scan_stream_with`, `count_rows`, `count_rows_with_staged`). Largest remaining tech-debt slice on the structured-Expr refactor. |
|
||||
|
||||
### Tier 2 — gated on Lance v7
|
||||
|
||||
| Item | Trigger | Notes |
|
||||
|---|---|---|
|
||||
| **Mutation delete predicate → `Expr`** via `DeleteBuilder::execute_uncommitted` (Lance [#6658](https://github.com/lance-format/lance/issues/6658)) | Lance v7.x bump | Issue closed 2026-05-14, but the public API first ships in `v7.0.0-beta.10`, not v6.x. Couples with **MR-A** (delete two-phase migration — tracked at [issue #112](https://github.com/ModernRelay/omnigraph/issues/112)). The DF Expr move at this site is half the work; the rest is retiring the parse-time D₂ rule and extending recovery sidecar coverage. |
|
||||
| **`DeleteBuilder::from_expr(...)`** (Lance #6343, v5.0) | Same | The structured Expr variant of the inline delete path. Useful only while the inline `delete_where` residual still exists; supplanted by the staged form above once MR-A lands. |
|
||||
|
||||
### Tier 3 — future-shape (require owning more of the planner)
|
||||
|
||||
| Item | DF PR | Notes |
|
||||
|---|---|---|
|
||||
| Extension planner for `TableScan` | [#20548](https://github.com/apache/datafusion/pull/20548) | Would let us plug a custom planner converting `IROp::NodeScan` directly into a DF logical plan, bypassing the Lance string-SQL detour entirely. Big change. Only worth it if we own more of the physical plan. |
|
||||
| `ExpressionPlacement` enum | [#20065](https://github.com/apache/datafusion/pull/20065) | Optimizer-hint enum letting the planner decide whether an expression evaluates in scan / filter / projection. Relevant only if we own optimizer rules. We don't. |
|
||||
| `ExtractLeafExpressions` optimizer rule for `get_field` pushdown | [#20117](https://github.com/apache/datafusion/pull/20117) | Applies automatically when we use struct projections in pushdown. We don't generate them today. |
|
||||
| `feat: support Set Comparison Subquery` | [#19109](https://github.com/apache/datafusion/pull/19109) | New subquery shape. Not relevant today — we don't lower to subqueries. |
|
||||
| `OuterReferenceColumn` non-adjacent outer relations | [#19930](https://github.com/apache/datafusion/pull/19930) | Grandparent-scope subqueries. Future-shape unlock. |
|
||||
| `AggregateMode::PartialReduce` (tree-reduce aggregation) | [#20019](https://github.com/apache/datafusion/pull/20019) | Aggregation perf opt for very wide partitions. Could opt in; modest gain. |
|
||||
| `Pushdown filters through UnionExec` | [#20145](https://github.com/apache/datafusion/pull/20145) | Applies automatically when planning multi-fragment unions. Future-shape for graph traversal. |
|
||||
|
||||
### Tier 4 — won't reach us without major changes
|
||||
|
||||
| Item | DF PR | Why it doesn't bite |
|
||||
|---|---|---|
|
||||
| `Wrap immutable plan parts into Arc` | #19893 | We have no custom `impl ExecutionPlan`. |
|
||||
| `Cache PlanProperties, fast-path with_new_children` | #19792 | Same. |
|
||||
| `Remove the statistics() api` | #20319 | Same. |
|
||||
| `feat: support limited deletion` | #20137 | DF-native LIMIT-on-DELETE. Lance owns delete in our stack; orthogonal. |
|
||||
|
||||
## Upstream cadence
|
||||
|
||||
We don't choose our DataFusion version directly — Lance does. Lance 6.0.1
|
||||
pins DF 53. Lance 7.0.0-rc.1 (2026-05-21) is on DF 53. Lance 7.x or 8.x
|
||||
may pick up DF 54 / 55; when that happens, refresh this doc with a new
|
||||
"Passive wins" row and a fresh upgrade audit.
|
||||
|
||||
DataFusion 54.0.0 has shipped (per the upstream upgrade-guide index).
|
||||
Anything in 54 that would actively bite us when Lance picks it up is
|
||||
worth surfacing here as a heads-up; right now there's no urgency.
|
||||
|
||||
## Maintenance
|
||||
|
||||
- Bump the **Current pin** stanza on every Lance bump (Lance dictates
|
||||
the DF version).
|
||||
- When a Tier 1 / Tier 2 item ships, move it to **Shipped** with a PR
|
||||
link and a one-line description of what it bought.
|
||||
- When DF ships a new major, add a row to **Passive wins active on DF
|
||||
N** with a one-line note. Track Tier 3 / Tier 4 items only if a
|
||||
follow-up MR mentions them or invalidates the categorization.
|
||||
- Don't track every passive perf improvement — only the ones that
|
||||
measurably touch our query/exec/mutation paths.
|
||||
- This doc is a snapshot; for in-depth Lance-side context see
|
||||
[lance.md](lance.md) and its periodic alignment audits.
|
||||
|
|
@ -23,6 +23,7 @@ constraints. User-facing behavior should still be documented through
|
|||
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) |
|
||||
| Direct-publish writes, D2, staged writes, recovery sidecars | [runs.md](runs.md) |
|
||||
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
|
||||
| DataFusion: current state, passive wins, future improvements | [datafusion-future-improvements.md](datafusion-future-improvements.md) |
|
||||
| Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) |
|
||||
| Branch and commit internals | [branches-commits.md](../user/branches-commits.md) |
|
||||
| Three-way merge implementation and conflicts | [merge.md](merge.md) |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue