omnigraph/docs/dev/datafusion-future-improvements.md
aaltshuler e57087636d docs(datafusion): reflect Lance 7.0.0 stable (still DF ^53)
Lance 7.0.0 shipped stable 2026-05-28 and still pins datafusion = "^53"
/ arrow = "^58" (verified against the published 7.0.0 dependency
manifest), so the pending 6.0.1 -> 7.0.0 bump is not a DataFusion bump:
the "Passive wins" table is unchanged.

- Current-pin stanza: note 7.0.0 is available upstream and holds DF ^53.
- Tier 2: the delete-Expr item's upstream gate (execute_uncommitted,
  lance#6658) is now satisfied (in 7.0.0 stable); reframe the trigger as
  our own 6->7 bump rather than waiting on a Lance release.
- Upstream cadence: correct the pre-release speculation — 7.0.0 stayed on
  DF 53; a DF 54/55 jump is deferred to a later Lance.
- Drop the brittle exec/query.rs:771-796 line range (drifted; hydrate_nodes
  is at 863 on main) in favor of the stable function name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:36:22 +03:00

9.4 KiB

DataFusion: current state + future improvements

Audience: contributors thinking about query-execution performance, predicate pushdown, or planner work. Companion: lance.md for upstream Lance pages and version state; invariants.md for the relevant deny-list items; execution.md for query-execution context.

This file tracks DataFusion-related improvements that have landed in OmniGraph, the passive wins active by virtue of the current DF version, and the structural changes still on the table. Update it whenever we ship a DF-related change or a DF upstream release changes the picture.

Current pin: DataFusion 53.1.0 (workspace dep datafusion = "53", default-features = false, features = ["nested_expressions"]). Pulled in transitively by Lance 6.0.1; our direct touchpoints are narrow. Lance 7.0.0 (stable, 2026-05-28) is now available upstream and still pins datafusion = "^53" / arrow = "^58" — so the pending 6.0.1 → 7.0.0 bump is not a DataFusion bump and leaves everything under "Passive wins" below unchanged. See Upstream cadence.

Direct touchpoints in our code

We have only two direct DataFusion consumers; everything else is transitive through Lance.

Site Role State
crates/omnigraph/src/exec/query.rs::build_lance_filter_expr (and helpers ir_filter_to_expr, ir_expr_to_expr, literal_to_expr) Lower typed IR filters to a DataFusion Expr and apply via Scanner::filter_expr Structured (PR #113)
crates/omnigraph/src/table_store.rs::scan_pending_batches Run SQL against an in-memory MemTable registered in a fresh SessionContext to filter the in-flight MutationStaging.pending batches String SQL — small enough that the migration cost outweighs the benefit; out of scope

We have no custom impl ExecutionPlan, no exhaustive match on ScalarValue, no direct tantivy/tokenizer imports. Three classes of DF 53 breaking changes therefore do not reach us: Arc<PlanProperties> wrapping (#19893), removed ExecutionPlan::statistics() (#20319), the ScalarValue::RunEndEncoded variant (#19895).

Shipped

Where What it bought us
DataFusion 52 → 53 bump PR #111 Substrate baseline. Every passive DF 53 optimizer/perf win activated automatically.
nested_expressions feature enabled PR #113 Made datafusion::functions_nested::expr_fn::array_has (and the rest of the nested-type expr-fns) reachable from our code.
execute_node_scanScanner::filter_expr(Expr) PR #113 Killed string-flattened pushdown on the bulk of the read path. CompOp::Contains now pushes down (via array_has) — previously returned None from ir_filter_to_sql and fell through to in-memory post-scan filtering. DF 53 optimizer rules now act on our predicates instead of being short-circuited by the string SQL detour.

Passive wins active on DF 53

These activated automatically when PR #111 landed. They apply to any predicate / plan that reaches DataFusion (now including our execute_node_scan filters via the structured Expr path):

DF PR Win Where it bites us
#20528 Vectorized IN-list eq kernel id IN (…) predicates in cascade-delete (exec/merge.rs:1016) and the structured Expr path
#20111 PhysicalExprSimplifier constant-folds before exec All predicates handed to Lance via Scanner::filter_expr
#20097 CASE WHEN x THEN y ELSE NULL shortcut Any generated CASE expressions in our predicates
#20228 Push limit into hash join Anti-join (not { … }) lowered to JoinType::LeftAnti with a query-level LIMIT N
#19918 HashJoinExec::try_new null_aware flag Correct NOT IN semantics when our anti-join involves nullable columns
#19625 Optimize Nullstate / accumulators count / sum / avg / min / max aggregations
#19910 Misc hash / hash-aggregation perf Same — hash-aggregation hot paths

Still on the table

Ranked by leverage. Update when one ships.

Tier 1 — structural, unblocked today

Item Effort Notes
hydrate_nodes (Expand-time pushdown) → Expr Medium (~2 days) The Expand pipeline (exec/query.rs::hydrate_nodes) still serializes through its extra_filter_sql: Option<&str> parameter. Migrating it pushes structured pushdown into TableStorage::scan_stream(filter: Option<&str>)Option<Expr>, which cascades through 6+ call sites (scan_stream_with, count_rows, count_rows_with_staged). Largest remaining tech-debt slice on the structured-Expr refactor.

Tier 2 — upstream-unblocked; gated on our Lance 6→7 bump

Item Trigger Notes
Mutation delete predicate → Expr via DeleteBuilder::execute_uncommitted (Lance #6658) Our 6.0.1 → 7.0.0 bump Upstream gate now satisfied: the API shipped in v7.0.0-beta.10 and is in Lance 7.0.0 stable (2026-05-28). The only remaining gate is the repo's own Lance bump (still pinned 6.0.1). Couples with MR-A (delete two-phase migration — tracked at issue #112). The DF Expr move at this site is half the work; the rest is retiring the parse-time D₂ rule and extending recovery sidecar coverage.
DeleteBuilder::from_expr(...) (Lance #6343, v5.0) Same The structured Expr variant of the inline delete path. Useful only while the inline delete_where residual still exists; supplanted by the staged form above once MR-A lands.

Tier 3 — future-shape (require owning more of the planner)

Item DF PR Notes
Extension planner for TableScan #20548 Would let us plug a custom planner converting IROp::NodeScan directly into a DF logical plan, bypassing the Lance string-SQL detour entirely. Big change. Only worth it if we own more of the physical plan.
ExpressionPlacement enum #20065 Optimizer-hint enum letting the planner decide whether an expression evaluates in scan / filter / projection. Relevant only if we own optimizer rules. We don't.
ExtractLeafExpressions optimizer rule for get_field pushdown #20117 Applies automatically when we use struct projections in pushdown. We don't generate them today.
feat: support Set Comparison Subquery #19109 New subquery shape. Not relevant today — we don't lower to subqueries.
OuterReferenceColumn non-adjacent outer relations #19930 Grandparent-scope subqueries. Future-shape unlock.
AggregateMode::PartialReduce (tree-reduce aggregation) #20019 Aggregation perf opt for very wide partitions. Could opt in; modest gain.
Pushdown filters through UnionExec #20145 Applies automatically when planning multi-fragment unions. Future-shape for graph traversal.

Tier 4 — won't reach us without major changes

Item DF PR Why it doesn't bite
Wrap immutable plan parts into Arc #19893 We have no custom impl ExecutionPlan.
Cache PlanProperties, fast-path with_new_children #19792 Same.
Remove the statistics() api #20319 Same.
feat: support limited deletion #20137 DF-native LIMIT-on-DELETE. Lance owns delete in our stack; orthogonal.

Upstream cadence

We don't choose our DataFusion version directly — Lance does. Lance 6.0.1 pins DF ^53. Lance 7.0.0 (stable, 2026-05-28) also pins datafusion = "^53" / arrow = "^58" — confirmed against the published 7.0.0 dependency manifest. So the 6.0.1 → 7.0.0 bump carries DataFusion forward unchanged: nothing under "Passive wins" moves, and the only DF-doc delta from that bump is the Tier 2 delete-Expr item un-gating (above). A DF 54 / 55 jump will arrive with a later Lance (8.x or beyond); when it does, refresh this doc with a new "Passive wins" row and a fresh upgrade audit.

DataFusion 54.0.0 has shipped upstream (per the upgrade-guide index) but is not in our stack — Lance has not picked it up as of 7.0.0. Treat anything in 54 as a heads-up only, and verify Lance's DF pin before acting; right now there's no urgency.

Maintenance

  • Bump the Current pin stanza on every Lance bump (Lance dictates the DF version).
  • When a Tier 1 / Tier 2 item ships, move it to Shipped with a PR link and a one-line description of what it bought.
  • When DF ships a new major, add a row to Passive wins active on DF N with a one-line note. Track Tier 3 / Tier 4 items only if a follow-up MR mentions them or invalidates the categorization.
  • Don't track every passive perf improvement — only the ones that measurably touch our query/exec/mutation paths.
  • This doc is a snapshot; for in-depth Lance-side context see lance.md and its periodic alignment audits.