mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
MR-925: validation-prototypes scaffolding + exp 1.1 + exp 1.2
- exclude validation-prototypes/ and merge-insert-cas-repro from the main
workspace so the nested cargo workspace can use its own pin set
- add validation-prototypes/{factorized-batches,custom-lance-index}/
scratch crates (never merged to main; long-lived branch only)
- exp 1.1 — factorized batches through DataFusion ops: writeup at
.context/experiments/factorized-batches.md (5 cells × 8 ops; all
scalar-keyed ops accept List<UInt64> input, UNNEST via CROSS JOIN
fails in DF 52.5)
- exp 1.2 — custom Lance index plugin from outside lance: writeup at
.context/experiments/custom-lance-index.md (5 probes; transaction
surface is open, SCALAR_INDEX_PLUGIN_REGISTRY is closed → hard
blocker for MR-737 §5.4; recommends upstream path or external-index
path)
This commit is contained in:
parent
c9c7c0672e
commit
02c4b45c85
12 changed files with 8033 additions and 0 deletions
238
.context/experiments/custom-lance-index.md
Normal file
238
.context/experiments/custom-lance-index.md
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
# Experiment 1.2 — Custom Lance index plugin from outside the lance crate
|
||||
|
||||
**Ticket:** MR-925 §1.2 (validates MR-737 §5.4, §5.5).
|
||||
**Prototype:** `validation-prototypes/custom-lance-index/` (long-lived branch).
|
||||
**Substrate pin:** Lance 4.0.1 (matched by cargo to 4.0.0 spec). Lance 4.0.1 internally pulls roaring 0.11 and prost-types 0.14; the workspace deps were lifted to match.
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis
|
||||
|
||||
A graph engine running on top of Lance can ship a custom index type
|
||||
(e.g. a neighbor-set adjacency index) from a third-party crate, by:
|
||||
|
||||
1. constructing an `IndexMetadata` row with a custom `index_details: Any`,
|
||||
2. committing it via the transaction API (`Operation::CreateIndex`),
|
||||
3. having Lance round-trip it through the manifest unchanged, and
|
||||
4. having the Lance scanner dispatch filter pushdown to our plugin.
|
||||
|
||||
§5.4 of MR-737 currently leaves (4) as an open question — this experiment
|
||||
turns the answer into evidence.
|
||||
|
||||
## Method
|
||||
|
||||
`custom-lance-index/` builds a tiny Lance dataset (`(key: UInt64, payload:
|
||||
Utf8)`, 1000 rows in fragment 0), then runs five probes against the public
|
||||
surface of `lance = 4.0.1`:
|
||||
|
||||
| Probe | What is exercised |
|
||||
|-------|-------------------|
|
||||
| **P1** Construct + commit | Build an `IndexMetadata` with a custom `index_details.type_url = "omnigraph.v0.NeighborIndexDetails"` and commit it with `Dataset::commit(..., Operation::CreateIndex { new_indices, removed_indices }, ...)`. |
|
||||
| **P2** Load round-trip | Reopen the dataset and call `DatasetIndexExt::load_indices()`. Verify the index survives Lance's `retain_supported_indices()` filter and its `index_details` survives bit-for-bit. |
|
||||
| **P3** Append coverage | Call `Dataset::append(...)`, then re-load indices. Verify the `fragment_bitmap` is *not* auto-updated to cover the new fragment — i.e. coverage is the plugin's responsibility, not Lance's. |
|
||||
| **P4** Scan filter | Run a `Dataset::scan().filter("key = 42")` and observe whether Lance attempts to open our plugin. With the plugin registry closed (see below), expect a full-scan fallback rather than an opt-in dispatch. |
|
||||
| **P5** Compact (Rewrite) | Call `compact_files(...)` and observe whether the index survives the Rewrite operation and whether the `fragment_bitmap` is remapped. |
|
||||
|
||||
Output (release-mode run, single execution):
|
||||
|
||||
```
|
||||
--------------------------------------- custom-lance-index compatibility matrix ----------------------------------------
|
||||
probe outcome notes
|
||||
------------------------------------------------------------------------------------------------------------------------
|
||||
P1 construct+commit OK Operation::CreateIndex accepted custom type_url; commit v2
|
||||
P2 load_indices (round-trip) OK type_url='omnigraph.v0.NeighborIndexDetails' fragment_bitmap.len=1 survives retain_supported_indices
|
||||
P3 append-row coverage STALE_AS_EXPECTED fragment_bitmap=[0] (expected [0]); new fragments not auto-covered
|
||||
P4 scan with filter on indexed col FULL_SCAN_FALLBACK rows=1 (expected 1); SCALAR_INDEX_PLUGIN_REGISTRY refuses unknown type_url so scanner falls back to full scan
|
||||
P5 compact_files (Rewrite) STALE_BITMAP before=1 indices; after=1 indices; rewritten files=0; new fragments=[0, 1]; idx.fragment_bitmap=[0]
|
||||
```
|
||||
|
||||
## Findings
|
||||
|
||||
### F1. The transaction surface is open. ✅
|
||||
|
||||
`Dataset::commit(uri, Operation::CreateIndex { new_indices: vec![idx],
|
||||
removed_indices: vec![] }, ...)` is a fully public API. `IndexMetadata` is
|
||||
a `pub struct` in `lance-table::format` with **every field public**,
|
||||
including `index_details: Option<Arc<prost_types::Any>>`, `fragment_bitmap:
|
||||
Option<RoaringBitmap>`, `index_version: i32`, `fields: Vec<i32>`. We can
|
||||
construct it with any `type_url` and `value: Vec<u8>` we want.
|
||||
|
||||
### F2. The retention filter does not block unknown type_urls. ✅
|
||||
|
||||
`lance/src/index.rs::retain_supported_indices` defends against version
|
||||
skew, not against unknown plugins. Its core check is:
|
||||
|
||||
```rust
|
||||
let max_supported_version = idx
|
||||
.index_details
|
||||
.as_ref()
|
||||
.map(|details| {
|
||||
IndexDetails(details.clone())
|
||||
.index_version()
|
||||
// If we don't know how to read the index, it isn't supported
|
||||
.unwrap_or(i32::MAX as u32)
|
||||
})
|
||||
.unwrap_or_default();
|
||||
let is_valid = idx.index_version <= max_supported_version as i32;
|
||||
```
|
||||
|
||||
When `index_details.type_url` is unknown to the static
|
||||
`SCALAR_INDEX_PLUGIN_REGISTRY`, `index_version()` returns `Err`, the
|
||||
`.unwrap_or(i32::MAX as u32)` triggers, and the index is retained. Our
|
||||
P2 outcome confirms this — the comment-vs-code mismatch ("If we don't
|
||||
know how to read the index, it isn't supported") is misleading; the actual
|
||||
behavior is that unknown indices are *kept* in the manifest. Good for our
|
||||
purposes (we want our custom index to round-trip cleanly), but worth
|
||||
filing upstream as a comment/behavior fix.
|
||||
|
||||
### F3. The plugin registry is closed. ❌ **HARD BLOCKER for §5.4.**
|
||||
|
||||
`lance/src/index/scalar.rs:223` (4.0.1):
|
||||
|
||||
```rust
|
||||
// TODO: Allow users to register their own plugins
|
||||
static SCALAR_INDEX_PLUGIN_REGISTRY: LazyLock<Arc<IndexPluginRegistry>> =
|
||||
LazyLock::new(IndexPluginRegistry::with_default_plugins);
|
||||
```
|
||||
|
||||
- The static is **module-private** (no `pub`).
|
||||
- `IndexPluginRegistry::with_default_plugins` is the only constructor used,
|
||||
and its initialization registers a fixed set of types (BTree, Bitmap,
|
||||
LabelList, Inverted, NGram, ZoneMap, BloomFilter, RTree, and the vector
|
||||
family).
|
||||
- There is no `register_plugin` or `extend_registry` API exposed by the
|
||||
`lance` crate.
|
||||
- `IndexType` is itself a closed enum (lance-index/src/lib.rs:106) with no
|
||||
`Custom` variant; `Index::index_type(&self)` must return one of the
|
||||
built-in values.
|
||||
|
||||
Consequence: **Lance 4.0.1 cannot dispatch its scanner to a third-party
|
||||
index plugin**. The downstream functions that gate scan-time index use —
|
||||
`open_scalar_index`, `infer_scalar_index_details`, `IndexDetails::supports_fts`,
|
||||
`IndexDetails::is_vector` — all consult `SCALAR_INDEX_PLUGIN_REGISTRY` or
|
||||
hard-coded `type_url` suffix checks. Even if we masquerade as
|
||||
`type_url.ends_with("BTreeIndexDetails")`, the scanner will then assume
|
||||
our index is a real BTreeIndex and try to open BTree-format files in the
|
||||
index directory, which we don't have.
|
||||
|
||||
### F4. The engine owns fragment_bitmap maintenance. ⚠️
|
||||
|
||||
P3 confirms: when we append a new fragment, Lance does **not** update the
|
||||
custom index's `fragment_bitmap` (and would not even know how — the plugin
|
||||
contract for "rebuild on append" lives inside the plugin registry, which
|
||||
is closed to us). Any custom-index reconciler we ship has to:
|
||||
|
||||
- re-read `load_indices()` after every commit,
|
||||
- compute the diff between `fragment_bitmap` and the current fragment set,
|
||||
- emit `Operation::CreateIndex { new_indices: vec![updated], removed_indices: vec![old] }`
|
||||
to re-publish the index with the updated bitmap.
|
||||
|
||||
This is *consistent with* the §5.5 reconciler pattern in MR-737, so it's
|
||||
not a blocker — but the writeup of §5.5 should explicitly say "the
|
||||
reconciler also owns fragment coverage diffs, not just file content".
|
||||
|
||||
### F5. Compaction does not move our index. ⚠️
|
||||
|
||||
P5: with default `CompactionOptions`, two small fragments of 1000 + 500
|
||||
rows did not trigger a Rewrite (`files_added: 0`). This is not a
|
||||
custom-index issue — it's the default heuristic. The signal we need is:
|
||||
**if a Rewrite had happened, would `Operation::Rewrite { groups, rewritten_indices,
|
||||
frag_reuse_index }` have remapped our index?** Looking at the conflict
|
||||
resolver (lance/src/io/commit/conflict_resolver.rs:495 onward), the answer
|
||||
is no — `rewritten_indices: Vec<RewrittenIndex>` is constructed only for
|
||||
indices whose plugin returns a remapper. Unknown-type indices fall through
|
||||
without remapping. So:
|
||||
|
||||
- **After a real compaction, our custom index will have a stale
|
||||
`fragment_bitmap`** pointing at fragment IDs that may have been
|
||||
rewritten into new IDs.
|
||||
- **Stable row IDs** (when `enable_stable_row_ids=true` on the dataset)
|
||||
would survive — but our `fragment_bitmap` would not.
|
||||
|
||||
We need to re-run with a more aggressive `CompactionOptions` to capture
|
||||
the exact post-Rewrite bitmap drift; that's a 1-hour follow-up. The
|
||||
qualitative answer is settled: **compaction without an index reconciler
|
||||
will leave our custom index pointing at dead fragments.**
|
||||
|
||||
## Per-operation compatibility matrix (the table §1.2 asks for)
|
||||
|
||||
| Lance operation | Custom index behavior with the public-API approach | Engine reconciler responsibility |
|
||||
|-----------------------|--------------------------------------------------------------|----------------------------------|
|
||||
| `Append` | IndexMetadata retained, `fragment_bitmap` STALE. | Detect new fragments; re-publish IndexMetadata with updated bitmap. |
|
||||
| `Update` (vertical) | Same as Append — new fragments arrive; old bitmap stale. | Same as Append, plus invalidate index entries for moved rows. |
|
||||
| `Delete` | IndexMetadata retained; new deletion files don't touch bitmap. | Index need not change unless the plugin caches row→key mappings. |
|
||||
| `Rewrite` (compact) | IndexMetadata retained but `fragment_bitmap` points at dead fragments; no remap. | Reconciler must rebuild bitmap (or use stable row IDs and remap externally). |
|
||||
| `Merge` (column add) | IndexMetadata retained; index files unaffected since indexed columns unchanged. | None for column-add. For column-rewrite, full rebuild. |
|
||||
| `Project` (column drop)| IndexMetadata retained but `fields: Vec<i32>` may now point at a dropped column. | Reconciler must DROP the IndexMetadata when its column is removed. |
|
||||
|
||||
The "engine reconciler responsibility" column is *additional* work over
|
||||
what a fully-registered Lance plugin would get for free, because we can't
|
||||
register.
|
||||
|
||||
## Decision impact on MR-737 §5.4
|
||||
|
||||
**§5.4's current premise (build custom index plugins from outside the
|
||||
lance crate) is NOT achievable on Lance 4.0.1 as written.** Three viable
|
||||
paths forward:
|
||||
|
||||
1. **Vendored fork of lance-index** — fork lance-index, expose
|
||||
`SCALAR_INDEX_PLUGIN_REGISTRY` plus a `register_plugin` API, and pin
|
||||
to the fork. Reduces to a maintenance burden equivalent to running our
|
||||
own substrate; explicitly disallowed by docs/invariants.md "Hand-rolling
|
||||
something Lance already does" — but here Lance does NOT yet do this. The
|
||||
honest framing is that Lance's *interface* for it doesn't exist yet.
|
||||
|
||||
2. **Upstream contribution** — implement the "Allow users to register their
|
||||
own plugins" TODO and contribute it back. Requires upstream review +
|
||||
release cycle; Lance is in pre-1.0 (4.x) and the protobuf surface for
|
||||
`index_details` is already pluggable, so the interface delta is small.
|
||||
This is the **recommended path**; the next §11 update to MR-737 should
|
||||
call out "depends on Lance issue: scalar-index-plugin-registry pluggability".
|
||||
|
||||
3. **Run our custom index entirely outside Lance** — store our index in a
|
||||
separate Lance dataset (or a sidecar key-value store) keyed by the
|
||||
primary table's stable row IDs. Lance round-trips an empty IndexMetadata
|
||||
row (or none) for visibility; query-time pushdown is done by the
|
||||
engine's planner via a manually-injected `PrefilterExec` that consults
|
||||
our external index and produces a row-ID `BatchSelection`. This is the
|
||||
pattern lance-graph appears to use for its neighbor index (TBC in
|
||||
experiment 3.3); it bypasses Lance's index-dispatch entirely.
|
||||
|
||||
§5.4 should be rewritten to **pick path (2) or path (3) explicitly**, not
|
||||
both. The current MR-737 wording implies path (1) is available; this
|
||||
experiment proves it is not.
|
||||
|
||||
§5.5 (reconciler pattern) is unaffected by this finding — but it must
|
||||
expand to explicitly own `fragment_bitmap` recomputation across all
|
||||
mutating operations, since with path (2) or path (3) we are the only
|
||||
party that knows the index's row coverage.
|
||||
|
||||
## Caveats
|
||||
|
||||
- **Default `CompactionOptions` did not trigger a Rewrite.** P5 is a
|
||||
qualitative answer from source-code reading; we need a re-run with
|
||||
`CompactionOptions { target_rows_per_fragment: 100, ..default }` (or
|
||||
enough small fragments to force one) to capture the exact bitmap drift.
|
||||
Follow-up: 1 hour.
|
||||
- **Stable row IDs not exercised.** The dataset was created without
|
||||
`enable_stable_row_ids=true`. Experiment 1.7 covers this surface.
|
||||
- **No write/read of actual index data.** This experiment is about the
|
||||
*metadata* round-trip, not about a working index over `key`. A real
|
||||
prototype would write a BTreeMap<u64, RowAddr> to a sidecar file under
|
||||
`<uri>/_indices/<uuid>/` and read it back at scan time via a manual
|
||||
prefilter. F3 says we already can't dispatch via Lance, so building the
|
||||
data round-trip is a path (2)/(3) decision deferred to Phase 0.
|
||||
|
||||
## Follow-ups (tracked, not done in this experiment)
|
||||
|
||||
- File upstream Lance issue: "Document or change behavior of
|
||||
`retain_supported_indices` for unknown `type_url`s — comment claims
|
||||
drop, code retains."
|
||||
- File upstream Lance issue: "Make `SCALAR_INDEX_PLUGIN_REGISTRY` pluggable
|
||||
(`register_plugin` API)." Block point for `lance-graph` and other
|
||||
graph layers.
|
||||
- Re-run P5 with aggressive `CompactionOptions` and an `enable_stable_row_ids`
|
||||
dataset to capture bitmap drift quantitatively (1 hr).
|
||||
- Compare the lance-graph repo's actual approach to extending Lance —
|
||||
cover in experiment 3.3.
|
||||
229
.context/experiments/factorized-batches.md
Normal file
229
.context/experiments/factorized-batches.md
Normal file
|
|
@ -0,0 +1,229 @@
|
|||
# Experiment 1.1 — Factorized batches through DataFusion ops
|
||||
|
||||
**Ticket:** MR-925 §1.1 (validates MR-737 §5.2 / Open Q2).
|
||||
**Prototype:** `validation-prototypes/factorized-batches/` (branch
|
||||
`devin/mr-925-pre-phase-0-validation-experiment-code-dive-agenda-to-de`).
|
||||
**Substrate pin:** DataFusion 52.5 + Arrow 57.3 (matches engine workspace).
|
||||
**Date:** 2026-05-12.
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis
|
||||
|
||||
DataFusion's `HashJoinExec`, `AggregateExec`, `FilterExec`, `SortExec`, and
|
||||
`ProjectionExec` either (a) handle a `List<UInt64>` neighbor-set column
|
||||
correctly with acceptable performance, or (b) require explicit `Flatten`
|
||||
before them. MR-737 §5.2 currently assumes mostly (b); this experiment maps
|
||||
the actual frontier so the §5.2 rule list lands on validated ground.
|
||||
|
||||
## Method
|
||||
|
||||
`factorized-batches/` builds an in-memory `RecordBatch` with schema
|
||||
`(src_id: UInt64, payload: Utf8, weight: Float64, _neighbors: List<UInt64>)`
|
||||
plus a flat-row baseline of `(src_id, payload, weight, dst: UInt64)`
|
||||
produced by exploding `_neighbors` to one row per `(src, dst)` pair.
|
||||
|
||||
For each cell `{n_src = 10_000} × {fanout ∈ uniform{1, 10, 100, 1000},
|
||||
skewed(target=10, heavy=2%)}` we run six pipelines on each input shape via
|
||||
`SessionContext::sql`:
|
||||
|
||||
| Op probe | SQL |
|
||||
|---------------------|--------------------------------------------------------------------|
|
||||
| `filter` | `SELECT * FROM t WHERE src_id < 5000` |
|
||||
| `project` | `SELECT src_id, _neighbors FROM t` |
|
||||
| `sort` | `SELECT src_id, _neighbors FROM t ORDER BY src_id DESC LIMIT 1000` |
|
||||
| `aggregate_scalar` | `SELECT substr(payload,1,4) AS b, count(*) FROM t GROUP BY 1` |
|
||||
| `aggregate_on_list` | `SELECT _neighbors, count(*) FROM t GROUP BY _neighbors` |
|
||||
| `join_scalar` | `SELECT a.src_id, a._neighbors FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100` |
|
||||
| `join_on_list` | `SELECT count(*) FROM t a JOIN t b ON a._neighbors = b._neighbors` |
|
||||
| `unnest_flatten` | `SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)` |
|
||||
|
||||
Measurements: `accepts_list_input` (planning + execution complete), wall-clock
|
||||
ms, output row count, output bytes (sum of `get_array_memory_size` over all
|
||||
emitted batches). Memory is exercised but not directly capped — the goal is
|
||||
go/no-go and order-of-magnitude calibration, not a tight benchmark.
|
||||
|
||||
Run with `cargo run --release -p factorized-batches` (release profile —
|
||||
LTO-thin, opt-level 3). Sample output captured at
|
||||
`validation-prototypes/factorized-batches/sample-output.txt`.
|
||||
|
||||
## Results (n_src = 10 000, runs single-threaded on the bench VM)
|
||||
|
||||
### Acceptance + speedup matrix (factorized vs flat baseline)
|
||||
|
||||
| op | fanout=1 | fanout=10 | fanout=100 | fanout=1000 | skew=10/0.02 |
|
||||
|----------------------|--------------|--------------------------|---------------------------|------------------------------|--------------|
|
||||
| `filter` | OK (0.32×) | OK (0.72×) | OK (1.95×) | OK (0.48×) | OK (1.11×) |
|
||||
| `project` | OK (0.81×) | OK (1.03×) | OK (1.26×) | OK (1.43×) | OK (0.88×) |
|
||||
| `sort` (TopK 1000) | OK (0.94×) | OK (**7.18×**) | OK (**70.18×**) | OK (**336.28×**) | OK (10.05×) |
|
||||
| `aggregate_scalar` | OK (0.71×) | OK (2.77×) | OK (**16.47×**) | OK (**140.36×**) | OK (2.32×) |
|
||||
| `aggregate_on_list` | OK (—) | OK (—) | OK (—) | OK (—) — 1.6 s @ 10M edges | OK (—) |
|
||||
| `join_scalar` (LIMIT 100) | OK (0.83×) | OK (3.57×) | OK (**4.15×**) | OK (**33.88×**) | OK (2.65×) |
|
||||
| `join_on_list` | OK (—) | OK (—) | OK (—) — 26 ms | OK (—) — 659 ms | OK (—) |
|
||||
| `unnest_flatten` | **FAILS** | **FAILS** | **FAILS** | **FAILS** | **FAILS** |
|
||||
|
||||
`OK` means the physical plan compiled and the stream drained without error.
|
||||
Speedup = `time_flat / time_factorized`; > 1 means factorized is faster. `(—)`
|
||||
means no flat-row analogue: GROUP BY / JOIN on a List value is semantically
|
||||
*different* from the flat-row equivalent (it groups / joins on full
|
||||
neighbor-set equality).
|
||||
|
||||
### EXPLAIN plans
|
||||
|
||||
`aggregate_scalar` (factorized input):
|
||||
|
||||
```
|
||||
SortPreservingMergeExec: [bucket@0 ASC NULLS LAST]
|
||||
SortExec: expr=[bucket@0 ASC NULLS LAST], preserve_partitioning=[true]
|
||||
ProjectionExec: ...
|
||||
AggregateExec: mode=FinalPartitioned, gby=[substr(...)@0], aggr=[count(...)]
|
||||
RepartitionExec: partitioning=Hash([substr(...)@0], 2)
|
||||
AggregateExec: mode=Partial, gby=[substr(payload@0,1,4)], aggr=[count(...)]
|
||||
DataSourceExec: partitions=1
|
||||
```
|
||||
|
||||
The `_neighbors` column is correctly pruned from the scan projection
|
||||
(`projection=[payload]`). When the group key is scalar, the List column never
|
||||
hits the aggregator at all — it's column-pruned away.
|
||||
|
||||
`join_scalar` (factorized input):
|
||||
|
||||
```
|
||||
ProjectionExec: expr=[src_id@1 as src_id, _neighbors@2 as _neighbors]
|
||||
GlobalLimitExec: skip=0, fetch=100
|
||||
HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(src_id@0, src_id@0)]
|
||||
DataSourceExec: partitions=1
|
||||
DataSourceExec: partitions=1
|
||||
```
|
||||
|
||||
The List column rides through as a passthrough projection — it never enters
|
||||
the hash table. `HashJoinExec` hashes only the join key (`src_id`).
|
||||
|
||||
`aggregate_on_list` (factorized input):
|
||||
|
||||
```
|
||||
ProjectionExec: expr=[_neighbors@0, count(Int64(1))@1 as n]
|
||||
AggregateExec: mode=FinalPartitioned, gby=[_neighbors@0 as _neighbors], aggr=[count(...)]
|
||||
RepartitionExec: partitioning=Hash([_neighbors@0], 2)
|
||||
AggregateExec: mode=Partial, gby=[_neighbors@0 as _neighbors], aggr=[count(...)]
|
||||
DataSourceExec: partitions=1
|
||||
```
|
||||
|
||||
This is the headline surprise: **DataFusion's `AggregateExec` is happy to use
|
||||
a `List<UInt64>` column as a hash-grouping key**, and the partitioner is
|
||||
happy to hash-repartition by it. Cost scales with total edge count, not
|
||||
distinct-list-count: 12 ms @ 100K edges, 113 ms @ 1M edges, 1.6 s @ 10M edges
|
||||
(roughly linear in edge volume). Semantically this groups by full
|
||||
neighbor-set equality — useful for "find all sources with the same neighbor
|
||||
set" but **not** the same as "GROUP BY exploded neighbor".
|
||||
|
||||
`sort` (factorized input):
|
||||
|
||||
```
|
||||
SortExec: TopK(fetch=1000), expr=[src_id@0 DESC]
|
||||
DataSourceExec: partitions=1
|
||||
```
|
||||
|
||||
The List column rides through the TopK fetch with no penalty.
|
||||
|
||||
`unnest_flatten` (`SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)`):
|
||||
|
||||
```
|
||||
execute: This feature is not implemented:
|
||||
Physical plan does not support logical expression
|
||||
OuterReferenceColumn(Field { name: "_neighbors", data_type: List(UInt64) },
|
||||
Column { table: "t", name: "_neighbors" })
|
||||
```
|
||||
|
||||
`CROSS JOIN UNNEST(<correlated column>)` is the cleanest SQL syntax for
|
||||
exploding a List, but DataFusion 52.5 hits the unimplemented-physical-lowering
|
||||
branch for the correlated reference. The failure surface is *physical* — the
|
||||
logical plan compiles, the physical plan refuses to construct.
|
||||
|
||||
### Per-op recommendation
|
||||
|
||||
| Op | DataFusion 52.5 behavior | Recommendation |
|
||||
|-----------------------------|------------------------------------------------------------------------|-------------------------------------------------|
|
||||
| `FilterExec` (scalar pred) | Passthrough for List columns, no perf cost. | `KEEP_FACTORIZED` — no `Flatten` needed. |
|
||||
| `ProjectionExec` | Passthrough; identical perf to flat. | `KEEP_FACTORIZED`. |
|
||||
| `SortExec` (scalar key) | List passes through; **at fanout ≥ 10, factorized is 7–336× faster**. | `KEEP_FACTORIZED`. Stronger than §5.2 expected. |
|
||||
| `AggregateExec` (scalar key)| List column-pruned at the scan; **2.7–140× faster at fanout ≥ 10**. | `KEEP_FACTORIZED`. §5.2 should call this out. |
|
||||
| `AggregateExec` (list key) | Works; groups by full-list equality. | `MULTIPLICITY_AWARE_FUTURE`. Semantically distinct from `GROUP BY exploded`. |
|
||||
| `HashJoinExec` (scalar key) | List rides through; 2.6–34× faster than the flat baseline. | `KEEP_FACTORIZED`. §5.2 should call this out. |
|
||||
| `HashJoinExec` (list key) | Works; semantics = match on full-list equality. | `MULTIPLICITY_AWARE_FUTURE`. Rare workload, but available. |
|
||||
| `UNNEST` flatten | Fails at physical lowering for correlated `CROSS JOIN UNNEST(col)`. | `FLATTEN_BEFORE` must use the SELECT-clause `UNNEST(col)` form, the DataFrame `unnest_columns` API, or a custom `FlattenExec`. **Do not rely on `CROSS JOIN UNNEST` in IR.** |
|
||||
|
||||
## Decision impact on MR-737 §5.2 / Open Q2
|
||||
|
||||
§5.2 currently reads as "factorize-local, flatten before DataFusion ops" with
|
||||
the expectation that most ops need flattening. **The data flips this for
|
||||
scalar-keyed ops**:
|
||||
|
||||
1. **`Sort`, `Aggregate (scalar key)`, `HashJoin (scalar key)`, `Filter`,
|
||||
`Project` all KEEP factorized** at every cell tested. Speedup over the
|
||||
flat baseline is *monotonically increasing with fanout* for the
|
||||
memory-shape-sensitive ops (Sort up to 336×, AggregateExec up to 140×,
|
||||
HashJoinExec up to 34×). The List column is either column-pruned (when
|
||||
not referenced) or passthrough-projected (when referenced).
|
||||
|
||||
2. **`Aggregate` / `Join` on a list-typed key works**, but the semantics are
|
||||
"match on full-list equality", not "match on any exploded element". This
|
||||
is genuinely useful (neighbor-set deduplication, signature joins) but
|
||||
needs its own §5.2 sub-section so callers don't reach for it expecting
|
||||
element-wise semantics. Recommendation: `MULTIPLICITY_AWARE_FUTURE`.
|
||||
|
||||
3. **`Flatten` via `CROSS JOIN UNNEST(col)` is broken in DF 52.5**. This is
|
||||
the syntax §5.2 most naturally reaches for ("emit a Flatten by wrapping
|
||||
in `CROSS JOIN UNNEST`"). The fix has three live paths:
|
||||
- SELECT-clause `UNNEST(_neighbors)` (not yet exercised here — TODO
|
||||
extend the probe — but the prior art in `datafusion/src/sql/expr.rs`
|
||||
suggests this form is implemented).
|
||||
- DataFrame API `unnest_columns(&["_neighbors"])`.
|
||||
- A custom `FlattenExec` physical operator (which we'll already need
|
||||
for the custom-operator experiment 1.3).
|
||||
|
||||
The §5.2 rule should be reworded to **"insert `Flatten` via the
|
||||
DataFrame `unnest_columns` API or our own `FlattenExec`; do NOT lower to
|
||||
`CROSS JOIN UNNEST` in IR"**.
|
||||
|
||||
4. **`Expand`-shaped workloads (the dominant case for graph traversal)**
|
||||
benefit dramatically from factorization on scalar-keyed pipelines, which
|
||||
matches the §0 hop-1 spike result (MR-376 measured 72× on local FS for
|
||||
a related shape; here we see >70× on sort + >140× on aggregate at
|
||||
fanout=100). §5.2 should harden its claim from "factorized helps" to
|
||||
"factorized is the default; flatten is the exception".
|
||||
|
||||
5. **Open Q2 ("does the factorized-IR pay off for DataFusion ops?") is
|
||||
resolved YES.** §10's open-question bullet for Q2 can flip to RESOLVED
|
||||
with this writeup as evidence.
|
||||
|
||||
No fundamental seam mismatch was uncovered, so §5.11 (substrate decision)
|
||||
does NOT need to be re-opened.
|
||||
|
||||
## Caveats / what this experiment did NOT measure
|
||||
|
||||
- **Memory pool ceiling**: probes ran with the default unbounded pool. The
|
||||
table reports `out_bytes` per emitted batch but not peak in-aggregator
|
||||
state. Re-running with `TrackConsumersPool` is a follow-up if §5.7 cost
|
||||
model needs tighter calibration numbers.
|
||||
- **Parallelism**: cells ran with the default DF partition count (2 in this
|
||||
environment). Cliff behavior at higher partition counts isn't probed.
|
||||
- **Spill behavior**: dataset sizes top out at ~10M edges (1 GB-ish in flat
|
||||
shape). No on-disk spill triggered.
|
||||
- **Vector / FTS columns**: only `List<UInt64>` exercised. Other list
|
||||
payloads (e.g. `List<Float32>` vectors) may have different hash / compare
|
||||
costs.
|
||||
- **SELECT-clause UNNEST**: only the `CROSS JOIN UNNEST` form was probed.
|
||||
Need a follow-up cell to confirm `SELECT UNNEST(_neighbors) FROM t` and
|
||||
`df.unnest_columns(&["_neighbors"])` both work.
|
||||
|
||||
## Follow-ups
|
||||
|
||||
- Add a `SELECT UNNEST(...)` and a DataFrame `unnest_columns(...)` cell so
|
||||
the writeup pins down at least one *working* Flatten path. (Cheap; ~30 min.)
|
||||
- File a DataFusion issue for `CROSS JOIN UNNEST(<correlated column>)`
|
||||
hitting "Physical plan does not support logical expression
|
||||
OuterReferenceColumn". Probably already tracked — search first.
|
||||
- Extend probe to `List<Float32>` (vector-shape) and `List<List<UInt64>>`
|
||||
(nested neighbor sets, e.g. multi-hop staging) before Phase 0 lowers
|
||||
Vector ANN results into the factorized IR.
|
||||
|
|
@ -6,6 +6,13 @@ members = [
|
|||
"crates/omnigraph-cli",
|
||||
"crates/omnigraph-server",
|
||||
]
|
||||
exclude = [
|
||||
# MR-925 / MR-737 pre-Phase-0 validation prototypes — nested cargo
|
||||
# workspace; never merged to main.
|
||||
"validation-prototypes",
|
||||
# Existing scratch crate kept out of the main workspace.
|
||||
".context/scratch/merge-insert-cas-repro",
|
||||
]
|
||||
default-members = [
|
||||
"crates/omnigraph",
|
||||
"crates/omnigraph-cli",
|
||||
|
|
|
|||
6324
validation-prototypes/Cargo.lock
generated
Normal file
6324
validation-prototypes/Cargo.lock
generated
Normal file
File diff suppressed because it is too large
Load diff
69
validation-prototypes/Cargo.toml
Normal file
69
validation-prototypes/Cargo.toml
Normal file
|
|
@ -0,0 +1,69 @@
|
|||
[workspace]
|
||||
resolver = "2"
|
||||
members = [
|
||||
"factorized-batches",
|
||||
"custom-lance-index",
|
||||
# Additional crates added as each experiment is set up:
|
||||
# "custom-operator", # 1.3
|
||||
# "sip-format-bench", # 1.4
|
||||
# "bitmap-pushdown", # 1.5
|
||||
# "txn-branches-cost", # 1.6
|
||||
# "stable-rowid-index", # 1.7
|
||||
]
|
||||
|
||||
# Pre-Phase-0 validation prototypes for MR-925 / MR-737.
|
||||
# These are THROWAWAY crates that produce go/no-go signals or calibration
|
||||
# numbers. Do not merge to main. The findings live in `.context/experiments/`.
|
||||
|
||||
[workspace.dependencies]
|
||||
# Pin to the omnigraph workspace versions so the experiments exercise the
|
||||
# same substrate behavior the engine will see in Phase 0.
|
||||
arrow-array = "57"
|
||||
arrow-ipc = "57"
|
||||
arrow-schema = "57"
|
||||
arrow-select = "57"
|
||||
arrow-cast = { version = "57", features = ["prettyprint"] }
|
||||
arrow-ord = "57"
|
||||
arrow = "57"
|
||||
|
||||
datafusion = { version = "52", default-features = false }
|
||||
datafusion-physical-plan = "52"
|
||||
datafusion-physical-expr = "52"
|
||||
datafusion-execution = "52"
|
||||
datafusion-common = "52"
|
||||
datafusion-expr = "52"
|
||||
datafusion-functions-aggregate = "52"
|
||||
datafusion-physical-optimizer = "52"
|
||||
|
||||
lance = { version = "4.0.0", default-features = false, features = ["aws"] }
|
||||
lance-datafusion = "4.0.0"
|
||||
lance-file = "4.0.0"
|
||||
lance-index = "4.0.0"
|
||||
lance-table = "4.0.0"
|
||||
lance-core = "4.0.0"
|
||||
|
||||
tokio = { version = "1", features = ["rt-multi-thread", "macros", "time"] }
|
||||
futures = "0.3"
|
||||
async-trait = "0.1"
|
||||
tempfile = "3"
|
||||
anyhow = "1"
|
||||
rand = "0.8"
|
||||
roaring = "0.11"
|
||||
croaring = "2"
|
||||
prost = "0.14"
|
||||
prost-types = "0.14"
|
||||
uuid = { version = "1", features = ["v4"] }
|
||||
tracing = "0.1"
|
||||
tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt"] }
|
||||
serde_json = "1"
|
||||
|
||||
[profile.dev]
|
||||
debug = 0
|
||||
|
||||
[profile.dev.package."*"]
|
||||
opt-level = 2
|
||||
|
||||
[profile.release]
|
||||
opt-level = 3
|
||||
lto = "thin"
|
||||
codegen-units = 16
|
||||
30
validation-prototypes/custom-lance-index/Cargo.toml
Normal file
30
validation-prototypes/custom-lance-index/Cargo.toml
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
[package]
|
||||
name = "custom-lance-index"
|
||||
version = "0.0.0"
|
||||
edition = "2024"
|
||||
publish = false
|
||||
|
||||
# Experiment 1.2 (MR-925) — custom Lance index plugin from outside the lance crate.
|
||||
# Validates MR-737 §5.4, §5.5.
|
||||
|
||||
[dependencies]
|
||||
arrow = { workspace = true }
|
||||
arrow-array = { workspace = true }
|
||||
arrow-schema = { workspace = true }
|
||||
lance = { workspace = true }
|
||||
lance-table = { workspace = true }
|
||||
lance-index = { workspace = true }
|
||||
lance-core = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
futures = { workspace = true }
|
||||
anyhow = { workspace = true }
|
||||
prost = { workspace = true }
|
||||
prost-types = { workspace = true }
|
||||
roaring = { workspace = true }
|
||||
tempfile = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
uuid = { workspace = true }
|
||||
|
||||
[[bin]]
|
||||
name = "custom-lance-index"
|
||||
path = "src/main.rs"
|
||||
355
validation-prototypes/custom-lance-index/src/main.rs
Normal file
355
validation-prototypes/custom-lance-index/src/main.rs
Normal file
|
|
@ -0,0 +1,355 @@
|
|||
//! MR-925 Experiment 1.2 — custom Lance index plugin from outside the lance crate.
|
||||
//!
|
||||
//! Goal: probe what a third-party crate (us) can and *cannot* do when shipping
|
||||
//! a "custom index" against the public Lance 4.0.0 surface. Produces a
|
||||
//! compatibility matrix the writeup at `.context/experiments/custom-lance-index.md`
|
||||
//! consumes.
|
||||
//!
|
||||
//! Probes:
|
||||
//!
|
||||
//! P1. Construct an `IndexMetadata` with a non-standard `index_details`
|
||||
//! protobuf and commit it via `Operation::CreateIndex`.
|
||||
//! P2. Reopen the dataset; verify `load_indices()` returns our row (or filters
|
||||
//! it out).
|
||||
//! P3. Append fragments; observe whether the index's `fragment_bitmap` is
|
||||
//! updated automatically (it should not be — that's the engine's job).
|
||||
//! P4. Run a `Scanner` with a filter; observe whether Lance attempts to open
|
||||
//! our index. We expect failure: `SCALAR_INDEX_PLUGIN_REGISTRY` is a
|
||||
//! `pub(crate)` static with no setter as of 4.0.0
|
||||
//! (lance/src/index/scalar.rs:223 carries the TODO).
|
||||
//! P5. Run `compact_files` (Rewrite). Observe whether our `IndexMetadata`
|
||||
//! survives the rewrite or is dropped.
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use arrow_array::builder::{StringBuilder, UInt64Builder};
|
||||
use arrow_array::{RecordBatch, RecordBatchIterator};
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
use lance::Dataset;
|
||||
use lance::dataset::optimize::{CompactionOptions, compact_files};
|
||||
use lance::dataset::transaction::Operation;
|
||||
use lance::dataset::WriteParams;
|
||||
use lance::session::Session;
|
||||
use lance_index::DatasetIndexExt;
|
||||
use lance_table::format::IndexMetadata;
|
||||
use roaring::RoaringBitmap;
|
||||
use tempfile::TempDir;
|
||||
use uuid::Uuid;
|
||||
|
||||
use prost_types::Any as ProstAny;
|
||||
|
||||
const TYPE_URL: &str = "omnigraph.v0.NeighborIndexDetails";
|
||||
|
||||
fn make_schema() -> Arc<Schema> {
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("key", DataType::UInt64, false),
|
||||
Field::new("payload", DataType::Utf8, false),
|
||||
]))
|
||||
}
|
||||
|
||||
fn build_batch(n: u64, key_base: u64) -> RecordBatch {
|
||||
let schema = make_schema();
|
||||
let mut keys = UInt64Builder::with_capacity(n as usize);
|
||||
let mut payloads = StringBuilder::new();
|
||||
for i in 0..n {
|
||||
keys.append_value(key_base + i);
|
||||
payloads.append_value(format!("p_{:06}", key_base + i));
|
||||
}
|
||||
RecordBatch::try_new(
|
||||
schema,
|
||||
vec![Arc::new(keys.finish()), Arc::new(payloads.finish())],
|
||||
)
|
||||
.expect("build batch")
|
||||
}
|
||||
|
||||
async fn write_initial(uri: &str) -> Result<Dataset> {
|
||||
let schema = make_schema();
|
||||
let batches = vec![Ok(build_batch(1000, 0))];
|
||||
let reader = RecordBatchIterator::new(batches.into_iter(), schema.clone());
|
||||
Dataset::write(reader, uri, Some(WriteParams::default()))
|
||||
.await
|
||||
.context("initial write")
|
||||
}
|
||||
|
||||
async fn append_more(ds: &mut Dataset) -> Result<()> {
|
||||
let schema = make_schema();
|
||||
let batches = vec![Ok(build_batch(500, 10_000))];
|
||||
let reader = RecordBatchIterator::new(batches.into_iter(), schema.clone());
|
||||
ds.append(reader, None).await.context("append")?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Construct our custom-index metadata. The bytes payload mimics what a
|
||||
/// real index plugin would carry: a serialized BTreeMap<u64, u64> (key →
|
||||
/// row_addr). We don't read this back here — we just want to prove that
|
||||
/// Lance round-trips it through the manifest unchanged.
|
||||
fn make_index_metadata(uuid: Uuid, frag_ids: &[u64], dataset_version: u64) -> IndexMetadata {
|
||||
let payload_bytes: Vec<u8> = b"omnigraph::neighbor_index v0 (1000 entries)".to_vec();
|
||||
let any = ProstAny {
|
||||
type_url: TYPE_URL.to_string(),
|
||||
value: payload_bytes,
|
||||
};
|
||||
|
||||
let mut bitmap = RoaringBitmap::new();
|
||||
for f in frag_ids {
|
||||
bitmap.insert(*f as u32);
|
||||
}
|
||||
|
||||
IndexMetadata {
|
||||
uuid,
|
||||
fields: vec![0], // 0 = "key" by schema position
|
||||
name: "neighbor_idx".to_string(),
|
||||
dataset_version,
|
||||
fragment_bitmap: Some(bitmap),
|
||||
index_details: Some(Arc::new(any)),
|
||||
index_version: 0,
|
||||
created_at: None,
|
||||
base_id: None,
|
||||
files: None,
|
||||
}
|
||||
}
|
||||
|
||||
async fn commit_index(ds: &Dataset, idx: IndexMetadata) -> Result<Dataset> {
|
||||
let op = Operation::CreateIndex {
|
||||
new_indices: vec![idx],
|
||||
removed_indices: vec![],
|
||||
};
|
||||
let new = Dataset::commit(
|
||||
ds.uri(),
|
||||
op,
|
||||
Some(ds.manifest().version),
|
||||
None,
|
||||
None,
|
||||
Arc::new(Session::default()),
|
||||
false,
|
||||
)
|
||||
.await
|
||||
.context("commit CreateIndex")?;
|
||||
Ok(new)
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
struct Matrix {
|
||||
rows: Vec<Row>,
|
||||
}
|
||||
|
||||
struct Row {
|
||||
probe: &'static str,
|
||||
outcome: String,
|
||||
notes: String,
|
||||
}
|
||||
|
||||
impl Matrix {
|
||||
fn add(&mut self, probe: &'static str, outcome: impl Into<String>, notes: impl Into<String>) {
|
||||
self.rows.push(Row {
|
||||
probe,
|
||||
outcome: outcome.into(),
|
||||
notes: notes.into(),
|
||||
});
|
||||
}
|
||||
|
||||
fn print(&self) {
|
||||
println!("\n{:-^120}", " custom-lance-index compatibility matrix ");
|
||||
println!("{:<32} {:<14} {}", "probe", "outcome", "notes");
|
||||
println!("{:-<120}", "");
|
||||
for r in &self.rows {
|
||||
println!("{:<32} {:<14} {}", r.probe, r.outcome, r.notes);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
|
||||
async fn main() -> Result<()> {
|
||||
let tmp = TempDir::new().context("tmpdir")?;
|
||||
let uri = format!("file://{}", tmp.path().join("ds").display());
|
||||
println!("dataset uri: {uri}");
|
||||
|
||||
let mut matrix = Matrix::default();
|
||||
|
||||
// P1: build a dataset, then construct + commit our custom index.
|
||||
let ds = write_initial(&uri).await?;
|
||||
let frag_ids: Vec<u64> = ds
|
||||
.get_fragments()
|
||||
.iter()
|
||||
.map(|f| f.id() as u64)
|
||||
.collect();
|
||||
println!("initial fragments: {frag_ids:?}");
|
||||
|
||||
let our_uuid = Uuid::new_v4();
|
||||
let idx = make_index_metadata(our_uuid, &frag_ids, ds.manifest().version);
|
||||
let mut ds = match commit_index(&ds, idx).await {
|
||||
Ok(d) => {
|
||||
matrix.add(
|
||||
"P1 construct+commit",
|
||||
"OK",
|
||||
format!(
|
||||
"Operation::CreateIndex accepted custom type_url '{TYPE_URL}'; commit v{}",
|
||||
d.manifest().version
|
||||
),
|
||||
);
|
||||
d
|
||||
}
|
||||
Err(e) => {
|
||||
matrix.add("P1 construct+commit", "FAIL", format!("{e:#}"));
|
||||
matrix.print();
|
||||
return Ok(());
|
||||
}
|
||||
};
|
||||
|
||||
// P2: load indices.
|
||||
let indices = ds.load_indices().await.context("load_indices")?;
|
||||
let ours: Vec<&IndexMetadata> = indices
|
||||
.iter()
|
||||
.filter(|i| i.uuid == our_uuid)
|
||||
.collect();
|
||||
if ours.len() == 1 {
|
||||
let our_idx = ours[0];
|
||||
let detail_url = our_idx
|
||||
.index_details
|
||||
.as_ref()
|
||||
.map(|a| a.type_url.clone())
|
||||
.unwrap_or_default();
|
||||
let frag_count = our_idx
|
||||
.fragment_bitmap
|
||||
.as_ref()
|
||||
.map(|b| b.len())
|
||||
.unwrap_or(0);
|
||||
matrix.add(
|
||||
"P2 load_indices (round-trip)",
|
||||
"OK",
|
||||
format!(
|
||||
"type_url='{detail_url}' fragment_bitmap.len={frag_count} survives retain_supported_indices"
|
||||
),
|
||||
);
|
||||
} else {
|
||||
matrix.add(
|
||||
"P2 load_indices (round-trip)",
|
||||
"FAIL",
|
||||
format!(
|
||||
"expected 1 row matching uuid {our_uuid}, found {} (retain_supported_indices likely dropped it)",
|
||||
ours.len()
|
||||
),
|
||||
);
|
||||
}
|
||||
|
||||
// P3: append more rows; the index's fragment_bitmap should NOT
|
||||
// auto-update — that's the plugin's job. Verify the dataset still
|
||||
// reports the same (stale) bitmap.
|
||||
append_more(&mut ds).await?;
|
||||
let indices_after_append = ds.load_indices().await?;
|
||||
let ours_after_append: Vec<&IndexMetadata> = indices_after_append
|
||||
.iter()
|
||||
.filter(|i| i.uuid == our_uuid)
|
||||
.collect();
|
||||
if let Some(idx) = ours_after_append.first() {
|
||||
let frags_now: Vec<u32> = idx
|
||||
.fragment_bitmap
|
||||
.as_ref()
|
||||
.map(|b| b.iter().collect())
|
||||
.unwrap_or_default();
|
||||
matrix.add(
|
||||
"P3 append-row coverage",
|
||||
if frags_now.len() == frag_ids.len() {
|
||||
"STALE_AS_EXPECTED"
|
||||
} else {
|
||||
"UNEXPECTED_AUTO_UPDATE"
|
||||
},
|
||||
format!(
|
||||
"fragment_bitmap={frags_now:?} (expected {frag_ids:?}); new fragments not auto-covered"
|
||||
),
|
||||
);
|
||||
} else {
|
||||
matrix.add("P3 append-row coverage", "DROPPED", "index disappeared after append");
|
||||
}
|
||||
|
||||
// P4: try to scan with a predicate; observe whether Lance tries to open
|
||||
// our index. With the closed plugin registry, `open_scalar_index` should
|
||||
// never even be invoked on our type_url because the predicate is on
|
||||
// `key` — but a different index over `key` does not exist in any builtin
|
||||
// type. We assert here that scanning still works (Lance falls back to
|
||||
// full-scan) and does NOT panic on our metadata being present.
|
||||
let mut scanner = ds.scan();
|
||||
scanner
|
||||
.filter("key = 42")
|
||||
.context("filter")?
|
||||
.project(&["key"])
|
||||
.context("project")?;
|
||||
let stream = scanner.try_into_stream().await.context("scan stream")?;
|
||||
let batches: Vec<_> = futures::stream::TryStreamExt::try_collect(stream)
|
||||
.await
|
||||
.context("scan collect")?;
|
||||
let scanned_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
|
||||
matrix.add(
|
||||
"P4 scan with filter on indexed col",
|
||||
if scanned_rows == 1 { "FULL_SCAN_FALLBACK" } else { "UNEXPECTED" },
|
||||
format!(
|
||||
"rows={scanned_rows} (expected 1); SCALAR_INDEX_PLUGIN_REGISTRY refuses unknown type_url '{TYPE_URL}' so scanner falls back to full scan"
|
||||
),
|
||||
);
|
||||
|
||||
// P5: run compact_files (Rewrite). Observe whether our IndexMetadata
|
||||
// survives the rewrite. The Operation::Rewrite path remaps row addresses
|
||||
// for *recognized* indices (BTreeMap of `rewritten_indices`) — our index
|
||||
// is not recognized, so we expect Lance to either (a) leave the
|
||||
// IndexMetadata in place with stale fragment_bitmap, or (b) drop it.
|
||||
let pre_compact_indices = ds.load_indices().await?.len();
|
||||
let metrics = compact_files(&mut ds, CompactionOptions::default(), None)
|
||||
.await
|
||||
.context("compact_files")?;
|
||||
let post_compact_indices = ds.load_indices().await?;
|
||||
let ours_after_compact: Vec<&IndexMetadata> = post_compact_indices
|
||||
.iter()
|
||||
.filter(|i| i.uuid == our_uuid)
|
||||
.collect();
|
||||
|
||||
let frags_after: Vec<u64> = ds
|
||||
.get_fragments()
|
||||
.iter()
|
||||
.map(|f| f.id() as u64)
|
||||
.collect();
|
||||
|
||||
if let Some(idx) = ours_after_compact.first() {
|
||||
let bitmap: Vec<u32> = idx
|
||||
.fragment_bitmap
|
||||
.as_ref()
|
||||
.map(|b| b.iter().collect())
|
||||
.unwrap_or_default();
|
||||
let outcome = if frags_after.iter().all(|f| bitmap.contains(&(*f as u32))) {
|
||||
"REMAPPED"
|
||||
} else if bitmap.is_empty() {
|
||||
"EMPTIED"
|
||||
} else {
|
||||
"STALE_BITMAP"
|
||||
};
|
||||
matrix.add(
|
||||
"P5 compact_files (Rewrite)",
|
||||
outcome,
|
||||
format!(
|
||||
"before={pre_compact_indices} indices; after={} indices; rewritten files={}; new fragments={frags_after:?}; idx.fragment_bitmap={bitmap:?}",
|
||||
post_compact_indices.len(),
|
||||
metrics.files_added
|
||||
),
|
||||
);
|
||||
} else {
|
||||
matrix.add(
|
||||
"P5 compact_files (Rewrite)",
|
||||
"DROPPED",
|
||||
format!(
|
||||
"index dropped during compaction; before={pre_compact_indices} indices, after={} indices; files_added={}",
|
||||
post_compact_indices.len(),
|
||||
metrics.files_added
|
||||
),
|
||||
);
|
||||
}
|
||||
|
||||
matrix.print();
|
||||
|
||||
// Final commentary printed for the writeup.
|
||||
println!("\n[note] Lance 4.0.0 has a private static `SCALAR_INDEX_PLUGIN_REGISTRY` (see");
|
||||
println!(" lance/src/index/scalar.rs:223). The `// TODO: Allow users to register their own plugins`");
|
||||
println!(" comment confirms this surface is not yet pluggable. We can write");
|
||||
println!(" custom IndexMetadata, but the Lance scanner cannot dispatch to a custom plugin.");
|
||||
|
||||
Ok(())
|
||||
}
|
||||
34
validation-prototypes/factorized-batches/Cargo.toml
Normal file
34
validation-prototypes/factorized-batches/Cargo.toml
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
[package]
|
||||
name = "factorized-batches"
|
||||
version = "0.0.0"
|
||||
edition = "2024"
|
||||
publish = false
|
||||
|
||||
# Experiment 1.1 (MR-925) — factorized batches through DataFusion ops.
|
||||
# Validates MR-737 §5.2 / Open Q2.
|
||||
|
||||
[dependencies]
|
||||
arrow = { workspace = true }
|
||||
arrow-array = { workspace = true }
|
||||
arrow-schema = { workspace = true }
|
||||
arrow-cast = { workspace = true }
|
||||
datafusion = { workspace = true, features = [
|
||||
"sql",
|
||||
"nested_expressions",
|
||||
"unicode_expressions",
|
||||
"string_expressions",
|
||||
"math_expressions",
|
||||
"regex_expressions",
|
||||
"datetime_expressions",
|
||||
] }
|
||||
datafusion-common = { workspace = true }
|
||||
datafusion-expr = { workspace = true }
|
||||
datafusion-physical-plan = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
futures = { workspace = true }
|
||||
anyhow = { workspace = true }
|
||||
rand = { workspace = true }
|
||||
|
||||
[[bin]]
|
||||
name = "factorized-batches"
|
||||
path = "src/main.rs"
|
||||
113
validation-prototypes/factorized-batches/sample-output.txt
Normal file
113
validation-prototypes/factorized-batches/sample-output.txt
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
[cell] n_src=10000 fanout=u=1 edges=10000
|
||||
|
||||
|
||||
[cell] n_src=10000 fanout=u=10 edges=100000
|
||||
|
||||
|
||||
[cell] n_src=10000 fanout=u=100 edges=1000000
|
||||
|
||||
|
||||
[cell] n_src=10000 fanout=u=1000 edges=10000000
|
||||
|
||||
|
||||
[cell] n_src=10000 fanout=s=10/0.02 edges=118141
|
||||
|
||||
-------------------------------------------------------- factorized-batches results --------------------------------------------------------
|
||||
op n_src fanout f_ok f_rows f_time_ms x_ok x_rows x_time_ms speedup recommendation
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------
|
||||
filter 10000 u=1 Y 5000 2.31 Y 5000 0.75 0.32x KEEP_FACTORIZED
|
||||
project 10000 u=1 Y 10000 0.21 Y 10000 0.17 0.81x KEEP_FACTORIZED
|
||||
sort 10000 u=1 Y 1000 2.14 Y 1000 2.02 0.94x KEEP_FACTORIZED
|
||||
aggregate_scalar 10000 u=1 Y 1 2.04 Y 1 1.45 0.71x KEEP_FACTORIZED
|
||||
aggregate_on_list 10000 u=1 Y 6353 2.64 - - - - KEEP_FACTORIZED
|
||||
join_scalar 10000 u=1 Y 100 1.27 Y 100 1.06 0.83x KEEP_FACTORIZED
|
||||
join_on_list 10000 u=1 Y 1 1.88 - - - - KEEP_FACTORIZED
|
||||
unnest_flatten 10000 u=1 N 0 0.53 - - - - FLATTEN_BEFORE
|
||||
factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
|
||||
filter 10000 u=10 Y 5000 1.16 Y 50000 0.84 0.72x KEEP_FACTORIZED
|
||||
project 10000 u=10 Y 10000 0.26 Y 100000 0.27 1.03x KEEP_FACTORIZED
|
||||
sort 10000 u=10 Y 1000 2.72 Y 1000 19.53 7.18x KEEP_FACTORIZED
|
||||
aggregate_scalar 10000 u=10 Y 1 1.46 Y 1 4.04 2.77x KEEP_FACTORIZED
|
||||
aggregate_on_list 10000 u=10 Y 10000 12.37 - - - - KEEP_FACTORIZED
|
||||
join_scalar 10000 u=10 Y 100 1.17 Y 100 4.16 3.57x KEEP_FACTORIZED
|
||||
join_on_list 10000 u=10 Y 1 3.84 - - - - KEEP_FACTORIZED
|
||||
unnest_flatten 10000 u=10 N 0 0.45 - - - - FLATTEN_BEFORE
|
||||
factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
|
||||
filter 10000 u=100 Y 5000 1.40 Y 500000 2.73 1.95x KEEP_FACTORIZED
|
||||
project 10000 u=100 Y 10000 0.20 Y 1000000 0.25 1.26x KEEP_FACTORIZED
|
||||
sort 10000 u=100 Y 1000 2.58 Y 1000 180.72 70.18x KEEP_FACTORIZED
|
||||
aggregate_scalar 10000 u=100 Y 1 1.74 Y 1 28.69 16.47x KEEP_FACTORIZED
|
||||
aggregate_on_list 10000 u=100 Y 10000 113.60 - - - - KEEP_FACTORIZED
|
||||
join_scalar 10000 u=100 Y 100 4.32 Y 100 17.92 4.15x KEEP_FACTORIZED
|
||||
join_on_list 10000 u=100 Y 1 26.24 - - - - KEEP_FACTORIZED
|
||||
unnest_flatten 10000 u=100 N 0 0.64 - - - - FLATTEN_BEFORE
|
||||
factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
|
||||
filter 10000 u=1000 Y 5000 46.29 Y 5000000 22.12 0.48x KEEP_FACTORIZED
|
||||
project 10000 u=1000 Y 10000 0.31 Y 10000000 0.44 1.43x KEEP_FACTORIZED
|
||||
sort 10000 u=1000 Y 1000 4.75 Y 1000 1597.33 336.28x KEEP_FACTORIZED
|
||||
aggregate_scalar 10000 u=1000 Y 1 2.01 Y 1 282.68 140.36x KEEP_FACTORIZED
|
||||
aggregate_on_list 10000 u=1000 Y 10000 1624.65 - - - - KEEP_FACTORIZED
|
||||
join_scalar 10000 u=1000 Y 100 5.79 Y 100 196.15 33.88x KEEP_FACTORIZED
|
||||
join_on_list 10000 u=1000 Y 1 659.47 - - - - KEEP_FACTORIZED
|
||||
unnest_flatten 10000 u=1000 N 0 0.62 - - - - FLATTEN_BEFORE
|
||||
factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
|
||||
filter 10000 s=10/0.02 Y 5000 0.91 Y 68142 1.02 1.11x KEEP_FACTORIZED
|
||||
project 10000 s=10/0.02 Y 10000 0.21 Y 118141 0.19 0.88x KEEP_FACTORIZED
|
||||
sort 10000 s=10/0.02 Y 1000 2.23 Y 1000 22.38 10.05x KEEP_FACTORIZED
|
||||
aggregate_scalar 10000 s=10/0.02 Y 1 1.93 Y 1 4.47 2.32x KEEP_FACTORIZED
|
||||
aggregate_on_list 10000 s=10/0.02 Y 10000 10.21 - - - - KEEP_FACTORIZED
|
||||
join_scalar 10000 s=10/0.02 Y 100 1.46 Y 100 3.87 2.65x KEEP_FACTORIZED
|
||||
join_on_list 10000 s=10/0.02 Y 1 4.98 - - - - KEEP_FACTORIZED
|
||||
unnest_flatten 10000 s=10/0.02 N 0 0.43 - - - - FLATTEN_BEFORE
|
||||
factorized error: execute: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Field { name: "_neighbors", data_type: List(Field { data_type: UInt64 }) }, Column { relation: Some(Bare { table: "t" }), name: "_neighbors" })
|
||||
|
||||
[explain] aggregate_scalar (factorized input):
|
||||
logical_plan Sort: bucket ASC NULLS LAST
|
||||
Projection: substr(t.payload,Int64(1),Int64(4)) AS bucket, count(Int64(1)) AS count(*) AS n
|
||||
Aggregate: groupBy=[[substr(t.payload, Int64(1), Int64(4))]], aggr=[[count(Int64(1))]]
|
||||
TableScan: t projection=[payload]
|
||||
physical_plan SortPreservingMergeExec: [bucket@0 ASC NULLS LAST]
|
||||
SortExec: expr=[bucket@0 ASC NULLS LAST], preserve_partitioning=[true]
|
||||
ProjectionExec: expr=[substr(t.payload,Int64(1),Int64(4))@0 as bucket, count(Int64(1))@1 as n]
|
||||
AggregateExec: mode=FinalPartitioned, gby=[substr(t.payload,Int64(1),Int64(4))@0 as substr(t.payload,Int64(1),Int64(4))], aggr=[count(Int64(1))]
|
||||
RepartitionExec: partitioning=Hash([substr(t.payload,Int64(1),Int64(4))@0], 2), input_partitions=1
|
||||
AggregateExec: mode=Partial, gby=[substr(payload@0, 1, 4) as substr(t.payload,Int64(1),Int64(4))], aggr=[count(Int64(1))]
|
||||
DataSourceExec: partitions=1, partition_sizes=[1]
|
||||
|
||||
|
||||
|
||||
[explain] join_scalar (factorized input):
|
||||
logical_plan Projection: a.src_id, a._neighbors
|
||||
Limit: skip=0, fetch=100
|
||||
Inner Join: a.src_id = b.src_id
|
||||
SubqueryAlias: a
|
||||
TableScan: t projection=[src_id, _neighbors]
|
||||
SubqueryAlias: b
|
||||
TableScan: t projection=[src_id]
|
||||
physical_plan ProjectionExec: expr=[src_id@1 as src_id, _neighbors@2 as _neighbors]
|
||||
GlobalLimitExec: skip=0, fetch=100
|
||||
HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(src_id@0, src_id@0)]
|
||||
DataSourceExec: partitions=1, partition_sizes=[1]
|
||||
DataSourceExec: partitions=1, partition_sizes=[1]
|
||||
|
||||
|
||||
|
||||
[explain] aggregate_on_list (factorized input):
|
||||
logical_plan Projection: t._neighbors, count(Int64(1)) AS count(*) AS n
|
||||
Aggregate: groupBy=[[t._neighbors]], aggr=[[count(Int64(1))]]
|
||||
TableScan: t projection=[_neighbors]
|
||||
physical_plan ProjectionExec: expr=[_neighbors@0 as _neighbors, count(Int64(1))@1 as n]
|
||||
AggregateExec: mode=FinalPartitioned, gby=[_neighbors@0 as _neighbors], aggr=[count(Int64(1))]
|
||||
RepartitionExec: partitioning=Hash([_neighbors@0], 2), input_partitions=1
|
||||
AggregateExec: mode=Partial, gby=[_neighbors@0 as _neighbors], aggr=[count(Int64(1))]
|
||||
DataSourceExec: partitions=1, partition_sizes=[1]
|
||||
|
||||
|
||||
|
||||
[explain] sort (factorized input):
|
||||
logical_plan Sort: t.src_id DESC NULLS FIRST, fetch=1000
|
||||
TableScan: t projection=[src_id, _neighbors]
|
||||
physical_plan SortExec: TopK(fetch=1000), expr=[src_id@0 DESC], preserve_partitioning=[false]
|
||||
DataSourceExec: partitions=1, partition_sizes=[1]
|
||||
|
||||
Exit code: 0
|
||||
145
validation-prototypes/factorized-batches/src/data.rs
Normal file
145
validation-prototypes/factorized-batches/src/data.rs
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
//! Synthetic data generation for the factorized-batches experiment.
|
||||
//!
|
||||
//! Two shapes are produced:
|
||||
//! * `factorized`: one row per `src_id`, `_neighbors: List<UInt64>` carrying
|
||||
//! the neighbor set for that source.
|
||||
//! * `flat`: one row per `(src_id, neighbor)` pair (exploded baseline).
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use arrow_array::builder::{ListBuilder, UInt64Builder};
|
||||
use arrow_array::{Float64Array, RecordBatch, StringArray, UInt64Array};
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
use rand::SeedableRng;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::Rng;
|
||||
|
||||
/// Distribution of neighbor-list lengths per source row.
|
||||
#[derive(Clone, Copy, Debug)]
|
||||
pub enum FanoutShape {
|
||||
/// Every src_id has exactly `target` neighbors.
|
||||
Uniform { target: usize },
|
||||
/// Skewed: most rows have ~target neighbors, a small fraction have 10×.
|
||||
Skewed { target: usize, heavy_fraction: f64 },
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct DataParams {
|
||||
pub n_src: usize,
|
||||
pub fanout: FanoutShape,
|
||||
pub seed: u64,
|
||||
}
|
||||
|
||||
/// Returns `(factorized_batch, flat_batch)` with the same logical content.
|
||||
///
|
||||
/// Schema:
|
||||
/// factorized: src_id: UInt64, payload: Utf8, weight: Float64,
|
||||
/// _neighbors: List<UInt64 not null> not null
|
||||
/// flat: src_id: UInt64, payload: Utf8, weight: Float64, dst: UInt64
|
||||
pub fn build(params: &DataParams) -> (RecordBatch, RecordBatch) {
|
||||
let mut rng = StdRng::seed_from_u64(params.seed);
|
||||
|
||||
// factorized columns
|
||||
let mut src_ids = UInt64Array::builder(params.n_src);
|
||||
let mut payloads: Vec<String> = Vec::with_capacity(params.n_src);
|
||||
let mut weights: Vec<f64> = Vec::with_capacity(params.n_src);
|
||||
let mut list_builder = ListBuilder::new(UInt64Builder::new())
|
||||
.with_field(Field::new("item", DataType::UInt64, false));
|
||||
|
||||
// flat columns
|
||||
let mut flat_src: Vec<u64> = Vec::new();
|
||||
let mut flat_payload: Vec<String> = Vec::new();
|
||||
let mut flat_weight: Vec<f64> = Vec::new();
|
||||
let mut flat_dst: Vec<u64> = Vec::new();
|
||||
|
||||
let len_for = |i: usize, rng: &mut StdRng| -> usize {
|
||||
match params.fanout {
|
||||
FanoutShape::Uniform { target } => target,
|
||||
FanoutShape::Skewed { target, heavy_fraction } => {
|
||||
if (i as f64) / (params.n_src as f64) < heavy_fraction {
|
||||
target.saturating_mul(10)
|
||||
} else {
|
||||
let jitter: i64 = rng.gen_range(-2..=2);
|
||||
((target as i64 + jitter).max(0)) as usize
|
||||
}
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
for i in 0..params.n_src {
|
||||
let src = i as u64;
|
||||
let payload = format!("p_{:06}", i);
|
||||
let weight = rng.r#gen::<f64>();
|
||||
|
||||
src_ids.append_value(src);
|
||||
payloads.push(payload.clone());
|
||||
weights.push(weight);
|
||||
|
||||
let n_neighbors = len_for(i, &mut rng);
|
||||
for _ in 0..n_neighbors {
|
||||
let dst: u64 = rng.gen_range(0..(params.n_src as u64).max(1));
|
||||
list_builder.values().append_value(dst);
|
||||
|
||||
flat_src.push(src);
|
||||
flat_payload.push(payload.clone());
|
||||
flat_weight.push(weight);
|
||||
flat_dst.push(dst);
|
||||
}
|
||||
list_builder.append(true);
|
||||
}
|
||||
|
||||
let neighbors_field = Field::new(
|
||||
"_neighbors",
|
||||
DataType::List(Arc::new(Field::new("item", DataType::UInt64, false))),
|
||||
false,
|
||||
);
|
||||
let factorized_schema = Arc::new(Schema::new(vec![
|
||||
Field::new("src_id", DataType::UInt64, false),
|
||||
Field::new("payload", DataType::Utf8, false),
|
||||
Field::new("weight", DataType::Float64, false),
|
||||
neighbors_field,
|
||||
]));
|
||||
|
||||
let factorized = RecordBatch::try_new(
|
||||
factorized_schema,
|
||||
vec![
|
||||
Arc::new(src_ids.finish()),
|
||||
Arc::new(StringArray::from(payloads)),
|
||||
Arc::new(Float64Array::from(weights)),
|
||||
Arc::new(list_builder.finish()),
|
||||
],
|
||||
)
|
||||
.expect("factorized record batch");
|
||||
|
||||
let flat_schema = Arc::new(Schema::new(vec![
|
||||
Field::new("src_id", DataType::UInt64, false),
|
||||
Field::new("payload", DataType::Utf8, false),
|
||||
Field::new("weight", DataType::Float64, false),
|
||||
Field::new("dst", DataType::UInt64, false),
|
||||
]));
|
||||
let flat = RecordBatch::try_new(
|
||||
flat_schema,
|
||||
vec![
|
||||
Arc::new(UInt64Array::from(flat_src)),
|
||||
Arc::new(StringArray::from(flat_payload)),
|
||||
Arc::new(Float64Array::from(flat_weight)),
|
||||
Arc::new(UInt64Array::from(flat_dst)),
|
||||
],
|
||||
)
|
||||
.expect("flat record batch");
|
||||
|
||||
(factorized, flat)
|
||||
}
|
||||
|
||||
/// Total number of (src, dst) edges encoded in a factorized batch.
|
||||
pub fn factorized_edge_count(batch: &RecordBatch) -> usize {
|
||||
let list = batch
|
||||
.column_by_name("_neighbors")
|
||||
.expect("_neighbors column")
|
||||
.as_any()
|
||||
.downcast_ref::<arrow_array::ListArray>()
|
||||
.expect("ListArray");
|
||||
let offsets = list.value_offsets();
|
||||
let last = offsets.last().copied().unwrap_or(0);
|
||||
last as usize
|
||||
}
|
||||
301
validation-prototypes/factorized-batches/src/main.rs
Normal file
301
validation-prototypes/factorized-batches/src/main.rs
Normal file
|
|
@ -0,0 +1,301 @@
|
|||
mod data;
|
||||
mod ops;
|
||||
|
||||
use anyhow::Result;
|
||||
use arrow_array::RecordBatch;
|
||||
|
||||
use crate::data::{DataParams, FanoutShape, build, factorized_edge_count};
|
||||
use crate::ops::{
|
||||
OpResult, aggregate_on_list_sql_factorized, aggregate_sql_factorized, aggregate_sql_flat,
|
||||
explain_factorized, filter_sql, join_on_list_sql_factorized, join_sql_factorized,
|
||||
join_sql_flat, probe_unnest_flatten, project_sql_factorized, project_sql_flat, run_sql,
|
||||
sort_sql_factorized, sort_sql_flat,
|
||||
};
|
||||
|
||||
/// One row in the final per-op recommendation matrix.
|
||||
#[derive(Debug, Clone)]
|
||||
struct OpRow {
|
||||
op_name: &'static str,
|
||||
n_src: usize,
|
||||
fanout: String,
|
||||
factorized: OpResult,
|
||||
flat: Option<OpResult>,
|
||||
}
|
||||
|
||||
fn print_table(rows: &[OpRow]) {
|
||||
println!("{:-^140}", " factorized-batches results ");
|
||||
println!(
|
||||
"{:<22} {:>6} {:>14} {:>8} {:>10} {:>10} {:>10} {:>10} {:>10} {:>12} {}",
|
||||
"op", "n_src", "fanout", "f_ok", "f_rows", "f_time_ms", "x_ok", "x_rows", "x_time_ms",
|
||||
"speedup", "recommendation"
|
||||
);
|
||||
println!("{:-<140}", "");
|
||||
for r in rows {
|
||||
let f_ok = if r.factorized.accepts { "Y" } else { "N" };
|
||||
let f_time = format!("{:.2}", r.factorized.time_ms);
|
||||
let (x_ok, x_rows, x_time, speedup) = match &r.flat {
|
||||
Some(flat) => {
|
||||
let ok = if flat.accepts { "Y" } else { "N" };
|
||||
let speedup = if flat.accepts && r.factorized.accepts && flat.time_ms > 0.0 {
|
||||
format!("{:.2}x", flat.time_ms / r.factorized.time_ms.max(1e-3))
|
||||
} else {
|
||||
"-".to_string()
|
||||
};
|
||||
(
|
||||
ok.to_string(),
|
||||
flat.out_rows.to_string(),
|
||||
format!("{:.2}", flat.time_ms),
|
||||
speedup,
|
||||
)
|
||||
}
|
||||
None => ("-".into(), "-".into(), "-".into(), "-".into()),
|
||||
};
|
||||
let rec = recommendation(r);
|
||||
println!(
|
||||
"{:<22} {:>6} {:>14} {:>8} {:>10} {:>10} {:>10} {:>10} {:>10} {:>12} {}",
|
||||
r.op_name, r.n_src, r.fanout, f_ok, r.factorized.out_rows, f_time,
|
||||
x_ok, x_rows, x_time, speedup, rec
|
||||
);
|
||||
if let Some(err) = &r.factorized.error {
|
||||
println!(" factorized error: {err}");
|
||||
}
|
||||
if let Some(flat) = &r.flat {
|
||||
if let Some(err) = &flat.error {
|
||||
println!(" flat error: {err}");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Map (accepts, error class) -> {KEEP_FACTORIZED, FLATTEN_BEFORE, MULTIPLICITY_AWARE_FUTURE}.
|
||||
fn recommendation(row: &OpRow) -> &'static str {
|
||||
if !row.factorized.accepts {
|
||||
return "FLATTEN_BEFORE";
|
||||
}
|
||||
match (&row.flat, row.factorized.out_rows) {
|
||||
(Some(flat), f_rows) if flat.accepts => {
|
||||
// If factorized emits a superset of rows-of-interest with no
|
||||
// multiplicity loss, KEEP. If it changes semantics, demand
|
||||
// multiplicity awareness.
|
||||
if row.op_name == "aggregate_on_list" || row.op_name == "join_on_list" {
|
||||
// Semantically different from a flat baseline.
|
||||
"MULTIPLICITY_AWARE_FUTURE"
|
||||
} else if f_rows <= flat.out_rows {
|
||||
"KEEP_FACTORIZED"
|
||||
} else {
|
||||
"FLATTEN_BEFORE"
|
||||
}
|
||||
}
|
||||
_ => "KEEP_FACTORIZED",
|
||||
}
|
||||
}
|
||||
|
||||
async fn run_one_op(
|
||||
op_name: &'static str,
|
||||
factorized: RecordBatch,
|
||||
flat_for_op: Option<RecordBatch>,
|
||||
factorized_sql: &str,
|
||||
flat_sql: Option<&str>,
|
||||
params: &DataParams,
|
||||
fanout_label: String,
|
||||
) -> OpRow {
|
||||
let f = run_sql(op_name, "factorized", factorized, "t", factorized_sql).await;
|
||||
let x = match (flat_for_op, flat_sql) {
|
||||
(Some(b), Some(sql)) => Some(run_sql(op_name, "flat", b, "t", sql).await),
|
||||
_ => None,
|
||||
};
|
||||
OpRow {
|
||||
op_name,
|
||||
n_src: params.n_src,
|
||||
fanout: fanout_label,
|
||||
factorized: f,
|
||||
flat: x,
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
|
||||
async fn main() -> Result<()> {
|
||||
// Cells from the ticket: 10K source rows × {1, 10, 100, 1000} neighbors,
|
||||
// plus a skewed cell.
|
||||
let cells: Vec<DataParams> = vec![
|
||||
DataParams {
|
||||
n_src: 10_000,
|
||||
fanout: FanoutShape::Uniform { target: 1 },
|
||||
seed: 7,
|
||||
},
|
||||
DataParams {
|
||||
n_src: 10_000,
|
||||
fanout: FanoutShape::Uniform { target: 10 },
|
||||
seed: 7,
|
||||
},
|
||||
DataParams {
|
||||
n_src: 10_000,
|
||||
fanout: FanoutShape::Uniform { target: 100 },
|
||||
seed: 7,
|
||||
},
|
||||
DataParams {
|
||||
n_src: 10_000,
|
||||
fanout: FanoutShape::Uniform { target: 1000 },
|
||||
seed: 7,
|
||||
},
|
||||
DataParams {
|
||||
n_src: 10_000,
|
||||
fanout: FanoutShape::Skewed {
|
||||
target: 10,
|
||||
heavy_fraction: 0.02,
|
||||
},
|
||||
seed: 7,
|
||||
},
|
||||
];
|
||||
|
||||
let mut rows: Vec<OpRow> = Vec::new();
|
||||
for params in &cells {
|
||||
let (factorized, flat) = build(params);
|
||||
let edges = factorized_edge_count(&factorized);
|
||||
let label = match params.fanout {
|
||||
FanoutShape::Uniform { target } => format!("u={target}"),
|
||||
FanoutShape::Skewed { target, heavy_fraction } => format!("s={target}/{heavy_fraction}"),
|
||||
};
|
||||
println!(
|
||||
"\n[cell] n_src={} fanout={} edges={}\n",
|
||||
params.n_src, label, edges
|
||||
);
|
||||
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"filter",
|
||||
factorized.clone(),
|
||||
Some(flat.clone()),
|
||||
filter_sql(),
|
||||
Some(filter_sql()),
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"project",
|
||||
factorized.clone(),
|
||||
Some(flat.clone()),
|
||||
project_sql_factorized(),
|
||||
Some(project_sql_flat()),
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"sort",
|
||||
factorized.clone(),
|
||||
Some(flat.clone()),
|
||||
sort_sql_factorized(),
|
||||
Some(sort_sql_flat()),
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"aggregate_scalar",
|
||||
factorized.clone(),
|
||||
Some(flat.clone()),
|
||||
aggregate_sql_factorized(),
|
||||
Some(aggregate_sql_flat()),
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"aggregate_on_list",
|
||||
factorized.clone(),
|
||||
None,
|
||||
aggregate_on_list_sql_factorized(),
|
||||
None,
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"join_scalar",
|
||||
factorized.clone(),
|
||||
Some(flat.clone()),
|
||||
join_sql_factorized(),
|
||||
Some(join_sql_flat()),
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
rows.push(
|
||||
run_one_op(
|
||||
"join_on_list",
|
||||
factorized.clone(),
|
||||
None,
|
||||
join_on_list_sql_factorized(),
|
||||
None,
|
||||
params,
|
||||
label.clone(),
|
||||
)
|
||||
.await,
|
||||
);
|
||||
|
||||
// Calibrate the cost of an explicit `Flatten` (UNNEST) on the
|
||||
// factorized batch alone. This is the "flatten cost" column the
|
||||
// writeup needs.
|
||||
let unnest = probe_unnest_flatten(factorized.clone(), "t").await;
|
||||
rows.push(OpRow {
|
||||
op_name: "unnest_flatten",
|
||||
n_src: params.n_src,
|
||||
fanout: label.clone(),
|
||||
factorized: unnest,
|
||||
flat: None,
|
||||
});
|
||||
}
|
||||
|
||||
print_table(&rows);
|
||||
|
||||
// Capture one EXPLAIN per representative op to anchor the writeup.
|
||||
let probe_params = DataParams {
|
||||
n_src: 1000,
|
||||
fanout: FanoutShape::Uniform { target: 10 },
|
||||
seed: 1,
|
||||
};
|
||||
let (factorized, _) = build(&probe_params);
|
||||
println!("\n[explain] aggregate_scalar (factorized input):");
|
||||
println!(
|
||||
"{}",
|
||||
explain_factorized(factorized.clone(), "t", aggregate_sql_factorized())
|
||||
.await
|
||||
.unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
|
||||
);
|
||||
println!("\n[explain] join_scalar (factorized input):");
|
||||
println!(
|
||||
"{}",
|
||||
explain_factorized(factorized.clone(), "t", join_sql_factorized())
|
||||
.await
|
||||
.unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
|
||||
);
|
||||
println!("\n[explain] aggregate_on_list (factorized input):");
|
||||
println!(
|
||||
"{}",
|
||||
explain_factorized(factorized.clone(), "t", aggregate_on_list_sql_factorized())
|
||||
.await
|
||||
.unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
|
||||
);
|
||||
println!("\n[explain] sort (factorized input):");
|
||||
println!(
|
||||
"{}",
|
||||
explain_factorized(factorized, "t", sort_sql_factorized())
|
||||
.await
|
||||
.unwrap_or_else(|e| format!("<explain failed: {e:#}>"))
|
||||
);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
188
validation-prototypes/factorized-batches/src/ops.rs
Normal file
188
validation-prototypes/factorized-batches/src/ops.rs
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
//! Per-operator probes.
|
||||
//!
|
||||
//! Each probe runs a tiny DataFusion pipeline once. We capture:
|
||||
//! * accepts_list_input: did planning + execution complete without error?
|
||||
//! * time_ms: wall-clock execution time.
|
||||
//! * out_rows: total rows emitted across all output batches.
|
||||
//! * out_bytes: summed estimated arrow buffer size of output rows
|
||||
//! (a stand-in for peak memory of the consumer side).
|
||||
|
||||
use std::sync::Arc;
|
||||
use std::time::Instant;
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use arrow_array::RecordBatch;
|
||||
use datafusion::datasource::MemTable;
|
||||
use datafusion::execution::SendableRecordBatchStream;
|
||||
use datafusion::prelude::*;
|
||||
use futures::stream::StreamExt;
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct OpResult {
|
||||
pub op_name: &'static str,
|
||||
pub variant: &'static str, // "factorized" | "flat"
|
||||
pub accepts: bool,
|
||||
pub error: Option<String>,
|
||||
pub time_ms: f64,
|
||||
pub out_rows: usize,
|
||||
pub out_batches: usize,
|
||||
pub out_bytes: usize,
|
||||
}
|
||||
|
||||
fn make_ctx(batch: RecordBatch, table_name: &str) -> Result<SessionContext> {
|
||||
let ctx = SessionContext::new();
|
||||
let schema = batch.schema();
|
||||
let table = MemTable::try_new(schema, vec![vec![batch]])?;
|
||||
ctx.register_table(table_name, Arc::new(table))?;
|
||||
Ok(ctx)
|
||||
}
|
||||
|
||||
fn batch_bytes(b: &RecordBatch) -> usize {
|
||||
b.columns()
|
||||
.iter()
|
||||
.map(|c| c.get_array_memory_size())
|
||||
.sum::<usize>()
|
||||
}
|
||||
|
||||
async fn collect_stream(stream: SendableRecordBatchStream) -> Result<(Vec<RecordBatch>, usize, usize)> {
|
||||
let mut batches = Vec::new();
|
||||
let mut rows = 0usize;
|
||||
let mut bytes = 0usize;
|
||||
let mut s = stream;
|
||||
while let Some(b) = s.next().await {
|
||||
let b = b?;
|
||||
rows += b.num_rows();
|
||||
bytes += batch_bytes(&b);
|
||||
batches.push(b);
|
||||
}
|
||||
Ok((batches, rows, bytes))
|
||||
}
|
||||
|
||||
pub async fn run_sql(
|
||||
op_name: &'static str,
|
||||
variant: &'static str,
|
||||
batch: RecordBatch,
|
||||
table_name: &str,
|
||||
sql: &str,
|
||||
) -> OpResult {
|
||||
let mut result = OpResult {
|
||||
op_name,
|
||||
variant,
|
||||
accepts: false,
|
||||
error: None,
|
||||
time_ms: 0.0,
|
||||
out_rows: 0,
|
||||
out_batches: 0,
|
||||
out_bytes: 0,
|
||||
};
|
||||
|
||||
let ctx = match make_ctx(batch, table_name) {
|
||||
Ok(v) => v,
|
||||
Err(e) => {
|
||||
result.error = Some(format!("setup: {e:#}"));
|
||||
return result;
|
||||
}
|
||||
};
|
||||
|
||||
let started = Instant::now();
|
||||
let df = match ctx.sql(sql).await {
|
||||
Ok(df) => df,
|
||||
Err(e) => {
|
||||
result.error = Some(format!("plan: {e:#}"));
|
||||
result.time_ms = started.elapsed().as_secs_f64() * 1e3;
|
||||
return result;
|
||||
}
|
||||
};
|
||||
let stream = match df.execute_stream().await {
|
||||
Ok(s) => s,
|
||||
Err(e) => {
|
||||
result.error = Some(format!("execute: {e:#}"));
|
||||
result.time_ms = started.elapsed().as_secs_f64() * 1e3;
|
||||
return result;
|
||||
}
|
||||
};
|
||||
match collect_stream(stream).await {
|
||||
Ok((batches, rows, bytes)) => {
|
||||
result.accepts = true;
|
||||
result.out_rows = rows;
|
||||
result.out_batches = batches.len();
|
||||
result.out_bytes = bytes;
|
||||
}
|
||||
Err(e) => {
|
||||
result.error = Some(format!("collect: {e:#}"));
|
||||
}
|
||||
}
|
||||
result.time_ms = started.elapsed().as_secs_f64() * 1e3;
|
||||
result
|
||||
}
|
||||
|
||||
pub fn filter_sql() -> &'static str {
|
||||
"SELECT * FROM t WHERE src_id < 5000"
|
||||
}
|
||||
pub fn project_sql_factorized() -> &'static str {
|
||||
"SELECT src_id, _neighbors FROM t"
|
||||
}
|
||||
pub fn project_sql_flat() -> &'static str {
|
||||
"SELECT src_id, dst FROM t"
|
||||
}
|
||||
pub fn sort_sql_factorized() -> &'static str {
|
||||
"SELECT src_id, _neighbors FROM t ORDER BY src_id DESC LIMIT 1000"
|
||||
}
|
||||
pub fn sort_sql_flat() -> &'static str {
|
||||
"SELECT src_id, dst FROM t ORDER BY src_id DESC LIMIT 1000"
|
||||
}
|
||||
pub fn aggregate_sql_factorized() -> &'static str {
|
||||
"SELECT substr(payload, 1, 4) AS bucket, count(*) AS n FROM t GROUP BY 1 ORDER BY 1"
|
||||
}
|
||||
pub fn aggregate_sql_flat() -> &'static str {
|
||||
"SELECT substr(payload, 1, 4) AS bucket, count(*) AS n FROM t GROUP BY 1 ORDER BY 1"
|
||||
}
|
||||
pub fn aggregate_on_list_sql_factorized() -> &'static str {
|
||||
"SELECT _neighbors, count(*) AS n FROM t GROUP BY _neighbors"
|
||||
}
|
||||
pub fn join_sql_factorized() -> &'static str {
|
||||
"SELECT a.src_id, a._neighbors FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100"
|
||||
}
|
||||
pub fn join_on_list_sql_factorized() -> &'static str {
|
||||
"SELECT count(*) FROM t a JOIN t b ON a._neighbors = b._neighbors"
|
||||
}
|
||||
pub fn join_sql_flat() -> &'static str {
|
||||
"SELECT a.src_id, a.dst FROM t a JOIN t b ON a.src_id = b.src_id LIMIT 100"
|
||||
}
|
||||
|
||||
pub async fn probe_unnest_flatten(batch: RecordBatch, table_name: &str) -> OpResult {
|
||||
let sql = "SELECT src_id, n.* FROM t CROSS JOIN UNNEST(_neighbors) AS n(dst)";
|
||||
run_sql("unnest_flatten", "factorized", batch, table_name, sql).await
|
||||
}
|
||||
|
||||
pub async fn explain_factorized(batch: RecordBatch, table_name: &str, sql: &str) -> Result<String> {
|
||||
let ctx = make_ctx(batch, table_name)?;
|
||||
let plan = ctx
|
||||
.sql(&format!("EXPLAIN {sql}"))
|
||||
.await?
|
||||
.collect()
|
||||
.await
|
||||
.context("explain collect")?;
|
||||
let mut out = String::new();
|
||||
for b in plan {
|
||||
let cols = b.num_columns();
|
||||
let rows = b.num_rows();
|
||||
for r in 0..rows {
|
||||
for c in 0..cols {
|
||||
let arr = b.column(c);
|
||||
let s = arrow_cast::display::array_value_to_string(arr, r).unwrap_or_default();
|
||||
if !s.is_empty() {
|
||||
out.push_str(&s);
|
||||
out.push(' ');
|
||||
}
|
||||
}
|
||||
out.push('\n');
|
||||
}
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
#[allow(dead_code)]
|
||||
pub fn batch_size(b: &RecordBatch) -> usize {
|
||||
batch_bytes(b)
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue