MR-854: post-rebase doc fixes (Lance 6.0.1, MR-A framing, into_dataset note)

Reviewer feedback on the rebased PR:

* docs/dev/writes.md residuals matrix: drop demoted methods from the trait-surface table (now `pub(crate)`); keep only the two genuine trait-surface residuals (`delete_where`, `create_vector_index`); reframe under MR-A (Lance v7.x bump) per docs/dev/lance.md.

* tests/forbidden_apis.rs: update transitional allow-list header to (a) drop the truncate_table mislabel (truncate_table is a Lance Dataset method, not a TableStore method — overwrite_batch's internal call), (b) reframe trait-surface residuals under MR-A / Lance #6666.

* crates/omnigraph/src/storage_layer.rs::SnapshotHandle::{into_arc, into_dataset}: add single-ref invariant doc — both consume Arc via try_unwrap-or-clone; sibling SnapshotHandle clones across an await point force a deep Dataset clone.

* Replace lance-4.0.0 version refs with lance-6.0.1 in active source/test/dev-doc comments (storage_layer.rs, table_store.rs, table_ops.rs, schema_apply.rs, merge.rs, recovery.rs, staged_writes.rs, consistency.rs, docs/dev/execution.md, docs/user/query-language.md). Historical refs in docs/releases/v0.4.1.md and the canonical "Lance 4.0.0 → 6.0.1 migration" line in docs/dev/lance.md left intact.

No engine code changes.
This commit is contained in:
Ragnor Comerford 2026-06-06 21:44:47 +00:00
parent ef9e735452
commit 9954327960
12 changed files with 57 additions and 40 deletions

View file

@ -25,7 +25,7 @@
//! version. Pinned by
//! `tests/staged_writes.rs::lance_restore_appends_one_commit_with_checked_out_content`.
//! - `Dataset::restore` "wins" against concurrent Append/Update/Delete/
//! CreateIndex/Merge — see `check_restore_txn` at lance-4.0.0
//! CreateIndex/Merge — see `check_restore_txn` at lance-6.0.1
//! `src/io/commit/conflict_resolver.rs:986`. The hazard is documented
//! by `tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning`.
//! This module sidesteps the hazard by running recovery only at

View file

@ -567,7 +567,7 @@ where
let dataset_uri = db.storage().dataset_uri(&entry.table_path);
// Route through stage_overwrite + commit_staged for non-empty
// batches. Lance's `InsertBuilder::execute_uncommitted`
// errors on empty data (lance-4.0.0 `src/dataset/write/insert.rs:144`),
// errors on empty data (lance-6.0.1 `src/dataset/write/insert.rs:144`),
// so the empty-rewrite case stays on `overwrite_dataset` (which
// accepts empty input). The empty case is rare in schema_apply
// — it only fires when the source table itself was already empty

View file

@ -622,7 +622,7 @@ pub(super) async fn build_indices_on_dataset_for_catalog(
}
} else if matches!(prop_type.scalar, ScalarType::Vector(_)) && !prop_type.list {
if !db.storage().has_vector_index(ds, prop_name).await? {
// Inline-commit residual: lance-4.0.0 does not
// Inline-commit residual: lance-6.0.1 does not
// expose `build_index_metadata_from_segments` as
// `pub`, so vector indices cannot be staged from
// outside the lance crate. Document at the call

View file

@ -1004,7 +1004,7 @@ async fn publish_rewritten_merge_table(
// Phase 2: delete removed rows via deletion vectors.
//
// INLINE-COMMIT RESIDUAL: lance-4.0.0 does not expose a public
// INLINE-COMMIT RESIDUAL: lance-6.0.1 does not expose a public
// two-phase delete API (DeleteJob is `pub(crate)` —
// lance-format/lance#6658 is open with no PRs). We deliberately do
// NOT introduce a `stage_delete` wrapper that would secretly

View file

@ -105,8 +105,18 @@ impl SnapshotHandle {
&self.inner
}
/// Take ownership of the inner `Arc<Dataset>`. Used when committing
/// staged writes (the call needs to consume the snapshot).
/// Take ownership of the inner `Arc<Dataset>`. Used by the
/// `TableStorage` impl when an op needs to mutate the dataset in
/// place (commit a staged write, append, overwrite, …).
///
/// Performance note: callers consume the returned `Arc` via
/// `Arc::try_unwrap(...).unwrap_or_else(|arc| (*arc).clone())`. The
/// fast path (no clone) only fires when the snapshot is single-ref
/// — i.e. the caller dropped every other `SnapshotHandle` clone
/// before calling. Holding parallel clones (e.g. across an `await`
/// point or stashed in a struct) forces a deep `Dataset` clone on
/// every mutating op. Engine callers should pass `SnapshotHandle`
/// by value into the mutating method, not keep a side copy.
pub(crate) fn into_arc(self) -> Arc<Dataset> {
self.inner
}
@ -118,6 +128,10 @@ impl SnapshotHandle {
/// that the `TableStorage` trait does not (and should not)
/// surface. Engine code that participates in the staged-write
/// invariant must stay on the trait methods.
///
/// Single-ref invariant: same fast-path/clone behavior as
/// `into_arc` — see that method's doc. Drop sibling
/// `SnapshotHandle` clones before calling.
pub(crate) fn into_dataset(self) -> Dataset {
Arc::try_unwrap(self.inner).unwrap_or_else(|arc| (*arc).clone())
}
@ -361,7 +375,7 @@ pub trait TableStorage: sealed::Sealed + Send + Sync + Debug {
//
// * `delete_where` — Lance #6658 (two-phase delete).
// * `create_*_index` — `build_index_metadata_from_segments` is
// `pub(crate)` for vector indices in lance-4.0.0; scalar indices
// `pub(crate)` for vector indices in lance-6.0.1; scalar indices
// migrate to staged in MR-793 Phase 2.
// * `append_batch`, `merge_insert_batches`, `overwrite_batch` —
// legacy paths that will be demoted to `pub(crate)` in MR-793

View file

@ -738,7 +738,7 @@ impl TableStore {
// merge_insert produce a spurious "Ambiguous merge inserts:
// multiple source rows match the same target row on (id = ...)"
// error. Lance's `processed_row_ids: Mutex<HashSet<u64>>`
// (lance-4.0.0 `src/dataset/write/merge_insert.rs:2099`)
// (lance-6.0.1 `src/dataset/write/merge_insert.rs:2099`)
// double-processes the same source/target match against
// datasets previously rewritten by merge_insert, and the default
// `SourceDedupeBehavior::Fail` errors on the second insertion.
@ -917,7 +917,7 @@ impl TableStore {
}
};
// Assign real fragment IDs. Lance's `InsertBuilder::execute_uncommitted`
// returns fragments with `id = 0` ("Temporary ID" — see lance-4.0.0
// returns fragments with `id = 0` ("Temporary ID" — see lance-6.0.1
// `dataset/write.rs:1044/1712`); the real assignment happens during
// commit via `Transaction::fragments_with_ids`. Because we expose
// these fragments to `scan_with_staged` *before* commit, two staged
@ -1085,7 +1085,7 @@ impl TableStore {
"stage_overwrite called with empty batch".to_string(),
));
}
// `enable_stable_row_ids: true` is defensive — empirically Lance 4.0.0
// `enable_stable_row_ids: true` is defensive — empirically Lance 6.0.1
// preserves the source dataset's flag through `Operation::Overwrite`
// when WriteParams omits it (pinned by
// `stage_overwrite_preserves_stable_row_ids` in tests/staged_writes.rs),
@ -1126,7 +1126,7 @@ impl TableStore {
// 2) For stable-row-id datasets, assign row_id_meta starting
// at 0 (Overwrite is a fresh-start) so `scan_with_staged`
// doesn't hit the "Missing row id meta" panic in
// lance-4.0.0 dataset/rowids.rs:22.
// lance-6.0.1 dataset/rowids.rs:22.
assign_fragment_ids(&mut new_fragments, 1);
if ds.manifest.uses_stable_row_ids() {
assign_row_id_meta(&mut new_fragments, 0)?;
@ -1150,7 +1150,7 @@ impl TableStore {
/// `IndexMetadata`; we manually wrap it in `Operation::CreateIndex
/// { new_indices, removed_indices }` via the public `TransactionBuilder`,
/// replicating the simple (non-segment-commit-path) branch of Lance's
/// `CreateIndexBuilder::execute` (lance-4.0.0 `src/index/create.rs:502-512`).
/// `CreateIndexBuilder::execute` (lance-6.0.1 `src/index/create.rs:502-512`).
///
/// `removed_indices` mirrors `execute()` lines 466-476: when the
/// build replaces an existing same-named index, those entries are
@ -1159,7 +1159,7 @@ impl TableStore {
/// MR-793 Phase 2: scalar index types (BTree, Inverted) are
/// stage-able. Vector indices are NOT (segment-commit-path requires
/// `build_index_metadata_from_segments` which is `pub(crate)` in
/// lance-4.0.0); see `create_vector_index` and Appendix A.3.
/// lance-6.0.1); see `create_vector_index` and Appendix A.3.
pub async fn stage_create_btree_index(
&self,
ds: &Dataset,
@ -1254,7 +1254,7 @@ impl TableStore {
/// committed fragments carry; Lance's optimizer drops them from the
/// filtered scan even when their data would match. Staged-fragment
/// rows are silently absent from the result. `scanner.use_stats(false)`
/// does not fix this in lance 4.0.0. Callers needing correct filtered
/// does not fix this in lance 6.0.1. Callers needing correct filtered
/// reads against staged data should use a different strategy — the
/// engine's `MutationStaging` accumulator unions in-memory pending
/// batches with the committed scan via DataFusion `MemTable` (see
@ -1600,7 +1600,7 @@ fn prior_stages_fragment_count(prior_stages: &[StagedWrite]) -> u64 {
}
/// Assign sequential fragment IDs starting at `start_id`. Mirrors Lance's
/// commit-time `Transaction::fragments_with_ids` (lance-4.0.0
/// commit-time `Transaction::fragments_with_ids` (lance-6.0.1
/// `dataset/transaction.rs:1456`) — fragments produced by
/// `InsertBuilder::execute_uncommitted` start with `id = 0` as a temporary
/// placeholder; we renumber here so they don't collide with committed
@ -1631,7 +1631,7 @@ fn prior_stages_row_count(prior_stages: &[StagedWrite]) -> Result<u64> {
/// Assign sequential row IDs to fragments that lack them, starting from
/// `start_row_id`. Mirrors the relevant arm of Lance's
/// `Transaction::assign_row_ids` (lance-4.0.0 `dataset/transaction.rs:2682`)
/// `Transaction::assign_row_ids` (lance-6.0.1 `dataset/transaction.rs:2682`)
/// for the `row_id_meta = None` case — fragments produced by
/// `InsertBuilder::execute_uncommitted` against a stable-row-id dataset.
///

View file

@ -126,7 +126,7 @@ async fn load_merge_upserts_existing_and_inserts_new() {
/// source batch had one row per key.
///
/// Triggered by Lance's `processed_row_ids: Mutex<HashSet<u64>>`
/// (lance-4.0.0 `src/dataset/write/merge_insert.rs:2099`) double-
/// (lance-6.0.1 `src/dataset/write/merge_insert.rs:2099`) double-
/// processing the same source/target match against datasets previously
/// rewritten by merge_insert. Worked around by opting
/// `MergeInsertBuilder` into `SourceDedupeBehavior::FirstSeen` in

View file

@ -35,15 +35,18 @@
//! reaches the storage layer through `db.storage()` (returns
//! `&dyn TableStorage`). The inherent inline-commit methods on
//! `TableStore` (`append_batch`, `merge_insert_batch{,es}`,
//! `overwrite_batch`, `create_{btree,inverted}_index`, `truncate_table`)
//! are now `pub(crate)`, so the only direct users are
//! `table_store.rs` itself (which IS the storage layer) and the bulk
//! loader's `LoadMode::{Append, Overwrite, Merge}` concurrent
//! fast-paths in `loader::write_batch_to_dataset` (the loader uses the
//! trait surface for the staged-write path and falls back to the
//! demoted inherent methods only for the concurrent fast-path, which
//! has no two-phase shape in Lance 4.0.0). The file-level allow-list
//! below matches that boundary.
//! `overwrite_batch`, `create_{btree,inverted}_index`) are now
//! `pub(crate)`, so the only direct users are `table_store.rs` itself
//! (which IS the storage layer) and the bulk loader's
//! `LoadMode::{Append, Overwrite, Merge}` concurrent fast-paths in
//! `loader::write_batch_to_dataset` (the loader uses the trait surface
//! for the staged-write path and falls back to the demoted inherent
//! methods only for the concurrent fast-path, which has no two-phase
//! shape in Lance v6.0.1). The remaining trait-surface residuals
//! (`delete_where`, `create_vector_index`) are gated on the Lance v7.x
//! bump (MR-A) and Lance #6666 respectively — see
//! `docs/dev/lance.md` for the canonical tracking. The file-level
//! allow-list below matches that boundary.
use std::path::{Path, PathBuf};

View file

@ -367,7 +367,7 @@ async fn stage_merge_insert_then_commit_persists_merged_view() {
/// `write_fragments_internal` lack per-column statistics. The result
/// contains only matching committed rows; matching staged rows are
/// silently absent. `scanner.use_stats(false)` does not bypass this in
/// lance 4.0.0.
/// lance 6.0.1.
///
/// This test pins the actual behavior so a future change either
/// preserves it (and updates the doc) or fixes it (and rewrites this
@ -715,7 +715,7 @@ async fn stage_create_inverted_index_does_not_advance_head_until_commit() {
);
}
/// Pin the inline-commit behavior of `delete_where`. Lance 4.0.0 does
/// Pin the inline-commit behavior of `delete_where`. Lance 6.0.1 does
/// NOT expose a public `DeleteJob::execute_uncommitted`
/// (`pub(crate)` — see lance-format/lance#6658). The trait deliberately
/// does NOT introduce a `stage_delete` wrapper that would secretly
@ -755,9 +755,9 @@ async fn delete_where_advances_head_inline_documents_residual() {
}
/// Companion to `delete_where_*`: pin the inline-commit behavior of
/// `create_vector_index`. Lance 4.0.0 vector indices take the
/// `create_vector_index`. Lance 6.0.1 vector indices take the
/// "segment commit path" which calls `build_index_metadata_from_segments`
/// (`pub(crate)` in lance-4.0.0 `src/index.rs:111`). Until upstream
/// (`pub(crate)` in lance-6.0.1 `src/index.rs:111`). Until upstream
/// exposes that helper (companion ticket to lance-format/lance#6658),
/// the trait surface deliberately does NOT include
/// `stage_create_vector_index` — same rationale as `stage_delete`'s
@ -820,7 +820,7 @@ async fn create_vector_index_advances_head_inline_documents_residual() {
/// The Lance source confirms this — `restore()` (no args) takes the
/// currently-checked-out version's content and applies it via
/// `apply_commit` against the latest manifest, advancing HEAD by one.
/// See lance-4.0.0 `src/dataset.rs:1106` and the transaction-spec
/// See lance-6.0.1 `src/dataset.rs:1106` and the transaction-spec
/// example at https://lance.org/format/table/transaction/.
///
/// If the lance bump (4.0.0 → 4.x) ever changes this delta or the call
@ -887,7 +887,7 @@ async fn lance_restore_appends_one_commit_with_checked_out_content() {
/// and any future continuous-recovery reconciler's queue-acquisition
/// requirement.
///
/// `Dataset::restore`'s `check_restore_txn` (lance-4.0.0
/// `Dataset::restore`'s `check_restore_txn` (lance-6.0.1
/// `src/io/commit/conflict_resolver.rs:986`) returns `Ok(())` against
/// almost every other op (Append, Update, Delete, CreateIndex, Merge, …),
/// so a Restore commits successfully even with concurrent commits in

View file

@ -84,7 +84,7 @@ Resolves expression values to literals, converts to typed Arrow arrays (`literal
- `insert` (no `@key`, edges) → accumulate into `MutationStaging.pending` (Append mode); finalize calls `stage_append` once per touched table.
- `insert` (`@key` node) → accumulate into `pending` (Merge mode); finalize calls `stage_merge_insert` once per touched table.
- `update` → scan committed via Lance + pending via DataFusion `MemTable` (read-your-writes), apply assignments, accumulate into `pending` (Merge mode).
- `delete` → still inline-commits via `delete_where` (Lance 4.0.0 has no public two-phase delete); recorded into `MutationStaging.inline_committed`.
- `delete` → still inline-commits via `delete_where` (Lance v6.0.1 has no public two-phase delete; `DeleteBuilder::execute_uncommitted` first ships in v7.0.0-beta.10 — tracked as MR-A in [docs/dev/lance.md](lance.md)); recorded into `MutationStaging.inline_committed`.
**D₂ parse-time rule.** A single mutation query is either insert/update-only or delete-only. Mixed → reject before any I/O. The check fires in `enforce_no_mixed_destructive_constructive(&ir)` inside `execute_named_mutation`.

View file

@ -107,21 +107,21 @@ described in `.context/mr-793-design.md` §15 (deferred to MR-795).
MR-793's acceptance criterion §1 ("`TableStore` public API has no method that performs a manifest commit as a side effect of writing") is met **per-method** by enumerating every inline-commit method that remains on the trait surface, naming why it cannot yet be removed, and keeping the residual comment at every call site:
| Method on `TableStore` | Inline-commit reason | Closes when |
| Method on `TableStorage` | Inline-commit reason | Closes when |
|---|---|---|
| `delete_where` (trait) | `DeleteJob` is `pub(crate)` in lance-4.0.0 — no public two-phase delete API | [lance-format/lance#6658](https://github.com/lance-format/lance/issues/6658) lands and `stage_delete` joins the trait |
| `create_vector_index` (trait) | Vector indices take Lance's "segment commit path"; the helper `build_index_metadata_from_segments` is `pub(crate)` | [lance-format/lance#6666](https://github.com/lance-format/lance/issues/6666) lands and `stage_create_vector_index` joins the trait |
| `delete_where` | `DeleteBuilder::execute_uncommitted` is not in Lance v6.0.1 (closed upstream as [#6658](https://github.com/lance-format/lance/issues/6658) but first ships in `v7.0.0-beta.10`); see [docs/dev/lance.md](lance.md) | MR-A: Lance v7.x bump migrates `delete_where` to staged, retires the parse-time D₂ mutation rule, and extends recovery sidecar coverage |
| `create_vector_index` | Vector indices take Lance's "segment commit path"; `build_index_metadata_from_segments` is `pub(crate)` (Lance [#6666](https://github.com/lance-format/lance/issues/6666) still open) | Lance #6666 lands and `stage_create_vector_index` joins the trait |
MR-854 (Phase 1b + Phase 9) closed the remaining residuals on the engine surface: every `db.table_store.X(...)` call site was converted to `db.storage().X(...)` (trait dispatch through `&dyn TableStorage`), and the inherent inline-commit methods on `TableStore` (`append_batch`, `merge_insert_batch`, `merge_insert_batches`, `overwrite_batch`, `create_btree_index`, `create_inverted_index`, `truncate_table`) were demoted from `pub` to `pub(crate)`. They survive only as the bulk loader's `LoadMode::{Append, Overwrite, Merge}` concurrent fast-paths (see "`LoadMode::Overwrite` residual" below) and as internal helpers for the staged primitives — no engine call site outside `table_store.rs` and `loader::write_batch_to_dataset` reaches them.
MR-854 (Phase 1b + Phase 9) closed the remaining residuals on the engine surface: every `db.table_store.X(...)` call site was converted to `db.storage().X(...)` (trait dispatch through `&dyn TableStorage`), and the legacy inline-commit inherent methods on `TableStore` (`append_batch`, `merge_insert_batch`, `merge_insert_batches`, `overwrite_batch`, `create_btree_index`, `create_inverted_index`) were demoted from `pub` to `pub(crate)`. They survive only as the bulk loader's `LoadMode::{Append, Overwrite, Merge}` concurrent fast-paths (see "`LoadMode::Overwrite` residual" below) and as internal helpers for the staged primitives — no engine call site outside `table_store.rs` and `loader::write_batch_to_dataset` reaches them.
After **lance#6658 + lance#6666 ship**, the trait surface exposes only staged-write primitives + `commit_staged`. Until then this matrix names the two remaining residuals explicitly, every call site carries a one-line residual comment, and no engine code outside `table_store.rs` is permitted to reach the inline-commit Lance APIs (enforced by the `tests/forbidden_apis.rs` guard).
After **MR-A (Lance v7 bump) + Lance #6666 ship**, the trait surface exposes only staged-write primitives + `commit_staged`. Until then this matrix names the two remaining residuals explicitly, every call site carries a one-line residual comment, and no engine code outside `table_store.rs` is permitted to reach the inline-commit Lance APIs (enforced by the `tests/forbidden_apis.rs` guard).
### `LoadMode::Overwrite` residual
The bulk loader's Append and Merge modes use the staged-write path
described above. `LoadMode::Overwrite` keeps the legacy inline-commit
path: truncate-then-append doesn't fit the staged shape cleanly in
Lance 4.0.0, and overwrite has no in-flight read-your-writes
Lance v6.0.1, and overwrite has no in-flight read-your-writes
requirement (the prior data is being wiped). A mid-overwrite failure
can leave Lance HEAD on a partially-truncated table; the next overwrite
will replace it. Operator-driven (rare in agent workloads); document

View file

@ -70,7 +70,7 @@ A single mutation query must be **either insert/update-only or delete-only**. Mi
> `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes. This restriction lifts when Lance exposes a two-phase delete API (tracked: MR-793 / Lance-upstream).`
Reason: under the staged-write rewire (MR-794), inserts and updates accumulate in memory and commit at end-of-query, while deletes still inline-commit (Lance 4.0.0 has no public two-phase delete). Mixing creates ordering hazards (same-row insert→delete becomes a no-op because the staged insert isn't visible to delete; cascading deletes of just-inserted edges break referential integrity by silent design). Until Lance exposes `DeleteJob::execute_uncommitted`, the parse-time rejection keeps both paths atomic and correct. See [docs/dev/writes.md](../dev/writes.md) and [docs/dev/invariants.md](../dev/invariants.md).
Reason: under the staged-write rewire (MR-794), inserts and updates accumulate in memory and commit at end-of-query, while deletes still inline-commit (Lance v6.0.1 has no public two-phase delete). Mixing creates ordering hazards (same-row insert→delete becomes a no-op because the staged insert isn't visible to delete; cascading deletes of just-inserted edges break referential integrity by silent design). Until the MR-A Lance v7 bump migrates `delete_where` to staged (`DeleteBuilder::execute_uncommitted` first ships in `v7.0.0-beta.10`), the parse-time rejection keeps both paths atomic and correct. See [docs/dev/writes.md](../dev/writes.md), [docs/dev/lance.md](../dev/lance.md), and [docs/dev/invariants.md](../dev/invariants.md).
## IR (Intermediate Representation)