fix(engine): close the 2 Lance 7.0.0 alignment failures (immutable PK + native namespace) (#236)

* fix(engine): make the v1→v2 manifest migration idempotent under Lance 7's immutable unenforced primary key

Lance 7 (dataset/transaction.rs) makes the unenforced primary key immutable
once set: any write touching the reserved `lance-schema:unenforced-primary-key`
field metadata after the PK is set errors "cannot be changed once set" — even
re-applying the same value. `migrate_v1_to_v2` previously relied on the old Lance
6 idempotency (re-applying the annotation was a no-op-ish bump), which it needs
for crash-recovery: a v1 graph that crashes after the field-set but before the
stamp bump re-enters the migration with the PK already present.

Under Lance 7 that re-entry now errors, so a real pre-v0.4.0 graph crashing in
that window could never complete its migration. Guard the field-set with
`schema().unenforced_primary_key().is_empty()` so a genuine first-set still runs
but a re-set is skipped — restoring crash-idempotency by construction. (Fresh
graphs bake the PK into manifest_schema() at init and never run this migration.)

The existing test_publish_migrates_pre_stamp_manifest_to_current_version is the
regression guard: red under Lance 7 before this change, green after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(engine): realign the native-namespace surface guard to Lance 7 (TableNotFound)

`test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path`
pokes Lance's NATIVE DirectoryNamespace (not omnigraph's production write path,
which is the manifest merge_insert publisher) to document that it cannot replace
omnigraph's authority.

Lance 7's DirectoryNamespace routes list/describe/create_table_version through
`check_table_status`, which now reports an omnigraph-manifest-tracked table as
absent — so all three return TableNotFound for `node:Person` (observed). The
native namespace is now fully decoupled from omnigraph's manifest: it cannot
enumerate, inspect, or publish over omnigraph's tables. This strengthens the
guard's thesis. Realigned the assertions to the v7 behavior and kept the
authority check (omnigraph's refresh ignores the direct append; row_count stays
0). Test-only; no production impact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(lance): document the 2 runtime behavior changes in the 7.0.0 alignment stanza

The #229 stanza verified a clean engine *build* but not the test suite, and
claimed "no Lance API surface omnigraph uses changed." Two runtime behaviors did,
caught only by the full test suite:

- the unenforced primary key is immutable once set in v7 (transaction.rs) — broke
  the v1→v2 manifest migration's crash-idempotency; fixed by an is-set guard;
- the native DirectoryNamespace returns TableNotFound for omnigraph
  manifest-tracked tables (dir.rs) — test-only; the surface guard was realigned.

Corrects the over-broad "no surface changed" claim, adds both findings, and notes
the lesson: a clean build is not a clean alignment — run cargo test --workspace
before declaring a Lance bump done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-06-15 02:37:38 +03:00 committed by GitHub
parent 67baf615d9
commit ceb37dd4cb
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 82 additions and 47 deletions

View file

@ -113,20 +113,27 @@ pub(super) async fn migrate_internal_schema(dataset: &mut Dataset) -> Result<()>
/// so the merge-insert conflict resolver enforces row-level CAS at commit
/// time, then bump the stamp.
///
/// Both steps are idempotent under retry: re-applying the field annotation
/// at its current value is a no-op-ish bump in Lance, and the stamp is a
/// simple key-value write. A crash between the two leaves the field set
/// without a stamp; the next open re-runs this fn and only the stamp lands.
/// Idempotent under crash-retry by construction. Lance 7 makes the unenforced
/// primary key **immutable once set**: any write that touches the reserved
/// `lance-schema:unenforced-primary-key` field metadata after the PK is set
/// errors ("cannot be changed once set", `lance::dataset::transaction`), even
/// re-applying the same value. A crash between the field-set and the stamp
/// bump leaves the field set without a stamp, so the next open re-enters here
/// with the PK already present — we must therefore set it only when absent.
/// (Fresh graphs bake the PK into `manifest_schema()` at init and never run
/// this migration; only genuine pre-v0.4.0 graphs do.)
async fn migrate_v1_to_v2(dataset: &mut Dataset) -> Result<()> {
dataset
.update_field_metadata()
.update(
"object_id",
[(OBJECT_ID_PK_KEY.to_string(), "true".to_string())],
)
.map_err(|e| OmniError::Lance(e.to_string()))?
.await
.map_err(|e| OmniError::Lance(e.to_string()))?;
if dataset.schema().unenforced_primary_key().is_empty() {
dataset
.update_field_metadata()
.update(
"object_id",
[(OBJECT_ID_PK_KEY.to_string(), "true".to_string())],
)
.map_err(|e| OmniError::Lance(e.to_string()))?
.await
.map_err(|e| OmniError::Lance(e.to_string()))?;
}
set_stamp(dataset, 2).await
}

View file

@ -336,40 +336,66 @@ async fn test_directory_namespace_direct_publish_cannot_replace_native_omnigraph
.await
.unwrap();
let versions = namespace
.list_table_versions(ListTableVersionsRequest {
id: Some(vec!["node:Person".to_string()]),
descending: Some(true),
..Default::default()
})
.await
.unwrap();
assert_eq!(
versions.versions[0].version as u64,
person_entry.table_version
// Lance 7: the native `DirectoryNamespace` no longer recognizes omnigraph's
// manifest-tracked tables. `check_table_status` reports the table absent
// (`lance-namespace-impls` dir.rs), so list / describe / create_table_version
// all return `TableNotFound`. The native path is fully decoupled from
// omnigraph's manifest authority: it cannot enumerate, inspect, or publish
// over omnigraph's tables. (Pre-v7 the native namespace could still *see* the
// published version but never replace the write path; v7 only widens the gap.)
let assert_table_not_found = |what: &str, dbg: String| {
assert!(
dbg.contains("TableNotFound") && dbg.contains("node:Person"),
"{what}: expected TableNotFound for node:Person, got: {dbg}"
);
};
assert_table_not_found(
"list_table_versions",
format!(
"{:?}",
namespace
.list_table_versions(ListTableVersionsRequest {
id: Some(vec!["node:Person".to_string()]),
descending: Some(true),
..Default::default()
})
.await
.unwrap_err()
),
);
assert_table_not_found(
"describe_table_version",
format!(
"{:?}",
namespace
.describe_table_version(DescribeTableVersionRequest {
id: Some(vec!["node:Person".to_string()]),
version: Some(person_version as i64),
..Default::default()
})
.await
.unwrap_err()
),
);
assert_table_not_found(
"create_table_version",
format!(
"{:?}",
namespace
.create_table_version(version_metadata.to_create_table_version_request(
"node:Person",
person_version,
1,
None,
))
.await
.unwrap_err()
),
);
let err = namespace
.describe_table_version(DescribeTableVersionRequest {
id: Some(vec!["node:Person".to_string()]),
version: Some(person_version as i64),
..Default::default()
})
.await
.unwrap_err();
assert!(err.to_string().contains("not found"));
let err = namespace
.create_table_version(version_metadata.to_create_table_version_request(
"node:Person",
person_version,
1,
None,
))
.await
.unwrap_err();
assert!(err.to_string().contains("already exists"));
// omnigraph's manifest stays authoritative: refresh ignores the direct
// `person_ds.append` above (it was never manifest-published), so the row
// count stays 0 and the version is unchanged.
mc.refresh().await.unwrap();
assert_eq!(
mc.snapshot().entry("node:Person").unwrap().table_version,

View file

@ -156,7 +156,7 @@ If a future need pulls one of these into scope, add a row to the matching domain
When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.
### Last alignment audit: 2026-06-14 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0)
### Last alignment audit: 2026-06-15 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0)
Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, DataFusion stayed 53** (no change) — the only transitive bump is `object_store` 0.12.5 → 0.13.2. 141 upstream commits reviewed (6.0.1 → 7.0.0); no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). Behavior-affecting findings:
@ -166,8 +166,10 @@ Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, Da
- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug — but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape).
- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) — the load-bearing half — `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged``CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival).
- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) → MR-A (migrate `delete_where` to the staged two-phase API, retire the parse-time D2 rule) is now **unblocked**, tracked separately (dev-graph `iss-950`). The bump itself keeps `delete_where` inline; the `_compile_delete_result_field_shape` guard is left untouched until MR-A.
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set with `schema().unenforced_primary_key().is_empty()` (`db/manifest/migrations.rs`). Regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7).
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
- **No Lance-API surface omnigraph uses changed 6.0.1 → 7.0.0** (verified by a clean engine build; the only compile break was object_store). `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable.
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
Bump this date stanza on the next alignment pass.