fix(engine): close the 2 Lance 7.0.0 alignment failures (immutable PK + native namespace) (#236)

* fix(engine): make the v1→v2 manifest migration idempotent under Lance 7's immutable unenforced primary key

Lance 7 (dataset/transaction.rs) makes the unenforced primary key immutable
once set: any write touching the reserved `lance-schema:unenforced-primary-key`
field metadata after the PK is set errors "cannot be changed once set" — even
re-applying the same value. `migrate_v1_to_v2` previously relied on the old Lance
6 idempotency (re-applying the annotation was a no-op-ish bump), which it needs
for crash-recovery: a v1 graph that crashes after the field-set but before the
stamp bump re-enters the migration with the PK already present.

Under Lance 7 that re-entry now errors, so a real pre-v0.4.0 graph crashing in
that window could never complete its migration. Guard the field-set with
`schema().unenforced_primary_key().is_empty()` so a genuine first-set still runs
but a re-set is skipped — restoring crash-idempotency by construction. (Fresh
graphs bake the PK into manifest_schema() at init and never run this migration.)

The existing test_publish_migrates_pre_stamp_manifest_to_current_version is the
regression guard: red under Lance 7 before this change, green after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(engine): realign the native-namespace surface guard to Lance 7 (TableNotFound)

`test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path`
pokes Lance's NATIVE DirectoryNamespace (not omnigraph's production write path,
which is the manifest merge_insert publisher) to document that it cannot replace
omnigraph's authority.

Lance 7's DirectoryNamespace routes list/describe/create_table_version through
`check_table_status`, which now reports an omnigraph-manifest-tracked table as
absent — so all three return TableNotFound for `node:Person` (observed). The
native namespace is now fully decoupled from omnigraph's manifest: it cannot
enumerate, inspect, or publish over omnigraph's tables. This strengthens the
guard's thesis. Realigned the assertions to the v7 behavior and kept the
authority check (omnigraph's refresh ignores the direct append; row_count stays
0). Test-only; no production impact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(lance): document the 2 runtime behavior changes in the 7.0.0 alignment stanza

The #229 stanza verified a clean engine *build* but not the test suite, and
claimed "no Lance API surface omnigraph uses changed." Two runtime behaviors did,
caught only by the full test suite:

- the unenforced primary key is immutable once set in v7 (transaction.rs) — broke
  the v1→v2 manifest migration's crash-idempotency; fixed by an is-set guard;
- the native DirectoryNamespace returns TableNotFound for omnigraph
  manifest-tracked tables (dir.rs) — test-only; the surface guard was realigned.

Corrects the over-broad "no surface changed" claim, adds both findings, and notes
the lesson: a clean build is not a clean alignment — run cargo test --workspace
before declaring a Lance bump done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-06-15 02:37:38 +03:00 committed by GitHub
parent 67baf615d9
commit ceb37dd4cb
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 82 additions and 47 deletions

View file

@ -156,7 +156,7 @@ If a future need pulls one of these into scope, add a row to the matching domain
When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.
### Last alignment audit: 2026-06-14 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0)
### Last alignment audit: 2026-06-15 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0)
Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, DataFusion stayed 53** (no change) — the only transitive bump is `object_store` 0.12.5 → 0.13.2. 141 upstream commits reviewed (6.0.1 → 7.0.0); no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). Behavior-affecting findings:
@ -166,8 +166,10 @@ Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, Da
- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug — but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape).
- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) — the load-bearing half — `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged``CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival).
- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) → MR-A (migrate `delete_where` to the staged two-phase API, retire the parse-time D2 rule) is now **unblocked**, tracked separately (dev-graph `iss-950`). The bump itself keeps `delete_where` inline; the `_compile_delete_result_field_shape` guard is left untouched until MR-A.
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set with `schema().unenforced_primary_key().is_empty()` (`db/manifest/migrations.rs`). Regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7).
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
- **No Lance-API surface omnigraph uses changed 6.0.1 → 7.0.0** (verified by a clean engine build; the only compile break was object_store). `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable.
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
Bump this date stanza on the next alignment pass.