omnigraph/docs/dev/lance.md

183 lines
13 KiB
Markdown
Raw Normal View History

# Lance Docs Index (for OmniGraph agents)
OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this codebase.
This file is the curated entry point. **When you hit a Lance-shaped problem, find the matching topic below and fetch the listed URL(s) before guessing.** Don't grep our codebase for behavior that is documented authoritatively in Lance.
Base URL: `https://lance.org`. **Fetch the FULL page content, not summaries** — use `curl -sL <url> | pandoc -f html -t markdown` or paste the rendered page text manually. Tools that summarize pages (like Claude's `WebFetch`) routinely drop load-bearing details — defaults, `pub(crate)` blockers, sub-specs hidden behind navigation hubs. **Never act on a summarized fetch alone.** Keep this index curated to relevant material — the upstream sitemap has hundreds of URLs (notably the Namespace REST API model surface, Spark/Trino/Databricks integrations) that we don't use.
> **Substrate boundary check.** Before fetching, recall [docs/dev/invariants.md](invariants.md): if Lance already does the thing, we don't reimplement it. The most common reason to read these docs is to confirm a substrate behavior, not to learn what to clone.
## Quick-start (read these once per project)
| Read when | URL |
|---|---|
| Onboarding to Lance — concepts in 10 min | https://lance.org/quickstart/ |
| Onboarding to vector search | https://lance.org/quickstart/vector-search/ |
| Onboarding to full-text search | https://lance.org/quickstart/full-text-search/ |
| Onboarding to versioning / time travel | https://lance.org/quickstart/versioning/ |
| Lance's own AGENTS.md (its agent guide) | https://lance.org/format/AGENTS/ |
## By problem domain
### Storage format & file layout
Touching `db/manifest`, fragment lifecycle, dataset reconstruction, or anything that reads/writes raw Lance state.
| Topic | URL |
|---|---|
| Lance file format overview | https://lance.org/format/ |
| File-level format spec | https://lance.org/format/file/ |
| File encoding | https://lance.org/format/file/encoding/ |
| File-level versioning | https://lance.org/format/file/versioning/ |
| Table layout (fragments, manifest) | https://lance.org/format/table/layout/ |
| Table schema metadata | https://lance.org/format/table/schema/ |
| Table-level versioning | https://lance.org/format/table/versioning/ |
| Transactions (commit semantics, conflict types) | https://lance.org/format/table/transaction/ |
| MemWAL (durability story) | https://lance.org/format/table/mem_wal/ |
| Row-ID lineage (stable row IDs) | https://lance.org/format/table/row_id_lineage/ |
| Branches & tags (Lance native) | https://lance.org/format/table/branch_tag/ |
### Branching / tags / time travel
Touching graph-level branches, snapshots, run isolation, the commit graph.
| Topic | URL |
|---|---|
| Branch & tag format | https://lance.org/format/table/branch_tag/ |
| Tags & branches operational guide | https://lance.org/guide/tags_and_branches/ |
| Versioning quick-start | https://lance.org/quickstart/versioning/ |
| Table-level versioning spec | https://lance.org/format/table/versioning/ |
### Indexes
Adding/changing index types, fixing coverage, debugging FTS or vector recall, designing the reconciler.
| Topic | URL |
|---|---|
| Index spec overview | https://lance.org/format/table/index/ |
| BTREE scalar index | https://lance.org/format/table/index/scalar/btree/ |
| Bitmap scalar index | https://lance.org/format/table/index/scalar/bitmap/ |
| Bloom-filter scalar index | https://lance.org/format/table/index/scalar/bloom_filter/ |
| Label-list scalar index | https://lance.org/format/table/index/scalar/label_list/ |
| Zone-map scalar index | https://lance.org/format/table/index/scalar/zonemap/ |
| R-Tree scalar index (spatial) | https://lance.org/format/table/index/scalar/rtree/ |
| Full-text search (FTS) index | https://lance.org/format/table/index/scalar/fts/ |
| N-gram scalar index | https://lance.org/format/table/index/scalar/ngram/ |
| Vector index | https://lance.org/format/table/index/vector/ |
| Fragment-reuse system index | https://lance.org/format/table/index/system/frag_reuse/ |
| MemWAL system index | https://lance.org/format/table/index/system/mem_wal/ |
| HNSW Rust example | https://lance.org/examples/rust/hnsw/ |
| Distributed indexing | https://lance.org/guide/distributed_indexing/ |
| Tokenizer (FTS, n-gram) | https://lance.org/guide/tokenizer/ |
### Reads & writes
Touching the bulk loader, mutation execution, `merge_insert`, `WriteMode` selection.
| Topic | URL |
|---|---|
| Read-and-write guide | https://lance.org/guide/read_and_write/ |
| Distributed write | https://lance.org/guide/distributed_write/ |
| Rust example: write & read a dataset | https://lance.org/examples/rust/write_read_dataset/ |
### Schema evolution
Touching `apply_schema`, the migration planner, additive evolution.
| Topic | URL |
|---|---|
| Data-evolution guide | https://lance.org/guide/data_evolution/ |
| Migration guide | https://lance.org/guide/migration/ |
### Object store / S3
Touching `storage.rs`, S3-compatible backends (RustFS, MinIO), env vars.
| Topic | URL |
|---|---|
| Object-store guide | https://lance.org/guide/object_store/ |
### Data types
Touching schema-language scalar mappings, blob columns, JSON, list columns.
| Topic | URL |
|---|---|
| Data types overview | https://lance.org/guide/data_types/ |
| Arrays / list types | https://lance.org/guide/arrays/ |
| Blobs (LargeBinary) | https://lance.org/guide/blob/ |
| JSON | https://lance.org/guide/json/ |
### Performance & tuning
Optimizing scans, fragment counts, cache behavior, memory pool sizing.
| Topic | URL |
|---|---|
| Performance guide | https://lance.org/guide/performance/ |
### Compaction & cleanup
Touching `omnigraph optimize` / `cleanup`, the underlying `compact_files` / `cleanup_old_versions`.
| Topic | URL |
|---|---|
| Read-and-write guide (covers `compact_files`, `cleanup_old_versions`) | https://lance.org/guide/read_and_write/ |
| Performance (compaction tradeoffs) | https://lance.org/guide/performance/ |
| Fragment-reuse index | https://lance.org/format/table/index/system/frag_reuse/ |
### DataFusion integration
The runtime substrate that may carry our query execution. See [docs/dev/invariants.md](invariants.md): we don't rebuild relational machinery.
| Topic | URL |
|---|---|
| DataFusion integration | https://lance.org/integrations/datafusion/ |
### SDK reference
Looking up a specific Rust API (signature, return type, error variant).
| Topic | URL |
|---|---|
| SDK docs landing | https://lance.org/sdk_docs/ |
## What's not in this index (and why)
- **Namespace REST API model surface** (`/format/namespace/client/operations/models/...`) — hundreds of REST schema docs for the Lance Namespace catalog API. Omnigraph does not run a Lance Namespace server, so these are not reachable from our problem space.
- **Spark / Trino / Databricks / Dataproc / Hive / Glue / Polaris / Iceberg / Unity / OneLake / Gravitino integrations** — not part of OmniGraph's deployment surface.
- **Python / TF / PyTorch / Hugging Face / Ray integrations** — OmniGraph is Rust-only; Python notebooks aren't relevant.
- **Community / governance / release / voting / PMC pages** — meta, not technical.
If a future need pulls one of these into scope, add a row to the matching domain section above and link it from `AGENTS.md`'s topic index.
## Maintenance
When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.
chore(lance): bump 4.0.0 → 6.0.1 (DataFusion 52→53, Arrow 57→58) (#111) * tests: add lance_surface_guards pre-flight pins for the v6 bump Land 8 named guards in a new test file that pin Lance API surfaces OmniGraph relies on. Each guard turns a silent-break risk (variant rename, struct restructure, async-flip) into a red CI bar instead of runtime drift. Guards (mapped to the silent-break inventory from the v6 migration plan): Runtime (#[tokio::test]): 1. lance_error_too_much_write_contention_variant_exists — pins the variant referenced by db/manifest/publisher.rs::map_lance_publish_error. 2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme types and ManifestLocation accessor returning &Self (the access pattern at db/manifest/metadata.rs:84-88). 6. write_params_default_does_not_set_storage_version — confirms our explicit V2_2 pin remains load-bearing (blob v2 requirement). Compile-only async fns (#[allow(...)] + unimplemented!() placeholders; never run, but cargo build --tests enforces the API shape): 3. checkout_version + restore chain — pins the recovery rollback hammer at db/manifest/recovery.rs:505-522. 4. DatasetBuilder::from_namespace().with_branch().with_version().load() — pins the namespace builder chain at db/manifest/namespace.rs:162-174. 5. MergeInsertBuilder fluent chain — pins the manifest CAS at db/manifest/publisher.rs:370-391, including the return shape (Arc<Dataset>, MergeStats). 7. compact_files(&mut ds, CompactionOptions, None) — pins db/omnigraph/optimize.rs:107. 8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline delete result shape (MR-A will repurpose this guard to the staged two-phase variant once Lance #6658 migration lands). This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump follows in commit 2 (will trigger the guards under v6 if any surface drifted). Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md (written this session). Two guards from the plan deferred to follow-up: - manifest_cas_returns_row_level_contention_variant (full publisher race integration test — needs harness scaffolding) - table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata is pub(crate); requires test reach extension). Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests compiles all 5 compile-only guards clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58 The Cargo bump itself. Source is intentionally untouched — this commit will not compile. The compile errors are the work-list for subsequent commits on this branch. Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn: + lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512) + object_store 0.13.x (Lance 6 brings it transitively; our explicit pin stays at 0.12.5 for now — revisit in stages if diamond bites) - tantivy* crates (replaced by lance-tokenizer) Compile error landscape on this commit (11 errors): • 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280 moved it to lance::index). Sites: table_store.rs:20, db/manifest.rs:37 (the second site was missed by the pre-flight inventory). • 8× E0599: `create_index_builder` / `load_indices` missing on `lance::Dataset` — all downstream of the DatasetIndexExt move. Once the import is corrected on table_store.rs and db/manifest.rs, these resolve automatically. • 2× E0063: missing field `is_only_declared` in `DescribeTableResponse` initializer at db/manifest/namespace.rs:221, 364. New Lance namespace field per the v5 namespace restructure (PR #6186). Surface guards (lance_surface_guards.rs, commit d571fa8) all still compile + the 3 runtime ones pass on v6 — none of the silent-break surfaces drifted. That's the load-bearing observation: the publisher CAS chain, ManifestLocation field shape, checkout_version/restore, DatasetBuilder fluent chain, MergeInsertBuilder return shape, WriteParams::default, compact_files signature, and DeleteResult fields are all v6-stable. Next commits address the 11 errors per the migration plan stages 3-8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * imports: move DatasetIndexExt to lance::index (Lance PR #6280) Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into `lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`. Mechanical update of 6 import sites: crates/omnigraph/src/table_store.rs:20 — split into two `use` lines crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt crates/omnigraph/tests/search.rs:6 crates/omnigraph/tests/branching.rs:7 crates/omnigraph/tests/failpoints.rs:467 crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt All 9 E0599 cascading errors on .create_index_builder / .load_indices resolve once the trait is back in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * namespace: add is_only_declared field to DescribeTableResponse Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to `DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)` because every table BranchManifestNamespace returns from describe_table is materialized — the manifest snapshot only includes entries for tables we've already opened via Dataset::open. Two sites in db/manifest/namespace.rs (BranchManifestNamespace + StagedTableNamespace impls of LanceNamespace::describe_table). Closes the last two compile errors from the v6 bump in the engine lib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cargo: add lance to omnigraph-cli + omnigraph-server dev-deps Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index` in the cli and server test crates. Both crates only had `lance-index` in their dev-dependencies; add `lance` alongside so the new path resolves. This is the last compile-error fix from the v6 bump — `cargo build --workspace --tests` is now green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: refresh Lance alignment audit for v6.0.1; bump surveyed version Per CLAUDE.md maintenance rule 2 (same-PR docs): - docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with the v6.0.1 audit. Captures every v5/v6 finding from this PR (the DatasetIndexExt move, DescribeTableResponse.is_only_declared, MergeInsertBuilder return shape, ManifestLocation field shape, LanceFileVersion::default flip, file-reader async, tokenizer vendor, Lance #6658/#6666/#6877 status). Cross-references each guard in tests/lance_surface_guards.rs. - AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x". Note: surveyed crate version stays at 0.4.2 — substrate version bumps are independent of OmniGraph's release version. - crates/omnigraph/src/storage_layer.rs: update the trait module-level doc-comment to reflect that Lance #6658 closed 2026-05-14 and delete_where two-phase migration is MR-A (the next follow-up). #6666 stays open; create_vector_index inline residual stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: silence clippy::diverging_sub_expression on compile-only guards The five `_compile_*` async fns in lance_surface_guards.rs use `let ds: Dataset = unimplemented!()` as a placeholder so type inference can chase the method chain we want to pin, without ever running the function. Clippy's `diverging_sub_expression` lint flags this pattern because the RHS diverges; that's the entire point. Added to the per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code / unused_variables / unused_mut already there. No behavior change. cargo test -p omnigraph-engine --test lance_surface_guards still 3/3 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1 The audit stanza in docs/dev/lance.md and the storage_layer.rs trait doc-comment both implied the public DeleteBuilder::execute_uncommitted API shipped with Lance 6.0.1. It did not. Issue #6658 closed 2026-05-14, but binary search across the release stream confirms: v6.0.1 ❌ no pub async fn execute_uncommitted on DeleteBuilder v6.1.0-rc.1 ❌ v7.0.0-beta.5 ❌ v7.0.0-beta.10 ✅ first appearance v7.0.0-rc.1 ✅ So MR-A (delete two-phase migration) is gated on the Lance v7.x bump, not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a week. No behavior change. Doc-only correction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux Lance 6's heavier trait surface around futures/streams in storage_layer.rs's staged-write API pushes the rustc trait-resolution recursion limit past the default 128 on Linux builds. CI on PR #111 surfaced this in both `Test Workspace` and `Test omnigraph-server --features aws`: error: queries overflow the depth limit! = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`omnigraph`) = note: query depth increased by 130 when computing layout of `{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}` (The async block is `stage_create_btree_index`'s body — its return type is several layers of `impl Future<Output=Result<StagedHandle>>` deep on top of Lance's own builder return types.) Local macOS builds happened to short-circuit before tripping the limit, which is why this didn't surface during the v6 bump sequence. The fix rustc itself suggests is one line at the crate root. No behavior change. Revisit if a future Lance bump stops needing it. Verified: `cargo build --locked -p omnigraph-server --features aws` compiles clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:42:29 +01:00
### Last alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1)
Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53, Arrow 57 → 58, lance-tokenizer 6.0.1 added, tantivy* removed). Direct 4 → 6 jump; v5.x was not used as an intermediate (rationale in `~/.claude/plans/shimmering-percolating-duckling.md`). Behavior-affecting findings:
- **DatasetIndexExt moved** from `lance-index` to `lance::index` (Lance PR #6280, v5.0). Six import sites updated. `lance-index::IndexType` and `lance-index::is_system_index` stayed in `lance-index`. `omnigraph-cli` and `omnigraph-server` gained `lance = { workspace = true }` in their dev-dependencies.
- **`DescribeTableResponse` gained `is_only_declared: Option<bool>`** (lance-namespace 6.0+, v5.0 PR #6186). Set to `Some(false)` in both `BranchManifestNamespace::describe_table` and `StagedTableNamespace::describe_table` — every table we return is physically materialized via `Dataset::open`, never "declared-only."
- **`MergeInsertBuilder` execute_reader return shape preserved** `(Arc<Dataset>, MergeStats)`; the publisher CAS chain at `db/manifest/publisher.rs:370-391` works unchanged. Pinned by `tests/lance_surface_guards.rs::_compile_merge_insert_builder_method_chain`.
- **`LanceError::TooMuchWriteContention` variant retained** in v6.0.1 (no rename). The typed publisher translation at `db/manifest/publisher.rs:417-430` continues to apply. Pinned by `lance_surface_guards.rs::lance_error_too_much_write_contention_variant_exists`.
- **`ManifestLocation` field shape stable**: `.path: object_store::path::Path`, `.size: Option<u64>`, `.e_tag: Option<String>`, `.naming_scheme: ManifestNamingScheme`. Pinned by `lance_surface_guards.rs::manifest_location_field_shape`.
- **`LanceFileVersion::default()` flipped V2_0 → V2_1** (v5.0). No effect — every `data_storage_version` callsite explicitly pins `Some(LanceFileVersion::V2_2)` (load-bearing for blob v2: `Blob v2 requires file version >= 2.2` enforced in `lance/src/dataset/write.rs:748`).
- **`Dataset::checkout_version(N).await?.restore().await?`**: `restore()` takes `&mut self` and returns `Result<()>` (mutates in place, does not consume + return a new dataset). The recovery rollback hammer at `db/manifest/recovery.rs:505-522` continues to work. Pinned by `lance_surface_guards.rs::_compile_checkout_version_then_restore_signature`.
- **`DatasetBuilder::from_namespace(...).with_branch(...).with_version(...).load()`** surface preserved (the namespace builder chain at `db/manifest/namespace.rs:162-174`). Pinned by `lance_surface_guards.rs::_compile_dataset_builder_from_namespace_signature`.
- **`compact_files(&mut ds, CompactionOptions::default(), None)`** signature stable. `CompactionOptions` still does not expose `data_storage_version`; `compact_files` builds its own `WriteParams { ..Default::default() }`. Note: `LanceFileVersion::default()` is now V2_1 in v6, so optimize-rewritten fragments come out at V2_1 by default (was V2_0 in v4). Existing explicit V2_2 pins on creates/appends still apply.
- **`Dataset::delete(predicate)` returns `DeleteResult { new_dataset: Arc<Dataset>, num_deleted_rows: u64 }`** — unchanged shape. Pinned by `lance_surface_guards.rs::_compile_delete_result_field_shape`. MR-A will repurpose this guard to the staged two-phase variant once `DeleteBuilder::execute_uncommitted` migration lands.
- **File reader read methods now async** (Lance PR #6710, v6.0). No effect — omnigraph reaches Lance exclusively through `Dataset::scan` and the staged-write API.
- **Tokenizer vendored as `lance-tokenizer`** (Lance PR #6512, v6.0). No effect — no direct tokenizer imports.
- **Lance #6658 closed** (2026-05-14) but `DeleteBuilder::execute_uncommitted` did **not** ship in v6.0.1 — binary search across the release stream shows it first appears in `v7.0.0-beta.10` (the closing commits landed on main but didn't backport to the 6.x line). Tracked as MR-A: migrate `delete_where` to staged, retire the parse-time D2 mutation rule, extend recovery sidecar coverage. **Gated on the Lance v7.x bump**, not this PR. v7.0.0-rc.1 dropped 2026-05-21.
- **Lance #6666 still open** (`build_index_metadata_from_segments` public): vector-index two-phase blocked; inline `create_vector_index` residual retained.
- **Lance #6877 still open** (`MergeInsertBuilder` dup-rowid): PR #109's `SourceDedupeBehavior::FirstSeen` + `check_batch_unique_by_keys` precondition stay load-bearing.
fix(branch): make branch delete correct under partial failure (#137) * test(lance): pin force_delete_branch surface guard Pin the Lance 6.0.1 force_delete_branch behavior the branch-delete single-authority redesign relies on: plain delete_branch errors on a missing ref, force_delete_branch removes an existing forked branch, and the local-store quirk where force_delete on a fully-absent branch still errors (worked around by the upcoming TableStore::force_delete_branch). Re-pin the docs/dev/lance.md alignment stanza (9 guards; 4 runtime). * feat(storage): add force branch-delete to TableStore + CommitGraph Add TableStore::force_delete_branch and CommitGraph::force_delete_branch (idempotent: tolerate an already-absent branch via Lance RefNotFound / NotFound), plus CommitGraph::list_branches for the cleanup reconciler to diff against the manifest authority. RefConflict (referencing descendants) is still surfaced. Unused until the branch-delete rewire. * test(maintenance): red — cleanup reconciles orphaned branch forks Forge a Lance branch on the Person table that the manifest never references (a zombie fork from an incomplete prior delete) and assert cleanup reclaims it while leaving main intact. Fails today: cleanup does not yet reconcile orphaned forks. Goes green with the next commit. * fix(maintenance): reconcile orphaned branch forks in cleanup Add reconcile_orphaned_branches: force_delete_branch every per-table and commit-graph Lance branch absent from the manifest branch set (the authority), children-before-parents. Folded into cleanup_all_tables, runs before version GC. Idempotent and authority-derived; no-ops once nothing is orphaned, and would harmlessly find nothing if a future Lance atomic multi-dataset branch op prevented orphans. Adds TableStore::list_branches and exposes graph_commits_uri(pub crate). Turns the maintenance red test green. * test(failpoints): red — branch_delete partial failure converges Add the branch_delete.before_table_cleanup failpoint hook (inert without the feature) and a regression test: a cleanup-step failure after the manifest authority flip must leave branch_delete returning Ok, the branch gone, the orphan stranded, then reclaimed by cleanup, and the name reusable. Fails today: cleanup_deleted_branch_tables propagates the error as a hard failure. Goes green with the next commit. * fix(branch): best-effort fork reclaim after the manifest flip Make branch_delete treat per-table forks and the commit-graph branch as derived state reclaimed best-effort with force_delete_branch after the manifest authority flip. A reclaim failure (transient error, or the branch_delete.before_table_cleanup failpoint) is logged via tracing::warn and swallowed: the branch is already gone and the cleanup reconciler converges the orphan. cleanup_deleted_branch_tables no longer returns an error or blocks the call. Turns the partial-failure recovery test green. * test(failpoints): red — recreate over orphaned fork is actionable After a partial-failure delete leaves a fork orphaned, recreating the branch name and writing to the previously-forked table before cleanup runs currently surfaces the opaque ExpectedVersionMismatch ("stale view ... expected manifest table version N"). Assert instead a clear error pointing the user at cleanup. Goes green with the next commit. * fix(branch): actionable orphan-collision error in fork_branch_from_state When a fork's create_branch collides with an existing target ref, reuse it only if its head matches source_version (a legitimate concurrent first-write). A version mismatch means a zombie fork from an incomplete prior delete: return a manifest_conflict pointing the user at `omnigraph cleanup`, instead of the opaque ExpectedVersionMismatch. Turns the recreate-over-orphan red test green. * docs(invariants): single-authority branch-lifecycle + Lance forward-compat Record branch delete in the Current Truth Matrix: manifest is the single authority flipped atomically first, per-table forks + commit-graph branch are derived state reclaimed best-effort with the cleanup reconciler as backstop, and reusing a name whose reclaim failed surfaces an actionable error. Note the reconciler is authority-derived and degrades to a no-op under a future Lance atomic multi-dataset branch op, the same shape as invariant 7. * test(failpoints): red — cleanup isolates a single-table failure Add the cleanup.table_gc failpoint hook (inert without the feature) and an error: Option<String> field on TableCleanupStats (mechanical, always None for now). Regression test: a one-shot version-GC failure for one table must not abort the whole cleanup — assert cleanup still succeeds, surfaces the failure per-table in stats, and the independent reconcile pass still reclaimed an orphan. Fails today: the version-GC collect aborts on the first table error. Goes green with the next commit. * fix(maintenance): fault-isolate cleanup per table Make the cleanup sweep do as much as it can and converge on re-run instead of aborting wholesale on one table's transient error (invariant 13). The version-GC loop now records a per-table failure on its stats row (error: Some) and logs it rather than collecting into a Result that aborts; reconcile_orphaned_branches isolates per-table and commit-graph failures into BranchReconcileStats.failures. The CLI reports any failed tables and tells the user to rerun cleanup. Addresses the Devin review finding. Turns the single-table-failure test green. * test(failpoints): red — branch_create heals commit-graph zombie + is atomic Add the branch_delete.before_commit_graph_reclaim failpoint hook and two regression tests: (a) recreating a name whose delete left a commit-graph zombie must succeed (today it dies on Lance's internal Clone error), and (b) branch_create must roll back the manifest branch when the derived commit-graph branch fails (today it leaves the manifest branch created while returning Err). Both fail now; green with the next commit. The existing branch_create_failpoint_triggers test still passes. * fix(branch): make branch_create atomic + heal commit-graph zombie branch_create now flips the manifest authority first, then creates the derived commit-graph branch in create_commit_graph_branch, force-dropping any orphaned commit-graph ref left by an incomplete prior delete (the manifest branch is fresh, so a same-named commit-graph branch is provably a zombie). If commit-graph creation fails, the manifest branch is rolled back so the name never half-exists. Addresses the Codex review finding. Turns the two branch_create red tests green; existing tests unaffected. * test(failpoints): red — fork collision misclassifies live concurrent fork Add the fork.before_classify failpoint hook and a concurrency test: when a concurrent first-write legitimately wins the fork race, the loser must get a retryable refresh-and-retry, not the misleading run-cleanup orphan error. Today the version-comparison misclassifies the live fork as an orphan (the Cursor finding). Goes green with the next commit. * fix(branch): manifest-arbitrated fork-collision classification Classify a fork collision by the manifest authority instead of comparing Lance branch versions. Before forking, open_owned_dataset_for_branch_write re-reads the live manifest: if the table is already forked on the active branch, a concurrent first-write won and the loser gets a retryable refresh-and-retry (not a misleading orphan error). fork_branch_from_state no longer guesses from versions — a create collision past that check is an orphan, so it returns the actionable cleanup error. Addresses the Cursor finding; turns the live-concurrent-fork test green, zombie path unchanged. * test(failpoints): close branch-lifecycle test gaps Three coverage additions for the branch-delete work (behavior already correct; these lock it in and catch regressions): - cleanup_isolates_reconcile_failure: inject a force-delete failure into the reconcile loop (new cleanup.reconcile_fork hook) and assert the sweep continues + converges on re-run. Directly covers the reconcile loop the Devin finding was about (previously only version-GC was). - cleanup_reclaims_orphaned_commit_graph_branch: forge a commit-graph orphan via the delete reclaim failpoint and assert cleanup's reconcile_commit_graph_orphans drops it (previously untested). - fork_collision_with_live_concurrent_fork_is_retryable: replace the fixed 300ms sleep with a deterministic readiness signal (cfg_callback + compare_exchange atomics) so the two-writer ordering can't flake. Full failpoints suite 31/0.
2026-06-01 13:28:38 +02:00
- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`.
chore(lance): bump 4.0.0 → 6.0.1 (DataFusion 52→53, Arrow 57→58) (#111) * tests: add lance_surface_guards pre-flight pins for the v6 bump Land 8 named guards in a new test file that pin Lance API surfaces OmniGraph relies on. Each guard turns a silent-break risk (variant rename, struct restructure, async-flip) into a red CI bar instead of runtime drift. Guards (mapped to the silent-break inventory from the v6 migration plan): Runtime (#[tokio::test]): 1. lance_error_too_much_write_contention_variant_exists — pins the variant referenced by db/manifest/publisher.rs::map_lance_publish_error. 2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme types and ManifestLocation accessor returning &Self (the access pattern at db/manifest/metadata.rs:84-88). 6. write_params_default_does_not_set_storage_version — confirms our explicit V2_2 pin remains load-bearing (blob v2 requirement). Compile-only async fns (#[allow(...)] + unimplemented!() placeholders; never run, but cargo build --tests enforces the API shape): 3. checkout_version + restore chain — pins the recovery rollback hammer at db/manifest/recovery.rs:505-522. 4. DatasetBuilder::from_namespace().with_branch().with_version().load() — pins the namespace builder chain at db/manifest/namespace.rs:162-174. 5. MergeInsertBuilder fluent chain — pins the manifest CAS at db/manifest/publisher.rs:370-391, including the return shape (Arc<Dataset>, MergeStats). 7. compact_files(&mut ds, CompactionOptions, None) — pins db/omnigraph/optimize.rs:107. 8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline delete result shape (MR-A will repurpose this guard to the staged two-phase variant once Lance #6658 migration lands). This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump follows in commit 2 (will trigger the guards under v6 if any surface drifted). Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md (written this session). Two guards from the plan deferred to follow-up: - manifest_cas_returns_row_level_contention_variant (full publisher race integration test — needs harness scaffolding) - table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata is pub(crate); requires test reach extension). Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests compiles all 5 compile-only guards clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58 The Cargo bump itself. Source is intentionally untouched — this commit will not compile. The compile errors are the work-list for subsequent commits on this branch. Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn: + lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512) + object_store 0.13.x (Lance 6 brings it transitively; our explicit pin stays at 0.12.5 for now — revisit in stages if diamond bites) - tantivy* crates (replaced by lance-tokenizer) Compile error landscape on this commit (11 errors): • 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280 moved it to lance::index). Sites: table_store.rs:20, db/manifest.rs:37 (the second site was missed by the pre-flight inventory). • 8× E0599: `create_index_builder` / `load_indices` missing on `lance::Dataset` — all downstream of the DatasetIndexExt move. Once the import is corrected on table_store.rs and db/manifest.rs, these resolve automatically. • 2× E0063: missing field `is_only_declared` in `DescribeTableResponse` initializer at db/manifest/namespace.rs:221, 364. New Lance namespace field per the v5 namespace restructure (PR #6186). Surface guards (lance_surface_guards.rs, commit d571fa8) all still compile + the 3 runtime ones pass on v6 — none of the silent-break surfaces drifted. That's the load-bearing observation: the publisher CAS chain, ManifestLocation field shape, checkout_version/restore, DatasetBuilder fluent chain, MergeInsertBuilder return shape, WriteParams::default, compact_files signature, and DeleteResult fields are all v6-stable. Next commits address the 11 errors per the migration plan stages 3-8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * imports: move DatasetIndexExt to lance::index (Lance PR #6280) Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into `lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`. Mechanical update of 6 import sites: crates/omnigraph/src/table_store.rs:20 — split into two `use` lines crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt crates/omnigraph/tests/search.rs:6 crates/omnigraph/tests/branching.rs:7 crates/omnigraph/tests/failpoints.rs:467 crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt All 9 E0599 cascading errors on .create_index_builder / .load_indices resolve once the trait is back in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * namespace: add is_only_declared field to DescribeTableResponse Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to `DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)` because every table BranchManifestNamespace returns from describe_table is materialized — the manifest snapshot only includes entries for tables we've already opened via Dataset::open. Two sites in db/manifest/namespace.rs (BranchManifestNamespace + StagedTableNamespace impls of LanceNamespace::describe_table). Closes the last two compile errors from the v6 bump in the engine lib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cargo: add lance to omnigraph-cli + omnigraph-server dev-deps Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index` in the cli and server test crates. Both crates only had `lance-index` in their dev-dependencies; add `lance` alongside so the new path resolves. This is the last compile-error fix from the v6 bump — `cargo build --workspace --tests` is now green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: refresh Lance alignment audit for v6.0.1; bump surveyed version Per CLAUDE.md maintenance rule 2 (same-PR docs): - docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with the v6.0.1 audit. Captures every v5/v6 finding from this PR (the DatasetIndexExt move, DescribeTableResponse.is_only_declared, MergeInsertBuilder return shape, ManifestLocation field shape, LanceFileVersion::default flip, file-reader async, tokenizer vendor, Lance #6658/#6666/#6877 status). Cross-references each guard in tests/lance_surface_guards.rs. - AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x". Note: surveyed crate version stays at 0.4.2 — substrate version bumps are independent of OmniGraph's release version. - crates/omnigraph/src/storage_layer.rs: update the trait module-level doc-comment to reflect that Lance #6658 closed 2026-05-14 and delete_where two-phase migration is MR-A (the next follow-up). #6666 stays open; create_vector_index inline residual stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: silence clippy::diverging_sub_expression on compile-only guards The five `_compile_*` async fns in lance_surface_guards.rs use `let ds: Dataset = unimplemented!()` as a placeholder so type inference can chase the method chain we want to pin, without ever running the function. Clippy's `diverging_sub_expression` lint flags this pattern because the RHS diverges; that's the entire point. Added to the per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code / unused_variables / unused_mut already there. No behavior change. cargo test -p omnigraph-engine --test lance_surface_guards still 3/3 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1 The audit stanza in docs/dev/lance.md and the storage_layer.rs trait doc-comment both implied the public DeleteBuilder::execute_uncommitted API shipped with Lance 6.0.1. It did not. Issue #6658 closed 2026-05-14, but binary search across the release stream confirms: v6.0.1 ❌ no pub async fn execute_uncommitted on DeleteBuilder v6.1.0-rc.1 ❌ v7.0.0-beta.5 ❌ v7.0.0-beta.10 ✅ first appearance v7.0.0-rc.1 ✅ So MR-A (delete two-phase migration) is gated on the Lance v7.x bump, not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a week. No behavior change. Doc-only correction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux Lance 6's heavier trait surface around futures/streams in storage_layer.rs's staged-write API pushes the rustc trait-resolution recursion limit past the default 128 on Linux builds. CI on PR #111 surfaced this in both `Test Workspace` and `Test omnigraph-server --features aws`: error: queries overflow the depth limit! = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`omnigraph`) = note: query depth increased by 130 when computing layout of `{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}` (The async block is `stage_create_btree_index`'s body — its return type is several layers of `impl Future<Output=Result<StagedHandle>>` deep on top of Lance's own builder return types.) Local macOS builds happened to short-circuit before tripping the limit, which is why this didn't surface during the v6 bump sequence. The fix rustc itself suggests is one line at the crate root. No behavior change. Revisit if a future Lance bump stops needing it. Verified: `cargo build --locked -p omnigraph-server --features aws` compiles clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:42:29 +01:00
fix(branch): make branch delete correct under partial failure (#137) * test(lance): pin force_delete_branch surface guard Pin the Lance 6.0.1 force_delete_branch behavior the branch-delete single-authority redesign relies on: plain delete_branch errors on a missing ref, force_delete_branch removes an existing forked branch, and the local-store quirk where force_delete on a fully-absent branch still errors (worked around by the upcoming TableStore::force_delete_branch). Re-pin the docs/dev/lance.md alignment stanza (9 guards; 4 runtime). * feat(storage): add force branch-delete to TableStore + CommitGraph Add TableStore::force_delete_branch and CommitGraph::force_delete_branch (idempotent: tolerate an already-absent branch via Lance RefNotFound / NotFound), plus CommitGraph::list_branches for the cleanup reconciler to diff against the manifest authority. RefConflict (referencing descendants) is still surfaced. Unused until the branch-delete rewire. * test(maintenance): red — cleanup reconciles orphaned branch forks Forge a Lance branch on the Person table that the manifest never references (a zombie fork from an incomplete prior delete) and assert cleanup reclaims it while leaving main intact. Fails today: cleanup does not yet reconcile orphaned forks. Goes green with the next commit. * fix(maintenance): reconcile orphaned branch forks in cleanup Add reconcile_orphaned_branches: force_delete_branch every per-table and commit-graph Lance branch absent from the manifest branch set (the authority), children-before-parents. Folded into cleanup_all_tables, runs before version GC. Idempotent and authority-derived; no-ops once nothing is orphaned, and would harmlessly find nothing if a future Lance atomic multi-dataset branch op prevented orphans. Adds TableStore::list_branches and exposes graph_commits_uri(pub crate). Turns the maintenance red test green. * test(failpoints): red — branch_delete partial failure converges Add the branch_delete.before_table_cleanup failpoint hook (inert without the feature) and a regression test: a cleanup-step failure after the manifest authority flip must leave branch_delete returning Ok, the branch gone, the orphan stranded, then reclaimed by cleanup, and the name reusable. Fails today: cleanup_deleted_branch_tables propagates the error as a hard failure. Goes green with the next commit. * fix(branch): best-effort fork reclaim after the manifest flip Make branch_delete treat per-table forks and the commit-graph branch as derived state reclaimed best-effort with force_delete_branch after the manifest authority flip. A reclaim failure (transient error, or the branch_delete.before_table_cleanup failpoint) is logged via tracing::warn and swallowed: the branch is already gone and the cleanup reconciler converges the orphan. cleanup_deleted_branch_tables no longer returns an error or blocks the call. Turns the partial-failure recovery test green. * test(failpoints): red — recreate over orphaned fork is actionable After a partial-failure delete leaves a fork orphaned, recreating the branch name and writing to the previously-forked table before cleanup runs currently surfaces the opaque ExpectedVersionMismatch ("stale view ... expected manifest table version N"). Assert instead a clear error pointing the user at cleanup. Goes green with the next commit. * fix(branch): actionable orphan-collision error in fork_branch_from_state When a fork's create_branch collides with an existing target ref, reuse it only if its head matches source_version (a legitimate concurrent first-write). A version mismatch means a zombie fork from an incomplete prior delete: return a manifest_conflict pointing the user at `omnigraph cleanup`, instead of the opaque ExpectedVersionMismatch. Turns the recreate-over-orphan red test green. * docs(invariants): single-authority branch-lifecycle + Lance forward-compat Record branch delete in the Current Truth Matrix: manifest is the single authority flipped atomically first, per-table forks + commit-graph branch are derived state reclaimed best-effort with the cleanup reconciler as backstop, and reusing a name whose reclaim failed surfaces an actionable error. Note the reconciler is authority-derived and degrades to a no-op under a future Lance atomic multi-dataset branch op, the same shape as invariant 7. * test(failpoints): red — cleanup isolates a single-table failure Add the cleanup.table_gc failpoint hook (inert without the feature) and an error: Option<String> field on TableCleanupStats (mechanical, always None for now). Regression test: a one-shot version-GC failure for one table must not abort the whole cleanup — assert cleanup still succeeds, surfaces the failure per-table in stats, and the independent reconcile pass still reclaimed an orphan. Fails today: the version-GC collect aborts on the first table error. Goes green with the next commit. * fix(maintenance): fault-isolate cleanup per table Make the cleanup sweep do as much as it can and converge on re-run instead of aborting wholesale on one table's transient error (invariant 13). The version-GC loop now records a per-table failure on its stats row (error: Some) and logs it rather than collecting into a Result that aborts; reconcile_orphaned_branches isolates per-table and commit-graph failures into BranchReconcileStats.failures. The CLI reports any failed tables and tells the user to rerun cleanup. Addresses the Devin review finding. Turns the single-table-failure test green. * test(failpoints): red — branch_create heals commit-graph zombie + is atomic Add the branch_delete.before_commit_graph_reclaim failpoint hook and two regression tests: (a) recreating a name whose delete left a commit-graph zombie must succeed (today it dies on Lance's internal Clone error), and (b) branch_create must roll back the manifest branch when the derived commit-graph branch fails (today it leaves the manifest branch created while returning Err). Both fail now; green with the next commit. The existing branch_create_failpoint_triggers test still passes. * fix(branch): make branch_create atomic + heal commit-graph zombie branch_create now flips the manifest authority first, then creates the derived commit-graph branch in create_commit_graph_branch, force-dropping any orphaned commit-graph ref left by an incomplete prior delete (the manifest branch is fresh, so a same-named commit-graph branch is provably a zombie). If commit-graph creation fails, the manifest branch is rolled back so the name never half-exists. Addresses the Codex review finding. Turns the two branch_create red tests green; existing tests unaffected. * test(failpoints): red — fork collision misclassifies live concurrent fork Add the fork.before_classify failpoint hook and a concurrency test: when a concurrent first-write legitimately wins the fork race, the loser must get a retryable refresh-and-retry, not the misleading run-cleanup orphan error. Today the version-comparison misclassifies the live fork as an orphan (the Cursor finding). Goes green with the next commit. * fix(branch): manifest-arbitrated fork-collision classification Classify a fork collision by the manifest authority instead of comparing Lance branch versions. Before forking, open_owned_dataset_for_branch_write re-reads the live manifest: if the table is already forked on the active branch, a concurrent first-write won and the loser gets a retryable refresh-and-retry (not a misleading orphan error). fork_branch_from_state no longer guesses from versions — a create collision past that check is an orphan, so it returns the actionable cleanup error. Addresses the Cursor finding; turns the live-concurrent-fork test green, zombie path unchanged. * test(failpoints): close branch-lifecycle test gaps Three coverage additions for the branch-delete work (behavior already correct; these lock it in and catch regressions): - cleanup_isolates_reconcile_failure: inject a force-delete failure into the reconcile loop (new cleanup.reconcile_fork hook) and assert the sweep continues + converges on re-run. Directly covers the reconcile loop the Devin finding was about (previously only version-GC was). - cleanup_reclaims_orphaned_commit_graph_branch: forge a commit-graph orphan via the delete reclaim failpoint and assert cleanup's reconcile_commit_graph_orphans drops it (previously untested). - fork_collision_with_live_concurrent_fork_is_retryable: replace the fixed 300ms sleep with a deterministic readiness signal (cfg_callback + compare_exchange atomics) so the two-writer ordering can't flake. Full failpoints suite 31/0.
2026-06-01 13:28:38 +02:00
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (9 named guards; 4 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
Bump this date stanza on the next alignment pass.