omnigraph/docs/dev/lance.md
Ragnor Comerford 353c0c876a
fix(branch): make branch delete correct under partial failure (#137)
* test(lance): pin force_delete_branch surface guard

Pin the Lance 6.0.1 force_delete_branch behavior the branch-delete
single-authority redesign relies on: plain delete_branch errors on a
missing ref, force_delete_branch removes an existing forked branch, and
the local-store quirk where force_delete on a fully-absent branch still
errors (worked around by the upcoming TableStore::force_delete_branch).

Re-pin the docs/dev/lance.md alignment stanza (9 guards; 4 runtime).

* feat(storage): add force branch-delete to TableStore + CommitGraph

Add TableStore::force_delete_branch and CommitGraph::force_delete_branch
(idempotent: tolerate an already-absent branch via Lance RefNotFound /
NotFound), plus CommitGraph::list_branches for the cleanup reconciler to
diff against the manifest authority. RefConflict (referencing
descendants) is still surfaced. Unused until the branch-delete rewire.

* test(maintenance): red — cleanup reconciles orphaned branch forks

Forge a Lance branch on the Person table that the manifest never
references (a zombie fork from an incomplete prior delete) and assert
cleanup reclaims it while leaving main intact. Fails today: cleanup does
not yet reconcile orphaned forks. Goes green with the next commit.

* fix(maintenance): reconcile orphaned branch forks in cleanup

Add reconcile_orphaned_branches: force_delete_branch every per-table and
commit-graph Lance branch absent from the manifest branch set (the
authority), children-before-parents. Folded into cleanup_all_tables,
runs before version GC. Idempotent and authority-derived; no-ops once
nothing is orphaned, and would harmlessly find nothing if a future Lance
atomic multi-dataset branch op prevented orphans. Adds TableStore::list_branches
and exposes graph_commits_uri(pub crate). Turns the maintenance red test green.

* test(failpoints): red — branch_delete partial failure converges

Add the branch_delete.before_table_cleanup failpoint hook (inert without
the feature) and a regression test: a cleanup-step failure after the
manifest authority flip must leave branch_delete returning Ok, the branch
gone, the orphan stranded, then reclaimed by cleanup, and the name
reusable. Fails today: cleanup_deleted_branch_tables propagates the error
as a hard failure. Goes green with the next commit.

* fix(branch): best-effort fork reclaim after the manifest flip

Make branch_delete treat per-table forks and the commit-graph branch as
derived state reclaimed best-effort with force_delete_branch after the
manifest authority flip. A reclaim failure (transient error, or the
branch_delete.before_table_cleanup failpoint) is logged via tracing::warn
and swallowed: the branch is already gone and the cleanup reconciler
converges the orphan. cleanup_deleted_branch_tables no longer returns an
error or blocks the call. Turns the partial-failure recovery test green.

* test(failpoints): red — recreate over orphaned fork is actionable

After a partial-failure delete leaves a fork orphaned, recreating the
branch name and writing to the previously-forked table before cleanup
runs currently surfaces the opaque ExpectedVersionMismatch ("stale view
... expected manifest table version N"). Assert instead a clear error
pointing the user at cleanup. Goes green with the next commit.

* fix(branch): actionable orphan-collision error in fork_branch_from_state

When a fork's create_branch collides with an existing target ref, reuse
it only if its head matches source_version (a legitimate concurrent
first-write). A version mismatch means a zombie fork from an incomplete
prior delete: return a manifest_conflict pointing the user at
`omnigraph cleanup`, instead of the opaque ExpectedVersionMismatch.
Turns the recreate-over-orphan red test green.

* docs(invariants): single-authority branch-lifecycle + Lance forward-compat

Record branch delete in the Current Truth Matrix: manifest is the single
authority flipped atomically first, per-table forks + commit-graph branch
are derived state reclaimed best-effort with the cleanup reconciler as
backstop, and reusing a name whose reclaim failed surfaces an actionable
error. Note the reconciler is authority-derived and degrades to a no-op
under a future Lance atomic multi-dataset branch op, the same shape as
invariant 7.

* test(failpoints): red — cleanup isolates a single-table failure

Add the cleanup.table_gc failpoint hook (inert without the feature) and
an error: Option<String> field on TableCleanupStats (mechanical, always
None for now). Regression test: a one-shot version-GC failure for one
table must not abort the whole cleanup — assert cleanup still succeeds,
surfaces the failure per-table in stats, and the independent reconcile
pass still reclaimed an orphan. Fails today: the version-GC collect
aborts on the first table error. Goes green with the next commit.

* fix(maintenance): fault-isolate cleanup per table

Make the cleanup sweep do as much as it can and converge on re-run
instead of aborting wholesale on one table's transient error
(invariant 13). The version-GC loop now records a per-table failure on
its stats row (error: Some) and logs it rather than collecting into a
Result that aborts; reconcile_orphaned_branches isolates per-table and
commit-graph failures into BranchReconcileStats.failures. The CLI reports
any failed tables and tells the user to rerun cleanup. Addresses the
Devin review finding. Turns the single-table-failure test green.

* test(failpoints): red — branch_create heals commit-graph zombie + is atomic

Add the branch_delete.before_commit_graph_reclaim failpoint hook and two
regression tests: (a) recreating a name whose delete left a commit-graph
zombie must succeed (today it dies on Lance's internal Clone error), and
(b) branch_create must roll back the manifest branch when the derived
commit-graph branch fails (today it leaves the manifest branch created
while returning Err). Both fail now; green with the next commit. The
existing branch_create_failpoint_triggers test still passes.

* fix(branch): make branch_create atomic + heal commit-graph zombie

branch_create now flips the manifest authority first, then creates the
derived commit-graph branch in create_commit_graph_branch, force-dropping
any orphaned commit-graph ref left by an incomplete prior delete (the
manifest branch is fresh, so a same-named commit-graph branch is provably
a zombie). If commit-graph creation fails, the manifest branch is rolled
back so the name never half-exists. Addresses the Codex review finding.
Turns the two branch_create red tests green; existing tests unaffected.

* test(failpoints): red — fork collision misclassifies live concurrent fork

Add the fork.before_classify failpoint hook and a concurrency test: when
a concurrent first-write legitimately wins the fork race, the loser must
get a retryable refresh-and-retry, not the misleading run-cleanup orphan
error. Today the version-comparison misclassifies the live fork as an
orphan (the Cursor finding). Goes green with the next commit.

* fix(branch): manifest-arbitrated fork-collision classification

Classify a fork collision by the manifest authority instead of comparing
Lance branch versions. Before forking, open_owned_dataset_for_branch_write
re-reads the live manifest: if the table is already forked on the active
branch, a concurrent first-write won and the loser gets a retryable
refresh-and-retry (not a misleading orphan error). fork_branch_from_state
no longer guesses from versions — a create collision past that check is
an orphan, so it returns the actionable cleanup error. Addresses the
Cursor finding; turns the live-concurrent-fork test green, zombie path
unchanged.

* test(failpoints): close branch-lifecycle test gaps

Three coverage additions for the branch-delete work (behavior already
correct; these lock it in and catch regressions):

- cleanup_isolates_reconcile_failure: inject a force-delete failure into
  the reconcile loop (new cleanup.reconcile_fork hook) and assert the
  sweep continues + converges on re-run. Directly covers the reconcile
  loop the Devin finding was about (previously only version-GC was).
- cleanup_reclaims_orphaned_commit_graph_branch: forge a commit-graph
  orphan via the delete reclaim failpoint and assert cleanup's
  reconcile_commit_graph_orphans drops it (previously untested).
- fork_collision_with_live_concurrent_fork_is_retryable: replace the
  fixed 300ms sleep with a deterministic readiness signal (cfg_callback +
  compare_exchange atomics) so the two-writer ordering can't flake.

Full failpoints suite 31/0.
2026-06-01 13:28:38 +02:00

13 KiB

Lance Docs Index (for OmniGraph agents)

OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this codebase.

This file is the curated entry point. When you hit a Lance-shaped problem, find the matching topic below and fetch the listed URL(s) before guessing. Don't grep our codebase for behavior that is documented authoritatively in Lance.

Base URL: https://lance.org. Fetch the FULL page content, not summaries — use curl -sL <url> | pandoc -f html -t markdown or paste the rendered page text manually. Tools that summarize pages (like Claude's WebFetch) routinely drop load-bearing details — defaults, pub(crate) blockers, sub-specs hidden behind navigation hubs. Never act on a summarized fetch alone. Keep this index curated to relevant material — the upstream sitemap has hundreds of URLs (notably the Namespace REST API model surface, Spark/Trino/Databricks integrations) that we don't use.

Substrate boundary check. Before fetching, recall docs/dev/invariants.md: if Lance already does the thing, we don't reimplement it. The most common reason to read these docs is to confirm a substrate behavior, not to learn what to clone.

Quick-start (read these once per project)

Read when URL
Onboarding to Lance — concepts in 10 min https://lance.org/quickstart/
Onboarding to vector search https://lance.org/quickstart/vector-search/
Onboarding to full-text search https://lance.org/quickstart/full-text-search/
Onboarding to versioning / time travel https://lance.org/quickstart/versioning/
Lance's own AGENTS.md (its agent guide) https://lance.org/format/AGENTS/

By problem domain

Storage format & file layout

Touching db/manifest, fragment lifecycle, dataset reconstruction, or anything that reads/writes raw Lance state.

Topic URL
Lance file format overview https://lance.org/format/
File-level format spec https://lance.org/format/file/
File encoding https://lance.org/format/file/encoding/
File-level versioning https://lance.org/format/file/versioning/
Table layout (fragments, manifest) https://lance.org/format/table/layout/
Table schema metadata https://lance.org/format/table/schema/
Table-level versioning https://lance.org/format/table/versioning/
Transactions (commit semantics, conflict types) https://lance.org/format/table/transaction/
MemWAL (durability story) https://lance.org/format/table/mem_wal/
Row-ID lineage (stable row IDs) https://lance.org/format/table/row_id_lineage/
Branches & tags (Lance native) https://lance.org/format/table/branch_tag/

Branching / tags / time travel

Touching graph-level branches, snapshots, run isolation, the commit graph.

Topic URL
Branch & tag format https://lance.org/format/table/branch_tag/
Tags & branches operational guide https://lance.org/guide/tags_and_branches/
Versioning quick-start https://lance.org/quickstart/versioning/
Table-level versioning spec https://lance.org/format/table/versioning/

Indexes

Adding/changing index types, fixing coverage, debugging FTS or vector recall, designing the reconciler.

Topic URL
Index spec overview https://lance.org/format/table/index/
BTREE scalar index https://lance.org/format/table/index/scalar/btree/
Bitmap scalar index https://lance.org/format/table/index/scalar/bitmap/
Bloom-filter scalar index https://lance.org/format/table/index/scalar/bloom_filter/
Label-list scalar index https://lance.org/format/table/index/scalar/label_list/
Zone-map scalar index https://lance.org/format/table/index/scalar/zonemap/
R-Tree scalar index (spatial) https://lance.org/format/table/index/scalar/rtree/
Full-text search (FTS) index https://lance.org/format/table/index/scalar/fts/
N-gram scalar index https://lance.org/format/table/index/scalar/ngram/
Vector index https://lance.org/format/table/index/vector/
Fragment-reuse system index https://lance.org/format/table/index/system/frag_reuse/
MemWAL system index https://lance.org/format/table/index/system/mem_wal/
HNSW Rust example https://lance.org/examples/rust/hnsw/
Distributed indexing https://lance.org/guide/distributed_indexing/
Tokenizer (FTS, n-gram) https://lance.org/guide/tokenizer/

Reads & writes

Touching the bulk loader, mutation execution, merge_insert, WriteMode selection.

Topic URL
Read-and-write guide https://lance.org/guide/read_and_write/
Distributed write https://lance.org/guide/distributed_write/
Rust example: write & read a dataset https://lance.org/examples/rust/write_read_dataset/

Schema evolution

Touching apply_schema, the migration planner, additive evolution.

Topic URL
Data-evolution guide https://lance.org/guide/data_evolution/
Migration guide https://lance.org/guide/migration/

Object store / S3

Touching storage.rs, S3-compatible backends (RustFS, MinIO), env vars.

Topic URL
Object-store guide https://lance.org/guide/object_store/

Data types

Touching schema-language scalar mappings, blob columns, JSON, list columns.

Topic URL
Data types overview https://lance.org/guide/data_types/
Arrays / list types https://lance.org/guide/arrays/
Blobs (LargeBinary) https://lance.org/guide/blob/
JSON https://lance.org/guide/json/

Performance & tuning

Optimizing scans, fragment counts, cache behavior, memory pool sizing.

Topic URL
Performance guide https://lance.org/guide/performance/

Compaction & cleanup

Touching omnigraph optimize / cleanup, the underlying compact_files / cleanup_old_versions.

Topic URL
Read-and-write guide (covers compact_files, cleanup_old_versions) https://lance.org/guide/read_and_write/
Performance (compaction tradeoffs) https://lance.org/guide/performance/
Fragment-reuse index https://lance.org/format/table/index/system/frag_reuse/

DataFusion integration

The runtime substrate that may carry our query execution. See docs/dev/invariants.md: we don't rebuild relational machinery.

Topic URL
DataFusion integration https://lance.org/integrations/datafusion/

SDK reference

Looking up a specific Rust API (signature, return type, error variant).

Topic URL
SDK docs landing https://lance.org/sdk_docs/

What's not in this index (and why)

  • Namespace REST API model surface (/format/namespace/client/operations/models/...) — hundreds of REST schema docs for the Lance Namespace catalog API. Omnigraph does not run a Lance Namespace server, so these are not reachable from our problem space.
  • Spark / Trino / Databricks / Dataproc / Hive / Glue / Polaris / Iceberg / Unity / OneLake / Gravitino integrations — not part of OmniGraph's deployment surface.
  • Python / TF / PyTorch / Hugging Face / Ray integrations — OmniGraph is Rust-only; Python notebooks aren't relevant.
  • Community / governance / release / voting / PMC pages — meta, not technical.

If a future need pulls one of these into scope, add a row to the matching domain section above and link it from AGENTS.md's topic index.

Maintenance

When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.

Last alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1)

Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53, Arrow 57 → 58, lance-tokenizer 6.0.1 added, tantivy* removed). Direct 4 → 6 jump; v5.x was not used as an intermediate (rationale in ~/.claude/plans/shimmering-percolating-duckling.md). Behavior-affecting findings:

  • DatasetIndexExt moved from lance-index to lance::index (Lance PR #6280, v5.0). Six import sites updated. lance-index::IndexType and lance-index::is_system_index stayed in lance-index. omnigraph-cli and omnigraph-server gained lance = { workspace = true } in their dev-dependencies.
  • DescribeTableResponse gained is_only_declared: Option<bool> (lance-namespace 6.0+, v5.0 PR #6186). Set to Some(false) in both BranchManifestNamespace::describe_table and StagedTableNamespace::describe_table — every table we return is physically materialized via Dataset::open, never "declared-only."
  • MergeInsertBuilder execute_reader return shape preserved (Arc<Dataset>, MergeStats); the publisher CAS chain at db/manifest/publisher.rs:370-391 works unchanged. Pinned by tests/lance_surface_guards.rs::_compile_merge_insert_builder_method_chain.
  • LanceError::TooMuchWriteContention variant retained in v6.0.1 (no rename). The typed publisher translation at db/manifest/publisher.rs:417-430 continues to apply. Pinned by lance_surface_guards.rs::lance_error_too_much_write_contention_variant_exists.
  • ManifestLocation field shape stable: .path: object_store::path::Path, .size: Option<u64>, .e_tag: Option<String>, .naming_scheme: ManifestNamingScheme. Pinned by lance_surface_guards.rs::manifest_location_field_shape.
  • LanceFileVersion::default() flipped V2_0 → V2_1 (v5.0). No effect — every data_storage_version callsite explicitly pins Some(LanceFileVersion::V2_2) (load-bearing for blob v2: Blob v2 requires file version >= 2.2 enforced in lance/src/dataset/write.rs:748).
  • Dataset::checkout_version(N).await?.restore().await?: restore() takes &mut self and returns Result<()> (mutates in place, does not consume + return a new dataset). The recovery rollback hammer at db/manifest/recovery.rs:505-522 continues to work. Pinned by lance_surface_guards.rs::_compile_checkout_version_then_restore_signature.
  • DatasetBuilder::from_namespace(...).with_branch(...).with_version(...).load() surface preserved (the namespace builder chain at db/manifest/namespace.rs:162-174). Pinned by lance_surface_guards.rs::_compile_dataset_builder_from_namespace_signature.
  • compact_files(&mut ds, CompactionOptions::default(), None) signature stable. CompactionOptions still does not expose data_storage_version; compact_files builds its own WriteParams { ..Default::default() }. Note: LanceFileVersion::default() is now V2_1 in v6, so optimize-rewritten fragments come out at V2_1 by default (was V2_0 in v4). Existing explicit V2_2 pins on creates/appends still apply.
  • Dataset::delete(predicate) returns DeleteResult { new_dataset: Arc<Dataset>, num_deleted_rows: u64 } — unchanged shape. Pinned by lance_surface_guards.rs::_compile_delete_result_field_shape. MR-A will repurpose this guard to the staged two-phase variant once DeleteBuilder::execute_uncommitted migration lands.
  • File reader read methods now async (Lance PR #6710, v6.0). No effect — omnigraph reaches Lance exclusively through Dataset::scan and the staged-write API.
  • Tokenizer vendored as lance-tokenizer (Lance PR #6512, v6.0). No effect — no direct tokenizer imports.
  • Lance #6658 closed (2026-05-14) but DeleteBuilder::execute_uncommitted did not ship in v6.0.1 — binary search across the release stream shows it first appears in v7.0.0-beta.10 (the closing commits landed on main but didn't backport to the 6.x line). Tracked as MR-A: migrate delete_where to staged, retire the parse-time D2 mutation rule, extend recovery sidecar coverage. Gated on the Lance v7.x bump, not this PR. v7.0.0-rc.1 dropped 2026-05-21.
  • Lance #6666 still open (build_index_metadata_from_segments public): vector-index two-phase blocked; inline create_vector_index residual retained.
  • Lance #6877 still open (MergeInsertBuilder dup-rowid): PR #109's SourceDedupeBehavior::FirstSeen + check_batch_unique_by_keys precondition stay load-bearing.
  • Dataset::force_delete_branch (branches().delete(name, force=true), dataset.rs:524) tolerates a missing branch-contents ref (vs plain delete_branch's RefNotFound), but on the local store still errors NotFound if the branch tree/ directory is fully absent (remove_dir_all's NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (RefConflict). TableStore::force_delete_branch wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by lance_surface_guards.rs::force_delete_branch_semantics. Branch delete is "flip the ref atomically, then remove_dir_all(tree/{branch})"; branch-exclusive data lives under tree/{branch}/ so a drop reclaims it immediately without touching main.

Surface guards added: crates/omnigraph/tests/lance_surface_guards.rs (9 named guards; 4 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (manifest_cas_returns_row_level_contention_variant needs full publisher-race harness; table_version_metadata_byte_compatible_with_v4 needs pub(crate) reach extension).

Bump this date stanza on the next alignment pass.