omnigraph/docs/dev/invariants.md
Ragnor Comerford 6a2dfa7325
fix: self-heal manifest-unreferenced branch forks (stop wedged branches) (#231)
* chore: correct stale global-lock comments

The global Arc<RwLock<Omnigraph>> that once serialized every server write was
removed — the server holds the engine as a lockless Arc<Omnigraph> and write
methods are &self, so the per-(table_key, branch) write queues are now the
actual write-serialization mechanism (in-process only).

Correct comments that still claimed the global lock is 'still in place' /
'today', or framed the queues as MR-686 scaffolding: write_queue.rs module doc,
exec/merge.rs, db/omnigraph/schema_apply.rs, db/manifest/recovery.rs, and the
bench_concurrent_http.rs example (which also wrongly stated mutate_as is
&mut self). workload.rs is left as-is — its 'previous global RwLock' wording is
accurate history.

* test: regression for self-healing a manifest-unreferenced fork

An interrupted first-write fork (create_branch succeeded, the manifest publish
did not) leaves a fully-formed Lance branch ref the manifest never references.
The branch stays a valid manifest branch, so cleanup's reconciler never
reclaims it, and today the next write to that table wedges with 'incomplete
prior delete; run cleanup'.

Forge that exact residue (a live 'feature' branch + a directly-created
'feature' ref on the Person table the manifest doesn't reference) and assert
the next load AND mutate self-heal. Deterministic and local — no S3 or timing,
since the forge IS the post-crash state. Adds a shared node_table_uri helper.

This commit is RED: it reproduces the bug and fails against the unfixed engine
with the predicted symptom. The fix follows in the next commit.

* fix: self-heal manifest-unreferenced branch forks

The first write to a table on a branch lazily forks it via Lance create_branch,
a durable two-phase op that advances Lance state BEFORE the atomic manifest
publish. If the writer dies or its request future is cancelled between the fork
and the publish, the branch ref is fully formed but the manifest never
references it. The next write re-enters the fork path, create_branch collides,
and the engine wedged with 'orphaned table state ... incomplete prior delete;
run cleanup' — which cleanup could not even fix, because the branch is still a
live manifest branch. This hit load, mutate, ingest, and the merge fork path
(one shared engine chokepoint), so a routine deploy restart or client
disconnect could wedge a branch.

Fix: treat the per-table fork ref as derived state of the manifest. fork_branch_
from_state returns a typed ForkOutcome instead of a human 'incomplete prior
delete' error; on RefAlreadyExists the db layer reclaims the manifest-
unreferenced fork (force_delete_branch + re-fork, exactly once) and proceeds.
A live committed fork is still routed to a retryable conflict before the fork
path, so concurrent first-writes stay correct.

Reclaim is only safe if no in-process writer can be mid-fork, so the write
entry points (load, mutate) acquire the per-(table, branch) write queues for
all touched tables up front — before the fork, held through the publish — when
forking a non-main branch. commit_all accepts these pre-held guards instead of
re-acquiring (the queue is non-re-entrant). The merge fork path already holds
the queue and self-heals through the shared wrapper. Cross-process in-flight
forks remain the documented one-winner-CAS gap.

Mechanical prep folded in: mutation IR lowering is hoisted so the touched-table
set is known before execution; commit_all gains the held_guards parameter.

Flips recreate_over_orphaned_fork_before_cleanup_is_actionable to assert
self-heal; fork_collision_with_live_concurrent_fork_is_retryable still holds.
Docs: writes.md cancelled-future note, invariants.md cross-process known gap.

* fix(cleanup): reconcile per-table manifest-unreferenced forks

reconcile_orphaned_branches keyed orphans on the branch NAME (absent from the
manifest), so it only reclaimed forks from a fully-deleted branch. A fork left
on a still-live branch by an interrupted first-write was never reclaimed — the
backstop the handoff expected cleanup to provide did not cover that case.

Broaden it to a per-table authority test: a Lance branch B on table T is an
orphan iff B is not a live manifest branch (delete-leftover) OR the manifest's
branch-B snapshot does not place T on B (interrupted first-write). Per-branch
snapshots are resolved once and cached across tables. Legitimately-forked
tables, main, and internal/system branches are never reclaimed; children are
dropped before parents to avoid Lance's referenced-parent RefConflict. The
commit-graph half stays whole-branch (per-table doesn't apply there).

This is the guaranteed-convergence backstop to the write-path self-heal: it
reclaims any fork the write path never revisits, and is what Lance's own
create_branch docstring asks embedders to provide for zombie/orphan refs.

* fix: reclaim self-validates against fresh manifest authority

The fork reclaim force-deletes a Lance branch ref, gated on the caller's proof
that the manifest does not place the table on the branch. But the first-write
path obtains that proof via snapshot_for_branch, which returns the coordinator's
CACHED snapshot when the handle is bound to the branch (an embedded handle on
the branch, or branch_merge's target swap). If that snapshot is stale and a
concurrent writer already published a legitimate fork, the reclaim would
force-delete it and re-fork from source, stranding the manifest at a version the
recreated ref no longer has.

Make the destructive primitive own its safety precondition: re-derive it from a
FRESH manifest read (fresh_snapshot_for_branch, which bypasses the cache)
immediately before force-deleting. If fresh authority shows the table is on the
branch, refuse with a retryable conflict instead of destroying a valid fork.
Correct for any caller regardless of snapshot staleness. Also stop branching on
Lance's exact RefConflict prose (loosened match; typed-variant is the durable
follow-up). Addresses PR review (Codex P1, Greptile P2).

* fix: cover delete-cascade edges in up-front fork-queue acquisition

A node delete cascades to every edge table touching that node (execute_delete_
node), forking those edge tables during execution. But touched_table_keys
derived the up-front fork-queue set from the IR ops alone (just node:Type), so a
branch delete that forks node + cascade edges held only the node queue —
commit_all then saw cascade-edge keys it had no guard for.

The touched set is a pure function of (IR ops + catalog), so compute the
COMPLETE set: op types plus, for delete-node ops, the cascade edges derived the
same way the executor derives them (from_type/to_type match). Pre-computed now
equals actual by construction.

Also promote commit_all's held-guard coverage check out of debug_assert into an
all-builds check that fails the write with a typed manifest_internal error: a
load-bearing serialization invariant must fail loudly+safely in release, not
silently proceed unguarded if a future execution path ever touches a table
outside the pre-computed set.

Adds branch_cascade_delete_forks_node_and_edges_under_held_queues, which drives
the cascade path on a branch (the gap the existing insert/load tests missed).
Addresses PR review (Cursor medium, Greptile P2).

* fix(cleanup): serialize fork reclaim against in-process live writers

The broadened per-table reconciler force_delete'd an orphan candidate on a LIVE
branch without holding the per-(table, branch) write queue. An in-process
first-write fork in its fork->publish window holds that queue and has not yet
advanced the manifest, so it looks exactly like an origin-2 orphan — concurrent
cleanup could delete the ref the writer still holds and is about to publish.
(The old branch-name-based reconciler did not have this race: a deleted branch
cannot have a live first-write.)

Bring the reconciler under the same invariant the write-path reclaim already
obeys: never force_delete a fork ref without holding the (table, branch) write
queue AND confirming, under it, from a fresh read, that the ref is still
manifest-unreferenced. Acquire one key at a time (no lock-order inversion vs
multi-table acquire_many writers); if the writer published meanwhile, the fresh
re-check sees the table on the branch and skips. Cross-process writers remain
the documented one-winner-CAS gap. Addresses PR review (Cursor high).

* fix: classify create_branch failure by ref existence, not by failure

fork_branch_from_state mapped ANY create_branch failure to RefAlreadyExists,
routing transient I/O / version / Lance-internal errors into the destructive
reclaim path and masking the real error as a retryable conflict.

Branch on the actual fact instead: on create_branch failure, check whether the
ref exists (list_branches). Only a genuinely pre-existing ref — a fully-formed
manifest-unreferenced fork — is a reclaim candidate; any other failure
propagates with fidelity. We deliberately do NOT force-delete on a not-found-ref
failure: it is indistinguishable from a transient error on a fresh create, and
force-deleting there is the overreach the fresh-authority guard already removed.
A phase-1-only Lance zombie (rarer; create_branch interrupted mid its two
internal phases) surfaces as the propagated error for manual reclaim.
Addresses PR review (Cursor medium).

* fix(cleanup): skip (not delete) on a transient re-check error for a live branch

The reconcile pre-delete re-check treated ANY fresh_snapshot error as 'still an
orphan' and proceeded to force_delete. A transient manifest read failure on a
LIVE branch could therefore destroy a fork the manifest still considers
legitimate — inconsistent with the write-path reclaim (aborts on the same error)
and the candidate scan (skips on snapshot failure).

Distinguish the two origins under the queue: a branch absent from the manifest
authority (origin 1) is a confirmed orphan and is deleted without a fresh read
(no live writer can hold a deleted branch's queue); a LIVE branch (origin 2)
gets the fresh re-check and, on a transient read error, is SKIPPED — never
destroyed on ambiguity — converging on a later cleanup. Same don't-destroy-on-
ambiguous-error principle as the create_branch failure classification.
Addresses PR review (Cursor medium).

* fix(cleanup): unify fork-ref reclaim on fresh authority under the queue

Consolidates the reconcile/reclaim hardening from PR review (the earlier per-site
commits were collapsed when reconciling with the main merge). Both destructive
fork-ref sites — the write-path reclaim and the cleanup reconciler — now share
one classifier, classify_fork_ref -> ForkRefStatus { Legitimate, Orphan,
Indeterminate }, evaluated from FRESH manifest authority under the held
(table, branch) write queue. A fork ref is destroyed ONLY on a confirmed Orphan;
a Legitimate (concurrent writer published a real fork) or Indeterminate
(transient read) status is never destroyed — the write path maps it to a
retryable conflict, cleanup maps it to skip. This closes, by construction:

- reclaim trusting a possibly-cached caller proof (Codex P1);
- reconcile racing an in-process live fork without the queue (Cursor);
- delete-on-transient-error in the re-check (Cursor/Greptile);
- origin-1 trusting a stale live_branches capture for a created-since branch
  (Cursor/Greptile P1).

Having one classifier removes the duplication that let the two sites drift.
ForkOutcome is made pub to match the sealed trait method returning it. Verified
green on Lance 7.0.0 (full engine suite + 48/48 failpoints).

* test(cleanup): pin classify_fork_ref decision (Legitimate / Orphan / ghost)

Both fork-ref reclaim sites (write-path reclaim + cleanup reconciler) route
their destroy/skip decision through classify_fork_ref, but it had no direct
test — reverting the fresh-authority logic was not test-detectable. Add a
deterministic in-source unit test that forges each state and asserts the status:
a manifest-placed fork -> Legitimate (never destroyed); a ref the manifest does
not place on the branch -> Orphan; a ref for a branch absent from the manifest
-> Orphan (ghost reclaim preserved). This makes the core fresh-authority
decision behind every reclaim fix revert-detectable in one place.

(The Indeterminate arm — transient read on a live branch -> skip — needs an
injected read failure and is left to the failpoints suite; the cross-process
cleanup-vs-writer and cached-snapshot reclaim races are the documented
one-winner-CAS gap, not reachable same-process bugs, so they are not faked here.)

* test(cleanup): pin the Indeterminate (transient re-check) reclaim arm

Closes the last untested classify_fork_ref arm. Adds a 'classify.fresh_read'
failpoint (no-op without the failpoints feature) that simulates a transient
failure of the fresh-authority read, and a failpoints test driving it through
cleanup: a genuine origin-2 orphan on a LIVE branch whose fresh re-check fails
classifies as Indeterminate, so the reconciler SKIPS it (never destroys on an
inconclusive read) and reclaims it on the next run once the read succeeds.

This makes the don't-destroy-on-ambiguity rule revert-detectable end-to-end.
The only paths now left untested are the cross-process cleanup-vs-writer and
reclaim-vs-publish races — the documented one-winner-CAS gap (cleanup is
&mut self / CLI-only, so no reachable same-process race), not faked here.

* test(server): avoid stale schema apply route handle

* fix(cleanup): report indeterminate fork authority clearly
2026-06-15 22:17:25 +02:00

22 KiB

Architectural Invariants

Type: standing review checklist Status: living document Audience: anyone proposing, reviewing, or implementing an OmniGraph change

This file is intentionally short. It records the rules that should be in working memory for every non-trivial change. Detailed mechanics live in the area docs linked below.

Use it this way:

  • Review the change against Hard Invariants and the Deny-list.
  • If code and docs disagree, either fix the code or add/update a Known Gap.
  • Keep implementation ledgers, roadmap detail, and historical MR notes in the per-area docs. This file is the filter, not the encyclopedia.

Governing principle: logical contract over physical state

The hard invariants below are instances of one rule. Keep it in view whenever a change touches the boundary between what the graph means and how it is physically stored.

Logical state is the contract. Physical state — index coverage, fragment layout, compaction versions, staged writes — is derived, rebuildable, and may be produced asynchronously. A physical operation must never fail a logical one. Preconditions are checked against logical state; physical reconciliation is idempotent and may lag or retry. Genuine logical conflicts still fail loudly: the licence to lag covers physical convergence, not correctness.

Invariants that instantiate it: 2 (manifest-atomic visibility) and 5 (recovery is part of the commit protocol) — a partially-written physical layer never changes what a graph commit means; 7 (indexes are derived state) — a query is correct under partial index coverage, and expensive index work converges from manifest state instead of gating the write path; 13 (failures bounded and observable) — the licence to lag is not a licence to drop, so a physical step that cannot make progress is surfaced, not swallowed. Deny-list items that enforce it: synchronous inline vector/FTS index rebuilds on the commit path; state that drifts from Lance or the manifest when it can be derived; job queues for manifest-derivable state where a reconciler fits.

The failure shape it rules out: a legitimate background operation on the physical layer (compaction, an index build, an interrupted staged write) is allowed to break a logical operation (a query's correctness, a migration's success, a branch's writability). The smell to watch for is a logical operation whose precondition is a physical fact — a cached file version, an index's existence, a fragment count. Make the precondition logical and let a reconciler converge the physical state.

Hard Invariants

  1. Respect the substrate. Lance owns columnar storage, per-dataset versioning, fragments, branches, compaction, cleanup, and index primitives. DataFusion should own relational execution where it fits. Do not add custom WALs, transaction managers, buffer pools, page formats, or local clones of substrate behavior. Read lance.md before guessing.

  2. Graph visibility is manifest-atomic. Lance commits are per dataset. OmniGraph's graph-level atomicity comes from publishing one manifest update for the whole graph, guarded by expected table versions and sidecar recovery. No write path may make a subset of touched node/edge tables visible as a graph commit.

  3. A query reads one snapshot. Query execution captures a manifest snapshot for its lifetime. Do not re-read branch head mid-query to discover newer table versions.

  4. Mutations publish at one boundary. A mutate_as or load operation accumulates constructive writes, commits each touched table at the end, then publishes one manifest update. Do not commit per statement. Delete-only queries are the documented inline residual; the parse-time D2 rule prevents mixing deletes with insert/update until Lance exposes two-phase delete. Read writes.md and execution.md.

  5. Recovery is part of the commit protocol. Writers that can advance Lance HEAD before manifest publish must write __recovery/{ulid}.json sidecars. Omnigraph::open in read-write mode runs the all-or-nothing sweep; the write entry points (load_as, mutate_as, apply_schema_as, branch_merge_as) and refresh run roll-forward-only recovery in-process, so a long-lived process converges on its next write rather than at restart. Do not add a new writer kind without sidecar coverage or an explicit proof that no Lance HEAD can move before manifest publish.

  6. Strong consistency is the default. Reads are snapshot-isolated, writes are durable before acknowledgement, and branch reads observe the current committed graph state. Any eventual-consistency mode must be explicit, read-only, auditable, and non-default.

  7. Indexes are derived state. Reads must see the correct result for the branch they read even when index coverage is partial. Expensive index work should converge from manifest state instead of extending the critical write path. Scalar staged index builds and vector inline residuals are documented in writes.md and indexes.md.

  8. Schema identity survives renames. Accepted schema identity must remain stable across type and property renames. Rename support belongs in migration planning, not in "drop and recreate" behavior. See the known gap below.

  9. Schema/data integrity failures are loud. Type errors, required-field misses, invalid edge endpoints, cardinality violations, and unsupported mixed mutation modes fail before a graph commit is published. The system must not invent placeholder nodes or silently weaken integrity.

  10. Query semantics are first-class IR concepts. Search modes, mutations, polymorphism, traversal, retrieval scores, imports, and policy predicates belong in typed AST/IR/planner structures. Do not smuggle semantics through strings, side tables, global state, or transport-specific flags.

  11. Transport/auth stay at the boundary. Kernel crates should not depend on HTTP, OpenAPI, bearer-token parsing, or future transport protocols. The server resolves bearer tokens to actors; clients cannot set actor identity directly.

  12. Bearer-token plaintext is not retained. Server startup hashes bearer tokens, authentication uses constant-time comparison, and request handling carries only the resolved actor identity and hash-derived match state.

  13. Operational failures are bounded and observable. Timeout, memory, OOM, partial result, recovery, and conflict paths must fail loudly or degrade in a documented way. If a metric affects plan choice or operator behavior, it must be exposed through the relevant trait or observability surface.

  14. Tests match the boundary being changed. Prefer extending the existing test that owns the area. Planner changes need planner-level coverage, storage changes need storage/recovery coverage, and end-to-end tests are not a substitute for missing lower-level assertions. Read testing.md before adding tests.

Current Truth Matrix

Area Current state Source
Multi-table commit Manifest CAS plus recovery sidecars; not a single Lance primitive writes.md, architecture.md
Constructive mutations In-memory MutationStaging, one end-of-query table commit per touched table, then one manifest publish writes.md, execution.md
Deletes Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 query-language.md, writes.md
Branch delete Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (force_delete_branch) with the cleanup reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before cleanup surfaces an actionable error branches-commits.md, maintenance.md
Schema validation Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths schema-language.md, execution.md
Unique constraints Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (loader::composite_unique_key, a separator-free Vec<String> tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap schema-language.md
Storage trait TableStorage (via db.storage()) is staged-only; the inline-commit residuals (delete_where, create_vector_index) are split onto a separate sealed InlineCommitResidual trait reached via db.storage_inline_residual() (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap writes.md, architecture.md
Index lifecycle @index/@key declares intent; the physical index is derived state and never fails a logical op. schema apply builds no indexes (records intent only; index-only changes touch no table data). load/mutate build inline through one chokepoint (build_indices_on_dataset_for_catalog, type-dispatched by node_prop_index_kind: enum + orderable scalar → BTREE, free-text String → FTS, Vector → vector) that fault-isolates an untrainable Vector column into a pending index instead of aborting. optimize/ensure_indices is the reconciler: it creates declared-but-missing indexes and folds appended/rewritten fragments into existing ones (optimize_indices), reporting still-pending columns. Explicit maintenance call, not yet a background loop indexes.md, maintenance.md
Traversal IDs Runtime still builds TypeIndex; Lance stable row-id based graph IDs are roadmap architecture.md, query-language.md
Auth Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary server.md, policy.md
Tests Tempdir-backed Lance tests are the current substrate; the storage adapter has an in-memory backend for adapter-level contract tests, but Lance datasets bypass it testing.md

The branch-delete reconciler is authority-derived: it reclaims orphaned forks today and degrades to a no-op if Lance ships an atomic multi-dataset branch operation, so the design composes with that future rather than blocking it. This is the same shape as invariant 7 (indexes are derived state); prefer it over a recovery-sidecar-style approach for any new multi-dataset metadata operation, since the sidecar would be scaffolding to remove once the substrate closes the gap.

Known Gaps

Do not hide these behind invariant wording. Either move them forward or keep them explicit.

  • Rename-stable schema identity: the invariant is that accepted IDs survive renames. The current compiler still derives type IDs from kind:name; this must be fixed before relying on renamed IDs across accepted schemas.
  • Storage abstraction: TableStorage is present, sealed, and canonical for staged writes. MR-854 sealed it: db.storage() exposes only staged primitives
    • reads, and the inline-commit residuals are split onto a separate sealed InlineCommitResidual trait reached via db.storage_inline_residual(), so a new writer cannot couple a write with a HEAD advance through the default surface. The dead legacy methods (append_batch on the trait, merge_insert_batch{,es}, create_{btree,inverted}_index) were removed. The remaining residuals are delete_where and create_vector_index. The Lance 6.0.1 → 7.0.0 bump landed, so the staged two-phase delete API (DeleteBuilder::execute_uncommitted, Lance #6658) is now available and MR-A is unblocked — but the migration itself is still pending, so delete_where stays inline for now. create_vector_index remains gated on Lance #6666 (still open). See lance.md and writes.md. New write paths should use the staged shape unless a documented Lance blocker applies.
  • Deletes and vector indexes: delete_where and vector index creation still advance Lance HEAD inline. The public delete two-phase API now exists (Lance #6658 shipped in 7.0.0), so the delete residual is unblocked pending the MR-A migration; vector index creation is still blocked (Lance #6666 open). Keep D2 and recovery coverage in place until those residuals are removed.
  • Blob-column compaction: Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), so optimize skips any table with a Blob property — reporting SkipReason::BlobColumnsUnsupportedByLance (loud, not a silent drop) behind the LANCE_SUPPORTS_BLOB_COMPACTION gate. Reads and writes are unaffected; only space/fragment reclamation on blob tables is deferred. Remove the skip when the upstream Lance fix lands — the lance_surface_guards.rs::compact_files_still_fails_on_blob_columns guard turns red on that bump to force it.
  • Recovery is serialized against live writers in-process only: the write-entry heal (and refresh) serialize against a live writer's sidecar lifetime via the per-(table, branch) write queues plus the schema-apply serialization key — all in-process primitives. A recovery pass in one process cannot serialize against a live writer in another (the open-time sweep has the same exposure, and always has): it may roll a live foreign writer's sidecar forward, which degrades to publisher-CAS contention for data writes but can race the schema-staging promotion for a foreign live schema apply. Multi-process writers on one graph are already documented one-winner-CAS territory; closing this fully needs a cross-process serialization primitive (e.g. lease-based use of the schema-apply lock branch) — design it before promoting multi-process write topologies.
  • Fork reclaim is in-process-safe only: the first write to a table on a branch forks it (a Lance create_branch that advances state before the manifest publish). An interrupted fork (crash, or a cancelled request future) leaves a manifest-unreferenced branch ref. The next write self-heals it — reclaim_orphaned_fork_and_refork (force_delete_branch + re-fork) — but reclaim is only safe because the writer holds the per-(table, branch) write queue from before the fork through the publish AND re-checks the live manifest under it, so no in-process writer can be mid-fork. A reclaim cannot serialize against a foreign-process in-flight fork: it may force-delete a peer's just-created ref, which makes that peer's commit fail and retry — the same one-winner-CAS exposure as above, not corruption. The reclaim never fires unless in-process-queue + manifest authority both prove the ref is manifest-unreferenced. cleanup's per-table reconciler (reconcile_orphaned_branches) is the guaranteed backstop for any fork the write path never revisits. Both degrade to a no-op if Lance ships an atomic multi-dataset branch op.
  • Local write_text_if_match is not a cross-process CAS: object-store backends use a true conditional put (ETag If-Match; the in-memory test backend too), but upstream object_store leaves PutMode::Update unimplemented for LocalFileSystem, so the local path emulates CAS with a content-token compare followed by an atomic replace — a check-then-act gap plus content-token ABA. Every current caller goes through the cluster lock protocol first, which makes this safe. A lock-free caller would get S3-correct but local-racy behavior — the same divergence shape as the acknowledged-before-visible bug this branch fixed. Close it (local CAS primitive, or a trait-level lock requirement) before admitting any lock-free if_match caller.
  • Manifest→commit-graph publish atomicity: a graph commit advances __manifest (the visibility authority) and then appends _graph_commits as two separate writes (commit_updates_with_actor_with_expected, failpoint graph_publish.before_commit_append). A crash between them leaves the manifest at version N with no commit-graph row for N. Live reads and durability are unaffected — the live version resolves via the manifest (GraphCoordinator::version()), not the commit-graph head — and the open-time recovery sweep does NOT repair it (lance_head == manifest_pinned classifies NoMovement; a recovery sidecar would not change this). Impact is bounded to commit history: commit list misses N, time-travel by commit id to N fails, and merge-base loses a node (a likely-benign off-by-one re-merge). This affects every publish, not a specific maintenance command. Eventual fix: make the commit graph reconcilable from the manifest (or the two writes atomic) — not a recovery-sidecar concern.
  • Planner capability/stat surfaces: cost-aware planning, complete capability advertisement, and explain-with-cost are roadmap. Do not describe them as implemented.
  • Traversal execution: current multi-hop execution still uses TypeIndex, ad-hoc ID filtering, and eager materialization in places. Stable row IDs, SIP, and factorization are target patterns, not current fact.
  • Retrieval ranks: hybrid search works, but rank/score are not yet carried everywhere as ordinary columns through the plan.
  • Policy pushdown and Source: Cedar enforcement is at the HTTP boundary today, and imports are still loader-shaped. Planner predicates and a unified Source operator are roadmap.
  • Resource bounds: some operations still lack enforced per-query memory or time budgets. New long-running work should add explicit bounds rather than widening the gap.

Deny-list

If a proposal fits one of these, the burden is on the proposer to prove why the case is exceptional.

  • Custom WAL, transaction manager, buffer pool, page format, or storage engine.
  • Per-table graph publishing outside the manifest publisher.
  • Re-reading current branch head during a query instead of using the captured snapshot.
  • New write paths that can advance Lance HEAD before manifest publish without a recovery sidecar.
  • Cross-query BEGIN/COMMIT transactions in the OSS engine. Use branches and merges for multi-query workflows.
  • Acknowledging writes before durable Lance and manifest persistence.
  • Silent fallback to eventual consistency, partial results, or dropped rows.
  • State that drifts from Lance or the manifest when it can be derived.
  • Job queues for manifest-derivable state where a reconciler is the right shape.
  • Synchronous inline vector/FTS index rebuilds on the query commit path, except for documented Lance API residuals.
  • Side-channels for query semantics: hidden globals, magic strings, transport flags, or out-of-band metadata.
  • Cost-blind plan choice when statistics are available or required.
  • Hidden statistics for behavior that affects planning or operator choice.
  • Hash-map iteration order in result ordering, plan choice, or migration output.
  • String-flattened SQL/filter generation when a structured pushdown API is available.
  • Eager multi-hop cross-product materialization when factorization fits.
  • Ad-hoc IN-list filtering where SIP or another structured selectivity path fits.
  • Discarding retrieval score/rank before fusion or projection decisions.
  • Auto-creating placeholder nodes for orphan edges.
  • Raw filesystem I/O for cluster-stored state (ledger, lock, sidecars, approvals, catalog) outside the cluster crate's storage module — every stored byte goes through the engine StorageAdapter so file:// and s3:// stay one code path.
  • Wire-protocol-specific code in compiler or engine crates.
  • Cloud-only correctness fixes or forks of the OSS engine for correctness.
  • Mutating immutable substrate state in place, including Lance fragments or index segments.
  • Shipping observable behavior as if it were not part of the contract. Output ordering, error text, timestamp precision, defaults, and latency profiles all become dependencies once exposed.

Review Checklist

Use this as yes/no/NA for any non-trivial design or PR:

  • Does it respect Lance/DataFusion instead of rebuilding them?
  • Does it preserve manifest-atomic graph visibility?
  • Does every query keep one snapshot for its lifetime?
  • Do mutations publish once at the commit boundary?
  • Can every Lance-HEAD-before-manifest gap recover all-or-nothing?
  • Are schema and edge integrity checks strict by default?
  • Are query semantics represented in AST/IR/planner structures?
  • Are transport, auth, and policy boundaries preserved?
  • Are failures bounded, typed, and observable?
  • Are result ordering and plan choices deterministic within a snapshot?
  • Are stats/capabilities exposed when behavior depends on them?
  • Are existing known gaps left no worse and documented if touched?
  • Does the test live at the same boundary as the change?
  • Does the change avoid every deny-list pattern, or justify the exception?

Maintenance Policy

Update this file when an invariant changes, a known gap opens or closes, or a new review anti-pattern deserves deny-list treatment. Prefer stable headings over numbered sections so other docs can link here without churn.

Removing or relaxing a hard invariant requires the same review process as code. Adding a known gap is acceptable when it makes reality explicit; leaving stale claims is not.