feat(engine): retire commit-graph tables (#311)

* docs(dev): write-latency roadmap (validated cost model + layered fix)

Records the validated 6-LIST warm-write cost model, the two root causes
(un-GC'd _versions/; re-resolving latest by listing), and the layered fix
(GC + capture-once reuse), plus how commit-graph-table retirement feeds in.
Linked from docs/dev/index.md next to the RFC-013 docs.

* feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration

Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest`
internal-schema version and refuses any older graph on open with a
rebuild-via-export/import message, instead of migrating it in place. Storage
format changes become a deliberate cutover, not a permanently-carried in-place
migration — the pre-release "complexity must be earned" contract.

Delete the entire in-place migration apparatus and everything that existed only
to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the
schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse);
the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration
tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned
Lance variants only the migration matched on; and `state::merge_lineage_rows`.
Keep `read_stamp` / `stamp_current_version` / `set_stamp` /
`refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into.

`load_commit_cache_for_branch` now reads the `__manifest` projection
unconditionally (sub-v4 graphs are refused at open). Adds
`sub_current_graph_is_refused_on_open_with_rebuild_hint`.

The commit-graph TABLES are still created/used as branch-ref ledgers — their
retirement (CommitGraph -> pure `__manifest` projection) is the next commit.

BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is
refused on open. Rebuild it: `omnigraph export` with the old release, then
`omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are
preserved; commit history and branches are not.

* feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection

Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` /
`graph_head` rows) and branch authority is `__manifest` (branch create forks it
first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance`
was never written or read; `_graph_commits.lance` carried zero commit rows and
only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire
both.

- `CommitGraph` collapses to a pure projection: drops its Lance dataset handles
  (`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/
  `refresh`/`init` open NO dataset, building the cache from
  `ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open
  dataset opens.
- `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid
  projection). `branch_create`/`branch_delete` go through `ManifestCoordinator`
  only — a single atomic op, replacing the two-step manifest-fork +
  commit-graph-fork + rollback. Deleted `create_commit_graph_branch`,
  `reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every
  `storage.exists(_graph_commits.lance)` gate.
- `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from
  the internal-table compaction set (now `__manifest` only).
- `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables.
- Fresh graphs create neither table; `lineage_projection.rs` now asserts both
  `.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race
  failpoint tests + their failpoint names, and updated the `maintenance`
  optimize tests (one internal table, not three).

Review-pass fixes folded in:
- Removed two stale `omnigraph.rs` in-source tests the prior run missed (a
  disk-full link failure masked them): one asserting `open` probes
  `_graph_commits.lance` (the exists-gate this commit removes) — it was masked
  earlier by a disk-full link failure.
- Corrected src comments referencing deleted code (`migrate_v3_to_v4`,
  `append_commit`/`append_merge_commit`, the three-internal-table list,
  the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit.
- Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal
  test) — removes a dead-code warning in the failpoints build.

Branch create/delete atomicity improves (single atomic `__manifest` op). No
behavior change for reads or branches.

Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test
counter + its `IOTracker`, threaded through ~11 cost-test files.

* feat: surface the internal-schema (storage-format) version to operators

After stranding storage versioning (a sub-v4 graph is refused on open), operators
could only discover the storage-format version by hitting a refusal. Surface it:

- `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT
  storage-format version).
- `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch
  on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`.
- `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's
  CURRENT, alongside `version`/`source_version`).

Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`;
add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`;
`snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it),
threaded through the embedded CLI + server snapshot callers. `openapi.json`
regenerated (two added int32 properties). Extends the existing healthz / snapshot /
version tests.

* docs(engine): gate internal-schema version at the graph level; record the per-branch read gap

PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget.

Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it."

No behavior change: HEAD already gated at the graph level.

* test(cost): remove the dead commit_graph_reads IO counter

Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs.

Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact.

* docs: align with the commit-graph retirement + strand storage versioning

Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild).

Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis.

Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs).

* test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance

Two coverage additions from PR review (P1):

(a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs.

(b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
This commit is contained in:
Ragnor Comerford 2026-06-28 16:49:49 +02:00 committed by GitHub
parent 0dcdcf5a9d
commit 7779b72446
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
53 changed files with 903 additions and 3324 deletions

View file

@ -1,5 +1,12 @@
# Handoff: finishing RFC-013 (write-path latency + correctness)
> **Status update (strand-and-retire work):** the `_graph_commits.lance` /
> `_graph_commit_actors.lance` datasets are **retired** (Phase B — lineage lives in
> `__manifest`; the `CommitGraph` is a pure projection). References to those tables
> below are historical: `optimize` now compacts **`__manifest` only** among the
> internal tables, and the per-write `_graph_commits` scan term is gone. See
> [versioning.md](versioning.md) and [invariants.md](invariants.md).
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)**
this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
@ -142,9 +149,11 @@ a 0-IO cache hit). So, apples-to-apples (both ground truth), per-write `__manife
is genuinely cross-branch with no index today).
- **PR4 / RFC #7264** — Lance native branch-aware `BatchCreateTableVersions`; manifest read → O(1),
per-write fragment append gone; retires most of PR1/PR2. Upstream-blocked.
4. **Low-leverage:** retire the vestigial `_graph_commits`/`_graph_commit_actors` datasets (zero rows
post-#299, only branch-ref carriers); a bitmap index on `__manifest` (no builder exists; `use_index(false)`
means it can't serve the CAS join anyway — a `graph_head:<branch>` point-lookup is the better variant).
4. **Low-leverage:** ~~retire the vestigial `_graph_commits`/`_graph_commit_actors` datasets~~ **DONE**
(Phase B of the strand-and-retire work — the tables are gone, branch authority is `__manifest`, the
`CommitGraph` is a pure `__manifest` projection). Still open: a bitmap index on `__manifest` (no builder
exists; `use_index(false)` means it can't serve the CAS join anyway — a `graph_head:<branch>` point-lookup
is the better variant).
### A.6 Critical files (Thread B)

View file

@ -12,6 +12,7 @@ constraints. User-facing behavior should still be documented through
| Need | Read |
|---|---|
| Architectural rules, known gaps, deny-list | [invariants.md](invariants.md) |
| Versioning & compatibility policy (release / wire / storage / Lance) | [versioning.md](versioning.md) |
| Upstream Lance source-of-truth index | [lance.md](lance.md) |
| Existing test coverage and test placement | [testing.md](testing.md) |
@ -94,6 +95,7 @@ Working documents for in-flight feature work. Removed when the work lands.
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
| Write-latency roadmap — validated cost model (the 6-LIST warm-write trace), the two root causes (un-GC'd `_versions/`; re-resolving latest by listing), and the layered fix (GC + capture-once reuse); how commit-graph-table retirement feeds in | [write-latency-roadmap.md](write-latency-roadmap.md) |
## Boundary

View file

@ -155,7 +155,7 @@ converge the physical state.
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
| Deletes | Staged like inserts/updates (`stage_delete` via Lance 7.0 `DeleteBuilder::execute_uncommitted`, MR-A) — no inline HEAD advance; mixed insert/update/delete in one query rejected by D2 as a deliberate boundary (constructive XOR destructive per query; compose via separate mutations or a branch) | [query-language.md](../user/queries/index.md), [writes.md](writes.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) |
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) |
| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the sole inline-commit residual (`create_vector_index`) is split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) |
@ -257,43 +257,46 @@ them explicit.
acknowledged-before-visible bug this branch fixed. Close it (local CAS
primitive, or a trait-level lock requirement) before admitting any
lock-free `if_match` caller.
- **Internal-schema stamp is validated at the graph (main) level only:**
`Omnigraph::open` reads the `omnigraph:internal_schema_version` stamp on
**main** and refuses an out-of-range graph (newer than CURRENT → "upgrade
omnigraph"; below MIN_SUPPORTED → "rebuild via export/import"). Branch reads
then trust that one check. This is correct by the storage-format contract: the
stamp is a graph-wide property (the upgrade path is a whole-graph
export/import), so with one binary version every branch is always CURRENT —
init stamps main, `create_branch` forks the stamp, the publisher writes rows
without re-stamping. A branch stamped out of range while main stays in range is
only reachable with concurrent **multi-version writers**, an unsupported
topology already in one-winner-CAS territory: writes to such a branch are
refused per-branch by the publisher (it reads its target's stamp), and a newer
binary advancing main is refused at open. The residual hole is read-only —
reading or merging a branch a newer binary advanced while main stayed old would
decode it with this binary's logic instead of refusing. Close it (a per-branch
read gate folded into the branch-manifest open the read already does, zero
extra I/O) only when multi-version write topologies are promoted to supported;
a separate stamp round-trip per branch read is the wrong shape (it regresses the
warm-read cost budget to defend an unsupported state).
- **Manifest→commit-graph publish atomicity — CLOSED (RFC-013 Phase 7):** graph
lineage now lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
lineage lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
rows written in the SAME `MergeInsertBuilder` commit as the table-version rows
(`commit_changes_with_lineage``GraphNamespacePublisher::publish` with a
`LineageIntent`). There is no second write to fail between — a graph commit and
its lineage land at one manifest version atomically, so a crash after the publish
leaves no gap. The commit-graph cache is a derived projection of those manifest
rows; nothing writes `_graph_commits.lance` (it persists only to carry branch
refs). The prior two-write gap (manifest at N with no `_graph_commits` row for N)
is gone by construction. A graph created before Phase 7 (internal schema v3)
carries its lineage only in `_graph_commits.lance`; the `migrate_v3_to_v4`
internal-schema step (`db/manifest/migrations.rs`) backfills it into `__manifest`
per-branch on the first read-write open (idempotent, crash-safe, data-preserving),
and a read-only open of an un-migrated v3 graph sources the DAG from
`_graph_commits.lance` via a stamp-gated transitional fallback so reads stay
correct until the first write migrates it. An old binary refuses a v4-stamped
graph (read-write and read-only) with the standard upgrade error. The migration
is **loud on failure and concurrent-runner idempotent**: the legacy-open read
(`read_legacy_commit_cache`) treats only a genuine not-found as "no legacy data"
and propagates any other open error (so a transient/corrupt open can never stamp
v4 over an empty backfill — orphaning lineage permanently), and the backfill
converges all-or-nothing when two runners open the same legacy graph at once — a
bounded re-open retry on the `graph_head:<branch>` row-level CAS plus an
idempotent terminal stamp bump (both runners write the same value, so a concurrent
`UpdateConfig`/`IncompatibleTransaction` loss re-opens and no-ops if the stamp
already landed). The branch read path (`load_commit_cache_for_branch`) also
refuses an out-of-range branch stamp (`> CURRENT` or `< MIN_SUPPORTED`;
defense-in-depth; not a live hole because migrations run main-first, so main
refuses first). The migration chain is **floor-bounded**:
`MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (migrations.rs; 1 today, a pure no-op) is
the oldest stamp this binary opens, enforced symmetrically with the ceiling by the
single `refuse_if_stamp_unsupported` guard at all three stamp-read sites
(write-path migrate, read-only open, branch lineage-read). Raising MIN sheds the
now-dead `migrate_vN_…` arms and (at MIN ≥ 4) the `commit_graph_legacy_v3` legacy
readers; a compile-time tripwire (`LOWEST_REGISTERED_MIGRATION_SOURCE`) fails the
build if the floor and the lowest registered arm drift. Retirement runbook lives on
the `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` doc-comment.
leaves no gap. The in-memory commit graph is a pure projection of those rows. The
`_graph_commits.lance` / `_graph_commit_actors.lance` tables are **retired**: a
fresh graph creates neither, branch authority is `__manifest` only, and nothing
reads or writes them. The prior two-write gap (manifest at N with no
`_graph_commits` row for N) is gone by construction.
- **Storage is strict-single-version (the strand model):** this binary reads
exactly ONE internal-schema version (`MIN_SUPPORTED == CURRENT`), so there is no
in-place migration. A graph stamped below CURRENT is refused on open with a
rebuild-via-export/import message (`refuse_if_stamp_unsupported`), not silently
upgraded; a graph stamped above CURRENT is refused with an "upgrade omnigraph"
message. The `migrate_v*` dispatcher, the `_graph_commits.lance` legacy-read
fallback, and the migration floor-bounding machinery were all deleted with the
retirement — the stamp + `refuse_if_stamp_unsupported` floor is the only seam a
future migration would re-introduce. See `docs/dev/versioning.md` (the
compatibility policy) and `docs/user/operations/upgrade.md` (the rebuild recipe).
- **Planner capability/stat surfaces:** cost-aware planning, complete
capability advertisement, and explain-with-cost are roadmap. Do not describe
them as implemented.
@ -310,7 +313,8 @@ them explicit.
widening the gap.
- **Read-path re-derivation (largely closed by the query-latency work):**
snapshot resolution used to re-open a fresh coordinator per read (a full
`__manifest` re-scan plus two commit-graph scans), open each table through the
`__manifest` re-scan plus the then-separate commit-graph-table scans, since
retired), open each table through the
namespace (two more `__manifest` scans per table), validate the schema twice,
and share no Lance `Session`. That was an O(commits) cost that never warmed up.
Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open
@ -324,10 +328,11 @@ them explicit.
the manifest e_tag is carried into synthetic snapshot ids when available, and
a detected same-branch manifest refresh clears read caches as the fallback for
e_tag-less table locations/topology. Remaining: `optimize` now compacts the
internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
in history; but they are not yet brought into `cleanup` (version GC), so the
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
internal metadata table (`__manifest`, which carries the lineage rows) too
(RFC-013 step 2), so a *periodically-optimized* graph keeps the
probe/refresh/per-write scan flat in history; but it is not yet brought into
`cleanup` (version GC), so the `_versions/` chain still grows until an explicit
cleanup (the cleanup half is
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
graph IS now reconcilable from the manifest (RFC-013 Phase 7 — it is a pure
projection of the `graph_commit`/`graph_head` rows); the traversal id-map is

View file

@ -166,11 +166,17 @@ Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, Da
- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug — but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape).
- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) — the load-bearing half — `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged``CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival).
- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) → MR-A (migrate `delete` to the staged two-phase API) **has since landed** (dev-graph `iss-950`): `delete_where` is retired, deletes stage via `TableStorage::stage_delete`, and the guard was flipped to `_compile_uncommitted_delete_field_shape` (pins `execute_uncommitted` / `UncommittedDelete`). `StagedWrite` must carry `UncommittedDelete.affected_rows` through `commit_staged` so Lance's row-level rebase metadata is preserved. The parse-time D2 rule is retained as a deliberate boundary (constructive XOR destructive per query), not as scaffolding awaiting further work.
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set on the manifest's unenforced-PK field (`db/manifest/migrations.rs::migrate_v1_to_v2`): `["object_id"]` → no-op, `[]` → set, any other PK field → loud refusal (the wrong CAS key, unchangeable under v7). Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability); regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7).
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into the manifest schema at init (`db/manifest/state.rs`). With the strand model there is no in-place migration, so the PK is only ever set at init: a graph that predates the annotation is refused on open (`refuse_if_stamp_unsupported`) and rebuilt via export/import, never re-keyed — which is also what Lance's immutability rule would require, since the wrong PK could not be changed once set. Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability).
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
- **Two surface guards added by the v3→v4 migration-robustness follow-up** (not a Lance bump, but they pin Lance error surfaces the migration now classifies on): `dataset_open_missing_returns_not_found_variant` (a missing `Dataset::open` returns `DatasetNotFound`/`NotFound` — the legacy-open read in `db/commit_graph.rs::read_legacy_commit_cache` treats only those as "no legacy data" and propagates everything else) and `lance_error_incompatible_transaction_variant_exists` (a concurrent `UpdateConfig` stamp-bump loses with `IncompatibleTransaction``db/manifest/migrations.rs::commit_v4_stamp_idempotently` matches it to retry the benign same-value race). Re-run on a Lance bump like the others.
- **The v3→v4 migration-robustness surface guards were removed with the strand.**
An earlier cycle added `dataset_open_missing_returns_not_found_variant` and
`lance_error_incompatible_transaction_variant_exists` to pin Lance error surfaces
the `migrate_v3_to_v4` backfill classified on. The strand retirement deleted that
migration (storage is now strict-single-version — see [invariants.md](invariants.md)),
so those guards and the legacy-read/stamp-bump code they pinned are gone. No
current omnigraph code path classifies on those Lance variants.
Bump this date stanza on the next alignment pass.

View file

@ -1,5 +1,13 @@
# RFC-013: Write-path latency — capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract
> **Status update (storage-versioning strand-and-retire work):** the
> `_graph_commits.lance` / `_graph_commit_actors.lance` datasets are now **retired**
> (Phase B — lineage lives in `__manifest` as `graph_commit` / `graph_head` rows;
> the `CommitGraph` is a pure `__manifest` projection). References to those tables
> below are historical: the remaining `optimize` / `cleanup` internal-table scope is
> **`__manifest`-only**, and the per-write `_graph_commits` scan term is gone. See
> [invariants.md](invariants.md) and [versioning.md](versioning.md).
**Status:** Proposed
**Author(s):** write-path latency investigation (handoff + multi-agent validation)
**Date:** 2026-06-19

View file

@ -26,7 +26,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest` scans flat in commit-history depth, lineage rows included`internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives; reads per-op deltas via lance's `incremental_stats()`, the upstream per-request idiom from `rust/lance/src/dataset/tests/dataset_io.rs`), `cost_harness`/`GraphIoMeter` (installs ONE `__manifest` `IOTracker` for a whole test body so the graph opens **under** it and `manifest_reads` is **ground truth** — every read regardless of handle age, the warm-coordinator freshness probe included — closing the blind spot where a per-op tracker installed at measure time cannot see a long-lived handle's reads; outside `cost_harness`, `measure` falls back to fresh per-op tracking, so `write_cost_s3.rs` is unaffected), `last_manifest_reads()` (the manifest read log for `assert_io_eq!`-style failure diagnostics), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
@ -46,7 +46,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). Also the v3→v4 migration fault-injection test (`transient_legacy_open_failure_aborts_migration_without_stamping_v4`, `migration.v3_to_v4.legacy_open` failpoint): a transient legacy-open failure aborts the migration loudly and leaves it retryable (stamp stays v3, no partial backfill), never stamping v4 over an empty backfill. Also the v4 stamp-bump exhaustion regression (`v4_stamp_exhaustion_returns_retryable_contention`, `migration.v4_stamp.force_incompatible` failpoint): the stamp retry loop surfaces a retryable `RowLevelCasContention` on exhaustion, not a stringified `Lance`. And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
@ -139,7 +139,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest` fragment scans, lineage rows included) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
- **Count on the handle that does the reads, not just the one a measured op opens.** Lance's IO-counted tests attach the `IOTracker` to the (warm, cached) dataset and read `incremental_stats()` per request — the tracker MUST be on the handle performing the reads, or warm-handle reads escape. A per-op tracker installed at measure time cannot see reads on a long-lived handle opened earlier (the warm coordinator's `__manifest` handle, reused across writes), so such reads were silently undercounted. Wrap a depth-swept body in `cost_harness` so the manifest tracker is installed before the graph opens and `manifest_reads` is **ground truth** (handle-age-irrelevant). The `version_probes` counter is the freshness-probe *call* count; ground truth additionally reveals that a write's probe does ~3 object-store RPCs (a read's probe is a 0-IO cache hit). `manifest_reads_capture_warm_probe` is the guard that this stays true.
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).

78
docs/dev/versioning.md Normal file
View file

@ -0,0 +1,78 @@
# Versioning & compatibility policy
**Audience:** engine / storage / release maintainers
**Status:** living document
Omnigraph has four independent version axes. They have different compatibility
contracts because they fail in different ways and at different costs. Conflating
them (for example, treating a storage-format change like a wire change) is how you
either ship an unsafe silent-misread or carry migration code you do not need.
| Axis | Policy | Mechanism |
|---|---|---|
| **Release (semver)** | All published crates move in lockstep. | Maintenance-contract rule 4 in [AGENTS.md](../../AGENTS.md): a release bump updates every crate manifest, `Cargo.lock`, `openapi.json`, and the surveyed version line together. |
| **CLI ↔ server wire** | Additive and rolling-safe; **no version gate**. New fields are optional; old clients ignore unknown fields and omit new ones. | Additive JSON DTOs in `omnigraph-api-types`; the OpenAPI-drift test (`crates/omnigraph-server/tests/openapi.rs`) catches an unintended wire change. |
| **Storage (internal manifest schema)** | **Strict single version**; upgrade is a cutover via export/import, never an in-place migration. | A stamp (`omnigraph:internal_schema_version`) in `__manifest`'s schema metadata + `refuse_if_stamp_unsupported`, with `MIN_SUPPORTED == CURRENT`. |
| **Lance on-disk format** | Pinned to one Lance version; bumped deliberately with the engine. | `data_storage_version: V2_2` at every write site + the surface guards in [lance.md](lance.md), re-run on every Lance bump. |
## Why storage is strict-single-version (the strand model)
The internal-schema stamp gates the on-disk shape of `__manifest`. The contract is:
**this binary reads exactly one internal-schema version.** `Omnigraph::open` (both
read-write and read-only) reads main's stamp before any data and refuses anything
it cannot serve:
- a stamp **below** CURRENT → refused with a rebuild-via-export/import message (see
[the upgrade guide](../user/operations/upgrade.md));
- a stamp **above** CURRENT → refused with "upgrade omnigraph", so an old binary
cannot silently misread a newer format.
There is no in-place migration dispatcher. The single source file
`db/manifest/migrations.rs` holds only the version constant, the stamp read/write,
and `refuse_if_stamp_unsupported`.
This is a liability decision, not a limitation we have not gotten around to. In-place
migration code is permanent surface: every future format change has to write,
test, and keep working a `vN → vN+1` step, plus the legacy readers and crash-recovery
paths each step needs, for a storage format that is still pre-release and changing.
The strand model trades that ongoing cost for a one-time operator action (export +
import) when a format changes. Per "engineering is programming integrated over time"
(see [AGENTS.md](../../AGENTS.md)), the lower-liability option is to **not** carry
the machinery until a concrete graph demands it.
The stamp + `refuse_if_stamp_unsupported` floor is exactly the seam a future in-place
migration would re-introduce: re-add a dispatcher and lower `MIN_SUPPORTED` below
CURRENT for the versions it can actually walk forward. Until then that machinery is
deliberately absent.
### Gating altitude
The stamp is validated at the **graph (main) level**: `Omnigraph::open` checks main
once, and branch reads trust it. The stamp is a graph-wide storage-format property
(the upgrade path is a whole-graph export/import), so with one binary version every
branch is always CURRENT — init stamps main, `create_branch` forks the stamp, and the
publisher writes rows without re-stamping. A branch stamped out of range while main
stays in range is only reachable with concurrent multi-version writers, an
unsupported topology; the residual is recorded as a known gap in
[invariants.md](invariants.md).
## Why the wire is additive-rolling-safe instead
The CLI↔server boundary is the opposite case: clients and servers are deployed
independently and a hard gate there would force lockstep redeploys for every field
addition. So that axis is additive — old and new coexist — and the OpenAPI-drift test
is the guard that a change stayed additive rather than breaking the shape.
## When you change each axis
- **Storage format**: bump `INTERNAL_MANIFEST_SCHEMA_VERSION`, keep
`MIN_SUPPORTED == CURRENT` (unless you are re-introducing migration), update the
stamp history on the constant's doc-comment, and add a release note pointing at
the upgrade guide. The change is breaking by construction — pre-bump graphs are
refused.
- **Wire**: keep it additive; regenerate `openapi.json`
(`OMNIGRAPH_UPDATE_OPENAPI=1`); do not add a version gate.
- **Lance**: follow the Lance-bump checklist in [lance.md](lance.md) — re-run the
surface guards first, then `cargo test --workspace` (a clean build is not a clean
alignment).
- **Release**: lockstep per the maintenance contract.

View file

@ -0,0 +1,79 @@
# Write-latency roadmap (validated cost model + layered fix)
**Type:** planning roadmap (forward-looking; not yet implemented)
**Companion to:** [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) (design) and [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) (current state + next actions)
**Status:** validated against Lance 7.0.0 source + a measured S3/RustFS trace of the `a7d4cba5` (#307) binary; **re-validated unchanged against HEAD `0dcdcf5a` (#308)** — see "Currency" below
This doc records the **validated** root-cause analysis of single-row write latency and the layered, correct-by-design fix, so the analysis is not lost between sessions. It also records how the separate commit-graph-table **retirement** work (see the strand-and-rebuild plan) feeds into it.
## Currency (re-validated at HEAD `0dcdcf5a`, #308)
#308 ("Stage the delete path; retire the inline-delete residual", MR-A) is a **pure delete-path refactor**: it moved `delete_where` off the inline `Dataset::delete` HEAD-advance onto Lance 7.0's two-phase `stage_delete` (`DeleteBuilder::execute_uncommitted`) + `commit_staged`. The **insert path is untouched** — every step of the cost model below re-traced identically (the `stage_merge_insert` + `commit_staged` open/commit count is unchanged; WriteTxn "collapse #1/#4", the probe-gated `occ_snapshot_for_branch`, and the publisher chain all survive). Two consequences worth recording: (1) deletes are now staged like inserts, so this cost model — and the layered fix — **now apply uniformly to deletes too** (before #308 a delete was an inline-commit residual with a different shape); (2) the sole remaining inline-commit residual is `create_vector_index`, and D2 (constructive XOR destructive per query) is now a **permanent by-design boundary, not a gap**. Lance stays pinned at 7.0.0, so Root-cause-A's `commit.rs` / `cleanup.rs` references are unchanged.
## Problem
A single keyed-node insert on `main` against the production graph (Cloudflare Container + R2, 343 MB, never cleaned) measured **~7.2s median**. The cost is not the writes — it is read/list amplification, dominated by repeated "resolve latest version" listings of `_versions/` prefixes.
## Validated cost model of one warm write
A warm server insert of one keyed node on `main` performs **6 `_versions/` LISTs + 1 full `__manifest` scan + 1 recovery LIST + ~5 writes**, every operation reconciled to code and matching the measured served trace exactly (3 `__manifest/_versions/` lists, 3 `node:SyncState/_versions/` lists, ~132 `__manifest` reads, 1 recovery list, 8 PUT + 1 POST):
| # | Operation | Cost | Source |
|---|---|---|---|
| 1 | recovery sidecar list | 1 `__recovery/` LIST | `recovery.rs` heal |
| 2 | schema-contract validate | 3 text GETs (not manifest) | `schema_state.rs` |
| 3 | branch resolve (warm) | 0 I/O (in-memory coord) | `omnigraph.rs` `resolved_branch_target` |
| 4 | per-table mutation open | **0** — WriteTxn returns `None` (RFC-013 collapse #1) | `table_ops.rs` |
| 5 | staging open (`reopen_for_mutation`) | **1 `node:SyncState/_versions/` LIST** (opens at *latest*) | `table_store.rs` `open_dataset_head_for_write` |
| 6 | OCC re-capture probe | **1 `__manifest/_versions/` LIST** (`latest_version_id`) | `manifest.rs` `probe_latest_version` |
| 7 | data-table drift probe | **1 `node:SyncState/_versions/` LIST** | `staging.rs` |
| 8 | sidecar PUT | 1 PUT | `staging.rs` `write_sidecar` |
| 9 | `commit_staged` (data commit) | **1 `node:SyncState/_versions/` LIST** (Lance commit) + PUTs | `table_store.rs` |
| 10 | index-build reopen | **0** — reuses handle (RFC-013 collapse #4) | `table_ops.rs` |
| 11 | `load_publish_state` | **1 `__manifest/_versions/` LIST + 1 full scan (~132 reads)** | `publisher.rs` |
| 12 | publish merge-insert commit | **1 `__manifest/_versions/` LIST** (Lance commit) + 1 POST | `publisher.rs` `merge_rows` |
| 13 | sidecar DELETE | 1 DELETE | `mutation.rs` |
The WriteTxn (RFC-013 step 3b) and #307 already eliminated several opens (steps 4, 10) and the redundant publish scans. What remains is structural.
## Two independent root causes
**Root cause A — per-LIST cost grows with history (un-GC'd `_versions/`).** On S3/R2, Lance resolves "latest version" via `resolve_version_from_listing` (`lance-table-7.0.0/src/io/commit.rs:552`). Even though V2 reverse-sorted naming puts the latest first, it reads **up to 1000 entries as a lexical-order sanity check** (`commit.rs:584`, `valid_manifests.take(999)`). So each LIST pulls a full ~1000-key page (~205 KB, ~500 ms on R2). `cleanup` covers only catalog data tables — **the internal tables are never version-GC'd** (`optimize.rs` `all_table_keys(&db.catalog())`), so `__manifest/_versions/` grows one entry per commit forever. The version hint that would skip the list is **unavailable on S3/R2** — Lance gates it to non-lexical stores (`commit.rs:327`, `uses_version_hint = enabled && !list_is_lexically_ordered`); `LANCE_USE_VERSION_HINT` can only *disable* it. NOTE: `optimize` already *compacts* the internal tables (RFC-013 step 2), which bounds the `__manifest` data **scan** (step 11), but compaction adds a version — only `cleanup` shrinks the `_versions/` **list**.
**Root cause B — the write re-resolves "latest" 6× by listing instead of reusing a known version.** Each boundary (table open, OCC probe, table commit, manifest open, manifest commit) independently asks "what's latest?" via a list. Opening at a *known* version is list-free: `DatasetBuilder::from_uri(loc).with_version(N)` reconstructs the exact V2 filename and does one GET — the read path already uses this (the held-handle cache). The write path deliberately opens HEAD and never consults the pinned pointer, even though the `__manifest` `table_version` row already stores `manifest_path` + `version` + `e_tag` (`TableVersionMetadata`).
## The layered fix (each layer correct on its own)
Target end state: **a warm single-writer insert costs O(keep-N + writes), independent of total history, with the publish CAS as the sole correctness authority.** Ordered by impact-per-risk.
### Layer 1 — version-GC the internal table(s) (attacks Root cause A; biggest win)
Bring `__manifest` into `cleanup` with a small keep-window (e.g. keep 2050), shrinking every LIST from ~1000 keys (~205 KB, ~500 ms) to ~2050 keys (~5 KB, ~40 ms) — a ~10× cut to the dominant term, reusing the `cleanup_all_tables` loop. Lance's `cleanup_old_versions` supports this directly (`CleanupPolicy { before_version, before_timestamp }`, always preserves latest, honors tags — `lance-7.0.0/src/dataset/cleanup.rs`).
- **Correctness:** a *manual, quiesced, sidecar-refusing* cleanup is correct **without** the Q8 watermark (the resurrection race only exists under concurrent live writers). Time-travel below the window is lost — the same tradeoff data-table cleanup already makes.
- **Immediate operational fix:** a one-shot quiesced `__manifest` cleanup on the production graph takes ~7s → ~1s today, zero code risk.
- **Connection to the retirement work:** after the commit-graph-table retirement, `_graph_commits`/`_graph_commit_actors` no longer exist, so Layer 1's GC scope is **just `__manifest`** (originally three internal tables). Retiring those tables also removes their 2 cold-open LISTs entirely — an orthogonal cold-start win.
### Layer 2 — Q8 durable monotonic watermark (makes Layer 1 safe under live writers)
Lance creates versions with a bare `PutMode::Create` and no monotonic guard, so a stalled writer can resurrect a GC'd version = a silent lost write on R2/S3. To run GC automatically while serving writes, add a durable boundary watermark (a Lance boundary **tag**, which `cleanup_old_versions` protects): GC advances the floor before deleting; every writer/open rejects `version ≤ boundary`. Invariant-level, touches the open path; correctly deferred in RFC-013 (§ step 2b / Q8). See [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) and [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) for the full design.
### Layer 3 — open data tables at the pinned manifest version (attacks Root cause B)
On the **non-strict** insert/merge path, route `reopen_for_mutation` through the version-pinned opener (`open_table_dataset(location, pinned_version, session)`) instead of `Dataset::open` at HEAD. The `__manifest` row already pins the exact version + e_tag, and the non-strict path already skips the strict version check — so opening at the pinned version is information-equivalent and **list-free** (removes step 5, folds the drift probe). The publisher CAS + `latest_version_id` drift guard remain the concurrency fences.
### Layer 4 — optimistic warm publish (attacks Root cause B)
The publisher loop runs cold `load_publish_state` (open + full scan) on every attempt, then arbitrates via the merge-insert row-level CAS — and the CAS is the stated authority. Thread the warm coordinator's already-open `__manifest` handle + in-memory `known_state` (folded after every commit, RFC-013 PR2 #1b) into **attempt 0**; fall through to today's cold `load_publish_state` on attempts 1..N **only when the CAS actually conflicts**. A current warm view (single-writer server, the common case) commits with zero manifest open/scan. Scope to non-strict ops; strict ops (update/delete/schema) keep the cold fresh-read drift check.
## Why this is the optimal shape (not a symptomatic patch)
- It is the natural extension of the WriteTxn "capture-once" architecture (RFC-013 step 3b) that already eliminated the per-table mutation opens — Layers 34 finish the job for the publish + commit opens.
- It separates **correctness (the CAS) from optimization (every read/probe)**, matching invariant 15: the optimistic path is probe-free; the cold path runs only under genuine contention. Write cost becomes O(1) with no contention, O(contention) only under real concurrency.
- It introduces **no parallel copy of version state** (deny-listed) — Lance + the manifest stay the single source of truth; we stop *re-deriving* "latest" from a cold list on the hot path (the "cold re-derivation" anti-pattern, invariant 15).
- Explicitly **not** pursued: the version hint (unavailable on S3) or an external manifest store (a drift-prone parallel pointer). GC + capture-once reuse is the substrate-respecting path.
## Sequencing
0. **Prereq (separate plan): retire `_graph_commits.lance` + `_graph_commit_actors.lance`. — DONE.** Phase B of the strand-and-retire storage-versioning work landed this: the tables are gone, Layer 1's GC scope is `__manifest`-only, and the 2 cold-open LISTs are removed.
1. **Operational now:** one-shot quiesced `__manifest` cleanup on the production graph (~7s → ~1s, no code).
2. **Layer 1:** wire `__manifest` into the `cleanup` command (manual/quiesced + sidecar-refuse) — the 80/20.
3. **Layers 3 + 4:** open-at-pinned-version + optimistic warm publish — remove the redundant round-trips; bring a warm write to ~2 cheap Lance-commit lists + writes.
4. **Layer 2:** Q8 watermark — only when GC must run continuously under live writers; heaviest, correctly deferred.
Layers 1+3+4 are independently shippable, each correctness-preserving, composing to the scalable end state. Each lands with `write_cost_s3.rs` extensions asserting the per-write LIST count drops and stays flat across commit-history depth.

View file

@ -355,36 +355,32 @@ actual }`. The HTTP server maps this to **409 Conflict** with body
`__manifest`, written in the publish CAS (RFC-013 Phase 7; previously
`_graph_commits.lance`). Audit history is queried via `omnigraph commit list`.
## Migration code
## Storage versioning (no in-place migration)
`db/manifest/migrations.rs` is the single place on-disk `__manifest` shape is
reconciled with what the binary expects, stepping the
`omnigraph:internal_schema_version` stamp forward one `match`-arm at a time. It
runs in `Omnigraph::open(ReadWrite)` (via `manifest::migrate_on_open`, before the
coordinator reads branch state) and again on the publisher's write path, so each
branch migrates on its first write; every step is idempotent under crash-retry
(work first, stamp bump last).
`db/manifest/migrations.rs` is the single place the on-disk `__manifest` shape is
reconciled with what the binary expects. Storage is **strict-single-version** (the
strand model): this binary reads exactly ONE internal-schema version
(`MIN_SUPPORTED == CURRENT == 4`), so there is no in-place migration.
- **v2→v3** (MR-770): a one-time sweep that deletes legacy `__run__*` staging
branches off `__manifest`. Deleting the inert `_graph_runs.lance` /
`_graph_run_actors.lance` dataset *bytes* is still deferred — it needs a
`StorageAdapter::delete_prefix` primitive — but those bytes are invisible to
graph-level state.
- **v3→v4** (RFC-013 Phase 7, `migrate_v3_to_v4`): backfills the graph lineage
from `_graph_commits.lance` into `__manifest` as `graph_commit` / `graph_head`
rows. A graph created before Phase 7 has its lineage only in
`_graph_commits.lance`; the new binary reads lineage from the `__manifest`
projection, so without this backfill it would see an empty commit DAG. The
backfill is per-branch (each branch migrates on its first write), idempotent
(keyed on `object_id`; a fast-path guard skips when `__manifest` already
carries `graph_commit` rows), and writes exactly one `graph_head:<branch>` row
for the actual head. `_graph_commits.lance` is left in place as the branch-ref
carrier — no commit row is written to it again. While a graph is below v4, a
**read-only** open (which never writes, so never migrates) sources the commit
DAG from `_graph_commits.lance` via the stamp-gated transitional fallback in
`CommitGraph::open*`, so reads see correct history before the first write
migrates the graph. An old binary opening a v4-stamped graph is refused with an
"upgrade omnigraph" error in both read-write and read-only modes.
- **Graph creation** stamps `omnigraph:internal_schema_version` at CURRENT, so a
fresh graph always opens.
- **`Omnigraph::open`** (both read-write and read-only) reads main's stamp before
the coordinator reads any branch state and calls `refuse_if_stamp_unsupported`:
a stamp *below* CURRENT is refused with a rebuild-via-export/import message; a
stamp *above* CURRENT is refused with "upgrade omnigraph". The publisher
re-checks the stamp on its write path against the branch it targets, with no
object-store writes, so the check is safe under a read-only open.
- The stamp + `refuse_if_stamp_unsupported` floor is the only seam a future
in-place migration would re-introduce (re-add a dispatcher and lower
`MIN_SUPPORTED`). Until a concrete graph demands it, that machinery is
deliberately absent — see [versioning.md](versioning.md) (the compatibility
policy) and [the upgrade guide](../user/operations/upgrade.md) (the rebuild
recipe).
The stamp history (v1 PK-less, v2 unenforced-PK, v3 `__run__*` sweep, v4 lineage
in `__manifest` with the commit-graph tables retired) is recorded on the
`INTERNAL_MANIFEST_SCHEMA_VERSION` doc-comment; only v4 is served. An earlier-stamped
graph is rebuilt via export/import, not migrated in place.
## Mid-query partial failure: closed by MR-794

View file

@ -2,60 +2,72 @@
> Draft release notes for the next minor. The version line in `AGENTS.md` and the
> crate manifests are bumped when this release is cut — these notes track the
> user-visible delta as the RFC-013 work lands.
> user-visible delta as the work lands.
This release moves the graph commit lineage into `__manifest` (RFC-013 Phase 7)
and ships a **one-time on-disk migration** for existing graphs. It is the first
release with an internal-schema change since v0.4.0, so it has an upgrade-order
requirement — read the upgrade notes before rolling it out.
This release moves the graph commit lineage into `__manifest` (RFC-013 Phase 7),
retires the two legacy commit-graph datasets, and surfaces the storage-format
version to operators. It is the first release with an internal-schema change since
v0.4.0, and storage is **strict-single-version**: there is no in-place migration —
a graph from an older release is rebuilt via export/import. Read the upgrade notes
before rolling it out.
## Graph lineage now lives in `__manifest` (internal schema v4)
The graph commit DAG (commits, parents, merge parents, per-branch heads, and the
authoring actor) is now stored in `__manifest` as `graph_commit` / `graph_head`
rows, written in the **same commit (CAS)** as the table-version rows of a graph
publish. Previously the lineage lived in a separate `_graph_commits.lance`
dataset written after the manifest commit, leaving a narrow window where a crash
could land a manifest version with no matching lineage row. Folding the lineage
into the publish closes that gap by construction: a graph commit and its lineage
now land atomically at one manifest version. The in-memory commit graph is a
projection of those manifest rows; `_graph_commits.lance` is retained only as a
carrier for Lance branch refs and no longer receives commit rows.
publish. Previously the lineage lived in a separate `_graph_commits.lance` dataset
written after the manifest commit, leaving a narrow window where a crash could land
a manifest version with no matching lineage row. Folding the lineage into the
publish closes that gap by construction: a graph commit and its lineage now land
atomically at one manifest version. The in-memory commit graph is a pure projection
of those manifest rows.
This bumps the `__manifest` internal schema stamp from **v3 to v4**.
This bumps the `__manifest` internal schema stamp to **v4**.
## Existing graphs migrate seamlessly on first write
## The `_graph_commits.lance` / `_graph_commit_actors.lance` tables are retired
A graph created by an earlier binary (internal schema v3) keeps its lineage in
`_graph_commits.lance` with none in `__manifest`. On the **first read-write
open**, Omnigraph backfills that lineage into `__manifest` (the `migrate_v3_to_v4`
internal-schema step) and bumps the stamp to v4. The migration:
With lineage in `__manifest` and branch authority already on the manifest, the two
legacy commit-graph datasets are no longer created, read, or written. A graph this
release creates has neither. This removes two cold-open directory listings per
graph open and simplifies maintenance (`optimize` compacts only `__manifest` among
the internal tables now).
- is **per-branch** — each branch backfills on its first write;
- is **idempotent and crash-safe** — the stamp bump is the last step, and the
backfill is keyed on the commit id, so a crash mid-migration re-runs harmlessly
on the next open;
- **preserves all data** — every commit, parent, merge parent, actor, and head is
carried over; commit ids are stable, so existing references still resolve.
## Strict-single-version storage: rebuild via export/import, no in-place migration
No data is lost and no operator action is required beyond upgrading the binary.
Internal schema v4 is a hard version gate, enforced in **both** directions on every
open (read-write and read-only):
Before its first write migrates the graph, a **read-only** open of a v3 graph
(e.g. `omnigraph commit list`, NDJSON export) still reads correct history via a
transitional fallback that sources the commit DAG from `_graph_commits.lance`
read-only opens never write, so they never migrate, but they never show an empty
history either.
- A graph stamped **below v4** (created by an older release whose storage format
this binary does not read) is refused with a **rebuild-via-export/import**
message. There is no in-place upgrade: export the graph with the older binary,
then `omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs
are preserved; commit history and branches are not carried over. See the
[upgrade guide](../user/operations/upgrade.md).
- A graph stamped **above v4** (created by a newer release) is refused with an
**`upgrade omnigraph before opening this graph`** error, so an old binary cannot
silently misread a newer format.
## Breaking: upgrade writer binaries first
This replaces the speculative one-time on-disk migration that earlier drafts of
this release described. The rationale (lower long-term liability than carrying
in-place migration code for a pre-release format) is in
[docs/dev/versioning.md](../dev/versioning.md).
Internal schema v4 is a hard version gate. Once a graph has been opened for write
by a v0.8.0 binary, its `__manifest` is stamped v4, and an **older binary will
refuse to open it** — read-write *and* read-only — with an
`upgrade omnigraph before opening this graph` error rather than silently
misreading the new lineage. This is the standard forward-version protection
(same shape as the v1→v2 / v2→v3 steps), now enforced on the read-only path too.
## See the storage-format version
**Upgrade order:** upgrade every writer (and reader) binary that touches a graph
to v0.8.0 before, or together with, the first write under the new version. A
mixed fleet where an old binary still writes the same graph is unsupported, as
with any internal-schema bump.
Operators can now read the internal-schema version directly instead of discovering
it through a refusal:
- `omnigraph version` prints an `internal-schema <N>` line (the version this binary
serves).
- `omnigraph snapshot` reports the opened graph's on-disk `internal_schema_version`.
- The server `GET /healthz` response includes `internal_schema_version` (the
binary's served version).
## Upgrade order
Upgrade every binary that touches a graph to v0.8.0 together. A mixed fleet where an
older binary still writes a graph another has stamped v4 is unsupported, as with any
internal-schema bump. To move a pre-v4 graph forward, follow the
[upgrade guide](../user/operations/upgrade.md): export with the old binary, then
`init` + `load` with v0.8.0.

View file

@ -20,9 +20,8 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin
- **Layout**:
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
- `__manifest/` — the catalog of all sub-tables and their published versions, **and** the graph commit lineage (RFC-013 Phase 7)
- `_graph_commits.lance` / `_graph_commit_actors.lance` — legacy / branch-ref carriers. Since RFC-013 Phase 7 the graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows, written in the publish CAS); `_graph_commits.lance` no longer receives commit rows, but is retained to carry the Lance branch refs that `create_branch` / `list_branches` / the `cleanup` orphan reconciler operate on. A graph created before Phase 7 (internal schema v3) keeps its lineage here until its first read-write open, which migrates it into `__manifest` via `migrate_v3_to_v4`.
- (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed. The internal schema migration sweeps stale `__run__*` branches on first write-open; the inert dataset bytes themselves remain until a prefix-delete storage primitive lands)
- `__manifest/` — the catalog of all sub-tables and their published versions, **and** the graph commit lineage (RFC-013 Phase 7: `graph_commit` / `graph_head` rows). Graph-level branches are Lance branches on these datasets.
- `_graph_commit_recoveries.lance` — the crash-recovery audit log (one row per recovery action; see below). The former `_graph_commits.lance` / `_graph_commit_actors.lance` lineage tables are **retired**: lineage lives in `__manifest`, so a graph this binary creates has neither.
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
- `object_type``table | table_version | table_tombstone | graph_commit | graph_head`
- `table_key``node:<TypeName> | edge:<EdgeName>` (empty for `graph_commit` / `graph_head` lineage rows)
@ -35,18 +34,22 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin
### Internal schema versioning
The on-disk shape of `__manifest` is reconciled with the binary via a single version stamp held in the manifest dataset's schema-level metadata.
The on-disk shape of `__manifest` is reconciled with the binary via a single version stamp (`omnigraph:internal_schema_version`) held in the manifest dataset's schema-level metadata. Storage is **strict-single-version** (the strand model): this binary reads exactly ONE internal-schema version, and there is no in-place migration.
- **Graph creation** stamps the current version, so newly initialized graphs never need migration.
- **The open-for-write path** migrates the on-disk stamp before reading state. When the stamp matches the binary, this is a single metadata read with no writes; otherwise the migration walks steps forward (1→2, 2→3, …) until the stamp matches, then proceeds with the publish. Reads stay side-effect-free.
- **Forward-version protection**: a stamp *higher* than the binary's known version triggers a clear "upgrade omnigraph first" error. An old binary cannot clobber a newer schema by silently treating "unknown stamp" as "missing stamp".
- **Idempotency**: each migration step is safe to re-run. A crash between two metadata updates inside a single step leaves the partial state; the next open re-runs the step and the second update lands.
- **Graph creation** stamps the current version, so newly initialized graphs always open.
- **Both open paths** (read-write and read-only) read main's stamp before reading any data and refuse a graph the binary cannot serve:
- a stamp *below* CURRENT — a graph from an older release whose storage format this binary does not read — is refused with a **rebuild-via-export/import** message (there is no in-place upgrade; see the [upgrade guide](../operations/upgrade.md)).
- a stamp *above* CURRENT — a graph written by a newer release — is refused with an **"upgrade omnigraph first"** message, so an old binary cannot misread a newer format.
- The stamp is read with no object-store writes, so the check is safe under a read-only open. Operators can see a graph's stamp with `omnigraph snapshot` and the binary's served version with `omnigraph version` (the `internal-schema` line).
| Stamp | Shape change |
The stamp values below are historical; this binary serves only the current one (`v4`). An earlier-stamped graph is rebuilt via export/import, not migrated in place.
| Stamp | Shape |
|---|---|
| v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; no row-level CAS protection. |
| v2 | `__manifest.object_id` carries an unenforced-primary-key annotation; row-level CAS engaged. |
| v3 | One-time sweep of legacy `__run__*` staging branches (pre-v0.4.0 Run state machine, removed) off `__manifest`. Runs at read-write open and on publish. |
| v3 | Legacy `__run__*` staging branches (pre-v0.4.0 Run state machine) swept off `__manifest`. |
| v4 | Graph lineage folded into `__manifest` as `graph_commit` / `graph_head` rows (RFC-013 Phase 7); the `_graph_commits.lance` / `_graph_commit_actors.lance` tables retired. **The only version this binary serves.** |
## On-disk layout
@ -62,7 +65,7 @@ flowchart TB
manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2
nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2
edges["edges/{fnv1a64-hex}/<br/>one dataset per edge type"]:::l2
cgraph["_graph_commits.lance/<br/>_graph_commit_actors.lance/<br/>_graph_commit_recoveries.lance/"]:::l2
cgraph["_graph_commit_recoveries.lance/<br/>crash-recovery audit log"]:::l2
recovery["__recovery/{ulid}.json<br/>recovery sidecars (transient)"]:::l2
refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2
@ -91,7 +94,7 @@ flowchart TB
- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph.
- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
- **`_graph_commits.lance`** is an L2 dataset retained only as a branch-ref carrier (and, on a pre-Phase-7 graph, the migration source). Since RFC-013 Phase 7 the graph commit DAG lives in `__manifest` as `graph_commit` / `graph_head` rows written in the publish CAS — `_graph_commits.lance` and its paired `_graph_commit_actors.lance` no longer receive commit rows. A graph created before Phase 7 (internal schema v3) backfills its lineage into `__manifest` on its first read-write open (`migrate_v3_to_v4`). (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.)
- The graph commit DAG lives in **`__manifest`** as `graph_commit` / `graph_head` rows written in the publish CAS (RFC-013 Phase 7). The former `_graph_commits.lance` / `_graph_commit_actors.lance` lineage tables are retired — a graph this binary creates has neither.
- **`_graph_commit_recoveries.lance`** — one row per crash-recovery action. Joined by `graph_commit_id` to the graph commit lineage (the `graph_commit` rows in `__manifest` since RFC-013 Phase 7); the linked commit carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join.
- **`__recovery/{ulid}.json`** — transient sidecar files written by a writer before it advances the underlying dataset, deleted once the matching manifest publish succeeds. A sidecar persisting after process exit means the writer crashed mid-commit; the next read-write open processes it. Steady-state directory is empty.
- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.

View file

@ -46,6 +46,7 @@ start with install, then follow the section that matches your task.
| Deploy the binary or container | [deployment.md](deployment.md) |
| Use HTTP endpoints | [operations/server.md](operations/server.md) |
| Compact, repair, and clean old versions | [operations/maintenance.md](operations/maintenance.md) |
| Upgrade across a storage-format change (export/import) | [operations/upgrade.md](operations/upgrade.md) |
| Configure Cedar authorization | [operations/policy.md](operations/policy.md) |
| Track actors and audit behavior | [operations/audit.md](operations/audit.md) |
| Interpret errors and output formats | [operations/errors.md](operations/errors.md) |

View file

@ -6,7 +6,7 @@
- Compacts every node + edge table on `main`, then reindexes them, then **publishes the resulting version to the `__manifest`** so the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers *and* would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed.
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until `cleanup` runs.
- **Also compacts the internal system tables** `__manifest`, `_graph_commits`, and `_graph_commit_actors` (RFC-013 step 2), which accumulate one fragment per commit (the actor table only on the authenticated write path, where every commit carries an actor) and otherwise make every write's metadata scan grow with history. These take a simpler path than data tables: they are not `__manifest`-tracked (readers open them at their latest version), so compaction just advances their version in place — **no manifest publish and no recovery sidecar**. (The sidecar-free property is not because it is one commit — `compact_files` can emit a `ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip below is a further commit — but because every one of those commits is content-preserving and the table is read at its latest version, so a crash at any point leaves it readable and content-identical and the next `optimize` re-plans.) They appear in the returned stats under `table_key` `"__manifest"` / `"_graph_commits"` / `"_graph_commit_actors"` (the latter two only when present). They are **not yet covered by `cleanup`**, so their version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat.
- **Also compacts the internal `__manifest` table** (RFC-013 step 2), which accumulates one fragment per commit — it now carries the graph lineage and actor rows inline (RFC-013 Phase 7: `graph_commit` / `graph_head` rows), so on the authenticated write path every commit's actor lands here too — and otherwise makes every write's metadata scan grow with history. (The `_graph_commits.lance` / `_graph_commit_actors.lance` tables are retired, so there is no separate lineage table to compact.) It takes a simpler path than data tables: `__manifest` is read at its latest version, so compaction just advances its version in place — **no manifest publish and no recovery sidecar**. (The sidecar-free property is not because it is one commit — `compact_files` can emit a `ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip below is a further commit — but because every one of those commits is content-preserving and the table is read at its latest version, so a crash at any point leaves it readable and content-identical and the next `optimize` re-plans.) It appears in the returned stats under `table_key` `"__manifest"`. It is **not yet covered by `cleanup`**, so its version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat.
- **`optimize` is non-destructive by construction — it never garbage-collects versions, on any table (data or internal).** Compaction rewrites fragments and advances the version; old versions stay reachable until you run `cleanup`. This holds even for a graph created by an older binary that stored an on-by-default Lance `auto_cleanup` hook: `compact_files` / `optimize_indices` commit with the hook enabled and expose no skip override, so before compacting **any** table `optimize` strips its stale `lance.auto_cleanup.*` config first, so Lance's commit-time GC hook cannot fire and silently prune `__manifest`-pinned versions. (Graphs created by current binaries store no such config; the strip is the upgrade-path safety net.) The internal-table path additionally tolerates a concurrent live writer: it runs a **bounded** rebase-and-retry, so transient contention does not fail the operator's `optimize` or the live write — but sustained contention past the retry budget surfaces a loud conflict error rather than looping forever (bounded and observable, not a silent give-up). The data-table path holds the per-table write queue while it compacts, so it does not contend with mutations on that table in the first place.
- **Reindex (index coverage maintenance).** A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by `load --mode merge`, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage. `optimize` runs Lance's incremental `optimize_indices` after compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with **no compaction work but stale index coverage still commits** a new version under `optimize`. Run `optimize` on a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail.
- **Create declared-but-missing indexes (the index reconciler).** `@index`/`@key` declares intent; `schema apply` records it but builds nothing, and `load`/`mutate` defer a column that cannot be built yet (a `Vector` column with no trainable vectors). `optimize` materializes any such declared-but-unbuilt index over the compacted layout — so it is the convergence path for an `@index` added after data exists, or a vector index whose embeddings arrived via a later `embed`. A column still not buildable (no vectors yet) is reported on the table's stat as `pending_indexes` (visible in `--json`), not treated as a failure; the next `optimize` retries. So `optimize` is the single operator-facing index reconciler: it compacts, restores coverage, **and** builds declared-but-missing indexes.

View file

@ -0,0 +1,84 @@
# Upgrading across a storage-format change (export / import)
Omnigraph storage is **strict-single-version**: a binary reads exactly one
internal-schema (storage-format) version. There is no in-place migration. When a
release changes the internal schema, a graph created by an older release is
**refused on open** with a message that points here, and you move it forward by
rebuilding it: export with the old binary, then `init` + `load` with the new one.
This is a deliberate pre-release design choice. The rationale (lower long-term
liability than carrying in-place migration code for a format that is still
changing) is in [docs/dev/versioning.md](../../dev/versioning.md).
## How you know you need this
Opening a graph whose stamp is below the binary's version fails with:
```
__manifest is stamped at internal schema vN, but this omnigraph reads only vM.
This graph was created by an older omnigraph release; rebuild it: run `omnigraph
export` with the older omnigraph binary that created it, then `omnigraph init` +
`omnigraph load` with this one. (Data, vectors, and blobs are preserved; commit
history and branches are not.)
```
You can also check versions before you hit a refusal:
- `omnigraph version` — the binary's served version (the `internal-schema <N>` line).
- `omnigraph snapshot <graph>` — the graph's on-disk `internal_schema_version`.
If the graph's stamp is **higher** than the binary's, the binary is too old —
upgrade omnigraph rather than rebuilding the graph.
## What is preserved (and what is not)
| Preserved | Not preserved |
|---|---|
| All node and edge rows | Commit history (the graph DAG starts fresh) |
| Vector columns (embeddings round-trip verbatim) | Branches (export is a single-branch snapshot) |
| Blob columns | Snapshot/time-travel history of the old graph |
| The schema (re-applied at `init`) | |
The rebuilt graph is a faithful copy of the exported branch's **current state**.
If you need history or multiple branches carried forward, there is no supported
path today — export each branch you care about separately.
## The recipe
Use the **old** binary for the export steps and the **new** binary for init/load.
Keep them as separate executables (for example a downloaded release archive) so you
can run both.
```bash
# 1. With the OLD binary — capture the schema and the data.
old-omnigraph schema show s3://bucket/graph.omni > schema.pg
old-omnigraph export s3://bucket/graph.omni > graph.jsonl
# 2. With the NEW binary — create a fresh graph and load the data.
omnigraph init --schema schema.pg s3://bucket/graph-v2.omni
omnigraph load --mode overwrite --data graph.jsonl s3://bucket/graph-v2.omni
# 3. With the NEW binary — verify.
omnigraph snapshot s3://bucket/graph-v2.omni # internal_schema_version is current
omnigraph version # confirms the binary's served version
```
`omnigraph export` writes a full JSONL snapshot (one row per node/edge, all
columns including vectors and blobs) of the chosen branch (default `main`; pass
`--branch` for another) to stdout. `omnigraph load --mode overwrite` replaces the
target graph's contents with that snapshot.
Once you have verified the rebuilt graph, retire the old one. If you rebuilt
in place (same URI), export to a side location first and only overwrite after the
new graph verifies.
## Notes
- **Upgrade the whole fleet together.** A mixed fleet where an old binary still
writes a graph a newer binary has stamped is unsupported, as with any
internal-schema bump.
- **Embeddings are not recomputed.** Export carries the stored vectors verbatim, so
a load does not re-run the embedding pipeline. If you changed the embedding model,
re-embed after loading.
- **Server deployments**: take the graph out of the serving set, rebuild it offline
with the CLI, then point the cluster at the rebuilt graph (`cluster apply`).

View file

@ -3,9 +3,9 @@
| Name | Value | Area |
|---|---|---|
| `MANIFEST_DIR` | `__manifest` | manifest layout |
| Commit graph dir | `_graph_commits.lance` | branch-ref carrier + pre-v4 lineage source (lineage lives in `__manifest` since RFC-013 Phase 7) |
| Run registry dir (legacy, removed) | `_graph_runs.lance` | inert post-v0.4.0; bytes remain until a prefix-delete primitive lands |
| Run branch prefix (legacy, removed) | `__run__` | swept off `__manifest` by the internal schema migration; no longer a reserved name |
| Commit graph dirs (retired) | `_graph_commits.lance` / `_graph_commit_actors.lance` | retired in Phase B; lineage lives in `__manifest` (`graph_commit` / `graph_head` rows) since RFC-013 Phase 7. A graph this binary creates has neither. |
| Recovery audit dir | `_graph_commit_recoveries.lance` | one row per crash-recovery action (`omnigraph commit list --filter actor=omnigraph:recovery`) |
| Run branch prefix (legacy, removed) | `__run__` | pre-v0.4.0 Run state machine; no longer a reserved name. A graph still carrying `__run__*` branches is sub-v4 and refused on open (rebuild via export/import). |
| Schema apply lock | `__schema_apply_lock__` | schema apply |
| Manifest publisher retry budget | `PUBLISHER_RETRY_BUDGET = 5` | manifest publish |
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 4` | manifest migrations (v4 = graph lineage in `__manifest`, RFC-013 Phase 7) |