feat(engine): retire commit-graph tables (#311)

* docs(dev): write-latency roadmap (validated cost model + layered fix)

Records the validated 6-LIST warm-write cost model, the two root causes
(un-GC'd _versions/; re-resolving latest by listing), and the layered fix
(GC + capture-once reuse), plus how commit-graph-table retirement feeds in.
Linked from docs/dev/index.md next to the RFC-013 docs.

* feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration

Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest`
internal-schema version and refuses any older graph on open with a
rebuild-via-export/import message, instead of migrating it in place. Storage
format changes become a deliberate cutover, not a permanently-carried in-place
migration — the pre-release "complexity must be earned" contract.

Delete the entire in-place migration apparatus and everything that existed only
to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the
schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse);
the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration
tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned
Lance variants only the migration matched on; and `state::merge_lineage_rows`.
Keep `read_stamp` / `stamp_current_version` / `set_stamp` /
`refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into.

`load_commit_cache_for_branch` now reads the `__manifest` projection
unconditionally (sub-v4 graphs are refused at open). Adds
`sub_current_graph_is_refused_on_open_with_rebuild_hint`.

The commit-graph TABLES are still created/used as branch-ref ledgers — their
retirement (CommitGraph -> pure `__manifest` projection) is the next commit.

BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is
refused on open. Rebuild it: `omnigraph export` with the old release, then
`omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are
preserved; commit history and branches are not.

* feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection

Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` /
`graph_head` rows) and branch authority is `__manifest` (branch create forks it
first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance`
was never written or read; `_graph_commits.lance` carried zero commit rows and
only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire
both.

- `CommitGraph` collapses to a pure projection: drops its Lance dataset handles
  (`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/
  `refresh`/`init` open NO dataset, building the cache from
  `ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open
  dataset opens.
- `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid
  projection). `branch_create`/`branch_delete` go through `ManifestCoordinator`
  only — a single atomic op, replacing the two-step manifest-fork +
  commit-graph-fork + rollback. Deleted `create_commit_graph_branch`,
  `reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every
  `storage.exists(_graph_commits.lance)` gate.
- `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from
  the internal-table compaction set (now `__manifest` only).
- `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables.
- Fresh graphs create neither table; `lineage_projection.rs` now asserts both
  `.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race
  failpoint tests + their failpoint names, and updated the `maintenance`
  optimize tests (one internal table, not three).

Review-pass fixes folded in:
- Removed two stale `omnigraph.rs` in-source tests the prior run missed (a
  disk-full link failure masked them): one asserting `open` probes
  `_graph_commits.lance` (the exists-gate this commit removes) — it was masked
  earlier by a disk-full link failure.
- Corrected src comments referencing deleted code (`migrate_v3_to_v4`,
  `append_commit`/`append_merge_commit`, the three-internal-table list,
  the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit.
- Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal
  test) — removes a dead-code warning in the failpoints build.

Branch create/delete atomicity improves (single atomic `__manifest` op). No
behavior change for reads or branches.

Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test
counter + its `IOTracker`, threaded through ~11 cost-test files.

* feat: surface the internal-schema (storage-format) version to operators

After stranding storage versioning (a sub-v4 graph is refused on open), operators
could only discover the storage-format version by hitting a refusal. Surface it:

- `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT
  storage-format version).
- `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch
  on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`.
- `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's
  CURRENT, alongside `version`/`source_version`).

Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`;
add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`;
`snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it),
threaded through the embedded CLI + server snapshot callers. `openapi.json`
regenerated (two added int32 properties). Extends the existing healthz / snapshot /
version tests.

* docs(engine): gate internal-schema version at the graph level; record the per-branch read gap

PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget.

Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it."

No behavior change: HEAD already gated at the graph level.

* test(cost): remove the dead commit_graph_reads IO counter

Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs.

Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact.

* docs: align with the commit-graph retirement + strand storage versioning

Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild).

Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis.

Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs).

* test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance

Two coverage additions from PR review (P1):

(a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs.

(b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
This commit is contained in:
Ragnor Comerford 2026-06-28 16:49:49 +02:00 committed by GitHub
parent 0dcdcf5a9d
commit 7779b72446
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
53 changed files with 903 additions and 3324 deletions

View file

@ -1,5 +1,12 @@
# Handoff: finishing RFC-013 (write-path latency + correctness)
> **Status update (strand-and-retire work):** the `_graph_commits.lance` /
> `_graph_commit_actors.lance` datasets are **retired** (Phase B — lineage lives in
> `__manifest`; the `CommitGraph` is a pure projection). References to those tables
> below are historical: `optimize` now compacts **`__manifest` only** among the
> internal tables, and the per-write `_graph_commits` scan term is gone. See
> [versioning.md](versioning.md) and [invariants.md](invariants.md).
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)**
this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
@ -142,9 +149,11 @@ a 0-IO cache hit). So, apples-to-apples (both ground truth), per-write `__manife
is genuinely cross-branch with no index today).
- **PR4 / RFC #7264** — Lance native branch-aware `BatchCreateTableVersions`; manifest read → O(1),
per-write fragment append gone; retires most of PR1/PR2. Upstream-blocked.
4. **Low-leverage:** retire the vestigial `_graph_commits`/`_graph_commit_actors` datasets (zero rows
post-#299, only branch-ref carriers); a bitmap index on `__manifest` (no builder exists; `use_index(false)`
means it can't serve the CAS join anyway — a `graph_head:<branch>` point-lookup is the better variant).
4. **Low-leverage:** ~~retire the vestigial `_graph_commits`/`_graph_commit_actors` datasets~~ **DONE**
(Phase B of the strand-and-retire work — the tables are gone, branch authority is `__manifest`, the
`CommitGraph` is a pure `__manifest` projection). Still open: a bitmap index on `__manifest` (no builder
exists; `use_index(false)` means it can't serve the CAS join anyway — a `graph_head:<branch>` point-lookup
is the better variant).
### A.6 Critical files (Thread B)

View file

@ -12,6 +12,7 @@ constraints. User-facing behavior should still be documented through
| Need | Read |
|---|---|
| Architectural rules, known gaps, deny-list | [invariants.md](invariants.md) |
| Versioning & compatibility policy (release / wire / storage / Lance) | [versioning.md](versioning.md) |
| Upstream Lance source-of-truth index | [lance.md](lance.md) |
| Existing test coverage and test placement | [testing.md](testing.md) |
@ -94,6 +95,7 @@ Working documents for in-flight feature work. Removed when the work lands.
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
| Write-latency roadmap — validated cost model (the 6-LIST warm-write trace), the two root causes (un-GC'd `_versions/`; re-resolving latest by listing), and the layered fix (GC + capture-once reuse); how commit-graph-table retirement feeds in | [write-latency-roadmap.md](write-latency-roadmap.md) |
## Boundary

View file

@ -155,7 +155,7 @@ converge the physical state.
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
| Deletes | Staged like inserts/updates (`stage_delete` via Lance 7.0 `DeleteBuilder::execute_uncommitted`, MR-A) — no inline HEAD advance; mixed insert/update/delete in one query rejected by D2 as a deliberate boundary (constructive XOR destructive per query; compose via separate mutations or a branch) | [query-language.md](../user/queries/index.md), [writes.md](writes.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) |
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) |
| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the sole inline-commit residual (`create_vector_index`) is split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) |
@ -257,43 +257,46 @@ them explicit.
acknowledged-before-visible bug this branch fixed. Close it (local CAS
primitive, or a trait-level lock requirement) before admitting any
lock-free `if_match` caller.
- **Internal-schema stamp is validated at the graph (main) level only:**
`Omnigraph::open` reads the `omnigraph:internal_schema_version` stamp on
**main** and refuses an out-of-range graph (newer than CURRENT → "upgrade
omnigraph"; below MIN_SUPPORTED → "rebuild via export/import"). Branch reads
then trust that one check. This is correct by the storage-format contract: the
stamp is a graph-wide property (the upgrade path is a whole-graph
export/import), so with one binary version every branch is always CURRENT —
init stamps main, `create_branch` forks the stamp, the publisher writes rows
without re-stamping. A branch stamped out of range while main stays in range is
only reachable with concurrent **multi-version writers**, an unsupported
topology already in one-winner-CAS territory: writes to such a branch are
refused per-branch by the publisher (it reads its target's stamp), and a newer
binary advancing main is refused at open. The residual hole is read-only —
reading or merging a branch a newer binary advanced while main stayed old would
decode it with this binary's logic instead of refusing. Close it (a per-branch
read gate folded into the branch-manifest open the read already does, zero
extra I/O) only when multi-version write topologies are promoted to supported;
a separate stamp round-trip per branch read is the wrong shape (it regresses the
warm-read cost budget to defend an unsupported state).
- **Manifest→commit-graph publish atomicity — CLOSED (RFC-013 Phase 7):** graph
lineage now lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
lineage lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
rows written in the SAME `MergeInsertBuilder` commit as the table-version rows
(`commit_changes_with_lineage``GraphNamespacePublisher::publish` with a
`LineageIntent`). There is no second write to fail between — a graph commit and
its lineage land at one manifest version atomically, so a crash after the publish
leaves no gap. The commit-graph cache is a derived projection of those manifest
rows; nothing writes `_graph_commits.lance` (it persists only to carry branch
refs). The prior two-write gap (manifest at N with no `_graph_commits` row for N)
is gone by construction. A graph created before Phase 7 (internal schema v3)
carries its lineage only in `_graph_commits.lance`; the `migrate_v3_to_v4`
internal-schema step (`db/manifest/migrations.rs`) backfills it into `__manifest`
per-branch on the first read-write open (idempotent, crash-safe, data-preserving),
and a read-only open of an un-migrated v3 graph sources the DAG from
`_graph_commits.lance` via a stamp-gated transitional fallback so reads stay
correct until the first write migrates it. An old binary refuses a v4-stamped
graph (read-write and read-only) with the standard upgrade error. The migration
is **loud on failure and concurrent-runner idempotent**: the legacy-open read
(`read_legacy_commit_cache`) treats only a genuine not-found as "no legacy data"
and propagates any other open error (so a transient/corrupt open can never stamp
v4 over an empty backfill — orphaning lineage permanently), and the backfill
converges all-or-nothing when two runners open the same legacy graph at once — a
bounded re-open retry on the `graph_head:<branch>` row-level CAS plus an
idempotent terminal stamp bump (both runners write the same value, so a concurrent
`UpdateConfig`/`IncompatibleTransaction` loss re-opens and no-ops if the stamp
already landed). The branch read path (`load_commit_cache_for_branch`) also
refuses an out-of-range branch stamp (`> CURRENT` or `< MIN_SUPPORTED`;
defense-in-depth; not a live hole because migrations run main-first, so main
refuses first). The migration chain is **floor-bounded**:
`MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (migrations.rs; 1 today, a pure no-op) is
the oldest stamp this binary opens, enforced symmetrically with the ceiling by the
single `refuse_if_stamp_unsupported` guard at all three stamp-read sites
(write-path migrate, read-only open, branch lineage-read). Raising MIN sheds the
now-dead `migrate_vN_…` arms and (at MIN ≥ 4) the `commit_graph_legacy_v3` legacy
readers; a compile-time tripwire (`LOWEST_REGISTERED_MIGRATION_SOURCE`) fails the
build if the floor and the lowest registered arm drift. Retirement runbook lives on
the `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` doc-comment.
leaves no gap. The in-memory commit graph is a pure projection of those rows. The
`_graph_commits.lance` / `_graph_commit_actors.lance` tables are **retired**: a
fresh graph creates neither, branch authority is `__manifest` only, and nothing
reads or writes them. The prior two-write gap (manifest at N with no
`_graph_commits` row for N) is gone by construction.
- **Storage is strict-single-version (the strand model):** this binary reads
exactly ONE internal-schema version (`MIN_SUPPORTED == CURRENT`), so there is no
in-place migration. A graph stamped below CURRENT is refused on open with a
rebuild-via-export/import message (`refuse_if_stamp_unsupported`), not silently
upgraded; a graph stamped above CURRENT is refused with an "upgrade omnigraph"
message. The `migrate_v*` dispatcher, the `_graph_commits.lance` legacy-read
fallback, and the migration floor-bounding machinery were all deleted with the
retirement — the stamp + `refuse_if_stamp_unsupported` floor is the only seam a
future migration would re-introduce. See `docs/dev/versioning.md` (the
compatibility policy) and `docs/user/operations/upgrade.md` (the rebuild recipe).
- **Planner capability/stat surfaces:** cost-aware planning, complete
capability advertisement, and explain-with-cost are roadmap. Do not describe
them as implemented.
@ -310,7 +313,8 @@ them explicit.
widening the gap.
- **Read-path re-derivation (largely closed by the query-latency work):**
snapshot resolution used to re-open a fresh coordinator per read (a full
`__manifest` re-scan plus two commit-graph scans), open each table through the
`__manifest` re-scan plus the then-separate commit-graph-table scans, since
retired), open each table through the
namespace (two more `__manifest` scans per table), validate the schema twice,
and share no Lance `Session`. That was an O(commits) cost that never warmed up.
Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open
@ -324,10 +328,11 @@ them explicit.
the manifest e_tag is carried into synthetic snapshot ids when available, and
a detected same-branch manifest refresh clears read caches as the fallback for
e_tag-less table locations/topology. Remaining: `optimize` now compacts the
internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
in history; but they are not yet brought into `cleanup` (version GC), so the
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
internal metadata table (`__manifest`, which carries the lineage rows) too
(RFC-013 step 2), so a *periodically-optimized* graph keeps the
probe/refresh/per-write scan flat in history; but it is not yet brought into
`cleanup` (version GC), so the `_versions/` chain still grows until an explicit
cleanup (the cleanup half is
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
graph IS now reconcilable from the manifest (RFC-013 Phase 7 — it is a pure
projection of the `graph_commit`/`graph_head` rows); the traversal id-map is

View file

@ -166,11 +166,17 @@ Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, Da
- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug — but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape).
- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) — the load-bearing half — `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged``CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival).
- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) → MR-A (migrate `delete` to the staged two-phase API) **has since landed** (dev-graph `iss-950`): `delete_where` is retired, deletes stage via `TableStorage::stage_delete`, and the guard was flipped to `_compile_uncommitted_delete_field_shape` (pins `execute_uncommitted` / `UncommittedDelete`). `StagedWrite` must carry `UncommittedDelete.affected_rows` through `commit_staged` so Lance's row-level rebase metadata is preserved. The parse-time D2 rule is retained as a deliberate boundary (constructive XOR destructive per query), not as scaffolding awaiting further work.
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set on the manifest's unenforced-PK field (`db/manifest/migrations.rs::migrate_v1_to_v2`): `["object_id"]` → no-op, `[]` → set, any other PK field → loud refusal (the wrong CAS key, unchangeable under v7). Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability); regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7).
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into the manifest schema at init (`db/manifest/state.rs`). With the strand model there is no in-place migration, so the PK is only ever set at init: a graph that predates the annotation is refused on open (`refuse_if_stamp_unsupported`) and rebuilt via export/import, never re-keyed — which is also what Lance's immutability rule would require, since the wrong PK could not be changed once set. Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability).
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
- **Two surface guards added by the v3→v4 migration-robustness follow-up** (not a Lance bump, but they pin Lance error surfaces the migration now classifies on): `dataset_open_missing_returns_not_found_variant` (a missing `Dataset::open` returns `DatasetNotFound`/`NotFound` — the legacy-open read in `db/commit_graph.rs::read_legacy_commit_cache` treats only those as "no legacy data" and propagates everything else) and `lance_error_incompatible_transaction_variant_exists` (a concurrent `UpdateConfig` stamp-bump loses with `IncompatibleTransaction``db/manifest/migrations.rs::commit_v4_stamp_idempotently` matches it to retry the benign same-value race). Re-run on a Lance bump like the others.
- **The v3→v4 migration-robustness surface guards were removed with the strand.**
An earlier cycle added `dataset_open_missing_returns_not_found_variant` and
`lance_error_incompatible_transaction_variant_exists` to pin Lance error surfaces
the `migrate_v3_to_v4` backfill classified on. The strand retirement deleted that
migration (storage is now strict-single-version — see [invariants.md](invariants.md)),
so those guards and the legacy-read/stamp-bump code they pinned are gone. No
current omnigraph code path classifies on those Lance variants.
Bump this date stanza on the next alignment pass.

View file

@ -1,5 +1,13 @@
# RFC-013: Write-path latency — capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract
> **Status update (storage-versioning strand-and-retire work):** the
> `_graph_commits.lance` / `_graph_commit_actors.lance` datasets are now **retired**
> (Phase B — lineage lives in `__manifest` as `graph_commit` / `graph_head` rows;
> the `CommitGraph` is a pure `__manifest` projection). References to those tables
> below are historical: the remaining `optimize` / `cleanup` internal-table scope is
> **`__manifest`-only**, and the per-write `_graph_commits` scan term is gone. See
> [invariants.md](invariants.md) and [versioning.md](versioning.md).
**Status:** Proposed
**Author(s):** write-path latency investigation (handoff + multi-agent validation)
**Date:** 2026-06-19

View file

@ -26,7 +26,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest` scans flat in commit-history depth, lineage rows included`internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives; reads per-op deltas via lance's `incremental_stats()`, the upstream per-request idiom from `rust/lance/src/dataset/tests/dataset_io.rs`), `cost_harness`/`GraphIoMeter` (installs ONE `__manifest` `IOTracker` for a whole test body so the graph opens **under** it and `manifest_reads` is **ground truth** — every read regardless of handle age, the warm-coordinator freshness probe included — closing the blind spot where a per-op tracker installed at measure time cannot see a long-lived handle's reads; outside `cost_harness`, `measure` falls back to fresh per-op tracking, so `write_cost_s3.rs` is unaffected), `last_manifest_reads()` (the manifest read log for `assert_io_eq!`-style failure diagnostics), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
@ -46,7 +46,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). Also the v3→v4 migration fault-injection test (`transient_legacy_open_failure_aborts_migration_without_stamping_v4`, `migration.v3_to_v4.legacy_open` failpoint): a transient legacy-open failure aborts the migration loudly and leaves it retryable (stamp stays v3, no partial backfill), never stamping v4 over an empty backfill. Also the v4 stamp-bump exhaustion regression (`v4_stamp_exhaustion_returns_retryable_contention`, `migration.v4_stamp.force_incompatible` failpoint): the stamp retry loop surfaces a retryable `RowLevelCasContention` on exhaustion, not a stringified `Lance`. And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
@ -139,7 +139,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest` fragment scans, lineage rows included) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
- **Count on the handle that does the reads, not just the one a measured op opens.** Lance's IO-counted tests attach the `IOTracker` to the (warm, cached) dataset and read `incremental_stats()` per request — the tracker MUST be on the handle performing the reads, or warm-handle reads escape. A per-op tracker installed at measure time cannot see reads on a long-lived handle opened earlier (the warm coordinator's `__manifest` handle, reused across writes), so such reads were silently undercounted. Wrap a depth-swept body in `cost_harness` so the manifest tracker is installed before the graph opens and `manifest_reads` is **ground truth** (handle-age-irrelevant). The `version_probes` counter is the freshness-probe *call* count; ground truth additionally reveals that a write's probe does ~3 object-store RPCs (a read's probe is a 0-IO cache hit). `manifest_reads_capture_warm_probe` is the guard that this stays true.
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).

78
docs/dev/versioning.md Normal file
View file

@ -0,0 +1,78 @@
# Versioning & compatibility policy
**Audience:** engine / storage / release maintainers
**Status:** living document
Omnigraph has four independent version axes. They have different compatibility
contracts because they fail in different ways and at different costs. Conflating
them (for example, treating a storage-format change like a wire change) is how you
either ship an unsafe silent-misread or carry migration code you do not need.
| Axis | Policy | Mechanism |
|---|---|---|
| **Release (semver)** | All published crates move in lockstep. | Maintenance-contract rule 4 in [AGENTS.md](../../AGENTS.md): a release bump updates every crate manifest, `Cargo.lock`, `openapi.json`, and the surveyed version line together. |
| **CLI ↔ server wire** | Additive and rolling-safe; **no version gate**. New fields are optional; old clients ignore unknown fields and omit new ones. | Additive JSON DTOs in `omnigraph-api-types`; the OpenAPI-drift test (`crates/omnigraph-server/tests/openapi.rs`) catches an unintended wire change. |
| **Storage (internal manifest schema)** | **Strict single version**; upgrade is a cutover via export/import, never an in-place migration. | A stamp (`omnigraph:internal_schema_version`) in `__manifest`'s schema metadata + `refuse_if_stamp_unsupported`, with `MIN_SUPPORTED == CURRENT`. |
| **Lance on-disk format** | Pinned to one Lance version; bumped deliberately with the engine. | `data_storage_version: V2_2` at every write site + the surface guards in [lance.md](lance.md), re-run on every Lance bump. |
## Why storage is strict-single-version (the strand model)
The internal-schema stamp gates the on-disk shape of `__manifest`. The contract is:
**this binary reads exactly one internal-schema version.** `Omnigraph::open` (both
read-write and read-only) reads main's stamp before any data and refuses anything
it cannot serve:
- a stamp **below** CURRENT → refused with a rebuild-via-export/import message (see
[the upgrade guide](../user/operations/upgrade.md));
- a stamp **above** CURRENT → refused with "upgrade omnigraph", so an old binary
cannot silently misread a newer format.
There is no in-place migration dispatcher. The single source file
`db/manifest/migrations.rs` holds only the version constant, the stamp read/write,
and `refuse_if_stamp_unsupported`.
This is a liability decision, not a limitation we have not gotten around to. In-place
migration code is permanent surface: every future format change has to write,
test, and keep working a `vN → vN+1` step, plus the legacy readers and crash-recovery
paths each step needs, for a storage format that is still pre-release and changing.
The strand model trades that ongoing cost for a one-time operator action (export +
import) when a format changes. Per "engineering is programming integrated over time"
(see [AGENTS.md](../../AGENTS.md)), the lower-liability option is to **not** carry
the machinery until a concrete graph demands it.
The stamp + `refuse_if_stamp_unsupported` floor is exactly the seam a future in-place
migration would re-introduce: re-add a dispatcher and lower `MIN_SUPPORTED` below
CURRENT for the versions it can actually walk forward. Until then that machinery is
deliberately absent.
### Gating altitude
The stamp is validated at the **graph (main) level**: `Omnigraph::open` checks main
once, and branch reads trust it. The stamp is a graph-wide storage-format property
(the upgrade path is a whole-graph export/import), so with one binary version every
branch is always CURRENT — init stamps main, `create_branch` forks the stamp, and the
publisher writes rows without re-stamping. A branch stamped out of range while main
stays in range is only reachable with concurrent multi-version writers, an
unsupported topology; the residual is recorded as a known gap in
[invariants.md](invariants.md).
## Why the wire is additive-rolling-safe instead
The CLI↔server boundary is the opposite case: clients and servers are deployed
independently and a hard gate there would force lockstep redeploys for every field
addition. So that axis is additive — old and new coexist — and the OpenAPI-drift test
is the guard that a change stayed additive rather than breaking the shape.
## When you change each axis
- **Storage format**: bump `INTERNAL_MANIFEST_SCHEMA_VERSION`, keep
`MIN_SUPPORTED == CURRENT` (unless you are re-introducing migration), update the
stamp history on the constant's doc-comment, and add a release note pointing at
the upgrade guide. The change is breaking by construction — pre-bump graphs are
refused.
- **Wire**: keep it additive; regenerate `openapi.json`
(`OMNIGRAPH_UPDATE_OPENAPI=1`); do not add a version gate.
- **Lance**: follow the Lance-bump checklist in [lance.md](lance.md) — re-run the
surface guards first, then `cargo test --workspace` (a clean build is not a clean
alignment).
- **Release**: lockstep per the maintenance contract.

View file

@ -0,0 +1,79 @@
# Write-latency roadmap (validated cost model + layered fix)
**Type:** planning roadmap (forward-looking; not yet implemented)
**Companion to:** [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) (design) and [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) (current state + next actions)
**Status:** validated against Lance 7.0.0 source + a measured S3/RustFS trace of the `a7d4cba5` (#307) binary; **re-validated unchanged against HEAD `0dcdcf5a` (#308)** — see "Currency" below
This doc records the **validated** root-cause analysis of single-row write latency and the layered, correct-by-design fix, so the analysis is not lost between sessions. It also records how the separate commit-graph-table **retirement** work (see the strand-and-rebuild plan) feeds into it.
## Currency (re-validated at HEAD `0dcdcf5a`, #308)
#308 ("Stage the delete path; retire the inline-delete residual", MR-A) is a **pure delete-path refactor**: it moved `delete_where` off the inline `Dataset::delete` HEAD-advance onto Lance 7.0's two-phase `stage_delete` (`DeleteBuilder::execute_uncommitted`) + `commit_staged`. The **insert path is untouched** — every step of the cost model below re-traced identically (the `stage_merge_insert` + `commit_staged` open/commit count is unchanged; WriteTxn "collapse #1/#4", the probe-gated `occ_snapshot_for_branch`, and the publisher chain all survive). Two consequences worth recording: (1) deletes are now staged like inserts, so this cost model — and the layered fix — **now apply uniformly to deletes too** (before #308 a delete was an inline-commit residual with a different shape); (2) the sole remaining inline-commit residual is `create_vector_index`, and D2 (constructive XOR destructive per query) is now a **permanent by-design boundary, not a gap**. Lance stays pinned at 7.0.0, so Root-cause-A's `commit.rs` / `cleanup.rs` references are unchanged.
## Problem
A single keyed-node insert on `main` against the production graph (Cloudflare Container + R2, 343 MB, never cleaned) measured **~7.2s median**. The cost is not the writes — it is read/list amplification, dominated by repeated "resolve latest version" listings of `_versions/` prefixes.
## Validated cost model of one warm write
A warm server insert of one keyed node on `main` performs **6 `_versions/` LISTs + 1 full `__manifest` scan + 1 recovery LIST + ~5 writes**, every operation reconciled to code and matching the measured served trace exactly (3 `__manifest/_versions/` lists, 3 `node:SyncState/_versions/` lists, ~132 `__manifest` reads, 1 recovery list, 8 PUT + 1 POST):
| # | Operation | Cost | Source |
|---|---|---|---|
| 1 | recovery sidecar list | 1 `__recovery/` LIST | `recovery.rs` heal |
| 2 | schema-contract validate | 3 text GETs (not manifest) | `schema_state.rs` |
| 3 | branch resolve (warm) | 0 I/O (in-memory coord) | `omnigraph.rs` `resolved_branch_target` |
| 4 | per-table mutation open | **0** — WriteTxn returns `None` (RFC-013 collapse #1) | `table_ops.rs` |
| 5 | staging open (`reopen_for_mutation`) | **1 `node:SyncState/_versions/` LIST** (opens at *latest*) | `table_store.rs` `open_dataset_head_for_write` |
| 6 | OCC re-capture probe | **1 `__manifest/_versions/` LIST** (`latest_version_id`) | `manifest.rs` `probe_latest_version` |
| 7 | data-table drift probe | **1 `node:SyncState/_versions/` LIST** | `staging.rs` |
| 8 | sidecar PUT | 1 PUT | `staging.rs` `write_sidecar` |
| 9 | `commit_staged` (data commit) | **1 `node:SyncState/_versions/` LIST** (Lance commit) + PUTs | `table_store.rs` |
| 10 | index-build reopen | **0** — reuses handle (RFC-013 collapse #4) | `table_ops.rs` |
| 11 | `load_publish_state` | **1 `__manifest/_versions/` LIST + 1 full scan (~132 reads)** | `publisher.rs` |
| 12 | publish merge-insert commit | **1 `__manifest/_versions/` LIST** (Lance commit) + 1 POST | `publisher.rs` `merge_rows` |
| 13 | sidecar DELETE | 1 DELETE | `mutation.rs` |
The WriteTxn (RFC-013 step 3b) and #307 already eliminated several opens (steps 4, 10) and the redundant publish scans. What remains is structural.
## Two independent root causes
**Root cause A — per-LIST cost grows with history (un-GC'd `_versions/`).** On S3/R2, Lance resolves "latest version" via `resolve_version_from_listing` (`lance-table-7.0.0/src/io/commit.rs:552`). Even though V2 reverse-sorted naming puts the latest first, it reads **up to 1000 entries as a lexical-order sanity check** (`commit.rs:584`, `valid_manifests.take(999)`). So each LIST pulls a full ~1000-key page (~205 KB, ~500 ms on R2). `cleanup` covers only catalog data tables — **the internal tables are never version-GC'd** (`optimize.rs` `all_table_keys(&db.catalog())`), so `__manifest/_versions/` grows one entry per commit forever. The version hint that would skip the list is **unavailable on S3/R2** — Lance gates it to non-lexical stores (`commit.rs:327`, `uses_version_hint = enabled && !list_is_lexically_ordered`); `LANCE_USE_VERSION_HINT` can only *disable* it. NOTE: `optimize` already *compacts* the internal tables (RFC-013 step 2), which bounds the `__manifest` data **scan** (step 11), but compaction adds a version — only `cleanup` shrinks the `_versions/` **list**.
**Root cause B — the write re-resolves "latest" 6× by listing instead of reusing a known version.** Each boundary (table open, OCC probe, table commit, manifest open, manifest commit) independently asks "what's latest?" via a list. Opening at a *known* version is list-free: `DatasetBuilder::from_uri(loc).with_version(N)` reconstructs the exact V2 filename and does one GET — the read path already uses this (the held-handle cache). The write path deliberately opens HEAD and never consults the pinned pointer, even though the `__manifest` `table_version` row already stores `manifest_path` + `version` + `e_tag` (`TableVersionMetadata`).
## The layered fix (each layer correct on its own)
Target end state: **a warm single-writer insert costs O(keep-N + writes), independent of total history, with the publish CAS as the sole correctness authority.** Ordered by impact-per-risk.
### Layer 1 — version-GC the internal table(s) (attacks Root cause A; biggest win)
Bring `__manifest` into `cleanup` with a small keep-window (e.g. keep 2050), shrinking every LIST from ~1000 keys (~205 KB, ~500 ms) to ~2050 keys (~5 KB, ~40 ms) — a ~10× cut to the dominant term, reusing the `cleanup_all_tables` loop. Lance's `cleanup_old_versions` supports this directly (`CleanupPolicy { before_version, before_timestamp }`, always preserves latest, honors tags — `lance-7.0.0/src/dataset/cleanup.rs`).
- **Correctness:** a *manual, quiesced, sidecar-refusing* cleanup is correct **without** the Q8 watermark (the resurrection race only exists under concurrent live writers). Time-travel below the window is lost — the same tradeoff data-table cleanup already makes.
- **Immediate operational fix:** a one-shot quiesced `__manifest` cleanup on the production graph takes ~7s → ~1s today, zero code risk.
- **Connection to the retirement work:** after the commit-graph-table retirement, `_graph_commits`/`_graph_commit_actors` no longer exist, so Layer 1's GC scope is **just `__manifest`** (originally three internal tables). Retiring those tables also removes their 2 cold-open LISTs entirely — an orthogonal cold-start win.
### Layer 2 — Q8 durable monotonic watermark (makes Layer 1 safe under live writers)
Lance creates versions with a bare `PutMode::Create` and no monotonic guard, so a stalled writer can resurrect a GC'd version = a silent lost write on R2/S3. To run GC automatically while serving writes, add a durable boundary watermark (a Lance boundary **tag**, which `cleanup_old_versions` protects): GC advances the floor before deleting; every writer/open rejects `version ≤ boundary`. Invariant-level, touches the open path; correctly deferred in RFC-013 (§ step 2b / Q8). See [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) and [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) for the full design.
### Layer 3 — open data tables at the pinned manifest version (attacks Root cause B)
On the **non-strict** insert/merge path, route `reopen_for_mutation` through the version-pinned opener (`open_table_dataset(location, pinned_version, session)`) instead of `Dataset::open` at HEAD. The `__manifest` row already pins the exact version + e_tag, and the non-strict path already skips the strict version check — so opening at the pinned version is information-equivalent and **list-free** (removes step 5, folds the drift probe). The publisher CAS + `latest_version_id` drift guard remain the concurrency fences.
### Layer 4 — optimistic warm publish (attacks Root cause B)
The publisher loop runs cold `load_publish_state` (open + full scan) on every attempt, then arbitrates via the merge-insert row-level CAS — and the CAS is the stated authority. Thread the warm coordinator's already-open `__manifest` handle + in-memory `known_state` (folded after every commit, RFC-013 PR2 #1b) into **attempt 0**; fall through to today's cold `load_publish_state` on attempts 1..N **only when the CAS actually conflicts**. A current warm view (single-writer server, the common case) commits with zero manifest open/scan. Scope to non-strict ops; strict ops (update/delete/schema) keep the cold fresh-read drift check.
## Why this is the optimal shape (not a symptomatic patch)
- It is the natural extension of the WriteTxn "capture-once" architecture (RFC-013 step 3b) that already eliminated the per-table mutation opens — Layers 34 finish the job for the publish + commit opens.
- It separates **correctness (the CAS) from optimization (every read/probe)**, matching invariant 15: the optimistic path is probe-free; the cold path runs only under genuine contention. Write cost becomes O(1) with no contention, O(contention) only under real concurrency.
- It introduces **no parallel copy of version state** (deny-listed) — Lance + the manifest stay the single source of truth; we stop *re-deriving* "latest" from a cold list on the hot path (the "cold re-derivation" anti-pattern, invariant 15).
- Explicitly **not** pursued: the version hint (unavailable on S3) or an external manifest store (a drift-prone parallel pointer). GC + capture-once reuse is the substrate-respecting path.
## Sequencing
0. **Prereq (separate plan): retire `_graph_commits.lance` + `_graph_commit_actors.lance`. — DONE.** Phase B of the strand-and-retire storage-versioning work landed this: the tables are gone, Layer 1's GC scope is `__manifest`-only, and the 2 cold-open LISTs are removed.
1. **Operational now:** one-shot quiesced `__manifest` cleanup on the production graph (~7s → ~1s, no code).
2. **Layer 1:** wire `__manifest` into the `cleanup` command (manual/quiesced + sidecar-refuse) — the 80/20.
3. **Layers 3 + 4:** open-at-pinned-version + optimistic warm publish — remove the redundant round-trips; bring a warm write to ~2 cheap Lance-commit lists + writes.
4. **Layer 2:** Q8 watermark — only when GC must run continuously under live writers; heaviest, correctly deferred.
Layers 1+3+4 are independently shippable, each correctness-preserving, composing to the scalable end state. Each lands with `write_cost_s3.rs` extensions asserting the per-write LIST count drops and stays flat across commit-history depth.

View file

@ -355,36 +355,32 @@ actual }`. The HTTP server maps this to **409 Conflict** with body
`__manifest`, written in the publish CAS (RFC-013 Phase 7; previously
`_graph_commits.lance`). Audit history is queried via `omnigraph commit list`.
## Migration code
## Storage versioning (no in-place migration)
`db/manifest/migrations.rs` is the single place on-disk `__manifest` shape is
reconciled with what the binary expects, stepping the
`omnigraph:internal_schema_version` stamp forward one `match`-arm at a time. It
runs in `Omnigraph::open(ReadWrite)` (via `manifest::migrate_on_open`, before the
coordinator reads branch state) and again on the publisher's write path, so each
branch migrates on its first write; every step is idempotent under crash-retry
(work first, stamp bump last).
`db/manifest/migrations.rs` is the single place the on-disk `__manifest` shape is
reconciled with what the binary expects. Storage is **strict-single-version** (the
strand model): this binary reads exactly ONE internal-schema version
(`MIN_SUPPORTED == CURRENT == 4`), so there is no in-place migration.
- **v2→v3** (MR-770): a one-time sweep that deletes legacy `__run__*` staging
branches off `__manifest`. Deleting the inert `_graph_runs.lance` /
`_graph_run_actors.lance` dataset *bytes* is still deferred — it needs a
`StorageAdapter::delete_prefix` primitive — but those bytes are invisible to
graph-level state.
- **v3→v4** (RFC-013 Phase 7, `migrate_v3_to_v4`): backfills the graph lineage
from `_graph_commits.lance` into `__manifest` as `graph_commit` / `graph_head`
rows. A graph created before Phase 7 has its lineage only in
`_graph_commits.lance`; the new binary reads lineage from the `__manifest`
projection, so without this backfill it would see an empty commit DAG. The
backfill is per-branch (each branch migrates on its first write), idempotent
(keyed on `object_id`; a fast-path guard skips when `__manifest` already
carries `graph_commit` rows), and writes exactly one `graph_head:<branch>` row
for the actual head. `_graph_commits.lance` is left in place as the branch-ref
carrier — no commit row is written to it again. While a graph is below v4, a
**read-only** open (which never writes, so never migrates) sources the commit
DAG from `_graph_commits.lance` via the stamp-gated transitional fallback in
`CommitGraph::open*`, so reads see correct history before the first write
migrates the graph. An old binary opening a v4-stamped graph is refused with an
"upgrade omnigraph" error in both read-write and read-only modes.
- **Graph creation** stamps `omnigraph:internal_schema_version` at CURRENT, so a
fresh graph always opens.
- **`Omnigraph::open`** (both read-write and read-only) reads main's stamp before
the coordinator reads any branch state and calls `refuse_if_stamp_unsupported`:
a stamp *below* CURRENT is refused with a rebuild-via-export/import message; a
stamp *above* CURRENT is refused with "upgrade omnigraph". The publisher
re-checks the stamp on its write path against the branch it targets, with no
object-store writes, so the check is safe under a read-only open.
- The stamp + `refuse_if_stamp_unsupported` floor is the only seam a future
in-place migration would re-introduce (re-add a dispatcher and lower
`MIN_SUPPORTED`). Until a concrete graph demands it, that machinery is
deliberately absent — see [versioning.md](versioning.md) (the compatibility
policy) and [the upgrade guide](../user/operations/upgrade.md) (the rebuild
recipe).
The stamp history (v1 PK-less, v2 unenforced-PK, v3 `__run__*` sweep, v4 lineage
in `__manifest` with the commit-graph tables retired) is recorded on the
`INTERNAL_MANIFEST_SCHEMA_VERSION` doc-comment; only v4 is served. An earlier-stamped
graph is rebuilt via export/import, not migrated in place.
## Mid-query partial failure: closed by MR-794