feat(engine): retire commit-graph tables (#311)

* docs(dev): write-latency roadmap (validated cost model + layered fix) Records the validated 6-LIST warm-write cost model, the two root causes (un-GC'd _versions/; re-resolving latest by listing), and the layered fix (GC + capture-once reuse), plus how commit-graph-table retirement feeds in. Linked from docs/dev/index.md next to the RFC-013 docs. * feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest` internal-schema version and refuses any older graph on open with a rebuild-via-export/import message, instead of migrating it in place. Storage format changes become a deliberate cutover, not a permanently-carried in-place migration — the pre-release "complexity must be earned" contract. Delete the entire in-place migration apparatus and everything that existed only to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse); the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned Lance variants only the migration matched on; and `state::merge_lineage_rows`. Keep `read_stamp` / `stamp_current_version` / `set_stamp` / `refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into. `load_commit_cache_for_branch` now reads the `__manifest` projection unconditionally (sub-v4 graphs are refused at open). Adds `sub_current_graph_is_refused_on_open_with_rebuild_hint`. The commit-graph TABLES are still created/used as branch-ref ledgers — their retirement (CommitGraph -> pure `__manifest` projection) is the next commit. BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is refused on open. Rebuild it: `omnigraph export` with the old release, then `omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are preserved; commit history and branches are not. * feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows) and branch authority is `__manifest` (branch create forks it first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance` was never written or read; `_graph_commits.lance` carried zero commit rows and only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire both. - `CommitGraph` collapses to a pure projection: drops its Lance dataset handles (`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/ `refresh`/`init` open NO dataset, building the cache from `ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open dataset opens. - `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid projection). `branch_create`/`branch_delete` go through `ManifestCoordinator` only — a single atomic op, replacing the two-step manifest-fork + commit-graph-fork + rollback. Deleted `create_commit_graph_branch`, `reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every `storage.exists(_graph_commits.lance)` gate. - `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from the internal-table compaction set (now `__manifest` only). - `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables. - Fresh graphs create neither table; `lineage_projection.rs` now asserts both `.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race failpoint tests + their failpoint names, and updated the `maintenance` optimize tests (one internal table, not three). Review-pass fixes folded in: - Removed two stale `omnigraph.rs` in-source tests the prior run missed (a disk-full link failure masked them): one asserting `open` probes `_graph_commits.lance` (the exists-gate this commit removes) — it was masked earlier by a disk-full link failure. - Corrected src comments referencing deleted code (`migrate_v3_to_v4`, `append_commit`/`append_merge_commit`, the three-internal-table list, the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit. - Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal test) — removes a dead-code warning in the failpoints build. Branch create/delete atomicity improves (single atomic `__manifest` op). No behavior change for reads or branches. Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test counter + its `IOTracker`, threaded through ~11 cost-test files. * feat: surface the internal-schema (storage-format) version to operators After stranding storage versioning (a sub-v4 graph is refused on open), operators could only discover the storage-format version by hitting a refusal. Surface it: - `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT storage-format version). - `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`. - `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's CURRENT, alongside `version`/`source_version`). Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`; add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`; `snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it), threaded through the embedded CLI + server snapshot callers. `openapi.json` regenerated (two added int32 properties). Extends the existing healthz / snapshot / version tests. * docs(engine): gate internal-schema version at the graph level; record the per-branch read gap PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget. Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it." No behavior change: HEAD already gated at the graph level. * test(cost): remove the dead commit_graph_reads IO counter Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs. Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact. * docs: align with the commit-graph retirement + strand storage versioning Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild). Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis. Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs). * test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance Two coverage additions from PR review (P1): (a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs. (b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
2026-06-30 02:49:39 +02:00 · 2026-06-28 16:49:49 +02:00 · 2026-06-28 16:49:49 +02:00 · 7779b72446
commit 7779b72446
parent 0dcdcf5a9d
53 changed files with 903 additions and 3324 deletions
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@ -26,7 +26,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
 | `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
 | `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
-| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
+| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest` scans flat in commit-history depth, lineage rows included — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
 | `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives; reads per-op deltas via lance's `incremental_stats()`, the upstream per-request idiom from `rust/lance/src/dataset/tests/dataset_io.rs`), `cost_harness`/`GraphIoMeter` (installs ONE `__manifest` `IOTracker` for a whole test body so the graph opens **under** it and `manifest_reads` is **ground truth** — every read regardless of handle age, the warm-coordinator freshness probe included — closing the blind spot where a per-op tracker installed at measure time cannot see a long-lived handle's reads; outside `cost_harness`, `measure` falls back to fresh per-op tracking, so `write_cost_s3.rs` is unaffected), `last_manifest_reads()` (the manifest read log for `assert_io_eq!`-style failure diagnostics), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
 | `lifecycle.rs` | Graph lifecycle, schema state |
 | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
@ -46,7 +46,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
 | `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
 | `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
-| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). Also the v3→v4 migration fault-injection test (`transient_legacy_open_failure_aborts_migration_without_stamping_v4`, `migration.v3_to_v4.legacy_open` failpoint): a transient legacy-open failure aborts the migration loudly and leaves it retryable (stamp stays v3, no partial backfill), never stamping v4 over an empty backfill. Also the v4 stamp-bump exhaustion regression (`v4_stamp_exhaustion_returns_retryable_contention`, `migration.v4_stamp.force_incompatible` failpoint): the stamp retry loop surfaces a retryable `RowLevelCasContention` on exhaustion, not a stringified `Lance`. And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
+| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
 | `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
 | `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |

@ -139,7 +139,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg

 - **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
 - **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
+- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest` fragment scans, lineage rows included) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
 - **Count on the handle that does the reads, not just the one a measured op opens.** Lance's IO-counted tests attach the `IOTracker` to the (warm, cached) dataset and read `incremental_stats()` per request — the tracker MUST be on the handle performing the reads, or warm-handle reads escape. A per-op tracker installed at measure time cannot see reads on a long-lived handle opened earlier (the warm coordinator's `__manifest` handle, reused across writes), so such reads were silently undercounted. Wrap a depth-swept body in `cost_harness` so the manifest tracker is installed before the graph opens and `manifest_reads` is **ground truth** (handle-age-irrelevant). The `version_probes` counter is the freshness-probe *call* count; ground truth additionally reveals that a write's probe does ~3 object-store RPCs (a read's probe is a 0-IO cache hit). `manifest_reads_capture_warm_probe` is the guard that this stays true.
 - This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).