omnigraph/docs/maintenance.md
Ragnor Comerford 815ff743f5
recovery: refresh-time roll-forward closes the in-process residual + invariants helper
Bundle of three correctness fixes plus a shared invariants helper that
existing tests now use.

1. SchemaApply atomicity: close the residual gap where a sidecar exists
   but staging files don't (e.g., Phase B failure BEFORE
   `_schema.pg.staging` write). `recover_schema_state_files` now returns
   a `SchemaStateRecovery` discriminator (`Noop` /
   `CleanedStaging` / `CompletedStagingRename { schema_apply_sidecar }`);
   the token threads through `recover_manifest_drift` →
   `process_sidecar`. SchemaApply sidecars are eligible for roll-forward
   ONLY when the staging rename completed in the same recovery pass.
   Full mode rolls back; RollForwardOnly defers. Without this, recovery
   would publish the manifest pin against new-schema data while
   `_schema.pg` stayed old (real corruption). New failpoint
   `schema_apply.before_staging_write` + new test
   `schema_apply_without_schema_staging_rolls_back_on_next_open` pin
   the gating.

2. Rollback target correction. Rollback now restores Lance HEAD to the
   current manifest pin (`state.manifest_pinned`) instead of the
   sidecar's `expected_version`. For UnexpectedAtP1/UnexpectedMultistep
   classifications these can differ; the old code could regress Lance
   HEAD past the manifest pin, re-introducing drift in the OTHER
   direction. The new behavior establishes `Lance HEAD == manifest pin`
   post-rollback — the canonical drift-free invariant. Param renamed
   from `expected_version` → `target_version` to match. Audit
   `to_version` records the actual restore target.

   This is a latent-behavior change. Any external consumer that compared
   `audit.to_version` against `sidecar.expected_version` for non-trivial
   classifications now sees the manifest pin instead.

3. Audit commit-graph unification. `record_audit` now opens the
   per-branch commit graph for ANY sidecar with `sidecar.branch.is_some()`
   — not just BranchMerge. Plain Mutation/Load/EnsureIndices commits on a
   feature branch now correctly land on that branch's commit graph,
   instead of main's. Closes the class of bug analogous to D2 but for
   non-merge writers.

   Pre-existing repos with non-main commits already on main's commit
   graph stay where they are; future recoveries write to the per-branch
   ref. Mixed-version compatibility is asymmetric but safe (old binaries
   ignore per-branch refs they don't know about; new binaries read both).

4. Recovery invariants helper + branch-axis cells. New
   `tests/helpers/recovery.rs` (~505 LOC) exports
   `assert_post_recovery_invariants(repo, op_id, RecoveryExpectation)`
   plus a `TableExpectation` builder. Six existing recovery tests
   refactored to call it; per-test bespoke assertions replaced. Two new
   branch-axis cells added in `tests/failpoints.rs`:
     - `recovery_rolls_forward_load_on_feature_branch`
     - `recovery_rolls_forward_ensure_indices_on_feature_branch`
   The loader gains a `mutation.post_finalize_pre_publisher` failpoint
   hook (gated on the `failpoints` feature; zero-cost in release) so the
   load test can pin the same Phase B → Phase C boundary the mutation
   path uses.

Misc:
   - `Omnigraph::refresh` extracts `reload_schema_if_source_changed`:
     early-return when schema source unchanged (saves IR parse + catalog
     rebuild on the steady-state refresh path).
   - New test injection point
     `failpoint_publish_table_head_without_index_rebuild_for_test`
     under `#[cfg(feature = "failpoints")]`.

Tests: 31 recovery + failpoint integration tests pass (14 + 17, up from
14 + 16). Full workspace sweep with `--features failpoints` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:04:48 +02:00

2.1 KiB

Maintenance: Optimize & Cleanup

db/omnigraph/optimize.rs.

optimize_all_tables(db) — non-destructive

  • Lance compact_files() on every node + edge table on main.
  • Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
  • Bounded by OMNIGRAPH_MAINTENANCE_CONCURRENCY (default 8).
  • Returns [TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }].

cleanup_all_tables(db, options) — destructive

  • Lance cleanup_old_versions() per table.
  • Removes manifests (and their unique fragments) older than the retention policy.
  • CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> } — at least one is required.
  • Returns [TableCleanupStats { table_key, bytes_removed, old_versions_removed }].
  • CLI guards with --confirm; without it, prints a preview line.
  • Recovery floor: --keep < 3 may garbage-collect Lance versions that the open-time recovery sweep needs as a rollback target (the sweep restores to the branch's manifest-pinned table version, which is HEAD-1 in the typical Phase B → Phase C drift case). Default --keep 10 is safe.

Tombstones

Logical sub-table delete markers in __manifest; tombstone_object_id(table_key, version) excludes a sub-table version from snapshot reconstruction.

Internal schema migrations (db/manifest/migrations.rs)

Version evolutions of the on-disk __manifest shape are reconciled automatically on the first write under a new binary. INTERNAL_MANIFEST_SCHEMA_VERSION declares the shape the binary expects; the on-disk stamp omnigraph:internal_schema_version (Lance schema-level metadata) records the on-disk shape. The publisher's open-for-write path calls migrate_internal_schema before reading state; reads are side-effect-free. No operator action is required for in-place upgrades. See storage.md → Internal schema versioning for the full mechanism.

A binary opening a manifest stamped at a version higher than it knows about refuses to publish with a clear "upgrade omnigraph first" error — old binaries cannot clobber a newer schema.