omnigraph/docs/maintenance.md
Ragnor Comerford 05e52f2ee0
recovery: rename composite test, strip ticket references, address review
Three bundled changes:

1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
   the test function). OmniGraph is consumed by both humans and agents
   - naming the test after one audience misframes the library.

2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
   review-round labels from source, tests, and docs added by this
   branch. Internal traceability belongs in commit messages and PR
   descriptions, not in checked-in artifacts. Upstream
   lance-format/lance issue refs and pre-existing MR-XXX refs in docs
   not touched by this branch are left alone.

3. Two outstanding review findings addressed:
   - `needs_index_work_node` / `needs_index_work_edge`: propagate
     `count_rows` errors instead of `unwrap_or(0)`. Silently treating
     transient I/O failures as "0 rows" risked skipping a table from
     the recovery sidecar pin set that was actually about to be
     modified.
   - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
     strengthen the assertion to fail when sidecar B classifies under
     a stale snapshot. The new assertion checks post-recovery Lance
     HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
     deleted + audit rows present" pair passed in both the bug and
     fix paths because both delete the sidecar and write an audit
     row; the differentiator is the post-recovery HEAD. Strengthening
     the assertion exposed an additional nuance: in this overlapping-
     sidecar scenario sidecar B's audit kind is RolledBack (no-op)
     rather than RolledForward, since sidecar A's roll-forward
     publishes Lance HEAD as the new manifest pin (absorbing B's
     work). The docstring now explains why this is correct given
     current `roll_forward_all` semantics.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:56:36 +02:00

2.1 KiB

Maintenance: Optimize & Cleanup

db/omnigraph/optimize.rs.

optimize_all_tables(db) — non-destructive

  • Lance compact_files() on every node + edge table on main.
  • Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
  • Bounded by OMNIGRAPH_MAINTENANCE_CONCURRENCY (default 8).
  • Returns [TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }].

cleanup_all_tables(db, options) — destructive

  • Lance cleanup_old_versions() per table.
  • Removes manifests (and their unique fragments) older than the retention policy.
  • CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> } — at least one is required.
  • Returns [TableCleanupStats { table_key, bytes_removed, old_versions_removed }].
  • CLI guards with --confirm; without it, prints a preview line.
  • Recovery floor: --keep < 3 may garbage-collect Lance versions that the open-time recovery sweep needs as a rollback target (the sweep restores to the manifest-pinned expected_version, which is HEAD-1 in the typical Phase B → Phase C drift case). Default --keep 10 is safe.

Tombstones

Logical sub-table delete markers in __manifest; tombstone_object_id(table_key, version) excludes a sub-table version from snapshot reconstruction.

Internal schema migrations (db/manifest/migrations.rs)

Version evolutions of the on-disk __manifest shape are reconciled automatically on the first write under a new binary. INTERNAL_MANIFEST_SCHEMA_VERSION declares the shape the binary expects; the on-disk stamp omnigraph:internal_schema_version (Lance schema-level metadata) records the on-disk shape. The publisher's open-for-write path calls migrate_internal_schema before reading state; reads are side-effect-free. No operator action is required for in-place upgrades. See storage.md → Internal schema versioning for the full mechanism.

A binary opening a manifest stamped at a version higher than it knows about refuses to publish with a clear "upgrade omnigraph first" error — old binaries cannot clobber a newer schema.