RFC-011 Decision 9. Writes echo their resolved target + access path to stderr (suppress with --quiet). Destructive writes (cleanup, overwrite load, branch delete) against a non-local scope require consent: --yes, a TTY prompt, or a hard refusal for non-TTY/--json runs. Local file:// writes unaffected.
7.7 KiB
Maintenance: Optimize, Repair & Cleanup
Addressing. optimize, repair, and cleanup are direct (storage-native) CLI commands: they run with direct storage access against a positional file:///s3:// URI or --cluster <dir|s3://…> --graph <id> (which resolves the graph's storage URI from the served cluster state, so you needn't know the <storage>/graphs/<id>.omni layout). They never run through a server, and reject --server or a remote (http(s)://) URI with a declared error. There are no server routes for them by design — to maintain a server-backed graph, run them out-of-band against the graph's storage URI. See the Command capabilities section of cli-reference.md.
optimize — non-destructive
- Compacts every node + edge table on
main, then reindexes them, then publishes the resulting version to the__manifestso the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers and would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed. - Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until
cleanupruns. - Reindex (index coverage maintenance). A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by
load --mode merge, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage.optimizeruns Lance's incrementaloptimize_indicesafter compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with no compaction work but stale index coverage still commits a new version underoptimize. Runoptimizeon a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail. - Each table's compact→reindex→publish serializes with concurrent mutations on the same table. A crash mid-operation is recovered automatically on the next open (both compaction and reindex are content-preserving, so roll-forward is always safe).
- Requires a recovered graph.
optimizerefuses (errors) when a pending crash-recovery operation is present — operating on an unrecovered graph could publish a partial write that recovery would roll back. Reopen the graph to run recovery, then re-runoptimize. - Uncovered drift is skipped, not interpreted. If a table's underlying version is ahead of the version recorded in
__manifestand no crash-recovery record covers that movement,optimizereportsskipped: DriftNeedsRepairwith the manifest/head versions and leaves the table untouched. Runomnigraph repairto classify and explicitly publish that drift. - Bounded by
OMNIGRAPH_MAINTENANCE_CONCURRENCY(default 8). - Returns per-table stats:
table_key, fragments_removed, fragments_added, committed, skipped, manifest_version, lance_head_version. - Blob tables are skipped. A table that declares any
Blobproperty is not compacted: it is reported withskipped: BlobColumnsUnsupportedByLance(and logged) instead of compacted, and the rest of the sweep proceeds normally. Reads and writes are unaffected — only compaction is. Consequence: fragment count and deleted-row space on blob tables are not reclaimed; query results are never affected. A skipped blob table is also not reindexed in the same sweep (the skip happens before the reindex step), so its index coverage on appended rows is not refreshed byoptimizetoday.
repair — explicit
- Handles uncovered manifest/head drift: a table's underlying version is ahead of the manifest pin and no crash-recovery record explains the movement.
- Preview by default.
omnigraph repair --json <uri>reports each table'sclassification,action, manifest/head versions, underlying operation names, and any classification error.--confirmpublishes only verified maintenance drift; if any suspicious or unverifiable table is refused, the CLI prints the per-table output and exits non-zero.--force --confirmalso publishes suspicious or unverifiable drift after operator review. - Classifies drift by reading the table's transaction history from
manifest_version + 1through the current head. Only fragment-reservation and rewrite (compaction) operations are verified maintenance. Semantic operations such as append, delete, update, merge, or missing transaction history are not auto-healed. - Publishes repair by advancing
__manifestto the existing head; it does not rewrite data. If the publish succeeds, normal reads and strict writes use the repaired version. If it fails, no new data-side partial state was created. - Requires a clean recovery state. A pending crash-recovery operation still belongs to automatic recovery, not manual repair.
cleanup — destructive
- Garbage-collects old versions per table.
- Removes versions (and their unique fragments) older than the retention policy.
- Policy options
keep_versionsandolder_than— at least one is required. - Returns per-table stats:
table_key, bytes_removed, old_versions_removed, error. - Fault-isolated per table. A single table's transient failure (version GC or
orphan reclaim) is recorded on that table's stats row (with an
error) and logged, and never aborts the healthy tables — cleanup is the convergence backstop, so it does as much as it can and converges on re-run. The CLI reports any failed tables; reruncleanupto retry them. - CLI guards with
--confirm; without it, prints a preview line. - Non-local consent (RFC-011 D9). Against a non-local target (an
s3://store/cluster),cleanupadditionally requires--yeson top of--confirm: a TTY is prompted, and a non-interactive run (no TTY, or--json) refuses rather than destroying. A local (file://) target needs only--confirm. The same--yesgate applies to overwriteloadandbranch delete; every maintenance run echoes its resolved target to stderr (suppress with--quiet). - Recovery floor:
--keep < 3may garbage-collect versions that crash recovery needs as a rollback target. Default--keep 10is safe. - Orphaned-branch reconciliation: before the version GC, cleanup reclaims any per-table or commit-graph branch absent from the manifest branch list. These orphans arise when a
branch_deleteflips the manifest authority but a downstream best-effort reclaim does not complete (see branches-commits.md). The reconciler is idempotent (it no-ops once nothing is orphaned), runs regardless of thekeep_versions/older_thanvalues (those gate version GC only), and never reclaimsmainor system-branch forks. Reclaimed forks are logged.
Tombstones
Logical sub-table delete markers in __manifest that exclude a sub-table version from snapshot reconstruction.
Internal schema migrations
Version evolutions of the on-disk __manifest shape are reconciled automatically on the first write under a new binary. An on-disk stamp records the shape; the binary migrates it forward before reading state, and reads are side-effect-free. No operator action is required for in-place upgrades. See storage.md → Internal schema versioning for the full mechanism.
A binary opening a manifest stamped at a version higher than it knows about refuses to publish with a clear "upgrade omnigraph first" error — old binaries cannot clobber a newer schema.