fix: optimize publishes compaction; recovery roll-back converges manifest (#141)

* test(optimize): cover manifest publish + HEAD-drift reconcile

Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:

- maintenance: optimize must publish so the manifest table_version
  tracks the compacted Lance HEAD and a later schema apply succeeds;
  and must reconcile a pre-existing manifest-behind-HEAD drift (forged
  via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
  reopen in the full lifecycle (the canonical flow previously omitted
  post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
  forward on next open.

* fix(optimize): publish compaction to manifest and reconcile HEAD drift

optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).

optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.

Docs updated (maintenance, storage, branches-commits, writes, testing).

* test(recovery): rollback convergence + optimize-defer regressions

Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
  must track Lance HEAD (no residual drift) so a follow-up schema apply
  succeeds — the original "+1 per retry" loop. Today roll-back restores
  without publishing, so the manifest lags HEAD and the apply fails its
  HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
  operating on an unrecovered graph could publish a partial write the
  sweep would roll back.

Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.

* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery

Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:

- Recovery roll-back now PUBLISHES the restored version (via a
  push_table_update_at_head helper shared with roll-forward), so the manifest
  tracks the Lance HEAD after recovery — symmetric with roll-forward. This
  fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
  precondition passes) and removes the only remaining source of orphaned
  drift. The audit still records the logical rolled-back-to version; the
  manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
  __recovery sidecar is pending, so it only ever operates on a recovered
  graph (manifest == HEAD); its compaction publish can no longer commit a
  partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
  moot.

Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.

* test(recovery): fix rollback assertion for manifest convergence

The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
This commit is contained in:
Ragnor Comerford 2026-06-08 01:50:12 +02:00 committed by GitHub
parent 4a66d6e071
commit e62d9166fb
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
15 changed files with 816 additions and 137 deletions

View file

@ -58,6 +58,6 @@ Internal or legacy branch refs:
## L2 — Recovery audit trail
The four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row.
The five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row.
Audit rows live in `_graph_commit_recoveries.lance` (sibling to `_graph_commits.lance`) and reference the commit graph by `graph_commit_id`. The linked recovery commit is identified by that same `graph_commit_id`, and `actor_id="omnigraph:recovery"` is stored in `_graph_commit_actors.lance` (joined by `graph_commit_id`) — `_graph_commits.lance` itself does not carry the `actor_id` column. To find recoveries for a specific original actor: `omnigraph commit list --filter actor=omnigraph:recovery`, then join to `_graph_commit_recoveries.lance` by `graph_commit_id` to read `recovery_for_actor`. Schema: see `crates/omnigraph/src/db/recovery_audit.rs`.

View file

@ -4,8 +4,10 @@
## `optimize_all_tables(db)` — non-destructive
- Lance `compact_files()` on every node + edge table on `main`.
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
- Lance `compact_files()` on every node + edge table on `main`, then **publishes the compacted version to the `__manifest`** so the manifest's `table_version` tracks the compacted Lance HEAD. Reads pin the manifest version, so without this publish compaction would be invisible to readers *and* would break the HEAD-vs-manifest precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually compacted.
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests until `cleanup` runs.
- Each table's compact→publish runs under its per-`(table, main)` write queue (serializing with concurrent mutations — compaction is a Lance `Rewrite` op that retryable-conflicts with a concurrent merge/update/delete on overlapping fragments). The Lance-HEAD-before-manifest-publish gap is covered by a `SidecarKind::Optimize` recovery sidecar (loose-match): a crash in that window rolls the compacted version forward on the next `Omnigraph::open` (compaction is content-preserving, so roll-forward is always safe).
- **Requires a recovered graph.** `optimize` refuses (errors) when an unresolved recovery sidecar is present under `__recovery` — operating on an unrecovered graph could publish a partial write the open-time recovery sweep would roll back. Reopen the graph to run the recovery sweep, then re-run `optimize`. (Recovery roll-back now publishes its restored version, so a recovered graph always satisfies `manifest == Lance HEAD` going in; there is no leftover drift for `optimize` to interpret.)
- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped }]`.
- **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: Some(BlobColumnsUnsupportedByLance)` (and logged via `tracing::warn`) instead of compacted, and the rest of the sweep proceeds normally. The current Lance `compact_files` mis-decodes blob-v2 columns under its forced `BlobHandling::AllBinary` read; **reads and writes are unaffected** — only compaction is. This is gated by `LANCE_SUPPORTS_BLOB_COMPACTION` (`db/omnigraph/optimize.rs`) and removed when the upstream Lance fix lands (see [docs/dev/lance.md](../dev/lance.md)). Consequence: fragment count and deleted-row space on blob tables are not reclaimed until then; query results are never affected.

View file

@ -94,7 +94,7 @@ flowchart TB
- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the v2→v3 migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once `delete_prefix` lands.)
- **`_graph_commit_recoveries.lance`** — one row per recovery sweep action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See `crates/omnigraph/src/db/recovery_audit.rs`.
- **`__recovery/{ulid}.json`** — transient sidecar files written by the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`.
- **`__recovery/{ulid}.json`** — transient sidecar files written by the five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`.
- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.
- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags.