mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-12 01:45:14 +02:00
fix: optimize publishes compaction; recovery roll-back converges manifest (#141)
* test(optimize): cover manifest publish + HEAD-drift reconcile Red against the pre-fix optimize, which ran compact_files without publishing the compacted version to __manifest: - maintenance: optimize must publish so the manifest table_version tracks the compacted Lance HEAD and a later schema apply succeeds; and must reconcile a pre-existing manifest-behind-HEAD drift (forged via raw Lance compaction) so strict writes commit again. - end_to_end + composite_flow: post-optimize query / strict update / reopen in the full lifecycle (the canonical flow previously omitted post-optimize writes as a documented "known limitation"). - failpoints: a crash between compaction and the manifest publish rolls forward on next open. * fix(optimize): publish compaction to manifest and reconcile HEAD drift optimize ran Lance compact_files without publishing the new version to __manifest, so the manifest table_version lagged the Lance HEAD: reads stayed pinned to the pre-compaction version, and the next schema apply or strict update/delete failed its HEAD-vs-manifest precondition with "stale view ... refresh and retry" (open-time recovery rollback inflated the gap on retry). optimize now publishes each compacted table's version under the per-(table, main) write queue, guarded by a manifest CAS and a SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe because compaction is content-preserving). When a table has nothing left to compact but its Lance HEAD is already ahead of the manifest pin (pre-fix drift, or a recovery restore commit), optimize reconciles the manifest forward to HEAD (metadata-only, no sidecar). Caches and the CSR/CSC graph index are invalidated after a publish. Docs updated (maintenance, storage, branches-commits, writes, testing). * test(recovery): rollback convergence + optimize-defer regressions Red against the current code, landed before the fix: - recovery: after the open-time sweep rolls a sidecar back, the manifest must track Lance HEAD (no residual drift) so a follow-up schema apply succeeds — the original "+1 per retry" loop. Today roll-back restores without publishing, so the manifest lags HEAD and the apply fails its HEAD-vs-manifest precondition. - maintenance: optimize must refuse while a recovery sidecar is pending — operating on an unrecovered graph could publish a partial write the sweep would roll back. Also removes optimize_reconciles_preexisting_manifest_head_drift: the ad-hoc drift reconcile it covered is replaced by recovery-side convergence. * fix(recovery): converge manifest on roll-back; optimize defers on pending recovery Root of PR #141's review findings and the original "+1 per retry" loop: a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving drift vs. a partial write a sidecar will roll back), and optimize's reconcile guessed it benign. Close the class instead of guessing: - Recovery roll-back now PUBLISHES the restored version (via a push_table_update_at_head helper shared with roll-forward), so the manifest tracks the Lance HEAD after recovery — symmetric with roll-forward. This fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest precondition passes) and removes the only remaining source of orphaned drift. The audit still records the logical rolled-back-to version; the manifest is published at the restore commit (identical content). - optimize drops the ad-hoc drift reconcile and instead REFUSES when a __recovery sidecar is pending, so it only ever operates on a recovered graph (manifest == HEAD); its compaction publish can no longer commit a partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is moot. Updates the rollback recovery-test helper (manifest == HEAD after roll-back), the failpoints assertions, and the user/dev docs. * test(recovery): fix rollback assertion for manifest convergence The roll-back-publishes change makes the manifest version advance after a SchemaApply roll-back (to the old-schema content), so the schema_apply_without_schema_staging_rolls_back_on_next_open assertion must be `version > pre`, not `version == pre`. This update was dropped during the commit churn and surfaced as a CI Test Workspace failure; the old-schema-preserved intent stays covered by count_rows + _schema.pg + the RolledBack convergence invariant.
This commit is contained in:
parent
4a66d6e071
commit
e62d9166fb
15 changed files with 816 additions and 137 deletions
|
|
@ -34,10 +34,10 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
|
||||
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
|
||||
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
|
||||
| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation |
|
||||
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the four per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`). |
|
||||
| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes the compacted version so the manifest tracks the Lance HEAD and a subsequent schema apply succeeds (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), and reconciles a pre-existing manifest-behind-HEAD drift forged via raw Lance compaction (`optimize_reconciles_preexisting_manifest_head_drift`) |
|
||||
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`). |
|
||||
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
|
||||
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories). |
|
||||
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
|
||||
|
||||
## Fixtures
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue