mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-21 02:28:07 +02:00
recovery: refresh-time roll-forward closes the in-process residual
Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).
Why roll-forward only and not full sweep:
* Roll-forward is safe under concurrency (publisher uses row-level
CAS).
* Roll-back uses Dataset::restore, which "wins" against concurrent
Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
silently orphaning the concurrent writer's commit (pinned by
tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
next ReadWrite open, where no concurrent writers exist and full
restore is safe.
Implementation:
* recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
process_sidecar branches on mode for Abort and RollBack — both
defer to next ReadWrite open under RollForwardOnly. RollForward
behavior unchanged.
* omnigraph.rs: Omnigraph::refresh promoted to pub; calls
recover_manifest_drift in RollForwardOnly mode after coordinator
refresh. Steady-state cost: one list_dir of __recovery (early
return on empty). Adds refresh_coordinator_only — pub(crate) —
for engine-internal callers that hold an in-flight sidecar (the
schema_apply lease-check + lock-release paths). Without this split,
refresh would race the in-flight sidecar.
* schema_apply.rs: switch all 6 internal db.refresh() call sites to
refresh_coordinator_only().
Tests:
* refresh_runs_roll_forward_recovery_in_process — trigger
mutation.post_finalize_pre_publisher; without restart, call
db.refresh(); assert sidecar deleted, drifted row visible,
subsequent mutation succeeds.
* refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
runs the full sweep which advances HEAD via restore.
Docs:
* docs/runs.md "Long-running servers" caveat updated to describe the
refresh-time roll-forward path and the rollback-defer behavior.
* docs/invariants.md §VI.23 status line updated to reflect in-process
closure of the common case.
Workspace tests pass with --features failpoints; no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
8c6506f5cd
commit
aaa031e834
7 changed files with 361 additions and 29 deletions
19
docs/runs.md
19
docs/runs.md
|
|
@ -193,12 +193,19 @@ Triggers for the residual: transient Lance write errors during finalize
|
|||
(object-store retry budget exhaustion, disk full); persistent publisher
|
||||
contention exceeding `PUBLISHER_RETRY_BUDGET = 5` retries.
|
||||
|
||||
**Long-running servers**: between Phase B failure and the next
|
||||
`Omnigraph::open` (typically a server restart), subsequent writers on
|
||||
the affected tables surface
|
||||
`ManifestConflictDetails::ExpectedVersionMismatch`. Continuous
|
||||
in-process recovery (no restart required) is the goal of a future
|
||||
background reconciler.
|
||||
**Long-running servers**: `Omnigraph::refresh` runs roll-forward-only
|
||||
recovery in-process — the common Phase B → Phase C residual closes
|
||||
without a restart. The next mutation on the same handle (after refresh)
|
||||
no longer surfaces `ExpectedVersionMismatch` for the failed table.
|
||||
Sidecars that would require a `Dataset::restore` (mixed / unexpected
|
||||
state) are deferred to the next `OpenMode::ReadWrite` open: restore is
|
||||
unsafe under concurrency because Lance's `check_restore_txn` accepts
|
||||
the restore against in-flight Append/Update/Delete commits and
|
||||
silently orphans them (pinned by
|
||||
`tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning`).
|
||||
Continuous in-process recovery for the rollback path is the goal of a
|
||||
future background reconciler with per-(table, branch) writer-queue
|
||||
acquisition.
|
||||
|
||||
The publisher-CAS contract is unchanged: a *concurrent writer* that
|
||||
advances any of our touched tables between snapshot capture and
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue