mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
Use refresh_coordinator_only to avoid racing branch_merge's sidecar
The previous fix used `self.refresh()` to sync the restored coordinator's cache after the swap-restore window. `refresh` runs the `RollForwardOnly` recovery sweep — which, on the merge Err path with a phase-B failure (sidecar written, per-table HEAD advanced, manifest publish skipped), would observe the merge's own in-flight sidecar and close it here. That violates the contract documented on `Omnigraph::refresh`: > Engine-internal callers that already hold an in-flight sidecar > (e.g. `schema_apply` mid-write) MUST use `refresh_coordinator_only` > to avoid the recovery sweep racing their own sidecar. The post-restore step's purpose is to sync the coord cache with disk, not to run recovery, so `refresh_coordinator_only` is the right primitive on both paths. CI surfaced this via `branch_merge_phase_b_failure_recovered_on_next_open` in `crates/omnigraph/tests/failpoints.rs`, which asserts the sidecar persists after the failpoint fires. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
This commit is contained in:
parent
e91d5615c6
commit
b7353e1dc7
1 changed files with 22 additions and 10 deletions
|
|
@ -1109,23 +1109,35 @@ impl Omnigraph {
|
|||
.await;
|
||||
self.restore_coordinator(previous).await;
|
||||
|
||||
// Refresh the restored coordinator on both Ok and Err paths.
|
||||
// During the swap window above, `self.coordinator` was a freshly
|
||||
// opened coord for the merge target; any concurrent writer on
|
||||
// that target (e.g. a `/change` on `main` racing a
|
||||
// `merge into=main`) publishes against the swapped coord and
|
||||
// never touches the original. Without this refresh, the
|
||||
// restored coord's cached manifest snapshot would diverge from
|
||||
// disk and seed a stale `expected_versions` into the next op's
|
||||
// publisher CAS fence — a non-retryable
|
||||
// Sync the restored coordinator's cached manifest snapshot with
|
||||
// disk on both Ok and Err paths. During the swap window above,
|
||||
// `self.coordinator` was a freshly opened coord for the merge
|
||||
// target; any concurrent writer on that target (e.g. a `/change`
|
||||
// on `main` racing a `merge into=main`) publishes against the
|
||||
// swapped coord and never touches the original. Without this
|
||||
// sync, the restored coord's cached manifest snapshot would
|
||||
// diverge from disk and seed a stale `expected_versions` into
|
||||
// the next op's publisher CAS fence — a non-retryable
|
||||
// `ExpectedVersionMismatch` for a user with no concurrent
|
||||
// writer of their own. Pinned by
|
||||
// `concurrent_merge_clean_409_does_not_poison_next_change_on_target`
|
||||
// in `crates/omnigraph-server/tests/server.rs` and by the
|
||||
// `[d:merge×change:into-target]` cell of
|
||||
// `concurrent_branch_ops_morphological_matrix`.
|
||||
//
|
||||
// Use `refresh_coordinator_only` rather than `refresh` so the
|
||||
// recovery sweep doesn't race the merge's own in-flight
|
||||
// sidecar: when the merge body returns Err between Phase B
|
||||
// (per-table `commit_staged` + sidecar write) and Phase C
|
||||
// (manifest publish + sidecar delete), the sidecar is still on
|
||||
// disk. `refresh`'s `RollForwardOnly` sweep would observe it
|
||||
// and close it here — masking the failure from the next
|
||||
// `Omnigraph::open` sweep and from the audit row that the open
|
||||
// sweep emits. Pinned by
|
||||
// `branch_merge_phase_b_failure_recovered_on_next_open` in
|
||||
// `crates/omnigraph/tests/failpoints.rs`.
|
||||
if previous_branch == target_branch {
|
||||
self.refresh().await?;
|
||||
self.refresh_coordinator_only().await?;
|
||||
}
|
||||
|
||||
merge_result
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue