From b7353e1dc70ca0f3f9823e70ea1eebd216c0a6b2 Mon Sep 17 00:00:00 2001 From: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Date: Mon, 11 May 2026 21:09:44 +0000 Subject: [PATCH] Use refresh_coordinator_only to avoid racing branch_merge's sidecar MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous fix used `self.refresh()` to sync the restored coordinator's cache after the swap-restore window. `refresh` runs the `RollForwardOnly` recovery sweep — which, on the merge Err path with a phase-B failure (sidecar written, per-table HEAD advanced, manifest publish skipped), would observe the merge's own in-flight sidecar and close it here. That violates the contract documented on `Omnigraph::refresh`: > Engine-internal callers that already hold an in-flight sidecar > (e.g. `schema_apply` mid-write) MUST use `refresh_coordinator_only` > to avoid the recovery sweep racing their own sidecar. The post-restore step's purpose is to sync the coord cache with disk, not to run recovery, so `refresh_coordinator_only` is the right primitive on both paths. CI surfaced this via `branch_merge_phase_b_failure_recovered_on_next_open` in `crates/omnigraph/tests/failpoints.rs`, which asserts the sidecar persists after the failpoint fires. Co-Authored-By: Ragnor Comerford --- crates/omnigraph/src/exec/merge.rs | 32 ++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/crates/omnigraph/src/exec/merge.rs b/crates/omnigraph/src/exec/merge.rs index d17afdf..9581fcb 100644 --- a/crates/omnigraph/src/exec/merge.rs +++ b/crates/omnigraph/src/exec/merge.rs @@ -1109,23 +1109,35 @@ impl Omnigraph { .await; self.restore_coordinator(previous).await; - // Refresh the restored coordinator on both Ok and Err paths. - // During the swap window above, `self.coordinator` was a freshly - // opened coord for the merge target; any concurrent writer on - // that target (e.g. a `/change` on `main` racing a - // `merge into=main`) publishes against the swapped coord and - // never touches the original. Without this refresh, the - // restored coord's cached manifest snapshot would diverge from - // disk and seed a stale `expected_versions` into the next op's - // publisher CAS fence — a non-retryable + // Sync the restored coordinator's cached manifest snapshot with + // disk on both Ok and Err paths. During the swap window above, + // `self.coordinator` was a freshly opened coord for the merge + // target; any concurrent writer on that target (e.g. a `/change` + // on `main` racing a `merge into=main`) publishes against the + // swapped coord and never touches the original. Without this + // sync, the restored coord's cached manifest snapshot would + // diverge from disk and seed a stale `expected_versions` into + // the next op's publisher CAS fence — a non-retryable // `ExpectedVersionMismatch` for a user with no concurrent // writer of their own. Pinned by // `concurrent_merge_clean_409_does_not_poison_next_change_on_target` // in `crates/omnigraph-server/tests/server.rs` and by the // `[d:merge×change:into-target]` cell of // `concurrent_branch_ops_morphological_matrix`. + // + // Use `refresh_coordinator_only` rather than `refresh` so the + // recovery sweep doesn't race the merge's own in-flight + // sidecar: when the merge body returns Err between Phase B + // (per-table `commit_staged` + sidecar write) and Phase C + // (manifest publish + sidecar delete), the sidecar is still on + // disk. `refresh`'s `RollForwardOnly` sweep would observe it + // and close it here — masking the failure from the next + // `Omnigraph::open` sweep and from the audit row that the open + // sweep emits. Pinned by + // `branch_merge_phase_b_failure_recovered_on_next_open` in + // `crates/omnigraph/tests/failpoints.rs`. if previous_branch == target_branch { - self.refresh().await?; + self.refresh_coordinator_only().await?; } merge_result