Use refresh_coordinator_only to avoid racing branch_merge's sidecar

The previous fix used `self.refresh()` to sync the restored coordinator's cache after the swap-restore window. `refresh` runs the `RollForwardOnly` recovery sweep — which, on the merge Err path with a phase-B failure (sidecar written, per-table HEAD advanced, manifest publish skipped), would observe the merge's own in-flight sidecar and close it here. That violates the contract documented on `Omnigraph::refresh`: > Engine-internal callers that already hold an in-flight sidecar > (e.g. `schema_apply` mid-write) MUST use `refresh_coordinator_only` > to avoid the recovery sweep racing their own sidecar. The post-restore step's purpose is to sync the coord cache with disk, not to run recovery, so `refresh_coordinator_only` is the right primitive on both paths. CI surfaced this via `branch_merge_phase_b_failure_recovered_on_next_open` in `crates/omnigraph/tests/failpoints.rs`, which asserts the sidecar persists after the failpoint fires. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
2026-06-09 01:35:18 +02:00 · 2026-05-11 21:09:44 +00:00 · 2026-05-11 21:09:44 +00:00 · b7353e1dc7
commit b7353e1dc7
parent e91d5615c6
1 changed files with 22 additions and 10 deletions
--- a/crates/omnigraph/src/exec/merge.rs
+++ b/crates/omnigraph/src/exec/merge.rs
@ -1109,23 +1109,35 @@ impl Omnigraph {
            .await;
        self.restore_coordinator(previous).await;

-        // Refresh the restored coordinator on both Ok and Err paths.
-        // During the swap window above, `self.coordinator` was a freshly
-        // opened coord for the merge target; any concurrent writer on
-        // that target (e.g. a `/change` on `main` racing a
-        // `merge into=main`) publishes against the swapped coord and
-        // never touches the original. Without this refresh, the
-        // restored coord's cached manifest snapshot would diverge from
-        // disk and seed a stale `expected_versions` into the next op's
-        // publisher CAS fence — a non-retryable
+        // Sync the restored coordinator's cached manifest snapshot with
+        // disk on both Ok and Err paths. During the swap window above,
+        // `self.coordinator` was a freshly opened coord for the merge
+        // target; any concurrent writer on that target (e.g. a `/change`
+        // on `main` racing a `merge into=main`) publishes against the
+        // swapped coord and never touches the original. Without this
+        // sync, the restored coord's cached manifest snapshot would
+        // diverge from disk and seed a stale `expected_versions` into
+        // the next op's publisher CAS fence — a non-retryable
        // `ExpectedVersionMismatch` for a user with no concurrent
        // writer of their own. Pinned by
        // `concurrent_merge_clean_409_does_not_poison_next_change_on_target`
        // in `crates/omnigraph-server/tests/server.rs` and by the
        // `[d:merge×change:into-target]` cell of
        // `concurrent_branch_ops_morphological_matrix`.
+        //
+        // Use `refresh_coordinator_only` rather than `refresh` so the
+        // recovery sweep doesn't race the merge's own in-flight
+        // sidecar: when the merge body returns Err between Phase B
+        // (per-table `commit_staged` + sidecar write) and Phase C
+        // (manifest publish + sidecar delete), the sidecar is still on
+        // disk. `refresh`'s `RollForwardOnly` sweep would observe it
+        // and close it here — masking the failure from the next
+        // `Omnigraph::open` sweep and from the audit row that the open
+        // sweep emits. Pinned by
+        // `branch_merge_phase_b_failure_recovered_on_next_open` in
+        // `crates/omnigraph/tests/failpoints.rs`.
        if previous_branch == target_branch {
-            self.refresh().await?;
+            self.refresh_coordinator_only().await?;
        }

        merge_result