engine: drop swap-restore in branch_create_from; operate on local coord

Closes the swap-restore race in `branch_create_from_impl` by simply not
touching `self.coordinator` at all. Open the source-branch coordinator
locally, call `branch_create` on it, drop it. The new branch is
durable on disk via the manifest write that `GraphCoordinator::branch_create`
issues on its own commit graph; subsequent reads of any coord will see
it after their normal manifest refresh.

Pre-fix: `branch_create_from_impl` ran swap → operate → restore as
three separate `coordinator.write().await` acquisitions. Under `&self`
concurrency, two callers with distinct source branches could interleave
their swaps, leaving each caller's "operate" step running against the
other's swapped coordinator and forking the new branch off the wrong
HEAD. Pinned by `concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator`
(previous commit) which deterministically reproduced the race with
8/8 forks landing on the wrong parent.

Why correct by design (AGENTS.md rule 9): closing the bug class
"non-atomic three-step coordinator manipulation under &self callers"
by removing the manipulation entirely. There's no scratch-space race
to lose because there's no scratch space.

Note: `branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100`
keeps the same swap-restore pattern. Its inner `branch_merge_on_current_target`
calls `self.snapshot()` and `self.ensure_commit_graph_initialized()` which
acquire the coord lock independently, so the simple "operate on local
coord" refactor doesn't compose without a deeper interface change. The
per-(table, branch) writer queue inside the merge body
(`crates/omnigraph/src/exec/merge.rs:1224`) bounds the damage in
practice; a deterministic regression for concurrent merges is tracked
under Block 3.1 of the plan.

`swap_coordinator_for_branch` and `restore_coordinator` remain
crate-internal for now (still used by `branch_merge_impl`); a follow-up
can remove them if the merge path is similarly refactored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-05-08 16:48:17 +02:00
parent 3b33e9ac56
commit 4ffbf6ec61
No known key found for this signature in database

View file

@ -878,10 +878,24 @@ impl Omnigraph {
ensure_public_branch_ref(name, "branch_create_from")?;
}
let branch = normalize_branch_name(&branch_name)?;
let previous = self.swap_coordinator_for_branch(branch.as_deref()).await?;
let result = self.coordinator.write().await.branch_create(name).await;
self.restore_coordinator(previous).await;
result
// Operate on a freshly-opened source coordinator that's owned locally
// — never touch `self.coordinator`. The pre-fix implementation used
// `swap_coordinator_for_branch` + operate + `restore_coordinator` as
// three separate `coordinator.write().await` acquisitions; under
// `&self` concurrency, a second `branch_create_from` could swap
// self.coordinator between this caller's swap and operate steps,
// making the operate run against the wrong source branch and
// forking off the wrong HEAD. Pinned by
// `concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator`
// in `crates/omnigraph-server/tests/server.rs`.
//
// `branch_create` mutates only the local coord's commit-graph cache;
// the manifest write is durable on disk regardless of which
// coord-handle issued it. Discarding `source_coord` after the call
// is the right shape — the new branch is reachable from any
// subsequent open of any coord.
let mut source_coord = self.open_coordinator_for_branch(branch.as_deref()).await?;
source_coord.branch_create(name).await
}
pub async fn branch_list(&self) -> Result<Vec<String>> {