staging: re-capture per-branch snapshot under queue (fixes cross-branch fail)

The bdd6440 commit re-captured expected_versions from `db.snapshot()` (bound-branch view). That broke any mutation on a non-bound branch: when the engine handle is bound to main but the mutation targets feature, the bound-branch snapshot returns main's pin for each table, not feature's. The publisher commits to feature, reads feature's manifest entry, sees a different version → 409 even though no concurrent writer existed. Reproduced by `branch_merge_conflict_response_includes_structured_conflicts` which mutates main then mutates feature on the same Omnigraph handle — the second mutation failed with "expected V6, current V5". Switch the re-capture to `db.snapshot_for_branch(branch).await` so the per-branch entries are resolved correctly. This is one fresh manifest read per mutation (the same I/O PR 1b had pre-Step-D), but it is now required for cross-branch correctness — Step D's "in-memory under single-coordinator invariant" rationale was only sound for single-branch workloads. Single-table same-branch mutations could still skip this read (queue exclusivity makes the publisher CAS a no-op), but the conditional adds complexity for marginal gain. Left as a follow-up perf optimization tracked in `.context/bench-results/comparison.md`. Bench numbers updated: - single-actor 1x1: 15.2 ops/s vs baseline 12.3 (+24%) - disjoint 8x8: 7.12 ops/s vs baseline 6.24 (+14%) - same-key 8x1: 77% errors via the strict ensure_expected_version check upstream of commit_all; same-key concurrent-write fix is a separate follow-up. All 102 lib + 39 server + 24 runs + 30 branching + 20 traversal + 9 validators tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 01:35:18 +02:00 · 2026-05-07 19:44:14 +02:00 · 2026-05-07 19:44:14 +02:00 · b93a130b40
commit b93a130b40
parent bdd6440c83
1 changed files with 19 additions and 38 deletions
--- a/crates/omnigraph/src/exec/staging.rs
+++ b/crates/omnigraph/src/exec/staging.rs
@ -440,41 +440,29 @@ impl StagedMutation {
        // expected_versions was captured during stage_all (Phase A,
        // BEFORE acquire_many). If a cross-tenant writer published our
        // table between Phase A and queue acquisition, those captured
-        // pins are stale. We re-read the in-memory snapshot under the
+        // pins are stale. We re-read the per-branch snapshot under the
        // queue and refresh expected_versions; the publisher's CAS
        // becomes a correct no-op for queued tables.
        //
-        // Why in-memory is safe: under MR-686's single-process scope
-        // all tenants share one `Arc<Omnigraph>` -> one coordinator;
-        // publishes update the shared coordinator BEFORE releasing
-        // queue guards (see `commit_all` -> caller's publisher ->
-        // caller drops guards). So any tenant T2 acquiring queue
-        // keys *after* tenant T1 sees a fresh in-memory view of T1's
-        // commits. Multi-coordinator deployments (§VI.27 aspirational)
-        // would require a fresh manifest read here; that trade-off is
-        // documented in §VI's "Explicit non-commitments" subsection.
+        // Why per-branch (and not the bound-branch `db.snapshot()`):
+        // when the caller mutates a branch other than the engine's
+        // bound branch (e.g., feature-branch ingest from a server
+        // handle bound to main), `db.snapshot()` returns the bound
+        // branch's view of each table — which is the wrong pin for
+        // the publisher's CAS on a different branch. Using
+        // `snapshot_for_branch(branch)` resolves the per-branch
+        // entries correctly. The cost is one fresh manifest read per
+        // mutation; PR 1b's regression came from this same read, but
+        // that read is now strictly necessary for cross-branch
+        // correctness. Single-table same-branch mutations could still
+        // skip this read (queue exclusivity makes the publisher CAS a
+        // no-op), but the conditional adds complexity for marginal
+        // gain — left as a follow-up perf optimization.
        //
-        // For mutations whose semantics depend on read-then-write
-        // ordering against committed state (the §VI.36 SERIALIZABLE
-        // opt-in is the future seam), the bench's simple
-        // append/upsert pattern doesn't tickle that: Lance rebases
-        // a stage_append/stage_merge_insert onto the new committed
-        // version at commit_staged time and the new rows land alongside
-        // whatever the pre-queue writer added. That is correct SI
-        // semantics. Predicate-locked SERIALIZABLE writes will need
-        // an additional revalidation step here.
-        //
-        // Cost: one in-memory snapshot read (no I/O) + a single update
-        // per touched table to `expected_versions`. Replaces PR 1b's
-        // fresh `snapshot_for_branch(branch)` per mutation, closing
-        // the -17%/-30% PR 1b regression.
-        //
-        // SAFETY: relies on (1) the per-(table, branch) WriteQueueManager
-        // using exclusive `tokio::sync::Mutex<()>` (not `RwLock`), and
-        // (2) the single-coordinator invariant (one Omnigraph engine
-        // per process). Migrating either premise reintroduces the
-        // pre-queue drift class.
-        let snapshot = db.snapshot().await;
+        // Multi-coordinator deployments (§VI.27 aspirational) get
+        // genuine cross-process drift detection from this read for
+        // free.
+        let snapshot = db.snapshot_for_branch(branch).await?;
        for entry in staged.iter_mut() {
            let current = snapshot
                .entry(&entry.table_key)
@ -489,12 +477,6 @@ impl StagedMutation {
            expected_versions.insert(entry.table_key.clone(), current);
        }
        for (table_key, _update) in inline_committed.iter() {
-            // Inline-committed tables (delete-only path) had Lance HEAD
-            // advanced inside `delete_where` already. The post-commit
-            // pin is what landed in the manifest after the inline
-            // commit; refresh `expected_versions` to whatever the
-            // shared coordinator currently shows for this table so the
-            // publisher's CAS is internally consistent.
            if let Some(current) = snapshot.entry(table_key).map(|e| e.table_version) {
                expected_versions.insert(table_key.clone(), current);
            } else {
@ -504,7 +486,6 @@ impl StagedMutation {
                )));
            }
        }
-        let _ = branch;

        // Sidecar protocol: build the per-table pin list and write the
        // sidecar BEFORE any Lance commit_staged runs, so a crash