fix(engine): optimize survives a cross-process write race on the same table (#297)

* test(engine): cross-process optimize-vs-write race — RED Two regression tests for the prod bug: a direct `optimize` process racing a served write on the same table fails, because the in-process write queue does not serialize across processes and the data-table optimize path has no retry. - optimize_survives_concurrent_insert_advancing_manifest: a concurrent insert advances the manifest while optimize is paused between compact and publish; optimize's equality-CAS publish then fails "expected X but current Y". - optimize_survives_concurrent_delete_before_compaction: a concurrent delete commits before optimize compacts; Lance rebases the compaction past it cleanly, so optimize again fails the publish CAS (the genuine Lance Rewrite-vs-Rewrite overlap is rarer and shares the internal path's retry). Both fail today with ExpectedVersionMismatch. Adds the `optimize.before_compact` failpoint seam + a wait_for_sidecar helper; serializes the optimize failpoint tests (shared failpoint name). The fix lands next. * fix(engine): optimize survives a cross-process write race on the same table The data-table optimize path trusted the in-process write queue and skipped a retry, so a CLI `optimize` racing a served write (separate processes = separate queues) failed: either the Lance Rewrite lost ("preempted by concurrent Update") or the manifest publish lost the strict equality CAS ("expected X but current Y"). Unify both compaction paths on the internal path's reopen+replan shape, with a two-level retry that matches the two failure points: - Outer loop (reopen+replan): a genuine Lance Rewrite-vs-Update/Delete same- fragment conflict means our compaction did not commit — reopen at the new HEAD and re-plan. Lance rebases the common disjoint case (a concurrent insert/delete on other fragments) for free, so this fires only on real overlap. - Inner loop (Phase C, monotonic publish): the manifest advanced between our compaction and our publish. The compaction is already committed at Lance HEAD N, so we must NOT reopen (that trips the HEAD>manifest drift guard on our own work). Re-read the current manifest version C: if C >= N the manifest already includes our compaction (versions are linear) — no-op; else fast-forward to N. Monotonic, not the strict equality CAS that manufactured the conflict. The Phase-A sidecar is written once and reused across reopen attempts (every Phase-B commit is content-preserving, so recovery rolls the observed HEAD forward or safely rolls the compaction back). The in-process queue is kept — it is now an in-process contention reducer, not the cross-process correctness guard. Shares the COMPACTION_RETRY_BUDGET constant + is_retryable_lance_conflict with the internal path; adds is_retryable_manifest_conflict for the publish loop. No writer_epoch. Turns the prior commit's two race tests green. * docs(rfc-013): two-op-class principle + the found+fixed optimize-vs-write race §6.6 records the maintenance vs logical op-class distinction (maintenance commutes → Lance rebase + reopen/replan + monotonic manifest fast-forward, no writer_epoch; logical → strict cross-process OCC + epoch) and the prod optimize-vs-served-write race that motivated it, now landed. Adds the matching mechanic row to §4.2. * fix(engine): retry must not misclassify optimize's own HEAD drift Review catch on the cross-process optimize fix: the outer retry loop re-ran the `lance_head > manifest` drift guard every iteration. After a partial Phase-B commit (the auto_cleanup strip or compaction commits, then a later op hits a retryable conflict), the reopened attempt saw HEAD ahead of the manifest — from OUR own sidecar-covered work, not an external writer — and deleted the sidecar + returned `skipped_for_drift`, stranding uncovered drift that then needs `repair`. Track `head_advanced` (did one of our Phase-B ops already commit). The drift guard now fires only when `!head_advanced` (genuine pre-existing external drift); once we have advanced HEAD, a reopened HEAD>manifest is our work that the monotonic publish fast-forwards. The no-op early-return likewise publishes prior committed work instead of dropping it when `head_advanced`. Regression test `optimize_retry_does_not_misclassify_own_head_drift` injects one retryable reindex conflict after the compaction commits (new `optimize.inject_ reindex_conflict` seam); red→green verified by negative control (reverting the gate reproduces `skipped_for_drift: Some(DriftNeedsRepair)`). Also de-flake `optimize_survives_concurrent_insert_advancing_manifest`: pause at `before_compact` (not post-compact) so the concurrent insert lands while HEAD== manifest — otherwise it could race optimize's committed-but-unpublished compaction and hit the write-path "HEAD ahead of manifest" guard. * fix(engine): optimize publish converges on retry-budget exhaustion Review catch (greptile): the monotonic Phase-C publish loop returned an error on its final iteration's retryable manifest conflict, even though that conflict can itself mean a concurrent writer published a version that already includes our (content- preserving) compaction — i.e. the postcondition ("the manifest reflects our compaction") is already met. Recovery covered it (no data loss), but the operator saw a spurious error and had to re-run. Restructure the loop to re-read `current` on every retryable conflict and, on budget exhaustion, do a final `current >= state.version` convergence check before surfacing the error — the §6.6 "postcondition is the state, not winning the CAS" principle. Factor the repeated current-version read into `current_manifest_version`.
2026-06-24 02:38:06 +02:00 · 2026-06-22 13:05:28 +02:00 · 2026-06-22 13:05:28 +02:00 · 6d4606a830
commit 6d4606a830
parent 5cfae9acc1
3 changed files with 603 additions and 188 deletions
--- a/docs/dev/rfc-013-write-path-latency.md
+++ b/docs/dev/rfc-013-write-path-latency.md
@ -579,6 +579,7 @@ The cost contract becomes part of `publish`'s documented API:
 | Epoch fence | Monotonic `writer_epoch` in `__manifest`, CAS-claimed at writer init, checked on every publish. Fences a whole zombie *writer* deterministically (no TTL); closes the multi-process exposure and the Lance-MTT TTL-lease gap. | SlateDB `FenceableTransactionalObject` **[U]** |
 | Branch create | Lance `Clone` instead of the per-table fork loop (O(tables)→O(1) sequential). | `iss-691` **[G]** |
 | Branch delete | Run the per-other-branch safety check and the per-table reclaim loops concurrently (`buffer_unordered`); read branch sets from in-memory coordinator state. | **[S]** |
+| Maintenance-class commit (compaction) | Commutative/content-preserving ops do NOT use the logical class's strict OCC: Lance rebases the disjoint case, the app reopens+replans on a real overlap, and the manifest publish is a **monotonic fast-forward** (advance or no-op, never equality CAS) — no `writer_epoch`. The two-op-class rule + the found+fixed optimize-vs-write race: §6.6. | §6.6 **[M]**, **LANDED** |

 ---

@ -823,6 +824,59 @@ This is the standard WAL-replay + leader-lease shape (confirmed against SlateDB'
 finding promotes #6/#7 from "nice correctness work" to the load-bearing guard that
 gates multi-writer topologies — and it is the motivating case for them.**

+## 6.6 The two op classes — and a found+fixed maintenance race (LANDED)
+
+§6.5 is about the **logical** write class. A prod run surfaced the same
+process-boundary flaw in the **maintenance** class: a direct CLI `optimize` racing
+a served write on the same table **failed** — either the Lance `Rewrite` lost
+("preempted by concurrent Update") or the manifest publish lost the strict equality
+CAS ("expected 14 but current 15"). Same root cause as §6.5 (the in-process write
+queue does not serialize across processes), but the right fix is the **opposite** of
+the logical class, because the two classes commute differently:
+
+| class | examples | commutes? | correct commit model |
+|---|---|---|---|
+| **maintenance** | compaction (`Rewrite`), `optimize_indices` | **yes** (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
+| **logical mutation** | load / mutate / merge / delete | **no** (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence (§6.5, #7) |
+
+Applying strict OCC + equality-CAS uniformly is the mistake: **too strong for
+maintenance** (it manufactures a false conflict against a commutative op — this
+bug) and **too weak for logical writes cross-process** (§6.5). The maintenance fix
+needs **no `writer_epoch`** — that is the tell that it is a different class.
+
+**Validated against Lance 7.0.0 source + reproduced [M].** Lance has no compaction
+re-plan retry (the semantic `RetryableCommitConflict` escapes `commit_transaction`'s
+loop at `io/commit.rs:979`; only the OCC manifest race is retried), so the
+application must reopen + re-plan — exactly what the internal-table path already
+did. Notably, Lance **rebases the common case for free**: a concurrent
+insert/update/delete on *other* fragments is disjoint, so the data-table compaction
+commits cleanly and the conflict surfaces only as the manifest fast-forward
+(the genuine `Rewrite`-vs-`Rewrite` overlap is the rarer many-fragment /
+concurrent-compaction case).
+
+**Fix (LANDED).** Both compaction paths now share one reopen+replan shape with a
+two-level retry: an outer loop reopens+replans on a real Lance overlap conflict; an
+inner Phase-C loop makes the manifest publish a **monotonic fast-forward**
+(advance to the compacted version `N`, or no-op when the manifest already moved to
+`≥ N` — being linear, it descends from and includes the compaction), never the
+equality CAS. The `Optimize` recovery sidecar is written once and reused across
+attempts (every commit is content-preserving). The in-process queue is kept as an
+in-process contention reducer, not the cross-process guard. No `writer_epoch`.
+(`db/omnigraph/optimize.rs`; regression tests in `tests/failpoints.rs`:
+`optimize_survives_concurrent_insert_advancing_manifest`,
+`optimize_survives_concurrent_delete_before_compaction`.)
+
+**Relationship to step 5 (the unification).** This is the first correct *instance* of
+the maintenance-class commit model, not a parallel band-aid: when step 5 collapses the
+writers into the single `publish(txn, plan)` authority, it **relocates** this — a
+`TableAction::Rewrite` carries the monotonic-fast-forward + reopen/replan commit model
+into the unified publisher — rather than reinventing it. What step 5 deletes is the
+*location* (the hand-rolled loop in `optimize_one_table`), not the *semantics*; the two
+regression tests above are the contract that must stay green across that refactor. It
+also makes step 5 easier, not harder: it already unified the two compaction paths onto
+one retry shape and drew the op-class line (logical writers keep equality CAS; only
+compaction is monotonic), so there is one fewer pattern for the unification to absorb.
+
 ---

 ## 7. Invariants & deny-list check