fix(recovery): converge roll-forward on concurrent manifest advance (#296)

* refactor(storage): gate test-only TableStore::append_batch behind cfg(test) The inherent append_batch is used only by in-source recovery test setup, but the non-test lib build (cfg(test) off) cannot see those callers and emitted a dead_code warning. Gating the method #[cfg(test)] silences the false positive and enforces its own doc contract ("no new engine call sites") by construction — engine code physically cannot call a cfg(test) method. * test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race Hardens the test infrastructure around the process-global `fail` registry, and adds a deterministic red repro for the open-time recovery sweep's roll-forward CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next commit — this commit is intentionally red (rule 12: red→green visible in log). Harness: - One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate is removed and cluster tests reuse the engine type via `omnigraph/failpoints`. - `#[serial]` on every failpoint test (the registry is process-global, so shared names interfere under parallelism); `serial_test` added to cluster dev-deps. - `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release) replaces fixed-`sleep` cross-thread coordination; the three concurrent tests now rendezvous deterministically. The reached flag doubles as a fired-assert. - Compile-checked `failpoints::names` catalog (engine + cluster); every call site references a const, and `failpoint_names_guard.rs` enforces "no string literal names" by source-walk, so a typo is a build error not a silent no-fire. Red repro: - New `recovery.before_roll_forward_publish` failpoint at the sweep's classify -> publish-CAS window (the only injection point there). - `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one pending sidecar; the sweep parked at the failpoint loses its publish CAS to the other and fails the open with `ExpectedVersionMismatch`. FAILS at this commit by design. * fix(recovery): converge roll-forward when the manifest advances concurrently The open-time recovery sweep classified a pending sidecar as RolledPastExpected, then published a manifest CAS at the sidecar's pinned expected_version. Under a concurrent writer that advanced the manifest past expected during the classify -> publish window, the CAS failed with ExpectedVersionMismatch and `?`-propagated, failing the whole Omnigraph::open. iss-schema-apply-reopen-recovery-race. A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an ExpectedVersionMismatch, re-read the live manifest and check whether the sidecar's intent is already satisfied (every pinned table at a version >= the one we observed and tried to publish; added tables registered; tombstones gone — sound under the heal-first invariant, documented at the check). If satisfied, this is convergence: record the RolledForward audit + delete the sidecar idempotently. If only partway, defer to the next pass. Either way the open no longer fails. Other errors still propagate; a genuine logical conflict resurfaces via the classifier's InvariantViolation. Turns the red repro from the previous commit green. The roll-BACK twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly. * Address PR review: harden failpoint name guard + dedupe converge audit Two issues surfaced in PR review of the failpoint hardening + recovery fix: 1. Name guard had a line-split blind spot. It scanned per line, so a call wrapped across lines (`park_first(\n "name",\n)`) put the literal on a different line than the call prefix and bypassed the "no string-literal failpoint names" check — and one such literal (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the guard whitespace/newline-tolerant (skip past the open paren to the first argument token) so wrapping can't hide a literal, and convert the bypassed site to the `names::` const. 2. Convergence path could append a duplicate recovery audit. When a roll-forward publish loses its CAS but the manifest already reached the sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward audit unconditionally. Under the heal-first invariant, whoever advanced the manifest already healed this sidecar (audit + delete), so a second row landed in `_graph_commit_recoveries` for one recovery event. Gate the audit+delete on the sidecar still being present: absent => the winner completed it, return success with no duplicate row. The convergence regression test now asserts exactly one audit row. * docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR) The handoff was a transient investigation note for `iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge helper + the red→green regression). Its rationale now lives durably in the dev-graph issue, the PR/commit history, and invariants.md, so the handoff is obsolete. Drop the doc, its dev-index row, and the dangling reference from the RFC-013 handoff; the doc cross-link check stays green. * fix(recovery): include added-table registrations in the converge audit The CAS-loss convergence audit built outcomes only from `sidecar.tables`, omitting the `additional_registrations` that the normal `roll_forward_all` audit includes. For a SchemaApply sidecar with added types, a converge-path audit row would be incomplete versus the normal roll-forward path for the same recovery kind. Mirror the roll-forward outcome construction (append a registration outcome per added table) so both paths emit the same audit shape.
2026-06-27 02:39:38 +02:00 · 2026-06-24 19:03:51 +02:00 · 2026-06-24 19:03:51 +02:00 · 4a5277b9c0
commit 4a5277b9c0
parent 7d3a52d674
25 changed files with 826 additions and 476 deletions
--- a/docs/dev/handoff-rfc-013-write-path.md
+++ b/docs/dev/handoff-rfc-013-write-path.md
@ -354,10 +354,9 @@ open; delete #6658 shipped). Track, don't build yet.
 - **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
  both are de-risking inputs for Design A's per-class commit models.
 - **#296** — recovery roll-forward converges on concurrent manifest advance. This is the fix
-  for the flaky `iss-schema-apply-reopen-recovery-race` (the handoff in
-  `handoff-schema-apply-recovery-flake.md`). It touches `recovery.rs` and is *aligned* with
-  #297's "postcondition is the state, not winning the CAS" principle — reconcile the monotonic
-  publish with #296's converge helper if #296 lands first.
+  for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
+  *aligned* with #297's "postcondition is the state, not winning the CAS" principle — reconcile
+  the monotonic publish with #296's converge helper if #296 lands first.
 - **#295** — the step-3b RFC doc (apply §4's three corrections to it).

 ---
--- a/docs/dev/handoff-schema-apply-recovery-flake.md
+++ b/docs/dev/handoff-schema-apply-recovery-flake.md
@ -1,216 +0,0 @@
-# Handoff: flaky schema-apply → reopen recovery race
-
-**Type:** bug investigation handoff (not yet fixed)
-**Status:** root-caused to a layer + hypothesis; exact mechanism and fix NOT yet validated
-**Severity:** medium — flaky CI; a real (rare) schema-apply-then-reopen failure under load
-**Scope:** pre-existing on `main`; **independent of** RFC-013 step 2 (internal-table
-compaction, PR #291) and step 3a (#288) — those paths never touch schema apply or
-the recovery sweep, and the full `--workspace` gate passes clean on a re-run.
-
-> Do **not** "fix" this by changing the test to use a single handle. That was
-> empirically shown to *reduce but not eliminate* the flake (see Experiments), so it
-> would mask a real product race. This is a correct-by-design fix in the engine, not
-> a test edit.
-
---
-
-## 1. Symptom
-
-The test
-`crates/omnigraph-server/tests/schema_routes.rs::schema_apply_route_hard_drops_property_with_allow_data_loss`
-intermittently fails. The HTTP schema apply **succeeds** (`applied == true`); the
-*subsequent* `Omnigraph::open(graph)` (which the test does to verify the catalog)
-panics on `.unwrap()` with:
-
-```
-OmniError::Manifest(Conflict,
-  "stale view of node:Person: expected manifest version 5 but current is 7",
-  ExpectedVersionMismatch { expected: 5, actual: 7 })
-```
-
-The values (5, 7) vary; the shape is always "recovery roll-forward expected version
-N, manifest is at M > N." It is raised from the **open-time recovery sweep**, i.e.
-inside `Omnigraph::open`, not from the apply itself.
-
---
-
-## 2. Reproduction
-
- **Needs sibling-test parallelism (CPU contention).** Running the target test
-  *alone* is rock-solid (0/20 failures). The flake only appears when other tests in
-  the same binary run concurrently and perturb the timing inside the apply→reopen
-  sequence.
- Fast repro loop (≈13–40% per run):
-  ```bash
-  cargo test -p omnigraph-server --test schema_routes --no-run
-  for i in $(seq 1 15); do
-    cargo test -p omnigraph-server --test schema_routes 2>&1 \
-      | grep -q "schema_apply_route_hard_drops_property_with_allow_data_loss ... FAILED" \
-      && echo "iter $i FAIL"
-  done
-  ```
- It originally surfaced in a full `cargo test --workspace` run (max parallelism).
- Each test uses its own `tempfile::tempdir()`, so this is **not** cross-test shared
-  state — it's a timing race inside one test's own graph.
-
---
-
-## 3. Experiments run (the discriminating evidence)
-
-Each variant was stress-run under the full `schema_routes` suite (parallel siblings):
-
-| Variant | Flake rate |
-|---|---|
-| Target test in isolation (no sibling parallelism) | **0/20** |
-| **Control** — as written (server handle + out-of-band `Omnigraph::open` load + reopen) | 6/15 ≈ 40% |
-| Drop the live server handle (`drop(app)`) before the reopen | 4/15 ≈ 27% |
-| Remove the out-of-band separate-handle load | 2/15 ≈ 13% |
-| Remove the load **and** drop the server handle (≈ single-handle) | 8/20 ≈ 40% |
-
-**Interpretation:**
- It is **concurrency-triggered**, not a topology bug: 0% isolated, flaky under
-  parallel load.
- **No single factor eliminates it.** Removing the out-of-band load roughly halves
-  the rate (it amplifies the race) but leaves a ~13% base. Dropping the live server
-  handle does not clearly help. So the "single-handle test" patch is a **band-aid**,
-  not the fix.
- The residual base rate with the out-of-band load removed means there is a real
-  race in the **schema-apply → reopen → recovery** path itself.
-
-Caveat on the experiments: `drop(app)` may not synchronously tear down the server's
-engine handle (it can be held by an `Arc`/spawned task), so the "single-handle"
-rows are not airtight. This is one of the things to validate (§6).
-
---
-
-## 4. Root-cause hypothesis (NOT yet proven)
-
-The failing path is the **open-time recovery sweep's roll-forward** raising
-`ExpectedVersionMismatch` from the publisher's `check_expected_table_versions`.
-
-The hard-drop schema apply (`allow_data_loss=true` → `DropMode::Hard`) is a
-**multi-step migration**: it performs several Lance commits + `__manifest` publishes,
-advancing `node:Person`'s manifest version across multiple versions (e.g. 5 → … → 7).
-To be crash-safe across the Lance-HEAD-before-manifest-publish gap, schema apply
-writes a **recovery sidecar** (`__recovery/{ulid}.json`) pinning per-table
-`expected_version` / `post_commit_pin` before its Phase B.
-
-Hypothesis: under CPU contention, the timing of (a) the migration's multi-version
-advancement, (b) the sidecar's Phase-D deletion, and (c) a later/overlapping
-`Omnigraph::open` recovery sweep interleaves such that the recovery roll-forward
-reads a sidecar whose pinned `expected` is **stale relative to a manifest that
-legitimately advanced several versions**, and **re-publishes at the stale `expected`
-instead of recognizing the migration already completed** → `expected 5, actual 7`.
-
-In other words: the recovery classifier / roll-forward likely does not correctly
-handle a table whose manifest is **already past `post_commit_pin`** by more than one
-step (multi-step migration), or a sidecar whose operation has already fully
-committed. The single-step assumption baked into the Optimize-style pin
-(`post_commit_pin = expected_version + 1`) may not generalize to multi-commit schema
-migrations.
-
---
-
-## 5. Likely solution (correct-by-design, surgical)
-
-Make the **open-time recovery classifier idempotent against a manifest that advanced
-past the sidecar's pin**:
-
- If the table's current manifest/Lance version is already `>= post_commit_pin`
-  (operation completed, possibly across multiple versions), classify it as
-  *already-rolled-forward / completed* (the `RolledPastExpected` family) and **delete
-  the sidecar without republishing** — never attempt a publish at the stale
-  `expected`.
- Ensure the schema-apply sidecar records a pin that the classifier can interpret for
-  a **multi-step** migration (a range / "completed at or beyond" semantics), not a
-  strict single-step `expected + 1`.
-
-This also hardens *real* crash recovery for multi-step schema apply (not just the
-test), and is small + local to `recovery.rs` (+ possibly the schema-apply sidecar
-write in `schema_apply.rs`). It does **not** rearchitect recovery.
-
-Per repo rule 12 (test-first for bug fixes): land a **deterministic** repro first —
-ideally a failpoint that forces the interleaving (pause after the migration's commits
-but before sidecar delete, then run an open) so the red→green is reliable, not a
-stress-loop probability. See the `failpoints.rs` pattern + the schema-apply failpoints
-already in the tree.
-
---
-
-## 6. What MUST be validated before fixing
-
-1. **Which sidecar is being rolled forward?** Confirm it is the *schema-apply*
-   sidecar (vs the out-of-band `load`'s sidecar, vs another writer). Instrument /
-   log the sidecar `operation_id`, `kind`, and `SidecarTablePin` at the point the
-   recovery sweep raises the error.
-2. **The exact classifier path.** Trace which `TableClassification` arm the failing
-   table hits (`recovery.rs::classify_table`, ~L600) and which roll-forward call
-   raises `ExpectedVersionMismatch` (`heal_pending_sidecars_roll_forward` ~L761,
-   `roll_forward_all` ~L1215, `restore`+publish ~L1275). Confirm it is the
-   multi-step-advanced / already-completed case being mishandled.
-3. **Is `post_commit_pin = expected + 1` the bug?** Verify the hard-drop migration
-   advances `node:Person` by **>1** version, and that the sidecar pins a single-step
-   `+1`, so the classifier can't recognize completion at +2.
-4. **Engine-level reproduction (no server).** Build a deterministic engine-level
-   repro: persistent handle applies a multi-step hard-drop, then a fresh
-   `Omnigraph::open` — ideally with a failpoint forcing the interleave — to confirm
-   the bug is in the engine recovery path and not server-specific (runtime, handle
-   lifecycle). The current evidence is server-test-only.
-5. **Is the out-of-band load *necessary or only amplifying*?** Confirm the ~13% base
-   rate (load removed) is the same root cause, not a second distinct race. If the
-   load is required, the bug is specifically about a second writer's version
-   advancement; if not, it's purely intra-apply.
-6. **`drop(app)` cleanliness.** Verify whether the server's engine handle is truly
-   gone after `drop(app)` (it may be `Arc`-held). If not, the "single-handle"
-   experiments don't isolate the live-handle factor and should be redone with a
-   genuinely single-handle setup.
-
---
-
-## 7. Relationship to Lance MTT
-
-This bug lives in the **recovery-sidecar roll-forward**, which exists only to bridge
-the Lance-HEAD-before-manifest-publish gap in omnigraph's faked multi-table
-atomicity. `invariants.md` already calls recovery sidecars "scaffolding to remove
-once the substrate closes the gap." Lance **MTT** (native atomic multi-table commits,
-RFC §8 / lance#7264) closes that gap → retires the sidecar → **eliminates this bug
-class.**
-
-Implications:
- **Don't wait for MTT** — it is the "strategic exit, not a current dependency,"
-  uncertain and far off, and this bug is live now.
- **Don't over-invest** — keep the fix surgical (classifier idempotency), because the
-  whole sidecar layer is MTT-disposable. A surgical fix retires cleanly with the
-  layer; a recovery rearchitecture would be throwaway.
-
---
-
-## 8. Key pointers
-
- Failing test: `crates/omnigraph-server/tests/schema_routes.rs`
-  → `schema_apply_route_hard_drops_property_with_allow_data_loss` (~L777,
-  `#[tokio::test(flavor = "multi_thread")]`).
- Error type: `OmniError::Manifest` / `ManifestConflictDetails::ExpectedVersionMismatch`
-  (`crates/omnigraph/src/error.rs`); raised by `check_expected_table_versions`
-  (`crates/omnigraph/src/db/manifest/publisher.rs`, ~L356).
- Recovery sweep + classifier: `crates/omnigraph/src/db/manifest/recovery.rs`
-  — `TableClassification` (~L335), `classify_table` (~L600), roll-forward
-  (`heal_pending_sidecars_roll_forward` ~L761, `roll_forward_all` ~L1215, restore +
-  publish ~L1275).
- Schema-apply sidecar write: `crates/omnigraph/src/db/omnigraph/schema_apply.rs`
-  (the `SidecarKind` schema-apply pins; `db.coordinator.write().refresh()` ~L692).
- Open entry point that runs the sweep: `Omnigraph::open` (read-write mode) →
-  `db/manifest/recovery.rs` sweep.
- Repro: §2 above. Stress under `schema_routes` suite parallelism; 0% isolated.
-
---
-
-## 9. Suggested next steps
-
-1. Add tracing at the recovery roll-forward error site (sidecar kind/id, pins,
-   observed vs expected) and capture a failing run (§6.1, §6.2).
-2. Reproduce deterministically at the engine level with a failpoint (§6.4) — this is
-   the red test (rule 12).
-3. Implement the classifier-idempotency fix (§5) in a separate commit; confirm
-   red→green and that the stress loop goes to 0 failures over ≥50 iterations.
-4. Keep it a standalone PR (not bundled with RFC-013 follow-ons).
--- a/docs/dev/index.md
+++ b/docs/dev/index.md
@ -94,7 +94,6 @@ Working documents for in-flight feature work. Removed when the work lands.
 | Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
 | Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
 | RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
-| Schema-apply recovery flake handoff — investigation notes and validation plan for the intermittent schema-apply reopen race | [handoff-schema-apply-recovery-flake.md](handoff-schema-apply-recovery-flake.md) |

 ## Boundary

--- a/docs/dev/invariants.md
+++ b/docs/dev/invariants.md
@ -211,10 +211,21 @@ them explicit.
  sweep has the same exposure, and always has): it may roll a live foreign
  writer's sidecar forward, which degrades to publisher-CAS contention for
  data writes but can race the schema-staging promotion for a foreign live
-  schema apply. Multi-process writers on one graph are already documented
-  one-winner-CAS territory; closing this fully needs a cross-process
-  serialization primitive (e.g. lease-based use of the schema-apply lock
-  branch) — design it before promoting multi-process write topologies.
+  schema apply. The roll-**forward** CAS contention is now
+  convergence-idempotent: when the publish loses the CAS to a concurrent
+  writer that already reached the sidecar's goal, the sweep treats it as
+  convergence (record the `RolledForward` audit + delete) rather than a fatal
+  `ExpectedVersionMismatch`, and defers when the manifest is only partway
+  (`converge_or_defer_roll_forward` in `db/manifest/recovery.rs`;
+  iss-schema-apply-reopen-recovery-race). So a concurrent advance no longer
+  fails the open. The schema-staging promotion race and the destructive
+  roll-**back** path (Lance `Restore` "trumps" a concurrent commit, so it
+  cannot be made idempotent — iss-recovery-sweep-live-writer-rollback) still
+  need the cross-process primitive. Multi-process writers on one graph are
+  already documented one-winner-CAS territory; closing this fully needs a
+  cross-process serialization primitive (e.g. lease-based use of the
+  schema-apply lock branch) — design it before promoting multi-process write
+  topologies.
 - **Fork reclaim is in-process-safe only:** the first write to a table on a
  branch forks it (a Lance `create_branch` that advances state before the
  manifest publish). An interrupted fork (crash, or a cancelled request
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@ -46,7 +46,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
 | `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
 | `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
-| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
+| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`) and the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
 | `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
 | `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |

@ -64,10 +64,12 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav

 ## Failpoints (fault injection)

- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml` **and** `crates/omnigraph-cluster/Cargo.toml`; the cluster feature does not enable the engine's).
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` expose `maybe_fail("name")` and `ScopedFailPoint` for tests.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, cluster apply's payload→state-write window, etc.).
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (crash-mid-apply + state CAS race via `fail::cfg_callback`; integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
+- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` in `crates/omnigraph/Cargo.toml`; the cluster's `failpoints` feature additionally enables `omnigraph/failpoints` (`crates/omnigraph-cluster/Cargo.toml`), so the shared test guard is available to cluster tests.
+- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` each expose `maybe_fail("name")` (per-crate error type). The test-side config guard `ScopedFailPoint` (`new` for action strings, `with_callback` for callbacks; RAII `Drop` removes the point) lives **once** in the engine and is reused by both test binaries.
+- **Names are compile-checked.** Every failpoint name is a `pub const` in `omnigraph::failpoints::names` (engine) / `omnigraph_cluster::failpoints::names` (cluster). Call sites and tests reference the constant, never a bare literal — a typo is a compile error, not a silently-never-firing point. Add a new failpoint by adding its const first.
+- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, the recovery sweep's classify→roll-forward-publish window, cluster apply's payload→state-write window, etc.).
+- **Serialize and rendezvous, never sleep.** The `fail` registry is process-global, so every failpoint test carries `#[serial]` (`serial_test`). For concurrent tests, use `helpers::failpoint::Rendezvous` (`tests/helpers/failpoint.rs`): `park_first(name)` parks the first thread to hit the point until `release()`, and `wait_until_reached().await` blocks on that condition (it doubles as a fired-assertion). Do not coordinate threads with fixed `sleep`s.
+- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.

 ## RustFS / S3 integration