fix(recovery): converge roll-forward on concurrent manifest advance (#296)

* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)

The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.

* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race

Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).

Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
  is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
  names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
  replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
  now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
  site references a const, and `failpoint_names_guard.rs` enforces "no string
  literal names" by source-walk, so a typo is a build error not a silent no-fire.

Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
  classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
  concurrent open-sweeps race one pending sidecar; the sweep parked at the
  failpoint loses its publish CAS to the other and fails the open with
  `ExpectedVersionMismatch`. FAILS at this commit by design.

* fix(recovery): converge roll-forward when the manifest advances concurrently

The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.

A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.

Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.

* Address PR review: harden failpoint name guard + dedupe converge audit

Two issues surfaced in PR review of the failpoint hardening + recovery fix:

1. Name guard had a line-split blind spot. It scanned per line, so a call
   wrapped across lines (`park_first(\n    "name",\n)`) put the literal on a
   different line than the call prefix and bypassed the "no string-literal
   failpoint names" check — and one such literal
   (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
   guard whitespace/newline-tolerant (skip past the open paren to the first
   argument token) so wrapping can't hide a literal, and convert the bypassed
   site to the `names::` const.

2. Convergence path could append a duplicate recovery audit. When a
   roll-forward publish loses its CAS but the manifest already reached the
   sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
   audit unconditionally. Under the heal-first invariant, whoever advanced the
   manifest already healed this sidecar (audit + delete), so a second row
   landed in `_graph_commit_recoveries` for one recovery event. Gate the
   audit+delete on the sidecar still being present: absent => the winner
   completed it, return success with no duplicate row. The convergence
   regression test now asserts exactly one audit row.

* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)

The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.

* fix(recovery): include added-table registrations in the converge audit

The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
This commit is contained in:
Ragnor Comerford 2026-06-24 19:03:51 +02:00 committed by GitHub
parent 7d3a52d674
commit 4a5277b9c0
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
25 changed files with 826 additions and 476 deletions

View file

@ -354,10 +354,9 @@ open; delete #6658 shipped). Track, don't build yet.
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
both are de-risking inputs for Design A's per-class commit models.
- **#296** — recovery roll-forward converges on concurrent manifest advance. This is the fix
for the flaky `iss-schema-apply-reopen-recovery-race` (the handoff in
`handoff-schema-apply-recovery-flake.md`). It touches `recovery.rs` and is *aligned* with
#297's "postcondition is the state, not winning the CAS" principle — reconcile the monotonic
publish with #296's converge helper if #296 lands first.
for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
*aligned* with #297's "postcondition is the state, not winning the CAS" principle — reconcile
the monotonic publish with #296's converge helper if #296 lands first.
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
---

View file

@ -1,216 +0,0 @@
# Handoff: flaky schema-apply → reopen recovery race
**Type:** bug investigation handoff (not yet fixed)
**Status:** root-caused to a layer + hypothesis; exact mechanism and fix NOT yet validated
**Severity:** medium — flaky CI; a real (rare) schema-apply-then-reopen failure under load
**Scope:** pre-existing on `main`; **independent of** RFC-013 step 2 (internal-table
compaction, PR #291) and step 3a (#288) — those paths never touch schema apply or
the recovery sweep, and the full `--workspace` gate passes clean on a re-run.
> Do **not** "fix" this by changing the test to use a single handle. That was
> empirically shown to *reduce but not eliminate* the flake (see Experiments), so it
> would mask a real product race. This is a correct-by-design fix in the engine, not
> a test edit.
---
## 1. Symptom
The test
`crates/omnigraph-server/tests/schema_routes.rs::schema_apply_route_hard_drops_property_with_allow_data_loss`
intermittently fails. The HTTP schema apply **succeeds** (`applied == true`); the
*subsequent* `Omnigraph::open(graph)` (which the test does to verify the catalog)
panics on `.unwrap()` with:
```
OmniError::Manifest(Conflict,
"stale view of node:Person: expected manifest version 5 but current is 7",
ExpectedVersionMismatch { expected: 5, actual: 7 })
```
The values (5, 7) vary; the shape is always "recovery roll-forward expected version
N, manifest is at M > N." It is raised from the **open-time recovery sweep**, i.e.
inside `Omnigraph::open`, not from the apply itself.
---
## 2. Reproduction
- **Needs sibling-test parallelism (CPU contention).** Running the target test
*alone* is rock-solid (0/20 failures). The flake only appears when other tests in
the same binary run concurrently and perturb the timing inside the apply→reopen
sequence.
- Fast repro loop (≈1340% per run):
```bash
cargo test -p omnigraph-server --test schema_routes --no-run
for i in $(seq 1 15); do
cargo test -p omnigraph-server --test schema_routes 2>&1 \
| grep -q "schema_apply_route_hard_drops_property_with_allow_data_loss ... FAILED" \
&& echo "iter $i FAIL"
done
```
- It originally surfaced in a full `cargo test --workspace` run (max parallelism).
- Each test uses its own `tempfile::tempdir()`, so this is **not** cross-test shared
state — it's a timing race inside one test's own graph.
---
## 3. Experiments run (the discriminating evidence)
Each variant was stress-run under the full `schema_routes` suite (parallel siblings):
| Variant | Flake rate |
|---|---|
| Target test in isolation (no sibling parallelism) | **0/20** |
| **Control** — as written (server handle + out-of-band `Omnigraph::open` load + reopen) | 6/15 ≈ 40% |
| Drop the live server handle (`drop(app)`) before the reopen | 4/15 ≈ 27% |
| Remove the out-of-band separate-handle load | 2/15 ≈ 13% |
| Remove the load **and** drop the server handle (≈ single-handle) | 8/20 ≈ 40% |
**Interpretation:**
- It is **concurrency-triggered**, not a topology bug: 0% isolated, flaky under
parallel load.
- **No single factor eliminates it.** Removing the out-of-band load roughly halves
the rate (it amplifies the race) but leaves a ~13% base. Dropping the live server
handle does not clearly help. So the "single-handle test" patch is a **band-aid**,
not the fix.
- The residual base rate with the out-of-band load removed means there is a real
race in the **schema-apply → reopen → recovery** path itself.
Caveat on the experiments: `drop(app)` may not synchronously tear down the server's
engine handle (it can be held by an `Arc`/spawned task), so the "single-handle"
rows are not airtight. This is one of the things to validate (§6).
---
## 4. Root-cause hypothesis (NOT yet proven)
The failing path is the **open-time recovery sweep's roll-forward** raising
`ExpectedVersionMismatch` from the publisher's `check_expected_table_versions`.
The hard-drop schema apply (`allow_data_loss=true``DropMode::Hard`) is a
**multi-step migration**: it performs several Lance commits + `__manifest` publishes,
advancing `node:Person`'s manifest version across multiple versions (e.g. 5 → … → 7).
To be crash-safe across the Lance-HEAD-before-manifest-publish gap, schema apply
writes a **recovery sidecar** (`__recovery/{ulid}.json`) pinning per-table
`expected_version` / `post_commit_pin` before its Phase B.
Hypothesis: under CPU contention, the timing of (a) the migration's multi-version
advancement, (b) the sidecar's Phase-D deletion, and (c) a later/over­lapping
`Omnigraph::open` recovery sweep interleaves such that the recovery roll-forward
reads a sidecar whose pinned `expected` is **stale relative to a manifest that
legitimately advanced several versions**, and **re-publishes at the stale `expected`
instead of recognizing the migration already completed** → `expected 5, actual 7`.
In other words: the recovery classifier / roll-forward likely does not correctly
handle a table whose manifest is **already past `post_commit_pin`** by more than one
step (multi-step migration), or a sidecar whose operation has already fully
committed. The single-step assumption baked into the Optimize-style pin
(`post_commit_pin = expected_version + 1`) may not generalize to multi-commit schema
migrations.
---
## 5. Likely solution (correct-by-design, surgical)
Make the **open-time recovery classifier idempotent against a manifest that advanced
past the sidecar's pin**:
- If the table's current manifest/Lance version is already `>= post_commit_pin`
(operation completed, possibly across multiple versions), classify it as
*already-rolled-forward / completed* (the `RolledPastExpected` family) and **delete
the sidecar without republishing** — never attempt a publish at the stale
`expected`.
- Ensure the schema-apply sidecar records a pin that the classifier can interpret for
a **multi-step** migration (a range / "completed at or beyond" semantics), not a
strict single-step `expected + 1`.
This also hardens *real* crash recovery for multi-step schema apply (not just the
test), and is small + local to `recovery.rs` (+ possibly the schema-apply sidecar
write in `schema_apply.rs`). It does **not** rearchitect recovery.
Per repo rule 12 (test-first for bug fixes): land a **deterministic** repro first —
ideally a failpoint that forces the interleaving (pause after the migration's commits
but before sidecar delete, then run an open) so the red→green is reliable, not a
stress-loop probability. See the `failpoints.rs` pattern + the schema-apply failpoints
already in the tree.
---
## 6. What MUST be validated before fixing
1. **Which sidecar is being rolled forward?** Confirm it is the *schema-apply*
sidecar (vs the out-of-band `load`'s sidecar, vs another writer). Instrument /
log the sidecar `operation_id`, `kind`, and `SidecarTablePin` at the point the
recovery sweep raises the error.
2. **The exact classifier path.** Trace which `TableClassification` arm the failing
table hits (`recovery.rs::classify_table`, ~L600) and which roll-forward call
raises `ExpectedVersionMismatch` (`heal_pending_sidecars_roll_forward` ~L761,
`roll_forward_all` ~L1215, `restore`+publish ~L1275). Confirm it is the
multi-step-advanced / already-completed case being mishandled.
3. **Is `post_commit_pin = expected + 1` the bug?** Verify the hard-drop migration
advances `node:Person` by **>1** version, and that the sidecar pins a single-step
`+1`, so the classifier can't recognize completion at +2.
4. **Engine-level reproduction (no server).** Build a deterministic engine-level
repro: persistent handle applies a multi-step hard-drop, then a fresh
`Omnigraph::open` — ideally with a failpoint forcing the interleave — to confirm
the bug is in the engine recovery path and not server-specific (runtime, handle
lifecycle). The current evidence is server-test-only.
5. **Is the out-of-band load *necessary or only amplifying*?** Confirm the ~13% base
rate (load removed) is the same root cause, not a second distinct race. If the
load is required, the bug is specifically about a second writer's version
advancement; if not, it's purely intra-apply.
6. **`drop(app)` cleanliness.** Verify whether the server's engine handle is truly
gone after `drop(app)` (it may be `Arc`-held). If not, the "single-handle"
experiments don't isolate the live-handle factor and should be redone with a
genuinely single-handle setup.
---
## 7. Relationship to Lance MTT
This bug lives in the **recovery-sidecar roll-forward**, which exists only to bridge
the Lance-HEAD-before-manifest-publish gap in omnigraph's faked multi-table
atomicity. `invariants.md` already calls recovery sidecars "scaffolding to remove
once the substrate closes the gap." Lance **MTT** (native atomic multi-table commits,
RFC §8 / lance#7264) closes that gap → retires the sidecar → **eliminates this bug
class.**
Implications:
- **Don't wait for MTT** — it is the "strategic exit, not a current dependency,"
uncertain and far off, and this bug is live now.
- **Don't over-invest** — keep the fix surgical (classifier idempotency), because the
whole sidecar layer is MTT-disposable. A surgical fix retires cleanly with the
layer; a recovery rearchitecture would be throwaway.
---
## 8. Key pointers
- Failing test: `crates/omnigraph-server/tests/schema_routes.rs`
`schema_apply_route_hard_drops_property_with_allow_data_loss` (~L777,
`#[tokio::test(flavor = "multi_thread")]`).
- Error type: `OmniError::Manifest` / `ManifestConflictDetails::ExpectedVersionMismatch`
(`crates/omnigraph/src/error.rs`); raised by `check_expected_table_versions`
(`crates/omnigraph/src/db/manifest/publisher.rs`, ~L356).
- Recovery sweep + classifier: `crates/omnigraph/src/db/manifest/recovery.rs`
`TableClassification` (~L335), `classify_table` (~L600), roll-forward
(`heal_pending_sidecars_roll_forward` ~L761, `roll_forward_all` ~L1215, restore +
publish ~L1275).
- Schema-apply sidecar write: `crates/omnigraph/src/db/omnigraph/schema_apply.rs`
(the `SidecarKind` schema-apply pins; `db.coordinator.write().refresh()` ~L692).
- Open entry point that runs the sweep: `Omnigraph::open` (read-write mode) →
`db/manifest/recovery.rs` sweep.
- Repro: §2 above. Stress under `schema_routes` suite parallelism; 0% isolated.
---
## 9. Suggested next steps
1. Add tracing at the recovery roll-forward error site (sidecar kind/id, pins,
observed vs expected) and capture a failing run (§6.1, §6.2).
2. Reproduce deterministically at the engine level with a failpoint (§6.4) — this is
the red test (rule 12).
3. Implement the classifier-idempotency fix (§5) in a separate commit; confirm
red→green and that the stress loop goes to 0 failures over ≥50 iterations.
4. Keep it a standalone PR (not bundled with RFC-013 follow-ons).

View file

@ -94,7 +94,6 @@ Working documents for in-flight feature work. Removed when the work lands.
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
| Schema-apply recovery flake handoff — investigation notes and validation plan for the intermittent schema-apply reopen race | [handoff-schema-apply-recovery-flake.md](handoff-schema-apply-recovery-flake.md) |
## Boundary

View file

@ -211,10 +211,21 @@ them explicit.
sweep has the same exposure, and always has): it may roll a live foreign
writer's sidecar forward, which degrades to publisher-CAS contention for
data writes but can race the schema-staging promotion for a foreign live
schema apply. Multi-process writers on one graph are already documented
one-winner-CAS territory; closing this fully needs a cross-process
serialization primitive (e.g. lease-based use of the schema-apply lock
branch) — design it before promoting multi-process write topologies.
schema apply. The roll-**forward** CAS contention is now
convergence-idempotent: when the publish loses the CAS to a concurrent
writer that already reached the sidecar's goal, the sweep treats it as
convergence (record the `RolledForward` audit + delete) rather than a fatal
`ExpectedVersionMismatch`, and defers when the manifest is only partway
(`converge_or_defer_roll_forward` in `db/manifest/recovery.rs`;
iss-schema-apply-reopen-recovery-race). So a concurrent advance no longer
fails the open. The schema-staging promotion race and the destructive
roll-**back** path (Lance `Restore` "trumps" a concurrent commit, so it
cannot be made idempotent — iss-recovery-sweep-live-writer-rollback) still
need the cross-process primitive. Multi-process writers on one graph are
already documented one-winner-CAS territory; closing this fully needs a
cross-process serialization primitive (e.g. lease-based use of the
schema-apply lock branch) — design it before promoting multi-process write
topologies.
- **Fork reclaim is in-process-safe only:** the first write to a table on a
branch forks it (a Lance `create_branch` that advances state before the
manifest publish). An interrupted fork (crash, or a cancelled request

View file

@ -46,7 +46,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`) and the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
@ -64,10 +64,12 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
## Failpoints (fault injection)
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml` **and** `crates/omnigraph-cluster/Cargo.toml`; the cluster feature does not enable the engine's).
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` expose `maybe_fail("name")` and `ScopedFailPoint` for tests.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, cluster apply's payload→state-write window, etc.).
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (crash-mid-apply + state CAS race via `fail::cfg_callback`; integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` in `crates/omnigraph/Cargo.toml`; the cluster's `failpoints` feature additionally enables `omnigraph/failpoints` (`crates/omnigraph-cluster/Cargo.toml`), so the shared test guard is available to cluster tests.
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` each expose `maybe_fail("name")` (per-crate error type). The test-side config guard `ScopedFailPoint` (`new` for action strings, `with_callback` for callbacks; RAII `Drop` removes the point) lives **once** in the engine and is reused by both test binaries.
- **Names are compile-checked.** Every failpoint name is a `pub const` in `omnigraph::failpoints::names` (engine) / `omnigraph_cluster::failpoints::names` (cluster). Call sites and tests reference the constant, never a bare literal — a typo is a compile error, not a silently-never-firing point. Add a new failpoint by adding its const first.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, the recovery sweep's classify→roll-forward-publish window, cluster apply's payload→state-write window, etc.).
- **Serialize and rendezvous, never sleep.** The `fail` registry is process-global, so every failpoint test carries `#[serial]` (`serial_test`). For concurrent tests, use `helpers::failpoint::Rendezvous` (`tests/helpers/failpoint.rs`): `park_first(name)` parks the first thread to hit the point until `release()`, and `wait_until_reached().await` blocks on that condition (it doubles as a fired-assertion). Do not coordinate threads with fixed `sleep`s.
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
## RustFS / S3 integration