test(cluster): failpoint tests for crash-mid-apply and state CAS race

The apply-side coverage the implementation spec's hard gate requires before
Phase 4 graph-moving apply:

- crash after the payload phase: state.json byte-identical, blobs inert on
  disk, lock released, no phantom statuses, nothing acknowledged; a plain
  re-run repairs via skip-if-exists blob reuse.
- CAS race: a cfg_callback rewrites state.json at the exact read->write
  window (the state.lock:false concurrent-writer scenario); apply surfaces
  state_cas_mismatch, acknowledges nothing, reports the persisted status
  snapshot, leaves the concurrent writer's state on disk; a re-run converges.

CI's failpoints step now runs both the engine and cluster suites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
aaltshuler 2026-06-10 02:14:06 +03:00
parent 21b531605f
commit 211b37e6de
2 changed files with 106 additions and 4 deletions

View file

@ -173,15 +173,18 @@ jobs:
OMNIGRAPH_UPDATE_OPENAPI: ${{ (github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name == github.repository) && '1' || '' }}
run: cargo test --workspace --locked
- name: Run failpoints feature test
- name: Run failpoints feature tests
if: needs.classify_changes.outputs.run_full_ci == 'true'
# Run after the workspace test so the build cache is warm —
# enabling --features failpoints is just an incremental rebuild
# of omnigraph-engine + the small `fail` crate, not the full
# of the target crate + the small `fail` crate, not the full
# dep tree (lance, datafusion). A separate job with its own
# cache key would be a fresh ~20min build on first run; this
# is ~30s on a warm cache.
run: cargo test --locked -p omnigraph-engine --features failpoints --test failpoints
# is ~30s on a warm cache. The cluster feature does not enable
# omnigraph/failpoints, so each line rebuilds only its crate.
run: |
cargo test --locked -p omnigraph-engine --features failpoints --test failpoints
cargo test --locked -p omnigraph-cluster --features failpoints --test failpoints
- name: Commit regenerated openapi.json to PR branch
if: |