* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap
A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.
Three coordinated changes:
1. Preview before arming the sidecar. `cluster apply` now runs
`preview_schema_apply_with_options` before `write_recovery_sidecar`, so
parser/planner rejections (non-main branches, unsupported plan) fail
loudly without leaving recovery work behind. The post-preview engine
error path now deletes the sidecar when the live schema still matches
the recorded digest (nothing moved), and keeps it only on real
mid-movement failure — both branches covered by new engine-failpoint
tests (cluster failpoints now enable omnigraph/failpoints).
2. Per-graph quarantine at serve time instead of whole-cluster refusal.
A graph-attributed pending sidecar, an unopenable graph root, a query
parse failure, or an unresolvable embedding provider now quarantines
just that graph (logged loudly at every boot layer) while healthy
graphs serve; `/graphs` lists only ready graphs and quarantined routes
404. Cluster-global problems (missing/unreadable state, malformed or
unattributable sidecars, shared-catalog or cluster-policy errors, zero
healthy graphs) stay fail-fast. `--require-all-graphs` /
OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.
3. Backfill embedding-provider profile metadata on apply. Mirrors the
existing policy-binding backfill: a pre-5A ledger missing
`embedding_profile` is now detected as a metadata-only change and
backfilled by a no-op apply, instead of bricking serve with
`embedding_provider_profile_missing` forever.
Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: resilient cluster boot + recovery-sidecar trap fix
Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(cluster): surface sidecar-cleanup failures; document severity promotion
Address Greptile review on PR #284:
- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
result, so a transient delete failure left the graph quarantined with no
signal. Add `try_delete_object` (Result-returning) and emit a
`recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
diagnostic to a cluster-fatal error (the listing only emits genuine
read/parse/version failures, as warnings, whose blast radius serving
cannot prove) and note the promote-by-code path if that ever changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
s3_cluster.rs runs the full control-plane lifecycle against a real
bucket (CI: containerized RustFS; locally the RustFS binary): import →
lock released (pins the drop-time release regression caught on the first
live smoke) → apply (graph roots + catalog on the bucket, nothing local)
→ serving snapshots from both the config dir and the bare URI → schema
evolution → approved delete (prefix removal) → empty-cluster refusal.
The server suite gains the config-free boot test: --cluster s3://… with
zero local files serves a stored query over HTTP.
CI: the rustfs job runs both suites; the classify filter covers the
cluster store/serve modules and the new test files. The server smoke
drops its name filter — every test in the s3 target is bucket-gated, and
a filter matching nothing passes vacuously (which silently ran zero
tests for a while).
Docs: deployment.md gains the Bucket-no-volume shape as the preferred
cloud deployment; cluster.md/server.md document --cluster <uri>;
testing.md maps the new suite.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Crash before the removal: root intact, approval file unconsumed, sidecar
survives, no ack; the next run retires the stale intent (row 8) and the
still-approved delete completes in the same run.
- Crash after the removal, before the state CAS: root gone, ledger
byte-identical, the sidecar carries the approval id; the next run's sweep
rolls the tombstone forward, consumes the approval, audits the recovery,
and converges (row 7b).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Crash before the engine call: sidecar (carrying the --as actor) survives,
live schema and ledger untouched, no ack; the next run's sweep retires the
stale intent and the same run applies and converges.
- Crash after the engine call, before the state CAS: the manifest moved with
the post-op pin in the sidecar, state.json byte-identical; the next run's
sweep rolls the ledger forward with a schema_apply audit entry and the run
converges.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Crash before the init (row 1): sidecar survives, nothing moved, no ack;
the next run's sweep removes the intent and the same run creates and
converges.
- Crash after the init, before the state CAS (row 4): the graph exists with
the post-init manifest pin in the sidecar, state.json byte-identical; the
next run's sweep rolls the ledger forward with a recovery_records audit
entry and the run converges.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mechanical conversion ahead of Stage 4A graph create (which calls the async
Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm,
and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all
60 lib tests and 3 failpoint tests green before and after.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup
as cfg actions; a panic while the point is active no longer leaks the callback
into the process-global registry where it would fire under later tests
(greptile review, PR #167).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The apply-side coverage the implementation spec's hard gate requires before
Phase 4 graph-moving apply:
- crash after the payload phase: state.json byte-identical, blobs inert on
disk, lock released, no phantom statuses, nothing acknowledged; a plain
re-run repairs via skip-if-exists blob reuse.
- CAS race: a cfg_callback rewrites state.json at the exact read->write
window (the state.lock:false concurrent-writer scenario); apply surfaces
state_cas_mismatch, acknowledges nothing, reports the persisted status
snapshot, leaves the concurrent writer's state on disk; a re-run converges.
CI's failpoints step now runs both the engine and cluster suites.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT
enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning
Diagnostic-typed injected errors, and two call sites in apply_config_dir:
cluster_apply.after_payload_phase (the crash point: blobs on disk, state
untouched) and cluster_apply.before_state_write (routes through the
persisted-statuses revert contract; a cfg_callback here can mutate state.json
to make the CAS check fail organically). Feature off compiles to Ok(()) —
zero behavior change. Tests live in a separate integration binary because the
fail registry is process-global. Also refresh the crate description (stale
'read-only' since Stage 3A).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>