* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap A `cluster apply` carrying a schema change against a graph that has non-main branches, or an unsupported "needs backfill" migration, armed a recovery sidecar *before* calling the engine, then left it behind when the engine rejected the apply pre-movement. The server refuses to boot while any sidecar is pending, and re-running apply re-armed a fresh sidecar — an unescapable crash loop. None of the engine rejections are bugs; the trap is in the apply/serve choreography. Three coordinated changes: 1. Preview before arming the sidecar. `cluster apply` now runs `preview_schema_apply_with_options` before `write_recovery_sidecar`, so parser/planner rejections (non-main branches, unsupported plan) fail loudly without leaving recovery work behind. The post-preview engine error path now deletes the sidecar when the live schema still matches the recorded digest (nothing moved), and keeps it only on real mid-movement failure — both branches covered by new engine-failpoint tests (cluster failpoints now enable omnigraph/failpoints). 2. Per-graph quarantine at serve time instead of whole-cluster refusal. A graph-attributed pending sidecar, an unopenable graph root, a query parse failure, or an unresolvable embedding provider now quarantines just that graph (logged loudly at every boot layer) while healthy graphs serve; `/graphs` lists only ready graphs and quarantined routes 404. Cluster-global problems (missing/unreadable state, malformed or unattributable sidecars, shared-catalog or cluster-policy errors, zero healthy graphs) stay fail-fast. `--require-all-graphs` / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot. 3. Backfill embedding-provider profile metadata on apply. Mirrors the existing policy-binding backfill: a pre-5A ledger missing `embedding_profile` is now detected as a metadata-only change and backfilled by a no-op apply, instead of bricking serve with `embedding_provider_profile_missing` forever. Tests: trap (no sidecar after a rejected apply), both digest-cleanup branches, per-graph quarantine (cluster + server), embedding backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: resilient cluster boot + recovery-sidecar trap fix Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release note, and update the user cluster/server/deployment docs and the OMNIGRAPH_REQUIRE_ALL_GRAPHS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cluster): surface sidecar-cleanup failures; document severity promotion Address Greptile review on PR #284: - The pre-movement sidecar cleanup fast-path discarded `delete_object`'s result, so a transient delete failure left the graph quarantined with no signal. Add `try_delete_object` (Result-returning) and emit a `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the fire-and-forget `delete_object` now delegates to it. - Document why the serve-time loop promotes every `list_recovery_sidecars` diagnostic to a cluster-fatal error (the listing only emits genuine read/parse/version failures, as warnings, whose blast radius serving cannot prove) and note the promote-by-code path if that ever changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
17 KiB
RFC: Server Boots from Cluster State — Phase 5 of the Cluster Control Plane
Status: Landed (5A policy bindings #175; 5B/5C the --cluster boot mode — one PR)
Implementation deviations: (1) cluster mode reuses ServerConfigMode::Multi (a new settings source, not a new enum variant; config_path carries the cluster dir). (2) Stored queries load via QueryRegistry::from_specs from verified blob content, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) GET /graphs keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses. (5) Graph-attributed startup failures quarantine that graph by default; operators can restore all-or-nothing boot with --require-all-graphs / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1.
Date: 2026-06-10
Builds on: Phase 4 complete (rfc-004-cluster-graph-schema-apply.md, Landed): cluster apply converges graphs, schemas, stored queries, and policies into the cluster catalog. Normative context: cluster-config-specs.md (the migration model's "window 2"), cluster-axioms.md (axiom 15), cluster-config-implementation-spec.md (Phase 5 rollout, Compatibility Stance #7–#9, exit criterion 7).
Target release: unversioned (phased — see Sequencing).
Summary
Give omnigraph-server a second boot source: omnigraph-server --cluster <dir> reads its graph set, stored queries, and Cedar policies from the cluster catalog — state.json's applied revision plus the content-addressed blobs under __cluster/resources/ — instead of omnigraph.yaml. This is the moment "applied" finally means "serving": the standing caveat in every cluster doc since Stage 3A ("the server still boots from omnigraph.yaml") retires for deployments that flip the switch.
Three commitments:
- An exclusive mode switch, never a merge (axiom 15, Compatibility Stance #7).
--cluster <dir>is mutually exclusive with the positional URI,--target, and--config. In cluster mode,omnigraph.yamlis not read at all — not for graphs, not for queries, not for policies. There is no precedence, no key-level aliasing, no fallback read. A deployment serves from one source. - The server serves the applied revision, not the desired config. What's live is what
cluster applyconverged: graph roots recorded in state, query/policy content at the applied digests from the content-addressed catalog. Un-applied config drift never leaks into serving — the serving surface and the ledger cannot disagree (axiom 5 extended to the data path). - The state ledger becomes serving-sufficient. Today one fact needed to serve is missing from state: a policy's
applies_tobindings live only incluster.yaml. A prerequisite slice (5A) records binding metadata into the applied revision at apply time, so a booting server reads state + blobs and nothing else. Without this, "boot from state" would silently become "boot from state and config" — the merged read axiom 15 forbids.
Motivation
Phase 4 closed the convergence loop but left it inert: an operator can declare, plan, approve, and apply an entire deployment, and the running server ignores all of it. The Sarah/Bob test still fails at the last step — Sarah's applied change is visible in cluster status but Bob's clients hit a server still wired to a hand-maintained omnigraph.yaml. Phase 5 makes the catalog the serving source, which is also the precondition for Phase 6 (policy-owned query exposure must filter a catalog the server actually reads).
Non-Goals
- Runtime reconciliation / hot reload. Cluster-mode boot is static, exactly like today's boot: the server reads the applied revision once at startup; picking up a newer applied state means restarting the process. The registry's runtime-mutation seam (the test-only
insert()+ mutateMutexinregistry.rs) stays future-proofing for a later watch-and-reload slice, not this RFC. - Policy-owned query exposure (Phase 6) — but this RFC defines the bridge it sunsets (§D5).
- Remote cluster roots.
--cluster <dir>is a local directory in this phase, same as theclusterCLI commands; S3-hosted cluster roots arrive with external state backends. - Retiring
omnigraph.yamlserver boot. It remains a fully supported mode indefinitely (Compatibility Stance #8: the file's job shrinks; the server-role keys become inert only for deployments that switch). - New management endpoints (
/cluster/statusetc.) — noted as future work; this RFC changes the boot source, not the HTTP surface (beyond OpenAPI regen if anything shifts).
Background (verified against main)
- Server boot today (
omnigraph-server/src/main.rs,lib.rs:891-1029):load_server_settingsapplies a four-rule mode inference (positional URI /--target/server.graph→ Single;--config+graphs:→ Multi), buildsServerConfigMode::{Single,Multi}with per-graphGraphStartupConfig {graph_id, uri, policy_file, queries}, loadsQueryRegistryfrom.gqfiles at settings time (identity-checked), type-checks queries at engine open (validate_and_attach), loads Cedar viaPolicyEngine::load_graph/load_server, installs it withwith_policy, and assemblesGraphRegistry::from_handles(startup-only; lock-freeArcSwapreads). Bind address and bearer tokens come from flags/env, not from graph config. No reload machinery exists. - The catalog today (
omnigraph-cluster):state.jsonrecordsapplied_revision.resources(address → digest) forgraph.*,schema.*,query.<graph>.<name>,policy.<name>, plus statuses, observations (incl. tombstones), approval and recovery records. Query/policy content lives content-addressed at__cluster/resources/query/<graph>/<name>/<digest>.gqandpolicy/<name>/<digest>.yaml. Graph roots are derived:<dir>/graphs/<id>.omni. - The gap: state records a policy's digest only;
applies_to(cluster vs graph refs) lives incluster.yaml. Queries are fine — their graph binding is encoded in the address itself.
Design
D1. The mode switch
New server flag: omnigraph-server --cluster <dir> (the directory containing cluster.yaml, __cluster/, and graphs/). Mutually exclusive — a hard startup error, not a precedence rule — with the positional URI, --target, and --config. --bind, --unauthenticated, and the bearer-token env vars keep working identically: listen address and credentials are process-operational facts, not cluster facts (they differ per replica/host and never belonged to the shared catalog; if a serve: section ever joins cluster.yaml, that's a separate proposal).
Mode inference gains rule 0: --cluster <dir> → Cluster mode, which is always multi-graph routing (/graphs/{graph_id}/...), even for a single declared graph. No flat-route legacy surface in cluster mode — it's a new mode with no compatibility debt to carry.
D2. What the server reads (the applied revision, and only it)
load_server_settings grows a cluster branch that reads, in order:
__cluster/state.json— missing state is a boot error ("runcluster import+cluster applyfirst"). Invalid or unattributable recovery sidecars under__cluster/recoveries/are also a boot error: a server must not start if it cannot prove the blast radius. Valid graph-attributed sidecars quarantine that graph by default and are logged ascluster_recovery_pending;--require-all-graphspromotes them back to a boot error.- Graph set = state's
graph.<id>resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root<dir>/graphs/<id>.omni. A recorded graph whose root does not open quarantines that graph by default;--require-all-graphsrestores the original fail-fast posture. - Stored queries = state's
query.<graph>.<name>entries, content loaded from the catalog blob at the recorded digest. Blob-missing or digest-mismatched is a boot error (the catalog verification semantics from Stage 3B, applied at boot). Queries type-check at engine open exactly as today (validate_and_attach— unchanged). - Policies = state's
policy.<name>entries, content from catalog blobs, bindings from the applied metadata of D3: bundles bound toclusterload as the server-level Cedar engine (PolicyEngine::load_server); bundles bound to graphs load per-graph (PolicyEngine::load_graph) and install viawith_policy— the existing two-gate structure, unchanged. cluster.yamlis parsed only to validate that the directory is a cluster root (and for nothing else — explicitly not for resource content; a divergence between desired config and applied state is served as applied, visible viacluster plan).
Everything downstream of settings construction — GraphStartupConfig, parallel engine opens, GraphRegistry::from_handles, routing middleware, auth, workload admission, OpenAPI — is reused as-is. Cluster mode is a new source for the same boot pipeline, not a new pipeline.
D3. Prerequisite: serving metadata in the applied revision (slice 5A)
State's StateResource records only a digest. To make the ledger serving-sufficient, cluster apply (and the sweep's roll-forwards) additionally record binding metadata for policy resources at apply time:
"applied_revision": {
"resources": {
"policy.base_rbac": {
"digest": "<sha256>",
"applies_to": ["cluster", "graph.knowledge"]
}
}
}
- Additive and optional (
#[serde(default)]) — existing state files parse unchanged; a policy entry withoutapplies_to(applied before 5A) is a boot error in cluster mode with the remedy "re-runcluster apply" (one apply rewrites the metadata; the digest needn't change — the metadata write is part of the state mutation, not the blob). applies_tois normalized to typed addresses (cluster|graph.<id>) at apply time, mirroring the validator's normalization.- Queries need no equivalent: the address (
query.<graph>.<name>) already carries the binding, and the registry key/symbol invariant is enforced at apply (validate) time. - This is deliberately applied metadata, not config mirroring: if
cluster.yamlchanges a binding, the server keeps serving the old binding untilcluster applyconverges it — the same contract as every other resource.
D4. Readiness and failure posture
Cluster-global failures are fail-fast, matching the server's existing stance (bad policy YAML refuses boot). Graph-local failures quarantine the affected graph by default so a single bad graph cannot crash-loop an otherwise healthy cluster. Operators who prefer the original all-or-nothing contract pass --require-all-graphs or set OMNIGRAPH_REQUIRE_ALL_GRAPHS=1, which promotes every graph-local quarantine/open/settings failure to a boot error.
| Condition | Behavior |
|---|---|
state.json missing / unparseable / unsupported version |
boot error |
| invalid/unreadable/unattributable recovery sidecars | boot error (run any state-mutating cluster command to sweep or inspect) |
| valid graph-attributed recovery sidecars | quarantine that graph; strict mode boot error |
| recorded graph root missing or unopenable | quarantine that graph; strict mode boot error |
| query/policy blob missing or digest-mismatched | boot error (run cluster refresh + apply to self-heal, then restart) |
policy entry without applies_to metadata |
boot error ("re-run cluster apply", D3) |
| stored query fails parse/type-check against the live schema | quarantine that graph; strict mode boot error |
| embedding provider configuration for one graph cannot resolve | quarantine that graph; strict mode boot error |
| every applied graph is quarantined or fails startup | boot error (cluster_no_healthy_graphs) |
| state lock held | not an error — boot takes no lock; it reads a point-in-time snapshot of an immutable-once-written state file (the CAS discipline means a concurrent apply produces a new file atomically; the server reads whichever was current at open) |
D5. The mcp.expose bridge in cluster mode
The cluster query registry has no expose flag by design (axiom 14: exposure is a policy decision — Phase 6). Until Phase 6 ships, cluster-mode servers list all stored queries in GET /queries. This is the documented bridge: cluster mode = everything exposed; omnigraph.yaml mode = mcp.expose honored as today. Its named sunset is Phase 6's policy-filtered catalog (Compatibility Stance #9). Invocation remains gated by the existing coarse invoke_query Cedar action in both modes.
D6. Migration path (exit criterion 7)
For an operator running multi-graph from omnigraph.yaml:
- Author
cluster.yamldeclaring the same graphs/queries/policies; place existing graph roots under<dir>/graphs/<id>.omni(or start fresh). cluster import(observes live graphs) →cluster plan→cluster apply(publishes queries/policies into the catalog; with 5A, records policy bindings).- Restart the server with
--cluster <dir>instead of--config omnigraph.yaml. omnigraph.yaml'sgraphs:/serve:/queries:/policy:keys are now inert for this deployment; the file remains the CLI's per-operator config.
Rollback is the same switch in reverse — nothing in cluster mode mutates omnigraph.yaml or the graphs in a way the yaml mode can't serve.
D7. Invariants and axioms check
- Axiom 15 / Stance #7: exclusive flag, hard mutual-exclusion error, zero
omnigraph.yamlreads in cluster mode — no fact has two readers. - Axiom 5: the server serves deployed reality (applied digests), never desired intent; D3 keeps the ledger the single serving source.
- Axiom 12: boot reads without the lock but relies on the atomic-replace write discipline; it never writes state.
- Axiom 14 / Stance #9: the expose-all bridge is named, scoped to cluster mode, and carries its Phase 6 sunset.
- Loud failures (deny-list): every degraded condition is either a typed cluster-global boot error with a remedy or an explicit graph quarantine logged at startup; no silent fallback to the yaml.
--require-all-graphsis the opt-in all-or-nothing mode for operators who treat any degraded graph as fatal. - Respect the boundaries:
omnigraph-clusterstays free of HTTP; the server reads the catalog through a small read-only loader (either apubread surface onomnigraph-clusteror a thin module in the server consuming the documented file formats — implementation picks the one that keepsomnigraph-clusterdependency-light; the state/blob formats are already a documented contract).
Sequencing
| Slice | Scope | Gate |
|---|---|---|
| 5A: serving metadata in state | applies_to recorded on policy resources at apply + sweep roll-forward; additive state schema; status/plan surfacing |
In-crate tests: metadata written/rolled-forward; old state parses; re-apply backfills |
5B: --cluster boot mode |
Flag + mode inference rule 0; catalog loader (state → GraphStartupConfigs + registries + policy engines); readiness table; OpenAPI regen if surface shifts |
Server tests: boot from a converged fixture dir, serve /graphs/{id}/query + stored queries + Cedar gates; D4 cluster-global rows refuse boot; graph-local rows quarantine by default and refuse under --require-all-graphs; e2e: cluster apply then serve — "applied means serving" |
| 5C: docs + caveat retirement | cluster-config.md mode-switch section; server.md/deployment.md; retire the "not serving" caveats for cluster-mode deployments; migration guide (D6) |
check-agents-md.sh; doc accuracy review |
Exit-criteria coverage
Answers implementation-spec exit criterion 7 (server startup + migration path) in full; touches 1 (state schema gains policy binding metadata — additive). Criteria 8 (per-query policy) and 9 (pipelines — descoped to a separate project) remain.
Open Questions
- Loader home:
pubread-only API onomnigraph-cluster(server gains the dependency) vs a server-side reader of the documented formats. Leaningomnigraph-clusterAPI — one parser for the state schema beats two drifting ones; the crate stays HTTP-free either way. - Boot-time blob re-hash: D4 requires digest verification at boot; for large catalogs a stat-only fast path with full hashes behind a flag may matter later. Start with full verification (catalogs are small).
GET /graphsenrichment: cluster mode could expose applied digests/revision in the enumeration — deferred until a consumer exists.- Watch-and-reload: the natural follow-up once cluster mode exists; the registry's mutation seam is ready, but reload semantics (drain? cutover?) deserve their own design.
References
- rfc-004-cluster-graph-schema-apply.md — the convergence machinery this serves
- cluster-config-specs.md §Migration model — window 2 is this RFC
- cluster-axioms.md — axioms 5, 12, 14, 15
- cluster-config-implementation-spec.md — Phase 5 rollout, Compatibility Stance #7–#9, blast-radius rows for the server registry
crates/omnigraph-server/src/lib.rs(load_server_settings,ServerConfigMode,GraphRegistry) — the boot pipeline this extends without forking