fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)

* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap

A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.

Three coordinated changes:

1. Preview before arming the sidecar. `cluster apply` now runs
   `preview_schema_apply_with_options` before `write_recovery_sidecar`, so
   parser/planner rejections (non-main branches, unsupported plan) fail
   loudly without leaving recovery work behind. The post-preview engine
   error path now deletes the sidecar when the live schema still matches
   the recorded digest (nothing moved), and keeps it only on real
   mid-movement failure — both branches covered by new engine-failpoint
   tests (cluster failpoints now enable omnigraph/failpoints).

2. Per-graph quarantine at serve time instead of whole-cluster refusal.
   A graph-attributed pending sidecar, an unopenable graph root, a query
   parse failure, or an unresolvable embedding provider now quarantines
   just that graph (logged loudly at every boot layer) while healthy
   graphs serve; `/graphs` lists only ready graphs and quarantined routes
   404. Cluster-global problems (missing/unreadable state, malformed or
   unattributable sidecars, shared-catalog or cluster-policy errors, zero
   healthy graphs) stay fail-fast. `--require-all-graphs` /
   OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.

3. Backfill embedding-provider profile metadata on apply. Mirrors the
   existing policy-binding backfill: a pre-5A ledger missing
   `embedding_profile` is now detected as a metadata-only change and
   backfilled by a no-op apply, instead of bricking serve with
   `embedding_provider_profile_missing` forever.

Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: resilient cluster boot + recovery-sidecar trap fix

Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(cluster): surface sidecar-cleanup failures; document severity promotion

Address Greptile review on PR #284:

- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
  result, so a transient delete failure left the graph quarantined with no
  signal. Add `try_delete_object` (Result-returning) and emit a
  `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
  fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
  diagnostic to a cluster-fatal error (the listing only emits genuine
  read/parse/version failures, as warnings, whose blast radius serving
  cannot prove) and note the promote-by-code path if that ever changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-06-19 03:34:15 +03:00 committed by GitHub
parent 7168ee0ed0
commit 7fd23c54a3
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
21 changed files with 1043 additions and 203 deletions

View file

@ -1,7 +1,7 @@
# RFC: Server Boots from Cluster State — Phase 5 of the Cluster Control Plane
**Status:** Landed (5A policy bindings #175; 5B/5C the `--cluster` boot mode — one PR)
**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses.
**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses. (5) Graph-attributed startup failures quarantine that graph by default; operators can restore all-or-nothing boot with `--require-all-graphs` / `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`.
**Date:** 2026-06-10
**Builds on:** Phase 4 complete ([rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md), Landed): `cluster apply` converges graphs, schemas, stored queries, and policies into the cluster catalog. Normative context: [cluster-config-specs.md](cluster-config-specs.md) (the migration model's "window 2"), [cluster-axioms.md](cluster-axioms.md) (axiom 15), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) (Phase 5 rollout, Compatibility Stance #7#9, exit criterion 7).
**Target release:** unversioned (phased — see Sequencing).
@ -46,8 +46,8 @@ Mode inference gains rule 0: `--cluster <dir>` → **Cluster mode**, which is al
`load_server_settings` grows a cluster branch that reads, in order:
1. `__cluster/state.json`**missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Pending recovery sidecars under `__cluster/recoveries/` are also a boot error (`cluster_recovery_pending`): a server must not start serving a ledger that a sweep is about to rewrite.
2. **Graph set** = state's `graph.<id>` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `<dir>/graphs/<id>.omni`. A recorded graph whose root does not open is a boot error — same fail-fast posture as today's bad URI.
1. `__cluster/state.json`**missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Invalid or unattributable recovery sidecars under `__cluster/recoveries/` are also a boot error: a server must not start if it cannot prove the blast radius. Valid graph-attributed sidecars quarantine that graph by default and are logged as `cluster_recovery_pending`; `--require-all-graphs` promotes them back to a boot error.
2. **Graph set** = state's `graph.<id>` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `<dir>/graphs/<id>.omni`. A recorded graph whose root does not open quarantines that graph by default; `--require-all-graphs` restores the original fail-fast posture.
3. **Stored queries** = state's `query.<graph>.<name>` entries, content loaded from the catalog blob at the recorded digest. Blob-missing or digest-mismatched is a boot error (the catalog verification semantics from Stage 3B, applied at boot). Queries type-check at engine open exactly as today (`validate_and_attach` — unchanged).
4. **Policies** = state's `policy.<name>` entries, content from catalog blobs, bindings from the applied metadata of D3: bundles bound to `cluster` load as the server-level Cedar engine (`PolicyEngine::load_server`); bundles bound to graphs load per-graph (`PolicyEngine::load_graph`) and install via `with_policy` — the existing two-gate structure, unchanged.
5. `cluster.yaml` is parsed **only** to validate that the directory is a cluster root (and for nothing else — explicitly not for resource content; a divergence between desired config and applied state is *served as applied*, visible via `cluster plan`).
@ -76,16 +76,19 @@ State's `StateResource` records only a digest. To make the ledger serving-suffic
### D4. Readiness and failure posture
Boot is fail-fast, matching the server's existing stance (bad policy YAML refuses boot):
Cluster-global failures are fail-fast, matching the server's existing stance (bad policy YAML refuses boot). Graph-local failures quarantine the affected graph by default so a single bad graph cannot crash-loop an otherwise healthy cluster. Operators who prefer the original all-or-nothing contract pass `--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`, which promotes every graph-local quarantine/open/settings failure to a boot error.
| Condition | Behavior |
|---|---|
| `state.json` missing / unparseable / unsupported version | boot error |
| pending recovery sidecars | boot error (run any state-mutating cluster command to sweep) |
| recorded graph root missing or unopenable | boot error |
| invalid/unreadable/unattributable recovery sidecars | boot error (run any state-mutating cluster command to sweep or inspect) |
| valid graph-attributed recovery sidecars | quarantine that graph; strict mode boot error |
| recorded graph root missing or unopenable | quarantine that graph; strict mode boot error |
| query/policy blob missing or digest-mismatched | boot error (run `cluster refresh` + `apply` to self-heal, then restart) |
| policy entry without `applies_to` metadata | boot error ("re-run cluster apply", D3) |
| stored query fails type-check against the live schema | boot error (existing `validate_and_attach` behavior) |
| stored query fails parse/type-check against the live schema | quarantine that graph; strict mode boot error |
| embedding provider configuration for one graph cannot resolve | quarantine that graph; strict mode boot error |
| every applied graph is quarantined or fails startup | boot error (`cluster_no_healthy_graphs`) |
| state lock held | **not** an error — boot takes no lock; it reads a point-in-time snapshot of an immutable-once-written state file (the CAS discipline means a concurrent apply produces a *new* file atomically; the server reads whichever was current at open) |
### D5. The `mcp.expose` bridge in cluster mode
@ -109,7 +112,7 @@ Rollback is the same switch in reverse — nothing in cluster mode mutates `omni
- *Axiom 5*: the server serves deployed reality (applied digests), never desired intent; D3 keeps the ledger the single serving source.
- *Axiom 12*: boot reads without the lock but relies on the atomic-replace write discipline; it never writes state.
- *Axiom 14 / Stance #9*: the expose-all bridge is named, scoped to cluster mode, and carries its Phase 6 sunset.
- *Loud failures (deny-list)*: every degraded condition is a typed boot error with a remedy; no partial serving, no silent fallback to the yaml.
- *Loud failures (deny-list)*: every degraded condition is either a typed cluster-global boot error with a remedy or an explicit graph quarantine logged at startup; no silent fallback to the yaml. `--require-all-graphs` is the opt-in all-or-nothing mode for operators who treat any degraded graph as fatal.
- *Respect the boundaries*: `omnigraph-cluster` stays free of HTTP; the server reads the catalog through a small read-only loader (either a `pub` read surface on `omnigraph-cluster` or a thin module in the server consuming the documented file formats — implementation picks the one that keeps `omnigraph-cluster` dependency-light; the state/blob formats are already a documented contract).
## Sequencing
@ -117,7 +120,7 @@ Rollback is the same switch in reverse — nothing in cluster mode mutates `omni
| Slice | Scope | Gate |
|---|---|---|
| **5A: serving metadata in state** | `applies_to` recorded on policy resources at apply + sweep roll-forward; additive state schema; `status`/plan surfacing | In-crate tests: metadata written/rolled-forward; old state parses; re-apply backfills |
| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state → `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; every D4 row refuses boot; e2e: `cluster apply` then serve — "applied means serving" |
| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state → `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; D4 cluster-global rows refuse boot; graph-local rows quarantine by default and refuse under `--require-all-graphs`; e2e: `cluster apply` then serve — "applied means serving" |
| **5C: docs + caveat retirement** | `cluster-config.md` mode-switch section; `server.md`/`deployment.md`; retire the "not serving" caveats for cluster-mode deployments; migration guide (D6) | `check-agents-md.sh`; doc accuracy review |
## Exit-criteria coverage

View file

@ -36,6 +36,12 @@ get faster and self-healing, and text embedding becomes provider-independent.
single-graph flat-route mode, positional-`<URI>` boot, and `omnigraph.yaml`
`graphs:`-map boot are gone — add or remove graphs with `cluster apply` and
restart.
- **Resilient cluster boot with strict opt-out.** Graph-attributed startup
failures now quarantine that graph and let healthy graphs serve; `/graphs`
lists only ready graphs, and quarantined graph routes return 404. Cluster-
global failures still refuse boot, and `--require-all-graphs` (or
`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) restores fail-fast all-or-nothing startup
for operators who prefer any degraded graph to abort the process.
- **One storage substrate + recovery liveness.** The cluster storage backend and
the engine both go through one `StorageAdapter` (versioned read, conditional
replace/CAS, prefix delete), exercised by a storage fault-injection matrix.

View file

@ -231,9 +231,11 @@ Policy entries additionally record their applied `applies_to` bindings as
normalized typed refs — the state ledger is serving-sufficient for the
future server-boot stage. A change to `applies_to` alone (the policy file
digest unchanged) appears in the plan as an Update marked `binding_change`
(human output: `[bindings]`), applies like any catalog change, and counts
toward convergence; ledgers written before this field existed are backfilled
by the next apply.
(human output: `[bindings]`), and as `metadata_change: policy_bindings` in
structured output. Embedding provider entries similarly carry their resolved
profile in the ledger; pre-profile ledgers are backfilled by an Update with
`metadata_change: embedding_profile`. These metadata-only updates apply like
catalog changes and count toward convergence.
Each plan change carries a `disposition` field — an honest preview of what
`cluster apply` will do with it in this stage: `applied` (executes), `derived`
@ -322,7 +324,9 @@ cluster apply until the approval-artifact stage. Unsupported migrations
(e.g. changing a property's type), engine lock contention, or graphs with
user branches fail loudly as `schema_apply_failed` with the engine's message;
dependent changes are demoted to `blocked` and graph-moving work stops for
the run.
the run. These pre-movement failures are checked before the cluster schema
recovery sidecar is created, so they do not leave stale recovery files behind
or brick later server boot.
`cluster plan` previews schema updates with the engine's real migration plan:
each schema change carries a `migration` field (`supported` + typed steps),
@ -402,20 +406,29 @@ drift is visible. Routing is always multi-graph (`/graphs/{id}/...`). Bearer
tokens and the bind address stay process-level (flags/env) — they are
per-replica facts, not cluster facts.
Boot is fail-fast: missing or unreadable state, pending recovery sidecars,
missing/tampered catalog blobs, policy entries without binding metadata
(pre-binding ledgers — re-run `cluster apply`), an empty graph set, more than
one policy bundle binding a single scope (split or merge bundles; stacked
scopes are a later stage), unopenable graph roots, and stored queries that no
longer type-check all refuse startup with a remedy. A held state lock is
*not* an error — boot reads the atomically-replaced state file without
Boot is fail-fast for cluster-global readiness failures: missing or
unreadable state, invalid/unattributable recovery sidecars,
missing/tampered shared catalog blobs, policy entries without binding
metadata (pre-binding ledgers — re-run `cluster apply`), an empty graph set,
more than one policy bundle binding a single scope (split or merge bundles;
stacked scopes are a later stage), cluster policy problems, or zero healthy
graphs. Valid graph-attributed recovery sidecars, unopenable graph roots, and
stored queries that no longer type-check quarantine that graph instead; the
server logs startup diagnostics, skips the graph's queries and graph-only
policy bindings, and serves any remaining healthy graphs. A held state lock
is *not* an error — boot reads the atomically-replaced state file without
locking.
Use `omnigraph-server --require-all-graphs` (or
`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) when degraded serving is not acceptable; it
promotes every graph-local quarantine or startup failure back to a boot error.
Serving is static per process: the server reads the applied revision once at
startup, so picking up newly applied state means restarting it. Stored
queries are all listed in `GET /queries` in cluster mode (the cluster
registry has no expose flag; exposure becomes a policy decision in a later
phase).
startup, so picking up newly applied state means restarting it. `GET /graphs`
lists only ready/served graphs; quarantined graphs are omitted and their
routes return 404. Stored queries are all listed in `GET /queries` in cluster
mode (the cluster registry has no expose flag; exposure becomes a policy
decision in a later phase).
## Status

View file

@ -221,7 +221,8 @@ applied revision is not safely servable. Each refusal names its remedy:
| Boot error | Meaning | Remedy |
|---|---|---|
| `cluster_state_missing` | no ledger | `cluster import`, then `apply` |
| `cluster_recovery_pending` | interrupted operation awaiting sweep | run `cluster apply` (or any state-mutating command), restart |
| `cluster_recovery_pending` | graph was quarantined because an interrupted operation awaits sweep | run `cluster apply` (or any state-mutating command), restart |
| `cluster_no_healthy_graphs` | every applied graph is quarantined or failed startup | sweep/fix the graph-specific failures, then restart |
| `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart |
| `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart |
| `cluster_empty` | applied revision has no graphs | apply a cluster with ≥1 graph |
@ -231,6 +232,13 @@ A held *state lock* is deliberately **not** a boot error — the server reads
the atomically-replaced ledger without locking, so serving never contends
with an in-flight apply.
When at least one graph is healthy, graph-attributed recovery sidecars and
graph-local startup failures do not block the whole server. The affected
graph is skipped, its graph-only policy bindings and queries are omitted,
and `/graphs` lists only the ready graphs. Pass
`omnigraph-server --require-all-graphs` or set
`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1` to make any such quarantine fail startup.
## 6. Deployment patterns
- **Replicas**: any number of `--cluster` servers can serve the same config

View file

@ -208,6 +208,7 @@ When no positional args are given, the image entrypoint
|---|---|
| `OMNIGRAPH_CLUSTER` | Cluster boot source — a config directory or a storage-root URI, forwarded as `--cluster`. The only boot source. |
| `OMNIGRAPH_BIND` | Listen address (default `0.0.0.0:8080`). |
| `OMNIGRAPH_REQUIRE_ALL_GRAPHS` | When truthy, forwarded as `--require-all-graphs`: any graph-local quarantine or startup failure aborts cluster boot instead of serving the healthy subset. |
Per-graph and server-level Cedar policy come from the cluster's applied
revision (authored in `cluster.yaml` and published with `cluster apply`),

View file

@ -15,11 +15,24 @@ omnigraph-server --cluster <dir | s3://…> --bind 0.0.0.0:8080
startup configs (id, URI, optional per-graph policy, stored-query
registry) plus an optional server-level policy, then opens every
configured graph in parallel at startup (bounded concurrency = 4,
fail-fast on the first open error). Routing is always multi-graph —
quarantining graph-specific open failures). Routing is always multi-graph —
requests to bare flat protected paths (`/read`, `/snapshot`, …) return
404; the served surface is `/graphs/{graph_id}/...`. See
[cluster-config.md](../clusters/config.md#serving-from-the-cluster-the-mode-switch)
for what is read and the fail-fast readiness rules.
for what is read and the readiness rules.
Readiness is fail-fast for cluster-global problems: missing or unreadable
state, invalid/unattributable recovery sidecars, unreadable shared catalog
payloads, cluster policy errors, or zero healthy graphs. Graph-attributed
pending recovery sidecars and graph-specific startup failures quarantine
that graph instead; the server logs startup diagnostics and serves the
remaining healthy graphs. `GET /graphs` enumerates ready/served graphs only,
so quarantined graphs are absent and their routes return 404.
Operators who want the original all-or-nothing boot contract can pass
`--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. In that mode,
any graph quarantine, graph-open failure, stored-query startup failure, or
embedding-provider resolution failure aborts startup.
A scheme-qualified argument (`s3://…`) reads the ledger straight from the
storage root, with no local config directory. `--bind`,
@ -27,7 +40,7 @@ storage root, with no local config directory. `--bind`,
### Stored-query validation at startup
If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup** and **refuses to boot** if any query references a type or property the schema lacks — the same fail-loud posture as a malformed policy file, so schema drift surfaces at the deploy boundary rather than at invocation. Two MCP-exposed queries claiming the same tool name is likewise a boot error. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup**. Query parse/type failures quarantine that graph; if no graph remains healthy, startup refuses. Two MCP-exposed queries claiming the same tool name are likewise graph-local startup failures. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
## Endpoint inventory
@ -61,7 +74,7 @@ Server-level management endpoints:
| Method | Path | Auth | Action |
|---|---|---|---|
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs |
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list ready/served graphs |
### Stored-query catalog (`GET /queries`)