fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)

* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap A `cluster apply` carrying a schema change against a graph that has non-main branches, or an unsupported "needs backfill" migration, armed a recovery sidecar *before* calling the engine, then left it behind when the engine rejected the apply pre-movement. The server refuses to boot while any sidecar is pending, and re-running apply re-armed a fresh sidecar — an unescapable crash loop. None of the engine rejections are bugs; the trap is in the apply/serve choreography. Three coordinated changes: 1. Preview before arming the sidecar. `cluster apply` now runs `preview_schema_apply_with_options` before `write_recovery_sidecar`, so parser/planner rejections (non-main branches, unsupported plan) fail loudly without leaving recovery work behind. The post-preview engine error path now deletes the sidecar when the live schema still matches the recorded digest (nothing moved), and keeps it only on real mid-movement failure — both branches covered by new engine-failpoint tests (cluster failpoints now enable omnigraph/failpoints). 2. Per-graph quarantine at serve time instead of whole-cluster refusal. A graph-attributed pending sidecar, an unopenable graph root, a query parse failure, or an unresolvable embedding provider now quarantines just that graph (logged loudly at every boot layer) while healthy graphs serve; `/graphs` lists only ready graphs and quarantined routes 404. Cluster-global problems (missing/unreadable state, malformed or unattributable sidecars, shared-catalog or cluster-policy errors, zero healthy graphs) stay fail-fast. `--require-all-graphs` / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot. 3. Backfill embedding-provider profile metadata on apply. Mirrors the existing policy-binding backfill: a pre-5A ledger missing `embedding_profile` is now detected as a metadata-only change and backfilled by a no-op apply, instead of bricking serve with `embedding_provider_profile_missing` forever. Tests: trap (no sidecar after a rejected apply), both digest-cleanup branches, per-graph quarantine (cluster + server), embedding backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: resilient cluster boot + recovery-sidecar trap fix Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release note, and update the user cluster/server/deployment docs and the OMNIGRAPH_REQUIRE_ALL_GRAPHS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cluster): surface sidecar-cleanup failures; document severity promotion Address Greptile review on PR #284: - The pre-movement sidecar cleanup fast-path discarded `delete_object`'s result, so a transient delete failure left the graph quarantined with no signal. Add `try_delete_object` (Result-returning) and emit a `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the fire-and-forget `delete_object` now delegates to it. - Document why the serve-time loop promotes every `list_recovery_sidecars` diagnostic to a cluster-fatal error (the listing only emits genuine read/parse/version failures, as warnings, whose blast radius serving cannot prove) and note the promote-by-code path if that ever changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 02:38:06 +02:00 · 2026-06-19 03:34:15 +03:00 · 2026-06-19 03:34:15 +03:00 · 7fd23c54a3
commit 7fd23c54a3
parent 7168ee0ed0
21 changed files with 1043 additions and 203 deletions
--- a/docs/user/operations/server.md
+++ b/docs/user/operations/server.md
@ -15,11 +15,24 @@ omnigraph-server --cluster <dir | s3://…> --bind 0.0.0.0:8080
 startup configs (id, URI, optional per-graph policy, stored-query
 registry) plus an optional server-level policy, then opens every
 configured graph in parallel at startup (bounded concurrency = 4,
-fail-fast on the first open error). Routing is always multi-graph —
+quarantining graph-specific open failures). Routing is always multi-graph —
 requests to bare flat protected paths (`/read`, `/snapshot`, …) return
 404; the served surface is `/graphs/{graph_id}/...`. See
 [cluster-config.md](../clusters/config.md#serving-from-the-cluster-the-mode-switch)
-for what is read and the fail-fast readiness rules.
+for what is read and the readiness rules.
+
+Readiness is fail-fast for cluster-global problems: missing or unreadable
+state, invalid/unattributable recovery sidecars, unreadable shared catalog
+payloads, cluster policy errors, or zero healthy graphs. Graph-attributed
+pending recovery sidecars and graph-specific startup failures quarantine
+that graph instead; the server logs startup diagnostics and serves the
+remaining healthy graphs. `GET /graphs` enumerates ready/served graphs only,
+so quarantined graphs are absent and their routes return 404.
+
+Operators who want the original all-or-nothing boot contract can pass
+`--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. In that mode,
+any graph quarantine, graph-open failure, stored-query startup failure, or
+embedding-provider resolution failure aborts startup.

 A scheme-qualified argument (`s3://…`) reads the ledger straight from the
 storage root, with no local config directory. `--bind`,
@ -27,7 +40,7 @@ storage root, with no local config directory. `--bind`,

 ### Stored-query validation at startup

-If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup** and **refuses to boot** if any query references a type or property the schema lacks — the same fail-loud posture as a malformed policy file, so schema drift surfaces at the deploy boundary rather than at invocation. Two MCP-exposed queries claiming the same tool name is likewise a boot error. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
+If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup**. Query parse/type failures quarantine that graph; if no graph remains healthy, startup refuses. Two MCP-exposed queries claiming the same tool name are likewise graph-local startup failures. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).

 ## Endpoint inventory

@ -61,7 +74,7 @@ Server-level management endpoints:

 | Method | Path | Auth | Action |
 |---|---|---|---|
-| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs |
+| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list ready/served graphs |

 ### Stored-query catalog (`GET /queries`)