mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-21 02:28:07 +02:00
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap A `cluster apply` carrying a schema change against a graph that has non-main branches, or an unsupported "needs backfill" migration, armed a recovery sidecar *before* calling the engine, then left it behind when the engine rejected the apply pre-movement. The server refuses to boot while any sidecar is pending, and re-running apply re-armed a fresh sidecar — an unescapable crash loop. None of the engine rejections are bugs; the trap is in the apply/serve choreography. Three coordinated changes: 1. Preview before arming the sidecar. `cluster apply` now runs `preview_schema_apply_with_options` before `write_recovery_sidecar`, so parser/planner rejections (non-main branches, unsupported plan) fail loudly without leaving recovery work behind. The post-preview engine error path now deletes the sidecar when the live schema still matches the recorded digest (nothing moved), and keeps it only on real mid-movement failure — both branches covered by new engine-failpoint tests (cluster failpoints now enable omnigraph/failpoints). 2. Per-graph quarantine at serve time instead of whole-cluster refusal. A graph-attributed pending sidecar, an unopenable graph root, a query parse failure, or an unresolvable embedding provider now quarantines just that graph (logged loudly at every boot layer) while healthy graphs serve; `/graphs` lists only ready graphs and quarantined routes 404. Cluster-global problems (missing/unreadable state, malformed or unattributable sidecars, shared-catalog or cluster-policy errors, zero healthy graphs) stay fail-fast. `--require-all-graphs` / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot. 3. Backfill embedding-provider profile metadata on apply. Mirrors the existing policy-binding backfill: a pre-5A ledger missing `embedding_profile` is now detected as a metadata-only change and backfilled by a no-op apply, instead of bricking serve with `embedding_provider_profile_missing` forever. Tests: trap (no sidecar after a rejected apply), both digest-cleanup branches, per-graph quarantine (cluster + server), embedding backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: resilient cluster boot + recovery-sidecar trap fix Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release note, and update the user cluster/server/deployment docs and the OMNIGRAPH_REQUIRE_ALL_GRAPHS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cluster): surface sidecar-cleanup failures; document severity promotion Address Greptile review on PR #284: - The pre-movement sidecar cleanup fast-path discarded `delete_object`'s result, so a transient delete failure left the graph quarantined with no signal. Add `try_delete_object` (Result-returning) and emit a `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the fire-and-forget `delete_object` now delegates to it. - Document why the serve-time loop promotes every `list_recovery_sidecars` diagnostic to a cluster-fatal error (the listing only emits genuine read/parse/version failures, as warnings, whose blast radius serving cannot prove) and note the promote-by-code path if that ever changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
7168ee0ed0
commit
7fd23c54a3
21 changed files with 1043 additions and 203 deletions
|
|
@ -231,9 +231,11 @@ Policy entries additionally record their applied `applies_to` bindings as
|
|||
normalized typed refs — the state ledger is serving-sufficient for the
|
||||
future server-boot stage. A change to `applies_to` alone (the policy file
|
||||
digest unchanged) appears in the plan as an Update marked `binding_change`
|
||||
(human output: `[bindings]`), applies like any catalog change, and counts
|
||||
toward convergence; ledgers written before this field existed are backfilled
|
||||
by the next apply.
|
||||
(human output: `[bindings]`), and as `metadata_change: policy_bindings` in
|
||||
structured output. Embedding provider entries similarly carry their resolved
|
||||
profile in the ledger; pre-profile ledgers are backfilled by an Update with
|
||||
`metadata_change: embedding_profile`. These metadata-only updates apply like
|
||||
catalog changes and count toward convergence.
|
||||
|
||||
Each plan change carries a `disposition` field — an honest preview of what
|
||||
`cluster apply` will do with it in this stage: `applied` (executes), `derived`
|
||||
|
|
@ -322,7 +324,9 @@ cluster apply until the approval-artifact stage. Unsupported migrations
|
|||
(e.g. changing a property's type), engine lock contention, or graphs with
|
||||
user branches fail loudly as `schema_apply_failed` with the engine's message;
|
||||
dependent changes are demoted to `blocked` and graph-moving work stops for
|
||||
the run.
|
||||
the run. These pre-movement failures are checked before the cluster schema
|
||||
recovery sidecar is created, so they do not leave stale recovery files behind
|
||||
or brick later server boot.
|
||||
|
||||
`cluster plan` previews schema updates with the engine's real migration plan:
|
||||
each schema change carries a `migration` field (`supported` + typed steps),
|
||||
|
|
@ -402,20 +406,29 @@ drift is visible. Routing is always multi-graph (`/graphs/{id}/...`). Bearer
|
|||
tokens and the bind address stay process-level (flags/env) — they are
|
||||
per-replica facts, not cluster facts.
|
||||
|
||||
Boot is fail-fast: missing or unreadable state, pending recovery sidecars,
|
||||
missing/tampered catalog blobs, policy entries without binding metadata
|
||||
(pre-binding ledgers — re-run `cluster apply`), an empty graph set, more than
|
||||
one policy bundle binding a single scope (split or merge bundles; stacked
|
||||
scopes are a later stage), unopenable graph roots, and stored queries that no
|
||||
longer type-check all refuse startup with a remedy. A held state lock is
|
||||
*not* an error — boot reads the atomically-replaced state file without
|
||||
Boot is fail-fast for cluster-global readiness failures: missing or
|
||||
unreadable state, invalid/unattributable recovery sidecars,
|
||||
missing/tampered shared catalog blobs, policy entries without binding
|
||||
metadata (pre-binding ledgers — re-run `cluster apply`), an empty graph set,
|
||||
more than one policy bundle binding a single scope (split or merge bundles;
|
||||
stacked scopes are a later stage), cluster policy problems, or zero healthy
|
||||
graphs. Valid graph-attributed recovery sidecars, unopenable graph roots, and
|
||||
stored queries that no longer type-check quarantine that graph instead; the
|
||||
server logs startup diagnostics, skips the graph's queries and graph-only
|
||||
policy bindings, and serves any remaining healthy graphs. A held state lock
|
||||
is *not* an error — boot reads the atomically-replaced state file without
|
||||
locking.
|
||||
|
||||
Use `omnigraph-server --require-all-graphs` (or
|
||||
`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) when degraded serving is not acceptable; it
|
||||
promotes every graph-local quarantine or startup failure back to a boot error.
|
||||
|
||||
Serving is static per process: the server reads the applied revision once at
|
||||
startup, so picking up newly applied state means restarting it. Stored
|
||||
queries are all listed in `GET /queries` in cluster mode (the cluster
|
||||
registry has no expose flag; exposure becomes a policy decision in a later
|
||||
phase).
|
||||
startup, so picking up newly applied state means restarting it. `GET /graphs`
|
||||
lists only ready/served graphs; quarantined graphs are omitted and their
|
||||
routes return 404. Stored queries are all listed in `GET /queries` in cluster
|
||||
mode (the cluster registry has no expose flag; exposure becomes a policy
|
||||
decision in a later phase).
|
||||
|
||||
## Status
|
||||
|
||||
|
|
|
|||
|
|
@ -221,7 +221,8 @@ applied revision is not safely servable. Each refusal names its remedy:
|
|||
| Boot error | Meaning | Remedy |
|
||||
|---|---|---|
|
||||
| `cluster_state_missing` | no ledger | `cluster import`, then `apply` |
|
||||
| `cluster_recovery_pending` | interrupted operation awaiting sweep | run `cluster apply` (or any state-mutating command), restart |
|
||||
| `cluster_recovery_pending` | graph was quarantined because an interrupted operation awaits sweep | run `cluster apply` (or any state-mutating command), restart |
|
||||
| `cluster_no_healthy_graphs` | every applied graph is quarantined or failed startup | sweep/fix the graph-specific failures, then restart |
|
||||
| `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart |
|
||||
| `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart |
|
||||
| `cluster_empty` | applied revision has no graphs | apply a cluster with ≥1 graph |
|
||||
|
|
@ -231,6 +232,13 @@ A held *state lock* is deliberately **not** a boot error — the server reads
|
|||
the atomically-replaced ledger without locking, so serving never contends
|
||||
with an in-flight apply.
|
||||
|
||||
When at least one graph is healthy, graph-attributed recovery sidecars and
|
||||
graph-local startup failures do not block the whole server. The affected
|
||||
graph is skipped, its graph-only policy bindings and queries are omitted,
|
||||
and `/graphs` lists only the ready graphs. Pass
|
||||
`omnigraph-server --require-all-graphs` or set
|
||||
`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1` to make any such quarantine fail startup.
|
||||
|
||||
## 6. Deployment patterns
|
||||
|
||||
- **Replicas**: any number of `--cluster` servers can serve the same config
|
||||
|
|
|
|||
|
|
@ -208,6 +208,7 @@ When no positional args are given, the image entrypoint
|
|||
|---|---|
|
||||
| `OMNIGRAPH_CLUSTER` | Cluster boot source — a config directory or a storage-root URI, forwarded as `--cluster`. The only boot source. |
|
||||
| `OMNIGRAPH_BIND` | Listen address (default `0.0.0.0:8080`). |
|
||||
| `OMNIGRAPH_REQUIRE_ALL_GRAPHS` | When truthy, forwarded as `--require-all-graphs`: any graph-local quarantine or startup failure aborts cluster boot instead of serving the healthy subset. |
|
||||
|
||||
Per-graph and server-level Cedar policy come from the cluster's applied
|
||||
revision (authored in `cluster.yaml` and published with `cluster apply`),
|
||||
|
|
|
|||
|
|
@ -15,11 +15,24 @@ omnigraph-server --cluster <dir | s3://…> --bind 0.0.0.0:8080
|
|||
startup configs (id, URI, optional per-graph policy, stored-query
|
||||
registry) plus an optional server-level policy, then opens every
|
||||
configured graph in parallel at startup (bounded concurrency = 4,
|
||||
fail-fast on the first open error). Routing is always multi-graph —
|
||||
quarantining graph-specific open failures). Routing is always multi-graph —
|
||||
requests to bare flat protected paths (`/read`, `/snapshot`, …) return
|
||||
404; the served surface is `/graphs/{graph_id}/...`. See
|
||||
[cluster-config.md](../clusters/config.md#serving-from-the-cluster-the-mode-switch)
|
||||
for what is read and the fail-fast readiness rules.
|
||||
for what is read and the readiness rules.
|
||||
|
||||
Readiness is fail-fast for cluster-global problems: missing or unreadable
|
||||
state, invalid/unattributable recovery sidecars, unreadable shared catalog
|
||||
payloads, cluster policy errors, or zero healthy graphs. Graph-attributed
|
||||
pending recovery sidecars and graph-specific startup failures quarantine
|
||||
that graph instead; the server logs startup diagnostics and serves the
|
||||
remaining healthy graphs. `GET /graphs` enumerates ready/served graphs only,
|
||||
so quarantined graphs are absent and their routes return 404.
|
||||
|
||||
Operators who want the original all-or-nothing boot contract can pass
|
||||
`--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. In that mode,
|
||||
any graph quarantine, graph-open failure, stored-query startup failure, or
|
||||
embedding-provider resolution failure aborts startup.
|
||||
|
||||
A scheme-qualified argument (`s3://…`) reads the ledger straight from the
|
||||
storage root, with no local config directory. `--bind`,
|
||||
|
|
@ -27,7 +40,7 @@ storage root, with no local config directory. `--bind`,
|
|||
|
||||
### Stored-query validation at startup
|
||||
|
||||
If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup** and **refuses to boot** if any query references a type or property the schema lacks — the same fail-loud posture as a malformed policy file, so schema drift surfaces at the deploy boundary rather than at invocation. Two MCP-exposed queries claiming the same tool name is likewise a boot error. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
|
||||
If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup**. Query parse/type failures quarantine that graph; if no graph remains healthy, startup refuses. Two MCP-exposed queries claiming the same tool name are likewise graph-local startup failures. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
|
||||
|
||||
## Endpoint inventory
|
||||
|
||||
|
|
@ -61,7 +74,7 @@ Server-level management endpoints:
|
|||
|
||||
| Method | Path | Auth | Action |
|
||||
|---|---|---|---|
|
||||
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs |
|
||||
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list ready/served graphs |
|
||||
|
||||
### Stored-query catalog (`GET /queries`)
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue