# RFC: Cluster Graph & Schema Apply — Phase 4 of the Cluster Control Plane **Status:** Landed (4A #170, 4B #171, 4C — all shipped) **Implementation deviations:** (1) D3 row 8 retires the stale delete sidecar and lets the still-approved delete re-propose and retry, instead of a pending-block — prefix removal is idempotent, so the retry is the repair. (2) The approver/actor flag is the CLI's existing global `--as`, not a dedicated `--actor`/`--by`. (3) Consumed approval artifacts are rewritten with `consumed_at` rather than moved into state — the file and the ledger record both survive independently (axiom 11). **Date:** 2026-06-10 **Builds on:** cluster Stages 1–3B (shipped: validate/plan/status/refresh/import/force-unlock, config-only `cluster apply` with content-addressed catalog publish, catalog payload verification, failpoint-proven crash/CAS recovery for the apply protocol). Normative context: [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). **Target release:** unversioned (phased — see Sequencing); no cluster functionality is in a tagged release yet. ## Summary Extend `cluster apply` from config-only resources (stored queries, policy bundles) to **graph-moving resources**: graph create, cluster-driven schema apply, and graph delete. This is the nested-publish territory the implementation spec flags as its highest-risk decision: a graph's Lance manifest can move (via the engine's own atomic publish) *before* the cluster's JSON state CAS lands, and a crash in that window must never be silent, never acknowledged as success, and never repaired by guessing. Three design commitments make the phase tractable: 1. **Cluster recovery is roll-forward-only.** The engine's recovery sidecars (`__recovery/{ulid}.json`, the open-time sweep in `db/manifest/recovery.rs`) already make every graph-level operation atomic *within the graph* — a schema apply either fully published or fully recovers at the next open. The cluster therefore never rolls a graph back. Cluster sidecars exist to **classify and record**: after a crash, the sweep observes the live graph, decides "moved / didn't move / moved unexpectedly," and either rolls the *cluster state* forward to match observable reality (axiom 5) or surfaces a loud pending-repair condition. The cluster holds no second transaction log and no rollback hammer — that would duplicate substrate behavior the engine already owns (invariant: respect the substrate). 2. **Irreversible operations require a digest-bound approval artifact.** Graph delete and `allow_data_loss` schema applies consume an explicit `__cluster/approvals/{ulid}.json` record bound to the exact change digests, written by a new `cluster approve` command and retired into the state ledger's `approval_records` (the durable audit reference of axiom 11). 3. **The operator identity becomes explicit.** `cluster apply` gains an actor, threaded to the engine's `apply_schema_as` so Cedar enforcement and commit attribution work unchanged. The cluster control plane adds no policy engine of its own (transport/auth stay at the boundary). ## Motivation After Stage 3B, the control plane converges everything *except* the resources that define the data plane. The Sarah/Bob test is half-passed: Bob can see that Sarah changed a schema (`plan` shows the deferred change; `refresh` shows drift), but the system cannot act on it — Sarah still applies schemas with the per-graph tool and the cluster ledger trails reality. Graph creation is worse: a new graph in `cluster.yaml` blocks every dependent query and policy with `dependency_missing` until someone runs `omnigraph init` by hand at exactly the derived path. Phase 4 closes the loop: desired config in, converged deployment out, for the full resource vocabulary of Stage 1. The implementation spec's hard gate for this phase — failpoint recovery tests proving the movement-before-state-publish gap — was deliberately front-loaded: Stage 3B shipped the failpoint infrastructure and the apply-side crash/CAS tests. What remains is the design this RFC supplies: the sidecar schema, the recovery decision matrix, the approval artifact, and the ordering rules. ## Non-Goals - **Server boot from cluster state** (Phase 5) — applied graphs/schemas still serve nothing; the server boots from `omnigraph.yaml` until the explicit per-deployment mode switch (axiom 15). - **Policy-owned query exposure / `mcp.expose` retirement** (Phase 6). - **Pipelines, embeddings, UI, aliases, bindings, providers, `env_file`** (Phase 7 and reserved fields). - **External or Lance-backed state backends**; the local JSON backend + lock/CAS remains the substrate. - **A cluster manifest publisher.** Deferred, per the spec: it becomes interesting only if the sidecar + repair path proves too weak for the accepted safety contract. Nothing in this design forecloses it. - **Multi-graph atomic apply groups.** Cross-graph convergence remains statusful-partial per resource; one graph's failure never pretends to fence another's success. - **Graph rename.** Stable-identity-across-rename is an open known gap at the schema level already; graph rename compounds it and is explicitly out of scope (see Open Questions). ## Background What Phase 4 builds on (all shipped): - **The engine's recovery discipline.** Writers that can advance Lance HEAD before manifest publish write `__recovery/{ulid}.json` sidecars carrying per-table pins (`expected_version`, `post_commit_pin`); `Omnigraph::open` in read-write mode classifies every pinned table (`NoMovement` / `RolledPastExpected` / `UnexpectedAtP1` / `UnexpectedMultistep` / `InvariantViolation`) and decides all-or-nothing: roll forward via one manifest publish, or roll back via `Dataset::restore`, recording an audit row attributed to `omnigraph:recovery`. The cluster inherits the *vocabulary* of this design but not its mechanics — see the roll-forward-only argument below. - **The engine's schema-apply surface.** `apply_schema_as(desired_source, SchemaApplyOptions { allow_data_loss }, actor)` returns `SchemaApplyResult { supported, applied, manifest_version, steps }`; `preview_schema_apply_with_options` returns the migration plan plus desired catalog without applying; the `__schema_apply_lock__` branch serializes schema applies graph-wide and refuses to run while user branches exist. Policy enforcement (`enforce(SchemaApply, TargetBranch("main"), actor)`) happens before the lock. - **Graph init.** `Omnigraph::init(uri, schema_source)` with a strict preflight (errors if schema artifacts exist) and an atomic `_schema.pg` claim. A documented gap: a failed init does not clean up Lance datasets or `__manifest/` it already created. - **No engine graph-delete primitive.** Deleting a graph today means removing its object-store prefix. This RFC works with that fact rather than waiting on a primitive. - **Cluster state and observations.** `state.json` (locked, CAS-checked, atomically replaced) already records per-resource digests, statuses, `observations["graph."]` with `manifest_version` and live schema digest, plus empty `approval_records` / `recovery_records` placeholders reserved for this phase. - **Stage 3A/3B apply mechanics.** Dispositions (`applied`/`derived`/`deferred`/`blocked`), content-addressed catalog publish before the state CAS, persisted-statuses contract on write failure, idempotent re-apply, payload verification with the drift + self-heal loop, and failpoints `cluster_apply.after_payload_phase` / `cluster_apply.before_state_write`. ## Design ### D1. Resource semantics: which dispositions change The Stage 3A classifier gains executable rows. Everything else (catalog resources, `derived` composites, blocked dependents) is unchanged: | Change | Stage 3A disposition | Phase 4 disposition | |---|---|---| | `graph.` Create | Deferred | **Applied** (4A): `Omnigraph::init` at the derived root | | `schema.` Create | Deferred | **Applied with the graph create** (the init carries the schema) | | `schema.` Update | Deferred | **Applied** (4B): `apply_schema_as` against the live graph | | `graph.` / `schema.` Delete | Deferred | **Applied behind approval** (4C): prefix removal | | `query.*`/`policy.*` blocked on the above | Blocked | Unblocked in the same apply once the dependency lands (ordering, D5) | Graph roots remain **derived**: `ClusterRoot/graphs/.omni` (high-risk decision #2 dispositioned: external graph roots are a separate, explicit future feature, not this phase). ### D2. Cluster recovery sidecar (exit criterion 2, first half) Written under the state lock **before** any engine call that can move or create a graph manifest; deleted only **after** the cluster state CAS that records the outcome lands. ```json { "schema_version": 1, "operation_id": "", "started_at": "", "actor": "", "kind": "graph_create | schema_apply | graph_delete", "graph_id": "", "graph_uri": "", "observed_manifest_version": 7, "expected_manifest_version": null, "desired_schema_digest": "", "state_cas_base": "sha256:" } ``` Path: `__cluster/recoveries/{operation_id}.json`, atomic write (temp + rename, the `write_state` discipline). Notes: - `observed_manifest_version` is the live graph's main-branch manifest version read at sidecar-write time (`null` for `graph_create` — no graph yet). This is the fencing value: apply refuses to proceed if it differs from the version recorded in `observations["graph."]` at plan time *and* re-observed under the lock (the same recompute-under-lock posture as Stage 3A's diff). - `expected_manifest_version` starts `null` and is **rewritten into the sidecar immediately after the engine call returns** with `SchemaApplyResult.manifest_version` (or the post-init observation). A crash before that rewrite leaves `null`, which the sweep treats as "engine call outcome unknown — classify by observation only." For `graph_delete` the field is **always `null`** — prefix removal produces no new manifest version, so there is no rewrite step for that kind; delete sidecars are classified purely by root presence + state tombstone (D3 rows 7/7b/8). - `state_cas_base` is **recorded for audit and diagnostics only — the sweep decision logic never consults it.** The sweep re-reads `state.json` under the lock and performs ordinary CAS-checked writes, so an independent state mutation between sidecar write and sweep is handled by the CAS like any other concurrent write, not by this field. Its value is forensic: a recovery audit entry can show which state revision the interrupted operation departed from. - One sidecar per graph-moving resource operation. Apply processes graph-moving operations strictly sequentially (D5), so at most one sidecar is pending per apply run *per graph*, and the sweep processes sidecars in ULID order. ### D3. Recovery decision matrix — roll-forward-only (exit criterion 2, second half) **Why no rollback.** The engine's sidecars already guarantee that a schema apply is atomic within the graph: by the time any cluster-visible manifest version moved, the engine either fully published or will recover all-or-nothing at its next read-write open. A cluster-level rollback would mean un-publishing a successfully published graph commit — rewriting substrate history the cluster does not own, duplicating the engine's transaction discipline (deny-list: custom transaction manager; state that drifts from what it can be derived from). The cluster's job after a crash is therefore *epistemic*, not transactional: observe what the graph actually is, and converge the ledger to it or refuse loudly. **Sweep trigger.** The sweep runs at the start of every state-mutating cluster command (`apply`, `refresh`, `import`), under the state lock, before the command's own work — mirroring the engine's open-time sweep gating (read-only `status`/`plan`/`validate` report pending sidecars as a warning, `cluster_recovery_pending`, but do not act). | # | Sidecar kind | Observation | Decision | |---|---|---|---| | 1 | any | Graph at `observed_manifest_version` (nothing moved) | Engine call never landed. Delete sidecar; the command's own plan/apply re-proposes the change. | | 2 | `graph_create` / `schema_apply` | Graph at `expected_manifest_version`; state already records the outcome | Crash fell between state CAS and sidecar delete. Delete sidecar; done. | | 3 | `schema_apply` | Graph at `expected_manifest_version` (or, when `expected` is `null`, live schema digest == `desired_schema_digest`); state stale | **Roll the cluster state forward**: record the live schema digest, recompute the graph composite, set statuses `applied`, append a `recovery_records` entry (audit), CAS-write, delete sidecar. | | 4 | `graph_create` | Graph opens read-only and its schema digest == `desired_schema_digest`; state stale | Same roll-forward as #3 (the create completed). | | 5 | `graph_create` | Root exists but the graph does not open (the engine's partial-init gap) | Status `error`, condition `graph_create_incomplete`, message: remove the root and re-run apply. **No auto-delete** — reconciler-initiated deletion is the same data-loss class as human deletion (high-risk decision #7). Sidecar kept until the operator acts and a sweep observes a clean state. | | 6 | any | Graph at any other version (out-of-band movement during the crash window) | Status `drifted`, condition `actual_applied_state_pending`; sidecar kept; the command refuses graph-moving work for that graph until `cluster refresh` re-observes and the operator re-plans. No success is acknowledged for the interrupted operation. | | 7 | `graph_delete` | Root absent; state already tombstoned | The delete kind's analog of row 2 (no manifest exists to version-check): crash fell between state CAS and sidecar delete. Delete sidecar; done. | | 7b | `graph_delete` | Root absent; state stale | Roll forward: tombstone the graph subtree out of state (D6), record audit, delete sidecar. Idempotent — re-entry after a crash mid-row lands in row 7. | | 8 | `graph_delete` | Root present (delete crashed mid-prefix-removal or never started) | If the approval artifact is still attached (D4), the delete is re-proposed by plan and re-runnable; status `drifted`, condition `graph_delete_incomplete`. Partial prefix removal leaves an unopenable graph — same operator message as #5. | Rows 3, 4 and 7b are the only mutations the sweep performs, and each is an ordinary CAS-checked state write under the lock — the sweep introduces no new write machinery. ### D4. Approval artifacts (exit criteria 1-partial and 6-partial; axioms 8 and 11) The irreversible tier — graph delete, `allow_data_loss` schema apply (hard drops) — requires a recorded human decision that survives any reconstruction of state. `plan` already emits `approvals_required`; Phase 4 adds the consumption side. **Artifact** (`__cluster/approvals/{approval_id}.json`, written by the new command, never by apply): ```json { "schema_version": 1, "approval_id": "", "resource": "graph.scratch", "operation": "delete", "reason": "", "bound_config_digest": "", "bound_before_digest": "", "bound_after_digest": "", "approved_by": "", "created_at": "" } ``` **Flow.** `cluster approve --config --by ` re-runs the plan under the lock, locates the pending gated change for that address, prints it, and writes the artifact bound to the exact digests. `cluster apply` executes a gated change only when a pending artifact matches **all** bound digests — a stale approval (config moved since) matches nothing, is reported (`approval_stale` warning), and the change stays `blocked` with condition `approval_required`. On successful execution the artifact file is moved into `state.approval_records[approval_id]` in the same state CAS that records the outcome (the state references the audit fact; losing state does not lose the approval, which is also why `import` preserves `approval_records` it finds — see D7). `allow_data_loss` is **never** a CLI flag on `cluster apply`; destructive promotion is expressed only through an approval artifact for the specific schema change. The default schema apply path runs with `allow_data_loss: false` (soft drops), which the spec's tier table classes as a recoverable definition rewrite — plan warning, no artifact. ### D5. Actor, ordering, and apply groups (exit criterion 4) **Actor.** `cluster apply --actor ` / `cluster approve --by `, with `OMNIGRAPH_CLUSTER_ACTOR` as the env fallback. The actor is threaded to `apply_schema_as` (so engine-side Cedar enforcement fires wherever a policy checker is installed and graph commits are attributed), recorded in sidecars, approvals, and `recovery_records`. The cluster adds no policy engine: graph-moving operations inherit the engine's gate; catalog-only operations remain ungated as today. When no actor is supplied and the target graph has no policy checker, behavior is unchanged from Stage 3A (`None` actor, as the engine's no-actor variants do); when a checker is installed the engine's existing "actor required" error surfaces as a typed diagnostic (`actor_required`). **Ordering.** Deterministic, dependency-shaped, within one apply run: 1. graph creates (with their schemas) — ULID-stable order by graph id 2. schema applies — sequential, one graph at a time (each holds that graph's `__schema_apply_lock__`; the cluster state lock already serializes cluster-side) 3. catalog writes (queries/policies) — the Stage 3A path, unchanged 4. deletes last (catalog deletes, then approved graph deletes) Each graph-moving operation is its own apply group: sidecar → engine call → sidecar update → continue. The **state CAS stays single and final** (one write at the end recording every outcome), preserving Stage 3A's protocol; sidecars cover the widened gap between individual engine calls and that final CAS. A failure mid-sequence stops graph-moving work, reports per-resource statuses for everything already done (loud partials), and leaves sidecars for the sweep. Cross-graph atomicity is explicitly not promised. **Failpoints.** Each engine-call boundary gets a failpoint (`cluster_apply.before_graph_create`, `cluster_apply.after_graph_create`, `cluster_apply.before_schema_apply`, `cluster_apply.after_schema_apply`, `cluster_apply.before_graph_delete`) so every row of the D3 matrix is testable with the Stage 3B harness. ### D6. Graph delete (4C) With no engine primitive, delete is cluster-orchestrated prefix removal: verify the approval artifact → sidecar (`kind: graph_delete`, current manifest version recorded) → recursively remove `ClusterRoot/graphs/.omni` → state CAS that tombstones the graph subtree (graph, schema, and its queries removed from `applied_revision.resources` and `resource_statuses`; observation replaced by a tombstone record `{deleted_at, approval_id}`) → delete sidecar. Catalog blobs of the graph's queries stay (GC remains a later stage, consistent with Stage 3A deletes). The engine gap (no atomic prefix delete; partial removal leaves an unopenable root) is handled by D3 row 8, and this RFC registers a desire for an engine-level `destroy_graph` primitive as future work, not a dependency. ### D7. Plan and import integration - **Plan** gains real data impact for schema updates: where Stage 3A showed only a digest diff, Phase 4 calls `preview_schema_apply_with_options` against the live graph (read-only) and embeds the migration steps + drop warnings in the change record — the "data-aware provider peek" from the high-level spec, bounded to graphs the plan already observes. Failure to preview (graph unreachable) degrades to the digest diff with a warning, never blocks planning. - **Import/refresh** already observe live graphs; Phase 4 makes `import` preserve `approval_records` and pending `recoveries/` it finds (state reconstruction must not orphan audit facts or pending repairs). ### D8. Invariants and axioms check - *Respect the substrate / no custom transaction manager*: cluster never rolls back graphs; engine sidecars own intra-graph atomicity (D3). - *Axiom 5 (state = deployed reality)*: recovery converges the ledger to observation, never observation to ledger. - *Axiom 8 (reversibility gates apply, including drift correction)*: approval artifacts for the irreversible tier; sweep never auto-deletes (D3 rows 5/8). - *Axiom 9 (plan-time integrity)*: ordering is planner-derived from existing dependency edges; no runtime discovery. - *Axiom 11 (approvals in a durable ledger)*: artifacts are files first, state-referenced after consumption; reconstructable state never re-derives who approved. - *Axiom 12 (locked state)*: every new write path (sidecars, approvals consumption, sweep) runs under the existing state lock. - *Axiom 15 (single owner / mode switch)*: nothing here reads from or writes to `omnigraph.yaml`; applied graphs still serve nothing until Phase 5. - *Loud partials (deny-list)*: every crash window lands in a typed status/condition; no path acknowledges unverified success. ## Migration / Compatibility Additive. Stage 3A/3B behavior is unchanged for catalog-only configs; existing state files gain no required fields (`approval_records`/`recovery_records` already exist, empty). New CLI surface: `cluster approve`, `--actor` on `cluster apply`. A deployment that never declares schema changes or graph creates sees identical behavior to Stage 3B. The honored-or-rejected posture continues: no new `cluster.yaml` fields are introduced by this phase (graph roots stay derived). ## Sequencing | Stage | Scope | Gate | |---|---|---| | **4A graph create** | `Omnigraph::init` at derived roots; create-intent sidecar; D3 rows 1/2/4/5; dependents unblock in-run | Failpoint tests for crash-before/after-init; e2e: declare graph → apply → import-less convergence | | **4B schema apply** | Full sidecar lifecycle; roll-forward sweep (D3 rows 3/6); actor threading; plan data-impact preview; soft-drop default | Failpoint tests per matrix row; e2e: schema evolution fully cluster-driven (replaces the Stage 3A defer→manual→refresh loop) | | **4C graph delete** | `cluster approve` + artifact consumption; prefix removal; tombstones; D3 rows 7/7b/8 | Failpoint tests incl. partial-removal; e2e: gated delete refused without artifact, executed with it, stale artifact rejected | Each stage is a separate PR with boundary-matched tests (the Stage 1–3B discipline). 4A ships first because it moves no existing manifest; 4B is the heart; 4C last because it is the only irreversible-tier executor and consumes the approval machinery 4B's hard-drop path also needs. ## Exit-criteria coverage (implementation spec) | # | Criterion | This RFC | |---|---|---| | 1 | State/status/approval/recovery schemas + paths | **Approval + recovery schemas: answered** (D2, D4). State/status: unchanged from shipped Stage 2A/3A. | | 2 | Sidecar schema + recovery decision matrix | **Answered** (D2, D3) | | 3 | State backend interface / lock+CAS | Unchanged (local JSON backend, shipped) — out of scope | | 4 | Apply group syntax + dependency ordering | **Answered** (D5): per-resource groups, fixed kind-ordering; no user-declared group syntax this phase | | 5 | Plan JSON schema incl. blast radius + approvals | **Extended** (D7 preview embedding); base schema shipped | | 6 | Bootstrap authority + first-actor | **Partial** (D5 actor threading); cluster bootstrap authority remains open (below) | | 7 | Server startup migration | Phase 5 — deferred | | 8 | Per-query policy / `mcp.expose` bridge | Phase 6 — deferred | | 9 | Pipeline runtime | Phase 7 — deferred | ## Open Questions 1. **Bootstrap authority.** The first apply against a fresh cluster has no policy engine to consult and no actor registry; today the answer is "whoever holds the object store wins." The durable story (out-of-band privileged bootstrap actor, per the high-level spec §open-questions) is unresolved and blocks nothing in this phase, since graph-level Cedar still gates wherever installed. 2. **Approval expiry.** Artifacts are digest-bound, so config drift invalidates them naturally; is wall-clock expiry also wanted (operator hygiene), or does digest binding suffice? 3. **Sweep on read-only commands.** This RFC has `status`/`plan` only *warn* about pending sidecars. If operator feedback shows the warn-but-don't-repair posture causes confusion, promoting `plan` to run the sweep (it already takes the lock) is a compatible change. 4. **Graph rename.** Deliberately out of scope; interacts with the rename-stable-identity known gap in [invariants.md](invariants.md). A rename today is delete + create — i.e., gated, lossy, and honest about it. 5. **Engine `destroy_graph` primitive.** 4C's prefix removal is correct but unatomic; if the engine grows a graph-destroy primitive with its own recovery, D6 collapses onto it (the cluster code is shaped to delegate). ## References - [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) — phases, exit criteria, high-risk decisions, approval tiers - [cluster-axioms.md](cluster-axioms.md) — axioms 5, 8, 9, 11, 12, 15 - [cluster-config-specs.md](cluster-config-specs.md) — the data-aware provider peek; state/ledger model - `crates/omnigraph/src/db/manifest/recovery.rs` — the engine sidecar + classifier this design mirrors in vocabulary and deliberately does not duplicate in mechanics - [writes.md](writes.md), [invariants.md](invariants.md) — engine recovery protocol and the deny-list this design is checked against