docs(cluster): RFC-004 — graph & schema apply design (Phase 4) (#168)

* docs(cluster): RFC-004 — graph & schema apply design (Phase 4) The design the implementation spec's exit criteria require before graph-moving cluster apply ships. Core positions: - Cluster recovery is roll-forward-only: the engine's own sidecars make every graph-level operation atomic within the graph, so the cluster never rolls a graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and record, converging the ledger to observable reality (axiom 5) or surfacing a loud pending-repair condition. Eight-row decision matrix, every row testable with the Stage 3B failpoint harness. - Irreversible operations (graph delete, allow_data_loss schema apply) consume digest-bound approval artifacts written by a new cluster approve command and retired into state.approval_records (axiom 11). A stale approval can never authorize a different change. - cluster apply gains an actor, threaded to apply_schema_as so engine Cedar enforcement and commit attribution work unchanged; the cluster adds no policy engine of its own. - Deterministic ordering (creates -> schema applies -> catalog -> deletes), per-resource apply groups, cross-graph atomicity explicitly not promised. - Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on per-matrix-row failpoint tests. Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to their phases (coverage table in the RFC). Linked from the dev index and the implementation spec's Phase 4 section. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete (no manifest to version-check after prefix removal) and 'root absent, state already tombstoned' fell into the stale row — split into rows 7 (delete's analog of row 2) and 7b (the roll-forward), with expected_manifest_version documented as always null for the delete kind. (2) state_cas_base is now explicitly audit/diagnostics-only — the sweep never consults it; independent state mutations are handled by the ordinary CAS like any concurrent write. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 01:45:14 +02:00 · 2026-06-10 04:34:14 +03:00 · 2026-06-10 04:34:14 +03:00 · 58c66a54a2
commit 58c66a54a2
parent effb9cc068
3 changed files with 215 additions and 0 deletions
--- a/docs/dev/cluster-config-implementation-spec.md
+++ b/docs/dev/cluster-config-implementation-spec.md
@ -588,6 +588,10 @@ replacement would make every invariant harder to audit. -->

 ### Phase 4: Graph And Schema Apply

+Detailed design: [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md)
+(cluster sidecar schema, roll-forward-only recovery matrix, approval artifacts,
+actor threading, 4A/4B/4C staging).
+
 - Add graph create/delete as cluster resources.
 - Make schema apply cluster-aware, with sidecar coverage for graph manifest
  movements before JSON state publish.
--- a/docs/dev/index.md
+++ b/docs/dev/index.md
@ -74,6 +74,7 @@ Working documents for in-flight feature work. Removed when the work lands.
 | Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) |
 | MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) |
 | Future cluster control plane — declarative as-code config, JSON state ledger, reconciler | [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) |
+| Cluster graph & schema apply — Phase 4 sidecars, roll-forward recovery, approval artifacts | [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) |

 ## Boundary

--- a/docs/dev/rfc-004-cluster-graph-schema-apply.md
+++ b/docs/dev/rfc-004-cluster-graph-schema-apply.md
@ -0,0 +1,210 @@
+# RFC: Cluster Graph & Schema Apply — Phase 4 of the Cluster Control Plane
+
+**Status:** Proposed
+**Date:** 2026-06-10
+**Builds on:** cluster Stages 1–3B (shipped: validate/plan/status/refresh/import/force-unlock, config-only `cluster apply` with content-addressed catalog publish, catalog payload verification, failpoint-proven crash/CAS recovery for the apply protocol). Normative context: [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md).
+**Target release:** unversioned (phased — see Sequencing); no cluster functionality is in a tagged release yet.
+
+## Summary
+
+Extend `cluster apply` from config-only resources (stored queries, policy bundles) to **graph-moving resources**: graph create, cluster-driven schema apply, and graph delete. This is the nested-publish territory the implementation spec flags as its highest-risk decision: a graph's Lance manifest can move (via the engine's own atomic publish) *before* the cluster's JSON state CAS lands, and a crash in that window must never be silent, never acknowledged as success, and never repaired by guessing.
+
+Three design commitments make the phase tractable:
+
+1. **Cluster recovery is roll-forward-only.** The engine's recovery sidecars (`__recovery/{ulid}.json`, the open-time sweep in `db/manifest/recovery.rs`) already make every graph-level operation atomic *within the graph* — a schema apply either fully published or fully recovers at the next open. The cluster therefore never rolls a graph back. Cluster sidecars exist to **classify and record**: after a crash, the sweep observes the live graph, decides "moved / didn't move / moved unexpectedly," and either rolls the *cluster state* forward to match observable reality (axiom 5) or surfaces a loud pending-repair condition. The cluster holds no second transaction log and no rollback hammer — that would duplicate substrate behavior the engine already owns (invariant: respect the substrate).
+2. **Irreversible operations require a digest-bound approval artifact.** Graph delete and `allow_data_loss` schema applies consume an explicit `__cluster/approvals/{ulid}.json` record bound to the exact change digests, written by a new `cluster approve` command and retired into the state ledger's `approval_records` (the durable audit reference of axiom 11).
+3. **The operator identity becomes explicit.** `cluster apply` gains an actor, threaded to the engine's `apply_schema_as` so Cedar enforcement and commit attribution work unchanged. The cluster control plane adds no policy engine of its own (transport/auth stay at the boundary).
+
+## Motivation
+
+After Stage 3B, the control plane converges everything *except* the resources that define the data plane. The Sarah/Bob test is half-passed: Bob can see that Sarah changed a schema (`plan` shows the deferred change; `refresh` shows drift), but the system cannot act on it — Sarah still applies schemas with the per-graph tool and the cluster ledger trails reality. Graph creation is worse: a new graph in `cluster.yaml` blocks every dependent query and policy with `dependency_missing` until someone runs `omnigraph init` by hand at exactly the derived path. Phase 4 closes the loop: desired config in, converged deployment out, for the full resource vocabulary of Stage 1.
+
+The implementation spec's hard gate for this phase — failpoint recovery tests proving the movement-before-state-publish gap — was deliberately front-loaded: Stage 3B shipped the failpoint infrastructure and the apply-side crash/CAS tests. What remains is the design this RFC supplies: the sidecar schema, the recovery decision matrix, the approval artifact, and the ordering rules.
+
+## Non-Goals
+
+- **Server boot from cluster state** (Phase 5) — applied graphs/schemas still serve nothing; the server boots from `omnigraph.yaml` until the explicit per-deployment mode switch (axiom 15).
+- **Policy-owned query exposure / `mcp.expose` retirement** (Phase 6).
+- **Pipelines, embeddings, UI, aliases, bindings, providers, `env_file`** (Phase 7 and reserved fields).
+- **External or Lance-backed state backends**; the local JSON backend + lock/CAS remains the substrate.
+- **A cluster manifest publisher.** Deferred, per the spec: it becomes interesting only if the sidecar + repair path proves too weak for the accepted safety contract. Nothing in this design forecloses it.
+- **Multi-graph atomic apply groups.** Cross-graph convergence remains statusful-partial per resource; one graph's failure never pretends to fence another's success.
+- **Graph rename.** Stable-identity-across-rename is an open known gap at the schema level already; graph rename compounds it and is explicitly out of scope (see Open Questions).
+
+## Background
+
+What Phase 4 builds on (all shipped):
+
+- **The engine's recovery discipline.** Writers that can advance Lance HEAD before manifest publish write `__recovery/{ulid}.json` sidecars carrying per-table pins (`expected_version`, `post_commit_pin`); `Omnigraph::open` in read-write mode classifies every pinned table (`NoMovement` / `RolledPastExpected` / `UnexpectedAtP1` / `UnexpectedMultistep` / `InvariantViolation`) and decides all-or-nothing: roll forward via one manifest publish, or roll back via `Dataset::restore`, recording an audit row attributed to `omnigraph:recovery`. The cluster inherits the *vocabulary* of this design but not its mechanics — see the roll-forward-only argument below.
+- **The engine's schema-apply surface.** `apply_schema_as(desired_source, SchemaApplyOptions { allow_data_loss }, actor)` returns `SchemaApplyResult { supported, applied, manifest_version, steps }`; `preview_schema_apply_with_options` returns the migration plan plus desired catalog without applying; the `__schema_apply_lock__` branch serializes schema applies graph-wide and refuses to run while user branches exist. Policy enforcement (`enforce(SchemaApply, TargetBranch("main"), actor)`) happens before the lock.
+- **Graph init.** `Omnigraph::init(uri, schema_source)` with a strict preflight (errors if schema artifacts exist) and an atomic `_schema.pg` claim. A documented gap: a failed init does not clean up Lance datasets or `__manifest/` it already created.
+- **No engine graph-delete primitive.** Deleting a graph today means removing its object-store prefix. This RFC works with that fact rather than waiting on a primitive.
+- **Cluster state and observations.** `state.json` (locked, CAS-checked, atomically replaced) already records per-resource digests, statuses, `observations["graph.<id>"]` with `manifest_version` and live schema digest, plus empty `approval_records` / `recovery_records` placeholders reserved for this phase.
+- **Stage 3A/3B apply mechanics.** Dispositions (`applied`/`derived`/`deferred`/`blocked`), content-addressed catalog publish before the state CAS, persisted-statuses contract on write failure, idempotent re-apply, payload verification with the drift + self-heal loop, and failpoints `cluster_apply.after_payload_phase` / `cluster_apply.before_state_write`.
+
+## Design
+
+### D1. Resource semantics: which dispositions change
+
+The Stage 3A classifier gains executable rows. Everything else (catalog resources, `derived` composites, blocked dependents) is unchanged:
+
+| Change | Stage 3A disposition | Phase 4 disposition |
+|---|---|---|
+| `graph.<id>` Create | Deferred | **Applied** (4A): `Omnigraph::init` at the derived root |
+| `schema.<id>` Create | Deferred | **Applied with the graph create** (the init carries the schema) |
+| `schema.<id>` Update | Deferred | **Applied** (4B): `apply_schema_as` against the live graph |
+| `graph.<id>` / `schema.<id>` Delete | Deferred | **Applied behind approval** (4C): prefix removal |
+| `query.*`/`policy.*` blocked on the above | Blocked | Unblocked in the same apply once the dependency lands (ordering, D5) |
+
+Graph roots remain **derived**: `ClusterRoot/graphs/<id>.omni` (high-risk decision #2 dispositioned: external graph roots are a separate, explicit future feature, not this phase).
+
+### D2. Cluster recovery sidecar (exit criterion 2, first half)
+
+Written under the state lock **before** any engine call that can move or create a graph manifest; deleted only **after** the cluster state CAS that records the outcome lands.
+
+```json
+{
+  "schema_version": 1,
+  "operation_id": "<ulid>",
+  "started_at": "<rfc3339>",
+  "actor": "<id or null>",
+  "kind": "graph_create | schema_apply | graph_delete",
+  "graph_id": "<id>",
+  "graph_uri": "<derived root>",
+  "observed_manifest_version": 7,
+  "expected_manifest_version": null,
+  "desired_schema_digest": "<sha256 of the schema source being applied>",
+  "state_cas_base": "sha256:<state.json digest at sidecar write>"
+}
+```
+
+Path: `__cluster/recoveries/{operation_id}.json`, atomic write (temp + rename, the `write_state` discipline). Notes:
+
+- `observed_manifest_version` is the live graph's main-branch manifest version read at sidecar-write time (`null` for `graph_create` — no graph yet). This is the fencing value: apply refuses to proceed if it differs from the version recorded in `observations["graph.<id>"]` at plan time *and* re-observed under the lock (the same recompute-under-lock posture as Stage 3A's diff).
+- `expected_manifest_version` starts `null` and is **rewritten into the sidecar immediately after the engine call returns** with `SchemaApplyResult.manifest_version` (or the post-init observation). A crash before that rewrite leaves `null`, which the sweep treats as "engine call outcome unknown — classify by observation only." For `graph_delete` the field is **always `null`** — prefix removal produces no new manifest version, so there is no rewrite step for that kind; delete sidecars are classified purely by root presence + state tombstone (D3 rows 7/7b/8).
+- `state_cas_base` is **recorded for audit and diagnostics only — the sweep decision logic never consults it.** The sweep re-reads `state.json` under the lock and performs ordinary CAS-checked writes, so an independent state mutation between sidecar write and sweep is handled by the CAS like any other concurrent write, not by this field. Its value is forensic: a recovery audit entry can show which state revision the interrupted operation departed from.
+- One sidecar per graph-moving resource operation. Apply processes graph-moving operations strictly sequentially (D5), so at most one sidecar is pending per apply run *per graph*, and the sweep processes sidecars in ULID order.
+
+### D3. Recovery decision matrix — roll-forward-only (exit criterion 2, second half)
+
+**Why no rollback.** The engine's sidecars already guarantee that a schema apply is atomic within the graph: by the time any cluster-visible manifest version moved, the engine either fully published or will recover all-or-nothing at its next read-write open. A cluster-level rollback would mean un-publishing a successfully published graph commit — rewriting substrate history the cluster does not own, duplicating the engine's transaction discipline (deny-list: custom transaction manager; state that drifts from what it can be derived from). The cluster's job after a crash is therefore *epistemic*, not transactional: observe what the graph actually is, and converge the ledger to it or refuse loudly.
+
+**Sweep trigger.** The sweep runs at the start of every state-mutating cluster command (`apply`, `refresh`, `import`), under the state lock, before the command's own work — mirroring the engine's open-time sweep gating (read-only `status`/`plan`/`validate` report pending sidecars as a warning, `cluster_recovery_pending`, but do not act).
+
+| # | Sidecar kind | Observation | Decision |
+|---|---|---|---|
+| 1 | any | Graph at `observed_manifest_version` (nothing moved) | Engine call never landed. Delete sidecar; the command's own plan/apply re-proposes the change. |
+| 2 | `graph_create` / `schema_apply` | Graph at `expected_manifest_version`; state already records the outcome | Crash fell between state CAS and sidecar delete. Delete sidecar; done. |
+| 3 | `schema_apply` | Graph at `expected_manifest_version` (or, when `expected` is `null`, live schema digest == `desired_schema_digest`); state stale | **Roll the cluster state forward**: record the live schema digest, recompute the graph composite, set statuses `applied`, append a `recovery_records` entry (audit), CAS-write, delete sidecar. |
+| 4 | `graph_create` | Graph opens read-only and its schema digest == `desired_schema_digest`; state stale | Same roll-forward as #3 (the create completed). |
+| 5 | `graph_create` | Root exists but the graph does not open (the engine's partial-init gap) | Status `error`, condition `graph_create_incomplete`, message: remove the root and re-run apply. **No auto-delete** — reconciler-initiated deletion is the same data-loss class as human deletion (high-risk decision #7). Sidecar kept until the operator acts and a sweep observes a clean state. |
+| 6 | any | Graph at any other version (out-of-band movement during the crash window) | Status `drifted`, condition `actual_applied_state_pending`; sidecar kept; the command refuses graph-moving work for that graph until `cluster refresh` re-observes and the operator re-plans. No success is acknowledged for the interrupted operation. |
+| 7 | `graph_delete` | Root absent; state already tombstoned | The delete kind's analog of row 2 (no manifest exists to version-check): crash fell between state CAS and sidecar delete. Delete sidecar; done. |
+| 7b | `graph_delete` | Root absent; state stale | Roll forward: tombstone the graph subtree out of state (D6), record audit, delete sidecar. Idempotent — re-entry after a crash mid-row lands in row 7. |
+| 8 | `graph_delete` | Root present (delete crashed mid-prefix-removal or never started) | If the approval artifact is still attached (D4), the delete is re-proposed by plan and re-runnable; status `drifted`, condition `graph_delete_incomplete`. Partial prefix removal leaves an unopenable graph — same operator message as #5. |
+
+Rows 3, 4 and 7b are the only mutations the sweep performs, and each is an ordinary CAS-checked state write under the lock — the sweep introduces no new write machinery.
+
+### D4. Approval artifacts (exit criteria 1-partial and 6-partial; axioms 8 and 11)
+
+The irreversible tier — graph delete, `allow_data_loss` schema apply (hard drops) — requires a recorded human decision that survives any reconstruction of state. `plan` already emits `approvals_required`; Phase 4 adds the consumption side.
+
+**Artifact** (`__cluster/approvals/{approval_id}.json`, written by the new command, never by apply):
+
+```json
+{
+  "schema_version": 1,
+  "approval_id": "<ulid>",
+  "resource": "graph.scratch",
+  "operation": "delete",
+  "reason": "<the approvals_required reason from plan>",
+  "bound_config_digest": "<desired config digest the plan was computed from>",
+  "bound_before_digest": "<state digest of the resource, or null>",
+  "bound_after_digest": "<desired digest, or null for delete>",
+  "approved_by": "<operator id, required>",
+  "created_at": "<rfc3339>"
+}
+```
+
+**Flow.** `cluster approve <resource-address> --config <dir> --by <operator>` re-runs the plan under the lock, locates the pending gated change for that address, prints it, and writes the artifact bound to the exact digests. `cluster apply` executes a gated change only when a pending artifact matches **all** bound digests — a stale approval (config moved since) matches nothing, is reported (`approval_stale` warning), and the change stays `blocked` with condition `approval_required`. On successful execution the artifact file is moved into `state.approval_records[approval_id]` in the same state CAS that records the outcome (the state references the audit fact; losing state does not lose the approval, which is also why `import` preserves `approval_records` it finds — see D7).
+
+`allow_data_loss` is **never** a CLI flag on `cluster apply`; destructive promotion is expressed only through an approval artifact for the specific schema change. The default schema apply path runs with `allow_data_loss: false` (soft drops), which the spec's tier table classes as a recoverable definition rewrite — plan warning, no artifact.
+
+### D5. Actor, ordering, and apply groups (exit criterion 4)
+
+**Actor.** `cluster apply --actor <id>` / `cluster approve --by <id>`, with `OMNIGRAPH_CLUSTER_ACTOR` as the env fallback. The actor is threaded to `apply_schema_as` (so engine-side Cedar enforcement fires wherever a policy checker is installed and graph commits are attributed), recorded in sidecars, approvals, and `recovery_records`. The cluster adds no policy engine: graph-moving operations inherit the engine's gate; catalog-only operations remain ungated as today. When no actor is supplied and the target graph has no policy checker, behavior is unchanged from Stage 3A (`None` actor, as the engine's no-actor variants do); when a checker is installed the engine's existing "actor required" error surfaces as a typed diagnostic (`actor_required`).
+
+**Ordering.** Deterministic, dependency-shaped, within one apply run:
+
+1. graph creates (with their schemas) — ULID-stable order by graph id
+2. schema applies — sequential, one graph at a time (each holds that graph's `__schema_apply_lock__`; the cluster state lock already serializes cluster-side)
+3. catalog writes (queries/policies) — the Stage 3A path, unchanged
+4. deletes last (catalog deletes, then approved graph deletes)
+
+Each graph-moving operation is its own apply group: sidecar → engine call → sidecar update → continue. The **state CAS stays single and final** (one write at the end recording every outcome), preserving Stage 3A's protocol; sidecars cover the widened gap between individual engine calls and that final CAS. A failure mid-sequence stops graph-moving work, reports per-resource statuses for everything already done (loud partials), and leaves sidecars for the sweep. Cross-graph atomicity is explicitly not promised.
+
+**Failpoints.** Each engine-call boundary gets a failpoint (`cluster_apply.before_graph_create`, `cluster_apply.after_graph_create`, `cluster_apply.before_schema_apply`, `cluster_apply.after_schema_apply`, `cluster_apply.before_graph_delete`) so every row of the D3 matrix is testable with the Stage 3B harness.
+
+### D6. Graph delete (4C)
+
+With no engine primitive, delete is cluster-orchestrated prefix removal: verify the approval artifact → sidecar (`kind: graph_delete`, current manifest version recorded) → recursively remove `ClusterRoot/graphs/<id>.omni` → state CAS that tombstones the graph subtree (graph, schema, and its queries removed from `applied_revision.resources` and `resource_statuses`; observation replaced by a tombstone record `{deleted_at, approval_id}`) → delete sidecar. Catalog blobs of the graph's queries stay (GC remains a later stage, consistent with Stage 3A deletes). The engine gap (no atomic prefix delete; partial removal leaves an unopenable root) is handled by D3 row 8, and this RFC registers a desire for an engine-level `destroy_graph` primitive as future work, not a dependency.
+
+### D7. Plan and import integration
+
+- **Plan** gains real data impact for schema updates: where Stage 3A showed only a digest diff, Phase 4 calls `preview_schema_apply_with_options` against the live graph (read-only) and embeds the migration steps + drop warnings in the change record — the "data-aware provider peek" from the high-level spec, bounded to graphs the plan already observes. Failure to preview (graph unreachable) degrades to the digest diff with a warning, never blocks planning.
+- **Import/refresh** already observe live graphs; Phase 4 makes `import` preserve `approval_records` and pending `recoveries/` it finds (state reconstruction must not orphan audit facts or pending repairs).
+
+### D8. Invariants and axioms check
+
+- *Respect the substrate / no custom transaction manager*: cluster never rolls back graphs; engine sidecars own intra-graph atomicity (D3).
+- *Axiom 5 (state = deployed reality)*: recovery converges the ledger to observation, never observation to ledger.
+- *Axiom 8 (reversibility gates apply, including drift correction)*: approval artifacts for the irreversible tier; sweep never auto-deletes (D3 rows 5/8).
+- *Axiom 9 (plan-time integrity)*: ordering is planner-derived from existing dependency edges; no runtime discovery.
+- *Axiom 11 (approvals in a durable ledger)*: artifacts are files first, state-referenced after consumption; reconstructable state never re-derives who approved.
+- *Axiom 12 (locked state)*: every new write path (sidecars, approvals consumption, sweep) runs under the existing state lock.
+- *Axiom 15 (single owner / mode switch)*: nothing here reads from or writes to `omnigraph.yaml`; applied graphs still serve nothing until Phase 5.
+- *Loud partials (deny-list)*: every crash window lands in a typed status/condition; no path acknowledges unverified success.
+
+## Migration / Compatibility
+
+Additive. Stage 3A/3B behavior is unchanged for catalog-only configs; existing state files gain no required fields (`approval_records`/`recovery_records` already exist, empty). New CLI surface: `cluster approve`, `--actor` on `cluster apply`. A deployment that never declares schema changes or graph creates sees identical behavior to Stage 3B. The honored-or-rejected posture continues: no new `cluster.yaml` fields are introduced by this phase (graph roots stay derived).
+
+## Sequencing
+
+| Stage | Scope | Gate |
+|---|---|---|
+| **4A graph create** | `Omnigraph::init` at derived roots; create-intent sidecar; D3 rows 1/2/4/5; dependents unblock in-run | Failpoint tests for crash-before/after-init; e2e: declare graph → apply → import-less convergence |
+| **4B schema apply** | Full sidecar lifecycle; roll-forward sweep (D3 rows 3/6); actor threading; plan data-impact preview; soft-drop default | Failpoint tests per matrix row; e2e: schema evolution fully cluster-driven (replaces the Stage 3A defer→manual→refresh loop) |
+| **4C graph delete** | `cluster approve` + artifact consumption; prefix removal; tombstones; D3 rows 7/7b/8 | Failpoint tests incl. partial-removal; e2e: gated delete refused without artifact, executed with it, stale artifact rejected |
+
+Each stage is a separate PR with boundary-matched tests (the Stage 1–3B discipline). 4A ships first because it moves no existing manifest; 4B is the heart; 4C last because it is the only irreversible-tier executor and consumes the approval machinery 4B's hard-drop path also needs.
+
+## Exit-criteria coverage (implementation spec)
+
+| # | Criterion | This RFC |
+|---|---|---|
+| 1 | State/status/approval/recovery schemas + paths | **Approval + recovery schemas: answered** (D2, D4). State/status: unchanged from shipped Stage 2A/3A. |
+| 2 | Sidecar schema + recovery decision matrix | **Answered** (D2, D3) |
+| 3 | State backend interface / lock+CAS | Unchanged (local JSON backend, shipped) — out of scope |
+| 4 | Apply group syntax + dependency ordering | **Answered** (D5): per-resource groups, fixed kind-ordering; no user-declared group syntax this phase |
+| 5 | Plan JSON schema incl. blast radius + approvals | **Extended** (D7 preview embedding); base schema shipped |
+| 6 | Bootstrap authority + first-actor | **Partial** (D5 actor threading); cluster bootstrap authority remains open (below) |
+| 7 | Server startup migration | Phase 5 — deferred |
+| 8 | Per-query policy / `mcp.expose` bridge | Phase 6 — deferred |
+| 9 | Pipeline runtime | Phase 7 — deferred |
+
+## Open Questions
+
+1. **Bootstrap authority.** The first apply against a fresh cluster has no policy engine to consult and no actor registry; today the answer is "whoever holds the object store wins." The durable story (out-of-band privileged bootstrap actor, per the high-level spec §open-questions) is unresolved and blocks nothing in this phase, since graph-level Cedar still gates wherever installed.
+2. **Approval expiry.** Artifacts are digest-bound, so config drift invalidates them naturally; is wall-clock expiry also wanted (operator hygiene), or does digest binding suffice?
+3. **Sweep on read-only commands.** This RFC has `status`/`plan` only *warn* about pending sidecars. If operator feedback shows the warn-but-don't-repair posture causes confusion, promoting `plan` to run the sweep (it already takes the lock) is a compatible change.
+4. **Graph rename.** Deliberately out of scope; interacts with the rename-stable-identity known gap in [invariants.md](invariants.md). A rename today is delete + create — i.e., gated, lossy, and honest about it.
+5. **Engine `destroy_graph` primitive.** 4C's prefix removal is correct but unatomic; if the engine grows a graph-destroy primitive with its own recovery, D6 collapses onto it (the cluster code is shaped to delegate).
+
+## References
+
+- [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) — phases, exit criteria, high-risk decisions, approval tiers
+- [cluster-axioms.md](cluster-axioms.md) — axioms 5, 8, 9, 11, 12, 15
+- [cluster-config-specs.md](cluster-config-specs.md) — the data-aware provider peek; state/ledger model
+- `crates/omnigraph/src/db/manifest/recovery.rs` — the engine sidecar + classifier this design mirrors in vocabulary and deliberately does not duplicate in mechanics
+- [writes.md](writes.md), [invariants.md](invariants.md) — engine recovery protocol and the deny-list this design is checked against