* docs(cluster): RFC-004 — graph & schema apply design (Phase 4)
The design the implementation spec's exit criteria require before
graph-moving cluster apply ships. Core positions:
- Cluster recovery is roll-forward-only: the engine's own sidecars make every
graph-level operation atomic within the graph, so the cluster never rolls a
graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and
record, converging the ledger to observable reality (axiom 5) or surfacing
a loud pending-repair condition. Eight-row decision matrix, every row
testable with the Stage 3B failpoint harness.
- Irreversible operations (graph delete, allow_data_loss schema apply)
consume digest-bound approval artifacts written by a new cluster approve
command and retired into state.approval_records (axiom 11). A stale
approval can never authorize a different change.
- cluster apply gains an actor, threaded to apply_schema_as so engine Cedar
enforcement and commit attribution work unchanged; the cluster adds no
policy engine of its own.
- Deterministic ordering (creates -> schema applies -> catalog -> deletes),
per-resource apply groups, cross-graph atomicity explicitly not promised.
- Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on
per-matrix-row failpoint tests.
Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to
their phases (coverage table in the RFC). Linked from the dev index and the
implementation spec's Phase 4 section.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract
Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete
(no manifest to version-check after prefix removal) and 'root absent, state
already tombstoned' fell into the stale row — split into rows 7 (delete's
analog of row 2) and 7b (the roll-forward), with expected_manifest_version
documented as always null for the delete kind. (2) state_cas_base is now
explicitly audit/diagnostics-only — the sweep never consults it; independent
state mutations are handled by the ordinary CAS like any concurrent write.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
25 KiB
RFC: Cluster Graph & Schema Apply — Phase 4 of the Cluster Control Plane
Status: Proposed
Date: 2026-06-10
Builds on: cluster Stages 1–3B (shipped: validate/plan/status/refresh/import/force-unlock, config-only cluster apply with content-addressed catalog publish, catalog payload verification, failpoint-proven crash/CAS recovery for the apply protocol). Normative context: cluster-config-specs.md, cluster-axioms.md, cluster-config-implementation-spec.md.
Target release: unversioned (phased — see Sequencing); no cluster functionality is in a tagged release yet.
Summary
Extend cluster apply from config-only resources (stored queries, policy bundles) to graph-moving resources: graph create, cluster-driven schema apply, and graph delete. This is the nested-publish territory the implementation spec flags as its highest-risk decision: a graph's Lance manifest can move (via the engine's own atomic publish) before the cluster's JSON state CAS lands, and a crash in that window must never be silent, never acknowledged as success, and never repaired by guessing.
Three design commitments make the phase tractable:
- Cluster recovery is roll-forward-only. The engine's recovery sidecars (
__recovery/{ulid}.json, the open-time sweep indb/manifest/recovery.rs) already make every graph-level operation atomic within the graph — a schema apply either fully published or fully recovers at the next open. The cluster therefore never rolls a graph back. Cluster sidecars exist to classify and record: after a crash, the sweep observes the live graph, decides "moved / didn't move / moved unexpectedly," and either rolls the cluster state forward to match observable reality (axiom 5) or surfaces a loud pending-repair condition. The cluster holds no second transaction log and no rollback hammer — that would duplicate substrate behavior the engine already owns (invariant: respect the substrate). - Irreversible operations require a digest-bound approval artifact. Graph delete and
allow_data_lossschema applies consume an explicit__cluster/approvals/{ulid}.jsonrecord bound to the exact change digests, written by a newcluster approvecommand and retired into the state ledger'sapproval_records(the durable audit reference of axiom 11). - The operator identity becomes explicit.
cluster applygains an actor, threaded to the engine'sapply_schema_asso Cedar enforcement and commit attribution work unchanged. The cluster control plane adds no policy engine of its own (transport/auth stay at the boundary).
Motivation
After Stage 3B, the control plane converges everything except the resources that define the data plane. The Sarah/Bob test is half-passed: Bob can see that Sarah changed a schema (plan shows the deferred change; refresh shows drift), but the system cannot act on it — Sarah still applies schemas with the per-graph tool and the cluster ledger trails reality. Graph creation is worse: a new graph in cluster.yaml blocks every dependent query and policy with dependency_missing until someone runs omnigraph init by hand at exactly the derived path. Phase 4 closes the loop: desired config in, converged deployment out, for the full resource vocabulary of Stage 1.
The implementation spec's hard gate for this phase — failpoint recovery tests proving the movement-before-state-publish gap — was deliberately front-loaded: Stage 3B shipped the failpoint infrastructure and the apply-side crash/CAS tests. What remains is the design this RFC supplies: the sidecar schema, the recovery decision matrix, the approval artifact, and the ordering rules.
Non-Goals
- Server boot from cluster state (Phase 5) — applied graphs/schemas still serve nothing; the server boots from
omnigraph.yamluntil the explicit per-deployment mode switch (axiom 15). - Policy-owned query exposure /
mcp.exposeretirement (Phase 6). - Pipelines, embeddings, UI, aliases, bindings, providers,
env_file(Phase 7 and reserved fields). - External or Lance-backed state backends; the local JSON backend + lock/CAS remains the substrate.
- A cluster manifest publisher. Deferred, per the spec: it becomes interesting only if the sidecar + repair path proves too weak for the accepted safety contract. Nothing in this design forecloses it.
- Multi-graph atomic apply groups. Cross-graph convergence remains statusful-partial per resource; one graph's failure never pretends to fence another's success.
- Graph rename. Stable-identity-across-rename is an open known gap at the schema level already; graph rename compounds it and is explicitly out of scope (see Open Questions).
Background
What Phase 4 builds on (all shipped):
- The engine's recovery discipline. Writers that can advance Lance HEAD before manifest publish write
__recovery/{ulid}.jsonsidecars carrying per-table pins (expected_version,post_commit_pin);Omnigraph::openin read-write mode classifies every pinned table (NoMovement/RolledPastExpected/UnexpectedAtP1/UnexpectedMultistep/InvariantViolation) and decides all-or-nothing: roll forward via one manifest publish, or roll back viaDataset::restore, recording an audit row attributed toomnigraph:recovery. The cluster inherits the vocabulary of this design but not its mechanics — see the roll-forward-only argument below. - The engine's schema-apply surface.
apply_schema_as(desired_source, SchemaApplyOptions { allow_data_loss }, actor)returnsSchemaApplyResult { supported, applied, manifest_version, steps };preview_schema_apply_with_optionsreturns the migration plan plus desired catalog without applying; the__schema_apply_lock__branch serializes schema applies graph-wide and refuses to run while user branches exist. Policy enforcement (enforce(SchemaApply, TargetBranch("main"), actor)) happens before the lock. - Graph init.
Omnigraph::init(uri, schema_source)with a strict preflight (errors if schema artifacts exist) and an atomic_schema.pgclaim. A documented gap: a failed init does not clean up Lance datasets or__manifest/it already created. - No engine graph-delete primitive. Deleting a graph today means removing its object-store prefix. This RFC works with that fact rather than waiting on a primitive.
- Cluster state and observations.
state.json(locked, CAS-checked, atomically replaced) already records per-resource digests, statuses,observations["graph.<id>"]withmanifest_versionand live schema digest, plus emptyapproval_records/recovery_recordsplaceholders reserved for this phase. - Stage 3A/3B apply mechanics. Dispositions (
applied/derived/deferred/blocked), content-addressed catalog publish before the state CAS, persisted-statuses contract on write failure, idempotent re-apply, payload verification with the drift + self-heal loop, and failpointscluster_apply.after_payload_phase/cluster_apply.before_state_write.
Design
D1. Resource semantics: which dispositions change
The Stage 3A classifier gains executable rows. Everything else (catalog resources, derived composites, blocked dependents) is unchanged:
| Change | Stage 3A disposition | Phase 4 disposition |
|---|---|---|
graph.<id> Create |
Deferred | Applied (4A): Omnigraph::init at the derived root |
schema.<id> Create |
Deferred | Applied with the graph create (the init carries the schema) |
schema.<id> Update |
Deferred | Applied (4B): apply_schema_as against the live graph |
graph.<id> / schema.<id> Delete |
Deferred | Applied behind approval (4C): prefix removal |
query.*/policy.* blocked on the above |
Blocked | Unblocked in the same apply once the dependency lands (ordering, D5) |
Graph roots remain derived: ClusterRoot/graphs/<id>.omni (high-risk decision #2 dispositioned: external graph roots are a separate, explicit future feature, not this phase).
D2. Cluster recovery sidecar (exit criterion 2, first half)
Written under the state lock before any engine call that can move or create a graph manifest; deleted only after the cluster state CAS that records the outcome lands.
{
"schema_version": 1,
"operation_id": "<ulid>",
"started_at": "<rfc3339>",
"actor": "<id or null>",
"kind": "graph_create | schema_apply | graph_delete",
"graph_id": "<id>",
"graph_uri": "<derived root>",
"observed_manifest_version": 7,
"expected_manifest_version": null,
"desired_schema_digest": "<sha256 of the schema source being applied>",
"state_cas_base": "sha256:<state.json digest at sidecar write>"
}
Path: __cluster/recoveries/{operation_id}.json, atomic write (temp + rename, the write_state discipline). Notes:
observed_manifest_versionis the live graph's main-branch manifest version read at sidecar-write time (nullforgraph_create— no graph yet). This is the fencing value: apply refuses to proceed if it differs from the version recorded inobservations["graph.<id>"]at plan time and re-observed under the lock (the same recompute-under-lock posture as Stage 3A's diff).expected_manifest_versionstartsnulland is rewritten into the sidecar immediately after the engine call returns withSchemaApplyResult.manifest_version(or the post-init observation). A crash before that rewrite leavesnull, which the sweep treats as "engine call outcome unknown — classify by observation only." Forgraph_deletethe field is alwaysnull— prefix removal produces no new manifest version, so there is no rewrite step for that kind; delete sidecars are classified purely by root presence + state tombstone (D3 rows 7/7b/8).state_cas_baseis recorded for audit and diagnostics only — the sweep decision logic never consults it. The sweep re-readsstate.jsonunder the lock and performs ordinary CAS-checked writes, so an independent state mutation between sidecar write and sweep is handled by the CAS like any other concurrent write, not by this field. Its value is forensic: a recovery audit entry can show which state revision the interrupted operation departed from.- One sidecar per graph-moving resource operation. Apply processes graph-moving operations strictly sequentially (D5), so at most one sidecar is pending per apply run per graph, and the sweep processes sidecars in ULID order.
D3. Recovery decision matrix — roll-forward-only (exit criterion 2, second half)
Why no rollback. The engine's sidecars already guarantee that a schema apply is atomic within the graph: by the time any cluster-visible manifest version moved, the engine either fully published or will recover all-or-nothing at its next read-write open. A cluster-level rollback would mean un-publishing a successfully published graph commit — rewriting substrate history the cluster does not own, duplicating the engine's transaction discipline (deny-list: custom transaction manager; state that drifts from what it can be derived from). The cluster's job after a crash is therefore epistemic, not transactional: observe what the graph actually is, and converge the ledger to it or refuse loudly.
Sweep trigger. The sweep runs at the start of every state-mutating cluster command (apply, refresh, import), under the state lock, before the command's own work — mirroring the engine's open-time sweep gating (read-only status/plan/validate report pending sidecars as a warning, cluster_recovery_pending, but do not act).
| # | Sidecar kind | Observation | Decision |
|---|---|---|---|
| 1 | any | Graph at observed_manifest_version (nothing moved) |
Engine call never landed. Delete sidecar; the command's own plan/apply re-proposes the change. |
| 2 | graph_create / schema_apply |
Graph at expected_manifest_version; state already records the outcome |
Crash fell between state CAS and sidecar delete. Delete sidecar; done. |
| 3 | schema_apply |
Graph at expected_manifest_version (or, when expected is null, live schema digest == desired_schema_digest); state stale |
Roll the cluster state forward: record the live schema digest, recompute the graph composite, set statuses applied, append a recovery_records entry (audit), CAS-write, delete sidecar. |
| 4 | graph_create |
Graph opens read-only and its schema digest == desired_schema_digest; state stale |
Same roll-forward as #3 (the create completed). |
| 5 | graph_create |
Root exists but the graph does not open (the engine's partial-init gap) | Status error, condition graph_create_incomplete, message: remove the root and re-run apply. No auto-delete — reconciler-initiated deletion is the same data-loss class as human deletion (high-risk decision #7). Sidecar kept until the operator acts and a sweep observes a clean state. |
| 6 | any | Graph at any other version (out-of-band movement during the crash window) | Status drifted, condition actual_applied_state_pending; sidecar kept; the command refuses graph-moving work for that graph until cluster refresh re-observes and the operator re-plans. No success is acknowledged for the interrupted operation. |
| 7 | graph_delete |
Root absent; state already tombstoned | The delete kind's analog of row 2 (no manifest exists to version-check): crash fell between state CAS and sidecar delete. Delete sidecar; done. |
| 7b | graph_delete |
Root absent; state stale | Roll forward: tombstone the graph subtree out of state (D6), record audit, delete sidecar. Idempotent — re-entry after a crash mid-row lands in row 7. |
| 8 | graph_delete |
Root present (delete crashed mid-prefix-removal or never started) | If the approval artifact is still attached (D4), the delete is re-proposed by plan and re-runnable; status drifted, condition graph_delete_incomplete. Partial prefix removal leaves an unopenable graph — same operator message as #5. |
Rows 3, 4 and 7b are the only mutations the sweep performs, and each is an ordinary CAS-checked state write under the lock — the sweep introduces no new write machinery.
D4. Approval artifacts (exit criteria 1-partial and 6-partial; axioms 8 and 11)
The irreversible tier — graph delete, allow_data_loss schema apply (hard drops) — requires a recorded human decision that survives any reconstruction of state. plan already emits approvals_required; Phase 4 adds the consumption side.
Artifact (__cluster/approvals/{approval_id}.json, written by the new command, never by apply):
{
"schema_version": 1,
"approval_id": "<ulid>",
"resource": "graph.scratch",
"operation": "delete",
"reason": "<the approvals_required reason from plan>",
"bound_config_digest": "<desired config digest the plan was computed from>",
"bound_before_digest": "<state digest of the resource, or null>",
"bound_after_digest": "<desired digest, or null for delete>",
"approved_by": "<operator id, required>",
"created_at": "<rfc3339>"
}
Flow. cluster approve <resource-address> --config <dir> --by <operator> re-runs the plan under the lock, locates the pending gated change for that address, prints it, and writes the artifact bound to the exact digests. cluster apply executes a gated change only when a pending artifact matches all bound digests — a stale approval (config moved since) matches nothing, is reported (approval_stale warning), and the change stays blocked with condition approval_required. On successful execution the artifact file is moved into state.approval_records[approval_id] in the same state CAS that records the outcome (the state references the audit fact; losing state does not lose the approval, which is also why import preserves approval_records it finds — see D7).
allow_data_loss is never a CLI flag on cluster apply; destructive promotion is expressed only through an approval artifact for the specific schema change. The default schema apply path runs with allow_data_loss: false (soft drops), which the spec's tier table classes as a recoverable definition rewrite — plan warning, no artifact.
D5. Actor, ordering, and apply groups (exit criterion 4)
Actor. cluster apply --actor <id> / cluster approve --by <id>, with OMNIGRAPH_CLUSTER_ACTOR as the env fallback. The actor is threaded to apply_schema_as (so engine-side Cedar enforcement fires wherever a policy checker is installed and graph commits are attributed), recorded in sidecars, approvals, and recovery_records. The cluster adds no policy engine: graph-moving operations inherit the engine's gate; catalog-only operations remain ungated as today. When no actor is supplied and the target graph has no policy checker, behavior is unchanged from Stage 3A (None actor, as the engine's no-actor variants do); when a checker is installed the engine's existing "actor required" error surfaces as a typed diagnostic (actor_required).
Ordering. Deterministic, dependency-shaped, within one apply run:
- graph creates (with their schemas) — ULID-stable order by graph id
- schema applies — sequential, one graph at a time (each holds that graph's
__schema_apply_lock__; the cluster state lock already serializes cluster-side) - catalog writes (queries/policies) — the Stage 3A path, unchanged
- deletes last (catalog deletes, then approved graph deletes)
Each graph-moving operation is its own apply group: sidecar → engine call → sidecar update → continue. The state CAS stays single and final (one write at the end recording every outcome), preserving Stage 3A's protocol; sidecars cover the widened gap between individual engine calls and that final CAS. A failure mid-sequence stops graph-moving work, reports per-resource statuses for everything already done (loud partials), and leaves sidecars for the sweep. Cross-graph atomicity is explicitly not promised.
Failpoints. Each engine-call boundary gets a failpoint (cluster_apply.before_graph_create, cluster_apply.after_graph_create, cluster_apply.before_schema_apply, cluster_apply.after_schema_apply, cluster_apply.before_graph_delete) so every row of the D3 matrix is testable with the Stage 3B harness.
D6. Graph delete (4C)
With no engine primitive, delete is cluster-orchestrated prefix removal: verify the approval artifact → sidecar (kind: graph_delete, current manifest version recorded) → recursively remove ClusterRoot/graphs/<id>.omni → state CAS that tombstones the graph subtree (graph, schema, and its queries removed from applied_revision.resources and resource_statuses; observation replaced by a tombstone record {deleted_at, approval_id}) → delete sidecar. Catalog blobs of the graph's queries stay (GC remains a later stage, consistent with Stage 3A deletes). The engine gap (no atomic prefix delete; partial removal leaves an unopenable root) is handled by D3 row 8, and this RFC registers a desire for an engine-level destroy_graph primitive as future work, not a dependency.
D7. Plan and import integration
- Plan gains real data impact for schema updates: where Stage 3A showed only a digest diff, Phase 4 calls
preview_schema_apply_with_optionsagainst the live graph (read-only) and embeds the migration steps + drop warnings in the change record — the "data-aware provider peek" from the high-level spec, bounded to graphs the plan already observes. Failure to preview (graph unreachable) degrades to the digest diff with a warning, never blocks planning. - Import/refresh already observe live graphs; Phase 4 makes
importpreserveapproval_recordsand pendingrecoveries/it finds (state reconstruction must not orphan audit facts or pending repairs).
D8. Invariants and axioms check
- Respect the substrate / no custom transaction manager: cluster never rolls back graphs; engine sidecars own intra-graph atomicity (D3).
- Axiom 5 (state = deployed reality): recovery converges the ledger to observation, never observation to ledger.
- Axiom 8 (reversibility gates apply, including drift correction): approval artifacts for the irreversible tier; sweep never auto-deletes (D3 rows 5/8).
- Axiom 9 (plan-time integrity): ordering is planner-derived from existing dependency edges; no runtime discovery.
- Axiom 11 (approvals in a durable ledger): artifacts are files first, state-referenced after consumption; reconstructable state never re-derives who approved.
- Axiom 12 (locked state): every new write path (sidecars, approvals consumption, sweep) runs under the existing state lock.
- Axiom 15 (single owner / mode switch): nothing here reads from or writes to
omnigraph.yaml; applied graphs still serve nothing until Phase 5. - Loud partials (deny-list): every crash window lands in a typed status/condition; no path acknowledges unverified success.
Migration / Compatibility
Additive. Stage 3A/3B behavior is unchanged for catalog-only configs; existing state files gain no required fields (approval_records/recovery_records already exist, empty). New CLI surface: cluster approve, --actor on cluster apply. A deployment that never declares schema changes or graph creates sees identical behavior to Stage 3B. The honored-or-rejected posture continues: no new cluster.yaml fields are introduced by this phase (graph roots stay derived).
Sequencing
| Stage | Scope | Gate |
|---|---|---|
| 4A graph create | Omnigraph::init at derived roots; create-intent sidecar; D3 rows 1/2/4/5; dependents unblock in-run |
Failpoint tests for crash-before/after-init; e2e: declare graph → apply → import-less convergence |
| 4B schema apply | Full sidecar lifecycle; roll-forward sweep (D3 rows 3/6); actor threading; plan data-impact preview; soft-drop default | Failpoint tests per matrix row; e2e: schema evolution fully cluster-driven (replaces the Stage 3A defer→manual→refresh loop) |
| 4C graph delete | cluster approve + artifact consumption; prefix removal; tombstones; D3 rows 7/7b/8 |
Failpoint tests incl. partial-removal; e2e: gated delete refused without artifact, executed with it, stale artifact rejected |
Each stage is a separate PR with boundary-matched tests (the Stage 1–3B discipline). 4A ships first because it moves no existing manifest; 4B is the heart; 4C last because it is the only irreversible-tier executor and consumes the approval machinery 4B's hard-drop path also needs.
Exit-criteria coverage (implementation spec)
| # | Criterion | This RFC |
|---|---|---|
| 1 | State/status/approval/recovery schemas + paths | Approval + recovery schemas: answered (D2, D4). State/status: unchanged from shipped Stage 2A/3A. |
| 2 | Sidecar schema + recovery decision matrix | Answered (D2, D3) |
| 3 | State backend interface / lock+CAS | Unchanged (local JSON backend, shipped) — out of scope |
| 4 | Apply group syntax + dependency ordering | Answered (D5): per-resource groups, fixed kind-ordering; no user-declared group syntax this phase |
| 5 | Plan JSON schema incl. blast radius + approvals | Extended (D7 preview embedding); base schema shipped |
| 6 | Bootstrap authority + first-actor | Partial (D5 actor threading); cluster bootstrap authority remains open (below) |
| 7 | Server startup migration | Phase 5 — deferred |
| 8 | Per-query policy / mcp.expose bridge |
Phase 6 — deferred |
| 9 | Pipeline runtime | Phase 7 — deferred |
Open Questions
- Bootstrap authority. The first apply against a fresh cluster has no policy engine to consult and no actor registry; today the answer is "whoever holds the object store wins." The durable story (out-of-band privileged bootstrap actor, per the high-level spec §open-questions) is unresolved and blocks nothing in this phase, since graph-level Cedar still gates wherever installed.
- Approval expiry. Artifacts are digest-bound, so config drift invalidates them naturally; is wall-clock expiry also wanted (operator hygiene), or does digest binding suffice?
- Sweep on read-only commands. This RFC has
status/planonly warn about pending sidecars. If operator feedback shows the warn-but-don't-repair posture causes confusion, promotingplanto run the sweep (it already takes the lock) is a compatible change. - Graph rename. Deliberately out of scope; interacts with the rename-stable-identity known gap in invariants.md. A rename today is delete + create — i.e., gated, lossy, and honest about it.
- Engine
destroy_graphprimitive. 4C's prefix removal is correct but unatomic; if the engine grows a graph-destroy primitive with its own recovery, D6 collapses onto it (the cluster code is shaped to delegate).
References
- cluster-config-implementation-spec.md — phases, exit criteria, high-risk decisions, approval tiers
- cluster-axioms.md — axioms 5, 8, 9, 11, 12, 15
- cluster-config-specs.md — the data-aware provider peek; state/ledger model
crates/omnigraph/src/db/manifest/recovery.rs— the engine sidecar + classifier this design mirrors in vocabulary and deliberately does not duplicate in mechanics- writes.md, invariants.md — engine recovery protocol and the deny-list this design is checked against