omnigraph/docs/dev/rfc-004-cluster-graph-schema-apply.md
Andrew Altshuler 58c66a54a2
docs(cluster): RFC-004 — graph & schema apply design (Phase 4) (#168)
* docs(cluster): RFC-004 — graph & schema apply design (Phase 4)

The design the implementation spec's exit criteria require before
graph-moving cluster apply ships. Core positions:

- Cluster recovery is roll-forward-only: the engine's own sidecars make every
  graph-level operation atomic within the graph, so the cluster never rolls a
  graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and
  record, converging the ledger to observable reality (axiom 5) or surfacing
  a loud pending-repair condition. Eight-row decision matrix, every row
  testable with the Stage 3B failpoint harness.
- Irreversible operations (graph delete, allow_data_loss schema apply)
  consume digest-bound approval artifacts written by a new cluster approve
  command and retired into state.approval_records (axiom 11). A stale
  approval can never authorize a different change.
- cluster apply gains an actor, threaded to apply_schema_as so engine Cedar
  enforcement and commit attribution work unchanged; the cluster adds no
  policy engine of its own.
- Deterministic ordering (creates -> schema applies -> catalog -> deletes),
  per-resource apply groups, cross-graph atomicity explicitly not promised.
- Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on
  per-matrix-row failpoint tests.

Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to
their phases (coverage table in the RFC). Linked from the dev index and the
implementation spec's Phase 4 section.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract

Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete
(no manifest to version-check after prefix removal) and 'root absent, state
already tombstoned' fell into the stale row — split into rows 7 (delete's
analog of row 2) and 7b (the roll-forward), with expected_manifest_version
documented as always null for the delete kind. (2) state_cas_base is now
explicitly audit/diagnostics-only — the sweep never consults it; independent
state mutations are handled by the ordinary CAS like any concurrent write.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:34:14 +03:00

25 KiB
Raw Blame History

RFC: Cluster Graph & Schema Apply — Phase 4 of the Cluster Control Plane

Status: Proposed Date: 2026-06-10 Builds on: cluster Stages 13B (shipped: validate/plan/status/refresh/import/force-unlock, config-only cluster apply with content-addressed catalog publish, catalog payload verification, failpoint-proven crash/CAS recovery for the apply protocol). Normative context: cluster-config-specs.md, cluster-axioms.md, cluster-config-implementation-spec.md. Target release: unversioned (phased — see Sequencing); no cluster functionality is in a tagged release yet.

Summary

Extend cluster apply from config-only resources (stored queries, policy bundles) to graph-moving resources: graph create, cluster-driven schema apply, and graph delete. This is the nested-publish territory the implementation spec flags as its highest-risk decision: a graph's Lance manifest can move (via the engine's own atomic publish) before the cluster's JSON state CAS lands, and a crash in that window must never be silent, never acknowledged as success, and never repaired by guessing.

Three design commitments make the phase tractable:

  1. Cluster recovery is roll-forward-only. The engine's recovery sidecars (__recovery/{ulid}.json, the open-time sweep in db/manifest/recovery.rs) already make every graph-level operation atomic within the graph — a schema apply either fully published or fully recovers at the next open. The cluster therefore never rolls a graph back. Cluster sidecars exist to classify and record: after a crash, the sweep observes the live graph, decides "moved / didn't move / moved unexpectedly," and either rolls the cluster state forward to match observable reality (axiom 5) or surfaces a loud pending-repair condition. The cluster holds no second transaction log and no rollback hammer — that would duplicate substrate behavior the engine already owns (invariant: respect the substrate).
  2. Irreversible operations require a digest-bound approval artifact. Graph delete and allow_data_loss schema applies consume an explicit __cluster/approvals/{ulid}.json record bound to the exact change digests, written by a new cluster approve command and retired into the state ledger's approval_records (the durable audit reference of axiom 11).
  3. The operator identity becomes explicit. cluster apply gains an actor, threaded to the engine's apply_schema_as so Cedar enforcement and commit attribution work unchanged. The cluster control plane adds no policy engine of its own (transport/auth stay at the boundary).

Motivation

After Stage 3B, the control plane converges everything except the resources that define the data plane. The Sarah/Bob test is half-passed: Bob can see that Sarah changed a schema (plan shows the deferred change; refresh shows drift), but the system cannot act on it — Sarah still applies schemas with the per-graph tool and the cluster ledger trails reality. Graph creation is worse: a new graph in cluster.yaml blocks every dependent query and policy with dependency_missing until someone runs omnigraph init by hand at exactly the derived path. Phase 4 closes the loop: desired config in, converged deployment out, for the full resource vocabulary of Stage 1.

The implementation spec's hard gate for this phase — failpoint recovery tests proving the movement-before-state-publish gap — was deliberately front-loaded: Stage 3B shipped the failpoint infrastructure and the apply-side crash/CAS tests. What remains is the design this RFC supplies: the sidecar schema, the recovery decision matrix, the approval artifact, and the ordering rules.

Non-Goals

  • Server boot from cluster state (Phase 5) — applied graphs/schemas still serve nothing; the server boots from omnigraph.yaml until the explicit per-deployment mode switch (axiom 15).
  • Policy-owned query exposure / mcp.expose retirement (Phase 6).
  • Pipelines, embeddings, UI, aliases, bindings, providers, env_file (Phase 7 and reserved fields).
  • External or Lance-backed state backends; the local JSON backend + lock/CAS remains the substrate.
  • A cluster manifest publisher. Deferred, per the spec: it becomes interesting only if the sidecar + repair path proves too weak for the accepted safety contract. Nothing in this design forecloses it.
  • Multi-graph atomic apply groups. Cross-graph convergence remains statusful-partial per resource; one graph's failure never pretends to fence another's success.
  • Graph rename. Stable-identity-across-rename is an open known gap at the schema level already; graph rename compounds it and is explicitly out of scope (see Open Questions).

Background

What Phase 4 builds on (all shipped):

  • The engine's recovery discipline. Writers that can advance Lance HEAD before manifest publish write __recovery/{ulid}.json sidecars carrying per-table pins (expected_version, post_commit_pin); Omnigraph::open in read-write mode classifies every pinned table (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation) and decides all-or-nothing: roll forward via one manifest publish, or roll back via Dataset::restore, recording an audit row attributed to omnigraph:recovery. The cluster inherits the vocabulary of this design but not its mechanics — see the roll-forward-only argument below.
  • The engine's schema-apply surface. apply_schema_as(desired_source, SchemaApplyOptions { allow_data_loss }, actor) returns SchemaApplyResult { supported, applied, manifest_version, steps }; preview_schema_apply_with_options returns the migration plan plus desired catalog without applying; the __schema_apply_lock__ branch serializes schema applies graph-wide and refuses to run while user branches exist. Policy enforcement (enforce(SchemaApply, TargetBranch("main"), actor)) happens before the lock.
  • Graph init. Omnigraph::init(uri, schema_source) with a strict preflight (errors if schema artifacts exist) and an atomic _schema.pg claim. A documented gap: a failed init does not clean up Lance datasets or __manifest/ it already created.
  • No engine graph-delete primitive. Deleting a graph today means removing its object-store prefix. This RFC works with that fact rather than waiting on a primitive.
  • Cluster state and observations. state.json (locked, CAS-checked, atomically replaced) already records per-resource digests, statuses, observations["graph.<id>"] with manifest_version and live schema digest, plus empty approval_records / recovery_records placeholders reserved for this phase.
  • Stage 3A/3B apply mechanics. Dispositions (applied/derived/deferred/blocked), content-addressed catalog publish before the state CAS, persisted-statuses contract on write failure, idempotent re-apply, payload verification with the drift + self-heal loop, and failpoints cluster_apply.after_payload_phase / cluster_apply.before_state_write.

Design

D1. Resource semantics: which dispositions change

The Stage 3A classifier gains executable rows. Everything else (catalog resources, derived composites, blocked dependents) is unchanged:

Change Stage 3A disposition Phase 4 disposition
graph.<id> Create Deferred Applied (4A): Omnigraph::init at the derived root
schema.<id> Create Deferred Applied with the graph create (the init carries the schema)
schema.<id> Update Deferred Applied (4B): apply_schema_as against the live graph
graph.<id> / schema.<id> Delete Deferred Applied behind approval (4C): prefix removal
query.*/policy.* blocked on the above Blocked Unblocked in the same apply once the dependency lands (ordering, D5)

Graph roots remain derived: ClusterRoot/graphs/<id>.omni (high-risk decision #2 dispositioned: external graph roots are a separate, explicit future feature, not this phase).

D2. Cluster recovery sidecar (exit criterion 2, first half)

Written under the state lock before any engine call that can move or create a graph manifest; deleted only after the cluster state CAS that records the outcome lands.

{
  "schema_version": 1,
  "operation_id": "<ulid>",
  "started_at": "<rfc3339>",
  "actor": "<id or null>",
  "kind": "graph_create | schema_apply | graph_delete",
  "graph_id": "<id>",
  "graph_uri": "<derived root>",
  "observed_manifest_version": 7,
  "expected_manifest_version": null,
  "desired_schema_digest": "<sha256 of the schema source being applied>",
  "state_cas_base": "sha256:<state.json digest at sidecar write>"
}

Path: __cluster/recoveries/{operation_id}.json, atomic write (temp + rename, the write_state discipline). Notes:

  • observed_manifest_version is the live graph's main-branch manifest version read at sidecar-write time (null for graph_create — no graph yet). This is the fencing value: apply refuses to proceed if it differs from the version recorded in observations["graph.<id>"] at plan time and re-observed under the lock (the same recompute-under-lock posture as Stage 3A's diff).
  • expected_manifest_version starts null and is rewritten into the sidecar immediately after the engine call returns with SchemaApplyResult.manifest_version (or the post-init observation). A crash before that rewrite leaves null, which the sweep treats as "engine call outcome unknown — classify by observation only." For graph_delete the field is always null — prefix removal produces no new manifest version, so there is no rewrite step for that kind; delete sidecars are classified purely by root presence + state tombstone (D3 rows 7/7b/8).
  • state_cas_base is recorded for audit and diagnostics only — the sweep decision logic never consults it. The sweep re-reads state.json under the lock and performs ordinary CAS-checked writes, so an independent state mutation between sidecar write and sweep is handled by the CAS like any other concurrent write, not by this field. Its value is forensic: a recovery audit entry can show which state revision the interrupted operation departed from.
  • One sidecar per graph-moving resource operation. Apply processes graph-moving operations strictly sequentially (D5), so at most one sidecar is pending per apply run per graph, and the sweep processes sidecars in ULID order.

D3. Recovery decision matrix — roll-forward-only (exit criterion 2, second half)

Why no rollback. The engine's sidecars already guarantee that a schema apply is atomic within the graph: by the time any cluster-visible manifest version moved, the engine either fully published or will recover all-or-nothing at its next read-write open. A cluster-level rollback would mean un-publishing a successfully published graph commit — rewriting substrate history the cluster does not own, duplicating the engine's transaction discipline (deny-list: custom transaction manager; state that drifts from what it can be derived from). The cluster's job after a crash is therefore epistemic, not transactional: observe what the graph actually is, and converge the ledger to it or refuse loudly.

Sweep trigger. The sweep runs at the start of every state-mutating cluster command (apply, refresh, import), under the state lock, before the command's own work — mirroring the engine's open-time sweep gating (read-only status/plan/validate report pending sidecars as a warning, cluster_recovery_pending, but do not act).

# Sidecar kind Observation Decision
1 any Graph at observed_manifest_version (nothing moved) Engine call never landed. Delete sidecar; the command's own plan/apply re-proposes the change.
2 graph_create / schema_apply Graph at expected_manifest_version; state already records the outcome Crash fell between state CAS and sidecar delete. Delete sidecar; done.
3 schema_apply Graph at expected_manifest_version (or, when expected is null, live schema digest == desired_schema_digest); state stale Roll the cluster state forward: record the live schema digest, recompute the graph composite, set statuses applied, append a recovery_records entry (audit), CAS-write, delete sidecar.
4 graph_create Graph opens read-only and its schema digest == desired_schema_digest; state stale Same roll-forward as #3 (the create completed).
5 graph_create Root exists but the graph does not open (the engine's partial-init gap) Status error, condition graph_create_incomplete, message: remove the root and re-run apply. No auto-delete — reconciler-initiated deletion is the same data-loss class as human deletion (high-risk decision #7). Sidecar kept until the operator acts and a sweep observes a clean state.
6 any Graph at any other version (out-of-band movement during the crash window) Status drifted, condition actual_applied_state_pending; sidecar kept; the command refuses graph-moving work for that graph until cluster refresh re-observes and the operator re-plans. No success is acknowledged for the interrupted operation.
7 graph_delete Root absent; state already tombstoned The delete kind's analog of row 2 (no manifest exists to version-check): crash fell between state CAS and sidecar delete. Delete sidecar; done.
7b graph_delete Root absent; state stale Roll forward: tombstone the graph subtree out of state (D6), record audit, delete sidecar. Idempotent — re-entry after a crash mid-row lands in row 7.
8 graph_delete Root present (delete crashed mid-prefix-removal or never started) If the approval artifact is still attached (D4), the delete is re-proposed by plan and re-runnable; status drifted, condition graph_delete_incomplete. Partial prefix removal leaves an unopenable graph — same operator message as #5.

Rows 3, 4 and 7b are the only mutations the sweep performs, and each is an ordinary CAS-checked state write under the lock — the sweep introduces no new write machinery.

D4. Approval artifacts (exit criteria 1-partial and 6-partial; axioms 8 and 11)

The irreversible tier — graph delete, allow_data_loss schema apply (hard drops) — requires a recorded human decision that survives any reconstruction of state. plan already emits approvals_required; Phase 4 adds the consumption side.

Artifact (__cluster/approvals/{approval_id}.json, written by the new command, never by apply):

{
  "schema_version": 1,
  "approval_id": "<ulid>",
  "resource": "graph.scratch",
  "operation": "delete",
  "reason": "<the approvals_required reason from plan>",
  "bound_config_digest": "<desired config digest the plan was computed from>",
  "bound_before_digest": "<state digest of the resource, or null>",
  "bound_after_digest": "<desired digest, or null for delete>",
  "approved_by": "<operator id, required>",
  "created_at": "<rfc3339>"
}

Flow. cluster approve <resource-address> --config <dir> --by <operator> re-runs the plan under the lock, locates the pending gated change for that address, prints it, and writes the artifact bound to the exact digests. cluster apply executes a gated change only when a pending artifact matches all bound digests — a stale approval (config moved since) matches nothing, is reported (approval_stale warning), and the change stays blocked with condition approval_required. On successful execution the artifact file is moved into state.approval_records[approval_id] in the same state CAS that records the outcome (the state references the audit fact; losing state does not lose the approval, which is also why import preserves approval_records it finds — see D7).

allow_data_loss is never a CLI flag on cluster apply; destructive promotion is expressed only through an approval artifact for the specific schema change. The default schema apply path runs with allow_data_loss: false (soft drops), which the spec's tier table classes as a recoverable definition rewrite — plan warning, no artifact.

D5. Actor, ordering, and apply groups (exit criterion 4)

Actor. cluster apply --actor <id> / cluster approve --by <id>, with OMNIGRAPH_CLUSTER_ACTOR as the env fallback. The actor is threaded to apply_schema_as (so engine-side Cedar enforcement fires wherever a policy checker is installed and graph commits are attributed), recorded in sidecars, approvals, and recovery_records. The cluster adds no policy engine: graph-moving operations inherit the engine's gate; catalog-only operations remain ungated as today. When no actor is supplied and the target graph has no policy checker, behavior is unchanged from Stage 3A (None actor, as the engine's no-actor variants do); when a checker is installed the engine's existing "actor required" error surfaces as a typed diagnostic (actor_required).

Ordering. Deterministic, dependency-shaped, within one apply run:

  1. graph creates (with their schemas) — ULID-stable order by graph id
  2. schema applies — sequential, one graph at a time (each holds that graph's __schema_apply_lock__; the cluster state lock already serializes cluster-side)
  3. catalog writes (queries/policies) — the Stage 3A path, unchanged
  4. deletes last (catalog deletes, then approved graph deletes)

Each graph-moving operation is its own apply group: sidecar → engine call → sidecar update → continue. The state CAS stays single and final (one write at the end recording every outcome), preserving Stage 3A's protocol; sidecars cover the widened gap between individual engine calls and that final CAS. A failure mid-sequence stops graph-moving work, reports per-resource statuses for everything already done (loud partials), and leaves sidecars for the sweep. Cross-graph atomicity is explicitly not promised.

Failpoints. Each engine-call boundary gets a failpoint (cluster_apply.before_graph_create, cluster_apply.after_graph_create, cluster_apply.before_schema_apply, cluster_apply.after_schema_apply, cluster_apply.before_graph_delete) so every row of the D3 matrix is testable with the Stage 3B harness.

D6. Graph delete (4C)

With no engine primitive, delete is cluster-orchestrated prefix removal: verify the approval artifact → sidecar (kind: graph_delete, current manifest version recorded) → recursively remove ClusterRoot/graphs/<id>.omni → state CAS that tombstones the graph subtree (graph, schema, and its queries removed from applied_revision.resources and resource_statuses; observation replaced by a tombstone record {deleted_at, approval_id}) → delete sidecar. Catalog blobs of the graph's queries stay (GC remains a later stage, consistent with Stage 3A deletes). The engine gap (no atomic prefix delete; partial removal leaves an unopenable root) is handled by D3 row 8, and this RFC registers a desire for an engine-level destroy_graph primitive as future work, not a dependency.

D7. Plan and import integration

  • Plan gains real data impact for schema updates: where Stage 3A showed only a digest diff, Phase 4 calls preview_schema_apply_with_options against the live graph (read-only) and embeds the migration steps + drop warnings in the change record — the "data-aware provider peek" from the high-level spec, bounded to graphs the plan already observes. Failure to preview (graph unreachable) degrades to the digest diff with a warning, never blocks planning.
  • Import/refresh already observe live graphs; Phase 4 makes import preserve approval_records and pending recoveries/ it finds (state reconstruction must not orphan audit facts or pending repairs).

D8. Invariants and axioms check

  • Respect the substrate / no custom transaction manager: cluster never rolls back graphs; engine sidecars own intra-graph atomicity (D3).
  • Axiom 5 (state = deployed reality): recovery converges the ledger to observation, never observation to ledger.
  • Axiom 8 (reversibility gates apply, including drift correction): approval artifacts for the irreversible tier; sweep never auto-deletes (D3 rows 5/8).
  • Axiom 9 (plan-time integrity): ordering is planner-derived from existing dependency edges; no runtime discovery.
  • Axiom 11 (approvals in a durable ledger): artifacts are files first, state-referenced after consumption; reconstructable state never re-derives who approved.
  • Axiom 12 (locked state): every new write path (sidecars, approvals consumption, sweep) runs under the existing state lock.
  • Axiom 15 (single owner / mode switch): nothing here reads from or writes to omnigraph.yaml; applied graphs still serve nothing until Phase 5.
  • Loud partials (deny-list): every crash window lands in a typed status/condition; no path acknowledges unverified success.

Migration / Compatibility

Additive. Stage 3A/3B behavior is unchanged for catalog-only configs; existing state files gain no required fields (approval_records/recovery_records already exist, empty). New CLI surface: cluster approve, --actor on cluster apply. A deployment that never declares schema changes or graph creates sees identical behavior to Stage 3B. The honored-or-rejected posture continues: no new cluster.yaml fields are introduced by this phase (graph roots stay derived).

Sequencing

Stage Scope Gate
4A graph create Omnigraph::init at derived roots; create-intent sidecar; D3 rows 1/2/4/5; dependents unblock in-run Failpoint tests for crash-before/after-init; e2e: declare graph → apply → import-less convergence
4B schema apply Full sidecar lifecycle; roll-forward sweep (D3 rows 3/6); actor threading; plan data-impact preview; soft-drop default Failpoint tests per matrix row; e2e: schema evolution fully cluster-driven (replaces the Stage 3A defer→manual→refresh loop)
4C graph delete cluster approve + artifact consumption; prefix removal; tombstones; D3 rows 7/7b/8 Failpoint tests incl. partial-removal; e2e: gated delete refused without artifact, executed with it, stale artifact rejected

Each stage is a separate PR with boundary-matched tests (the Stage 13B discipline). 4A ships first because it moves no existing manifest; 4B is the heart; 4C last because it is the only irreversible-tier executor and consumes the approval machinery 4B's hard-drop path also needs.

Exit-criteria coverage (implementation spec)

# Criterion This RFC
1 State/status/approval/recovery schemas + paths Approval + recovery schemas: answered (D2, D4). State/status: unchanged from shipped Stage 2A/3A.
2 Sidecar schema + recovery decision matrix Answered (D2, D3)
3 State backend interface / lock+CAS Unchanged (local JSON backend, shipped) — out of scope
4 Apply group syntax + dependency ordering Answered (D5): per-resource groups, fixed kind-ordering; no user-declared group syntax this phase
5 Plan JSON schema incl. blast radius + approvals Extended (D7 preview embedding); base schema shipped
6 Bootstrap authority + first-actor Partial (D5 actor threading); cluster bootstrap authority remains open (below)
7 Server startup migration Phase 5 — deferred
8 Per-query policy / mcp.expose bridge Phase 6 — deferred
9 Pipeline runtime Phase 7 — deferred

Open Questions

  1. Bootstrap authority. The first apply against a fresh cluster has no policy engine to consult and no actor registry; today the answer is "whoever holds the object store wins." The durable story (out-of-band privileged bootstrap actor, per the high-level spec §open-questions) is unresolved and blocks nothing in this phase, since graph-level Cedar still gates wherever installed.
  2. Approval expiry. Artifacts are digest-bound, so config drift invalidates them naturally; is wall-clock expiry also wanted (operator hygiene), or does digest binding suffice?
  3. Sweep on read-only commands. This RFC has status/plan only warn about pending sidecars. If operator feedback shows the warn-but-don't-repair posture causes confusion, promoting plan to run the sweep (it already takes the lock) is a compatible change.
  4. Graph rename. Deliberately out of scope; interacts with the rename-stable-identity known gap in invariants.md. A rename today is delete + create — i.e., gated, lossy, and honest about it.
  5. Engine destroy_graph primitive. 4C's prefix removal is correct but unatomic; if the engine grows a graph-destroy primitive with its own recovery, D6 collapses onto it (the cluster code is shaped to delegate).

References

  • cluster-config-implementation-spec.md — phases, exit criteria, high-risk decisions, approval tiers
  • cluster-axioms.md — axioms 5, 8, 9, 11, 12, 15
  • cluster-config-specs.md — the data-aware provider peek; state/ledger model
  • crates/omnigraph/src/db/manifest/recovery.rs — the engine sidecar + classifier this design mirrors in vocabulary and deliberately does not duplicate in mechanics
  • writes.md, invariants.md — engine recovery protocol and the deny-list this design is checked against