docs(cluster): RFC-004 — graph & schema apply design (Phase 4) (#168)

* docs(cluster): RFC-004 — graph & schema apply design (Phase 4)

The design the implementation spec's exit criteria require before
graph-moving cluster apply ships. Core positions:

- Cluster recovery is roll-forward-only: the engine's own sidecars make every
  graph-level operation atomic within the graph, so the cluster never rolls a
  graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and
  record, converging the ledger to observable reality (axiom 5) or surfacing
  a loud pending-repair condition. Eight-row decision matrix, every row
  testable with the Stage 3B failpoint harness.
- Irreversible operations (graph delete, allow_data_loss schema apply)
  consume digest-bound approval artifacts written by a new cluster approve
  command and retired into state.approval_records (axiom 11). A stale
  approval can never authorize a different change.
- cluster apply gains an actor, threaded to apply_schema_as so engine Cedar
  enforcement and commit attribution work unchanged; the cluster adds no
  policy engine of its own.
- Deterministic ordering (creates -> schema applies -> catalog -> deletes),
  per-resource apply groups, cross-graph atomicity explicitly not promised.
- Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on
  per-matrix-row failpoint tests.

Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to
their phases (coverage table in the RFC). Linked from the dev index and the
implementation spec's Phase 4 section.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract

Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete
(no manifest to version-check after prefix removal) and 'root absent, state
already tombstoned' fell into the stale row — split into rows 7 (delete's
analog of row 2) and 7b (the roll-forward), with expected_manifest_version
documented as always null for the delete kind. (2) state_cas_base is now
explicitly audit/diagnostics-only — the sweep never consults it; independent
state mutations are handled by the ordinary CAS like any concurrent write.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-06-10 04:34:14 +03:00 committed by GitHub
parent effb9cc068
commit 58c66a54a2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 215 additions and 0 deletions

View file

@ -74,6 +74,7 @@ Working documents for in-flight feature work. Removed when the work lands.
| Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) |
| MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) |
| Future cluster control plane — declarative as-code config, JSON state ledger, reconciler | [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) |
| Cluster graph & schema apply — Phase 4 sidecars, roll-forward recovery, approval artifacts | [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) |
## Boundary