# Cluster Config Implementation Spec And Blast Radius
**Status:** Draft / implementation planning
**Type:** Downstream design spec
**Date:** 2026-06-08
**Relationship:** companion to [cluster-config-specs.md](cluster-config-specs.md)
and [cluster-axioms.md](cluster-axioms.md). The high-level spec explains why
the cluster control plane should exist; this file names what must change
downstream and how large the blast radius is.
<!-- Spec note: this file exists so the user-facing cluster spec can stay
readable. Keep implementation inventories, rollout phases, and test ownership
here instead of expanding the narrative spec into an encyclopedia. -->
## Executive Summary
Overall blast radius: **very high**.
This is not a small extension to `omnigraph.yaml`. The target design creates a
new shared cluster desired-state document, a locked state ledger, a cluster
manifest publisher, and a reconciler that coordinates resources above a single
graph. The existing config system remains useful, but its role changes:
-`omnigraph.yaml` / global config remains the per-operator and startup bridge.
-`cluster.yaml` becomes shared desired state for a deployment.
- The cluster state ledger becomes the authoritative record of applied reality.
- Server/runtime surfaces eventually read from the cluster catalog instead of
only from process-start config.
Safe rollout requires an additive path. Do not replace the current config,
server, or policy behavior in one step.
## Current Surfaces Surveyed
| Surface | Current behavior | Why it matters |
|---|---|---|
| `omnigraph-config::OmnigraphConfig` | Layered global/state/project config for CLI and server startup; strict `version: 1`; named maps replace wholesale | A cluster spec needs different ownership and merge semantics; do not stretch this type until it becomes ambiguous |
| `omnigraph-server::load_server_settings` | Opens either one selected graph or every configured embedded graph in multi mode | Cluster config changes startup, registry identity, and eventually runtime reconcile |
| `GraphRegistry` | Holds open graph handles; production registry is startup-only today; runtime insert is test-only | Cluster apply wants graph add/remove/reload as real control-plane operations |
| `omnigraph-queries::QueryRegistry` | Loads `.gq` files from `queries:` and honors `mcp.expose` for catalog listing | Target cluster config removes exposure from the registry and moves list/invoke to policy |
| `omnigraph-policy::PolicyAction` | Per-graph actions plus server-scoped `graph_list`; `invoke_query` is graph-scoped and coarse | Cluster plan/apply and per-query exposure need new policy scope without breaking coarse rules |
| Engine graph manifest | Graph-level atomic visibility via `__manifest`, expected table versions, and recovery sidecars | Cluster apply needs a higher-level publisher; Lance still commits per dataset |
| Schema apply | Existing plan/apply/lock shape for one graph; soft/hard drops already modeled | This is the prototype resource reconciler, but cluster apply cannot call it blindly and then claim cluster atomicity |
| Public docs/tests | Config, policy, server, and query behavior are already documented and tested | Every behavior change below has user docs and test fallout |
## Compatibility Stance
<!-- Spec note: keep `cluster.yaml` separate from `omnigraph.yaml` because the
current file is deliberately layered and partly per-operator. Collapsing shared
cluster intent into it would blur the source-of-truth split the high-level spec
is trying to create. -->
1.`cluster.yaml` is a new target-state file, not `omnigraph.yaml` v2.
2. Existing `omnigraph.yaml` keeps working for CLI, server boot, aliases,
graph locators, bearer-token env lookup, and the current stored-query
registry.
3. Initial cluster commands are explicit: `omnigraph cluster validate`,
| State backend config | Lock implementation, import/refresh protocol, apply acknowledgements | Backend missing CAS/lock; state CAS failure after graph/resource movement |
| Config parsing | Add strict `cluster.yaml` parser, path/env-ref resolver, resource fingerprints, no layered merge | High | Separate from `OmnigraphConfig`; existing config tests still need backcompat coverage |
| CLI | Add `cluster validate/plan/apply/status/refresh/import`, plan rendering, approval flags, actor threading | High | Must not change existing command selection or `omnigraph use` behavior |
| State backend | Add JSON state document, status/approval/recovery records, lock/CAS, and import/refresh repair | High | Must not silently succeed after state CAS failure |
| Optional cluster publisher | Add a cluster manifest plus table-backed state/status store only if stronger all-or-nothing apply is required | Very high | Touches core atomicity and recovery invariants |
| Recovery | Add cluster sidecars and failpoint coverage for graph-move-before-state-publish gaps | Very high | Any missed sidecar is a correctness bug |
| Graph lifecycle | First-class graph resource create/delete/rename or stable-id story | High | Current server add/remove is intentionally not exposed |
| Schema apply integration | Make schema apply cluster-aware or wrap it with cluster recovery | High | Existing schema apply cannot be treated as cluster atomic by assertion |
| Query registry | Remove target-state exposure flag, add dependency metadata, keep `mcp.expose` bridge | Medium/high | Catalog behavior is observable public API |
| Policy | Add cluster plan/apply/admin actions and per-query list/invoke scope | High | Needs docs, tests, Cedar schema migration, and compatibility with coarse `invoke_query` |
| Server registry | Boot from cluster state, eventually reload/reconcile graph handles, expose statuses | High | Affects routing, OpenAPI, auth, and workload admission |
| API types/OpenAPI | Plan/status/apply DTOs if HTTP management endpoints ship | Medium/high | OpenAPI drift must be regenerated |
| UI specs | New renderer/spec validator/binding checker | High | New product surface, not currently implemented |
| Pipelines | New scheduler/runtime/connector/mapping/idempotency/run ledger | Very high | Second data-plane seam; large product and correctness surface |
| Embeddings | Cluster-level defaults, env refs, model/dimension validation, index interaction | Medium | Existing embedding code is mostly offline/client-side |
| Docs | User docs for cluster config, policy, server, CLI; dev docs for invariants/testing | High | Public contract changes |
| Tests | New cluster suites plus extensions to config/server/policy/recovery/schema/query tests | High | Needs boundary-matched coverage |
## Reversibility And Approval Tiers
| Tier | Examples | Gate |
|---|---|---|
| Display-only | Dashboard layout, non-breaking alias addition | No approval beyond policy |
| Catalog behavior | Add query, hide/list query via policy, add policy grant | Policy check; no data-loss approval |
| Compatibility narrowing | Future validated enum narrowing, query param removal, policy removal that revokes access | Explicit compatibility warning; may require human approval by policy |
| Recoverable definition rewrite | Soft schema drop, graph schema rename, index rebuild | Plan warning; no data-loss approval unless policy requires |
| Irreversible data loss | Graph delete, hard schema drop, cleanup-triggered prior-version reclamation, overwriting pipeline target | Human approval artifact recorded in audit ledger |
Future enum narrowing belongs in `CompatibilityNarrowing` unless the migration
also drops/coerces data or triggers cleanup. That distinction matters for plan
wording and for policy predicates.
## Rollout Phases
<!-- Spec note: the only safe path is staged. The cluster control plane crosses
config, engine, server, policy, and data-plane-adjacent surfaces; a big-bang
replacement would make every invariant harder to audit. -->
### Phase 0: Documentation And Parser Skeleton
- Add cluster spec types and strict parser behind an unused feature/module.
- Implement `cluster validate --config <folder>` with no state backend.
| Plan goldens | Given config + imported/fake state, plan JSON contains stable resource digests, dependency edges, state observations, proposed changes, blast radius, and approval gates in deterministic order |
| Fresh-plan apply | Changing config digest, state revision, resource digest, or observed graph manifest version after planning makes `cluster apply` reject and require re-plan/refresh |
| State lock / CAS | Concurrent applies against the same backend cannot both succeed; loser gets a typed lock/CAS conflict |
| Recovery / partial apply | Fail after graph/resource movement but before cluster state publish; assert recovery or status surfaces `ActualAppliedStatePending`/sidecar state and never returns success |
| Server/runtime phase | Before cluster state drives routing or registry reload, tests are hermetic: no real home dir, no real global config, no real credentials, no ignored remote tests |
| Pipeline phase | Fan-out run records per-target status, commit ids, retryability, and idempotency keys; no aggregate success unless every target succeeded |
Hard gates:
- Do not ship `cluster apply` until `cluster validate` and read-only
`cluster plan` have hermetic tests.
- Do not ship graph/schema-moving apply until failpoint recovery tests prove the
Phase B -> state publish gap is covered.
For docs-only changes, `scripts/check-agents-md.sh` is enough. For
implementation phases, run the boundary tests above before widening to
`cargo test --workspace --locked`.
## User-Visible Documentation Fallout
The following public docs must change when the corresponding phase ships:
| Phase | User docs |
|---|---|
| Parser/validate | New `docs/user/cluster-config.md`; CLI reference for `cluster validate` |