35 KiB
Cluster Config Implementation Spec And Blast Radius
Status: Draft / implementation planning Type: Downstream design spec Date: 2026-06-08 Relationship: companion to cluster-config-specs.md and cluster-axioms.md. The high-level spec explains why the cluster control plane should exist; this file names what must change downstream and how large the blast radius is.
Executive Summary
Overall blast radius: very high.
This is not a small extension to omnigraph.yaml. The target design creates a
new shared cluster desired-state document, a locked state ledger, a cluster
manifest publisher, and a reconciler that coordinates resources above a single
graph. The existing config system remains useful, but its role changes:
omnigraph.yaml/ global config remains the per-operator and startup bridge.cluster.yamlbecomes shared desired state for a deployment.- The cluster state ledger becomes the authoritative record of applied reality.
- Server/runtime surfaces eventually read from the cluster catalog instead of only from process-start config.
Safe rollout requires an additive path. Do not replace the current config, server, or policy behavior in one step.
Current Surfaces Surveyed
| Surface | Current behavior | Why it matters |
|---|---|---|
omnigraph-config::OmnigraphConfig |
Layered global/state/project config for CLI and server startup; strict version: 1; named maps replace wholesale |
A cluster spec needs different ownership and merge semantics; do not stretch this type until it becomes ambiguous |
omnigraph-server::load_server_settings |
Opens either one selected graph or every configured embedded graph in multi mode | Cluster config changes startup, registry identity, and eventually runtime reconcile |
GraphRegistry |
Holds open graph handles; production registry is startup-only today; runtime insert is test-only | Cluster apply wants graph add/remove/reload as real control-plane operations |
omnigraph-queries::QueryRegistry |
Loads .gq files from queries: and honors mcp.expose for catalog listing |
Target cluster config removes exposure from the registry and moves list/invoke to policy |
omnigraph-policy::PolicyAction |
Per-graph actions plus server-scoped graph_list; invoke_query is graph-scoped and coarse |
Cluster plan/apply and per-query exposure need new policy scope without breaking coarse rules |
| Engine graph manifest | Graph-level atomic visibility via __manifest, expected table versions, and recovery sidecars |
Cluster apply needs a higher-level publisher; Lance still commits per dataset |
| Schema apply | Existing plan/apply/lock shape for one graph; soft/hard drops already modeled | This is the prototype resource reconciler, but cluster apply cannot call it blindly and then claim cluster atomicity |
| Public docs/tests | Config, policy, server, and query behavior are already documented and tested | Every behavior change below has user docs and test fallout |
Compatibility Stance
cluster.yamlis a new target-state file, notomnigraph.yamlv2.- Existing
omnigraph.yamlkeeps working for CLI, server boot, aliases, graph locators, bearer-token env lookup, and the current stored-query registry. - Initial cluster commands are explicit:
omnigraph cluster validate,omnigraph cluster plan,omnigraph cluster apply,omnigraph cluster status,omnigraph cluster refresh, andomnigraph cluster import. - Cluster config is one shared folder, resolved from the command's cluster root or explicit path. It is not merged from global + project + active context layers.
- The per-operator connection layer selects the cluster root and actor
identity. It is not committed into
cluster.yaml. mcp.exposeremains supported in currentomnigraph.yamluntil the per-query policy replacement ships.
Terraform-Aligned Schema Validation
Every field in target-state cluster.yaml must be honored or rejected:
- If a field is part of the declared resource schema, it must affect validation, plan, apply, state, or status.
- If a field is misspelled, placed under the wrong resource kind, or reserved
for a future phase,
cluster validate/cluster planmust fail with a typed diagnostic. - Compatibility warnings are allowed only in an explicit migration window for old schema versions. They are not allowed in the target schema.
- Free-form extension areas must be named as such, for example
labels,metadata,vars, orprovider_options; accidental unknown keys are never treated as extension data.
Examples:
graphs:
knowledge:
schema: ./knowledge.pg
lables: { team: platform } # invalid: typo, use `labels`
pipelines:
github_sync:
source: { kind: github, token: ${GITHUB_TOKEN} }
into:
- { graph: engineering, map: ./github.map.yaml }
retry_magic: true # invalid unless `retry_magic` is in schema
graphs:
knowledge:
schema: ./knowledge.pg
labels: { team: platform } # valid free-form metadata bucket
provider_options:
lance:
compaction_window: daily # valid only if this extension is declared
Typed Resource And Provider Addresses
A locator is a typed address to another declared thing. Internally — in plan and
state — every reference is a typed address (axiom 9). At the config surface a
field may accept bare shorthand when its schema fixes the referent kind (a
policy applies_to: list is graph refs; a pipeline into.graph is a graph id) —
the parser normalizes it to the typed address before planning. A value whose
kind is ambiguous or wrong (a source: that could be a connector type, an
instance, or a provider) has no safe normalization and must be a typed
provider.* address or an explicit inline block.
Target address forms:
graph.<graph_id>
schema.<graph_id>
query.<graph_id>.<query_name>
policy.<policy_name>
ui.dashboard.<dashboard_name>
pipeline.<pipeline_name>
provider.storage.<provider_name>
provider.source.<provider_name>
provider.embedding.<provider_name>
Bad shape — the value's kind is ambiguous or wrong, not merely bare:
pipelines:
github_sync:
source: github # AMBIGUOUS kind: connector type, instance, or provider?
# → provider.source.<name> or inline { kind: github, ... }
policies:
base_rbac:
applies_to: [query.knowledge.find_experts] # WRONG kind: a query address in a graph-ref field
OK shorthand (kind fixed by the field → normalized):
policies:
base_rbac:
applies_to: [knowledge, engineering] # bare names in a graph-ref field → graph.knowledge, graph.engineering
Target shape:
providers:
storage:
prod_graphs:
kind: s3
bucket: company
prefix: prod
source:
github_org:
kind: github
token: ${GITHUB_TOKEN}
graphs:
knowledge:
storage: provider.storage.prod_graphs
path: graphs/knowledge.omni
schema: ./knowledge.pg
engineering:
storage: provider.storage.prod_graphs
path: graphs/engineering.omni
schema: ./engineering.pg
policies:
base_rbac:
file: ./base_rbac.policy.yaml
applies_to:
- graph.knowledge
- graph.engineering
pipelines:
github_sync:
source: provider.source.github_org
into:
- { graph: graph.engineering, map: ./github_to_engineering.map.yaml }
- { graph: graph.knowledge, map: ./github_to_people.map.yaml }
Validation rules:
- A field that expects a graph address accepts
graph.<id>, notquery.<graph>.<name>or an arbitrary string. - A field that expects a query address accepts
query.<graph>.<name>, and the planner validates both the graph and the query symbol. - A field that expects a source provider accepts
provider.source.<name>, notprovider.storage.<name>. - A field that expects storage accepts
provider.storage.<name>or an explicit storage block, not a server URL or source connector.
- A field whose schema fixes the kind accepts bare shorthand (e.g.
knowledgein a graph-ref field) and normalizes it to the typed address; a kind-ambiguous or wrong-kind value is rejected with a typed diagnostic. - Plan and state always store the normalized typed address, regardless of whether the surface used shorthand.
Target Components
Preferred split:
| Component | Responsibility | Depends on |
|---|---|---|
omnigraph-cluster crate |
Cluster spec types, path resolution, resource graph, plan model, state backend traits, apply orchestration | omnigraph-config only for shared simple config types if needed; avoid server deps |
omnigraph engine additions |
Graph lifecycle primitives, schema-apply integration, recovery hooks for graph moves during cluster apply; optional future cluster manifest publisher if JSON state is not enough | Lance, existing graph manifest/recovery |
omnigraph-cli |
cluster * commands, plan rendering, approval collection, state lock UX |
omnigraph-cluster, engine |
omnigraph-server |
Optional boot from cluster state, registry reload, status endpoints, policy-filtered query catalog | omnigraph-cluster, engine, policy |
omnigraph-policy |
Cluster/server actions, per-query list/invoke scope, approval policy predicates | none above server |
omnigraph-queries |
Registry without exposure side-channel; dependency metadata for downstream validation | compiler/config |
omnigraph-api-types |
New status/plan/apply response types if cluster HTTP endpoints ship | serde only |
If the first implementation avoids a new crate, keep the same boundary in modules. The important constraint is that cluster spec parsing must not drag HTTP/server code into compiler or engine crates.
Resource Model
Resource identity is stable and typed:
ClusterRoot
ResourceKey = <kind>/<scope>/<name>
ResourceAddress = <kind>.<name> | <kind>.<graph_id>.<name>
ProviderAddress = provider.<kind>.<name>
graph/cluster/knowledge
schema/graph:knowledge/main
query/graph:knowledge/find_experts
policy/cluster/base_rbac
ui/cluster/dashboard.overview
pipeline/cluster/github_sync
alias/cluster/experts
embedding/cluster/default
Resource records carry:
| Field | Meaning |
|---|---|
kind |
Graph, Schema, Query, PolicyBundle, UiSpec, Binding, Alias, EmbeddingConfig, Pipeline |
scope |
Cluster or graph id |
name |
Stable resource name inside scope |
fingerprint |
Content hash of the normalized spec and all referenced files |
dependencies |
Resource keys this resource references |
observed |
Applied graph manifest version, policy digest, query digest, schedule id, etc. |
status |
Pending, Planned, Applying, Applied, Drifted, Blocked, Error |
conditions |
Typed details such as ActualAppliedStatePending, NeedsApproval, DependencyMissing, PartialPipelineRun |
The planner builds a dependency graph from these records and uses it for both validation and blast-radius reporting.
Terraform-Style Validate / Plan / Apply
The cluster workflow deliberately mirrors Terraform's safe sequence:
cluster validate # parse + schema-check desired config, no state mutation
cluster plan # diff desired config against state, with optional refresh
cluster apply # apply an accepted fresh plan and update state
cluster status # read state-backed deployed reality
cluster refresh # repair/import observations from actual cluster state
Implementation rollout follows the same safety posture: ship parser/validate first, then read-only plan, then state backend and lock, then apply.
The plan is a structured artifact, not just terminal text. It must include:
| Plan field | Why it exists |
|---|---|
desired_revision |
Git commit / config digest being evaluated |
resource_digests |
Exact digest of every schema, query, policy, UI, pipeline, and map file |
dependencies |
Edges such as query -> graph/schema, dashboard -> query, pipeline -> source provider + graph |
state_observations |
Applied revision, resource fingerprints, graph manifest versions, status conditions, and drift |
changes |
Create/update/delete/replace/refresh-only operations |
blast_radius |
Downstream resources to revalidate or affected behavior to surface |
approvals_required |
Irreversible/data-loss or compatibility-narrowing gates |
cluster apply must reject a stale plan when state, resource digests, or
observed graph versions no longer match the plan base. The operator or agent
must re-plan or explicitly refresh first.
Cluster Storage Layout
Target Phase-1 cluster-root layout:
<cluster-root>/
__cluster/
state.json
lock.json
status/
<resource-address>.json
approvals/
<ulid>.json
recoveries/
<ulid>.json
recovery/
<ulid>.json
resources/
query/<graph>/<name>/<digest>.gq
policy/<name>/<digest>.yaml
ui/<name>/<digest>.dashboard.yaml
pipeline/<name>/<digest>.pipeline.yaml
graphs/
<graph_id>.omni/
The exact filenames can change, but the shape cannot:
- There is one cluster-control namespace under the cluster root.
- Graph data remains in ordinary OmniGraph graph roots.
- State is a locked/CAS-updated JSON document, not a Lance dataset.
- Status, approval, and recovery ledgers are append-only or per-resource JSON records until table semantics are proven necessary.
- Resource payloads are content-addressed by digest so apply can be idempotent.
- Cluster state is not inferred from the operator's working tree.
- A Lance-backed control-plane store is a future backend option only if row-level queryability/history or tighter publish fencing justifies it.
State Backend Protocol
Cluster-Hosted JSON State
When state.backend: cluster, the baseline backend stores JSON documents under
<cluster-root>/__cluster/ and protects state.json with object-store lock/CAS.
It is cluster-hosted, but it is still a separate state write from graph Lance
manifest movement.
Apply protocol:
- Acquire the cluster state lock.
- Read current
state.jsonand backend CAS token / object generation. - Validate plan base still matches state.
- Write a cluster recovery sidecar before any graph manifest or non-idempotent resource can move.
- Write content-addressed resource payloads and perform any required graph manifest movements.
- CAS-update
state.jsonwith the new applied revision, resource fingerprints, observed graph versions, status references, and approval / recovery references. - If step 6 fails after actual resources moved, do not acknowledge success.
Surface
ActualAppliedStatePendingand requirerefresh/importrepair. - Delete the sidecar and release the lock only after the state outcome is recorded.
External State
When state.backend points outside the cluster root, the same JSON state shape
lives in an external store. It is locked and CAS-updated, but it is not atomic
with Lance or OmniGraph manifests.
Apply protocol:
- Acquire the external state lock.
- Read state and CAS token.
- Validate plan base still matches state.
- Write a cluster recovery sidecar.
- Perform the cluster resource changes.
- CAS-update external state with the new applied revision, statuses, and the observed graph manifest / resource versions it records.
- If step 6 fails, do not acknowledge success. Surface
ActualAppliedStatePendingand requirerefresh/importrepair. - Release the external lock only after the state outcome is recorded.
This mode can be strongly coordinated, but it must never be documented as one atomic commit across both stores.
Future Lance-Backed State
A Lance-backed state/status/approval/recovery store is deliberately not the baseline. It becomes attractive only if JSON files become a real liability: large status sets need structured filtering, approval/recovery history needs table scans, or cluster apply needs a manifest publisher that can fence state and graph-version pins together. Until then, Lance datasets add bootstrapping, schema migration, and control-plane recovery surface without enough benefit.
Cluster Manifest Publisher
The cluster publisher is a possible later layer above today's graph publisher.
It does not replace Lance or the per-graph __manifest table, and it is not
required for Phase-1 JSON state / read-only plan.
Required semantics:
| Requirement | Detail |
|---|---|
| Expected-version CAS | Every resource in an apply group supplies its expected current version/fingerprint |
| Resource changes | Register/update/tombstone resource payloads and graph version pins |
| Graph-head fencing | If a graph schema/lifecycle operation moves a graph manifest, the cluster manifest records the exact graph manifest version |
| Sidecar coverage | Any graph or cluster resource that can move before cluster publish must be recoverable all-or-nothing |
| Deterministic publish order | Sidecars and apply groups process in stable order |
| Loud partials | If a group cannot be rolled back or forward in-process, status records the condition before more apply work proceeds |
The risky case is nested publish:
schema apply moves graph:knowledge manifest
cluster apply has not yet published query/policy/state records
process crashes
That is not safe unless the cluster sidecar records enough information to roll the graph movement forward into the cluster manifest or roll it back using the same recovery discipline as current graph recovery.
Plan Model
Plan output is a durable, replay-checked proposal, not just pretty text:
Plan {
plan_id,
desired_revision,
base_state_revision,
base_state_cas,
changes[],
apply_groups[],
approvals_required[],
blast_radius,
diagnostics[]
}
Each change records:
| Field | Meaning |
|---|---|
resource |
Stable ResourceKey |
operation |
Create, Update, Delete, Replace, RefreshOnly |
reversibility |
Reversible, Recoverable, CompatibilityNarrowing, IrreversibleDataLoss |
effect |
ConfigOnly, Catalog, GraphDefinition, GraphDataRewrite, DataPlaneSchedule |
downstream |
Resources that must be revalidated or will observe changed behavior |
approval |
None, HumanRequired, PolicyRequired, AlreadySatisfied |
apply must re-read state and reject stale plans unless an explicit
--refresh / --replan path recomputes the plan.
Downstream Dependency Rules
These are the concrete "what requires downstream" rules.
| Changed resource | Must revalidate / recompute downstream | Blocking failures |
|---|---|---|
| Graph create/delete/rename | Policies, queries, aliases, dashboards, pipelines, bindings, server registry, state graph set | Dangling graph references; duplicate URI; invalid GraphId; graph delete without irreversible approval |
| Schema | Stored queries, pipeline maps, UI bindings/query outputs, embedding/index config, data-impact preview, policy predicates once row/type pushdown exists | Unsupported migration; query breakage; missing target type/property; hard drop without approval |
| Stored query | Aliases, UI bindings, policy list/invoke grants, MCP/tool catalog compatibility, typed params | Query file parse/type errors; registry key != query <name>; removed query still referenced |
| Policy bundle | Query catalog visibility, graph/server action authorization, approval gates, bootstrap permissions | Invalid Cedar/YAML; server-scoped action in graph policy; per-query list/invoke gap unhandled |
| UI/dashboard | Query bindings, graph refs, output field expectations, policy visibility for referenced queries | Binding to missing graph/query/param/output |
| Alias | CLI command resolution, graph/query refs, shared-vs-personal boundary | Dangling graph/query; mutation alias pointing at read-only context |
| Embedding config | Schema @embed columns, model dimension, index rebuild/reconcile, env refs |
Dimension mismatch; missing env ref; unsupported model/provider |
| Pipeline definition | Target graph schemas, mapping files, env refs, scheduler/runtime state, per-target run ledger | Missing target graph/type/property; overwrite mode without approval; source secret missing |
| Binding | Referenced source/surface pair, dependency order, visibility policy | Missing source or target; incompatible params |
| State backend config | Lock implementation, import/refresh protocol, apply acknowledgements | Backend missing CAS/lock; state CAS failure after graph/resource movement |
Blast Radius Matrix
| Area | Required downstream change | Blast radius | Notes |
|---|---|---|---|
| Config parsing | Add strict cluster.yaml parser, path/env-ref resolver, resource fingerprints, no layered merge |
High | Separate from OmnigraphConfig; existing config tests still need backcompat coverage |
| CLI | Add cluster validate/plan/apply/status/refresh/import, plan rendering, approval flags, actor threading |
High | Must not change existing command selection or omnigraph use behavior |
| State backend | Add JSON state document, status/approval/recovery records, lock/CAS, and import/refresh repair | High | Must not silently succeed after state CAS failure |
| Optional cluster publisher | Add a cluster manifest plus table-backed state/status store only if stronger all-or-nothing apply is required | Very high | Touches core atomicity and recovery invariants |
| Recovery | Add cluster sidecars and failpoint coverage for graph-move-before-state-publish gaps | Very high | Any missed sidecar is a correctness bug |
| Graph lifecycle | First-class graph resource create/delete/rename or stable-id story | High | Current server add/remove is intentionally not exposed |
| Schema apply integration | Make schema apply cluster-aware or wrap it with cluster recovery | High | Existing schema apply cannot be treated as cluster atomic by assertion |
| Query registry | Remove target-state exposure flag, add dependency metadata, keep mcp.expose bridge |
Medium/high | Catalog behavior is observable public API |
| Policy | Add cluster plan/apply/admin actions and per-query list/invoke scope | High | Needs docs, tests, Cedar schema migration, and compatibility with coarse invoke_query |
| Server registry | Boot from cluster state, eventually reload/reconcile graph handles, expose statuses | High | Affects routing, OpenAPI, auth, and workload admission |
| API types/OpenAPI | Plan/status/apply DTOs if HTTP management endpoints ship | Medium/high | OpenAPI drift must be regenerated |
| UI specs | New renderer/spec validator/binding checker | High | New product surface, not currently implemented |
| Pipelines | New scheduler/runtime/connector/mapping/idempotency/run ledger | Very high | Second data-plane seam; large product and correctness surface |
| Embeddings | Cluster-level defaults, env refs, model/dimension validation, index interaction | Medium | Existing embedding code is mostly offline/client-side |
| Docs | User docs for cluster config, policy, server, CLI; dev docs for invariants/testing | High | Public contract changes |
| Tests | New cluster suites plus extensions to config/server/policy/recovery/schema/query tests | High | Needs boundary-matched coverage |
Reversibility And Approval Tiers
| Tier | Examples | Gate |
|---|---|---|
| Display-only | Dashboard layout, non-breaking alias addition | No approval beyond policy |
| Catalog behavior | Add query, hide/list query via policy, add policy grant | Policy check; no data-loss approval |
| Compatibility narrowing | Future validated enum narrowing, query param removal, policy removal that revokes access | Explicit compatibility warning; may require human approval by policy |
| Recoverable definition rewrite | Soft schema drop, graph schema rename, index rebuild | Plan warning; no data-loss approval unless policy requires |
| Irreversible data loss | Graph delete, hard schema drop, cleanup-triggered prior-version reclamation, overwriting pipeline target | Human approval artifact recorded in audit ledger |
Future enum narrowing belongs in CompatibilityNarrowing unless the migration
also drops/coerces data or triggers cleanup. That distinction matters for plan
wording and for policy predicates.
Rollout Phases
Phase 0: Documentation And Parser Skeleton
- Add cluster spec types and strict parser behind an unused feature/module.
- Implement
cluster validate --config <folder>with no state backend. - Validate file paths, env refs, duplicate resource keys, and dependency graph.
- No behavior change to
omnigraph.yaml, server boot, or query exposure.
Phase 1: Read-Only Planning
- Add
cluster planagainst a mock/imported state snapshot. - Produce plan JSON and human output.
- Reuse existing schema migration planner for schema resources.
- Validate stored queries against desired schema.
- Compute downstream dependencies and blast radius.
- Still no apply.
Phase 2: State Backend And Lock
- Add
state.backend: clusterJSON storage and lock/CAS. - Add external backend trait only if lock + CAS semantics are explicit.
- Add
cluster status,refresh, andimport. - Persist
AppliedRevision,ResourceStatus, and audit references in JSON.
Phase 3: Config-Only Apply
- Apply query, policy, UI, alias, embedding, and pipeline definition resources that do not move graph manifests.
- Publish by writing content-addressed resource payloads and CAS-updating
state.json. - Keep server boot from
omnigraph.yaml; cluster state is inspectable but not yet serving traffic.
Phase 4: Graph And Schema Apply
- Add graph create/delete as cluster resources.
- Make schema apply cluster-aware, with sidecar coverage for graph manifest movements before JSON state publish.
- Gate irreversible data-loss operations with approval artifacts.
- Consider a cluster manifest publisher only if the JSON sidecar + repair path is not strong enough for the accepted safety contract.
Phase 5: Server Reads Cluster Catalog
- Allow server startup from cluster state.
- Add status and catalog endpoints as needed.
- Keep the current
omnigraph.yamlstartup path as compatibility mode. - Regenerate OpenAPI for any HTTP surface.
Phase 6: Policy-Owned Query Exposure
- Add per-query policy scope for list/invoke.
- Filter
GET /queriesby actor. - Keep coarse
invoke_queryas a broad allow rule for compatibility until docs and migrations say it can be narrowed. - Deprecate and later remove
mcp.exposefrom target-state cluster config.
Phase 7: Pipeline Runtime
- Add scheduler/worker/runtime.
- Add source connector contracts, mapping validation, idempotency keys, per-target run status, and retry behavior.
- Treat fan-out execution as data-plane writes unless explicitly staged through branch/merge.
Test Ownership
Tests must prove the Terraform-style workflow, not just individual parsers. The minimum behavior contract:
validate catches bad config
plan is deterministic and complete
apply only applies a fresh accepted plan
state changes are locked and durable
drift and partial convergence are visible, not silent
| Change | Existing coverage to extend | New coverage likely needed |
|---|---|---|
| Cluster parser | omnigraph-config inline config tests for strictness/path resolution |
omnigraph-cluster parser/dependency tests |
| Plan dependency graph | Schema planner tests, query registry tests | Golden plan JSON for cross-resource downstream impacts |
| State lock/backend | Existing schema apply lock tests as model | JSON state CAS/lock race tests |
| Optional cluster manifest publisher | crates/omnigraph/src/db/manifest/tests.rs |
Cluster publisher CAS, expected-version, deterministic order tests if that backend ships |
| Cluster recovery | recovery.rs, failpoints.rs |
Phase B -> state publish failpoints, external state CAS failure tests |
| Schema cluster apply | schema_apply.rs, failpoints schema apply cases |
Nested graph/cluster recovery tests |
| Query exposure policy | omnigraph-policy invoke_query tests, server query catalog tests |
Per-query list/invoke allow/deny and no-probing tests |
| Server cluster boot | omnigraph-server/tests/server.rs, openapi.rs |
Boot from cluster state, registry reload/status tests |
| CLI cluster commands | omnigraph-cli/tests/cli.rs, system_local.rs |
cluster validate/plan/apply/status system tests |
| Pipelines | None today | New runtime/mapping/idempotency/run-ledger suites |
Workflow-specific tests:
| Workflow area | Required assertions |
|---|---|
| Parser / validate | Unknown fields, wrong-kind typed addresses, missing providers, inline secret values, dangling graph/query/pipeline refs, and future-phase fields fail with typed diagnostics |
| Plan goldens | Given config + imported/fake state, plan JSON contains stable resource digests, dependency edges, state observations, proposed changes, blast radius, and approval gates in deterministic order |
| Fresh-plan apply | Changing config digest, state revision, resource digest, or observed graph manifest version after planning makes cluster apply reject and require re-plan/refresh |
| State lock / CAS | Concurrent applies against the same backend cannot both succeed; loser gets a typed lock/CAS conflict |
| Recovery / partial apply | Fail after graph/resource movement but before cluster state publish; assert recovery or status surfaces ActualAppliedStatePending/sidecar state and never returns success |
| Server/runtime phase | Before cluster state drives routing or registry reload, tests are hermetic: no real home dir, no real global config, no real credentials, no ignored remote tests |
| Pipeline phase | Fan-out run records per-target status, commit ids, retryability, and idempotency keys; no aggregate success unless every target succeeded |
Hard gates:
- Do not ship
cluster applyuntilcluster validateand read-onlycluster planhave hermetic tests. - Do not ship graph/schema-moving apply until failpoint recovery tests prove the Phase B -> state publish gap is covered.
For docs-only changes, scripts/check-agents-md.sh is enough. For
implementation phases, run the boundary tests above before widening to
cargo test --workspace --locked.
User-Visible Documentation Fallout
The following public docs must change when the corresponding phase ships:
| Phase | User docs |
|---|---|
| Parser/validate | New docs/user/cluster-config.md; CLI reference for cluster validate |
| Plan/apply | CLI reference, transactions, policy, errors |
| State backend | Storage, deployment, constants, maintenance |
| Server cluster boot | Server, deployment, OpenAPI |
| Policy query exposure | Policy, server, query language / stored-query registry docs |
| Pipelines | New pipeline user guide, deployment, audit, errors |
| Embeddings config | Embeddings, indexes |
Do not ship a user-visible command, flag, env var, endpoint, or config key without updating the corresponding user doc in the same PR.
Known High-Risk Design Decisions
- Cluster root identity. Decide whether
metadata.nameis a label or identity. Prefer root-derived stable identity plus display name to avoid a rename breaking resource identity. - Graph storage derivation. The high-level sample omits graph storage.
Implementation should derive graph roots under
ClusterRoot/graphs/<id>.omniby default and treat external graph roots as a separate, explicit feature. - Nested apply. Schema apply and graph lifecycle cannot move a graph manifest outside cluster sidecar coverage.
- External state. Must expose pending repair instead of returning success when graph/resource movement succeeds and external state CAS fails.
- Per-query policy. Catalog filtering must avoid probing leaks: callers without list/invoke permission should not distinguish hidden from missing.
- Pipeline fan-out. Do not promise atomic multi-graph ingestion unless the runtime uses a real branch/merge or equivalent protocol for every target.
- Drift correction. Reconciler-initiated deletes are the same data-loss class as human-requested deletes.
Exit Criteria For A Real RFC
Before implementation begins beyond parser/validate, the RFC must answer:
- Exact JSON state/status/approval/recovery schemas and object-store paths.
- Exact sidecar JSON schema and recovery decision matrix.
- State backend interface and supported lock/CAS implementations.
- Cluster apply group syntax and dependency ordering rules.
- Plan JSON schema, including blast-radius and approval fields.
- Bootstrap authority and first-actor story.
- Server startup and migration path from
omnigraph.yaml. - Per-query policy schema and compatibility bridge for
mcp.expose. - Pipeline runtime owner, status schema, and idempotency contract.