mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
706 lines
35 KiB
Markdown
706 lines
35 KiB
Markdown
|
|
# Cluster Config Implementation Spec And Blast Radius
|
||
|
|
|
||
|
|
**Status:** Draft / implementation planning
|
||
|
|
**Type:** Downstream design spec
|
||
|
|
**Date:** 2026-06-08
|
||
|
|
**Relationship:** companion to [cluster-config-specs.md](cluster-config-specs.md)
|
||
|
|
and [cluster-axioms.md](cluster-axioms.md). The high-level spec explains why
|
||
|
|
the cluster control plane should exist; this file names what must change
|
||
|
|
downstream and how large the blast radius is.
|
||
|
|
|
||
|
|
<!-- Spec note: this file exists so the user-facing cluster spec can stay
|
||
|
|
readable. Keep implementation inventories, rollout phases, and test ownership
|
||
|
|
here instead of expanding the narrative spec into an encyclopedia. -->
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Overall blast radius: **very high**.
|
||
|
|
|
||
|
|
This is not a small extension to `omnigraph.yaml`. The target design creates a
|
||
|
|
new shared cluster desired-state document, a locked state ledger, a cluster
|
||
|
|
manifest publisher, and a reconciler that coordinates resources above a single
|
||
|
|
graph. The existing config system remains useful, but its role changes:
|
||
|
|
|
||
|
|
- `omnigraph.yaml` / global config remains the per-operator and startup bridge.
|
||
|
|
- `cluster.yaml` becomes shared desired state for a deployment.
|
||
|
|
- The cluster state ledger becomes the authoritative record of applied reality.
|
||
|
|
- Server/runtime surfaces eventually read from the cluster catalog instead of
|
||
|
|
only from process-start config.
|
||
|
|
|
||
|
|
Safe rollout requires an additive path. Do not replace the current config,
|
||
|
|
server, or policy behavior in one step.
|
||
|
|
|
||
|
|
## Current Surfaces Surveyed
|
||
|
|
|
||
|
|
| Surface | Current behavior | Why it matters |
|
||
|
|
|---|---|---|
|
||
|
|
| `omnigraph-config::OmnigraphConfig` | Layered global/state/project config for CLI and server startup; strict `version: 1`; named maps replace wholesale | A cluster spec needs different ownership and merge semantics; do not stretch this type until it becomes ambiguous |
|
||
|
|
| `omnigraph-server::load_server_settings` | Opens either one selected graph or every configured embedded graph in multi mode | Cluster config changes startup, registry identity, and eventually runtime reconcile |
|
||
|
|
| `GraphRegistry` | Holds open graph handles; production registry is startup-only today; runtime insert is test-only | Cluster apply wants graph add/remove/reload as real control-plane operations |
|
||
|
|
| `omnigraph-queries::QueryRegistry` | Loads `.gq` files from `queries:` and honors `mcp.expose` for catalog listing | Target cluster config removes exposure from the registry and moves list/invoke to policy |
|
||
|
|
| `omnigraph-policy::PolicyAction` | Per-graph actions plus server-scoped `graph_list`; `invoke_query` is graph-scoped and coarse | Cluster plan/apply and per-query exposure need new policy scope without breaking coarse rules |
|
||
|
|
| Engine graph manifest | Graph-level atomic visibility via `__manifest`, expected table versions, and recovery sidecars | Cluster apply needs a higher-level publisher; Lance still commits per dataset |
|
||
|
|
| Schema apply | Existing plan/apply/lock shape for one graph; soft/hard drops already modeled | This is the prototype resource reconciler, but cluster apply cannot call it blindly and then claim cluster atomicity |
|
||
|
|
| Public docs/tests | Config, policy, server, and query behavior are already documented and tested | Every behavior change below has user docs and test fallout |
|
||
|
|
|
||
|
|
## Compatibility Stance
|
||
|
|
|
||
|
|
<!-- Spec note: keep `cluster.yaml` separate from `omnigraph.yaml` because the
|
||
|
|
current file is deliberately layered and partly per-operator. Collapsing shared
|
||
|
|
cluster intent into it would blur the source-of-truth split the high-level spec
|
||
|
|
is trying to create. -->
|
||
|
|
|
||
|
|
1. `cluster.yaml` is a new target-state file, not `omnigraph.yaml` v2.
|
||
|
|
2. Existing `omnigraph.yaml` keeps working for CLI, server boot, aliases,
|
||
|
|
graph locators, bearer-token env lookup, and the current stored-query
|
||
|
|
registry.
|
||
|
|
3. Initial cluster commands are explicit: `omnigraph cluster validate`,
|
||
|
|
`omnigraph cluster plan`, `omnigraph cluster apply`, `omnigraph cluster
|
||
|
|
status`, `omnigraph cluster refresh`, and `omnigraph cluster import`.
|
||
|
|
4. Cluster config is one shared folder, resolved from the command's cluster
|
||
|
|
root or explicit path. It is not merged from global + project + active
|
||
|
|
context layers.
|
||
|
|
5. The per-operator connection layer selects the cluster root and actor
|
||
|
|
identity. It is not committed into `cluster.yaml`.
|
||
|
|
6. `mcp.expose` remains supported in current `omnigraph.yaml` until the
|
||
|
|
per-query policy replacement ships.
|
||
|
|
|
||
|
|
## Terraform-Aligned Schema Validation
|
||
|
|
|
||
|
|
<!-- Spec note: Terraform is strict for resource/provider/module configuration,
|
||
|
|
but looser for variable-value inputs such as `.tfvars` and `TF_VAR_*`. For
|
||
|
|
cluster desired state we borrow the strict resource-schema posture because
|
||
|
|
`cluster.yaml` is shared intent, not an operator-local variable bag. -->
|
||
|
|
|
||
|
|
Every field in target-state `cluster.yaml` must be **honored or rejected**:
|
||
|
|
|
||
|
|
- If a field is part of the declared resource schema, it must affect
|
||
|
|
validation, plan, apply, state, or status.
|
||
|
|
- If a field is misspelled, placed under the wrong resource kind, or reserved
|
||
|
|
for a future phase, `cluster validate` / `cluster plan` must fail with a
|
||
|
|
typed diagnostic.
|
||
|
|
- Compatibility warnings are allowed only in an explicit migration window for
|
||
|
|
old schema versions. They are not allowed in the target schema.
|
||
|
|
- Free-form extension areas must be named as such, for example `labels`,
|
||
|
|
`metadata`, `vars`, or `provider_options`; accidental unknown keys are never
|
||
|
|
treated as extension data.
|
||
|
|
|
||
|
|
Examples:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
graphs:
|
||
|
|
knowledge:
|
||
|
|
schema: ./knowledge.pg
|
||
|
|
lables: { team: platform } # invalid: typo, use `labels`
|
||
|
|
|
||
|
|
pipelines:
|
||
|
|
github_sync:
|
||
|
|
source: { kind: github, token: ${GITHUB_TOKEN} }
|
||
|
|
into:
|
||
|
|
- { graph: engineering, map: ./github.map.yaml }
|
||
|
|
retry_magic: true # invalid unless `retry_magic` is in schema
|
||
|
|
```
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
graphs:
|
||
|
|
knowledge:
|
||
|
|
schema: ./knowledge.pg
|
||
|
|
labels: { team: platform } # valid free-form metadata bucket
|
||
|
|
provider_options:
|
||
|
|
lance:
|
||
|
|
compaction_window: daily # valid only if this extension is declared
|
||
|
|
```
|
||
|
|
|
||
|
|
## Typed Resource And Provider Addresses
|
||
|
|
|
||
|
|
<!-- Spec note: this is the Terraform-aligned version of "typed locators".
|
||
|
|
The target cluster spec should not ask later code to guess whether a string is a
|
||
|
|
graph name, query name, server endpoint, storage URI, source connector, or
|
||
|
|
credential reference. References carry their kind. -->
|
||
|
|
|
||
|
|
<!-- Fix (2026-06-08): resolved the "shorthand may exist" (here) vs "bare strings
|
||
|
|
are bad shape" (below) contradiction. The rule is now explicit: bare names ARE
|
||
|
|
valid shorthand in a field whose schema fixes the referent kind (normalized to a
|
||
|
|
typed address); "bad shape" means a value whose KIND is ambiguous or WRONG, not
|
||
|
|
merely bare. This also makes the high-level spec's bare examples (policy
|
||
|
|
`graphs:`/`applies_to:` lists, pipeline `into.graph`, dashboard `graphs:`) valid. -->
|
||
|
|
A locator is a typed address to another declared thing. **Internally — in plan and
|
||
|
|
state — every reference is a typed address** (axiom 9). At the config *surface* a
|
||
|
|
field may accept **bare shorthand when its schema fixes the referent kind** (a
|
||
|
|
policy `applies_to:` list is graph refs; a pipeline `into.graph` is a graph id) —
|
||
|
|
the parser normalizes it to the typed address before planning. A value whose
|
||
|
|
*kind* is ambiguous or wrong (a `source:` that could be a connector type, an
|
||
|
|
instance, or a provider) has no safe normalization and must be a typed
|
||
|
|
`provider.*` address or an explicit inline block.
|
||
|
|
|
||
|
|
Target address forms:
|
||
|
|
|
||
|
|
```text
|
||
|
|
graph.<graph_id>
|
||
|
|
schema.<graph_id>
|
||
|
|
query.<graph_id>.<query_name>
|
||
|
|
policy.<policy_name>
|
||
|
|
ui.dashboard.<dashboard_name>
|
||
|
|
pipeline.<pipeline_name>
|
||
|
|
provider.storage.<provider_name>
|
||
|
|
provider.source.<provider_name>
|
||
|
|
provider.embedding.<provider_name>
|
||
|
|
```
|
||
|
|
|
||
|
|
Bad shape — the value's **kind is ambiguous or wrong**, not merely bare:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
pipelines:
|
||
|
|
github_sync:
|
||
|
|
source: github # AMBIGUOUS kind: connector type, instance, or provider?
|
||
|
|
# → provider.source.<name> or inline { kind: github, ... }
|
||
|
|
policies:
|
||
|
|
base_rbac:
|
||
|
|
applies_to: [query.knowledge.find_experts] # WRONG kind: a query address in a graph-ref field
|
||
|
|
```
|
||
|
|
|
||
|
|
OK shorthand (kind fixed by the field → normalized):
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
policies:
|
||
|
|
base_rbac:
|
||
|
|
applies_to: [knowledge, engineering] # bare names in a graph-ref field → graph.knowledge, graph.engineering
|
||
|
|
```
|
||
|
|
|
||
|
|
Target shape:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
providers:
|
||
|
|
storage:
|
||
|
|
prod_graphs:
|
||
|
|
kind: s3
|
||
|
|
bucket: company
|
||
|
|
prefix: prod
|
||
|
|
source:
|
||
|
|
github_org:
|
||
|
|
kind: github
|
||
|
|
token: ${GITHUB_TOKEN}
|
||
|
|
|
||
|
|
graphs:
|
||
|
|
knowledge:
|
||
|
|
storage: provider.storage.prod_graphs
|
||
|
|
path: graphs/knowledge.omni
|
||
|
|
schema: ./knowledge.pg
|
||
|
|
engineering:
|
||
|
|
storage: provider.storage.prod_graphs
|
||
|
|
path: graphs/engineering.omni
|
||
|
|
schema: ./engineering.pg
|
||
|
|
|
||
|
|
policies:
|
||
|
|
base_rbac:
|
||
|
|
file: ./base_rbac.policy.yaml
|
||
|
|
applies_to:
|
||
|
|
- graph.knowledge
|
||
|
|
- graph.engineering
|
||
|
|
|
||
|
|
pipelines:
|
||
|
|
github_sync:
|
||
|
|
source: provider.source.github_org
|
||
|
|
into:
|
||
|
|
- { graph: graph.engineering, map: ./github_to_engineering.map.yaml }
|
||
|
|
- { graph: graph.knowledge, map: ./github_to_people.map.yaml }
|
||
|
|
```
|
||
|
|
|
||
|
|
<!-- Fix (2026-06-08): this example shows the EXPLICIT/external graph-storage case
|
||
|
|
(`storage:` + `path:`). It is not the default — per "Known High-Risk Design
|
||
|
|
Decisions" §2 and the cluster storage layout, graph roots derive to
|
||
|
|
`ClusterRoot/graphs/<id>.omni` by default; an external storage provider is the
|
||
|
|
opt-in. The pipeline `into.graph` here is typed (`graph.engineering`); the bare
|
||
|
|
`{ graph: engineering, ... }` shorthand is equally valid (normalized). -->
|
||
|
|
|
||
|
|
Validation rules:
|
||
|
|
|
||
|
|
- A field that expects a graph address accepts `graph.<id>`, not
|
||
|
|
`query.<graph>.<name>` or an arbitrary string.
|
||
|
|
- A field that expects a query address accepts `query.<graph>.<name>`, and the
|
||
|
|
planner validates both the graph and the query symbol.
|
||
|
|
- A field that expects a source provider accepts `provider.source.<name>`, not
|
||
|
|
`provider.storage.<name>`.
|
||
|
|
- A field that expects storage accepts `provider.storage.<name>` or an explicit
|
||
|
|
storage block, not a server URL or source connector.
|
||
|
|
<!-- Fix (2026-06-08): shorthand is a present rule, not "future syntax" — it is how
|
||
|
|
the high-level spec's bare examples are valid. -->
|
||
|
|
- A field whose schema **fixes the kind** accepts bare shorthand (e.g. `knowledge`
|
||
|
|
in a graph-ref field) and normalizes it to the typed address; a kind-ambiguous
|
||
|
|
or wrong-kind value is rejected with a typed diagnostic.
|
||
|
|
- Plan and state always store the **normalized typed address**, regardless of
|
||
|
|
whether the surface used shorthand.
|
||
|
|
|
||
|
|
## Target Components
|
||
|
|
|
||
|
|
Preferred split:
|
||
|
|
|
||
|
|
| Component | Responsibility | Depends on |
|
||
|
|
|---|---|---|
|
||
|
|
| `omnigraph-cluster` crate | Cluster spec types, path resolution, resource graph, plan model, state backend traits, apply orchestration | `omnigraph-config` only for shared simple config types if needed; avoid server deps |
|
||
|
|
| `omnigraph` engine additions | Graph lifecycle primitives, schema-apply integration, recovery hooks for graph moves during cluster apply; optional future cluster manifest publisher if JSON state is not enough | Lance, existing graph manifest/recovery |
|
||
|
|
| `omnigraph-cli` | `cluster *` commands, plan rendering, approval collection, state lock UX | `omnigraph-cluster`, engine |
|
||
|
|
| `omnigraph-server` | Optional boot from cluster state, registry reload, status endpoints, policy-filtered query catalog | `omnigraph-cluster`, engine, policy |
|
||
|
|
| `omnigraph-policy` | Cluster/server actions, per-query list/invoke scope, approval policy predicates | none above server |
|
||
|
|
| `omnigraph-queries` | Registry without exposure side-channel; dependency metadata for downstream validation | compiler/config |
|
||
|
|
| `omnigraph-api-types` | New status/plan/apply response types if cluster HTTP endpoints ship | serde only |
|
||
|
|
|
||
|
|
If the first implementation avoids a new crate, keep the same boundary in
|
||
|
|
modules. The important constraint is that cluster spec parsing must not drag
|
||
|
|
HTTP/server code into compiler or engine crates.
|
||
|
|
|
||
|
|
## Resource Model
|
||
|
|
|
||
|
|
Resource identity is stable and typed:
|
||
|
|
|
||
|
|
```text
|
||
|
|
ClusterRoot
|
||
|
|
ResourceKey = <kind>/<scope>/<name>
|
||
|
|
ResourceAddress = <kind>.<name> | <kind>.<graph_id>.<name>
|
||
|
|
ProviderAddress = provider.<kind>.<name>
|
||
|
|
|
||
|
|
graph/cluster/knowledge
|
||
|
|
schema/graph:knowledge/main
|
||
|
|
query/graph:knowledge/find_experts
|
||
|
|
policy/cluster/base_rbac
|
||
|
|
ui/cluster/dashboard.overview
|
||
|
|
pipeline/cluster/github_sync
|
||
|
|
alias/cluster/experts
|
||
|
|
embedding/cluster/default
|
||
|
|
```
|
||
|
|
|
||
|
|
<!-- Fix (2026-06-08): resource key uses `dashboard.overview` (dot) to match the
|
||
|
|
address form `ui.dashboard.<dashboard_name>` — was `dashboard:overview`. `dashboard`
|
||
|
|
is the only ui sub-kind today. -->
|
||
|
|
|
||
|
|
Resource records carry:
|
||
|
|
|
||
|
|
| Field | Meaning |
|
||
|
|
|---|---|
|
||
|
|
| `kind` | Graph, Schema, Query, PolicyBundle, UiSpec, Binding, Alias, EmbeddingConfig, Pipeline |
|
||
|
|
| `scope` | Cluster or graph id |
|
||
|
|
| `name` | Stable resource name inside scope |
|
||
|
|
| `fingerprint` | Content hash of the normalized spec and all referenced files |
|
||
|
|
| `dependencies` | Resource keys this resource references |
|
||
|
|
| `observed` | Applied graph manifest version, policy digest, query digest, schedule id, etc. |
|
||
|
|
| `status` | `Pending`, `Planned`, `Applying`, `Applied`, `Drifted`, `Blocked`, `Error` |
|
||
|
|
| `conditions` | Typed details such as `ActualAppliedStatePending`, `NeedsApproval`, `DependencyMissing`, `PartialPipelineRun` |
|
||
|
|
|
||
|
|
The planner builds a dependency graph from these records and uses it for both
|
||
|
|
validation and blast-radius reporting.
|
||
|
|
|
||
|
|
## Terraform-Style Validate / Plan / Apply
|
||
|
|
|
||
|
|
The cluster workflow deliberately mirrors Terraform's safe sequence:
|
||
|
|
|
||
|
|
```text
|
||
|
|
cluster validate # parse + schema-check desired config, no state mutation
|
||
|
|
cluster plan # diff desired config against state, with optional refresh
|
||
|
|
cluster apply # apply an accepted fresh plan and update state
|
||
|
|
cluster status # read state-backed deployed reality
|
||
|
|
cluster refresh # repair/import observations from actual cluster state
|
||
|
|
```
|
||
|
|
|
||
|
|
Implementation rollout follows the same safety posture: ship parser/validate
|
||
|
|
first, then read-only plan, then state backend and lock, then apply.
|
||
|
|
|
||
|
|
The plan is a structured artifact, not just terminal text. It must include:
|
||
|
|
|
||
|
|
| Plan field | Why it exists |
|
||
|
|
|---|---|
|
||
|
|
| `desired_revision` | Git commit / config digest being evaluated |
|
||
|
|
| `resource_digests` | Exact digest of every schema, query, policy, UI, pipeline, and map file |
|
||
|
|
| `dependencies` | Edges such as query -> graph/schema, dashboard -> query, pipeline -> source provider + graph |
|
||
|
|
| `state_observations` | Applied revision, resource fingerprints, graph manifest versions, status conditions, and drift |
|
||
|
|
| `changes` | Create/update/delete/replace/refresh-only operations |
|
||
|
|
| `blast_radius` | Downstream resources to revalidate or affected behavior to surface |
|
||
|
|
| `approvals_required` | Irreversible/data-loss or compatibility-narrowing gates |
|
||
|
|
|
||
|
|
`cluster apply` must reject a stale plan when state, resource digests, or
|
||
|
|
observed graph versions no longer match the plan base. The operator or agent
|
||
|
|
must re-plan or explicitly refresh first.
|
||
|
|
|
||
|
|
## Cluster Storage Layout
|
||
|
|
|
||
|
|
Target Phase-1 cluster-root layout:
|
||
|
|
|
||
|
|
```text
|
||
|
|
<cluster-root>/
|
||
|
|
__cluster/
|
||
|
|
state.json
|
||
|
|
lock.json
|
||
|
|
status/
|
||
|
|
<resource-address>.json
|
||
|
|
approvals/
|
||
|
|
<ulid>.json
|
||
|
|
recoveries/
|
||
|
|
<ulid>.json
|
||
|
|
recovery/
|
||
|
|
<ulid>.json
|
||
|
|
resources/
|
||
|
|
query/<graph>/<name>/<digest>.gq
|
||
|
|
policy/<name>/<digest>.yaml
|
||
|
|
ui/<name>/<digest>.dashboard.yaml
|
||
|
|
pipeline/<name>/<digest>.pipeline.yaml
|
||
|
|
graphs/
|
||
|
|
<graph_id>.omni/
|
||
|
|
```
|
||
|
|
|
||
|
|
<!-- Spec note: JSON is the baseline because it matches Terraform state, is
|
||
|
|
easy to inspect/repair, and avoids bootstrapping Lance datasets before the
|
||
|
|
control-plane semantics are proven. -->
|
||
|
|
The exact filenames can change, but the shape cannot:
|
||
|
|
|
||
|
|
- There is one cluster-control namespace under the cluster root.
|
||
|
|
- Graph data remains in ordinary OmniGraph graph roots.
|
||
|
|
- State is a locked/CAS-updated JSON document, not a Lance dataset.
|
||
|
|
- Status, approval, and recovery ledgers are append-only or per-resource JSON
|
||
|
|
records until table semantics are proven necessary.
|
||
|
|
- Resource payloads are content-addressed by digest so apply can be idempotent.
|
||
|
|
- Cluster state is not inferred from the operator's working tree.
|
||
|
|
- A Lance-backed control-plane store is a future backend option only if
|
||
|
|
row-level queryability/history or tighter publish fencing justifies it.
|
||
|
|
|
||
|
|
## State Backend Protocol
|
||
|
|
|
||
|
|
### Cluster-Hosted JSON State
|
||
|
|
|
||
|
|
When `state.backend: cluster`, the baseline backend stores JSON documents under
|
||
|
|
`<cluster-root>/__cluster/` and protects `state.json` with object-store lock/CAS.
|
||
|
|
It is cluster-hosted, but it is still a separate state write from graph Lance
|
||
|
|
manifest movement.
|
||
|
|
|
||
|
|
Apply protocol:
|
||
|
|
|
||
|
|
1. Acquire the cluster state lock.
|
||
|
|
2. Read current `state.json` and backend CAS token / object generation.
|
||
|
|
3. Validate plan base still matches state.
|
||
|
|
4. Write a cluster recovery sidecar before any graph manifest or non-idempotent
|
||
|
|
resource can move.
|
||
|
|
5. Write content-addressed resource payloads and perform any required graph
|
||
|
|
manifest movements.
|
||
|
|
6. CAS-update `state.json` with the new applied revision, resource
|
||
|
|
fingerprints, observed graph versions, status references, and approval /
|
||
|
|
recovery references.
|
||
|
|
7. If step 6 fails after actual resources moved, do not acknowledge success.
|
||
|
|
Surface `ActualAppliedStatePending` and require `refresh` / `import` repair.
|
||
|
|
8. Delete the sidecar and release the lock only after the state outcome is
|
||
|
|
recorded.
|
||
|
|
|
||
|
|
### External State
|
||
|
|
|
||
|
|
<!-- Spec note: external state is a separate commit domain. The protocol below
|
||
|
|
prevents an apply from returning success after the cluster moved but the state
|
||
|
|
ledger failed to record that movement. -->
|
||
|
|
|
||
|
|
When `state.backend` points outside the cluster root, the same JSON state shape
|
||
|
|
lives in an external store. It is locked and CAS-updated, but it is not atomic
|
||
|
|
with Lance or OmniGraph manifests.
|
||
|
|
|
||
|
|
Apply protocol:
|
||
|
|
|
||
|
|
1. Acquire the external state lock.
|
||
|
|
2. Read state and CAS token.
|
||
|
|
3. Validate plan base still matches state.
|
||
|
|
4. Write a cluster recovery sidecar.
|
||
|
|
5. Perform the cluster resource changes.
|
||
|
|
6. CAS-update external state with the new applied revision, statuses, and the
|
||
|
|
observed graph manifest / resource versions it records.
|
||
|
|
7. If step 6 fails, do not acknowledge success. Surface
|
||
|
|
`ActualAppliedStatePending` and require `refresh` / `import` repair.
|
||
|
|
8. Release the external lock only after the state outcome is recorded.
|
||
|
|
|
||
|
|
This mode can be strongly coordinated, but it must never be documented as one
|
||
|
|
atomic commit across both stores.
|
||
|
|
|
||
|
|
### Future Lance-Backed State
|
||
|
|
|
||
|
|
A Lance-backed state/status/approval/recovery store is deliberately not the
|
||
|
|
baseline. It becomes attractive only if JSON files become a real liability:
|
||
|
|
large status sets need structured filtering, approval/recovery history needs
|
||
|
|
table scans, or cluster apply needs a manifest publisher that can fence state
|
||
|
|
and graph-version pins together. Until then, Lance datasets add bootstrapping,
|
||
|
|
schema migration, and control-plane recovery surface without enough benefit.
|
||
|
|
|
||
|
|
## Cluster Manifest Publisher
|
||
|
|
|
||
|
|
The cluster publisher is a possible later layer above today's graph publisher.
|
||
|
|
It does not replace Lance or the per-graph `__manifest` table, and it is not
|
||
|
|
required for Phase-1 JSON state / read-only plan.
|
||
|
|
|
||
|
|
Required semantics:
|
||
|
|
|
||
|
|
| Requirement | Detail |
|
||
|
|
|---|---|
|
||
|
|
| Expected-version CAS | Every resource in an apply group supplies its expected current version/fingerprint |
|
||
|
|
| Resource changes | Register/update/tombstone resource payloads and graph version pins |
|
||
|
|
| Graph-head fencing | If a graph schema/lifecycle operation moves a graph manifest, the cluster manifest records the exact graph manifest version |
|
||
|
|
| Sidecar coverage | Any graph or cluster resource that can move before cluster publish must be recoverable all-or-nothing |
|
||
|
|
| Deterministic publish order | Sidecars and apply groups process in stable order |
|
||
|
|
| Loud partials | If a group cannot be rolled back or forward in-process, status records the condition before more apply work proceeds |
|
||
|
|
|
||
|
|
The risky case is nested publish:
|
||
|
|
|
||
|
|
```text
|
||
|
|
schema apply moves graph:knowledge manifest
|
||
|
|
cluster apply has not yet published query/policy/state records
|
||
|
|
process crashes
|
||
|
|
```
|
||
|
|
|
||
|
|
That is not safe unless the cluster sidecar records enough information to roll
|
||
|
|
the graph movement forward into the cluster manifest or roll it back using the
|
||
|
|
same recovery discipline as current graph recovery.
|
||
|
|
|
||
|
|
## Plan Model
|
||
|
|
|
||
|
|
Plan output is a durable, replay-checked proposal, not just pretty text:
|
||
|
|
|
||
|
|
```text
|
||
|
|
Plan {
|
||
|
|
plan_id,
|
||
|
|
desired_revision,
|
||
|
|
base_state_revision,
|
||
|
|
base_state_cas,
|
||
|
|
changes[],
|
||
|
|
apply_groups[],
|
||
|
|
approvals_required[],
|
||
|
|
blast_radius,
|
||
|
|
diagnostics[]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Each change records:
|
||
|
|
|
||
|
|
| Field | Meaning |
|
||
|
|
|---|---|
|
||
|
|
| `resource` | Stable `ResourceKey` |
|
||
|
|
| `operation` | Create, Update, Delete, Replace, RefreshOnly |
|
||
|
|
| `reversibility` | Reversible, Recoverable, CompatibilityNarrowing, IrreversibleDataLoss |
|
||
|
|
| `effect` | ConfigOnly, Catalog, GraphDefinition, GraphDataRewrite, DataPlaneSchedule |
|
||
|
|
| `downstream` | Resources that must be revalidated or will observe changed behavior |
|
||
|
|
| `approval` | None, HumanRequired, PolicyRequired, AlreadySatisfied |
|
||
|
|
|
||
|
|
`apply` must re-read state and reject stale plans unless an explicit
|
||
|
|
`--refresh` / `--replan` path recomputes the plan.
|
||
|
|
|
||
|
|
## Downstream Dependency Rules
|
||
|
|
|
||
|
|
These are the concrete "what requires downstream" rules.
|
||
|
|
|
||
|
|
| Changed resource | Must revalidate / recompute downstream | Blocking failures |
|
||
|
|
|---|---|---|
|
||
|
|
| Graph create/delete/rename | Policies, queries, aliases, dashboards, pipelines, bindings, server registry, state graph set | Dangling graph references; duplicate URI; invalid `GraphId`; graph delete without irreversible approval |
|
||
|
|
| Schema | Stored queries, pipeline maps, UI bindings/query outputs, embedding/index config, data-impact preview, policy predicates once row/type pushdown exists | Unsupported migration; query breakage; missing target type/property; hard drop without approval |
|
||
|
|
| Stored query | Aliases, UI bindings, policy list/invoke grants, MCP/tool catalog compatibility, typed params | Query file parse/type errors; registry key != `query <name>`; removed query still referenced |
|
||
|
|
| Policy bundle | Query catalog visibility, graph/server action authorization, approval gates, bootstrap permissions | Invalid Cedar/YAML; server-scoped action in graph policy; per-query list/invoke gap unhandled |
|
||
|
|
| UI/dashboard | Query bindings, graph refs, output field expectations, policy visibility for referenced queries | Binding to missing graph/query/param/output |
|
||
|
|
| Alias | CLI command resolution, graph/query refs, shared-vs-personal boundary | Dangling graph/query; mutation alias pointing at read-only context |
|
||
|
|
| Embedding config | Schema `@embed` columns, model dimension, index rebuild/reconcile, env refs | Dimension mismatch; missing env ref; unsupported model/provider |
|
||
|
|
| Pipeline definition | Target graph schemas, mapping files, env refs, scheduler/runtime state, per-target run ledger | Missing target graph/type/property; overwrite mode without approval; source secret missing |
|
||
|
|
| Binding | Referenced source/surface pair, dependency order, visibility policy | Missing source or target; incompatible params |
|
||
|
|
| State backend config | Lock implementation, import/refresh protocol, apply acknowledgements | Backend missing CAS/lock; state CAS failure after graph/resource movement |
|
||
|
|
|
||
|
|
## Blast Radius Matrix
|
||
|
|
|
||
|
|
| Area | Required downstream change | Blast radius | Notes |
|
||
|
|
|---|---|---|---|
|
||
|
|
| Config parsing | Add strict `cluster.yaml` parser, path/env-ref resolver, resource fingerprints, no layered merge | High | Separate from `OmnigraphConfig`; existing config tests still need backcompat coverage |
|
||
|
|
| CLI | Add `cluster validate/plan/apply/status/refresh/import`, plan rendering, approval flags, actor threading | High | Must not change existing command selection or `omnigraph use` behavior |
|
||
|
|
| State backend | Add JSON state document, status/approval/recovery records, lock/CAS, and import/refresh repair | High | Must not silently succeed after state CAS failure |
|
||
|
|
| Optional cluster publisher | Add a cluster manifest plus table-backed state/status store only if stronger all-or-nothing apply is required | Very high | Touches core atomicity and recovery invariants |
|
||
|
|
| Recovery | Add cluster sidecars and failpoint coverage for graph-move-before-state-publish gaps | Very high | Any missed sidecar is a correctness bug |
|
||
|
|
| Graph lifecycle | First-class graph resource create/delete/rename or stable-id story | High | Current server add/remove is intentionally not exposed |
|
||
|
|
| Schema apply integration | Make schema apply cluster-aware or wrap it with cluster recovery | High | Existing schema apply cannot be treated as cluster atomic by assertion |
|
||
|
|
| Query registry | Remove target-state exposure flag, add dependency metadata, keep `mcp.expose` bridge | Medium/high | Catalog behavior is observable public API |
|
||
|
|
| Policy | Add cluster plan/apply/admin actions and per-query list/invoke scope | High | Needs docs, tests, Cedar schema migration, and compatibility with coarse `invoke_query` |
|
||
|
|
| Server registry | Boot from cluster state, eventually reload/reconcile graph handles, expose statuses | High | Affects routing, OpenAPI, auth, and workload admission |
|
||
|
|
| API types/OpenAPI | Plan/status/apply DTOs if HTTP management endpoints ship | Medium/high | OpenAPI drift must be regenerated |
|
||
|
|
| UI specs | New renderer/spec validator/binding checker | High | New product surface, not currently implemented |
|
||
|
|
| Pipelines | New scheduler/runtime/connector/mapping/idempotency/run ledger | Very high | Second data-plane seam; large product and correctness surface |
|
||
|
|
| Embeddings | Cluster-level defaults, env refs, model/dimension validation, index interaction | Medium | Existing embedding code is mostly offline/client-side |
|
||
|
|
| Docs | User docs for cluster config, policy, server, CLI; dev docs for invariants/testing | High | Public contract changes |
|
||
|
|
| Tests | New cluster suites plus extensions to config/server/policy/recovery/schema/query tests | High | Needs boundary-matched coverage |
|
||
|
|
|
||
|
|
## Reversibility And Approval Tiers
|
||
|
|
|
||
|
|
| Tier | Examples | Gate |
|
||
|
|
|---|---|---|
|
||
|
|
| Display-only | Dashboard layout, non-breaking alias addition | No approval beyond policy |
|
||
|
|
| Catalog behavior | Add query, hide/list query via policy, add policy grant | Policy check; no data-loss approval |
|
||
|
|
| Compatibility narrowing | Future validated enum narrowing, query param removal, policy removal that revokes access | Explicit compatibility warning; may require human approval by policy |
|
||
|
|
| Recoverable definition rewrite | Soft schema drop, graph schema rename, index rebuild | Plan warning; no data-loss approval unless policy requires |
|
||
|
|
| Irreversible data loss | Graph delete, hard schema drop, cleanup-triggered prior-version reclamation, overwriting pipeline target | Human approval artifact recorded in audit ledger |
|
||
|
|
|
||
|
|
Future enum narrowing belongs in `CompatibilityNarrowing` unless the migration
|
||
|
|
also drops/coerces data or triggers cleanup. That distinction matters for plan
|
||
|
|
wording and for policy predicates.
|
||
|
|
|
||
|
|
## Rollout Phases
|
||
|
|
|
||
|
|
<!-- Spec note: the only safe path is staged. The cluster control plane crosses
|
||
|
|
config, engine, server, policy, and data-plane-adjacent surfaces; a big-bang
|
||
|
|
replacement would make every invariant harder to audit. -->
|
||
|
|
|
||
|
|
### Phase 0: Documentation And Parser Skeleton
|
||
|
|
|
||
|
|
- Add cluster spec types and strict parser behind an unused feature/module.
|
||
|
|
- Implement `cluster validate --config <folder>` with no state backend.
|
||
|
|
- Validate file paths, env refs, duplicate resource keys, and dependency graph.
|
||
|
|
- No behavior change to `omnigraph.yaml`, server boot, or query exposure.
|
||
|
|
|
||
|
|
### Phase 1: Read-Only Planning
|
||
|
|
|
||
|
|
- Add `cluster plan` against a mock/imported state snapshot.
|
||
|
|
- Produce plan JSON and human output.
|
||
|
|
- Reuse existing schema migration planner for schema resources.
|
||
|
|
- Validate stored queries against desired schema.
|
||
|
|
- Compute downstream dependencies and blast radius.
|
||
|
|
- Still no apply.
|
||
|
|
|
||
|
|
### Phase 2: State Backend And Lock
|
||
|
|
|
||
|
|
- Add `state.backend: cluster` JSON storage and lock/CAS.
|
||
|
|
- Add external backend trait only if lock + CAS semantics are explicit.
|
||
|
|
- Add `cluster status`, `refresh`, and `import`.
|
||
|
|
- Persist `AppliedRevision`, `ResourceStatus`, and audit references in JSON.
|
||
|
|
|
||
|
|
### Phase 3: Config-Only Apply
|
||
|
|
|
||
|
|
- Apply query, policy, UI, alias, embedding, and pipeline definition resources
|
||
|
|
that do not move graph manifests.
|
||
|
|
- Publish by writing content-addressed resource payloads and CAS-updating
|
||
|
|
`state.json`.
|
||
|
|
- Keep server boot from `omnigraph.yaml`; cluster state is inspectable but not
|
||
|
|
yet serving traffic.
|
||
|
|
|
||
|
|
### Phase 4: Graph And Schema Apply
|
||
|
|
|
||
|
|
- Add graph create/delete as cluster resources.
|
||
|
|
- Make schema apply cluster-aware, with sidecar coverage for graph manifest
|
||
|
|
movements before JSON state publish.
|
||
|
|
- Gate irreversible data-loss operations with approval artifacts.
|
||
|
|
- Consider a cluster manifest publisher only if the JSON sidecar + repair path
|
||
|
|
is not strong enough for the accepted safety contract.
|
||
|
|
|
||
|
|
### Phase 5: Server Reads Cluster Catalog
|
||
|
|
|
||
|
|
- Allow server startup from cluster state.
|
||
|
|
- Add status and catalog endpoints as needed.
|
||
|
|
- Keep the current `omnigraph.yaml` startup path as compatibility mode.
|
||
|
|
- Regenerate OpenAPI for any HTTP surface.
|
||
|
|
|
||
|
|
### Phase 6: Policy-Owned Query Exposure
|
||
|
|
|
||
|
|
- Add per-query policy scope for list/invoke.
|
||
|
|
- Filter `GET /queries` by actor.
|
||
|
|
- Keep coarse `invoke_query` as a broad allow rule for compatibility until
|
||
|
|
docs and migrations say it can be narrowed.
|
||
|
|
- Deprecate and later remove `mcp.expose` from target-state cluster config.
|
||
|
|
|
||
|
|
### Phase 7: Pipeline Runtime
|
||
|
|
|
||
|
|
- Add scheduler/worker/runtime.
|
||
|
|
- Add source connector contracts, mapping validation, idempotency keys,
|
||
|
|
per-target run status, and retry behavior.
|
||
|
|
- Treat fan-out execution as data-plane writes unless explicitly staged through
|
||
|
|
branch/merge.
|
||
|
|
|
||
|
|
## Test Ownership
|
||
|
|
|
||
|
|
Tests must prove the Terraform-style workflow, not just individual parsers.
|
||
|
|
The minimum behavior contract:
|
||
|
|
|
||
|
|
```text
|
||
|
|
validate catches bad config
|
||
|
|
plan is deterministic and complete
|
||
|
|
apply only applies a fresh accepted plan
|
||
|
|
state changes are locked and durable
|
||
|
|
drift and partial convergence are visible, not silent
|
||
|
|
```
|
||
|
|
|
||
|
|
| Change | Existing coverage to extend | New coverage likely needed |
|
||
|
|
|---|---|---|
|
||
|
|
| Cluster parser | `omnigraph-config` inline config tests for strictness/path resolution | `omnigraph-cluster` parser/dependency tests |
|
||
|
|
| Plan dependency graph | Schema planner tests, query registry tests | Golden plan JSON for cross-resource downstream impacts |
|
||
|
|
| State lock/backend | Existing schema apply lock tests as model | JSON state CAS/lock race tests |
|
||
|
|
| Optional cluster manifest publisher | `crates/omnigraph/src/db/manifest/tests.rs` | Cluster publisher CAS, expected-version, deterministic order tests if that backend ships |
|
||
|
|
| Cluster recovery | `recovery.rs`, `failpoints.rs` | Phase B -> state publish failpoints, external state CAS failure tests |
|
||
|
|
| Schema cluster apply | `schema_apply.rs`, failpoints schema apply cases | Nested graph/cluster recovery tests |
|
||
|
|
| Query exposure policy | `omnigraph-policy` invoke_query tests, server query catalog tests | Per-query list/invoke allow/deny and no-probing tests |
|
||
|
|
| Server cluster boot | `omnigraph-server/tests/server.rs`, `openapi.rs` | Boot from cluster state, registry reload/status tests |
|
||
|
|
| CLI cluster commands | `omnigraph-cli/tests/cli.rs`, `system_local.rs` | `cluster validate/plan/apply/status` system tests |
|
||
|
|
| Pipelines | None today | New runtime/mapping/idempotency/run-ledger suites |
|
||
|
|
|
||
|
|
Workflow-specific tests:
|
||
|
|
|
||
|
|
| Workflow area | Required assertions |
|
||
|
|
|---|---|
|
||
|
|
| Parser / validate | Unknown fields, wrong-kind typed addresses, missing providers, inline secret values, dangling graph/query/pipeline refs, and future-phase fields fail with typed diagnostics |
|
||
|
|
| Plan goldens | Given config + imported/fake state, plan JSON contains stable resource digests, dependency edges, state observations, proposed changes, blast radius, and approval gates in deterministic order |
|
||
|
|
| Fresh-plan apply | Changing config digest, state revision, resource digest, or observed graph manifest version after planning makes `cluster apply` reject and require re-plan/refresh |
|
||
|
|
| State lock / CAS | Concurrent applies against the same backend cannot both succeed; loser gets a typed lock/CAS conflict |
|
||
|
|
| Recovery / partial apply | Fail after graph/resource movement but before cluster state publish; assert recovery or status surfaces `ActualAppliedStatePending`/sidecar state and never returns success |
|
||
|
|
| Server/runtime phase | Before cluster state drives routing or registry reload, tests are hermetic: no real home dir, no real global config, no real credentials, no ignored remote tests |
|
||
|
|
| Pipeline phase | Fan-out run records per-target status, commit ids, retryability, and idempotency keys; no aggregate success unless every target succeeded |
|
||
|
|
|
||
|
|
Hard gates:
|
||
|
|
|
||
|
|
- Do not ship `cluster apply` until `cluster validate` and read-only
|
||
|
|
`cluster plan` have hermetic tests.
|
||
|
|
- Do not ship graph/schema-moving apply until failpoint recovery tests prove the
|
||
|
|
Phase B -> state publish gap is covered.
|
||
|
|
|
||
|
|
For docs-only changes, `scripts/check-agents-md.sh` is enough. For
|
||
|
|
implementation phases, run the boundary tests above before widening to
|
||
|
|
`cargo test --workspace --locked`.
|
||
|
|
|
||
|
|
## User-Visible Documentation Fallout
|
||
|
|
|
||
|
|
The following public docs must change when the corresponding phase ships:
|
||
|
|
|
||
|
|
| Phase | User docs |
|
||
|
|
|---|---|
|
||
|
|
| Parser/validate | New `docs/user/cluster-config.md`; CLI reference for `cluster validate` |
|
||
|
|
| Plan/apply | CLI reference, transactions, policy, errors |
|
||
|
|
| State backend | Storage, deployment, constants, maintenance |
|
||
|
|
| Server cluster boot | Server, deployment, OpenAPI |
|
||
|
|
| Policy query exposure | Policy, server, query language / stored-query registry docs |
|
||
|
|
| Pipelines | New pipeline user guide, deployment, audit, errors |
|
||
|
|
| Embeddings config | Embeddings, indexes |
|
||
|
|
|
||
|
|
Do not ship a user-visible command, flag, env var, endpoint, or config key
|
||
|
|
without updating the corresponding user doc in the same PR.
|
||
|
|
|
||
|
|
## Known High-Risk Design Decisions
|
||
|
|
|
||
|
|
1. **Cluster root identity.** Decide whether `metadata.name` is a label or
|
||
|
|
identity. Prefer root-derived stable identity plus display name to avoid a
|
||
|
|
rename breaking resource identity.
|
||
|
|
2. **Graph storage derivation.** The high-level sample omits graph storage.
|
||
|
|
Implementation should derive graph roots under `ClusterRoot/graphs/<id>.omni`
|
||
|
|
by default and treat external graph roots as a separate, explicit feature.
|
||
|
|
3. **Nested apply.** Schema apply and graph lifecycle cannot move a graph
|
||
|
|
manifest outside cluster sidecar coverage.
|
||
|
|
4. **External state.** Must expose pending repair instead of returning success
|
||
|
|
when graph/resource movement succeeds and external state CAS fails.
|
||
|
|
5. **Per-query policy.** Catalog filtering must avoid probing leaks: callers
|
||
|
|
without list/invoke permission should not distinguish hidden from missing.
|
||
|
|
6. **Pipeline fan-out.** Do not promise atomic multi-graph ingestion unless the
|
||
|
|
runtime uses a real branch/merge or equivalent protocol for every target.
|
||
|
|
7. **Drift correction.** Reconciler-initiated deletes are the same data-loss
|
||
|
|
class as human-requested deletes.
|
||
|
|
|
||
|
|
## Exit Criteria For A Real RFC
|
||
|
|
|
||
|
|
Before implementation begins beyond parser/validate, the RFC must answer:
|
||
|
|
|
||
|
|
1. Exact JSON state/status/approval/recovery schemas and object-store paths.
|
||
|
|
2. Exact sidecar JSON schema and recovery decision matrix.
|
||
|
|
3. State backend interface and supported lock/CAS implementations.
|
||
|
|
4. Cluster apply group syntax and dependency ordering rules.
|
||
|
|
5. Plan JSON schema, including blast-radius and approval fields.
|
||
|
|
6. Bootstrap authority and first-actor story.
|
||
|
|
7. Server startup and migration path from `omnigraph.yaml`.
|
||
|
|
8. Per-query policy schema and compatibility bridge for `mcp.expose`.
|
||
|
|
9. Pipeline runtime owner, status schema, and idempotency contract.
|