docs(user): restructure user docs into topic sections (Phase 1) (#223)

Move the 23 flat docs/user/*.md files into topic subdirectories so the
user guide is organized by area (schema, queries, search, branching, cli,
operations, clusters, concepts, reference) instead of a flat list. This is
a pure structural move — whole files relocated, every cross-doc link
recomputed, no prose rewrites or content splits (those follow in Phase 2).

- 19 `git mv`s (install.md, deployment.md stay top-level); history preserved
  (renames detected at 92–100% similarity).
- All intra-doc links, AGENTS.md's topic table (52 pointers), and the
  docs/dev + docs/releases back-links recomputed via relpath from each
  file's new location.
- docs/user/index.md rewritten as a sectioned nav hub.
- Fixed 5 doc-path references in Rust (comments + two user-facing server
  settings error strings) to point at the new locations.

Verified: zero broken .md links across tracked docs; check-agents-md.sh
green (with the untracked scratch docs set aside); touched crates build.

Note: the public site (omnigraph-web) imports docs/ via a flat-only script;
its import-docs.mjs needs a subdir-aware update before the next re-sync.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Andrew Altshuler 2026-06-14 13:52:14 +03:00 committed by GitHub
parent 8726ca92ec
commit d46e50dd6d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
33 changed files with 126 additions and 109 deletions

View file

@ -0,0 +1,457 @@
# Cluster Config
**Status:** Phase 5 — cluster-booted serving (`omnigraph-server --cluster`).
> New to the cluster tooling? Start with the operator how-to guide,
> [cluster.md](index.md) — this document is the reference.
Cluster config is the future control-plane configuration surface for a whole
OmniGraph deployment. In this stage, OmniGraph can validate a local
`cluster.yaml` folder, produce a deterministic read-only plan, inspect the
local JSON state ledger, explicitly refresh/import graph observations into
that ledger, manually remove a held local state lock by exact lock id, and
**apply the executable subset of the plan** — stored-query and policy-bundle
catalog writes, **graph creation** (a declared graph that does not exist yet
is initialized by apply at the derived root), **schema updates** (soft drops
only), and — behind an explicit, digest-bound **approval** — **graph
deletion**. It does not perform data-loss schema migrations, start servers,
or serve anything it applies: the server still boots from `omnigraph.yaml`.
## Commands
```bash
omnigraph cluster validate --config company-brain
omnigraph cluster plan --config company-brain --json
omnigraph cluster apply --config company-brain --json
omnigraph cluster approve graph.<id> --config company-brain --as <actor>
omnigraph cluster status --config company-brain --json
omnigraph cluster refresh --config company-brain --json
omnigraph cluster import --config company-brain --json
omnigraph cluster force-unlock <LOCK_ID> --config company-brain --json
```
`--config` points at a directory, not a file. The directory must contain
`cluster.yaml`. When omitted, it defaults to the current directory.
## Relationship to `omnigraph.yaml`
`cluster.yaml` does not replace `omnigraph.yaml`, and the two never describe
the same fact. `omnigraph.yaml` is the permanent **per-operator** layer (CLI
defaults, the operator's identity and credential references, graph targets
for data-plane commands); `cluster.yaml` is the shared desired state of a
whole deployment, read only by the `cluster` commands via `--config`.
The exact contract:
- **Cluster commands read `omnigraph.yaml` for exactly one thing**: the
`cli.actor` default used by `apply`/`approve` when `--as` is omitted —
operator identity is a per-operator fact. With `--as` present, no config
is read at all. Nothing else (its graph set, targets, bind, queries,
policies) ever influences a cluster command; a malformed `omnigraph.yaml`
breaks only the no-flag actor lookup, loudly.
- **A `--cluster` server reads `omnigraph.yaml` for nothing** — not even the
implicit current-directory search runs (mode-inference rule 0). Boot from
cluster state XOR `omnigraph.yaml`, never a merge.
- **The other direction is ergonomics, not coupling**: a per-operator
`omnigraph.yaml` may point `graphs.<name>.uri` at a cluster's derived root
(`company-brain/graphs/knowledge.omni`) so data-plane commands can use
`--target <name>` — an ordinary local path, no special handling.
## Supported `cluster.yaml`
Stage 3A accepts only this resource subset:
```yaml
version: 1
metadata:
name: company-brain
state:
backend: cluster
lock: true
graphs:
knowledge:
schema: knowledge.pg
queries: queries/ # discover every `query <name>` in queries/*.gq
policies:
base:
file: base.policy.yaml
applies_to: [knowledge]
```
`queries` is Terraform-shaped — the `.gq` files are the declaration. Three
forms:
```yaml
queries: queries/ # directory: top-level *.gq, sorted; every declaration registers
queries: [people.gq, extra/a.gq] # explicit files; every declaration in each
queries: # fine-grained name -> file map
find_experts:
file: knowledge.gq
```
Discovery is loud: an unreadable or unparseable `.gq`, or the same query name
declared in two files, fails validation (`query_parse_error`,
`duplicate_query_name`). Each discovered query is still an individually
addressed resource (`query.<graph>.<name>`) with its own plan/apply lifecycle;
the digest is the containing file's hash, so editing a multi-query file
updates all of its queries together. Paths are relative to the config
directory — the cluster is one explicit folder, so no `./` prefixes are
needed.
`storage:` (optional) is the **storage root URI** for everything the cluster
stores — the state ledger, lock, content-addressed catalog, recovery
sidecars, approval artifacts, and the derived graph roots
(`<storage>/graphs/<id>.omni`). Absent, it defaults to the config directory
itself (the original layout, byte-compatible with pre-existing clusters).
`s3://bucket/prefix` puts the whole cluster on S3-compatible object storage:
the ledger CAS uses conditional writes (verified against AWS S3 semantics and
RustFS), the lock becomes genuinely cross-machine, and graph roots are
engine-native S3 URIs. Credentials are **never** in `cluster.yaml` — the
standard `AWS_*` environment contract applies, identical to graph storage.
Declared configuration (`cluster.yaml` and the schema/query/policy sources it
references) always stays in the working tree: config is versioned in git,
state lives in the store — the Terraform split.
`metadata.name` is a display label. `state.backend` may be omitted or set to
`cluster`; external state backends are reserved for a later stage. `state.lock`
defaults to `true`. When enabled, `cluster plan`, `cluster apply`,
`cluster refresh`, and `cluster import` briefly acquire
`<config-dir>/__cluster/lock.json`, then remove it before returning. `cluster status` never acquires the lock; it only reports
whether one is present. `cluster force-unlock` is the only lock-removal command;
it requires the exact lock id and should be run only after confirming no cluster
operation is active.
## Validation
`cluster validate` checks:
- `cluster.yaml` syntax and supported fields
- duplicate YAML keys
- schema, query, and policy file existence
- schema parsing and catalog construction
- stored-query parsing and query-name matching
- stored-query type-checking against the desired schema
- policy `applies_to` graph references
Fields reserved for later phases, such as `pipelines`, `embeddings`, `ui`,
`aliases`, and `bindings`, fail with a typed diagnostic instead of being
silently ignored.
## Planning
`cluster plan` first performs validation, then reads local JSON state from:
```text
<config-dir>/__cluster/state.json
```
If the file is missing, the state is treated as empty and every desired
resource is planned as a create. If present, the file must use this shape:
```json
{
"version": 1,
"state_revision": 0,
"applied_revision": {
"config_digest": "...",
"resources": {
"graph.knowledge": { "digest": "..." },
"schema.knowledge": { "digest": "..." },
"query.knowledge.find_experts": { "digest": "..." },
"policy.base": {
"digest": "...",
"applies_to": ["cluster", "graph.knowledge"]
}
}
},
"resource_statuses": {
"graph.knowledge": {
"status": "applied",
"conditions": [],
"message": "optional status detail"
}
},
"approval_records": {},
"recovery_records": {},
"observations": {}
}
```
`state_revision`, `resource_statuses`, `approval_records`, `recovery_records`,
and `observations` are optional so older Stage 1 state fixtures keep working.
Missing `state_revision` is treated as `0`. Resource status values are
`pending`, `planned`, `applying`, `applied`, `drifted`, `blocked`, or `error`.
Plan output compares desired resource digests against state resource digests
and reports `create`, `update`, and `delete` changes. It also reports the state
CAS (`sha256:<digest>`) and state revision. `state_observations.locked` means an
existing lock file was observed, along with its metadata (`lock_id`,
`lock_operation`, `lock_created_at`, `lock_pid`, `lock_age_seconds`); a
successful `plan` instead reports `lock_acquired: true` and an
`acquired_lock_id`, then releases the lock before returning. The command never
writes `state.json` and does not scan live graphs. Use explicit
`cluster refresh` / `cluster import` when the state ledger should be updated
from live observations. Live drift scans during plan are later-stage work.
Policy entries additionally record their applied `applies_to` bindings as
normalized typed refs — the state ledger is serving-sufficient for the
future server-boot stage. A change to `applies_to` alone (the policy file
digest unchanged) appears in the plan as an Update marked `binding_change`
(human output: `[bindings]`), applies like any catalog change, and counts
toward convergence; ledgers written before this field existed are backfilled
by the next apply.
Each plan change carries a `disposition` field — an honest preview of what
`cluster apply` will do with it in this stage: `applied` (executes), `derived`
(a `graph.<id>` composite-digest update that converges automatically once its
query digests land), `deferred` (graph/schema change, later phase), or
`blocked` (query/policy gated by an unapplied or missing dependency, with the
condition in `reason`).
## Apply
`cluster apply` executes the executable subset of the plan — stored-query and
policy-bundle changes, graph creates, and schema updates. There is no confirm
flag: `cluster plan` is the preview,
and apply recomputes the same diff under the state lock before executing, so a
stale preview can never be applied. Apply requires an existing `state.json`
(`state_missing` directs you to `cluster import` first).
For each applied create/update, the resource payload is written
content-addressed into the local catalog:
```text
<config-dir>/__cluster/resources/query/<graph>/<name>/<digest>.gq
<config-dir>/__cluster/resources/policy/<name>/<digest>.yaml
```
Extensions are fixed per kind regardless of the source file's name. Payloads
are written before the state update because `state.json` is the publish point:
if the final CAS-checked state write fails, no success is reported and the
digest-named blobs already written are inert — re-running apply is the repair.
Deletes remove the resource from state; their old payload blobs stay on disk
(garbage collection is a later stage). Re-running a converged apply is a no-op:
no state write, no revision change (`state_written: false`).
**Applied means serving — for deployments that opt in.** A server started
with `--cluster <dir>` boots from the applied revision (see
[Serving from the cluster](#serving-from-the-cluster-the-mode-switch)); it
picks up newly applied state on its next restart. Deployments still booting
from `omnigraph.yaml` are untouched: for them, applied means recorded in the
catalog, nothing more.
### Graph creation
A `graph.<id>` create (the graph is declared but no root exists) is executed
by apply: the graph is initialized at the derived root
```text
<config-dir>/graphs/<graph-id>.omni
```
with the declared schema, before any catalog writes, so queries and policies
that depend on the new graph apply **in the same run**. Each create is fenced
by a recovery sidecar under `__cluster/recoveries/{ulid}.json`, written before
the init and removed only after the state update lands. If apply crashes in
between, the next state-mutating command (`apply`, `refresh`, `import`) runs a
**recovery sweep** that classifies the survivor by observation: an absent root
removes the stale intent; a completed create rolls the cluster state forward
(recorded in the state's `recovery_records`); a partial root reports
`graph_create_incomplete` (status `error` — remove the root and re-run apply;
nothing is auto-deleted); unexpected graph content reports
`actual_applied_state_pending` (status `drifted` — run `cluster refresh` and
re-plan). While a kept sidecar is pending, that graph's create and its
dependents are blocked with `cluster_recovery_pending`. Read-only commands
(`status`, `plan`) warn about pending sidecars without acting on them.
**Re-creation is convergence.** If a graph root disappears out-of-band,
`refresh` records the drift and the next `plan` proposes a create — and apply
will execute it, producing an **empty** graph at the root. The data was
already lost when the root vanished; the create is visible in the plan
(disposition `applied`) before anything runs.
### Schema updates
A `schema.<id>` update (the declared schema differs from what state records)
is executed by apply via the engine's schema-apply, after graph creates and
before catalog writes — so a query change that depends on the new schema
applies in the same run. Each schema apply is sidecar-fenced like a create:
pre-operation manifest version recorded, post-operation version written back,
sidecar retired only after the state update lands; the recovery sweep
classifies survivors by schema digest (consistent ledger → retired; completed
on the graph → state rolled forward with an audit entry; anything else →
`drifted`/`actual_applied_state_pending`, kept).
Migrations run with **soft drops only** — a removed property disappears from
the current version while prior versions retain the data (reversible until
`cleanup`). Data-loss migrations (`allow_data_loss`) are not reachable from
cluster apply until the approval-artifact stage. Unsupported migrations
(e.g. changing a property's type), engine lock contention, or graphs with
user branches fail loudly as `schema_apply_failed` with the engine's message;
dependent changes are demoted to `blocked` and graph-moving work stops for
the run.
`cluster plan` previews schema updates with the engine's real migration plan:
each schema change carries a `migration` field (`supported` + typed steps),
and the human output prints the steps. If the live graph cannot be opened the
preview degrades to the digest diff with a `schema_preview_unavailable`
warning.
**Drift is converged, not just reported.** A schema changed out-of-band on
the live graph shows up as `drifted` after `refresh`, and the next plan
proposes migrating it back to the declared schema — apply executes that like
any other soft migration. Drift correction is gated by the same rules as any
change; nothing about it is hidden (the plan shows the steps, including soft
drops of out-of-band fields).
**Attribution.** `cluster apply --as <actor>` records the operator identity
in recovery sidecars and audit entries and threads it to the engine's
schema-apply (so commit attribution and Cedar enforcement — wherever a policy
checker is installed — work unchanged).
### Approvals and graph deletion
Deleting a graph is the irreversible tier: it requires a recorded human
decision. `cluster plan` lists the gate under `approvals_required` (one gate
per graph — the graph-level approval carries its schema and queries);
`cluster approve graph.<id> --as <actor>` writes a digest-bound artifact to
```text
<config-dir>/__cluster/approvals/<approval-id>.json
```
bound to the exact desired config digest and the change's state digest, so
**any config or state drift after approving invalidates the artifact**
automatically (`approval_stale` warning; it never authorizes a different
change). An unapproved delete blocks with `approval_required`.
An approved delete executes **last** in the apply run: the graph root is
removed recursively, the subtree (graph, schema, its queries) is tombstoned
out of the state ledger with a tombstone observation, and the approval is
consumed — recorded in the state's `approval_records` in the same state
update, and the artifact file rewritten with `consumed_at` (the file is never
deleted: the audit fact survives the loss of either store). A failed run
consumes nothing; the approval stays valid for the retry. Catalog blobs of
the deleted graph's queries stay on disk (GC is a later stage).
Crash recovery for deletes: a completed-but-unrecorded delete is rolled
forward by the sweep (tombstone + approval consumption + audit entry); an
incomplete delete (root still present) is retired with a
`graph_delete_incomplete` warning and simply **re-proposed** — prefix removal
is idempotent, so the still-approved retry is the repair.
Standalone schema deletes are never executed by this stage. They are
reported as `deferred` (warning `apply_unsupported_change`), and query/policy
changes that depend on them are `blocked` (warning `apply_dependency_blocked`, status
`blocked` in state). A partially-applicable plan still exits 0 with warnings;
the JSON `converged` field is the automation signal for "state now matches the
desired revision". The applied `config_digest` is only recorded when apply
fully converges. The `graph.<id>` composite digest is recomputed from state's
own schema/query digests after each apply, so applied query changes converge
without graph movement.
## Serving from the cluster (the mode switch)
```bash
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080
```
`--cluster <dir>` is an **exclusive boot source** (axiom 15): it cannot
combine with a graph URI, `--target`, or `--config`, and in this mode
`omnigraph.yaml` is never read — not for graphs, not for queries, not for
policies. The server serves the **applied revision**: graph roots recorded in
`state.json`, stored-query and policy content from the content-addressed
catalog at the applied digests (re-verified at boot), and policy bundles
wired by their applied `applies_to` bindings — `cluster`-bound bundles become
the server-level Cedar engine, graph-bound bundles attach per graph.
Un-applied config drift never leaks into serving; `cluster plan` is where
drift is visible. Routing is always multi-graph (`/graphs/{id}/...`). Bearer
tokens and the bind address stay process-level (flags/env) — they are
per-replica facts, not cluster facts.
Boot is fail-fast: missing or unreadable state, pending recovery sidecars,
missing/tampered catalog blobs, policy entries without binding metadata
(pre-binding ledgers — re-run `cluster apply`), an empty graph set, more than
one policy bundle binding a single scope (split or merge bundles; stacked
scopes are a later stage), unopenable graph roots, and stored queries that no
longer type-check all refuse startup with a remedy. A held state lock is
*not* an error — boot reads the atomically-replaced state file without
locking.
Serving is static per process: the server reads the applied revision once at
startup, so picking up newly applied state means restarting it. Stored
queries are all listed in `GET /queries` in cluster mode (the cluster
registry has no expose flag; exposure becomes a policy decision in a later
phase).
## Status
`cluster status` reads the same local JSON state ledger and prints what the
ledger says is deployed. It does not validate referenced schema/query/policy
files and does not inspect live graphs. Missing `state.json` succeeds with a
warning; invalid state JSON or an unsupported state version fails. If a lock is
present, status reports its id, operation, creation time, pid, and age.
Status also verifies the catalog payloads read-only: every query/policy digest
recorded in state is checked against its content-addressed blob under
`__cluster/resources/` (existence and full digest re-hash). A missing or
mismatched blob is reported as a warning (`catalog_payload_missing` /
`catalog_payload_mismatch`); an unreadable blob is an error
(`catalog_payload_read_error`) because an unverifiable catalog must not report
healthy. Status never writes state — persisting the `drifted` condition is
refresh's job. The check runs without the state lock, so it is a point-in-time
report.
## Refresh And Import
`cluster refresh` updates an existing `state.json` from actual observations.
`cluster import` creates the first `state.json` when the ledger is missing.
Both commands open declared graphs read-only at:
```text
<config-dir>/graphs/<graph-id>.omni
```
They observe only branch `main`, recording graph existence, manifest version,
live schema digest, desired schema digest, and schema-match status under
`observations["graph.<id>"]`. Missing graph roots are recorded as drift and
remove the graph/schema digests from state so a later `plan` proposes creates.
Invalid graph roots are recorded as errors; `refresh` persists the error
observation and exits non-zero, while `import` exits non-zero without creating
initial state.
Refresh also verifies the catalog payloads of every query/policy digest
recorded in state (the same check `cluster status` reports read-only), and
closes the loop:
- a **missing** or **digest-mismatched** blob marks the resource `drifted`
(condition `payload_missing` / `payload_mismatch`) and removes its digest
from state — so the next `cluster plan` proposes a create and the next
`cluster apply` republishes the blob (the self-heal loop, mirroring how a
missing graph root is handled);
- an **unreadable** blob (IO error other than not-found) keeps the digest,
marks the resource `error` (condition `payload_read_error`), and exits
non-zero — transient IO must not trigger a spurious republish.
Upgrade note: a state ledger written before catalog publish existed records
query/policy digests with no blobs on disk; the first refresh after upgrading
flags them all `payload_missing`, and a single `cluster apply` republishes
everything and converges.
Refresh/import do not observe query or policy resources beyond their catalog
payloads yet. Existing query and policy state digests are preserved on refresh
(unless their payload drifted, above) and are not invented on import.
## Force Unlock
`cluster force-unlock <LOCK_ID>` removes `<config-dir>/__cluster/lock.json` only
when the file exists, is valid version-1 lock JSON, and its `lock_id` exactly
matches the argument. A wrong id, missing lock, invalid lock JSON, or unsupported
lock version exits non-zero and leaves the file untouched.
This is manual recovery for abandoned local locks. OmniGraph does not perform
PID-liveness checks, TTL expiry, stale-lock breaking, or automatic unlock in
Stage 2C.

285
docs/user/clusters/index.md Normal file
View file

@ -0,0 +1,285 @@
# Operating an OmniGraph Cluster
This is the operator's guide to the cluster control plane: how to go from an
empty directory to a served deployment, and how to run it day to day —
evolving schemas, rotating queries and policies, healing drift, approving
destructive changes, and recovering from crashes.
It is a **how-to**. The reference for every `cluster.yaml` key, command flag,
state-file field, and diagnostic code is
[cluster-config.md](config.md); the HTTP surface is
[server.md](../operations/server.md).
## The model in one paragraph
You declare the entire deployment — graphs, schemas, stored queries, Cedar
policies — as files in one directory (`cluster.yaml` plus the `.pg`/`.gq`/
`.yaml` files it references). `cluster apply` converges reality to that
declaration and records what it did in a state ledger
(`__cluster/state.json`); `cluster plan` previews exactly what apply would
do, including real schema-migration steps. A server started with
`omnigraph-server --cluster <dir>` serves what was applied — never what is
merely written in config. Terraform users will recognize the shape: config
is desired state, the ledger is recorded state, plan is the diff, apply is
the only thing that changes the world, and irreversible changes require an
explicitly recorded approval.
## 1. Deploy a cluster from zero
Lay out a config directory:
```
company-brain/
├── cluster.yaml
├── people.pg # schema for the "knowledge" graph
├── queries/ # stored queries — the .gq files ARE the declaration
│ └── people.gq
└── base.policy.yaml # a Cedar policy bundle
```
```yaml
# cluster.yaml
version: 1
# storage: s3://omnigraph-local/clusters/company-brain # optional: put the
# ledger, catalog, and graph data on object storage (default: this folder)
metadata:
name: company-brain
graphs:
knowledge:
schema: people.pg
queries: queries/ # every `query <name>` in queries/*.gq registers
policies:
base:
file: base.policy.yaml
applies_to: [knowledge] # graph-bound; use [cluster] for server-level
```
Bring it to life:
```bash
omnigraph cluster validate --config company-brain # parse + typecheck everything
omnigraph cluster import --config company-brain # create the state ledger
omnigraph cluster plan --config company-brain # preview: what would apply do?
omnigraph cluster apply --config company-brain # converge
```
That single `apply` **creates the graph** (at the derived root
`company-brain/graphs/knowledge.omni`), applies its schema, and publishes
the query and policy into the content-addressed catalog
(`__cluster/resources/…`). The output lists every change with its
disposition; `converged: true` means there is nothing left to do — re-running
`apply` is always safe and idempotent.
Load data through the normal graph plane (the control plane manages
*definitions*, not rows):
```bash
omnigraph load --data seed.jsonl company-brain/graphs/knowledge.omni
```
Serve it:
```bash
OMNIGRAPH_SERVER_BEARER_TOKENS_JSON='{"act-reader":"s3cret"}' \
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080
```
`--cluster` accepts either a **config directory** (the storage root resolves
through `cluster.yaml`'s `storage:` key) or a **storage-root URI directly**
(`--cluster s3://bucket/prefix`) — config-free serving: a serving box needs
only the URI and credentials, no checkout of the config repo. The ledger and
catalog on the bucket are the deployment artifact.
`--cluster` is an **exclusive boot source**: it cannot be combined with a
graph URI, `--target`, or `--config`, and `omnigraph.yaml` is never read in
this mode. Routing is always multi-graph:
```bash
curl -H 'authorization: Bearer s3cret' \
-X POST http://localhost:8080/graphs/knowledge/queries/find_person \
-H 'content-type: application/json' -d '{"params":{"name":"Ada"}}'
```
Bearer tokens and the bind address are deliberately *not* cluster facts —
they are per-replica, set by flag or environment
([server.md](../operations/server.md#modes) for the token sources).
## 2. The day-2 loop: edit → plan → apply → restart
Every change follows the same loop, whatever its kind:
```bash
$EDITOR company-brain/people.pg # or any .gq / policy / cluster.yaml edit
omnigraph cluster plan --config company-brain
omnigraph cluster apply --config company-brain --as andrew
# restart cluster-booted servers to pick it up
```
`--as <actor>` attributes the run: it is recorded in recovery sidecars and
audit entries and threaded into the engine's commit history. Set
`cli: { actor: <you> }` in your per-operator `omnigraph.yaml` to make it the
default when `--as` is omitted (the flag always wins; `approve` requires one
of the two).
What each change kind does:
| You edit | Plan shows | Apply does |
|---|---|---|
| a `.gq` file or `queries:` entry | `Update query.<g>.<n>` | publishes the new content-addressed blob, updates the ledger |
| a policy file | `Update policy.<n>` | same — new blob, ledger update |
| a policy's `applies_to` | `Update policy.<n> [bindings]` | records the new bindings (the file digest is unchanged; bindings are first-class changes) |
| a `.pg` schema | `Update schema.<g>` **with the real migration steps embedded** | runs the engine's schema apply on the live graph — soft drops only, sidecar-fenced |
| `graphs:` gains an entry | `Create graph.<g>` (+ schema, queries) | initializes the graph at its derived root; dependents apply in the same run |
| `graphs:` loses an entry | `Delete graph.<g>`**blocked, `approval_required`** | nothing, until approved (see §4) |
Two properties worth internalizing:
- **One apply, ordered correctly.** Creates run first, then schema
migrations, then catalog writes, then (approved) deletes — so a schema
change plus a query that uses the new field converge together in one run.
- **Soft drops only.** A removed schema property disappears from the current
version while prior versions retain the data (reversible until `cleanup`).
Data-loss migrations are not reachable from cluster apply.
Read the plan before applying when the change is non-trivial — for schema
updates it embeds the engine's actual migration plan (`add_property`,
`drop_property [soft]`, `unsupported: …`), so you see data impact before
anything runs.
## 3. Inspect: status, refresh, drift
```bash
omnigraph cluster status --config company-brain --json # ledger only, read-only
omnigraph cluster refresh --config company-brain # re-observe live graphs
```
`status` never touches the graphs; `refresh` opens them read-only and
records what it finds — manifest versions, live schema digests, catalog blob
integrity. If someone changed a graph behind the control plane's back (a
direct `omnigraph schema apply`, a tampered catalog file), refresh marks the
resource **`drifted`**.
**Drift is converged, not just reported.** After a refresh records drift,
the next `plan` proposes migrating the live graph back to the declared
schema — with the steps visible, including the soft drops of out-of-band
fields — and `apply` executes it like any other change. If the out-of-band
change is the one you want, change the *config* to match instead, and apply
converges the ledger.
## 4. Destructive changes: the approval gate
Removing a graph from `cluster.yaml` never executes silently:
```bash
omnigraph cluster apply --config company-brain
# Delete graph.scratch [Blocked: approval_required]
omnigraph cluster approve graph.scratch --config company-brain --as andrew
# cluster approve: delete graph.scratch approved by andrew (approval 01KT…)
omnigraph cluster apply --config company-brain --as andrew
# Delete graph.scratch [Applied] ← root removed, subtree tombstoned
```
The approval artifact (`__cluster/approvals/<id>.json`) is **digest-bound**:
it authorizes exactly the change you saw when you approved it. Any config or
state movement afterwards invalidates it automatically (`approval_stale`
warning) — a stale approval can never authorize a different delete. One
approval covers the graph's whole subtree (its schema and queries ride
along). Consumed artifacts are kept (rewritten with `consumed_at`) and
summarized in the ledger's `approval_records`, so the audit trail of *who
approved what* survives the loss of either store.
## 5. When things go wrong
**Crashes are designed for.** Every graph-moving operation (create, schema
apply, delete) writes a recovery sidecar before acting. If an apply dies
mid-run, the next state-mutating command sweeps the sidecars and reconciles
— rolling the ledger forward when the operation completed on the graph,
retiring stale intent when nothing moved, and flagging anything it cannot
verify. You generally fix a crashed run by **running `cluster apply`
again**.
**A held lock** (a crashed process left `__cluster/lock.json`):
```bash
omnigraph cluster status --config company-brain # shows the lock holder + id
omnigraph cluster force-unlock <LOCK_ID> --config company-brain
```
Force-unlock requires the exact lock id (from status) — there is no blind
unlock.
**A lost or corrupted state ledger**: the cluster is self-describing.
`cluster import` rebuilds `state.json` from the config plus read-only
observation of the live graphs; the next `apply` re-converges onto the same
content-addressed catalog.
**A server that refuses to boot** with `--cluster` is telling you the
applied revision is not safely servable. Each refusal names its remedy:
| Boot error | Meaning | Remedy |
|---|---|---|
| `cluster_state_missing` | no ledger | `cluster import`, then `apply` |
| `cluster_recovery_pending` | interrupted operation awaiting sweep | run `cluster apply` (or any state-mutating command), restart |
| `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart |
| `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart |
| `cluster_empty` | applied revision has no graphs | apply a cluster with ≥1 graph |
| multiple bundles bind one scope | serving holds one policy bundle per graph + one server-level | split or merge bundles |
A held *state lock* is deliberately **not** a boot error — the server reads
the atomically-replaced ledger without locking, so serving never contends
with an in-flight apply.
## 6. Deployment patterns
- **Replicas**: any number of `--cluster` servers can serve the same config
directory; boot is read-only. Roll out a change by `apply` once, then
restarting replicas (serving is static per process — there is no hot
reload yet). Container/cloud recipes (AWS ECS+EFS, Railway volumes):
[deployment.md](../deployment.md#cluster-mode-in-containers-aws-railway).
- **The directory is the deployable unit**: config, catalog, ledger,
approvals, and graph data all live under it. Back it up as a whole;
version the *config files* (not `__cluster/` or `graphs/`) in git.
- **CI-driven convergence**: `validate` and `plan --json` are read-only and
safe in pipelines; gate `apply --as ci` on plan review. Approvals are the
human step by design — keep `cluster approve` out of automation.
- **`omnigraph.yaml` still has a job**: per-operator settings — your
`cli.actor` default for `--as`, CLI defaults, credentials, and data-plane
ergonomics (point `graphs.<name>.uri` at a derived root like
`company-brain/graphs/knowledge.omni` to use `--target <name>` for
loads). It just no longer describes the deployment — a server boots from
one source or the other, never a merge of both.
## 7. Maintaining a cluster graph
Storage maintenance (`optimize` / `repair` / `cleanup`) is **not** a control-plane
operation — it runs out-of-band, with direct storage access, against the graph's
roots. Address a cluster graph by name instead of hand-typing its storage path:
```bash
omnigraph optimize --cluster ./company-brain --cluster-graph knowledge
omnigraph cleanup --cluster ./company-brain --cluster-graph knowledge --keep 10 --confirm
# --cluster also takes the storage-root URI directly (config-free):
omnigraph optimize --cluster s3://bucket/clusters/company-brain --cluster-graph knowledge
```
The graph's storage URI is resolved from the **served cluster state** (the same
truth a `--cluster` server boots from); a graph that hasn't been applied yet is
not resolvable. Run these from a host with storage access — there are no server
routes for them. Conversely, **`init` refuses** a cluster-managed path: graphs in
a cluster are created by `cluster apply`, not by hand.
## What the control plane does not do (yet)
- **No hot reload** — applied changes serve on the next restart.
- **No data operations** — rows move through `omnigraph load / ingest /
mutate` against the graph roots, with branches and merges as usual.
- **Stored-query exposure is all-or-nothing per cluster** — every applied
query is listed and invokable (subject to Cedar `invoke_query`); per-query
exposure policy is a planned phase.
- **Pipelines (ETL)** are a separate project; the `pipelines:` key is
reserved and rejected loudly.
For the full reference — every key, flag, status, disposition, and
diagnostic — see [cluster-config.md](config.md).