omnigraph/docs/user/clusters/index.md
Andrew Altshuler 4601e5f4bf
feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252)
* refactor(cli): own ReadOutputFormat/TableCellLayout in the CLI

The two output-presentation enums lived in `omnigraph-server::config` and were
re-exported for the CLI, even though the server never used them. Move both
definitions into `omnigraph-cli/src/read_format.rs` (where the renderer already
lives) and drop them from the server's public re-export. This is a step toward
deleting the legacy `omnigraph-server::config` module entirely — a CLI
presentation concern has no business in the server crate.

No behavior change. The server keeps private copies in `config.rs` only for the
soon-to-be-deleted legacy `CliDefaults`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(cli)!: remove the `config migrate` command and migrate.rs

`config migrate` was the last CLI consumer of the legacy `omnigraph.yaml`
(`OmnigraphConfig` + `load_config`). With the excision complete there is no
legacy file to split, so the whole `omnigraph config` command group is removed
along with `migrate.rs`. The `OmnigraphConfig` type, `load_config`, and the
deprecation machinery are deleted next.

- Remove `Command::Config` / `ConfigCommand` from the clap surface and the
  dispatch arm; drop `mod migrate;` and the now-unused `load_config` import.
- Drop the `Command::Config` arms in `planes.rs`.
- Delete the `config_migrate_splits_legacy_config` integration test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(server)!: delete the legacy OmnigraphConfig type and load_config

With `config migrate` gone, nothing loads `omnigraph.yaml` anymore. Delete the
entire `omnigraph-server::config` module: the `OmnigraphConfig` type and its
sub-structs (`ProjectConfig`, `TargetConfig`, `CliDefaults`, `ServerDefaults`,
`AuthDefaults`, `QueryDefaults`, `AliasConfig`, `AliasCommand`, `PolicySettings`,
`QueryEntry`, `McpSettings`), `load_config`, and the RFC-008 deprecation
machinery (`OMNIGRAPH_CONFIG`, `OMNIGRAPH_NO_LEGACY_CONFIG`,
`OMNIGRAPH_SUPPRESS_YAML_DEPRECATION`, the deprecation map + warner).

- `QueryRegistry::load` (the only `OmnigraphConfig`/`QueryEntry` consumer; its
  only caller was its own test) is removed — server boot and the CLI both build
  registries via `QueryRegistry::from_specs`.
- `graph_resource_id_for_selection` (CLI-only) moves into the CLI
  (`helpers.rs`), with its unit test; the server no longer exports it.
- Drop the already-dead `format_registry_load_errors` helper (config-adjacent).

No behavior change — every deleted item was unreachable after the excision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs: purge the legacy omnigraph.yaml surface from the docs

Finish the RFC-011 excision in the docs: the CLI no longer reads omnigraph.yaml
and the server boots cluster-only, so every doc that described the legacy file
as a live config is now wrong.

- AGENTS.md: rewrite the HTTP-server line to cluster-only boot (drop the
  single-graph/flat-route and omnigraph.yaml-boot framing); rewrite the CLI
  two-surface-config passage (drop `config migrate`, the deprecation env vars,
  and "Never extend omnigraph.yaml"); fix the topic table + capability rows.
- cli/reference.md: delete the entire "omnigraph.yaml schema (legacy combined
  file)" section and the `config migrate` row; re-home the `policy` row, the
  bearer-token chain, the actor/format/param-precedence references, and the
  `--config` mentions to the operator config + `--cluster`.
- cli/index.md: rewrite the multi-graph-server + add-graph paragraphs to
  cluster (`--cluster` + `cluster apply`); fix the policy examples to
  `--cluster`; replace the `## Config` omnigraph.yaml example with the
  operator/cluster two-surface model.
- operations/policy.md: rewrite per-graph-vs-server-level policy to the cluster
  `policies:`/`applies_to` model; re-home the actor + CLI tooling sections.
- clusters/config.md, clusters/index.md, deployment.md: server boots from the
  cluster only; per-operator facts come from ~/.omnigraph/config.yaml.
- architecture.md, testing.md: drop the stale omnigraph.yaml / deleted-test
  references.

RFCs, design specs, and prior release notes are left as historical records.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:31:29 +03:00

296 lines
14 KiB
Markdown

# Operating an OmniGraph Cluster
This is the operator's guide to the cluster control plane: how to go from an
empty directory to a served deployment, and how to run it day to day —
evolving schemas, rotating queries and policies, healing drift, approving
destructive changes, and recovering from crashes.
It is a **how-to**. The reference for every `cluster.yaml` key, command flag,
state-file field, and diagnostic code is
[cluster-config.md](config.md); the HTTP surface is
[server.md](../operations/server.md).
## The model in one paragraph
You declare the entire deployment — graphs, schemas, stored queries, Cedar
policies — as files in one directory (`cluster.yaml` plus the `.pg`/`.gq`/
`.yaml` files it references). `cluster apply` converges reality to that
declaration and records what it did in a state ledger
(`__cluster/state.json`); `cluster plan` previews exactly what apply would
do, including real schema-migration steps. A server started with
`omnigraph-server --cluster <dir>` serves what was applied — never what is
merely written in config. Terraform users will recognize the shape: config
is desired state, the ledger is recorded state, plan is the diff, apply is
the only thing that changes the world, and irreversible changes require an
explicitly recorded approval.
## 1. Deploy a cluster from zero
Lay out a config directory:
```
company-brain/
├── cluster.yaml
├── people.pg # schema for the "knowledge" graph
├── queries/ # stored queries — the .gq files ARE the declaration
│ └── people.gq
└── base.policy.yaml # a Cedar policy bundle
```
```yaml
# cluster.yaml
version: 1
# storage: s3://omnigraph-local/clusters/company-brain # optional: put the
# ledger, catalog, and graph data on object storage (default: this folder)
metadata:
name: company-brain
graphs:
knowledge:
schema: people.pg
queries: queries/ # every `query <name>` in queries/*.gq registers
policies:
base:
file: base.policy.yaml
applies_to: [knowledge] # graph-bound; use [cluster] for server-level
```
Bring it to life:
```bash
omnigraph cluster validate --config company-brain # parse + typecheck everything
omnigraph cluster import --config company-brain # create the state ledger
omnigraph cluster plan --config company-brain # preview: what would apply do?
omnigraph cluster apply --config company-brain # converge
```
That single `apply` **creates the graph** (at the derived root
`company-brain/graphs/knowledge.omni`), applies its schema, and publishes
the query and policy into the content-addressed catalog
(`__cluster/resources/…`). The output lists every change with its
disposition; `converged: true` means there is nothing left to do — re-running
`apply` is always safe and idempotent.
Load data through the normal graph plane (the control plane manages
*definitions*, not rows):
```bash
omnigraph load --data seed.jsonl company-brain/graphs/knowledge.omni
```
Serve it:
```bash
OMNIGRAPH_SERVER_BEARER_TOKENS_JSON='{"act-reader":"s3cret"}' \
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080
```
`--cluster` accepts either a **config directory** (the storage root resolves
through `cluster.yaml`'s `storage:` key) or a **storage-root URI directly**
(`--cluster s3://bucket/prefix`) — config-free serving: a serving box needs
only the URI and credentials, no checkout of the config repo. The ledger and
catalog on the bucket are the deployment artifact.
`--cluster` is an **exclusive boot source**: it cannot be combined with a
graph URI, `--target`, or `--config`, and `omnigraph.yaml` is never read in
this mode. Routing is always multi-graph:
```bash
curl -H 'authorization: Bearer s3cret' \
-X POST http://localhost:8080/graphs/knowledge/queries/find_person \
-H 'content-type: application/json' -d '{"params":{"name":"Ada"}}'
```
Bearer tokens and the bind address are deliberately *not* cluster facts —
they are per-replica, set by flag or environment
([server.md](../operations/server.md#modes) for the token sources).
## 2. The day-2 loop: edit → plan → apply → restart
Every change follows the same loop, whatever its kind:
```bash
$EDITOR company-brain/people.pg # or any .gq / policy / cluster.yaml edit
omnigraph cluster plan --config company-brain
omnigraph cluster apply --config company-brain --as andrew
# restart cluster-booted servers to pick it up
```
`--as <actor>` attributes the run: it is recorded in recovery sidecars and
audit entries and threaded into the engine's commit history. Set
`operator: { actor: <you> }` in your `~/.omnigraph/config.yaml` to make it the
default when `--as` is omitted (the flag always wins; `approve` requires one
of the two).
What each change kind does:
| You edit | Plan shows | Apply does |
|---|---|---|
| a `.gq` file or `queries:` entry | `Update query.<g>.<n>` | publishes the new content-addressed blob, updates the ledger |
| a policy file | `Update policy.<n>` | same — new blob, ledger update |
| a policy's `applies_to` | `Update policy.<n> [bindings]` | records the new bindings (the file digest is unchanged; bindings are first-class changes) |
| a `.pg` schema | `Update schema.<g>` **with the real migration steps embedded** | runs the engine's schema apply on the live graph — soft drops only, sidecar-fenced |
| `graphs:` gains an entry | `Create graph.<g>` (+ schema, queries) | initializes the graph at its derived root; dependents apply in the same run |
| `graphs:` loses an entry | `Delete graph.<g>`**blocked, `approval_required`** | nothing, until approved (see §4) |
Two properties worth internalizing:
- **One apply, ordered correctly.** Creates run first, then schema
migrations, then catalog writes, then (approved) deletes — so a schema
change plus a query that uses the new field converge together in one run.
- **Soft drops only.** A removed schema property disappears from the current
version while prior versions retain the data (reversible until `cleanup`).
Data-loss migrations are not reachable from cluster apply.
Read the plan before applying when the change is non-trivial — for schema
updates it embeds the engine's actual migration plan (`add_property`,
`drop_property [soft]`, `unsupported: …`), so you see data impact before
anything runs.
## 3. Inspect: status, refresh, drift
```bash
omnigraph cluster status --config company-brain --json # ledger only, read-only
omnigraph cluster refresh --config company-brain # re-observe live graphs
```
`status` never touches the graphs; `refresh` opens them read-only and
records what it finds — manifest versions, live schema digests, catalog blob
integrity. If someone changed a graph behind the control plane's back (a
direct `omnigraph schema apply`, a tampered catalog file), refresh marks the
resource **`drifted`**.
**Drift is converged, not just reported.** After a refresh records drift,
the next `plan` proposes migrating the live graph back to the declared
schema — with the steps visible, including the soft drops of out-of-band
fields — and `apply` executes it like any other change. If the out-of-band
change is the one you want, change the *config* to match instead, and apply
converges the ledger.
## 4. Destructive changes: the approval gate
Removing a graph from `cluster.yaml` never executes silently:
```bash
omnigraph cluster apply --config company-brain
# Delete graph.scratch [Blocked: approval_required]
omnigraph cluster approve graph.scratch --config company-brain --as andrew
# cluster approve: delete graph.scratch approved by andrew (approval 01KT…)
omnigraph cluster apply --config company-brain --as andrew
# Delete graph.scratch [Applied] ← root removed, subtree tombstoned
```
The approval artifact (`__cluster/approvals/<id>.json`) is **digest-bound**:
it authorizes exactly the change you saw when you approved it. Any config or
state movement afterwards invalidates it automatically (`approval_stale`
warning) — a stale approval can never authorize a different delete. One
approval covers the graph's whole subtree (its schema and queries ride
along). Consumed artifacts are kept (rewritten with `consumed_at`) and
summarized in the ledger's `approval_records`, so the audit trail of *who
approved what* survives the loss of either store.
## 5. When things go wrong
**Crashes are designed for.** Every graph-moving operation (create, schema
apply, delete) writes a recovery sidecar before acting. If an apply dies
mid-run, the next state-mutating command sweeps the sidecars and reconciles
— rolling the ledger forward when the operation completed on the graph,
retiring stale intent when nothing moved, and flagging anything it cannot
verify. You generally fix a crashed run by **running `cluster apply`
again**.
**A held lock** (a crashed process left `__cluster/lock.json`):
```bash
omnigraph cluster status --config company-brain # shows the lock holder + id
omnigraph cluster force-unlock <LOCK_ID> --config company-brain
```
Force-unlock requires the exact lock id (from status) — there is no blind
unlock.
**A lost or corrupted state ledger**: the cluster is self-describing.
`cluster import` rebuilds `state.json` from the config plus read-only
observation of the live graphs; the next `apply` re-converges onto the same
content-addressed catalog.
**A server that refuses to boot** with `--cluster` is telling you the
applied revision is not safely servable. Each refusal names its remedy:
| Boot error | Meaning | Remedy |
|---|---|---|
| `cluster_state_missing` | no ledger | `cluster import`, then `apply` |
| `cluster_recovery_pending` | interrupted operation awaiting sweep | run `cluster apply` (or any state-mutating command), restart |
| `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart |
| `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart |
| `cluster_empty` | applied revision has no graphs | apply a cluster with ≥1 graph |
| multiple bundles bind one scope | serving holds one policy bundle per graph + one server-level | split or merge bundles |
A held *state lock* is deliberately **not** a boot error — the server reads
the atomically-replaced ledger without locking, so serving never contends
with an in-flight apply.
## 6. Deployment patterns
- **Replicas**: any number of `--cluster` servers can serve the same config
directory; boot is read-only. Roll out a change by `apply` once, then
restarting replicas (serving is static per process — there is no hot
reload yet). Container/cloud recipes (AWS ECS+EFS, Railway volumes):
[deployment.md](../deployment.md#cluster-mode-in-containers-aws-railway).
- **The directory is the deployable unit**: config, catalog, ledger,
approvals, and graph data all live under it. Back it up as a whole;
version the *config files* (not `__cluster/` or `graphs/`) in git.
- **CI-driven convergence**: `validate` and `plan --json` are read-only and
safe in pipelines; gate `apply --as ci` on plan review. Approvals are the
human step by design — keep `cluster approve` out of automation.
- **`~/.omnigraph/config.yaml` is the per-operator config**: your
`operator.actor` default for `--as`, named servers/clusters, credentials,
profiles, and data-plane ergonomics (address a cluster graph by its derived
root like `company-brain/graphs/knowledge.omni` with `--store` for loads). The
cluster directory's `cluster.yaml` is the **sole deployment declaration** — the
server boots from the cluster only.
## 7. Maintaining a cluster graph
Storage maintenance (`optimize` / `repair` / `cleanup`) is **not** a control-plane
operation — it runs out-of-band, with direct storage access, against the graph's
roots. Address a cluster graph by name instead of hand-typing its storage path:
```bash
omnigraph optimize --cluster ./company-brain --graph knowledge
omnigraph cleanup --cluster ./company-brain --graph knowledge --keep 10 --confirm
# --cluster also takes the storage-root URI directly (config-free), and a
# `clusters:` name from ~/.omnigraph/config.yaml:
omnigraph optimize --cluster s3://bucket/clusters/company-brain --graph knowledge
```
The graph's storage URI is resolved from the **served cluster state** (the same
truth a `--cluster` server boots from); a graph that hasn't been applied yet is
not resolvable. Run these from a host with storage access — there are no server
routes for them. Conversely, **`init` refuses** a cluster-managed path: graphs in
a cluster are created by `cluster apply`, not by hand.
If the cluster has exactly **one** applied graph you can omit `--graph` — it is
used automatically. With **several**, omitting `--graph` errors and lists the
candidates (RFC-011 D7); it never picks one for you.
Against an **`s3://`-backed cluster** the resolved graph storage is non-local, so a
destructive `cleanup` additionally requires **`--yes`** (an interactive prompt
otherwise, refusal without a TTY) on top of `--confirm` — see [cli-reference.md](../cli/reference.md)'s
*Write diagnostics & destructive confirmation*. Every maintenance run also echoes
its resolved target to stderr (suppress with `--quiet`).
## What the control plane does not do (yet)
- **No hot reload** — applied changes serve on the next restart.
- **No data operations** — rows move through `omnigraph load / ingest /
mutate` against the graph roots, with branches and merges as usual.
- **Stored-query exposure is all-or-nothing per cluster** — every applied
query is listed and invokable (subject to Cedar `invoke_query`); per-query
exposure policy is a planned phase.
- **Pipelines (ETL)** are a separate project; the `pipelines:` key is
reserved and rejected loudly.
For the full reference — every key, flag, status, disposition, and
diagnostic — see [cluster-config.md](config.md).