mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-15 01:55:13 +02:00
* feat(cluster): cluster_root_for_graph_uri detection helper (RFC-010 Slice 3) Public helper the CLI uses to refuse `init` into a cluster-managed location: given a graph storage URI of the cluster layout (`<root>/graphs/<id>.omni`), return the cluster root if `<root>` holds `__cluster/state.json`, else None. Cheap by construction — a URI that doesn't match the `<root>/graphs/<id>.omni` shape returns None with zero I/O, so ordinary `init` targets never probe storage. Works for file:// and s3:// via the storage adapter. Adds two ClusterStore accessors (`display_root`, `has_state`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) Two cluster-graph-aware CLI behaviors, sharing the cluster-resolution path. Maintenance addressing. `optimize`/`repair`/`cleanup` gain `--cluster <dir|s3://…> --cluster-graph <id>`, which resolves the graph's storage URI from the served cluster snapshot (the same truth a `--cluster` server boots from — `read_serving_snapshot*`) and opens it embedded. The operator no longer hand-types `<storage>/graphs/<id>.omni`. A distinct flag is required because the global `--graph` is `requires = server` and means a remote multi-graph id. clap enforces both-or-neither and exclusion with the positional URI / `--target`; an unserved graph errors loudly, pointing at `cluster apply`. init signpost. `init` refuses a cluster-managed positional path (the `<root>/graphs/<id>.omni` layout where `<root>` holds `__cluster/state.json`, detected by `cluster_root_for_graph_uri`) and points at `cluster apply` — graphs in an established cluster are created with ledger/recovery/approvals, not by hand. The check is gated on the path shape, so ordinary `init` does no extra I/O and existing pre-apply cluster-graph inits are unaffected. planes guard remediation now also mentions `--cluster … --cluster-graph …` (the two Slice-1 guard-string tests track it). Docs updated (cli-reference Command planes, maintenance.md, cluster.md §7); the stale "no S3-hosted cluster directories" limitation is dropped (RFC-006 landed it). Tests (cli_cluster.rs, reusing the apply-a-cluster fixture): resolve by id, unknown-id error, `--cluster` requires `--cluster-graph`, init refusal + signpost, and ordinary init still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(cli): resolve cluster graphs from the state ledger, not the serving snapshot Addresses the Greptile review on #221. `read_serving_snapshot*` does all-or-nothing serving validation — recovery-sidecar checks plus a digest verify of every catalog payload (query .gq, policy blobs). Using it to resolve a maintenance target coupled `optimize`/`repair`/`cleanup` to the readiness of unrelated resources: a single corrupt policy blob, or a pending recovery sweep, would block the command before it could touch the graph — worst for `repair`, the tool you reach for *when the cluster is degraded*. Add `omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id)`: read the state ledger, confirm the graph is in the applied revision, return `graph_root(id)` — the URI is deterministically derivable, no catalog validation. The CLI's cluster resolver now calls it. Test: `optimize --cluster … --cluster-graph …` still resolves after the catalog payloads (`__cluster/resources/`) are removed — the ledger-only path is not blocked by degraded/unrelated catalog state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
285 lines
13 KiB
Markdown
285 lines
13 KiB
Markdown
# Operating an OmniGraph Cluster
|
|
|
|
This is the operator's guide to the cluster control plane: how to go from an
|
|
empty directory to a served deployment, and how to run it day to day —
|
|
evolving schemas, rotating queries and policies, healing drift, approving
|
|
destructive changes, and recovering from crashes.
|
|
|
|
It is a **how-to**. The reference for every `cluster.yaml` key, command flag,
|
|
state-file field, and diagnostic code is
|
|
[cluster-config.md](cluster-config.md); the HTTP surface is
|
|
[server.md](server.md).
|
|
|
|
## The model in one paragraph
|
|
|
|
You declare the entire deployment — graphs, schemas, stored queries, Cedar
|
|
policies — as files in one directory (`cluster.yaml` plus the `.pg`/`.gq`/
|
|
`.yaml` files it references). `cluster apply` converges reality to that
|
|
declaration and records what it did in a state ledger
|
|
(`__cluster/state.json`); `cluster plan` previews exactly what apply would
|
|
do, including real schema-migration steps. A server started with
|
|
`omnigraph-server --cluster <dir>` serves what was applied — never what is
|
|
merely written in config. Terraform users will recognize the shape: config
|
|
is desired state, the ledger is recorded state, plan is the diff, apply is
|
|
the only thing that changes the world, and irreversible changes require an
|
|
explicitly recorded approval.
|
|
|
|
## 1. Deploy a cluster from zero
|
|
|
|
Lay out a config directory:
|
|
|
|
```
|
|
company-brain/
|
|
├── cluster.yaml
|
|
├── people.pg # schema for the "knowledge" graph
|
|
├── queries/ # stored queries — the .gq files ARE the declaration
|
|
│ └── people.gq
|
|
└── base.policy.yaml # a Cedar policy bundle
|
|
```
|
|
|
|
```yaml
|
|
# cluster.yaml
|
|
version: 1
|
|
# storage: s3://omnigraph-local/clusters/company-brain # optional: put the
|
|
# ledger, catalog, and graph data on object storage (default: this folder)
|
|
metadata:
|
|
name: company-brain
|
|
graphs:
|
|
knowledge:
|
|
schema: people.pg
|
|
queries: queries/ # every `query <name>` in queries/*.gq registers
|
|
policies:
|
|
base:
|
|
file: base.policy.yaml
|
|
applies_to: [knowledge] # graph-bound; use [cluster] for server-level
|
|
```
|
|
|
|
Bring it to life:
|
|
|
|
```bash
|
|
omnigraph cluster validate --config company-brain # parse + typecheck everything
|
|
omnigraph cluster import --config company-brain # create the state ledger
|
|
omnigraph cluster plan --config company-brain # preview: what would apply do?
|
|
omnigraph cluster apply --config company-brain # converge
|
|
```
|
|
|
|
That single `apply` **creates the graph** (at the derived root
|
|
`company-brain/graphs/knowledge.omni`), applies its schema, and publishes
|
|
the query and policy into the content-addressed catalog
|
|
(`__cluster/resources/…`). The output lists every change with its
|
|
disposition; `converged: true` means there is nothing left to do — re-running
|
|
`apply` is always safe and idempotent.
|
|
|
|
Load data through the normal graph plane (the control plane manages
|
|
*definitions*, not rows):
|
|
|
|
```bash
|
|
omnigraph load --data seed.jsonl company-brain/graphs/knowledge.omni
|
|
```
|
|
|
|
Serve it:
|
|
|
|
```bash
|
|
OMNIGRAPH_SERVER_BEARER_TOKENS_JSON='{"act-reader":"s3cret"}' \
|
|
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080
|
|
```
|
|
|
|
`--cluster` accepts either a **config directory** (the storage root resolves
|
|
through `cluster.yaml`'s `storage:` key) or a **storage-root URI directly**
|
|
(`--cluster s3://bucket/prefix`) — config-free serving: a serving box needs
|
|
only the URI and credentials, no checkout of the config repo. The ledger and
|
|
catalog on the bucket are the deployment artifact.
|
|
|
|
`--cluster` is an **exclusive boot source**: it cannot be combined with a
|
|
graph URI, `--target`, or `--config`, and `omnigraph.yaml` is never read in
|
|
this mode. Routing is always multi-graph:
|
|
|
|
```bash
|
|
curl -H 'authorization: Bearer s3cret' \
|
|
-X POST http://localhost:8080/graphs/knowledge/queries/find_person \
|
|
-H 'content-type: application/json' -d '{"params":{"name":"Ada"}}'
|
|
```
|
|
|
|
Bearer tokens and the bind address are deliberately *not* cluster facts —
|
|
they are per-replica, set by flag or environment
|
|
([server.md](server.md#modes) for the token sources).
|
|
|
|
## 2. The day-2 loop: edit → plan → apply → restart
|
|
|
|
Every change follows the same loop, whatever its kind:
|
|
|
|
```bash
|
|
$EDITOR company-brain/people.pg # or any .gq / policy / cluster.yaml edit
|
|
omnigraph cluster plan --config company-brain
|
|
omnigraph cluster apply --config company-brain --as andrew
|
|
# restart cluster-booted servers to pick it up
|
|
```
|
|
|
|
`--as <actor>` attributes the run: it is recorded in recovery sidecars and
|
|
audit entries and threaded into the engine's commit history. Set
|
|
`cli: { actor: <you> }` in your per-operator `omnigraph.yaml` to make it the
|
|
default when `--as` is omitted (the flag always wins; `approve` requires one
|
|
of the two).
|
|
|
|
What each change kind does:
|
|
|
|
| You edit | Plan shows | Apply does |
|
|
|---|---|---|
|
|
| a `.gq` file or `queries:` entry | `Update query.<g>.<n>` | publishes the new content-addressed blob, updates the ledger |
|
|
| a policy file | `Update policy.<n>` | same — new blob, ledger update |
|
|
| a policy's `applies_to` | `Update policy.<n> [bindings]` | records the new bindings (the file digest is unchanged; bindings are first-class changes) |
|
|
| a `.pg` schema | `Update schema.<g>` **with the real migration steps embedded** | runs the engine's schema apply on the live graph — soft drops only, sidecar-fenced |
|
|
| `graphs:` gains an entry | `Create graph.<g>` (+ schema, queries) | initializes the graph at its derived root; dependents apply in the same run |
|
|
| `graphs:` loses an entry | `Delete graph.<g>` — **blocked, `approval_required`** | nothing, until approved (see §4) |
|
|
|
|
Two properties worth internalizing:
|
|
|
|
- **One apply, ordered correctly.** Creates run first, then schema
|
|
migrations, then catalog writes, then (approved) deletes — so a schema
|
|
change plus a query that uses the new field converge together in one run.
|
|
- **Soft drops only.** A removed schema property disappears from the current
|
|
version while prior versions retain the data (reversible until `cleanup`).
|
|
Data-loss migrations are not reachable from cluster apply.
|
|
|
|
Read the plan before applying when the change is non-trivial — for schema
|
|
updates it embeds the engine's actual migration plan (`add_property`,
|
|
`drop_property [soft]`, `unsupported: …`), so you see data impact before
|
|
anything runs.
|
|
|
|
## 3. Inspect: status, refresh, drift
|
|
|
|
```bash
|
|
omnigraph cluster status --config company-brain --json # ledger only, read-only
|
|
omnigraph cluster refresh --config company-brain # re-observe live graphs
|
|
```
|
|
|
|
`status` never touches the graphs; `refresh` opens them read-only and
|
|
records what it finds — manifest versions, live schema digests, catalog blob
|
|
integrity. If someone changed a graph behind the control plane's back (a
|
|
direct `omnigraph schema apply`, a tampered catalog file), refresh marks the
|
|
resource **`drifted`**.
|
|
|
|
**Drift is converged, not just reported.** After a refresh records drift,
|
|
the next `plan` proposes migrating the live graph back to the declared
|
|
schema — with the steps visible, including the soft drops of out-of-band
|
|
fields — and `apply` executes it like any other change. If the out-of-band
|
|
change is the one you want, change the *config* to match instead, and apply
|
|
converges the ledger.
|
|
|
|
## 4. Destructive changes: the approval gate
|
|
|
|
Removing a graph from `cluster.yaml` never executes silently:
|
|
|
|
```bash
|
|
omnigraph cluster apply --config company-brain
|
|
# Delete graph.scratch [Blocked: approval_required]
|
|
|
|
omnigraph cluster approve graph.scratch --config company-brain --as andrew
|
|
# cluster approve: delete graph.scratch approved by andrew (approval 01KT…)
|
|
|
|
omnigraph cluster apply --config company-brain --as andrew
|
|
# Delete graph.scratch [Applied] ← root removed, subtree tombstoned
|
|
```
|
|
|
|
The approval artifact (`__cluster/approvals/<id>.json`) is **digest-bound**:
|
|
it authorizes exactly the change you saw when you approved it. Any config or
|
|
state movement afterwards invalidates it automatically (`approval_stale`
|
|
warning) — a stale approval can never authorize a different delete. One
|
|
approval covers the graph's whole subtree (its schema and queries ride
|
|
along). Consumed artifacts are kept (rewritten with `consumed_at`) and
|
|
summarized in the ledger's `approval_records`, so the audit trail of *who
|
|
approved what* survives the loss of either store.
|
|
|
|
## 5. When things go wrong
|
|
|
|
**Crashes are designed for.** Every graph-moving operation (create, schema
|
|
apply, delete) writes a recovery sidecar before acting. If an apply dies
|
|
mid-run, the next state-mutating command sweeps the sidecars and reconciles
|
|
— rolling the ledger forward when the operation completed on the graph,
|
|
retiring stale intent when nothing moved, and flagging anything it cannot
|
|
verify. You generally fix a crashed run by **running `cluster apply`
|
|
again**.
|
|
|
|
**A held lock** (a crashed process left `__cluster/lock.json`):
|
|
|
|
```bash
|
|
omnigraph cluster status --config company-brain # shows the lock holder + id
|
|
omnigraph cluster force-unlock <LOCK_ID> --config company-brain
|
|
```
|
|
|
|
Force-unlock requires the exact lock id (from status) — there is no blind
|
|
unlock.
|
|
|
|
**A lost or corrupted state ledger**: the cluster is self-describing.
|
|
`cluster import` rebuilds `state.json` from the config plus read-only
|
|
observation of the live graphs; the next `apply` re-converges onto the same
|
|
content-addressed catalog.
|
|
|
|
**A server that refuses to boot** with `--cluster` is telling you the
|
|
applied revision is not safely servable. Each refusal names its remedy:
|
|
|
|
| Boot error | Meaning | Remedy |
|
|
|---|---|---|
|
|
| `cluster_state_missing` | no ledger | `cluster import`, then `apply` |
|
|
| `cluster_recovery_pending` | interrupted operation awaiting sweep | run `cluster apply` (or any state-mutating command), restart |
|
|
| `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart |
|
|
| `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart |
|
|
| `cluster_empty` | applied revision has no graphs | apply a cluster with ≥1 graph |
|
|
| multiple bundles bind one scope | serving holds one policy bundle per graph + one server-level | split or merge bundles |
|
|
|
|
A held *state lock* is deliberately **not** a boot error — the server reads
|
|
the atomically-replaced ledger without locking, so serving never contends
|
|
with an in-flight apply.
|
|
|
|
## 6. Deployment patterns
|
|
|
|
- **Replicas**: any number of `--cluster` servers can serve the same config
|
|
directory; boot is read-only. Roll out a change by `apply` once, then
|
|
restarting replicas (serving is static per process — there is no hot
|
|
reload yet). Container/cloud recipes (AWS ECS+EFS, Railway volumes):
|
|
[deployment.md](deployment.md#cluster-mode-in-containers-aws-railway).
|
|
- **The directory is the deployable unit**: config, catalog, ledger,
|
|
approvals, and graph data all live under it. Back it up as a whole;
|
|
version the *config files* (not `__cluster/` or `graphs/`) in git.
|
|
- **CI-driven convergence**: `validate` and `plan --json` are read-only and
|
|
safe in pipelines; gate `apply --as ci` on plan review. Approvals are the
|
|
human step by design — keep `cluster approve` out of automation.
|
|
- **`omnigraph.yaml` still has a job**: per-operator settings — your
|
|
`cli.actor` default for `--as`, CLI defaults, credentials, and data-plane
|
|
ergonomics (point `graphs.<name>.uri` at a derived root like
|
|
`company-brain/graphs/knowledge.omni` to use `--target <name>` for
|
|
loads). It just no longer describes the deployment — a server boots from
|
|
one source or the other, never a merge of both.
|
|
|
|
## 7. Maintaining a cluster graph
|
|
|
|
Storage maintenance (`optimize` / `repair` / `cleanup`) is **not** a control-plane
|
|
operation — it runs out-of-band, with direct storage access, against the graph's
|
|
roots. Address a cluster graph by name instead of hand-typing its storage path:
|
|
|
|
```bash
|
|
omnigraph optimize --cluster ./company-brain --cluster-graph knowledge
|
|
omnigraph cleanup --cluster ./company-brain --cluster-graph knowledge --keep 10 --confirm
|
|
# --cluster also takes the storage-root URI directly (config-free):
|
|
omnigraph optimize --cluster s3://bucket/clusters/company-brain --cluster-graph knowledge
|
|
```
|
|
|
|
The graph's storage URI is resolved from the **served cluster state** (the same
|
|
truth a `--cluster` server boots from); a graph that hasn't been applied yet is
|
|
not resolvable. Run these from a host with storage access — there are no server
|
|
routes for them. Conversely, **`init` refuses** a cluster-managed path: graphs in
|
|
a cluster are created by `cluster apply`, not by hand.
|
|
|
|
## What the control plane does not do (yet)
|
|
|
|
- **No hot reload** — applied changes serve on the next restart.
|
|
- **No data operations** — rows move through `omnigraph load / ingest /
|
|
mutate` against the graph roots, with branches and merges as usual.
|
|
- **Stored-query exposure is all-or-nothing per cluster** — every applied
|
|
query is listed and invokable (subject to Cedar `invoke_query`); per-query
|
|
exposure policy is a planned phase.
|
|
- **Pipelines (ETL)** are a separate project; the `pipelines:` key is
|
|
reserved and rejected loudly.
|
|
|
|
For the full reference — every key, flag, status, disposition, and
|
|
diagnostic — see [cluster-config.md](cluster-config.md).
|