From 711865e6f117b0f0c1ff09eb32c60a36dc0e3f1f Mon Sep 17 00:00:00 2001 From: aaltshuler Date: Wed, 10 Jun 2026 17:55:15 +0300 Subject: [PATCH] docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The standing caveat ('applied means recorded in the cluster catalog — nothing more; the server still boots from omnigraph.yaml') retires: cluster docs gain the 'Serving from the cluster' section (exclusivity, applied- revision serving, fail-fast readiness, restart-to-pick-up, expose-all bridge), server.md gains mode-inference rule 0 and the cluster-booted multi mode, deployment.md the boot-source choice, and the CLI's apply note plus the cli-reference cluster row (stale back to Stage 3A) now describe the full convergence surface. RFC-005 flips to Landed with four implementation deviations recorded. Co-Authored-By: Claude Fable 5 --- crates/omnigraph-cli/src/main.rs | 2 +- docs/dev/rfc-005-server-cluster-boot.md | 3 +- docs/dev/testing.md | 4 +-- docs/user/cli-reference.md | 2 +- docs/user/cluster-config.md | 46 ++++++++++++++++++++++--- docs/user/deployment.md | 8 +++++ docs/user/server.md | 16 +++++++-- 7 files changed, 69 insertions(+), 12 deletions(-) diff --git a/crates/omnigraph-cli/src/main.rs b/crates/omnigraph-cli/src/main.rs index 673adb7..da4f8e8 100644 --- a/crates/omnigraph-cli/src/main.rs +++ b/crates/omnigraph-cli/src/main.rs @@ -856,7 +856,7 @@ fn print_cluster_apply_human(output: &ApplyOutput) { " state: revision {}, converged: {}, written: {}", state.state_revision, output.converged, output.state_written ); - println!(" note: applied = recorded in the cluster catalog; the server still boots from omnigraph.yaml"); + println!(" note: cluster-booted servers (--cluster) serve this on their next restart; omnigraph.yaml deployments are unaffected"); } print_cluster_diagnostics(&output.diagnostics); } diff --git a/docs/dev/rfc-005-server-cluster-boot.md b/docs/dev/rfc-005-server-cluster-boot.md index 81d5129..85df875 100644 --- a/docs/dev/rfc-005-server-cluster-boot.md +++ b/docs/dev/rfc-005-server-cluster-boot.md @@ -1,6 +1,7 @@ # RFC: Server Boots from Cluster State — Phase 5 of the Cluster Control Plane -**Status:** Proposed +**Status:** Landed (5A policy bindings #175; 5B/5C the `--cluster` boot mode — one PR) +**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses. **Date:** 2026-06-10 **Builds on:** Phase 4 complete ([rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md), Landed): `cluster apply` converges graphs, schemas, stored queries, and policies into the cluster catalog. Normative context: [cluster-config-specs.md](cluster-config-specs.md) (the migration model's "window 2"), [cluster-axioms.md](cluster-axioms.md) (axiom 15), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) (Phase 5 rollout, Compatibility Stance #7–#9, exit criterion 7). **Target release:** unversioned (phased — see Sequencing). diff --git a/docs/dev/testing.md b/docs/dev/testing.md index a7a6cb3..eba70c9 100644 --- a/docs/dev/testing.md +++ b/docs/dev/testing.md @@ -8,8 +8,8 @@ This file is the always-on map of the test surface. **Consult it before every ta |---|---|---| | `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (21 files), fixture-driven, share `tests/helpers/mod.rs` | | `omnigraph-cli` | `crates/omnigraph-cli/tests/` | `cli.rs` (unit-ish; includes the `cluster_e2e_*` lifecycle compositions over the spawned binary — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `system_local.rs`, `system_remote.rs`, share `tests/support/mod.rs` | -| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill) | -| `omnigraph-server` | `crates/omnigraph-server/tests/` | `server.rs` (HTTP-level), `openapi.rs` (OpenAPI drift / regeneration) | +| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill), and the 5B serving-snapshot read API (converged read, refusal rows) | +| `omnigraph-server` | `crates/omnigraph-server/tests/` | `server.rs` (HTTP-level; incl. cluster-mode boot — converged-dir serving, policy binding wiring, boot refusals), `openapi.rs` (OpenAPI drift / regeneration) | | `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint | The engine's `tests/` is the principal coverage surface; most graph-shaped behavior is exercised there. diff --git a/docs/user/cli-reference.md b/docs/user/cli-reference.md index 9dc8a25..6d864cc 100644 --- a/docs/user/cli-reference.md +++ b/docs/user/cli-reference.md @@ -19,7 +19,7 @@ Top-level command families and subcommands. Graph-targeting commands accept eith | `commit list \| show` | inspect commit graph | | `schema plan \| apply \| show (alias: get)` | migrations | | `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` | -| `cluster validate \| plan \| apply \| approve \| status \| refresh \| import \| force-unlock` | cluster-control preview. `validate` checks a local `cluster.yaml` folder and referenced schema/query/policy files; `plan` diffs it against local JSON state at `__cluster/state.json` and annotates each change with its apply disposition; `apply` executes the config-only (stored-query/policy) subset into the content-addressed local catalog under `__cluster/resources/` — graph/schema changes are deferred loudly, and nothing applied serves traffic (the server still boots from `omnigraph.yaml`); `status` reads the state ledger; `refresh`/`import` explicitly update local JSON state from read-only graph observations; `force-unlock ` manually removes a held local state lock by exact id. No graph-manifest movement, server change, automatic stale-lock breaking, or `plan --refresh` occurs in Stage 3A | +| `cluster validate \| plan \| apply \| approve \| status \| refresh \| import \| force-unlock` | declarative cluster control plane. `validate` checks a local `cluster.yaml` folder and referenced schema/query/policy files; `plan` diffs it against local JSON state at `__cluster/state.json`, annotates dispositions, and embeds real schema-migration previews; `apply` converges the cluster — stored-query/policy catalog writes (content-addressed under `__cluster/resources/`), graph creates, schema updates (soft drops only; `--as` records the actor), and graph deletes behind a digest-bound approval from `cluster approve --as `; what apply converges is what an `omnigraph-server --cluster ` deployment serves on its next restart (omnigraph.yaml deployments are unaffected); `status` reads the state ledger; `refresh`/`import` explicitly update local JSON state from read-only graph observations; `force-unlock ` manually removes a held local state lock by exact id | | `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns or uncovered drift; `--json` reports `skipped`) | | `repair [--confirm] [--force]` | preview or explicitly publish uncovered manifest/head drift. `--confirm` heals verified maintenance drift and exits non-zero if suspicious/unverifiable drift is refused; `--force --confirm` publishes suspicious/unverifiable drift after operator review | | `cleanup --keep N --older-than 7d --confirm` | destructive version GC | diff --git a/docs/user/cluster-config.md b/docs/user/cluster-config.md index 284bfbf..5c51b1f 100644 --- a/docs/user/cluster-config.md +++ b/docs/user/cluster-config.md @@ -1,6 +1,6 @@ # Cluster Config -**Status:** Stage 4C — Phase 4 complete (graph create, schema apply, gated graph delete). +**Status:** Phase 5 — cluster-booted serving (`omnigraph-server --cluster`). Cluster config is the future control-plane configuration surface for a whole OmniGraph deployment. In this stage, OmniGraph can validate a local @@ -190,10 +190,12 @@ Deletes remove the resource from state; their old payload blobs stay on disk (garbage collection is a later stage). Re-running a converged apply is a no-op: no state write, no revision change (`state_written: false`). -**Applied means recorded in the cluster catalog — nothing more.** The server -still boots from `omnigraph.yaml`; no query or policy applied here serves -traffic until the server-boot stage ships, as an explicit per-deployment mode -switch. +**Applied means serving — for deployments that opt in.** A server started +with `--cluster ` boots from the applied revision (see +[Serving from the cluster](#serving-from-the-cluster-the-mode-switch)); it +picks up newly applied state on its next restart. Deployments still booting +from `omnigraph.yaml` are untouched: for them, applied means recorded in the +catalog, nothing more. ### Graph creation @@ -305,6 +307,40 @@ fully converges. The `graph.` composite digest is recomputed from state's own schema/query digests after each apply, so applied query changes converge without graph movement. +## Serving from the cluster (the mode switch) + +```bash +omnigraph-server --cluster ./company-brain --bind 0.0.0.0:8080 +``` + +`--cluster ` is an **exclusive boot source** (axiom 15): it cannot +combine with a graph URI, `--target`, or `--config`, and in this mode +`omnigraph.yaml` is never read — not for graphs, not for queries, not for +policies. The server serves the **applied revision**: graph roots recorded in +`state.json`, stored-query and policy content from the content-addressed +catalog at the applied digests (re-verified at boot), and policy bundles +wired by their applied `applies_to` bindings — `cluster`-bound bundles become +the server-level Cedar engine, graph-bound bundles attach per graph. +Un-applied config drift never leaks into serving; `cluster plan` is where +drift is visible. Routing is always multi-graph (`/graphs/{id}/...`). Bearer +tokens and the bind address stay process-level (flags/env) — they are +per-replica facts, not cluster facts. + +Boot is fail-fast: missing or unreadable state, pending recovery sidecars, +missing/tampered catalog blobs, policy entries without binding metadata +(pre-binding ledgers — re-run `cluster apply`), an empty graph set, more than +one policy bundle binding a single scope (split or merge bundles; stacked +scopes are a later stage), unopenable graph roots, and stored queries that no +longer type-check all refuse startup with a remedy. A held state lock is +*not* an error — boot reads the atomically-replaced state file without +locking. + +Serving is static per process: the server reads the applied revision once at +startup, so picking up newly applied state means restarting it. Stored +queries are all listed in `GET /queries` in cluster mode (the cluster +registry has no expose flag; exposure becomes a policy decision in a later +phase). + ## Status `cluster status` reads the same local JSON state ledger and prints what the diff --git a/docs/user/deployment.md b/docs/user/deployment.md index 9a4466c..328784f 100644 --- a/docs/user/deployment.md +++ b/docs/user/deployment.md @@ -13,6 +13,14 @@ Omnigraph supports two broad deployment shapes: The server binary and container image expose the same HTTP surface. +The server also has two **boot sources**: `omnigraph.yaml` (graph targets +declared in the per-operator config) or a **cluster directory** +(`omnigraph-server --cluster `), which serves the cluster control +plane's applied revision — see +[cluster-config.md](cluster-config.md#serving-from-the-cluster-the-mode-switch). +The two are exclusive per deployment; switching is a restart with a different +flag. + ## Binary Deployment Build or install: diff --git a/docs/user/server.md b/docs/user/server.md index 67b5afe..60988ca 100644 --- a/docs/user/server.md +++ b/docs/user/server.md @@ -1,6 +1,6 @@ # HTTP Server (`omnigraph-server`) -Axum 0.8 + tokio + utoipa-generated OpenAPI. **Two modes** (v0.6.0+): single-graph (legacy) and multi-graph (MR-668). Mode is inferred from CLI args + config shape. +Axum 0.8 + tokio + utoipa-generated OpenAPI. **Two modes** (v0.6.0+): single-graph (legacy) and multi-graph (MR-668), with **two boot sources** for multi mode: `omnigraph.yaml` or — exclusively — a cluster directory (`--cluster`, RFC-005). Mode is inferred from CLI args + config shape. ## Modes @@ -14,8 +14,20 @@ Axum 0.8 + tokio + utoipa-generated OpenAPI. **Two modes** (v0.6.0+): single-gra `omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:` map and **no** single-mode selector (no `server.graph`, no ``, no `--target`). The server opens every configured graph in parallel at startup (bounded concurrency = 4, fail-fast on the first open error). Routes are nested under `/graphs/{graph_id}/...`. Bare flat paths return 404 in multi mode. -Mode inference (four-rule matrix): +### Cluster-booted multi mode (Phase 5) +`omnigraph-server --cluster ` boots from the cluster catalog's **applied +revision** (`state.json` + content-addressed blobs) instead of +`omnigraph.yaml` — an exclusive boot source: combining it with ``, +`--target`, or `--config` is a startup error, and `omnigraph.yaml` is never +read in this mode. Always multi-graph routing. See +[cluster-config.md](cluster-config.md#serving-from-the-cluster-the-mode-switch) +for what is read and the fail-fast readiness rules. `--bind`, +`--unauthenticated`, and the bearer-token env vars work identically. + +Mode inference: + +0. CLI `--cluster ` → **multi, cluster-booted** (exclusive) 1. CLI positional `` → single 2. CLI `--target ` → single 3. `server.graph` in config → single