mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt)

The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:

  - flock-on-renamed-inode race: the YAML flock is taken on
    omnigraph.yaml itself, then a temp file is renamed over it.
    Cross-process writers end up locking different inodes — both
    believing they hold exclusive access.
  - duplicate-check outside the file lock: precheck runs against
    the in-memory registry only; the locked closure does
    config.graphs.insert(...) unconditionally. Concurrent same-id
    POSTs can persist the loser in YAML while the in-memory registry
    keeps the winner — they disagree after restart.
  - best_effort_cleanup_init_artifacts deletes _schema.pg /
    _schema.ir.json / __schema_state.json on any init failure. An
    accidental re-init against an existing graph's URI destroys its
    schema; subsequent open() fails at read_text(_schema.pg).

The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.

For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.

Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
  multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
  staging_path, hash_config_file, AppState::config_hash field +
  threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
  + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
  GraphPolicySpec API types (only the POST handler / CLI imported them)

Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
  per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
  that covers operator-authored YAML + restart — the path that
  survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
  (still reachable from CLI `omnigraph init`; preflight fix deferred
  as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
  callers gone, but the method is the natural seam for the future
  cluster-catalog work

Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
  advertises the management route correctly (was previously rewritten
  to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
  multi_mode_openapi_keeps_management_paths_flat, asserts both
  /healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
  /graphs in addition to /healthz

Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
  but --target is a config-graph-name lookup; corrected to --uri.
  Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
  ownership", and "POST /graphs body shape" sections. Added a
  paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
  "Configuration" line to clarify that server-scoped rules (graph_list)
  take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
  mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
  reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.

Diff: 17 files, +123 −1525 LOC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-05-26 17:49:38 +02:00
parent d11c18fb27
commit 937fd6382d
No known key found for this signature in database
18 changed files with 136 additions and 1727 deletions

View file

@ -1,6 +1,8 @@
# Omnigraph v0.7.0
Multi-graph server mode (MR-668). One `omnigraph-server` process can now serve 110 graphs concurrently behind cluster routes (`/graphs/{graph_id}/...`), with per-graph Cedar policy, runtime graph creation via `POST /graphs`, and CLI parity (`omnigraph graphs list/create`).
Multi-graph server mode (MR-668). One `omnigraph-server` process can now serve 110 graphs concurrently behind cluster routes (`/graphs/{graph_id}/...`), with per-graph and server-level Cedar policy, read-only `GET /graphs` enumeration, and CLI parity (`omnigraph graphs list`).
Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`, `omnigraph graphs create`) is **not** in v0.7.0. Operators add or remove graphs by editing `omnigraph.yaml` and restarting. The first cut of `POST /graphs` shipped behind an atomic-YAML-rewrite design that we pulled before release once its concurrency guarantees were challenged (flock-on-renamed-inode race, duplicate-check outside the critical section, and an init-cleanup path that could destroy an existing graph's schema on re-init). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars); that work is deferred.
## Breaking Changes
@ -13,25 +15,11 @@ Multi-graph server mode (MR-668). One `omnigraph-server` process can now serve 1
## New
- **Multi-graph mode**. Invoke with `omnigraph-server --config omnigraph.yaml` where the YAML has a non-empty `graphs:` map and no single-mode selector (no `server.graph`, no CLI `<URI>` or `--target`). At startup the server opens every configured graph in parallel (bounded concurrency, fail-fast).
- **`POST /graphs`**. Runtime graph creation. Request body:
```json
{
"graph_id": "beta",
"uri": "/data/beta.omni",
"schema": { "source": "<inline .pg source>" },
"policy": { "file": "./policies/beta.yaml" }
}
```
`schema` and `policy` are nested objects — leaves room for future fields without breaking the shape. (Asymmetric with the existing `POST /schema/apply`, which still uses flat `schema_source: String`. A follow-up release may migrate it.) Body limit is 32 MiB.
The server runs `Omnigraph::init` at the supplied URI, atomically rewrites `omnigraph.yaml` under an exclusive `fcntl::flock` with SHA-256 drift detection, then publishes the handle in the in-memory registry. Returns 201 on success; 409 on duplicate `graph_id` or URI; 503 on YAML drift (operator hand-edited the file between server start and the rewrite).
- **`GET /graphs`**. Lists every registered graph, sorted alphabetically by `graph_id`. Auth-required when bearer tokens are configured; Cedar-gated by `PolicyAction::GraphList` against `Omnigraph::Server::"root"`. Returns 405 in single mode.
- **CLI `omnigraph graphs list/create`**. Mirrors the HTTP surface. Reject local URI targets with a clear message — these subcommands are for remote multi-graph servers only.
- **Per-graph Cedar policy**. Each entry in the `graphs:` map can carry a `policy.file` path. Loaded at startup or attached at `POST` time. Cedar's `Omnigraph::Graph::"<graph_id>"` resource is per-graph; the new `Omnigraph::Server::"root"` resource governs server-level actions.
- **Cedar action vocabulary**: `graph_create` and `graph_list` (server-scoped). `graph_delete` is reserved but not shipped — see "Deferred."
- **YAML drift detection**. Server hashes `omnigraph.yaml` at startup. `POST /graphs` re-hashes the on-disk file under the flock before rewriting; if the hash doesn't match the baseline, the rewrite refuses with 503 to avoid clobbering operator hand-edits.
- **`Omnigraph::init` error-path cleanup**. A failed init now best-effort-deletes the schema artifacts (`_schema.pg`, `_schema.ir.json`, `__schema_state.json`). Lance per-type directories created by `GraphCoordinator::init` may still orphan — full recursive cleanup needs a `delete_prefix` substrate primitive, deferred along with `DELETE /graphs/{id}`.
- **`omnigraph-policy` is now a published workspace crate.** The published-crates set is `omnigraph-compiler`, `omnigraph-policy`, `omnigraph-engine`, `omnigraph-server`, `omnigraph-cli`.
- **CLI `omnigraph graphs list`**. Mirrors the HTTP surface. Rejects local URI targets with a clear message — for remote multi-graph servers only.
- **Per-graph Cedar policy**. Each entry in the `graphs:` map can carry a `policy.file` path, loaded at startup. Cedar's `Omnigraph::Graph::"<graph_id>"` resource is per-graph; the new `Omnigraph::Server::"root"` resource governs server-level actions.
- **Server-level Cedar policy**. `server.policy.file` in the config governs the `graph_list` action on `Omnigraph::Server::"root"`. Required to expose `GET /graphs` once bearer tokens are configured (MR-723 default-deny otherwise rejects `graph_list` as non-`read`).
- **Cedar action vocabulary**: `graph_list` (server-scoped). Runtime `graph_create` / `graph_delete` are reserved but not shipped — see "Deferred."
## Configuration
@ -41,7 +29,7 @@ Multi-graph server mode (MR-668). One `omnigraph-server` process can now serve 1
server:
bind: 0.0.0.0:8080
policy:
file: ./server-policy.yaml # server-level Cedar (graph_create, graph_list)
file: ./server-policy.yaml # server-level Cedar (graph_list)
graphs:
alpha:
@ -55,8 +43,9 @@ graphs:
## Deferred
- **`DELETE /graphs/{id}`**. Cut from v0.7.0 scope to bound complexity (no `delete_prefix` substrate, no tombstones). Operators remove graphs by stopping the server, editing `omnigraph.yaml`, then restarting.
- **`StorageAdapter::delete_prefix`**. The substrate primitive that DELETE would need. Will land alongside DELETE in a future release.
- **`POST /graphs` runtime graph creation** and **CLI `omnigraph graphs create`**. Pulled before release after the YAML-rewrite design's correctness story didn't survive review. A future release will add a managed cluster catalog (Lance-backed reserve → init → publish with recovery sidecars) and re-expose runtime creation on top of it. Until then, operators add graphs by editing `omnigraph.yaml` and restarting.
- **`DELETE /graphs/{id}`**. Never shipped in v0.7.0; deferred with the same cluster-catalog work.
- **`StorageAdapter::delete_prefix`**. The substrate primitive a managed catalog would need. Will land alongside runtime mutation.
- **`X-Actor-Id` service delegation forwarding**. Needs durable both-actor audit on `_graph_commits.lance` — out of scope.
- **Hot policy reload**. Restart is cheap at N≤10 graphs.
@ -65,7 +54,6 @@ graphs:
- **Existing single-graph deployments upgrade with zero changes.** `omnigraph-server <URI>` with v0.6.0 config keeps working identically.
- **Multi-graph adoption is opt-in.** Add a `graphs:` map to `omnigraph.yaml` (and remove `server.graph`) to switch a deployment to multi mode.
- **Cluster routes are breaking for client SDKs targeting multi mode.** Generated clients from previous v0.6.0 OpenAPI specs will hit 404 on flat paths against a multi-mode server. Regenerate against the v0.7.0 `openapi.json`.
- **`fs2 = "0.4"`** is a new dependency for the file locking that powers the atomic YAML rewrite. POSIX-only. Linux / macOS deployment supported; Windows is out of scope.
- **Operator-supplied policy.yaml files don't change.** The Cedar `Omnigraph::Graph` and `Omnigraph::Server` entities are internally generated by `compile_policy_source` — operator YAML only references actions and groups.
## Migration: single → multi
@ -85,7 +73,7 @@ policy:
# After (v0.7.0 multi-mode — drop `server.graph` and the top-level `policy`)
server:
policy:
file: ./server-policy.yaml # NEW: governs POST/GET /graphs
file: ./server-policy.yaml # NEW: governs GET /graphs
graphs:
my-graph:
uri: /var/lib/omnigraph/my-graph
@ -95,15 +83,12 @@ graphs:
Same `omnigraph.yaml` file; restart the server. Clients targeting the old flat routes (`/snapshot`, `/read`, …) must update to `/graphs/my-graph/snapshot`, etc.
To add a new graph after rollout: stop the server, append a new `graphs.<id>` entry, restart.
## Test coverage
v0.7.0 ships ~280 new tests covering MR-668 specifically:
- `GraphId` newtype validation, registry race tests (PR 3), init failpoints (PR 2a).
- `GraphId` newtype validation, registry race tests (PR 3), init failpoints (PR 2a — still reachable from `omnigraph init` CLI).
- Mode-inference four-rule matrix (PR 5), parallel multi-graph startup, cluster routing.
- Cedar `Server` resource refactor, backwards-compat for graph-only policies.
- `POST /graphs` happy path + duplicate graph_id + duplicate URI + YAML drift detection + 405-in-single-mode.
- Composite lifecycle: POST a graph, query it via cluster route, reload config from disk, confirm persistence.
- Per-graph Cedar policy enforced for a POST-created graph (engine-layer enforcement is re-applied via `Omnigraph::with_policy`).
- Concurrent distinct-id POSTs serialize correctly through the flock without spurious drift errors.
- `GET /graphs` enumeration, 405-in-single-mode.
- MR-731 spoof regression test stays green across the entire refactor.

View file

@ -46,26 +46,17 @@ and configure the matching `bearer_token_env` in `omnigraph.yaml`.
## Multi-graph servers (v0.7.0+)
Against a multi-graph server (started with `--config omnigraph.yaml` referencing a non-empty `graphs:` map), use `omnigraph graphs` to enumerate and create graphs:
Against a multi-graph server (started with `--config omnigraph.yaml` referencing a non-empty `graphs:` map), use `omnigraph graphs list` to enumerate the registered graphs:
```bash
# List
omnigraph graphs list --target http://server.example.com --json
# Create
omnigraph graphs create \
--target http://server.example.com \
--graph-id beta \
--graph-uri /data/beta.omni \
--schema schema.pg \
--policy-file ./policies/beta.yaml # optional
omnigraph graphs list --uri http://server.example.com --json
```
The CLI reads `--schema` from the local disk and inlines the contents as `schema.source` in the request body. Both subcommands reject local URI targets — they're for remote multi-graph servers only.
`list` rejects local URI targets — it's for remote multi-graph servers only.
`omnigraph graphs delete` is **not** in v0.7.0. To remove a graph, stop the server, edit `omnigraph.yaml`, restart.
Runtime add/remove is **not** in v0.7.0. To add a graph, stop the server, add a `graphs.<id>` entry to `omnigraph.yaml`, then restart. To remove, stop the server, delete the entry, restart.
Per-graph URLs: once a graph exists, hit its cluster route from any subcommand by pointing `--uri` at it:
Per-graph URLs: hit a graph's cluster route from any subcommand by pointing `--uri` at it:
```bash
omnigraph read --uri http://server.example.com/graphs/beta --query ./q.gq ...

View file

@ -15,12 +15,11 @@ Per-graph actions (bind to `Omnigraph::Graph::"<graph_id>"`):
7. `branch_merge`
8. `admin` — reserved for policy-management surfaces (hot reload, audit log, approvals). No call site today; see MR-724 for the reservation rationale.
Server-scoped actions (v0.7.0+; bind to `Omnigraph::Server::"root"`):
Server-scoped action (v0.7.0+; binds to `Omnigraph::Server::"root"`):
9. `graph_create``POST /graphs` runtime graph creation (multi-graph mode)
10. `graph_list``GET /graphs` registry enumeration (multi-graph mode)
9. `graph_list``GET /graphs` registry enumeration (multi-graph mode)
Server-scoped actions cannot use `branch_scope` or `target_branch_scope` — they operate on the registry, not on a graph's branches. A rule cannot mix server-scoped and per-graph actions; split into separate rules. (`graph_delete` is reserved but not shipped in v0.7.0.)
Server-scoped actions cannot use `branch_scope` or `target_branch_scope` — they operate on the registry, not on a graph's branches. A rule cannot mix server-scoped and per-graph actions; split into separate rules. (Runtime `graph_create` / `graph_delete` are reserved but not shipped in v0.7.0; operators add/remove graphs by editing `omnigraph.yaml` and restarting.)
## Scope kinds
@ -35,7 +34,7 @@ In multi mode (`omnigraph.yaml` with a non-empty `graphs:` map), policy files at
```yaml
server:
policy:
file: ./server-policy.yaml # server-level: graph_create, graph_list
file: ./server-policy.yaml # server-level: graph_list
graphs:
alpha:
@ -47,7 +46,7 @@ graphs:
# no per-graph policy → no engine-layer Cedar enforcement on beta
```
Each graph's HTTP request flows through its own per-graph policy. Management endpoints (`/graphs`) flow through the server-level policy. When `server.policy.file` is unset and bearer tokens are configured, `GET /graphs` falls through to MR-723 default-deny (only `read`-equivalent actions allowed for authenticated actors — and `graph_list` is not `read`) → 403. So the operator must explicitly authorize via `server-policy.yaml` to expose `/graphs`.
Each graph's HTTP request flows through its own per-graph policy. The management endpoint (`GET /graphs`) flows through the server-level policy. When `server.policy.file` is unset and bearer tokens are configured, `GET /graphs` falls through to MR-723 default-deny (only `read`-equivalent actions allowed for authenticated actors — and `graph_list` is not `read`) → 403. So the operator must explicitly authorize via `server-policy.yaml` to expose `/graphs`.
Example server-level policy:
@ -56,10 +55,10 @@ version: 1
groups:
admins: [act-andrew]
rules:
- id: admins-can-create-and-list-graphs
- id: admins-can-list-graphs
allow:
actors: { group: admins }
actions: [graph_create, graph_list]
actions: [graph_list]
```
## Configuration
@ -75,7 +74,7 @@ cli:
actor: act-andrew # default actor for CLI direct-engine writes
```
Each rule must use exactly one of `branch_scope` or `target_branch_scope`.
Each per-graph rule must use exactly one of `branch_scope` or `target_branch_scope`. Server-scoped rules (`graph_list`) take neither — they have no branch context.
`cli.actor` is the default actor identity for CLI direct-engine writes
when `policy.file` is configured. Override per-invocation with `--as

View file

@ -47,34 +47,18 @@ Server-level management endpoints (v0.7.0+):
| Method | Path | Auth | Action | Handler |
|---|---|---|---|---|
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs | `server_graphs_list` (405 in single mode) |
| POST | `/graphs` | bearer + `graph_create` on `Server::"root"` | create new graph at runtime | `server_graphs_create` (405 in single mode, 32 MB body limit) |
`DELETE /graphs/{id}` is **not** in v0.7.0. Operators remove graphs by stopping the server, editing `omnigraph.yaml`, then restarting.
## Adding and removing graphs (multi mode)
## `omnigraph.yaml` ownership (multi mode)
Runtime add/remove via API is **not** exposed in v0.7.0 — neither
`POST /graphs` nor `DELETE /graphs/{id}` is implemented. Operators add
or remove graphs by stopping the server, editing the `graphs:` map in
`omnigraph.yaml`, then restarting. The server treats `omnigraph.yaml`
as operator-owned configuration and never writes it.
The server owns `omnigraph.yaml` while running. `POST /graphs` rewrites the file atomically under an exclusive `fcntl::flock` with SHA-256 drift detection:
- The server hashes the file at startup. `POST /graphs` re-hashes under the flock before rewriting. If the hash doesn't match (operator hand-edited), the rewrite refuses with 503.
- Comments and blank-line structure are **not** preserved across server-side rewrites — the file is regenerated via `serde_yaml::to_string`.
- Operators must not edit the file while the server is running. To make offline changes: stop the server, edit, restart.
In **single mode** the server never writes `omnigraph.yaml`.
## `POST /graphs` body shape
```json
{
"graph_id": "alpha",
"uri": "s3://tenant-bucket/alpha",
"schema": { "source": "<inline .pg source>" },
"policy": { "file": "./policies/alpha.yaml" }
}
```
- `schema` and `policy` are nested — leaves room for future fields without breaking the shape.
- `policy` is optional; without it, no per-graph Cedar enforcement.
- Status codes: 201 Created · 400 invalid body · 401 missing bearer · 403 Cedar denied · 405 single mode · 409 duplicate `graph_id` or `uri` · 413 body >32 MiB · 500 init or rewrite failure · 503 YAML drift.
A future release may introduce a managed registry (Lance-backed,
catalog-style: reserve → init → publish with recovery sidecars) and
re-expose runtime mutation on top of it.
## Streaming