omnigraph/docs/releases/v0.7.0.md
Ragnor Comerford 937fd6382d
mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt)
The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:

  - flock-on-renamed-inode race: the YAML flock is taken on
    omnigraph.yaml itself, then a temp file is renamed over it.
    Cross-process writers end up locking different inodes — both
    believing they hold exclusive access.
  - duplicate-check outside the file lock: precheck runs against
    the in-memory registry only; the locked closure does
    config.graphs.insert(...) unconditionally. Concurrent same-id
    POSTs can persist the loser in YAML while the in-memory registry
    keeps the winner — they disagree after restart.
  - best_effort_cleanup_init_artifacts deletes _schema.pg /
    _schema.ir.json / __schema_state.json on any init failure. An
    accidental re-init against an existing graph's URI destroys its
    schema; subsequent open() fails at read_text(_schema.pg).

The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.

For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.

Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
  multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
  staging_path, hash_config_file, AppState::config_hash field +
  threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
  + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
  GraphPolicySpec API types (only the POST handler / CLI imported them)

Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
  per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
  that covers operator-authored YAML + restart — the path that
  survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
  (still reachable from CLI `omnigraph init`; preflight fix deferred
  as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
  callers gone, but the method is the natural seam for the future
  cluster-catalog work

Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
  advertises the management route correctly (was previously rewritten
  to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
  multi_mode_openapi_keeps_management_paths_flat, asserts both
  /healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
  /graphs in addition to /healthz

Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
  but --target is a config-graph-name lookup; corrected to --uri.
  Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
  ownership", and "POST /graphs body shape" sections. Added a
  paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
  "Configuration" line to clarify that server-scoped rules (graph_list)
  take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
  mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
  reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.

Diff: 17 files, +123 −1525 LOC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 17:49:38 +02:00

6.8 KiB
Raw Blame History

Omnigraph v0.7.0

Multi-graph server mode (MR-668). One omnigraph-server process can now serve 110 graphs concurrently behind cluster routes (/graphs/{graph_id}/...), with per-graph and server-level Cedar policy, read-only GET /graphs enumeration, and CLI parity (omnigraph graphs list).

Runtime add/remove (POST /graphs, DELETE /graphs/{id}, omnigraph graphs create) is not in v0.7.0. Operators add or remove graphs by editing omnigraph.yaml and restarting. The first cut of POST /graphs shipped behind an atomic-YAML-rewrite design that we pulled before release once its concurrency guarantees were challenged (flock-on-renamed-inode race, duplicate-check outside the critical section, and an init-cleanup path that could destroy an existing graph's schema on re-init). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars); that work is deferred.

Breaking Changes

  • Multi-graph deployments lose flat routes. Single-graph invocation (omnigraph-server <URI>) is unchanged — same flat /snapshot, /read, /branches, etc. Multi-graph deployments serve those routes under /graphs/{graph_id}/...; bare flat paths return 404 in multi mode.
  • ServerConfig shape change (programmatic embedders only): ServerConfig { uri, policy_file } is replaced by ServerConfig { mode: ServerConfigMode }, where ServerConfigMode = Single { uri, policy_file } | Multi { graphs, config_path, server_policy_file }. Callers that use load_server_settings are unaffected; callers that construct ServerConfig directly need to wrap their fields in ServerConfigMode::Single.
  • AppState::uri() now returns Option<&str> (was &str). Returns Some in single mode, None in multi mode — per-graph URIs live on GraphHandle.uri instead.
  • AppState::new_multi is the new multi-graph constructor. Single-mode new_* / open_* constructors are unchanged.
  • AuthenticatedActor(Arc<str>)ResolvedActor { actor_id, tenant_id, scopes, source } (programmatic embedders only). The struct shape changes, but the HTTP contract — bearer auth, MR-731 spoof defense — is unchanged. Cluster-mode call sites construct with tenant_id: None, scopes: vec![Scope::Full], source: AuthSource::Static. Forward-compat for Cloud mode (RFC 0003) and OAuth provider (RFC 0004).

New

  • Multi-graph mode. Invoke with omnigraph-server --config omnigraph.yaml where the YAML has a non-empty graphs: map and no single-mode selector (no server.graph, no CLI <URI> or --target). At startup the server opens every configured graph in parallel (bounded concurrency, fail-fast).
  • GET /graphs. Lists every registered graph, sorted alphabetically by graph_id. Auth-required when bearer tokens are configured; Cedar-gated by PolicyAction::GraphList against Omnigraph::Server::"root". Returns 405 in single mode.
  • CLI omnigraph graphs list. Mirrors the HTTP surface. Rejects local URI targets with a clear message — for remote multi-graph servers only.
  • Per-graph Cedar policy. Each entry in the graphs: map can carry a policy.file path, loaded at startup. Cedar's Omnigraph::Graph::"<graph_id>" resource is per-graph; the new Omnigraph::Server::"root" resource governs server-level actions.
  • Server-level Cedar policy. server.policy.file in the config governs the graph_list action on Omnigraph::Server::"root". Required to expose GET /graphs once bearer tokens are configured (MR-723 default-deny otherwise rejects graph_list as non-read).
  • Cedar action vocabulary: graph_list (server-scoped). Runtime graph_create / graph_delete are reserved but not shipped — see "Deferred."

Configuration

omnigraph.yaml schema additions (all optional, single-mode unaffected):

server:
  bind: 0.0.0.0:8080
  policy:
    file: ./server-policy.yaml          # server-level Cedar (graph_list)

graphs:
  alpha:
    uri: s3://tenant-bucket/alpha
    policy:
      file: ./policies/alpha.yaml       # per-graph Cedar
  beta:
    uri: s3://tenant-bucket/beta
    # no per-graph policy → engine-layer enforcement is a no-op

Deferred

  • POST /graphs runtime graph creation and CLI omnigraph graphs create. Pulled before release after the YAML-rewrite design's correctness story didn't survive review. A future release will add a managed cluster catalog (Lance-backed reserve → init → publish with recovery sidecars) and re-expose runtime creation on top of it. Until then, operators add graphs by editing omnigraph.yaml and restarting.
  • DELETE /graphs/{id}. Never shipped in v0.7.0; deferred with the same cluster-catalog work.
  • StorageAdapter::delete_prefix. The substrate primitive a managed catalog would need. Will land alongside runtime mutation.
  • X-Actor-Id service delegation forwarding. Needs durable both-actor audit on _graph_commits.lance — out of scope.
  • Hot policy reload. Restart is cheap at N≤10 graphs.

User Impact

  • Existing single-graph deployments upgrade with zero changes. omnigraph-server <URI> with v0.6.0 config keeps working identically.
  • Multi-graph adoption is opt-in. Add a graphs: map to omnigraph.yaml (and remove server.graph) to switch a deployment to multi mode.
  • Cluster routes are breaking for client SDKs targeting multi mode. Generated clients from previous v0.6.0 OpenAPI specs will hit 404 on flat paths against a multi-mode server. Regenerate against the v0.7.0 openapi.json.
  • Operator-supplied policy.yaml files don't change. The Cedar Omnigraph::Graph and Omnigraph::Server entities are internally generated by compile_policy_source — operator YAML only references actions and groups.

Migration: single → multi

# Before (v0.6.0 single-mode invocation)
server:
  graph: my-graph
graphs:
  my-graph:
    uri: /var/lib/omnigraph/my-graph
policy:
  file: ./policy.yaml
# After (v0.7.0 multi-mode — drop `server.graph` and the top-level `policy`)
server:
  policy:
    file: ./server-policy.yaml      # NEW: governs GET /graphs
graphs:
  my-graph:
    uri: /var/lib/omnigraph/my-graph
    policy:
      file: ./policy.yaml           # MOVED: was top-level

Same omnigraph.yaml file; restart the server. Clients targeting the old flat routes (/snapshot, /read, …) must update to /graphs/my-graph/snapshot, etc.

To add a new graph after rollout: stop the server, append a new graphs.<id> entry, restart.

Test coverage

  • GraphId newtype validation, registry race tests (PR 3), init failpoints (PR 2a — still reachable from omnigraph init CLI).
  • Mode-inference four-rule matrix (PR 5), parallel multi-graph startup, cluster routing.
  • Cedar Server resource refactor, backwards-compat for graph-only policies.
  • GET /graphs enumeration, 405-in-single-mode.
  • MR-731 spoof regression test stays green across the entire refactor.