mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Ragnor Comerford d11c18fb27

mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)

PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.

Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
  - `multi_graph_lifecycle_post_query_restart_persistence` — POST a
    graph, query it via cluster route, reload the config from disk
    and confirm `load_server_settings` sees the rewritten YAML.
    Validates the "restart resolves orphans" failure-mode story.
  - `per_graph_policy_enforced_on_post_created_graph` — POST a graph
    with a per-graph policy attached, then send authenticated read
    and change requests. Per-graph Cedar enforcement fires correctly
    on a POST-created graph (engine-layer policy reinstalled via
    `Omnigraph::with_policy` inside the create flow).
  - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
    POSTs with distinct graph_ids all return 201. Caught a real
    race in `rewrite_atomic` (see below).

Race fix — `rewrite_atomic_with_modify`:

The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:

  - POST A: captures baseline H0, calls rewrite_atomic.
  - POST B: captures baseline H0 too (before A's update lands).
  - A: acquires flock, on-disk == H0, writes H1, releases.
  - A: updates baseline H0 → H1.
  - B: tries to acquire flock — waits.
  - B: acquires flock. On-disk is now H1. Expected (captured
       before A finished) is H0. MISMATCH → spurious Drift error.

Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.

The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
  1. Read on-disk bytes, hash, compare to baseline.
  2. Parse on-disk YAML.
  3. Call `modify(parsed)` to produce the new config — receives
     fresh on-disk state, returns the modification.
  4. Serialize + write + fsync + rename + update baseline.

Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.

The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.

Version bump v0.6.0 → v0.7.0:
  - All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
    plus their inter-crate `path` dep version constraints.
  - `Cargo.lock` regenerated by `cargo build --workspace`.
  - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
    row updated to mention multi-graph + cluster routes + atomic YAML
    rewrite.
  - `openapi.json` regenerated.

Docs:
  - `docs/releases/v0.7.0.md` (new) — release notes with breaking
    changes, new features, deferred items (DELETE, `delete_prefix`,
    actor forwarding), and the single→multi migration recipe.
  - `docs/user/server.md` — substantial section additions for the
    two modes, mode inference, cluster endpoint table, management
    endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
    body shape + status codes.
  - `docs/user/cli.md` — `omnigraph graphs list/create` section,
    deferred-DELETE note.
  - `docs/user/policy.md` — server-scoped Cedar actions
    (`graph_create`, `graph_list`), per-graph vs server-level policy
    composition, example server-level policy.

Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.

This commit closes MR-668. v0.7.0 is ready for tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 21:32:49 +02:00

7.8 KiB

Raw Blame History

Omnigraph v0.7.0

Multi-graph server mode (MR-668). One omnigraph-server process can now serve 1–10 graphs concurrently behind cluster routes (/graphs/{graph_id}/...), with per-graph Cedar policy, runtime graph creation via POST /graphs, and CLI parity (omnigraph graphs list/create).

Breaking Changes

Multi-graph deployments lose flat routes. Single-graph invocation (omnigraph-server <URI>) is unchanged — same flat /snapshot, /read, /branches, etc. Multi-graph deployments serve those routes under /graphs/{graph_id}/...; bare flat paths return 404 in multi mode.
ServerConfig shape change (programmatic embedders only): ServerConfig { uri, policy_file } is replaced by ServerConfig { mode: ServerConfigMode }, where ServerConfigMode = Single { uri, policy_file } | Multi { graphs, config_path, server_policy_file }. Callers that use load_server_settings are unaffected; callers that construct ServerConfig directly need to wrap their fields in ServerConfigMode::Single.
AppState::uri() now returns Option<&str> (was &str). Returns Some in single mode, None in multi mode — per-graph URIs live on GraphHandle.uri instead.
AppState::new_multi is the new multi-graph constructor. Single-mode new_* / open_* constructors are unchanged.
AuthenticatedActor(Arc<str>) → ResolvedActor { actor_id, tenant_id, scopes, source } (programmatic embedders only). The struct shape changes, but the HTTP contract — bearer auth, MR-731 spoof defense — is unchanged. Cluster-mode call sites construct with tenant_id: None, scopes: vec![Scope::Full], source: AuthSource::Static. Forward-compat for Cloud mode (RFC 0003) and OAuth provider (RFC 0004).

New

Multi-graph mode. Invoke with omnigraph-server --config omnigraph.yaml where the YAML has a non-empty graphs: map and no single-mode selector (no server.graph, no CLI <URI> or --target). At startup the server opens every configured graph in parallel (bounded concurrency, fail-fast).
POST /graphs. Runtime graph creation. Request body:
```
{
  "graph_id": "beta",
  "uri": "/data/beta.omni",
  "schema": { "source": "<inline .pg source>" },
  "policy": { "file": "./policies/beta.yaml" }
}
```
schema and policy are nested objects — leaves room for future fields without breaking the shape. (Asymmetric with the existing POST /schema/apply, which still uses flat schema_source: String. A follow-up release may migrate it.) Body limit is 32 MiB.

The server runs Omnigraph::init at the supplied URI, atomically rewrites omnigraph.yaml under an exclusive fcntl::flock with SHA-256 drift detection, then publishes the handle in the in-memory registry. Returns 201 on success; 409 on duplicate graph_id or URI; 503 on YAML drift (operator hand-edited the file between server start and the rewrite).
GET /graphs. Lists every registered graph, sorted alphabetically by graph_id. Auth-required when bearer tokens are configured; Cedar-gated by PolicyAction::GraphList against Omnigraph::Server::"root". Returns 405 in single mode.
CLI omnigraph graphs list/create. Mirrors the HTTP surface. Reject local URI targets with a clear message — these subcommands are for remote multi-graph servers only.
Per-graph Cedar policy. Each entry in the graphs: map can carry a policy.file path. Loaded at startup or attached at POST time. Cedar's Omnigraph::Graph::"<graph_id>" resource is per-graph; the new Omnigraph::Server::"root" resource governs server-level actions.
Cedar action vocabulary: graph_create and graph_list (server-scoped). graph_delete is reserved but not shipped — see "Deferred."
YAML drift detection. Server hashes omnigraph.yaml at startup. POST /graphs re-hashes the on-disk file under the flock before rewriting; if the hash doesn't match the baseline, the rewrite refuses with 503 to avoid clobbering operator hand-edits.
Omnigraph::init error-path cleanup. A failed init now best-effort-deletes the schema artifacts (_schema.pg, _schema.ir.json, __schema_state.json). Lance per-type directories created by GraphCoordinator::init may still orphan — full recursive cleanup needs a delete_prefix substrate primitive, deferred along with DELETE /graphs/{id}.
omnigraph-policy is now a published workspace crate. The published-crates set is omnigraph-compiler, omnigraph-policy, omnigraph-engine, omnigraph-server, omnigraph-cli.

Configuration

omnigraph.yaml schema additions (all optional, single-mode unaffected):

server:
  bind: 0.0.0.0:8080
  policy:
    file: ./server-policy.yaml          # server-level Cedar (graph_create, graph_list)

graphs:
  alpha:
    uri: s3://tenant-bucket/alpha
    policy:
      file: ./policies/alpha.yaml       # per-graph Cedar
  beta:
    uri: s3://tenant-bucket/beta
    # no per-graph policy → engine-layer enforcement is a no-op

Deferred

DELETE /graphs/{id}. Cut from v0.7.0 scope to bound complexity (no delete_prefix substrate, no tombstones). Operators remove graphs by stopping the server, editing omnigraph.yaml, then restarting.
StorageAdapter::delete_prefix. The substrate primitive that DELETE would need. Will land alongside DELETE in a future release.
X-Actor-Id service delegation forwarding. Needs durable both-actor audit on _graph_commits.lance — out of scope.
Hot policy reload. Restart is cheap at N≤10 graphs.

User Impact

Existing single-graph deployments upgrade with zero changes. omnigraph-server <URI> with v0.6.0 config keeps working identically.
Multi-graph adoption is opt-in. Add a graphs: map to omnigraph.yaml (and remove server.graph) to switch a deployment to multi mode.
Cluster routes are breaking for client SDKs targeting multi mode. Generated clients from previous v0.6.0 OpenAPI specs will hit 404 on flat paths against a multi-mode server. Regenerate against the v0.7.0 openapi.json.
fs2 = "0.4" is a new dependency for the file locking that powers the atomic YAML rewrite. POSIX-only. Linux / macOS deployment supported; Windows is out of scope.
Operator-supplied policy.yaml files don't change. The Cedar Omnigraph::Graph and Omnigraph::Server entities are internally generated by compile_policy_source — operator YAML only references actions and groups.

Migration: single → multi

# Before (v0.6.0 single-mode invocation)
server:
  graph: my-graph
graphs:
  my-graph:
    uri: /var/lib/omnigraph/my-graph
policy:
  file: ./policy.yaml

# After (v0.7.0 multi-mode — drop `server.graph` and the top-level `policy`)
server:
  policy:
    file: ./server-policy.yaml      # NEW: governs POST/GET /graphs
graphs:
  my-graph:
    uri: /var/lib/omnigraph/my-graph
    policy:
      file: ./policy.yaml           # MOVED: was top-level

Same omnigraph.yaml file; restart the server. Clients targeting the old flat routes (/snapshot, /read, …) must update to /graphs/my-graph/snapshot, etc.

Test coverage

v0.7.0 ships ~280 new tests covering MR-668 specifically:

GraphId newtype validation, registry race tests (PR 3), init failpoints (PR 2a).
Mode-inference four-rule matrix (PR 5), parallel multi-graph startup, cluster routing.
Cedar Server resource refactor, backwards-compat for graph-only policies.
POST /graphs happy path + duplicate graph_id + duplicate URI + YAML drift detection + 405-in-single-mode.
Composite lifecycle: POST a graph, query it via cluster route, reload config from disk, confirm persistence.
Per-graph Cedar policy enforced for a POST-created graph (engine-layer enforcement is re-applied via Omnigraph::with_policy).
Concurrent distinct-id POSTs serialize correctly through the flock without spurious drift errors.
MR-731 spoof regression test stays green across the entire refactor.

7.8 KiB Raw Blame History Unescape Escape