mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-12 01:45:14 +02:00

Ragnor Comerford d11c18fb27

mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)

PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.

Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
  - `multi_graph_lifecycle_post_query_restart_persistence` — POST a
    graph, query it via cluster route, reload the config from disk
    and confirm `load_server_settings` sees the rewritten YAML.
    Validates the "restart resolves orphans" failure-mode story.
  - `per_graph_policy_enforced_on_post_created_graph` — POST a graph
    with a per-graph policy attached, then send authenticated read
    and change requests. Per-graph Cedar enforcement fires correctly
    on a POST-created graph (engine-layer policy reinstalled via
    `Omnigraph::with_policy` inside the create flow).
  - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
    POSTs with distinct graph_ids all return 201. Caught a real
    race in `rewrite_atomic` (see below).

Race fix — `rewrite_atomic_with_modify`:

The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:

  - POST A: captures baseline H0, calls rewrite_atomic.
  - POST B: captures baseline H0 too (before A's update lands).
  - A: acquires flock, on-disk == H0, writes H1, releases.
  - A: updates baseline H0 → H1.
  - B: tries to acquire flock — waits.
  - B: acquires flock. On-disk is now H1. Expected (captured
       before A finished) is H0. MISMATCH → spurious Drift error.

Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.

The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
  1. Read on-disk bytes, hash, compare to baseline.
  2. Parse on-disk YAML.
  3. Call `modify(parsed)` to produce the new config — receives
     fresh on-disk state, returns the modification.
  4. Serialize + write + fsync + rename + update baseline.

Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.

The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.

Version bump v0.6.0 → v0.7.0:
  - All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
    plus their inter-crate `path` dep version constraints.
  - `Cargo.lock` regenerated by `cargo build --workspace`.
  - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
    row updated to mention multi-graph + cluster routes + atomic YAML
    rewrite.
  - `openapi.json` regenerated.

Docs:
  - `docs/releases/v0.7.0.md` (new) — release notes with breaking
    changes, new features, deferred items (DELETE, `delete_prefix`,
    actor forwarding), and the single→multi migration recipe.
  - `docs/user/server.md` — substantial section additions for the
    two modes, mode inference, cluster endpoint table, management
    endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
    body shape + status codes.
  - `docs/user/cli.md` — `omnigraph graphs list/create` section,
    deferred-DELETE note.
  - `docs/user/policy.md` — server-scoped Cedar actions
    (`graph_create`, `graph_list`), per-graph vs server-level policy
    composition, example server-level policy.

Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.

This commit closes MR-668. v0.7.0 is ready for tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 21:32:49 +02:00

8.2 KiB

Raw Blame History

HTTP Server (`omnigraph-server`)

Axum 0.8 + tokio + utoipa-generated OpenAPI. Two modes (v0.7.0+): single-graph (legacy) and multi-graph (MR-668). Mode is inferred from CLI args + config shape.

Modes

Single-graph mode (legacy)

omnigraph-server <URI> or omnigraph-server --target <name> --config omnigraph.yaml. Routes are flat — /snapshot, /read, /branches, etc. Behavior unchanged from v0.6.0.

Multi-graph mode (v0.7.0+)

omnigraph-server --config omnigraph.yaml with a non-empty graphs: map and no single-mode selector (no server.graph, no <URI>, no --target). The server opens every configured graph in parallel at startup (bounded concurrency = 4, fail-fast on the first open error). Routes are nested under /graphs/{graph_id}/.... Bare flat paths return 404 in multi mode.

Mode inference (four-rule matrix):

CLI positional <URI> → single
CLI --target <name> → single
server.graph in config → single
--config + non-empty graphs: + no single-mode selector → multi
otherwise → error with migration hint

Endpoint inventory

Per-graph endpoints — same body shape across modes; URLs differ:

Method	Single-mode path	Multi-mode path	Auth	Action	Handler
GET	`/healthz`	`/healthz`	none	—	`server_health`
GET	`/openapi.json`	`/openapi.json`	none	—	`server_openapi` (strips security if auth disabled; in multi mode emits cluster paths with `cluster_` operation-id prefix)
GET	`/snapshot?branch=`	`/graphs/{id}/snapshot?branch=`	bearer + `read`	snapshot of branch	`server_snapshot`
POST	`/read`	`/graphs/{id}/read`	bearer + `read`	run named query	`server_read`
POST	`/export`	`/graphs/{id}/export`	bearer + `export`	NDJSON stream	`server_export`
POST	`/change`	`/graphs/{id}/change`	bearer + `change`	mutation	`server_change`
GET	`/schema`	`/graphs/{id}/schema`	bearer + `read`	get current `.pg` source	`server_schema_get`
POST	`/schema/apply`	`/graphs/{id}/schema/apply`	bearer + `schema_apply` (target=`main`)	migrate	`server_schema_apply`
POST	`/ingest`	`/graphs/{id}/ingest`	bearer + `branch_create` (if new) + `change`	bulk load	`server_ingest` (32 MB body limit)
GET	`/branches`	`/graphs/{id}/branches`	bearer + `read`	list branches	`server_branch_list`
POST	`/branches`	`/graphs/{id}/branches`	bearer + `branch_create`	create	`server_branch_create`
DELETE	`/branches/{branch}`	`/graphs/{id}/branches/{branch}`	bearer + `branch_delete`	delete	`server_branch_delete`
POST	`/branches/merge`	`/graphs/{id}/branches/merge`	bearer + `branch_merge`	merge `source → target`	`server_branch_merge`
GET	`/commits?branch=`	`/graphs/{id}/commits?branch=`	bearer + `read`	list	`server_commit_list`
GET	`/commits/{commit_id}`	`/graphs/{id}/commits/{commit_id}`	bearer + `read`	show	`server_commit_show`

Server-level management endpoints (v0.7.0+):

Method	Path	Auth	Action	Handler
GET	`/graphs`	bearer + `graph_list` on `Server::"root"`	list registered graphs	`server_graphs_list` (405 in single mode)
POST	`/graphs`	bearer + `graph_create` on `Server::"root"`	create new graph at runtime	`server_graphs_create` (405 in single mode, 32 MB body limit)

DELETE /graphs/{id} is not in v0.7.0. Operators remove graphs by stopping the server, editing omnigraph.yaml, then restarting.

`omnigraph.yaml` ownership (multi mode)

The server owns omnigraph.yaml while running. POST /graphs rewrites the file atomically under an exclusive fcntl::flock with SHA-256 drift detection:

The server hashes the file at startup. POST /graphs re-hashes under the flock before rewriting. If the hash doesn't match (operator hand-edited), the rewrite refuses with 503.
Comments and blank-line structure are not preserved across server-side rewrites — the file is regenerated via serde_yaml::to_string.
Operators must not edit the file while the server is running. To make offline changes: stop the server, edit, restart.

In single mode the server never writes omnigraph.yaml.

`POST /graphs` body shape

{
  "graph_id": "alpha",
  "uri": "s3://tenant-bucket/alpha",
  "schema": { "source": "<inline .pg source>" },
  "policy": { "file": "./policies/alpha.yaml" }
}

schema and policy are nested — leaves room for future fields without breaking the shape.
policy is optional; without it, no per-graph Cedar enforcement.
Status codes: 201 Created · 400 invalid body · 401 missing bearer · 403 Cedar denied · 405 single mode · 409 duplicate graph_id or uri · 413 body >32 MiB · 500 init or rewrite failure · 503 YAML drift.

Streaming

Only /export streams (application/x-ndjson, MPSC channel + Body::from_stream). Everything else is buffered JSON.

Error model

Uniform ErrorOutput { error, code?, merge_conflicts[], manifest_conflict? } with code ∈ unauthorized | forbidden | bad_request | not_found | conflict | too_many_requests | internal. Merge conflicts attach structured MergeConflictOutput { table_key, row_id?, kind, message }.

manifest_conflict is set on publisher CAS rejections (HTTP 409): the caller's pre-write view of one table's manifest version was stale. ManifestConflictOutput { table_key, expected, actual } tells the client which table to refresh and retry. This is the conflict shape produced by concurrent /change or /ingest calls landing the same (table, branch) race.

HTTP status codes used: 200, 400, 401, 403, 404, 409, 429, 500.

Per-actor admission control

Disjoint (table, branch) writes from different actors now run concurrently, guarded only by the engine's per-(table, branch) write queue. To keep one heavy actor from exhausting shared capacity (Lance I/O, manifest churn, network), the server gates mutating handlers through a WorkloadController configured per-process from environment variables:

Env var	Default	Purpose
`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX`	16	Concurrent in-flight mutations per actor
`OMNIGRAPH_PER_ACTOR_BYTES_MAX`	4 GiB	In-flight estimated bytes per actor

When an actor exceeds its in-flight count or byte budget, the server returns HTTP 429 Too Many Requests with code: too_many_requests and a Retry-After header (seconds). The actor should back off; other actors are unaffected.

Cedar policy authorization runs before admission accounting so denied requests don't consume admission slots.

Today admission gates every mutating handler: /change, /ingest, /branches/{create,delete,merge}, and /schema/apply. Read-only endpoints (/snapshot, /read, /export, /branches GET, /commits, /schema GET) are not admission-gated.

Body limits

Default: 1 MB
/ingest: 32 MB

Auth model (`bearer + SHA-256`)

Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory.
Constant-time comparison via subtle::ConstantTimeEq.
Three sources, in precedence:
1. OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET — AWS Secrets Manager (build with --features aws)
2. OMNIGRAPH_SERVER_BEARER_TOKENS_FILE or OMNIGRAPH_SERVER_BEARER_TOKENS_JSON — JSON {actor_id: token, …}
3. OMNIGRAPH_SERVER_BEARER_TOKEN — single legacy token, actor default
If no tokens configured, server runs unauthenticated (local dev) and /openapi.json strips the security scheme.

See deployment.md for token-source operational details.

Tracing & observability

tower_http::TraceLayer::new_for_http()
Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
Startup logs: token source name, graph URI, bind address
Graceful SIGINT shutdown

Not implemented (by design or "TBD")

CORS — not configured; add tower_http::cors if needed.
Rate limiting — per-actor admission control gates /change, /ingest, /branches/{create,delete,merge}, /schema/apply (see "Per-actor admission control" above). No global rate limiter is configured; add tower_http::limit if a graph-wide cap is needed.
Pagination — none (commits/branches return everything; export streams).
Multi-tenant routing — one graph per process.

8.2 KiB Raw Blame History

HTTP Server (omnigraph-server)