mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-12 01:45:14 +02:00
The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:
- flock-on-renamed-inode race: the YAML flock is taken on
omnigraph.yaml itself, then a temp file is renamed over it.
Cross-process writers end up locking different inodes — both
believing they hold exclusive access.
- duplicate-check outside the file lock: precheck runs against
the in-memory registry only; the locked closure does
config.graphs.insert(...) unconditionally. Concurrent same-id
POSTs can persist the loser in YAML while the in-memory registry
keeps the winner — they disagree after restart.
- best_effort_cleanup_init_artifacts deletes _schema.pg /
_schema.ir.json / __schema_state.json on any init failure. An
accidental re-init against an existing graph's URI destroys its
schema; subsequent open() fails at read_text(_schema.pg).
The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.
For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.
Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
staging_path, hash_config_file, AppState::config_hash field +
threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
+ action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
GraphPolicySpec API types (only the POST handler / CLI imported them)
Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
that covers operator-authored YAML + restart — the path that
survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
(still reachable from CLI `omnigraph init`; preflight fix deferred
as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
callers gone, but the method is the natural seam for the future
cluster-catalog work
Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
advertises the management route correctly (was previously rewritten
to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
multi_mode_openapi_keeps_management_paths_flat, asserts both
/healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
/graphs in addition to /healthz
Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
but --target is a config-graph-name lookup; corrected to --uri.
Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
ownership", and "POST /graphs body shape" sections. Added a
paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
"Configuration" line to clarify that server-scoped rules (graph_list)
take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.
Diff: 17 files, +123 −1525 LOC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
139 lines
7.2 KiB
Markdown
139 lines
7.2 KiB
Markdown
# HTTP Server (`omnigraph-server`)
|
|
|
|
Axum 0.8 + tokio + utoipa-generated OpenAPI. **Two modes** (v0.7.0+): single-graph (legacy) and multi-graph (MR-668). Mode is inferred from CLI args + config shape.
|
|
|
|
## Modes
|
|
|
|
### Single-graph mode (legacy)
|
|
|
|
`omnigraph-server <URI>` or `omnigraph-server --target <name> --config omnigraph.yaml`. Routes are flat — `/snapshot`, `/read`, `/branches`, etc. Behavior unchanged from v0.6.0.
|
|
|
|
### Multi-graph mode (v0.7.0+)
|
|
|
|
`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:` map and **no** single-mode selector (no `server.graph`, no `<URI>`, no `--target`). The server opens every configured graph in parallel at startup (bounded concurrency = 4, fail-fast on the first open error). Routes are nested under `/graphs/{graph_id}/...`. Bare flat paths return 404 in multi mode.
|
|
|
|
Mode inference (four-rule matrix):
|
|
|
|
1. CLI positional `<URI>` → single
|
|
2. CLI `--target <name>` → single
|
|
3. `server.graph` in config → single
|
|
4. `--config` + non-empty `graphs:` + no single-mode selector → **multi**
|
|
5. otherwise → error with migration hint
|
|
|
|
## Endpoint inventory
|
|
|
|
Per-graph endpoints — same body shape across modes; URLs differ:
|
|
|
|
| Method | Single-mode path | Multi-mode path | Auth | Action | Handler |
|
|
|---|---|---|---|---|---|
|
|
| GET | `/healthz` | `/healthz` | none | — | `server_health` |
|
|
| GET | `/openapi.json` | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled; in multi mode emits cluster paths with `cluster_` operation-id prefix) |
|
|
| GET | `/snapshot?branch=` | `/graphs/{id}/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` |
|
|
| POST | `/read` | `/graphs/{id}/read` | bearer + `read` | run named query | `server_read` |
|
|
| POST | `/export` | `/graphs/{id}/export` | bearer + `export` | NDJSON stream | `server_export` |
|
|
| POST | `/change` | `/graphs/{id}/change` | bearer + `change` | mutation | `server_change` |
|
|
| GET | `/schema` | `/graphs/{id}/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` |
|
|
| POST | `/schema/apply` | `/graphs/{id}/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` |
|
|
| POST | `/ingest` | `/graphs/{id}/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) |
|
|
| GET | `/branches` | `/graphs/{id}/branches` | bearer + `read` | list branches | `server_branch_list` |
|
|
| POST | `/branches` | `/graphs/{id}/branches` | bearer + `branch_create` | create | `server_branch_create` |
|
|
| DELETE | `/branches/{branch}` | `/graphs/{id}/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` |
|
|
| POST | `/branches/merge` | `/graphs/{id}/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` |
|
|
| GET | `/commits?branch=` | `/graphs/{id}/commits?branch=` | bearer + `read` | list | `server_commit_list` |
|
|
| GET | `/commits/{commit_id}` | `/graphs/{id}/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` |
|
|
|
|
Server-level management endpoints (v0.7.0+):
|
|
|
|
| Method | Path | Auth | Action | Handler |
|
|
|---|---|---|---|---|
|
|
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs | `server_graphs_list` (405 in single mode) |
|
|
|
|
## Adding and removing graphs (multi mode)
|
|
|
|
Runtime add/remove via API is **not** exposed in v0.7.0 — neither
|
|
`POST /graphs` nor `DELETE /graphs/{id}` is implemented. Operators add
|
|
or remove graphs by stopping the server, editing the `graphs:` map in
|
|
`omnigraph.yaml`, then restarting. The server treats `omnigraph.yaml`
|
|
as operator-owned configuration and never writes it.
|
|
|
|
A future release may introduce a managed registry (Lance-backed,
|
|
catalog-style: reserve → init → publish with recovery sidecars) and
|
|
re-expose runtime mutation on top of it.
|
|
|
|
## Streaming
|
|
|
|
Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON.
|
|
|
|
## Error model
|
|
|
|
Uniform `ErrorOutput { error, code?, merge_conflicts[], manifest_conflict? }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | too_many_requests | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`.
|
|
|
|
`manifest_conflict` is set on **publisher CAS rejections** (HTTP 409): the
|
|
caller's pre-write view of one table's manifest version was stale.
|
|
`ManifestConflictOutput { table_key, expected, actual }` tells the client
|
|
which table to refresh and retry. This is the conflict shape produced by
|
|
concurrent `/change` or `/ingest` calls landing the same `(table, branch)`
|
|
race.
|
|
|
|
HTTP status codes used: 200, 400, 401, 403, 404, 409, 429, 500.
|
|
|
|
## Per-actor admission control
|
|
|
|
Disjoint
|
|
`(table, branch)` writes from different actors now run concurrently,
|
|
guarded only by the engine's per-(table, branch) write queue. To keep
|
|
one heavy actor from exhausting shared capacity (Lance I/O, manifest
|
|
churn, network), the server gates mutating handlers through a
|
|
`WorkloadController` configured per-process from environment variables:
|
|
|
|
| Env var | Default | Purpose |
|
|
|---|---|---|
|
|
| `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX` | 16 | Concurrent in-flight mutations per actor |
|
|
| `OMNIGRAPH_PER_ACTOR_BYTES_MAX` | 4 GiB | In-flight estimated bytes per actor |
|
|
|
|
When an actor exceeds its in-flight count or byte budget, the server
|
|
returns **HTTP 429 Too Many Requests** with `code: too_many_requests`
|
|
and a `Retry-After` header (seconds). The actor should back off; other
|
|
actors are unaffected.
|
|
|
|
Cedar policy authorization runs **before** admission accounting so
|
|
denied requests don't consume admission slots.
|
|
|
|
Today admission gates every mutating handler: `/change`, `/ingest`,
|
|
`/branches/{create,delete,merge}`, and `/schema/apply`. Read-only
|
|
endpoints (`/snapshot`, `/read`, `/export`, `/branches` GET, `/commits`,
|
|
`/schema` GET) are not admission-gated.
|
|
|
|
## Body limits
|
|
|
|
- Default: 1 MB
|
|
- `/ingest`: 32 MB
|
|
|
|
## Auth model (`bearer + SHA-256`)
|
|
|
|
- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory.
|
|
- Constant-time comparison via `subtle::ConstantTimeEq`.
|
|
- Three sources, in precedence:
|
|
1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`)
|
|
2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}`
|
|
3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default`
|
|
- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme.
|
|
|
|
See [deployment.md](deployment.md) for token-source operational details.
|
|
|
|
## Tracing & observability
|
|
|
|
- `tower_http::TraceLayer::new_for_http()`
|
|
- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
|
|
- Startup logs: token source name, graph URI, bind address
|
|
- Graceful SIGINT shutdown
|
|
|
|
## Not implemented (by design or "TBD")
|
|
|
|
- CORS — not configured; add `tower_http::cors` if needed.
|
|
- Rate limiting — per-actor admission control gates `/change`, `/ingest`,
|
|
`/branches/{create,delete,merge}`, `/schema/apply` (see "Per-actor
|
|
admission control" above). No global rate limiter is configured;
|
|
add `tower_http::limit` if a graph-wide cap is needed.
|
|
- Pagination — none (commits/branches return everything; export streams).
|
|
- Multi-tenant routing — one graph per process.
|