The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:
- flock-on-renamed-inode race: the YAML flock is taken on
omnigraph.yaml itself, then a temp file is renamed over it.
Cross-process writers end up locking different inodes — both
believing they hold exclusive access.
- duplicate-check outside the file lock: precheck runs against
the in-memory registry only; the locked closure does
config.graphs.insert(...) unconditionally. Concurrent same-id
POSTs can persist the loser in YAML while the in-memory registry
keeps the winner — they disagree after restart.
- best_effort_cleanup_init_artifacts deletes _schema.pg /
_schema.ir.json / __schema_state.json on any init failure. An
accidental re-init against an existing graph's URI destroys its
schema; subsequent open() fails at read_text(_schema.pg).
The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.
For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.
Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
staging_path, hash_config_file, AppState::config_hash field +
threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
+ action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
GraphPolicySpec API types (only the POST handler / CLI imported them)
Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
that covers operator-authored YAML + restart — the path that
survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
(still reachable from CLI `omnigraph init`; preflight fix deferred
as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
callers gone, but the method is the natural seam for the future
cluster-catalog work
Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
advertises the management route correctly (was previously rewritten
to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
multi_mode_openapi_keeps_management_paths_flat, asserts both
/healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
/graphs in addition to /healthz
Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
but --target is a config-graph-name lookup; corrected to --uri.
Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
ownership", and "POST /graphs body shape" sections. Added a
paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
"Configuration" line to clarify that server-scoped rules (graph_list)
take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.
Diff: 17 files, +123 −1525 LOC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.
Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
- `multi_graph_lifecycle_post_query_restart_persistence` — POST a
graph, query it via cluster route, reload the config from disk
and confirm `load_server_settings` sees the rewritten YAML.
Validates the "restart resolves orphans" failure-mode story.
- `per_graph_policy_enforced_on_post_created_graph` — POST a graph
with a per-graph policy attached, then send authenticated read
and change requests. Per-graph Cedar enforcement fires correctly
on a POST-created graph (engine-layer policy reinstalled via
`Omnigraph::with_policy` inside the create flow).
- `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
POSTs with distinct graph_ids all return 201. Caught a real
race in `rewrite_atomic` (see below).
Race fix — `rewrite_atomic_with_modify`:
The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:
- POST A: captures baseline H0, calls rewrite_atomic.
- POST B: captures baseline H0 too (before A's update lands).
- A: acquires flock, on-disk == H0, writes H1, releases.
- A: updates baseline H0 → H1.
- B: tries to acquire flock — waits.
- B: acquires flock. On-disk is now H1. Expected (captured
before A finished) is H0. MISMATCH → spurious Drift error.
Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.
The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
1. Read on-disk bytes, hash, compare to baseline.
2. Parse on-disk YAML.
3. Call `modify(parsed)` to produce the new config — receives
fresh on-disk state, returns the modification.
4. Serialize + write + fsync + rename + update baseline.
Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.
The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.
Version bump v0.6.0 → v0.7.0:
- All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
plus their inter-crate `path` dep version constraints.
- `Cargo.lock` regenerated by `cargo build --workspace`.
- `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
row updated to mention multi-graph + cluster routes + atomic YAML
rewrite.
- `openapi.json` regenerated.
Docs:
- `docs/releases/v0.7.0.md` (new) — release notes with breaking
changes, new features, deferred items (DELETE, `delete_prefix`,
actor forwarding), and the single→multi migration recipe.
- `docs/user/server.md` — substantial section additions for the
two modes, mode inference, cluster endpoint table, management
endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
body shape + status codes.
- `docs/user/cli.md` — `omnigraph graphs list/create` section,
deferred-DELETE note.
- `docs/user/policy.md` — server-scoped Cedar actions
(`graph_create`, `graph_list`), per-graph vs server-level policy
composition, example server-level policy.
Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.
This commit closes MR-668. v0.7.0 is ready for tagging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>