mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)

PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.

Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
  - `multi_graph_lifecycle_post_query_restart_persistence` — POST a
    graph, query it via cluster route, reload the config from disk
    and confirm `load_server_settings` sees the rewritten YAML.
    Validates the "restart resolves orphans" failure-mode story.
  - `per_graph_policy_enforced_on_post_created_graph` — POST a graph
    with a per-graph policy attached, then send authenticated read
    and change requests. Per-graph Cedar enforcement fires correctly
    on a POST-created graph (engine-layer policy reinstalled via
    `Omnigraph::with_policy` inside the create flow).
  - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
    POSTs with distinct graph_ids all return 201. Caught a real
    race in `rewrite_atomic` (see below).

Race fix — `rewrite_atomic_with_modify`:

The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:

  - POST A: captures baseline H0, calls rewrite_atomic.
  - POST B: captures baseline H0 too (before A's update lands).
  - A: acquires flock, on-disk == H0, writes H1, releases.
  - A: updates baseline H0 → H1.
  - B: tries to acquire flock — waits.
  - B: acquires flock. On-disk is now H1. Expected (captured
       before A finished) is H0. MISMATCH → spurious Drift error.

Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.

The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
  1. Read on-disk bytes, hash, compare to baseline.
  2. Parse on-disk YAML.
  3. Call `modify(parsed)` to produce the new config — receives
     fresh on-disk state, returns the modification.
  4. Serialize + write + fsync + rename + update baseline.

Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.

The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.

Version bump v0.6.0 → v0.7.0:
  - All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
    plus their inter-crate `path` dep version constraints.
  - `Cargo.lock` regenerated by `cargo build --workspace`.
  - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
    row updated to mention multi-graph + cluster routes + atomic YAML
    rewrite.
  - `openapi.json` regenerated.

Docs:
  - `docs/releases/v0.7.0.md` (new) — release notes with breaking
    changes, new features, deferred items (DELETE, `delete_prefix`,
    actor forwarding), and the single→multi migration recipe.
  - `docs/user/server.md` — substantial section additions for the
    two modes, mode inference, cluster endpoint table, management
    endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
    body shape + status codes.
  - `docs/user/cli.md` — `omnigraph graphs list/create` section,
    deferred-DELETE note.
  - `docs/user/policy.md` — server-scoped Cedar actions
    (`graph_create`, `graph_list`), per-graph vs server-level policy
    composition, example server-level policy.

Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.

This commit closes MR-668. v0.7.0 is ready for tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-05-25 21:32:49 +02:00
parent 75514b6cfd
commit d11c18fb27
No known key found for this signature in database
15 changed files with 632 additions and 77 deletions

View file

@ -493,6 +493,77 @@ fn staging_path(path: &Path) -> PathBuf {
PathBuf::from(s)
}
/// Atomic read-modify-write of `omnigraph.yaml` (MR-668 PR 7 — race-fix
/// from PR 9). Everything happens **inside** the `fcntl::flock` and the
/// in-memory baseline mutex:
/// 1. Acquire `LOCK_EX`.
/// 2. Lock the in-memory baseline mutex.
/// 3. Read the on-disk file, hash it.
/// 4. Compare to the in-memory baseline; if mismatch → `Drift`.
/// 5. Parse the on-disk YAML, hand the parsed config to `modify`.
/// 6. Serialize the returned config, write `.tmp`, fsync, rename.
/// 7. Update the in-memory baseline to the new file's hash.
/// 8. Release flock + mutex.
///
/// The earlier `rewrite_atomic` captured the baseline OUTSIDE the
/// flock, which created a race under concurrent writers: a second
/// writer would see a stale baseline + the first writer's new on-disk
/// hash, yielding a spurious `Drift` error. The `_with_modify` shape
/// keeps the entire critical section atomic.
///
/// `modify` is a `FnOnce` so the caller can read mutable state into it
/// (e.g. a `GraphCreateRequest`) without `Sync` requirements.
pub fn rewrite_atomic_with_modify<F>(
path: &Path,
baseline: &std::sync::Mutex<[u8; 32]>,
modify: F,
) -> std::result::Result<(), RewriteAtomicError>
where
F: FnOnce(OmnigraphConfig) -> std::result::Result<OmnigraphConfig, RewriteAtomicError>,
{
let lock_file = fs::OpenOptions::new().read(true).write(true).open(path)?;
lock_file.lock_exclusive()?;
let _lock_guard = lock_file;
// Lock the in-memory baseline INSIDE the flock so concurrent writers
// serialize on both: flock for cross-process safety, mutex for
// in-process baseline updates. The mutex guard outlives the modify
// step so the baseline can't move under our feet.
let mut baseline_guard = baseline
.lock()
.expect("baseline mutex must not be poisoned");
let current_bytes = fs::read(path)?;
let mut current_hash = [0u8; 32];
current_hash.copy_from_slice(&Sha256::digest(&current_bytes));
if current_hash != *baseline_guard {
return Err(RewriteAtomicError::Drift);
}
// Parse the on-disk config (NOT a stale cached version) and hand
// to `modify`. The closure can mutate freely; the result is what
// we serialize and write.
let current_config: OmnigraphConfig = serde_yaml::from_slice(&current_bytes)?;
let new_config = modify(current_config)?;
let serialized = serde_yaml::to_string(&new_config)?;
let tmp_path = staging_path(path);
fs::write(&tmp_path, &serialized)?;
let tmp_file = fs::File::open(&tmp_path)?;
tmp_file.sync_all()?;
drop(tmp_file);
fs::rename(&tmp_path, path)?;
if let Some(parent) = path.parent() {
let dir = fs::File::open(parent)?;
dir.sync_all()?;
}
let mut new_hash = [0u8; 32];
new_hash.copy_from_slice(&Sha256::digest(serialized.as_bytes()));
*baseline_guard = new_hash;
Ok(())
}
#[cfg(test)]
mod tests {
use std::fs;

View file

@ -1428,49 +1428,33 @@ async fn server_graphs_create(
))
}
/// Load `omnigraph.yaml` from disk, add the new graph entry, write it
/// back via `config::rewrite_atomic`, and update the in-memory baseline
/// hash. Returns an `ApiError` mapped to the appropriate HTTP status
/// (503 for drift, 500 for IO/serialize failures).
/// Atomically rewrite `omnigraph.yaml` to add a new graph entry.
/// Runs inside `tokio::task::spawn_blocking` (the flock is sync).
///
/// Runs inside `tokio::task::spawn_blocking` — `fs2::flock` is sync.
/// Read-modify-write happens entirely under the flock + baseline
/// mutex via `config::rewrite_atomic_with_modify` — concurrent
/// writers serialize without spurious drift errors.
fn rewrite_yaml_with_new_graph(
config_path: &std::path::Path,
config_hash: &Arc<std::sync::Mutex<[u8; 32]>>,
graph_id: &str,
new_target: config::TargetConfig,
) -> std::result::Result<(), ApiError> {
// Re-read the config file to construct the next state.
let bytes = std::fs::read(config_path)
.map_err(|err| ApiError::internal(format!("read omnigraph.yaml: {err}")))?;
let mut updated: config::OmnigraphConfig = serde_yaml::from_slice(&bytes)
.map_err(|err| ApiError::internal(format!("parse omnigraph.yaml: {err}")))?;
updated.graphs.insert(graph_id.to_string(), new_target);
// Grab the current baseline hash for the drift check.
let expected = *config_hash
.lock()
.expect("config_hash mutex must not be poisoned");
let new_hash = config::rewrite_atomic(config_path, &updated, &expected).map_err(|err| {
match err {
config::RewriteAtomicError::Drift => ApiError {
status: StatusCode::SERVICE_UNAVAILABLE,
code: ErrorCode::Conflict,
message: err.to_string(),
merge_conflicts: Vec::new(),
manifest_conflict: None,
},
other => ApiError::internal(other.to_string()),
}
})?;
// Update the baseline so the next POST sees this as the new "no
// drift" reference. If we forgot this, every POST after the first
// would 503.
*config_hash
.lock()
.expect("config_hash mutex must not be poisoned") = new_hash;
Ok(())
let graph_id = graph_id.to_string();
config::rewrite_atomic_with_modify(config_path, config_hash, move |mut config| {
config.graphs.insert(graph_id, new_target);
Ok(config)
})
.map_err(|err| match err {
config::RewriteAtomicError::Drift => ApiError {
status: StatusCode::SERVICE_UNAVAILABLE,
code: ErrorCode::Conflict,
message: err.to_string(),
merge_conflicts: Vec::new(),
manifest_conflict: None,
},
other => ApiError::internal(other.to_string()),
})
}
async fn server_openapi(State(state): State<AppState>) -> Json<utoipa::openapi::OpenApi> {