mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-30 02:49:39 +02:00
mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)
PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.
Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
- `multi_graph_lifecycle_post_query_restart_persistence` — POST a
graph, query it via cluster route, reload the config from disk
and confirm `load_server_settings` sees the rewritten YAML.
Validates the "restart resolves orphans" failure-mode story.
- `per_graph_policy_enforced_on_post_created_graph` — POST a graph
with a per-graph policy attached, then send authenticated read
and change requests. Per-graph Cedar enforcement fires correctly
on a POST-created graph (engine-layer policy reinstalled via
`Omnigraph::with_policy` inside the create flow).
- `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
POSTs with distinct graph_ids all return 201. Caught a real
race in `rewrite_atomic` (see below).
Race fix — `rewrite_atomic_with_modify`:
The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:
- POST A: captures baseline H0, calls rewrite_atomic.
- POST B: captures baseline H0 too (before A's update lands).
- A: acquires flock, on-disk == H0, writes H1, releases.
- A: updates baseline H0 → H1.
- B: tries to acquire flock — waits.
- B: acquires flock. On-disk is now H1. Expected (captured
before A finished) is H0. MISMATCH → spurious Drift error.
Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.
The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
1. Read on-disk bytes, hash, compare to baseline.
2. Parse on-disk YAML.
3. Call `modify(parsed)` to produce the new config — receives
fresh on-disk state, returns the modification.
4. Serialize + write + fsync + rename + update baseline.
Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.
The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.
Version bump v0.6.0 → v0.7.0:
- All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
plus their inter-crate `path` dep version constraints.
- `Cargo.lock` regenerated by `cargo build --workspace`.
- `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
row updated to mention multi-graph + cluster routes + atomic YAML
rewrite.
- `openapi.json` regenerated.
Docs:
- `docs/releases/v0.7.0.md` (new) — release notes with breaking
changes, new features, deferred items (DELETE, `delete_prefix`,
actor forwarding), and the single→multi migration recipe.
- `docs/user/server.md` — substantial section additions for the
two modes, mode inference, cluster endpoint table, management
endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
body shape + status codes.
- `docs/user/cli.md` — `omnigraph graphs list/create` section,
deferred-DELETE note.
- `docs/user/policy.md` — server-scoped Cedar actions
(`graph_create`, `graph_list`), per-graph vs server-level policy
composition, example server-level policy.
Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.
This commit closes MR-668. v0.7.0 is ready for tagging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
75514b6cfd
commit
d11c18fb27
15 changed files with 632 additions and 77 deletions
|
|
@ -493,6 +493,77 @@ fn staging_path(path: &Path) -> PathBuf {
|
|||
PathBuf::from(s)
|
||||
}
|
||||
|
||||
/// Atomic read-modify-write of `omnigraph.yaml` (MR-668 PR 7 — race-fix
|
||||
/// from PR 9). Everything happens **inside** the `fcntl::flock` and the
|
||||
/// in-memory baseline mutex:
|
||||
/// 1. Acquire `LOCK_EX`.
|
||||
/// 2. Lock the in-memory baseline mutex.
|
||||
/// 3. Read the on-disk file, hash it.
|
||||
/// 4. Compare to the in-memory baseline; if mismatch → `Drift`.
|
||||
/// 5. Parse the on-disk YAML, hand the parsed config to `modify`.
|
||||
/// 6. Serialize the returned config, write `.tmp`, fsync, rename.
|
||||
/// 7. Update the in-memory baseline to the new file's hash.
|
||||
/// 8. Release flock + mutex.
|
||||
///
|
||||
/// The earlier `rewrite_atomic` captured the baseline OUTSIDE the
|
||||
/// flock, which created a race under concurrent writers: a second
|
||||
/// writer would see a stale baseline + the first writer's new on-disk
|
||||
/// hash, yielding a spurious `Drift` error. The `_with_modify` shape
|
||||
/// keeps the entire critical section atomic.
|
||||
///
|
||||
/// `modify` is a `FnOnce` so the caller can read mutable state into it
|
||||
/// (e.g. a `GraphCreateRequest`) without `Sync` requirements.
|
||||
pub fn rewrite_atomic_with_modify<F>(
|
||||
path: &Path,
|
||||
baseline: &std::sync::Mutex<[u8; 32]>,
|
||||
modify: F,
|
||||
) -> std::result::Result<(), RewriteAtomicError>
|
||||
where
|
||||
F: FnOnce(OmnigraphConfig) -> std::result::Result<OmnigraphConfig, RewriteAtomicError>,
|
||||
{
|
||||
let lock_file = fs::OpenOptions::new().read(true).write(true).open(path)?;
|
||||
lock_file.lock_exclusive()?;
|
||||
let _lock_guard = lock_file;
|
||||
|
||||
// Lock the in-memory baseline INSIDE the flock so concurrent writers
|
||||
// serialize on both: flock for cross-process safety, mutex for
|
||||
// in-process baseline updates. The mutex guard outlives the modify
|
||||
// step so the baseline can't move under our feet.
|
||||
let mut baseline_guard = baseline
|
||||
.lock()
|
||||
.expect("baseline mutex must not be poisoned");
|
||||
|
||||
let current_bytes = fs::read(path)?;
|
||||
let mut current_hash = [0u8; 32];
|
||||
current_hash.copy_from_slice(&Sha256::digest(¤t_bytes));
|
||||
if current_hash != *baseline_guard {
|
||||
return Err(RewriteAtomicError::Drift);
|
||||
}
|
||||
|
||||
// Parse the on-disk config (NOT a stale cached version) and hand
|
||||
// to `modify`. The closure can mutate freely; the result is what
|
||||
// we serialize and write.
|
||||
let current_config: OmnigraphConfig = serde_yaml::from_slice(¤t_bytes)?;
|
||||
let new_config = modify(current_config)?;
|
||||
let serialized = serde_yaml::to_string(&new_config)?;
|
||||
|
||||
let tmp_path = staging_path(path);
|
||||
fs::write(&tmp_path, &serialized)?;
|
||||
let tmp_file = fs::File::open(&tmp_path)?;
|
||||
tmp_file.sync_all()?;
|
||||
drop(tmp_file);
|
||||
fs::rename(&tmp_path, path)?;
|
||||
if let Some(parent) = path.parent() {
|
||||
let dir = fs::File::open(parent)?;
|
||||
dir.sync_all()?;
|
||||
}
|
||||
|
||||
let mut new_hash = [0u8; 32];
|
||||
new_hash.copy_from_slice(&Sha256::digest(serialized.as_bytes()));
|
||||
*baseline_guard = new_hash;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use std::fs;
|
||||
|
|
|
|||
|
|
@ -1428,49 +1428,33 @@ async fn server_graphs_create(
|
|||
))
|
||||
}
|
||||
|
||||
/// Load `omnigraph.yaml` from disk, add the new graph entry, write it
|
||||
/// back via `config::rewrite_atomic`, and update the in-memory baseline
|
||||
/// hash. Returns an `ApiError` mapped to the appropriate HTTP status
|
||||
/// (503 for drift, 500 for IO/serialize failures).
|
||||
/// Atomically rewrite `omnigraph.yaml` to add a new graph entry.
|
||||
/// Runs inside `tokio::task::spawn_blocking` (the flock is sync).
|
||||
///
|
||||
/// Runs inside `tokio::task::spawn_blocking` — `fs2::flock` is sync.
|
||||
/// Read-modify-write happens entirely under the flock + baseline
|
||||
/// mutex via `config::rewrite_atomic_with_modify` — concurrent
|
||||
/// writers serialize without spurious drift errors.
|
||||
fn rewrite_yaml_with_new_graph(
|
||||
config_path: &std::path::Path,
|
||||
config_hash: &Arc<std::sync::Mutex<[u8; 32]>>,
|
||||
graph_id: &str,
|
||||
new_target: config::TargetConfig,
|
||||
) -> std::result::Result<(), ApiError> {
|
||||
// Re-read the config file to construct the next state.
|
||||
let bytes = std::fs::read(config_path)
|
||||
.map_err(|err| ApiError::internal(format!("read omnigraph.yaml: {err}")))?;
|
||||
let mut updated: config::OmnigraphConfig = serde_yaml::from_slice(&bytes)
|
||||
.map_err(|err| ApiError::internal(format!("parse omnigraph.yaml: {err}")))?;
|
||||
updated.graphs.insert(graph_id.to_string(), new_target);
|
||||
|
||||
// Grab the current baseline hash for the drift check.
|
||||
let expected = *config_hash
|
||||
.lock()
|
||||
.expect("config_hash mutex must not be poisoned");
|
||||
let new_hash = config::rewrite_atomic(config_path, &updated, &expected).map_err(|err| {
|
||||
match err {
|
||||
config::RewriteAtomicError::Drift => ApiError {
|
||||
status: StatusCode::SERVICE_UNAVAILABLE,
|
||||
code: ErrorCode::Conflict,
|
||||
message: err.to_string(),
|
||||
merge_conflicts: Vec::new(),
|
||||
manifest_conflict: None,
|
||||
},
|
||||
other => ApiError::internal(other.to_string()),
|
||||
}
|
||||
})?;
|
||||
|
||||
// Update the baseline so the next POST sees this as the new "no
|
||||
// drift" reference. If we forgot this, every POST after the first
|
||||
// would 503.
|
||||
*config_hash
|
||||
.lock()
|
||||
.expect("config_hash mutex must not be poisoned") = new_hash;
|
||||
Ok(())
|
||||
let graph_id = graph_id.to_string();
|
||||
config::rewrite_atomic_with_modify(config_path, config_hash, move |mut config| {
|
||||
config.graphs.insert(graph_id, new_target);
|
||||
Ok(config)
|
||||
})
|
||||
.map_err(|err| match err {
|
||||
config::RewriteAtomicError::Drift => ApiError {
|
||||
status: StatusCode::SERVICE_UNAVAILABLE,
|
||||
code: ErrorCode::Conflict,
|
||||
message: err.to_string(),
|
||||
merge_conflicts: Vec::new(),
|
||||
manifest_conflict: None,
|
||||
},
|
||||
other => ApiError::internal(other.to_string()),
|
||||
})
|
||||
}
|
||||
|
||||
async fn server_openapi(State(state): State<AppState>) -> Json<utoipa::openapi::OpenApi> {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue