omnigraph/crates/omnigraph-cluster/Cargo.toml

[package]
name = "omnigraph-cluster"
version = "0.7.2"
edition = "2024"
description = "Cluster configuration validation, planning, and config-only apply for Omnigraph."
license = "MIT"
repository = "https://github.com/ModernRelay/omnigraph"
homepage = "https://github.com/ModernRelay/omnigraph"
documentation = "https://docs.rs/omnigraph-cluster"

[features]
# Fault-injection hooks for the apply protocol (crash-mid-apply, CAS-race
# tests), including cluster/engine boundary failures.
failpoints = ["dep:fail", "fail/failpoints", "omnigraph/failpoints"]

[dependencies]
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" }
omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" }
fail = { workspace = true, optional = true }
serde = { workspace = true }
serde_json = { workspace = true }
serde_yaml = { workspace = true }
sha2 = { workspace = true }
thiserror = { workspace = true }
time = { workspace = true }
# Runtime handle only — best-effort async lock release in
# StateLockGuard::drop on object-store backends (cluster commands always
# run inside the caller's tokio runtime).
tokio = { workspace = true }
ulid = { workspace = true }

[dev-dependencies]
serial_test = "3"
tempfile = { workspace = true }
tokio = { workspace = true }
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00			`[package]`
			`name = "omnigraph-cluster"`
release: v0.7.2 (#301) Patch release over v0.7.1: write-path latency reductions (#288 direct table opener, #298 schema-once + open-each-table-once) and three correctness fixes on the maintenance and recovery paths (#297 optimize survives a cross-process write race, #291 optimize compacts the internal metadata tables + non-destructive auto_cleanup strip, #296 recovery converges under a concurrent manifest advance). No breaking changes, no on-disk format change, no migration. Version coherence: all 7 crate manifests + path-dep constraints, Cargo.lock, openapi.json, and the AGENTS.md surveyed version bumped 0.7.1 -> 0.7.2. Build green under --locked; OpenAPI drift check green. 2026-06-25 09:08:12 +02:00			`version = "0.7.2"`
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00			`edition = "2024"`
feat(cluster): failpoint infrastructure mirroring the engine Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning Diagnostic-typed injected errors, and two call sites in apply_config_dir: cluster_apply.after_payload_phase (the crash point: blobs on disk, state untouched) and cluster_apply.before_state_write (routes through the persisted-statuses revert contract; a cfg_callback here can mutate state.json to make the CAS check fail organically). Feature off compiles to Ok(()) — zero behavior change. Tests live in a separate integration binary because the fail registry is process-global. Also refresh the crate description (stale 'read-only' since Stage 3A). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-10 02:12:59 +03:00			`description = "Cluster configuration validation, planning, and config-only apply for Omnigraph."`
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00			`license = "MIT"`
			`repository = "https://github.com/ModernRelay/omnigraph"`
			`homepage = "https://github.com/ModernRelay/omnigraph"`
			`documentation = "https://docs.rs/omnigraph-cluster"`

feat(cluster): failpoint infrastructure mirroring the engine Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning Diagnostic-typed injected errors, and two call sites in apply_config_dir: cluster_apply.after_payload_phase (the crash point: blobs on disk, state untouched) and cluster_apply.before_state_write (routes through the persisted-statuses revert contract; a cfg_callback here can mutate state.json to make the CAS check fail organically). Feature off compiles to Ok(()) — zero behavior change. Tests live in a separate integration binary because the fail registry is process-global. Also refresh the crate description (stale 'read-only' since Stage 3A). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-10 02:12:59 +03:00			`[features]`
			`# Fault-injection hooks for the apply protocol (crash-mid-apply, CAS-race`
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284) * fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap A `cluster apply` carrying a schema change against a graph that has non-main branches, or an unsupported "needs backfill" migration, armed a recovery sidecar before calling the engine, then left it behind when the engine rejected the apply pre-movement. The server refuses to boot while any sidecar is pending, and re-running apply re-armed a fresh sidecar — an unescapable crash loop. None of the engine rejections are bugs; the trap is in the apply/serve choreography. Three coordinated changes: 1. Preview before arming the sidecar. `cluster apply` now runs `preview_schema_apply_with_options` before `write_recovery_sidecar`, so parser/planner rejections (non-main branches, unsupported plan) fail loudly without leaving recovery work behind. The post-preview engine error path now deletes the sidecar when the live schema still matches the recorded digest (nothing moved), and keeps it only on real mid-movement failure — both branches covered by new engine-failpoint tests (cluster failpoints now enable omnigraph/failpoints). 2. Per-graph quarantine at serve time instead of whole-cluster refusal. A graph-attributed pending sidecar, an unopenable graph root, a query parse failure, or an unresolvable embedding provider now quarantines just that graph (logged loudly at every boot layer) while healthy graphs serve; `/graphs` lists only ready graphs and quarantined routes 404. Cluster-global problems (missing/unreadable state, malformed or unattributable sidecars, shared-catalog or cluster-policy errors, zero healthy graphs) stay fail-fast. `--require-all-graphs` / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot. 3. Backfill embedding-provider profile metadata on apply. Mirrors the existing policy-binding backfill: a pre-5A ledger missing `embedding_profile` is now detected as a metadata-only change and backfilled by a no-op apply, instead of bricking serve with `embedding_provider_profile_missing` forever. Tests: trap (no sidecar after a rejected apply), both digest-cleanup branches, per-graph quarantine (cluster + server), embedding backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: resilient cluster boot + recovery-sidecar trap fix Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release note, and update the user cluster/server/deployment docs and the OMNIGRAPH_REQUIRE_ALL_GRAPHS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cluster): surface sidecar-cleanup failures; document severity promotion Address Greptile review on PR #284: - The pre-movement sidecar cleanup fast-path discarded `delete_object`'s result, so a transient delete failure left the graph quarantined with no signal. Add `try_delete_object` (Result-returning) and emit a `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the fire-and-forget `delete_object` now delegates to it. - Document why the serve-time loop promotes every `list_recovery_sidecars` diagnostic to a cluster-fatal error (the listing only emits genuine read/parse/version failures, as warnings, whose blast radius serving cannot prove) and note the promote-by-code path if that ever changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> 2026-06-19 03:34:15 +03:00			`# tests), including cluster/engine boundary failures.`
			`failpoints = ["dep:fail", "fail/failpoints", "omnigraph/failpoints"]`
feat(cluster): failpoint infrastructure mirroring the engine Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning Diagnostic-typed injected errors, and two call sites in apply_config_dir: cluster_apply.after_payload_phase (the crash point: blobs on disk, state untouched) and cluster_apply.before_state_write (routes through the persisted-statuses revert contract; a cfg_callback here can mutate state.json to make the CAS check fail organically). Feature off compiles to Ok(()) — zero behavior change. Tests live in a separate integration binary because the fail registry is process-global. Also refresh the crate description (stale 'read-only' since Stage 3A). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-10 02:12:59 +03:00
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00			`[dependencies]`
release: v0.7.2 (#301) Patch release over v0.7.1: write-path latency reductions (#288 direct table opener, #298 schema-once + open-each-table-once) and three correctness fixes on the maintenance and recovery paths (#297 optimize survives a cross-process write race, #291 optimize compacts the internal metadata tables + non-destructive auto_cleanup strip, #296 recovery converges under a concurrent manifest advance). No breaking changes, no on-disk format change, no migration. Version coherence: all 7 crate manifests + path-dep constraints, Cargo.lock, openapi.json, and the AGENTS.md surveyed version bumped 0.7.1 -> 0.7.2. Build green under --locked; OpenAPI drift check green. 2026-06-25 09:08:12 +02:00			`omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" }`
			`omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" }`
feat(cluster): failpoint infrastructure mirroring the engine Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning Diagnostic-typed injected errors, and two call sites in apply_config_dir: cluster_apply.after_payload_phase (the crash point: blobs on disk, state untouched) and cluster_apply.before_state_write (routes through the persisted-statuses revert contract; a cfg_callback here can mutate state.json to make the CAS check fail organically). Feature off compiles to Ok(()) — zero behavior change. Tests live in a separate integration binary because the fail registry is process-global. Also refresh the crate description (stale 'read-only' since Stage 3A). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-10 02:12:59 +03:00			`fail = { workspace = true, optional = true }`
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00			`serde = { workspace = true }`
			`serde_json = { workspace = true }`
			`serde_yaml = { workspace = true }`
			`sha2 = { workspace = true }`
			`thiserror = { workspace = true }`
Add cluster JSON state ledger status 2026-06-08 21:09:23 +03:00			`time = { workspace = true }`
feat(cluster): port the storage backend to the engine StorageAdapter LocalStateBackend becomes ClusterStore: every stored byte — state ledger, lock, recovery sidecars, approval artifacts — now flows through the engine's StorageAdapter, making file:// and s3:// one code path. Behavior on the file backend is byte-compatible (layout, CAS semantics, diagnostics, lock release timing) and the entire pre-existing suite passes unchanged. Mechanics: the ledger CAS keeps its public sha256 vocabulary while the physical swap is token-conditioned (ETag If-Match on S3 via PR #186's primitives; content-token + temp/rename locally — the pre-port semantics); the lock is a create-only put (genuinely cross-machine on object stores) with deterministic drop-release locally and best-effort spawned release on S3; sidecars/approvals address by URI (SweepOutcome and the executors carry strings); sweep row-1 retirement joins the uniform deferred post-CAS cleanup. ClusterStore also gains the catalog-payload and graph-root methods that commit 2 wires in. Async ripple: status/force-unlock/serving-snapshot and the server's settings loader chain go async (CLI dispatch and ~20 test hosts follow, mechanically). tokio joins the cluster crate's runtime deps for the lock guard's handle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-11 14:11:14 +03:00			`# Runtime handle only — best-effort async lock release in`
			`# StateLockGuard::drop on object-store backends (cluster commands always`
			`# run inside the caller's tokio runtime).`
			`tokio = { workspace = true }`
Add cluster JSON state ledger status 2026-06-08 21:09:23 +03:00			`ulid = { workspace = true }`
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00
			`[dev-dependencies]`
fix(recovery): converge roll-forward on concurrent manifest advance (#296) * refactor(storage): gate test-only TableStore::append_batch behind cfg(test) The inherent append_batch is used only by in-source recovery test setup, but the non-test lib build (cfg(test) off) cannot see those callers and emitted a dead_code warning. Gating the method #[cfg(test)] silences the false positive and enforces its own doc contract ("no new engine call sites") by construction — engine code physically cannot call a cfg(test) method. * test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race Hardens the test infrastructure around the process-global `fail` registry, and adds a deterministic red repro for the open-time recovery sweep's roll-forward CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next commit — this commit is intentionally red (rule 12: red→green visible in log). Harness: - One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate is removed and cluster tests reuse the engine type via `omnigraph/failpoints`. - `#[serial]` on every failpoint test (the registry is process-global, so shared names interfere under parallelism); `serial_test` added to cluster dev-deps. - `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release) replaces fixed-`sleep` cross-thread coordination; the three concurrent tests now rendezvous deterministically. The reached flag doubles as a fired-assert. - Compile-checked `failpoints::names` catalog (engine + cluster); every call site references a const, and `failpoint_names_guard.rs` enforces "no string literal names" by source-walk, so a typo is a build error not a silent no-fire. Red repro: - New `recovery.before_roll_forward_publish` failpoint at the sweep's classify -> publish-CAS window (the only injection point there). - `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one pending sidecar; the sweep parked at the failpoint loses its publish CAS to the other and fails the open with `ExpectedVersionMismatch`. FAILS at this commit by design. * fix(recovery): converge roll-forward when the manifest advances concurrently The open-time recovery sweep classified a pending sidecar as RolledPastExpected, then published a manifest CAS at the sidecar's pinned expected_version. Under a concurrent writer that advanced the manifest past expected during the classify -> publish window, the CAS failed with ExpectedVersionMismatch and `?`-propagated, failing the whole Omnigraph::open. iss-schema-apply-reopen-recovery-race. A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an ExpectedVersionMismatch, re-read the live manifest and check whether the sidecar's intent is already satisfied (every pinned table at a version >= the one we observed and tried to publish; added tables registered; tombstones gone — sound under the heal-first invariant, documented at the check). If satisfied, this is convergence: record the RolledForward audit + delete the sidecar idempotently. If only partway, defer to the next pass. Either way the open no longer fails. Other errors still propagate; a genuine logical conflict resurfaces via the classifier's InvariantViolation. Turns the red repro from the previous commit green. The roll-BACK twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly. * Address PR review: harden failpoint name guard + dedupe converge audit Two issues surfaced in PR review of the failpoint hardening + recovery fix: 1. Name guard had a line-split blind spot. It scanned per line, so a call wrapped across lines (`park_first(\n "name",\n)`) put the literal on a different line than the call prefix and bypassed the "no string-literal failpoint names" check — and one such literal (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the guard whitespace/newline-tolerant (skip past the open paren to the first argument token) so wrapping can't hide a literal, and convert the bypassed site to the `names::` const. 2. Convergence path could append a duplicate recovery audit. When a roll-forward publish loses its CAS but the manifest already reached the sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward audit unconditionally. Under the heal-first invariant, whoever advanced the manifest already healed this sidecar (audit + delete), so a second row landed in `_graph_commit_recoveries` for one recovery event. Gate the audit+delete on the sidecar still being present: absent => the winner completed it, return success with no duplicate row. The convergence regression test now asserts exactly one audit row. * docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR) The handoff was a transient investigation note for `iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge helper + the red→green regression). Its rationale now lives durably in the dev-graph issue, the PR/commit history, and invariants.md, so the handoff is obsolete. Drop the doc, its dev-index row, and the dangling reference from the RFC-013 handoff; the doc cross-link check stays green. * fix(recovery): include added-table registrations in the converge audit The CAS-loss convergence audit built outcomes only from `sidecar.tables`, omitting the `additional_registrations` that the normal `roll_forward_all` audit includes. For a SchemaApply sidecar with added types, a converge-path audit row would be incomplete versus the normal roll-forward path for the same recovery kind. Mirror the roll-forward outcome construction (append a registration outcome per added table) so both paths emit the same audit shape. 2026-06-24 19:03:51 +02:00			`serial_test = "3"`
feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00			`tempfile = { workspace = true }`
Implement cluster refresh and import 2026-06-08 23:18:44 +03:00			`tokio = { workspace = true }`