2026-06-08 20:07:39 +03:00
|
|
|
[package]
|
|
|
|
|
name = "omnigraph-cluster"
|
2026-06-25 09:08:12 +02:00
|
|
|
version = "0.7.2"
|
2026-06-08 20:07:39 +03:00
|
|
|
edition = "2024"
|
2026-06-10 02:12:59 +03:00
|
|
|
description = "Cluster configuration validation, planning, and config-only apply for Omnigraph."
|
2026-06-08 20:07:39 +03:00
|
|
|
license = "MIT"
|
|
|
|
|
repository = "https://github.com/ModernRelay/omnigraph"
|
|
|
|
|
homepage = "https://github.com/ModernRelay/omnigraph"
|
|
|
|
|
documentation = "https://docs.rs/omnigraph-cluster"
|
|
|
|
|
|
2026-06-10 02:12:59 +03:00
|
|
|
[features]
|
|
|
|
|
# Fault-injection hooks for the apply protocol (crash-mid-apply, CAS-race
|
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap
A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.
Three coordinated changes:
1. Preview before arming the sidecar. `cluster apply` now runs
`preview_schema_apply_with_options` before `write_recovery_sidecar`, so
parser/planner rejections (non-main branches, unsupported plan) fail
loudly without leaving recovery work behind. The post-preview engine
error path now deletes the sidecar when the live schema still matches
the recorded digest (nothing moved), and keeps it only on real
mid-movement failure — both branches covered by new engine-failpoint
tests (cluster failpoints now enable omnigraph/failpoints).
2. Per-graph quarantine at serve time instead of whole-cluster refusal.
A graph-attributed pending sidecar, an unopenable graph root, a query
parse failure, or an unresolvable embedding provider now quarantines
just that graph (logged loudly at every boot layer) while healthy
graphs serve; `/graphs` lists only ready graphs and quarantined routes
404. Cluster-global problems (missing/unreadable state, malformed or
unattributable sidecars, shared-catalog or cluster-policy errors, zero
healthy graphs) stay fail-fast. `--require-all-graphs` /
OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.
3. Backfill embedding-provider profile metadata on apply. Mirrors the
existing policy-binding backfill: a pre-5A ledger missing
`embedding_profile` is now detected as a metadata-only change and
backfilled by a no-op apply, instead of bricking serve with
`embedding_provider_profile_missing` forever.
Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: resilient cluster boot + recovery-sidecar trap fix
Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(cluster): surface sidecar-cleanup failures; document severity promotion
Address Greptile review on PR #284:
- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
result, so a transient delete failure left the graph quarantined with no
signal. Add `try_delete_object` (Result-returning) and emit a
`recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
diagnostic to a cluster-fatal error (the listing only emits genuine
read/parse/version failures, as warnings, whose blast radius serving
cannot prove) and note the promote-by-code path if that ever changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 03:34:15 +03:00
|
|
|
# tests), including cluster/engine boundary failures.
|
|
|
|
|
failpoints = ["dep:fail", "fail/failpoints", "omnigraph/failpoints"]
|
2026-06-10 02:12:59 +03:00
|
|
|
|
2026-06-08 20:07:39 +03:00
|
|
|
[dependencies]
|
2026-06-25 09:08:12 +02:00
|
|
|
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" }
|
|
|
|
|
omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" }
|
2026-06-10 02:12:59 +03:00
|
|
|
fail = { workspace = true, optional = true }
|
2026-06-08 20:07:39 +03:00
|
|
|
serde = { workspace = true }
|
|
|
|
|
serde_json = { workspace = true }
|
|
|
|
|
serde_yaml = { workspace = true }
|
|
|
|
|
sha2 = { workspace = true }
|
|
|
|
|
thiserror = { workspace = true }
|
2026-06-08 21:09:23 +03:00
|
|
|
time = { workspace = true }
|
2026-06-11 14:11:14 +03:00
|
|
|
# Runtime handle only — best-effort async lock release in
|
|
|
|
|
# StateLockGuard::drop on object-store backends (cluster commands always
|
|
|
|
|
# run inside the caller's tokio runtime).
|
|
|
|
|
tokio = { workspace = true }
|
2026-06-08 21:09:23 +03:00
|
|
|
ulid = { workspace = true }
|
2026-06-08 20:07:39 +03:00
|
|
|
|
|
|
|
|
[dev-dependencies]
|
fix(recovery): converge roll-forward on concurrent manifest advance (#296)
* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)
The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.
* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race
Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).
Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
site references a const, and `failpoint_names_guard.rs` enforces "no string
literal names" by source-walk, so a typo is a build error not a silent no-fire.
Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
concurrent open-sweeps race one pending sidecar; the sweep parked at the
failpoint loses its publish CAS to the other and fails the open with
`ExpectedVersionMismatch`. FAILS at this commit by design.
* fix(recovery): converge roll-forward when the manifest advances concurrently
The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.
A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.
Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.
* Address PR review: harden failpoint name guard + dedupe converge audit
Two issues surfaced in PR review of the failpoint hardening + recovery fix:
1. Name guard had a line-split blind spot. It scanned per line, so a call
wrapped across lines (`park_first(\n "name",\n)`) put the literal on a
different line than the call prefix and bypassed the "no string-literal
failpoint names" check — and one such literal
(`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
guard whitespace/newline-tolerant (skip past the open paren to the first
argument token) so wrapping can't hide a literal, and convert the bypassed
site to the `names::` const.
2. Convergence path could append a duplicate recovery audit. When a
roll-forward publish loses its CAS but the manifest already reached the
sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
audit unconditionally. Under the heal-first invariant, whoever advanced the
manifest already healed this sidecar (audit + delete), so a second row
landed in `_graph_commit_recoveries` for one recovery event. Gate the
audit+delete on the sidecar still being present: absent => the winner
completed it, return success with no duplicate row. The convergence
regression test now asserts exactly one audit row.
* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)
The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.
* fix(recovery): include added-table registrations in the converge audit
The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
2026-06-24 19:03:51 +02:00
|
|
|
serial_test = "3"
|
2026-06-08 20:07:39 +03:00
|
|
|
tempfile = { workspace = true }
|
2026-06-08 23:18:44 +03:00
|
|
|
tokio = { workspace = true }
|