mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-30 02:49:39 +02:00
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap A `cluster apply` carrying a schema change against a graph that has non-main branches, or an unsupported "needs backfill" migration, armed a recovery sidecar *before* calling the engine, then left it behind when the engine rejected the apply pre-movement. The server refuses to boot while any sidecar is pending, and re-running apply re-armed a fresh sidecar — an unescapable crash loop. None of the engine rejections are bugs; the trap is in the apply/serve choreography. Three coordinated changes: 1. Preview before arming the sidecar. `cluster apply` now runs `preview_schema_apply_with_options` before `write_recovery_sidecar`, so parser/planner rejections (non-main branches, unsupported plan) fail loudly without leaving recovery work behind. The post-preview engine error path now deletes the sidecar when the live schema still matches the recorded digest (nothing moved), and keeps it only on real mid-movement failure — both branches covered by new engine-failpoint tests (cluster failpoints now enable omnigraph/failpoints). 2. Per-graph quarantine at serve time instead of whole-cluster refusal. A graph-attributed pending sidecar, an unopenable graph root, a query parse failure, or an unresolvable embedding provider now quarantines just that graph (logged loudly at every boot layer) while healthy graphs serve; `/graphs` lists only ready graphs and quarantined routes 404. Cluster-global problems (missing/unreadable state, malformed or unattributable sidecars, shared-catalog or cluster-policy errors, zero healthy graphs) stay fail-fast. `--require-all-graphs` / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot. 3. Backfill embedding-provider profile metadata on apply. Mirrors the existing policy-binding backfill: a pre-5A ledger missing `embedding_profile` is now detected as a metadata-only change and backfilled by a no-op apply, instead of bricking serve with `embedding_provider_profile_missing` forever. Tests: trap (no sidecar after a rejected apply), both digest-cleanup branches, per-graph quarantine (cluster + server), embedding backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: resilient cluster boot + recovery-sidecar trap fix Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release note, and update the user cluster/server/deployment docs and the OMNIGRAPH_REQUIRE_ALL_GRAPHS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cluster): surface sidecar-cleanup failures; document severity promotion Address Greptile review on PR #284: - The pre-movement sidecar cleanup fast-path discarded `delete_object`'s result, so a transient delete failure left the graph quarantined with no signal. Add `try_delete_object` (Result-returning) and emit a `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the fire-and-forget `delete_object` now delegates to it. - Document why the serve-time loop promotes every `list_recovery_sidecars` diagnostic to a cluster-fatal error (the listing only emits genuine read/parse/version failures, as warnings, whose blast radius serving cannot prove) and note the promote-by-code path if that ever changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
7168ee0ed0
commit
7fd23c54a3
21 changed files with 1043 additions and 203 deletions
|
|
@ -18,6 +18,7 @@ pub(crate) fn diff_resources(
|
|||
disposition: None,
|
||||
reason: None,
|
||||
binding_change: false,
|
||||
metadata_change: None,
|
||||
migration: None,
|
||||
}),
|
||||
Some(before) if before != after => changes.push(PlanChange {
|
||||
|
|
@ -28,6 +29,7 @@ pub(crate) fn diff_resources(
|
|||
disposition: None,
|
||||
reason: None,
|
||||
binding_change: false,
|
||||
metadata_change: None,
|
||||
migration: None,
|
||||
}),
|
||||
Some(_) => {}
|
||||
|
|
@ -43,6 +45,7 @@ pub(crate) fn diff_resources(
|
|||
disposition: None,
|
||||
reason: None,
|
||||
binding_change: false,
|
||||
metadata_change: None,
|
||||
migration: None,
|
||||
});
|
||||
}
|
||||
|
|
@ -82,6 +85,47 @@ pub(crate) fn append_policy_binding_changes(
|
|||
disposition: None,
|
||||
reason: None,
|
||||
binding_change: true,
|
||||
metadata_change: Some(PlanMetadataChange::PolicyBindings),
|
||||
migration: None,
|
||||
});
|
||||
}
|
||||
changes.sort_by(|a, b| a.resource.cmp(&b.resource));
|
||||
}
|
||||
|
||||
/// Metadata-only embedding provider changes: the provider digest is unchanged
|
||||
/// but the applied state predates storing the profile body needed by
|
||||
/// config-free serving. This mirrors policy binding backfill instead of
|
||||
/// hiding a serving-time failure behind a no-op plan.
|
||||
pub(crate) fn append_embedding_profile_changes(
|
||||
changes: &mut Vec<PlanChange>,
|
||||
prior_state: Option<&ClusterState>,
|
||||
desired: &DesiredCluster,
|
||||
) {
|
||||
let Some(state) = prior_state else {
|
||||
return; // no state: provider Creates carry profiles already
|
||||
};
|
||||
for (address, desired_profile) in &desired.embedding_providers {
|
||||
if changes
|
||||
.iter()
|
||||
.any(|change| change.resource.as_str() == address.as_str())
|
||||
{
|
||||
continue; // content change already covers it
|
||||
}
|
||||
let Some(entry) = state.applied_revision.resources.get(address) else {
|
||||
continue; // not applied yet: the Create covers it
|
||||
};
|
||||
if entry.embedding_profile.as_ref() == Some(desired_profile) {
|
||||
continue;
|
||||
}
|
||||
changes.push(PlanChange {
|
||||
resource: address.clone(),
|
||||
operation: PlanOperation::Update,
|
||||
before_digest: Some(entry.digest.clone()),
|
||||
after_digest: Some(entry.digest.clone()),
|
||||
disposition: None,
|
||||
reason: None,
|
||||
binding_change: false,
|
||||
metadata_change: Some(PlanMetadataChange::EmbeddingProfile),
|
||||
migration: None,
|
||||
});
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue