omnigraph/crates/omnigraph-server/src/main.rs
Andrew Altshuler 7fd23c54a3
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap

A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.

Three coordinated changes:

1. Preview before arming the sidecar. `cluster apply` now runs
   `preview_schema_apply_with_options` before `write_recovery_sidecar`, so
   parser/planner rejections (non-main branches, unsupported plan) fail
   loudly without leaving recovery work behind. The post-preview engine
   error path now deletes the sidecar when the live schema still matches
   the recorded digest (nothing moved), and keeps it only on real
   mid-movement failure — both branches covered by new engine-failpoint
   tests (cluster failpoints now enable omnigraph/failpoints).

2. Per-graph quarantine at serve time instead of whole-cluster refusal.
   A graph-attributed pending sidecar, an unopenable graph root, a query
   parse failure, or an unresolvable embedding provider now quarantines
   just that graph (logged loudly at every boot layer) while healthy
   graphs serve; `/graphs` lists only ready graphs and quarantined routes
   404. Cluster-global problems (missing/unreadable state, malformed or
   unattributable sidecars, shared-catalog or cluster-policy errors, zero
   healthy graphs) stay fail-fast. `--require-all-graphs` /
   OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.

3. Backfill embedding-provider profile metadata on apply. Mirrors the
   existing policy-binding backfill: a pre-5A ledger missing
   `embedding_profile` is now detected as a metadata-only change and
   backfilled by a no-op apply, instead of bricking serve with
   `embedding_provider_profile_missing` forever.

Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: resilient cluster boot + recovery-sidecar trap fix

Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(cluster): surface sidecar-cleanup failures; document severity promotion

Address Greptile review on PR #284:

- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
  result, so a transient delete failure left the graph quarantined with no
  signal. Add `try_delete_object` (Result-returning) and emit a
  `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
  fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
  diagnostic to a cluster-fatal error (the listing only emits genuine
  read/parse/version failures, as warnings, whose blast radius serving
  cannot prove) and note the promote-by-code path if that ever changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 03:34:15 +03:00

46 lines
1.5 KiB
Rust

use std::path::PathBuf;
use clap::Parser;
use color_eyre::eyre::Result;
use omnigraph_server::{ServerConfig, init_tracing, load_server_settings, serve};
#[derive(Debug, Parser)]
#[command(name = "omnigraph-server")]
#[command(about = "HTTP server for the Omnigraph graph database")]
struct Cli {
/// Boot from a cluster: either a config directory (storage resolved
/// through cluster.yaml) or a storage-root URI directly
/// (s3://bucket/prefix — config-free serving from the bucket).
/// The server's only boot source (RFC-011 cluster-only).
#[arg(long)]
cluster: Option<PathBuf>,
#[arg(long)]
bind: Option<String>,
/// Run without bearer tokens and without a policy file (MR-723).
/// Required when neither is configured — otherwise the server
/// refuses to start to prevent shipping the illusion of protection.
/// Equivalent to setting `OMNIGRAPH_UNAUTHENTICATED=1`.
#[arg(long)]
unauthenticated: bool,
/// Fail startup if any applied graph is quarantined or fails to open.
/// By default, graph-local failures are logged and healthy graphs still
/// serve. Equivalent to setting `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`.
#[arg(long)]
require_all_graphs: bool,
}
#[tokio::main]
async fn main() -> Result<()> {
color_eyre::install()?;
init_tracing();
let cli = Cli::parse();
let settings: ServerConfig = load_server_settings(
cli.cluster.as_ref(),
cli.bind,
cli.unauthenticated,
cli.require_all_graphs,
)
.await?;
serve(settings).await
}