omnigraph/docs/dev
Andrew Altshuler 7fd23c54a3
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap

A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.

Three coordinated changes:

1. Preview before arming the sidecar. `cluster apply` now runs
   `preview_schema_apply_with_options` before `write_recovery_sidecar`, so
   parser/planner rejections (non-main branches, unsupported plan) fail
   loudly without leaving recovery work behind. The post-preview engine
   error path now deletes the sidecar when the live schema still matches
   the recorded digest (nothing moved), and keeps it only on real
   mid-movement failure — both branches covered by new engine-failpoint
   tests (cluster failpoints now enable omnigraph/failpoints).

2. Per-graph quarantine at serve time instead of whole-cluster refusal.
   A graph-attributed pending sidecar, an unopenable graph root, a query
   parse failure, or an unresolvable embedding provider now quarantines
   just that graph (logged loudly at every boot layer) while healthy
   graphs serve; `/graphs` lists only ready graphs and quarantined routes
   404. Cluster-global problems (missing/unreadable state, malformed or
   unattributable sidecars, shared-catalog or cluster-policy errors, zero
   healthy graphs) stay fail-fast. `--require-all-graphs` /
   OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.

3. Backfill embedding-provider profile metadata on apply. Mirrors the
   existing policy-binding backfill: a pre-5A ledger missing
   `embedding_profile` is now detected as a metadata-only change and
   backfilled by a no-op apply, instead of bricking serve with
   `embedding_provider_profile_missing` forever.

Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: resilient cluster boot + recovery-sidecar trap fix

Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(cluster): surface sidecar-cleanup failures; document severity promotion

Address Greptile review on PR #284:

- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
  result, so a transient delete failure left the graph quarantined with no
  signal. Add `try_delete_object` (Result-returning) and emit a
  `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
  fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
  diagnostic to a cluster-fatal error (the listing only emits genuine
  read/parse/version failures, as warnings, whose blast radius serving
  cannot prove) and note the promote-by-code path if that ever changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 03:34:15 +03:00
..
architecture.md feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252) 2026-06-15 22:31:29 +03:00
branch-protection.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
ci.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
cluster-axioms.md docs(cluster): axiom 15 — single ownership, mode-switch migration, per-operator layer (#164) 2026-06-10 00:44:51 +03:00
cluster-config-implementation-spec.md docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174) 2026-06-10 15:22:12 +03:00
cluster-config-specs.md docs(user): restructure user docs into topic sections (Phase 1) (#223) 2026-06-14 13:52:14 +03:00
execution.md fix(embedding): address PR review feedback (RFC-012 Phase 2) 2026-06-15 18:37:34 +02:00
index.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
invariants.md perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) 2026-06-17 13:25:20 +02:00
lance.md test(engine): pin Lance 7 immutable-PK behavior + sharpen native-namespace alignment notes (#240) 2026-06-15 11:33:25 +02:00
merge.md docs: split user and developer docs (#93) 2026-05-15 03:45:22 +03:00
rfc-001-queries-envelope-mcp.md docs(user): restructure user docs into topic sections (Phase 1) (#223) 2026-06-14 13:52:14 +03:00
rfc-002-config-cli-architecture.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-003-mcp-server-surface.md Stored-query registry foundation + config/CLI RFC-002 (#128) 2026-06-01 22:50:31 +02:00
rfc-004-cluster-graph-schema-apply.md docs(cluster): document Stage 4C — Phase 4 complete 2026-06-10 14:44:12 +03:00
rfc-005-server-cluster-boot.md fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284) 2026-06-19 03:34:15 +03:00
rfc-007-operator-config.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-008-deprecate-omnigraph-yaml.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-009-unify-access-paths.md feat: canonical POST /load, deprecate /ingest (RFC-009 Phase 5) (#222) 2026-06-14 03:32:16 +03:00
rfc-010-cli-planes-restructure.md docs(rfc): RFC-010 — apply verification-comment current-state fixups (#215) 2026-06-13 22:24:09 +03:00
rfc-011-cli-refactoring.md feat(cli): add read-only profile list / profile show (RFC-011 D8) (#255) 2026-06-15 23:33:01 +03:00
rfc-012-embedding-provider-config.md Wire cluster embedding providers 2026-06-16 04:02:08 +03:00
schema-lint-v1-plan.md schema-lint chassis v1.0: DropProperty Soft + code-tagged diagnostics (MR-694) (#90) 2026-05-16 16:30:03 +03:00
testing.md perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) 2026-06-17 13:25:20 +02:00
writes.md fix(engine): stop branch-merge fast-forward OOM on embedding tables (#277) 2026-06-19 00:15:06 +02:00