2026-04-10 20:49:41 +03:00
|
|
|
#!/bin/sh
|
|
|
|
|
set -eu
|
|
|
|
|
|
|
|
|
|
SERVER_BIN="/usr/local/bin/omnigraph-server"
|
|
|
|
|
|
|
|
|
|
if [ "$#" -gt 0 ]; then
|
|
|
|
|
exec "$SERVER_BIN" "$@"
|
|
|
|
|
fi
|
|
|
|
|
|
|
|
|
|
bind="${OMNIGRAPH_BIND:-0.0.0.0:8080}"
|
|
|
|
|
|
2026-06-10 22:35:58 +03:00
|
|
|
# Cluster mode first, and exclusive (the server's mode-inference rule 0):
|
|
|
|
|
# a deployment serves from cluster state XOR omnigraph.yaml, never a merge.
|
|
|
|
|
# Fail fast here with the same contract the server enforces.
|
|
|
|
|
if [ -n "${OMNIGRAPH_CLUSTER:-}" ]; then
|
|
|
|
|
if [ -n "${OMNIGRAPH_TARGET_URI:-}" ] || [ -n "${OMNIGRAPH_CONFIG:-}" ] || [ -n "${OMNIGRAPH_TARGET:-}" ]; then
|
|
|
|
|
echo "OMNIGRAPH_CLUSTER is an exclusive boot source; unset OMNIGRAPH_TARGET_URI/OMNIGRAPH_CONFIG/OMNIGRAPH_TARGET" >&2
|
|
|
|
|
exit 64
|
|
|
|
|
fi
|
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap
A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.
Three coordinated changes:
1. Preview before arming the sidecar. `cluster apply` now runs
`preview_schema_apply_with_options` before `write_recovery_sidecar`, so
parser/planner rejections (non-main branches, unsupported plan) fail
loudly without leaving recovery work behind. The post-preview engine
error path now deletes the sidecar when the live schema still matches
the recorded digest (nothing moved), and keeps it only on real
mid-movement failure — both branches covered by new engine-failpoint
tests (cluster failpoints now enable omnigraph/failpoints).
2. Per-graph quarantine at serve time instead of whole-cluster refusal.
A graph-attributed pending sidecar, an unopenable graph root, a query
parse failure, or an unresolvable embedding provider now quarantines
just that graph (logged loudly at every boot layer) while healthy
graphs serve; `/graphs` lists only ready graphs and quarantined routes
404. Cluster-global problems (missing/unreadable state, malformed or
unattributable sidecars, shared-catalog or cluster-policy errors, zero
healthy graphs) stay fail-fast. `--require-all-graphs` /
OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.
3. Backfill embedding-provider profile metadata on apply. Mirrors the
existing policy-binding backfill: a pre-5A ledger missing
`embedding_profile` is now detected as a metadata-only change and
backfilled by a no-op apply, instead of bricking serve with
`embedding_provider_profile_missing` forever.
Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: resilient cluster boot + recovery-sidecar trap fix
Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(cluster): surface sidecar-cleanup failures; document severity promotion
Address Greptile review on PR #284:
- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
result, so a transient delete failure left the graph quarantined with no
signal. Add `try_delete_object` (Result-returning) and emit a
`recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
diagnostic to a cluster-fatal error (the listing only emits genuine
read/parse/version failures, as warnings, whose blast radius serving
cannot prove) and note the promote-by-code path if that ever changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 03:34:15 +03:00
|
|
|
set -- --cluster "${OMNIGRAPH_CLUSTER}" --bind "${bind}"
|
|
|
|
|
case "${OMNIGRAPH_REQUIRE_ALL_GRAPHS:-}" in
|
|
|
|
|
""|0|false|FALSE) ;;
|
|
|
|
|
*) set -- "$@" --require-all-graphs ;;
|
|
|
|
|
esac
|
|
|
|
|
exec "$SERVER_BIN" "$@"
|
2026-06-10 22:35:58 +03:00
|
|
|
fi
|
|
|
|
|
|
2026-05-30 20:17:55 +01:00
|
|
|
# URI comes from the env var (the positional arg wins over any config
|
|
|
|
|
# `graphs` block in resolve_target_uri). OMNIGRAPH_CONFIG, when also set,
|
|
|
|
|
# is forwarded as --config purely to supply a policy file — the two
|
|
|
|
|
# compose. Without OMNIGRAPH_CONFIG the behavior is unchanged.
|
2026-04-10 20:49:41 +03:00
|
|
|
if [ -n "${OMNIGRAPH_TARGET_URI:-}" ]; then
|
2026-05-30 20:17:55 +01:00
|
|
|
exec "$SERVER_BIN" "${OMNIGRAPH_TARGET_URI}" \
|
|
|
|
|
${OMNIGRAPH_CONFIG:+--config "$OMNIGRAPH_CONFIG"} \
|
|
|
|
|
--bind "${bind}"
|
2026-04-10 20:49:41 +03:00
|
|
|
fi
|
|
|
|
|
|
|
|
|
|
if [ -n "${OMNIGRAPH_CONFIG:-}" ]; then
|
|
|
|
|
if [ -n "${OMNIGRAPH_TARGET:-}" ]; then
|
|
|
|
|
exec "$SERVER_BIN" --config "${OMNIGRAPH_CONFIG}" --target "${OMNIGRAPH_TARGET}" --bind "${bind}"
|
|
|
|
|
fi
|
|
|
|
|
exec "$SERVER_BIN" --config "${OMNIGRAPH_CONFIG}" --bind "${bind}"
|
|
|
|
|
fi
|
|
|
|
|
|
|
|
|
|
cat >&2 <<'EOF'
|
|
|
|
|
omnigraph-server container startup requires one of:
|
2026-06-10 22:35:58 +03:00
|
|
|
- OMNIGRAPH_CLUSTER (serve a cluster directory's applied revision;
|
|
|
|
|
exclusive — cannot combine with the others)
|
2026-04-10 20:49:41 +03:00
|
|
|
- OMNIGRAPH_TARGET_URI
|
|
|
|
|
- OMNIGRAPH_CONFIG
|
|
|
|
|
|
|
|
|
|
Optional:
|
|
|
|
|
- OMNIGRAPH_BIND (default: 0.0.0.0:8080)
|
fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284)
* fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap
A `cluster apply` carrying a schema change against a graph that has
non-main branches, or an unsupported "needs backfill" migration, armed a
recovery sidecar *before* calling the engine, then left it behind when the
engine rejected the apply pre-movement. The server refuses to boot while
any sidecar is pending, and re-running apply re-armed a fresh sidecar — an
unescapable crash loop. None of the engine rejections are bugs; the trap
is in the apply/serve choreography.
Three coordinated changes:
1. Preview before arming the sidecar. `cluster apply` now runs
`preview_schema_apply_with_options` before `write_recovery_sidecar`, so
parser/planner rejections (non-main branches, unsupported plan) fail
loudly without leaving recovery work behind. The post-preview engine
error path now deletes the sidecar when the live schema still matches
the recorded digest (nothing moved), and keeps it only on real
mid-movement failure — both branches covered by new engine-failpoint
tests (cluster failpoints now enable omnigraph/failpoints).
2. Per-graph quarantine at serve time instead of whole-cluster refusal.
A graph-attributed pending sidecar, an unopenable graph root, a query
parse failure, or an unresolvable embedding provider now quarantines
just that graph (logged loudly at every boot layer) while healthy
graphs serve; `/graphs` lists only ready graphs and quarantined routes
404. Cluster-global problems (missing/unreadable state, malformed or
unattributable sidecars, shared-catalog or cluster-policy errors, zero
healthy graphs) stay fail-fast. `--require-all-graphs` /
OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot.
3. Backfill embedding-provider profile metadata on apply. Mirrors the
existing policy-binding backfill: a pre-5A ledger missing
`embedding_profile` is now detected as a metadata-only change and
backfilled by a no-op apply, instead of bricking serve with
`embedding_provider_profile_missing` forever.
Tests: trap (no sidecar after a rejected apply), both digest-cleanup
branches, per-graph quarantine (cluster + server), embedding backfill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: resilient cluster boot + recovery-sidecar trap fix
Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local
quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release
note, and update the user cluster/server/deployment docs and the
OMNIGRAPH_REQUIRE_ALL_GRAPHS env var.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(cluster): surface sidecar-cleanup failures; document severity promotion
Address Greptile review on PR #284:
- The pre-movement sidecar cleanup fast-path discarded `delete_object`'s
result, so a transient delete failure left the graph quarantined with no
signal. Add `try_delete_object` (Result-returning) and emit a
`recovery_sidecar_cleanup_failed` warning diagnostic on failure; the
fire-and-forget `delete_object` now delegates to it.
- Document why the serve-time loop promotes every `list_recovery_sidecars`
diagnostic to a cluster-fatal error (the listing only emits genuine
read/parse/version failures, as warnings, whose blast radius serving
cannot prove) and note the promote-by-code path if that ever changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 03:34:15 +03:00
|
|
|
- OMNIGRAPH_REQUIRE_ALL_GRAPHS (cluster mode: fail startup unless every
|
|
|
|
|
applied graph is healthy)
|
2026-04-10 20:49:41 +03:00
|
|
|
- OMNIGRAPH_TARGET (used with OMNIGRAPH_CONFIG)
|
2026-05-30 20:17:55 +01:00
|
|
|
- OMNIGRAPH_CONFIG (may also accompany OMNIGRAPH_TARGET_URI to add a
|
|
|
|
|
policy file; the URI still comes from OMNIGRAPH_TARGET_URI)
|
2026-04-10 20:49:41 +03:00
|
|
|
EOF
|
|
|
|
|
exit 64
|