Add cluster state lock recovery

This commit is contained in:
aaltshuler 2026-06-09 02:12:00 +03:00
parent cb1e7bb5ea
commit 4fffddc6b7
6 changed files with 596 additions and 52 deletions

View file

@ -1,13 +1,13 @@
# Cluster Config
**Status:** Stage 2B state-observation preview.
**Status:** Stage 2C state-lock recovery preview.
Cluster config is the future control-plane configuration surface for a whole
OmniGraph deployment. In this stage, OmniGraph can validate a local
`cluster.yaml` folder, produce a deterministic read-only plan, inspect the
local JSON state ledger, and explicitly refresh/import graph observations into
that ledger. It does not apply desired changes, start servers, or write graph
resources.
that ledger. It can also manually remove a held local state lock by exact lock
id. It does not apply desired changes, start servers, or write graph resources.
## Commands
@ -17,6 +17,7 @@ omnigraph cluster plan --config ./company-brain --json
omnigraph cluster status --config ./company-brain --json
omnigraph cluster refresh --config ./company-brain --json
omnigraph cluster import --config ./company-brain --json
omnigraph cluster force-unlock <LOCK_ID> --config ./company-brain --json
```
`--config` points at a directory, not a file. The directory must contain
@ -24,7 +25,7 @@ omnigraph cluster import --config ./company-brain --json
## Supported `cluster.yaml`
Stage 2B accepts only the read-only resource subset:
Stage 2C accepts only the read-only resource subset:
```yaml
version: 1
@ -53,7 +54,9 @@ policies:
defaults to `true`. When enabled, `cluster plan`, `cluster refresh`, and
`cluster import` briefly acquire `<config-dir>/__cluster/lock.json`, then remove
it before returning. `cluster status` never acquires the lock; it only reports
whether one is present.
whether one is present. `cluster force-unlock` is the only lock-removal command;
it requires the exact lock id and should be run only after confirming no cluster
operation is active.
## Validation
@ -115,18 +118,19 @@ Missing `state_revision` is treated as `0`. Resource status values are
Plan output compares desired resource digests against state resource digests
and reports `create`, `update`, and `delete` changes. It also reports the state
CAS (`sha256:<digest>`), state revision, and lock id used for the read. The
command never writes `state.json` and does not scan live graphs. Use explicit
`cluster refresh` / `cluster import` when the state ledger should be updated
from live observations. Apply and live drift scans during plan are later-stage
work.
CAS (`sha256:<digest>`), state revision, and lock metadata used for the read.
The command never writes `state.json` and does not scan live graphs. Use
explicit `cluster refresh` / `cluster import` when the state ledger should be
updated from live observations. Apply and live drift scans during plan are
later-stage work.
## Status
`cluster status` reads the same local JSON state ledger and prints what the
ledger says is deployed. It does not validate referenced schema/query/policy
files and does not inspect live graphs. Missing `state.json` succeeds with a
warning; invalid state JSON or an unsupported state version fails.
warning; invalid state JSON or an unsupported state version fails. If a lock is
present, status reports its id, operation, creation time, pid, and age.
## Refresh And Import
@ -148,3 +152,14 @@ initial state.
Refresh/import do not observe query or policy resources yet. Existing query and
policy state digests are preserved on refresh and are not invented on import.
## Force Unlock
`cluster force-unlock <LOCK_ID>` removes `<config-dir>/__cluster/lock.json` only
when the file exists, is valid version-1 lock JSON, and its `lock_id` exactly
matches the argument. A wrong id, missing lock, invalid lock JSON, or unsupported
lock version exits non-zero and leaves the file untouched.
This is manual recovery for abandoned local locks. OmniGraph does not perform
PID-liveness checks, TTL expiry, stale-lock breaking, or automatic unlock in
Stage 2C.