Merge branch 'main' into ragnorc/index-best-practices-audit

2026-06-21 02:28:07 +02:00 · 2026-06-09 17:57:14 +02:00 · 2026-06-09 17:57:14 +02:00 · 5ca5c40df7
commit 5ca5c40df7
parent f96682bd52 59b64ea097
38 changed files with 5302 additions and 202 deletions
--- a/docs/user/cli-reference.md
+++ b/docs/user/cli-reference.md
@ -2,7 +2,7 @@

 A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` schema. For a quick-start guide, see [cli.md](cli.md).

-17 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target <name>` resolved against `omnigraph.yaml`.
+Top-level command families and subcommands. Graph-targeting commands accept either a positional `URI`, `--uri`, or a `--target <name>` resolved against `omnigraph.yaml`; `cluster` commands use `--config <dir>`.

 ## Top-level commands

@ -17,11 +17,12 @@ A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` sc
 | `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) |
 | `branch create \| list \| delete \| merge` | branching ops |
 | `commit list \| show` | inspect commit graph |
-| `run list \| show \| publish \| abort` | transactional run ops |
 | `schema plan \| apply \| show (alias: get)` | migrations |
 | `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` |
 | `queries validate \| list` | operate on the server-side stored-query registry (the `queries:` block). `validate` type-checks every stored query against the live schema offline (opens the selected graph; exits non-zero on any breakage), catching schema drift without restarting the server; `list` prints the selected registry's query names, MCP exposure, and typed params. For per-graph registries, pass `--target <graph>` or set `cli.graph`; with no graph selection, `list` shows only top-level `queries:`. Distinct from `lint`, which validates a single `.gq` file |
-| `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns; `--json` reports a `skipped` field) |
+| `cluster validate \| plan \| status` | read-only cluster-control preview. `validate` checks a local `cluster.yaml` folder and referenced schema/query/policy files; `plan` diffs it against local JSON state at `__cluster/state.json` while briefly holding `__cluster/lock.json`; `status` reads the state ledger. No apply, graph open, live drift scan, server change, or `state.json` mutation occurs in Stage 2A |
+| `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns or uncovered drift; `--json` reports `skipped`) |
+| `repair [--confirm] [--force]` | preview or explicitly publish uncovered manifest/head drift. `--confirm` heals verified maintenance drift and exits non-zero if suspicious/unverifiable drift is refused; `--force --confirm` publishes suspicious/unverifiable drift after operator review |
 | `cleanup --keep N --older-than 7d --confirm` | destructive version GC |
 | `embed` | offline JSONL embedding pipeline |
 | `policy validate \| test \| explain` | Cedar tooling. Selects `cli.graph`, else `server.graph`, else top-level `policy.file` |
@ -73,6 +74,23 @@ policy:
  file: ./policy.yaml
 ```

+## Cluster config preview
+
+```bash
+omnigraph cluster validate --config ./company-brain
+omnigraph cluster plan     --config ./company-brain --json
+omnigraph cluster status   --config ./company-brain --json
+```
+
+`--config` is a directory containing `cluster.yaml`; it defaults to `.`.
+Stage 2A accepts graphs, schemas, stored queries, and policy bundle file
+references. `cluster plan` reads local JSON state from
+`<config-dir>/__cluster/state.json`; a missing file means empty state. Plan
+acquires `__cluster/lock.json` by default and releases it before returning.
+`cluster status` reads state only and reports any existing lock. External state
+backends, apply, refresh/import, pipelines, UI specs, embeddings, aliases, and
+bindings are reserved for later stages. See [cluster-config.md](cluster-config.md).
+
 ## Output formats (`query` command, alias: `read`)

 - `json` — pretty-printed object with metadata + rows
--- a/docs/user/cluster-config.md
+++ b/docs/user/cluster-config.md
@ -0,0 +1,126 @@
+# Cluster Config
+
+**Status:** Stage 2A read-only preview.
+
+Cluster config is the future control-plane configuration surface for a whole
+OmniGraph deployment. In this stage, OmniGraph can validate a local
+`cluster.yaml` folder, produce a deterministic read-only plan, and inspect the
+local JSON state ledger. It does not apply changes, open graph roots, scan live
+cluster state, start servers, or write graph resources.
+
+## Commands
+
+```bash
+omnigraph cluster validate --config ./company-brain
+omnigraph cluster plan     --config ./company-brain --json
+omnigraph cluster status   --config ./company-brain --json
+```
+
+`--config` points at a directory, not a file. The directory must contain
+`cluster.yaml`. When omitted, it defaults to the current directory.
+
+## Supported `cluster.yaml`
+
+Stage 2A accepts only the read-only resource subset:
+
+```yaml
+version: 1
+metadata:
+  name: company-brain
+
+state:
+  backend: cluster
+  lock: true
+
+graphs:
+  knowledge:
+    schema: ./knowledge.pg
+    queries:
+      find_experts:
+        file: ./knowledge.gq
+
+policies:
+  base:
+    file: ./base.policy.yaml
+    applies_to: [knowledge]
+```
+
+`metadata.name` is a display label. `state.backend` may be omitted or set to
+`cluster`; external state backends are reserved for a later stage. `state.lock`
+defaults to `true`. When enabled, `cluster plan` briefly acquires
+`<config-dir>/__cluster/lock.json` while it reads state, then removes it before
+returning. `cluster status` never acquires the lock; it only reports whether one
+is present.
+
+## Validation
+
+`cluster validate` checks:
+
+- `cluster.yaml` syntax and supported fields
+- duplicate YAML keys
+- schema, query, and policy file existence
+- schema parsing and catalog construction
+- stored-query parsing and query-name matching
+- stored-query type-checking against the desired schema
+- policy `applies_to` graph references
+
+Fields reserved for later phases, such as `pipelines`, `embeddings`, `ui`,
+`aliases`, and `bindings`, fail with a typed diagnostic instead of being
+silently ignored.
+
+## Planning
+
+`cluster plan` first performs validation, then reads local JSON state from:
+
+```text
+<config-dir>/__cluster/state.json
+```
+
+If the file is missing, the state is treated as empty and every desired
+resource is planned as a create. If present, the file must use this shape:
+
+```json
+{
+  "version": 1,
+  "state_revision": 0,
+  "applied_revision": {
+    "config_digest": "...",
+    "resources": {
+      "graph.knowledge": { "digest": "..." },
+      "schema.knowledge": { "digest": "..." },
+      "query.knowledge.find_experts": { "digest": "..." },
+      "policy.base": { "digest": "..." }
+    }
+  },
+  "resource_statuses": {
+    "graph.knowledge": {
+      "status": "applied",
+      "conditions": [],
+      "message": "optional status detail"
+    }
+  },
+  "approval_records": {},
+  "recovery_records": {},
+  "observations": {}
+}
+```
+
+`state_revision`, `resource_statuses`, `approval_records`, `recovery_records`,
+and `observations` are optional so older Stage 1 state fixtures keep working.
+Missing `state_revision` is treated as `0`. Resource status values are
+`pending`, `planned`, `applying`, `applied`, `drifted`, `blocked`, or `error`.
+
+Plan output compares desired resource digests against state resource digests
+and reports `create`, `update`, and `delete` changes. It also reports the state
+CAS (`sha256:<digest>`) and state revision. `state_observations.locked` means an
+existing lock file was observed; a successful `plan` instead reports
+`lock_acquired: true` and an `acquired_lock_id`, then releases the lock before
+returning. The command never writes `state.json`; apply, refresh, import, and
+live drift scans are later-stage work.
+
+## Status
+
+`cluster status` reads the same local JSON state ledger and prints what the
+ledger says is deployed. It does not validate referenced schema/query/policy
+files and does not inspect live graphs. Missing `state.json` succeeds with a
+warning; invalid state JSON or an unsupported state version fails.
--- a/docs/user/index.md
+++ b/docs/user/index.md
@ -13,6 +13,7 @@ of MRs, internal recovery mechanics, or contributor-only invariants.
 | Install OmniGraph | [install.md](install.md) |
 | Run the CLI locally | [cli.md](cli.md) |
 | Look up every CLI flag and config field | [cli-reference.md](cli-reference.md) |
+| Validate and plan cluster config | [cluster-config.md](cluster-config.md) |
 | Write schemas | [schema-language.md](schema-language.md) |
 | Read schema-lint diagnostic codes | [schema-lint.md](schema-lint.md) |
 | Write queries and mutations | [query-language.md](query-language.md) |
--- a/docs/user/maintenance.md
+++ b/docs/user/maintenance.md
@ -1,17 +1,26 @@
-# Maintenance: Optimize & Cleanup
+# Maintenance: Optimize, Repair & Cleanup

-`db/omnigraph/optimize.rs`.
+`db/omnigraph/optimize.rs` and `db/omnigraph/repair.rs`.

 ## `optimize_all_tables(db)` — non-destructive

 - Lance `compact_files()` on every node + edge table on `main`, then **publishes the compacted version to the `__manifest`** so the manifest's `table_version` tracks the compacted Lance HEAD. Reads pin the manifest version, so without this publish compaction would be invisible to readers *and* would break the HEAD-vs-manifest precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually compacted.
 - Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests until `cleanup` runs.
 - Each table's compact→publish runs under its per-`(table, main)` write queue (serializing with concurrent mutations — compaction is a Lance `Rewrite` op that retryable-conflicts with a concurrent merge/update/delete on overlapping fragments). The Lance-HEAD-before-manifest-publish gap is covered by a `SidecarKind::Optimize` recovery sidecar (loose-match): a crash in that window rolls the compacted version forward on the next `Omnigraph::open` (compaction is content-preserving, so roll-forward is always safe).
- **Requires a recovered graph.** `optimize` refuses (errors) when an unresolved recovery sidecar is present under `__recovery` — operating on an unrecovered graph could publish a partial write the open-time recovery sweep would roll back. Reopen the graph to run the recovery sweep, then re-run `optimize`. (Recovery roll-back now publishes its restored version, so a recovered graph always satisfies `manifest == Lance HEAD` going in; there is no leftover drift for `optimize` to interpret.)
+- **Requires a recovered graph.** `optimize` refuses (errors) when an unresolved recovery sidecar is present under `__recovery` — operating on an unrecovered graph could publish a partial write the open-time recovery sweep would roll back. Reopen the graph to run the recovery sweep, then re-run `optimize`.
+- **Uncovered drift is skipped, not interpreted.** If a table's Lance HEAD is ahead of the version recorded in `__manifest` and no recovery sidecar covers that movement, `optimize` reports `skipped: Some(DriftNeedsRepair)` with the manifest/head versions and leaves the table untouched. Run `omnigraph repair` to classify and explicitly publish that drift.
 - Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped }]`.
+- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped, manifest_version, lance_head_version }]`.
 - **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: Some(BlobColumnsUnsupportedByLance)` (and logged via `tracing::warn`) instead of compacted, and the rest of the sweep proceeds normally. The current Lance `compact_files` mis-decodes blob-v2 columns under its forced `BlobHandling::AllBinary` read; **reads and writes are unaffected** — only compaction is. This is gated by `LANCE_SUPPORTS_BLOB_COMPACTION` (`db/omnigraph/optimize.rs`) and removed when the upstream Lance fix lands (see [docs/dev/lance.md](../dev/lance.md)). Consequence: fragment count and deleted-row space on blob tables are not reclaimed until then; query results are never affected.

+## `repair_all_tables(db, options)` — explicit
+
+- Handles **uncovered manifest/head drift**: a table's Lance HEAD is ahead of the manifest pin and no recovery sidecar records the writer intent.
+- Preview by default. `omnigraph repair --json <uri>` reports each table's `classification`, `action`, manifest/head versions, Lance operation names, and any classification error. `--confirm` publishes only verified maintenance drift; if any suspicious or unverifiable table is refused, the CLI prints the per-table output and exits non-zero. `--force --confirm` also publishes suspicious or unverifiable drift after operator review.
+- Classifies drift by reading Lance transactions from `manifest_version + 1` through `lance_head_version`. Only `ReserveFragments` and `Rewrite` are verified maintenance. Semantic operations such as `Append`, `Delete`, `Update`, `Merge`, or missing transaction history are not auto-healed.
+- Publishes repair by advancing `__manifest` to the existing Lance HEAD; it does **not** rewrite Lance data. If the publish succeeds, normal reads and strict writes use the repaired version. If it fails, no new data-side partial state was created.
+- Requires a clean recovery state. Pending `__recovery` sidecars still belong to automatic sidecar recovery, not manual repair.
+
 ## `cleanup_all_tables(db, options)` — destructive

 - Lance `cleanup_old_versions()` per table.