Merge branch 'main' into ragnorc/omnigraph-mcp-crate

Bring the MCP feature branch up to date with main (14 commits). One conflict — compiler/parser.rs: main's `NanoError` → `CompilerError` rename vs this branch's `@mcp` / per-param `@description` parser additions; resolved by keeping the new parsing under the renamed error type. The CLI `queries list` change (#280, surfacing `@description`/`@instruction`) auto-merged with this branch's `mcp_expose`/`tool_name` columns.
2026-06-21 02:28:07 +02:00 · 2026-06-19 21:59:14 +02:00 · 2026-06-19 21:59:14 +02:00 · fbf455a250
commit fbf455a250
parent 916dc46c0e 57348cf7fa
110 changed files with 6396 additions and 2511 deletions
--- a/docs/dev/branch-protection.md
+++ b/docs/dev/branch-protection.md
@ -8,10 +8,9 @@ This page explains what the policy says and how to change it.

 | Setting | Value | Why |
 |---|---|---|
-| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass the AWS-feature build/test, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. **`Test Workspace` is deliberately NOT required** — it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending — the same job-never-reports trap the CODEOWNERS contexts call out below. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
-| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. |
-| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. |
-| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. |
+| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws` | Every PR must pass the AWS-feature build/test and AGENTS.md link integrity. **`Test Workspace` is deliberately NOT required** — it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending — the job-never-reports trap. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). Each required context must equal a job `name:` that actually reports on PRs **verbatim** — a context naming a job that never reports leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
+| **Required approving reviews** | `0` | No human-review gate. With a 2-person team where both maintainers own everything, requiring an approval meant every PR needed the *other* person (or an admin/bypass override) — friction with no real review value. CI checks are the gate; maintainers merge their own PRs once checks pass. Raise this to `1` if an outside-contributor flow ever needs a review gate. |
+| **Require code-owner reviews** | `false` | CODEOWNERS was removed entirely (see the git history of `.github/`); there is no code-owner review requirement. |
 | **Require linear history** | `true` | No merge commits — squash or rebase only. Matches recent practice. |
 | **Disallow force pushes** | `true` | No history rewrites on `main`. |
 | **Disallow branch deletions** | `true` | `main` cannot be deleted. |
@ -57,7 +56,7 @@ Outputs the live policy. Compare against `.github/branch-protection.json` to det

 - **Audit trail**: `git log .github/branch-protection.json` shows every change with a reviewable diff and a merge commit.
 - **Disaster recovery**: if branch protection is accidentally removed or weakened via the UI, the JSON is the canonical recovery point.
- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repository policy lives in the repository.
+- **Consistency**: repository policy lives in the repository, reviewed like code.

 ## What this gates

@ -65,11 +64,11 @@ After branch protection is applied, every PR targeting `main` must:

 1. Pass all listed status checks.
 2. Be up-to-date with `main` (rebase or merge-from-main).
-3. Have at least one approving review from a code owner for the touched paths.
-4. Have all review conversations resolved.
-5. Be squash- or rebase-merged (no merge commits).
+3. Have all review conversations resolved.
+4. Be squash- or rebase-merged (no merge commits).

-Even repository admins are subject to these rules.
+No human approval is required (`required_approving_review_count: 0`). Repository
+admins can override the gates (`enforce_admins: false`).

 ## Subsequent hardening (not in this PR)

@ -77,7 +76,7 @@ The branch-protection policy is the foundation. Future hardening adds:

 - **Required signed commits** (`required_signatures: true`) — once maintainers enroll GPG/SSH signing.
 - **Tag protection** for `v*` tags via `repos/.../tags/protection`.
- **Required reviewers from specific teams** for high-leverage paths (e.g., `docs/dev/invariants.md`) via CODEOWNERS tier expansion + the N-unique-approvers CI workaround.
+- **Required reviewers from specific teams** for high-leverage paths (e.g., `docs/dev/invariants.md`) via a GitHub ruleset's path-scoped required-review rule, if a review gate is ever reintroduced.
 - **More required CI checks**: `cargo deny`, `cargo audit`, `cargo fmt --check`, `cargo clippy -D warnings`, CodeQL, secret scanning, schema-lint (MR-946).

 See the hardening playbook for the full plan.
--- a/docs/dev/bug-case-fix.md
+++ b/docs/dev/bug-case-fix.md
@ -0,0 +1,217 @@
+# Bug case study: camelCase property filters lowercased at runtime
+
+**Issue:** [#283](https://github.com/ModernRelay/omnigraph/issues/283) (mirrored
+in the dev-graph as `iss-990`)
+**Reported on:** 0.7.0 (release binary)
+**Status of code:** present on `v0.7.0`; fixed on branch `fix/iss-283-camelcase-filter` (read pushdown + pending mutation scan)
+**Severity:** correctness — a valid, lint-clean query fails at run time.
+
+## Symptom
+
+A read query that filters on a **camelCase** schema field lints and plans
+cleanly but fails when it executes:
+
+```text
+No field named reponame. Column names are case sensitive.
+```
+
+Minimal repro:
+
+```pg
+node SourceDocument {
+  repoName: String @index
+}
+```
+
+```gq
+query find($repoName: String) {
+  match { $d: SourceDocument { repoName: $repoName } }
+  return { $d.repoName }
+}
+```
+
+`omnigraph lint` passes; running the query errors. The operator workaround is to
+rename the field to all-lowercase (`repo`), which is why this looked like a
+schema-design quirk rather than an engine bug.
+
+## Root cause
+
+The filter-pushdown path builds the Lance scan predicate's column reference with
+`datafusion::prelude::col(property)`:
+
+- **Site:** `crates/omnigraph/src/exec/query.rs` — `ir_expr_to_expr`:
+  ```rust
+  IRExpr::PropAccess { property, .. } => Some(col(property)),
+  ```
+- `col(&str)` runs DataFusion's SQL **identifier normalization**
+  (`Column::from_qualified_name` → `parse_identifiers_normalized(.., false)`),
+  which **lowercases unquoted identifiers**. So `col("repoName")` resolves to a
+  column named `reponame`.
+- Lance stores columns **case-preserved** (`repoName`) and resolves them
+  case-sensitively, so the scan can't find `reponame` and errors.
+
+The IR is not at fault: the parser and lowering preserve the original case
+(`property: pm.prop_name.clone()`), which is exactly why the compiler resolves
+`repoName` and **lint passes**. The case is destroyed only at the
+engine → Lance boundary.
+
+There is a **second** boundary with the same root cause but a *different*
+parser: the pending-batch scan in `table_store.rs::scan_pending_batches` splices
+the mutation predicate string into a DataFusion `SELECT … WHERE {filter}` over a
+`MemTable`, and DataFusion's SQL parser lowercases the unquoted column the same
+way (`repoName` → `reponame`). See **Part 2** of the fix — it surfaces only on a
+*chained* mutation that re-reads the pending side, which is why a single
+update/delete on a camelCase predicate looked fine.
+
+### Why the rest of the engine is unaffected
+
+The two pushdown sites above were the offenders; the remaining paths already
+treat column names case-sensitively and handle camelCase correctly:
+
+- **Projection / return** uses the real Arrow field name (`f.name()`).
+- **In-memory filtering** (the fallback for non-pushable predicates) looks the
+  column up by the preserved property name against the batch schema.
+- **The committed Lance mutation scan** (`Scanner::filter(&str)`) preserves an
+  unquoted identifier's case, so committed-row matching on a camelCase predicate
+  already worked.
+
+So the read bug surfaces for predicates that *are* pushed down (e.g. an equality
+on a scalar camelCase column), and the mutation bug only for the pending-side
+re-scan of a chained mutation.
+
+### Why it slipped through
+
+The `ir_filter_to_expr` unit tests only use the all-lowercase field `count`, so
+no test exercised a camelCase property. Nothing in CI compared the emitted
+column name against the schema's casing.
+
+## Fix
+
+There are **two** engine→Lance boundaries that lose case, and they need
+**different** fixes because the two consumers disagree on quoting semantics.
+
+### Part 1 — read pushdown (`exec/query.rs`, `ir_expr_to_expr`)
+
+Use DataFusion's case-preserving column constructor, `ident()`, instead of
+`col()`:
+
+```rust
+IRExpr::PropAccess { property, .. } => Some(datafusion::prelude::ident(property)),
+```
+
+`ident()` builds `Expr::Column(Column::new_unqualified(property))` with no SQL
+parse and no normalization, so the case is preserved. Property references here
+are always bare column names (the variable is dropped via `..`), so there is no
+qualified-name (`a.b`) handling to lose.
+
+This is the right layer and the right shape:
+
+- It is a **no-op for the lowercase columns that work today** (`slug`, `id`,
+  `status`, …) — lowercasing those was already a no-op — so there is no
+  regression risk for the common case.
+- It makes pushdown **consistent** with projection and in-memory filtering,
+  which already use case-preserved names.
+- It also restores **index use** for camelCase columns: today such a filter
+  errors before the BTREE is even considered.
+
+### Part 2 — pending mutation scan (`table_store.rs`, `scan_pending_batches`)
+
+`update`/`delete` predicates lower through `predicate_to_sql(..)` into a single
+**SQL string** (`format!("{} {} {}", column, op, value_sql)`). That one string
+is consumed by **two** different parsers, and *they disagree on what quoting
+means*:
+
+- The **committed** side passes the string to Lance's `Scanner::filter(&str)`.
+  Lance **preserves an unquoted identifier's case** (so unquoted camelCase
+  *already works* on the committed scan) but treats a double-quoted `"col"` as a
+  **string literal** — `"repoName" = 'acme'` parses as `'repoName' = 'acme'`,
+  a constant-false predicate that silently matches **zero** committed rows.
+- The **pending** side splices the same string into a DataFusion
+  `SELECT … FROM pending WHERE {filter}` over a `MemTable`. DataFusion's SQL
+  parser **lowercases** an unquoted identifier (`repoName` → `reponame`) and
+  fails to resolve against the case-sensitive `MemTable` schema.
+
+So no single quoting choice for the column satisfies both: quoting fixes the
+pending side but breaks the committed side, and vice versa. The fix keeps the
+predicate **unquoted** (what the committed Lance scan needs) and makes the
+*pending* context case-preserving instead, by disabling SQL identifier
+normalization on its `SessionContext`:
+
+```rust
+let mut config = SessionConfig::new();
+config.options_mut().sql_parser.enable_ident_normalization = false;
+let ctx = SessionContext::new_with_config(config);
+```
+
+`predicate_to_sql` itself never lowercased anything (it copies the preserved
+property name), so its emitted string is unchanged — it gains only a comment
+recording the unquoted contract. The projection list in the same function is
+already double-quoted and is unaffected (quoted identifiers are case-preserved
+under either normalization setting).
+
+Rejected alternatives: banning/normalizing camelCase at the compiler (a real
+usability regression — camelCase fields are legitimate), lowercasing column
+names in storage (a breaking on-disk change), merely making lint *warn* (a
+band-aid that leaves the runtime broken), or **quoting the column in
+`predicate_to_sql`** (empirically breaks 7 existing lowercase-column mutation
+tests because Lance reads `"col"` as a string literal — see Part 2).
+
+## Scope and caveats
+
+- **Not Windows-specific.** The original report's environment was Windows, but
+  the cause is platform-independent.
+- **The mutation path was only *partially* broken, and not where first
+  assumed.** The committed side of `scan_with_pending(..)` (Lance
+  `Scanner::filter(&str)`) and `delete`'s `delete_where(..)` / `Dataset::delete`
+  preserve an unquoted identifier's case, so a *single* `update`/`delete` on a
+  camelCase predicate already worked. Only the **pending** side — the in-memory
+  `MemTable` re-scan that a *chained* mutation hits — lowercased the column.
+  This was confirmed empirically: a single update+delete on `repoName` passes
+  unfixed; a chained update that re-reads the pending side fails with
+  `No field named reponame`. The fix is Part 2 above (disable identifier
+  normalization on the pending `SessionContext`), **not** quoting the column.
+  The eventual MR-A migration (`delete_where` → Lance 7
+  `DeleteBuilder::execute_uncommitted`, structured `Expr`) is the longer-term
+  shape but is out of scope here.
+- **Check the coercion lookup.** Adjacent to the fix, the literal-coercion step
+  (`prop_data_type(.., schema)`, which keeps the BTREE usable) also resolves the
+  column by name. Confirm it uses the preserved name; if it mishandles case a
+  camelCase filter would resolve but lose its index — a silent perf regression,
+  not a crash.
+- **Do not use `col(r#""repoName""#)` as the general read-path fix.** Quoting
+  would preserve this one name, but it routes through SQL identifier parsing and
+  changes qualified-name semantics. The IR property here is already a bare
+  column name, so `ident(property)` / `Column::new_unqualified(property)` is the
+  precise structured expression.
+- **Do not "fix" the mutation string by quoting the column.** It is tempting to
+  reuse a `quote_ident` helper symmetric with `literal_to_sql`'s value escaping,
+  but the column quote-rules differ between the two consumers of the predicate
+  string: Lance's `Scanner::filter(&str)` reads `"col"` as a *string literal*
+  (silently matching nothing), while DataFusion's `ctx.sql` reads it as a
+  case-preserved identifier. Because the committed Lance scan already preserves
+  the *unquoted* identifier's case, the column must stay unquoted and the
+  pending DataFusion context must be told not to normalize — not the reverse.
+
+## Validation (test-first)
+
+1. **Red:** add an `ir_filter_to_expr` test asserting the emitted
+   `Expr::Column` name for a camelCase property is `repoName`, not `reponame`.
+   Fails on current code.
+2. **Green:** apply the `col` → `ident` change (Part 1) and the pending-context
+   `enable_ident_normalization = false` change (Part 2).
+3. **End-to-end:** a camelCase `@index` field with
+   `match { T { camelField: $x } }` returns the row (the unit test alone can't
+   catch an engine↔Lance boundary regression).
+4. **Mutation parity:** with the same camelCase field, cover:
+   - `update T where camelField == $x set otherField = ...` updates the intended
+     row.
+   - `delete T where camelField == $x` deletes the intended row and cascades as
+     expected.
+   - A chained update that hits the pending side of `scan_with_pending` still
+     works, so both the committed Lance scan and pending DataFusion `MemTable`
+     predicate paths are case-preserving.
+5. **Index preservation:** keep or add a plan/trace assertion that the
+   camelCase `@index` equality predicate still reaches the scalar-index path.
+   A result-only test can pass while silently falling back to a full scan.
+6. Run the full engine suite (`cargo test -p omnigraph-engine`) — in particular
+   the existing BTREE index-eligibility tests, which `ident()` must not disturb.
--- a/docs/dev/ci.md
+++ b/docs/dev/ci.md
@ -3,7 +3,7 @@
 `.github/workflows/`:

 - **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
-  - **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, and the two CODEOWNERS checks. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md).
+  - **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, and `Test omnigraph-server --features aws`. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md).
  - **Consequences to internalize:** (1) a regression that the suite would catch now lands on `main` and turns the post-merge run red, rather than being blocked pre-merge — `main` can briefly break, so run `cargo test --workspace --locked` locally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2) `openapi.json` is no longer auto-regenerated on PRs (that step is inside the `test` job); for server/API changes, regenerate it locally with `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi` and commit it, or the strict drift check fails the post-merge `main` run.
  - **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** — until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
 - **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
--- a/docs/dev/codeowners.md
+++ b/docs/dev/codeowners.md
@ -1,58 +0,0 @@
-# Code ownership
-
-`.github/CODEOWNERS` is **generated** — not hand-edited. The source of truth is `.github/codeowners-roles.yml`, expanded by `.github/scripts/render-codeowners.py`. CI rejects drift between the two and rejects direct edits to `CODEOWNERS` that don't accompany a yml change.
-
-This setup gives every role change a reviewable PR and a permanent in-repository audit trail (`git log .github/codeowners-roles.yml`).
-
-## Who owns what
-
-The tables below are **generated** from `.github/codeowners-roles.yml` by `.github/scripts/render-codeowners.py` (the same render that produces `.github/CODEOWNERS`). They are the always-current "who owns what at this commit" view — don't edit them by hand; edit the yml and re-render.
-
-<!-- BEGIN GENERATED OWNERSHIP — edit codeowners-roles.yml + run render-codeowners.py -->
-
-**Path → owners** (GitHub applies *last match wins*; the `*` catch-all is listed first and is overridden by the specific patterns below it):
-
-| Path | Owners | Role(s) |
-|---|---|---|
-| `*` | @aaltshuler @ragnorc | engineering |
-| `crates/**` | @aaltshuler @ragnorc | engineering |
-| `docs/**` | @aaltshuler @ragnorc | docs |
-| `README.md` | @aaltshuler @ragnorc | docs |
-| `AGENTS.md` | @aaltshuler @ragnorc | docs |
-| `CLAUDE.md` | @aaltshuler @ragnorc | docs |
-| `SECURITY.md` | @aaltshuler @ragnorc | docs |
-
-**Roles**:
-
-| Role | Members | Description |
-|---|---|---|
-| `engineering` | @aaltshuler @ragnorc | All production code under crates/**. Engine, CLI, server, compiler. |
-| `docs` | @aaltshuler @ragnorc | Documentation under docs/**, plus repo-level docs (README.md, AGENTS.md, CLAUDE.md symlink, SECURITY.md). |
-
-<!-- END GENERATED OWNERSHIP -->
-
-GitHub treats multiple owners on a CODEOWNERS line as **"any one of them satisfies the review requirement"**. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured).
-
-## How to change role membership or path mappings
-
-1. Edit `.github/codeowners-roles.yml`.
-2. Open a PR. **CI re-renders for you**: the `CODEOWNERS` workflow regenerates `.github/CODEOWNERS` and the ownership tables above and auto-commits them back to your PR branch on same-repository PRs — you don't have to run the script locally (though you can: `python3 .github/scripts/render-codeowners.py`, requires PyYAML).
-
-On a fork (where CI can't push back), the workflow instead fails with the diff so you can run the script and commit it yourself.
-
-CI fails the PR if:
- a fork PR left a generated artifact out of sync, or
- `CODEOWNERS` was edited without a corresponding yml change (the `CODEOWNERS not hand-edited` check).
-
-## How to add a new role
-
-1. Add a new entry to `roles:` in the yml with a `description` and `members` list.
-2. Reference the role from `paths:` (or `default:`).
-3. Regenerate + commit as above.
-
-## Why a generator, not direct CODEOWNERS edits?
-
- **Audit trail**: `git log .github/codeowners-roles.yml` is the canonical record of every role change. The rendered `CODEOWNERS` is a derived artifact.
- **Roles are first-class**: paths reference roles, not raw handles. Renaming a person or rotating a role updates one place, not every path.
- **Future extension**: scheduled rotation (weekly on-call, quarterly leads) plugs into the same yml without changing the path mappings. Not enabled today.
- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repository's code-owner policy follows the same "policy as reviewed code" pattern.
--- a/docs/dev/index.md
+++ b/docs/dev/index.md
@ -28,7 +28,6 @@ constraints. User-facing behavior should still be documented through
 | Three-way merge implementation and conflicts | [merge.md](merge.md) |
 | Diff/change-feed implementation | [changes.md](../user/branching/changes.md) |
 | Branch protection policy | [branch-protection.md](branch-protection.md) |
-| CODEOWNERS source of truth | [codeowners.md](codeowners.md) |

 ## Language, Runtime, And Boundaries

@ -63,6 +62,16 @@ The `docs/rfcs/` track is the **public, externally-authorable** RFC process. The
 maintainer/internal RFCs below (`rfc-00N-*.md`) are a separate, team-owned
 track; don't conflate the two.

+## Case Studies
+
+Worked write-ups of specific bugs — root cause, fix, and the reasoning that
+ruled out the tempting-but-wrong alternatives. Read these for the debugging
+pattern, not just the outcome.
+
+| Area | Read |
+|---|---|
+| camelCase property filters lowercased at runtime (#283) — two engine→Lance boundaries, two different fixes | [bug-case-fix.md](bug-case-fix.md) |
+
 ## Active Implementation Plans

 Working documents for in-flight feature work. Removed when the work lands.
--- a/docs/dev/invariants.md
+++ b/docs/dev/invariants.md
@ -53,7 +53,13 @@ converge the physical state.
   versioning, fragments, branches, compaction, cleanup, and index primitives.
   DataFusion should own relational execution where it fits. Do not add custom
   WALs, transaction managers, buffer pools, page formats, or local clones of
-   substrate behavior. Read [lance.md](lance.md) before guessing.
+   substrate behavior. Read [lance.md](lance.md) before guessing. Respecting the
+   substrate also means *using* it idiomatically, not only refraining from
+   rebuilding it: reuse long-lived handles instead of re-opening per call,
+   resolve latest state through the substrate's cheap primitive instead of
+   re-scanning, and share its caches/session. Re-deriving per call what the
+   substrate keeps warm is a substrate violation even when no code is
+   reimplemented.

 2. **Graph visibility is manifest-atomic.** Lance commits are per dataset.
   OmniGraph's graph-level atomicity comes from publishing one manifest update
@ -126,6 +132,18 @@ converge the physical state.
    a substitute for missing lower-level assertions. Read [testing.md](testing.md)
    before adding tests.

+15. **One source of truth, cheaply derived.** Lance and the manifest are the
+    source of truth. Everything the engine needs at runtime is a derived view of
+    them: read or projected on demand, held warm, refreshed by a cheap probe. Two
+    failure modes are forbidden. A *parallel copy* the engine maintains can drift
+    from the source, and that divergence compounds over time. *Cold
+    re-derivation* rebuilds the view from the full source on every call, so its
+    cost grows with history. Invariants 1 and 7, and the deny-list "state that
+    drifts" and "manifest-derivable reconciler" items, are instances; so is
+    bounding a read's cost to its working set rather than the commit count. This
+    is the structural face of "engineering is programming integrated over time":
+    both failure modes are liabilities that compound as the system grows.
+
 ## Current Truth Matrix

 | Area | Current state | Source |
@ -252,6 +270,37 @@ them explicit.
 - **Resource bounds:** some operations still lack enforced per-query memory or
  time budgets. New long-running work should add explicit bounds rather than
  widening the gap.
+- **Read-path re-derivation (largely closed by the query-latency work):**
+  snapshot resolution used to re-open a fresh coordinator per read (a full
+  `__manifest` re-scan plus two commit-graph scans), open each table through the
+  namespace (two more `__manifest` scans per table), validate the schema twice,
+  and share no Lance `Session`. That was an O(commits) cost that never warmed up.
+  Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open
+  tables by location+version), finding A (validate once), and Fix 3 (a held
+  `Dataset`-handle cache keyed by `(table, branch, version, e_tag when Lance
+  exposes it)` plus one shared `Session` per graph) remove that tax: a warm
+  same-branch read does one probe, one schema read, and zero opens on a repeat.
+  Non-main branch freshness compares the manifest incarnation (`version` plus
+  manifest-location e_tag when available, otherwise Lance manifest timestamp),
+  because Lance branch names can be deleted/recreated at the same version number;
+  the manifest e_tag is carried into synthetic snapshot ids when available, and
+  a detected same-branch manifest refresh clears read caches as the fallback for
+  e_tag-less table locations/topology. Remaining: the internal metadata tables
+  (`__manifest`, `_graph_commits`) are still not compacted, so the probe and
+  refresh cost still grows with fragment count on a long-lived graph (the
+  `optimize`-covers-internal-tables follow-up); the commit graph is not yet
+  reconcilable from the manifest; and the traversal id-map is still rebuilt.
+- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
+  the commit-graph head from storage before appending, so a same-branch write
+  after an external commit no longer forks the commit DAG by parenting off a
+  stale cached head (the single-process fork, pre-existing for non-strict
+  inserts and widened to strict ops by Fix 1's `refresh_manifest_only`, is now
+  closed). Residual: two processes writing disjoint tables can still pass their
+  per-table manifest CAS and append off the same parent (a refresh-then-append
+  TOCTOU). The convergent fix is reconcile-from-manifest (parent = the commit at
+  the manifest version the publisher CAS'd against; `manifest_version` is on
+  every commit row), composing with the manifest-to-commit-graph atomicity gap;
+  it needs commit-graph append ordering or a Lance append-CAS to fully close.

 ## Deny-list

@ -277,6 +326,10 @@ case is exceptional.
 - Cost-blind plan choice when statistics are available or required.
 - Hidden statistics for behavior that affects planning or operator choice.
 - Hash-map iteration order in result ordering, plan choice, or migration output.
+- Cold re-derivation on the hot path: rebuilding from the full source what could
+  be held warm and refreshed cheaply, so cost scales with history rather than the
+  working set (the cost face of invariant 15; "state that drifts" above is its
+  shadow-copy face).
 - String-flattened SQL/filter generation when a structured pushdown API is
  available.
 - Eager multi-hop cross-product materialization when factorization fits.
@ -313,6 +366,8 @@ Use this as yes/no/NA for any non-trivial design or PR:
 - Are stats/capabilities exposed when behavior depends on them?
 - Are existing known gaps left no worse and documented if touched?
 - Does the test live at the same boundary as the change?
+- Is this operation's cost bounded with respect to history and scale, or does it
+  re-derive warm state from cold storage per call?
 - Does the change avoid every deny-list pattern, or justify the exception?

 ## Maintenance Policy
--- a/docs/dev/rfc-005-server-cluster-boot.md
+++ b/docs/dev/rfc-005-server-cluster-boot.md
@ -1,7 +1,7 @@
 # RFC: Server Boots from Cluster State — Phase 5 of the Cluster Control Plane

 **Status:** Landed (5A policy bindings #175; 5B/5C the `--cluster` boot mode — one PR)
-**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses.
+**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract — without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses. (5) Graph-attributed startup failures quarantine that graph by default; operators can restore all-or-nothing boot with `--require-all-graphs` / `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`.
 **Date:** 2026-06-10
 **Builds on:** Phase 4 complete ([rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md), Landed): `cluster apply` converges graphs, schemas, stored queries, and policies into the cluster catalog. Normative context: [cluster-config-specs.md](cluster-config-specs.md) (the migration model's "window 2"), [cluster-axioms.md](cluster-axioms.md) (axiom 15), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) (Phase 5 rollout, Compatibility Stance #7–#9, exit criterion 7).
 **Target release:** unversioned (phased — see Sequencing).
@ -46,8 +46,8 @@ Mode inference gains rule 0: `--cluster <dir>` → **Cluster mode**, which is al

 `load_server_settings` grows a cluster branch that reads, in order:

-1. `__cluster/state.json` — **missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Pending recovery sidecars under `__cluster/recoveries/` are also a boot error (`cluster_recovery_pending`): a server must not start serving a ledger that a sweep is about to rewrite.
-2. **Graph set** = state's `graph.<id>` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `<dir>/graphs/<id>.omni`. A recorded graph whose root does not open is a boot error — same fail-fast posture as today's bad URI.
+1. `__cluster/state.json` — **missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Invalid or unattributable recovery sidecars under `__cluster/recoveries/` are also a boot error: a server must not start if it cannot prove the blast radius. Valid graph-attributed sidecars quarantine that graph by default and are logged as `cluster_recovery_pending`; `--require-all-graphs` promotes them back to a boot error.
+2. **Graph set** = state's `graph.<id>` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `<dir>/graphs/<id>.omni`. A recorded graph whose root does not open quarantines that graph by default; `--require-all-graphs` restores the original fail-fast posture.
 3. **Stored queries** = state's `query.<graph>.<name>` entries, content loaded from the catalog blob at the recorded digest. Blob-missing or digest-mismatched is a boot error (the catalog verification semantics from Stage 3B, applied at boot). Queries type-check at engine open exactly as today (`validate_and_attach` — unchanged).
 4. **Policies** = state's `policy.<name>` entries, content from catalog blobs, bindings from the applied metadata of D3: bundles bound to `cluster` load as the server-level Cedar engine (`PolicyEngine::load_server`); bundles bound to graphs load per-graph (`PolicyEngine::load_graph`) and install via `with_policy` — the existing two-gate structure, unchanged.
 5. `cluster.yaml` is parsed **only** to validate that the directory is a cluster root (and for nothing else — explicitly not for resource content; a divergence between desired config and applied state is *served as applied*, visible via `cluster plan`).
@ -76,16 +76,19 @@ State's `StateResource` records only a digest. To make the ledger serving-suffic

 ### D4. Readiness and failure posture

-Boot is fail-fast, matching the server's existing stance (bad policy YAML refuses boot):
+Cluster-global failures are fail-fast, matching the server's existing stance (bad policy YAML refuses boot). Graph-local failures quarantine the affected graph by default so a single bad graph cannot crash-loop an otherwise healthy cluster. Operators who prefer the original all-or-nothing contract pass `--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`, which promotes every graph-local quarantine/open/settings failure to a boot error.

 | Condition | Behavior |
 |---|---|
 | `state.json` missing / unparseable / unsupported version | boot error |
-| pending recovery sidecars | boot error (run any state-mutating cluster command to sweep) |
-| recorded graph root missing or unopenable | boot error |
+| invalid/unreadable/unattributable recovery sidecars | boot error (run any state-mutating cluster command to sweep or inspect) |
+| valid graph-attributed recovery sidecars | quarantine that graph; strict mode boot error |
+| recorded graph root missing or unopenable | quarantine that graph; strict mode boot error |
 | query/policy blob missing or digest-mismatched | boot error (run `cluster refresh` + `apply` to self-heal, then restart) |
 | policy entry without `applies_to` metadata | boot error ("re-run cluster apply", D3) |
-| stored query fails type-check against the live schema | boot error (existing `validate_and_attach` behavior) |
+| stored query fails parse/type-check against the live schema | quarantine that graph; strict mode boot error |
+| embedding provider configuration for one graph cannot resolve | quarantine that graph; strict mode boot error |
+| every applied graph is quarantined or fails startup | boot error (`cluster_no_healthy_graphs`) |
 | state lock held | **not** an error — boot takes no lock; it reads a point-in-time snapshot of an immutable-once-written state file (the CAS discipline means a concurrent apply produces a *new* file atomically; the server reads whichever was current at open) |

 ### D5. MCP presentation (`@mcp(expose, tool_name)`) in cluster mode
@ -124,7 +127,7 @@ Rollback is the same switch in reverse — nothing in cluster mode mutates `omni
 - *Axiom 5*: the server serves deployed reality (applied digests), never desired intent; D3 keeps the ledger the single serving source.
 - *Axiom 12*: boot reads without the lock but relies on the atomic-replace write discipline; it never writes state.
 - *Axiom 14 / Stance #9*: the expose-all bridge is named, scoped to cluster mode, and carries its Phase 6 sunset.
- *Loud failures (deny-list)*: every degraded condition is a typed boot error with a remedy; no partial serving, no silent fallback to the yaml.
+- *Loud failures (deny-list)*: every degraded condition is either a typed cluster-global boot error with a remedy or an explicit graph quarantine logged at startup; no silent fallback to the yaml. `--require-all-graphs` is the opt-in all-or-nothing mode for operators who treat any degraded graph as fatal.
 - *Respect the boundaries*: `omnigraph-cluster` stays free of HTTP; the server reads the catalog through a small read-only loader (either a `pub` read surface on `omnigraph-cluster` or a thin module in the server consuming the documented file formats — implementation picks the one that keeps `omnigraph-cluster` dependency-light; the state/blob formats are already a documented contract).

 ## Sequencing
@ -132,7 +135,7 @@ Rollback is the same switch in reverse — nothing in cluster mode mutates `omni
 | Slice | Scope | Gate |
 |---|---|---|
 | **5A: serving metadata in state** | `applies_to` recorded on policy resources at apply + sweep roll-forward; additive state schema; `status`/plan surfacing | In-crate tests: metadata written/rolled-forward; old state parses; re-apply backfills |
-| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state → `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; every D4 row refuses boot; e2e: `cluster apply` then serve — "applied means serving" |
+| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state → `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; D4 cluster-global rows refuse boot; graph-local rows quarantine by default and refuse under `--require-all-graphs`; e2e: `cluster apply` then serve — "applied means serving" |
 | **5C: docs + caveat retirement** | `cluster-config.md` mode-switch section; `server.md`/`deployment.md`; retire the "not serving" caveats for cluster-mode deployments; migration guide (D6) | `check-agents-md.sh`; doc accuracy review |

 ## Exit-criteria coverage
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@ -24,8 +24,9 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
 | `writes.rs` | Direct-publish writes: cancellation, non-strict insert/merge rebase under the per-table queue, strict stale-write conflicts, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
 | `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
-| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
+| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
 | `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
+| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
 | `lifecycle.rs` | Graph lifecycle, schema state |
 | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
 | `changes.rs` | `diff_between` / `diff_commits` |
@ -126,5 +127,14 @@ When you pick up any change, walk through this:
 6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
 7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
 8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
+9. **Bound hot-path cost at history depth.** If the change touches a read or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a fixed object-op count) against a fixture with realistic *commit-history depth*, not just realistic row counts. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
+
+## Cost-budget tests: bound hot-path cost at history depth
+
+Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and degrade silently in production. The engine read path historically had no cost assertion, and fixtures carry shallow commit history, so an O(commits)-per-query cost stayed green in CI and only surfaced on a long-lived graph (read snapshot resolution re-scanned the internal manifest and commit-graph tables on every query, and those tables were never compacted). Guard against the class:
+
+- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
+- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
+- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).

 When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.
--- a/docs/dev/writes.md
+++ b/docs/dev/writes.md
@ -178,6 +178,17 @@ are left at `Lance HEAD = manifest_pinned + 1`.
   post_commit_pin)` it intends to commit + the writer kind +
   actor_id.
 2. **Phase B**: writer's per-table `commit_staged` loop runs.
+   - **Phase-B confirmation (`BranchMerge` only)**: a `BranchMerge` writer
+     advances each table's HEAD by *several* commits (append → upsert →
+     delete), so a bare "HEAD moved" is ambiguous — it could be a complete
+     publish or one crashed mid-sequence. After the whole per-table loop
+     finishes, the writer re-writes the sidecar stamping each pin's
+     `confirmed_version` with the exact achieved version, then proceeds to
+     Phase C. This is the commit point of the recovery WAL: a crash *after*
+     confirmation rolls forward to those versions; a crash *during* Phase B
+     (sidecar still unconfirmed) rolls back. Other writers don't confirm —
+     their drift is derived state (index coverage, compaction) that a partial
+     roll-forward never corrupts.
 3. **Phase C**: publisher commits the manifest.
 4. **Phase D**: writer deletes the sidecar.

@ -197,7 +208,10 @@ recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`:
 - For each sidecar in `__recovery/`, compare every named table's
  Lance HEAD to the manifest pin. Classify per the all-or-nothing
  decision tree (RolledPastExpected / NoMovement / UnexpectedAtP1 /
-  UnexpectedMultistep / InvariantViolation).
+  UnexpectedMultistep / IncompletePhaseB / InvariantViolation). For a
+  `BranchMerge` sidecar, a moved HEAD with no `confirmed_version` classifies
+  as `IncompletePhaseB` (a partial multi-commit publish) and forces roll-back;
+  with a `confirmed_version`, roll-forward targets exactly that version.
 - If any table is `InvariantViolation` (Lance HEAD < manifest pinned —
  should be impossible), **abort** with a loud error and leave the
  sidecar on disk for operator review.
--- a/docs/releases/v0.2.0.md
+++ b/docs/releases/v0.2.0.md
@ -1,86 +0,0 @@
-# Omnigraph v0.2.0
-
-Omnigraph v0.2.0 focuses on day-to-day operability: safer schema evolution, more capable mutation queries, better local and remote ergonomics, and a documented HTTP surface for clients and tooling.
-
-This release is especially relevant if you are running Omnigraph locally on RustFS or using the CLI and server together as a graph application backend.
-
-## Highlights
-
-### Schema planning and apply
-
-Schema changes can now move from planning to execution with first-class CLI and server support.
-
- Added `omnigraph schema apply --schema ...` alongside `schema plan`
- Added `POST /schema/apply` on the server
- Added policy support for schema application through the `schema_apply` action
- Persisted accepted schema updates as part of a supported apply flow
-
-This makes schema evolution an actual product capability instead of a plan-only diagnostic.
-
-### Safer schema apply on live repos
-
-After the initial schema-apply rollout, the apply path was hardened to avoid clobbering concurrent writes and to preserve indexes during table rewrites.
-
- Blocks writes while schema apply is in progress
- Verifies source heads before publishing rewritten tables
- Rebuilds the full expected index set after rewrite operations
- Keeps schema apply constrained to repos whose only branch is `main`
-
-The result is a much more defensible v1 schema migration path.
-
-### Multi-statement mutations
-
-Mutation queries can now contain multiple sequential statements that execute atomically within one run.
-
-Example:
-
-```gq
-query add_and_link($name: String, $age: I32, $friend: String) {
-    insert Person { name: $name, age: $age }
-    insert Knows { from: $name, to: $friend }
-}
-```
-
-This is a meaningful step toward richer write-side workflows without forcing multiple client round trips.
-
-### OpenAPI support
-
-The server now publishes an OpenAPI document at `/openapi.json`.
-
- Added schema-backed endpoint documentation for the Omnigraph HTTP API
- Documented request and response types for the current server surface
- Made the published spec reflect runtime auth mode, so open local deployments are documented correctly
-
-This makes Omnigraph easier to integrate with generated clients, inspection tools, and API consumers that want a machine-readable contract.
-
-### CLI and export ergonomics
-
-Several rough edges in the CLI were fixed.
-
- Export now streams instead of buffering the full snapshot in memory first
- Load summaries now report actual loaded row counts
- Alias handling no longer steals legitimate first arguments
- `commit show` matches the documented `--uri` usage
- Remote and local usage are more consistent for common admin flows
-
-## Additional Improvements
-
- RustFS CI is now scoped to relevant changes instead of burning time on unrelated pull requests
- README and install docs were tightened around public binary install behavior
- The local RustFS bootstrap remains aligned with the rolling `edge` binary channel
-
-## Upgrade Notes
-
- If you use local or remote schema administration, prefer `schema plan` before `schema apply`
- `schema apply` is intentionally conservative in v1 and rejects repos with non-`main` branches
- If policy is enabled, make sure admin actors are allowed to perform `schema_apply`
- If you rely on published binaries, this release is the point where stable installers can pick up schema apply and the newer CLI/runtime behavior without using `edge`
-
-## Included Changes
-
- PR #2: CLI ergonomics and streamed export output
- PR #5: schema apply command and policy support
- PR #7: schema apply concurrency and index-preservation hardening
- PR #4: multi-statement mutations
- PR #1: OpenAPI generation and auth-aware `/openapi.json`
- PR #8: RustFS CI scoping improvements
--- a/docs/releases/v0.2.1.md
+++ b/docs/releases/v0.2.1.md
@ -1,59 +0,0 @@
-# Omnigraph v0.2.1
-
-Omnigraph v0.2.1 is a focused follow-up release on top of v0.2.0. It adds query linting, improves query execution correctness, hardens the local RustFS bootstrap flow, and cleans up project config naming.
-
-## Highlights
-
-### Query lint and check
-
-The CLI now ships a first-class query validation surface:
-
- `omnigraph query lint`
- `omnigraph query check`
-
-These commands validate `.gq` files against either an explicit schema file or a local/S3-backed repo schema, emit structured results, and support both human-readable and JSON output.
-
-### Query execution fixes and aggregate support
-
-This release includes several improvements in the query engine:
-
- aggregate execution support for read queries
- nullable query parameters now accept omission and explicit null for nullable params
- traversal planning and join alignment are more robust for traversal-introduced bindings
-
-Together, these changes make complex read queries more dependable and easier to author.
-
-### Better local RustFS startup
-
-The local RustFS bootstrap is more resilient:
-
- detects dirty/stale repo prefixes before blindly reinitializing
- makes bootstrap recovery clearer for persisted local RustFS state
- ships a more generic demo fixture instead of user-specific seed content
-
-This reduces the most common failure mode in local-first setup.
-
-### Config terminology cleanup
-
-`omnigraph.yaml` now uses graph-oriented naming:
-
- `graphs:` instead of `targets:`
- `cli.graph` / `server.graph` instead of `target`
-
-This removes one of the more confusing overloaded terms in the CLI/server config model.
-
-## Included Changes
-
- PR #15: query lint and query check commands
- PR #6: aggregate execution support
- PR #3: nullable query parameter fixes
- PR #16: traversal planning and join-alignment fixes
- PR #13: local RustFS bootstrap recovery hardening
- PR #14: generic bootstrap fixture
- PR #17: config rename from targets to graphs
-
-## Upgrade Notes
-
- If you maintain `.gq` files in-repo, add `omnigraph query lint` to your local validation workflow
- Existing configs must use `graphs:` / `graph:` after this release
- Local RustFS users should prefer the current bootstrap script from `main` or this release rather than older cached copies
--- a/docs/releases/v0.2.2.md
+++ b/docs/releases/v0.2.2.md
@ -1,29 +0,0 @@
-# Omnigraph v0.2.2
-
-Omnigraph v0.2.2 is a packaging follow-up to v0.2.1. It keeps the CLI and server surface the same, but renames the published runtime crate from `omnigraph` to `omnigraph-engine` so the full crate set can be published cleanly to crates.io.
-
-## Highlights
-
-### Published runtime crate rename
-
-The runtime package is now published as:
-
- `omnigraph-engine`
-
-The in-code Rust library name remains `omnigraph`, so internal imports and code paths stay stable. CLI users are unaffected.
-
-### Crates.io metadata cleanup
-
-All published crates now ship repository, homepage, and documentation metadata so the crates.io pages are complete and the release pipeline no longer emits missing-package-metadata warnings.
-
-## Included Changes
-
- rename runtime package from `omnigraph` to `omnigraph-engine`
- bump `omnigraph-engine`, `omnigraph-compiler`, `omnigraph-server`, and `omnigraph-cli` to `0.2.2`
- update dependent manifests and CI package references to the new runtime package name
-
-## Upgrade Notes
-
- Rust consumers should depend on `omnigraph-engine` on crates.io
- Code that imports the library can continue using `omnigraph` as the crate name
- The `omnigraph` CLI binary name is unchanged
--- a/docs/releases/v0.3.0.md
+++ b/docs/releases/v0.3.0.md
@ -1,49 +0,0 @@
-# Omnigraph v0.3.0
-
-Omnigraph v0.3.0 is a feature and security release. It adds an AWS deployment path for the server, hardens bearer-token authentication, introduces a schema inspection endpoint, and ships the CodeBuild-driven image packaging pipeline.
-
-## Highlights
-
-### AWS deployment path
-
-A new `aws` Cargo feature enables an AWS-native bearer-token backend. When compiled with `--features aws` and pointed at an AWS Secrets Manager secret ARN via `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET`, the server fetches and parses bearer tokens directly from Secrets Manager at startup. The token loading path is abstracted behind a `TokenSource` trait so additional backends are easy to add.
-
-A manually-dispatched Package workflow builds two variants of the server image (default and `--features aws`) via AWS CodeBuild, tags them by source SHA in ECR, and records the digests for downstream deploy automation.
-
-### Bearer auth hardening
-
-Bearer tokens are now hashed (SHA-256) at rest inside the server and compared using constant-time equality (`subtle::ConstantTimeEq`). The authenticated actor id is resolved server-side from the hash match — requests can no longer assert their own actor id by setting a header.
-
-### Schema inspection API
-
-A new `GET /schema` endpoint and matching CLI `schema get` command return the active graph schema as JSON. A static OpenAPI spec is published at `openapi.json` and kept in sync with the server via a CI job.
-
-### Stricter run-branch hygiene
-
-Internal `__run__…` branches, used for short-lived write staging, are now filtered out of user-visible branch listings and are deleted on every terminal state transition instead of accumulating over time.
-
-## Breaking changes
-
-### Schema state is now required
-
-The server refuses to open a repo that lacks persisted schema state (`_schema.pg`, `_schema.ir.json`, `__schema_state.json`) or that has non-main public branches left over from earlier versions. Existing repos created with 0.2.x need to be reinitialized (or have their schema state written explicitly) before they can be opened with 0.3.0.
-
-## Included Changes
-
- Add `aws` feature + `SecretsManagerTokenSource` backend
- Extract `TokenSource` trait for bearer token loading
- Harden bearer auth: constant-time compare, SHA-256 hashed at rest, server-authoritative actor id
- Add manually-dispatched Package workflow for CodeBuild image builds (default + aws variants)
- Add `GET /schema` endpoint and `schema get` CLI command
- Ship static `openapi.json` spec with CI auto-sync
- Filter and delete ephemeral `__run__` branches
- Switch Dockerfile base to ECR Public (avoid Docker Hub rate limits)
- Raise `LANCE_MEM_POOL_SIZE` default to 1 GB for stable parallel tests
- Automate Homebrew tap updates on release tags
- Documentation for the AWS build variant and bearer-token sources
-
-## Upgrade Notes
-
- Repos created with 0.2.x must be reinitialized (or have their schema state generated) before they can be opened with 0.3.0
- Deployments using AWS Secrets Manager for bearer tokens must build the server with `--features aws` and set `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` to the secret ARN
- The default token source (env var or JSON file) continues to work unchanged
--- a/docs/releases/v0.3.1.md
+++ b/docs/releases/v0.3.1.md
@ -1,19 +0,0 @@
-# Omnigraph v0.3.1
-
-Omnigraph v0.3.1 is a performance and operability point release.
-
-## Highlights
-
- **Parallel per-type load writes**: the bulk loader writes to each node/edge table concurrently rather than serially, materially reducing wall-clock time on multi-table loads.
- **`omnigraph optimize` and `omnigraph cleanup` CLI commands**: previously only available via the engine API. `optimize` runs Lance `compact_files()` across every node/edge table; `cleanup` runs Lance `cleanup_old_versions()` with a `--keep`/`--older-than` policy and requires `--confirm` for the destructive form.
- **Dst-id deduplication during edge expand hydration**: avoids redundant lookups when the same destination id appears multiple times in an `Expand` step (#45).
-
-## Included Changes
-
- Parallel per-type load writes (#46)
- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs (#46)
- Dedupe dst ids before hydrating nodes in `execute_expand` (#45)
-
-## Upgrade Notes
-
-No breaking changes. Existing v0.3.0 repos can be opened directly with v0.3.1.
--- a/docs/releases/v0.4.0.md
+++ b/docs/releases/v0.4.0.md
@ -1,88 +0,0 @@
-# Omnigraph v0.4.0
-
-Omnigraph v0.4.0 demotes the Run state machine to commit metadata via the
-publisher's CAS, fixing a write-cancellation hole and reducing the engine's
-surface area.
-
-## Highlights
-
- **Direct-to-target writes**: `mutate_as` and `load` write
-  directly to the target tables and call
-  `ManifestBatchPublisher::publish` once at the end with
-  `expected_table_versions`. No more `__run__<id>` staging branches, no
-  more `RunRecord` state machine. Cross-table OCC is enforced inside the
-  publisher's row-level CAS on `__manifest`.
- **Cancellation safety by construction**: a dropped mutation future
-  leaves no graph-level state — only orphaned Lance fragments, reclaimed
-  by `omnigraph cleanup`. The "zombie run" cascade documented in
-  `.context/zombie-run-investigation.md` is gone.
- **Read-your-writes inside multi-statement mutations**: a `.gq` query
-  that inserts and then references a row in the same statement now sees
-  its own writes via an in-process `MutationStaging` cache, even though
-  no manifest commit happens between ops.
- **Structured conflict surface**: concurrent writers race through the
-  publisher's CAS; the loser surfaces as
-  `ManifestConflictDetails::ExpectedVersionMismatch { table_key,
-  expected, actual }`. The HTTP server maps this to **409 Conflict** with
-  a structured `manifest_conflict` body so clients can detect-and-retry
-  without parsing the message.
-
-## Removed
-
-This is a breaking release. Pre-0.4.0 / no SLA.
-
- `omnigraph::db::{RunRecord, RunStatus, RunId}` types and the
-  `_graph_runs.lance` / `_graph_run_actors.lance` Lance datasets.
- Engine APIs `begin_run`, `begin_run_as`, `publish_run`,
-  `publish_run_as`, `abort_run`, `fail_run`, `terminate_run`,
-  `list_runs`, `get_run`.
- HTTP endpoints: `GET /runs`, `GET /runs/{run_id}`, `POST
-  /runs/{run_id}/publish`, `POST /runs/{run_id}/abort`. The
-  `RunListOutput` and `RunOutput` schemas are removed from the OpenAPI
-  document.
- CLI subcommands: `omnigraph run list`, `omnigraph run show`, `omnigraph
-  run publish`, `omnigraph run abort`. Use `omnigraph commit list`
-  reading the commit graph for audit history.
- Cedar policy actions `run_publish` and `run_abort`. Existing
-  `policy.yaml` files referencing these actions will fail validation —
-  remove the rules; the `change` action covers the equivalent gating.
-
-## Behavior changes
-
- `mutate_as` / `load` are now **atomic per query, single publish at the
-  end**. A failed mutation leaves the target unchanged with no
-  intermediate manifest commits.
- The `OmniError::manifest_conflict` shape produced by concurrent
-  writers is now `ExpectedVersionMismatch` (was `MergeConflict::DivergentUpdate`
-  via the run merge path). Clients that match on the conflict body must
-  switch to inspecting `manifest_conflict.table_key/expected/actual`.
-
-## Known limitation
-
-A multi-statement mutation that writes a Lance fragment in op-N and then
-fails in op-N+1 leaves the touched table with Lance HEAD ahead of the
-manifest. The next mutation against that table fails with
-`ExpectedVersionMismatch`. Most validation runs before any Lance write,
-so single-statement mutations are unaffected; the narrow path is
-multi-statement queries with late-op failures. Tracked as a follow-up;
-see [docs/dev/writes.md](../dev/writes.md#mid-query-partial-failure-closed-by-mr-794)
-for the workaround.
-
-## Upgrade notes
-
- **Stale `__run__*` branches and `_graph_runs.lance`** in legacy v0.3.x
-  repos are *inert* — the engine no longer reads them — but they remain
-  on disk until production cleanup. This release deliberately does not touch
-  legacy bytes.
- The `is_internal_run_branch` predicate is kept as a defense-in-depth
-  guard against users naming a branch `__run__*`. It will be removed in
-  a follow-up cleanup.
- External scripts hitting `/runs/*` will now receive 404. Migrate them
-  to `/commits` for audit history; mutation status is implied by the
-  HTTP response on `/change` itself.
-
-## Included Changes
-
- Demote Run: write directly to target via publisher
- `ManifestBatchPublisher::publish` accepts per-table
-  `expected_table_versions`
--- a/docs/releases/v0.4.1.md
+++ b/docs/releases/v0.4.1.md
@ -1,142 +0,0 @@
-# Omnigraph v0.4.1
-
-Omnigraph v0.4.1 closes the multi-statement-mutation atomicity gap that
-v0.4.0 documented as a known limitation. Inserts and updates now route
-through an in-memory `MutationStaging` accumulator and commit via Lance's
-two-phase distributed-write API at end-of-query. A failed mid-query op
-no longer leaves Lance HEAD drifted on the touched table — the next
-mutation proceeds normally.
-
-## Highlights
-
- **Staged-write rewire**: `mutate_as` and `load` (Append /
-  Merge modes) accumulate insert/update batches into
-  `MutationStaging.pending` per touched table. No Lance HEAD advance
-  happens during op execution; one `stage_*` + `commit_staged` per
-  table runs at end-of-query, then `ManifestBatchPublisher::publish`
-  commits the manifest atomically. **For op-execution failures**
-  (validation errors, missing endpoints, parse-time D₂ rejection), Lance
-  HEAD on every staged table is untouched and the next mutation
-  proceeds normally. A narrowed residual remains at the
-  finalize→publisher boundary (multi-table `commit_staged` is not
-  atomic with the manifest commit) — see [docs/dev/writes.md](../dev/writes.md)
-  "Finalize → publisher residual" for details.
- **D₂ parse-time rule**: a single mutation query is either
-  insert/update-only or delete-only. Mixed → rejected with a clear
-  error directing the caller to split into two queries. Lance 4.0.0
-  has no public two-phase delete; deletes still inline-commit, and D₂
-  keeps that path safe.
- **Read-your-writes via DataFusion `MemTable`**: read sites in
-  multi-statement mutations consume `TableStore::scan_with_pending`,
-  which Lance-scans the committed snapshot at the captured
-  `expected_version` and unions with a DataFusion `MemTable` over the
-  pending batches. Replaces the previous "reopen at staged Lance
-  version" pattern.
- **Coordinator swap-restore eliminated** from `mutate_with_current_actor`.
-  Branch is threaded explicitly through the per-op execution path
-  (`execute_named_mutation`, `execute_insert`, `execute_update`,
-  `execute_delete*`, `validate_edge_insert_endpoints`,
-  `ensure_node_id_exists`). The `swap_coordinator_for_branch` /
-  `restore_coordinator` API and `CoordinatorRestoreGuard` are removed
-  from `mutation.rs`. (`merge.rs` keeps its own swap pattern; that's
-  a separate workflow.)
- **`docs/dev/invariants.md` mutation atomicity / read-your-writes status**
-  flips from `aspirational/open` to `upheld for inserts/updates`. The within-query read-your-writes
-  guarantee is now load-bearing for the publisher CAS contract.
-
-## Behavior changes
-
- A failed multi-statement mutation no longer surfaces
-  `ExpectedVersionMismatch` on the *next* mutation against the same
-  table. The next call proceeds normally — Lance HEAD on staged
-  tables is unchanged.
- Mixed insert/update + delete in one query is rejected at parse
-  time. Existing test queries that mixed both must be split.
- `MutationStaging`'s shape changed: `pending: HashMap<String, PendingTable>`
-  + `inline_committed: HashMap<String, SubTableUpdate>` replaces the
-  previous `latest: HashMap<String, StagedTable>`. This is an internal
-  type; no public API impact.
-
-## Residual / out of scope
-
- **`LoadMode::Overwrite`** keeps the legacy inline-commit path
-  (truncate-then-append doesn't fit the staged shape). A mid-overwrite
-  failure can still drift Lance HEAD on a partially-truncated table;
-  the next overwrite replaces it. Operator-driven, rare.
- **Delete-only multi-statement mutations** still inline-commit per op.
-  D₂ keeps inserts/updates from coexisting with deletes, so the
-  inline path remains atomic per op but not per query for delete-only
-  cascades. Closing this requires Lance to expose
-  `DeleteJob::execute_uncommitted`; tracked upstream with Lance.
- **`schema_apply`, `branch_merge_internal`, `ensure_indices`** still
-  use Lance's inline-commit APIs. The two-phase pattern is in
-  `mutate_as` and `load` only; hoisting it to a storage-trait invariant
-  covering all writers remains future work.
-
-## Tests added
-
- `tests/writes.rs::partial_failure_leaves_target_queryable_and_unblocks_next_mutation`
-  (replaces the old `partial_failure_observably_rolls_back_but_blocks_next_mutation_on_same_table`)
- `tests/writes.rs::mutation_rejects_mixed_insert_and_delete_at_parse_time`
- `tests/writes.rs::mixed_insert_and_update_on_same_person_coalesces_to_one_merge`
- `tests/writes.rs::multiple_appends_to_same_edge_coalesce_to_one_append`
- `tests/writes.rs::multi_statement_inserts_publish_exactly_once`
- `tests/writes.rs::load_with_bad_edge_reference_unblocks_next_load`
- `tests/writes.rs::load_with_cardinality_violation_unblocks_next_load`
-
-## Files changed
-
- `crates/omnigraph/src/exec/staging.rs` (NEW) — `MutationStaging`,
-  `PendingTable`, `PendingMode`, `StagedTablePath`,
-  `dedupe_merge_batches_by_id`.
- `crates/omnigraph/src/exec/mutation.rs` — D₂ check; per-op
-  rewires (`execute_insert`, `execute_update`, `execute_delete*`);
-  branch threading; coordinator-swap removal; helper
-  `validate_edge_cardinality_with_pending`; helper
-  `concat_match_batches_to_schema`; `apply_assignments` updated to
-  copy unassigned blob columns from full-schema scans.
- `crates/omnigraph/src/loader/mod.rs` — `load_jsonl_reader` split:
-  staged path for Append/Merge, legacy inline-commit path for
-  Overwrite. Helpers `collect_node_ids_with_pending` and
-  `validate_edge_cardinality_with_pending_loader`.
- `crates/omnigraph/src/table_store.rs` — `scan_with_pending`,
-  `count_rows_with_pending` (DataFusion `MemTable`-backed union with
-  Lance scan).
- `Cargo.toml` (workspace) + `crates/omnigraph/Cargo.toml` — added
-  `datafusion = "52"` direct dep (transitively pulled by Lance
-  already; required for `MemTable`).
- `docs/dev/writes.md` — removed "Known limitation" section; documented
-  the new accumulator + D₂ + LoadMode::Overwrite residual.
- `docs/dev/invariants.md` — mutation atomicity / read-your-writes status
-  flipped to `upheld for inserts/updates`.
- `docs/dev/architecture.md` — added "Mutation atomicity — in-memory
-  accumulator" subsection; refreshed the engine + state
-  diagrams to drop `RunRegistry` and add `MutationStaging`.
- `docs/dev/execution.md` — rewrote the mutation flow sequence diagram
-  for the staged-write path; updated the `LoadMode` table to call
-  out per-mode commit semantics; rewrote `load` vs `ingest`.
- `docs/user/query-language.md` — documented the D₂ parse-time rule.
- `docs/user/errors.md` — added the D₂ `BadRequest` rejection path.
- `docs/user/storage.md` — dropped the live `_graph_runs.lance` reference
-  from the layout diagram and prose.
- `docs/user/branches-commits.md` — moved `__run__<id>` to a legacy note;
-  removed `publish_run` from the publish-trigger list.
- `docs/user/audit.md` — current `_as` API list refreshed; legacy
-  `RunRecord.actor_id` moved to a historical note.
- `docs/user/constants.md` — marked the run registry / branch-prefix rows
-  as legacy.
- `docs/user/cli.md` — replaced the legacy `omnigraph run *` quickstart
-  block with `omnigraph commit list/show`.
- `docs/dev/testing.md` — extended the `writes.rs` row to cover the new
-  staged-write contract tests; added the `staged_writes.rs` row.
- `AGENTS.md` (CLAUDE.md symlink) — updated the atomic-per-query
-  description and the L2 capability matrix row.
-
-## Included Changes
-
- Rewire `mutate_as` and `load` via in-memory `MutationStaging` +
-  `stage_*` / `commit_staged` per touched table at end-of-query.
- (The storage substrate shipped in v0.4.0's PR #67 — `StagedWrite`,
-  `stage_append`, `stage_merge_insert`, `commit_staged`,
-  `scan_with_staged`, `count_rows_with_staged` — and is the substrate
-  this release builds on.)
--- a/docs/releases/v0.4.2.md
+++ b/docs/releases/v0.4.2.md
@ -1,115 +0,0 @@
-# Omnigraph v0.4.2
-
-Omnigraph v0.4.2 is a concurrency, admission-control, and release-hygiene
-release. It removes the server-global write lock, lets disjoint writers make
-progress concurrently, adds per-actor admission limits, hardens branch and
-mutation races with snapshot-isolation fences, and documents the release in
-public open-source terms.
-
-## Highlights
-
- **Unlocked server engine handle**: the HTTP server now holds the engine behind
-  a shared handle instead of a server-global write lock. Concurrent handlers can
-  call engine APIs directly while the engine serializes only the resources that
-  actually conflict.
- **Engine-owned writer queues**: same `(table, branch)` writers are serialized
-  by per-table writer queues inside the engine, while disjoint table/branch
-  writes can run concurrently. This narrows contention without relying on route
-  handlers to know storage-level ordering rules.
- **Per-actor admission control**: mutating HTTP handlers are gated by a
-  `WorkloadController` with per-actor in-flight request and estimated-byte
-  budgets. Rejections use HTTP 429 with `code: too_many_requests` and a
-  `Retry-After` header, so noisy actors back off without blocking unrelated
-  actors.
- **Admission coverage for all mutating handlers**: `/change`, `/ingest`,
-  `/schema/apply`, branch create/delete, and branch merge now flow through the
-  admission controller. Read-only endpoints are not admission-gated.
- **Op-kind-aware version checks**: mutation commit-time drift checks distinguish
-  append-like inserts from strict update/delete work. Inserts remain permissive
-  enough for safe concurrent append patterns; updates and deletes get stricter
-  stale-view rejection.
- **Read-time drift checks for strict mutations**: staged mutations compare the
-  manifest pin captured when the query opened against the manifest snapshot
-  captured under table-queue ownership. If a concurrent writer moved the table
-  after the query read, the stale writer returns a structured
-  `manifest_conflict` 409 instead of staging work computed against an old
-  snapshot.
- **Inline-delete recovery coverage**: delete-only mutations still use Lance's
-  inline delete path, but their recovery sidecar is now written before the
-  manifest-version rejection path can return. If a delete moves Lance HEAD and a
-  concurrent manifest update makes the query stale, the next read-write open can
-  roll the residual back rather than leaving a head-ahead-of-manifest table.
- **Branch-operation race hardening**: branch creation and branch merge avoid
-  coordinator swap-restore races that could expose the wrong active branch to
-  concurrent work. Concurrent branch merges are serialized by a merge mutex.
- **Branch-merge target revalidation**: merges re-check target table versions
-  after acquiring target write queues. A stale merge plan returns a structured
-  conflict instead of overwriting concurrent target-branch changes or adopting a
-  source table over newly appended target rows.
- **Schema refresh deadlock fix**: recovery refresh releases the write guard
-  before schema reload, preventing a refresh/schema-apply deadlock.
- **Lean admission API**: removed the unused global rewrite admission pool,
-  `service_unavailable` error variant, related 503 documentation, and benchmark
-  flag. The public server surface now reflects only admission behavior that is
-  wired to handlers.
- **Open-source release hygiene**: this release adds guidance for public-facing
-  documentation, release notes, and version bumps. Release docs now avoid
-  private issue tracker references and use stable public descriptions instead.
-
-## Behavior changes
-
- Disjoint mutating HTTP requests can now make progress concurrently instead of
-  queueing behind one process-wide engine write lock.
- Mutating handlers may return HTTP 429 when an actor exceeds per-actor in-flight
-  or estimated-byte budgets. Clients should respect `Retry-After` and retry
-  later.
- Concurrent update/delete and merge races now return structured
-  `manifest_conflict` 409 responses in more stale-view cases instead of relying
-  on later publisher-CAS detection or allowing a stale plan to proceed.
- Concurrent branch merge × change on the same target branch may return either
-  success or a clean 409 conflict, depending on which operation wins the queue.
- `OMNIGRAPH_GLOBAL_REWRITE_MAX` is no longer recognized. Remove it from
-  deployment manifests; use the per-actor in-flight and byte-budget admission
-  settings for the currently wired server controls.
-
-## Upgrade Notes
-
- No repository migration is required. Existing v0.4.1 repos can be opened
-  directly with v0.4.2.
- Clients should treat `manifest_conflict` 409 responses as retryable stale-view
-  conflicts. This was already the documented contract, but this release uses it
-  in more concurrent-write paths.
- Clients should handle HTTP 429 from every mutating endpoint, not only
-  `/change`. Honor the `Retry-After` header.
- Operators should remove stale references to global rewrite admission and 503
-  rewrite-pool exhaustion from local runbooks.
- If you maintain public docs or release notes, use public identifiers and
-  user-facing descriptions rather than private tracker IDs.
-
-## Tests added or strengthened
-
- Regression tests for update read-your-writes under in-process concurrency.
- HTTP tests for same-key insert snapshots, disjoint `/change` concurrency, and
-  `/ingest` admission 429 + `Retry-After`.
- Branch-operation regression tests for branch-create swap-restore races,
-  concurrent `/change` + branch-merge interleavings, branch-merge swap-restore
-  races, branch-op matrix coverage, and post-reopen consistency.
- Failpoint-backed regression coverage for inline-delete recovery sidecar
-  creation before version-mismatch rejection.
- Admission tests use injectable `WorkloadController` state instead of mutating
-  process environment.
-
-## Included Changes
-
- Shared server engine state and per-actor admission on mutating endpoints.
- Per-(table, branch) writer queues and op-kind-aware manifest drift checks.
- Strict read-time version checks for updates/deletes.
- Branch create/merge race hardening and branch-merge target snapshot
-  revalidation under queue ownership.
- Retry-after support for admission rejections and OpenAPI updates for reachable
-  429 responses.
- Actor-isolation benchmark harness updates for the current admission controller.
- Removal of the unwired global rewrite admission / 503 server surface.
- Version bump to `0.4.2` across workspace crates, `Cargo.lock`, and
-  `openapi.json`.
- Public release-note cleanup and new OSS best-practice guidance in `AGENTS.md`.
--- a/docs/releases/v0.5.0.md
+++ b/docs/releases/v0.5.0.md
@ -1,171 +0,0 @@
-# Omnigraph v0.5.0
-
-Omnigraph v0.5.0 is a substrate, security, and migration-safety release. It
-jumps the storage substrate from Lance 4 to Lance 6.0.1 (DataFusion 52 → 53,
-Arrow 57 → 58), introduces engine-wide Cedar policy enforcement on every
-authoring path, and ships a structured schema-lint v1 chassis with
-code-tagged diagnostics, soft drops, and an explicit `--allow-data-loss`
-flag for destructive migrations.
-
-## Highlights
-
- **Lance 6.0.1 substrate**: bump from Lance 4.0.0 → 6.0.1, DataFusion 52 →
-  53, Arrow 57 → 58. New optimizer rules (vectorized `IN`-list eq kernel,
-  `PhysicalExprSimplifier`, push-limit-into-hash-join, CASE-NULL shortcut)
-  reach predicates that flow through the engine. `lance-tokenizer` replaces
-  tantivy internally; FTS behavior preserved.
- **Cedar policy engine**: a new `omnigraph-policy` crate wires
-  `Omnigraph::enforce(action, scope, actor)` into every `_as` writer
-  (`mutate_as`, `load_as`, `apply_schema_as`, `branch_create_as`,
-  `branch_merge_as`, `branch_delete_as`, plus the load and change
-  variants). The HTTP server defaults to deny-all when no Cedar policy is
-  configured; a YAML policy file is required to enable writes. Actor
-  identity comes only from signed token claims — clients cannot set actor
-  identity directly.
- **Schema lint v1 chassis**: diagnostics now carry stable codes of the form
-  `OG-XXX-NNN` instead of free-form messages. `omnigraph schema plan` and
-  `apply` understand soft drops on properties and types — destructive drops
-  require the new `--allow-data-loss` flag (Hard mode) at the CLI and an
-  equivalent JSON flag over HTTP.
- **Structured filter pushdown**: query-language predicates lower to
-  DataFusion `Expr` and push down through Lance's `Scanner::filter_expr`
-  instead of being flattened to SQL strings. This unlocks `CompOp::Contains`
-  pushdown (via `array_has`), which previously fell through to in-memory
-  post-scan filtering, and lets the DataFusion 53 optimizer rules above act
-  on our predicates.
- **HTTP `allow_data_loss` parity**: the destructive-drop guard now exists
-  on both the CLI (`--allow-data-loss`) and HTTP (`allow_data_loss: true` in
-  the schema-apply request body).
- **Inline query strings on CLI and HTTP**: `omnigraph read` /
-  `omnigraph mutate` and the corresponding HTTP endpoints accept inline
-  `.gq` source, not just a file path. Easier ad-hoc queries, clearer
-  request logs.
- **Browser CORS layer**: optional CORS layer on `omnigraph-server` for
-  browser-based UIs, gated by `OMNIGRAPH_CORS_ORIGINS`.
- **Merge-insert dup-rowid fix**: Lance's `MergeInsertBuilder` could surface
-  spurious `"Ambiguous merge inserts"` errors on sequential merges against
-  rows previously rewritten by `merge_insert`. The engine now opts into
-  `SourceDedupeBehavior::FirstSeen` with a `check_batch_unique_by_keys`
-  fail-fast precondition that guarantees source-side dedup happens before
-  Lance sees the batch.
- **Branch-merge error-path recovery**: a branch merge that failed
-  mid-flight could leave the in-process coordinator pointing at a stale
-  active branch. The error path now restores the prior coordinator,
-  matching the success path's invariant.
- **Branch merge with blob columns**: external blob URIs are now
-  materialized correctly during branch merge instead of being dropped or
-  pointing at the source branch.
- **Lance API surface guards**: a new test file
-  (`crates/omnigraph/tests/lance_surface_guards.rs`) pins eight specific
-  Lance API surfaces (`LanceError::TooMuchWriteContention`,
-  `ManifestLocation` fields, `MergeInsertBuilder` return shape,
-  `WriteParams::default`, `compact_files` signature, etc.) so the next
-  Lance bump fails compile or runtime on any silent drift rather than
-  producing wrong-state recovery in production.
-
-## Behavior changes
-
- **On-disk format unchanged**: existing v0.4.2 datasets open unchanged.
-  The Lance file format pin stays at V2_2 (required by Lance's blob v2
-  feature).
- **`omnigraph-server` defaults to deny-all under `--policy`**: starting a
-  server with the policy feature enabled but no Cedar YAML policy
-  configured rejects every write. Operators must supply a policy file to
-  authorize anything.
- **Schema-lint diagnostics carry stable codes**: messages now lead with
-  `OG-XXX-NNN`. CI parsers or tooling that keyed off the v0.4.2 free-form
-  text need to switch to code-based matching.
- **Destructive schema drops require `--allow-data-loss`**: dropping a
-  property or type returns a structured diagnostic by default.
-  `omnigraph schema apply --allow-data-loss` (CLI) or
-  `{"allow_data_loss": true}` (HTTP) opts into Hard mode.
- **`HashJoinExec` null-aware semantics on anti-join**: a side effect of
-  the DataFusion 53 bump — `NOT IN` semantics under null-valued anti-join
-  columns are now correct per SQL standard. Queries that depended on the
-  prior behavior would have been incorrect.
-
-## Upgrade Notes
-
-### Migration
-
- No data migration. v0.4.2 repos open directly on v0.5.0.
-
-### Clients
-
- HTTP and SDK clients should switch any string-matching schema-lint
-  parsing to code-based matching against the `OG-XXX-NNN` prefix.
- Clients exercising destructive schema drops (`DropProperty`, `DropType`)
-  must add the `allow_data_loss` request field (HTTP) or
-  `--allow-data-loss` flag (CLI). Default is soft-drop-or-reject.
- Clients consuming `mutate_as` / `load_as` / `apply_schema_as` / branch
-  authoring APIs now flow through the policy enforcer. Anything bypassing
-  authorization on v0.4.2 will be rejected on v0.5.0 once a policy is
-  configured.
-
-### Operators
-
- Configure a Cedar policy YAML for production servers before enabling
-  writes; deny-all is the new default. The `omnigraph policy validate` /
-  `test` / `explain` CLI commands are unchanged.
- Bearer tokens continue to be the actor-identity source; review the
-  signed-token-claim-only invariant in `docs/dev/invariants.md` if you've
-  built custom authentication.
- If your local CI uses RustFS for S3-compatible storage testing, our CI
-  pins `rustfs/rustfs:1.0.0-beta.3` (the last known-good tag before the
-  upstream credentials-policy change). Mirror the pin or set
-  `RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true` for the new image
-  versions.
-
-## Tests added or strengthened
-
- `crates/omnigraph/tests/lance_surface_guards.rs` — 8 named guards pinning
-  Lance API surfaces against silent drift on future bumps.
- `crates/omnigraph/tests/policy_engine_chassis.rs` — engine-level policy
-  enforcement coverage; complements the existing HTTP policy tests.
- Policy chassis e2e gap-fills — branch-merge, branch-create, branch-delete
-  policy paths now have explicit end-to-end tests over HTTP and CLI.
- Merge-pair truth table — exhaustive op-variant matrix for three-way
-  merge across `noop`, `addNode`, `removeNode`, `addEdge`, `removeEdge`,
-  `setProperty`, `dropProperty`, `addLabel`, `removeLabel`; the build
-  fails to compile when a new op variant is added without dispositioning
-  every pairing.
- Merge-insert: regression for the dup-rowid bug class on the load surface
-  (`load_merge_repeated_against_overlapping_keys_succeeds`), the update
-  surface (`second_sequential_update_on_same_row_succeeds`), and the
-  upstream-Lance-gap canary
-  (`load_merge_window_2_documents_upstream_lance_gap`).
- Maintenance + destructive-migration coverage — `omnigraph optimize` /
-  `cleanup` boundary cases, plus schema-apply soft-drop and Hard-mode
-  paths.
- Stable-row-id preservation across `stage_overwrite` — pins the invariant
-  that staged overwrites carry stable row IDs through to the committed
-  fragment set.
- `CompOp::Contains` pushdown regression
-  (`ir_filter_with_list_contains_pushes_down`) — pins the new structured
-  Expr pushdown path that retired the in-memory fallback.
-
-## Included Changes
-
- Lance 4 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58 substrate upgrade.
- `omnigraph-policy` crate with engine-wide Cedar enforcement and
-  signed-token-claim-only actor identity.
- Schema-lint v1 chassis with `OG-XXX-NNN` codes, soft `DropProperty` /
-  `DropType` semantics, and `--allow-data-loss` for Hard mode.
- HTTP `allow_data_loss` request field parity with the CLI flag.
- Structured DataFusion `Expr` filter pushdown via
-  `Scanner::filter_expr`, with `CompOp::Contains` lowered through
-  `array_has`.
- Inline `.gq` source acceptance on CLI and HTTP read/mutate endpoints.
- Optional CORS layer on `omnigraph-server` for browser UIs.
- Bug fixes: merge-insert dup-rowid (FirstSeen + uniqueness precondition),
-  branch-merge coordinator restore on error, blob-column materialization
-  during branch merge.
- New Lance API surface-guard test file as the canary for future Lance
-  bumps.
- Recovery-sidecar coverage extended across the four write paths
-  (`MutationStaging::finalize`, `schema_apply`, `branch_merge`,
-  `ensure_indices`) with failpoint regression tests.
- CI: pinned `rustfs/rustfs:1.0.0-beta.3` after the upstream `:latest`
-  introduced a credentials-policy change.
- Version bump to `0.5.0` across workspace crates, `Cargo.lock`,
-  `openapi.json`, and the `AGENTS.md` surveyed version.
--- a/docs/releases/v0.6.0.md
+++ b/docs/releases/v0.6.0.md
@ -1,141 +0,0 @@
-# Omnigraph v0.6.0
-
-Three pieces of work land in this release:
-
-1. The **graph terminology rename** (renamed `Repo` → `Graph` across the Cedar resource model, policy API, and query-lint schema source).
-2. **Multi-graph server mode** — one `omnigraph-server` process can now serve 1–10 graphs concurrently behind cluster routes (`/graphs/{graph_id}/...`), with per-graph and server-level Cedar policy, read-only `GET /graphs` enumeration, and CLI parity (`omnigraph graphs list`).
-3. **Inline + canonical-named queries and mutations.** New `POST /query` and `POST /mutate` endpoints pair with the CLI's new `-e/--query-string` flag for ad-hoc execution without a temp file. `POST /read` and `POST /change` continue serving indefinitely as deprecated aliases that carry RFC 9745 `Deprecation: true` and RFC 8288 `Link: </successor>; rel="successor-version"` response headers, plus `deprecated: true` in `openapi.json`. Same canonicalization on the CLI: `omnigraph query`, `omnigraph mutate`, and top-level `omnigraph lint` / `omnigraph check` replace `omnigraph read`, `omnigraph change`, and the nested `omnigraph query lint` / `omnigraph query check`. Every deprecated spelling remains a `visible_alias` that warns to stderr once per invocation.
-
-Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`, `omnigraph graphs create`) is **not** in v0.6.0. Operators add or remove graphs by editing `omnigraph.yaml` and restarting. The first cut of `POST /graphs` shipped behind an atomic-YAML-rewrite design that we pulled before release once its concurrency guarantees were challenged (flock-on-renamed-inode race, duplicate-check outside the critical section, and an init-cleanup path that could destroy an existing graph's schema on re-init). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars); that work is deferred.
-
-## Breaking Changes
-
-### Graph terminology rename
-
- Renamed the Cedar resource entity from `Omnigraph::Repo` to `Omnigraph::Graph`.
- Renamed policy API terminology from `repo_id` to `graph_id` on `PolicyCompiler::compile` (and on the new `PolicyEngine::load_graph` / `PolicyEngine::load_server` loaders described below).
- Renamed query-lint schema source JSON from `"repo"` to `"graph"` for `schema_source.kind`.
-
-### Multi-graph server mode
-
- **Multi-graph deployments lose flat routes.** Single-graph invocation (`omnigraph-server <URI>`) is unchanged — same flat `/snapshot`, `/read`, `/branches`, etc. Multi-graph deployments serve those routes under `/graphs/{graph_id}/...`; bare flat paths return 404 in multi mode.
- **`ServerConfig` shape change** (programmatic embedders only): `ServerConfig { uri, policy_file }` is replaced by `ServerConfig { mode: ServerConfigMode }`, where `ServerConfigMode = Single { uri, policy_file } | Multi { graphs, config_path, server_policy_file }`. Callers that use `load_server_settings` are unaffected; callers that construct `ServerConfig` directly need to wrap their fields in `ServerConfigMode::Single`.
- **`AppState`'s routing surface** is `AppState::routing() -> &GraphRouting`, where `GraphRouting = Single { handle } | Multi { registry, config_path }`. The previous `AppState::uri()`, `AppState::mode()`, `AppState::registry()` accessors and the `ServerMode` enum are gone — embedders read `state.routing()` and match on the arm they need. Per-graph URIs live on `handle.uri`.
- **`AppState::new_multi`** is the new multi-graph constructor. Single-mode `new_*` / `open_*` constructors are unchanged.
- **`AuthenticatedActor(Arc<str>)` → `ResolvedActor { actor_id, tenant_id, scopes, source }`** (programmatic embedders only). The struct shape changes, but the HTTP contract — bearer auth and the bearer-derived-actor-identity guarantee — is unchanged. Cluster-mode call sites construct with `tenant_id: None`, `scopes: vec![Scope::Full]`, `source: AuthSource::Static`. The new fields are forward-compat seams for future multi-tenant and OAuth deployments; they're inert in this release.
- **`PolicyEngine::load(path, graph_id)` removed** in favor of two kind-typed loaders: `PolicyEngine::load_graph(path, graph_id)` for per-graph policies and `PolicyEngine::load_server(path)` for server-level policies. Each loader rejects rules whose action `resource_kind()` doesn't match the engine kind — operators who put a `graph_list` rule in a per-graph file (or a `read` rule in a server file) now get a load-time error instead of a silently-never-matching rule.
- **`PolicyRequest::actor_id` field removed.** Actor identity is now a separate parameter on `PolicyEngine::authorize(actor_id, &request)`. The type system enforces the server-authoritative-actor invariant: actor identity is always sourced from the bearer-token match resolved at the auth boundary; handlers cannot smuggle identity through the request body.
- **`Omnigraph::init` is strict by default.** Initialization at a URI that already holds schema files now errors with `OmniError::AlreadyInitialized` instead of silently overwriting. Operators who actually want to overwrite use `InitOptions { force: true }` (CLI: `omnigraph init --force`). Closes the destructive-cleanup footgun where a failed re-init would delete an existing graph's schema files.
- **Top-level `policy.file` is rejected in multi-graph server mode.** It remains valid for single-graph / CLI-local policy. Multi-graph deployments must move graph rules to `graphs.<graph_id>.policy.file` and server-scoped `graph_list` rules to `server.policy.file`.
- **Open server startup requires explicit opt-in.** A server with no bearer tokens and no policy now refuses to start unless passed `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1`.
- **Policy requires bearer tokens.** Configuring any policy file without bearer tokens now refuses startup; otherwise every protected request would 401 before Cedar could evaluate it.
- **Tokens without policy default-deny non-read actions.** Existing authenticated deployments that relied on writes or admin routes without Cedar policy must add policy rules for those actions.
- **`GET /graphs` requires `server.policy.file` in every runtime state.** Even `--unauthenticated` mode keeps server topology closed until the operator explicitly authorizes `graph_list`.
-
-### Query / mutation rename
-
- **`ChangeRequest` field rename**: `query_source` → `query`, `query_name` → `name`. Both legacy names continue to deserialize via `#[serde(alias = "...")]`, so existing clients sending the old JSON keys keep working. CLI remote calls against `/change` still emit the legacy keys verbatim through the `legacy_change_request_body` helper so a newer CLI talking to an older server keeps working byte-for-byte.
- **CLI `omnigraph query lint` / `omnigraph query check`** are now top-level — canonical name is **`omnigraph lint`**. The three deprecated invocations (`omnigraph query lint`, `omnigraph query check`, and bare `omnigraph check`) remain as argv-level shims that rewrite to `omnigraph lint` and print a one-line stderr deprecation warning. `check` is deliberately **not** a clap `visible_alias` on `lint` — two equivalent canonical names would split agent emissions between them depending on training-data drift, so the deprecation pattern (rewrite + warn) gives one unambiguous canonical name in `omnigraph --help`.
-
-## New
-
- **Multi-graph mode**. Invoke with `omnigraph-server --config omnigraph.yaml` where the YAML has a non-empty `graphs:` map and no single-mode selector (no `server.graph`, no CLI `<URI>` or `--target`). At startup the server opens every configured graph in parallel (bounded concurrency, fail-fast).
- **`GET /graphs`**. Lists every registered graph, sorted alphabetically by `graph_id`. Auth-required when bearer tokens are configured; Cedar-gated by `PolicyAction::GraphList` against `Omnigraph::Server::"root"`. Returns 405 in single mode. Server-scoped actions require an explicit `server.policy.file` in every runtime state — the management surface is closed by default even in `--unauthenticated` mode so that server topology is never exposed without operator opt-in.
- **CLI `omnigraph graphs list`**. Mirrors the HTTP surface. Rejects local URI targets with a clear message — for remote multi-graph servers only.
- **CLI `omnigraph init --force`**. Bypasses the strict-init preflight when an operator deliberately wants to recover from orphan schema files. Does NOT purge existing Lance datasets; recursive deletion needs `StorageAdapter::delete_prefix` (deferred — see below).
- **Per-graph Cedar policy**. Each entry in the `graphs:` map can carry a `policy.file` path, loaded at startup via `PolicyEngine::load_graph`. Cedar's `Omnigraph::Graph::"<graph_id>"` resource is per-graph; the new `Omnigraph::Server::"root"` resource governs server-level actions.
- **Server-level Cedar policy**. `server.policy.file` in the config governs the `graph_list` action on `Omnigraph::Server::"root"`. Required to expose `GET /graphs` in every runtime state — without a server policy the default-deny posture rejects `graph_list`, including in `--unauthenticated` mode.
- **Cedar action vocabulary**: `graph_list` (server-scoped). Runtime `graph_create` / `graph_delete` are reserved but not shipped — see "Deferred."
- **Canonical graph URI identity.** Server startup normalizes graph root URIs before registry insertion and response output, so aliases such as `/tmp/g`, `/tmp/g/`, and `file:///tmp/g` cannot register as distinct graphs that actually share one Lance root.
- **`POST /query`** and **`POST /mutate`**. Canonical inline endpoints. `/query` rejects mutations with a typed 400 (the D2 rule lives at the URL — read-only contract enforced before execution); body uses the clean `{ query, name, params, branch, snapshot }` shape. `/mutate` accepts the same shape for mutations. Both available in single mode and per-graph multi mode (`/graphs/{id}/query`, `/graphs/{id}/mutate`). Internal call sites share two helpers (`run_query`, `run_mutate`) that take decoupled args, not request bodies — the seam MR-969's future stored-query handler plugs into.
- **CLI `omnigraph query` / `omnigraph mutate`** as top-level canonical subcommands. Pairs with new top-level **`omnigraph lint` (alias `check`)** so query validation no longer sits under `omnigraph query`.
- **CLI `-e, --query-string <GQ>`** on both `omnigraph query` and `omnigraph mutate`. 3-way mutex with `--query <path>` and `--alias <name>` — exactly one is required. Empty string rejected. Suits ad-hoc exploration, REPL workflows, and agent tool-use without temp files.
- **Three-channel deprecation signal on `/read` and `/change`**: OpenAPI `deprecated: true` on the operation (every codegen flags the generated SDK method), RFC 9745 `Deprecation: true` response header, and RFC 8288 `Link: </query>; rel="successor-version"` (or `</mutate>`) response header. Auto-discoverable; no SDK breakage.
- **`omnigraph.yaml` `aliases.<name>.command`** now accepts `query` and `mutate` as canonical values alongside the legacy `read` and `change`. The internal `AliasCommand` enum retains the legacy variant names so serialized configs stay byte-stable.
-
-## Configuration
-
-`omnigraph.yaml` schema additions (all optional, single-mode unaffected):
-
-```yaml
-server:
-  bind: 0.0.0.0:8080
-  policy:
-    file: ./server-policy.yaml          # server-level Cedar (graph_list)
-
-graphs:
-  alpha:
-    uri: s3://tenant-bucket/alpha
-    policy:
-      file: ./policies/alpha.yaml       # per-graph Cedar
-  beta:
-    uri: s3://tenant-bucket/beta
-    # no per-graph policy → engine-layer enforcement is a no-op
-```
-
-## Deferred
-
- **`POST /graphs` runtime graph creation** and **CLI `omnigraph graphs create`**. Pulled before release after the YAML-rewrite design's correctness story didn't survive review. A future release will add a managed cluster catalog (Lance-backed reserve → init → publish with recovery sidecars) and re-expose runtime creation on top of it. Until then, operators add graphs by editing `omnigraph.yaml` and restarting.
- **`DELETE /graphs/{id}`**. Never shipped in v0.6.0; deferred with the same cluster-catalog work.
- **`StorageAdapter::delete_prefix`**. The substrate primitive a managed catalog would need. Will land alongside runtime mutation.
- **`omnigraph init --force` purging Lance state.** Today `--force` only bypasses the schema-file preflight; recursive deletion of existing Lance datasets needs `delete_prefix`.
- **`X-Actor-Id` service delegation forwarding**. Needs durable both-actor audit on `_graph_commits.lance` — out of scope.
- **Hot policy reload**. Restart is cheap at N≤10 graphs.
-
-## User Impact
-
- **No on-disk migration is required.** Existing `.omni` graphs from v0.5.0 (and earlier) open cleanly under v0.6.0 — Lance datasets, `__manifest`, `_schema.pg`, `_schema.ir.json`, `__schema_state.json`, `_graph_commits.lance`, `_graph_commit_recoveries.lance` all use unchanged formats. No conversion step.
- **Existing single-graph storage upgrades without migration.** Server deployments may need auth/policy config changes: explicitly pass `--unauthenticated` for local open mode, configure tokens when using policy, and add Cedar policy for non-read authenticated actions.
- **Multi-graph adoption is opt-in.** Add a `graphs:` map to `omnigraph.yaml` (and remove `server.graph`) to switch a deployment to multi mode.
- **Cluster routes are breaking for client SDKs targeting multi mode.** Generated clients from previous v0.5.0 OpenAPI specs will hit 404 on flat paths against a multi-mode server. Regenerate against the v0.6.0 `openapi.json`.
- **Supported YAML policy authoring is unchanged.** The Cedar `Omnigraph::Graph` and `Omnigraph::Server` entities are internally generated by `compile_policy_source` — operator YAML only references actions and groups.
- **Operators with unsupported raw Cedar policy files** should update `Omnigraph::Repo` resource references to `Omnigraph::Graph`.
- **Endpoint and CLI rename is cosmetic on the client side.** Existing callers on `/read`, `/change`, `omnigraph read`, `omnigraph change`, and `omnigraph query lint` keep working — they pick up the `Deprecation` + `Link` headers (or stderr deprecation warning on the CLI) so SDKs and proxies can surface the successor name automatically. New integrations should target the canonical names. ChangeRequest field names migrate at the caller's pace — both `query_source`/`query_name` and `query`/`name` accepted indefinitely.
-
-## Migration: single → multi
-
-```yaml
-# Before (v0.5.0 single-mode invocation)
-server:
-  graph: my-graph
-graphs:
-  my-graph:
-    uri: /var/lib/omnigraph/my-graph
-policy:
-  file: ./policy.yaml
-```
-
-```yaml
-# After (v0.6.0 multi-mode — drop `server.graph` and the top-level `policy`)
-server:
-  policy:
-    file: ./server-policy.yaml      # NEW: governs GET /graphs
-graphs:
-  my-graph:
-    uri: /var/lib/omnigraph/my-graph
-    policy:
-      file: ./policy.yaml           # MOVED: was top-level
-```
-
-Same `omnigraph.yaml` file; restart the server. Clients targeting the old flat routes (`/snapshot`, `/read`, …) must update to `/graphs/my-graph/snapshot`, etc.
-
-To add a new graph after rollout: stop the server, append a new `graphs.<id>` entry, restart.
-
-## Documentation
-
- Public docs, CLI help, examples, server docs, and test helpers now consistently use "graph" for the OmniGraph data artifact.
- GitHub/source repository terminology remains spelled out as "repository" where needed.
- New: `docs/user/cli.md` documents `omnigraph graphs list`; `docs/user/server.md` documents the multi-graph mode and the cluster route convention; `docs/user/policy.md` documents the per-graph vs server-scoped action distinction.
- New: `docs/user/server.md` documents `POST /query` / `POST /mutate` and the three-channel deprecation signal on `/read` / `/change`. `docs/user/cli.md` documents the `-e/--query-string` flag with examples. `docs/user/cli-reference.md` shows the canonical CLI verbs (`query`, `mutate`, `lint`, `check`) with legacy spellings as visible aliases.
- New: `docs/dev/rfc-001-queries-envelope-mcp.md` is the cross-cutting design doc for the inline / stored query work that started landing in this release. It sequences the v0.6.x patch series (request/response envelope hardening) and the v0.7.0 stored-query + MCP work.
-
-## Test coverage
-
- `GraphId` newtype validation, registry race tests, init failpoints (still reachable from `omnigraph init` CLI).
- Mode-inference four-rule matrix, parallel multi-graph startup, cluster routing.
- Cedar `Server` resource refactor, backwards-compat for graph-only policies, kind-alignment rejection (server actions in graph files / vice versa).
- `GET /graphs` enumeration, 405-in-single-mode, 403-in-Open-mode-without-server-policy, Cedar admin/viewer authorization.
- Cluster routes with inner path params (`/branches/{branch}`, `/commits/{commit_id}`) deserialize correctly under axum 0.8 nested routing.
- Policy-requires-tokens startup invariant enforced uniformly across single and multi mode.
- The bearer-auth-derived-actor-identity regression test (client-supplied identity headers are ignored; the server-resolved actor is the only identity Cedar sees) stays green across the entire refactor.
-</content>
--- a/docs/releases/v0.6.1.md
+++ b/docs/releases/v0.6.1.md
@ -1,28 +0,0 @@
-# Omnigraph v0.6.1
-
-v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safer branch cleanup, more complete release artifacts, and a Lance blob-compaction workaround.
-
-## Highlights
-
- **Stored-query registries.** `omnigraph.yaml` can declare curated `queries:` blocks per graph. Servers load and type-check them at startup, `omnigraph queries validate` checks them offline, `omnigraph queries list` shows exposed queries and typed params, `GET /queries` exposes a typed catalog, and `POST /queries/{name}` invokes a stored query without accepting ad hoc `.gq` source from the client.
- **Stored-query policy gate.** New Cedar action `invoke_query` gates the stored-query invocation surface. Stored mutations are double-gated: `invoke_query` to reach the stored query and `change` for the actual write.
- **Safer branch deletion.** `branch_delete` now treats the manifest as the authority, flips branch visibility atomically, and reclaims per-table/commit-graph forks as derived state. If best-effort reclaim is interrupted, `cleanup` reconciles orphaned forks; reusing a branch name before cleanup reports an actionable error.
- **Legacy `__run__` cleanup (MR-770).** *(Correction: this item shipped in [v0.6.2](v0.6.2.md), not v0.6.1 — the v0.6.1 notes over-claimed it. At the v0.6.1 tag the `__run__` branch-name guard and `run_registry.rs` were still present and no v2→v3 sweep migration existed.)* The guard removal and the one-time v2→v3 `__manifest` migration that sweeps stale `__run__*` staging branches on first read-write open are described in the v0.6.2 release notes.
- **Blob-safe optimize.** `omnigraph optimize` skips tables with `Blob` properties instead of failing the whole sweep on Lance's blob-v2 compaction decode bug. Skips are visible in human output, `--json` as `skipped`, `TableOptimizeStats.skipped`, and logs; non-blob tables still compact normally.
- **Deployment improvements.** The container entrypoint now composes `OMNIGRAPH_TARGET_URI` with `OMNIGRAPH_CONFIG`, so operators can keep the graph URI in env while loading policy/query config from a mounted file. The local RustFS bootstrap pins RustFS beta.3 and allows the current insecure local-dev default credentials.
- **Windows release support.** Tagged and edge releases now publish Windows x86_64 archives containing `omnigraph.exe` and `omnigraph-server.exe`, with a PowerShell installer and Windows install docs.
- **Release tooling.** Homebrew formula generation was tightened to produce audit-clean formulas.
-
-## Compatibility Notes
-
- A graph selected by name (`--target` or `server.graph`) now uses `graphs.<name>.policy` and `graphs.<name>.queries`. Top-level `policy` / `queries` blocks are only for anonymous bare-URI single-graph mode; using them with a named graph now fails loudly with migration guidance.
- `mcp.expose` defaults to `true` for stored-query registry entries. Set `mcp: { expose: false }` for service-only queries that should not appear in the catalog.
- `invoke_query` is graph-scoped, not branch-scoped. Branch/snapshot access remains enforced by the inner `read` / `change` gate.
- **Legacy `__run__` migration.** *(Correction: deferred to [v0.6.2](v0.6.2.md).)* The automatic v2→v3 `__manifest` stamp migration that sweeps stale `__run__*` branches on first read-write open ships in v0.6.2, not v0.6.1; a v0.6.1 binary does not perform it. See the v0.6.2 notes for the migration behavior and the read-only caveat.
- Blob tables are not compacted until the upstream Lance fix lands, so fragment count and deleted-row space on blob tables are not reclaimed by `optimize`. Reads, writes, and query results are unaffected; no on-disk migration is required.
- `TableOptimizeStats` is now `#[non_exhaustive]` and gains a `skipped: Option<SkipReason>` field (so does the new `SkipReason` enum). This is a source-level change only for downstream code that built this returned result struct by literal — rare, since it is produced by `optimize` and consumed by reading its fields; field access is unaffected, and `#[non_exhaustive]` keeps future additions non-breaking.
-
-## Docs And Cleanup
-
- Public docs were updated for stored queries, policy, server routes, deployment, Windows installation, branch deletion, maintenance, and the `runs` docs rename to `writes`.
- README copy and release documentation were refreshed; older release notes had small typo/wording fixes.
--- a/docs/releases/v0.6.2.md
+++ b/docs/releases/v0.6.2.md
@ -1,69 +0,0 @@
-# Omnigraph v0.6.2
-
-v0.6.2 is a maintenance-safety release on top of v0.6.1. It tightens the
-`optimize` / recovery boundary, adds an explicit repair path for uncovered
-manifest/head drift, completes the legacy `__run__` branch cleanup (MR-770),
-accepts pretty-printed JSON load input, and updates the project governance and
-release automation around those fixes.
-
-## Highlights
-
- **Explicit `omnigraph repair`.** New `repair` CLI support previews uncovered
-  manifest/head drift by default and reports each table's classification,
-  action, manifest version, Lance HEAD version, Lance operations, and any
-  classification error. `--confirm` publishes verified maintenance-only drift;
-  `--force --confirm` can publish suspicious or unverifiable drift after
-  operator review.
- **Optimize skips uncovered drift.** `omnigraph optimize` now refuses to
-  interpret Lance HEAD movement that is ahead of `__manifest` without a recovery
-  sidecar. Those tables are reported as `skipped: DriftNeedsRepair` and left
-  untouched until `omnigraph repair` classifies them.
- **Optimize publishes compaction.** Successful compaction now publishes the
-  compacted Lance version back through the graph manifest and is covered by an
-  `Optimize` recovery sidecar. A crash after Lance compaction but before
-  manifest publish converges through the normal recovery sweep instead of
-  leaving hidden drift.
- **Recovery roll-back convergence.** Recovery roll-back now aligns the
-  manifest-visible version after restoring a table, closing the residual where
-  Lance HEAD and `__manifest` could stay out of sync after recovery.
- **Legacy `__run__` branch cleanup (MR-770).** Completes the retirement of the
-  Run state machine (removed in v0.4.0). A one-time v2→v3 `__manifest`
-  internal-schema migration runs on the first read-write open and deletes any
-  stale `__run__*` staging branches left by pre-v0.4.0 graphs — they previously
-  leaked into `branch list` and counted as blocking branches at `schema apply`
-  time. The migration is idempotent, and the `is_internal_run_branch` guard
-  (and `run_registry.rs`) is retired now that `__run__*` is an ordinary branch
-  name. (The earlier v0.6.1 notes described this as shipped in v0.6.1; it
-  actually landed here in v0.6.2.)
- **Pretty-printed JSON load input.** `load` accepts multi-line JSON objects in
-  addition to one-object-per-line JSONL, so formatted fixture or export files no
-  longer need to be minified before import.
-
-## Operational Notes
-
- `repair` requires a clean recovery state. Pending `__recovery` sidecars still
-  belong to automatic open-time recovery; reopen the graph first, then run
-  repair if drift remains.
- `repair --confirm` only auto-publishes drift made of Lance maintenance
-  operations (`Rewrite` and `ReserveFragments`). Semantic operations such as
-  append, delete, update, and merge are refused unless the operator uses
-  `--force --confirm`.
- `optimize` remains non-destructive. It still skips blob-bearing tables while
-  OmniGraph is pinned to the Lance version with the blob-v2 compaction issue.
- No manual on-disk migration is required. Existing graphs open under v0.6.2.
-  Graphs already at internal manifest schema stamp v3 are unchanged; graphs
-  created before v0.4.0 that still carry the v2 stamp auto-migrate v2→v3 on the
-  first **read-write** open (the `__run__*` sweep above). The migration is
-  write-path-only, so a long-lived **read-only** deployment still lists any
-  stale `__run__*` branch until it is next opened read-write.
-
-## Docs, Governance, And CI
-
- Added issue, discussion, RFC, and pull-request templates plus governance docs
-  for the external contribution path.
- Regenerated CODEOWNERS tables and adjusted branch-protection docs so code
-  owners can bypass required PR review where repository rules allow it.
- Trimmed Windows release builds out of per-PR CI and kept Windows packaging on
-  tag releases.
- Made Homebrew audit diagnostic-only in the release workflow so a flaky audit
-  cannot block publishing an otherwise valid formula update.
--- a/docs/releases/v0.7.0.md
+++ b/docs/releases/v0.7.0.md
@ -36,6 +36,12 @@ get faster and self-healing, and text embedding becomes provider-independent.
  single-graph flat-route mode, positional-`<URI>` boot, and `omnigraph.yaml`
  `graphs:`-map boot are gone — add or remove graphs with `cluster apply` and
  restart.
+- **Resilient cluster boot with strict opt-out.** Graph-attributed startup
+  failures now quarantine that graph and let healthy graphs serve; `/graphs`
+  lists only ready graphs, and quarantined graph routes return 404. Cluster-
+  global failures still refuse boot, and `--require-all-graphs` (or
+  `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) restores fail-fast all-or-nothing startup
+  for operators who prefer any degraded graph to abort the process.
 - **One storage substrate + recovery liveness.** The cluster storage backend and
  the engine both go through one `StorageAdapter` (versioned read, conditional
  replace/CAS, prefix delete), exercised by a storage fault-injection matrix.
--- a/docs/user/branching/merge.md
+++ b/docs/user/branching/merge.md
@ -22,6 +22,25 @@ A merge resolves to one of three outcomes:
  simply advances to the source.
 - **Merged** — both sides diverged; a new merge commit is created with two parents.

+## Indexes after a merge
+
+A **fast-forward** merge (the common case — the target had no conflicting
+changes, so the source's rows are adopted) does not build or rebuild indexes on
+the rows it brings into the target. Newly merged rows (and any index a table does
+not yet have) are covered the next time `optimize` runs — indexes are derived
+state, and reads stay correct in the meantime via brute-force scan over the
+not-yet-covered rows. This keeps a fast-forward merge fast (it never pays an
+inline vector/FTS rebuild on the publish path), at the cost of brute-force search
+latency on freshly merged rows until the next `optimize`.
+
+A **three-way** merge (the `Merged` outcome — both branches changed the table and
+the rows were reconciled) still rebuilds the table's indexes inline today, as part
+of the publish. So a Merged-outcome merge of an embedding-bearing table pays the
+index-build cost up front.
+
+Either way, run `omnigraph optimize` after a large merge to restore (or, for the
+fast-forward path, establish) full index coverage.
+
 ## Conflicts

 When both branches changed the same data incompatibly, the merge fails with a
--- a/docs/user/cli/index.md
+++ b/docs/user/cli/index.md
@ -14,7 +14,7 @@ omnigraph mutate insert_person --params '{"name":"Mina","age":28}'
 `omnigraph query` is the canonical read command (pairs with `POST /query`);
 `omnigraph mutate` is the canonical write command (pairs with `POST /mutate`).
 The positional argument is the **stored-query name**, invoked from the served
-catalog (RFC-011 D3) — the graph is addressed by scope (`--server` / `--profile`
+catalog — the graph is addressed by scope (`--server` / `--profile`
 / defaults), and the verb asserts the query's kind (`query` rejects a stored
 mutation, and vice-versa). The previous names `omnigraph read` and
 `omnigraph change` keep working as visible aliases — invocations emit a one-line
--- a/docs/user/cli/reference.md
+++ b/docs/user/cli/reference.md
@ -2,7 +2,7 @@

 A reference for the `omnigraph` binary's command surface and the per-operator `~/.omnigraph/config.yaml` schema. For a quick-start guide, see [cli.md](index.md).

-Top-level command families and subcommands. Graph-targeting commands accept a positional `file://`/`s3://` URI, `--server <name|url>` (an operator-defined server from `~/.omnigraph/config.yaml` by name, or a literal `http(s)://` URL, optionally with `--graph <id>` for multi-graph servers; exclusive with a positional URI), `--store <uri>` (a single graph's storage directly), or `--profile <name>` / `$OMNIGRAPH_PROFILE` (a named scope bundle; see [Scopes & profiles](#scopes--profiles-rfc-011)); `cluster` commands use `--config <dir>`, while `policy` and `queries` read a cluster's applied state via `--cluster <dir|uri>`. A remote server is addressed only with `--server` — a positional `http(s)://` URI is rejected. **`query`/`mutate` are the exception**: their positional is a stored-query *name* (RFC-011 D3), not a graph URI, so they address the graph only via `--store`/`--server`/`--profile`/defaults.
+Top-level command families and subcommands. Graph-targeting commands accept a positional `file://`/`s3://` URI, `--server <name|url>` (an operator-defined server from `~/.omnigraph/config.yaml` by name, or a literal `http(s)://` URL, optionally with `--graph <id>` for multi-graph servers; exclusive with a positional URI), `--store <uri>` (a single graph's storage directly), or `--profile <name>` / `$OMNIGRAPH_PROFILE` (a named scope bundle; see [Scopes & profiles](#scopes--profiles)); `cluster` commands use `--config <dir>`, while `policy` and `queries` read a cluster's applied state via `--cluster <dir|uri>`. A remote server is addressed only with `--server` — a positional `http(s)://` URI is rejected. **`query`/`mutate` are the exception**: their positional is a stored-query *name*, not a graph URI, so they address the graph only via `--store`/`--server`/`--profile`/defaults.

 ## Top-level commands

@ -13,19 +13,20 @@ Top-level command families and subcommands. Graph-targeting commands accept a po
 | `ingest` | deprecated alias of `load --from <base>` (defaults: `--from main --mode merge`); prints a one-line warning to stderr |
 | `query <name>` (alias: `read`) | run a read query. **Catalog lane** (default): `<name>` is a stored query invoked **by name** from the served catalog (served-only — address with `--server`/`--profile`; the verb asserts the query is a read). **Ad-hoc lane**: with `--query <path>` or `-e`/`--query-string <GQ>`, runs that source (the positional `<name>` then selects which query in it). No positional graph URI — address via `--store`/`--server`/`--profile`. `read` is the deprecated previous name (one-line stderr warning) |
 | `mutate <name>` (alias: `change`) | run a mutation query; same catalog (by-name, served-only, verb asserts mutation) / ad-hoc (`--query`/`-e`) lanes as `query`. `change` is the deprecated previous name (one-line stderr warning) |
-| `alias <name> [args]` | invoke an operator alias — a read-only personal binding (under `aliases:` in `~/.omnigraph/config.yaml`) to a stored query on a named server (RFC-011 D4; replaces the removed `--alias` flag; stored mutations are rejected before execution) |
+| `alias <name> [args]` | invoke an operator alias — a read-only personal binding (under `aliases:` in `~/.omnigraph/config.yaml`) to a stored query on a named server (replaces the removed `--alias` flag; stored mutations are rejected before execution) |
 | `snapshot` | print current snapshot (per-table version + row count) |
 | `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) |
 | `branch create \| list \| delete \| merge` | branching ops |
 | `commit list \| show` | inspect commit graph |
 | `schema plan \| apply \| show (alias: get)` | migrations. `apply` refuses a cluster-managed graph (one whose storage is inside a cluster) and points at `cluster apply` — those graphs evolve through the cluster ledger, not a direct apply |
 | `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` |
-| `cluster validate \| plan \| apply \| approve \| status \| refresh \| import \| force-unlock` | declarative cluster control plane. `validate` checks a local `cluster.yaml` folder and referenced schema/query/policy files; `plan` diffs it against local JSON state at `__cluster/state.json`, annotates dispositions, and embeds real schema-migration previews; `apply` converges the cluster — stored-query/policy catalog writes (content-addressed under `__cluster/resources/`), graph creates, schema updates (soft drops only; `--as` records the actor), and graph deletes behind a digest-bound approval from `cluster approve <resource> --as <actor>` (`apply`/`approve` default the actor from `~/.omnigraph/config.yaml`'s `operator.actor` when `--as` is omitted); what apply converges is what an `omnigraph-server --cluster <dir>` deployment serves on its next restart (`--cluster` is the server's only boot source — RFC-011 cluster-only); `status` reads the state ledger; `refresh`/`import` explicitly update local JSON state from read-only graph observations; `force-unlock <LOCK_ID>` manually removes a held local state lock by exact id |
+| `cluster validate \| plan \| apply \| approve \| status \| refresh \| import \| force-unlock` | declarative cluster control plane. `validate` checks a local `cluster.yaml` folder and referenced schema/query/policy files; `plan` diffs it against local JSON state at `__cluster/state.json`, annotates dispositions, and embeds real schema-migration previews; `apply` converges the cluster — stored-query/policy catalog writes (content-addressed under `__cluster/resources/`), graph creates, schema updates (soft drops only; `--as` records the actor), and graph deletes behind a digest-bound approval from `cluster approve <resource> --as <actor>` (`apply`/`approve` default the actor from `~/.omnigraph/config.yaml`'s `operator.actor` when `--as` is omitted); what apply converges is what an `omnigraph-server --cluster <dir>` deployment serves on its next restart (`--cluster` is the server's only boot source — cluster-only); `status` reads the state ledger; `refresh`/`import` explicitly update local JSON state from read-only graph observations; `force-unlock <LOCK_ID>` manually removes a held local state lock by exact id |
 | `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns or uncovered drift; `--json` reports `skipped`) |
 | `repair [--confirm] [--force]` | preview or explicitly publish uncovered manifest/head drift. `--confirm` heals verified maintenance drift and exits non-zero if suspicious/unverifiable drift is refused; `--force --confirm` publishes suspicious/unverifiable drift after operator review |
 | `cleanup --keep N --older-than 7d --confirm` | destructive version GC (`--confirm` to execute; also needs `--yes` against a non-local `s3://` target — see *Write diagnostics & destructive confirmation*) |
 | `embed` | offline JSONL embedding pipeline |
 | `policy validate \| test \| explain` | Cedar tooling against a cluster's applied policies (`--cluster <dir>`; `--graph <id>` picks a graph's bundle when several apply). `test` takes `--tests <file>`; `explain` takes `--actor`/`--action`/`--branch`/`--target-branch` |
+| `queries list \| validate` | inspect a cluster's applied stored-query registry (`--cluster <dir\|uri>`; `--graph <id>` to scope one graph). `list` prints each query's kind (read/mutation), name, typed params, and `[mcp: …]` exposure; a query's `@description`/`@instruction` are shown as indented `description:` / `instruction:` lines when declared (omitted otherwise). `--json` emits `{name, mcp_expose, tool_name, mutation, params}` plus `description`/`instruction` **only when present** — matching the HTTP `GET /queries` catalog ([server.md](../operations/server.md)). `validate` type-checks the registry and exits non-zero on a broken query |
 | `profile list \| show [<name>]` | read-only inspection of `~/.omnigraph/config.yaml` profiles. `list` shows each profile's binding (server/cluster/store) + default graph and marks the `$OMNIGRAPH_PROFILE`-active one; JSON keeps `binding` and adds `scope_kind`, `target`, `valid`, and `error`; `show` resolves one profile's scope (endpoint + default graph), defaulting to the active profile, else the flat operator defaults |
 | `version` / `-v` | print `omnigraph 0.3.x` |

@ -52,7 +53,7 @@ To maintain a server-backed graph, run the `direct` verbs from a host with stora

 ## Write diagnostics & destructive confirmation

-Two global flags make writes self-documenting and guard the dangerous ones (RFC-011 Decision 9):
+Two global flags make writes self-documenting and guard the dangerous ones:

 - **Every write echoes its resolved target to stderr** — `omnigraph load → s3://acme/brain/graphs/knowledge.omni (direct, remote)` — so you catch a scope that resolved somewhere unexpected (e.g. *prod*) before it lands. Applies to `load`, `ingest`, `mutate`, `branch create|delete|merge`, `schema apply`, `optimize`, `repair`, `cleanup`. The line is stderr, so `--json` consumers reading stdout are unaffected; suppress it with **`--quiet`**.
 - **Destructive writes against a non-local scope require confirmation.** `cleanup`, overwrite `load` (`--mode overwrite`), and `branch delete` proceed freely against a local (`file://`) graph, but when the resolved target is **not local** (a served `http(s)://` graph or an `s3://` store/cluster) they require explicit consent: pass **`--yes`** to confirm, an interactive terminal is prompted, and a non-interactive run (no TTY, or `--json`) **refuses with an error** rather than silently destroying. `cleanup` still also requires its existing `--confirm` (preview→execute); `--yes` is the additional non-local consent.
@ -79,15 +80,15 @@ servers:                # operator-owned endpoints; names key the credentials
    url: https://graph.example.com     # no tokens in this file, ever
 defaults:
  output: table         # read format default, below --json/--format/alias
-  server: prod          # the everyday SERVED scope when no address is given (RFC-011)
+  server: prod          # the everyday SERVED scope when no address is given
  # store: file:///data/dev.omni   # OR a zero-flag LOCAL default (mutually
  #                                #   exclusive with `server`); the local-dev
  #                                #   counterpart of `server`
  default_graph: knowledge   # graph selected in a server/cluster scope
-clusters:               # admin-only: managed-cluster storage roots (RFC-011).
+clusters:               # admin-only: managed-cluster storage roots.
  brain:                #   the ONLY place a storage root lives in this file.
    root: s3://acme/clusters/brain
-profiles:               # named scope bundles (RFC-011); pick with --profile
+profiles:               # named scope bundles; pick with --profile
  staging: { server: staging, default_graph: knowledge }   # a served scope
  brain-admin: { cluster: brain, default_graph: knowledge } # a direct cluster scope
 ```
@ -96,7 +97,7 @@ Absent file = empty layer. Unknown keys warn and load (a file written for a
 newer CLI works on an older one). Override the config directory with
 `$OMNIGRAPH_HOME`.

-#### Scopes & profiles (RFC-011)
+#### Scopes & profiles

 A command resolves a **scope** — a server, a cluster, or a store — then selects a
 graph in it; the served-vs-direct access path is derived from the scope, not
@ -116,14 +117,15 @@ sticky "current" mode. Inspect what is defined with `omnigraph profile list` and
  `--cluster <root> --graph <id>`. A `--graph` flag overrides the profile's default.
 - A `server`-bound scope on a maintenance verb, or a `cluster`-bound scope on a
  data verb, is rejected with a message pointing at the right addressing.
- **No graph selected (RFC-011 D7).** When a scope has no `--graph` and no
+- **No graph selected.** When a scope has no `--graph` and no
  `default_graph`, the CLI never silently picks:
  - **Cluster scope** — exactly **one** applied graph is used automatically;
    **several** errors and lists the candidates (from the served catalog).
-  - **Server scope** — a multi-graph server (any non-empty `GET /graphs`, even a
-    single entry) errors and lists the candidates: you must pass `--graph <id>`.
-    A single-graph / flat server (405 on `/graphs`), or one whose `/graphs` is
-    policy-gated or unreachable, uses its bare URL as before.
+  - **Server scope** — an `omnigraph-server` is always cluster-backed, so its
+    `GET /graphs` lists the graphs and you must pass `--graph <id>` (the CLI
+    lists the candidates if you omit it). It falls back to the bare URL only
+    when `/graphs` is unavailable: policy-gated, unreachable, or a
+    non-`omnigraph` endpoint.

 `--target`, `--cluster-graph`, and the positional-`http(s)://`→remote dispatch
 have been **removed** (`--graph` is now the one graph selector across server and
@ -158,7 +160,7 @@ aliases:

 `omnigraph alias triage 2026-06-01` invokes
 `POST <server>/graphs/spike/queries/weekly_triage` with the keyed
-credential. Aliases live in their own `alias` namespace (RFC-011 Decision 4),
+credential. Aliases live in their own `alias` namespace,
 so an alias can never shadow — or be shadowed by — a built-in verb. (The old
 `--alias <name>` flag on `query`/`mutate` was removed.)

--- a/docs/user/clusters/config.md
+++ b/docs/user/clusters/config.md
@ -231,9 +231,11 @@ Policy entries additionally record their applied `applies_to` bindings as
 normalized typed refs — the state ledger is serving-sufficient for the
 future server-boot stage. A change to `applies_to` alone (the policy file
 digest unchanged) appears in the plan as an Update marked `binding_change`
-(human output: `[bindings]`), applies like any catalog change, and counts
-toward convergence; ledgers written before this field existed are backfilled
-by the next apply.
+(human output: `[bindings]`), and as `metadata_change: policy_bindings` in
+structured output. Embedding provider entries similarly carry their resolved
+profile in the ledger; pre-profile ledgers are backfilled by an Update with
+`metadata_change: embedding_profile`. These metadata-only updates apply like
+catalog changes and count toward convergence.

 Each plan change carries a `disposition` field — an honest preview of what
 `cluster apply` will do with it in this stage: `applied` (executes), `derived`
@ -322,7 +324,9 @@ cluster apply until the approval-artifact stage. Unsupported migrations
 (e.g. changing a property's type), engine lock contention, or graphs with
 user branches fail loudly as `schema_apply_failed` with the engine's message;
 dependent changes are demoted to `blocked` and graph-moving work stops for
-the run.
+the run. These pre-movement failures are checked before the cluster schema
+recovery sidecar is created, so they do not leave stale recovery files behind
+or brick later server boot.

 `cluster plan` previews schema updates with the engine's real migration plan:
 each schema change carries a `migration` field (`supported` + typed steps),
@ -390,7 +394,7 @@ omnigraph-server --cluster company-brain --bind 0.0.0.0:8080
 ```

 `--cluster <dir>` is an **exclusive boot source** (axiom 15): it cannot
-combine with a graph URI, `--target`, or `--config`, and in this mode
+combine with a graph URI or `--config`, and in this mode
 `omnigraph.yaml` is never read — not for graphs, not for queries, not for
 policies. The server serves the **applied revision**: graph roots recorded in
 `state.json`, stored-query and policy content from the content-addressed
@ -402,20 +406,29 @@ drift is visible. Routing is always multi-graph (`/graphs/{id}/...`). Bearer
 tokens and the bind address stay process-level (flags/env) — they are
 per-replica facts, not cluster facts.

-Boot is fail-fast: missing or unreadable state, pending recovery sidecars,
-missing/tampered catalog blobs, policy entries without binding metadata
-(pre-binding ledgers — re-run `cluster apply`), an empty graph set, more than
-one policy bundle binding a single scope (split or merge bundles; stacked
-scopes are a later stage), unopenable graph roots, and stored queries that no
-longer type-check all refuse startup with a remedy. A held state lock is
-*not* an error — boot reads the atomically-replaced state file without
+Boot is fail-fast for cluster-global readiness failures: missing or
+unreadable state, invalid/unattributable recovery sidecars,
+missing/tampered shared catalog blobs, policy entries without binding
+metadata (pre-binding ledgers — re-run `cluster apply`), an empty graph set,
+more than one policy bundle binding a single scope (split or merge bundles;
+stacked scopes are a later stage), cluster policy problems, or zero healthy
+graphs. Valid graph-attributed recovery sidecars, unopenable graph roots, and
+stored queries that no longer type-check quarantine that graph instead; the
+server logs startup diagnostics, skips the graph's queries and graph-only
+policy bindings, and serves any remaining healthy graphs. A held state lock
+is *not* an error — boot reads the atomically-replaced state file without
 locking.

+Use `omnigraph-server --require-all-graphs` (or
+`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) when degraded serving is not acceptable; it
+promotes every graph-local quarantine or startup failure back to a boot error.
+
 Serving is static per process: the server reads the applied revision once at
-startup, so picking up newly applied state means restarting it. Stored
-queries are all listed in `GET /queries` in cluster mode (the cluster
-registry has no expose flag; exposure becomes a policy decision in a later
-phase).
+startup, so picking up newly applied state means restarting it. `GET /graphs`
+lists only ready/served graphs; quarantined graphs are omitted and their
+routes return 404. Stored queries are all listed in `GET /queries` in cluster
+mode (the cluster registry has no expose flag; exposure becomes a policy
+decision in a later phase).

 ## Status

--- a/docs/user/clusters/index.md
+++ b/docs/user/clusters/index.md
@ -91,7 +91,7 @@ only the URI and credentials, no checkout of the config repo. The ledger and
 catalog on the bucket are the deployment artifact.

 `--cluster` is an **exclusive boot source**: it cannot be combined with a
-graph URI, `--target`, or `--config`, and `omnigraph.yaml` is never read in
+graph URI or `--config`, and `omnigraph.yaml` is never read in
 this mode. Routing is always multi-graph:

 ```bash
@ -221,7 +221,8 @@ applied revision is not safely servable. Each refusal names its remedy:
 | Boot error | Meaning | Remedy |
 |---|---|---|
 | `cluster_state_missing` | no ledger | `cluster import`, then `apply` |
-| `cluster_recovery_pending` | interrupted operation awaiting sweep | run `cluster apply` (or any state-mutating command), restart |
+| `cluster_recovery_pending` | graph was quarantined because an interrupted operation awaits sweep | run `cluster apply` (or any state-mutating command), restart |
+| `cluster_no_healthy_graphs` | every applied graph is quarantined or failed startup | sweep/fix the graph-specific failures, then restart |
 | `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart |
 | `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart |
 | `cluster_empty` | applied revision has no graphs | apply a cluster with ≥1 graph |
@ -231,6 +232,13 @@ A held *state lock* is deliberately **not** a boot error — the server reads
 the atomically-replaced ledger without locking, so serving never contends
 with an in-flight apply.

+When at least one graph is healthy, graph-attributed recovery sidecars and
+graph-local startup failures do not block the whole server. The affected
+graph is skipped, its graph-only policy bindings and queries are omitted,
+and `/graphs` lists only the ready graphs. Pass
+`omnigraph-server --require-all-graphs` or set
+`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1` to make any such quarantine fail startup.
+
 ## 6. Deployment patterns

 - **Replicas**: any number of `--cluster` servers can serve the same config
@ -273,7 +281,7 @@ a cluster are created by `cluster apply`, not by hand.

 If the cluster has exactly **one** applied graph you can omit `--graph` — it is
 used automatically. With **several**, omitting `--graph` errors and lists the
-candidates (RFC-011 D7); it never picks one for you.
+candidates; it never picks one for you.

 Against an **`s3://`-backed cluster** the resolved graph storage is non-local, so a
 destructive `cleanup` additionally requires **`--yes`** (an interactive prompt
--- a/docs/user/deployment.md
+++ b/docs/user/deployment.md
@ -208,6 +208,7 @@ When no positional args are given, the image entrypoint
 |---|---|
 | `OMNIGRAPH_CLUSTER` | Cluster boot source — a config directory or a storage-root URI, forwarded as `--cluster`. The only boot source. |
 | `OMNIGRAPH_BIND` | Listen address (default `0.0.0.0:8080`). |
+| `OMNIGRAPH_REQUIRE_ALL_GRAPHS` | When truthy, forwarded as `--require-all-graphs`: any graph-local quarantine or startup failure aborts cluster boot instead of serving the healthy subset. |

 Per-graph and server-level Cedar policy come from the cluster's applied
 revision (authored in `cluster.yaml` and published with `cluster apply`),
--- a/docs/user/operations/errors.md
+++ b/docs/user/operations/errors.md
@ -12,7 +12,7 @@
  - **D₂ parse-time rejection**: a single mutation query that mixes inserts/updates with deletes errors out *before any I/O* with kind `BadRequest`. Message: `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes`. See [query-language.md](../queries/index.md) for the rule.
 - `MergeConflicts(Vec<MergeConflict>)`

-Compiler-side `NanoError` covers parse / catalog / type / storage / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics.
+Compiler-side `CompilerError` covers parse / catalog / type / storage / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics. The legacy `NanoError` name remains as a deprecated compatibility alias.

 ## Result serialization (`omnigraph_compiler::result::QueryResult`)

--- a/docs/user/operations/maintenance.md
+++ b/docs/user/operations/maintenance.md
@ -35,7 +35,7 @@
  backstop, so it does as much as it can and converges on re-run. The CLI reports
  any failed tables; rerun `cleanup` to retry them.
 - CLI guards with `--confirm`; without it, prints a preview line.
- **Non-local consent (RFC-011 D9).** Against a non-local target (an `s3://` store/cluster), `cleanup` additionally requires `--yes` on top of `--confirm`: a TTY is prompted, and a non-interactive run (no TTY, or `--json`) refuses rather than destroying. A local (`file://`) target needs only `--confirm`. The same `--yes` gate applies to overwrite `load` and `branch delete`; every maintenance run echoes its resolved target to stderr (suppress with `--quiet`).
+- **Non-local consent.** Against a non-local target (an `s3://` store/cluster), `cleanup` additionally requires `--yes` on top of `--confirm`: a TTY is prompted, and a non-interactive run (no TTY, or `--json`) refuses rather than destroying. A local (`file://`) target needs only `--confirm`. The same `--yes` gate applies to overwrite `load` and `branch delete`; every maintenance run echoes its resolved target to stderr (suppress with `--quiet`).
 - **Recovery floor:** `--keep < 3` may garbage-collect versions that crash recovery needs as a rollback target. Default `--keep 10` is safe.
 - **Orphaned-branch reconciliation:** before the version GC, cleanup reclaims any per-table or commit-graph branch absent from the manifest branch list. These orphans arise when a `branch_delete` flips the manifest authority but a downstream best-effort reclaim does not complete (see [branches-commits.md](../branching/index.md)). The reconciler is idempotent (it no-ops once nothing is orphaned), runs regardless of the `keep_versions` / `older_than` values (those gate version GC only), and never reclaims `main` or system-branch forks. Reclaimed forks are logged.

--- a/docs/user/operations/policy.md
+++ b/docs/user/operations/policy.md
@ -78,7 +78,7 @@ The default actor identity for CLI direct-engine (`--store`) writes is
 `operator.actor` in `~/.omnigraph/config.yaml`. Override per-invocation with
 `--as <ACTOR>` — `--as` wins, otherwise `operator.actor`, otherwise no actor.
 Remote HTTP writes ignore both — they resolve their actor server-side from the
-bearer token. (Direct-store access carries no Cedar policy under RFC-011; policy
+bearer token. (Direct-store access carries no Cedar policy; policy
 lives in the cluster/server.)

 ## CLI
--- a/docs/user/operations/server.md
+++ b/docs/user/operations/server.md
@ -1,6 +1,6 @@
 # HTTP Server (`omnigraph-server`)

-Axum 0.8 + tokio + utoipa-generated OpenAPI. **Cluster-only boot** (RFC-011): the server always boots from a cluster (`--cluster <dir | s3://…>`) and serves N graphs (N ≥ 1) under cluster routes. There is no longer a single-graph flat-route mode, no positional `<URI>` boot, no `--target`, and no `omnigraph.yaml`-`graphs:`-map boot. All HTTP is nested under `/graphs/{graph_id}/...`; `/healthz` and the management `/graphs` enumeration stay flat.
+Axum 0.8 + tokio + utoipa-generated OpenAPI. **Cluster-only boot**: the server always boots from a cluster (`--cluster <dir | s3://…>`) and serves N graphs (N ≥ 1) under cluster routes. There is no longer a single-graph flat-route mode, no positional `<URI>` boot, no `--target`, and no `omnigraph.yaml`-`graphs:`-map boot. All HTTP is nested under `/graphs/{graph_id}/...`; `/healthz` and the management `/graphs` enumeration stay flat.

 ## Boot

@ -15,11 +15,24 @@ omnigraph-server --cluster <dir | s3://…> --bind 0.0.0.0:8080
 startup configs (id, URI, optional per-graph policy, stored-query
 registry) plus an optional server-level policy, then opens every
 configured graph in parallel at startup (bounded concurrency = 4,
-fail-fast on the first open error). Routing is always multi-graph —
+quarantining graph-specific open failures). Routing is always multi-graph —
 requests to bare flat protected paths (`/read`, `/snapshot`, …) return
 404; the served surface is `/graphs/{graph_id}/...`. See
 [cluster-config.md](../clusters/config.md#serving-from-the-cluster-the-mode-switch)
-for what is read and the fail-fast readiness rules.
+for what is read and the readiness rules.
+
+Readiness is fail-fast for cluster-global problems: missing or unreadable
+state, invalid/unattributable recovery sidecars, unreadable shared catalog
+payloads, cluster policy errors, or zero healthy graphs. Graph-attributed
+pending recovery sidecars and graph-specific startup failures quarantine
+that graph instead; the server logs startup diagnostics and serves the
+remaining healthy graphs. `GET /graphs` enumerates ready/served graphs only,
+so quarantined graphs are absent and their routes return 404.
+
+Operators who want the original all-or-nothing boot contract can pass
+`--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. In that mode,
+any graph quarantine, graph-open failure, stored-query startup failure, or
+embedding-provider resolution failure aborts startup.

 A scheme-qualified argument (`s3://…`) reads the ledger straight from the
 storage root, with no local config directory. `--bind`,
@ -27,7 +40,7 @@ storage root, with no local config directory. `--bind`,

 ### Stored-query validation at startup

-If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup** and **refuses to boot** if any query references a type or property the schema lacks — the same fail-loud posture as a malformed policy file, so schema drift surfaces at the deploy boundary rather than at invocation. Two MCP-exposed queries claiming the same tool name is likewise a boot error. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
+If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup**. Query parse/type failures quarantine that graph; if no graph remains healthy, startup refuses. Two MCP-exposed queries claiming the same tool name are likewise graph-local startup failures. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).

 ## Endpoint inventory

@ -62,7 +75,7 @@ Server-level management endpoints:

 | Method | Path | Auth | Action |
 |---|---|---|---|
-| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs |
+| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list ready/served graphs |

 ### Stored-query catalog (`GET /queries`)

--- a/docs/user/search/embeddings.md
+++ b/docs/user/search/embeddings.md
@ -42,7 +42,7 @@ boots from the applied cluster ledger, so `cluster validate`, `plan`, and
 needs no key. Vector dimensions stay schema-driven by the target `Vector(N)`
 column.

-Direct single-graph serving, embedded callers, and the offline
+Direct (`--store`) access, embedded callers, and the offline
 `omnigraph embed` pipeline use environment configuration unless they inject an
 `EmbeddingConfig` directly.