Merge branch 'main' into andrew/datafusion-future-improvements-doc

This commit is contained in:
Andrew Altshuler 2026-06-06 19:36:35 +03:00 committed by GitHub
commit 2d6591dfb3
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
134 changed files with 16450 additions and 2643 deletions

View file

@ -10,7 +10,7 @@ Three views, increasing zoom:
2. **Layer view** — the eight-layer stack inside one OmniGraph process.
3. **Component zoom-ins** — what's inside each layer.
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a repo, see [`docs/user/storage.md`](../user/storage.md).
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/storage.md).
L1 (orange in the diagrams) is what we inherit from Lance; L2 (blue) is what OmniGraph adds. The L1/L2 framing is also called out in prose at the bottom of this doc.
@ -63,7 +63,7 @@ flowchart TB
subgraph engine[omnigraph engine]
plan[exec query and mutation]:::l2
gi[graph index CSR/CSC<br/>RuntimeCache LRU 8]:::l2
coord[coordinator<br/>ManifestRepo · CommitGraph]:::l2
coord[coordinator<br/>ManifestCoordinator · CommitGraph]:::l2
end
subgraph storage[storage trait — wraps Lance]
@ -132,7 +132,7 @@ flowchart TB
subgraph state[graph state]
coord[GraphCoordinator]:::l2
mr[ManifestRepo<br/>db/manifest.rs]:::l2
mr[ManifestCoordinator<br/>db/manifest.rs]:::l2
cg[CommitGraph<br/>_graph_commits.lance]:::l2
stg[MutationStaging<br/>per-query in-memory accumulator<br/>exec/staging.rs]:::l2
end
@ -166,7 +166,7 @@ Code paths:
- Read entry: `Omnigraph::query` at `crates/omnigraph/src/exec/query.rs:7`
- Mutation entry: `Omnigraph::mutate` at `crates/omnigraph/src/exec/mutation.rs:511`
- Manifest commit: `ManifestRepo::commit` at `crates/omnigraph/src/db/manifest.rs:280`
- Manifest commit: `ManifestCoordinator::commit` at `crates/omnigraph/src/db/manifest.rs:280`
- Graph index: `crates/omnigraph/src/graph_index/`
- Loader: `Omnigraph::ingest` at `crates/omnigraph/src/loader/mod.rs:74`
@ -207,7 +207,7 @@ contracts:
This pattern realizes read-your-writes within a multi-statement mutation
and keeps failure scope bounded for inserts/updates by construction at
the writer layer. See [docs/dev/invariants.md](invariants.md) and
[docs/dev/runs.md](runs.md) for the publisher CAS contract this builds on.
[docs/dev/writes.md](writes.md) for the publisher CAS contract this builds on.
### Storage trait — today vs. roadmap
@ -278,7 +278,7 @@ flowchart LR
eng --> wq
```
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/runs.md](runs.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
Code paths:

View file

@ -8,7 +8,7 @@ This page explains what the policy says and how to change it.
| Setting | Value | Why |
|---|---|---|
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS / drift`, `CODEOWNERS / noedit` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. `strict: true` requires the branch to be up-to-date with `main` before merge. |
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. |
| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. |
| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. |
@ -16,12 +16,12 @@ This page explains what the policy says and how to change it.
| **Disallow force pushes** | `true` | No history rewrites on `main`. |
| **Disallow branch deletions** | `true` | `main` cannot be deleted. |
| **Required conversation resolution** | `true` | All review comment threads must be resolved before merge. |
| **Enforce on admins** | `true` | Even repo admins go through the gates. The point is no bypasses. |
| **Enforce on admins** | `false` | Admins can override the gates (`enforce_admins: false` in the JSON). This is the intended escape hatch for the 2-person team; tightening to `true` is tracked under hardening below. |
| **Required signed commits** | not yet | Not enabled. Would lock out maintainers until everyone enrolls GPG/SSH commit signing. Tracked as a follow-up. |
## How to apply
Run from the repo root:
Run from the repository root:
```bash
./scripts/apply-branch-protection.sh
@ -29,7 +29,7 @@ Run from the repo root:
The script reads `.github/branch-protection.json`, strips the human-readable `_comment` field (the GitHub API rejects unknown keys), and PUTs to `repos/ModernRelay/omnigraph/branches/main/protection`.
Requires `gh` authenticated with a token that has admin permissions on the repo.
Requires `gh` authenticated with a token that has admin permissions on the repository.
To preview without applying:
@ -57,7 +57,7 @@ Outputs the live policy. Compare against `.github/branch-protection.json` to det
- **Audit trail**: `git log .github/branch-protection.json` shows every change with a reviewable diff and a merge commit.
- **Disaster recovery**: if branch protection is accidentally removed or weakened via the UI, the JSON is the canonical recovery point.
- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repo policy lives in the repo.
- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repository policy lives in the repository.
## What this gates
@ -69,7 +69,7 @@ After branch protection is applied, every PR targeting `main` must:
4. Have all review conversations resolved.
5. Be squash- or rebase-merged (no merge commits).
Even repo admins are subject to these rules.
Even repository admins are subject to these rules.
## Subsequent hardening (not in this PR)

View file

@ -2,9 +2,10 @@
`.github/workflows/`:
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease.
- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`.
- **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax.
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux x86_64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer against `edge`.
- **release.yml**: on `v*` tags, builds the Linux x86_64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`, and smoke-tests the Windows PowerShell installer against the tag.
- **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.

View file

@ -2,26 +2,47 @@
`.github/CODEOWNERS` is **generated** — not hand-edited. The source of truth is `.github/codeowners-roles.yml`, expanded by `.github/scripts/render-codeowners.py`. CI rejects drift between the two and rejects direct edits to `CODEOWNERS` that don't accompany a yml change.
This setup gives every role change a reviewable PR and a permanent in-repo audit trail (`git log .github/codeowners-roles.yml`).
This setup gives every role change a reviewable PR and a permanent in-repository audit trail (`git log .github/codeowners-roles.yml`).
## Current roles
## Who owns what
| Role | Members | Scope |
The tables below are **generated** from `.github/codeowners-roles.yml` by `.github/scripts/render-codeowners.py` (the same render that produces `.github/CODEOWNERS`). They are the always-current "who owns what at this commit" view — don't edit them by hand; edit the yml and re-render.
<!-- BEGIN GENERATED OWNERSHIP — edit codeowners-roles.yml + run render-codeowners.py -->
**Path → owners** (GitHub applies *last match wins*; the `*` catch-all is listed first and is overridden by the specific patterns below it):
| Path | Owners | Role(s) |
|---|---|---|
| `engineering` | `@aaltshuler` | All code under `crates/**`, repo infrastructure, default for unmapped paths |
| `docs` | `@aaltshuler`, `@ragnorc` | `docs/**`, README.md, AGENTS.md, CLAUDE.md, SECURITY.md |
| `*` | @ragnorc | engineering |
| `crates/**` | @ragnorc | engineering |
| `docs/**` | @ragnorc | docs |
| `README.md` | @ragnorc | docs |
| `AGENTS.md` | @ragnorc | docs |
| `CLAUDE.md` | @ragnorc | docs |
| `SECURITY.md` | @ragnorc | docs |
GitHub treats multiple owners in a CODEOWNERS line as **"any one of them satisfies the review requirement"**. For docs, either named member can approve. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured).
**Roles**:
| Role | Members | Description |
|---|---|---|
| `engineering` | @ragnorc | All production code under crates/**. Engine, CLI, server, compiler. |
| `docs` | @ragnorc | Documentation under docs/**, plus repo-level docs (README.md, AGENTS.md, CLAUDE.md symlink, SECURITY.md). |
<!-- END GENERATED OWNERSHIP -->
GitHub treats multiple owners on a CODEOWNERS line as **"any one of them satisfies the review requirement"**. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured).
## How to change role membership or path mappings
1. Edit `.github/codeowners-roles.yml`.
2. Run `python3 .github/scripts/render-codeowners.py` (requires PyYAML; `pip install pyyaml`).
3. Commit both files in the same PR.
2. Open a PR. **CI re-renders for you**: the `CODEOWNERS` workflow regenerates `.github/CODEOWNERS` and the ownership tables above and auto-commits them back to your PR branch on same-repository PRs — you don't have to run the script locally (though you can: `python3 .github/scripts/render-codeowners.py`, requires PyYAML).
On a fork (where CI can't push back), the workflow instead fails with the diff so you can run the script and commit it yourself.
CI fails the PR if:
- `CODEOWNERS` was edited without a corresponding yml change, or
- The yml was changed but the rendered `CODEOWNERS` doesn't match.
- a fork PR left a generated artifact out of sync, or
- `CODEOWNERS` was edited without a corresponding yml change (the `CODEOWNERS not hand-edited` check).
## How to add a new role
@ -34,4 +55,4 @@ CI fails the PR if:
- **Audit trail**: `git log .github/codeowners-roles.yml` is the canonical record of every role change. The rendered `CODEOWNERS` is a derived artifact.
- **Roles are first-class**: paths reference roles, not raw handles. Renaming a person or rotating a role updates one place, not every path.
- **Future extension**: scheduled rotation (weekly on-call, quarterly leads) plugs into the same yml without changing the path mappings. Not enabled today.
- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repo's code-owner policy follows the same "policy as reviewed code" pattern.
- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repository's code-owner policy follows the same "policy as reviewed code" pattern.

View file

@ -147,7 +147,7 @@ sequenceDiagram
- End-of-query Lance commit: `TableStore::stage_append`, `stage_merge_insert`, `commit_staged` at `crates/omnigraph/src/table_store.rs`
- Manifest commit primitive: `commit_updates_on_branch_with_expected` at `crates/omnigraph/src/db/omnigraph/table_ops.rs`
Atomicity guarantee for multi-statement mutations: a mid-query failure leaves Lance HEAD untouched on staged tables (no inline commit happened during op execution), so the next mutation proceeds normally with no `ExpectedVersionMismatch`. The publisher CAS at the very end either succeeds (manifest advances atomically across all touched sub-tables) or fails with a typed `ManifestConflictDetails::ExpectedVersionMismatch` (no partial publish). See [docs/dev/invariants.md](invariants.md) and [docs/dev/runs.md](runs.md).
Atomicity guarantee for multi-statement mutations: a mid-query failure leaves Lance HEAD untouched on staged tables (no inline commit happened during op execution), so the next mutation proceeds normally with no `ExpectedVersionMismatch`. The publisher CAS at the very end either succeeds (manifest advances atomically across all touched sub-tables) or fails with a typed `ManifestConflictDetails::ExpectedVersionMismatch` (no partial publish). See [docs/dev/invariants.md](invariants.md) and [docs/dev/writes.md](writes.md).
## Bulk loader (`loader/mod.rs`)

View file

@ -21,7 +21,7 @@ constraints. User-facing behavior should still be documented through
|---|---|
| System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) |
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) |
| Direct-publish writes, D2, staged writes, recovery sidecars | [runs.md](runs.md) |
| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) |
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
| DataFusion: current state, passive wins, future improvements | [datafusion-future-improvements.md](datafusion-future-improvements.md) |
| Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) |
@ -59,6 +59,9 @@ Working documents for in-flight feature work. Removed when the work lands.
| Area | Read |
|---|---|
| Schema-lint chassis v1 (MR-694) — `--allow-data-loss`, soft/hard drops | [schema-lint-v1-plan.md](schema-lint-v1-plan.md) |
| Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) | [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) |
| Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) |
| MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) |
## Boundary

View file

@ -38,7 +38,7 @@ Use it this way:
publishes one manifest update. Do not commit per statement. Delete-only
queries are the documented inline residual; the parse-time D2 rule prevents
mixing deletes with insert/update until Lance exposes two-phase delete.
Read [runs.md](runs.md) and [execution.md](execution.md).
Read [writes.md](writes.md) and [execution.md](execution.md).
5. **Recovery is part of the commit protocol.** Writers that can advance Lance
HEAD before manifest publish must write `__recovery/{ulid}.json` sidecars.
@ -56,7 +56,7 @@ Use it this way:
branch they read even when index coverage is partial. Expensive index work
should converge from manifest state instead of extending the critical write
path. Scalar staged index builds and vector inline residuals are documented
in [runs.md](runs.md) and [indexes.md](../user/indexes.md).
in [writes.md](writes.md) and [indexes.md](../user/indexes.md).
8. **Schema identity survives renames.** Accepted schema identity must remain
stable across type and property renames. Rename support belongs in migration
@ -96,17 +96,25 @@ Use it this way:
| Area | Current state | Source |
|---|---|---|
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [runs.md](runs.md), [architecture.md](architecture.md) |
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [runs.md](runs.md), [execution.md](execution.md) |
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [runs.md](runs.md) |
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [writes.md](writes.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branches-commits.md), [maintenance.md](../user/maintenance.md) |
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) |
| Unique constraints | Intra-batch and write-path checks exist; full cross-version uniqueness is still a gap | [schema-language.md](../user/schema-language.md) |
| Storage trait | `TableStorage` exists as the sealed staged-write surface; full call-site migration and capability/stat surfaces are incomplete | [runs.md](runs.md), [architecture.md](architecture.md) |
| Storage trait | `TableStorage` exists as the sealed staged-write surface; full call-site migration and capability/stat surfaces are incomplete | [writes.md](writes.md), [architecture.md](architecture.md) |
| Index lifecycle | `ensure_indices` is explicit today; reconciler-based convergence is roadmap | [indexes.md](../user/indexes.md), [maintenance.md](../user/maintenance.md) |
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/query-language.md) |
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/server.md), [policy.md](../user/policy.md) |
| Tests | Tempdir-backed Lance tests are the current substrate; there is no `MemStorage` test backend | [testing.md](testing.md) |
The branch-delete reconciler is authority-derived: it reclaims orphaned forks
today and degrades to a no-op if Lance ships an atomic multi-dataset branch
operation, so the design composes with that future rather than blocking it. This
is the same shape as invariant 7 (indexes are derived state); prefer it over a
recovery-sidecar-style approach for any new multi-dataset metadata operation,
since the sidecar would be scaffolding to remove once the substrate closes the gap.
## Known Gaps
Do not hide these behind invariant wording. Either move them forward or keep
@ -122,6 +130,15 @@ them explicit.
- **Deletes and vector indexes:** `delete_where` and vector index creation still
advance Lance HEAD inline because the required public Lance APIs are missing.
Keep D2 and recovery coverage in place until those residuals are removed.
- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns
under its forced `BlobHandling::AllBinary` read ("more fields in the schema
than provided column indices"), so `optimize` skips any table with a `Blob`
property — reporting `SkipReason::BlobColumnsUnsupportedByLance` (loud, not a
silent drop) behind the `LANCE_SUPPORTS_BLOB_COMPACTION` gate. Reads and writes
are unaffected; only space/fragment reclamation on blob tables is deferred.
Remove the skip when the upstream Lance fix lands — the
`lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard
turns red on that bump to force it.
- **Planner capability/stat surfaces:** cost-aware planning, complete
capability advertisement, and explain-with-cost are roadmap. Do not describe
them as implemented.

View file

@ -1,6 +1,6 @@
# Lance Docs Index (for OmniGraph agents)
OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this repo.
OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this codebase.
This file is the curated entry point. **When you hit a Lance-shaped problem, find the matching topic below and fetch the listed URL(s) before guessing.** Don't grep our codebase for behavior that is documented authoritatively in Lance.
@ -175,7 +175,9 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
- **Lance #6658 closed** (2026-05-14) but `DeleteBuilder::execute_uncommitted` did **not** ship in v6.0.1 — binary search across the release stream shows it first appears in `v7.0.0-beta.10` (the closing commits landed on main but didn't backport to the 6.x line). Tracked as MR-A: migrate `delete_where` to staged, retire the parse-time D2 mutation rule, extend recovery sidecar coverage. **Gated on the Lance v7.x bump**, not this PR. v7.0.0-rc.1 dropped 2026-05-21.
- **Lance #6666 still open** (`build_index_metadata_from_segments` public): vector-index two-phase blocked; inline `create_vector_index` residual retained.
- **Lance #6877 still open** (`MergeInsertBuilder` dup-rowid): PR #109's `SourceDedupeBehavior::FirstSeen` + `check_batch_unique_by_keys` precondition stay load-bearing.
- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`.
- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed → flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions.
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (8 named guards; 3 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
Bump this date stanza on the next alignment pass.

View file

@ -0,0 +1,351 @@
# RFC: Inline + Stored Queries, Request/Response Envelope, MCP
**Status:** Proposed
**Date:** 2026-05-28
**Tickets:** MR-656 (inline `-e` + URL rename), MR-668 (multi-graph, shipped), MR-976 (Phase 1 envelope parent: MR-977 / MR-978 / MR-979 / MR-980), MR-969 (stored queries + MCP)
**Target release:** v0.6.x patch series (MR-656 + Phase 1) → v0.7.0 (MR-969 PRs 1-3)
## Summary
OmniGraph today exposes `POST /read` and `POST /change` with a weakly-contracted body (counts only on writes) and no per-query authorization. This RFC consolidates the work landing across three Linear tickets into one coherent design:
1. **MR-656**: rename `/read``/query` and `/change``/mutate`, add inline `-e` CLI flag, ship three-channel deprecation on the legacy URLs. **In flight, PR #110.**
2. **Envelope hardening** (this RFC adds it as a Phase 1 before MR-969): make today's mutation surface agent-grade with idempotency keys, preconditions, deadlines, and a structured response envelope carrying `audit_id`, `commit_id`, `snapshot_id`, and cost stats.
3. **MR-969**: add a stored-query registry, `POST /queries/{name}`, a new `InvokeQuery` Cedar action with per-query scope, inline pragmas in `.gq` (`@description`, `@returns`, `@mcp`), and MCP transport over the same routing primitive.
The bet: inline and stored queries serve different stages of the same lifecycle, run through the same engine code, and are gated by different Cedar actions. HelixDB collapsed to stored-only. Postgres has neither stored-query Cedar nor MCP. The window for an OSS, declarative, agent-grade graph query surface is open.
## Motivation
Three problems today:
- **Mutation responses are too thin.** `ChangeOutput { node_count, edge_count }` is the entire memory the API has of what just happened. No `commit_id`, no `audit_id`, no `snapshot_id`. Agents reporting results have nothing to cite. Humans can't reproduce a read.
- **No agent-safe surface.** Cedar gates `read` and `change` at the action level. A token either runs *any* query or *no* query of that kind. There is no way to express "this agent can invoke `find_user` and nothing else."
- **No discovery primitive.** Agents need a tool list. SDKs need a stable contract per operation. Both are absent.
The MR-656 rename solves the cosmetic asymmetry (`/read` was a poor pair for the future `/queries/{name}`). The envelope work and MR-969 solve the substantive gaps.
## Non-Goals
- Compiled query bundles (HelixDB's `queries.json` shape). `.gq` files are already declarative; the file *is* the artifact.
- Hot reload of the registry. Restart-only matches the multi-graph operational model from MR-668.
- Per-query rate limits in v1. Existing `WorkloadController` covers the bulk of the risk. Punt to a future ticket.
- Cross-graph tool listing in MCP. Agents loop over per-graph endpoints when they need multi-graph access. Avoid namespacing in the contract.
- Web dashboard / control-plane management of the registry. Operators edit `.gq` + `policy.yaml` and restart.
- Schema introspection through MCP. Schema is an operator concern; agents see types through declared return shapes on the queries they're allowed to invoke.
- Per-environment override files. Environment-specific differences live in `policy.yaml`, which already has per-env variants.
## Background
OmniGraph runs on Lance 6.x with a property graph layered on top: typed nodes/edges in per-type Lance datasets, atomic multi-table commits via a `__manifest` table, branchable and time-travelable through Lance versioning. The HTTP server (`omnigraph-server`) is Axum + utoipa with bearer-token auth and Cedar policy enforcement at every `_as` writer.
MR-668 shipped multi-graph mode in v0.6.0. One server process can host 1-10 graphs, with per-graph endpoints under `/graphs/{id}/...`. Cedar policy resolves against `Server::"root"` (for management actions) and `Graph::"prod"` (for per-graph actions).
MR-656 is currently in PR #110 (CONFLICTING / DIRTY against main; rebase planned). It renames the URL surface, adds inline source support, and ships three-channel deprecation (OpenAPI `deprecated: true`, RFC 9745 `Deprecation: true` header, RFC 8288 successor `Link`).
## Design
### Two paths, one engine
| Dimension | Inline (`/query`, `/mutate`) | Stored (`/queries/{name}`) |
|---|---|---|
| Source location | Request body | `queries/*.gq` on disk |
| Parse + typecheck | Per request | Once at server boot |
| Cedar action | `read` / `change` | `invoke_query` (per-name scope) |
| MCP-exposed | No (not enumerable) | Yes (when `@mcp(expose=true)`) |
| Output schema | Inferred | Declared via `@returns`, asserted at boot |
| Audit log shape | Records query hash | Records query name |
| Failure visibility | Runtime 400 | Boot-time refusal |
Both paths converge in the engine:
```
POST /query ─parse→─┐
POST /mutate ─parse→─┤
├─→ run_query / run_mutate(ast, params, branch) ─→ envelope
POST /queries/{name} ───────┤
POST /mcp/invoke ───────────┘ (MCP adapter on top of the same call)
```
The MR-656 rebase widens `run_query` / `run_mutate` to accept a parsed AST or source string. Inline parses on each call. Stored looks up the pre-parsed AST in the registry. Same execution path beyond that point.
### Cedar split (the LLM-safe wedge)
Inline and stored coexist safely because they're gated by different actions:
```yaml
# Production policy — agents locked to a curated stored-query set
- deny:
actors: { group: agents }
actions: [read, change] # blocks /query, /mutate, /read, /change
- allow:
actors: { group: agents }
actions: [invoke_query]
resource: Graph::"prod"
query_scope: { names: [find_user, list_orders, search_docs] }
```
The agent's effective surface: three stored queries by name. Cannot compose inline. Cannot enumerate schema. Cannot read arbitrary entities. A developer in the same deployment with `dev-engineers` group membership might have `[read, change, invoke_query]` allowed — full access to both paths.
Same server, same data, two completely different API surfaces depending on token. This is the posture MR-969 calls "LLM-safe API surface."
### `.gq` pragmas
Stored queries self-describe at the top of the source file:
```gq
@description("Look up a user by ID. Returns name, email, last_login.")
@returns({ name: String, email: String, last_login: DateTime? })
@mcp(expose=true)
query find_user($id: String) {
match { $u: User { id: $id } }
return { $u.name, $u.email, $u.last_login }
}
```
Three pragmas in v1:
- `@description("...")` — string surfaced in `omnigraph queries explain` and MCP tool descriptions.
- `@returns({...})` — optional output type assertion. Compiler verifies the inferred type matches; mismatch fails server startup.
- `@mcp(expose=true|false, tool_name="alt_name"?)` — controls MCP visibility. Default is `expose=false` (callable via HTTP, hidden from MCP). `tool_name` defaults to the query name.
Pragmas live in source, not in a separate YAML registry. Drop a file in `queries/`, restart, the registry picks it up. The full agent contract is reviewable in one diff.
### Request envelope ("before")
Today's request carries auth + body. The envelope adds five fields, all optional:
```http
POST /graphs/prod/queries/find_user
Authorization: Bearer <token>
Idempotency-Key: 01HXYZ... # mutations only
If-Match: 01HABC... # optimistic concurrency
X-Deadline: 2026-05-28T19:30:00Z # or X-Timeout-Ms: 5000
X-Trace-Id: 01HDEF...
Content-Type: application/json
{
"params": { "id": "u-42" },
"branch": "main",
"expect": "read_only", # scope assertion
"dry_run": false, # mutations only
"fields": ["name", "email"] # result projection
}
```
Field semantics:
| Field | Applies to | Purpose |
|---|---|---|
| `Idempotency-Key` | Mutations | Server caches `(token, key)` → response for 10 minutes. Replays return cached response with `Idempotency-Replay: true` header. Prevents double-write on retry. |
| `If-Match` | Mutations | Run only if branch HEAD matches the given commit ID. 412 Precondition Failed otherwise. Enables read-then-write without races. |
| `X-Deadline` / `X-Timeout-Ms` | All | Server respects; returns 504-typed error past the deadline. Bounds execution for context-budget-constrained callers. |
| `X-Trace-Id` | All | Caller-supplied; server echoes back. Lets agents correlate multi-call sequences. |
| `expect` | All | Caller asserts shape: `"read_only"`, `{"max_rows_scanned": 10000}`. Server validates against parsed AST or planner estimate; rejects before running. |
| `dry_run` | Mutations | Returns what *would* happen without committing. Implemented via scratch branch + diff + discard. |
| `fields` | Reads | Server returns only listed columns. Saves bandwidth + agent context window. |
All five fields are optional; today's call shape continues working.
### Response envelope ("after")
The response envelope replaces today's bare-result shape with a structured wrapper. Every endpoint (inline, stored, MCP) returns the same envelope:
```json
{
"result": { "name": "Alice", "email": "alice@..." },
"audit_id": "01HGHI...",
"snapshot_id": "01HJKL...",
"commit_id": null,
"stats": {
"rows_scanned": 1,
"ms_elapsed": 4,
"bytes_read": 128
},
"warnings": []
}
```
Response headers:
| Header | When | Purpose |
|---|---|---|
| `Idempotency-Replay: true\|false` | Mutations | Was this response served from the idempotency cache? |
| `X-Trace-Id` | All | Echo of the request's trace ID, or server-minted if absent. |
| `Deprecation: true` | `/read`, `/change` only | RFC 9745 signal from MR-656. |
| `Link: </query>; rel="successor-version"` | `/read`, `/change` only | RFC 8288 successor pointer from MR-656. |
Body envelope fields:
| Field | When | Purpose |
|---|---|---|
| `result` | All | The actual response payload. Shape determined by the query's return type. |
| `audit_id` | All | ULID for the audit log entry. Lets the caller cite exactly what ran. |
| `snapshot_id` | All | Manifest snapshot the query observed. Reproducibility — replay with `?snapshot=<id>`. |
| `commit_id` | Mutations | ULID of the new commit. Null for reads. Lets the caller cite what changed. |
| `stats` | All | `{rows_scanned, ms_elapsed, bytes_read}`. Lets agents learn what's expensive. |
| `warnings` | All | Non-fatal observations: deprecated property access, full-scan despite available index, scan exceeded soft row limit. Empty array when none. |
The envelope is the API's *memory of what happened*. Without `audit_id` + `commit_id` + `snapshot_id`, agent reports are hearsay and reads are not reproducible. With them, provenance is a first-class property of every response.
### MCP integration with multi-graph
MCP routes are per-graph, matching the rest of MR-668's hierarchy:
```
GET /graphs/{id}/mcp/tools # tool list for this graph, this token
POST /graphs/{id}/mcp/invoke # invoke a tool on this graph
```
Single-mode collapses to `/mcp/tools` and `/mcp/invoke` at the root (same shape, no `/graphs/{id}` prefix). Both modes route through identical handler code.
Tool list response:
```json
{
"tools": [
{
"name": "find_user",
"description": "Look up a user by ID.",
"inputSchema": { "id": { "type": "string", "required": true } },
"outputSchema": { "name": "string", "email": "string", "last_login": "datetime?" },
"read_only": true
}
],
"graph_id": "prod",
"snapshot_id": "01HJKL..."
}
```
The tool list is the subset of registered queries where (a) `@mcp(expose=true)` in source and (b) Cedar permits `invoke_query` for this token on this name on this graph. Computed per request — cheap because it's just iterating the registry + one Cedar evaluation per name.
**Token scoping.** Most tokens carry one graph claim. Cross-graph access requires multiple Cedar rules (one per graph) and is uncommon. Agents that genuinely operate across graphs loop over `/graphs/{id}/mcp/tools` themselves. The contract stays clean; graph renames don't break tool names.
**Discovery.** Agents are told their MCP URL at provisioning: `https://omnigraph.example.com/graphs/prod/mcp`. Token authorizes; URL identifies. Same model as every OAuth-style API.
**`/mcp/invoke` is a protocol adapter.** Unwrap MCP protocol envelope, call the same code path as `/queries/{name}`, wrap the response in MCP shape. No new execution semantics.
### CLI surface
The CLI mirrors the HTTP routes. Post-MR-656 and post-MR-969:
```bash
# Inline (MR-656)
omnigraph query -e 'query test() { ... }' # /query
omnigraph mutate -e 'query bump() { update ... }' # /mutate
# Stored (MR-969)
omnigraph queries list # GET /queries (future)
omnigraph queries explain find_user # show params + return shape + source
omnigraph queries invoke find_user --param id=u-42 # POST /queries/find_user
# Pragma + registry validation
omnigraph lint queries/find_user.gq # parses + verifies pragmas
omnigraph queries lint # validates the whole registry
```
`omnigraph queries invoke` reads bearer + URL from `omnigraph.yaml` like the other remote commands. Local invocations work the same way the existing `omnigraph query`/`mutate` do.
### Lifecycle
The promotion path from inline to stored is the load-bearing DX story:
```
1. EXPLORE omnigraph query -e 'query find_user($id: String) { ... }' --params '{"id": "u-42"}'
└─ POST /query, iterate freely
2. STABILIZE write queries/find_user.gq with @description, @returns, @mcp pragmas
└─ git diff shows the full agent contract in one file
3. AUTHORIZE add Cedar rule allowing invoke_query for the appropriate actor group
└─ scope_names: [find_user]
4. DEPLOY restart server
└─ /queries/find_user goes live
└─ /mcp/tools auto-lists it for any token with invoke_query[find_user]
5. RETIRE deny: read change for the agent group
└─ inline access closed; stored remains
└─ MR-969's "LLM-safe API surface" reached
```
Same `.gq` source through all five steps. No rewrite. No language shift. The pragmas are the only added syntax between exploration and production.
## Migration
Existing callers see no breakage:
- `POST /read` and `POST /change` keep working, now with `Deprecation: true` headers (MR-656).
- `ChangeRequest` field names `query_source` / `query_name` accepted as serde aliases (MR-656).
- `aliases:` block in `omnigraph.yaml` unchanged; both `read`/`change` and `query`/`mutate` accepted as `command:` values (MR-656).
- New envelope fields are additive; old clients ignoring them keep working.
- `Idempotency-Key`, `If-Match`, `X-Deadline` are opt-in headers; absence is the current behavior.
Callers move at their own pace. The envelope upgrades + URL rename ship in v0.6.x (small PRs). Stored queries + MCP ship in v0.7.0.
## Sequencing
**Phase 1: envelope (v0.6.x, before MR-969).** Four small PRs, ~100-200 LOC each.
1. Wrap responses in the structured envelope. Add `audit_id`, `snapshot_id`, `commit_id`, `stats`, `warnings`. Backward-compatible if we keep today's top-level fields and add new ones alongside; cleaner break if we move to nested `result.*`. Pick one and live with it.
2. Honor `Idempotency-Key` on `/mutate` (and the deprecated `/change`). Server-side cache keyed by `(token, key)`.
3. Honor `If-Match` on `/mutate`. Wire through to the publisher CAS layer.
4. Honor `X-Deadline` / `X-Timeout-Ms` on every endpoint. Return 504-typed error past deadline.
**Phase 2: MR-969 PR 1 (registry).** The stored-query registry, `/queries/{name}` route, `InvokeQuery` Cedar action with per-name scope, `.gq` pragma parsing (`@description`, `@returns`, `@mcp`), read-vs-mutate classification at registry load. Inline keeps working unchanged.
**Phase 3: MR-969 PR 2 (MCP).** `/graphs/{id}/mcp/tools` and `/graphs/{id}/mcp/invoke`. Tool schemas projected from declared return types and parameter declarations. Single-graph-scoped tokens.
**Phase 4: MR-969 PR 3 (Cedar deny-on-ad-hoc sugar).** Small Cedar-language addition so operators can lock down `/read` / `/query` while keeping `/queries/*` open. Independent of PRs 1-2.
**Phase 5: deferred.**
- Cross-graph MCP namespacing (wait for usage signal).
- Per-query rate limits (extend `WorkloadController`).
- Schema introspection as a separate Cedar action (3-line PR).
- CLI verb consolidation (`omnigraph call <name>`).
- Cache warming (HelixDB-style; not load-bearing).
## Rejected Alternatives
**Per-environment override files (`_overrides.yaml`).** Initial design had a sparse YAML file for per-env tweaks: MCP exposure, row caps, kill-switch, param locks. Rejected because every override candidate either belongs in source (`@mcp` flag), Cedar policy (per-actor visibility, per-env), or `omnigraph.yaml` (operator config). Splitting query metadata across files makes it harder to review what an agent can see. Keep source authoritative; let Cedar express the per-env differences.
**Compiled query bundle (HelixDB's `queries.json`).** HelixDB compiles their Rust-DSL queries to JSON. Rejected because `.gq` files are already declarative. The file is the artifact. Reviewers diff source, not bytecode.
**Stored-queries-only (HelixDB's posture).** Rejected because the personal-graph / dev-iteration use case dies without inline. Inline `-e` is the REPL for human exploration; stored is the contract for production agents. Both first-class.
**Cross-graph tool-name prefixing (`prod.find_user`).** Rejected because graph renames would break agent contracts. Per-graph URLs let graph identity live in the URL, not in tool names.
**Body-field graph dispatch (`{tool, graph, params}`).** Rejected because it doubles the contract surface (every tool is identified by two fields). Per-graph URLs are simpler.
**Pragmas in YAML instead of source.** Rejected because two-file definitions (source + metadata YAML) make diffs harder to review and create drift opportunities. Source is the source of truth.
**Pragmas as in-source comments (`#[mcp]` HelixDB-style).** Considered; chose `@mcp(...)` because comment-flavored pragmas conflate documentation and machine-readable metadata. The `@` prefix makes the pragma's role explicit.
## Open Questions
1. **Envelope breakage vs additive.** Phase 1.1 wraps responses in a structured envelope. Do we keep today's top-level fields *and* add new ones (additive, ugly), or move result to `result.*` (clean break, requires SDK updates)? Lean toward additive — let the new envelope coexist with the old shape until v0.7.0, then collapse.
2. **`@returns` strictness.** Should mismatched declared-vs-inferred return type be a boot-time error or a warning? Lean toward error — silent drift defeats the assertion's purpose. Operators who want flexibility omit `@returns`.
3. **MCP protocol transport.** Streamable HTTP (the new MCP standard) vs stdio (Anthropic's original). Both have Rust crates. Lean toward streamable HTTP since we're already an HTTP server.
4. **Stored mutation routing.** A `.gq` file that contains both reads and writes — does the registry reject it at load (parse-time D2 rule from MR-656), or accept and classify as "mixed"? Lean toward reject. Mixed queries are a footgun; force operators to split.
5. **`expect` field strictness.** `expect: "read_only"` against a parsed mutating query is an obvious 400. But `expect: {max_rows_scanned: 10000}` requires planner estimates that don't exist today. Either ship `expect` with only the "read_only" assertion in v1 and grow it, or wait for the planner. Lean toward shipping the partial form.
6. **CLI `queries invoke` shape.** Today's `omnigraph query` takes a file or alias. `omnigraph queries invoke find_user` takes a stored query name. Should `omnigraph query --name find_user` also work (auto-detect)? Cleaner to keep them separate verbs — the stored vs inline distinction is part of the contract.
## References
- MR-656: [Support inline query strings in CLI and HTTP server](https://linear.app/modernrelay/issue/MR-656)
- MR-668: [Multi-graph server mode](https://linear.app/modernrelay/issue/MR-668) (shipped, PR #119)
- MR-969: [Stored queries with MCP exposure and per-query Cedar authorization](https://linear.app/modernrelay/issue/MR-969)
- PR #110: [feat: inline query strings in CLI and HTTP server](https://github.com/ModernRelay/omnigraph/pull/110)
- HelixDB docs: [docs.helix-db.com/llms-full.txt](https://docs.helix-db.com/llms-full.txt) — `#[mcp]` macro, scoped API keys, stored query model
- RFC 9745 (`Deprecation` header)
- RFC 8288 (`Link` relations, `successor-version`)
- MCP spec: [modelcontextprotocol.io](https://modelcontextprotocol.io)
- [invariants.md](./invariants.md) — substrate boundaries this work respects
- [../user/server.md](../user/server.md) — current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation)

View file

@ -0,0 +1,590 @@
# RFC: Config & CLI Architecture — Layered Config, Client Targeting, File Naming
**Status:** Proposed
**Date:** 2026-05-30
**Tickets:** MR-668 (multi-graph server, shipped — the dependency this builds on), MR-969 (stored queries + MCP — supplies the in-repo agent tool surface), MR-973 (quickstart / onboarding), MR-974 (agent setup surface), MR-981 (agent-friendly CLI hardening)
**Target release:** v0.8.x (tentative; phased — see Rollout)
## Summary
OmniGraph today has a single config file, `omnigraph.yaml`, read both by the CLI (operating the embedded engine) and by `omnigraph-server` (hosting graphs). There is **no client-side configuration that targets a *running server*** — to talk to a deployed `omnigraph-server` you drop to `curl` or the `omnigraph-ts` client. This is the one real gap in an otherwise coherent design (storage-URI addressing, multi-graph routing, per-graph policy).
This RFC defines the config and CLI architecture that closes that gap, derived from first principles — *working backwards from what OmniGraph uniquely enables* rather than copying kubeconfig / `helix.toml`. The result:
1. A **global-first layered config** — user-global (`~/.omnigraph/`) is the **primary, self-sufficient default**; per-project (`./omnigraph.yaml`) is an *optional* override + deployment manifest. One uniform schema, both layers optional; the CLI works from any directory with **no project file** (the `kubectl`/`aws`/`gh` posture), unlike today's project-anchored behavior.
2. A single unifying noun — the **target** — that resolves a name to a concrete `(locus, graph, sub-state, credential)` tuple, where the locus is **embedded (storage URI) XOR remote (server endpoint)**.
3. A **multi-server × multi-graph** client model (OmniGraph hosts N graphs per server and there are M servers — unlike Helix's one-cluster-one-graph).
4. **Credentials by reference, keyed by server name** (the AWS/gh/kube model) — OS keychain `omnigraph:<server>` (preferred) → a `[<server>]` profile in `~/.omnigraph/credentials``OMNIGRAPH_TOKEN[_<SERVER>]` env (CI). `servers.<name>` is endpoint-only by default but may carry an explicit, secret-free `auth: { token: { env|file|command|keychain } }` source; no `credentials.yaml`; the shipped `bearer_token_env` + dotenv stay as a legacy compat path. Every committed/GitOps'd surface stays secret-free.
5. A **file-naming** decision: project and server config are **the same artifact, same name** (`omnigraph.yaml`); the only differently-named file is the user-global `config.yaml`, justified by **scope, not role**.
The design optimizes jointly for **DX** (one command surface across embedded and remote; clone-and-go) and **AX** (agent experience: one flat resolved context, secrets structurally unreachable, branch-pinned reproducible reads, and a GitOps'd capability surface).
## Reconciliation with shipped / planned CLI work
Verified **against the code**, not ticket statuses (which are unreliable — e.g. MR-581 is marked done but is stale and unbuilt). Findings and the corrections they force:
- **Noun is `graph`/`graphs`, NOT `target`/`targets`.** The config key is `graphs:` in `config.rs` and the flag is `--graph`. **This RFC uses `graphs:`/`--graph` throughout**; the unifying noun is a **`graphs:` entry** that is *embedded* (`storage:`, formerly `uri:`) XOR *remote* (`server:` + `graph_id:` defaulting to the entry key) — a typed locator (§1.1). Read any lingering `targets:`/`--target` below as `graphs:`/`--graph`.
- **`~/.omnigraph/` stands on its own merits** (Helix/aws/kube peer convention), **not** on precedent — there is **no `~/.omnigraph/` usage in the code** today. (MR-581 / MR-531 templates-into-`~/.omnigraph/` are *stale tickets, unbuilt*.)
- **Templates do not exist** in the code (no `template` command). The template mechanism is a *design question for this RFC / the init family*, not an existing foothold.
- **What actually exists in the CLI** (verified): `init, query(read), mutate(change), load, ingest, branch, schema, lint, snapshot, export, commit, policy, optimize, cleanup, graphs`. **Not built:** `serve, quickstart, template, prune, login`. `omnigraph init` exists (with `scaffold_config_if_missing`, `main.rs:1415`); the rest of the "init family" (`quickstart` MR-973, `serve` MR-970, `prune`/`init --force` MR-972/975, `mcp install`/skills MR-974, agent-mode MR-981) are **unbuilt tickets**, some stale.
- **Config still uses `aliases:`** (no `operations:` in code; MR-839 unbuilt). §6's reconciliation talks about `aliases:` as-is, noting `operations:` is a *proposed* rename.
- **`bearer_token_env` exists** (per-graph, `config.rs`); MR-971 flags a CLI-parity / server-side gap. The per-`servers.<name>` extension lands on top of that.
- **A top-level `omnigraph lint` command exists** (verified). A stored-query *registry* validator must pick a verb that doesn't read as a competing lint/check.
## Motivation
Three problems, in priority order:
- **No client→server targeting config.** The moment an operator stands up `omnigraph-server` — for bearer auth + Cedar at a network boundary + admission control + multi-graph routing — the CLI can't address it. `curl` is the fallback. There is no named, switchable, credential-carrying way to say "run this against `prod` on the team server."
- **Multi-server × multi-graph has no first-class expression.** OmniGraph genuinely runs N graphs per server across M servers. The same graph is **multi-homed**`s3://b/prod` may be `prod` on server A, `production` on server B, and opened directly by the CLI. Today's flat `graphs:` map (name→storage-URI) can't express "graph `production` on server `prod-eu`."
- **Solo-first and embedded-first are unserved by the remote story.** A solo developer with no projects should define everything in `~`. A developer iterating locally (embedded, no server) and then pointing at staging (remote) should change *one word*, not learn a second command surface.
MR-668 shipped the server side (multiple graphs per server). MR-969 ships the in-repo agent tool surface (stored queries / MCP). This RFC supplies the **client and config layer** that lets humans and agents target that surface coherently — the foundation under MR-973 / MR-974 / MR-981.
## Non-Goals
- **A control plane / dashboard for config.** Operators edit files and (for servers) restart. No runtime config-mutation API. Matches the MR-668 / MR-969 operational model.
- **Hot reload.** Restart-only for server-side config, matching MR-668 and MR-969.
- **Embedding secrets in any config file.** Credentials are by-reference; the git-ignored `auth.env_file` dotenv (or, later, the OS keychain) holds tokens. Never a committable `*.yaml`.
- **Renaming the project manifest by role.** No `omnigraph.server.yaml` / `omnigraph.client.yaml`. Role lives in sections, not filenames (see Design §3).
- **Dropping embedded mode.** Embedded-first is load-bearing for the file-naming decision; this RFC assumes it stays.
- **Cross-graph / cross-server tool listing in MCP.** Clients loop over per-graph catalogs (a MR-969 non-goal, restated).
## Background
OmniGraph runs on Lance 6.x: typed nodes/edges in per-type Lance datasets, atomic multi-table commits via a `__manifest` table, branchable and time-travelable. The CLI (`omnigraph`) operates the **embedded engine** directly against a storage URI — no HTTP client in its runtime dependencies. `omnigraph-server` (Axum) is a *separate* HTTP front-end over the same engine, with bearer auth + per-graph Cedar (MR-668). The two read the same `omnigraph.yaml` but never connect to each other.
OmniGraph **already has a credentials-by-reference mechanism**, which this RFC builds on rather than replacing: `TargetConfig.bearer_token_env` names the env var holding a graph's bearer token, and `auth.env_file` points at a git-ignored dotenv (`.env.omni`) that the CLI auto-loads into the process (`load_env_file_into_process`) with real-env-vars-win precedence; `resolve_remote_bearer_token` resolves a token via env var then dotenv named lookup. `.env.omni` is already in `.gitignore`.
The six **irreducible enablers** that drive the design (referenced as E1E6 below):
| # | Enabler | Consequence |
|---|---|---|
| E1 | A graph is a **self-contained storage URI**; the substrate (object store + manifest CAS) is the source of truth — no server required to read/write. | A graph is addressable **directly (embedded)**, not only via a server. |
| E2 | A server hosts **many graphs**; **many servers** exist. | The remote address space is **`{server} × {graph_id}`**. |
| E3 | The same graph is **multi-homed** under different per-locus names. | **Name ≠ identity.** Resolution is mandatory. |
| E4 | **Branch / commit / snapshot** are first-class addressable sub-state. | An address is *graph @ branch/snapshot*, not just graph. |
| E5 | Enforcement is **two-layered**: engine-layer Cedar (`_as` writers, works embedded) + HTTP-boundary bearer+Cedar (server only). | *How* you reach a graph determines *which* enforcement applies. |
| E6 | **Stored queries / MCP tools are a per-graph registry defined in the project config** (MR-969). | The **agent tool surface is version-controlled in the repo**. |
Competitors collapse dimensions OmniGraph keeps live: **Helix** fuses E2+E3 (one cluster = one graph); **namidb** fuses E1+E3 into the URI (`s3://b?ns=prod`) and serves one namespace per process. OmniGraph has all of E1E6 at once, so its config resolves a richer space — but the richness is *earned* by capability.
## Design
### 1. The address space and the `target` abstraction
Every OmniGraph address is a tuple:
```
(locus, graph, sub-state, credential)
locus = embedded(URI) XOR remote(server-endpoint) # E1, E2
graph = a URI (embedded) | a graph_id on a server (remote) # E3
sub-state = branch | snapshot # E4
credential = cloud-storage creds (embedded) | bearer token (remote) # E5
```
The config's only job is **name → this tuple**. Define one noun — a **target** — that resolves to either shape:
```yaml
targets:
dev: # embedded — substrate-direct (E1)
storage: s3://team-bucket/dev.omni
branch: main # sub-state (E4)
staging: # remote — resolves a server by reference (E2/E3)
server: staging # → looked up in `servers`
graph_id: prod # the graph's id on that server (defaults to the entry key)
branch: review
```
`--target staging` resolves: project `targets.staging``{server: staging, graph_id: prod, branch: review}``servers.staging``{endpoint, token-by-ref}` → final `(remote(https://…), prod, review, $TOKEN)`. Embedded targets skip the server hop and use cloud-storage credentials.
**Two concepts, not kubeconfig's three.** kube splits cluster / user / context; that 3-way split is its most-cursed UX. A target *bundles* server+graph+branch+defaults under one name; the **only** thing split out is `servers`, because endpoints+credentials are shared across many targets and are secret-bearing (different ownership and rate-of-change; see §2). Result: **2 nouns — `servers` and `targets`.** Embedded `targets` (`storage:`) subsume today's `graphs:` entries.
### 1.1 The resolved address is a typed *locator*, not a `uri` string
The shipped config models a graph as a single `uri: String`, and code branches on `is_remote_uri(uri)`. That conflates two structurally different addresses: an **embedded** graph is a *complete, self-contained* address — one storage URI = one graph, opened directly via the embedded engine; a **remote** graph is a *server endpoint + a `graph_id`* — one server hosts N graphs. A bare server URL **is not a graph**; it lacks the `graph_id`. The cost of the string model, in the code today:
- the CLI re-decides "server or file?" via `is_remote_uri` at ~16 call sites;
- `TargetConfig` (one `uri` field) **cannot express** multi-server × multi-graph or a multi-homed graph (E2/E3) — "graph `production` on server `prod-eu`" has no representation;
- the CLI **bails on remote URIs** for most operations, precisely because the string can't carry the `graph_id`;
- the `omnigraph-ts` SDK had to model `baseUrl` **+** `graphId` *separately* (rewriting `/graphs/{graphId}/…`) — it invented the structure the string lacks.
So the *resolved* address is a **typed locator**, not a string:
```rust
enum GraphLocator {
Embedded { storage: StorageUri }, // file:// , s3:// — a complete graph
Remote { server: ServerId, graph_id: GraphId }, // which server + which graph (+ bearer creds)
}
```
A `graphs:` entry resolves into this **once**; downstream code dispatches on the variant (the breadboard's `GraphConn = Embedded(engine) | Remote(http)`) instead of re-sniffing a scheme at each call site. The `uri` string becomes an *input format* for the embedded variant, never the address itself.
**YAML naming follows the locator — the *key* names the locus**, so neither the value's scheme nor a comment is load-bearing:
| Locus | Key | Value |
|---|---|---|
| Embedded | **`storage:`** (shipped `uri:` is a deprecated alias) | a storage URI (`s3://…`, `file://…`) |
| Remote | **`server:`** | a name in `servers:` (its `endpoint` + creds resolve by name, §5) |
| Remote graph id | **`graph_id:`** | the id on that server — **defaults to the entry key**; set only when the local alias differs |
An entry has `storage:` **xor** `server:` — the deserializer rejects *both* and *neither* (no silent ambiguity). This removes two prior confusions: `graphs:` (the map) vs `graph:` (the remote id), and `uri:`-might-be-a-server.
```yaml
servers:
prod-eu: { endpoint: https://og-eu.internal:8080 }
graphs:
dev: { storage: s3://team-bucket/dev.omni } # embedded
production: { server: prod-eu } # remote — graph_id = "production" (the key)
staging: { server: prod-eu, graph_id: prod } # remote — alias ≠ server's id
```
### 1.2 Invalid configs are rejected by design
The DX rule is: **a config field is either honored or rejected, never silently ignored**. The loader therefore has two phases:
1. Parse YAML into a loose/raw shape that preserves origin (`base_dir`, layer, line/path when available).
2. Convert once into a typed, role-aware resolved config. Every command receives the resolved form, not the raw YAML structs.
The typed graph shape is:
```rust
enum GraphEntry {
Embedded(EmbeddedGraphEntry),
Remote(RemoteGraphEntry),
}
struct EmbeddedGraphEntry {
storage: StorageUri,
branch: Option<BranchName>,
policy: Option<PolicyFile>,
queries: QueryRegistrySpec,
}
struct RemoteGraphEntry {
server: ServerId,
graph_id: GraphId,
branch: Option<BranchName>,
}
```
That makes these rules structural rather than advisory:
- A graph entry must specify **exactly one** locator: `storage:`/legacy `uri:` xor `server:`.
- `policy:` and `queries:` are valid only on `Embedded` graph entries, because they define the capability surface of a graph this process opens directly. A `Remote` graph entry points at a server; that server owns policy and stored-query definitions.
- `omnigraph-server` may serve only `Embedded` graph entries. A server manifest entry with `server:` is rejected: a server should not "host" a graph by proxying another server.
- A named graph uses its own graph entry. Top-level `policy:` / `queries:` are a legacy anonymous-bare-URI compatibility path only; if a named graph is selected while top-level blocks would be ignored, config validation errors with a migration hint.
- A client-defined remote graph discovers stored queries from the server (`GET /queries`) and invokes them (`POST /queries/{name}`); it does not define `queries:` locally for that remote graph.
Examples that must fail fast:
```yaml
graphs:
prod:
storage: s3://team-bucket/prod.omni
server: prod-us # invalid: storage xor server
```
```yaml
graphs:
prod:
server: prod-us
graph_id: production
policy: { file: ./policies/prod.yaml } # invalid: remote graph policy lives on the server
queries:
find_user: { file: ./queries/find_user.gq } # invalid: remote graph queries are discovered
```
`omnigraph config view --resolved --show-origin` is the user-facing debugger for this boundary: it shows the final `Embedded` or `Remote` graph and where every honored field came from. Fields that cannot be honored never make it into the resolved view; they fail validation first.
### 2. Layered config — global-first, uniform schema, project-optional
**Posture: global-first, project-optional.** OmniGraph's CLI is primarily a *client* (it operates against graphs and servers, embedded or remote), so it sits on the **global-first** side of the CLI-config axis — like `kubectl` / `aws` / `gh` / `docker`, and unlike *project-first* tools (`git` / `cargo` / `terraform`) whose primary config is per-repo. The **global user config is the primary, self-sufficient default**; the project file is an *optional* repo-scoped override (and, when present, the deployment manifest). `omnigraph query --target prod` must work from **any directory with no project file**, exactly as `kubectl get pods --context prod` works from anywhere. *(This is a deliberate flip from today, where the CLI reads `./omnigraph.yaml` and does not even walk parent dirs — i.e. today it is project-anchored.)*
**Rule: the two layers share ONE raw schema, and each is fully self-sufficient** (the git-layering mechanism — same schema at both levels; you never need a repo to have a working config). Do **not** specialize the file format by layer. Instead, run the same role-aware validation everywhere (§1.2): the global and project layers may both define graph locators, defaults, servers, and aliases, but fields that are meaningless for a resolved graph variant are rejected rather than ignored. For example, `queries:` is valid for an embedded graph this config opens directly; it is invalid on a remote graph entry because remote stored queries are server-owned and discovered.
This makes the **zero-project case the default, not an edge case**: a solo user (or an agent) defines everything needed for client work in `~/.omnigraph/config.yaml` — servers, embedded + remote graph locators, defaults, aliases, and optionally personal embedded-graph query registries — and **never creates a project file**. A team adds `./omnigraph.yaml` only when it wants repo-scoped overrides or a committed, GitOps'd deployment manifest. Global-first does **not** forbid project files; it stops *requiring* them (the kubectl model: `~/.kube/config` is sufficient and default; per-project kubeconfigs are opt-in via `KUBECONFIG`).
| Layer | Required? | Typical use | Path |
|---|---|---|---|
| Global | no | **the default** — solo/agent's entire config; shared servers+creds for teams; even a personal server's graphs/queries | `~/.omnigraph/config.yaml` |
| Project | no | **opt-in** — repo-scoped overrides + the committed deployment manifest (graphs, queries, policy) | `./omnigraph.yaml` |
**Precedence (low → high):** built-in defaults < global < project < env vars < CLI flags. With no project file it collapses to **built-in < global < env < flags** the common global-only path.
**Merge semantics — "closest layer wins, at the smallest meaningful unit"** (the field consensus: git / kubeconfig / cargo / Helm / VS Code):
- **Settings objects** (`defaults`, `auth`, `server`) → **deep-merge per field**: a project sets `defaults.graph` and *inherits* the global `defaults.output_format`. (VS Code / cargo behavior.)
- **Named-resource maps** (`servers`, `graphs` / compat `targets`, `queries`, `aliases`) → **union by key; on a collision the higher layer's entry REPLACES the lower wholesale***no field-level deep-merge within an entry*. (kubeconfig: union contexts by name.) The footgun this avoids: global `servers.prod = {endpoint, policy}`, project `servers.prod = {endpoint: other}` — deep-merge would silently retain the old fields; replace makes the project's `prod` self-contained and predictable.
- **Lists/arrays****replace, never append** (Helm convention; appending is order-sensitive and surprising).
- **Scalars** → higher layer wins.
- **Relative paths carry their origin's base_dir.** A `queries:` entry's `.gq` path, or a `policy.file`, resolves against the directory of the layer it was *defined in* — global entries under `~/.omnigraph/`, project entries under the project dir.
- **Inspectable (non-negotiable):** `omnigraph config view --resolved --show-origin` prints each final value *and which layer set it* (the `git config --show-origin` / `kubectl config view` rule). A layered config without origin-tracing is a debugging trap.
### 3. Roles, and the file-naming decision (same name for project = server)
`omnigraph.yaml` carries two *roles* that diverge in prod and collapse on a laptop:
- **Server role** (read by `omnigraph-server`): `graphs:` entries that are **embedded storage locators**, per-graph `policy.file`, **`queries:` — the stored-query/MCP registry lives here**, plus serving knobs. Remote graph locators are rejected in this role.
- **Client role** (read by the CLI/agent): `servers:`, embedded or remote `graphs:` locators, `defaults:`, `aliases:`. A remote graph locator points at server-owned capabilities; it cannot define local `policy:` or `queries:`.
**Project config and server config are the same artifact, hence the same name.** The server *serves the project*: the file that says "these graphs exist, with these stored queries and this policy" is simultaneously the project manifest and the server's deploy config. Role is distinguished by which *sections* are populated, never by filename. Readers ignore sections that are not theirs (today's file already does this with `cli:` vs `server:`).
**Why not kube's role-split.** Two coherent models exist: (A) one project file with role-sections (Helix `helix.toml` holds both `[local.dev]` and `[enterprise.production]`; compose; Cargo), and (B) deployment-manifest strictly separate from client config (kubectl — you never put a context in `deployment.yaml`). kube is the sharpest topological analog (multi-server × multi-graph, one client targeting many), so B has a real claim. The tiebreaker is **E1: OmniGraph is embedded-first.** In embedded mode the manifest's `graphs:` *is* the local target list — manifest and local-client-view are the same object, so splitting them (B) fights the grain and forces two files for local work. kube splits because it has **no** embedded mode (client always remote+global). So: take the half kube is right about — *remote* client targeting (`servers:`, endpoints, creds) is a separate concern in a separate **user-global** file (`config.yaml`, like `~/.kube/config`); reject the half it is wrong about for us — do **not** split the *project* layer by role. **The second name (`config.yaml`) is justified by scope (user-global), not role.** *(If OmniGraph ever dropped embedded mode and went pure-remote, model B's strict split would become cleanest.)*
### 4. File naming
Principles from the field: **one global dir** `~/.omnigraph/` (like `~/.aws`/`~/.kube`/`~/.helix`), with config/cache/state as **subdirectories** (separation without XDG's three-root scatter); **secrets keyed by server name in the OS keychain or a separate git-ignored profile file** (AWS/gh model, not a new `credentials.yaml`); **project-root manifest keeps the app-named file** (`Cargo.toml`, `package.json`); **`.yaml`, not `.yml`**; keep OmniGraph's established names. The genuinely *new* decisions are the **global** dir's existence and keyed-by-name resolution with an explicit `auth.token` override (MR-971); the shipped `bearer_token_env` + `auth.env_file` mechanism remains as legacy compat.
| Artifact | Path / name | Why |
|---|---|---|
| Project = server config (one artifact) | `./omnigraph.yaml` | **Keep.** Root manifest like `Cargo.toml` / `compose.yaml` / `helix.toml`. Same name for both roles because it is one file. In prod the server's deploy repo and an app repo each have their own `omnigraph.yaml` — same name, different repos. |
| Global user config | `~/.omnigraph/config.yaml` | **One dir** (`~/.omnigraph/`, like `~/.aws`/`~/.kube`/`~/.helix`). Named `config.yaml` *not* `omnigraph.yaml` — the name signals scope (and `~/.aws/config`, `~/.kube/config`, `~/.helix/config` all do this). Holds the full schema so a solo user needs nothing else. |
| Credentials | OS keychain (`omnigraph:<server>`, preferred) → `~/.omnigraph/credentials` profile file (`[<server>]`, `0600`, git-ignored). **Keyed by server name**, inside the one dir. | **Key by name, AWS/gh model**`~/.aws/credentials [profile]`, `~/.kube/config users:`, `~/.helix/credentials`. *Not* a `credentials.yaml`, and *not* a per-server hand-named env var; the secret lives under the server name (no indirection). Legacy `bearer_token_env` + `.env.omni` dotenv remain as a compat path. See §5. |
| Cache / state | `~/.omnigraph/cache/`, `~/.omnigraph/state/` | Subdirs of the one dir (like `~/.aws/sso/cache/`, `~/.kube/cache/`) — cache is `rm -rf`-safe and backup-excludable without scattering across XDG roots. |
| Cedar policy | `./policies/<env>.yaml` + `<env>.tests.yaml` | **Keep.** Referenced by `policy.file`. |
| Schema | `./*.pg` (e.g. `schema.pg`) | **Keep.** |
| Stored queries | `./queries/*.gq` | **Keep.** `.gq` sources referenced by the `queries:` registry. |
**Global dir: `~/.omnigraph/` — one place, with subdirectories.** Everything OmniGraph keeps for a user lives under a single `~/.omnigraph/` directory, matching the peer group (`~/.aws`, `~/.kube`, `~/.docker`) and the direct competitor (`~/.helix`). This is what DB/cloud-CLI users expect and the lowest-cognitive-load shape.
*Separation and "one place" are not in conflict* — the decisive realization. The peer tools get config/cache/state separation via **subdirectories inside the one dir**, not via XDG's three scattered roots: `~/.aws/sso/cache/`, `~/.kube/cache/`. So OmniGraph keeps `~/.omnigraph/config.yaml`, `~/.omnigraph/credentials`, `~/.omnigraph/cache/` (catalogs — `rm -rf`-safe, backup-excludable), `~/.omnigraph/state/` (session, logs) — getting cache hygiene **and** a single discoverable location, without the XDG scatter. An earlier draft argued XDG on a false dichotomy (it assumed single-dir ⇒ mixed); subdirs dissolve it. `~/.omnigraph/` is canonical and documented; `$XDG_CONFIG_HOME` may optionally be honored if a user has set it, but XDG is not part of the mental model.
**Env / override precedence (the `KUBECONFIG` analog):**
- `OMNIGRAPH_CONFIG=/path` — explicit config file, highest precedence.
- `OMNIGRAPH_HOME=/path` → the global dir (default `~/.omnigraph/`); `$XDG_CONFIG_HOME` optionally honored if a user has set it, but `~/.omnigraph/` is canonical.
- Cache and state are subdirs of the one dir: `~/.omnigraph/cache/` (cached remote catalogs), `~/.omnigraph/state/` (session, logs).
- Per-server token resolution: an explicit `auth: { token: {...} }` source (env/file/command/keychain) wins if set; otherwise **keyed by the server name**`OMNIGRAPH_TOKEN_<NAME>` (or `OMNIGRAPH_TOKEN` for the active server) → OS keychain `omnigraph:<name>` → the `[<name>]` profile in `~/.omnigraph/credentials`; legacy `bearer_token_env` still honored. See §5.
### 5. Credentials, connection tiers, and bind portability (12-factor)
**Credentials are by-reference everywhere, never inlined — and keyed by the *server name*, not by a hand-invented env-var name.** This is the one place the design departs from simply reusing the shipped `bearer_token_env` mechanism, because that mechanism is sub-optimal for a multi-server client: it forces the operator to invent and coordinate an env-var name per server (three steps to add a server: pick a var, name it in config, set it in the store). The peer group (AWS profiles, `gh` hosts, kubeconfig users, docker auths) instead keys the secret **by the server's name** — no indirection. OmniGraph should match that.
**Resolution for server `<name>` (no config field required):**
1. **`OMNIGRAPH_TOKEN_<NAME>`** env var (name-derived, upper-snake), else **`OMNIGRAPH_TOKEN`** for the active server — the CI/headless override (12-factor).
2. **OS keychain** entry `omnigraph:<name>` — the preferred interactive store (no plaintext on disk); written by `omnigraph login <name>`.
3. **`~/.omnigraph/credentials`** — an AWS-style profile file keyed by server name (mode `0600`, git-ignored), the fallback when no keychain:
```ini
[prod-us]
token = …
[prod-eu]
token = …
```
So a `servers.<name>` with no token field resolves by name — adding a server is one step (`omnigraph login <name>`), and "multiple servers, multiple tokens" falls out for free.
**But implicit must not be the *only* path — explicit sourcing is a first-class option** (the DX/AX lesson). Pure-convention is invisible (you must *know* `OMNIGRAPH_TOKEN_<NAME>`), can't integrate with a secrets-manager's fixed var name, and can't do dynamic/short-lived tokens. So a server may declare an explicit `auth:` block — a **method-agnostic wrapper** (today only `token:` for bearer; `mtls:`/`oidc:` are the future siblings, so the credential model never has to be re-keyed) holding a tagged token *source*. Secrets are *still* never inlined (every source is a reference):
```yaml
servers:
prod-us:
endpoint: https://og-us…
auth: { token: { env: OG_PROD_US_TOKEN } } # explicit env var — self-documenting (= legacy bearer_token_env)
prod-eu:
endpoint: https://og-eu…
auth: { token: { command: [vault, read, -field=token, secret/og] } } # dynamic / short-lived
edge:
endpoint: https://og-edge…
auth: { token: { file: /run/secrets/og-token } } # k8s/docker mounted secret
staging:
endpoint: https://og-staging… # no auth: → implicit chain (below)
```
| `auth.token:` source | when | DX/AX value |
|---|---|---|
| *(auth omitted)* | the common case | zero-config; `omnigraph login` populates keychain `omnigraph:<name>` |
| `{ env: VAR }` | secrets-manager / CI injects a fixed var | **self-documenting** — config states the source; = the legacy `bearer_token_env` |
| `{ file: PATH }` | k8s/docker secret mounted as a file | no env plumbing |
| `{ command: [...] }` | Vault, cloud IAM, `gh auth token` | **dynamic tokens** — first-class exec, the capability pure-env/keychain can't give (kube `exec` / AWS `credential_process`) |
| `{ keychain: ENTRY }` | pin a non-default keychain entry | explicit override of the name-derived default |
**Resolution per server:** if `auth.token:` is set, use that source (no fallthrough). Else the **implicit chain**: `OMNIGRAPH_TOKEN_<NAME>` (or `OMNIGRAPH_TOKEN` for the active server) → keychain `omnigraph:<name>``[<name>]` in `~/.omnigraph/credentials` (`0600`, git-ignored). `omnigraph login <server>` writes/rotates only that server's secret; per-server precedence is independent; sharing is opt-in (same env var or source). The `command` source runs locally with the operator's own privileges and is defined only in operator-owned config (never server-supplied), so it adds no remote-execution surface. The `auth:` wrapper is method-agnostic so adding mTLS/OIDC later is a new sibling key, not a breaking re-key (Hyrum's Law: the field name is a contract once shipped). There is **no `credentials.yaml`** and **no inlined secret**. *Convention for the floor, explicit for control — and explicit is legible to agents and never inlines a secret.*
**Back-compat.** The shipped per-graph `bearer_token_env` + `auth.env_file` dotenv (`resolve_remote_bearer_token`, real-env-wins) keeps working unchanged for existing single-server setups; `bearer_token_env` is just the legacy flat alias for `auth: { token: { env } }`. Resolution tries an explicit `auth.token:` (or legacy `bearer_token_env`) first, then the keyed-by-name chain — so nothing breaks, but the zero-config default is the no-boilerplate keyed-by-name path. (MR-971 — the `bearer_token_env` parity gap — is where this resolver work lands.)
**Three connection tiers** (Supabase/Prisma teach the zero-config floor):
1. **Env vars**`OMNIGRAPH_SERVER=https://…` + `OMNIGRAPH_TOKEN=…`: zero-config remote, no file (the `DATABASE_URL` floor).
2. **Global `config.yaml`** — named `servers:` + `graphs:` for multi-server setups (the AWS-profiles convenience).
3. **Project `omnigraph.yaml`** — project-pinned targets/graphs, committed.
**Keep `omnigraph.yaml` a *portable* manifest (12-factor).** Deploy-specific runtime that varies per environment — the **bind host/port**, worker counts — should be supplied by **`--bind` / `OMNIGRAPH_BIND` (flags/env)**, *not* a committed `server.bind:` baked into the manifest. A manifest that hardcodes `0.0.0.0:8080` is not portable across deploys and leaks an environment detail into a version-controlled file. The same-named `omnigraph.yaml` stays portable across deploys precisely because the volatile, per-environment knobs live in env/flags (12-factor config), while the stable, portable definition (graphs, queries, policy) lives in the file. This is the one concrete lesson taken from kube's model-B without adopting its file split: portability via env/flags, not via a second file.
### 6. Where stored queries live: defined locally, invoked remotely
A stored query splits across two axes; do not conflate them:
- **Definition** (`.gq` source + `queries:` entry) lives next to the **embedded graph entry that owns it**. For a hosted remote graph, that is the **deployment manifest** read by `omnigraph-server`; for a personal embedded graph, it may be the user's own config. It never lives on a client-side `Remote` graph entry.
- **Discovery** ("what tools exist for me?") is fetched from the **server** (Cedar-filtered `GET /queries` / MCP catalog) at connect time.
- **Invocation** is **remote** (client → server, HTTP/MCP) — or **embedded** (the CLI opens the graph directly and reads the same manifest).
For remote use, the client carries *pointers to servers*, not query definitions; it **discovers and invokes**, never defines. This is the **capability-as-code guarantee for agents**: an agent can only invoke tools the server's *committed, reviewed* config exposes — it **cannot define a new tool at runtime**. Definition is structurally outside the agent's reach.
`queries:` (graph-capability registry, Cedar-gated when served remotely, MCP-visible when exposed) and `aliases:` (client CLI shortcut) overlap — both can name `.gq`-backed operations. This RFC keeps them siblings (the MR-969 decision); the clean long-term is **one registry, two invocation surfaces** (embedded + remote), with `aliases:` subsumed. Out of scope here.
#### Reconciling `aliases:` with the role model
`aliases:` is the pre-MR-969, **client-role, embedded-only, ungated** ancestor of `queries:`. An alias bundles `command` (read/change), `query` (`.gq` path), `name` (symbol), `args` (positional param names), and `graph`/`branch`/`format` defaults; the CLI runs it embedded. The server never reads it. So:
- **Role:** `aliases:` is **client-role** (CLI behavior) → it may live in **both** the user-global `config.yaml` and the project manifest, layered. `queries:` is **graph-capability role** → it lives only on an `Embedded` graph entry, and for remote server graphs that means the server deployment manifest. *Who opens the graph determines where query definitions can live.*
- **Difference:** `aliases:` = embedded invocation, no gating, explicit `command`, bundles client defaults + positional args. `queries:` = remote (+future embedded), Cedar + `mcp.expose`, **infers** read/mutate, bundles only MCP settings.
- **Convergence:** decompose an alias — *definition* (name→.gq+symbol) → `queries:` (the superset: typed, validated, gated, multi-surface, no redundant `command`); *target/branch/format* → client invocation context (`--target`/`--branch`/`--format` or `defaults:`), not baked per-query; *positional `args`* → thin CLI sugar or dropped (agents/services use named JSON params). End-state: one `queries:` registry + the client config model subsumes `aliases:`.
- **Validation:** a file-backed alias (`query: ./foo.gq`) may target only an embedded graph. A remote graph shortcut must be explicit that it invokes a server-owned stored query, e.g. `invoke: find_user`, so the client cannot smuggle a new `.gq` definition into a remote capability surface.
- **v1:** keep `aliases:` unchanged. Footgun worth a load-time warn: an alias and a query with the same name in one manifest are different namespaces invoked differently (`--alias X` vs `POST /queries/X`).
```yaml
aliases:
local_owner:
command: query
query: ./queries/owner.gq
name: owner
graph: dev # valid only if `dev` resolves Embedded
remote_owner:
invoke: find_user
graph: prod # valid only if `prod` resolves Remote; source lives on the server
args: [name]
```
### 7. CLI surface
- `omnigraph login <server>` — interactive auth; stores the token keyed by server name in the OS keychain (`omnigraph:<server>`) or the `[<server>]` profile of `~/.omnigraph/credentials` (0600). The `gh auth login` analog.
- `omnigraph use <graph>` — set the active graph (writes the appropriate layer). The `kubectl config use-context` analog.
- `omnigraph config view [--resolved] [--show-origin] [<graph>]` — print the merged config and, with `--resolved`, the final tuple **plus the origin layer of every field** (the `git config --show-origin` / `kubectl config view` analog). Resolution is never a mystery.
- All existing verbs (`query`, `mutate`, `load`, `schema`, `branch`, …) gain `--graph <name>`; resolution decides embedded vs remote transparently.
### 7.5 Init, login, and bootstrap — three tiers (folds in the Q2 design)
Scaffolding splits into three tiers by *scope* and *fatness*, mirroring the field (supabase `init` vs `login`; HelixDB thin `init` vs fat `chef`). Most of this lives in sibling tickets; this RFC owns only the **user route**.
| Tier | Command | Scope | What it does | Model | Status |
|---|---|---|---|---|---|
| **User route** | `omnigraph login [<server>]` | user (`~/.omnigraph/`) | auth + write `~/.omnigraph/config.yaml` / `credentials`; first-run global setup | gh / supabase `login` | **this RFC** (unbuilt) |
| **Thin project init** | `omnigraph init` | project, in-place | create graph + `scaffold_config_if_missing` (`omnigraph.yaml` + minimal `.pg`/`.gq`); refuse-if-exists or `--force` | `cargo init`, `prisma init` | exists; `--force` purge = MR-975 |
| **Fat bootstrap** | `omnigraph quickstart [--template <t>] [--auto]` | project, possibly new-dir | scaffold + seed data + `serve start` + agent prompt file | HelixDB `chef`, `create-next-app` | MR-973 (unbuilt) |
**Design positions** (first-principles, since none of the fat tier is built):
- **Split `init` (project) from `login` (user)** — never one command writing to both `$HOME` and the project (the supabase line, not the dbt line). `init`=project scaffold; `login`=user credential + global config.
- **`init` is in-place + refuse-if-exists** (cargo/prisma/terraform default): don't clobber; adopt existing files; require `--force` to overwrite (and `--force` purges Lance state per MR-975).
- **Interactive for humans, `--auto`/agent-mode for automation** (npm `-y`, create-* `--CI`, MR-981 `--machine`). In `OMNIGRAPH_AGENT_MODE` any prompt → fail with a repair hint.
- **Templates are a `--template <name>` flag on the fat tier** (create-vite model), with the *content* (schema + queries + seed) coming from a template source. Mechanism is a design question (bundled-in vs `og template pull` from a repo vs `npm create-*`-style delegation) — **not** an existing foothold (MR-581 stale). Lean: a small set of bundled templates first (generic `Person→Knows`, plus promote `omnigraph-intel-bootstrap`), `--template <github>` later.
- **`init`/`quickstart` can scaffold the `graphs:` map with one or more entries**; "init with specific graphs" = the scaffolded `graphs:` block (embedded `storage:` locally; the agent/operator adds remote `server:` entries via `login` + editing).
- **Secrets-on-scaffold rule** (prisma/dbt/supabase all do this): anything that writes a token also keeps it out of VCS. `login` prefers the OS keychain (no file); the `~/.omnigraph/credentials` profile fallback is `0600` and git-ignored, and any project-local `.env`-shaped file gets a `.gitignore` entry.
### 8. Concrete shape
**Global** `~/.omnigraph/config.yaml` (per-user, secret-free):
```yaml
servers: # endpoint only — token is keyed by the server name
prod-us: { endpoint: https://og-us.internal:8080 }
prod-eu: { endpoint: https://og-eu.internal:8080 }
staging: { endpoint: https://og-staging.internal:8080 }
graphs:
personal: { storage: ~/graphs/personal.omni }
defaults:
graph: personal
aliases:
my_people:
command: query
query: ~/queries/people.gq
name: list_people
graph: personal
```
**Project client** `./omnigraph.yaml` (committed, secret-free, portable — no `server.bind`). Note the shipped noun is `graphs:` (MR-603); an entry is embedded (`storage:`) XOR remote (`server:` + `graph_id:`, §1.1):
```yaml
graphs:
dev: { storage: s3://team-bucket/dev.omni, branch: main } # embedded
staging: { server: staging, graph_id: prod, branch: review } # remote → graph `prod` on server `staging`
prod-us: { server: prod-us, graph_id: production }
prod-eu: { server: prod-eu, graph_id: production } # multi-homed: same graph, another server
defaults: { graph: dev, output_format: table }
aliases:
owner:
command: query
query: ./queries/owner.gq
name: owner
args: [name]
graph: dev
```
Select with `--graph <name>` (shipped flag, MR-603).
**Server deployment** `./omnigraph.yaml` (committed in the deploy repo, read by `omnigraph-server`). Every served graph is an embedded storage locator; server-owned policy and stored-query definitions live here:
```yaml
graphs:
production:
storage: s3://team-bucket/prod.omni
policy:
file: ./policies/prod.yaml
queries:
find_user:
file: ./queries/find_user.gq
mcp: { expose: true, tool_name: lookup_user }
server:
policy:
file: ./policies/server.yaml
```
**Credentials** are keyed by server name — `omnigraph login prod-us` writes the OS keychain entry `omnigraph:prod-us` (or a `[prod-us]` profile in `~/.omnigraph/credentials`, 0600, git-ignored); `OMNIGRAPH_TOKEN_PROD_US` overrides for CI. No token fields in any config file; no committable secrets.
## DX
1. **One command surface, two loci.** `query --graph dev` (embedded) and `--graph staging` (remote) are the same command; only resolution differs. Change one word, not a mental model.
2. **Clone-and-go.** Project config names servers+graphs; teammate runs `omnigraph login staging` once and every target resolves. The git + `gh auth login` model.
3. **Multi-server × multi-graph is the default.** Remote graph entries reference `server` by name; `servers` is a global named map; graphs are per-server. `prod-us` and `prod-eu` both serving `production` is two graph entries — Helix cannot express this.
4. **Solo-first.** Everything in `~`, no project required.
5. **Laptop-to-fleet on one schema.** Local = one `omnigraph.yaml` (both roles); prod = role-split across repos. No second format to learn.
## AX (agent experience)
1. **One flat resolved context, never a config to navigate.** target→server→endpoint→token resolves *before* the agent sees anything. The agent reasons about tools, not topology (the LLM-safe-surface principle extended to config).
2. **Secrets are structurally outside the agent's reach.** The repo it operates in has no tokens; they are in the global layer / keychain, outside its view. An agent *cannot* exfiltrate a prod token from project config because it is not there.
3. **Branch/snapshot-pinned contexts** (E4) — hand an agent a `branch: review` / `--snapshot v42` target and its reads are reproducible and cannot see uncommitted main-line state. No kubeconfig analog.
4. **The agent's capabilities are a GitOps'd artifact** (E6) — which graphs exist, which stored-query tools it may call, and which Cedar rules gate them are all in the version-controlled server config. Powers change only via a reviewed PR, deployed by restart. Infrastructure-as-code for what the AI can do.
5. **Config + policy compose.** Config = "where am I pointed + which token"; Cedar = "what may I do there." Orthogonal; no enforcement logic leaks into config.
## GitOps — three surfaces, secrets in none
| Surface | Repo | Contents | Deploy | Secrets |
|---|---|---|---|---|
| Server deployment config | infra/deploy repo | `graphs:`, policy, **`queries:` + `.gq` files** | commit → CI → **server restart** (no hot reload) | none — by-reference |
| Project client config | app repo | `graphs:` → embedded storage or remote server+graph | committed, read by CLI/agent | none |
| Global user config | **not GitOps'd** — machine-local `~` | `servers:` + creds-by-ref | `omnigraph login` writes it | refs only (like `~/.kube/config`) |
## Comparison
| Property | kubeconfig | Helix | git | compose | **OmniGraph (this RFC)** |
|---|---|---|---|---|---|
| Named remote endpoints + creds-by-ref | ✅ | ✅ | partial | partial | ✅ (global `servers`) |
| Global + project layering, uniform schema | ✗ | ✗ | ✅ | ✗ | ✅ |
| Embedded OR remote under one name | ✗ | ✗ | n/a | ✗ | ✅ (E1) |
| Multi-server × multi-graph | ✅ | ✗ | n/a | n/a | ✅ (E2) |
| Branch/snapshot in the address | ✗ | ✗ | partial | ✗ | ✅ (E4) |
| Agent tool surface in the repo | ✗ | ✗ (separate bundle) | n/a | n/a | ✅ (E6) |
| Project manifest renamed by role | — | no | — | no | **no** |
| Concept count | 3 | 1 | 2 | 1 | **2 (servers/targets)** |
## Migration / backwards compatibility
- **Additive.** Today's `omnigraph.yaml` (`graphs:`, `cli:`, `server:`, `aliases:`, `policy:`) keeps working unchanged. `graphs:` entries are equivalent to embedded `targets:` with a `storage:` (shipped `uri:` is a deprecated alias); both resolve.
- **`targets:` is new** and optional. `servers:` is new and optional. Absent → today's behavior.
- **Global `~/.omnigraph/config.yaml` is new.** Absent → only project + env + flags, exactly as now. Its addition is the **global-first posture flip**: today the CLI is project-anchored (reads `./omnigraph.yaml`, no parent walk); the global config becomes the new primary discovery path so the CLI works with no project file. Existing project-only workflows are unchanged (project still overrides global); the flip is additive — it adds a fallback layer below the project file, it does not remove the project file.
- **`graphs:``targets:` is an evolution, not a break.** Both can coexist; `targets:` is the superset (adds remote + branch pinning). A future cleanup may alias `graphs:` to embedded `targets:`.
- **`server.bind` stays supported** but documentation steers operators to `--bind` / `OMNIGRAPH_BIND` for portability; no removal.
- **Credentials: keyed-by-name is new; `bearer_token_env` is the compat path.** The primary design (keychain / `[<server>]` profile / `OMNIGRAPH_TOKEN_<SERVER>`) is new resolver work (lands on MR-971). The shipped `bearer_token_env` + `auth.env_file` dotenv (`resolve_remote_bearer_token`) is **unchanged and still honored** — existing single-server dotenv setups keep working, and the resolver honors an explicit `auth: { token: {...} }` source (env/file/command/keychain) with `bearer_token_env` as its flat legacy alias. No `credentials.yaml`.
- **Validation tightens invalid mixes, not valid legacy use.** Top-level `policy:` / `queries:` remain only for anonymous bare-URI compatibility. Named graphs use per-entry fields. Remote graph entries with local `policy:` / `queries:` and server manifests with `server:` graph locators are rejected because there is no correct way to honor those fields.
## Open questions
- **`graphs:` vs `targets:` naming churn.** Do we rename `graphs:``targets:` (with a deprecation alias) or keep `graphs:` for embedded and add `targets:` for remote? Leaning: keep both, document `targets:` as the superset.
- **Keychain integration scope.** Keychain is now the *primary* credential store (§5), so this is on the critical path, not optional: macOS Keychain first (matches operator practice) with the `0600` `[<server>]` profile file as fallback; Linux Secret Service / `pass` later. Open: which keyring crate, and the exact `OMNIGRAPH_TOKEN_<SERVER>` name-derivation (upper-snake, non-alnum → `_`).
- **Project-local `servers:`.** Allowed (e.g. a localhost dev server), merged with global. Confirm creds stay by-reference even for project-local servers (yes).
- **`aliases:``queries:` convergence.** Out of scope here; tracked separately. One registry with embedded + remote invocation surfaces is the target end state.
- **Single-file `KUBECONFIG`-style list.** Do we support `OMNIGRAPH_CONFIG` pointing at multiple files (colon-joined), or a single file only? Start single; revisit if demand appears.
## Implementation — breadboard + slices (Shape A)
Shaped via requirements + a fit check (Shape A — global-first layered config + unified `graphs:` entry + three-tier init — selected over a project-first minimal option and a Helix-clone). This section breadboards A and slices it. **Bold** = NEW.
### Places
| # | Place | What |
|---|---|---|
| P1 | Disk | `~/.omnigraph/{config.yaml, credentials, cache/, state/}` + project `omnigraph.yaml` + `.env.omni` |
| P2 | Config resolution | runs on every command: load layers → merge → resolve `--graph` |
| P3 | Command execution | embedded engine OR remote HTTP client |
| P4 | Remote `omnigraph-server` | existing HTTP surface (`/query`, `/mutate`, `/queries/{name}`) |
| P5 | Scaffold | `login` / `init` / `quickstart` |
### Affordances
| # | Place | Affordance | NEW? | Wires |
|---|---|---|---|---|
| U1 | P1 | `~/.omnigraph/config.yaml` (operator edits) | **N** | → N1 |
| U2 | P1 | project `./omnigraph.yaml` | — | → N1 |
| U3 | P1 | `~/.omnigraph/credentials` / `.env.omni` dotenv (secrets, git-ignored) | — | → N4 |
| U4 | P3 | `omnigraph <verb> --graph <name>` (any command) | — | → N14 |
| U5 | P5 | `omnigraph login [<server>]` | **N** | → N11 |
| U6 | P5 | `omnigraph init` / `quickstart [--template]` | partly | → N12 / N13 |
| U7 | P2 | `omnigraph config view --resolved --show-origin` | **N** | → N10 |
| N1 | P2 | `load_layered_config()` — global (N3) + project (cwd), serde each | **N** | → N2 |
| N2 | P2 | **merge engine** — deep-merge settings; replace named-resource entries; replace lists; **retain provenance** and raw field origins | **N⚠** | → N5, → S_merged |
| N3 | P2 | global-dir resolver — `OMNIGRAPH_HOME` else `~/.omnigraph/` | **N** | → N1 |
| N4 | P2 | `load_env_file_into_process` — dotenv, real-env-wins (existing) | — | → N9 |
| N5 | P2 | `resolve_graph(name, merged)` → typed `Embedded`/`Remote` locator; rejects invalid role/field combinations before execution | **N⚠** | → N6 |
| N6 | P3 | `GraphConn``Embedded(engine)` \| `Remote(http)` dispatch | **N⚠** | → N7, → N8 |
| N7 | P3 | embedded path — `Omnigraph::open(uri)` (existing) | — | → engine |
| N8 | P3 | **HTTP-client path** — POST `/query`/`/mutate`/`/queries/{name}` | **N⚠** | → P4, → N9 |
| N9 | P2 | `resolve_bearer_token(server)` — explicit `auth.token` source if set, else **keyed by name**: `OMNIGRAPH_TOKEN_<NAME>`/`OMNIGRAPH_TOKEN` → keychain `omnigraph:<name>``[<name>]` profile; legacy `bearer_token_env`/dotenv (MR-971) | **N⚠** | → N8 |
| N10 | P2 | `config view` handler — merged + per-field origin (needs N2 provenance) | **N** | → U7 |
| N11 | P5 | `login` handler — interactive auth → write `config.yaml` + `credentials` (0600) + `.gitignore` | **N⚠** | → S_global |
| N12 | P5 | `init` handler — `scaffold_config_if_missing` + create graph; refuse-if-exists/`--force` purge (MR-975) | partly | → S_project |
| N13 | P5 | `quickstart` handler — scaffold + `--template` + seed + `serve start` + agent prompt (MR-973; needs serve MR-970) | **N⚠** | → S_project |
| N14 | P3 | agent-mode wrapper — `--machine`/`OMNIGRAPH_AGENT_MODE`: JSON, structured errors, never-prompt, typed exit codes (MR-981) | **N⚠** | → N1 |
| S_global | P1 | `~/.omnigraph/config.yaml` + `credentials` | **N** | read by N1/N9 |
| S_project | P1 | `./omnigraph.yaml` + `.env.omni` | — | read by N1/N4 |
| S_merged | P2 | in-memory resolved config (per command, with provenance) | **N** | read by N5/N10 |
| S_cache | P1 | `~/.omnigraph/cache/` (remote catalogs) | **N** | read by N8 |
```mermaid
flowchart TB
subgraph P1["P1: Disk"]
U1["U1: ~/.omnigraph/config.yaml"]
U2["U2: ./omnigraph.yaml"]
U3["U3: credentials dotenv"]
end
subgraph P2["P2: Config resolution"]
N3["N3: global-dir (OMNIGRAPH_HOME)"]
N1["N1: load_layered_config"]
N2["N2: merge engine (+provenance)"]
N4["N4: dotenv loader"]
N5["N5: resolve_graph(--graph)"]
N9["N9: resolve_bearer_token"]
N10["N10: config view"]
end
subgraph P3["P3: Command execution"]
U4["U4: omnigraph <verb> --graph"]
N14["N14: agent-mode wrapper"]
N6["N6: GraphConn embedded|remote"]
N7["N7: embedded Omnigraph::open"]
N8["N8: HTTP-client POST"]
end
subgraph P5["P5: Scaffold"]
U5["U5: login"]; U6["U6: init/quickstart"]
N11["N11: login handler"]; N12["N12: init"]; N13["N13: quickstart"]
end
P4["P4: remote omnigraph-server"]
U1-->N1; U2-->N1; N3-->N1; N1-->N2-->N5-->N6
U3-->N4-->N9-->N8
U4-->N14-->N1
N6-->N7; N6-->N8-->P4
N2-->N10-->U7["U7: config view --resolved"]
U5-->N11; U6-->N12; U6-->N13
classDef ui fill:#ffb6c1,stroke:#d87093,color:#000
classDef n fill:#d3d3d3,stroke:#808080,color:#000
class U1,U2,U3,U4,U5,U6,U7 ui
class N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14 n
```
### Slices (vertical, each demo-able)
| # | Slice | Parts/affordances | Demo |
|---|---|---|---|
| **V1** | **Global layer + merge + `config view`** | A1A4 · N1,N2,N3,N10 · U1,U7,S_global,S_merged | Put config in `~/.omnigraph/`, run `omnigraph config view --resolved --show-origin` from any dir → merged result with per-field origin; existing embedded commands work global-first with no project file |
| **V2** | **Remote graphs + HTTP client + creds** | A5A7 · N5,N6,N8,N9 · S_cache | Define a `server:` graph entry; `omnigraph query --graph prod` hits the remote server (`curl`-free); embedded `--graph dev` still local |
| **V3** | **`omnigraph login`** | A8 · N11,U5 | `omnigraph login prod` writes `~/.omnigraph/credentials` (0600) + `.gitignore`; V2 remote query now works with no manual env |
| **V4** | **Thin-init hardening + quickstart + templates** | A9 · N12,N13,U6 (needs serve MR-970) | `omnigraph quickstart --template person-knows` scaffolds + seeds + serves; `init --force` purges (MR-975) |
| **V5** | **Agent-mode** | A10 · N14,U4 (MR-981) | `OMNIGRAPH_AGENT_MODE=1 omnigraph query …` → JSON + structured errors + typed exit codes; never-prompt |
V1 is the foundation (global-first + merge + view). V2 closes the substantive client→server gap. V3 is credential ergonomics. V4/V5 ride sibling tickets (MR-970/973/981). MR-969 (stored queries) ships independently and is reached by N8's `/queries/{name}` once V2 lands.
## Rollout
The slices above are the rollout order: **V1 (global layer + merge) → V2 (remote graphs + HTTP client) → V3 (login) → V4 (quickstart/templates, on MR-970) → V5 (agent-mode, MR-981).** V1V2 close the substantive gap (global-first config + `curl`-free server access); V3V5 are ergonomics that ride sibling tickets. Evaluate after V2 against early-adopter and agent-onboarding (MR-973 / MR-974) signal. The spikes (X1 HTTP-client, X2 merge engine, X3 resolver+provenance, X4 login) resolve before their owning slice.
## Prior art
- kubeconfig (clusters / users / contexts; `KUBECONFIG`; `kubectl config view`)
- Helix CLI v2 (`helix.toml` local+enterprise instance blocks; `~/.helix/config`; `~/.helix/credentials`)
- AWS CLI (`~/.aws/config` + `~/.aws/credentials` split; named profiles; `credential_process`)
- git (`~/.gitconfig` + `.git/config`; `--show-origin`)
- Cargo (`Cargo.toml` manifest + `~/.cargo/config.toml`)
- Supabase / Prisma (one project manifest; connection via `DATABASE_URL` env)
- 12-factor app (config that varies by deploy lives in the environment)

View file

@ -0,0 +1,270 @@
# RFC: MCP Server Surface for `omnigraph-server` — Full Tool Parity, Stored Queries, Modular Auth
**Status:** Proposed
**Date:** 2026-06-01
**Tickets:** MR-969 (stored queries + MCP exposure — the surface this completes), MR-956 (federated auth / WorkOS OAuth — the auth substrate this consumes), MR-971 (per-server credential resolver), MR-974 (agent setup surface — the installer that wires this), MR-668 (multi-graph server — shipped, the routing this builds on)
**Builds on:** [omnigraph#128](https://github.com/ModernRelay/omnigraph/pull/128) (`ragnorc/stored-queries-mcp`) — the shipped stored-query registry, `GET /queries`, `POST /queries/{name}`, and the coarse `invoke_query` gate.
**Supersedes:** the MCP-transport portion of [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) (`/mcp/tools` + `/mcp/invoke`). See [Relationship to RFC-001](#relationship-to-rfc-001).
**Target release:** v0.8.x (phased — see Rollout)
## Summary
Add a first-class **MCP (Model Context Protocol) server surface to `omnigraph-server`**, exposed over **Streamable HTTP**, that projects the server's operations as MCP tools and resources for LLM clients (Claude Code/Desktop/web, Cursor, etc.). Two populations of tools share one projection path:
1. **Built-in operational tools** — parity with the existing `@modernrelay/omnigraph-mcp` stdio package's **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and its **2 resources** (`omnigraph://schema`, `omnigraph://branches`), plus a new server-scoped `graphs_list` tool and an `omnigraph://graphs` resource (multi-graph mode).
2. **Dynamic stored-query tools** — one MCP tool per `mcp.expose: true` entry in the `queries:` registry (MR-969 / #128), with parameters typed from the `.gq` declaration via the shipped `query_catalog_entry` / `param_descriptor` projection.
Every tool is **authorized by the server's existing Cedar policy engine**. The MCP layer never implements its own authentication: it consumes an **already-resolved `ResolvedActor`** from the server's bearer middleware (`require_bearer_auth` today; the `TokenVerifier` seam when MR-956 lands), so the **same MCP endpoint serves on-prem (static or customer-OIDC tokens) and our cloud (WorkOS OAuth) by configuration only**. Cloud OAuth is an additive layer (RFC 9728 protected-resource metadata) that slots in with zero MCP changes.
The end-state collapses two diverging tool implementations into one: the in-server MCP is the canonical, Cedar-gated, remotely-reachable surface; the stdio package becomes a thin stdio↔HTTP proxy (local on-ramp) over it.
> **Key caveat, stated up front (see §5.9 below):** the headline "a token scoped via Cedar to a *specific set* of stored queries" requires **per-query `invoke_query` scope**, which is *designed* (rfc-001) but **not yet implemented** — the shipped action is coarse (any stored query on the graph, or none). Per-actor Cedar curation works today for *built-in vs ad-hoc vs admin* tools and for *stored-vs-ad-hoc*; sub-selecting individual stored queries per actor is gated on a prerequisite (PR 0b). Until then, stored-query curation is graph-level (registry membership + `mcp.expose`).
## Relationship to RFC-001
[rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) (MR-656 / MR-976 / MR-969) is the parent design for stored queries + the response envelope + MCP. This RFC is the **detailed MCP-transport design** that #128 left for a follow-up, and it **revises rfc-001 in three places where the shipped code or the MCP wire protocol diverged from rfc-001's sketch**:
1. **Transport shape.** rfc-001 sketched `GET /mcp/tools` + `POST /mcp/invoke` (a bespoke REST pair). **That is not the MCP wire protocol — real MCP clients cannot connect to it.** This RFC implements actual MCP JSON-RPC over Streamable HTTP and reuses `query_catalog_entry` as a *projection source*, not a parallel surface. (rfc-001's own Open Question already leaned toward Streamable HTTP.)
2. **Exposure config.** rfc-001 specified inline `.gq` pragmas (`@mcp(expose=…)`, default `expose=false`). **#128 shipped a different mechanism:** YAML `queries.<name>.mcp.expose` in `omnigraph.yaml`, **default `true`** (declaring a query in the manifest *is* the opt-in). This RFC builds on the shipped YAML form; the `.gq`-pragma design in rfc-001 is superseded for exposure.
3. **Schema introspection.** rfc-001 lists "Schema introspection through MCP" as a **non-goal** ("agents see types through declared return shapes"). This RFC **revises that**: the operational-parity tools include `schema_get` and `omnigraph://schema`*because the shipped stdio package already exposes both*. The non-goal is achieved by *policy*, not omission: `schema_get`/`omnigraph://schema` are Cedar-gated by `Read`, and the recommended locked-down agent policy denies `Read`, so a curated agent still never sees the schema. (rfc-001's intent is preserved; the mechanism moves from "don't build it" to "build it, gate it.")
Everything else in rfc-001 (two-paths-one-engine, per-query `invoke_query` *as the intended scope*, the response envelope, multi-graph per-graph endpoints) this RFC consumes unchanged.
> **Numbering note:** the `TokenVerifier`/WorkOS auth design is referred to in code (`crates/omnigraph-server/src/identity.rs`) as "RFC 0001," which is a *different* document from this repo's `docs/dev/rfc-001-queries-envelope-mcp.md`. To avoid the collision this RFC cites the auth substrate as **MR-956** throughout, never "RFC 0001."
## Reconciliation with shipped code (verified against `ragnorc/stored-queries-mcp` HEAD)
Verified against `crates/omnigraph-server/src/{lib.rs,api.rs}` and `crates/omnigraph-policy/src/lib.rs` at the current branch head (not the #128 PR body, and not `api.rs` alone):
- ✅ `GET /queries` returns the `mcp.expose == true` subset as `QueriesCatalogOutput { queries: [QueryCatalogEntry] }`, each with typed `ParamDescriptor`s, `tool_name`, `description`, `instruction`, and a `mutation` flag. **MCP-ready projection, but exposed as bespoke REST/JSON — not the MCP wire protocol.**
- ✅ `POST /queries/{name}` route exists (`server_invoke_query`, `lib.rs`).
- ✅ `query_catalog_entry()` / `param_descriptor()` with an exhaustive `ScalarType → ParamKind` map (a new scalar is a compile error).
- ✅ `InvokeQuery` Cedar action defined in `omnigraph-policy`.
- ✅ **`InvokeQuery` IS enforced** at `POST /queries/{name}`: `server_invoke_query` calls `authorize(PolicyAction::InvokeQuery)` and **masks a denial to a 404 identical to "unknown query"** so the catalog isn't probeable (the denial-masking the previous draft of this RFC reported as missing is shipped — it lives in `lib.rs`, not `api.rs`). The stored-mutation path is already double-gated: `InvokeQuery` outer, then `Change` inside `run_mutate`.
- ✅ **Reuse path exists:** `run_query` / `run_mutate` are already decoupled from their HTTP request bodies and take registry-supplied `(source, name, params, branch/snapshot)`. MCP `tools/call` for both stored and ad-hoc tools delegates to these — no new business logic.
- ❌ **Per-query (`invoke_query[name]`) scope is NOT implemented.** `PolicyRequest` carries only `{action, branch, target_branch}`**no query-name dimension** — and the action is documented coarse ("permits *any* stored query on the graph"). rfc-001 *designed* per-name scope; it is unbuilt. This RFC's per-query Cedar filtering (§5.4) and recommended agent policy (§5.9) depend on it → tracked as **PR 0b**.
- ❌ No MCP protocol surface (`initialize`/`tools/list`/`tools/call`, JSON-RPC, transport).
- ❌ No `TokenVerifier` trait yet — `require_bearer_auth` resolves a `ResolvedActor` inline (static-hash). The trait/`OidcJwtVerifier` are MR-956 (draft). The MCP layer's only requirement — *consume `ResolvedActor`* — is satisfiable today.
Stack (verified `Cargo.toml`): Axum + utoipa (OpenAPI) + `omnigraph-policy` (Cedar) + `futures` + `tokio`. **No MCP crate present.** `edition = "2024"`.
## Motivation
- **One curated, safe, remotely-reachable tool surface.** MR-969's thesis: hand an LLM a token Cedar-scoped to a set of tools and it sees exactly those typed tools — cannot construct ad-hoc queries it isn't permitted, cannot read the schema it isn't permitted, cannot reach other graphs. Today the only MCP is the stdio package: local-only, full surface, ungated.
- **Parity, so the in-server MCP can be the single implementation.** Operators/agents already depend on the operational tools. Supporting them server-side behind one Cedar gate lets the stdio package degrade to a proxy and removes two diverging tool sets.
- **On-prem and cloud from one endpoint.** A managed cloud (WorkOS OAuth) and an on-prem/air-gapped deploy (static or customer-OIDC tokens) must serve the same MCP without forks or MCP-specific auth.
- **Foundation for the agent on-ramp (MR-974).** `omnigraph mcp install --agent <tool>` needs a decided transport + a stable endpoint.
## Goals
- Project built-in tools + stored queries as MCP tools through **one** registry abstraction.
- `tools/list` and the callable set are **identical for argument-independent authorization**, both driven by Cedar (see §5.4 for the branch-scoped caveat).
- The MCP layer is **auth-method-agnostic**: it consumes `ResolvedActor`, never a raw token, never branches on how auth happened.
- The same endpoint works on-prem (static/OIDC) and cloud (WorkOS OAuth), switched by config; cloud OAuth is additive (RFC 9728).
- No new business logic: MCP tools delegate to the same `run_query`/`run_mutate`/branch/schema functions the HTTP routes call.
- Behaviour-neutral when unused: no MCP traffic = no change.
## Non-Goals
- **Building/hosting an OAuth authorization server.** The server is a Resource Server; WorkOS AuthKit+Connect is the AS (MR-956). The MCP endpoint validates tokens, never issues them, never holds client secrets.
- **OAuth/WorkOS implementation itself** — MR-956's work. This RFC leaves a clean RFC-9728 hook and consumes `ResolvedActor`.
- **MCP prompts, elicitation, `tools/list_changed`, resource subscriptions, server-initiated messages.** None needed → enables a stateless POST-only transport (§5.6).
- **stdio transport inside the server.** stdio stays in the TS package (now a proxy).
- **Cross-graph tool listing.** Per-graph catalogs only (MR-969 + RFC-002 non-goal).
- **Hot reload of the query registry.** Restart-only (MR-969).
## Background
`omnigraph-server` (Axum) already implements every operation this RFC exposes as an authenticated HTTP route; each authorizes via a `PolicyAction` against the Cedar policy for a server-resolved actor and calls into the engine. The existing stdio MCP package is a *client* of these routes (it owns no business logic). MR-956 will introduce a `TokenVerifier` trait (`StaticHashTokenVerifier` today inline, `OidcJwtVerifier` for OIDC/WorkOS) producing the `ResolvedActor { actor_id, tenant_id: Option, scopes: Vec<Scope>, source }` that already exists in `identity.rs` and is consumed by Cedar — token *validation* is offline (cached JWKS), so on-prem/air-gapped has no request-path dependency on the cloud.
## Design
### 5.1 One tool model: a `McpTool` trait, two populators
Both built-in and stored-query tools implement one trait so `tools/list` / `tools/call` never special-case:
```rust
trait McpTool: Send + Sync {
fn name(&self) -> &str; // MCP tool id (stable)
fn title(&self) -> Option<&str>;
fn description(&self) -> &str;
fn input_schema(&self) -> serde_json::Value; // JSON Schema (draft 2020-12)
fn annotations(&self) -> ToolAnnotations; // readOnlyHint / destructiveHint / idempotentHint
/// The Cedar request(s) this call requires, given parsed args. Used BOTH at
/// list-time (dry-run filter, default args) and call-time (enforce, real args).
fn authorization(&self, args: &ToolArgs) -> Vec<PolicyRequest>;
async fn call(&self, ctx: &GraphCtx, args: ToolArgs) -> Result<ToolOutput, ToolError>;
}
```
- **Built-ins**: ~14 static impls, each delegating to the *same* function its HTTP route calls (`run_query`, `run_mutate`, branch ops, `apply_schema_as`, …). `input_schema` authored once (or derived from each route's existing `utoipa`/`ToSchema` DTO).
- **Stored queries**: generated `McpTool` instances, one per `mcp.expose` entry; `input_schema` from `param_descriptor` (§5.3); `authorization``InvokeQuery` (coarse today; `InvokeQuery{name}` after PR 0b) then the inner `Read`/`Change`.
`ToolRegistry` for a graph = the static built-ins + the dynamic stored-query tools resolved from that graph's `GraphHandle` registry.
### 5.2 Tool catalog (parity) and Cedar mapping
Each built-in **reuses the exact `PolicyAction` its HTTP route already enforces** — verified against the handlers in `lib.rs`, not invented:
| MCP tool | Scope | Read/Mutate | Cedar action (verified from route) |
|---|---|---|---|
| `health` | server | read | none (liveness/version) |
| `graphs_list` *(new)* | server | read | `GraphList` |
| `snapshot` | graph | read | `Read` |
| `schema_get` | graph | read | `Read` |
| `branches_list` | graph | read | `Read` |
| `commits_list`, `commits_get` | graph | read | `Read` |
| `read` (ad-hoc `.gq`) / `query` *(alias)* | graph | read | `Read` |
| `change` (ad-hoc `.gq`) / `mutate` *(alias)* | graph | mutate | `Change` |
| `ingest` (NDJSON) | graph | mutate | `Change` (+ `BranchCreate` when forking a new branch) |
| `branches_create` | graph | mutate | `BranchCreate` |
| `branches_delete` | graph | mutate | `BranchDelete` |
| `branches_merge` | graph | mutate | `BranchMerge` |
| `schema_apply` (`allow_data_loss`) | graph | mutate | `SchemaApply` |
| **stored query** (`find_user`, …) | graph | inferred | `InvokeQuery` (coarse; `InvokeQuery{name}` after PR 0b) + inner `Read`/`Change` |
There is **no `Ingest` and no separate `snapshot`/`Export` action**`ingest` enforces `Change`, `snapshot` enforces `Read`. (`Export` exists but maps to the `/export` route, which this RFC does not expose as a tool.)
**Tool id parity vs. canonicalization.** The shipped stdio package uses tool ids **`read`/`change`** (and calls the deprecated `/read`,`/change` routes). The server HTTP surface canonicalized to `/query`,`/mutate` with `/read`,`/change` deprecated (MR-656). To keep existing package clients working *and* align with the server, the MCP exposes **`query`/`mutate` as canonical with `read`/`change` retained as deprecated-but-live aliases** (both dispatch to the same handler). Open Q7 asks whether to drop the aliases later.
Resources (§5.5): `omnigraph://schema`, `omnigraph://branches` (parity), plus `omnigraph://graphs` *(new)* — each gated by the same action as its list/get route (`Read`, `Read`, `GraphList`).
### 5.3 `ParamDescriptor → JSON Schema` (stored-query tools)
| `ParamKind` | JSON Schema | Notes |
|---|---|---|
| String | `{"type":"string"}` | |
| Bool | `{"type":"boolean"}` | |
| Int (i32/u32) | `{"type":"integer"}` | |
| BigInt (i64/u64) | `{"type":"string","pattern":"^-?\\d+$"}` | JSON numbers lose precision >2⁵³ → string (matches the shipped `api.rs` rationale). (Open Q1) |
| Float (f32/f64) | `{"type":"number"}` | |
| Date | `{"type":"string","format":"date"}` | |
| DateTime | `{"type":"string","format":"date-time"}` | |
| Blob | `{"type":"string","contentEncoding":"base64"}` | |
| Vector | `{"type":"array","items":{"type":"number"},"minItems":dim,"maxItems":dim}` | uses `vector_dim` |
| List | `{"type":"array","items":<item_kind schema>}` | scalar items only (grammar guarantees) |
`nullable == false` → param is in `required`. Annotations: `mutation``{readOnlyHint:false, destructiveHint:true}`; else `{readOnlyHint:true}`. `description` → tool description; `instruction` → appended to description (or `_meta`). (The shipped `check()` already warns when an `mcp.expose` query declares a `Vector` param an LLM can't supply.)
For built-in tools the schema is hand-authored from the route DTO; e.g. `query``{source: string, branch?: string, params?: object}`; `schema_apply``{schema: string, allow_data_loss?: boolean}`; `ingest``{ndjson: string, mode?: "merge"|"append"|"overwrite", branch?: string}`.
### 5.4 `tools/list` (Cedar-filtered) and `tools/call` (dispatch + masking)
- **`tools/list`**: build the `ToolRegistry`; for each tool evaluate `authorization(default_args)` against the actor's Cedar policy; **emit only tools that authorize**. Authz decisions memoized per request. Stored-query tools additionally require `mcp.expose: true`.
- **Exactness caveat (R7 is conditional):** the listed set equals the callable set **only for tools whose authorization is argument-independent** (`health`, `graphs_list`, `snapshot`, `schema_get`, `branches_list`, `commits_*`, ad-hoc `query`/`mutate`, and stored queries under the *coarse* action). For **branch-scoped tools** (`branches_create`/`merge` with `target_branch_scope`, and any branch-scoped `Read`/`Change` rule), list-time uses `default_args` (e.g. branch `main`) and cannot know the real target, so the listed set is a *best-effort approximation* of callability — a call may still be denied (or, rarely, a hidden tool would have been allowed). `tools/call` is always the authoritative gate. The contract is: **list never shows a tool the actor can't ever call; for branch-scoped tools it may show one the actor can call only on some branches.**
- **`tools/call`**: resolve `name``McpTool` (masked-404 if unknown *or* `mcp.expose:false`); parse+validate args against `input_schema`; enforce `authorization(args)` (mutations stay double-gated: `InvokeQuery` then `Change`); on success `call`. **Denial masking** lives in one place (the dispatcher): an authz denial is returned identically to "unknown tool" (§5.10), reusing the same deny≡missing principle already shipped at `POST /queries/{name}`.
### 5.5 Resources
Advertise `resources` capability (`subscribe:false, listChanged:false`). `resources/list` → the URIs the actor may read; `resources/read` → schema `.pg` text / branches JSON / (multi-graph) graphs JSON, each gated by the corresponding action (`Read`, `Read`, `GraphList`). A locked-down agent denied `Read` simply never sees `omnigraph://schema` or `omnigraph://branches` — this is how rfc-001's "agents don't introspect schema" intent is met *by policy* (§Relationship-to-RFC-001).
### 5.6 Transport: Streamable HTTP, stateless, POST-only
- **Streamable HTTP** (MCP's current standard; we're already an HTTP server). One endpoint per scope (§5.7).
- Because the server emits **no** server-initiated messages, implement the **minimal conformant** shape: client `POST`s JSON-RPC, server replies `application/json`. **No SSE channel, no `Mcp-Session-Id`, stateless** — each request authenticated independently via the bearer middleware. Honour the `MCP-Protocol-Version` header. SSE/sessions can be added later if subscriptions land.
- **JSON-RPC methods:** `initialize` (advertise `{tools:{listChanged:false}, resources:{listChanged:false, subscribe:false}}` + serverInfo/version), `notifications/initialized` (no-op ack), `ping`, `tools/list`, `tools/call`, `resources/list`, `resources/read`. `prompts/list` returns empty if probed.
- **Library decision (Open Q2):** spike `rmcp` (official Rust MCP SDK) for conformance + Streamable-HTTP/Axum on edition 2024; **fall back to a hand-rolled ~150 LOC JSON-RPC-over-POST** (only the methods above) on friction. Given the tiny surface, hand-roll is an acceptable default.
### 5.7 Endpoint routing (server- vs graph-scoped)
- **Single-graph mode:** `POST /mcp` — graph tools + server tools (`health`, `graphs_list`).
- **Multi-graph mode (MR-668):** `POST /graphs/{graph_id}/mcp` — graph-scoped tools for that graph; plus a server-level `POST /mcp` exposing only server-scoped tools (`health`, `graphs_list`). A per-graph endpoint never lists another graph's tools (isolation, tested). Mirrors the shipped `/graphs/{graph_id}/…` cluster routing. (Open Q5: confirm naming + whether server tools also appear on the per-graph endpoint.)
### 5.8 Modular / decoupled auth (the cross-cutting requirement)
**Invariant (load-bearing, satisfiable today):** the MCP handler receives an **already-resolved `ResolvedActor`** and **branches on nothing** about how the token was verified. No token parsing, no method check, no OAuth inside the MCP module. Today that actor comes from `require_bearer_auth`; when MR-956 lands it comes from a `TokenVerifier` — the MCP code is identical either way.
```
request → [auth middleware: ResolvedActor] → [MCP route] → Cedar → McpTool
```
**Server side — auth is config, not code:**
| Deployment | Verifier | MCP change |
|---|---|---|
| On-prem, static bearer | `require_bearer_auth` / `StaticHashTokenVerifier` | none |
| On-prem, customer IdP | `OidcJwtVerifier` → customer issuer (MR-956) | none |
| Our cloud | `OidcJwtVerifier` → WorkOS, `tenant_id = Some(org_id)` (MR-956) | none |
Token validation is offline (cached JWKS) — on-prem/air-gapped keeps working with no request-path cloud dependency. The MCP endpoint never terminates OAuth and never holds a client secret (Resource Server only).
**Cloud client negotiation — additive, no MCP changes:** when MR-956 lands, the server publishes RFC 9728 `/.well-known/oauth-protected-resource` and returns `WWW-Authenticate: Bearer ..., resource_metadata="..."` on 401. A compliant MCP client (Claude) then auto-negotiates: static bearer to an on-prem endpoint; on a cloud 401 it discovers the WorkOS AS and runs OAuth/PKCE itself — **same endpoint URL, zero client-side branching.** This RFC only requires that MCP routes flow through the standard 401 path so that hook can be added later without touching MCP.
**Multi-user identity pass-through (cloud):** the *caller's* token (a WorkOS JWT, audience-bound per-tenant) must reach the server so Cedar enforces per-user/per-tenant policy — never a shared service token. The MCP endpoint validates it offline and maps `org_id → tenant_id`. This is why the **remote path is the in-server HTTP MCP that Claude connects to directly** (its token flows through), not a stdio bridge impersonating a user.
**Client-side credential acquisition (CLI/SDK/proxy) — pluggable `CredentialSource`** (RFC-002 §5, MR-971), keyed by server name, so OAuth is a future *sibling key*, not a re-key:
```yaml
servers:
onprem: { endpoint: https://og.internal:8080, auth: { token: { env: OG_TOKEN } } }
edge: { endpoint: https://og-edge, auth: { token: { command: [vault, read, -field=token, secret/og] } } }
cloud: { endpoint: https://api.omnigraph.cloud, auth: { oauth: { issuer: workos } } } # future sibling
```
Implicit chain when `auth:` omitted: `OMNIGRAPH_TOKEN_<NAME>` → keychain `omnigraph:<name>``[<name>]` in `~/.omnigraph/credentials`; legacy `bearer_token_env` honoured. Secrets never inlined.
### 5.9 Safety model — Cedar is the gate, default-deny is the floor
With ad-hoc `query`/`mutate`/`schema_apply` present as tools, the **only** thing protecting an untrusted agent is the Cedar policy. Therefore:
- **Default-deny when tokens are configured** (MR-723, shipped) is the floor — an actor with no grants sees an empty tool list.
- **What works today (coarse action):** a policy can hide all ad-hoc tools and admin tools per-actor (`deny Read, Change, SchemaApply, Branch*`) while allowing stored queries (`allow InvokeQuery`). That already reproduces "can't run ad-hoc, can't read schema, can only call stored queries" — the agent sees *every* exposed stored query plus nothing else.
- **What needs PR 0b (per-query scope):** selecting *which* stored queries an actor may call (`allow InvokeQuery [find_user, list_orders]`, deny the rest). The shipped `invoke_query` is coarse (all stored queries or none). Until PR 0b adds a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), per-actor sub-selection of stored queries is **not expressible**; curation is graph-level (which `.gq` files are registered + `mcp.expose`).
- `schema_apply`, `branches_delete`, ad-hoc `mutate` require an explicit admin-tier grant; never in a default agent policy.
- (Open Q3) Optional `mcp.allow_adhoc` server switch defaulting **off** for the ad-hoc `query`/`mutate` tools — defence-in-depth independent of Cedar, and independent of PR 0b.
### 5.10 Result shaping and error mapping
- **Success:** `tools/call` returns `content: [{type:"text", text:<json>}]` where `<json>` is the route's existing output envelope (read rows / mutation summary, i.e. `ReadOutput` / `ChangeOutput`). (Open Q4: also emit `structuredContent` + `outputSchema` — defer; text-JSON for v1.)
- **Tool execution error** (bad params after schema validation, engine error): result with `isError:true` + a text content block.
- **Authorization denial / unknown tool / `mcp.expose:false`:** a single JSON-RPC error (`-32602`, message `"unknown tool"`) — identical for all three so policy isn't probeable (same principle as the shipped `POST /queries/{name}` 404 masking).
- **Auth failure** (bad/absent bearer): HTTP 401 from the middleware *before* MCP — carries `WWW-Authenticate` (the RFC 9728 hook), never masked as a tool error. (This is exactly the path the shipped `authorize`/`authorize_request` split preserves: operational failures keep their status; only *denials* are masked.)
## Relationship to the `@modernrelay/omnigraph-mcp` stdio package
Verified surface of the package (`omnigraph-ts`, pkg version `0.3.0`, `@modelcontextprotocol/sdk@^1.29.0`, **stdio only**): **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and **2 resources** (`omnigraph://schema`, `omnigraph://branches`). It is a thin client over the SDK → HTTP routes and **forwards the caller's bearer verbatim** (no inspection).
Once parity lands, **collapse to one implementation**: the in-server MCP is canonical (Cedar-gated, remote-capable, the path that becomes a Claude-web connector via MR-956). The stdio package degrades to a **thin stdio↔HTTP proxy** forwarding JSON-RPC (and the incoming `Authorization`) to `/mcp` — staying the local on-ramp for Claude Code/Desktop while sharing one tool set, one Cedar gate. Transition: keep the current independent stdio package on its `0.3.x`/`0.6.x` line; ship proxy mode in a later TS minor once the server endpoint is GA. (Note: the package is currently several minors behind the server — its vendored `spec/openapi.json` predates the stored-query routes — so it needs the standard re-sync regardless of MCP work.)
## Testing
- **Protocol conformance:** `initialize` handshake + advertised capabilities; `tools/list` shape; `tools/call` happy path; JSON-RPC error envelopes (`-32601` unknown method, `-32602` invalid params / unknown tool); `resources/list` + `resources/read`.
- **Cedar filtering (coarse, today):** an actor with `allow InvokeQuery` + `deny Read/Change` sees *all* exposed stored queries but **not** `query`/`mutate`/`schema_get`; `tools/call query` returns masked "unknown tool"; an admin sees the full catalog.
- **Cedar filtering (per-query, gated on PR 0b):** actor scoped to `InvokeQuery [find_user]` sees *only* `find_user`; `tools/call list_orders` masks. **This test ships with PR 0b**, not PR 1 — it cannot pass against the coarse action.
- **Parity per built-in:** each tool round-trips against the same expectations as its HTTP route (reuse route tests); `read`/`change` aliases dispatch identically to `query`/`mutate`.
- **Double-gating:** a stored mutation requires both `InvokeQuery` and `Change`; `schema_apply` requires `SchemaApply`.
- **`mcp.expose:false`:** absent from `GET /queries` and MCP `tools/list`; still service-callable by name through `POST /queries/{name}` when the actor has `invoke_query`, but not MCP-callable.
- **Schema generation:** table-driven over every `ParamKind` incl. nullable / list / vector(dim).
- **Branch-scoped list approximation:** assert the documented R7 caveat — a branch-scoped policy lists `branches_create`, and `tools/call` is the authoritative gate (a denied target still 403s/masks).
- **Multi-graph isolation:** `/graphs/a/mcp` never lists graph `b`'s tools; server `/mcp` exposes only server tools.
- **Auth decoupling:** the MCP suite is green under the current `require_bearer_auth` and under a mock OIDC `ResolvedActor` source — proving verifier-agnosticism. A 401 carries `WWW-Authenticate`.
- **OpenAPI:** the JSON-RPC endpoint is not REST — document only the envelope in utoipa (or exclude); keep `openapi.json` drift test green (`OMNIGRAPH_UPDATE_OPENAPI=1` to regenerate on intentional change).
- **Cross-repo smoke (optional):** point `@modelcontextprotocol/sdk` (TS) at the HTTP endpoint in an `omnigraph-ts` integration test.
## Rollout — phased by risk
- **PR 0a — extract the reusable invoke path (small).** The coarse `invoke_query` gate + 404 denial-masking are **already shipped** in `server_invoke_query`. Extract the read/mutate dispatch into `invoke_stored_query(handle, name, params, branch/snapshot, actor)` so MCP `tools/call` and the HTTP route share one path. No behaviour change. *(Replaces the previous draft's "PR 0 — wire the gate", which was already done.)*
- **PR 0b — per-query `invoke_query` scope (the safety prerequisite).** Add a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), wire it at `POST /queries/{name}` and in the stored-query `McpTool::authorization`. Independently useful (the `allow InvokeQuery [find_user]` policy). **Gates the per-query Cedar-filtering test and §5.9's recommended agent policy.**
- **PR 1 — MCP transport + read-only parity + stored-query reads.** Endpoint(s), `initialize`/`tools/list`/`tools/call`/`resources/*`, the `McpTool` registry, Cedar-filtered listing, the read-only built-ins (`health`, `graphs_list`, `snapshot`, `read`/`query`, `schema_get`, `branches_list`, `commits_*`) + resources + stored-query *reads*. All auth-agnostic.
- **PR 2 — mutating parity + stored-query mutations.** `change`/`mutate`, `ingest`, `branches_create/delete/merge`, `schema_apply`, stored-query mutations + the `mcp.allow_adhoc` switch.
- **PR 3 — docs + agent on-ramp hook.** `docs/user/server.md` MCP section (incl. the recommended agent policy + the coarse-vs-per-query caveat), `openapi.json` sync, the `omnigraph mcp install` config target (MR-974), and the downstream `omnigraph-ts` re-sync/proxy follow-up.
- **Later (separate, MR-956):** RFC 9728 protected-resource metadata + WorkOS — slots in with zero MCP changes.
- **Later (TS minor):** stdio package → proxy mode.
## Migration / backwards compatibility
- **Additive.** No `queries:` and no MCP traffic → today's behaviour unchanged. New endpoints are new routes.
- **Cedar default-deny** (when tokens configured) means MCP exposes nothing until an actor is granted — safe by default.
- The stdio package keeps working unchanged; proxy mode is opt-in later.
- `openapi.json` only gains the documented MCP envelope; existing REST routes untouched.
## Open Questions
1. **BigInt/u64 as JSON string** (recommended, precision-safe) vs number.
2. **`rmcp` vs hand-rolled** JSON-RPC (spike `rmcp` on edition 2024; default to hand-roll on friction).
3. **Default-off `mcp.allow_adhoc`** for ad-hoc `query`/`mutate` (recommended) vs always-on + Cedar-only.
4. **`structuredContent` + `outputSchema`** now vs text-JSON v1 (recommend v1 text-JSON).
5. **Endpoint paths:** `/mcp` + `/graphs/{id}/mcp` — confirm naming and whether server-scoped tools also appear on the per-graph endpoint.
6. **Stateless POST-only** confirmed (no near-term server-initiated messages) — revisit only if subscriptions land.
7. **Legacy alias tools** (`read`/`change`): keep for client compat (the shipped package uses them), or drop and rely on `query`/`mutate`?
8. **PR 0b shape:** per-query scope as a Cedar *resource* (`StoredQuery::"find_user"`) vs a `query_name` *context attribute* + policy condition — affects how `allow InvokeQuery [list]` is authored.

View file

@ -20,9 +20,9 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `end_to_end.rs` | Full init → load → query/mutate flow |
| `branching.rs` | Branch create / list / delete, lazy fork |
| `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
| `runs.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
| `writes.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
| `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
| `lifecycle.rs` | Repo lifecycle, schema state |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
@ -31,7 +31,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `traversal.rs` | `Expand`, variable-length hops, anti-join |
| `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` |
| `export.rs` | NDJSON streaming export filters |
| `s3_storage.rs` | S3-backed repo (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
| `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation |
@ -45,7 +45,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
## Test helpers
- **Engine**`crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp repo + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
- **Engine**`crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp graph + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
- **CLI**`crates/omnigraph-cli/tests/support/mod.rs`: `Command`-style wrapper for invoking `omnigraph`, server-process spawning, fixture resolution, output assertion helpers.
- **Server** — no shared helpers; server tests call the `Omnigraph` engine API directly and exercise endpoints over the wire.
@ -63,14 +63,14 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml``rustfs_integration` job):
- `cargo test -p omnigraph-engine --test s3_storage`
- `cargo test -p omnigraph-server --test server server_opens_s3_repo_directly_and_serves_snapshot_and_read`
- `cargo test -p omnigraph-server --test server server_opens_s3_graph_directly_and_serves_snapshot_and_read`
- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `AWS_ENDPOINT_URL_S3` for non-AWS) before running. Without those, S3 tests skip gracefully.
## OpenAPI drift
`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repo PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).
`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repository PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).
## Examples & benches
@ -79,7 +79,7 @@ Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `A
## Coverage tooling — what's missing
There is **no** coverage tooling in the repo today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.
There is **no** coverage tooling in the repository today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.
If introducing coverage tooling is in scope for your task, the natural first step is `cargo-llvm-cov` wired into a separate CI job, and a per-crate threshold rather than a global one.
@ -89,7 +89,7 @@ If introducing coverage tooling is in scope for your task, the natural first ste
How to check:
1. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `runs.rs`, `search.rs`, etc.). The filename usually names the area.
1. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `writes.rs`, `search.rs`, etc.). The filename usually names the area.
2. **Open the file and skim every test fn name.** Test fn names are the index — read them all, not just the first few.
3. **Grep for the symbol or path you're changing.** `rg <FunctionName>` or `rg <enum_variant>` across all `tests/` directories surfaces existing coverage you might miss.
4. **Decide one of three outcomes**, in this order of preference:
@ -97,7 +97,7 @@ How to check:
- *Existing test covers the area but not your case***add an assertion or a fixture row to the existing test**, don't write a new function with `init_and_load()` again.
- *No existing coverage in any test file* → only then write a new test; put it in the file that owns the area, or open a new file only if the area itself is new.
Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repo. Don't add to it.
Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repository. Don't add to it.
## Before-every-task checklist
@ -106,7 +106,7 @@ When you pick up any change, walk through this:
1. **Find existing coverage** (per the principle above). Don't just look at the first test file by name — grep for the symbol you're touching across every crate's `tests/`.
2. **Run those tests locally before editing.** `cargo test --workspace --locked` for the broad pass; `-p <crate> --test <file>` for a focused loop. Confirm a clean baseline.
3. **Decide extend-vs-new** explicitly. If you can extend an existing test (assertion, fixture row, parameterization), do that. Only add a new test fn or new file if no existing one owns the area.
4. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh repo by hand if a helper exists.
4. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh graph by hand if a helper exists.
5. **Mind the boundary.** Per [docs/dev/invariants.md](invariants.md), test at the layer the change lives at — planner-level changes deserve planner-level tests, not just end-to-end.
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.

View file

@ -1,7 +1,10 @@
# Runs — REMOVED (MR-771)
# Direct-Publish Write Path
The Run state machine and `__run__<id>` staging branches were removed in
MR-771. `mutate_as` and `load` now write **directly to the target table**
> History: the Run state machine and `__run__<id>` staging branches were
> removed in MR-771 (shipped v0.4.0). Writes now go directly to the target
> table; this document specifies that direct-publish path.
`mutate_as` and `load` write **directly to the target table**
and call `ManifestBatchPublisher::publish` once at the end with
`expected_table_versions` (the per-table manifest versions captured before
the first write). Cross-table OCC is enforced inside the publisher; the

View file

@ -65,7 +65,7 @@ manifest. The next mutation against that table fails with
`ExpectedVersionMismatch`. Most validation runs before any Lance write,
so single-statement mutations are unaffected; the narrow path is
multi-statement queries with late-op failures. Tracked as a follow-up;
see [docs/dev/runs.md](../dev/runs.md#known-limitation-mid-query-partial-failure-on-the-same-table)
see [docs/dev/writes.md](../dev/writes.md#mid-query-partial-failure-closed-by-mr-794)
for the workaround.
## Upgrade notes

View file

@ -19,7 +19,7 @@ mutation proceeds normally.
HEAD on every staged table is untouched and the next mutation
proceeds normally. A narrowed residual remains at the
finalize→publisher boundary (multi-table `commit_staged` is not
atomic with the manifest commit) — see [docs/dev/runs.md](../dev/runs.md)
atomic with the manifest commit) — see [docs/dev/writes.md](../dev/writes.md)
"Finalize → publisher residual" for details.
- **D₂ parse-time rule**: a single mutation query is either
insert/update-only or delete-only. Mixed → rejected with a clear
@ -75,14 +75,14 @@ mutation proceeds normally.
## Tests added
- `tests/runs.rs::partial_failure_leaves_target_queryable_and_unblocks_next_mutation`
- `tests/writes.rs::partial_failure_leaves_target_queryable_and_unblocks_next_mutation`
(replaces the old `partial_failure_observably_rolls_back_but_blocks_next_mutation_on_same_table`)
- `tests/runs.rs::mutation_rejects_mixed_insert_and_delete_at_parse_time`
- `tests/runs.rs::mixed_insert_and_update_on_same_person_coalesces_to_one_merge`
- `tests/runs.rs::multiple_appends_to_same_edge_coalesce_to_one_append`
- `tests/runs.rs::multi_statement_inserts_publish_exactly_once`
- `tests/runs.rs::load_with_bad_edge_reference_unblocks_next_load`
- `tests/runs.rs::load_with_cardinality_violation_unblocks_next_load`
- `tests/writes.rs::mutation_rejects_mixed_insert_and_delete_at_parse_time`
- `tests/writes.rs::mixed_insert_and_update_on_same_person_coalesces_to_one_merge`
- `tests/writes.rs::multiple_appends_to_same_edge_coalesce_to_one_append`
- `tests/writes.rs::multi_statement_inserts_publish_exactly_once`
- `tests/writes.rs::load_with_bad_edge_reference_unblocks_next_load`
- `tests/writes.rs::load_with_cardinality_violation_unblocks_next_load`
## Files changed
@ -105,7 +105,7 @@ mutation proceeds normally.
- `Cargo.toml` (workspace) + `crates/omnigraph/Cargo.toml` — added
`datafusion = "52"` direct dep (transitively pulled by Lance
already; required for `MemTable`).
- `docs/dev/runs.md` — removed "Known limitation" section; documented
- `docs/dev/writes.md` — removed "Known limitation" section; documented
the new accumulator + D₂ + LoadMode::Overwrite residual.
- `docs/dev/invariants.md` — mutation atomicity / read-your-writes status
flipped to `upheld for inserts/updates`.
@ -127,7 +127,7 @@ mutation proceeds normally.
as legacy.
- `docs/user/cli.md` — replaced the legacy `omnigraph run *` quickstart
block with `omnigraph commit list/show`.
- `docs/dev/testing.md` — extended the `runs.rs` row to cover the new
- `docs/dev/testing.md` — extended the `writes.rs` row to cover the new
staged-write contract tests; added the `staged_writes.rs` row.
- `AGENTS.md` (CLAUDE.md symlink) — updated the atomic-per-query
description and the L2 capability matrix row.

171
docs/releases/v0.5.0.md Normal file
View file

@ -0,0 +1,171 @@
# Omnigraph v0.5.0
Omnigraph v0.5.0 is a substrate, security, and migration-safety release. It
jumps the storage substrate from Lance 4 to Lance 6.0.1 (DataFusion 52 → 53,
Arrow 57 → 58), introduces engine-wide Cedar policy enforcement on every
authoring path, and ships a structured schema-lint v1 chassis with
code-tagged diagnostics, soft drops, and an explicit `--allow-data-loss`
flag for destructive migrations.
## Highlights
- **Lance 6.0.1 substrate**: bump from Lance 4.0.0 → 6.0.1, DataFusion 52 →
53, Arrow 57 → 58. New optimizer rules (vectorized `IN`-list eq kernel,
`PhysicalExprSimplifier`, push-limit-into-hash-join, CASE-NULL shortcut)
reach predicates that flow through the engine. `lance-tokenizer` replaces
tantivy internally; FTS behavior preserved.
- **Cedar policy engine**: a new `omnigraph-policy` crate wires
`Omnigraph::enforce(action, scope, actor)` into every `_as` writer
(`mutate_as`, `load_as`, `apply_schema_as`, `branch_create_as`,
`branch_merge_as`, `branch_delete_as`, plus the load and change
variants). The HTTP server defaults to deny-all when no Cedar policy is
configured; a YAML policy file is required to enable writes. Actor
identity comes only from signed token claims — clients cannot set actor
identity directly.
- **Schema lint v1 chassis**: diagnostics now carry stable codes of the form
`OG-XXX-NNN` instead of free-form messages. `omnigraph schema plan` and
`apply` understand soft drops on properties and types — destructive drops
require the new `--allow-data-loss` flag (Hard mode) at the CLI and an
equivalent JSON flag over HTTP.
- **Structured filter pushdown**: query-language predicates lower to
DataFusion `Expr` and push down through Lance's `Scanner::filter_expr`
instead of being flattened to SQL strings. This unlocks `CompOp::Contains`
pushdown (via `array_has`), which previously fell through to in-memory
post-scan filtering, and lets the DataFusion 53 optimizer rules above act
on our predicates.
- **HTTP `allow_data_loss` parity**: the destructive-drop guard now exists
on both the CLI (`--allow-data-loss`) and HTTP (`allow_data_loss: true` in
the schema-apply request body).
- **Inline query strings on CLI and HTTP**: `omnigraph read` /
`omnigraph mutate` and the corresponding HTTP endpoints accept inline
`.gq` source, not just a file path. Easier ad-hoc queries, clearer
request logs.
- **Browser CORS layer**: optional CORS layer on `omnigraph-server` for
browser-based UIs, gated by `OMNIGRAPH_CORS_ORIGINS`.
- **Merge-insert dup-rowid fix**: Lance's `MergeInsertBuilder` could surface
spurious `"Ambiguous merge inserts"` errors on sequential merges against
rows previously rewritten by `merge_insert`. The engine now opts into
`SourceDedupeBehavior::FirstSeen` with a `check_batch_unique_by_keys`
fail-fast precondition that guarantees source-side dedup happens before
Lance sees the batch.
- **Branch-merge error-path recovery**: a branch merge that failed
mid-flight could leave the in-process coordinator pointing at a stale
active branch. The error path now restores the prior coordinator,
matching the success path's invariant.
- **Branch merge with blob columns**: external blob URIs are now
materialized correctly during branch merge instead of being dropped or
pointing at the source branch.
- **Lance API surface guards**: a new test file
(`crates/omnigraph/tests/lance_surface_guards.rs`) pins eight specific
Lance API surfaces (`LanceError::TooMuchWriteContention`,
`ManifestLocation` fields, `MergeInsertBuilder` return shape,
`WriteParams::default`, `compact_files` signature, etc.) so the next
Lance bump fails compile or runtime on any silent drift rather than
producing wrong-state recovery in production.
## Behavior changes
- **On-disk format unchanged**: existing v0.4.2 datasets open unchanged.
The Lance file format pin stays at V2_2 (required by Lance's blob v2
feature).
- **`omnigraph-server` defaults to deny-all under `--policy`**: starting a
server with the policy feature enabled but no Cedar YAML policy
configured rejects every write. Operators must supply a policy file to
authorize anything.
- **Schema-lint diagnostics carry stable codes**: messages now lead with
`OG-XXX-NNN`. CI parsers or tooling that keyed off the v0.4.2 free-form
text need to switch to code-based matching.
- **Destructive schema drops require `--allow-data-loss`**: dropping a
property or type returns a structured diagnostic by default.
`omnigraph schema apply --allow-data-loss` (CLI) or
`{"allow_data_loss": true}` (HTTP) opts into Hard mode.
- **`HashJoinExec` null-aware semantics on anti-join**: a side effect of
the DataFusion 53 bump — `NOT IN` semantics under null-valued anti-join
columns are now correct per SQL standard. Queries that depended on the
prior behavior would have been incorrect.
## Upgrade Notes
### Migration
- No data migration. v0.4.2 repos open directly on v0.5.0.
### Clients
- HTTP and SDK clients should switch any string-matching schema-lint
parsing to code-based matching against the `OG-XXX-NNN` prefix.
- Clients exercising destructive schema drops (`DropProperty`, `DropType`)
must add the `allow_data_loss` request field (HTTP) or
`--allow-data-loss` flag (CLI). Default is soft-drop-or-reject.
- Clients consuming `mutate_as` / `load_as` / `apply_schema_as` / branch
authoring APIs now flow through the policy enforcer. Anything bypassing
authorization on v0.4.2 will be rejected on v0.5.0 once a policy is
configured.
### Operators
- Configure a Cedar policy YAML for production servers before enabling
writes; deny-all is the new default. The `omnigraph policy validate` /
`test` / `explain` CLI commands are unchanged.
- Bearer tokens continue to be the actor-identity source; review the
signed-token-claim-only invariant in `docs/dev/invariants.md` if you've
built custom authentication.
- If your local CI uses RustFS for S3-compatible storage testing, our CI
pins `rustfs/rustfs:1.0.0-beta.3` (the last known-good tag before the
upstream credentials-policy change). Mirror the pin or set
`RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true` for the new image
versions.
## Tests added or strengthened
- `crates/omnigraph/tests/lance_surface_guards.rs` — 8 named guards pinning
Lance API surfaces against silent drift on future bumps.
- `crates/omnigraph/tests/policy_engine_chassis.rs` — engine-level policy
enforcement coverage; complements the existing HTTP policy tests.
- Policy chassis e2e gap-fills — branch-merge, branch-create, branch-delete
policy paths now have explicit end-to-end tests over HTTP and CLI.
- Merge-pair truth table — exhaustive op-variant matrix for three-way
merge across `noop`, `addNode`, `removeNode`, `addEdge`, `removeEdge`,
`setProperty`, `dropProperty`, `addLabel`, `removeLabel`; the build
fails to compile when a new op variant is added without dispositioning
every pairing.
- Merge-insert: regression for the dup-rowid bug class on the load surface
(`load_merge_repeated_against_overlapping_keys_succeeds`), the update
surface (`second_sequential_update_on_same_row_succeeds`), and the
upstream-Lance-gap canary
(`load_merge_window_2_documents_upstream_lance_gap`).
- Maintenance + destructive-migration coverage — `omnigraph optimize` /
`cleanup` boundary cases, plus schema-apply soft-drop and Hard-mode
paths.
- Stable-row-id preservation across `stage_overwrite` — pins the invariant
that staged overwrites carry stable row IDs through to the committed
fragment set.
- `CompOp::Contains` pushdown regression
(`ir_filter_with_list_contains_pushes_down`) — pins the new structured
Expr pushdown path that retired the in-memory fallback.
## Included Changes
- Lance 4 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58 substrate upgrade.
- `omnigraph-policy` crate with engine-wide Cedar enforcement and
signed-token-claim-only actor identity.
- Schema-lint v1 chassis with `OG-XXX-NNN` codes, soft `DropProperty` /
`DropType` semantics, and `--allow-data-loss` for Hard mode.
- HTTP `allow_data_loss` request field parity with the CLI flag.
- Structured DataFusion `Expr` filter pushdown via
`Scanner::filter_expr`, with `CompOp::Contains` lowered through
`array_has`.
- Inline `.gq` source acceptance on CLI and HTTP read/mutate endpoints.
- Optional CORS layer on `omnigraph-server` for browser UIs.
- Bug fixes: merge-insert dup-rowid (FirstSeen + uniqueness precondition),
branch-merge coordinator restore on error, blob-column materialization
during branch merge.
- New Lance API surface-guard test file as the canary for future Lance
bumps.
- Recovery-sidecar coverage extended across the four write paths
(`MutationStaging::finalize`, `schema_apply`, `branch_merge`,
`ensure_indices`) with failpoint regression tests.
- CI: pinned `rustfs/rustfs:1.0.0-beta.3` after the upstream `:latest`
introduced a credentials-policy change.
- Version bump to `0.5.0` across workspace crates, `Cargo.lock`,
`openapi.json`, and the `AGENTS.md` surveyed version.

141
docs/releases/v0.6.0.md Normal file
View file

@ -0,0 +1,141 @@
# Omnigraph v0.6.0
Three pieces of work land in this release:
1. The **graph terminology rename** (renamed `Repo``Graph` across the Cedar resource model, policy API, and query-lint schema source).
2. **Multi-graph server mode** — one `omnigraph-server` process can now serve 110 graphs concurrently behind cluster routes (`/graphs/{graph_id}/...`), with per-graph and server-level Cedar policy, read-only `GET /graphs` enumeration, and CLI parity (`omnigraph graphs list`).
3. **Inline + canonical-named queries and mutations.** New `POST /query` and `POST /mutate` endpoints pair with the CLI's new `-e/--query-string` flag for ad-hoc execution without a temp file. `POST /read` and `POST /change` continue serving indefinitely as deprecated aliases that carry RFC 9745 `Deprecation: true` and RFC 8288 `Link: </successor>; rel="successor-version"` response headers, plus `deprecated: true` in `openapi.json`. Same canonicalization on the CLI: `omnigraph query`, `omnigraph mutate`, and top-level `omnigraph lint` / `omnigraph check` replace `omnigraph read`, `omnigraph change`, and the nested `omnigraph query lint` / `omnigraph query check`. Every deprecated spelling remains a `visible_alias` that warns to stderr once per invocation.
Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`, `omnigraph graphs create`) is **not** in v0.6.0. Operators add or remove graphs by editing `omnigraph.yaml` and restarting. The first cut of `POST /graphs` shipped behind an atomic-YAML-rewrite design that we pulled before release once its concurrency guarantees were challenged (flock-on-renamed-inode race, duplicate-check outside the critical section, and an init-cleanup path that could destroy an existing graph's schema on re-init). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars); that work is deferred.
## Breaking Changes
### Graph terminology rename
- Renamed the Cedar resource entity from `Omnigraph::Repo` to `Omnigraph::Graph`.
- Renamed policy API terminology from `repo_id` to `graph_id` on `PolicyCompiler::compile` (and on the new `PolicyEngine::load_graph` / `PolicyEngine::load_server` loaders described below).
- Renamed query-lint schema source JSON from `"repo"` to `"graph"` for `schema_source.kind`.
### Multi-graph server mode
- **Multi-graph deployments lose flat routes.** Single-graph invocation (`omnigraph-server <URI>`) is unchanged — same flat `/snapshot`, `/read`, `/branches`, etc. Multi-graph deployments serve those routes under `/graphs/{graph_id}/...`; bare flat paths return 404 in multi mode.
- **`ServerConfig` shape change** (programmatic embedders only): `ServerConfig { uri, policy_file }` is replaced by `ServerConfig { mode: ServerConfigMode }`, where `ServerConfigMode = Single { uri, policy_file } | Multi { graphs, config_path, server_policy_file }`. Callers that use `load_server_settings` are unaffected; callers that construct `ServerConfig` directly need to wrap their fields in `ServerConfigMode::Single`.
- **`AppState`'s routing surface** is `AppState::routing() -> &GraphRouting`, where `GraphRouting = Single { handle } | Multi { registry, config_path }`. The previous `AppState::uri()`, `AppState::mode()`, `AppState::registry()` accessors and the `ServerMode` enum are gone — embedders read `state.routing()` and match on the arm they need. Per-graph URIs live on `handle.uri`.
- **`AppState::new_multi`** is the new multi-graph constructor. Single-mode `new_*` / `open_*` constructors are unchanged.
- **`AuthenticatedActor(Arc<str>)``ResolvedActor { actor_id, tenant_id, scopes, source }`** (programmatic embedders only). The struct shape changes, but the HTTP contract — bearer auth and the bearer-derived-actor-identity guarantee — is unchanged. Cluster-mode call sites construct with `tenant_id: None`, `scopes: vec![Scope::Full]`, `source: AuthSource::Static`. The new fields are forward-compat seams for future multi-tenant and OAuth deployments; they're inert in this release.
- **`PolicyEngine::load(path, graph_id)` removed** in favor of two kind-typed loaders: `PolicyEngine::load_graph(path, graph_id)` for per-graph policies and `PolicyEngine::load_server(path)` for server-level policies. Each loader rejects rules whose action `resource_kind()` doesn't match the engine kind — operators who put a `graph_list` rule in a per-graph file (or a `read` rule in a server file) now get a load-time error instead of a silently-never-matching rule.
- **`PolicyRequest::actor_id` field removed.** Actor identity is now a separate parameter on `PolicyEngine::authorize(actor_id, &request)`. The type system enforces the server-authoritative-actor invariant: actor identity is always sourced from the bearer-token match resolved at the auth boundary; handlers cannot smuggle identity through the request body.
- **`Omnigraph::init` is strict by default.** Initialization at a URI that already holds schema files now errors with `OmniError::AlreadyInitialized` instead of silently overwriting. Operators who actually want to overwrite use `InitOptions { force: true }` (CLI: `omnigraph init --force`). Closes the destructive-cleanup footgun where a failed re-init would delete an existing graph's schema files.
- **Top-level `policy.file` is rejected in multi-graph server mode.** It remains valid for single-graph / CLI-local policy. Multi-graph deployments must move graph rules to `graphs.<graph_id>.policy.file` and server-scoped `graph_list` rules to `server.policy.file`.
- **Open server startup requires explicit opt-in.** A server with no bearer tokens and no policy now refuses to start unless passed `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1`.
- **Policy requires bearer tokens.** Configuring any policy file without bearer tokens now refuses startup; otherwise every protected request would 401 before Cedar could evaluate it.
- **Tokens without policy default-deny non-read actions.** Existing authenticated deployments that relied on writes or admin routes without Cedar policy must add policy rules for those actions.
- **`GET /graphs` requires `server.policy.file` in every runtime state.** Even `--unauthenticated` mode keeps server topology closed until the operator explicitly authorizes `graph_list`.
### Query / mutation rename
- **`ChangeRequest` field rename**: `query_source``query`, `query_name``name`. Both legacy names continue to deserialize via `#[serde(alias = "...")]`, so existing clients sending the old JSON keys keep working. CLI remote calls against `/change` still emit the legacy keys verbatim through the `legacy_change_request_body` helper so a newer CLI talking to an older server keeps working byte-for-byte.
- **CLI `omnigraph query lint` / `omnigraph query check`** are now top-level — canonical name is **`omnigraph lint`**. The three deprecated invocations (`omnigraph query lint`, `omnigraph query check`, and bare `omnigraph check`) remain as argv-level shims that rewrite to `omnigraph lint` and print a one-line stderr deprecation warning. `check` is deliberately **not** a clap `visible_alias` on `lint` — two equivalent canonical names would split agent emissions between them depending on training-data drift, so the deprecation pattern (rewrite + warn) gives one unambiguous canonical name in `omnigraph --help`.
## New
- **Multi-graph mode**. Invoke with `omnigraph-server --config omnigraph.yaml` where the YAML has a non-empty `graphs:` map and no single-mode selector (no `server.graph`, no CLI `<URI>` or `--target`). At startup the server opens every configured graph in parallel (bounded concurrency, fail-fast).
- **`GET /graphs`**. Lists every registered graph, sorted alphabetically by `graph_id`. Auth-required when bearer tokens are configured; Cedar-gated by `PolicyAction::GraphList` against `Omnigraph::Server::"root"`. Returns 405 in single mode. Server-scoped actions require an explicit `server.policy.file` in every runtime state — the management surface is closed by default even in `--unauthenticated` mode so that server topology is never exposed without operator opt-in.
- **CLI `omnigraph graphs list`**. Mirrors the HTTP surface. Rejects local URI targets with a clear message — for remote multi-graph servers only.
- **CLI `omnigraph init --force`**. Bypasses the strict-init preflight when an operator deliberately wants to recover from orphan schema files. Does NOT purge existing Lance datasets; recursive deletion needs `StorageAdapter::delete_prefix` (deferred — see below).
- **Per-graph Cedar policy**. Each entry in the `graphs:` map can carry a `policy.file` path, loaded at startup via `PolicyEngine::load_graph`. Cedar's `Omnigraph::Graph::"<graph_id>"` resource is per-graph; the new `Omnigraph::Server::"root"` resource governs server-level actions.
- **Server-level Cedar policy**. `server.policy.file` in the config governs the `graph_list` action on `Omnigraph::Server::"root"`. Required to expose `GET /graphs` in every runtime state — without a server policy the default-deny posture rejects `graph_list`, including in `--unauthenticated` mode.
- **Cedar action vocabulary**: `graph_list` (server-scoped). Runtime `graph_create` / `graph_delete` are reserved but not shipped — see "Deferred."
- **Canonical graph URI identity.** Server startup normalizes graph root URIs before registry insertion and response output, so aliases such as `/tmp/g`, `/tmp/g/`, and `file:///tmp/g` cannot register as distinct graphs that actually share one Lance root.
- **`POST /query`** and **`POST /mutate`**. Canonical inline endpoints. `/query` rejects mutations with a typed 400 (the D2 rule lives at the URL — read-only contract enforced before execution); body uses the clean `{ query, name, params, branch, snapshot }` shape. `/mutate` accepts the same shape for mutations. Both available in single mode and per-graph multi mode (`/graphs/{id}/query`, `/graphs/{id}/mutate`). Internal call sites share two helpers (`run_query`, `run_mutate`) that take decoupled args, not request bodies — the seam MR-969's future stored-query handler plugs into.
- **CLI `omnigraph query` / `omnigraph mutate`** as top-level canonical subcommands. Pairs with new top-level **`omnigraph lint` (alias `check`)** so query validation no longer sits under `omnigraph query`.
- **CLI `-e, --query-string <GQ>`** on both `omnigraph query` and `omnigraph mutate`. 3-way mutex with `--query <path>` and `--alias <name>` — exactly one is required. Empty string rejected. Suits ad-hoc exploration, REPL workflows, and agent tool-use without temp files.
- **Three-channel deprecation signal on `/read` and `/change`**: OpenAPI `deprecated: true` on the operation (every codegen flags the generated SDK method), RFC 9745 `Deprecation: true` response header, and RFC 8288 `Link: </query>; rel="successor-version"` (or `</mutate>`) response header. Auto-discoverable; no SDK breakage.
- **`omnigraph.yaml` `aliases.<name>.command`** now accepts `query` and `mutate` as canonical values alongside the legacy `read` and `change`. The internal `AliasCommand` enum retains the legacy variant names so serialized configs stay byte-stable.
## Configuration
`omnigraph.yaml` schema additions (all optional, single-mode unaffected):
```yaml
server:
bind: 0.0.0.0:8080
policy:
file: ./server-policy.yaml # server-level Cedar (graph_list)
graphs:
alpha:
uri: s3://tenant-bucket/alpha
policy:
file: ./policies/alpha.yaml # per-graph Cedar
beta:
uri: s3://tenant-bucket/beta
# no per-graph policy → engine-layer enforcement is a no-op
```
## Deferred
- **`POST /graphs` runtime graph creation** and **CLI `omnigraph graphs create`**. Pulled before release after the YAML-rewrite design's correctness story didn't survive review. A future release will add a managed cluster catalog (Lance-backed reserve → init → publish with recovery sidecars) and re-expose runtime creation on top of it. Until then, operators add graphs by editing `omnigraph.yaml` and restarting.
- **`DELETE /graphs/{id}`**. Never shipped in v0.6.0; deferred with the same cluster-catalog work.
- **`StorageAdapter::delete_prefix`**. The substrate primitive a managed catalog would need. Will land alongside runtime mutation.
- **`omnigraph init --force` purging Lance state.** Today `--force` only bypasses the schema-file preflight; recursive deletion of existing Lance datasets needs `delete_prefix`.
- **`X-Actor-Id` service delegation forwarding**. Needs durable both-actor audit on `_graph_commits.lance` — out of scope.
- **Hot policy reload**. Restart is cheap at N≤10 graphs.
## User Impact
- **No on-disk migration is required.** Existing `.omni` graphs from v0.5.0 (and earlier) open cleanly under v0.6.0 — Lance datasets, `__manifest`, `_schema.pg`, `_schema.ir.json`, `__schema_state.json`, `_graph_commits.lance`, `_graph_commit_recoveries.lance` all use unchanged formats. No conversion step.
- **Existing single-graph storage upgrades without migration.** Server deployments may need auth/policy config changes: explicitly pass `--unauthenticated` for local open mode, configure tokens when using policy, and add Cedar policy for non-read authenticated actions.
- **Multi-graph adoption is opt-in.** Add a `graphs:` map to `omnigraph.yaml` (and remove `server.graph`) to switch a deployment to multi mode.
- **Cluster routes are breaking for client SDKs targeting multi mode.** Generated clients from previous v0.5.0 OpenAPI specs will hit 404 on flat paths against a multi-mode server. Regenerate against the v0.6.0 `openapi.json`.
- **Supported YAML policy authoring is unchanged.** The Cedar `Omnigraph::Graph` and `Omnigraph::Server` entities are internally generated by `compile_policy_source` — operator YAML only references actions and groups.
- **Operators with unsupported raw Cedar policy files** should update `Omnigraph::Repo` resource references to `Omnigraph::Graph`.
- **Endpoint and CLI rename is cosmetic on the client side.** Existing callers on `/read`, `/change`, `omnigraph read`, `omnigraph change`, and `omnigraph query lint` keep working — they pick up the `Deprecation` + `Link` headers (or stderr deprecation warning on the CLI) so SDKs and proxies can surface the successor name automatically. New integrations should target the canonical names. ChangeRequest field names migrate at the caller's pace — both `query_source`/`query_name` and `query`/`name` accepted indefinitely.
## Migration: single → multi
```yaml
# Before (v0.5.0 single-mode invocation)
server:
graph: my-graph
graphs:
my-graph:
uri: /var/lib/omnigraph/my-graph
policy:
file: ./policy.yaml
```
```yaml
# After (v0.6.0 multi-mode — drop `server.graph` and the top-level `policy`)
server:
policy:
file: ./server-policy.yaml # NEW: governs GET /graphs
graphs:
my-graph:
uri: /var/lib/omnigraph/my-graph
policy:
file: ./policy.yaml # MOVED: was top-level
```
Same `omnigraph.yaml` file; restart the server. Clients targeting the old flat routes (`/snapshot`, `/read`, …) must update to `/graphs/my-graph/snapshot`, etc.
To add a new graph after rollout: stop the server, append a new `graphs.<id>` entry, restart.
## Documentation
- Public docs, CLI help, examples, server docs, and test helpers now consistently use "graph" for the OmniGraph data artifact.
- GitHub/source repository terminology remains spelled out as "repository" where needed.
- New: `docs/user/cli.md` documents `omnigraph graphs list`; `docs/user/server.md` documents the multi-graph mode and the cluster route convention; `docs/user/policy.md` documents the per-graph vs server-scoped action distinction.
- New: `docs/user/server.md` documents `POST /query` / `POST /mutate` and the three-channel deprecation signal on `/read` / `/change`. `docs/user/cli.md` documents the `-e/--query-string` flag with examples. `docs/user/cli-reference.md` shows the canonical CLI verbs (`query`, `mutate`, `lint`, `check`) with legacy spellings as visible aliases.
- New: `docs/dev/rfc-001-queries-envelope-mcp.md` is the cross-cutting design doc for the inline / stored query work that started landing in this release. It sequences the v0.6.x patch series (request/response envelope hardening) and the v0.7.0 stored-query + MCP work.
## Test coverage
- `GraphId` newtype validation, registry race tests, init failpoints (still reachable from `omnigraph init` CLI).
- Mode-inference four-rule matrix, parallel multi-graph startup, cluster routing.
- Cedar `Server` resource refactor, backwards-compat for graph-only policies, kind-alignment rejection (server actions in graph files / vice versa).
- `GET /graphs` enumeration, 405-in-single-mode, 403-in-Open-mode-without-server-policy, Cedar admin/viewer authorization.
- Cluster routes with inner path params (`/branches/{branch}`, `/commits/{commit_id}`) deserialize correctly under axum 0.8 nested routing.
- Policy-requires-tokens startup invariant enforced uniformly across single and multi mode.
- The bearer-auth-derived-actor-identity regression test (client-supplied identity headers are ignored; the server-resolved actor is the only identity Cedar sees) stays green across the entire refactor.
</content>

26
docs/releases/v0.6.1.md Normal file
View file

@ -0,0 +1,26 @@
# Omnigraph v0.6.1
v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safer branch cleanup, more complete release artifacts, and a Lance blob-compaction workaround.
## Highlights
- **Stored-query registries.** `omnigraph.yaml` can declare curated `queries:` blocks per graph. Servers load and type-check them at startup, `omnigraph queries validate` checks them offline, `omnigraph queries list` shows exposed queries and typed params, `GET /queries` exposes a typed catalog, and `POST /queries/{name}` invokes a stored query without accepting ad hoc `.gq` source from the client.
- **Stored-query policy gate.** New Cedar action `invoke_query` gates the stored-query invocation surface. Stored mutations are double-gated: `invoke_query` to reach the stored query and `change` for the actual write.
- **Safer branch deletion.** `branch_delete` now treats the manifest as the authority, flips branch visibility atomically, and reclaims per-table/commit-graph forks as derived state. If best-effort reclaim is interrupted, `cleanup` reconciles orphaned forks; reusing a branch name before cleanup reports an actionable error.
- **Blob-safe optimize.** `omnigraph optimize` skips tables with `Blob` properties instead of failing the whole sweep on Lance's blob-v2 compaction decode bug. Skips are visible in human output, `--json` as `skipped`, `TableOptimizeStats.skipped`, and logs; non-blob tables still compact normally.
- **Deployment improvements.** The container entrypoint now composes `OMNIGRAPH_TARGET_URI` with `OMNIGRAPH_CONFIG`, so operators can keep the graph URI in env while loading policy/query config from a mounted file. The local RustFS bootstrap pins RustFS beta.3 and allows the current insecure local-dev default credentials.
- **Windows release support.** Tagged and edge releases now publish Windows x86_64 archives containing `omnigraph.exe` and `omnigraph-server.exe`, with a PowerShell installer and Windows install docs.
- **Release tooling.** Homebrew formula generation was tightened to produce audit-clean formulas.
## Compatibility Notes
- A graph selected by name (`--target` or `server.graph`) now uses `graphs.<name>.policy` and `graphs.<name>.queries`. Top-level `policy` / `queries` blocks are only for anonymous bare-URI single-graph mode; using them with a named graph now fails loudly with migration guidance.
- `mcp.expose` defaults to `true` for stored-query registry entries. Set `mcp: { expose: false }` for service-only queries that should not appear in the catalog.
- `invoke_query` is graph-scoped, not branch-scoped. Branch/snapshot access remains enforced by the inner `read` / `change` gate.
- Blob tables are not compacted until the upstream Lance fix lands, so fragment count and deleted-row space on blob tables are not reclaimed by `optimize`. Reads, writes, and query results are unaffected; no on-disk migration is required.
- `TableOptimizeStats` is now `#[non_exhaustive]` and gains a `skipped: Option<SkipReason>` field (so does the new `SkipReason` enum). This is a source-level change only for downstream code that built this returned result struct by literal — rare, since it is produced by `optimize` and consumed by reading its fields; field access is unaffected, and `#[non_exhaustive]` keeps future additions non-breaking.
## Docs And Cleanup
- Public docs were updated for stored queries, policy, server routes, deployment, Windows installation, branch deletion, maintenance, and the `runs` docs rename to `writes`.
- README copy and release documentation were refreshed; older release notes had small typo/wording fixes.

View file

@ -4,4 +4,4 @@
- `_as` variants of every write API let callers override the actor: `mutate_as`, `ingest_as`, `branch_merge_as`, `apply_schema_as`, etc.
- Actor IDs are persisted on `GraphCommit.actor_id` with split storage in `_graph_commit_actors.lance` (the commit graph is split into `_graph_commits.lance` for the linkage and `_graph_commit_actors.lance` for the actor map).
- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor).
- Pre-v0.4.0 repos also stored actor IDs on `RunRecord.actor_id` in `_graph_runs.lance` / `_graph_run_actors.lance`. The Run state machine was removed in MR-771; those files are inert post-v0.4.0 and reclaimed by MR-770's production sweep.
- Pre-v0.4.0 graphs also stored actor IDs on `RunRecord.actor_id` in `_graph_runs.lance` / `_graph_run_actors.lance`. The Run state machine was removed in MR-771; those files are inert post-v0.4.0 and reclaimed by MR-770's production sweep.

View file

@ -8,10 +8,10 @@ Lance supports branching at the dataset level: a branch is a named lineage of ve
OmniGraph builds *graph branches* on top by branching every sub-table coherently:
- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle.
- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle. Atomic and authority-first like `branch_delete`: it flips the `__manifest` branch (authority), then creates the derived commit-graph branch, force-dropping any orphaned commit-graph ref left by an incomplete prior delete (the manifest branch is fresh, so a same-named commit-graph branch is provably a zombie). If commit-graph creation fails, the manifest branch is rolled back so the name never half-exists.
- `branch_list()` — returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes.
- `branch_delete(name)` — refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments.
- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source.
- `branch_delete(name)` — refuses if there are descendants or active runs on the branch. The manifest is the single authority for branch existence: deletion flips the `__manifest` branch ref first (one atomic op), after which the branch is gone from every snapshot. The owned per-table forks and the commit-graph branch are derived state, reclaimed best-effort with `force_delete_branch` after the flip. A failure during that reclaim (transient object-store error) does not fail the call or block the authority flip; the leftover forks are unreachable orphans that the [`cleanup`](maintenance.md) reconciler converges. One consequence: if a delete's best-effort reclaim fails, reusing that branch name before the next `cleanup` surfaces a clear error pointing at `cleanup` (the stale fork would otherwise collide on first write).
- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source. A fork collision is classified by the manifest authority, not by Lance branch versions: if the live manifest already records the fork on the active branch, a concurrent first-write won and the caller gets a retryable "refresh and retry"; if the manifest does not, a physical branch there is an orphan and the caller is pointed at `cleanup`.
- `sync_branch(branch)` — re-binds the in-memory handle to the latest head of the branch.
## L2 — Commit graph (`db/commit_graph.rs`)

View file

@ -8,22 +8,23 @@ A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` sc
| Command | Purpose |
|---|---|
| `init` | `--schema <pg>` → initialize a repo (also scaffolds `omnigraph.yaml` if missing) |
| `init` | `--schema <pg>` → initialize a graph (also scaffolds `omnigraph.yaml` if missing) |
| `load` | bulk load a branch (`--mode overwrite\|append\|merge`) |
| `ingest` | branch-creating transactional load (`--from <base>`) |
| `read` | run named query (params via `--params`, `--params-file`, or alias args) |
| `change` | run mutation query |
| `query` (alias: `read`) | run named read query; source via `--query <path>`, `-e`/`--query-string <GQ>`, or `--alias <name>` (exactly one). `read` is the deprecated previous name and prints a one-line warning to stderr |
| `mutate` (alias: `change`) | run mutation query; same `--query` / `-e` / `--alias` mutual-exclusion as `query`. `change` is the deprecated previous name and prints a one-line warning to stderr |
| `snapshot` | print current snapshot (per-table version + row count) |
| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) |
| `branch create \| list \| delete \| merge` | branching ops |
| `commit list \| show` | inspect commit graph |
| `run list \| show \| publish \| abort` | transactional run ops |
| `schema plan \| apply \| show (alias: get)` | migrations |
| `query lint \| check` | offline / repo-backed validation |
| `optimize` | non-destructive Lance compaction |
| `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` |
| `queries validate \| list` | operate on the server-side stored-query registry (the `queries:` block). `validate` type-checks every stored query against the live schema offline (opens the selected graph; exits non-zero on any breakage), catching schema drift without restarting the server; `list` prints the selected registry's query names, MCP exposure, and typed params. For per-graph registries, pass `--target <graph>` or set `cli.graph`; with no graph selection, `list` shows only top-level `queries:`. Distinct from `lint`, which validates a single `.gq` file |
| `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns; `--json` reports a `skipped` field) |
| `cleanup --keep N --older-than 7d --confirm` | destructive version GC |
| `embed` | offline JSONL embedding pipeline |
| `policy validate \| test \| explain` | Cedar tooling |
| `policy validate \| test \| explain` | Cedar tooling. Selects `cli.graph`, else `server.graph`, else top-level `policy.file` |
| `version` / `-v` | print `omnigraph 0.3.x` |
## `omnigraph.yaml` schema
@ -34,6 +35,13 @@ graphs:
<name>:
uri: <local|s3://|http(s)://>
bearer_token_env: <ENV_NAME>
queries: # per-graph stored-query registry (server-role; multi-graph mode)
<query-name>: # key MUST equal the `query <name>` symbol inside the .gq
file: <path-to-.gq> # relative to this config's directory
mcp:
expose: true # default true: listed in the MCP catalog (GET /queries); set false to hide (still HTTP-callable)
tool_name: <name> # optional MCP tool-name override (defaults to <query-name>;
# must be unique across exposed queries)
server:
graph: <name>
bind: <ip:port>
@ -49,18 +57,23 @@ auth:
env_file: ./.env.omni
aliases:
<alias>:
command: read|change
# accepted values: `read` / `query` (read alias), `change` / `mutate`
# (write alias). `query` and `mutate` are recommended; `read` and
# `change` remain accepted forever for back-compat.
command: read|change|query|mutate
query: <path-to-.gq>
name: <query-name>
args: [<positional-name>, …]
graph: <name>
branch: <name>
format: <output-format>
queries: # top-level registry — applies only to a bare-URI (anonymous) graph; a graph served by name uses its `graphs.<id>.queries`. Mirrors top-level `policy`.
<query-name>: { file: <path-to-.gq> } # mcp.expose defaults to true
policy:
file: ./policy.yaml
```
## Output formats (read command)
## Output formats (`query` command, alias: `read`)
- `json` — pretty-printed object with metadata + rows
- `jsonl` — one metadata line then one JSON object per row

View file

@ -1,40 +1,65 @@
# CLI Guide
## Core Repo Flow
## Core Graph Flow
```bash
omnigraph init --schema ./schema.pg ./repo.omni
omnigraph load --data ./data.jsonl --mode overwrite ./repo.omni
omnigraph snapshot ./repo.omni --branch main --json
omnigraph read --uri ./repo.omni --query ./queries.gq --name get_person --params '{"name":"Alice"}'
omnigraph change --uri ./repo.omni --query ./queries.gq --name insert_person --params '{"name":"Mina","age":28}'
omnigraph init --schema ./schema.pg ./graph.omni
omnigraph load --data ./data.jsonl --mode overwrite ./graph.omni
omnigraph snapshot ./graph.omni --branch main --json
omnigraph query --uri ./graph.omni --query ./queries.gq --name get_person --params '{"name":"Alice"}'
omnigraph mutate --uri ./graph.omni --query ./queries.gq --name insert_person --params '{"name":"Mina","age":28}'
```
`omnigraph query` is the canonical read command (pairs with `POST /query`);
`omnigraph mutate` is the canonical write command (pairs with `POST /mutate`).
The previous names `omnigraph read` and `omnigraph change` keep working as
visible aliases — invocations emit a one-line deprecation warning to stderr
and otherwise behave identically. See [Deprecated names](#deprecated-names)
for the migration table.
For ad-hoc reads and mutations (REPLs, AI agents, one-off scripts), pass the
GQ source inline with `-e` / `--query-string` instead of a file path:
```bash
omnigraph query --uri ./graph.omni \
-e 'query find($name: String) { match { $p: Person { name: $name } } return { $p.name, $p.age } }' \
--params '{"name":"Alice"}'
omnigraph mutate --uri ./graph.omni \
-e 'query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }' \
--params '{"name":"Inline","age":42}'
```
`-e` is mutually exclusive with `--query <path>` and `--alias <name>`; exactly
one of the three must be provided. The inline source travels through the same
parser, lint, params binding, and commit machinery as a file-based query —
only the source loader changes.
## Branching And Reviewable Data Flows
```bash
omnigraph branch create --uri ./repo.omni --from main feature-x
omnigraph branch list --uri ./repo.omni
omnigraph branch merge --uri ./repo.omni feature-x --into main
omnigraph branch create --uri ./graph.omni --from main feature-x
omnigraph branch list --uri ./graph.omni
omnigraph branch merge --uri ./graph.omni feature-x --into main
omnigraph ingest --data ./batch.jsonl --branch review/import-2026-04-09 ./repo.omni
omnigraph export ./repo.omni --branch main --type Person > people.jsonl
omnigraph commit list ./repo.omni --branch main --json
omnigraph commit show --uri ./repo.omni <commit-id> --json
omnigraph ingest --data ./batch.jsonl --branch review/import-2026-04-09 ./graph.omni
omnigraph export ./graph.omni --branch main --type Person > people.jsonl
omnigraph commit list ./graph.omni --branch main --json
omnigraph commit show --uri ./graph.omni <commit-id> --json
```
## Remote Server Mode
Serve a repo:
Serve a graph:
```bash
omnigraph-server ./repo.omni --bind 127.0.0.1:8080
omnigraph-server ./graph.omni --bind 127.0.0.1:8080
```
Read through the HTTP API:
```bash
omnigraph read \
omnigraph query \
--target http://127.0.0.1:8080 \
--query ./queries.gq \
--name get_person \
@ -44,26 +69,47 @@ omnigraph read \
If the server requires auth, set `OMNIGRAPH_SERVER_BEARER_TOKEN` on the server
and configure the matching `bearer_token_env` in `omnigraph.yaml`.
## Multi-graph servers (v0.6.0+)
Against a multi-graph server (started with `--config omnigraph.yaml` referencing a non-empty `graphs:` map), use `omnigraph graphs list` to enumerate the registered graphs. The server must configure bearer tokens and `server.policy.file` with a rule that allows `graph_list`; `/graphs` is closed by default even when the server runs with `--unauthenticated`.
```bash
OMNIGRAPH_BEARER_TOKEN=admin-token \
omnigraph graphs list --uri http://server.example.com --json
```
For config-driven clients, set the remote graph's `bearer_token_env` to an environment variable containing a token whose actor is authorized by `server.policy.file`.
`list` rejects local URI targets — it's for remote multi-graph servers only.
Runtime add/remove is **not** in v0.6.0. To add a graph, stop the server, add a `graphs.<id>` entry to `omnigraph.yaml`, then restart. To remove, stop the server, delete the entry, restart.
Per-graph URLs: hit a graph's cluster route from any subcommand by pointing `--uri` at it:
```bash
omnigraph read --uri http://server.example.com/graphs/beta --query ./q.gq ...
```
## Runs, Policy, And Diagnostics
```bash
omnigraph query lint --query ./queries.gq --schema ./schema.pg --json
omnigraph query check --query ./queries.gq ./repo.omni --json
omnigraph lint --query ./queries.gq --schema ./schema.pg --json
omnigraph check --query ./queries.gq ./graph.omni --json
omnigraph schema plan --schema ./next.pg ./repo.omni --json
omnigraph schema apply --schema ./next.pg ./repo.omni --json
omnigraph schema plan --schema ./next.pg ./graph.omni --json
omnigraph schema apply --schema ./next.pg ./graph.omni --json
omnigraph policy validate --config ./omnigraph.yaml
omnigraph policy test --config ./omnigraph.yaml
omnigraph policy explain --config ./omnigraph.yaml --actor act-alice --action read --branch main
omnigraph commit list ./repo.omni --json
omnigraph commit show --uri ./repo.omni <commit-id> --json
omnigraph commit list ./graph.omni --json
omnigraph commit show --uri ./graph.omni <commit-id> --json
```
(The legacy `omnigraph run list/show/publish/abort` subcommands were removed in MR-771; mutations and loads publish atomically and the commit graph (`omnigraph commit list`) is the audit surface.)
`query lint` and `query check` are the same command surface. In v1, repo-backed
lint uses local or `s3://` repo URIs; HTTP targets are only supported when you
`query lint` and `query check` are the same command surface. In v1, graph-backed
lint uses local or `s3://` graph URIs; HTTP targets are only supported when you
also pass `--schema`.
## Config
@ -98,3 +144,21 @@ The config file can also define:
When policy is enabled, `schema apply` is authorized through the
`schema_apply` action and is typically limited to admins on protected `main`.
## Deprecated names
The CLI was renamed to align with the HTTP server's canonical endpoint
names (`POST /query`, `POST /mutate`) and the `query` keyword in the GQ
language. The previous spellings keep working forever; invocations emit a
one-line warning to stderr and otherwise behave identically.
| Old (deprecated) | New (canonical) | Migration |
|--------------------------|---------------------|----------------------------------------------------------|
| `omnigraph read` | `omnigraph query` | Same flags and behavior. `read` is a visible clap alias. |
| `omnigraph change` | `omnigraph mutate` | Same flags and behavior. `change` is a visible clap alias. |
| `omnigraph query lint` | `omnigraph lint` | Same flags. The argv-level shim rewrites `query lint` to `lint`. |
| `omnigraph query check` | `omnigraph check` | `check` is a visible alias of `omnigraph lint`. |
The `command:` field in `aliases.<name>` in `omnigraph.yaml` accepts both
`read` / `change` (legacy) and `query` / `mutate` (canonical); the two
spellings are interchangeable on the wire via serde aliases.

View file

@ -11,6 +11,7 @@
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` | `db/manifest/migrations.rs` |
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` |
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` |
| Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | `db/omnigraph/optimize.rs` |
| Graph index cache size | `8` (LRU) | `runtime_cache.rs` |
| Default body limit | `1 MB` | `omnigraph-server/lib.rs` |
| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` |

View file

@ -8,8 +8,8 @@ internal deploy automation.
Omnigraph supports two broad deployment shapes:
- local directory repos
- `s3://` repos on AWS S3 or S3-compatible object stores
- local directory graphs
- `s3://` graphs on AWS S3 or S3-compatible object stores
The server binary and container image expose the same HTTP surface.
@ -20,18 +20,20 @@ Build or install:
- `omnigraph`
- `omnigraph-server`
Run against a local repo:
On Windows, the binaries are `omnigraph.exe` and `omnigraph-server.exe`.
Run against a local graph:
```bash
omnigraph-server ./repo.omni --bind 0.0.0.0:8080
omnigraph-server ./graph.omni --bind 0.0.0.0:8080
```
Run against an object-store-backed repo:
Run against an object-store-backed graph:
```bash
OMNIGRAPH_SERVER_BEARER_TOKEN="change-me" \
AWS_REGION="us-east-1" \
omnigraph-server s3://my-bucket/repos/example/releases/2026-04-10-v0.1.0 \
omnigraph-server s3://my-bucket/graphs/example/releases/2026-04-10-v0.1.0 \
--bind 0.0.0.0:8080
```
@ -46,7 +48,7 @@ curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/
The bootstrap:
- starts a local RustFS-backed object store
- creates a bucket and S3-backed Omnigraph repo
- creates a bucket and S3-backed Omnigraph graph
- loads the checked-in context fixture
- starts `omnigraph-server` on `127.0.0.1:8080`
@ -60,8 +62,8 @@ Useful overrides:
- `WORKDIR=/path/to/state`
- `BUCKET=omnigraph-local`
- `PREFIX=repos/context`
- `RESET_REPO=1` to delete an existing partially initialized repo prefix before recreating it
- `PREFIX=graphs/context`
- `RESET_REPO=1` to delete an existing partially initialized graph prefix before recreating it
- `BIND=127.0.0.1:8080`
- `RUSTFS_CONTAINER_NAME=omnigraph-rustfs-demo`
@ -76,7 +78,7 @@ If `aws` is not installed, the script attempts a user-local AWS CLI install via
running.
If a previous bootstrap left objects behind under the selected `PREFIX` but did
not finish initializing the repo, rerun with `RESET_REPO=1` or choose a new
not finish initializing the graph, rerun with `RESET_REPO=1` or choose a new
`PREFIX`.
## Container Deployment
@ -87,29 +89,59 @@ Build the image:
docker build -t omnigraph-server:local .
```
Run against a local repo:
Run against a local graph:
```bash
docker run --rm -p 8080:8080 \
-v "$PWD/repo.omni:/data/repo.omni" \
-v "$PWD/graph.omni:/data/graph.omni" \
omnigraph-server:local \
/data/repo.omni --bind 0.0.0.0:8080
/data/graph.omni --bind 0.0.0.0:8080
```
Run against an S3-backed repo:
Run against an S3-backed graph:
```bash
docker run --rm -p 8080:8080 \
-e OMNIGRAPH_SERVER_BEARER_TOKEN="change-me" \
-e AWS_REGION="us-east-1" \
omnigraph-server:local \
s3://my-bucket/repos/example/releases/2026-04-10-v0.1.0 \
s3://my-bucket/graphs/example/releases/2026-04-10-v0.1.0 \
--bind 0.0.0.0:8080
```
### Container entrypoint env vars
When no positional args are given, the image entrypoint
(`docker/entrypoint.sh`) builds the server command from env vars:
| Var | Effect |
|---|---|
| `OMNIGRAPH_TARGET_URI` | Graph URI, passed as the positional argument. |
| `OMNIGRAPH_CONFIG` | Path to an `omnigraph.yaml`, passed as `--config`. Used to supply a `policy.file` (Cedar authorization). The config file and any relative `policy.file` must be mounted into the container. |
| `OMNIGRAPH_TARGET` | Graph name to select from the config's `graphs:` block (with `OMNIGRAPH_CONFIG`, when no `OMNIGRAPH_TARGET_URI`). |
| `OMNIGRAPH_BIND` | Listen address (default `0.0.0.0:8080`). |
`OMNIGRAPH_TARGET_URI` and `OMNIGRAPH_CONFIG` **compose**: set both to keep the
graph URI in the env var while loading policy from the config file (the
positional URI wins over any `graphs:` entry). To enable Cedar policy on a
container otherwise driven by `OMNIGRAPH_TARGET_URI`, mount the config dir and
add `OMNIGRAPH_CONFIG`:
```bash
docker run --rm -p 8080:8080 \
-e OMNIGRAPH_SERVER_BEARER_TOKEN="change-me" \
-e OMNIGRAPH_TARGET_URI="s3://my-bucket/graphs/example/releases/2026-04-10-v0.1.0" \
-e OMNIGRAPH_CONFIG="/etc/omnigraph/omnigraph.yaml" \
-v "$PWD/config:/etc/omnigraph:ro" \
omnigraph-server:local
# /etc/omnigraph/omnigraph.yaml contains `policy: { file: ./policy.yaml }`;
# policy.yaml (+ optional policy.tests.yaml) sit beside it in the mount.
```
## Auth
The server can run unauthenticated for local development, but any shared or
The server can run unauthenticated for local development only when explicitly
started with `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1`. Any shared or
internet-facing deployment should set a bearer token source.
### Token sources
@ -139,9 +171,11 @@ The server binary ships in two flavors:
| **Default** (on-prem / local dev) | `cargo build --release` | Core server, no AWS SDK |
| **AWS** | `cargo build --release --features aws` | Adds AWS Secrets Manager backend for bearer tokens |
Release artifacts are published with matching suffixes —
`omnigraph-server-<version>-<platform>.tar.gz` for the default build and
`omnigraph-server-<version>-<platform>-aws.tar.gz` for the AWS-enabled build.
Tagged release archives contain the default `omnigraph` and
`omnigraph-server` binaries on macOS / Linux, and `omnigraph.exe` plus
`omnigraph-server.exe` on Windows. AWS-enabled server binaries are built from
source with `cargo build --release --features aws -p omnigraph-server` when
needed.
The AWS build adds ~150 transitive deps and ~30-60s of first-build compile
time. Default builds don't pay that cost.
@ -154,7 +188,7 @@ Manager secret whose `SecretString` is a JSON object of
`{"actor_id": "token", ...}`:
```bash
omnigraph-server-aws s3://my-bucket/repos/example ...
omnigraph-server-aws s3://my-bucket/graphs/example ...
# Environment:
# OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET=arn:aws:secretsmanager:us-east-1:123456789012:secret:omnigraph-tokens-AbCdEf
```

View file

@ -22,7 +22,7 @@ Mark a Vector property with `@embed("source_text_property")`. At ingest, the eng
## CLI `omnigraph embed` (offline file pipeline)
Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive):
Operates on **JSONL files** (not on a graph). Three modes (mutually exclusive):
- (default) `fill_missing` — only embed rows whose target field is empty
- `--reembed-all` — overwrite all

View file

@ -9,7 +9,7 @@
- `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, details: Option<ManifestConflictDetails>, … })`
- `ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected, actual }` — caller's `expected_table_versions` did not match the manifest's current latest non-tombstoned version (set by `OmniError::manifest_expected_version_mismatch`).
- `ManifestConflictDetails::RowLevelCasContention` — Lance row-level CAS rejected the publish because a concurrent writer landed the same `object_id`. Retried internally by the publisher; only surfaces if the retry budget exhausts.
- **D₂ parse-time rejection** (MR-794): a single mutation query that mixes inserts/updates with deletes errors out *before any I/O* with kind `BadRequest`. Message: `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes`. See [docs/user/query-language.md](query-language.md) for the rule and [docs/dev/runs.md](../dev/runs.md) for the underlying staged-write rationale.
- **D₂ parse-time rejection** (MR-794): a single mutation query that mixes inserts/updates with deletes errors out *before any I/O* with kind `BadRequest`. Message: `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes`. See [docs/user/query-language.md](query-language.md) for the rule and [docs/dev/writes.md](../dev/writes.md) for the underlying staged-write rationale.
- `MergeConflicts(Vec<MergeConflict>)`
Compiler-side `NanoError` covers parse / catalog / type / storage / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics.

View file

@ -18,11 +18,11 @@ of MRs, internal recovery mechanics, or contributor-only invariants.
| Write queries and mutations | [query-language.md](query-language.md) |
| Use embeddings | [embeddings.md](embeddings.md) |
## Operate A Repo
## Operate A Graph
| Goal | Read |
|---|---|
| Understand repo layout and URI support | [storage.md](storage.md) |
| Understand graph layout and URI support | [storage.md](storage.md) |
| Work with branches, commits, and snapshots | [branches-commits.md](branches-commits.md) |
| Coordinate multi-query workflows | [transactions.md](transactions.md) |
| Read diffs and change feeds | [changes.md](changes.md) |

View file

@ -2,16 +2,29 @@
## Quick Install
macOS / Linux:
```bash
curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash
```
Windows PowerShell:
```powershell
powershell -NoProfile -ExecutionPolicy Bypass -Command "iwr -UseBasicParsing https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.ps1 | iex"
```
By default the installer places:
- `omnigraph`
- `omnigraph-server`
in `~/.local/bin`.
in `~/.local/bin` on macOS / Linux, or:
- `omnigraph.exe`
- `omnigraph-server.exe`
in `%USERPROFILE%\.local\bin` on Windows.
The default installer is binary-only. It downloads a published release asset,
verifies the SHA256 checksum, and unpacks it. It does not build from source.
@ -39,6 +52,13 @@ Rolling edge binaries from `main`:
curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | RELEASE_CHANNEL=edge bash
```
Windows rolling edge binaries:
```powershell
iwr -UseBasicParsing https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.ps1 -OutFile install.ps1
powershell -NoProfile -ExecutionPolicy Bypass -File .\install.ps1 -ReleaseChannel edge
```
Install from source:
```bash
@ -53,12 +73,24 @@ Install to a different directory:
curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | INSTALL_DIR="$HOME/bin" bash
```
Windows:
```powershell
powershell -NoProfile -ExecutionPolicy Bypass -File .\install.ps1 -InstallDir "$env:USERPROFILE\bin"
```
Install a specific tag:
```bash
curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | VERSION=v0.1.0 bash
```
Windows:
```powershell
powershell -NoProfile -ExecutionPolicy Bypass -File .\install.ps1 -Version v0.1.0
```
Build from a specific git ref:
```bash
@ -67,28 +99,53 @@ curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/
## Manual Source Build
macOS / Linux:
```bash
cargo build --release --locked -p omnigraph-cli -p omnigraph-server
install -m 0755 target/release/omnigraph ~/.local/bin/omnigraph
install -m 0755 target/release/omnigraph-server ~/.local/bin/omnigraph-server
```
Windows:
```powershell
cargo build --release --locked -p omnigraph-cli -p omnigraph-server
New-Item -ItemType Directory -Force "$env:USERPROFILE\.local\bin" | Out-Null
Copy-Item target\release\omnigraph.exe "$env:USERPROFILE\.local\bin\omnigraph.exe"
Copy-Item target\release\omnigraph-server.exe "$env:USERPROFILE\.local\bin\omnigraph-server.exe"
```
## Release Assets
Tagged releases are expected to publish:
- `omnigraph-linux-x86_64.tar.gz`
- `omnigraph-macos-x86_64.tar.gz`
- `omnigraph-macos-arm64.tar.gz`
- `omnigraph-windows-x86_64.zip`
Each archive contains both binaries:
The macOS / Linux archives contain both binaries:
- `omnigraph`
- `omnigraph-server`
The Windows archive contains:
- `omnigraph.exe`
- `omnigraph-server.exe`
## Verify The Install
macOS / Linux:
```bash
omnigraph version
omnigraph-server --help
```
Windows:
```powershell
omnigraph.exe version
omnigraph-server.exe --help
```

View file

@ -7,16 +7,23 @@
- Lance `compact_files()` on every node + edge table on `main`.
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`.
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped }]`.
- **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: Some(BlobColumnsUnsupportedByLance)` (and logged via `tracing::warn`) instead of compacted, and the rest of the sweep proceeds normally. The current Lance `compact_files` mis-decodes blob-v2 columns under its forced `BlobHandling::AllBinary` read; **reads and writes are unaffected** — only compaction is. This is gated by `LANCE_SUPPORTS_BLOB_COMPACTION` (`db/omnigraph/optimize.rs`) and removed when the upstream Lance fix lands (see [docs/dev/lance.md](../dev/lance.md)). Consequence: fragment count and deleted-row space on blob tables are not reclaimed until then; query results are never affected.
## `cleanup_all_tables(db, options)` — destructive
- Lance `cleanup_old_versions()` per table.
- Removes manifests (and their unique fragments) older than the retention policy.
- `CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> }` — at least one is required.
- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`.
- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed, error }]`.
- **Fault-isolated per table.** A single table's transient failure (version GC or
orphan reclaim) is recorded on that table's stats row (`error: Some(..)`, logged
via `tracing`) and never aborts the healthy tables — cleanup is the convergence
backstop, so it does as much as it can and converges on re-run. The CLI reports
any failed tables; rerun `cleanup` to retry them.
- CLI guards with `--confirm`; without it, prints a preview line.
- **Recovery floor:** `--keep < 3` may garbage-collect Lance versions that the open-time recovery sweep needs as a rollback target (the sweep restores to the branch's manifest-pinned table version, which is HEAD-1 in the typical Phase B → Phase C drift case). Default `--keep 10` is safe.
- **Orphaned-branch reconciliation:** before the version GC, cleanup runs `reconcile_orphaned_branches`, which `force_delete_branch`es any per-table or commit-graph Lance branch absent from the manifest branch list. These orphans arise when a `branch_delete` flips the manifest authority but a downstream best-effort reclaim does not complete (see [branches-commits.md](branches-commits.md)). The reconciler is authority-derived and idempotent (it no-ops once nothing is orphaned), runs regardless of the `keep_versions` / `older_than` values (those gate version GC only), and never reclaims `main` or system-branch forks. Reclaimed forks are logged via `tracing::info`.
## Tombstones

View file

@ -4,6 +4,8 @@ OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC.
## Policy actions
Per-graph actions (bind to `Omnigraph::Graph::"<graph_id>"`):
1. `read` — query / snapshot / list branches & commits
2. `export` — NDJSON export
3. `change` — mutations
@ -12,6 +14,13 @@ OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC.
6. `branch_delete`
7. `branch_merge`
8. `admin` — reserved for policy-management surfaces (hot reload, audit log, approvals). No call site today; see MR-724 for the reservation rationale.
9. `invoke_query` — gates invoking a server-side stored query (the `queries:` registry). Graph-scoped (like `admin`) — per-branch access is enforced by the inner `read` / `change` gate, so a rule that sets `branch_scope` on `invoke_query` is rejected. Coarse in this release: an `invoke_query` allow rule permits any stored query on the graph; a future, additive refinement adds an optional per-query-name scope without changing rules written against the coarse action. Enforced at `POST /queries/{name}` (see [server](server.md)). A stored *mutation* is double-gated: `invoke_query` to reach the tool, plus `change` for the write itself (the engine `_as` writers still enforce per the query body).
Server-scoped action (v0.6.0+; binds to `Omnigraph::Server::"root"`):
10. `graph_list``GET /graphs` registry enumeration (multi-graph mode)
Server-scoped actions cannot use `branch_scope` or `target_branch_scope` — they operate on the registry, not on a graph's branches. A rule cannot mix server-scoped and per-graph actions; split into separate rules. (Runtime `graph_create` / `graph_delete` are reserved but not shipped in v0.6.0; operators add/remove graphs by editing `omnigraph.yaml` and restarting.)
## Scope kinds
@ -19,6 +28,50 @@ OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC.
- `target_branch_scope` — applied to destination (`schema_apply`, branch ops, run ops)
- `protected_branches` — named list with special rules; rule scopes are `any | protected | unprotected`
## Per-graph vs. server-level policy (multi-graph mode)
In multi mode (`omnigraph.yaml` with a non-empty `graphs:` map), policy files attach at two levels:
```yaml
server:
policy:
file: ./server-policy.yaml # server-level: graph_list
graphs:
alpha:
uri: s3://tenant-bucket/alpha
policy:
file: ./policies/alpha.yaml # per-graph: read, change, branch_*, schema_apply
beta:
uri: s3://tenant-bucket/beta
# no per-graph policy → no engine-layer Cedar enforcement on beta
```
**Config follows graph identity, not server mode.** A graph served by **name**
(`--target <name>` or `server.graph`) uses its own `graphs.<name>.policy.file`,
exactly as in multi-graph mode. Top-level `policy.file` applies only to an
**anonymous** graph — one served by a bare `<URI>` with no `graphs:` entry.
Serving a **named** graph (single- or multi-graph mode) while top-level
`policy.file` (or `queries:`) is populated **refuses boot**, naming the block,
since the top-level value would otherwise be silently shadowed by the per-graph
block. Move per-graph rules to `graphs.<graph_id>.policy.file` and `graph_list`
rules to `server.policy.file`.
Each graph's HTTP request flows through its own per-graph policy. The management endpoint (`GET /graphs`) flows through the server-level policy. When `server.policy.file` is unset, `GET /graphs` is denied in every runtime state, including `--unauthenticated`; with bearer tokens configured, it returns 403 after admission control because `graph_list` is not a `read`-equivalent action. The operator must explicitly authorize via `server-policy.yaml` to expose `/graphs`.
Example server-level policy:
```yaml
version: 1
groups:
admins: [act-andrew]
rules:
- id: admins-can-list-graphs
allow:
actors: { group: admins }
actions: [graph_list]
```
## Configuration
`omnigraph.yaml`:
@ -32,7 +85,7 @@ cli:
actor: act-andrew # default actor for CLI direct-engine writes
```
Each rule must use exactly one of `branch_scope` or `target_branch_scope`.
Each per-graph rule may use at most one of `branch_scope` or `target_branch_scope`. Server-scoped rules (`graph_list`) take neither — they have no branch context.
`cli.actor` is the default actor identity for CLI direct-engine writes
when `policy.file` is configured. Override per-invocation with `--as
@ -45,6 +98,10 @@ bearer token.
## CLI
Policy tooling resolves its graph like server single-mode policy: `cli.graph`
wins, otherwise `server.graph` is used, otherwise the top-level `policy.file`
is validated/tested/explained as the anonymous policy.
- `omnigraph policy validate` — parse + count actors, exit 1 on parse error.
- `omnigraph policy test` — run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch.
- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` — show decision and matched rule.
@ -74,12 +131,13 @@ reaches `authorize_request()` without a matching policy permit.
|---|---|---|---|
| **Open** | no | no | Every request is permitted. Refuses to start unless `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1` is set — the operator must explicitly opt in. |
| **DefaultDeny** | yes | no | Every authenticated request for an action other than `read` is rejected with HTTP 403. Closes the "tokens but forgot the policy file" trap — an operator who sets up auth and forgot to point at a policy file used to ship the illusion of protection. |
| **PolicyEnabled** | any | yes | Every request is evaluated by Cedar against the configured policy. |
| **PolicyEnabled** | yes | yes | Authenticated requests that reach a configured policy engine are evaluated by Cedar. Server-scoped actions still require `server.policy.file`. |
The classifier is `classify_server_runtime_state` in
`crates/omnigraph-server/src/lib.rs`; it returns `Err` for the "no
tokens, no policy, no flag" cell so the server refuses to start instead
of silently shipping an open instance. Tests pin every cell of the
tokens, no policy, no flag" cell and for "policy file, no tokens" so the
server refuses to start instead of silently shipping an open instance or
a policy-protected server that can only 401. Tests pin every cell of the
matrix and the State-2 deny path.
Server-side, `authorize_request()` still runs at the HTTP boundary —

View file

@ -70,7 +70,7 @@ A single mutation query must be **either insert/update-only or delete-only**. Mi
> `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes. This restriction lifts when Lance exposes a two-phase delete API (tracked: MR-793 / Lance-upstream).`
Reason: under the staged-write rewire (MR-794), inserts and updates accumulate in memory and commit at end-of-query, while deletes still inline-commit (Lance 4.0.0 has no public two-phase delete). Mixing creates ordering hazards (same-row insert→delete becomes a no-op because the staged insert isn't visible to delete; cascading deletes of just-inserted edges break referential integrity by silent design). Until Lance exposes `DeleteJob::execute_uncommitted`, the parse-time rejection keeps both paths atomic and correct. See [docs/dev/runs.md](../dev/runs.md) and [docs/dev/invariants.md](../dev/invariants.md).
Reason: under the staged-write rewire (MR-794), inserts and updates accumulate in memory and commit at end-of-query, while deletes still inline-commit (Lance 4.0.0 has no public two-phase delete). Mixing creates ordering hazards (same-row insert→delete becomes a no-op because the staged insert isn't visible to delete; cascading deletes of just-inserted edges break referential integrity by silent design). Until Lance exposes `DeleteJob::execute_uncommitted`, the parse-time rejection keeps both paths atomic and correct. See [docs/dev/writes.md](../dev/writes.md) and [docs/dev/invariants.md](../dev/invariants.md).
## IR (Intermediate Representation)

View file

@ -1,26 +1,137 @@
# HTTP Server (`omnigraph-server`)
Axum 0.8 + tokio + utoipa-generated OpenAPI. Single repo per process; deploy multiple processes for multi-tenant.
Axum 0.8 + tokio + utoipa-generated OpenAPI. **Two modes** (v0.6.0+): single-graph (legacy) and multi-graph (MR-668). Mode is inferred from CLI args + config shape.
## Modes
### Single-graph mode (legacy)
`omnigraph-server <URI>` or `omnigraph-server --target <name> --config omnigraph.yaml`. Routes are flat — `/snapshot`, `/read`, `/branches`, etc.
**Config follows graph identity.** A bare `<URI>` is an *anonymous* graph and uses the **top-level** `policy.file` / `queries:`. A graph chosen by **name** (`--target` / `server.graph`) uses its own `graphs.<name>.{policy.file, queries}` — the same block multi-graph mode uses. ⚠️ *Changed from v0.6.0, which always used top-level config in single mode: a named-graph config that puts `policy`/`queries` at top-level now **refuses boot** and points you at `graphs.<name>.…` (move the block there). Bare-`<URI>` single mode is unchanged.*
### Multi-graph mode (v0.6.0+)
`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:` map and **no** single-mode selector (no `server.graph`, no `<URI>`, no `--target`). The server opens every configured graph in parallel at startup (bounded concurrency = 4, fail-fast on the first open error). Routes are nested under `/graphs/{graph_id}/...`. Bare flat paths return 404 in multi mode.
Mode inference (four-rule matrix):
1. CLI positional `<URI>` → single
2. CLI `--target <name>` → single
3. `server.graph` in config → single
4. `--config` + non-empty `graphs:` + no single-mode selector → **multi**
5. otherwise → error with migration hint
### Stored-query validation at startup
If a graph declares a `queries:` registry (see [cli-reference](cli-reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup** and **refuses to boot** if any query references a type or property the schema lacks — the same fail-loud posture as a malformed policy file, so schema drift surfaces at the deploy boundary rather than at invocation. Two MCP-exposed queries claiming the same tool name is likewise a boot error. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the exposed queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below).
## Endpoint inventory
Per-graph endpoints — same body shape across modes; URLs differ:
| Method | Single-mode path | Multi-mode path | Auth | Action | Handler |
|---|---|---|---|---|---|
| GET | `/healthz` | `/healthz` | none | — | `server_health` |
| GET | `/openapi.json` | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled; in multi mode emits cluster paths with `cluster_` operation-id prefix) |
| GET | `/snapshot?branch=` | `/graphs/{id}/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` |
| POST | `/query` | `/graphs/{id}/query` | bearer + `read` | inline read query (canonical; clean field names `query`/`name`; mutations → 400) | `server_query` |
| POST | `/read` | `/graphs/{id}/read` | bearer + `read` | **deprecated** alias of `/query` (legacy field names `query_source`/`query_name`, byte-stable response; carries `Deprecation: true` + `Link: </query>; rel="successor-version"`) | `server_read` |
| POST | `/export` | `/graphs/{id}/export` | bearer + `export` | NDJSON stream | `server_export` |
| POST | `/mutate` | `/graphs/{id}/mutate` | bearer + `change` | mutation (canonical; `query`/`name`; accepts legacy `query_source`/`query_name` as serde aliases) | `server_mutate` |
| POST | `/change` | `/graphs/{id}/change` | bearer + `change` | **deprecated** alias of `/mutate` (carries `Deprecation: true` + `Link: </mutate>; rel="successor-version"`) | `server_change` |
| GET | `/queries` | `/graphs/{id}/queries` | bearer + `read` | list the `mcp.expose` stored queries as a typed tool catalog | `server_list_queries` |
| POST | `/queries/{name}` | `/graphs/{id}/queries/{name}` | bearer + `invoke_query` (+ `change` for a stored mutation) | invoke a named query from the `queries:` registry; deny == 404 | `server_invoke_query` |
| GET | `/schema` | `/graphs/{id}/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` |
| POST | `/schema/apply` | `/graphs/{id}/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` |
| POST | `/ingest` | `/graphs/{id}/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) |
| GET | `/branches` | `/graphs/{id}/branches` | bearer + `read` | list branches | `server_branch_list` |
| POST | `/branches` | `/graphs/{id}/branches` | bearer + `branch_create` | create | `server_branch_create` |
| DELETE | `/branches/{branch}` | `/graphs/{id}/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` |
| POST | `/branches/merge` | `/graphs/{id}/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` |
| GET | `/commits?branch=` | `/graphs/{id}/commits?branch=` | bearer + `read` | list | `server_commit_list` |
| GET | `/commits/{commit_id}` | `/graphs/{id}/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` |
Server-level management endpoints (v0.6.0+):
| Method | Path | Auth | Action | Handler |
|---|---|---|---|---|
| GET | `/healthz` | none | — | `server_health` |
| GET | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled) |
| GET | `/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` |
| POST | `/read` | bearer + `read` | run named query | `server_read` |
| POST | `/export` | bearer + `export` | NDJSON stream | `server_export` |
| POST | `/change` | bearer + `change` | mutation | `server_change` |
| GET | `/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` |
| POST | `/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` |
| POST | `/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) |
| GET | `/branches` | bearer + `read` | list branches | `server_branch_list` |
| POST | `/branches` | bearer + `branch_create` | create | `server_branch_create` |
| DELETE | `/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` |
| POST | `/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` |
| GET | `/commits?branch=` | bearer + `read` | list | `server_commit_list` |
| GET | `/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` |
| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs | `server_graphs_list` (405 in single mode) |
### Stored-query catalog (`GET /queries`)
List the graph's **`mcp.expose`** stored queries as a typed tool catalog — enough for a client (e.g. an MCP server) to register each as a tool without fetching `.gq` source. Each entry: `{ name, tool_name, description, instruction, mutation, params }`, where each param is `{ name, kind, item_kind?, vector_dim?, nullable }`. `kind` is one of `string | bool | int | bigint | float | date | datetime | blob | vector | list` (decomposed so a consumer maps it with a closed `switch`, never re-parsing GQ type spelling). `bigint` (I64/U64), `date`, `datetime`, and `blob` are carried as JSON **strings** — a 64-bit integer loses precision as a JSON number, dates are ISO strings, and a blob is a URI string.
- **Read-gated** (works in default-deny mode). The catalog is **graph-wide** (branch-independent; `read` is authorized against `main`).
- **`mcp.expose` defaults to `true`** — declaring a query in `queries:` lists it; set `mcp: { expose: false }` to keep it HTTP/service-callable but hidden from the catalog.
- **Not Cedar-filtered per query (yet).** A caller with `read` but not `invoke_query` can *list* a query they can't *invoke* (which would 404). Closing that gap is future per-query authorization; for now the catalog is a discovery surface and `invoke_query` remains the invocation gate.
### Stored-query invocation (`POST /queries/{name}`)
Invoke a curated, server-side stored query by **name** — the source comes from the graph's `queries:` registry, so the client never sends `.gq`. The request body itself is optional; omit it for no-param queries, or send `{ "params": { … }, "branch": "main", "snapshot": null }`, where every field is optional and `params` keys match the query's declared parameters. The response is the **read envelope** (`ReadOutput`) for a stored read or the **mutation envelope** (`ChangeOutput`) for a stored mutation — serialized untagged, so the wire shape is identical to `/query` / `/mutate`.
- **Gate:** `invoke_query` (per-graph, graph-scoped) at the boundary. A stored *mutation* is **double-gated** — it also passes the engine's `change` gate, so an actor with `invoke_query` but not `change` gets `403`.
- **Deny == unknown, for callers without `invoke_query`:** for a caller lacking the grant, an `invoke_query` denial and an unknown query name return the **same `404`** (identical body), so the catalog can't be probed. A caller that *holds* `invoke_query` may still get the inner gate's `403` for an existing query it can't `read`/`change` (the double-gate, above) — so existence is visible to grant-holders by design.
- **Requires an explicit policy grant when auth is on.** In default-deny mode (bearer tokens but no `policy.file`), only `read` is permitted, so *every* `/queries/{name}` call returns `404` until an `invoke_query` rule is configured.
- A stored mutation cannot target a `snapshot` (`400`); a parameter type error is a structured `400` naming the parameter.
## Adding and removing graphs (multi mode)
Runtime add/remove via API is **not** exposed in v0.6.0 — neither
`POST /graphs` nor `DELETE /graphs/{id}` is implemented. Operators add
or remove graphs by stopping the server, editing the `graphs:` map in
`omnigraph.yaml`, then restarting. The server treats `omnigraph.yaml`
as operator-owned configuration and never writes it.
A future release may introduce a managed registry (Lance-backed,
catalog-style: reserve → init → publish with recovery sidecars) and
re-expose runtime mutation on top of it.
## Inline read queries (`POST /query`)
`POST /query` is the read-only, agent-friendly twin of `POST /read`. The
request body uses clean field names that match the CLI `-e` flag and the GQ
`query` keyword:
```json
{
"query": "query find($n: String) { match { $p: Person { name: $n } } return { $p.name } }",
"name": "find",
"params": { "n": "Alice" },
"branch": "main",
"snapshot": null
}
```
Response shape is identical to `/read` (`ReadOutput`). If the inline source
contains mutations (`insert` / `update` / `delete`), the request is rejected
with HTTP 400 and an error pointing the caller at `POST /mutate` — the
read-only contract is enforced at the URL.
`POST /mutate` is the canonical mutation endpoint. It accepts the same clean
field names (`query`, `name`); the legacy field names `query_source` and
`query_name` continue to deserialize as serde aliases so existing clients keep
working without changes.
## Deprecated names (`/read`, `/change`)
`POST /read` and `POST /change` are kept for back-compat indefinitely — they
are byte-stable on the request side and otherwise behave identically to
`/query` / `/mutate`. They are flagged as deprecated through three independent
channels:
- **OpenAPI**: the operations carry `deprecated: true` in `openapi.json`, so
every OpenAPI codegen (typescript-fetch, openapi-generator, oapi-codegen,
…) emits a `@deprecated` marker on the generated SDK method.
- **Response headers (RFC 9745)**: every response carries `Deprecation: true`.
- **Response headers (RFC 8288)**: every response carries a `Link` header
pointing at the canonical successor:
`Link: </query>; rel="successor-version"` for `/read`, and
`Link: </mutate>; rel="successor-version"` for `/change`. SDKs and HTTP
proxies can pick the successor up automatically.
Migration is purely cosmetic on the client side — swap the URL path, leave
the request body and response handling alone.
## Streaming
@ -34,8 +145,8 @@ Uniform `ErrorOutput { error, code?, merge_conflicts[], manifest_conflict? }` wi
caller's pre-write view of one table's manifest version was stale.
`ManifestConflictOutput { table_key, expected, actual }` tells the client
which table to refresh and retry. This is the conflict shape produced by
concurrent `/change` or `/ingest` calls landing the same `(table, branch)`
race.
concurrent `/mutate` (or its `/change` alias) or `/ingest` calls landing
the same `(table, branch)` race.
HTTP status codes used: 200, 400, 401, 403, 404, 409, 429, 500.
@ -61,10 +172,11 @@ actors are unaffected.
Cedar policy authorization runs **before** admission accounting so
denied requests don't consume admission slots.
Today admission gates every mutating handler: `/change`, `/ingest`,
`/branches/{create,delete,merge}`, and `/schema/apply`. Read-only
endpoints (`/snapshot`, `/read`, `/export`, `/branches` GET, `/commits`,
`/schema` GET) are not admission-gated.
Today admission gates every mutating handler: `/mutate` (and its
deprecated alias `/change`), `/ingest`, `/branches/{create,delete,merge}`,
and `/schema/apply`. Read-only endpoints (`/snapshot`, `/query`, `/read`,
`/export`, `/branches` GET, `/commits`, `/schema` GET) are not
admission-gated.
## Body limits
@ -79,7 +191,10 @@ endpoints (`/snapshot`, `/read`, `/export`, `/branches` GET, `/commits`,
1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`)
2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}`
3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default`
- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme.
- If no tokens are configured, startup refuses unless `--unauthenticated` or
`OMNIGRAPH_UNAUTHENTICATED=1` explicitly opts into open local-dev mode. A
policy file without tokens is also rejected at startup. In open mode
`/openapi.json` strips the security scheme.
See [deployment.md](deployment.md) for token-source operational details.
@ -87,15 +202,16 @@ See [deployment.md](deployment.md) for token-source operational details.
- `tower_http::TraceLayer::new_for_http()`
- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
- Startup logs: token source name, repo URI, bind address
- Startup logs: token source name, graph URI, bind address
- Graceful SIGINT shutdown
## Not implemented (by design or "TBD")
- CORS — not configured; add `tower_http::cors` if needed.
- Rate limiting — per-actor admission control gates `/change`, `/ingest`,
`/branches/{create,delete,merge}`, `/schema/apply` (see "Per-actor
- Rate limiting — per-actor admission control gates `/mutate` (alias
`/change`), `/ingest`, `/branches/{create,delete,merge}`,
`/schema/apply` (see "Per-actor
admission control" above). No global rate limiter is configured;
add `tower_http::limit` if a graph-wide cap is needed.
- Pagination — none (commits/branches return everything; export streams).
- Multi-tenant routing — one repo per process.
- Runtime graph add/remove — edit `omnigraph.yaml` and restart.

View file

@ -7,7 +7,7 @@ Every node type and every edge type is its own Lance dataset:
- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema.
- **Fragments**: data is partitioned into fragments; new writes create new fragments.
- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable.
- **Stable row IDs**: `enable_stable_row_ids: true` is set on every Lance dataset OmniGraph creates — node and edge data tables, `__manifest`, `_graph_commits.lance`, `_graph_commit_recoveries.lance`, and any future system tables. This is an architectural invariant: the flag is one-way at dataset create per Lance's row-id-lineage spec, so a future change that introduces a Lance dataset must preserve it. Consequences: `_row_created_at_version` and `_row_last_updated_at_version` are available on every dataset (load-bearing for change-feed validators); `CreateIndex × Rewrite` is not a retryable conflict, so indices survive `omnigraph optimize` without needing the Fragment Reuse Index; readers must use a Lance build that recognises the flag (our pinned 4.0.0 is fine). Pre-0.4.x repos created before this code path settled may have datasets without the flag and cannot be retrofitted in place — the supported path is dump-and-reload. The `stage_overwrite` rewrite path (used by `schema_apply`) preserves the flag through `Operation::Overwrite`; pinned by `stage_overwrite_preserves_stable_row_ids` in `crates/omnigraph/tests/staged_writes.rs`.
- **Stable row IDs**: `enable_stable_row_ids: true` is set on every Lance dataset OmniGraph creates — node and edge data tables, `__manifest`, `_graph_commits.lance`, `_graph_commit_recoveries.lance`, and any future system tables. This is an architectural invariant: the flag is one-way at dataset create per Lance's row-id-lineage spec, so a future change that introduces a Lance dataset must preserve it. Consequences: `_row_created_at_version` and `_row_last_updated_at_version` are available on every dataset (load-bearing for change-feed validators); `CreateIndex × Rewrite` is not a retryable conflict, so indices survive `omnigraph optimize` without needing the Fragment Reuse Index; readers must use a Lance build that recognises the flag (our pinned 4.0.0 is fine). Pre-0.4.x graphs created before this code path settled may have datasets without the flag and cannot be retrofitted in place — the supported path is dump-and-reload. The `stage_overwrite` rewrite path (used by `schema_apply`) preserves the flag through `Operation::Overwrite`; pinned by `stage_overwrite_preserves_stable_row_ids` in `crates/omnigraph/tests/staged_writes.rs`.
- **Append / delete / `merge_insert`**: native Lance write modes.
- **Per-dataset branches** (Lance native): copy-on-write at the dataset level.
- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (`storage.rs`).
@ -22,7 +22,7 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
- `__manifest/` — the catalog of all sub-tables and their published versions
- `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map
- (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 repos are inert; the run state machine was removed in MR-771 and these files are cleaned up via MR-770's production sweep)
- (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed in MR-771 and these files are cleaned up via MR-770's production sweep)
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
- `object_type``table | table_version | table_tombstone`
- `table_key``node:<TypeName> | edge:<EdgeName>`
@ -36,7 +36,7 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin
The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher. `INTERNAL_MANIFEST_SCHEMA_VERSION` declares the shape this binary writes; the on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`).
- **`init_manifest_repo`** stamps the current version at creation, so newly initialized repos never need migration.
- **`init_manifest_graph`** stamps the current version at creation, so newly initialized graphs never need migration.
- **Publisher open-for-write path** (`load_publish_state`) calls `migrate_internal_schema(&mut dataset)` before reading state. When the on-disk stamp matches the binary, this is a single metadata read with no writes; otherwise the dispatcher walks `match`-arm steps forward (1→2, 2→3, …) until the stamp matches, then proceeds with the publish. Reads stay side-effect-free.
- **Forward-version protection**: a stamp *higher* than the binary's known version triggers a clear "upgrade omnigraph first" error. An old binary cannot clobber a newer schema by silently treating "unknown stamp" as "missing stamp".
- **Idempotency**: each migration step is safe to re-run. A crash between two metadata updates inside a single step leaves the partial state; the next open re-runs the step and the second update lands. The dispatcher itself is a cheap stamp-read on the steady-state path.
@ -50,14 +50,14 @@ Adding a new on-disk shape change is one constant bump (`INTERNAL_MANIFEST_SCHEM
## On-disk layout
A repo on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (`_versions/`, `data/`, `_indices/`, `_refs/`); OmniGraph adds the multi-dataset coordination by keeping `__manifest/` alongside the per-type datasets.
A graph on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (`_versions/`, `data/`, `_indices/`, `_refs/`); OmniGraph adds the multi-dataset coordination by keeping `__manifest/` alongside the per-type datasets.
```mermaid
flowchart TB
classDef l1 fill:#fef3e8,stroke:#c46900,color:#000
classDef l2 fill:#e8f4fd,stroke:#1e6aa8,color:#000
repo["repo URI<br/>file:// or s3://bucket/prefix"]:::l2
graph["graph URI<br/>file:// or s3://bucket/prefix"]:::l2
manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2
nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2
@ -66,12 +66,12 @@ flowchart TB
recovery["__recovery/{ulid}.json<br/>recovery sidecars (transient)"]:::l2
refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2
repo --> manifest
repo --> nodes
repo --> edges
repo --> cgraph
repo --> recovery
repo --> refs
graph --> manifest
graph --> nodes
graph --> edges
graph --> cgraph
graph --> recovery
graph --> refs
subgraph dataset[Inside each Lance dataset — L1]
ds_v["_versions/{n}.manifest<br/>per-dataset versions"]:::l1
@ -88,10 +88,10 @@ flowchart TB
**What's where:**
- **Repo root** is one directory (or S3 prefix). Everything below is part of one OmniGraph repo.
- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph.
- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 repos also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; MR-770 sweeps these in production.)
- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; MR-770 sweeps these in production.)
- **`_graph_commit_recoveries.lance`** — one row per recovery sweep action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See `crates/omnigraph/src/db/recovery_audit.rs`.
- **`__recovery/{ulid}.json`** — transient sidecar files written by the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`.
- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.

View file

@ -48,7 +48,7 @@ query register_employee_with_team($name: String, $age: I32, $team: String) {
```bash
omnigraph change --query ./mutations.gq --name register_employee_with_team \
--params '{"name":"Alice","age":30,"team":"Acme"}' ./repo.omni
--params '{"name":"Alice","age":30,"team":"Acme"}' ./graph.omni
```
If the second statement fails (e.g. `Acme` doesn't exist), the publisher never publishes; `Alice` is not in the database. Atomic.
@ -57,10 +57,10 @@ If the second statement fails (e.g. `Acme` doesn't exist), the publisher never p
```bash
# Query 1
omnigraph change --query ./mutations.gq --name register_employee --params '{"name":"Alice","age":30}' ./repo.omni
omnigraph change --query ./mutations.gq --name register_employee --params '{"name":"Alice","age":30}' ./graph.omni
# Query 2 — runs after Query 1 has already published
omnigraph change --query ./mutations.gq --name link_to_team --params '{"name":"Alice","team":"Acme"}' ./repo.omni
omnigraph change --query ./mutations.gq --name link_to_team --params '{"name":"Alice","team":"Acme"}' ./graph.omni
```
These are **two publishes** on `main`. If Query 2 fails, Query 1's effects are already visible. There is no `ROLLBACK` for Query 1.
@ -75,32 +75,32 @@ The pattern when you need to run multiple queries — possibly across multiple c
```bash
# Fork a working branch from main.
omnigraph branch create --from main onboarding/2026-04-25 ./repo.omni
omnigraph branch create --from main onboarding/2026-04-25 ./graph.omni
# Run any number of mutations on the branch — each one is its own publish on the branch.
# Concurrent reads of `main` are unaffected.
omnigraph change --branch onboarding/2026-04-25 \
--query ./mutations.gq --name register_employee \
--params '{"name":"Alice","age":30}' ./repo.omni
--params '{"name":"Alice","age":30}' ./graph.omni
omnigraph change --branch onboarding/2026-04-25 \
--query ./mutations.gq --name register_employee \
--params '{"name":"Bob","age":25}' ./repo.omni
--params '{"name":"Bob","age":25}' ./graph.omni
omnigraph change --branch onboarding/2026-04-25 \
--query ./mutations.gq --name link_to_team \
--params '{"name":"Alice","team":"Acme"}' ./repo.omni
--params '{"name":"Alice","team":"Acme"}' ./graph.omni
# Inspect the branch — read queries work just like on main.
omnigraph read --branch onboarding/2026-04-25 \
--query ./queries.gq --name list_employees ./repo.omni
--query ./queries.gq --name list_employees ./graph.omni
# Happy with what's on the branch? Merge it. This is one atomic publish:
# `main` flips to include every commit on the branch.
omnigraph branch merge onboarding/2026-04-25 --into main ./repo.omni
omnigraph branch merge onboarding/2026-04-25 --into main ./graph.omni
# OR: not happy? Throw it away. `main` is untouched.
# omnigraph branch delete onboarding/2026-04-25 ./repo.omni
# omnigraph branch delete onboarding/2026-04-25 ./graph.omni
```
Properties:
@ -115,16 +115,16 @@ Two agents writing to the same graph independently:
```bash
# Agent A
omnigraph branch create --from main agent-a/work ./repo.omni
omnigraph change --branch agent-a/work … ./repo.omni
omnigraph branch create --from main agent-a/work ./graph.omni
omnigraph change --branch agent-a/work … ./graph.omni
# … many mutations …
omnigraph branch merge agent-a/work --into main ./repo.omni
omnigraph branch merge agent-a/work --into main ./graph.omni
# Agent B (running concurrently)
omnigraph branch create --from main agent-b/work ./repo.omni
omnigraph change --branch agent-b/work … ./repo.omni
omnigraph branch create --from main agent-b/work ./graph.omni
omnigraph change --branch agent-b/work … ./graph.omni
# … many mutations …
omnigraph branch merge agent-b/work --into main ./repo.omni
omnigraph branch merge agent-b/work --into main ./graph.omni
```
Each agent sees a consistent snapshot of `main` at the time it forked. The first merge to `main` lands as a fast-forward (or a no-op if no concurrent change). The second merge runs three-way: rows touched by both branches surface as `MergeConflict`s for the caller to resolve.
@ -138,7 +138,7 @@ This is the workflow MR-797 / agentic loops are designed around: **branches are
| Single query fails mid-flight | Publisher never publishes; target unchanged | Read the error, decide whether to retry |
| Concurrent writers race the same `(table, branch)` | Publisher CAS rejects the loser with `ManifestConflictDetails::ExpectedVersionMismatch` | Refresh handle, retry the query |
| Branch with N successful mutations, then merge fails (three-way conflict) | Each individual mutation already committed on the branch; merge surfaces `MergeConflicts` | Inspect, decide whether to keep working on the branch, abandon it (`branch_delete`), or resolve and re-merge |
| Process crashes mid-branch-workflow | Each completed mutation on the branch is durable | Re-open the repo, continue where you left off |
| Process crashes mid-branch-workflow | Each completed mutation on the branch is durable | Re-open the graph, continue where you left off |
## When to use what
@ -156,7 +156,7 @@ This is the workflow MR-797 / agentic loops are designed around: **branches are
- **Cross-query atomicity on `main` without a branch.** If you don't want to fork a branch, multiple queries on `main` publish independently. There is no implicit transaction.
- **Long-running interactive transactions.** No `BEGIN` over a connection. Branches are the durable equivalent.
- **Cross-graph (cross-repo) transactions.** Each repo is its own atomicity domain.
- **Cross-graph transactions.** Each graph is its own atomicity domain.
- **"Pessimistic" locks** that serialize writers before they reach the storage layer. Snapshot-MVCC + publisher CAS handles concurrency optimistically; the loser retries.
## See also
@ -164,5 +164,5 @@ This is the workflow MR-797 / agentic loops are designed around: **branches are
- [`docs/user/branches-commits.md`](branches-commits.md) — branch and commit-graph mechanics.
- [`docs/dev/merge.md`](../dev/merge.md) — three-way merge details and conflict kinds.
- [`docs/user/query-language.md`](query-language.md) — `.gq` syntax for the multi-statement queries used above.
- [`docs/dev/runs.md`](../dev/runs.md) — the per-query commit pipeline that gives single-query atomicity.
- [`docs/dev/writes.md`](../dev/writes.md) — the per-query commit pipeline that gives single-query atomicity.
- [`docs/dev/invariants.md`](../dev/invariants.md) — the architectural rule.