mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
Merge branch 'main' into andrew/datafusion-future-improvements-doc
This commit is contained in:
commit
2d6591dfb3
134 changed files with 16450 additions and 2643 deletions
|
|
@ -10,7 +10,7 @@ Three views, increasing zoom:
|
|||
2. **Layer view** — the eight-layer stack inside one OmniGraph process.
|
||||
3. **Component zoom-ins** — what's inside each layer.
|
||||
|
||||
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a repo, see [`docs/user/storage.md`](../user/storage.md).
|
||||
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/storage.md).
|
||||
|
||||
L1 (orange in the diagrams) is what we inherit from Lance; L2 (blue) is what OmniGraph adds. The L1/L2 framing is also called out in prose at the bottom of this doc.
|
||||
|
||||
|
|
@ -63,7 +63,7 @@ flowchart TB
|
|||
subgraph engine[omnigraph engine]
|
||||
plan[exec query and mutation]:::l2
|
||||
gi[graph index CSR/CSC<br/>RuntimeCache LRU 8]:::l2
|
||||
coord[coordinator<br/>ManifestRepo · CommitGraph]:::l2
|
||||
coord[coordinator<br/>ManifestCoordinator · CommitGraph]:::l2
|
||||
end
|
||||
|
||||
subgraph storage[storage trait — wraps Lance]
|
||||
|
|
@ -132,7 +132,7 @@ flowchart TB
|
|||
|
||||
subgraph state[graph state]
|
||||
coord[GraphCoordinator]:::l2
|
||||
mr[ManifestRepo<br/>db/manifest.rs]:::l2
|
||||
mr[ManifestCoordinator<br/>db/manifest.rs]:::l2
|
||||
cg[CommitGraph<br/>_graph_commits.lance]:::l2
|
||||
stg[MutationStaging<br/>per-query in-memory accumulator<br/>exec/staging.rs]:::l2
|
||||
end
|
||||
|
|
@ -166,7 +166,7 @@ Code paths:
|
|||
|
||||
- Read entry: `Omnigraph::query` at `crates/omnigraph/src/exec/query.rs:7`
|
||||
- Mutation entry: `Omnigraph::mutate` at `crates/omnigraph/src/exec/mutation.rs:511`
|
||||
- Manifest commit: `ManifestRepo::commit` at `crates/omnigraph/src/db/manifest.rs:280`
|
||||
- Manifest commit: `ManifestCoordinator::commit` at `crates/omnigraph/src/db/manifest.rs:280`
|
||||
- Graph index: `crates/omnigraph/src/graph_index/`
|
||||
- Loader: `Omnigraph::ingest` at `crates/omnigraph/src/loader/mod.rs:74`
|
||||
|
||||
|
|
@ -207,7 +207,7 @@ contracts:
|
|||
This pattern realizes read-your-writes within a multi-statement mutation
|
||||
and keeps failure scope bounded for inserts/updates by construction at
|
||||
the writer layer. See [docs/dev/invariants.md](invariants.md) and
|
||||
[docs/dev/runs.md](runs.md) for the publisher CAS contract this builds on.
|
||||
[docs/dev/writes.md](writes.md) for the publisher CAS contract this builds on.
|
||||
|
||||
### Storage trait — today vs. roadmap
|
||||
|
||||
|
|
@ -278,7 +278,7 @@ flowchart LR
|
|||
eng --> wq
|
||||
```
|
||||
|
||||
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/runs.md](runs.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
|
||||
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
|
||||
|
||||
Code paths:
|
||||
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ This page explains what the policy says and how to change it.
|
|||
|
||||
| Setting | Value | Why |
|
||||
|---|---|---|
|
||||
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS / drift`, `CODEOWNERS / noedit` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. `strict: true` requires the branch to be up-to-date with `main` before merge. |
|
||||
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
|
||||
| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. |
|
||||
| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. |
|
||||
| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. |
|
||||
|
|
@ -16,12 +16,12 @@ This page explains what the policy says and how to change it.
|
|||
| **Disallow force pushes** | `true` | No history rewrites on `main`. |
|
||||
| **Disallow branch deletions** | `true` | `main` cannot be deleted. |
|
||||
| **Required conversation resolution** | `true` | All review comment threads must be resolved before merge. |
|
||||
| **Enforce on admins** | `true` | Even repo admins go through the gates. The point is no bypasses. |
|
||||
| **Enforce on admins** | `false` | Admins can override the gates (`enforce_admins: false` in the JSON). This is the intended escape hatch for the 2-person team; tightening to `true` is tracked under hardening below. |
|
||||
| **Required signed commits** | not yet | Not enabled. Would lock out maintainers until everyone enrolls GPG/SSH commit signing. Tracked as a follow-up. |
|
||||
|
||||
## How to apply
|
||||
|
||||
Run from the repo root:
|
||||
Run from the repository root:
|
||||
|
||||
```bash
|
||||
./scripts/apply-branch-protection.sh
|
||||
|
|
@ -29,7 +29,7 @@ Run from the repo root:
|
|||
|
||||
The script reads `.github/branch-protection.json`, strips the human-readable `_comment` field (the GitHub API rejects unknown keys), and PUTs to `repos/ModernRelay/omnigraph/branches/main/protection`.
|
||||
|
||||
Requires `gh` authenticated with a token that has admin permissions on the repo.
|
||||
Requires `gh` authenticated with a token that has admin permissions on the repository.
|
||||
|
||||
To preview without applying:
|
||||
|
||||
|
|
@ -57,7 +57,7 @@ Outputs the live policy. Compare against `.github/branch-protection.json` to det
|
|||
|
||||
- **Audit trail**: `git log .github/branch-protection.json` shows every change with a reviewable diff and a merge commit.
|
||||
- **Disaster recovery**: if branch protection is accidentally removed or weakened via the UI, the JSON is the canonical recovery point.
|
||||
- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repo policy lives in the repo.
|
||||
- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repository policy lives in the repository.
|
||||
|
||||
## What this gates
|
||||
|
||||
|
|
@ -69,7 +69,7 @@ After branch protection is applied, every PR targeting `main` must:
|
|||
4. Have all review conversations resolved.
|
||||
5. Be squash- or rebase-merged (no merge commits).
|
||||
|
||||
Even repo admins are subject to these rules.
|
||||
Even repository admins are subject to these rules.
|
||||
|
||||
## Subsequent hardening (not in this PR)
|
||||
|
||||
|
|
|
|||
|
|
@ -2,9 +2,10 @@
|
|||
|
||||
`.github/workflows/`:
|
||||
|
||||
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
|
||||
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
|
||||
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
|
||||
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
|
||||
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease.
|
||||
- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`.
|
||||
- **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax.
|
||||
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
|
||||
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux x86_64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer against `edge`.
|
||||
- **release.yml**: on `v*` tags, builds the Linux x86_64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`, and smoke-tests the Windows PowerShell installer against the tag.
|
||||
- **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.
|
||||
|
|
|
|||
|
|
@ -2,26 +2,47 @@
|
|||
|
||||
`.github/CODEOWNERS` is **generated** — not hand-edited. The source of truth is `.github/codeowners-roles.yml`, expanded by `.github/scripts/render-codeowners.py`. CI rejects drift between the two and rejects direct edits to `CODEOWNERS` that don't accompany a yml change.
|
||||
|
||||
This setup gives every role change a reviewable PR and a permanent in-repo audit trail (`git log .github/codeowners-roles.yml`).
|
||||
This setup gives every role change a reviewable PR and a permanent in-repository audit trail (`git log .github/codeowners-roles.yml`).
|
||||
|
||||
## Current roles
|
||||
## Who owns what
|
||||
|
||||
| Role | Members | Scope |
|
||||
The tables below are **generated** from `.github/codeowners-roles.yml` by `.github/scripts/render-codeowners.py` (the same render that produces `.github/CODEOWNERS`). They are the always-current "who owns what at this commit" view — don't edit them by hand; edit the yml and re-render.
|
||||
|
||||
<!-- BEGIN GENERATED OWNERSHIP — edit codeowners-roles.yml + run render-codeowners.py -->
|
||||
|
||||
**Path → owners** (GitHub applies *last match wins*; the `*` catch-all is listed first and is overridden by the specific patterns below it):
|
||||
|
||||
| Path | Owners | Role(s) |
|
||||
|---|---|---|
|
||||
| `engineering` | `@aaltshuler` | All code under `crates/**`, repo infrastructure, default for unmapped paths |
|
||||
| `docs` | `@aaltshuler`, `@ragnorc` | `docs/**`, README.md, AGENTS.md, CLAUDE.md, SECURITY.md |
|
||||
| `*` | @ragnorc | engineering |
|
||||
| `crates/**` | @ragnorc | engineering |
|
||||
| `docs/**` | @ragnorc | docs |
|
||||
| `README.md` | @ragnorc | docs |
|
||||
| `AGENTS.md` | @ragnorc | docs |
|
||||
| `CLAUDE.md` | @ragnorc | docs |
|
||||
| `SECURITY.md` | @ragnorc | docs |
|
||||
|
||||
GitHub treats multiple owners in a CODEOWNERS line as **"any one of them satisfies the review requirement"**. For docs, either named member can approve. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured).
|
||||
**Roles**:
|
||||
|
||||
| Role | Members | Description |
|
||||
|---|---|---|
|
||||
| `engineering` | @ragnorc | All production code under crates/**. Engine, CLI, server, compiler. |
|
||||
| `docs` | @ragnorc | Documentation under docs/**, plus repo-level docs (README.md, AGENTS.md, CLAUDE.md symlink, SECURITY.md). |
|
||||
|
||||
<!-- END GENERATED OWNERSHIP -->
|
||||
|
||||
GitHub treats multiple owners on a CODEOWNERS line as **"any one of them satisfies the review requirement"**. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured).
|
||||
|
||||
## How to change role membership or path mappings
|
||||
|
||||
1. Edit `.github/codeowners-roles.yml`.
|
||||
2. Run `python3 .github/scripts/render-codeowners.py` (requires PyYAML; `pip install pyyaml`).
|
||||
3. Commit both files in the same PR.
|
||||
2. Open a PR. **CI re-renders for you**: the `CODEOWNERS` workflow regenerates `.github/CODEOWNERS` and the ownership tables above and auto-commits them back to your PR branch on same-repository PRs — you don't have to run the script locally (though you can: `python3 .github/scripts/render-codeowners.py`, requires PyYAML).
|
||||
|
||||
On a fork (where CI can't push back), the workflow instead fails with the diff so you can run the script and commit it yourself.
|
||||
|
||||
CI fails the PR if:
|
||||
- `CODEOWNERS` was edited without a corresponding yml change, or
|
||||
- The yml was changed but the rendered `CODEOWNERS` doesn't match.
|
||||
- a fork PR left a generated artifact out of sync, or
|
||||
- `CODEOWNERS` was edited without a corresponding yml change (the `CODEOWNERS not hand-edited` check).
|
||||
|
||||
## How to add a new role
|
||||
|
||||
|
|
@ -34,4 +55,4 @@ CI fails the PR if:
|
|||
- **Audit trail**: `git log .github/codeowners-roles.yml` is the canonical record of every role change. The rendered `CODEOWNERS` is a derived artifact.
|
||||
- **Roles are first-class**: paths reference roles, not raw handles. Renaming a person or rotating a role updates one place, not every path.
|
||||
- **Future extension**: scheduled rotation (weekly on-call, quarterly leads) plugs into the same yml without changing the path mappings. Not enabled today.
|
||||
- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repo's code-owner policy follows the same "policy as reviewed code" pattern.
|
||||
- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repository's code-owner policy follows the same "policy as reviewed code" pattern.
|
||||
|
|
|
|||
|
|
@ -147,7 +147,7 @@ sequenceDiagram
|
|||
- End-of-query Lance commit: `TableStore::stage_append`, `stage_merge_insert`, `commit_staged` at `crates/omnigraph/src/table_store.rs`
|
||||
- Manifest commit primitive: `commit_updates_on_branch_with_expected` at `crates/omnigraph/src/db/omnigraph/table_ops.rs`
|
||||
|
||||
Atomicity guarantee for multi-statement mutations: a mid-query failure leaves Lance HEAD untouched on staged tables (no inline commit happened during op execution), so the next mutation proceeds normally with no `ExpectedVersionMismatch`. The publisher CAS at the very end either succeeds (manifest advances atomically across all touched sub-tables) or fails with a typed `ManifestConflictDetails::ExpectedVersionMismatch` (no partial publish). See [docs/dev/invariants.md](invariants.md) and [docs/dev/runs.md](runs.md).
|
||||
Atomicity guarantee for multi-statement mutations: a mid-query failure leaves Lance HEAD untouched on staged tables (no inline commit happened during op execution), so the next mutation proceeds normally with no `ExpectedVersionMismatch`. The publisher CAS at the very end either succeeds (manifest advances atomically across all touched sub-tables) or fails with a typed `ManifestConflictDetails::ExpectedVersionMismatch` (no partial publish). See [docs/dev/invariants.md](invariants.md) and [docs/dev/writes.md](writes.md).
|
||||
|
||||
## Bulk loader (`loader/mod.rs`)
|
||||
|
||||
|
|
|
|||
|
|
@ -21,7 +21,7 @@ constraints. User-facing behavior should still be documented through
|
|||
|---|---|
|
||||
| System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) |
|
||||
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) |
|
||||
| Direct-publish writes, D2, staged writes, recovery sidecars | [runs.md](runs.md) |
|
||||
| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) |
|
||||
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
|
||||
| DataFusion: current state, passive wins, future improvements | [datafusion-future-improvements.md](datafusion-future-improvements.md) |
|
||||
| Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) |
|
||||
|
|
@ -59,6 +59,9 @@ Working documents for in-flight feature work. Removed when the work lands.
|
|||
| Area | Read |
|
||||
|---|---|
|
||||
| Schema-lint chassis v1 (MR-694) — `--allow-data-loss`, soft/hard drops | [schema-lint-v1-plan.md](schema-lint-v1-plan.md) |
|
||||
| Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) | [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) |
|
||||
| Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) |
|
||||
| MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) |
|
||||
|
||||
## Boundary
|
||||
|
||||
|
|
|
|||
|
|
@ -38,7 +38,7 @@ Use it this way:
|
|||
publishes one manifest update. Do not commit per statement. Delete-only
|
||||
queries are the documented inline residual; the parse-time D2 rule prevents
|
||||
mixing deletes with insert/update until Lance exposes two-phase delete.
|
||||
Read [runs.md](runs.md) and [execution.md](execution.md).
|
||||
Read [writes.md](writes.md) and [execution.md](execution.md).
|
||||
|
||||
5. **Recovery is part of the commit protocol.** Writers that can advance Lance
|
||||
HEAD before manifest publish must write `__recovery/{ulid}.json` sidecars.
|
||||
|
|
@ -56,7 +56,7 @@ Use it this way:
|
|||
branch they read even when index coverage is partial. Expensive index work
|
||||
should converge from manifest state instead of extending the critical write
|
||||
path. Scalar staged index builds and vector inline residuals are documented
|
||||
in [runs.md](runs.md) and [indexes.md](../user/indexes.md).
|
||||
in [writes.md](writes.md) and [indexes.md](../user/indexes.md).
|
||||
|
||||
8. **Schema identity survives renames.** Accepted schema identity must remain
|
||||
stable across type and property renames. Rename support belongs in migration
|
||||
|
|
@ -96,17 +96,25 @@ Use it this way:
|
|||
|
||||
| Area | Current state | Source |
|
||||
|---|---|---|
|
||||
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [runs.md](runs.md), [architecture.md](architecture.md) |
|
||||
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [runs.md](runs.md), [execution.md](execution.md) |
|
||||
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [runs.md](runs.md) |
|
||||
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
|
||||
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
|
||||
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [writes.md](writes.md) |
|
||||
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branches-commits.md), [maintenance.md](../user/maintenance.md) |
|
||||
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) |
|
||||
| Unique constraints | Intra-batch and write-path checks exist; full cross-version uniqueness is still a gap | [schema-language.md](../user/schema-language.md) |
|
||||
| Storage trait | `TableStorage` exists as the sealed staged-write surface; full call-site migration and capability/stat surfaces are incomplete | [runs.md](runs.md), [architecture.md](architecture.md) |
|
||||
| Storage trait | `TableStorage` exists as the sealed staged-write surface; full call-site migration and capability/stat surfaces are incomplete | [writes.md](writes.md), [architecture.md](architecture.md) |
|
||||
| Index lifecycle | `ensure_indices` is explicit today; reconciler-based convergence is roadmap | [indexes.md](../user/indexes.md), [maintenance.md](../user/maintenance.md) |
|
||||
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/query-language.md) |
|
||||
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/server.md), [policy.md](../user/policy.md) |
|
||||
| Tests | Tempdir-backed Lance tests are the current substrate; there is no `MemStorage` test backend | [testing.md](testing.md) |
|
||||
|
||||
The branch-delete reconciler is authority-derived: it reclaims orphaned forks
|
||||
today and degrades to a no-op if Lance ships an atomic multi-dataset branch
|
||||
operation, so the design composes with that future rather than blocking it. This
|
||||
is the same shape as invariant 7 (indexes are derived state); prefer it over a
|
||||
recovery-sidecar-style approach for any new multi-dataset metadata operation,
|
||||
since the sidecar would be scaffolding to remove once the substrate closes the gap.
|
||||
|
||||
## Known Gaps
|
||||
|
||||
Do not hide these behind invariant wording. Either move them forward or keep
|
||||
|
|
@ -122,6 +130,15 @@ them explicit.
|
|||
- **Deletes and vector indexes:** `delete_where` and vector index creation still
|
||||
advance Lance HEAD inline because the required public Lance APIs are missing.
|
||||
Keep D2 and recovery coverage in place until those residuals are removed.
|
||||
- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns
|
||||
under its forced `BlobHandling::AllBinary` read ("more fields in the schema
|
||||
than provided column indices"), so `optimize` skips any table with a `Blob`
|
||||
property — reporting `SkipReason::BlobColumnsUnsupportedByLance` (loud, not a
|
||||
silent drop) behind the `LANCE_SUPPORTS_BLOB_COMPACTION` gate. Reads and writes
|
||||
are unaffected; only space/fragment reclamation on blob tables is deferred.
|
||||
Remove the skip when the upstream Lance fix lands — the
|
||||
`lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard
|
||||
turns red on that bump to force it.
|
||||
- **Planner capability/stat surfaces:** cost-aware planning, complete
|
||||
capability advertisement, and explain-with-cost are roadmap. Do not describe
|
||||
them as implemented.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
# Lance Docs Index (for OmniGraph agents)
|
||||
|
||||
OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this repo.
|
||||
OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this codebase.
|
||||
|
||||
This file is the curated entry point. **When you hit a Lance-shaped problem, find the matching topic below and fetch the listed URL(s) before guessing.** Don't grep our codebase for behavior that is documented authoritatively in Lance.
|
||||
|
||||
|
|
@ -175,7 +175,9 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
|
|||
- **Lance #6658 closed** (2026-05-14) but `DeleteBuilder::execute_uncommitted` did **not** ship in v6.0.1 — binary search across the release stream shows it first appears in `v7.0.0-beta.10` (the closing commits landed on main but didn't backport to the 6.x line). Tracked as MR-A: migrate `delete_where` to staged, retire the parse-time D2 mutation rule, extend recovery sidecar coverage. **Gated on the Lance v7.x bump**, not this PR. v7.0.0-rc.1 dropped 2026-05-21.
|
||||
- **Lance #6666 still open** (`build_index_metadata_from_segments` public): vector-index two-phase blocked; inline `create_vector_index` residual retained.
|
||||
- **Lance #6877 still open** (`MergeInsertBuilder` dup-rowid): PR #109's `SourceDedupeBehavior::FirstSeen` + `check_batch_unique_by_keys` precondition stay load-bearing.
|
||||
- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`.
|
||||
- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed → flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions.
|
||||
|
||||
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (8 named guards; 3 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
|
||||
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
|
||||
|
||||
Bump this date stanza on the next alignment pass.
|
||||
|
|
|
|||
351
docs/dev/rfc-001-queries-envelope-mcp.md
Normal file
351
docs/dev/rfc-001-queries-envelope-mcp.md
Normal file
|
|
@ -0,0 +1,351 @@
|
|||
# RFC: Inline + Stored Queries, Request/Response Envelope, MCP
|
||||
|
||||
**Status:** Proposed
|
||||
**Date:** 2026-05-28
|
||||
**Tickets:** MR-656 (inline `-e` + URL rename), MR-668 (multi-graph, shipped), MR-976 (Phase 1 envelope parent: MR-977 / MR-978 / MR-979 / MR-980), MR-969 (stored queries + MCP)
|
||||
**Target release:** v0.6.x patch series (MR-656 + Phase 1) → v0.7.0 (MR-969 PRs 1-3)
|
||||
|
||||
## Summary
|
||||
|
||||
OmniGraph today exposes `POST /read` and `POST /change` with a weakly-contracted body (counts only on writes) and no per-query authorization. This RFC consolidates the work landing across three Linear tickets into one coherent design:
|
||||
|
||||
1. **MR-656**: rename `/read` → `/query` and `/change` → `/mutate`, add inline `-e` CLI flag, ship three-channel deprecation on the legacy URLs. **In flight, PR #110.**
|
||||
2. **Envelope hardening** (this RFC adds it as a Phase 1 before MR-969): make today's mutation surface agent-grade with idempotency keys, preconditions, deadlines, and a structured response envelope carrying `audit_id`, `commit_id`, `snapshot_id`, and cost stats.
|
||||
3. **MR-969**: add a stored-query registry, `POST /queries/{name}`, a new `InvokeQuery` Cedar action with per-query scope, inline pragmas in `.gq` (`@description`, `@returns`, `@mcp`), and MCP transport over the same routing primitive.
|
||||
|
||||
The bet: inline and stored queries serve different stages of the same lifecycle, run through the same engine code, and are gated by different Cedar actions. HelixDB collapsed to stored-only. Postgres has neither stored-query Cedar nor MCP. The window for an OSS, declarative, agent-grade graph query surface is open.
|
||||
|
||||
## Motivation
|
||||
|
||||
Three problems today:
|
||||
|
||||
- **Mutation responses are too thin.** `ChangeOutput { node_count, edge_count }` is the entire memory the API has of what just happened. No `commit_id`, no `audit_id`, no `snapshot_id`. Agents reporting results have nothing to cite. Humans can't reproduce a read.
|
||||
- **No agent-safe surface.** Cedar gates `read` and `change` at the action level. A token either runs *any* query or *no* query of that kind. There is no way to express "this agent can invoke `find_user` and nothing else."
|
||||
- **No discovery primitive.** Agents need a tool list. SDKs need a stable contract per operation. Both are absent.
|
||||
|
||||
The MR-656 rename solves the cosmetic asymmetry (`/read` was a poor pair for the future `/queries/{name}`). The envelope work and MR-969 solve the substantive gaps.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Compiled query bundles (HelixDB's `queries.json` shape). `.gq` files are already declarative; the file *is* the artifact.
|
||||
- Hot reload of the registry. Restart-only matches the multi-graph operational model from MR-668.
|
||||
- Per-query rate limits in v1. Existing `WorkloadController` covers the bulk of the risk. Punt to a future ticket.
|
||||
- Cross-graph tool listing in MCP. Agents loop over per-graph endpoints when they need multi-graph access. Avoid namespacing in the contract.
|
||||
- Web dashboard / control-plane management of the registry. Operators edit `.gq` + `policy.yaml` and restart.
|
||||
- Schema introspection through MCP. Schema is an operator concern; agents see types through declared return shapes on the queries they're allowed to invoke.
|
||||
- Per-environment override files. Environment-specific differences live in `policy.yaml`, which already has per-env variants.
|
||||
|
||||
## Background
|
||||
|
||||
OmniGraph runs on Lance 6.x with a property graph layered on top: typed nodes/edges in per-type Lance datasets, atomic multi-table commits via a `__manifest` table, branchable and time-travelable through Lance versioning. The HTTP server (`omnigraph-server`) is Axum + utoipa with bearer-token auth and Cedar policy enforcement at every `_as` writer.
|
||||
|
||||
MR-668 shipped multi-graph mode in v0.6.0. One server process can host 1-10 graphs, with per-graph endpoints under `/graphs/{id}/...`. Cedar policy resolves against `Server::"root"` (for management actions) and `Graph::"prod"` (for per-graph actions).
|
||||
|
||||
MR-656 is currently in PR #110 (CONFLICTING / DIRTY against main; rebase planned). It renames the URL surface, adds inline source support, and ships three-channel deprecation (OpenAPI `deprecated: true`, RFC 9745 `Deprecation: true` header, RFC 8288 successor `Link`).
|
||||
|
||||
## Design
|
||||
|
||||
### Two paths, one engine
|
||||
|
||||
| Dimension | Inline (`/query`, `/mutate`) | Stored (`/queries/{name}`) |
|
||||
|---|---|---|
|
||||
| Source location | Request body | `queries/*.gq` on disk |
|
||||
| Parse + typecheck | Per request | Once at server boot |
|
||||
| Cedar action | `read` / `change` | `invoke_query` (per-name scope) |
|
||||
| MCP-exposed | No (not enumerable) | Yes (when `@mcp(expose=true)`) |
|
||||
| Output schema | Inferred | Declared via `@returns`, asserted at boot |
|
||||
| Audit log shape | Records query hash | Records query name |
|
||||
| Failure visibility | Runtime 400 | Boot-time refusal |
|
||||
|
||||
Both paths converge in the engine:
|
||||
|
||||
```
|
||||
POST /query ─parse→─┐
|
||||
POST /mutate ─parse→─┤
|
||||
├─→ run_query / run_mutate(ast, params, branch) ─→ envelope
|
||||
POST /queries/{name} ───────┤
|
||||
POST /mcp/invoke ───────────┘ (MCP adapter on top of the same call)
|
||||
```
|
||||
|
||||
The MR-656 rebase widens `run_query` / `run_mutate` to accept a parsed AST or source string. Inline parses on each call. Stored looks up the pre-parsed AST in the registry. Same execution path beyond that point.
|
||||
|
||||
### Cedar split (the LLM-safe wedge)
|
||||
|
||||
Inline and stored coexist safely because they're gated by different actions:
|
||||
|
||||
```yaml
|
||||
# Production policy — agents locked to a curated stored-query set
|
||||
- deny:
|
||||
actors: { group: agents }
|
||||
actions: [read, change] # blocks /query, /mutate, /read, /change
|
||||
|
||||
- allow:
|
||||
actors: { group: agents }
|
||||
actions: [invoke_query]
|
||||
resource: Graph::"prod"
|
||||
query_scope: { names: [find_user, list_orders, search_docs] }
|
||||
```
|
||||
|
||||
The agent's effective surface: three stored queries by name. Cannot compose inline. Cannot enumerate schema. Cannot read arbitrary entities. A developer in the same deployment with `dev-engineers` group membership might have `[read, change, invoke_query]` allowed — full access to both paths.
|
||||
|
||||
Same server, same data, two completely different API surfaces depending on token. This is the posture MR-969 calls "LLM-safe API surface."
|
||||
|
||||
### `.gq` pragmas
|
||||
|
||||
Stored queries self-describe at the top of the source file:
|
||||
|
||||
```gq
|
||||
@description("Look up a user by ID. Returns name, email, last_login.")
|
||||
@returns({ name: String, email: String, last_login: DateTime? })
|
||||
@mcp(expose=true)
|
||||
|
||||
query find_user($id: String) {
|
||||
match { $u: User { id: $id } }
|
||||
return { $u.name, $u.email, $u.last_login }
|
||||
}
|
||||
```
|
||||
|
||||
Three pragmas in v1:
|
||||
|
||||
- `@description("...")` — string surfaced in `omnigraph queries explain` and MCP tool descriptions.
|
||||
- `@returns({...})` — optional output type assertion. Compiler verifies the inferred type matches; mismatch fails server startup.
|
||||
- `@mcp(expose=true|false, tool_name="alt_name"?)` — controls MCP visibility. Default is `expose=false` (callable via HTTP, hidden from MCP). `tool_name` defaults to the query name.
|
||||
|
||||
Pragmas live in source, not in a separate YAML registry. Drop a file in `queries/`, restart, the registry picks it up. The full agent contract is reviewable in one diff.
|
||||
|
||||
### Request envelope ("before")
|
||||
|
||||
Today's request carries auth + body. The envelope adds five fields, all optional:
|
||||
|
||||
```http
|
||||
POST /graphs/prod/queries/find_user
|
||||
Authorization: Bearer <token>
|
||||
Idempotency-Key: 01HXYZ... # mutations only
|
||||
If-Match: 01HABC... # optimistic concurrency
|
||||
X-Deadline: 2026-05-28T19:30:00Z # or X-Timeout-Ms: 5000
|
||||
X-Trace-Id: 01HDEF...
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"params": { "id": "u-42" },
|
||||
"branch": "main",
|
||||
"expect": "read_only", # scope assertion
|
||||
"dry_run": false, # mutations only
|
||||
"fields": ["name", "email"] # result projection
|
||||
}
|
||||
```
|
||||
|
||||
Field semantics:
|
||||
|
||||
| Field | Applies to | Purpose |
|
||||
|---|---|---|
|
||||
| `Idempotency-Key` | Mutations | Server caches `(token, key)` → response for 10 minutes. Replays return cached response with `Idempotency-Replay: true` header. Prevents double-write on retry. |
|
||||
| `If-Match` | Mutations | Run only if branch HEAD matches the given commit ID. 412 Precondition Failed otherwise. Enables read-then-write without races. |
|
||||
| `X-Deadline` / `X-Timeout-Ms` | All | Server respects; returns 504-typed error past the deadline. Bounds execution for context-budget-constrained callers. |
|
||||
| `X-Trace-Id` | All | Caller-supplied; server echoes back. Lets agents correlate multi-call sequences. |
|
||||
| `expect` | All | Caller asserts shape: `"read_only"`, `{"max_rows_scanned": 10000}`. Server validates against parsed AST or planner estimate; rejects before running. |
|
||||
| `dry_run` | Mutations | Returns what *would* happen without committing. Implemented via scratch branch + diff + discard. |
|
||||
| `fields` | Reads | Server returns only listed columns. Saves bandwidth + agent context window. |
|
||||
|
||||
All five fields are optional; today's call shape continues working.
|
||||
|
||||
### Response envelope ("after")
|
||||
|
||||
The response envelope replaces today's bare-result shape with a structured wrapper. Every endpoint (inline, stored, MCP) returns the same envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
"result": { "name": "Alice", "email": "alice@..." },
|
||||
"audit_id": "01HGHI...",
|
||||
"snapshot_id": "01HJKL...",
|
||||
"commit_id": null,
|
||||
"stats": {
|
||||
"rows_scanned": 1,
|
||||
"ms_elapsed": 4,
|
||||
"bytes_read": 128
|
||||
},
|
||||
"warnings": []
|
||||
}
|
||||
```
|
||||
|
||||
Response headers:
|
||||
|
||||
| Header | When | Purpose |
|
||||
|---|---|---|
|
||||
| `Idempotency-Replay: true\|false` | Mutations | Was this response served from the idempotency cache? |
|
||||
| `X-Trace-Id` | All | Echo of the request's trace ID, or server-minted if absent. |
|
||||
| `Deprecation: true` | `/read`, `/change` only | RFC 9745 signal from MR-656. |
|
||||
| `Link: </query>; rel="successor-version"` | `/read`, `/change` only | RFC 8288 successor pointer from MR-656. |
|
||||
|
||||
Body envelope fields:
|
||||
|
||||
| Field | When | Purpose |
|
||||
|---|---|---|
|
||||
| `result` | All | The actual response payload. Shape determined by the query's return type. |
|
||||
| `audit_id` | All | ULID for the audit log entry. Lets the caller cite exactly what ran. |
|
||||
| `snapshot_id` | All | Manifest snapshot the query observed. Reproducibility — replay with `?snapshot=<id>`. |
|
||||
| `commit_id` | Mutations | ULID of the new commit. Null for reads. Lets the caller cite what changed. |
|
||||
| `stats` | All | `{rows_scanned, ms_elapsed, bytes_read}`. Lets agents learn what's expensive. |
|
||||
| `warnings` | All | Non-fatal observations: deprecated property access, full-scan despite available index, scan exceeded soft row limit. Empty array when none. |
|
||||
|
||||
The envelope is the API's *memory of what happened*. Without `audit_id` + `commit_id` + `snapshot_id`, agent reports are hearsay and reads are not reproducible. With them, provenance is a first-class property of every response.
|
||||
|
||||
### MCP integration with multi-graph
|
||||
|
||||
MCP routes are per-graph, matching the rest of MR-668's hierarchy:
|
||||
|
||||
```
|
||||
GET /graphs/{id}/mcp/tools # tool list for this graph, this token
|
||||
POST /graphs/{id}/mcp/invoke # invoke a tool on this graph
|
||||
```
|
||||
|
||||
Single-mode collapses to `/mcp/tools` and `/mcp/invoke` at the root (same shape, no `/graphs/{id}` prefix). Both modes route through identical handler code.
|
||||
|
||||
Tool list response:
|
||||
|
||||
```json
|
||||
{
|
||||
"tools": [
|
||||
{
|
||||
"name": "find_user",
|
||||
"description": "Look up a user by ID.",
|
||||
"inputSchema": { "id": { "type": "string", "required": true } },
|
||||
"outputSchema": { "name": "string", "email": "string", "last_login": "datetime?" },
|
||||
"read_only": true
|
||||
}
|
||||
],
|
||||
"graph_id": "prod",
|
||||
"snapshot_id": "01HJKL..."
|
||||
}
|
||||
```
|
||||
|
||||
The tool list is the subset of registered queries where (a) `@mcp(expose=true)` in source and (b) Cedar permits `invoke_query` for this token on this name on this graph. Computed per request — cheap because it's just iterating the registry + one Cedar evaluation per name.
|
||||
|
||||
**Token scoping.** Most tokens carry one graph claim. Cross-graph access requires multiple Cedar rules (one per graph) and is uncommon. Agents that genuinely operate across graphs loop over `/graphs/{id}/mcp/tools` themselves. The contract stays clean; graph renames don't break tool names.
|
||||
|
||||
**Discovery.** Agents are told their MCP URL at provisioning: `https://omnigraph.example.com/graphs/prod/mcp`. Token authorizes; URL identifies. Same model as every OAuth-style API.
|
||||
|
||||
**`/mcp/invoke` is a protocol adapter.** Unwrap MCP protocol envelope, call the same code path as `/queries/{name}`, wrap the response in MCP shape. No new execution semantics.
|
||||
|
||||
### CLI surface
|
||||
|
||||
The CLI mirrors the HTTP routes. Post-MR-656 and post-MR-969:
|
||||
|
||||
```bash
|
||||
# Inline (MR-656)
|
||||
omnigraph query -e 'query test() { ... }' # /query
|
||||
omnigraph mutate -e 'query bump() { update ... }' # /mutate
|
||||
|
||||
# Stored (MR-969)
|
||||
omnigraph queries list # GET /queries (future)
|
||||
omnigraph queries explain find_user # show params + return shape + source
|
||||
omnigraph queries invoke find_user --param id=u-42 # POST /queries/find_user
|
||||
|
||||
# Pragma + registry validation
|
||||
omnigraph lint queries/find_user.gq # parses + verifies pragmas
|
||||
omnigraph queries lint # validates the whole registry
|
||||
```
|
||||
|
||||
`omnigraph queries invoke` reads bearer + URL from `omnigraph.yaml` like the other remote commands. Local invocations work the same way the existing `omnigraph query`/`mutate` do.
|
||||
|
||||
### Lifecycle
|
||||
|
||||
The promotion path from inline to stored is the load-bearing DX story:
|
||||
|
||||
```
|
||||
1. EXPLORE omnigraph query -e 'query find_user($id: String) { ... }' --params '{"id": "u-42"}'
|
||||
└─ POST /query, iterate freely
|
||||
|
||||
2. STABILIZE write queries/find_user.gq with @description, @returns, @mcp pragmas
|
||||
└─ git diff shows the full agent contract in one file
|
||||
|
||||
3. AUTHORIZE add Cedar rule allowing invoke_query for the appropriate actor group
|
||||
└─ scope_names: [find_user]
|
||||
|
||||
4. DEPLOY restart server
|
||||
└─ /queries/find_user goes live
|
||||
└─ /mcp/tools auto-lists it for any token with invoke_query[find_user]
|
||||
|
||||
5. RETIRE deny: read change for the agent group
|
||||
└─ inline access closed; stored remains
|
||||
└─ MR-969's "LLM-safe API surface" reached
|
||||
```
|
||||
|
||||
Same `.gq` source through all five steps. No rewrite. No language shift. The pragmas are the only added syntax between exploration and production.
|
||||
|
||||
## Migration
|
||||
|
||||
Existing callers see no breakage:
|
||||
|
||||
- `POST /read` and `POST /change` keep working, now with `Deprecation: true` headers (MR-656).
|
||||
- `ChangeRequest` field names `query_source` / `query_name` accepted as serde aliases (MR-656).
|
||||
- `aliases:` block in `omnigraph.yaml` unchanged; both `read`/`change` and `query`/`mutate` accepted as `command:` values (MR-656).
|
||||
- New envelope fields are additive; old clients ignoring them keep working.
|
||||
- `Idempotency-Key`, `If-Match`, `X-Deadline` are opt-in headers; absence is the current behavior.
|
||||
|
||||
Callers move at their own pace. The envelope upgrades + URL rename ship in v0.6.x (small PRs). Stored queries + MCP ship in v0.7.0.
|
||||
|
||||
## Sequencing
|
||||
|
||||
**Phase 1: envelope (v0.6.x, before MR-969).** Four small PRs, ~100-200 LOC each.
|
||||
|
||||
1. Wrap responses in the structured envelope. Add `audit_id`, `snapshot_id`, `commit_id`, `stats`, `warnings`. Backward-compatible if we keep today's top-level fields and add new ones alongside; cleaner break if we move to nested `result.*`. Pick one and live with it.
|
||||
2. Honor `Idempotency-Key` on `/mutate` (and the deprecated `/change`). Server-side cache keyed by `(token, key)`.
|
||||
3. Honor `If-Match` on `/mutate`. Wire through to the publisher CAS layer.
|
||||
4. Honor `X-Deadline` / `X-Timeout-Ms` on every endpoint. Return 504-typed error past deadline.
|
||||
|
||||
**Phase 2: MR-969 PR 1 (registry).** The stored-query registry, `/queries/{name}` route, `InvokeQuery` Cedar action with per-name scope, `.gq` pragma parsing (`@description`, `@returns`, `@mcp`), read-vs-mutate classification at registry load. Inline keeps working unchanged.
|
||||
|
||||
**Phase 3: MR-969 PR 2 (MCP).** `/graphs/{id}/mcp/tools` and `/graphs/{id}/mcp/invoke`. Tool schemas projected from declared return types and parameter declarations. Single-graph-scoped tokens.
|
||||
|
||||
**Phase 4: MR-969 PR 3 (Cedar deny-on-ad-hoc sugar).** Small Cedar-language addition so operators can lock down `/read` / `/query` while keeping `/queries/*` open. Independent of PRs 1-2.
|
||||
|
||||
**Phase 5: deferred.**
|
||||
- Cross-graph MCP namespacing (wait for usage signal).
|
||||
- Per-query rate limits (extend `WorkloadController`).
|
||||
- Schema introspection as a separate Cedar action (3-line PR).
|
||||
- CLI verb consolidation (`omnigraph call <name>`).
|
||||
- Cache warming (HelixDB-style; not load-bearing).
|
||||
|
||||
## Rejected Alternatives
|
||||
|
||||
**Per-environment override files (`_overrides.yaml`).** Initial design had a sparse YAML file for per-env tweaks: MCP exposure, row caps, kill-switch, param locks. Rejected because every override candidate either belongs in source (`@mcp` flag), Cedar policy (per-actor visibility, per-env), or `omnigraph.yaml` (operator config). Splitting query metadata across files makes it harder to review what an agent can see. Keep source authoritative; let Cedar express the per-env differences.
|
||||
|
||||
**Compiled query bundle (HelixDB's `queries.json`).** HelixDB compiles their Rust-DSL queries to JSON. Rejected because `.gq` files are already declarative. The file is the artifact. Reviewers diff source, not bytecode.
|
||||
|
||||
**Stored-queries-only (HelixDB's posture).** Rejected because the personal-graph / dev-iteration use case dies without inline. Inline `-e` is the REPL for human exploration; stored is the contract for production agents. Both first-class.
|
||||
|
||||
**Cross-graph tool-name prefixing (`prod.find_user`).** Rejected because graph renames would break agent contracts. Per-graph URLs let graph identity live in the URL, not in tool names.
|
||||
|
||||
**Body-field graph dispatch (`{tool, graph, params}`).** Rejected because it doubles the contract surface (every tool is identified by two fields). Per-graph URLs are simpler.
|
||||
|
||||
**Pragmas in YAML instead of source.** Rejected because two-file definitions (source + metadata YAML) make diffs harder to review and create drift opportunities. Source is the source of truth.
|
||||
|
||||
**Pragmas as in-source comments (`#[mcp]` HelixDB-style).** Considered; chose `@mcp(...)` because comment-flavored pragmas conflate documentation and machine-readable metadata. The `@` prefix makes the pragma's role explicit.
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Envelope breakage vs additive.** Phase 1.1 wraps responses in a structured envelope. Do we keep today's top-level fields *and* add new ones (additive, ugly), or move result to `result.*` (clean break, requires SDK updates)? Lean toward additive — let the new envelope coexist with the old shape until v0.7.0, then collapse.
|
||||
|
||||
2. **`@returns` strictness.** Should mismatched declared-vs-inferred return type be a boot-time error or a warning? Lean toward error — silent drift defeats the assertion's purpose. Operators who want flexibility omit `@returns`.
|
||||
|
||||
3. **MCP protocol transport.** Streamable HTTP (the new MCP standard) vs stdio (Anthropic's original). Both have Rust crates. Lean toward streamable HTTP since we're already an HTTP server.
|
||||
|
||||
4. **Stored mutation routing.** A `.gq` file that contains both reads and writes — does the registry reject it at load (parse-time D2 rule from MR-656), or accept and classify as "mixed"? Lean toward reject. Mixed queries are a footgun; force operators to split.
|
||||
|
||||
5. **`expect` field strictness.** `expect: "read_only"` against a parsed mutating query is an obvious 400. But `expect: {max_rows_scanned: 10000}` requires planner estimates that don't exist today. Either ship `expect` with only the "read_only" assertion in v1 and grow it, or wait for the planner. Lean toward shipping the partial form.
|
||||
|
||||
6. **CLI `queries invoke` shape.** Today's `omnigraph query` takes a file or alias. `omnigraph queries invoke find_user` takes a stored query name. Should `omnigraph query --name find_user` also work (auto-detect)? Cleaner to keep them separate verbs — the stored vs inline distinction is part of the contract.
|
||||
|
||||
## References
|
||||
|
||||
- MR-656: [Support inline query strings in CLI and HTTP server](https://linear.app/modernrelay/issue/MR-656)
|
||||
- MR-668: [Multi-graph server mode](https://linear.app/modernrelay/issue/MR-668) (shipped, PR #119)
|
||||
- MR-969: [Stored queries with MCP exposure and per-query Cedar authorization](https://linear.app/modernrelay/issue/MR-969)
|
||||
- PR #110: [feat: inline query strings in CLI and HTTP server](https://github.com/ModernRelay/omnigraph/pull/110)
|
||||
- HelixDB docs: [docs.helix-db.com/llms-full.txt](https://docs.helix-db.com/llms-full.txt) — `#[mcp]` macro, scoped API keys, stored query model
|
||||
- RFC 9745 (`Deprecation` header)
|
||||
- RFC 8288 (`Link` relations, `successor-version`)
|
||||
- MCP spec: [modelcontextprotocol.io](https://modelcontextprotocol.io)
|
||||
- [invariants.md](./invariants.md) — substrate boundaries this work respects
|
||||
- [../user/server.md](../user/server.md) — current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation)
|
||||
590
docs/dev/rfc-002-config-cli-architecture.md
Normal file
590
docs/dev/rfc-002-config-cli-architecture.md
Normal file
|
|
@ -0,0 +1,590 @@
|
|||
# RFC: Config & CLI Architecture — Layered Config, Client Targeting, File Naming
|
||||
|
||||
**Status:** Proposed
|
||||
**Date:** 2026-05-30
|
||||
**Tickets:** MR-668 (multi-graph server, shipped — the dependency this builds on), MR-969 (stored queries + MCP — supplies the in-repo agent tool surface), MR-973 (quickstart / onboarding), MR-974 (agent setup surface), MR-981 (agent-friendly CLI hardening)
|
||||
**Target release:** v0.8.x (tentative; phased — see Rollout)
|
||||
|
||||
## Summary
|
||||
|
||||
OmniGraph today has a single config file, `omnigraph.yaml`, read both by the CLI (operating the embedded engine) and by `omnigraph-server` (hosting graphs). There is **no client-side configuration that targets a *running server*** — to talk to a deployed `omnigraph-server` you drop to `curl` or the `omnigraph-ts` client. This is the one real gap in an otherwise coherent design (storage-URI addressing, multi-graph routing, per-graph policy).
|
||||
|
||||
This RFC defines the config and CLI architecture that closes that gap, derived from first principles — *working backwards from what OmniGraph uniquely enables* rather than copying kubeconfig / `helix.toml`. The result:
|
||||
|
||||
1. A **global-first layered config** — user-global (`~/.omnigraph/`) is the **primary, self-sufficient default**; per-project (`./omnigraph.yaml`) is an *optional* override + deployment manifest. One uniform schema, both layers optional; the CLI works from any directory with **no project file** (the `kubectl`/`aws`/`gh` posture), unlike today's project-anchored behavior.
|
||||
2. A single unifying noun — the **target** — that resolves a name to a concrete `(locus, graph, sub-state, credential)` tuple, where the locus is **embedded (storage URI) XOR remote (server endpoint)**.
|
||||
3. A **multi-server × multi-graph** client model (OmniGraph hosts N graphs per server and there are M servers — unlike Helix's one-cluster-one-graph).
|
||||
4. **Credentials by reference, keyed by server name** (the AWS/gh/kube model) — OS keychain `omnigraph:<server>` (preferred) → a `[<server>]` profile in `~/.omnigraph/credentials` → `OMNIGRAPH_TOKEN[_<SERVER>]` env (CI). `servers.<name>` is endpoint-only by default but may carry an explicit, secret-free `auth: { token: { env|file|command|keychain } }` source; no `credentials.yaml`; the shipped `bearer_token_env` + dotenv stay as a legacy compat path. Every committed/GitOps'd surface stays secret-free.
|
||||
5. A **file-naming** decision: project and server config are **the same artifact, same name** (`omnigraph.yaml`); the only differently-named file is the user-global `config.yaml`, justified by **scope, not role**.
|
||||
|
||||
The design optimizes jointly for **DX** (one command surface across embedded and remote; clone-and-go) and **AX** (agent experience: one flat resolved context, secrets structurally unreachable, branch-pinned reproducible reads, and a GitOps'd capability surface).
|
||||
|
||||
## Reconciliation with shipped / planned CLI work
|
||||
|
||||
Verified **against the code**, not ticket statuses (which are unreliable — e.g. MR-581 is marked done but is stale and unbuilt). Findings and the corrections they force:
|
||||
|
||||
- **Noun is `graph`/`graphs`, NOT `target`/`targets`.** The config key is `graphs:` in `config.rs` and the flag is `--graph`. **This RFC uses `graphs:`/`--graph` throughout**; the unifying noun is a **`graphs:` entry** that is *embedded* (`storage:`, formerly `uri:`) XOR *remote* (`server:` + `graph_id:` defaulting to the entry key) — a typed locator (§1.1). Read any lingering `targets:`/`--target` below as `graphs:`/`--graph`.
|
||||
- **`~/.omnigraph/` stands on its own merits** (Helix/aws/kube peer convention), **not** on precedent — there is **no `~/.omnigraph/` usage in the code** today. (MR-581 / MR-531 templates-into-`~/.omnigraph/` are *stale tickets, unbuilt*.)
|
||||
- **Templates do not exist** in the code (no `template` command). The template mechanism is a *design question for this RFC / the init family*, not an existing foothold.
|
||||
- **What actually exists in the CLI** (verified): `init, query(read), mutate(change), load, ingest, branch, schema, lint, snapshot, export, commit, policy, optimize, cleanup, graphs`. **Not built:** `serve, quickstart, template, prune, login`. `omnigraph init` exists (with `scaffold_config_if_missing`, `main.rs:1415`); the rest of the "init family" (`quickstart` MR-973, `serve` MR-970, `prune`/`init --force` MR-972/975, `mcp install`/skills MR-974, agent-mode MR-981) are **unbuilt tickets**, some stale.
|
||||
- **Config still uses `aliases:`** (no `operations:` in code; MR-839 unbuilt). §6's reconciliation talks about `aliases:` as-is, noting `operations:` is a *proposed* rename.
|
||||
- **`bearer_token_env` exists** (per-graph, `config.rs`); MR-971 flags a CLI-parity / server-side gap. The per-`servers.<name>` extension lands on top of that.
|
||||
- **A top-level `omnigraph lint` command exists** (verified). A stored-query *registry* validator must pick a verb that doesn't read as a competing lint/check.
|
||||
|
||||
## Motivation
|
||||
|
||||
Three problems, in priority order:
|
||||
|
||||
- **No client→server targeting config.** The moment an operator stands up `omnigraph-server` — for bearer auth + Cedar at a network boundary + admission control + multi-graph routing — the CLI can't address it. `curl` is the fallback. There is no named, switchable, credential-carrying way to say "run this against `prod` on the team server."
|
||||
- **Multi-server × multi-graph has no first-class expression.** OmniGraph genuinely runs N graphs per server across M servers. The same graph is **multi-homed** — `s3://b/prod` may be `prod` on server A, `production` on server B, and opened directly by the CLI. Today's flat `graphs:` map (name→storage-URI) can't express "graph `production` on server `prod-eu`."
|
||||
- **Solo-first and embedded-first are unserved by the remote story.** A solo developer with no projects should define everything in `~`. A developer iterating locally (embedded, no server) and then pointing at staging (remote) should change *one word*, not learn a second command surface.
|
||||
|
||||
MR-668 shipped the server side (multiple graphs per server). MR-969 ships the in-repo agent tool surface (stored queries / MCP). This RFC supplies the **client and config layer** that lets humans and agents target that surface coherently — the foundation under MR-973 / MR-974 / MR-981.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- **A control plane / dashboard for config.** Operators edit files and (for servers) restart. No runtime config-mutation API. Matches the MR-668 / MR-969 operational model.
|
||||
- **Hot reload.** Restart-only for server-side config, matching MR-668 and MR-969.
|
||||
- **Embedding secrets in any config file.** Credentials are by-reference; the git-ignored `auth.env_file` dotenv (or, later, the OS keychain) holds tokens. Never a committable `*.yaml`.
|
||||
- **Renaming the project manifest by role.** No `omnigraph.server.yaml` / `omnigraph.client.yaml`. Role lives in sections, not filenames (see Design §3).
|
||||
- **Dropping embedded mode.** Embedded-first is load-bearing for the file-naming decision; this RFC assumes it stays.
|
||||
- **Cross-graph / cross-server tool listing in MCP.** Clients loop over per-graph catalogs (a MR-969 non-goal, restated).
|
||||
|
||||
## Background
|
||||
|
||||
OmniGraph runs on Lance 6.x: typed nodes/edges in per-type Lance datasets, atomic multi-table commits via a `__manifest` table, branchable and time-travelable. The CLI (`omnigraph`) operates the **embedded engine** directly against a storage URI — no HTTP client in its runtime dependencies. `omnigraph-server` (Axum) is a *separate* HTTP front-end over the same engine, with bearer auth + per-graph Cedar (MR-668). The two read the same `omnigraph.yaml` but never connect to each other.
|
||||
|
||||
OmniGraph **already has a credentials-by-reference mechanism**, which this RFC builds on rather than replacing: `TargetConfig.bearer_token_env` names the env var holding a graph's bearer token, and `auth.env_file` points at a git-ignored dotenv (`.env.omni`) that the CLI auto-loads into the process (`load_env_file_into_process`) with real-env-vars-win precedence; `resolve_remote_bearer_token` resolves a token via env var then dotenv named lookup. `.env.omni` is already in `.gitignore`.
|
||||
|
||||
The six **irreducible enablers** that drive the design (referenced as E1–E6 below):
|
||||
|
||||
| # | Enabler | Consequence |
|
||||
|---|---|---|
|
||||
| E1 | A graph is a **self-contained storage URI**; the substrate (object store + manifest CAS) is the source of truth — no server required to read/write. | A graph is addressable **directly (embedded)**, not only via a server. |
|
||||
| E2 | A server hosts **many graphs**; **many servers** exist. | The remote address space is **`{server} × {graph_id}`**. |
|
||||
| E3 | The same graph is **multi-homed** under different per-locus names. | **Name ≠ identity.** Resolution is mandatory. |
|
||||
| E4 | **Branch / commit / snapshot** are first-class addressable sub-state. | An address is *graph @ branch/snapshot*, not just graph. |
|
||||
| E5 | Enforcement is **two-layered**: engine-layer Cedar (`_as` writers, works embedded) + HTTP-boundary bearer+Cedar (server only). | *How* you reach a graph determines *which* enforcement applies. |
|
||||
| E6 | **Stored queries / MCP tools are a per-graph registry defined in the project config** (MR-969). | The **agent tool surface is version-controlled in the repo**. |
|
||||
|
||||
Competitors collapse dimensions OmniGraph keeps live: **Helix** fuses E2+E3 (one cluster = one graph); **namidb** fuses E1+E3 into the URI (`s3://b?ns=prod`) and serves one namespace per process. OmniGraph has all of E1–E6 at once, so its config resolves a richer space — but the richness is *earned* by capability.
|
||||
|
||||
## Design
|
||||
|
||||
### 1. The address space and the `target` abstraction
|
||||
|
||||
Every OmniGraph address is a tuple:
|
||||
|
||||
```
|
||||
(locus, graph, sub-state, credential)
|
||||
locus = embedded(URI) XOR remote(server-endpoint) # E1, E2
|
||||
graph = a URI (embedded) | a graph_id on a server (remote) # E3
|
||||
sub-state = branch | snapshot # E4
|
||||
credential = cloud-storage creds (embedded) | bearer token (remote) # E5
|
||||
```
|
||||
|
||||
The config's only job is **name → this tuple**. Define one noun — a **target** — that resolves to either shape:
|
||||
|
||||
```yaml
|
||||
targets:
|
||||
dev: # embedded — substrate-direct (E1)
|
||||
storage: s3://team-bucket/dev.omni
|
||||
branch: main # sub-state (E4)
|
||||
staging: # remote — resolves a server by reference (E2/E3)
|
||||
server: staging # → looked up in `servers`
|
||||
graph_id: prod # the graph's id on that server (defaults to the entry key)
|
||||
branch: review
|
||||
```
|
||||
|
||||
`--target staging` resolves: project `targets.staging` → `{server: staging, graph_id: prod, branch: review}` → `servers.staging` → `{endpoint, token-by-ref}` → final `(remote(https://…), prod, review, $TOKEN)`. Embedded targets skip the server hop and use cloud-storage credentials.
|
||||
|
||||
**Two concepts, not kubeconfig's three.** kube splits cluster / user / context; that 3-way split is its most-cursed UX. A target *bundles* server+graph+branch+defaults under one name; the **only** thing split out is `servers`, because endpoints+credentials are shared across many targets and are secret-bearing (different ownership and rate-of-change; see §2). Result: **2 nouns — `servers` and `targets`.** Embedded `targets` (`storage:`) subsume today's `graphs:` entries.
|
||||
|
||||
### 1.1 The resolved address is a typed *locator*, not a `uri` string
|
||||
|
||||
The shipped config models a graph as a single `uri: String`, and code branches on `is_remote_uri(uri)`. That conflates two structurally different addresses: an **embedded** graph is a *complete, self-contained* address — one storage URI = one graph, opened directly via the embedded engine; a **remote** graph is a *server endpoint + a `graph_id`* — one server hosts N graphs. A bare server URL **is not a graph**; it lacks the `graph_id`. The cost of the string model, in the code today:
|
||||
|
||||
- the CLI re-decides "server or file?" via `is_remote_uri` at ~16 call sites;
|
||||
- `TargetConfig` (one `uri` field) **cannot express** multi-server × multi-graph or a multi-homed graph (E2/E3) — "graph `production` on server `prod-eu`" has no representation;
|
||||
- the CLI **bails on remote URIs** for most operations, precisely because the string can't carry the `graph_id`;
|
||||
- the `omnigraph-ts` SDK had to model `baseUrl` **+** `graphId` *separately* (rewriting `/graphs/{graphId}/…`) — it invented the structure the string lacks.
|
||||
|
||||
So the *resolved* address is a **typed locator**, not a string:
|
||||
|
||||
```rust
|
||||
enum GraphLocator {
|
||||
Embedded { storage: StorageUri }, // file:// , s3:// — a complete graph
|
||||
Remote { server: ServerId, graph_id: GraphId }, // which server + which graph (+ bearer creds)
|
||||
}
|
||||
```
|
||||
|
||||
A `graphs:` entry resolves into this **once**; downstream code dispatches on the variant (the breadboard's `GraphConn = Embedded(engine) | Remote(http)`) instead of re-sniffing a scheme at each call site. The `uri` string becomes an *input format* for the embedded variant, never the address itself.
|
||||
|
||||
**YAML naming follows the locator — the *key* names the locus**, so neither the value's scheme nor a comment is load-bearing:
|
||||
|
||||
| Locus | Key | Value |
|
||||
|---|---|---|
|
||||
| Embedded | **`storage:`** (shipped `uri:` is a deprecated alias) | a storage URI (`s3://…`, `file://…`) |
|
||||
| Remote | **`server:`** | a name in `servers:` (its `endpoint` + creds resolve by name, §5) |
|
||||
| Remote graph id | **`graph_id:`** | the id on that server — **defaults to the entry key**; set only when the local alias differs |
|
||||
|
||||
An entry has `storage:` **xor** `server:` — the deserializer rejects *both* and *neither* (no silent ambiguity). This removes two prior confusions: `graphs:` (the map) vs `graph:` (the remote id), and `uri:`-might-be-a-server.
|
||||
|
||||
```yaml
|
||||
servers:
|
||||
prod-eu: { endpoint: https://og-eu.internal:8080 }
|
||||
graphs:
|
||||
dev: { storage: s3://team-bucket/dev.omni } # embedded
|
||||
production: { server: prod-eu } # remote — graph_id = "production" (the key)
|
||||
staging: { server: prod-eu, graph_id: prod } # remote — alias ≠ server's id
|
||||
```
|
||||
|
||||
### 1.2 Invalid configs are rejected by design
|
||||
|
||||
The DX rule is: **a config field is either honored or rejected, never silently ignored**. The loader therefore has two phases:
|
||||
|
||||
1. Parse YAML into a loose/raw shape that preserves origin (`base_dir`, layer, line/path when available).
|
||||
2. Convert once into a typed, role-aware resolved config. Every command receives the resolved form, not the raw YAML structs.
|
||||
|
||||
The typed graph shape is:
|
||||
|
||||
```rust
|
||||
enum GraphEntry {
|
||||
Embedded(EmbeddedGraphEntry),
|
||||
Remote(RemoteGraphEntry),
|
||||
}
|
||||
|
||||
struct EmbeddedGraphEntry {
|
||||
storage: StorageUri,
|
||||
branch: Option<BranchName>,
|
||||
policy: Option<PolicyFile>,
|
||||
queries: QueryRegistrySpec,
|
||||
}
|
||||
|
||||
struct RemoteGraphEntry {
|
||||
server: ServerId,
|
||||
graph_id: GraphId,
|
||||
branch: Option<BranchName>,
|
||||
}
|
||||
```
|
||||
|
||||
That makes these rules structural rather than advisory:
|
||||
|
||||
- A graph entry must specify **exactly one** locator: `storage:`/legacy `uri:` xor `server:`.
|
||||
- `policy:` and `queries:` are valid only on `Embedded` graph entries, because they define the capability surface of a graph this process opens directly. A `Remote` graph entry points at a server; that server owns policy and stored-query definitions.
|
||||
- `omnigraph-server` may serve only `Embedded` graph entries. A server manifest entry with `server:` is rejected: a server should not "host" a graph by proxying another server.
|
||||
- A named graph uses its own graph entry. Top-level `policy:` / `queries:` are a legacy anonymous-bare-URI compatibility path only; if a named graph is selected while top-level blocks would be ignored, config validation errors with a migration hint.
|
||||
- A client-defined remote graph discovers stored queries from the server (`GET /queries`) and invokes them (`POST /queries/{name}`); it does not define `queries:` locally for that remote graph.
|
||||
|
||||
Examples that must fail fast:
|
||||
|
||||
```yaml
|
||||
graphs:
|
||||
prod:
|
||||
storage: s3://team-bucket/prod.omni
|
||||
server: prod-us # invalid: storage xor server
|
||||
```
|
||||
|
||||
```yaml
|
||||
graphs:
|
||||
prod:
|
||||
server: prod-us
|
||||
graph_id: production
|
||||
policy: { file: ./policies/prod.yaml } # invalid: remote graph policy lives on the server
|
||||
queries:
|
||||
find_user: { file: ./queries/find_user.gq } # invalid: remote graph queries are discovered
|
||||
```
|
||||
|
||||
`omnigraph config view --resolved --show-origin` is the user-facing debugger for this boundary: it shows the final `Embedded` or `Remote` graph and where every honored field came from. Fields that cannot be honored never make it into the resolved view; they fail validation first.
|
||||
|
||||
### 2. Layered config — global-first, uniform schema, project-optional
|
||||
|
||||
**Posture: global-first, project-optional.** OmniGraph's CLI is primarily a *client* (it operates against graphs and servers, embedded or remote), so it sits on the **global-first** side of the CLI-config axis — like `kubectl` / `aws` / `gh` / `docker`, and unlike *project-first* tools (`git` / `cargo` / `terraform`) whose primary config is per-repo. The **global user config is the primary, self-sufficient default**; the project file is an *optional* repo-scoped override (and, when present, the deployment manifest). `omnigraph query --target prod` must work from **any directory with no project file**, exactly as `kubectl get pods --context prod` works from anywhere. *(This is a deliberate flip from today, where the CLI reads `./omnigraph.yaml` and does not even walk parent dirs — i.e. today it is project-anchored.)*
|
||||
|
||||
**Rule: the two layers share ONE raw schema, and each is fully self-sufficient** (the git-layering mechanism — same schema at both levels; you never need a repo to have a working config). Do **not** specialize the file format by layer. Instead, run the same role-aware validation everywhere (§1.2): the global and project layers may both define graph locators, defaults, servers, and aliases, but fields that are meaningless for a resolved graph variant are rejected rather than ignored. For example, `queries:` is valid for an embedded graph this config opens directly; it is invalid on a remote graph entry because remote stored queries are server-owned and discovered.
|
||||
|
||||
This makes the **zero-project case the default, not an edge case**: a solo user (or an agent) defines everything needed for client work in `~/.omnigraph/config.yaml` — servers, embedded + remote graph locators, defaults, aliases, and optionally personal embedded-graph query registries — and **never creates a project file**. A team adds `./omnigraph.yaml` only when it wants repo-scoped overrides or a committed, GitOps'd deployment manifest. Global-first does **not** forbid project files; it stops *requiring* them (the kubectl model: `~/.kube/config` is sufficient and default; per-project kubeconfigs are opt-in via `KUBECONFIG`).
|
||||
|
||||
| Layer | Required? | Typical use | Path |
|
||||
|---|---|---|---|
|
||||
| Global | no | **the default** — solo/agent's entire config; shared servers+creds for teams; even a personal server's graphs/queries | `~/.omnigraph/config.yaml` |
|
||||
| Project | no | **opt-in** — repo-scoped overrides + the committed deployment manifest (graphs, queries, policy) | `./omnigraph.yaml` |
|
||||
|
||||
**Precedence (low → high):** built-in defaults < global < project < env vars < CLI flags. With no project file it collapses to **built-in < global < env < flags** — the common global-only path.
|
||||
|
||||
**Merge semantics — "closest layer wins, at the smallest meaningful unit"** (the field consensus: git / kubeconfig / cargo / Helm / VS Code):
|
||||
- **Settings objects** (`defaults`, `auth`, `server`) → **deep-merge per field**: a project sets `defaults.graph` and *inherits* the global `defaults.output_format`. (VS Code / cargo behavior.)
|
||||
- **Named-resource maps** (`servers`, `graphs` / compat `targets`, `queries`, `aliases`) → **union by key; on a collision the higher layer's entry REPLACES the lower wholesale** — *no field-level deep-merge within an entry*. (kubeconfig: union contexts by name.) The footgun this avoids: global `servers.prod = {endpoint, policy}`, project `servers.prod = {endpoint: other}` — deep-merge would silently retain the old fields; replace makes the project's `prod` self-contained and predictable.
|
||||
- **Lists/arrays** → **replace, never append** (Helm convention; appending is order-sensitive and surprising).
|
||||
- **Scalars** → higher layer wins.
|
||||
- **Relative paths carry their origin's base_dir.** A `queries:` entry's `.gq` path, or a `policy.file`, resolves against the directory of the layer it was *defined in* — global entries under `~/.omnigraph/`, project entries under the project dir.
|
||||
- **Inspectable (non-negotiable):** `omnigraph config view --resolved --show-origin` prints each final value *and which layer set it* (the `git config --show-origin` / `kubectl config view` rule). A layered config without origin-tracing is a debugging trap.
|
||||
|
||||
### 3. Roles, and the file-naming decision (same name for project = server)
|
||||
|
||||
`omnigraph.yaml` carries two *roles* that diverge in prod and collapse on a laptop:
|
||||
|
||||
- **Server role** (read by `omnigraph-server`): `graphs:` entries that are **embedded storage locators**, per-graph `policy.file`, **`queries:` — the stored-query/MCP registry lives here**, plus serving knobs. Remote graph locators are rejected in this role.
|
||||
- **Client role** (read by the CLI/agent): `servers:`, embedded or remote `graphs:` locators, `defaults:`, `aliases:`. A remote graph locator points at server-owned capabilities; it cannot define local `policy:` or `queries:`.
|
||||
|
||||
**Project config and server config are the same artifact, hence the same name.** The server *serves the project*: the file that says "these graphs exist, with these stored queries and this policy" is simultaneously the project manifest and the server's deploy config. Role is distinguished by which *sections* are populated, never by filename. Readers ignore sections that are not theirs (today's file already does this with `cli:` vs `server:`).
|
||||
|
||||
**Why not kube's role-split.** Two coherent models exist: (A) one project file with role-sections (Helix `helix.toml` holds both `[local.dev]` and `[enterprise.production]`; compose; Cargo), and (B) deployment-manifest strictly separate from client config (kubectl — you never put a context in `deployment.yaml`). kube is the sharpest topological analog (multi-server × multi-graph, one client targeting many), so B has a real claim. The tiebreaker is **E1: OmniGraph is embedded-first.** In embedded mode the manifest's `graphs:` *is* the local target list — manifest and local-client-view are the same object, so splitting them (B) fights the grain and forces two files for local work. kube splits because it has **no** embedded mode (client always remote+global). So: take the half kube is right about — *remote* client targeting (`servers:`, endpoints, creds) is a separate concern in a separate **user-global** file (`config.yaml`, like `~/.kube/config`); reject the half it is wrong about for us — do **not** split the *project* layer by role. **The second name (`config.yaml`) is justified by scope (user-global), not role.** *(If OmniGraph ever dropped embedded mode and went pure-remote, model B's strict split would become cleanest.)*
|
||||
|
||||
### 4. File naming
|
||||
|
||||
Principles from the field: **one global dir** `~/.omnigraph/` (like `~/.aws`/`~/.kube`/`~/.helix`), with config/cache/state as **subdirectories** (separation without XDG's three-root scatter); **secrets keyed by server name in the OS keychain or a separate git-ignored profile file** (AWS/gh model, not a new `credentials.yaml`); **project-root manifest keeps the app-named file** (`Cargo.toml`, `package.json`); **`.yaml`, not `.yml`**; keep OmniGraph's established names. The genuinely *new* decisions are the **global** dir's existence and keyed-by-name resolution with an explicit `auth.token` override (MR-971); the shipped `bearer_token_env` + `auth.env_file` mechanism remains as legacy compat.
|
||||
|
||||
| Artifact | Path / name | Why |
|
||||
|---|---|---|
|
||||
| Project = server config (one artifact) | `./omnigraph.yaml` | **Keep.** Root manifest like `Cargo.toml` / `compose.yaml` / `helix.toml`. Same name for both roles because it is one file. In prod the server's deploy repo and an app repo each have their own `omnigraph.yaml` — same name, different repos. |
|
||||
| Global user config | `~/.omnigraph/config.yaml` | **One dir** (`~/.omnigraph/`, like `~/.aws`/`~/.kube`/`~/.helix`). Named `config.yaml` *not* `omnigraph.yaml` — the name signals scope (and `~/.aws/config`, `~/.kube/config`, `~/.helix/config` all do this). Holds the full schema so a solo user needs nothing else. |
|
||||
| Credentials | OS keychain (`omnigraph:<server>`, preferred) → `~/.omnigraph/credentials` profile file (`[<server>]`, `0600`, git-ignored). **Keyed by server name**, inside the one dir. | **Key by name, AWS/gh model** — `~/.aws/credentials [profile]`, `~/.kube/config users:`, `~/.helix/credentials`. *Not* a `credentials.yaml`, and *not* a per-server hand-named env var; the secret lives under the server name (no indirection). Legacy `bearer_token_env` + `.env.omni` dotenv remain as a compat path. See §5. |
|
||||
| Cache / state | `~/.omnigraph/cache/`, `~/.omnigraph/state/` | Subdirs of the one dir (like `~/.aws/sso/cache/`, `~/.kube/cache/`) — cache is `rm -rf`-safe and backup-excludable without scattering across XDG roots. |
|
||||
| Cedar policy | `./policies/<env>.yaml` + `<env>.tests.yaml` | **Keep.** Referenced by `policy.file`. |
|
||||
| Schema | `./*.pg` (e.g. `schema.pg`) | **Keep.** |
|
||||
| Stored queries | `./queries/*.gq` | **Keep.** `.gq` sources referenced by the `queries:` registry. |
|
||||
|
||||
**Global dir: `~/.omnigraph/` — one place, with subdirectories.** Everything OmniGraph keeps for a user lives under a single `~/.omnigraph/` directory, matching the peer group (`~/.aws`, `~/.kube`, `~/.docker`) and the direct competitor (`~/.helix`). This is what DB/cloud-CLI users expect and the lowest-cognitive-load shape.
|
||||
|
||||
*Separation and "one place" are not in conflict* — the decisive realization. The peer tools get config/cache/state separation via **subdirectories inside the one dir**, not via XDG's three scattered roots: `~/.aws/sso/cache/`, `~/.kube/cache/`. So OmniGraph keeps `~/.omnigraph/config.yaml`, `~/.omnigraph/credentials`, `~/.omnigraph/cache/` (catalogs — `rm -rf`-safe, backup-excludable), `~/.omnigraph/state/` (session, logs) — getting cache hygiene **and** a single discoverable location, without the XDG scatter. An earlier draft argued XDG on a false dichotomy (it assumed single-dir ⇒ mixed); subdirs dissolve it. `~/.omnigraph/` is canonical and documented; `$XDG_CONFIG_HOME` may optionally be honored if a user has set it, but XDG is not part of the mental model.
|
||||
|
||||
**Env / override precedence (the `KUBECONFIG` analog):**
|
||||
- `OMNIGRAPH_CONFIG=/path` — explicit config file, highest precedence.
|
||||
- `OMNIGRAPH_HOME=/path` → the global dir (default `~/.omnigraph/`); `$XDG_CONFIG_HOME` optionally honored if a user has set it, but `~/.omnigraph/` is canonical.
|
||||
- Cache and state are subdirs of the one dir: `~/.omnigraph/cache/` (cached remote catalogs), `~/.omnigraph/state/` (session, logs).
|
||||
- Per-server token resolution: an explicit `auth: { token: {...} }` source (env/file/command/keychain) wins if set; otherwise **keyed by the server name** — `OMNIGRAPH_TOKEN_<NAME>` (or `OMNIGRAPH_TOKEN` for the active server) → OS keychain `omnigraph:<name>` → the `[<name>]` profile in `~/.omnigraph/credentials`; legacy `bearer_token_env` still honored. See §5.
|
||||
|
||||
### 5. Credentials, connection tiers, and bind portability (12-factor)
|
||||
|
||||
**Credentials are by-reference everywhere, never inlined — and keyed by the *server name*, not by a hand-invented env-var name.** This is the one place the design departs from simply reusing the shipped `bearer_token_env` mechanism, because that mechanism is sub-optimal for a multi-server client: it forces the operator to invent and coordinate an env-var name per server (three steps to add a server: pick a var, name it in config, set it in the store). The peer group (AWS profiles, `gh` hosts, kubeconfig users, docker auths) instead keys the secret **by the server's name** — no indirection. OmniGraph should match that.
|
||||
|
||||
**Resolution for server `<name>` (no config field required):**
|
||||
1. **`OMNIGRAPH_TOKEN_<NAME>`** env var (name-derived, upper-snake), else **`OMNIGRAPH_TOKEN`** for the active server — the CI/headless override (12-factor).
|
||||
2. **OS keychain** entry `omnigraph:<name>` — the preferred interactive store (no plaintext on disk); written by `omnigraph login <name>`.
|
||||
3. **`~/.omnigraph/credentials`** — an AWS-style profile file keyed by server name (mode `0600`, git-ignored), the fallback when no keychain:
|
||||
```ini
|
||||
[prod-us]
|
||||
token = …
|
||||
[prod-eu]
|
||||
token = …
|
||||
```
|
||||
So a `servers.<name>` with no token field resolves by name — adding a server is one step (`omnigraph login <name>`), and "multiple servers, multiple tokens" falls out for free.
|
||||
|
||||
**But implicit must not be the *only* path — explicit sourcing is a first-class option** (the DX/AX lesson). Pure-convention is invisible (you must *know* `OMNIGRAPH_TOKEN_<NAME>`), can't integrate with a secrets-manager's fixed var name, and can't do dynamic/short-lived tokens. So a server may declare an explicit `auth:` block — a **method-agnostic wrapper** (today only `token:` for bearer; `mtls:`/`oidc:` are the future siblings, so the credential model never has to be re-keyed) holding a tagged token *source*. Secrets are *still* never inlined (every source is a reference):
|
||||
|
||||
```yaml
|
||||
servers:
|
||||
prod-us:
|
||||
endpoint: https://og-us…
|
||||
auth: { token: { env: OG_PROD_US_TOKEN } } # explicit env var — self-documenting (= legacy bearer_token_env)
|
||||
prod-eu:
|
||||
endpoint: https://og-eu…
|
||||
auth: { token: { command: [vault, read, -field=token, secret/og] } } # dynamic / short-lived
|
||||
edge:
|
||||
endpoint: https://og-edge…
|
||||
auth: { token: { file: /run/secrets/og-token } } # k8s/docker mounted secret
|
||||
staging:
|
||||
endpoint: https://og-staging… # no auth: → implicit chain (below)
|
||||
```
|
||||
|
||||
| `auth.token:` source | when | DX/AX value |
|
||||
|---|---|---|
|
||||
| *(auth omitted)* | the common case | zero-config; `omnigraph login` populates keychain `omnigraph:<name>` |
|
||||
| `{ env: VAR }` | secrets-manager / CI injects a fixed var | **self-documenting** — config states the source; = the legacy `bearer_token_env` |
|
||||
| `{ file: PATH }` | k8s/docker secret mounted as a file | no env plumbing |
|
||||
| `{ command: [...] }` | Vault, cloud IAM, `gh auth token` | **dynamic tokens** — first-class exec, the capability pure-env/keychain can't give (kube `exec` / AWS `credential_process`) |
|
||||
| `{ keychain: ENTRY }` | pin a non-default keychain entry | explicit override of the name-derived default |
|
||||
|
||||
**Resolution per server:** if `auth.token:` is set, use that source (no fallthrough). Else the **implicit chain**: `OMNIGRAPH_TOKEN_<NAME>` (or `OMNIGRAPH_TOKEN` for the active server) → keychain `omnigraph:<name>` → `[<name>]` in `~/.omnigraph/credentials` (`0600`, git-ignored). `omnigraph login <server>` writes/rotates only that server's secret; per-server precedence is independent; sharing is opt-in (same env var or source). The `command` source runs locally with the operator's own privileges and is defined only in operator-owned config (never server-supplied), so it adds no remote-execution surface. The `auth:` wrapper is method-agnostic so adding mTLS/OIDC later is a new sibling key, not a breaking re-key (Hyrum's Law: the field name is a contract once shipped). There is **no `credentials.yaml`** and **no inlined secret**. *Convention for the floor, explicit for control — and explicit is legible to agents and never inlines a secret.*
|
||||
|
||||
**Back-compat.** The shipped per-graph `bearer_token_env` + `auth.env_file` dotenv (`resolve_remote_bearer_token`, real-env-wins) keeps working unchanged for existing single-server setups; `bearer_token_env` is just the legacy flat alias for `auth: { token: { env } }`. Resolution tries an explicit `auth.token:` (or legacy `bearer_token_env`) first, then the keyed-by-name chain — so nothing breaks, but the zero-config default is the no-boilerplate keyed-by-name path. (MR-971 — the `bearer_token_env` parity gap — is where this resolver work lands.)
|
||||
|
||||
**Three connection tiers** (Supabase/Prisma teach the zero-config floor):
|
||||
1. **Env vars** — `OMNIGRAPH_SERVER=https://…` + `OMNIGRAPH_TOKEN=…`: zero-config remote, no file (the `DATABASE_URL` floor).
|
||||
2. **Global `config.yaml`** — named `servers:` + `graphs:` for multi-server setups (the AWS-profiles convenience).
|
||||
3. **Project `omnigraph.yaml`** — project-pinned targets/graphs, committed.
|
||||
|
||||
**Keep `omnigraph.yaml` a *portable* manifest (12-factor).** Deploy-specific runtime that varies per environment — the **bind host/port**, worker counts — should be supplied by **`--bind` / `OMNIGRAPH_BIND` (flags/env)**, *not* a committed `server.bind:` baked into the manifest. A manifest that hardcodes `0.0.0.0:8080` is not portable across deploys and leaks an environment detail into a version-controlled file. The same-named `omnigraph.yaml` stays portable across deploys precisely because the volatile, per-environment knobs live in env/flags (12-factor config), while the stable, portable definition (graphs, queries, policy) lives in the file. This is the one concrete lesson taken from kube's model-B without adopting its file split: portability via env/flags, not via a second file.
|
||||
|
||||
### 6. Where stored queries live: defined locally, invoked remotely
|
||||
|
||||
A stored query splits across two axes; do not conflate them:
|
||||
- **Definition** (`.gq` source + `queries:` entry) lives next to the **embedded graph entry that owns it**. For a hosted remote graph, that is the **deployment manifest** read by `omnigraph-server`; for a personal embedded graph, it may be the user's own config. It never lives on a client-side `Remote` graph entry.
|
||||
- **Discovery** ("what tools exist for me?") is fetched from the **server** (Cedar-filtered `GET /queries` / MCP catalog) at connect time.
|
||||
- **Invocation** is **remote** (client → server, HTTP/MCP) — or **embedded** (the CLI opens the graph directly and reads the same manifest).
|
||||
|
||||
For remote use, the client carries *pointers to servers*, not query definitions; it **discovers and invokes**, never defines. This is the **capability-as-code guarantee for agents**: an agent can only invoke tools the server's *committed, reviewed* config exposes — it **cannot define a new tool at runtime**. Definition is structurally outside the agent's reach.
|
||||
|
||||
`queries:` (graph-capability registry, Cedar-gated when served remotely, MCP-visible when exposed) and `aliases:` (client CLI shortcut) overlap — both can name `.gq`-backed operations. This RFC keeps them siblings (the MR-969 decision); the clean long-term is **one registry, two invocation surfaces** (embedded + remote), with `aliases:` subsumed. Out of scope here.
|
||||
|
||||
#### Reconciling `aliases:` with the role model
|
||||
|
||||
`aliases:` is the pre-MR-969, **client-role, embedded-only, ungated** ancestor of `queries:`. An alias bundles `command` (read/change), `query` (`.gq` path), `name` (symbol), `args` (positional param names), and `graph`/`branch`/`format` defaults; the CLI runs it embedded. The server never reads it. So:
|
||||
|
||||
- **Role:** `aliases:` is **client-role** (CLI behavior) → it may live in **both** the user-global `config.yaml` and the project manifest, layered. `queries:` is **graph-capability role** → it lives only on an `Embedded` graph entry, and for remote server graphs that means the server deployment manifest. *Who opens the graph determines where query definitions can live.*
|
||||
- **Difference:** `aliases:` = embedded invocation, no gating, explicit `command`, bundles client defaults + positional args. `queries:` = remote (+future embedded), Cedar + `mcp.expose`, **infers** read/mutate, bundles only MCP settings.
|
||||
- **Convergence:** decompose an alias — *definition* (name→.gq+symbol) → `queries:` (the superset: typed, validated, gated, multi-surface, no redundant `command`); *target/branch/format* → client invocation context (`--target`/`--branch`/`--format` or `defaults:`), not baked per-query; *positional `args`* → thin CLI sugar or dropped (agents/services use named JSON params). End-state: one `queries:` registry + the client config model subsumes `aliases:`.
|
||||
- **Validation:** a file-backed alias (`query: ./foo.gq`) may target only an embedded graph. A remote graph shortcut must be explicit that it invokes a server-owned stored query, e.g. `invoke: find_user`, so the client cannot smuggle a new `.gq` definition into a remote capability surface.
|
||||
- **v1:** keep `aliases:` unchanged. Footgun worth a load-time warn: an alias and a query with the same name in one manifest are different namespaces invoked differently (`--alias X` vs `POST /queries/X`).
|
||||
|
||||
```yaml
|
||||
aliases:
|
||||
local_owner:
|
||||
command: query
|
||||
query: ./queries/owner.gq
|
||||
name: owner
|
||||
graph: dev # valid only if `dev` resolves Embedded
|
||||
|
||||
remote_owner:
|
||||
invoke: find_user
|
||||
graph: prod # valid only if `prod` resolves Remote; source lives on the server
|
||||
args: [name]
|
||||
```
|
||||
|
||||
### 7. CLI surface
|
||||
|
||||
- `omnigraph login <server>` — interactive auth; stores the token keyed by server name in the OS keychain (`omnigraph:<server>`) or the `[<server>]` profile of `~/.omnigraph/credentials` (0600). The `gh auth login` analog.
|
||||
- `omnigraph use <graph>` — set the active graph (writes the appropriate layer). The `kubectl config use-context` analog.
|
||||
- `omnigraph config view [--resolved] [--show-origin] [<graph>]` — print the merged config and, with `--resolved`, the final tuple **plus the origin layer of every field** (the `git config --show-origin` / `kubectl config view` analog). Resolution is never a mystery.
|
||||
- All existing verbs (`query`, `mutate`, `load`, `schema`, `branch`, …) gain `--graph <name>`; resolution decides embedded vs remote transparently.
|
||||
|
||||
### 7.5 Init, login, and bootstrap — three tiers (folds in the Q2 design)
|
||||
|
||||
Scaffolding splits into three tiers by *scope* and *fatness*, mirroring the field (supabase `init` vs `login`; HelixDB thin `init` vs fat `chef`). Most of this lives in sibling tickets; this RFC owns only the **user route**.
|
||||
|
||||
| Tier | Command | Scope | What it does | Model | Status |
|
||||
|---|---|---|---|---|---|
|
||||
| **User route** | `omnigraph login [<server>]` | user (`~/.omnigraph/`) | auth + write `~/.omnigraph/config.yaml` / `credentials`; first-run global setup | gh / supabase `login` | **this RFC** (unbuilt) |
|
||||
| **Thin project init** | `omnigraph init` | project, in-place | create graph + `scaffold_config_if_missing` (`omnigraph.yaml` + minimal `.pg`/`.gq`); refuse-if-exists or `--force` | `cargo init`, `prisma init` | exists; `--force` purge = MR-975 |
|
||||
| **Fat bootstrap** | `omnigraph quickstart [--template <t>] [--auto]` | project, possibly new-dir | scaffold + seed data + `serve start` + agent prompt file | HelixDB `chef`, `create-next-app` | MR-973 (unbuilt) |
|
||||
|
||||
**Design positions** (first-principles, since none of the fat tier is built):
|
||||
- **Split `init` (project) from `login` (user)** — never one command writing to both `$HOME` and the project (the supabase line, not the dbt line). `init`=project scaffold; `login`=user credential + global config.
|
||||
- **`init` is in-place + refuse-if-exists** (cargo/prisma/terraform default): don't clobber; adopt existing files; require `--force` to overwrite (and `--force` purges Lance state per MR-975).
|
||||
- **Interactive for humans, `--auto`/agent-mode for automation** (npm `-y`, create-* `--CI`, MR-981 `--machine`). In `OMNIGRAPH_AGENT_MODE` any prompt → fail with a repair hint.
|
||||
- **Templates are a `--template <name>` flag on the fat tier** (create-vite model), with the *content* (schema + queries + seed) coming from a template source. Mechanism is a design question (bundled-in vs `og template pull` from a repo vs `npm create-*`-style delegation) — **not** an existing foothold (MR-581 stale). Lean: a small set of bundled templates first (generic `Person→Knows`, plus promote `omnigraph-intel-bootstrap`), `--template <github>` later.
|
||||
- **`init`/`quickstart` can scaffold the `graphs:` map with one or more entries**; "init with specific graphs" = the scaffolded `graphs:` block (embedded `storage:` locally; the agent/operator adds remote `server:` entries via `login` + editing).
|
||||
- **Secrets-on-scaffold rule** (prisma/dbt/supabase all do this): anything that writes a token also keeps it out of VCS. `login` prefers the OS keychain (no file); the `~/.omnigraph/credentials` profile fallback is `0600` and git-ignored, and any project-local `.env`-shaped file gets a `.gitignore` entry.
|
||||
|
||||
### 8. Concrete shape
|
||||
|
||||
**Global** `~/.omnigraph/config.yaml` (per-user, secret-free):
|
||||
```yaml
|
||||
servers: # endpoint only — token is keyed by the server name
|
||||
prod-us: { endpoint: https://og-us.internal:8080 }
|
||||
prod-eu: { endpoint: https://og-eu.internal:8080 }
|
||||
staging: { endpoint: https://og-staging.internal:8080 }
|
||||
graphs:
|
||||
personal: { storage: ~/graphs/personal.omni }
|
||||
defaults:
|
||||
graph: personal
|
||||
aliases:
|
||||
my_people:
|
||||
command: query
|
||||
query: ~/queries/people.gq
|
||||
name: list_people
|
||||
graph: personal
|
||||
```
|
||||
|
||||
**Project client** `./omnigraph.yaml` (committed, secret-free, portable — no `server.bind`). Note the shipped noun is `graphs:` (MR-603); an entry is embedded (`storage:`) XOR remote (`server:` + `graph_id:`, §1.1):
|
||||
```yaml
|
||||
graphs:
|
||||
dev: { storage: s3://team-bucket/dev.omni, branch: main } # embedded
|
||||
staging: { server: staging, graph_id: prod, branch: review } # remote → graph `prod` on server `staging`
|
||||
prod-us: { server: prod-us, graph_id: production }
|
||||
prod-eu: { server: prod-eu, graph_id: production } # multi-homed: same graph, another server
|
||||
defaults: { graph: dev, output_format: table }
|
||||
aliases:
|
||||
owner:
|
||||
command: query
|
||||
query: ./queries/owner.gq
|
||||
name: owner
|
||||
args: [name]
|
||||
graph: dev
|
||||
```
|
||||
Select with `--graph <name>` (shipped flag, MR-603).
|
||||
|
||||
**Server deployment** `./omnigraph.yaml` (committed in the deploy repo, read by `omnigraph-server`). Every served graph is an embedded storage locator; server-owned policy and stored-query definitions live here:
|
||||
```yaml
|
||||
graphs:
|
||||
production:
|
||||
storage: s3://team-bucket/prod.omni
|
||||
policy:
|
||||
file: ./policies/prod.yaml
|
||||
queries:
|
||||
find_user:
|
||||
file: ./queries/find_user.gq
|
||||
mcp: { expose: true, tool_name: lookup_user }
|
||||
|
||||
server:
|
||||
policy:
|
||||
file: ./policies/server.yaml
|
||||
```
|
||||
|
||||
**Credentials** are keyed by server name — `omnigraph login prod-us` writes the OS keychain entry `omnigraph:prod-us` (or a `[prod-us]` profile in `~/.omnigraph/credentials`, 0600, git-ignored); `OMNIGRAPH_TOKEN_PROD_US` overrides for CI. No token fields in any config file; no committable secrets.
|
||||
|
||||
## DX
|
||||
|
||||
1. **One command surface, two loci.** `query --graph dev` (embedded) and `--graph staging` (remote) are the same command; only resolution differs. Change one word, not a mental model.
|
||||
2. **Clone-and-go.** Project config names servers+graphs; teammate runs `omnigraph login staging` once and every target resolves. The git + `gh auth login` model.
|
||||
3. **Multi-server × multi-graph is the default.** Remote graph entries reference `server` by name; `servers` is a global named map; graphs are per-server. `prod-us` and `prod-eu` both serving `production` is two graph entries — Helix cannot express this.
|
||||
4. **Solo-first.** Everything in `~`, no project required.
|
||||
5. **Laptop-to-fleet on one schema.** Local = one `omnigraph.yaml` (both roles); prod = role-split across repos. No second format to learn.
|
||||
|
||||
## AX (agent experience)
|
||||
|
||||
1. **One flat resolved context, never a config to navigate.** target→server→endpoint→token resolves *before* the agent sees anything. The agent reasons about tools, not topology (the LLM-safe-surface principle extended to config).
|
||||
2. **Secrets are structurally outside the agent's reach.** The repo it operates in has no tokens; they are in the global layer / keychain, outside its view. An agent *cannot* exfiltrate a prod token from project config because it is not there.
|
||||
3. **Branch/snapshot-pinned contexts** (E4) — hand an agent a `branch: review` / `--snapshot v42` target and its reads are reproducible and cannot see uncommitted main-line state. No kubeconfig analog.
|
||||
4. **The agent's capabilities are a GitOps'd artifact** (E6) — which graphs exist, which stored-query tools it may call, and which Cedar rules gate them are all in the version-controlled server config. Powers change only via a reviewed PR, deployed by restart. Infrastructure-as-code for what the AI can do.
|
||||
5. **Config + policy compose.** Config = "where am I pointed + which token"; Cedar = "what may I do there." Orthogonal; no enforcement logic leaks into config.
|
||||
|
||||
## GitOps — three surfaces, secrets in none
|
||||
|
||||
| Surface | Repo | Contents | Deploy | Secrets |
|
||||
|---|---|---|---|---|
|
||||
| Server deployment config | infra/deploy repo | `graphs:`, policy, **`queries:` + `.gq` files** | commit → CI → **server restart** (no hot reload) | none — by-reference |
|
||||
| Project client config | app repo | `graphs:` → embedded storage or remote server+graph | committed, read by CLI/agent | none |
|
||||
| Global user config | **not GitOps'd** — machine-local `~` | `servers:` + creds-by-ref | `omnigraph login` writes it | refs only (like `~/.kube/config`) |
|
||||
|
||||
## Comparison
|
||||
|
||||
| Property | kubeconfig | Helix | git | compose | **OmniGraph (this RFC)** |
|
||||
|---|---|---|---|---|---|
|
||||
| Named remote endpoints + creds-by-ref | ✅ | ✅ | partial | partial | ✅ (global `servers`) |
|
||||
| Global + project layering, uniform schema | ✗ | ✗ | ✅ | ✗ | ✅ |
|
||||
| Embedded OR remote under one name | ✗ | ✗ | n/a | ✗ | ✅ (E1) |
|
||||
| Multi-server × multi-graph | ✅ | ✗ | n/a | n/a | ✅ (E2) |
|
||||
| Branch/snapshot in the address | ✗ | ✗ | partial | ✗ | ✅ (E4) |
|
||||
| Agent tool surface in the repo | ✗ | ✗ (separate bundle) | n/a | n/a | ✅ (E6) |
|
||||
| Project manifest renamed by role | — | no | — | no | **no** |
|
||||
| Concept count | 3 | 1 | 2 | 1 | **2 (servers/targets)** |
|
||||
|
||||
## Migration / backwards compatibility
|
||||
|
||||
- **Additive.** Today's `omnigraph.yaml` (`graphs:`, `cli:`, `server:`, `aliases:`, `policy:`) keeps working unchanged. `graphs:` entries are equivalent to embedded `targets:` with a `storage:` (shipped `uri:` is a deprecated alias); both resolve.
|
||||
- **`targets:` is new** and optional. `servers:` is new and optional. Absent → today's behavior.
|
||||
- **Global `~/.omnigraph/config.yaml` is new.** Absent → only project + env + flags, exactly as now. Its addition is the **global-first posture flip**: today the CLI is project-anchored (reads `./omnigraph.yaml`, no parent walk); the global config becomes the new primary discovery path so the CLI works with no project file. Existing project-only workflows are unchanged (project still overrides global); the flip is additive — it adds a fallback layer below the project file, it does not remove the project file.
|
||||
- **`graphs:` → `targets:` is an evolution, not a break.** Both can coexist; `targets:` is the superset (adds remote + branch pinning). A future cleanup may alias `graphs:` to embedded `targets:`.
|
||||
- **`server.bind` stays supported** but documentation steers operators to `--bind` / `OMNIGRAPH_BIND` for portability; no removal.
|
||||
- **Credentials: keyed-by-name is new; `bearer_token_env` is the compat path.** The primary design (keychain / `[<server>]` profile / `OMNIGRAPH_TOKEN_<SERVER>`) is new resolver work (lands on MR-971). The shipped `bearer_token_env` + `auth.env_file` dotenv (`resolve_remote_bearer_token`) is **unchanged and still honored** — existing single-server dotenv setups keep working, and the resolver honors an explicit `auth: { token: {...} }` source (env/file/command/keychain) with `bearer_token_env` as its flat legacy alias. No `credentials.yaml`.
|
||||
- **Validation tightens invalid mixes, not valid legacy use.** Top-level `policy:` / `queries:` remain only for anonymous bare-URI compatibility. Named graphs use per-entry fields. Remote graph entries with local `policy:` / `queries:` and server manifests with `server:` graph locators are rejected because there is no correct way to honor those fields.
|
||||
|
||||
## Open questions
|
||||
|
||||
- **`graphs:` vs `targets:` naming churn.** Do we rename `graphs:` → `targets:` (with a deprecation alias) or keep `graphs:` for embedded and add `targets:` for remote? Leaning: keep both, document `targets:` as the superset.
|
||||
- **Keychain integration scope.** Keychain is now the *primary* credential store (§5), so this is on the critical path, not optional: macOS Keychain first (matches operator practice) with the `0600` `[<server>]` profile file as fallback; Linux Secret Service / `pass` later. Open: which keyring crate, and the exact `OMNIGRAPH_TOKEN_<SERVER>` name-derivation (upper-snake, non-alnum → `_`).
|
||||
- **Project-local `servers:`.** Allowed (e.g. a localhost dev server), merged with global. Confirm creds stay by-reference even for project-local servers (yes).
|
||||
- **`aliases:` ⇄ `queries:` convergence.** Out of scope here; tracked separately. One registry with embedded + remote invocation surfaces is the target end state.
|
||||
- **Single-file `KUBECONFIG`-style list.** Do we support `OMNIGRAPH_CONFIG` pointing at multiple files (colon-joined), or a single file only? Start single; revisit if demand appears.
|
||||
|
||||
## Implementation — breadboard + slices (Shape A)
|
||||
|
||||
Shaped via requirements + a fit check (Shape A — global-first layered config + unified `graphs:` entry + three-tier init — selected over a project-first minimal option and a Helix-clone). This section breadboards A and slices it. **Bold** = NEW.
|
||||
|
||||
### Places
|
||||
|
||||
| # | Place | What |
|
||||
|---|---|---|
|
||||
| P1 | Disk | `~/.omnigraph/{config.yaml, credentials, cache/, state/}` + project `omnigraph.yaml` + `.env.omni` |
|
||||
| P2 | Config resolution | runs on every command: load layers → merge → resolve `--graph` |
|
||||
| P3 | Command execution | embedded engine OR remote HTTP client |
|
||||
| P4 | Remote `omnigraph-server` | existing HTTP surface (`/query`, `/mutate`, `/queries/{name}`) |
|
||||
| P5 | Scaffold | `login` / `init` / `quickstart` |
|
||||
|
||||
### Affordances
|
||||
|
||||
| # | Place | Affordance | NEW? | Wires |
|
||||
|---|---|---|---|---|
|
||||
| U1 | P1 | `~/.omnigraph/config.yaml` (operator edits) | **N** | → N1 |
|
||||
| U2 | P1 | project `./omnigraph.yaml` | — | → N1 |
|
||||
| U3 | P1 | `~/.omnigraph/credentials` / `.env.omni` dotenv (secrets, git-ignored) | — | → N4 |
|
||||
| U4 | P3 | `omnigraph <verb> --graph <name>` (any command) | — | → N14 |
|
||||
| U5 | P5 | `omnigraph login [<server>]` | **N** | → N11 |
|
||||
| U6 | P5 | `omnigraph init` / `quickstart [--template]` | partly | → N12 / N13 |
|
||||
| U7 | P2 | `omnigraph config view --resolved --show-origin` | **N** | → N10 |
|
||||
| N1 | P2 | `load_layered_config()` — global (N3) + project (cwd), serde each | **N** | → N2 |
|
||||
| N2 | P2 | **merge engine** — deep-merge settings; replace named-resource entries; replace lists; **retain provenance** and raw field origins | **N⚠️** | → N5, → S_merged |
|
||||
| N3 | P2 | global-dir resolver — `OMNIGRAPH_HOME` else `~/.omnigraph/` | **N** | → N1 |
|
||||
| N4 | P2 | `load_env_file_into_process` — dotenv, real-env-wins (existing) | — | → N9 |
|
||||
| N5 | P2 | `resolve_graph(name, merged)` → typed `Embedded`/`Remote` locator; rejects invalid role/field combinations before execution | **N⚠️** | → N6 |
|
||||
| N6 | P3 | `GraphConn` — `Embedded(engine)` \| `Remote(http)` dispatch | **N⚠️** | → N7, → N8 |
|
||||
| N7 | P3 | embedded path — `Omnigraph::open(uri)` (existing) | — | → engine |
|
||||
| N8 | P3 | **HTTP-client path** — POST `/query`/`/mutate`/`/queries/{name}` | **N⚠️** | → P4, → N9 |
|
||||
| N9 | P2 | `resolve_bearer_token(server)` — explicit `auth.token` source if set, else **keyed by name**: `OMNIGRAPH_TOKEN_<NAME>`/`OMNIGRAPH_TOKEN` → keychain `omnigraph:<name>` → `[<name>]` profile; legacy `bearer_token_env`/dotenv (MR-971) | **N⚠️** | → N8 |
|
||||
| N10 | P2 | `config view` handler — merged + per-field origin (needs N2 provenance) | **N** | → U7 |
|
||||
| N11 | P5 | `login` handler — interactive auth → write `config.yaml` + `credentials` (0600) + `.gitignore` | **N⚠️** | → S_global |
|
||||
| N12 | P5 | `init` handler — `scaffold_config_if_missing` + create graph; refuse-if-exists/`--force` purge (MR-975) | partly | → S_project |
|
||||
| N13 | P5 | `quickstart` handler — scaffold + `--template` + seed + `serve start` + agent prompt (MR-973; needs serve MR-970) | **N⚠️** | → S_project |
|
||||
| N14 | P3 | agent-mode wrapper — `--machine`/`OMNIGRAPH_AGENT_MODE`: JSON, structured errors, never-prompt, typed exit codes (MR-981) | **N⚠️** | → N1 |
|
||||
| S_global | P1 | `~/.omnigraph/config.yaml` + `credentials` | **N** | read by N1/N9 |
|
||||
| S_project | P1 | `./omnigraph.yaml` + `.env.omni` | — | read by N1/N4 |
|
||||
| S_merged | P2 | in-memory resolved config (per command, with provenance) | **N** | read by N5/N10 |
|
||||
| S_cache | P1 | `~/.omnigraph/cache/` (remote catalogs) | **N** | read by N8 |
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph P1["P1: Disk"]
|
||||
U1["U1: ~/.omnigraph/config.yaml"]
|
||||
U2["U2: ./omnigraph.yaml"]
|
||||
U3["U3: credentials dotenv"]
|
||||
end
|
||||
subgraph P2["P2: Config resolution"]
|
||||
N3["N3: global-dir (OMNIGRAPH_HOME)"]
|
||||
N1["N1: load_layered_config"]
|
||||
N2["N2: merge engine (+provenance)"]
|
||||
N4["N4: dotenv loader"]
|
||||
N5["N5: resolve_graph(--graph)"]
|
||||
N9["N9: resolve_bearer_token"]
|
||||
N10["N10: config view"]
|
||||
end
|
||||
subgraph P3["P3: Command execution"]
|
||||
U4["U4: omnigraph <verb> --graph"]
|
||||
N14["N14: agent-mode wrapper"]
|
||||
N6["N6: GraphConn embedded|remote"]
|
||||
N7["N7: embedded Omnigraph::open"]
|
||||
N8["N8: HTTP-client POST"]
|
||||
end
|
||||
subgraph P5["P5: Scaffold"]
|
||||
U5["U5: login"]; U6["U6: init/quickstart"]
|
||||
N11["N11: login handler"]; N12["N12: init"]; N13["N13: quickstart"]
|
||||
end
|
||||
P4["P4: remote omnigraph-server"]
|
||||
U1-->N1; U2-->N1; N3-->N1; N1-->N2-->N5-->N6
|
||||
U3-->N4-->N9-->N8
|
||||
U4-->N14-->N1
|
||||
N6-->N7; N6-->N8-->P4
|
||||
N2-->N10-->U7["U7: config view --resolved"]
|
||||
U5-->N11; U6-->N12; U6-->N13
|
||||
classDef ui fill:#ffb6c1,stroke:#d87093,color:#000
|
||||
classDef n fill:#d3d3d3,stroke:#808080,color:#000
|
||||
class U1,U2,U3,U4,U5,U6,U7 ui
|
||||
class N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14 n
|
||||
```
|
||||
|
||||
### Slices (vertical, each demo-able)
|
||||
|
||||
| # | Slice | Parts/affordances | Demo |
|
||||
|---|---|---|---|
|
||||
| **V1** | **Global layer + merge + `config view`** | A1–A4 · N1,N2,N3,N10 · U1,U7,S_global,S_merged | Put config in `~/.omnigraph/`, run `omnigraph config view --resolved --show-origin` from any dir → merged result with per-field origin; existing embedded commands work global-first with no project file |
|
||||
| **V2** | **Remote graphs + HTTP client + creds** | A5–A7 · N5,N6,N8,N9 · S_cache | Define a `server:` graph entry; `omnigraph query --graph prod` hits the remote server (`curl`-free); embedded `--graph dev` still local |
|
||||
| **V3** | **`omnigraph login`** | A8 · N11,U5 | `omnigraph login prod` writes `~/.omnigraph/credentials` (0600) + `.gitignore`; V2 remote query now works with no manual env |
|
||||
| **V4** | **Thin-init hardening + quickstart + templates** | A9 · N12,N13,U6 (needs serve MR-970) | `omnigraph quickstart --template person-knows` scaffolds + seeds + serves; `init --force` purges (MR-975) |
|
||||
| **V5** | **Agent-mode** | A10 · N14,U4 (MR-981) | `OMNIGRAPH_AGENT_MODE=1 omnigraph query …` → JSON + structured errors + typed exit codes; never-prompt |
|
||||
|
||||
V1 is the foundation (global-first + merge + view). V2 closes the substantive client→server gap. V3 is credential ergonomics. V4/V5 ride sibling tickets (MR-970/973/981). MR-969 (stored queries) ships independently and is reached by N8's `/queries/{name}` once V2 lands.
|
||||
|
||||
## Rollout
|
||||
|
||||
The slices above are the rollout order: **V1 (global layer + merge) → V2 (remote graphs + HTTP client) → V3 (login) → V4 (quickstart/templates, on MR-970) → V5 (agent-mode, MR-981).** V1–V2 close the substantive gap (global-first config + `curl`-free server access); V3–V5 are ergonomics that ride sibling tickets. Evaluate after V2 against early-adopter and agent-onboarding (MR-973 / MR-974) signal. The spikes (X1 HTTP-client, X2 merge engine, X3 resolver+provenance, X4 login) resolve before their owning slice.
|
||||
|
||||
## Prior art
|
||||
|
||||
- kubeconfig (clusters / users / contexts; `KUBECONFIG`; `kubectl config view`)
|
||||
- Helix CLI v2 (`helix.toml` local+enterprise instance blocks; `~/.helix/config`; `~/.helix/credentials`)
|
||||
- AWS CLI (`~/.aws/config` + `~/.aws/credentials` split; named profiles; `credential_process`)
|
||||
- git (`~/.gitconfig` + `.git/config`; `--show-origin`)
|
||||
- Cargo (`Cargo.toml` manifest + `~/.cargo/config.toml`)
|
||||
- Supabase / Prisma (one project manifest; connection via `DATABASE_URL` env)
|
||||
- 12-factor app (config that varies by deploy lives in the environment)
|
||||
270
docs/dev/rfc-003-mcp-server-surface.md
Normal file
270
docs/dev/rfc-003-mcp-server-surface.md
Normal file
|
|
@ -0,0 +1,270 @@
|
|||
# RFC: MCP Server Surface for `omnigraph-server` — Full Tool Parity, Stored Queries, Modular Auth
|
||||
|
||||
**Status:** Proposed
|
||||
**Date:** 2026-06-01
|
||||
**Tickets:** MR-969 (stored queries + MCP exposure — the surface this completes), MR-956 (federated auth / WorkOS OAuth — the auth substrate this consumes), MR-971 (per-server credential resolver), MR-974 (agent setup surface — the installer that wires this), MR-668 (multi-graph server — shipped, the routing this builds on)
|
||||
**Builds on:** [omnigraph#128](https://github.com/ModernRelay/omnigraph/pull/128) (`ragnorc/stored-queries-mcp`) — the shipped stored-query registry, `GET /queries`, `POST /queries/{name}`, and the coarse `invoke_query` gate.
|
||||
**Supersedes:** the MCP-transport portion of [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) (`/mcp/tools` + `/mcp/invoke`). See [Relationship to RFC-001](#relationship-to-rfc-001).
|
||||
**Target release:** v0.8.x (phased — see Rollout)
|
||||
|
||||
## Summary
|
||||
|
||||
Add a first-class **MCP (Model Context Protocol) server surface to `omnigraph-server`**, exposed over **Streamable HTTP**, that projects the server's operations as MCP tools and resources for LLM clients (Claude Code/Desktop/web, Cursor, etc.). Two populations of tools share one projection path:
|
||||
|
||||
1. **Built-in operational tools** — parity with the existing `@modernrelay/omnigraph-mcp` stdio package's **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and its **2 resources** (`omnigraph://schema`, `omnigraph://branches`), plus a new server-scoped `graphs_list` tool and an `omnigraph://graphs` resource (multi-graph mode).
|
||||
2. **Dynamic stored-query tools** — one MCP tool per `mcp.expose: true` entry in the `queries:` registry (MR-969 / #128), with parameters typed from the `.gq` declaration via the shipped `query_catalog_entry` / `param_descriptor` projection.
|
||||
|
||||
Every tool is **authorized by the server's existing Cedar policy engine**. The MCP layer never implements its own authentication: it consumes an **already-resolved `ResolvedActor`** from the server's bearer middleware (`require_bearer_auth` today; the `TokenVerifier` seam when MR-956 lands), so the **same MCP endpoint serves on-prem (static or customer-OIDC tokens) and our cloud (WorkOS OAuth) by configuration only**. Cloud OAuth is an additive layer (RFC 9728 protected-resource metadata) that slots in with zero MCP changes.
|
||||
|
||||
The end-state collapses two diverging tool implementations into one: the in-server MCP is the canonical, Cedar-gated, remotely-reachable surface; the stdio package becomes a thin stdio↔HTTP proxy (local on-ramp) over it.
|
||||
|
||||
> **Key caveat, stated up front (see §5.9 below):** the headline "a token scoped via Cedar to a *specific set* of stored queries" requires **per-query `invoke_query` scope**, which is *designed* (rfc-001) but **not yet implemented** — the shipped action is coarse (any stored query on the graph, or none). Per-actor Cedar curation works today for *built-in vs ad-hoc vs admin* tools and for *stored-vs-ad-hoc*; sub-selecting individual stored queries per actor is gated on a prerequisite (PR 0b). Until then, stored-query curation is graph-level (registry membership + `mcp.expose`).
|
||||
|
||||
## Relationship to RFC-001
|
||||
|
||||
[rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) (MR-656 / MR-976 / MR-969) is the parent design for stored queries + the response envelope + MCP. This RFC is the **detailed MCP-transport design** that #128 left for a follow-up, and it **revises rfc-001 in three places where the shipped code or the MCP wire protocol diverged from rfc-001's sketch**:
|
||||
|
||||
1. **Transport shape.** rfc-001 sketched `GET /mcp/tools` + `POST /mcp/invoke` (a bespoke REST pair). **That is not the MCP wire protocol — real MCP clients cannot connect to it.** This RFC implements actual MCP JSON-RPC over Streamable HTTP and reuses `query_catalog_entry` as a *projection source*, not a parallel surface. (rfc-001's own Open Question already leaned toward Streamable HTTP.)
|
||||
2. **Exposure config.** rfc-001 specified inline `.gq` pragmas (`@mcp(expose=…)`, default `expose=false`). **#128 shipped a different mechanism:** YAML `queries.<name>.mcp.expose` in `omnigraph.yaml`, **default `true`** (declaring a query in the manifest *is* the opt-in). This RFC builds on the shipped YAML form; the `.gq`-pragma design in rfc-001 is superseded for exposure.
|
||||
3. **Schema introspection.** rfc-001 lists "Schema introspection through MCP" as a **non-goal** ("agents see types through declared return shapes"). This RFC **revises that**: the operational-parity tools include `schema_get` and `omnigraph://schema` — *because the shipped stdio package already exposes both*. The non-goal is achieved by *policy*, not omission: `schema_get`/`omnigraph://schema` are Cedar-gated by `Read`, and the recommended locked-down agent policy denies `Read`, so a curated agent still never sees the schema. (rfc-001's intent is preserved; the mechanism moves from "don't build it" to "build it, gate it.")
|
||||
|
||||
Everything else in rfc-001 (two-paths-one-engine, per-query `invoke_query` *as the intended scope*, the response envelope, multi-graph per-graph endpoints) this RFC consumes unchanged.
|
||||
|
||||
> **Numbering note:** the `TokenVerifier`/WorkOS auth design is referred to in code (`crates/omnigraph-server/src/identity.rs`) as "RFC 0001," which is a *different* document from this repo's `docs/dev/rfc-001-queries-envelope-mcp.md`. To avoid the collision this RFC cites the auth substrate as **MR-956** throughout, never "RFC 0001."
|
||||
|
||||
## Reconciliation with shipped code (verified against `ragnorc/stored-queries-mcp` HEAD)
|
||||
|
||||
Verified against `crates/omnigraph-server/src/{lib.rs,api.rs}` and `crates/omnigraph-policy/src/lib.rs` at the current branch head (not the #128 PR body, and not `api.rs` alone):
|
||||
|
||||
- ✅ `GET /queries` returns the `mcp.expose == true` subset as `QueriesCatalogOutput { queries: [QueryCatalogEntry] }`, each with typed `ParamDescriptor`s, `tool_name`, `description`, `instruction`, and a `mutation` flag. **MCP-ready projection, but exposed as bespoke REST/JSON — not the MCP wire protocol.**
|
||||
- ✅ `POST /queries/{name}` route exists (`server_invoke_query`, `lib.rs`).
|
||||
- ✅ `query_catalog_entry()` / `param_descriptor()` with an exhaustive `ScalarType → ParamKind` map (a new scalar is a compile error).
|
||||
- ✅ `InvokeQuery` Cedar action defined in `omnigraph-policy`.
|
||||
- ✅ **`InvokeQuery` IS enforced** at `POST /queries/{name}`: `server_invoke_query` calls `authorize(PolicyAction::InvokeQuery)` and **masks a denial to a 404 identical to "unknown query"** so the catalog isn't probeable (the denial-masking the previous draft of this RFC reported as missing is shipped — it lives in `lib.rs`, not `api.rs`). The stored-mutation path is already double-gated: `InvokeQuery` outer, then `Change` inside `run_mutate`.
|
||||
- ✅ **Reuse path exists:** `run_query` / `run_mutate` are already decoupled from their HTTP request bodies and take registry-supplied `(source, name, params, branch/snapshot)`. MCP `tools/call` for both stored and ad-hoc tools delegates to these — no new business logic.
|
||||
- ❌ **Per-query (`invoke_query[name]`) scope is NOT implemented.** `PolicyRequest` carries only `{action, branch, target_branch}` — **no query-name dimension** — and the action is documented coarse ("permits *any* stored query on the graph"). rfc-001 *designed* per-name scope; it is unbuilt. This RFC's per-query Cedar filtering (§5.4) and recommended agent policy (§5.9) depend on it → tracked as **PR 0b**.
|
||||
- ❌ No MCP protocol surface (`initialize`/`tools/list`/`tools/call`, JSON-RPC, transport).
|
||||
- ❌ No `TokenVerifier` trait yet — `require_bearer_auth` resolves a `ResolvedActor` inline (static-hash). The trait/`OidcJwtVerifier` are MR-956 (draft). The MCP layer's only requirement — *consume `ResolvedActor`* — is satisfiable today.
|
||||
|
||||
Stack (verified `Cargo.toml`): Axum + utoipa (OpenAPI) + `omnigraph-policy` (Cedar) + `futures` + `tokio`. **No MCP crate present.** `edition = "2024"`.
|
||||
|
||||
## Motivation
|
||||
|
||||
- **One curated, safe, remotely-reachable tool surface.** MR-969's thesis: hand an LLM a token Cedar-scoped to a set of tools and it sees exactly those typed tools — cannot construct ad-hoc queries it isn't permitted, cannot read the schema it isn't permitted, cannot reach other graphs. Today the only MCP is the stdio package: local-only, full surface, ungated.
|
||||
- **Parity, so the in-server MCP can be the single implementation.** Operators/agents already depend on the operational tools. Supporting them server-side behind one Cedar gate lets the stdio package degrade to a proxy and removes two diverging tool sets.
|
||||
- **On-prem and cloud from one endpoint.** A managed cloud (WorkOS OAuth) and an on-prem/air-gapped deploy (static or customer-OIDC tokens) must serve the same MCP without forks or MCP-specific auth.
|
||||
- **Foundation for the agent on-ramp (MR-974).** `omnigraph mcp install --agent <tool>` needs a decided transport + a stable endpoint.
|
||||
|
||||
## Goals
|
||||
|
||||
- Project built-in tools + stored queries as MCP tools through **one** registry abstraction.
|
||||
- `tools/list` and the callable set are **identical for argument-independent authorization**, both driven by Cedar (see §5.4 for the branch-scoped caveat).
|
||||
- The MCP layer is **auth-method-agnostic**: it consumes `ResolvedActor`, never a raw token, never branches on how auth happened.
|
||||
- The same endpoint works on-prem (static/OIDC) and cloud (WorkOS OAuth), switched by config; cloud OAuth is additive (RFC 9728).
|
||||
- No new business logic: MCP tools delegate to the same `run_query`/`run_mutate`/branch/schema functions the HTTP routes call.
|
||||
- Behaviour-neutral when unused: no MCP traffic = no change.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- **Building/hosting an OAuth authorization server.** The server is a Resource Server; WorkOS AuthKit+Connect is the AS (MR-956). The MCP endpoint validates tokens, never issues them, never holds client secrets.
|
||||
- **OAuth/WorkOS implementation itself** — MR-956's work. This RFC leaves a clean RFC-9728 hook and consumes `ResolvedActor`.
|
||||
- **MCP prompts, elicitation, `tools/list_changed`, resource subscriptions, server-initiated messages.** None needed → enables a stateless POST-only transport (§5.6).
|
||||
- **stdio transport inside the server.** stdio stays in the TS package (now a proxy).
|
||||
- **Cross-graph tool listing.** Per-graph catalogs only (MR-969 + RFC-002 non-goal).
|
||||
- **Hot reload of the query registry.** Restart-only (MR-969).
|
||||
|
||||
## Background
|
||||
|
||||
`omnigraph-server` (Axum) already implements every operation this RFC exposes as an authenticated HTTP route; each authorizes via a `PolicyAction` against the Cedar policy for a server-resolved actor and calls into the engine. The existing stdio MCP package is a *client* of these routes (it owns no business logic). MR-956 will introduce a `TokenVerifier` trait (`StaticHashTokenVerifier` today inline, `OidcJwtVerifier` for OIDC/WorkOS) producing the `ResolvedActor { actor_id, tenant_id: Option, scopes: Vec<Scope>, source }` that already exists in `identity.rs` and is consumed by Cedar — token *validation* is offline (cached JWKS), so on-prem/air-gapped has no request-path dependency on the cloud.
|
||||
|
||||
## Design
|
||||
|
||||
### 5.1 One tool model: a `McpTool` trait, two populators
|
||||
|
||||
Both built-in and stored-query tools implement one trait so `tools/list` / `tools/call` never special-case:
|
||||
|
||||
```rust
|
||||
trait McpTool: Send + Sync {
|
||||
fn name(&self) -> &str; // MCP tool id (stable)
|
||||
fn title(&self) -> Option<&str>;
|
||||
fn description(&self) -> &str;
|
||||
fn input_schema(&self) -> serde_json::Value; // JSON Schema (draft 2020-12)
|
||||
fn annotations(&self) -> ToolAnnotations; // readOnlyHint / destructiveHint / idempotentHint
|
||||
/// The Cedar request(s) this call requires, given parsed args. Used BOTH at
|
||||
/// list-time (dry-run filter, default args) and call-time (enforce, real args).
|
||||
fn authorization(&self, args: &ToolArgs) -> Vec<PolicyRequest>;
|
||||
async fn call(&self, ctx: &GraphCtx, args: ToolArgs) -> Result<ToolOutput, ToolError>;
|
||||
}
|
||||
```
|
||||
|
||||
- **Built-ins**: ~14 static impls, each delegating to the *same* function its HTTP route calls (`run_query`, `run_mutate`, branch ops, `apply_schema_as`, …). `input_schema` authored once (or derived from each route's existing `utoipa`/`ToSchema` DTO).
|
||||
- **Stored queries**: generated `McpTool` instances, one per `mcp.expose` entry; `input_schema` from `param_descriptor` (§5.3); `authorization` → `InvokeQuery` (coarse today; `InvokeQuery{name}` after PR 0b) then the inner `Read`/`Change`.
|
||||
|
||||
`ToolRegistry` for a graph = the static built-ins + the dynamic stored-query tools resolved from that graph's `GraphHandle` registry.
|
||||
|
||||
### 5.2 Tool catalog (parity) and Cedar mapping
|
||||
|
||||
Each built-in **reuses the exact `PolicyAction` its HTTP route already enforces** — verified against the handlers in `lib.rs`, not invented:
|
||||
|
||||
| MCP tool | Scope | Read/Mutate | Cedar action (verified from route) |
|
||||
|---|---|---|---|
|
||||
| `health` | server | read | none (liveness/version) |
|
||||
| `graphs_list` *(new)* | server | read | `GraphList` |
|
||||
| `snapshot` | graph | read | `Read` |
|
||||
| `schema_get` | graph | read | `Read` |
|
||||
| `branches_list` | graph | read | `Read` |
|
||||
| `commits_list`, `commits_get` | graph | read | `Read` |
|
||||
| `read` (ad-hoc `.gq`) / `query` *(alias)* | graph | read | `Read` |
|
||||
| `change` (ad-hoc `.gq`) / `mutate` *(alias)* | graph | mutate | `Change` |
|
||||
| `ingest` (NDJSON) | graph | mutate | `Change` (+ `BranchCreate` when forking a new branch) |
|
||||
| `branches_create` | graph | mutate | `BranchCreate` |
|
||||
| `branches_delete` | graph | mutate | `BranchDelete` |
|
||||
| `branches_merge` | graph | mutate | `BranchMerge` |
|
||||
| `schema_apply` (`allow_data_loss`) | graph | mutate | `SchemaApply` |
|
||||
| **stored query** (`find_user`, …) | graph | inferred | `InvokeQuery` (coarse; `InvokeQuery{name}` after PR 0b) + inner `Read`/`Change` |
|
||||
|
||||
There is **no `Ingest` and no separate `snapshot`/`Export` action** — `ingest` enforces `Change`, `snapshot` enforces `Read`. (`Export` exists but maps to the `/export` route, which this RFC does not expose as a tool.)
|
||||
|
||||
**Tool id parity vs. canonicalization.** The shipped stdio package uses tool ids **`read`/`change`** (and calls the deprecated `/read`,`/change` routes). The server HTTP surface canonicalized to `/query`,`/mutate` with `/read`,`/change` deprecated (MR-656). To keep existing package clients working *and* align with the server, the MCP exposes **`query`/`mutate` as canonical with `read`/`change` retained as deprecated-but-live aliases** (both dispatch to the same handler). Open Q7 asks whether to drop the aliases later.
|
||||
|
||||
Resources (§5.5): `omnigraph://schema`, `omnigraph://branches` (parity), plus `omnigraph://graphs` *(new)* — each gated by the same action as its list/get route (`Read`, `Read`, `GraphList`).
|
||||
|
||||
### 5.3 `ParamDescriptor → JSON Schema` (stored-query tools)
|
||||
|
||||
| `ParamKind` | JSON Schema | Notes |
|
||||
|---|---|---|
|
||||
| String | `{"type":"string"}` | |
|
||||
| Bool | `{"type":"boolean"}` | |
|
||||
| Int (i32/u32) | `{"type":"integer"}` | |
|
||||
| BigInt (i64/u64) | `{"type":"string","pattern":"^-?\\d+$"}` | JSON numbers lose precision >2⁵³ → string (matches the shipped `api.rs` rationale). (Open Q1) |
|
||||
| Float (f32/f64) | `{"type":"number"}` | |
|
||||
| Date | `{"type":"string","format":"date"}` | |
|
||||
| DateTime | `{"type":"string","format":"date-time"}` | |
|
||||
| Blob | `{"type":"string","contentEncoding":"base64"}` | |
|
||||
| Vector | `{"type":"array","items":{"type":"number"},"minItems":dim,"maxItems":dim}` | uses `vector_dim` |
|
||||
| List | `{"type":"array","items":<item_kind schema>}` | scalar items only (grammar guarantees) |
|
||||
|
||||
`nullable == false` → param is in `required`. Annotations: `mutation` → `{readOnlyHint:false, destructiveHint:true}`; else `{readOnlyHint:true}`. `description` → tool description; `instruction` → appended to description (or `_meta`). (The shipped `check()` already warns when an `mcp.expose` query declares a `Vector` param an LLM can't supply.)
|
||||
|
||||
For built-in tools the schema is hand-authored from the route DTO; e.g. `query` → `{source: string, branch?: string, params?: object}`; `schema_apply` → `{schema: string, allow_data_loss?: boolean}`; `ingest` → `{ndjson: string, mode?: "merge"|"append"|"overwrite", branch?: string}`.
|
||||
|
||||
### 5.4 `tools/list` (Cedar-filtered) and `tools/call` (dispatch + masking)
|
||||
|
||||
- **`tools/list`**: build the `ToolRegistry`; for each tool evaluate `authorization(default_args)` against the actor's Cedar policy; **emit only tools that authorize**. Authz decisions memoized per request. Stored-query tools additionally require `mcp.expose: true`.
|
||||
- **Exactness caveat (R7 is conditional):** the listed set equals the callable set **only for tools whose authorization is argument-independent** (`health`, `graphs_list`, `snapshot`, `schema_get`, `branches_list`, `commits_*`, ad-hoc `query`/`mutate`, and stored queries under the *coarse* action). For **branch-scoped tools** (`branches_create`/`merge` with `target_branch_scope`, and any branch-scoped `Read`/`Change` rule), list-time uses `default_args` (e.g. branch `main`) and cannot know the real target, so the listed set is a *best-effort approximation* of callability — a call may still be denied (or, rarely, a hidden tool would have been allowed). `tools/call` is always the authoritative gate. The contract is: **list never shows a tool the actor can't ever call; for branch-scoped tools it may show one the actor can call only on some branches.**
|
||||
- **`tools/call`**: resolve `name` → `McpTool` (masked-404 if unknown *or* `mcp.expose:false`); parse+validate args against `input_schema`; enforce `authorization(args)` (mutations stay double-gated: `InvokeQuery` then `Change`); on success `call`. **Denial masking** lives in one place (the dispatcher): an authz denial is returned identically to "unknown tool" (§5.10), reusing the same deny≡missing principle already shipped at `POST /queries/{name}`.
|
||||
|
||||
### 5.5 Resources
|
||||
|
||||
Advertise `resources` capability (`subscribe:false, listChanged:false`). `resources/list` → the URIs the actor may read; `resources/read` → schema `.pg` text / branches JSON / (multi-graph) graphs JSON, each gated by the corresponding action (`Read`, `Read`, `GraphList`). A locked-down agent denied `Read` simply never sees `omnigraph://schema` or `omnigraph://branches` — this is how rfc-001's "agents don't introspect schema" intent is met *by policy* (§Relationship-to-RFC-001).
|
||||
|
||||
### 5.6 Transport: Streamable HTTP, stateless, POST-only
|
||||
|
||||
- **Streamable HTTP** (MCP's current standard; we're already an HTTP server). One endpoint per scope (§5.7).
|
||||
- Because the server emits **no** server-initiated messages, implement the **minimal conformant** shape: client `POST`s JSON-RPC, server replies `application/json`. **No SSE channel, no `Mcp-Session-Id`, stateless** — each request authenticated independently via the bearer middleware. Honour the `MCP-Protocol-Version` header. SSE/sessions can be added later if subscriptions land.
|
||||
- **JSON-RPC methods:** `initialize` (advertise `{tools:{listChanged:false}, resources:{listChanged:false, subscribe:false}}` + serverInfo/version), `notifications/initialized` (no-op ack), `ping`, `tools/list`, `tools/call`, `resources/list`, `resources/read`. `prompts/list` returns empty if probed.
|
||||
- **Library decision (Open Q2):** spike `rmcp` (official Rust MCP SDK) for conformance + Streamable-HTTP/Axum on edition 2024; **fall back to a hand-rolled ~150 LOC JSON-RPC-over-POST** (only the methods above) on friction. Given the tiny surface, hand-roll is an acceptable default.
|
||||
|
||||
### 5.7 Endpoint routing (server- vs graph-scoped)
|
||||
|
||||
- **Single-graph mode:** `POST /mcp` — graph tools + server tools (`health`, `graphs_list`).
|
||||
- **Multi-graph mode (MR-668):** `POST /graphs/{graph_id}/mcp` — graph-scoped tools for that graph; plus a server-level `POST /mcp` exposing only server-scoped tools (`health`, `graphs_list`). A per-graph endpoint never lists another graph's tools (isolation, tested). Mirrors the shipped `/graphs/{graph_id}/…` cluster routing. (Open Q5: confirm naming + whether server tools also appear on the per-graph endpoint.)
|
||||
|
||||
### 5.8 Modular / decoupled auth (the cross-cutting requirement)
|
||||
|
||||
**Invariant (load-bearing, satisfiable today):** the MCP handler receives an **already-resolved `ResolvedActor`** and **branches on nothing** about how the token was verified. No token parsing, no method check, no OAuth inside the MCP module. Today that actor comes from `require_bearer_auth`; when MR-956 lands it comes from a `TokenVerifier` — the MCP code is identical either way.
|
||||
|
||||
```
|
||||
request → [auth middleware: ResolvedActor] → [MCP route] → Cedar → McpTool
|
||||
```
|
||||
|
||||
**Server side — auth is config, not code:**
|
||||
|
||||
| Deployment | Verifier | MCP change |
|
||||
|---|---|---|
|
||||
| On-prem, static bearer | `require_bearer_auth` / `StaticHashTokenVerifier` | none |
|
||||
| On-prem, customer IdP | `OidcJwtVerifier` → customer issuer (MR-956) | none |
|
||||
| Our cloud | `OidcJwtVerifier` → WorkOS, `tenant_id = Some(org_id)` (MR-956) | none |
|
||||
|
||||
Token validation is offline (cached JWKS) — on-prem/air-gapped keeps working with no request-path cloud dependency. The MCP endpoint never terminates OAuth and never holds a client secret (Resource Server only).
|
||||
|
||||
**Cloud client negotiation — additive, no MCP changes:** when MR-956 lands, the server publishes RFC 9728 `/.well-known/oauth-protected-resource` and returns `WWW-Authenticate: Bearer ..., resource_metadata="..."` on 401. A compliant MCP client (Claude) then auto-negotiates: static bearer to an on-prem endpoint; on a cloud 401 it discovers the WorkOS AS and runs OAuth/PKCE itself — **same endpoint URL, zero client-side branching.** This RFC only requires that MCP routes flow through the standard 401 path so that hook can be added later without touching MCP.
|
||||
|
||||
**Multi-user identity pass-through (cloud):** the *caller's* token (a WorkOS JWT, audience-bound per-tenant) must reach the server so Cedar enforces per-user/per-tenant policy — never a shared service token. The MCP endpoint validates it offline and maps `org_id → tenant_id`. This is why the **remote path is the in-server HTTP MCP that Claude connects to directly** (its token flows through), not a stdio bridge impersonating a user.
|
||||
|
||||
**Client-side credential acquisition (CLI/SDK/proxy) — pluggable `CredentialSource`** (RFC-002 §5, MR-971), keyed by server name, so OAuth is a future *sibling key*, not a re-key:
|
||||
|
||||
```yaml
|
||||
servers:
|
||||
onprem: { endpoint: https://og.internal:8080, auth: { token: { env: OG_TOKEN } } }
|
||||
edge: { endpoint: https://og-edge, auth: { token: { command: [vault, read, -field=token, secret/og] } } }
|
||||
cloud: { endpoint: https://api.omnigraph.cloud, auth: { oauth: { issuer: workos } } } # future sibling
|
||||
```
|
||||
|
||||
Implicit chain when `auth:` omitted: `OMNIGRAPH_TOKEN_<NAME>` → keychain `omnigraph:<name>` → `[<name>]` in `~/.omnigraph/credentials`; legacy `bearer_token_env` honoured. Secrets never inlined.
|
||||
|
||||
### 5.9 Safety model — Cedar is the gate, default-deny is the floor
|
||||
|
||||
With ad-hoc `query`/`mutate`/`schema_apply` present as tools, the **only** thing protecting an untrusted agent is the Cedar policy. Therefore:
|
||||
|
||||
- **Default-deny when tokens are configured** (MR-723, shipped) is the floor — an actor with no grants sees an empty tool list.
|
||||
- **What works today (coarse action):** a policy can hide all ad-hoc tools and admin tools per-actor (`deny Read, Change, SchemaApply, Branch*`) while allowing stored queries (`allow InvokeQuery`). That already reproduces "can't run ad-hoc, can't read schema, can only call stored queries" — the agent sees *every* exposed stored query plus nothing else.
|
||||
- **What needs PR 0b (per-query scope):** selecting *which* stored queries an actor may call (`allow InvokeQuery [find_user, list_orders]`, deny the rest). The shipped `invoke_query` is coarse (all stored queries or none). Until PR 0b adds a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), per-actor sub-selection of stored queries is **not expressible**; curation is graph-level (which `.gq` files are registered + `mcp.expose`).
|
||||
- `schema_apply`, `branches_delete`, ad-hoc `mutate` require an explicit admin-tier grant; never in a default agent policy.
|
||||
- (Open Q3) Optional `mcp.allow_adhoc` server switch defaulting **off** for the ad-hoc `query`/`mutate` tools — defence-in-depth independent of Cedar, and independent of PR 0b.
|
||||
|
||||
### 5.10 Result shaping and error mapping
|
||||
|
||||
- **Success:** `tools/call` returns `content: [{type:"text", text:<json>}]` where `<json>` is the route's existing output envelope (read rows / mutation summary, i.e. `ReadOutput` / `ChangeOutput`). (Open Q4: also emit `structuredContent` + `outputSchema` — defer; text-JSON for v1.)
|
||||
- **Tool execution error** (bad params after schema validation, engine error): result with `isError:true` + a text content block.
|
||||
- **Authorization denial / unknown tool / `mcp.expose:false`:** a single JSON-RPC error (`-32602`, message `"unknown tool"`) — identical for all three so policy isn't probeable (same principle as the shipped `POST /queries/{name}` 404 masking).
|
||||
- **Auth failure** (bad/absent bearer): HTTP 401 from the middleware *before* MCP — carries `WWW-Authenticate` (the RFC 9728 hook), never masked as a tool error. (This is exactly the path the shipped `authorize`/`authorize_request` split preserves: operational failures keep their status; only *denials* are masked.)
|
||||
|
||||
## Relationship to the `@modernrelay/omnigraph-mcp` stdio package
|
||||
|
||||
Verified surface of the package (`omnigraph-ts`, pkg version `0.3.0`, `@modelcontextprotocol/sdk@^1.29.0`, **stdio only**): **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and **2 resources** (`omnigraph://schema`, `omnigraph://branches`). It is a thin client over the SDK → HTTP routes and **forwards the caller's bearer verbatim** (no inspection).
|
||||
|
||||
Once parity lands, **collapse to one implementation**: the in-server MCP is canonical (Cedar-gated, remote-capable, the path that becomes a Claude-web connector via MR-956). The stdio package degrades to a **thin stdio↔HTTP proxy** forwarding JSON-RPC (and the incoming `Authorization`) to `/mcp` — staying the local on-ramp for Claude Code/Desktop while sharing one tool set, one Cedar gate. Transition: keep the current independent stdio package on its `0.3.x`/`0.6.x` line; ship proxy mode in a later TS minor once the server endpoint is GA. (Note: the package is currently several minors behind the server — its vendored `spec/openapi.json` predates the stored-query routes — so it needs the standard re-sync regardless of MCP work.)
|
||||
|
||||
## Testing
|
||||
|
||||
- **Protocol conformance:** `initialize` handshake + advertised capabilities; `tools/list` shape; `tools/call` happy path; JSON-RPC error envelopes (`-32601` unknown method, `-32602` invalid params / unknown tool); `resources/list` + `resources/read`.
|
||||
- **Cedar filtering (coarse, today):** an actor with `allow InvokeQuery` + `deny Read/Change` sees *all* exposed stored queries but **not** `query`/`mutate`/`schema_get`; `tools/call query` returns masked "unknown tool"; an admin sees the full catalog.
|
||||
- **Cedar filtering (per-query, gated on PR 0b):** actor scoped to `InvokeQuery [find_user]` sees *only* `find_user`; `tools/call list_orders` masks. **This test ships with PR 0b**, not PR 1 — it cannot pass against the coarse action.
|
||||
- **Parity per built-in:** each tool round-trips against the same expectations as its HTTP route (reuse route tests); `read`/`change` aliases dispatch identically to `query`/`mutate`.
|
||||
- **Double-gating:** a stored mutation requires both `InvokeQuery` and `Change`; `schema_apply` requires `SchemaApply`.
|
||||
- **`mcp.expose:false`:** absent from `GET /queries` and MCP `tools/list`; still service-callable by name through `POST /queries/{name}` when the actor has `invoke_query`, but not MCP-callable.
|
||||
- **Schema generation:** table-driven over every `ParamKind` incl. nullable / list / vector(dim).
|
||||
- **Branch-scoped list approximation:** assert the documented R7 caveat — a branch-scoped policy lists `branches_create`, and `tools/call` is the authoritative gate (a denied target still 403s/masks).
|
||||
- **Multi-graph isolation:** `/graphs/a/mcp` never lists graph `b`'s tools; server `/mcp` exposes only server tools.
|
||||
- **Auth decoupling:** the MCP suite is green under the current `require_bearer_auth` and under a mock OIDC `ResolvedActor` source — proving verifier-agnosticism. A 401 carries `WWW-Authenticate`.
|
||||
- **OpenAPI:** the JSON-RPC endpoint is not REST — document only the envelope in utoipa (or exclude); keep `openapi.json` drift test green (`OMNIGRAPH_UPDATE_OPENAPI=1` to regenerate on intentional change).
|
||||
- **Cross-repo smoke (optional):** point `@modelcontextprotocol/sdk` (TS) at the HTTP endpoint in an `omnigraph-ts` integration test.
|
||||
|
||||
## Rollout — phased by risk
|
||||
|
||||
- **PR 0a — extract the reusable invoke path (small).** The coarse `invoke_query` gate + 404 denial-masking are **already shipped** in `server_invoke_query`. Extract the read/mutate dispatch into `invoke_stored_query(handle, name, params, branch/snapshot, actor)` so MCP `tools/call` and the HTTP route share one path. No behaviour change. *(Replaces the previous draft's "PR 0 — wire the gate", which was already done.)*
|
||||
- **PR 0b — per-query `invoke_query` scope (the safety prerequisite).** Add a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), wire it at `POST /queries/{name}` and in the stored-query `McpTool::authorization`. Independently useful (the `allow InvokeQuery [find_user]` policy). **Gates the per-query Cedar-filtering test and §5.9's recommended agent policy.**
|
||||
- **PR 1 — MCP transport + read-only parity + stored-query reads.** Endpoint(s), `initialize`/`tools/list`/`tools/call`/`resources/*`, the `McpTool` registry, Cedar-filtered listing, the read-only built-ins (`health`, `graphs_list`, `snapshot`, `read`/`query`, `schema_get`, `branches_list`, `commits_*`) + resources + stored-query *reads*. All auth-agnostic.
|
||||
- **PR 2 — mutating parity + stored-query mutations.** `change`/`mutate`, `ingest`, `branches_create/delete/merge`, `schema_apply`, stored-query mutations + the `mcp.allow_adhoc` switch.
|
||||
- **PR 3 — docs + agent on-ramp hook.** `docs/user/server.md` MCP section (incl. the recommended agent policy + the coarse-vs-per-query caveat), `openapi.json` sync, the `omnigraph mcp install` config target (MR-974), and the downstream `omnigraph-ts` re-sync/proxy follow-up.
|
||||
- **Later (separate, MR-956):** RFC 9728 protected-resource metadata + WorkOS — slots in with zero MCP changes.
|
||||
- **Later (TS minor):** stdio package → proxy mode.
|
||||
|
||||
## Migration / backwards compatibility
|
||||
|
||||
- **Additive.** No `queries:` and no MCP traffic → today's behaviour unchanged. New endpoints are new routes.
|
||||
- **Cedar default-deny** (when tokens configured) means MCP exposes nothing until an actor is granted — safe by default.
|
||||
- The stdio package keeps working unchanged; proxy mode is opt-in later.
|
||||
- `openapi.json` only gains the documented MCP envelope; existing REST routes untouched.
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **BigInt/u64 as JSON string** (recommended, precision-safe) vs number.
|
||||
2. **`rmcp` vs hand-rolled** JSON-RPC (spike `rmcp` on edition 2024; default to hand-roll on friction).
|
||||
3. **Default-off `mcp.allow_adhoc`** for ad-hoc `query`/`mutate` (recommended) vs always-on + Cedar-only.
|
||||
4. **`structuredContent` + `outputSchema`** now vs text-JSON v1 (recommend v1 text-JSON).
|
||||
5. **Endpoint paths:** `/mcp` + `/graphs/{id}/mcp` — confirm naming and whether server-scoped tools also appear on the per-graph endpoint.
|
||||
6. **Stateless POST-only** confirmed (no near-term server-initiated messages) — revisit only if subscriptions land.
|
||||
7. **Legacy alias tools** (`read`/`change`): keep for client compat (the shipped package uses them), or drop and rely on `query`/`mutate`?
|
||||
8. **PR 0b shape:** per-query scope as a Cedar *resource* (`StoredQuery::"find_user"`) vs a `query_name` *context attribute* + policy condition — affects how `allow InvokeQuery [list]` is authored.
|
||||
|
|
@ -20,9 +20,9 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `end_to_end.rs` | Full init → load → query/mutate flow |
|
||||
| `branching.rs` | Branch create / list / delete, lazy fork |
|
||||
| `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
|
||||
| `runs.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
|
||||
| `writes.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
|
||||
| `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
|
||||
| `lifecycle.rs` | Repo lifecycle, schema state |
|
||||
| `lifecycle.rs` | Graph lifecycle, schema state |
|
||||
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
|
||||
| `changes.rs` | `diff_between` / `diff_commits` |
|
||||
| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
|
||||
|
|
@ -31,7 +31,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `traversal.rs` | `Expand`, variable-length hops, anti-join |
|
||||
| `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` |
|
||||
| `export.rs` | NDJSON streaming export filters |
|
||||
| `s3_storage.rs` | S3-backed repo (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
|
||||
| `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
|
||||
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
|
||||
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
|
||||
| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation |
|
||||
|
|
@ -45,7 +45,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
|
||||
## Test helpers
|
||||
|
||||
- **Engine** — `crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp repo + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
|
||||
- **Engine** — `crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp graph + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
|
||||
- **CLI** — `crates/omnigraph-cli/tests/support/mod.rs`: `Command`-style wrapper for invoking `omnigraph`, server-process spawning, fixture resolution, output assertion helpers.
|
||||
- **Server** — no shared helpers; server tests call the `Omnigraph` engine API directly and exercise endpoints over the wire.
|
||||
|
||||
|
|
@ -63,14 +63,14 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
|
||||
|
||||
- `cargo test -p omnigraph-engine --test s3_storage`
|
||||
- `cargo test -p omnigraph-server --test server server_opens_s3_repo_directly_and_serves_snapshot_and_read`
|
||||
- `cargo test -p omnigraph-server --test server server_opens_s3_graph_directly_and_serves_snapshot_and_read`
|
||||
- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
|
||||
|
||||
Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `AWS_ENDPOINT_URL_S3` for non-AWS) before running. Without those, S3 tests skip gracefully.
|
||||
|
||||
## OpenAPI drift
|
||||
|
||||
`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repo PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).
|
||||
`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repository PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).
|
||||
|
||||
## Examples & benches
|
||||
|
||||
|
|
@ -79,7 +79,7 @@ Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `A
|
|||
|
||||
## Coverage tooling — what's missing
|
||||
|
||||
There is **no** coverage tooling in the repo today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.
|
||||
There is **no** coverage tooling in the repository today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.
|
||||
|
||||
If introducing coverage tooling is in scope for your task, the natural first step is `cargo-llvm-cov` wired into a separate CI job, and a per-crate threshold rather than a global one.
|
||||
|
||||
|
|
@ -89,7 +89,7 @@ If introducing coverage tooling is in scope for your task, the natural first ste
|
|||
|
||||
How to check:
|
||||
|
||||
1. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `runs.rs`, `search.rs`, etc.). The filename usually names the area.
|
||||
1. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `writes.rs`, `search.rs`, etc.). The filename usually names the area.
|
||||
2. **Open the file and skim every test fn name.** Test fn names are the index — read them all, not just the first few.
|
||||
3. **Grep for the symbol or path you're changing.** `rg <FunctionName>` or `rg <enum_variant>` across all `tests/` directories surfaces existing coverage you might miss.
|
||||
4. **Decide one of three outcomes**, in this order of preference:
|
||||
|
|
@ -97,7 +97,7 @@ How to check:
|
|||
- *Existing test covers the area but not your case* → **add an assertion or a fixture row to the existing test**, don't write a new function with `init_and_load()` again.
|
||||
- *No existing coverage in any test file* → only then write a new test; put it in the file that owns the area, or open a new file only if the area itself is new.
|
||||
|
||||
Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repo. Don't add to it.
|
||||
Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repository. Don't add to it.
|
||||
|
||||
## Before-every-task checklist
|
||||
|
||||
|
|
@ -106,7 +106,7 @@ When you pick up any change, walk through this:
|
|||
1. **Find existing coverage** (per the principle above). Don't just look at the first test file by name — grep for the symbol you're touching across every crate's `tests/`.
|
||||
2. **Run those tests locally before editing.** `cargo test --workspace --locked` for the broad pass; `-p <crate> --test <file>` for a focused loop. Confirm a clean baseline.
|
||||
3. **Decide extend-vs-new** explicitly. If you can extend an existing test (assertion, fixture row, parameterization), do that. Only add a new test fn or new file if no existing one owns the area.
|
||||
4. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh repo by hand if a helper exists.
|
||||
4. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh graph by hand if a helper exists.
|
||||
5. **Mind the boundary.** Per [docs/dev/invariants.md](invariants.md), test at the layer the change lives at — planner-level changes deserve planner-level tests, not just end-to-end.
|
||||
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
|
||||
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
|
||||
|
|
|
|||
|
|
@ -1,7 +1,10 @@
|
|||
# Runs — REMOVED (MR-771)
|
||||
# Direct-Publish Write Path
|
||||
|
||||
The Run state machine and `__run__<id>` staging branches were removed in
|
||||
MR-771. `mutate_as` and `load` now write **directly to the target table**
|
||||
> History: the Run state machine and `__run__<id>` staging branches were
|
||||
> removed in MR-771 (shipped v0.4.0). Writes now go directly to the target
|
||||
> table; this document specifies that direct-publish path.
|
||||
|
||||
`mutate_as` and `load` write **directly to the target table**
|
||||
and call `ManifestBatchPublisher::publish` once at the end with
|
||||
`expected_table_versions` (the per-table manifest versions captured before
|
||||
the first write). Cross-table OCC is enforced inside the publisher; the
|
||||
Loading…
Add table
Add a link
Reference in a new issue