docs(rfc-003): correct-by-construction fixes from PR review

Fold the valid #157 review findings into the blueprint as by-design fixes so a
from-scratch build cannot reintroduce them:

- B.3.1: host/origin policy derived from the server bind + config (loopback ->
  loopback hosts; non-loopback -> configured public host, else Host-allowlist
  off with bearer + Origin as the controls). Closes the loopback-only-default
  footgun that 403s every remote client before bearer auth.
- B.6/B.7: stored-query params nested under a `params` object (mirrors
  POST /queries/{name}), so the branch/snapshot knobs cannot collide with a
  query param; mutation tools omit `snapshot` + additionalProperties:false, so
  mutation-against-a-snapshot is unrepresentable. Vector `dim` made intrinsic to
  the kind, killing the unwrap_or(0) zero-length-array schema.
- B.8: one canonical classify(Result<_, ApiError>) for tool errors (5xx ->
  JSON-RPC, 4xx/409 -> isError), no second mapper to drift; list visibility uses
  the call-path default-branch authorization, not a branch:None probe that hid
  branch-scoped-grant tools.
- Reconciled the stale Summary count and the section-5-era Rollout/Testing
  sections (banners pointing to B.12/B.4/B.13).
This commit is contained in:
Ragnor Comerford 2026-06-11 10:46:21 +02:00
parent 3009564437
commit 2ade97f05f
No known key found for this signature in database

View file

@ -11,7 +11,7 @@
Add a first-class **MCP (Model Context Protocol) server surface to `omnigraph-server`**, exposed over **Streamable HTTP**, that projects the server's operations as MCP tools and resources for LLM clients (Claude Code/Desktop/web, Cursor, etc.). Two populations of tools share one projection path:
1. **Built-in operational tools** — parity with the existing `@modernrelay/omnigraph-mcp` stdio package's **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and its **2 resources** (`omnigraph://schema`, `omnigraph://branches`). (Server-scoped graph discovery — a `graphs_list` tool / `omnigraph://graphs` resource — was considered but **dropped from MCP**; it stays REST-only via `GET /graphs`. See [B.10](#b10-routing-resolved).)
1. **Built-in operational tools** — parity with the existing `@modernrelay/omnigraph-mcp` stdio package's operational tool set (`health`, `snapshot`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, ad-hoc read/write, `ingest`, branch ops, `schema_apply`) and its core resources (`omnigraph://schema`, `omnigraph://branches`). (Exact counts moved over time — the package is now at 16 tools / ~9 resources; see [Relationship](#relationship-to-the-modernrelayomnigraph-mcp-stdio-package). The authoritative as-built catalog is [B.5](#b5-tool-catalog-13-built-ins--cedar-mapping): 13 tools.) (Server-scoped graph discovery — a `graphs_list` tool / `omnigraph://graphs` resource — was considered but **dropped from MCP**; it stays REST-only via `GET /graphs`. See [B.10](#b10-routing-resolved).)
2. **Dynamic stored-query tools** — one MCP tool per `mcp.expose: true` entry in the `queries:` registry (MR-969 / #128), with parameters typed from the `.gq` declaration via the shipped `query_catalog_entry` / `param_descriptor` projection.
Every tool is **authorized by the server's existing Cedar policy engine**. The MCP layer never implements its own authentication: it consumes an **already-resolved `ResolvedActor`** from the server's bearer middleware (`require_bearer_auth` today; the `TokenVerifier` seam when MR-956 lands), so the **same MCP endpoint serves on-prem (static or customer-OIDC tokens) and our cloud (WorkOS OAuth) by configuration only**. Cloud OAuth is an additive layer (RFC 9728 protected-resource metadata) that slots in with zero MCP changes.
@ -57,10 +57,15 @@ pub trait McpBackend: Clone + Send + Sync + 'static {
### B.3 Transport (lives in `omnigraph-mcp`)
- A generic `struct McpService<B>` implements rmcp's `ServerHandler`, delegating each method to `B` after extracting `&Parts` from `ctx.extensions` once (missing → `McpError::internal_error`). `get_info → backend.server_info()`; `initialize`/`ping` use rmcp's defaults.
- `pub fn mcp_router<B: McpBackend>(backend: B, body_limit: usize) -> axum::Router`:
- `let svc = StreamableHttpService::new(move || Ok(McpService::new(backend.clone())), Arc::new(LocalSessionManager::default()), config)` with `config = StreamableHttpServerConfig::default().with_stateful_mode(false).with_json_response(true)` (`StreamableHttpServerConfig` is `#[non_exhaustive]` — build from `Default`, mutate via the `with_*` setters; keep the loopback `allowed_hosts` default).
- Return `Router::new().route_service("/mcp", svc).layer(tower_http::limit::RequestBodyLimitLayer::new(body_limit))`. rmcp reads the body directly (not via an axum extractor), so axum's `DefaultBodyLimit` does **not** bound `/mcp`; the tower-http layer does.
- **Stateless mode delivers these conformance MUSTs for free (verified against rmcp 1.7 source):** `GET`/`DELETE /mcp``405` (with `Allow`); a disallowed `Host`/`Origin``403` (loopback hosts by default — DNS-rebind guard); `MCP-Protocol-Version``400` on unsupported, default `2025-03-26` when absent. **No conformance middleware is needed** (the §5.6 "honour the header" footnote is satisfied by rmcp).
- `pub fn mcp_router<B: McpBackend>(backend: B, body_limit: usize, hosts: McpHostPolicy) -> axum::Router`:
- `config = StreamableHttpServerConfig::default().with_stateful_mode(false).with_json_response(true).with_allowed_hosts(hosts.allowed_hosts).with_allowed_origins(hosts.allowed_origins)` (`StreamableHttpServerConfig` is `#[non_exhaustive]` — build from `Default`, mutate via the `with_*` setters). **Do not keep rmcp's fixed loopback `allowed_hosts` default** — it `403`s every non-loopback client *before* bearer auth (a deploy footgun); derive the policy per B.3.1.
- `let svc = StreamableHttpService::new(move || Ok(McpService::new(backend.clone())), Arc::new(LocalSessionManager::default()), config)`; return `Router::new().route_service("/mcp", svc).layer(tower_http::limit::RequestBodyLimitLayer::new(body_limit))`. rmcp reads the body directly (not via an axum extractor), so axum's `DefaultBodyLimit` does **not** bound `/mcp`; the tower-http layer does.
- **Stateless mode delivers these conformance MUSTs for free (verified against rmcp 1.7 source):** `GET`/`DELETE /mcp``405` (with `Allow`); a disallowed `Host`/`Origin``403` (per the `McpHostPolicy`, B.3.1); `MCP-Protocol-Version``400` on unsupported, default `2025-03-26` when absent. **No conformance middleware is needed** (the §5.6 "honour the header" footnote is satisfied by rmcp).
**B.3.1 Host/Origin policy — derived from deployment, never a fixed default.** The `Host` header is not a security boundary for a bearer-authed server (the token is); rmcp's loopback `allowed_hosts` default exists to protect *local* dev servers from browser DNS-rebinding, and for any non-loopback deploy it just rejects legitimate clients. So `omnigraph-server` computes `McpHostPolicy` once at startup from what it already knows (the `--bind` address + any configured public host) and passes it in:
- **Loopback bind** (`127.0.0.1` / `::1` / `localhost`) → loopback `allowed_hosts` (rebind protection genuinely matters locally).
- **Non-loopback bind** (`0.0.0.0` / a public IP) → the operator chose remote reachability: allow the configured public host(s) if set, else **disable Host-allowlisting** (accept any `Host`) and log it — bearer auth + `Origin` validation are the real controls.
- `allowed_origins` is the actual browser-attack control: default empty (off — non-browser MCP clients send no `Origin`), configurable for browser-based clients. The class "one fixed default that is wrong for some deployments" is closed by making the policy a function of the deployment.
- **Client/test obligations rmcp enforces:** the request must carry `Accept: application/json, text/event-stream` (both), `Content-Type: application/json`, and a `Host` header. rmcp negotiates `protocolVersion` (a recent client sees `2025-11-25`).
### B.4 Server-side backend (lives in `omnigraph-server`)
@ -94,7 +99,7 @@ Annotations (rmcp defaults `destructiveHint` **and** `openWorldHint` to **true**
### B.6 Stored-query tools
One MCP tool per `mcp.expose` registry entry (named by its `tool_name`), projected from the **same `api::query_catalog_entry`** the `GET /queries` catalog uses; parameters → JSON Schema per B.7. The outer gate is the **coarse `InvokeQuery`** action (all exposed queries on the graph, or none — per-query scope is deferred, see B.13); the call then runs the registry source through `run_query` / `run_mutate`, whose inner `Read` / `Change` gate the body — the **double-gate of `POST /queries/{name}`**. Skip a stored tool whose name collides with a built-in (built-ins win, so the catalog never has a duplicate tool name).
One MCP tool per `mcp.expose` registry entry (named by its `tool_name`), projected from the **same `api::query_catalog_entry`** the `GET /queries` catalog uses; parameters → JSON Schema per B.7. The outer gate is the **coarse `InvokeQuery`** action (all exposed queries on the graph, or none — per-query scope is deferred, see B.13); the call then runs the registry source through `run_query` / `run_mutate`, whose inner `Read` / `Change` gate the body — the **double-gate of `POST /queries/{name}`**. Skip a stored tool whose name collides with a built-in (built-ins win, so the catalog never has a duplicate tool name). Dispatch reads the query params from `args.params` and the knobs from `args.branch` / `args.snapshot` (B.7's nested envelope), so a query parameter named `branch`/`snapshot` cannot be shadowed.
### B.7 `ParamKind` → JSON Schema (stored-query params)
@ -104,16 +109,16 @@ One MCP tool per `mcp.expose` registry entry (named by its `tool_name`), project
| BigInt (i64/u64) | `{"type":"string","pattern":"^-?\\d+$"}` (JSON numbers lose precision >2⁵³) |
| Date / DateTime | `{"type":"string","format":"date"}` / `{"type":"string","format":"date-time"}` |
| Blob | `{"type":"string","contentEncoding":"base64"}` |
| Vector | `{"type":"array","items":{"type":"number"},"minItems":dim,"maxItems":dim}` (from `vector_dim`) |
| Vector | `{"type":"array","items":{"type":"number"},"minItems":dim,"maxItems":dim}` **`dim` is intrinsic to the kind** (`ParamKind::Vector { dim: u32 }`, never an `Option`), so a dimensionless vector is unrepresentable and the `unwrap_or(0)` footgun (a 0-length-array schema that rejects all input) cannot exist; if a `dim` is ever genuinely unknown, **omit** `minItems`/`maxItems` rather than emit `0`. |
| List | `{"type":"array","items":<item_kind schema>}` (scalar items only) |
`nullable == false` → in `required`. Add `branch` (and `snapshot` for reads) as optional invocation knobs. Fold `instruction` into the description.
**Envelope (correct by construction).** The tool's `input_schema` **nests the query params under a `params` object**, mirroring the `POST /queries/{name}` request: `{ "params": { <the query's typed params> }, "branch"?: string, "snapshot"?: string }`. The invocation knobs and the query's own parameters then live in **separate namespaces and cannot collide** — a query param named `branch`/`snapshot` is harmless. `nullable == false` params go in the inner `params.required`. For a **mutation** tool, omit `snapshot` from the schema and set `additionalProperties: false`, so "mutation against a snapshot" is *unrepresentable* (the REST path `400`s it; the MCP schema makes it impossible). Fold `instruction` into the description.
### B.8 `list` / `call` semantics
- **`list_tools` / `list_resources`** are Cedar-filtered: for each tool/resource, evaluate `authorize(actor, policy_for(gate), { action, branch: None })`; emit only `Allowed`; an `Err` (operational, e.g. policy-engine error) propagates as a JSON-RPC error; a `Denied` simply hides. Stored-query tools list as a group iff the coarse `InvokeQuery` is allowed.
- **`call_tool`:** an unknown tool **or** a denied tool returns the **identical** `unknown tool: <name>` (`-32602`) so the catalog can't be probed without the grant. A business/validation/engine failure (4xx/409) → `CallToolResult { isError: true }` (so the model self-corrects — the 2025-11-25 SEP-1303 split); an operational 5xx → JSON-RPC error. A missing/bad bearer is an HTTP `401` at the boundary *before* rmcp.
- **Branch-scope caveat (R7):** list visibility evaluates with `branch: None`; the actual `do_*` / `run_*` re-authorizes against the real branch, so a branch-scoped policy may list a tool yet deny a specific-branch call. `tools/call` is authoritative.
- **`list_tools` / `list_resources`** are Cedar-filtered. For each tool/resource, run the **same authorization the call path runs, with default args** (the default branch `main`) — **not** a separate `branch: None` probe (a `branch: None` request matches no `branch_scope` rule, so it would *hide* tools the actor can actually call on a scoped branch). Sharing the authorization input makes list and call agree by construction for the common case. Emit only `Allowed`; an `Err` (operational) propagates as a JSON-RPC error; a `Denied` hides. Stored-query tools list as a group iff the coarse `InvokeQuery` is allowed.
- **`call_tool`:** an unknown tool **or** a denied tool returns the **identical** `unknown tool: <name>` (`-32602`) so the catalog can't be probed without the grant. Route **every** `do_*`/`run_*` result through **one canonical `classify(Result<_, ApiError>) -> Result<CallToolResult, McpError>`** — the single source of truth, used at every dispatch site so the mapping can't drift: `Ok` → success JSON content; `Err` 4xx/409 → `CallToolResult { isError: true }` (the model self-corrects, per the 2025-11-25 SEP-1303 split); `Err` operational 5xx → JSON-RPC error. (Do not keep two mappers — a single `classify` replaces the `api_error_to_tool` / `api_operational_error` split that drifted in the reference impl.) A missing/bad bearer is an HTTP `401` at the boundary *before* rmcp.
- **Branch-scope residual (R7):** because visibility uses the default branch, a grant scoped to a *non-default* branch is an **over-show** — the tool lists and `call_tool` is the authoritative gate. Over-showing is the safe direction (the agent sees the tool, tries it, gets a clear deny); under-showing a callable tool is the failure mode the default-branch probe avoids.
### B.9 Resources
@ -373,6 +378,8 @@ Once parity lands, **collapse to one implementation**: the in-server MCP is cano
## Testing
> *Superseded by [B.12](#b12-tests--verification), which is the authoritative test plan. The list below is the original §5-era sketch; ignore its references to `graphs_list`, `read`/`change` aliases, and the per-query `invoke_query` test (all resolved/dropped — see B.13).*
- **Protocol conformance:** `initialize` handshake + advertised capabilities; `tools/list` shape; `tools/call` happy path; JSON-RPC error envelopes (`-32601` unknown method, `-32602` invalid params / unknown tool); `resources/list` + `resources/read`.
- **Cedar filtering (coarse, today):** an actor with `allow InvokeQuery` + `deny Read/Change` sees *all* exposed stored queries but **not** `query`/`mutate`/`schema_get`; `tools/call query` returns masked "unknown tool"; an admin sees the full catalog.
- **Cedar filtering (per-query, gated on PR 0b):** actor scoped to `InvokeQuery [find_user]` sees *only* `find_user`; `tools/call list_orders` masks. **This test ships with PR 0b**, not PR 1 — it cannot pass against the coarse action.
@ -388,6 +395,8 @@ Once parity lands, **collapse to one implementation**: the in-server MCP is cano
## Rollout — phased by risk
> *Superseded by the Blueprint's build order ([B.4](#b4-server-side-backend-lives-in-omnigraph-server) → crate ([B.1](#b1-crate-architecture--dependency-direction)[B.3](#b3-transport-lives-in-omnigraph-mcp)) → backend → tools/stored/resources). The PR phases below are the original §5-era plan; they still mention `graphs_list`, `read`/`change` aliases, and the `mcp.allow_adhoc` switch, all of which the locked decisions ([B.13](#b13-decisions-locked)) drop. PR 0b (per-query `invoke_query` scope) remains the one deferred follow-up.*
- **PR 0a — extract the reusable invoke path (small).** The coarse `invoke_query` gate + 404 denial-masking are **already shipped** in `server_invoke_query`. Extract the read/mutate dispatch into `invoke_stored_query(handle, name, params, branch/snapshot, actor)` so MCP `tools/call` and the HTTP route share one path. No behaviour change. *(Replaces the previous draft's "PR 0 — wire the gate", which was already done.)*
- **PR 0b — per-query `invoke_query` scope (the safety prerequisite).** Add a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), wire it at `POST /queries/{name}` and in the stored-query `McpTool::authorization`. Independently useful (the `allow InvokeQuery [find_user]` policy). **Gates the per-query Cedar-filtering test and §5.9's recommended agent policy.**
- **PR 1 — MCP transport + read-only parity + stored-query reads.** Endpoint(s), `initialize`/`tools/list`/`tools/call`/`resources/*`, the `McpTool` registry, Cedar-filtered listing, the read-only built-ins (`health`, `graphs_list`, `snapshot`, `read`/`query`, `schema_get`, `branches_list`, `commits_*`) + resources + stored-query *reads*. All auth-agnostic.