omnigraph/docs/dev/rfc-013-tenancy-cells.md
Ragnor Comerford c43b81d318
docs(rfc-013): add reader/writer scaling (§5.8) — split read fleet vs write tier
The substrate makes read/write scaling asymmetric: reads are object-store-backed,
snapshot-isolated, stateless -> horizontal to N replicas with zero coordination;
writes are serialized per (table,branch) by one-winner manifest CAS -> scale by
partitioning (branches/graphs/cells), with a single active coordinator per
(cell,graph,branch). Adds the CQRS deployment split (read fleet / write tier /
maintenance / heavy reads), read-your-write via commit_id/snapshot_id pinning,
and gates pooled WRITES (not reads) on closing the cross-process-CAS gap.
2026-06-16 18:53:01 +02:00

417 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RFC-013: Tenancy model — cluster-as-tenant cells, pooled compute
**Status:** Proposed — general architecture (server topology, identity, deployment).
**Date:** 2026-06-16
**Audience:** server / cluster / platform maintainers.
**Builds on:** [rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md)
(cluster-only boot, applied-revision serving), [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md)
(the store/cluster/server ontology), [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md)
(ledger/recovery/approvals — what makes a cell self-contained), [rfc-007-operator-config.md](rfc-007-operator-config.md)
(keyed credentials / secret resolution).
**Consumed by:** [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) (the MCP
surface is one tenant-scoped consumer of this model, not its driver — see §6).
**Target release:** v0.9.x (cell refactor) → cloud GA (pooled tier + WorkOS/OIDC).
---
## 1. Summary
This RFC fixes the **tenancy model** for OmniGraph as a server/platform concern —
independent of any one surface (HTTP data plane, MCP, CLI). It resolves an
ambiguity that currently sits half-built in the code: the server is **cluster-only**
(one cluster per process — [rfc-005](rfc-005-server-cluster-boot.md)), yet the
identity layer carries **pooled multi-tenant scaffolding** (`GraphKey.tenant_id`,
`ResolvedActor.tenant_id`, "Cloud will set `Some(...)` from the OAuth `org_id`
claim"). Those two point at *different* tenancy architectures. We pick one:
> **The cluster is the tenant is the cell.** A cell is the unit of data isolation —
> its own storage root, catalog, Cedar policy bundle, and token source. Isolation is
> **structural** (by deployment), never a row-level `tenant_id` filter on shared
> data. **Density comes from one server process hosting many cells**, not from
> pooling tenants into one graph.
The one structural change: lift today's per-cluster server runtime into a
**`CellRuntime`** value and let the server hold a map of them, resolved per request
by host (or path) **before** authentication. The entire per-graph stack beneath —
handlers, Cedar enforcement, stored queries, the RFC-003 MCP backend — is unchanged;
it gains one outer dimension, not a rewrite. Identity maps **WorkOS Organization →
cell, 1:1**, with a per-cell OAuth audience, so a token for one tenant cannot be
verified against another's endpoint. The result is best-practice **tiered isolation**
(silo the data, pool the compute) on one binary: dedicated/on-prem (1 process : 1
cell) and pooled cloud (1 process : N cells) are the *same code*, different topology.
## 2. Goals
- **Decide the tenancy model** and make the code stop implying two.
- **Isolation by construction:** a tenant cannot reach or enumerate another tenant's
data even if a Cedar policy is missing or a handler has a bug.
- **Density without row-level pooling:** amortize compute across tenants while keeping
each tenant's storage, catalog, policy, and tokens fully separate.
- **One binary, tiered topology:** dedicated, pooled, and on-prem are deployment
choices, not forks. No cloud-only correctness.
- **Additive to the substrate:** no change to the manifest/commit/Cedar invariants;
the data plane and MCP surface ride on top unchanged.
## 3. Non-Goals
- **Row-level (pooled-into-one-graph) multi-tenancy.** OmniGraph has no engine-level
row security; pooling tenants into a shared graph would make isolation depend on a
per-query filter — the highest-risk pattern, explicitly rejected (§4).
- **A new in-process `tenant_id` authorization dimension.** Cedar stays per-graph /
per-cell; the cell boundary does the tenant isolation (§5.5).
- **Hosting an OAuth Authorization Server.** Each cell is a Resource Server; the AS is
WorkOS or the customer's IdP (§5.4), same posture as [rfc-003](rfc-003-mcp-server-surface.md) §3.
- **Cross-cell queries / cross-tenant joins.** A cell is a hard boundary by design.
## 4. The tenancy decision (and why)
The industry framing is **silo / pool / bridge**, refined to **tiered + cell-based**:
| Model | Isolation | Density | Fit |
|---|---|---|---|
| **Silo** — infra per tenant | strongest (structural) | worst | few, large, regulated |
| **Pool** — shared infra, logical `tenant_id` | weakest (code-dependent) | best | many small — *only with engine RLS* |
| **Bridge** — silo some layers, pool others | tunable | tunable | most real products |
**Why not pool the data tier.** A pooled data store puts isolation in one place:
every read/write must carry the tenant filter; one miss is a cross-tenant breach, and
leaked customer data cannot be rotated. Postgres has Row-Level Security precisely to
move that check into the engine. **OmniGraph has no RLS equivalent and no row-level
tenant filtering** — so pooling tenants into a shared graph would adopt the most
dangerous isolation pattern without the substrate support that makes it safe.
Disqualifying for a data substrate.
**Therefore: silo the data, pool the compute (the bridge model, biased to
storage-silo because it is a database).** The isolation unit is the **cluster**, which
already gives separate storage root + catalog + policy bundle + token source. A
cluster is also a natural **cell** (a bounded blast-radius unit, the cell-based
architecture pattern): a fault or breach is contained to one tenant, never the fleet.
Density is then a *compute* concern — one process serving many cells — not a data
concern. This is the model the code is already ~80% built for; the gap is
compute-density routing, not isolation.
## 5. Design
### 5.1 The Cell abstraction
A **cell** is exactly today's whole single-cluster server runtime, lifted into a
value. The fields that are per-cluster today (registry, token table, server policy,
boot source) move off `AppState` onto a `CellRuntime`:
```rust
// new — a cell == one cluster's runtime == one tenant
pub struct CellRuntime {
pub cell_id: CellId, // == cluster id == tenant id (audit/log key, NOT an isolation check)
pub registry: Arc<GraphRegistry>, // this cell's graphs (GraphHandle{ engine, policy, queries })
pub auth: CellAuth, // per-cell token source — §5.4
pub server_policy: Option<Arc<PolicyEngine>>, // this cell's server-scoped Cedar (GraphList, …)
pub config_path: PathBuf, // this cell's cluster boot source (applied revision)
}
```
Everything inside a cell — `GraphHandle`, per-graph Cedar bundles, stored-query
registries, `GraphKey { tenant_id, graph_id }` — is **unchanged**. Graph ids are
unique *within a cell*, which is all that is needed; there is no global graph
namespace and no route↔key reconciliation problem because the cell is resolved first.
### 5.2 Server: one process, many cells
```rust
pub struct AppState {
pub cells: Arc<CellRegistry>, // host/prefix -> Arc<CellRuntime> (the ONLY new top-level field)
pub workload: Arc<WorkloadController>, // admission control (see Deferred — per-cell fairness)
}
// today's single `Multi { graphs, config_path, server_policy }` (rfc-005) becomes ONE cell;
// the server boots N of them.
pub enum ServerConfigMode {
MultiCluster { cells: Vec<CellBootConfig> },
}
```
Boot opens each cell's applied revision (the existing rfc-005 path, run N times,
bounded-concurrency) and inserts `Arc<CellRuntime>` into the `CellRegistry` keyed by
its host (or path prefix). A **dedicated/on-prem** deployment boots a one-entry map; a
**pooled** deployment boots many.
### 5.3 Routing & middleware — one new outer hop
The existing `build_app` nests per-graph routes under `/graphs/{graph_id}` with two
`route_layer`s (`resolve_graph_handle` inner, `require_bearer_auth` outer). We add
**one outermost layer**, `resolve_cell`, and rebind two existing layers to read the
cell instead of `AppState`:
```rust
let per_graph_protected = Router::new()
.route("/snapshot", get(server_snapshot))
// … /query /mutate /queries /schema /load /branches /commits …
.merge(mcp::mcp_router(state.clone())) // RFC-003 — unchanged
.route_layer(from_fn_with_state(state.clone(), resolve_graph_handle)) // inner: reads CELL.registry (was AppState.routing)
.route_layer(from_fn_with_state(state.clone(), require_bearer_auth)) // mid: reads CELL.auth (was AppState.bearer_tokens)
.route_layer(from_fn_with_state(state.clone(), resolve_cell)); // OUTER: injects Arc<CellRuntime> ← NEW
```
Request lifecycle:
```
resolve_cell host/prefix → Arc<CellRuntime> (404 unknown cell) ← NEW, outermost
└─ require_bearer_auth validate token vs CELL.auth → ResolvedActor (401) ← now cell-scoped
└─ resolve_graph_handle {graph_id} in CELL.registry → Arc<GraphHandle> (404) ← now cell-scoped
└─ handler / MCP run_query · run_mutate · /mcp · Cedar enforce(...) ← UNCHANGED
```
The only handler-adjacent edits: `require_bearer_auth` reads `cell.auth`,
`resolve_graph_handle` / `server_graphs_list` read `cell.registry`. The isolation is
in the ordering: **cell-A's token table and registry are unreachable from a cell-B
request** because the cell is resolved first and everything downstream reads *that*
cell.
**Cell selector — host-based (recommended) vs path-based:**
| | Host-based | Path-based |
|---|---|---|
| Selector | `Host: tenant-a.omnigraph.example.com` | `/clusters/{cell_id}/…` |
| OAuth audience | the per-tenant origin (natural RFC 8707 resource) | `…/clusters/{cell_id}` |
| Origin/CORS | isolated per subdomain (free) | shared origin |
| DNS/cert | wildcard `*.example.com` → pooled fleet | one host |
| `resolve_cell` | `cells.by_host(host)` | `cells.by_prefix(first_segment)` |
Host-based wins for cloud (per-tenant audience, Origin, and cookie boundaries fall out
for free). Path-based is the simple on-prem/dev shape. `resolve_cell` abstracts which.
### 5.4 Identity & auth — per cell, two modes
```rust
pub enum CellAuth {
Static(Arc<[(BearerTokenHash, Arc<str>)]>), // on-prem / self-host / dev — today's path, per cell
Oidc(Arc<dyn TokenVerifier>), // WorkOS (or customer IdP) for this org — cloud
}
```
- **WorkOS Organization → cell, 1:1.** The cell's `Oidc` verifier is configured with
*that org's* issuer + audience. A token minted for `tenant-a`'s audience **fails
verification at `tenant-b`** (wrong `aud`) — structural isolation that runs *before*
Cedar and is independent of policy completeness.
- **Same Resource-Server endpoint, mode by cell.** `require_bearer_auth` dispatches on
`cell.auth`. Static and OIDC cells coexist in one process. This is the
`TokenVerifier` seam already drafted in `identity.rs` ("RFC 0001 step 1 adds
`AuthSource::Oidc` when the `OidcJwtVerifier` ships"); WorkOS is one implementation.
- **`ResolvedActor` mapping:** `actor_id``sub`; `tenant_id` ← the **cell id** (for
audit/log clarity — *not* an isolation mechanism, since the endpoint already
isolated); `scopes` ← the OAuth `scope`/roles claim; `source``Oidc`/`Static`.
This repurposes `tenant_id` from vestigial pooled scaffolding into "which cell logged
this," which is honest. **Identity stays server-resolved, never client-set** (the
MR-731 invariant, now applied per cell).
- **Per-cell OAuth discovery:** each cell serves its own
`/.well-known/oauth-protected-resource` → that org's WorkOS AS, with the cell's
audience. Per-tenant PRM → per-tenant OAuth → per-tenant audience. (The RFC-003 §8
PRM config-gate for issue #59467 becomes a per-cell flag.)
### 5.5 Authorization — Cedar stays per-graph / per-cell
The cell boundary already guarantees a cell-A actor never reaches cell-B's policy
engine, so **no tenant dimension is added to authorization**:
- `PolicyRequest { action, branch, target_branch }` and `ResourceScope`
(Graph / Branch / TargetBranch) — **unchanged**. The principal stays `actor_id`.
- `authorize`'s **default-deny-except-`Read`** becomes *safe*: "readable on missing
policy" now means the tenant's *own* graphs, not cross-tenant. The exact hazard that
would make this dangerous under pooled tenancy is structurally absent.
- `GET /graphs` reads `cell.registry`, so it enumerates only the tenant's own graphs
and storage URIs — no cross-tenant topology leak.
This is the payoff of cluster-as-tenant: the in-process tenant machinery a pooled
model would require (tenant-keyed routing, a tenant Cedar principal, a tenant-aware
deny default, a tenant-filtered enumeration) is **not built because it is not needed**.
### 5.6 Control plane — the one legitimately pooled component
A small **Cell Registry** holds *metadata only* (no tenant data):
```
org_id (WorkOS) ──▶ cell_id ──▶ { storage_root, issuer, audience, host, tier }
```
Onboarding a tenant is provisioning-as-code — the thing that makes silo *operable*
(automated, not N hand-built stacks):
```
1. WorkOS Organization created / detected.
2. `cluster apply` a NEW cell on a fresh storage root (own bucket/prefix), with the
org's schema.pg / queries / policy → ledger + recovery + approvals (rfc-004).
3. Register org_id → cell_id (+ issuer/audience/host) in the Cell Registry.
4. Cell goes live:
• dedicated tier → its own process boots that one cell (today's exact path).
• pooled tier → the fleet HOT-LOADS the cell into the CellRegistry map.
5. DNS: tenant-a.example.com → the pooled fleet (wildcard); the host selects the cell.
```
Step 4's pooled hot-load is the **one new runtime-mutation primitive**, and it is
deliberately **cell-granular, not graph-granular**: [rfc-011](rfc-011-cli-refactoring.md)
closes runtime *graph*-add inside a cluster (correct — it mutates a live registry),
but loading a **whole, independently-validated cell** is just "open a cluster" — its
own ledger/recovery/catalog, nothing in any other cell moves. Far safer than the thing
rfc-011 forbids. Eviction = drop the `Arc<CellRuntime>` from the map; in-flight
requests keep their `Arc`.
### 5.7 Deployment tiers — same binary, different topology
| Tier | Topology | Mode | Use |
|---|---|---|---|
| **Dedicated** | 1 process : 1 cell | `MultiCluster { cells: [one] }` | enterprise / regulated / data-residency |
| **Pooled** | 1 process : N cells | `MultiCluster { cells: [many] }` | SMB / free / long tail |
| **On-prem** | 1 process : 1 cell, `Static` auth | `MultiCluster { cells: [one], Static }` | air-gapped / self-host |
A tenant graduates pooled → dedicated by moving its cell to its own process — **no data
migration** (the storage root does not move; the cell is already self-contained).
### 5.8 Workload scaling — readers vs writers
The substrate makes read and write scaling **asymmetric**, and the cell model inherits
and exploits that asymmetry. It is the deployment-shaping fact, so it is part of the
tenancy design, not an afterthought.
**Reads scale horizontally and statelessly.** The object store is the source of truth;
a read is snapshot-isolated and holds no shared mutable state, so any number of
processes can open a snapshot and serve reads with **zero coordination**. Add read
replicas (even per-region) freely. Per-cell independence means per-tenant read scaling
is independent. Limits are real but ordinary: object-store request throughput
(per-prefix S3 limits), index residency + CPU/RAM for *heavy* reads (vector ANN, FTS,
aggregations), and in-process cache warmth (a cold tenant in a pooled fleet pays a
cold-open). Freshness is the one catch — a replica's engine handle may lag live HEAD
([rfc-011](rfc-011-cli-refactoring.md) note), so the read path is eventual within the
refresh window unless pinned.
**Writes are serialized per `(table, branch)` and scale by partitioning, not by adding
writers.** A single graph's main-branch write throughput is bounded by commit latency
(object-store round-trips for stage + manifest CAS) and degrades under concurrency
(one-winner CAS → retry). You scale writes by *partitioning the write set*:
- more **branches** — different `(table, branch)` queues don't contend; bulk loads fan
out onto review branches and **merge** serializes the integration;
- more **graphs**, more **cells** — fully independent write paths.
Multiple replicas writing the *same* `(cell, graph, branch)` contend on manifest CAS —
the documented one-winner-CAS territory (invariants.md known gaps: "Local
`write_text_if_match` is not a cross-process CAS"; "recovery serialized against live
writers in-process only"). So horizontal write scaling on one branch needs a **single
active writer/coordinator**, not N racing replicas.
**The deployment shape this implies — split the roles (CQRS at the deployment layer):**
| Role | Scaling | Topology |
|---|---|---|
| **Read fleet** | horizontal, stateless, regional | many replicas behind a LB; each opens snapshots from object store |
| **Write tier** | one active coordinator per `(cell, graph, branch)` | small; consistent-hash or a per-`(graph,branch)` lease routes writes to a single owner to bound CAS thrash |
| **Maintenance** | single-coordinator, out-of-band | `optimize`/`cleanup`/reindex as async jobs ([rfc-011](rfc-011-cli-refactoring.md) D11), never inline with serving |
| **Heavy reads** | own pool (optional) | vector/FTS/aggregation isolated so they don't starve point reads |
- **Read-your-write:** the write envelope returns `commit_id` / `snapshot_id`
([rfc-001](rfc-001-queries-envelope-mcp.md)); a client needing RYW pins its next read
to that snapshot, or is routed (sticky) to the writer / a freshly-refreshed replica.
Otherwise reads are eventual within the handle-refresh window — make that explicit per
[rfc-003](rfc-003-mcp-server-surface.md)'s "no silent eventual consistency."
**Cell-model interaction.** Read scaling is per-cell-independent and horizontal; write
scaling is per-`(cell, graph, branch)`. The cell bounds blast radius and lets a hot
tenant get a dedicated write coordinator (or its own process — the dedicated tier),
while the long tail shares a pooled read fleet. The one genuinely hard coordination
problem is a **pooled write fleet**: two replicas must not own the same `(cell, graph,
branch)`, so it needs routing affinity or a lease — which ties directly to the
cross-process-CAS known gap and must be closed before pooled *writes* (pooled *reads*
are unaffected). Single-writer-per-partition + many-readers is exactly what an
object-store, log/Git-structured engine wants; the design states it rather than fighting
it.
## 6. How the surfaces ride on top
This is a server/topology change; the surfaces are consumers and need little or no
change:
- **HTTP data plane.** Every protected route already resolves `Arc<GraphHandle>` from a
request extension; it now comes from the cell's registry. Handlers are unchanged.
- **MCP ([rfc-003](rfc-003-mcp-server-surface.md)).** The MCP backend "consumes a
resolved actor and branches on nothing about how the token was verified" and mounts
under `per_graph_protected`. So `/graphs/{id}/mcp` simply lives under a cell now:
`https://tenant-a.example.com/graphs/{id}/mcp`. Per-graph isolation (rfc-003 §15.1)
is *sufficient* under cluster-as-tenant — each tenant's MCP clients point at their
own cell's endpoints; the discovery/enumeration concerns that would bite a pooled
model do not apply. MCP is **one** tenant-scoped consumer, not the reason for this
RFC.
- **CLI ([rfc-011](rfc-011-cli-refactoring.md)).** A `--server <name|url>` scope already
addresses one served endpoint; a per-tenant subdomain is just a server URL. No new
addressing concept — the cell is reached as a server.
## 7. Migration & backward compatibility
Today's `omnigraph-server --cluster <one>` is a `MultiCluster` with **one cell** and a
**host-agnostic** `resolve_cell` (any host → the sole cell). Therefore:
- Existing single-cluster deployments keep working unchanged (one cell; `resolve_cell`
is identity).
- `--cluster <dir|s3>` stays the dedicated/on-prem entry point.
- A pooled fleet boots from a cell list (e.g. repeated `--cluster`, or a
`--cells <registry>` source).
- RFC-003 MCP, OpenAPI generation, and CLI addressing are unchanged; `/graphs/{id}/…`
just lives under a cell.
- `ResolvedActor.tenant_id` / `GraphKey.tenant_id` are **repurposed to the cell id**
(or removed) — they stop implying pooled-row tenancy. This is the cleanup that ends
the two-models ambiguity.
## 8. Invariants & deny-list check
- **§11 transport/auth at the boundary:** cell auth + `resolve_cell` live only in
`omnigraph-server`; engine/compiler/cluster crates never learn cells exist. ✓
- **§12 no client-set identity:** both the cell (host/prefix → registry) and the actor
(token → `ResolvedActor`) are server-resolved; neither is client-settable. ✓
- **No cloud-only correctness / no fork:** cells are a server-layer wrapper; the OSS
`Static`-auth cell is first-class, OIDC is additive. ✓
- **Strong consistency / manifest atomicity (§1§6):** untouched — this adds an outer
routing dimension, not a write path. Each cell's engine keeps its own snapshot
isolation and manifest publish. ✓
- **No state that drifts:** the Cell Registry is control-plane metadata; per-cell state
remains the cluster's applied revision (rfc-005), derived from the ledger. ✓
- **Least privilege:** a leaked cell token reaches one tenant; rotation is per-cell;
storage credentials stay with the server process, never operators. ✓
## 9. Decisions, open questions, deferred
**Decided:**
- Cluster = tenant = cell; silo data, pool compute. No row-level pooling.
- Cells resolved before auth; per-cell token source; per-cell Cedar.
- WorkOS Organization → cell 1:1; per-cell OAuth audience.
- `tenant_id` repurposed to cell id (audit), not an isolation mechanism.
**Open / deferred:**
- **Per-cell vs process-wide admission control.** `WorkloadController` is process-wide
today; pooled cells want per-cell fairness (noisy-neighbor). Make `workload` a
per-cell value or add per-cell quotas. This is the one shared resource pooling
reintroduces — design before the pooled tier ships.
- **Cell hot-load / evict protocol.** Control-plane push vs poll of the Cell Registry;
pin the consistency story (a cell appears atomically or not at all; eviction drains).
- **Cell Registry storage.** Its own OmniGraph graph, a control DB, or a config object
— metadata-only either way. Decide ownership and durability.
- **`TokenVerifier` trait shape.** Still draft in `identity.rs`; this RFC fixes *where*
it plugs in (per-cell `CellAuth::Oidc`), not its exact signature.
- **Scope semantics.** `ResolvedActor.scopes` is currently `[Full]` and read nowhere;
when OIDC populates real scopes, decide whether/how they feed Cedar context (a
behavior change to sequence deliberately, not silently).
- **Pooled-write coordination (§5.8).** A pooled write fleet needs a single active
owner per `(cell, graph, branch)` (consistent-hash or lease) and the cross-process
CAS gap closed first. Pooled *reads* need none of this; gate pooled *writes* on it.
Read-fleet freshness (handle-refresh lag) + read-your-write pinning is the companion
decision.
## 10. Relationship to prior RFCs
[rfc-005](rfc-005-server-cluster-boot.md) made the server boot one cluster from its
applied revision; this RFC makes the server boot *many* such clusters as cells and
resolves one per request. [rfc-011](rfc-011-cli-refactoring.md) fixed the
store/cluster/server ontology and closed runtime graph mutation; this RFC adds the
*cell* as the tenant unit and the one safe runtime mutation (cell hot-load) that does
not violate rfc-011's reasoning. [rfc-004](rfc-004-cluster-graph-schema-apply.md)'s
ledger/recovery/approvals are what make a cell a self-contained, independently
provisionable unit. [rfc-003](rfc-003-mcp-server-surface.md) is a consumer: its
per-graph MCP surface becomes per-tenant for free once each cluster is a tenant. The
net: tenancy is decided once, at the server topology layer, and every surface inherits
it.