# RFC 0002 — Cloud Deployment Architecture (Staged)

**Type:** design proposal
**Status:** draft — not accepted, not implemented
**Audience:** maintainers reviewing the cloud offering and the OSS/Cloud boundary
**Date:** 2026-05-17
**Depends on:** [RFC 0001 — Federated Authentication](0001-federated-authentication.md)

> This is a proposal, not current truth. Until accepted and implemented, the
> authoritative deployment story remains [docs/user/deployment.md](../../user/deployment.md).

## Summary

Defines how OmniGraph is deployed as a managed cloud offering, in **three
stages** of increasing complexity. Each stage wins one irreducible property
and pays only the complexity that property earns:

1. **Managed single-region** — a customer can sign up, get a repo, and
   authenticate against a managed OmniGraph. Wins: *managed + authenticated +
   multi-tenant*.
2. **Elastic data plane + worker tier** — write scale, and maintenance
   (indexing, compaction, recovery) moved off the request path. Wins: *scale +
   off-path maintenance + no recovery-on-open race*.
3. **BYOC / VPC / air-gapped** — data plane in the customer's VPC, only a
   thin orchestrator in the vendor cloud. Wins: *data sovereignty*.

The same OSS binary runs in every stage and in a customer VPC; deployment mode
is configuration. The auth design from RFC 0001 threads through unchanged —
each stage only moves *where the token issuer lives*, never the validation
path.

## Motivation

OmniGraph's durable state lives entirely in object storage (Lance datasets +
the `__manifest` commit log); concurrency is optimistic CAS on `__manifest`.
That is already the object-storage-native architecture that turbopuffer,
LanceDB, Neon, and WarpStream converged on. The task is not to adopt it but to
*lean into it* — and to do so in stages so that a managed offering can ship and
collect real load before the expensive pieces (the reconciler, BYOC) are built.

## Goals

- One OSS binary; deployment mode is configuration only.
- A managed offering reachable in Stage 1 without building the reconciler.
- Object storage as the only request-path dependency.
- The control plane is dispensable: it holds only soft, derivable state.
- Auth (RFC 0001) is identical across stages and across VPC/on-prem.

## Non-goals

- A new storage substrate, WAL, or metadata database (deny-list).
- Changing the engine crates — transport/auth stay at the server boundary
  (Invariant 11).
- Multi-region active/active. Regions are independent stacks.
- Browser SSO / login UX (a control-plane concern, out of scope here).

## Foundational principles (hold across all stages)

These are decided once and constrain every stage.

### P1 — Object-storage-only commit

OmniGraph writes are **batch-shaped** (`mutate_as`, `load`, merges,
`schema_apply`) — not OLTP `COMMIT`s. Neon adds a Safekeeper quorum tier in
front of object storage *because* OLTP commit cannot wait ~100-300ms for an S3
PUT. OmniGraph has no such requirement, so it takes the turbopuffer path:
**commit straight to object storage, accept ~100-300ms write latency, run no
fast durable tier.** This is a large, deliberate simplicity win. It is
hard to reverse — if a low-latency single-row write path ever becomes a product
requirement, *that* is when a fast durable tier earns its complexity, via a new
RFC.

### P2 — The control plane holds only soft state

WarpStream keeps an authoritative metadata store (file→offset mappings) in its
cloud. OmniGraph does **not** need this: the `__manifest` table already *is* the
authoritative, strongly-consistent metadata, and it lives in object storage.
The control plane therefore stores only **soft, derivable state** — tenant
directory, billing counters, routing hints, compaction schedules, recovery
leases. Everything it knows is rebuildable by scanning object storage. The
control plane is never on the request path; if it is down, existing tenants
keep serving (the turbopuffer 99.99%-uptime property).

### P3 — One binary, config-driven

The `omnigraph-server` container is identical in Stage 1, Stage 2, a customer
VPC, and air-gapped on-prem. A "cloud build" is configuration plus *additive,
optional* control-plane services — never a fork (deny-list: no Cloud fork;
correctness is always OSS).

### P4 — Auth validation never makes a network call

Per RFC 0001: tokens are validated offline against cached JWKS. The token
*issuer* may be cloud-hosted, but the data plane never calls it on the request
path. This is what lets the identical data plane run in Stage 3's customer VPC.

## Architecture primitives

```
        CONTROL PLANE (vendor cloud; soft state only; off request path)
        - provisioning / tenant directory   - billing / metering
        - identity issuer (RFC 0001; may wrap WorkOS)
        - orchestration: compaction schedule, recovery leases, routing hints
                              │  (async, never on request path)
   ┌──────────────────────────┼──────────────────────────────────────┐
   │  DATA PLANE — omnigraph-server (identical OSS binary everywhere)  │
   │    read replicas  ·  writer(s)  ·  worker tier (Stage 2+)        │
   └──────────────────────────┬──────────────────────────────────────┘
                              ▼
                       OBJECT STORAGE
        Lance datasets + __manifest  (the only request-path dependency)
```

Tiers, introduced progressively:

- **Read replicas** — open `OpenMode::ReadOnly` (skips the recovery sweep),
  snapshot-isolated, fan out freely.
- **Writer(s)** — open `ReadWrite`; route by repo so CAS contention and cache
  stay local.
- **Worker tier** (Stage 2+) — background indexing, compaction, cleanup,
  recovery. Off the request path.

## Stage 1 — Managed single-region

**Property won:** a customer can sign up, get a repo, authenticate, and use a
managed OmniGraph.

**Architecture.** Single region. One object store (prefix-per-tenant or
bucket-per-tenant — see open decisions). Data plane = a pool of **read
replicas** plus a **single writer replica** per region. The single writer is
deliberate: it sidesteps both `__manifest` CAS contention *and* the
recovery-on-open race **without building the worker tier**. Recovery runs on the
writer's `open`, as today. Reads fan out across read replicas.

**Control plane.** Thinnest viable: provisioning (`open-or-create` a repo —
largely doable by the data plane itself on first request), a tenant directory,
billing counters, and the RFC 0001 identity issuer. All soft state (P2).

**Auth (RFC 0001).** `mode = static` remains the default for M2M / CI;
`mode = oidc` available, validated offline. The control plane runs the issuer
(its own, or wrapping WorkOS for human SSO). `hybrid` lets both coexist.

**Branching as product surface.** OmniGraph already has Git-style graph
branches with lazy fork — the same zero-copy, metadata-pointer design Neon
sells. Stage 1 exposes this directly: instant per-PR / dev / staging branches at
near-zero storage cost. No new engine work — a product packaging of an existing
capability.

**Deliberately not done.** No autoscaling of writes, no worker tier, no
reconciler, no BYOC. **Accepted limitation:** per-region write throughput is
bounded by one writer; a writer restart briefly pauses writes for that region.

**Exit criteria → Stage 2.** Single-writer throughput, write-pause blast
radius, or maintenance load (inline index builds / compaction) becomes the
binding constraint.

## Stage 2 — Elastic data plane + worker tier

**Property won:** horizontal write scale, and maintenance moved off the request
path — which also eliminates the recovery-on-open race.

**Architecture.**

- **Multiple writers** with **consistent-hash routing by repo URI**. A repo's
  writes land on one node, so CAS contention is bounded and the Lance page cache
  / warm `Omnigraph` handle stay local.
- **Per-repo write coalescer** — concurrent `mutate_as`/`load` commits to one
  repo batch into one manifest publish (the turbopuffer WAL-batching lesson:
  beat contention with batching, not locks).
- **Three-tier cache** made explicit: object storage → NVMe SSD → in-process
  (Lance page cache + warm handle), with routing affinity keeping a repo warm.
- **Worker tier** — background workers own index building (the deny-list
  reconciler mandate), compaction (`optimize`), cleanup, **and recovery**.
  Recovery moves from "every `open` runs the sweep" to "one leased worker per
  repo owns recovery." This *is* the long-deferred background reconciler;
  cloud is its forcing function.
- **Per-tenant resource bounds** — close the `invariants.md` resource-bounds
  gap: enforced per-query memory/time budgets, plus `WorkloadController`
  admission control, so multi-tenant compute has no noisy-neighbor failure.
- **Scale-to-zero** for cold tenants — evict idle handles, re-warm on first
  request, bill by the second (the Neon model).

**Control plane.** Gains orchestration: routing-hint distribution, compaction
scheduling, recovery-lease coordination. Still soft state (P2), still off-path.

**Auth.** No change to the validation path. The control plane's config-bundle
sync (RFC 0001 `ControlPlaneSync`) may now feed a SCIM-sourced actor allowlist.

**Optional consistency knob.** With warm caches, a per-query `stale-ok` read
becomes viable (turbopuffer's sub-10ms eventual mode). Invariant 6 permits it
**only** as explicit, read-only, non-default — exposed as opt-in, never the
default.

**Deliberately not done.** Data still resides in vendor-managed object storage.

**Exit criteria → Stage 3.** A customer requires data sovereignty (data may not
leave their account) or air-gapped operation.

## Stage 3 — BYOC / VPC / air-gapped

**Property won:** data sovereignty — the customer's graph data never leaves
their cloud account.

**Architecture.** The WarpStream BYOC split. The data plane (read replicas,
writers, worker tier — the Stage 1 *or* Stage 2 shape) and the customer's object
store run **inside the customer's VPC**. The vendor cloud keeps only the
soft-state orchestrator and the identity issuer. No customer graph data crosses
the boundary; no cross-account IAM into the customer's bucket. Air-gapped is the
same packaging with the control plane absent and config supplied as static
files.

**Auth.** This is where RFC 0001's P4 pays off fully: the in-VPC data plane
validates tokens **offline** against cached JWKS. The vendor identity issuer is
the only cloud touchpoint and it is off the request path. Air-gapped: point at
the customer's own IdP, or `mode = static`, with JWKS/policy pre-seeded.

**Why this is mostly packaging.** Because P2/P3/P4 were honored from Stage 1 —
control plane thin and off-path, one config-driven binary, auth validated
offline — Stage 3 is boundary hardening and deployment templates (Helm /
Terraform), not an architectural change.

## Why three stages (first-principles)

- **Not one stage.** The managed offering must not wait on the reconciler — a
  large build. Stage 1's single-writer design wins a real, sellable, managed
  product with bounded complexity, and the load it collects is the evidence
  that justifies Stage 2 (reversible-change discipline: ship, measure, then
  invest).
- **Not collapsing 2 and 3.** *Scale* (Stage 2) and *sovereignty* (Stage 3) are
  independent axes — a customer may demand BYOC before single-region scale runs
  out, or the reverse. They share only the thin-control-plane prerequisite,
  which is foundational (P2) anyway. **Stage 3 can therefore ship on the Stage 1
  data-plane shape**; the 1→2→3 numbering is the expected-demand order, not a
  hard dependency. If enterprise/sovereignty demand arrives first, do 1→3→2.
- **Not more than three.** The natural seams are exactly *managed+auth*,
  *elastic+maintenance*, *sovereignty*. Finer splits would be invented
  complexity, not earned.

## Open decisions

1. **Tenancy isolation model** — bucket-per-tenant vs prefix-per-tenant vs
   account-per-tenant. Strongest lever and effectively irreversible; the
   control plane should vend short-lived per-tenant scoped credentials
   regardless. Decide before Stage 1.
2. **Recovery ownership** — Stage 1 leans on the single writer; Stage 2 needs a
   per-repo recovery lease. Confirm the lease mechanism (object-store-based
   lease vs control-plane-issued).
3. **Commit-latency model (P1)** — ratify object-storage-only commit, or
   identify a concrete low-latency write requirement that would justify a fast
   durable tier.
4. **RFC 0001 carry-overs** — degraded-mode JWKS grace window, revocation
   strategy, whether VPC customers can override cloud-pushed Cedar policy.
5. **Stale-read knob** — ship the optional eventual-consistency read in
   Stage 2, or defer.

## Invariant analysis

| Invariant / deny-list item | Outcome |
|---|---|
| 2 — manifest-atomic graph visibility | ✅ unchanged; `__manifest` CAS is the commit point in every stage |
| 5 — recovery part of commit protocol | ✅ Stage 1 = open-time sweep; Stage 2 = leased worker; never weakened |
| 6 — strong consistency default | ✅ stale-read knob is explicit, read-only, non-default |
| 11 — transport/auth at the boundary | ✅ engine crates untouched; auth in `omnigraph-server` |
| 13 — failures bounded/observable | ✅ Stage 2 closes the per-query resource-bounds gap |
| Deny: custom WAL / metadata store | ✅ P1/P2 — object storage + `__manifest` only |
| Deny: cloud-only correctness / fork | ✅ P3 — one OSS binary, additive control plane |
| Deny: job queue for manifest-derivable state | ✅ worker tier is a reconciler, not a queue |

## Testing notes

- Stage 1: extend `omnigraph-server` tests for multi-replica read fan-out and
  single-writer routing; reuse `failpoints` for writer-restart behavior.
- Stage 2: per-repo coalescer and routing need engine/storage-boundary tests
  (`runs.rs`, `recovery.rs`); recovery-lease coverage belongs in `recovery.rs`.
- Stage 3: a deployment-template smoke test (data plane against an in-VPC-style
  object store); confirm no control-plane call on the request path.
- Update [docs/user/deployment.md](../../user/deployment.md),
  [docs/user/server.md](../../user/server.md) as each stage lands.