Add RFC 0002: staged cloud deployment architecture

Drafts the cloud deployment design as three earned stages — managed
single-region, elastic data plane with an off-path worker tier, then
BYOC/VPC/air-gapped — each winning one irreducible property. Sets
foundational principles (object-storage-only commit, a soft-state
control plane off the request path, one config-driven binary) drawn
from turbopuffer, Neon, and WarpStream, threads the RFC 0001 auth
design through every stage, and records the open decisions and
invariant analysis.

https://claude.ai/code/session_01N22WDYC6vv2njR5Xu96QaC
This commit is contained in:
Claude 2026-05-17 03:23:14 +00:00
parent 5e03ca977c
commit 9e5a86580d
No known key found for this signature in database
2 changed files with 279 additions and 0 deletions

View file

@ -51,6 +51,7 @@ description of shipped behavior always lives in the area docs above.
| RFC | Status | Topic |
|---|---|---|
| [0001-federated-authentication.md](rfcs/0001-federated-authentication.md) | draft | OIDC auth with a cloud control plane plus VPC/on-prem deployment |
| [0002-cloud-deployment-architecture.md](rfcs/0002-cloud-deployment-architecture.md) | draft | Staged cloud deployment — managed, elastic, then BYOC/VPC |
## Project Operations

View file

@ -0,0 +1,278 @@
# RFC 0002 — Cloud Deployment Architecture (Staged)
**Type:** design proposal
**Status:** draft — not accepted, not implemented
**Audience:** maintainers reviewing the cloud offering and the OSS/Cloud boundary
**Date:** 2026-05-17
**Depends on:** [RFC 0001 — Federated Authentication](0001-federated-authentication.md)
> This is a proposal, not current truth. Until accepted and implemented, the
> authoritative deployment story remains [docs/user/deployment.md](../../user/deployment.md).
## Summary
Defines how OmniGraph is deployed as a managed cloud offering, in **three
stages** of increasing complexity. Each stage wins one irreducible property
and pays only the complexity that property earns:
1. **Managed single-region** — a customer can sign up, get a repo, and
authenticate against a managed OmniGraph. Wins: *managed + authenticated +
multi-tenant*.
2. **Elastic data plane + worker tier** — write scale, and maintenance
(indexing, compaction, recovery) moved off the request path. Wins: *scale +
off-path maintenance + no recovery-on-open race*.
3. **BYOC / VPC / air-gapped** — data plane in the customer's VPC, only a
thin orchestrator in the vendor cloud. Wins: *data sovereignty*.
The same OSS binary runs in every stage and in a customer VPC; deployment mode
is configuration. The auth design from RFC 0001 threads through unchanged —
each stage only moves *where the token issuer lives*, never the validation
path.
## Motivation
OmniGraph's durable state lives entirely in object storage (Lance datasets +
the `__manifest` commit log); concurrency is optimistic CAS on `__manifest`.
That is already the object-storage-native architecture that turbopuffer,
LanceDB, Neon, and WarpStream converged on. The task is not to adopt it but to
*lean into it* — and to do so in stages so that a managed offering can ship and
collect real load before the expensive pieces (the reconciler, BYOC) are built.
## Goals
- One OSS binary; deployment mode is configuration only.
- A managed offering reachable in Stage 1 without building the reconciler.
- Object storage as the only request-path dependency.
- The control plane is dispensable: it holds only soft, derivable state.
- Auth (RFC 0001) is identical across stages and across VPC/on-prem.
## Non-goals
- A new storage substrate, WAL, or metadata database (deny-list).
- Changing the engine crates — transport/auth stay at the server boundary
(Invariant 11).
- Multi-region active/active. Regions are independent stacks.
- Browser SSO / login UX (a control-plane concern, out of scope here).
## Foundational principles (hold across all stages)
These are decided once and constrain every stage.
### P1 — Object-storage-only commit
OmniGraph writes are **batch-shaped** (`mutate_as`, `load`, merges,
`schema_apply`) — not OLTP `COMMIT`s. Neon adds a Safekeeper quorum tier in
front of object storage *because* OLTP commit cannot wait ~100-300ms for an S3
PUT. OmniGraph has no such requirement, so it takes the turbopuffer path:
**commit straight to object storage, accept ~100-300ms write latency, run no
fast durable tier.** This is a large, deliberate simplicity win. It is
hard to reverse — if a low-latency single-row write path ever becomes a product
requirement, *that* is when a fast durable tier earns its complexity, via a new
RFC.
### P2 — The control plane holds only soft state
WarpStream keeps an authoritative metadata store (file→offset mappings) in its
cloud. OmniGraph does **not** need this: the `__manifest` table already *is* the
authoritative, strongly-consistent metadata, and it lives in object storage.
The control plane therefore stores only **soft, derivable state** — tenant
directory, billing counters, routing hints, compaction schedules, recovery
leases. Everything it knows is rebuildable by scanning object storage. The
control plane is never on the request path; if it is down, existing tenants
keep serving (the turbopuffer 99.99%-uptime property).
### P3 — One binary, config-driven
The `omnigraph-server` container is identical in Stage 1, Stage 2, a customer
VPC, and air-gapped on-prem. A "cloud build" is configuration plus *additive,
optional* control-plane services — never a fork (deny-list: no Cloud fork;
correctness is always OSS).
### P4 — Auth validation never makes a network call
Per RFC 0001: tokens are validated offline against cached JWKS. The token
*issuer* may be cloud-hosted, but the data plane never calls it on the request
path. This is what lets the identical data plane run in Stage 3's customer VPC.
## Architecture primitives
```
CONTROL PLANE (vendor cloud; soft state only; off request path)
- provisioning / tenant directory - billing / metering
- identity issuer (RFC 0001; may wrap WorkOS)
- orchestration: compaction schedule, recovery leases, routing hints
│ (async, never on request path)
┌──────────────────────────┼──────────────────────────────────────┐
│ DATA PLANE — omnigraph-server (identical OSS binary everywhere) │
│ read replicas · writer(s) · worker tier (Stage 2+) │
└──────────────────────────┬──────────────────────────────────────┘
OBJECT STORAGE
Lance datasets + __manifest (the only request-path dependency)
```
Tiers, introduced progressively:
- **Read replicas** — open `OpenMode::ReadOnly` (skips the recovery sweep),
snapshot-isolated, fan out freely.
- **Writer(s)** — open `ReadWrite`; route by repo so CAS contention and cache
stay local.
- **Worker tier** (Stage 2+) — background indexing, compaction, cleanup,
recovery. Off the request path.
## Stage 1 — Managed single-region
**Property won:** a customer can sign up, get a repo, authenticate, and use a
managed OmniGraph.
**Architecture.** Single region. One object store (prefix-per-tenant or
bucket-per-tenant — see open decisions). Data plane = a pool of **read
replicas** plus a **single writer replica** per region. The single writer is
deliberate: it sidesteps both `__manifest` CAS contention *and* the
recovery-on-open race **without building the worker tier**. Recovery runs on the
writer's `open`, as today. Reads fan out across read replicas.
**Control plane.** Thinnest viable: provisioning (`open-or-create` a repo —
largely doable by the data plane itself on first request), a tenant directory,
billing counters, and the RFC 0001 identity issuer. All soft state (P2).
**Auth (RFC 0001).** `mode = static` remains the default for M2M / CI;
`mode = oidc` available, validated offline. The control plane runs the issuer
(its own, or wrapping WorkOS for human SSO). `hybrid` lets both coexist.
**Branching as product surface.** OmniGraph already has Git-style graph
branches with lazy fork — the same zero-copy, metadata-pointer design Neon
sells. Stage 1 exposes this directly: instant per-PR / dev / staging branches at
near-zero storage cost. No new engine work — a product packaging of an existing
capability.
**Deliberately not done.** No autoscaling of writes, no worker tier, no
reconciler, no BYOC. **Accepted limitation:** per-region write throughput is
bounded by one writer; a writer restart briefly pauses writes for that region.
**Exit criteria → Stage 2.** Single-writer throughput, write-pause blast
radius, or maintenance load (inline index builds / compaction) becomes the
binding constraint.
## Stage 2 — Elastic data plane + worker tier
**Property won:** horizontal write scale, and maintenance moved off the request
path — which also eliminates the recovery-on-open race.
**Architecture.**
- **Multiple writers** with **consistent-hash routing by repo URI**. A repo's
writes land on one node, so CAS contention is bounded and the Lance page cache
/ warm `Omnigraph` handle stay local.
- **Per-repo write coalescer** — concurrent `mutate_as`/`load` commits to one
repo batch into one manifest publish (the turbopuffer WAL-batching lesson:
beat contention with batching, not locks).
- **Three-tier cache** made explicit: object storage → NVMe SSD → in-process
(Lance page cache + warm handle), with routing affinity keeping a repo warm.
- **Worker tier** — background workers own index building (the deny-list
reconciler mandate), compaction (`optimize`), cleanup, **and recovery**.
Recovery moves from "every `open` runs the sweep" to "one leased worker per
repo owns recovery." This *is* the long-deferred background reconciler;
cloud is its forcing function.
- **Per-tenant resource bounds** — close the `invariants.md` resource-bounds
gap: enforced per-query memory/time budgets, plus `WorkloadController`
admission control, so multi-tenant compute has no noisy-neighbor failure.
- **Scale-to-zero** for cold tenants — evict idle handles, re-warm on first
request, bill by the second (the Neon model).
**Control plane.** Gains orchestration: routing-hint distribution, compaction
scheduling, recovery-lease coordination. Still soft state (P2), still off-path.
**Auth.** No change to the validation path. The control plane's config-bundle
sync (RFC 0001 `ControlPlaneSync`) may now feed a SCIM-sourced actor allowlist.
**Optional consistency knob.** With warm caches, a per-query `stale-ok` read
becomes viable (turbopuffer's sub-10ms eventual mode). Invariant 6 permits it
**only** as explicit, read-only, non-default — exposed as opt-in, never the
default.
**Deliberately not done.** Data still resides in vendor-managed object storage.
**Exit criteria → Stage 3.** A customer requires data sovereignty (data may not
leave their account) or air-gapped operation.
## Stage 3 — BYOC / VPC / air-gapped
**Property won:** data sovereignty — the customer's graph data never leaves
their cloud account.
**Architecture.** The WarpStream BYOC split. The data plane (read replicas,
writers, worker tier — the Stage 1 *or* Stage 2 shape) and the customer's object
store run **inside the customer's VPC**. The vendor cloud keeps only the
soft-state orchestrator and the identity issuer. No customer graph data crosses
the boundary; no cross-account IAM into the customer's bucket. Air-gapped is the
same packaging with the control plane absent and config supplied as static
files.
**Auth.** This is where RFC 0001's P4 pays off fully: the in-VPC data plane
validates tokens **offline** against cached JWKS. The vendor identity issuer is
the only cloud touchpoint and it is off the request path. Air-gapped: point at
the customer's own IdP, or `mode = static`, with JWKS/policy pre-seeded.
**Why this is mostly packaging.** Because P2/P3/P4 were honored from Stage 1 —
control plane thin and off-path, one config-driven binary, auth validated
offline — Stage 3 is boundary hardening and deployment templates (Helm /
Terraform), not an architectural change.
## Why three stages (first-principles)
- **Not one stage.** The managed offering must not wait on the reconciler — a
large build. Stage 1's single-writer design wins a real, sellable, managed
product with bounded complexity, and the load it collects is the evidence
that justifies Stage 2 (reversible-change discipline: ship, measure, then
invest).
- **Not collapsing 2 and 3.** *Scale* (Stage 2) and *sovereignty* (Stage 3) are
independent axes — a customer may demand BYOC before single-region scale runs
out, or the reverse. They share only the thin-control-plane prerequisite,
which is foundational (P2) anyway. **Stage 3 can therefore ship on the Stage 1
data-plane shape**; the 1→2→3 numbering is the expected-demand order, not a
hard dependency. If enterprise/sovereignty demand arrives first, do 1→3→2.
- **Not more than three.** The natural seams are exactly *managed+auth*,
*elastic+maintenance*, *sovereignty*. Finer splits would be invented
complexity, not earned.
## Open decisions
1. **Tenancy isolation model** — bucket-per-tenant vs prefix-per-tenant vs
account-per-tenant. Strongest lever and effectively irreversible; the
control plane should vend short-lived per-tenant scoped credentials
regardless. Decide before Stage 1.
2. **Recovery ownership** — Stage 1 leans on the single writer; Stage 2 needs a
per-repo recovery lease. Confirm the lease mechanism (object-store-based
lease vs control-plane-issued).
3. **Commit-latency model (P1)** — ratify object-storage-only commit, or
identify a concrete low-latency write requirement that would justify a fast
durable tier.
4. **RFC 0001 carry-overs** — degraded-mode JWKS grace window, revocation
strategy, whether VPC customers can override cloud-pushed Cedar policy.
5. **Stale-read knob** — ship the optional eventual-consistency read in
Stage 2, or defer.
## Invariant analysis
| Invariant / deny-list item | Outcome |
|---|---|
| 2 — manifest-atomic graph visibility | ✅ unchanged; `__manifest` CAS is the commit point in every stage |
| 5 — recovery part of commit protocol | ✅ Stage 1 = open-time sweep; Stage 2 = leased worker; never weakened |
| 6 — strong consistency default | ✅ stale-read knob is explicit, read-only, non-default |
| 11 — transport/auth at the boundary | ✅ engine crates untouched; auth in `omnigraph-server` |
| 13 — failures bounded/observable | ✅ Stage 2 closes the per-query resource-bounds gap |
| Deny: custom WAL / metadata store | ✅ P1/P2 — object storage + `__manifest` only |
| Deny: cloud-only correctness / fork | ✅ P3 — one OSS binary, additive control plane |
| Deny: job queue for manifest-derivable state | ✅ worker tier is a reconciler, not a queue |
## Testing notes
- Stage 1: extend `omnigraph-server` tests for multi-replica read fan-out and
single-writer routing; reuse `failpoints` for writer-restart behavior.
- Stage 2: per-repo coalescer and routing need engine/storage-boundary tests
(`runs.rs`, `recovery.rs`); recovery-lease coverage belongs in `recovery.rs`.
- Stage 3: a deployment-template smoke test (data plane against an in-VPC-style
object store); confirm no control-plane call on the request path.
- Update [docs/user/deployment.md](../../user/deployment.md),
[docs/user/server.md](../../user/server.md) as each stage lands.