Drafts the cloud deployment design as three earned stages — managed single-region, elastic data plane with an off-path worker tier, then BYOC/VPC/air-gapped — each winning one irreducible property. Sets foundational principles (object-storage-only commit, a soft-state control plane off the request path, one config-driven binary) drawn from turbopuffer, Neon, and WarpStream, threads the RFC 0001 auth design through every stage, and records the open decisions and invariant analysis. https://claude.ai/code/session_01N22WDYC6vv2njR5Xu96QaC
14 KiB
RFC 0002 — Cloud Deployment Architecture (Staged)
Type: design proposal Status: draft — not accepted, not implemented Audience: maintainers reviewing the cloud offering and the OSS/Cloud boundary Date: 2026-05-17 Depends on: RFC 0001 — Federated Authentication
This is a proposal, not current truth. Until accepted and implemented, the authoritative deployment story remains docs/user/deployment.md.
Summary
Defines how OmniGraph is deployed as a managed cloud offering, in three stages of increasing complexity. Each stage wins one irreducible property and pays only the complexity that property earns:
- Managed single-region — a customer can sign up, get a repo, and authenticate against a managed OmniGraph. Wins: managed + authenticated + multi-tenant.
- Elastic data plane + worker tier — write scale, and maintenance (indexing, compaction, recovery) moved off the request path. Wins: scale + off-path maintenance + no recovery-on-open race.
- BYOC / VPC / air-gapped — data plane in the customer's VPC, only a thin orchestrator in the vendor cloud. Wins: data sovereignty.
The same OSS binary runs in every stage and in a customer VPC; deployment mode is configuration. The auth design from RFC 0001 threads through unchanged — each stage only moves where the token issuer lives, never the validation path.
Motivation
OmniGraph's durable state lives entirely in object storage (Lance datasets +
the __manifest commit log); concurrency is optimistic CAS on __manifest.
That is already the object-storage-native architecture that turbopuffer,
LanceDB, Neon, and WarpStream converged on. The task is not to adopt it but to
lean into it — and to do so in stages so that a managed offering can ship and
collect real load before the expensive pieces (the reconciler, BYOC) are built.
Goals
- One OSS binary; deployment mode is configuration only.
- A managed offering reachable in Stage 1 without building the reconciler.
- Object storage as the only request-path dependency.
- The control plane is dispensable: it holds only soft, derivable state.
- Auth (RFC 0001) is identical across stages and across VPC/on-prem.
Non-goals
- A new storage substrate, WAL, or metadata database (deny-list).
- Changing the engine crates — transport/auth stay at the server boundary (Invariant 11).
- Multi-region active/active. Regions are independent stacks.
- Browser SSO / login UX (a control-plane concern, out of scope here).
Foundational principles (hold across all stages)
These are decided once and constrain every stage.
P1 — Object-storage-only commit
OmniGraph writes are batch-shaped (mutate_as, load, merges,
schema_apply) — not OLTP COMMITs. Neon adds a Safekeeper quorum tier in
front of object storage because OLTP commit cannot wait ~100-300ms for an S3
PUT. OmniGraph has no such requirement, so it takes the turbopuffer path:
commit straight to object storage, accept ~100-300ms write latency, run no
fast durable tier. This is a large, deliberate simplicity win. It is
hard to reverse — if a low-latency single-row write path ever becomes a product
requirement, that is when a fast durable tier earns its complexity, via a new
RFC.
P2 — The control plane holds only soft state
WarpStream keeps an authoritative metadata store (file→offset mappings) in its
cloud. OmniGraph does not need this: the __manifest table already is the
authoritative, strongly-consistent metadata, and it lives in object storage.
The control plane therefore stores only soft, derivable state — tenant
directory, billing counters, routing hints, compaction schedules, recovery
leases. Everything it knows is rebuildable by scanning object storage. The
control plane is never on the request path; if it is down, existing tenants
keep serving (the turbopuffer 99.99%-uptime property).
P3 — One binary, config-driven
The omnigraph-server container is identical in Stage 1, Stage 2, a customer
VPC, and air-gapped on-prem. A "cloud build" is configuration plus additive,
optional control-plane services — never a fork (deny-list: no Cloud fork;
correctness is always OSS).
P4 — Auth validation never makes a network call
Per RFC 0001: tokens are validated offline against cached JWKS. The token issuer may be cloud-hosted, but the data plane never calls it on the request path. This is what lets the identical data plane run in Stage 3's customer VPC.
Architecture primitives
CONTROL PLANE (vendor cloud; soft state only; off request path)
- provisioning / tenant directory - billing / metering
- identity issuer (RFC 0001; may wrap WorkOS)
- orchestration: compaction schedule, recovery leases, routing hints
│ (async, never on request path)
┌──────────────────────────┼──────────────────────────────────────┐
│ DATA PLANE — omnigraph-server (identical OSS binary everywhere) │
│ read replicas · writer(s) · worker tier (Stage 2+) │
└──────────────────────────┬──────────────────────────────────────┘
▼
OBJECT STORAGE
Lance datasets + __manifest (the only request-path dependency)
Tiers, introduced progressively:
- Read replicas — open
OpenMode::ReadOnly(skips the recovery sweep), snapshot-isolated, fan out freely. - Writer(s) — open
ReadWrite; route by repo so CAS contention and cache stay local. - Worker tier (Stage 2+) — background indexing, compaction, cleanup, recovery. Off the request path.
Stage 1 — Managed single-region
Property won: a customer can sign up, get a repo, authenticate, and use a managed OmniGraph.
Architecture. Single region. One object store (prefix-per-tenant or
bucket-per-tenant — see open decisions). Data plane = a pool of read
replicas plus a single writer replica per region. The single writer is
deliberate: it sidesteps both __manifest CAS contention and the
recovery-on-open race without building the worker tier. Recovery runs on the
writer's open, as today. Reads fan out across read replicas.
Control plane. Thinnest viable: provisioning (open-or-create a repo —
largely doable by the data plane itself on first request), a tenant directory,
billing counters, and the RFC 0001 identity issuer. All soft state (P2).
Auth (RFC 0001). mode = static remains the default for M2M / CI;
mode = oidc available, validated offline. The control plane runs the issuer
(its own, or wrapping WorkOS for human SSO). hybrid lets both coexist.
Branching as product surface. OmniGraph already has Git-style graph branches with lazy fork — the same zero-copy, metadata-pointer design Neon sells. Stage 1 exposes this directly: instant per-PR / dev / staging branches at near-zero storage cost. No new engine work — a product packaging of an existing capability.
Deliberately not done. No autoscaling of writes, no worker tier, no reconciler, no BYOC. Accepted limitation: per-region write throughput is bounded by one writer; a writer restart briefly pauses writes for that region.
Exit criteria → Stage 2. Single-writer throughput, write-pause blast radius, or maintenance load (inline index builds / compaction) becomes the binding constraint.
Stage 2 — Elastic data plane + worker tier
Property won: horizontal write scale, and maintenance moved off the request path — which also eliminates the recovery-on-open race.
Architecture.
- Multiple writers with consistent-hash routing by repo URI. A repo's
writes land on one node, so CAS contention is bounded and the Lance page cache
/ warm
Omnigraphhandle stay local. - Per-repo write coalescer — concurrent
mutate_as/loadcommits to one repo batch into one manifest publish (the turbopuffer WAL-batching lesson: beat contention with batching, not locks). - Three-tier cache made explicit: object storage → NVMe SSD → in-process (Lance page cache + warm handle), with routing affinity keeping a repo warm.
- Worker tier — background workers own index building (the deny-list
reconciler mandate), compaction (
optimize), cleanup, and recovery. Recovery moves from "everyopenruns the sweep" to "one leased worker per repo owns recovery." This is the long-deferred background reconciler; cloud is its forcing function. - Per-tenant resource bounds — close the
invariants.mdresource-bounds gap: enforced per-query memory/time budgets, plusWorkloadControlleradmission control, so multi-tenant compute has no noisy-neighbor failure. - Scale-to-zero for cold tenants — evict idle handles, re-warm on first request, bill by the second (the Neon model).
Control plane. Gains orchestration: routing-hint distribution, compaction scheduling, recovery-lease coordination. Still soft state (P2), still off-path.
Auth. No change to the validation path. The control plane's config-bundle
sync (RFC 0001 ControlPlaneSync) may now feed a SCIM-sourced actor allowlist.
Optional consistency knob. With warm caches, a per-query stale-ok read
becomes viable (turbopuffer's sub-10ms eventual mode). Invariant 6 permits it
only as explicit, read-only, non-default — exposed as opt-in, never the
default.
Deliberately not done. Data still resides in vendor-managed object storage.
Exit criteria → Stage 3. A customer requires data sovereignty (data may not leave their account) or air-gapped operation.
Stage 3 — BYOC / VPC / air-gapped
Property won: data sovereignty — the customer's graph data never leaves their cloud account.
Architecture. The WarpStream BYOC split. The data plane (read replicas, writers, worker tier — the Stage 1 or Stage 2 shape) and the customer's object store run inside the customer's VPC. The vendor cloud keeps only the soft-state orchestrator and the identity issuer. No customer graph data crosses the boundary; no cross-account IAM into the customer's bucket. Air-gapped is the same packaging with the control plane absent and config supplied as static files.
Auth. This is where RFC 0001's P4 pays off fully: the in-VPC data plane
validates tokens offline against cached JWKS. The vendor identity issuer is
the only cloud touchpoint and it is off the request path. Air-gapped: point at
the customer's own IdP, or mode = static, with JWKS/policy pre-seeded.
Why this is mostly packaging. Because P2/P3/P4 were honored from Stage 1 — control plane thin and off-path, one config-driven binary, auth validated offline — Stage 3 is boundary hardening and deployment templates (Helm / Terraform), not an architectural change.
Why three stages (first-principles)
- Not one stage. The managed offering must not wait on the reconciler — a large build. Stage 1's single-writer design wins a real, sellable, managed product with bounded complexity, and the load it collects is the evidence that justifies Stage 2 (reversible-change discipline: ship, measure, then invest).
- Not collapsing 2 and 3. Scale (Stage 2) and sovereignty (Stage 3) are independent axes — a customer may demand BYOC before single-region scale runs out, or the reverse. They share only the thin-control-plane prerequisite, which is foundational (P2) anyway. Stage 3 can therefore ship on the Stage 1 data-plane shape; the 1→2→3 numbering is the expected-demand order, not a hard dependency. If enterprise/sovereignty demand arrives first, do 1→3→2.
- Not more than three. The natural seams are exactly managed+auth, elastic+maintenance, sovereignty. Finer splits would be invented complexity, not earned.
Open decisions
- Tenancy isolation model — bucket-per-tenant vs prefix-per-tenant vs account-per-tenant. Strongest lever and effectively irreversible; the control plane should vend short-lived per-tenant scoped credentials regardless. Decide before Stage 1.
- Recovery ownership — Stage 1 leans on the single writer; Stage 2 needs a per-repo recovery lease. Confirm the lease mechanism (object-store-based lease vs control-plane-issued).
- Commit-latency model (P1) — ratify object-storage-only commit, or identify a concrete low-latency write requirement that would justify a fast durable tier.
- RFC 0001 carry-overs — degraded-mode JWKS grace window, revocation strategy, whether VPC customers can override cloud-pushed Cedar policy.
- Stale-read knob — ship the optional eventual-consistency read in Stage 2, or defer.
Invariant analysis
| Invariant / deny-list item | Outcome |
|---|---|
| 2 — manifest-atomic graph visibility | ✅ unchanged; __manifest CAS is the commit point in every stage |
| 5 — recovery part of commit protocol | ✅ Stage 1 = open-time sweep; Stage 2 = leased worker; never weakened |
| 6 — strong consistency default | ✅ stale-read knob is explicit, read-only, non-default |
| 11 — transport/auth at the boundary | ✅ engine crates untouched; auth in omnigraph-server |
| 13 — failures bounded/observable | ✅ Stage 2 closes the per-query resource-bounds gap |
| Deny: custom WAL / metadata store | ✅ P1/P2 — object storage + __manifest only |
| Deny: cloud-only correctness / fork | ✅ P3 — one OSS binary, additive control plane |
| Deny: job queue for manifest-derivable state | ✅ worker tier is a reconciler, not a queue |
Testing notes
- Stage 1: extend
omnigraph-servertests for multi-replica read fan-out and single-writer routing; reusefailpointsfor writer-restart behavior. - Stage 2: per-repo coalescer and routing need engine/storage-boundary tests
(
runs.rs,recovery.rs); recovery-lease coverage belongs inrecovery.rs. - Stage 3: a deployment-template smoke test (data plane against an in-VPC-style object store); confirm no control-plane call on the request path.
- Update docs/user/deployment.md, docs/user/server.md as each stage lands.