omnigraph/docs/dev/rfcs/0002-cloud-deployment-architecture.md
Claude 9e5a86580d
Add RFC 0002: staged cloud deployment architecture
Drafts the cloud deployment design as three earned stages — managed
single-region, elastic data plane with an off-path worker tier, then
BYOC/VPC/air-gapped — each winning one irreducible property. Sets
foundational principles (object-storage-only commit, a soft-state
control plane off the request path, one config-driven binary) drawn
from turbopuffer, Neon, and WarpStream, threads the RFC 0001 auth
design through every stage, and records the open decisions and
invariant analysis.

https://claude.ai/code/session_01N22WDYC6vv2njR5Xu96QaC
2026-05-17 03:23:14 +00:00

14 KiB

RFC 0002 — Cloud Deployment Architecture (Staged)

Type: design proposal Status: draft — not accepted, not implemented Audience: maintainers reviewing the cloud offering and the OSS/Cloud boundary Date: 2026-05-17 Depends on: RFC 0001 — Federated Authentication

This is a proposal, not current truth. Until accepted and implemented, the authoritative deployment story remains docs/user/deployment.md.

Summary

Defines how OmniGraph is deployed as a managed cloud offering, in three stages of increasing complexity. Each stage wins one irreducible property and pays only the complexity that property earns:

  1. Managed single-region — a customer can sign up, get a repo, and authenticate against a managed OmniGraph. Wins: managed + authenticated + multi-tenant.
  2. Elastic data plane + worker tier — write scale, and maintenance (indexing, compaction, recovery) moved off the request path. Wins: scale + off-path maintenance + no recovery-on-open race.
  3. BYOC / VPC / air-gapped — data plane in the customer's VPC, only a thin orchestrator in the vendor cloud. Wins: data sovereignty.

The same OSS binary runs in every stage and in a customer VPC; deployment mode is configuration. The auth design from RFC 0001 threads through unchanged — each stage only moves where the token issuer lives, never the validation path.

Motivation

OmniGraph's durable state lives entirely in object storage (Lance datasets + the __manifest commit log); concurrency is optimistic CAS on __manifest. That is already the object-storage-native architecture that turbopuffer, LanceDB, Neon, and WarpStream converged on. The task is not to adopt it but to lean into it — and to do so in stages so that a managed offering can ship and collect real load before the expensive pieces (the reconciler, BYOC) are built.

Goals

  • One OSS binary; deployment mode is configuration only.
  • A managed offering reachable in Stage 1 without building the reconciler.
  • Object storage as the only request-path dependency.
  • The control plane is dispensable: it holds only soft, derivable state.
  • Auth (RFC 0001) is identical across stages and across VPC/on-prem.

Non-goals

  • A new storage substrate, WAL, or metadata database (deny-list).
  • Changing the engine crates — transport/auth stay at the server boundary (Invariant 11).
  • Multi-region active/active. Regions are independent stacks.
  • Browser SSO / login UX (a control-plane concern, out of scope here).

Foundational principles (hold across all stages)

These are decided once and constrain every stage.

P1 — Object-storage-only commit

OmniGraph writes are batch-shaped (mutate_as, load, merges, schema_apply) — not OLTP COMMITs. Neon adds a Safekeeper quorum tier in front of object storage because OLTP commit cannot wait ~100-300ms for an S3 PUT. OmniGraph has no such requirement, so it takes the turbopuffer path: commit straight to object storage, accept ~100-300ms write latency, run no fast durable tier. This is a large, deliberate simplicity win. It is hard to reverse — if a low-latency single-row write path ever becomes a product requirement, that is when a fast durable tier earns its complexity, via a new RFC.

P2 — The control plane holds only soft state

WarpStream keeps an authoritative metadata store (file→offset mappings) in its cloud. OmniGraph does not need this: the __manifest table already is the authoritative, strongly-consistent metadata, and it lives in object storage. The control plane therefore stores only soft, derivable state — tenant directory, billing counters, routing hints, compaction schedules, recovery leases. Everything it knows is rebuildable by scanning object storage. The control plane is never on the request path; if it is down, existing tenants keep serving (the turbopuffer 99.99%-uptime property).

P3 — One binary, config-driven

The omnigraph-server container is identical in Stage 1, Stage 2, a customer VPC, and air-gapped on-prem. A "cloud build" is configuration plus additive, optional control-plane services — never a fork (deny-list: no Cloud fork; correctness is always OSS).

P4 — Auth validation never makes a network call

Per RFC 0001: tokens are validated offline against cached JWKS. The token issuer may be cloud-hosted, but the data plane never calls it on the request path. This is what lets the identical data plane run in Stage 3's customer VPC.

Architecture primitives

        CONTROL PLANE (vendor cloud; soft state only; off request path)
        - provisioning / tenant directory   - billing / metering
        - identity issuer (RFC 0001; may wrap WorkOS)
        - orchestration: compaction schedule, recovery leases, routing hints
                              │  (async, never on request path)
   ┌──────────────────────────┼──────────────────────────────────────┐
   │  DATA PLANE — omnigraph-server (identical OSS binary everywhere)  │
   │    read replicas  ·  writer(s)  ·  worker tier (Stage 2+)        │
   └──────────────────────────┬──────────────────────────────────────┘
                              ▼
                       OBJECT STORAGE
        Lance datasets + __manifest  (the only request-path dependency)

Tiers, introduced progressively:

  • Read replicas — open OpenMode::ReadOnly (skips the recovery sweep), snapshot-isolated, fan out freely.
  • Writer(s) — open ReadWrite; route by repo so CAS contention and cache stay local.
  • Worker tier (Stage 2+) — background indexing, compaction, cleanup, recovery. Off the request path.

Stage 1 — Managed single-region

Property won: a customer can sign up, get a repo, authenticate, and use a managed OmniGraph.

Architecture. Single region. One object store (prefix-per-tenant or bucket-per-tenant — see open decisions). Data plane = a pool of read replicas plus a single writer replica per region. The single writer is deliberate: it sidesteps both __manifest CAS contention and the recovery-on-open race without building the worker tier. Recovery runs on the writer's open, as today. Reads fan out across read replicas.

Control plane. Thinnest viable: provisioning (open-or-create a repo — largely doable by the data plane itself on first request), a tenant directory, billing counters, and the RFC 0001 identity issuer. All soft state (P2).

Auth (RFC 0001). mode = static remains the default for M2M / CI; mode = oidc available, validated offline. The control plane runs the issuer (its own, or wrapping WorkOS for human SSO). hybrid lets both coexist.

Branching as product surface. OmniGraph already has Git-style graph branches with lazy fork — the same zero-copy, metadata-pointer design Neon sells. Stage 1 exposes this directly: instant per-PR / dev / staging branches at near-zero storage cost. No new engine work — a product packaging of an existing capability.

Deliberately not done. No autoscaling of writes, no worker tier, no reconciler, no BYOC. Accepted limitation: per-region write throughput is bounded by one writer; a writer restart briefly pauses writes for that region.

Exit criteria → Stage 2. Single-writer throughput, write-pause blast radius, or maintenance load (inline index builds / compaction) becomes the binding constraint.

Stage 2 — Elastic data plane + worker tier

Property won: horizontal write scale, and maintenance moved off the request path — which also eliminates the recovery-on-open race.

Architecture.

  • Multiple writers with consistent-hash routing by repo URI. A repo's writes land on one node, so CAS contention is bounded and the Lance page cache / warm Omnigraph handle stay local.
  • Per-repo write coalescer — concurrent mutate_as/load commits to one repo batch into one manifest publish (the turbopuffer WAL-batching lesson: beat contention with batching, not locks).
  • Three-tier cache made explicit: object storage → NVMe SSD → in-process (Lance page cache + warm handle), with routing affinity keeping a repo warm.
  • Worker tier — background workers own index building (the deny-list reconciler mandate), compaction (optimize), cleanup, and recovery. Recovery moves from "every open runs the sweep" to "one leased worker per repo owns recovery." This is the long-deferred background reconciler; cloud is its forcing function.
  • Per-tenant resource bounds — close the invariants.md resource-bounds gap: enforced per-query memory/time budgets, plus WorkloadController admission control, so multi-tenant compute has no noisy-neighbor failure.
  • Scale-to-zero for cold tenants — evict idle handles, re-warm on first request, bill by the second (the Neon model).

Control plane. Gains orchestration: routing-hint distribution, compaction scheduling, recovery-lease coordination. Still soft state (P2), still off-path.

Auth. No change to the validation path. The control plane's config-bundle sync (RFC 0001 ControlPlaneSync) may now feed a SCIM-sourced actor allowlist.

Optional consistency knob. With warm caches, a per-query stale-ok read becomes viable (turbopuffer's sub-10ms eventual mode). Invariant 6 permits it only as explicit, read-only, non-default — exposed as opt-in, never the default.

Deliberately not done. Data still resides in vendor-managed object storage.

Exit criteria → Stage 3. A customer requires data sovereignty (data may not leave their account) or air-gapped operation.

Stage 3 — BYOC / VPC / air-gapped

Property won: data sovereignty — the customer's graph data never leaves their cloud account.

Architecture. The WarpStream BYOC split. The data plane (read replicas, writers, worker tier — the Stage 1 or Stage 2 shape) and the customer's object store run inside the customer's VPC. The vendor cloud keeps only the soft-state orchestrator and the identity issuer. No customer graph data crosses the boundary; no cross-account IAM into the customer's bucket. Air-gapped is the same packaging with the control plane absent and config supplied as static files.

Auth. This is where RFC 0001's P4 pays off fully: the in-VPC data plane validates tokens offline against cached JWKS. The vendor identity issuer is the only cloud touchpoint and it is off the request path. Air-gapped: point at the customer's own IdP, or mode = static, with JWKS/policy pre-seeded.

Why this is mostly packaging. Because P2/P3/P4 were honored from Stage 1 — control plane thin and off-path, one config-driven binary, auth validated offline — Stage 3 is boundary hardening and deployment templates (Helm / Terraform), not an architectural change.

Why three stages (first-principles)

  • Not one stage. The managed offering must not wait on the reconciler — a large build. Stage 1's single-writer design wins a real, sellable, managed product with bounded complexity, and the load it collects is the evidence that justifies Stage 2 (reversible-change discipline: ship, measure, then invest).
  • Not collapsing 2 and 3. Scale (Stage 2) and sovereignty (Stage 3) are independent axes — a customer may demand BYOC before single-region scale runs out, or the reverse. They share only the thin-control-plane prerequisite, which is foundational (P2) anyway. Stage 3 can therefore ship on the Stage 1 data-plane shape; the 1→2→3 numbering is the expected-demand order, not a hard dependency. If enterprise/sovereignty demand arrives first, do 1→3→2.
  • Not more than three. The natural seams are exactly managed+auth, elastic+maintenance, sovereignty. Finer splits would be invented complexity, not earned.

Open decisions

  1. Tenancy isolation model — bucket-per-tenant vs prefix-per-tenant vs account-per-tenant. Strongest lever and effectively irreversible; the control plane should vend short-lived per-tenant scoped credentials regardless. Decide before Stage 1.
  2. Recovery ownership — Stage 1 leans on the single writer; Stage 2 needs a per-repo recovery lease. Confirm the lease mechanism (object-store-based lease vs control-plane-issued).
  3. Commit-latency model (P1) — ratify object-storage-only commit, or identify a concrete low-latency write requirement that would justify a fast durable tier.
  4. RFC 0001 carry-overs — degraded-mode JWKS grace window, revocation strategy, whether VPC customers can override cloud-pushed Cedar policy.
  5. Stale-read knob — ship the optional eventual-consistency read in Stage 2, or defer.

Invariant analysis

Invariant / deny-list item Outcome
2 — manifest-atomic graph visibility unchanged; __manifest CAS is the commit point in every stage
5 — recovery part of commit protocol Stage 1 = open-time sweep; Stage 2 = leased worker; never weakened
6 — strong consistency default stale-read knob is explicit, read-only, non-default
11 — transport/auth at the boundary engine crates untouched; auth in omnigraph-server
13 — failures bounded/observable Stage 2 closes the per-query resource-bounds gap
Deny: custom WAL / metadata store P1/P2 — object storage + __manifest only
Deny: cloud-only correctness / fork P3 — one OSS binary, additive control plane
Deny: job queue for manifest-derivable state worker tier is a reconciler, not a queue

Testing notes

  • Stage 1: extend omnigraph-server tests for multi-replica read fan-out and single-writer routing; reuse failpoints for writer-restart behavior.
  • Stage 2: per-repo coalescer and routing need engine/storage-boundary tests (runs.rs, recovery.rs); recovery-lease coverage belongs in recovery.rs.
  • Stage 3: a deployment-template smoke test (data plane against an in-VPC-style object store); confirm no control-plane call on the request path.
  • Update docs/user/deployment.md, docs/user/server.md as each stage lands.