From 9e5a86580d8af7d0de684e510ee59a0d775203e7 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 17 May 2026 03:23:14 +0000 Subject: [PATCH] Add RFC 0002: staged cloud deployment architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drafts the cloud deployment design as three earned stages — managed single-region, elastic data plane with an off-path worker tier, then BYOC/VPC/air-gapped — each winning one irreducible property. Sets foundational principles (object-storage-only commit, a soft-state control plane off the request path, one config-driven binary) drawn from turbopuffer, Neon, and WarpStream, threads the RFC 0001 auth design through every stage, and records the open decisions and invariant analysis. https://claude.ai/code/session_01N22WDYC6vv2njR5Xu96QaC --- docs/dev/index.md | 1 + .../0002-cloud-deployment-architecture.md | 278 ++++++++++++++++++ 2 files changed, 279 insertions(+) create mode 100644 docs/dev/rfcs/0002-cloud-deployment-architecture.md diff --git a/docs/dev/index.md b/docs/dev/index.md index 281e0e1..e85f202 100644 --- a/docs/dev/index.md +++ b/docs/dev/index.md @@ -51,6 +51,7 @@ description of shipped behavior always lives in the area docs above. | RFC | Status | Topic | |---|---|---| | [0001-federated-authentication.md](rfcs/0001-federated-authentication.md) | draft | OIDC auth with a cloud control plane plus VPC/on-prem deployment | +| [0002-cloud-deployment-architecture.md](rfcs/0002-cloud-deployment-architecture.md) | draft | Staged cloud deployment — managed, elastic, then BYOC/VPC | ## Project Operations diff --git a/docs/dev/rfcs/0002-cloud-deployment-architecture.md b/docs/dev/rfcs/0002-cloud-deployment-architecture.md new file mode 100644 index 0000000..f43a5ae --- /dev/null +++ b/docs/dev/rfcs/0002-cloud-deployment-architecture.md @@ -0,0 +1,278 @@ +# RFC 0002 — Cloud Deployment Architecture (Staged) + +**Type:** design proposal +**Status:** draft — not accepted, not implemented +**Audience:** maintainers reviewing the cloud offering and the OSS/Cloud boundary +**Date:** 2026-05-17 +**Depends on:** [RFC 0001 — Federated Authentication](0001-federated-authentication.md) + +> This is a proposal, not current truth. Until accepted and implemented, the +> authoritative deployment story remains [docs/user/deployment.md](../../user/deployment.md). + +## Summary + +Defines how OmniGraph is deployed as a managed cloud offering, in **three +stages** of increasing complexity. Each stage wins one irreducible property +and pays only the complexity that property earns: + +1. **Managed single-region** — a customer can sign up, get a repo, and + authenticate against a managed OmniGraph. Wins: *managed + authenticated + + multi-tenant*. +2. **Elastic data plane + worker tier** — write scale, and maintenance + (indexing, compaction, recovery) moved off the request path. Wins: *scale + + off-path maintenance + no recovery-on-open race*. +3. **BYOC / VPC / air-gapped** — data plane in the customer's VPC, only a + thin orchestrator in the vendor cloud. Wins: *data sovereignty*. + +The same OSS binary runs in every stage and in a customer VPC; deployment mode +is configuration. The auth design from RFC 0001 threads through unchanged — +each stage only moves *where the token issuer lives*, never the validation +path. + +## Motivation + +OmniGraph's durable state lives entirely in object storage (Lance datasets + +the `__manifest` commit log); concurrency is optimistic CAS on `__manifest`. +That is already the object-storage-native architecture that turbopuffer, +LanceDB, Neon, and WarpStream converged on. The task is not to adopt it but to +*lean into it* — and to do so in stages so that a managed offering can ship and +collect real load before the expensive pieces (the reconciler, BYOC) are built. + +## Goals + +- One OSS binary; deployment mode is configuration only. +- A managed offering reachable in Stage 1 without building the reconciler. +- Object storage as the only request-path dependency. +- The control plane is dispensable: it holds only soft, derivable state. +- Auth (RFC 0001) is identical across stages and across VPC/on-prem. + +## Non-goals + +- A new storage substrate, WAL, or metadata database (deny-list). +- Changing the engine crates — transport/auth stay at the server boundary + (Invariant 11). +- Multi-region active/active. Regions are independent stacks. +- Browser SSO / login UX (a control-plane concern, out of scope here). + +## Foundational principles (hold across all stages) + +These are decided once and constrain every stage. + +### P1 — Object-storage-only commit + +OmniGraph writes are **batch-shaped** (`mutate_as`, `load`, merges, +`schema_apply`) — not OLTP `COMMIT`s. Neon adds a Safekeeper quorum tier in +front of object storage *because* OLTP commit cannot wait ~100-300ms for an S3 +PUT. OmniGraph has no such requirement, so it takes the turbopuffer path: +**commit straight to object storage, accept ~100-300ms write latency, run no +fast durable tier.** This is a large, deliberate simplicity win. It is +hard to reverse — if a low-latency single-row write path ever becomes a product +requirement, *that* is when a fast durable tier earns its complexity, via a new +RFC. + +### P2 — The control plane holds only soft state + +WarpStream keeps an authoritative metadata store (file→offset mappings) in its +cloud. OmniGraph does **not** need this: the `__manifest` table already *is* the +authoritative, strongly-consistent metadata, and it lives in object storage. +The control plane therefore stores only **soft, derivable state** — tenant +directory, billing counters, routing hints, compaction schedules, recovery +leases. Everything it knows is rebuildable by scanning object storage. The +control plane is never on the request path; if it is down, existing tenants +keep serving (the turbopuffer 99.99%-uptime property). + +### P3 — One binary, config-driven + +The `omnigraph-server` container is identical in Stage 1, Stage 2, a customer +VPC, and air-gapped on-prem. A "cloud build" is configuration plus *additive, +optional* control-plane services — never a fork (deny-list: no Cloud fork; +correctness is always OSS). + +### P4 — Auth validation never makes a network call + +Per RFC 0001: tokens are validated offline against cached JWKS. The token +*issuer* may be cloud-hosted, but the data plane never calls it on the request +path. This is what lets the identical data plane run in Stage 3's customer VPC. + +## Architecture primitives + +``` + CONTROL PLANE (vendor cloud; soft state only; off request path) + - provisioning / tenant directory - billing / metering + - identity issuer (RFC 0001; may wrap WorkOS) + - orchestration: compaction schedule, recovery leases, routing hints + │ (async, never on request path) + ┌──────────────────────────┼──────────────────────────────────────┐ + │ DATA PLANE — omnigraph-server (identical OSS binary everywhere) │ + │ read replicas · writer(s) · worker tier (Stage 2+) │ + └──────────────────────────┬──────────────────────────────────────┘ + ▼ + OBJECT STORAGE + Lance datasets + __manifest (the only request-path dependency) +``` + +Tiers, introduced progressively: + +- **Read replicas** — open `OpenMode::ReadOnly` (skips the recovery sweep), + snapshot-isolated, fan out freely. +- **Writer(s)** — open `ReadWrite`; route by repo so CAS contention and cache + stay local. +- **Worker tier** (Stage 2+) — background indexing, compaction, cleanup, + recovery. Off the request path. + +## Stage 1 — Managed single-region + +**Property won:** a customer can sign up, get a repo, authenticate, and use a +managed OmniGraph. + +**Architecture.** Single region. One object store (prefix-per-tenant or +bucket-per-tenant — see open decisions). Data plane = a pool of **read +replicas** plus a **single writer replica** per region. The single writer is +deliberate: it sidesteps both `__manifest` CAS contention *and* the +recovery-on-open race **without building the worker tier**. Recovery runs on the +writer's `open`, as today. Reads fan out across read replicas. + +**Control plane.** Thinnest viable: provisioning (`open-or-create` a repo — +largely doable by the data plane itself on first request), a tenant directory, +billing counters, and the RFC 0001 identity issuer. All soft state (P2). + +**Auth (RFC 0001).** `mode = static` remains the default for M2M / CI; +`mode = oidc` available, validated offline. The control plane runs the issuer +(its own, or wrapping WorkOS for human SSO). `hybrid` lets both coexist. + +**Branching as product surface.** OmniGraph already has Git-style graph +branches with lazy fork — the same zero-copy, metadata-pointer design Neon +sells. Stage 1 exposes this directly: instant per-PR / dev / staging branches at +near-zero storage cost. No new engine work — a product packaging of an existing +capability. + +**Deliberately not done.** No autoscaling of writes, no worker tier, no +reconciler, no BYOC. **Accepted limitation:** per-region write throughput is +bounded by one writer; a writer restart briefly pauses writes for that region. + +**Exit criteria → Stage 2.** Single-writer throughput, write-pause blast +radius, or maintenance load (inline index builds / compaction) becomes the +binding constraint. + +## Stage 2 — Elastic data plane + worker tier + +**Property won:** horizontal write scale, and maintenance moved off the request +path — which also eliminates the recovery-on-open race. + +**Architecture.** + +- **Multiple writers** with **consistent-hash routing by repo URI**. A repo's + writes land on one node, so CAS contention is bounded and the Lance page cache + / warm `Omnigraph` handle stay local. +- **Per-repo write coalescer** — concurrent `mutate_as`/`load` commits to one + repo batch into one manifest publish (the turbopuffer WAL-batching lesson: + beat contention with batching, not locks). +- **Three-tier cache** made explicit: object storage → NVMe SSD → in-process + (Lance page cache + warm handle), with routing affinity keeping a repo warm. +- **Worker tier** — background workers own index building (the deny-list + reconciler mandate), compaction (`optimize`), cleanup, **and recovery**. + Recovery moves from "every `open` runs the sweep" to "one leased worker per + repo owns recovery." This *is* the long-deferred background reconciler; + cloud is its forcing function. +- **Per-tenant resource bounds** — close the `invariants.md` resource-bounds + gap: enforced per-query memory/time budgets, plus `WorkloadController` + admission control, so multi-tenant compute has no noisy-neighbor failure. +- **Scale-to-zero** for cold tenants — evict idle handles, re-warm on first + request, bill by the second (the Neon model). + +**Control plane.** Gains orchestration: routing-hint distribution, compaction +scheduling, recovery-lease coordination. Still soft state (P2), still off-path. + +**Auth.** No change to the validation path. The control plane's config-bundle +sync (RFC 0001 `ControlPlaneSync`) may now feed a SCIM-sourced actor allowlist. + +**Optional consistency knob.** With warm caches, a per-query `stale-ok` read +becomes viable (turbopuffer's sub-10ms eventual mode). Invariant 6 permits it +**only** as explicit, read-only, non-default — exposed as opt-in, never the +default. + +**Deliberately not done.** Data still resides in vendor-managed object storage. + +**Exit criteria → Stage 3.** A customer requires data sovereignty (data may not +leave their account) or air-gapped operation. + +## Stage 3 — BYOC / VPC / air-gapped + +**Property won:** data sovereignty — the customer's graph data never leaves +their cloud account. + +**Architecture.** The WarpStream BYOC split. The data plane (read replicas, +writers, worker tier — the Stage 1 *or* Stage 2 shape) and the customer's object +store run **inside the customer's VPC**. The vendor cloud keeps only the +soft-state orchestrator and the identity issuer. No customer graph data crosses +the boundary; no cross-account IAM into the customer's bucket. Air-gapped is the +same packaging with the control plane absent and config supplied as static +files. + +**Auth.** This is where RFC 0001's P4 pays off fully: the in-VPC data plane +validates tokens **offline** against cached JWKS. The vendor identity issuer is +the only cloud touchpoint and it is off the request path. Air-gapped: point at +the customer's own IdP, or `mode = static`, with JWKS/policy pre-seeded. + +**Why this is mostly packaging.** Because P2/P3/P4 were honored from Stage 1 — +control plane thin and off-path, one config-driven binary, auth validated +offline — Stage 3 is boundary hardening and deployment templates (Helm / +Terraform), not an architectural change. + +## Why three stages (first-principles) + +- **Not one stage.** The managed offering must not wait on the reconciler — a + large build. Stage 1's single-writer design wins a real, sellable, managed + product with bounded complexity, and the load it collects is the evidence + that justifies Stage 2 (reversible-change discipline: ship, measure, then + invest). +- **Not collapsing 2 and 3.** *Scale* (Stage 2) and *sovereignty* (Stage 3) are + independent axes — a customer may demand BYOC before single-region scale runs + out, or the reverse. They share only the thin-control-plane prerequisite, + which is foundational (P2) anyway. **Stage 3 can therefore ship on the Stage 1 + data-plane shape**; the 1→2→3 numbering is the expected-demand order, not a + hard dependency. If enterprise/sovereignty demand arrives first, do 1→3→2. +- **Not more than three.** The natural seams are exactly *managed+auth*, + *elastic+maintenance*, *sovereignty*. Finer splits would be invented + complexity, not earned. + +## Open decisions + +1. **Tenancy isolation model** — bucket-per-tenant vs prefix-per-tenant vs + account-per-tenant. Strongest lever and effectively irreversible; the + control plane should vend short-lived per-tenant scoped credentials + regardless. Decide before Stage 1. +2. **Recovery ownership** — Stage 1 leans on the single writer; Stage 2 needs a + per-repo recovery lease. Confirm the lease mechanism (object-store-based + lease vs control-plane-issued). +3. **Commit-latency model (P1)** — ratify object-storage-only commit, or + identify a concrete low-latency write requirement that would justify a fast + durable tier. +4. **RFC 0001 carry-overs** — degraded-mode JWKS grace window, revocation + strategy, whether VPC customers can override cloud-pushed Cedar policy. +5. **Stale-read knob** — ship the optional eventual-consistency read in + Stage 2, or defer. + +## Invariant analysis + +| Invariant / deny-list item | Outcome | +|---|---| +| 2 — manifest-atomic graph visibility | ✅ unchanged; `__manifest` CAS is the commit point in every stage | +| 5 — recovery part of commit protocol | ✅ Stage 1 = open-time sweep; Stage 2 = leased worker; never weakened | +| 6 — strong consistency default | ✅ stale-read knob is explicit, read-only, non-default | +| 11 — transport/auth at the boundary | ✅ engine crates untouched; auth in `omnigraph-server` | +| 13 — failures bounded/observable | ✅ Stage 2 closes the per-query resource-bounds gap | +| Deny: custom WAL / metadata store | ✅ P1/P2 — object storage + `__manifest` only | +| Deny: cloud-only correctness / fork | ✅ P3 — one OSS binary, additive control plane | +| Deny: job queue for manifest-derivable state | ✅ worker tier is a reconciler, not a queue | + +## Testing notes + +- Stage 1: extend `omnigraph-server` tests for multi-replica read fan-out and + single-writer routing; reuse `failpoints` for writer-restart behavior. +- Stage 2: per-repo coalescer and routing need engine/storage-boundary tests + (`runs.rs`, `recovery.rs`); recovery-lease coverage belongs in `recovery.rs`. +- Stage 3: a deployment-template smoke test (data plane against an in-VPC-style + object store); confirm no control-plane call on the request path. +- Update [docs/user/deployment.md](../../user/deployment.md), + [docs/user/server.md](../../user/server.md) as each stage lands.