mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
Add RFC 0002: staged cloud deployment architecture
Drafts the cloud deployment design as three earned stages — managed single-region, elastic data plane with an off-path worker tier, then BYOC/VPC/air-gapped — each winning one irreducible property. Sets foundational principles (object-storage-only commit, a soft-state control plane off the request path, one config-driven binary) drawn from turbopuffer, Neon, and WarpStream, threads the RFC 0001 auth design through every stage, and records the open decisions and invariant analysis. https://claude.ai/code/session_01N22WDYC6vv2njR5Xu96QaC
This commit is contained in:
parent
5e03ca977c
commit
9e5a86580d
2 changed files with 279 additions and 0 deletions
|
|
@ -51,6 +51,7 @@ description of shipped behavior always lives in the area docs above.
|
|||
| RFC | Status | Topic |
|
||||
|---|---|---|
|
||||
| [0001-federated-authentication.md](rfcs/0001-federated-authentication.md) | draft | OIDC auth with a cloud control plane plus VPC/on-prem deployment |
|
||||
| [0002-cloud-deployment-architecture.md](rfcs/0002-cloud-deployment-architecture.md) | draft | Staged cloud deployment — managed, elastic, then BYOC/VPC |
|
||||
|
||||
## Project Operations
|
||||
|
||||
|
|
|
|||
278
docs/dev/rfcs/0002-cloud-deployment-architecture.md
Normal file
278
docs/dev/rfcs/0002-cloud-deployment-architecture.md
Normal file
|
|
@ -0,0 +1,278 @@
|
|||
# RFC 0002 — Cloud Deployment Architecture (Staged)
|
||||
|
||||
**Type:** design proposal
|
||||
**Status:** draft — not accepted, not implemented
|
||||
**Audience:** maintainers reviewing the cloud offering and the OSS/Cloud boundary
|
||||
**Date:** 2026-05-17
|
||||
**Depends on:** [RFC 0001 — Federated Authentication](0001-federated-authentication.md)
|
||||
|
||||
> This is a proposal, not current truth. Until accepted and implemented, the
|
||||
> authoritative deployment story remains [docs/user/deployment.md](../../user/deployment.md).
|
||||
|
||||
## Summary
|
||||
|
||||
Defines how OmniGraph is deployed as a managed cloud offering, in **three
|
||||
stages** of increasing complexity. Each stage wins one irreducible property
|
||||
and pays only the complexity that property earns:
|
||||
|
||||
1. **Managed single-region** — a customer can sign up, get a repo, and
|
||||
authenticate against a managed OmniGraph. Wins: *managed + authenticated +
|
||||
multi-tenant*.
|
||||
2. **Elastic data plane + worker tier** — write scale, and maintenance
|
||||
(indexing, compaction, recovery) moved off the request path. Wins: *scale +
|
||||
off-path maintenance + no recovery-on-open race*.
|
||||
3. **BYOC / VPC / air-gapped** — data plane in the customer's VPC, only a
|
||||
thin orchestrator in the vendor cloud. Wins: *data sovereignty*.
|
||||
|
||||
The same OSS binary runs in every stage and in a customer VPC; deployment mode
|
||||
is configuration. The auth design from RFC 0001 threads through unchanged —
|
||||
each stage only moves *where the token issuer lives*, never the validation
|
||||
path.
|
||||
|
||||
## Motivation
|
||||
|
||||
OmniGraph's durable state lives entirely in object storage (Lance datasets +
|
||||
the `__manifest` commit log); concurrency is optimistic CAS on `__manifest`.
|
||||
That is already the object-storage-native architecture that turbopuffer,
|
||||
LanceDB, Neon, and WarpStream converged on. The task is not to adopt it but to
|
||||
*lean into it* — and to do so in stages so that a managed offering can ship and
|
||||
collect real load before the expensive pieces (the reconciler, BYOC) are built.
|
||||
|
||||
## Goals
|
||||
|
||||
- One OSS binary; deployment mode is configuration only.
|
||||
- A managed offering reachable in Stage 1 without building the reconciler.
|
||||
- Object storage as the only request-path dependency.
|
||||
- The control plane is dispensable: it holds only soft, derivable state.
|
||||
- Auth (RFC 0001) is identical across stages and across VPC/on-prem.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- A new storage substrate, WAL, or metadata database (deny-list).
|
||||
- Changing the engine crates — transport/auth stay at the server boundary
|
||||
(Invariant 11).
|
||||
- Multi-region active/active. Regions are independent stacks.
|
||||
- Browser SSO / login UX (a control-plane concern, out of scope here).
|
||||
|
||||
## Foundational principles (hold across all stages)
|
||||
|
||||
These are decided once and constrain every stage.
|
||||
|
||||
### P1 — Object-storage-only commit
|
||||
|
||||
OmniGraph writes are **batch-shaped** (`mutate_as`, `load`, merges,
|
||||
`schema_apply`) — not OLTP `COMMIT`s. Neon adds a Safekeeper quorum tier in
|
||||
front of object storage *because* OLTP commit cannot wait ~100-300ms for an S3
|
||||
PUT. OmniGraph has no such requirement, so it takes the turbopuffer path:
|
||||
**commit straight to object storage, accept ~100-300ms write latency, run no
|
||||
fast durable tier.** This is a large, deliberate simplicity win. It is
|
||||
hard to reverse — if a low-latency single-row write path ever becomes a product
|
||||
requirement, *that* is when a fast durable tier earns its complexity, via a new
|
||||
RFC.
|
||||
|
||||
### P2 — The control plane holds only soft state
|
||||
|
||||
WarpStream keeps an authoritative metadata store (file→offset mappings) in its
|
||||
cloud. OmniGraph does **not** need this: the `__manifest` table already *is* the
|
||||
authoritative, strongly-consistent metadata, and it lives in object storage.
|
||||
The control plane therefore stores only **soft, derivable state** — tenant
|
||||
directory, billing counters, routing hints, compaction schedules, recovery
|
||||
leases. Everything it knows is rebuildable by scanning object storage. The
|
||||
control plane is never on the request path; if it is down, existing tenants
|
||||
keep serving (the turbopuffer 99.99%-uptime property).
|
||||
|
||||
### P3 — One binary, config-driven
|
||||
|
||||
The `omnigraph-server` container is identical in Stage 1, Stage 2, a customer
|
||||
VPC, and air-gapped on-prem. A "cloud build" is configuration plus *additive,
|
||||
optional* control-plane services — never a fork (deny-list: no Cloud fork;
|
||||
correctness is always OSS).
|
||||
|
||||
### P4 — Auth validation never makes a network call
|
||||
|
||||
Per RFC 0001: tokens are validated offline against cached JWKS. The token
|
||||
*issuer* may be cloud-hosted, but the data plane never calls it on the request
|
||||
path. This is what lets the identical data plane run in Stage 3's customer VPC.
|
||||
|
||||
## Architecture primitives
|
||||
|
||||
```
|
||||
CONTROL PLANE (vendor cloud; soft state only; off request path)
|
||||
- provisioning / tenant directory - billing / metering
|
||||
- identity issuer (RFC 0001; may wrap WorkOS)
|
||||
- orchestration: compaction schedule, recovery leases, routing hints
|
||||
│ (async, never on request path)
|
||||
┌──────────────────────────┼──────────────────────────────────────┐
|
||||
│ DATA PLANE — omnigraph-server (identical OSS binary everywhere) │
|
||||
│ read replicas · writer(s) · worker tier (Stage 2+) │
|
||||
└──────────────────────────┬──────────────────────────────────────┘
|
||||
▼
|
||||
OBJECT STORAGE
|
||||
Lance datasets + __manifest (the only request-path dependency)
|
||||
```
|
||||
|
||||
Tiers, introduced progressively:
|
||||
|
||||
- **Read replicas** — open `OpenMode::ReadOnly` (skips the recovery sweep),
|
||||
snapshot-isolated, fan out freely.
|
||||
- **Writer(s)** — open `ReadWrite`; route by repo so CAS contention and cache
|
||||
stay local.
|
||||
- **Worker tier** (Stage 2+) — background indexing, compaction, cleanup,
|
||||
recovery. Off the request path.
|
||||
|
||||
## Stage 1 — Managed single-region
|
||||
|
||||
**Property won:** a customer can sign up, get a repo, authenticate, and use a
|
||||
managed OmniGraph.
|
||||
|
||||
**Architecture.** Single region. One object store (prefix-per-tenant or
|
||||
bucket-per-tenant — see open decisions). Data plane = a pool of **read
|
||||
replicas** plus a **single writer replica** per region. The single writer is
|
||||
deliberate: it sidesteps both `__manifest` CAS contention *and* the
|
||||
recovery-on-open race **without building the worker tier**. Recovery runs on the
|
||||
writer's `open`, as today. Reads fan out across read replicas.
|
||||
|
||||
**Control plane.** Thinnest viable: provisioning (`open-or-create` a repo —
|
||||
largely doable by the data plane itself on first request), a tenant directory,
|
||||
billing counters, and the RFC 0001 identity issuer. All soft state (P2).
|
||||
|
||||
**Auth (RFC 0001).** `mode = static` remains the default for M2M / CI;
|
||||
`mode = oidc` available, validated offline. The control plane runs the issuer
|
||||
(its own, or wrapping WorkOS for human SSO). `hybrid` lets both coexist.
|
||||
|
||||
**Branching as product surface.** OmniGraph already has Git-style graph
|
||||
branches with lazy fork — the same zero-copy, metadata-pointer design Neon
|
||||
sells. Stage 1 exposes this directly: instant per-PR / dev / staging branches at
|
||||
near-zero storage cost. No new engine work — a product packaging of an existing
|
||||
capability.
|
||||
|
||||
**Deliberately not done.** No autoscaling of writes, no worker tier, no
|
||||
reconciler, no BYOC. **Accepted limitation:** per-region write throughput is
|
||||
bounded by one writer; a writer restart briefly pauses writes for that region.
|
||||
|
||||
**Exit criteria → Stage 2.** Single-writer throughput, write-pause blast
|
||||
radius, or maintenance load (inline index builds / compaction) becomes the
|
||||
binding constraint.
|
||||
|
||||
## Stage 2 — Elastic data plane + worker tier
|
||||
|
||||
**Property won:** horizontal write scale, and maintenance moved off the request
|
||||
path — which also eliminates the recovery-on-open race.
|
||||
|
||||
**Architecture.**
|
||||
|
||||
- **Multiple writers** with **consistent-hash routing by repo URI**. A repo's
|
||||
writes land on one node, so CAS contention is bounded and the Lance page cache
|
||||
/ warm `Omnigraph` handle stay local.
|
||||
- **Per-repo write coalescer** — concurrent `mutate_as`/`load` commits to one
|
||||
repo batch into one manifest publish (the turbopuffer WAL-batching lesson:
|
||||
beat contention with batching, not locks).
|
||||
- **Three-tier cache** made explicit: object storage → NVMe SSD → in-process
|
||||
(Lance page cache + warm handle), with routing affinity keeping a repo warm.
|
||||
- **Worker tier** — background workers own index building (the deny-list
|
||||
reconciler mandate), compaction (`optimize`), cleanup, **and recovery**.
|
||||
Recovery moves from "every `open` runs the sweep" to "one leased worker per
|
||||
repo owns recovery." This *is* the long-deferred background reconciler;
|
||||
cloud is its forcing function.
|
||||
- **Per-tenant resource bounds** — close the `invariants.md` resource-bounds
|
||||
gap: enforced per-query memory/time budgets, plus `WorkloadController`
|
||||
admission control, so multi-tenant compute has no noisy-neighbor failure.
|
||||
- **Scale-to-zero** for cold tenants — evict idle handles, re-warm on first
|
||||
request, bill by the second (the Neon model).
|
||||
|
||||
**Control plane.** Gains orchestration: routing-hint distribution, compaction
|
||||
scheduling, recovery-lease coordination. Still soft state (P2), still off-path.
|
||||
|
||||
**Auth.** No change to the validation path. The control plane's config-bundle
|
||||
sync (RFC 0001 `ControlPlaneSync`) may now feed a SCIM-sourced actor allowlist.
|
||||
|
||||
**Optional consistency knob.** With warm caches, a per-query `stale-ok` read
|
||||
becomes viable (turbopuffer's sub-10ms eventual mode). Invariant 6 permits it
|
||||
**only** as explicit, read-only, non-default — exposed as opt-in, never the
|
||||
default.
|
||||
|
||||
**Deliberately not done.** Data still resides in vendor-managed object storage.
|
||||
|
||||
**Exit criteria → Stage 3.** A customer requires data sovereignty (data may not
|
||||
leave their account) or air-gapped operation.
|
||||
|
||||
## Stage 3 — BYOC / VPC / air-gapped
|
||||
|
||||
**Property won:** data sovereignty — the customer's graph data never leaves
|
||||
their cloud account.
|
||||
|
||||
**Architecture.** The WarpStream BYOC split. The data plane (read replicas,
|
||||
writers, worker tier — the Stage 1 *or* Stage 2 shape) and the customer's object
|
||||
store run **inside the customer's VPC**. The vendor cloud keeps only the
|
||||
soft-state orchestrator and the identity issuer. No customer graph data crosses
|
||||
the boundary; no cross-account IAM into the customer's bucket. Air-gapped is the
|
||||
same packaging with the control plane absent and config supplied as static
|
||||
files.
|
||||
|
||||
**Auth.** This is where RFC 0001's P4 pays off fully: the in-VPC data plane
|
||||
validates tokens **offline** against cached JWKS. The vendor identity issuer is
|
||||
the only cloud touchpoint and it is off the request path. Air-gapped: point at
|
||||
the customer's own IdP, or `mode = static`, with JWKS/policy pre-seeded.
|
||||
|
||||
**Why this is mostly packaging.** Because P2/P3/P4 were honored from Stage 1 —
|
||||
control plane thin and off-path, one config-driven binary, auth validated
|
||||
offline — Stage 3 is boundary hardening and deployment templates (Helm /
|
||||
Terraform), not an architectural change.
|
||||
|
||||
## Why three stages (first-principles)
|
||||
|
||||
- **Not one stage.** The managed offering must not wait on the reconciler — a
|
||||
large build. Stage 1's single-writer design wins a real, sellable, managed
|
||||
product with bounded complexity, and the load it collects is the evidence
|
||||
that justifies Stage 2 (reversible-change discipline: ship, measure, then
|
||||
invest).
|
||||
- **Not collapsing 2 and 3.** *Scale* (Stage 2) and *sovereignty* (Stage 3) are
|
||||
independent axes — a customer may demand BYOC before single-region scale runs
|
||||
out, or the reverse. They share only the thin-control-plane prerequisite,
|
||||
which is foundational (P2) anyway. **Stage 3 can therefore ship on the Stage 1
|
||||
data-plane shape**; the 1→2→3 numbering is the expected-demand order, not a
|
||||
hard dependency. If enterprise/sovereignty demand arrives first, do 1→3→2.
|
||||
- **Not more than three.** The natural seams are exactly *managed+auth*,
|
||||
*elastic+maintenance*, *sovereignty*. Finer splits would be invented
|
||||
complexity, not earned.
|
||||
|
||||
## Open decisions
|
||||
|
||||
1. **Tenancy isolation model** — bucket-per-tenant vs prefix-per-tenant vs
|
||||
account-per-tenant. Strongest lever and effectively irreversible; the
|
||||
control plane should vend short-lived per-tenant scoped credentials
|
||||
regardless. Decide before Stage 1.
|
||||
2. **Recovery ownership** — Stage 1 leans on the single writer; Stage 2 needs a
|
||||
per-repo recovery lease. Confirm the lease mechanism (object-store-based
|
||||
lease vs control-plane-issued).
|
||||
3. **Commit-latency model (P1)** — ratify object-storage-only commit, or
|
||||
identify a concrete low-latency write requirement that would justify a fast
|
||||
durable tier.
|
||||
4. **RFC 0001 carry-overs** — degraded-mode JWKS grace window, revocation
|
||||
strategy, whether VPC customers can override cloud-pushed Cedar policy.
|
||||
5. **Stale-read knob** — ship the optional eventual-consistency read in
|
||||
Stage 2, or defer.
|
||||
|
||||
## Invariant analysis
|
||||
|
||||
| Invariant / deny-list item | Outcome |
|
||||
|---|---|
|
||||
| 2 — manifest-atomic graph visibility | ✅ unchanged; `__manifest` CAS is the commit point in every stage |
|
||||
| 5 — recovery part of commit protocol | ✅ Stage 1 = open-time sweep; Stage 2 = leased worker; never weakened |
|
||||
| 6 — strong consistency default | ✅ stale-read knob is explicit, read-only, non-default |
|
||||
| 11 — transport/auth at the boundary | ✅ engine crates untouched; auth in `omnigraph-server` |
|
||||
| 13 — failures bounded/observable | ✅ Stage 2 closes the per-query resource-bounds gap |
|
||||
| Deny: custom WAL / metadata store | ✅ P1/P2 — object storage + `__manifest` only |
|
||||
| Deny: cloud-only correctness / fork | ✅ P3 — one OSS binary, additive control plane |
|
||||
| Deny: job queue for manifest-derivable state | ✅ worker tier is a reconciler, not a queue |
|
||||
|
||||
## Testing notes
|
||||
|
||||
- Stage 1: extend `omnigraph-server` tests for multi-replica read fan-out and
|
||||
single-writer routing; reuse `failpoints` for writer-restart behavior.
|
||||
- Stage 2: per-repo coalescer and routing need engine/storage-boundary tests
|
||||
(`runs.rs`, `recovery.rs`); recovery-lease coverage belongs in `recovery.rs`.
|
||||
- Stage 3: a deployment-template smoke test (data plane against an in-VPC-style
|
||||
object store); confirm no control-plane call on the request path.
|
||||
- Update [docs/user/deployment.md](../../user/deployment.md),
|
||||
[docs/user/server.md](../../user/server.md) as each stage lands.
|
||||
Loading…
Add table
Add a link
Reference in a new issue