From 5e03ca977c4a82f7a48b6113997bacb6edf1dee0 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 03:32:25 +0000 Subject: [PATCH] Add RFC 0001: federated authentication Drafts a design for OIDC-based federated authentication that lets a managed cloud offering issue identity tokens while keeping VPC and air-gapped on-prem deployments free of any request-time dependency on the cloud. Introduces a server-only TokenVerifier seam with static and OIDC implementations, validates the design against the OSS/Cloud invariants, and records the open decisions needed before acceptance. https://claude.ai/code/session_01N22WDYC6vv2njR5Xu96QaC --- docs/dev/index.md | 9 + .../dev/rfcs/0001-federated-authentication.md | 256 ++++++++++++++++++ 2 files changed, 265 insertions(+) create mode 100644 docs/dev/rfcs/0001-federated-authentication.md diff --git a/docs/dev/index.md b/docs/dev/index.md index 3a2b674..281e0e1 100644 --- a/docs/dev/index.md +++ b/docs/dev/index.md @@ -43,6 +43,15 @@ constraints. User-facing behavior should still be documented through | Constants and tunables | [constants.md](../user/constants.md) | | Transaction model public contract | [transactions.md](../user/transactions.md) | +## Design Proposals (RFCs) + +RFCs are proposals under review, not current truth. The authoritative +description of shipped behavior always lives in the area docs above. + +| RFC | Status | Topic | +|---|---|---| +| [0001-federated-authentication.md](rfcs/0001-federated-authentication.md) | draft | OIDC auth with a cloud control plane plus VPC/on-prem deployment | + ## Project Operations | Area | Read | diff --git a/docs/dev/rfcs/0001-federated-authentication.md b/docs/dev/rfcs/0001-federated-authentication.md new file mode 100644 index 0000000..0534e03 --- /dev/null +++ b/docs/dev/rfcs/0001-federated-authentication.md @@ -0,0 +1,256 @@ +# RFC 0001 — Federated Authentication (Cloud Control Plane + VPC/On-Prem) + +**Type:** design proposal +**Status:** draft — not accepted, not implemented +**Audience:** maintainers reviewing the auth substrate and the OSS/Cloud boundary +**Date:** 2026-05-16 +**Supersedes:** nothing — extends the current bearer-token model + +> This is a proposal, not current truth. Until accepted and implemented, the +> authoritative description of auth remains [docs/user/server.md](../../user/server.md) +> and [docs/user/policy.md](../../user/policy.md). + +## Summary + +Add OIDC-based federated authentication to `omnigraph-server` so that a +managed cloud offering can issue identity tokens, while VPC and air-gapped +on-prem deployments keep working with **no request-time dependency** on the +cloud. The mechanism is a new `TokenVerifier` seam in `omnigraph-server` with +two OSS implementations: today's static bearer tokens, and an OIDC JWT +verifier. The cloud offering is configuration of the OSS verifier plus an +additive, optional control-plane sync component — not a fork. + +## Motivation + +The current model (`omnigraph-server/src/auth.rs`, `lib.rs`) hashes a static +set of bearer tokens at startup and compares per request. It is correct and +on-prem-friendly, but it cannot: + +- accept identities issued by an enterprise IdP (Okta, Entra, Google) or by an + OmniGraph cloud control plane; +- support short-lived, rotating credentials; +- feed an actor allowlist from an external source (e.g. SCIM). + +We want a cloud offering with managed identity **without** sacrificing the VPC +/ on-prem deployment story or violating the OSS/Cloud invariants in +[invariants.md](../invariants.md). + +## Goals + +- One engine binary; deployment mode is configuration only. +- Token **validation** is fully OSS and works offline (no control-plane call + on the request path). +- Cloud control plane *issues* tokens and *distributes* config; it never + becomes a correctness dependency of the data plane. +- Static bearer tokens remain a first-class, default-capable path for + machine-to-machine (M2M), CI, and air-gapped use. +- Cedar authorization (`policy.rs`) is unchanged — it still operates on a + server-resolved actor. + +## Non-goals + +- Building an OmniGraph identity provider. The cloud control plane may wrap a + third party (e.g. WorkOS) for human SSO; that is out of scope here. +- Browser login / session UX. This RFC covers API/CLI credential verification. +- Changing the engine crates. `omnigraph` and `omnigraph-compiler` stay free + of transport/auth code (Invariant 11). + +## Background — current state + +| Concern | Today | File | +|---|---|---| +| Token store | SHA-256 hashes of static tokens | `omnigraph-server/src/lib.rs:58` | +| Comparison | constant-time over all entries | `lib.rs:286` | +| Token sources | env / file / AWS Secrets Manager | `auth.rs` (`TokenSource` trait) | +| Actor | `AuthenticatedActor(Arc)`, server-resolved | `lib.rs:135` | +| Authz | Cedar, 8 actions, branch scopes | `policy.rs` | + +`TokenSource` yields a *static set to hash*. OIDC needs a per-request +*validation* step instead, so a new seam is required rather than a new +`TokenSource` impl. + +## Design + +### Control plane / data plane split + +The decoupling that makes VPC + on-prem work: **token validation never makes a +network call.** OIDC tokens are signed by the issuer; the verifier needs only +the issuer's public keys (JWKS). The data plane validates offline against +cached JWKS. + +``` + CLOUD CONTROL PLANE (SaaS only, optional) + - issues tokens (own IdP or WorkOS front) + - publishes signed config bundle: + { jwks, cedar_policy, actor_allowlist } + │ (pull, periodic) + ▼ + ┌─────────────────────────────────────────────────────────┐ + │ DATA PLANE — omnigraph-server + engine │ + │ (identical binary in SaaS / VPC / on-prem) │ + │ │ + │ request ─▶ TokenVerifier ─▶ AuthenticatedActor ─▶ Cedar │ + │ reads ONLY local cached state │ + └─────────────────────────────────────────────────────────┘ +``` + +In air-gapped mode the bundle is supplied as static files; the control-plane +sync component is simply not configured. The request path is byte-identical in +all three modes. + +### The `TokenVerifier` seam + +Lives entirely in `omnigraph-server` (Invariant 11). + +```rust +/// Verifies an inbound bearer credential and resolves it to an actor. +trait TokenVerifier: Send + Sync { + async fn verify(&self, presented: &str) -> Result; +} + +struct ResolvedActor { actor_id: Arc, source: AuthSource } +``` + +OSS implementations: + +- `StaticHashTokenVerifier` — current behavior, refactored behind the trait. + Constant-time hash compare. Default. +- `OidcJwtVerifier` — validates JWT signature against cached JWKS, checks + `iss` / `aud` / `exp` (with bounded clock skew), maps a configured claim to + `actor_id`, optionally checks an allowlist. + +`require_bearer_auth()` dispatches to the configured verifier(s) and produces +the same `AuthenticatedActor` the rest of the stack already consumes. Nothing +downstream — Cedar, `mutate_as`, the commit actor map — changes. + +The "cloud verifier" is **not new code**: it is `OidcJwtVerifier` pointed at +the OmniGraph cloud issuer. + +### Configuration + +```toml +[auth] +mode = "static" # "static" | "oidc" | "hybrid" + +[auth.oidc] +issuer = "https://auth.omnigraph.cloud/" # or a customer IdP +audience = "omnigraph" +actor_claim = "sub" # claim -> actor_id +jwks_uri = "" # blank = OIDC discovery +jwks_cache_ttl = "1h" +jwks_offline_path = "/etc/omnigraph/jwks.json" # air-gapped pre-seed +jwks_stale_max = "24h" # see degraded mode +clock_skew = "60s" +allowed_actors_path = "" # optional; SCIM-fed in cloud + +[auth.control_plane] # SaaS only; omit elsewhere +bundle_url = "https://cp.omnigraph.cloud/v1/bundle" +sync_interval = "5m" +``` + +`hybrid` mode runs both verifiers (static tried first, then OIDC), so M2M +service tokens and human/federated identities coexist during and after +migration. + +### Control-plane sync (additive, cloud-only) + +An optional `ControlPlaneSync` task periodically pulls a **signed** bundle +(`jwks`, `cedar_policy`, `actor_allowlist`), verifies its signature, and writes +it to the same local paths the verifier and `policy.rs` already read. It is a +distribution mechanism, not a code path the request touches — preserving the +"no fork for Cloud" invariant. + +### Degraded-mode behavior + +VPC deployments must tolerate brief control-plane / IdP unreachability: + +- **JWKS refresh failure** — keep serving on cached keys; emit a loud + `WARN` + metric. Past `jwks_stale_max`, fail closed. Cached JWKS is safe + because signing keys rotate slowly. +- **Revocation** — JWT revocation is inherently weak. Mitigate with short token + TTL (Databricks Lakebase uses 1h; we recommend ≤1h for cloud-issued tokens) + rather than a request-time denylist lookup. Optional opaque-token + introspection is left as future work, not a default. +- **Control-plane bundle staleness** — last-good bundle stays in effect; loud + warning. Never silently fail open or drop to a weaker policy. + +This satisfies the deny-list "no silent failures" and Invariant 6 (any +degraded mode is explicit, bounded, observable). + +### M2M for VPC + +In-VPC service-to-service and CI clients must not depend on the cloud at all. +`StaticHashTokenVerifier` remains the supported M2M path (analogous to +Lakebase's indefinitely-lived service principals). `hybrid` mode lets a +deployment serve static service tokens and OIDC human tokens simultaneously. + +### Authorization interaction + +Unchanged. Cedar (`policy.rs`) receives the resolved actor regardless of which +verifier produced it. The control-plane bundle may *distribute* the Cedar +policy and actor allowlist, but enforcement, scopes, and the 8 actions are +exactly as they are today. + +## Invariant analysis + +| Invariant / deny-list item | Outcome | +|---|---| +| 11 — transport/auth at the boundary | ✅ `TokenVerifier` is server-only; engine untouched | +| 12 — bearer plaintext not retained | ✅ JWT verified per request, not stored; static path keeps constant-time compare | +| 6 — strong consistency default | ✅ degraded mode is explicit, bounded, non-default | +| Deny: cloud-only correctness fix | ✅ verification is OSS; cloud only issues + distributes | +| Deny: fork the codebase for Cloud | ✅ cloud verifier = config; `ControlPlaneSync` is additive/optional | +| Deny: silent failure | ✅ JWKS/bundle staleness is loud + metered, fails closed at a bound | +| Deny: side-channel for semantics | ✅ actor stays a first-class server-resolved value | + +## Open questions — decisions needed before acceptance + +1. **Degraded grace window.** Is `jwks_stale_max = 24h` the right default, and + is fail-closed-after-bound acceptable for VPC SLAs? +2. **Revocation.** Short TTL only, or do we also ship optional introspection / + a denylist for high-security tenants? +3. **Policy authority for VPC.** Can a VPC customer override a cloud-pushed + Cedar policy locally, or is the cloud bundle authoritative? Security and + product implications. +4. **Unknown-actor handling.** When `actor_claim` resolves to an actor absent + from the allowlist: reject at `verify()`, or pass through and let Cedar deny + (with a default-deny rule)? +5. **Multi-issuer.** Does `hybrid` need to validate against more than one OIDC + issuer simultaneously (cloud issuer + customer IdP)? +6. **Bundle signing.** What signs the control-plane bundle, and how is that + root of trust provisioned to an air-gapped install? + +## Rollout + +1. Land `TokenVerifier` + `StaticHashTokenVerifier` as a pure refactor — + `mode = "static"` is the default, behavior identical. (Separate commit, per + AGENTS.md rule 11.) +2. Add `OidcJwtVerifier` + JWKS cache; `mode = "oidc"` / `"hybrid"` opt-in. +3. Add `ControlPlaneSync` as an optional component. +4. Update [docs/user/server.md](../../user/server.md), + [docs/user/policy.md](../../user/policy.md), and + [docs/user/deployment.md](../../user/deployment.md) in the same changes. + +No breaking change: existing static-token deployments keep working untouched. + +## Alternatives considered + +- **Mandatory cloud control plane (Lakebase model).** Rejected — Lakebase + requires the Databricks control plane reachable, which kills the air-gapped + on-prem story and would put correctness behind a cloud service (deny-list: + cloud-only correctness). +- **Opaque tokens + request-time introspection.** Rejected as the default — + adds request-path egress to the issuer, defeating the VPC decoupling. Kept as + possible future opt-in for revocation-sensitive tenants. +- **Build an OmniGraph IdP.** Rejected — not our domain; delegate issuance to + an OIDC provider or a thin cloud control plane that may wrap WorkOS. + +## Testing + +- Extend `crates/omnigraph-server/tests/server.rs` with `TokenVerifier` + coverage; add a focused auth test module. +- OIDC path: test harness signs JWTs with a local key; assert accept / expired + / bad-audience / bad-signature / unknown-actor. +- Degraded mode: exercise JWKS-unreachable via the `failpoints` feature; assert + cached-key serving, the loud warning, and fail-closed past `jwks_stale_max`. +- Confirm `omnigraph-server` remains the only crate with auth dependencies.