docs(rfc-013): add reader/writer scaling (§5.8) — split read fleet vs write tier

The substrate makes read/write scaling asymmetric: reads are object-store-backed, snapshot-isolated, stateless -> horizontal to N replicas with zero coordination; writes are serialized per (table,branch) by one-winner manifest CAS -> scale by partitioning (branches/graphs/cells), with a single active coordinator per (cell,graph,branch). Adds the CQRS deployment split (read fleet / write tier / maintenance / heavy reads), read-your-write via commit_id/snapshot_id pinning, and gates pooled WRITES (not reads) on closing the cross-process-CAS gap.
2026-06-18 02:24:27 +02:00 · 2026-06-16 18:53:01 +02:00 · 2026-06-16 18:53:01 +02:00 · c43b81d318
commit c43b81d318
parent 0f58329ab7
1 changed files with 63 additions and 0 deletions
--- a/docs/dev/rfc-013-tenancy-cells.md
+++ b/docs/dev/rfc-013-tenancy-cells.md
@ -265,6 +265,64 @@ requests keep their `Arc`.
 A tenant graduates pooled → dedicated by moving its cell to its own process — **no data
 migration** (the storage root does not move; the cell is already self-contained).

+### 5.8 Workload scaling — readers vs writers
+
+The substrate makes read and write scaling **asymmetric**, and the cell model inherits
+and exploits that asymmetry. It is the deployment-shaping fact, so it is part of the
+tenancy design, not an afterthought.
+
+**Reads scale horizontally and statelessly.** The object store is the source of truth;
+a read is snapshot-isolated and holds no shared mutable state, so any number of
+processes can open a snapshot and serve reads with **zero coordination**. Add read
+replicas (even per-region) freely. Per-cell independence means per-tenant read scaling
+is independent. Limits are real but ordinary: object-store request throughput
+(per-prefix S3 limits), index residency + CPU/RAM for *heavy* reads (vector ANN, FTS,
+aggregations), and in-process cache warmth (a cold tenant in a pooled fleet pays a
+cold-open). Freshness is the one catch — a replica's engine handle may lag live HEAD
+([rfc-011](rfc-011-cli-refactoring.md) note), so the read path is eventual within the
+refresh window unless pinned.
+
+**Writes are serialized per `(table, branch)` and scale by partitioning, not by adding
+writers.** A single graph's main-branch write throughput is bounded by commit latency
+(object-store round-trips for stage + manifest CAS) and degrades under concurrency
+(one-winner CAS → retry). You scale writes by *partitioning the write set*:
+
+- more **branches** — different `(table, branch)` queues don't contend; bulk loads fan
+  out onto review branches and **merge** serializes the integration;
+- more **graphs**, more **cells** — fully independent write paths.
+
+Multiple replicas writing the *same* `(cell, graph, branch)` contend on manifest CAS —
+the documented one-winner-CAS territory (invariants.md known gaps: "Local
+`write_text_if_match` is not a cross-process CAS"; "recovery serialized against live
+writers in-process only"). So horizontal write scaling on one branch needs a **single
+active writer/coordinator**, not N racing replicas.
+
+**The deployment shape this implies — split the roles (CQRS at the deployment layer):**
+
+| Role | Scaling | Topology |
+|---|---|---|
+| **Read fleet** | horizontal, stateless, regional | many replicas behind a LB; each opens snapshots from object store |
+| **Write tier** | one active coordinator per `(cell, graph, branch)` | small; consistent-hash or a per-`(graph,branch)` lease routes writes to a single owner to bound CAS thrash |
+| **Maintenance** | single-coordinator, out-of-band | `optimize`/`cleanup`/reindex as async jobs ([rfc-011](rfc-011-cli-refactoring.md) D11), never inline with serving |
+| **Heavy reads** | own pool (optional) | vector/FTS/aggregation isolated so they don't starve point reads |
+
+- **Read-your-write:** the write envelope returns `commit_id` / `snapshot_id`
+  ([rfc-001](rfc-001-queries-envelope-mcp.md)); a client needing RYW pins its next read
+  to that snapshot, or is routed (sticky) to the writer / a freshly-refreshed replica.
+  Otherwise reads are eventual within the handle-refresh window — make that explicit per
+  [rfc-003](rfc-003-mcp-server-surface.md)'s "no silent eventual consistency."
+
+**Cell-model interaction.** Read scaling is per-cell-independent and horizontal; write
+scaling is per-`(cell, graph, branch)`. The cell bounds blast radius and lets a hot
+tenant get a dedicated write coordinator (or its own process — the dedicated tier),
+while the long tail shares a pooled read fleet. The one genuinely hard coordination
+problem is a **pooled write fleet**: two replicas must not own the same `(cell, graph,
+branch)`, so it needs routing affinity or a lease — which ties directly to the
+cross-process-CAS known gap and must be closed before pooled *writes* (pooled *reads*
+are unaffected). Single-writer-per-partition + many-readers is exactly what an
+object-store, log/Git-structured engine wants; the design states it rather than fighting
+it.
+
 ## 6. How the surfaces ride on top

 This is a server/topology change; the surfaces are consumers and need little or no
@ -338,6 +396,11 @@ Today's `omnigraph-server --cluster <one>` is a `MultiCluster` with **one cell**
 - **Scope semantics.** `ResolvedActor.scopes` is currently `[Full]` and read nowhere;
  when OIDC populates real scopes, decide whether/how they feed Cedar context (a
  behavior change to sequence deliberately, not silently).
+- **Pooled-write coordination (§5.8).** A pooled write fleet needs a single active
+  owner per `(cell, graph, branch)` (consistent-hash or lease) and the cross-process
+  CAS gap closed first. Pooled *reads* need none of this; gate pooled *writes* on it.
+  Read-fleet freshness (handle-refresh lag) + read-your-write pinning is the companion
+  decision.

 ## 10. Relationship to prior RFCs