omnigraph/docs/dev/rfc-013-write-path-latency.md

# RFC-013: Write-path latency — capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract

**Status:** Proposed
**Author(s):** write-path latency investigation (handoff + multi-agent validation)
**Date:** 2026-06-19
**Audience:** engine / storage maintainers
**Builds on:**
[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) (`GraphClient` — embedded ≡ remote),
the query-latency work (PR #268, read-path warm-up — the read-side twin of this change),
the iss-991 handoff (manifest-authoritative graph lineage / Phase 7),
[writes.md](writes.md), [execution.md](execution.md), [invariants.md](invariants.md).
**Tracking (dev graph `modernrelay`):** primary `iss-write-s3-roundtrip-amplification`; depth term `iss-991`; substrate seam `iss-863`/`iss-864`; branch-create `iss-691`; recovery `iss-856`/`iss-recovery-sweep-live-writer-rollback`/`iss-merge-recovery-partial-rollforward`; MemWAL `iss-681`; read twin `gap-read-path-rederivation`.

> Status maintained by maintainers: `Proposed` while open, `Accepted` on merge.

---

## Summary

On object-store-backed clusters a single trivial write (one edge, one branch op)
issues **hundreds of mostly-sequential object-store round-trips**, and that count
**grows without bound with the graph's commit history**, so a long-lived graph
degrades to minutes per edge. The cost is invisible on a local filesystem
(µs/call) and to correctness tests (results are right, just slow), and it was
never measured because nothing in the suite counts *object-store round-trips per
logical operation*.

This RFC specifies the optimal write path from first principles — **a write is a
pure function of one version-pinned snapshot, published in a single
manifest-atomic CAS** — and the **cost contract that makes its O(1)-in-history
guarantee provable and non-regressable** (deterministic IO-counted tests on every
PR). It collapses four hand-rolled writers into one `GraphPublishAuthority`,
moves graph lineage into the manifest (so the per-write `_graph_commits` scan
disappears), brings the internal metadata tables into compaction (so the
per-write `__manifest` scan stops growing), takes recovery off the hot path, and
adds an epoch fence for multi-writer safety. None of it is a substrate rewrite —
the manifest-CAS model is already correct and is exactly what Lance native
multi-table transactions (lance#7264) will later formalize; this RFC builds the
seam to that future and pays down the write path onto it.

**The dominant fix is demonstrated, not proposed:** a one-line opener-bypass
prototype (open writes direct-by-URI instead of through the lance-namespace builder)
flattens the depth-dominant term `31 + 12·depth → flat 4` and cuts a depth-80 edge
**2.7×** (1618 → 593 ops), measured end-to-end and functionally correct on
main/branch/node paths (§2.4). It is shippable as a standalone PR first (§9 step
3a); the rest of the RFC is the constant-factor + correctness + internal-residual
work layered on the same seam.

---

## 0. Validation ledger (read this first)

Every claim is tagged: **[M]** measured by me this cycle, **[S]** verified in
v0.7.0 source (`file:line` given), **[U]** verified against upstream
Lance/LanceDB/SlateDB source or docs, **[G]** tracked in the dev graph (slug
given), **[I]** inferred/reasoned.

A correction from the originating handoff: it hypothesized that **Cloudflare R2
walks the full manifest listing on every open** (a prod-only amplifier absent
from AWS). **This is false for the pinned Lance 7.0.0 [U].** R2 is treated as
lexically ordered (`list_is_lexically_ordered = !is_s3_express`,
`lance-io/.../providers/aws.rs:183`), so R2 gets the O(1) head-only manifest fast
path, same as AWS; only S3-Express buckets are excluded, and even those are O(1)
via the v7 `latest_version_hint.json`. There is no R2-list config fix because
there is no R2-list problem.

**The depth term — corrected attribution.** Two measurements, one
instrumentation-blind, one complete:

*(a) IOTracker probe [M] — internal tables only.* A throwaway probe (the
`warm_read_cost` harness applied to a single insert to `main`, swept across
commit depth) counted the two internal tables: `__manifest` ≈ 14 + 2·depth,
`_graph_commits` ≈ 9 + 2·depth → ≈ 23 + 4·depth, `write_iops = 1`. **But this
probe is structurally blind to the write path's per-table *data* opens** — they
bypass the instrumented opener (`table_wrapper`), so it reports `probes=0` for the
data tables. It measured the *minority* of the cost.

*(b) Network-proxy measurement [M] — all RPCs, fresh graph.* A counting proxy in
front of `rustfs` (sees every object-store RPC, under `--mode merge` — the
production path), on a brand-new graph (400 seed nodes, one committing merge per
checkpoint), classified by S3 key:

| commit depth | data `_versions` | `__manifest` | `_graph_commits` | node (RI) | schema | TOTAL | `write_iops` |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 0 | 31 | 29 | 13 | 6 | 46 | 156 | 1 |
| 5 | 121 | 44 | 23 | 6 | 46 | 268 | 1 |
| 10 | 181 | 59 | 33 | 6 | 46 | 358 | 1 |
| 20 | 301 | 89 | 53 | 6 | 46 | 538 | 1 |
| 40 | 541 | 149 | 93 | 6 | 46 | 898 | 1 |
| 80 | 1021 | 269 | 173 | 6 | 46 | 1618 | 1 |

Slopes: **data table +12/depth (~67%)**, `__manifest` +3/depth, `_graph_commits`
+2/depth → **TOTAL ≈ 156 + 18·depth**, `write_iops` flat at 1. The IOTracker probe
(a) saw only the +4/depth internal subset — blind to the data-table opens, the
dominant ~67%.

**Constant-factor finding [M]: the schema contract is a flat 46 reads/write** — not
depth-scaling, but **29% of the depth-0 cost (46/156)**, from
`validate_schema_contract` re-running uncached on every resolve (`omnigraph.rs:561`).
A depth-slope gate will *not* catch it; WriteTxn's resolve/validate-once kills it,
and the §5.1 fitness assert (`validate_schema_contract_calls == 1`) is what pins it
(constant-factor delta, §6).

The dominant term is **the written table's open routed through the lance-namespace
builder ~13× per write** — now source-traced. The **write** path opens via
`DatasetBuilder::from_namespace` (`namespace.rs:174`, from `open_table_head_for_write`
`table_store.rs:181` / `namespace.rs:544`). Lance's builder calls the namespace's
`describe_table` once and uses only `response.location` (`lance` `builder.rs:130-178`)
— but omnigraph's `describe_table` **opens the whole dataset** just to produce that
location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`), and `.load()`
then **resolves the latest version again** — a **double latest-resolution per
open**, ~13× per write, nothing cached. Crucially, latest-resolution is **not
inherently O(depth)**: the namespace path is O(depth) because it **misses the V2
lexical / `latest_version_hint.json` fast path** that the direct opener engages
(most likely because `load_table_from_namespace` attaches no shared `Session`/store
params, `namespace.rs:174` — inferred, not traced). The **read** path skips all of
it — `from_uri(location).with_version(N)`, one HEAD, O(1) — which is why reads are
flat (+12/depth on the data table, §0(b)). **Proven on omnigraph's own table [M]:**
a direct `Dataset::open` of the *same physical* 85-version edge table = **2 ops
(O(1))**, the `from_namespace` open of that identical table = the O(depth) sweep —
same bytes, two open paths. `checkout_version` is also O(1) — **exonerated**, not a
back-walk. So `from_uri().with_version(N)` is the O(1) primitive and step 3 makes
each open O(1) *intrinsically* (cleanup then becomes hygiene/interim, not
load-bearing for read cost — §2.3). **Mode-independent [U]:** `append` ≡ `merge` ≡ +12/depth, so §0(a)
measuring a single insert was *not* the defect — the defect is the namespace open
path, not the verb. **Using `from_namespace` per-open is a misuse of Lance's
design** (the namespace is a catalog/discovery layer — resolve once, then open the
dataset directly, `lance-namespace` `operations/index.md` **[U]**); the read path
already bypasses it (PR #268 Fix 2 — see §2.4).

**Corrected conclusion.** The depth blow-up is in omnigraph's DB layer and is
**data-table-dominated**: the redundant per-table opens (fixed by §9 step 3 —
WriteTxn open-once-by-pinned-version — plus scheduled *version cleanup* of the
node/edge tables) are ~70% of it; the uncompacted internal tables (§9 step 2) are
the secondary ~30%. Both the originating R2 hypothesis and the earlier "entirely
the internal tables" framing are wrong. The exact Lance call doing the data-table
chain re-read (`checkout_version` back-walk vs merge-insert conflict replay) is the
one unpinned item — see §12. Reads, by contrast, are flat in depth
(`warm_read_cost.rs`, PR #268). This is the O(history)-per-write →
O(N²)-cumulative behavior the production incident hit.

---

## 1. Problem & measurements

On object storage every call is a 10–100 ms RPC, there is no cheap stat, and
sequential RPCs serialize. A long-lived production graph on R2, originating handoff
**[M]**:

| operation | prod (R2) | local `file://` |
|---|---|---|
| one-edge `load --mode merge` → main | ~3 min (90 s workflow timeout) | <1 s |
| `branch create --from main` | 120 s | <1 s |
| one-row `load` → a branch | 204 s | <1 s |
| `branch delete` | 216 s | <1 s |
| warm read / `/healthz` | fast (0.2–2 s) | fast |

`iss-write-s3-roundtrip-amplification` **[G]** independently records the same:
cross-region single insert ~46 s, 5-node mutation ~110 s, vs ~390 ms for a
no-storage `/healthz`. Its acceptance criteria are this RFC's goal: *"a single
insert issues O(1)-to-few S3 round-trips, not O(number of tables); bulk mutations
amortize the manifest commit."*

The cost decomposes into terms; the dominant one scales with history (§0):

1. **Per-table opens through the O(depth) lance-namespace builder (DOMINANT,
   O(tables × depth)).** Each stage opens via `DatasetBuilder::from_namespace`
   (`namespace.rs:174`); its `describe_table` opens the whole dataset just to return
   a location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`) and
   `.load()` resolves latest **again** — a double latest-resolution per open,
   O(depth) on the repro store, ~13× per write with nothing caching it **[S]**
   (§2.2). The read path's direct `from_uri().with_version(N)` is O(1). →
   **+12 reads/depth, ~70% of the slope [M]**. Fixed by opening once, by pinned
   version via the direct opener (§9 step 3); node/edge version *cleanup* bounds it
   further.
2. **Per-write `__manifest` scan (O(history), secondary).** Every publish
   full-scans the uncompacted `__manifest` (`load_publish_state` →
   `read_manifest_scan`, `state.rs:133-141`) **[S]**; the internal tables are
   never compacted/cleaned (`optimize` iterates node/edge only,
   `optimize.rs:895-904`) **[S]**. +3.1 reads/depth **[M]**.
3. **Per-write `_graph_commits` refresh (O(history), secondary).**
   `record_graph_commit` reloads the entire commit cache before each append
   (`commit_graph.rs:136-164`) **[S]**; never compacted/cleaned. +2.1 reads/depth
   **[M]**. The "read-path anti-pattern, now on writes" (`iss-991` handoff **[G]**).

Terms 2+3 are the secondary ~30%; term 1 dominates. Plus per-write fixed taxes: a `list_dir("__recovery/")` (`loader/mod.rs:197`,
`exec/mutation.rs:725`, `exec/merge.rs:1090`) **[S]**, and the publisher CAS
retry budget (`PUBLISHER_RETRY_BUDGET = 5`, `publisher.rs:51`) **[S]**.

Branch ops compound it: `branch create` is a per-table sequential fork loop
(`fork_branch_from_state`, `table_store.rs:282`); `branch delete` opens a
snapshot per *other* branch (`ensure_branch_delete_safe`, `omnigraph.rs:1317`)
and force-deletes per forked table sequentially (`cleanup_deleted_branch_tables`,
`omnigraph.rs:1359`) **[S]**.

---

## 2. Root cause (validated)

### 2.1 The write re-derives its world from storage every stage

`loader/mod.rs:400` captures a `snapshot` once, but downstream stages **ignore
it** and re-resolve **[S]**:

- `open_for_mutation_on_branch` (`table_ops.rs:505`) re-calls
  `resolved_branch_target` **per table** (`:512`), which runs
  `ensure_schema_state_valid` (a full schema-contract storage read with no cache,
  `omnigraph.rs:561-568`) and then opens **by head** via
  `open_dataset_head_for_write` (`:522`/`:559`), asserting head == pinned only
  *after* the open.
- `fresh_snapshot_for_branch` (`omnigraph.rs:771`) always does fresh I/O; the
  fork authority path re-reads the live manifest (`table_ops.rs:574`).
- The captured snapshot is used only for membership/fork checks, never for the
  actual opens.

The drift guards, CAS retries, and recovery scans are **compensating machinery**
for the staleness this self-inflicts. The `Snapshot`/coordinator primitive
already exists; it is treated as cheap-to-reacquire rather than as the
operation's authoritative identity.

### 2.2 The depth terms — data-table re-reads dominate, internal tables secondary

Confirmed in code and measurement (§0). The **dominant** term is §2.1's per-table
opens: ~13 opens per write through the lance-namespace builder
(`DatasetBuilder::from_namespace`, `namespace.rs:174`). The builder calls the
namespace's `describe_table` (`lance` `builder.rs:130-178`), and omnigraph's
`describe_table` opens the whole dataset just to return a location (`open_head` →
`Dataset::open`, `namespace.rs:362`/`:112`); `.load()` then resolves latest again —
a **double latest-resolution per open**, O(depth) on the repro store — so cost
grows with the table's version count (+12 reads/depth, ~70%). The **read** path
opens direct `from_uri().with_version(N)` (`namespace.rs:112` / `SubTableEntry::open`)
— O(1) — and native pylance is flat 6 ops at any depth **[U]**, so this is
omnigraph's *namespace-open* pattern, not Lance; `checkout_version` is O(1) and not
implicated. (The heavier `list_table_versions` — `versions()` + a checkout per
version, `namespace.rs:395-427` — is **not** on this path; it is test-only today, a
separate latent O(depth): §10 follow-up.) The **secondary** terms are the two
internal tables: `load_publish_state` and
`commit_graph.refresh` each full-scan a table that gains a fragment per write and
is never compacted (+5 reads/depth, ~30%). This is the `gap-read-path-rederivation`
**[G]** failure mode — "cost grows with fragment count" — on the *write* path,
where PR #268 never reached. `invariants.md` documents the internal-table half:
*"the internal metadata tables (`__manifest`, `_graph_commits`) are still not
compacted, so the probe and refresh cost still grows with fragment count."*

### 2.3 The `skip_auto_cleanup` interaction — and compaction ≠ cleanup

v0.7.0 sets `skip_auto_cleanup: true` deliberately (`table_store.rs` 10 sites +
`publisher.rs:392`) **[S]** — load-bearing, because Lance 7's on-by-default
`auto_cleanup` would GC `__manifest`-pinned snapshot versions (`lance.md` audit)
**[U]**. Two distinct levers were turned off and must be replaced *separately*:
**compaction** (`compact_files`) rewrites small fragments into fewer larger ones
but does **not** prune old versions; **cleanup** (`cleanup_old_versions`) prunes
old versions. Measured on a ~85-version graph **[M]**: `optimize`/compaction
*added a version* (data-table reads 1035 → 1083, frags 81→1 — **no help** against
the depth term); `cleanup --keep 3` dropped it 1035 → 63 (89 versions pruned across
7 tables, **16×**). So only *cleanup* bounds the version-chain length. Note today's
`cleanup`/`optimize` cover **node/edge tables only** (the "7 tables"; internal
`__manifest`/`_graph_commits` are excluded, `optimize.rs:895-904` **[M]**) — so
bounding the internal +5/depth residual needs them **added** to the key set (§9 step
2's code change). Operationally: `cleanup` aborts on a remote store without `--yes`
(the
scheduled job must pass it). Relation to step 3: while the namespace open is still
on the write path, cleanup **caps** the dominant term — a real interim mitigation;
once step 3 opens direct-by-version (O(1) regardless of version count, §2.4),
cleanup is **storage hygiene + internal-table sprawl**, not load-bearing for read
cost. The correct replacement is *scheduled* compaction **and** version cleanup
(§9 step 2), **not** re-enabling `auto_cleanup`. Without it, version history (and
per-write cost) grows forever.

### 2.4 Lance namespace: proper use (why the fix is bypass, not patch)

The upstream Lance Namespace is a **catalog / discovery layer** — "table
discovery, resolving table locations, and coordinating commits" — whose intended
division of labor is *"the namespace provides basic information about the table,
[then] the Lance SDK … fulfill[s] the other operations"* (`lance-namespace`
`namespace/index.md`, `operations/index.md`) **[U]**. It is meant to be consulted
to *resolve a table once*, after which you operate on the `Dataset` directly — **not
consulted on every per-table open on a hot path.** `DatasetBuilder::from_namespace`
itself reflects this: it calls `describe_table` only to extract `location`, then
reduces to a `from_uri` builder (`lance` `builder.rs:130-178`). For a system that
*already holds* each table's location + version (omnigraph's `__manifest` does, via
`SubTableEntry`), routing per-open resolution back through the namespace is the
anti-pattern — and it aligns with this project's invariant 1 ("resolve latest state
through the substrate's cheap primitive instead of re-scanning") and the deny-list
"cold re-derivation on the hot path."

So the fix is **bypass, not patch**: open writes by URI + pinned version
(`from_uri(location).with_version(N)`) — exactly what the **read** path already does
(PR #268 Fix 2; the read path's own comment notes the namespace open "would
full-scan `__manifest` twice per open (`describe_table` + `describe_table_version`)"),
so this completes #268's open-by-location migration on the write side (§9 step 3).
The **custom namespace impl stays** — it is still the right home for legitimate
*catalog* operations (`describe_table` / `table_exists` / `list_table_versions` /
`create_table_version` / managed-versioning commit coordination); only the
per-open *resolution* leaves it. Two Lance facts make this safe and final: opening
by explicit version is `default_resolve_version` = a single HEAD, O(1) (`lance`
`commit.rs:939-981`), and Lance's own latest-resolution cost work (version-hint, PR
#6752) confirms the latest path is the expensive one to avoid. **Proven on
omnigraph's own table [M]:** a direct `Dataset::open` of the *same physical*
85-version edge table is 2 ops (O(1)), while the `from_namespace` open of that
identical table is the O(depth) sweep — so latest-resolution is not inherently
O(depth); the namespace path is O(depth) only because it misses the fast path the
direct opener engages (likely the un-threaded `Session`). Step 3 therefore makes
each write open O(1) on its own — so node/edge `cleanup` (§2.3) is an **interim
mitigation + storage hygiene**, not load-bearing for read cost once step 3 ships.

**End-to-end proof [M] — the one-line opener bypass, measured.** A prototype
patched `open_dataset_head_for_write` (`table_store.rs:174`) to open directly by URI
(bypassing `from_namespace` — exactly step 3 / Alternative B), rebuilt v0.7.0, and
re-ran the depth sweep on a fresh graph:

| depth | data `edgeVER` baseline | data patched | TOTAL baseline | TOTAL patched |
|---:|---:|---:|---:|---:|
| 0 | 31 | **4** | 156 | 121 |
| 10 | 181 | **4** | 358 | 173 |
| 20 | 301 | **4** | 538 | 233 |
| 40 | 541 | **4** | 898 | 353 |
| 80 | 1021 | **4** | 1618 | **593** |

The dominant data-table term collapses `31 + 12·depth → flat 4` (O(1) in history),
the total slope drops `+18/depth → +5/depth` (the residual +5 is exactly the two
internal tables — step 2's scope), and at depth 80 a single edge drops **1618 → 593
ops (2.7×)** from this one change alone, before step 2 / Phase 7. Functional
correctness verified on the hot paths: main edge merge, branch create + write +
read-back, node merge (managed-versioning still correct) — the direct opener already
handles `checkout_branch` for non-main, so the namespace layer was not load-bearing
for write correctness on these paths. **Caveat:** the prototype did **not** exercise
schema-apply, branch merge, fork-on-first-write to a new table on a branch, overwrite
mode, or concurrent writers — a production step 3 must pass the full
`merge_truth_table`/recovery/failpoint suite (the namespace may do
managed-versioning work that matters there). It proves the thesis + hot-path
correctness, not drop-in completeness.

**Step 2 also proven [M].** On the step-3-patched binary at depth ~87, compacting
the internal tables to 1 fragment each (content-preserving) collapsed their scans:
`__manifest` 285 → 32 (8.9×), `_graph_commits` 177 → 11 (16×); the step-3 data term
stayed flat at 4. So **both depth terms are now empirically eliminated** — a depth-87
single edge drops **~1720 → 198 ops (~8.7×; ≈258 s → ≈30 s at 150 ms/RTT)** with
both fixes. The internal term is **fragment-scan growth** (`read_manifest_scan` /
`commit_graph.refresh` read all fragments of the *latest* version), so the fix is
**compaction** (merge fragments) — distinct from the data table's version-chain term
that step 3 / version-cleanup handle. `optimize`'s `all_table_keys`
(`optimize.rs:895-904`) excludes the internal tables, so step 2 is a real code
change, not just scheduling.

---

## 3. First principles

On object storage the only objective function is **minimize the number of
*sequential* round-trips per logical operation, and make that number invariant to
graph age, history depth, and table count** — under the hard floor of SI,
durability, atomicity, and loud integrity. Three generating principles fall out,
each mapped to a validated failure:

1. **Pin once, derive the rest (MVCC / invariant 15).** A write is a pure
   function of one immutable, fully-pinned snapshot
   `{branch, manifest_version, per-table (location, version, e_tag), schema_hash,
   writer_epoch}`, resolved exactly once; every stage reads only from it
   (open-by-pinned-version, O(1), cacheable); the only contact with "current" is
   the final CAS. → fixes §2.1.
2. **One source of truth, one commit (invariant 2).** Visibility + lineage +
   version bumps are **one atomic manifest write**; the commit graph, indexes,
   and topology are *projections* of the manifest, never second authorities to
   keep in sync. → fixes the §2.2 `_graph_commits` term (iss-991 Phase 7).
3. **The plan is the contract (correct-by-construction recovery).** The writer
   serializes its *complete* publish intent **before any HEAD moves**; the live
   commit and crash-recovery execute the *identical* plan, so they cannot
   diverge. → fixes the partial-publish bug class structurally
   (`iss-merge-recovery-partial-rollforward`, PR #277).

The optimal single-edge write under these: **~2–3 sequential hops, O(1) in size**
— 1 warm probe (0 if the coordinator is unchanged), 1 parallel stage of fragment
writes, 1 manifest CAS — regardless of 5 tables or 500, 10 commits or 10M.
Lance's own `test_commit_iops` (read 1 / write 2 / stages 3) **[U]** proves the
per-table primitive already hits this; the job is to make the *graph* write
inherit it.

This is not speculative: it is exactly what the two reference object-storage
databases do. **LanceDB** threads a pinned `Arc<Dataset>` + shared `Session` and
commits with one CAS off a captured `read_version`, never re-resolving "latest"
under default consistency **[U]**. **SlateDB** captures a snapshot, treats a
monotonic-ID manifest (no pointer file) as the *sole* authority, commits with one
conditional-PUT, recovers on open (never per-write), fences with a monotonic
`writer_epoch`, and compacts on a schedule **[U]**.

---

## 4. Reference-level design

### 4.1 The interface — one publish authority, one declarative plan

The deepest structural flaw is **four hand-rolled writers** (`load_as`,
`mutate_as`, `apply_schema_as`, `branch_merge_as`), each re-implementing open →
stage → commit → sidecar → lineage. There is **one publish machine**; the verbs
are different declarative plans fed to it.

```rust
// The pinned, immutable operation identity — resolved ONCE.
struct WriteTxn {
    branch: BranchRef,
    base: PinnedSnapshot,   // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
    session: Arc<Session>,  // shared per-graph; warms metadata/index caches across opens
    handles: HandleCache,   // open-by-version; each table opened once, reused across stages
}

// A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
enum TableAction {
    Append(Stream), Upsert(Batch), Overwrite(Image), Delete(Pred),
    Fork { from_version: u64 }, Register(Schema), Tombstone,
}
struct PublishPlan {
    base: PinnedSnapshot,
    actions: Map<TableKey, Vec<TableAction>>,
    lineage: GraphCommitIntent,   // parent = base.head; rides the SAME manifest CAS (Phase 7)
    expected: Expectations,       // per-table versions + graph_head + writer_epoch
}

impl GraphPublishAuthority {
    async fn open_txn(&self, branch: BranchRef) -> WriteTxn;                          // 1 warm probe
    async fn publish(&self, txn: &WriteTxn, plan: PublishPlan) -> PublishedSnapshot;  // stage∥ → 1 CAS
}
```

Properties that make it optimal:

- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
  open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
  recovery both call `plan.apply()` — a merge that bumps tables A+B can never
  roll A forward and silently drop B. The
  `iss-merge-recovery-partial-rollforward` bug class is gone by design.
- **One CAS.** `publish` issues exactly one conditional `__manifest`
  merge-insert carrying every touched-table version + the `graph_commit` /
  `graph_head` lineage rows + the `writer_epoch` check.
- **Verbs are thin lowerings.** `load`/`mutate`/`schema apply`/`branch merge`
  each build a `PublishPlan` and call `publish`. Four copies → one machine; the
  public `load_as`/`mutate_as` API is unchanged (it lowers internally).

The cost contract becomes part of `publish`'s documented API:

> `publish(txn, plan)` costs `opens ≤ |plan.touched_tables|` (0 warm),
> `stages ≤ 3`, `manifest_ops = O(1)` — **invariant to history depth and table
> count.**

### 4.2 Supporting mechanics (each validated this cycle)

| Mechanic | Design | Validation |
|---|---|---|
| Open by pinned version | `from_uri(location).with_version(N)` + shared `Session` + warm handle cache — the O(1) opener *reads* already use (`instrumentation::open_table_dataset:112`, `SubTableEntry::open` `db/manifest.rs:200`). **NOT** the write path's `from_namespace` builder (`namespace.rs:174`), whose `describe_table` + `.load()` do an O(depth) double latest-resolution (§2.2 — the dominant cost), and **NOT** `open_dataset_at_state` (opens head then checks out, `table_store.rs:232`, not O(1)). | #0 **[S]** |
| Strict-op SI | Update/Delete/SchemaRewrite open by pinned version (consistent read base) and the publish CAS rejects a *same-table* advance. Insert/Merge rely on Lance's natural rebase. **Do not remove the open guards wholesale** — that is a silent lost-update. | #5 **[S]** |
| Fork × pinned-version | Fork already opens source at the pinned version and creates the target from it; the live-manifest authority re-read before fork stays (not defeated by the pin). | #6 **[S]** |
| Open-once via the direct opener (**THE dominant depth fix**) | Reuse is **intra-transaction** (open each table once, by pinned version, thread it — kills the ~13 namespace-builder opens, the O(depth) double latest-resolution / +12/depth term, §0/§2.2). A commit invalidates its own entry, so no cross-write warm cache. Thread the shared per-graph `Session` through write opens (it is *not* today — `load_table_from_namespace` attaches no session, `namespace.rs:174`). | #9 **[S]** |
| Lineage in the manifest (Phase 7) | Publish `graph_commit` + mutable `graph_head:<branch>` rows in the same `__manifest` merge-insert with a branch-head CAS; `_graph_commits` becomes a projection. Removes the per-write `commit_graph.refresh` and closes the "manifest→commit-graph atomicity" + "commit-graph parent under concurrency" gaps. | `iss-991` **[G]**, **[S]** |
| Bounded history (compaction **and** cleanup) | Bring the internal table(s) into the `optimize` loop AND schedule version *cleanup* of node/edge tables — compaction rewrites fragments, only cleanup prunes the version chain that §2.2's dominant term re-reads (§2.3). No blob/PK/CAS blocker (`__manifest` has no blob column, `state.rs:44-72`; the unenforced PK is orthogonal to a content-preserving Rewrite). Post-Phase-7 there is only **one** internal table to compact. | #8 **[S]** |
| Recovery off the hot path | Move the per-write `list_dir("__recovery/")` to coordinator-open + the CAS-conflict path, guarded by a sidecar-age grace window (the sidecar carries `created_at` micros + a ULID, `recovery.rs:762`/`:1522`). | #4, `iss-856`/`iss-recovery-sweep-live-writer-rollback` **[G][S]** |
| Epoch fence | Monotonic `writer_epoch` in `__manifest`, CAS-claimed at writer init, checked on every publish. Fences a whole zombie *writer* deterministically (no TTL); closes the multi-process exposure and the Lance-MTT TTL-lease gap. | SlateDB `FenceableTransactionalObject` **[U]** |
| Branch create | Lance `Clone` instead of the per-table fork loop (O(tables)→O(1) sequential). | `iss-691` **[G]** |
| Branch delete | Run the per-other-branch safety check and the per-table reclaim loops concurrently (`buffer_unordered`); read branch sets from in-memory coordinator state. | **[S]** |

---

## 5. The cost contract — measurement & enforcement

The bug class is invisible to correctness tests, to local-FS tests, and to
wall-clock benches. You can only prevent a regression in a quantity you **define
precisely, measure deterministically, and bound on every PR.** The quantity is
*sequential object-store round-trips per logical operation, as a function of
history depth and table count.* OmniGraph already has the correct pattern for
**reads** (`warm_read_cost.rs`, `IOTracker`, swept to depth 20); this RFC extends
it across the write/branch/open surface. This is exactly how Lance and SlateDB
enforce it **[U]**.

### 5.1 Tier 1 — deterministic IO-counted gate (every PR)

Ordinary `cargo test`, hermetic (in-memory / tempdir + `IOTracker`), no S3, no
wall-clock. Two shapes:

```rust
// (A) cost-invariant-to-HISTORY — the load-bearing gate. Gate the MERGE verb (the prod path).
for depth in [10, 100, 1000] {              // REAL commit history, not row count
    build_history(depth);
    reset_counters();
    let s = measured_merge();               // --mode merge, the read-modify production path
    // PRIMARY — the dominant term (§0): the written table's data opens/reads, flat in depth.
    assert!(s.data_table_opens          <= touched_tables); // open each table ONCE, by pinned version
    assert!(s.data_table_reads_per_open <= K_OPEN);         // each open O(1) in the table's version count
    // SECONDARY — internal-table scans flat in depth (compaction + cleanup).
    assert!(s.manifest_ops  <= K_MANIFEST); // small CONSTANT, NOT a function of depth
    assert!(s.lineage_ops   <= K_LINEAGE);
    assert!(s.stages        <= 3);          // bounded sequential hops
}
assert_flat_across_depths();                // ALL terms — esp. data-table opens — flat in N

// (B) fitness functions — architectural invariants AS tests
assert_eq!(validate_schema_contract_calls(write), 1);   // resolve-once
assert_eq!(coordinator_resolutions(write), 1);          // O(1) resolution
assert_eq!(recovery_listdir_calls(steady_state_write), 0);
```

**Prerequisite, not a follow-up: route ALL opens (read + write) through the one
instrumented opener BEFORE the gate is meaningful.** Today the write path's data
opens bypass `table_wrapper` (the §0(a) blind spot), so a gate that asserts only
`manifest_ops`/`lineage_ops` would **pass a still-broken build** — one that
compacts the internal tables (§9 step 2) but keeps the dominant ~13× namespace-open
sweep (§2.2). The gate MUST count data-table opens/reads (the dominant term), which
requires the routing change first. The data term is **mode-independent** (append ≡
merge ≡ +12/depth **[U]**), so either verb exercises it; gate the **merge** verb
as the production path. **Fixture caveat [U]:** use *valid* edge endpoints — a
write to a non-existent endpoint fails RI validation and rolls back at ~192 ops
with **zero chain reads**, so a bad-endpoint fixture silently measures the rollback
path and would pass falsely.

The load-bearing rule both Lance and SlateDB mostly miss: **assert the constant is
flat across N, not just small at one N.** A shallow fixture cannot catch an
O(history) cost (the §0(b) table is the red baseline). Add a `num_stages`
(sequential-hop) assertion via a `ThrottledStore` wrapper (Lance's
`test_commit_iops` setup) so an O(N) listing also blows a wall-time budget.

### 5.2 Tier 2 — wall-clock trend (post-merge / nightly, never a PR gate)

A `ThrottledStore` criterion bench injecting cross-region RTT (50/150 ms/op — the
incident's regime) for single-insert and branch-op latency, with a threshold
alert (Bencher.dev `--err` / github-action-benchmark `fail-on-alert`). Both
reference DBs keep wall-clock out of the PR gate (too noisy on shared runners)
and use it only as a trend.

### 5.3 Close the loop — production metric

Emit `storage.ops` and `storage.stages` per logical operation as a span/counter
(cheap always-on atomics; the heavy per-table attributing wrapper stays
test-only behind a `test-util`-style feature, zero release cost). The number
asserted in CI is the number observed in prod — `iss-write-s3-roundtrip-amplification`'s
cross-region signal becomes a direct readout.

### 5.4 Process discipline — test-first for performance

Write the depth-sweep cost-budget test **first**: it goes **red today** (§0), the
WriteTxn + Phase-7 + compaction work turns it **green** (flat in N), and the
red→green is the proof. This is CLAUDE.md rule 12 applied to cost, and the
originating handoff's sequencing (§8/§9: land the tests before the fix so the win
is measured and locked). Add the policy (extend invariant 15 + testing.md "Cost
budget tests"): *any change touching the read/write/branch/open path MUST add or
extend a cost-budget test asserting the metric is flat at history depth.*

### 5.5 The correctness contract — concurrency tests (the safety twin of the cost gate)

The cost gate proves *fast*; these prove *safe*. §6.5's multi-writer cliff slipped
the suite for the same structural reason the latency bug did — **nothing runs the
schedule that triggers it**: the suite is single-process with the in-process queue
(the bug is cross-process), uses local/in-memory stores (no object-store
cross-process CAS), and its recovery tests cover restart-time sweep, not
live-writer rollback. **These four must land before `PublishPlan`/epoch merge
(steps 5):**

1. **Cross-process multi-writer on a real/emulated object store** (the *corruption*
   case) — N independent engine **processes** writing the same `(table, branch)`;
   assert all commit-or-cleanly-retry (no lost updates, no stuck "needs recovery,"
   no HEAD-ahead-of-manifest). **A single-process failpoint test cannot reproduce
   the corruption** (in-process degrades to clean OCC, §6.5) — this genuinely needs
   a multi-process harness (empirically 1/12 today). State that so nobody writes a
   single-process test expecting it to fail.
2. **Deterministic in-process interleaving (failpoint) — WRITTEN, passes [M].** Two→
   eight handles, sleep failpoint at the `commit_staged`→publish window
   (`loader/mod.rs:605`); resume losers and assert they retry cleanly. This
   demonstrates the **benign** path (N=8 → 2 commit, 6 clean OCC retries) — it is the
   regression guard for "in-process stays clean," *not* a reproduction of the
   cross-process cliff.
3. **Live-writer recovery** (`iss-recovery-sweep-live-writer-rollback`) — a
   concurrent open must not roll back a live in-flight publish (the grace window).
4. **Formal model** — a Quint/TLA+ model of `{two writers, interleave commit_staged
   and manifest-CAS}` (`iss-934`); it finds the §6.5 cliff immediately.
5. **Cross-table write-skew — WRITTEN, red, and driven red→green in-process [M].**
   Failpoint `loader.post_ri_pre_stage` (between RI-validation and staging): writer B
   validates "Bob exists" and parks; writer A `overwrite`s `node:Person` dropping Bob
   (non-cascading); B commits `Knows(Bob→Alice)` → committed orphan. The red test for
   the §7.1 fix. **Acceptance is a single-process gate** — unlike the §6.5 HEAD-ahead
   corruption (which genuinely needs the multi-process harness), this skew reproduces
   *deterministically in one process*: the parked edge writer's snapshot really does
   pin `edge:Knows:1` before the overwrite commits, so the overlap is real with two
   in-process handles. The fix went red→green in-process behind a shared head row
   (§7.1). Only #1–#4 (HEAD-ahead/epoch corruption) need cross-process scheduling.

Plus one **disambiguating run** owed (§6.5 confound): separate-handles in-process
on S3 — to confirm the corruption is the process boundary, not the store.

This mirrors the cost gate's discipline (assert across the dimension the suite
otherwise never exercises) — there, history depth; here, concurrent cross-process
schedules.

---

## 6. What is already right vs. the deltas

**Already correct — do not rewrite.** The in-memory `MutationStaging` accumulator,
the recovery sidecar mechanism, the per-(table,branch) write queue, D2, the sealed
`TableStorage` trait, and the read-path warm-up (PR #268) all stay. This is **not**
a substrate rewrite.

**One claim to soften — manifest-CAS is atomic *per publish*, not unconditionally
cross-table-serializing [M].** The manifest CAS (the reference impl of the
lance#7264 "Alternative A") makes each publish atomic and serializes any two writers
whose write-sets **share a `__manifest` row** — overlapping or same-table, which is
exactly why §6.5's same-table cases and the cascading-delete case retry cleanly. But
two writers touching **disjoint** tables write disjoint per-`object_id` rows, so Lance
sees no conflict and **both commit** (proven [M], §7.1). The genuinely-atomic
cross-table commit §13 contrasts with Delta is the **target** (§4.1's single
merge-insert over a shared head row), **not current state**. So "do not rewrite the
CAS" holds for the *commit primitive*, but the cross-table-serialization §7.1 needs
is a real addition (the shared `graph_head` row), not something the current CAS
already provides.

**The deltas (each a validated, localized gap):**

| # | Delta | Mechanism | Tracking |
|---|---|---|---|
| 1 | Snapshot re-derived per stage | capture-once `WriteTxn`, thread by ref | `iss-write-s3-roundtrip-amplification` |
| 2 | Write opens via `from_namespace` re-resolve the data-table ~13×/write, missing the fast path (**DOMINANT, +12/depth**) | open each table **once, direct `from_uri().with_version(N)`** (bypass namespace, §2.4) + shared Session | `iss-write-s3-roundtrip-amplification`, #0 |
| 3 | Lineage = 2nd authority, O(history) refresh (secondary) | Phase 7: lineage into `__manifest` | `iss-991` |
| 4 | `__manifest`/`_graph_commits` excluded from optimize/cleanup (`optimize.rs:895-904`; prototype pruned "7 tables" = node/edge only **[M]**) — the +5/depth residual after step 3 | **add them to `all_table_keys`** (a code change) + scheduled compaction/cleanup | `gap-read-path-rederivation` (write twin) |
| 5 | `list_dir("__recovery/")` per write | move to open + conflict, grace window | `iss-856`, `iss-recovery-sweep-live-writer-rollback` |
| 6 | 4 hand-rolled writers, commit↔recovery drift | one `PublishPlan` executed by both | `iss-merge-recovery-partial-rollforward` (PR #277) |
| 7 | No writer epoch (multi-process exposure) | `writer_epoch` in `__manifest` | — (new) |
| 8 | branch create = O(tables) fork loop | Lance `Clone` | `iss-691` |
| 9 | branch delete = sequential loops | concurrent `buffer_unordered` | — (new) |
| 10 | No write/branch cost gate (must count **data-table** opens; route all opens through the instrumented opener first) | Tier-1 IO-counted tests, merge verb | — (new) |
| 11 | Schema contract re-validated uncached per resolve (**flat 46 reads/write — 29% of depth-0 cost; constant, not depth**) | resolve/validate-once in `WriteTxn`; §5.1 `validate_schema_contract_calls==1` (the depth gate misses it) | `iss-write-s3-roundtrip-amplification` |

---

## 6.5 Concurrency correctness — the multi-writer cliff (proven [M])

The latency fixes are about *speed*; a separate, proven finding is about *safety*.
A multi-writer experiment **[M]** shows concurrent same-branch writers behave very
differently by topology:

| topology | concurrency | outcome |
|---|---|---|
| single server (shared in-proc queue, `loader/mod.rs:426`) | 12 | **12 / 12 commit** (clean) |
| in-process, separate handles, interleave failpoint at `commit_staged`→publish (`loader/mod.rs:605`) | 8 | **2 / 8 commit; the other 6 are clean retryable OCC** |
| multi-process (separate CLIs / S3, no shared queue) | 2 / 3 / 5 / 12 | **1 / N commit; the rest CORRUPT** |

**Two distinct failure modes — and the corruption is strictly cross-process:**

- **In-process → benign.** Even with *separate handles, no shared queue, high
  contention*, losers fail with `stale view of 'edge:Knows': expected manifest table
  version 5 but current is 7 — refresh and retry` — a **clean, retryable OCC
  conflict; graph state stays consistent.** The publisher CAS is doing its job.
- **Cross-process → corruption.** `Lance HEAD version N+1 ahead of manifest version
  N; a pending recovery sidecar requires rollback`. **Mechanism:** a losing writer
  advances the table's Lance HEAD (`commit_staged`) *before* the manifest CAS; when
  the CAS loses, HEAD is ahead of the manifest — a partial commit the per-write heal
  **defers** (`recovery.rs:978-988`; only the open-time sweep rolls back), so a
  *live* writer hitting it **fails instead of healing**. Self-heals on the next
  read-write reopen (not permanently bricked), but during a burst throughput
  collapses to one survivor. Reachable at **concurrency = 2** cross-process.

So in-process safety **already comes from the publisher CAS** (clean OCC); the
corruption needs the process boundary. *(Confound, stated honestly: the in-process
interleave ran on local-FS and the cross-process on S3-via-proxy — but
single-server-on-S3 was also clean (12/12), giving two independent "in-process
clean" points vs one "cross-process corrupt," triangulating on the process
boundary, not the store. One disambiguating run — separate-handles in-process on S3
— would move this from triangulated to proven; §5.5.)*

**Scoping (matters for urgency):** **single-server prod is serialized-correct, just
slow** — the in-process `(table,branch)` queue serializes same-branch writes (all 12
commit, no lost updates); the production incident was the *latency* (serialized
O(depth) writes → 90 s timeout), **not** corruption. The corruption hazard is
**latent**: it appears the moment a second writer exists (server replica,
CLI-alongside-server, multi-writer scale-out). **So: single-server today =
serialized-correct (slow; fixed by steps 2/3); multi-writer = UNSAFE until
`writer_epoch` lands.**

**The fix is the existing RFC, no new design.** The `A`-before-`B` window
(Lance HEAD moves before the manifest references it) is inherent to Lance's
per-table-lineage model — you cannot eliminate it, only fence and recover it: the
**`writer_epoch`** (delta #7) is a leader-lease via cross-process CAS so two writers
are never in the `commit_staged`→manifest-CAS window across processes (it removes
the concurrent-race dimension); the **`PublishPlan`=sidecar** (delta #6) makes a
single crashed writer roll forward/back deterministically (the crash dimension); and
**recovery off the hot path + grace window** (delta #5, Q2) is the exact reason the
live writers failed rather than self-healed (`iss-recovery-sweep-live-writer-rollback`).
This is the standard WAL-replay + leader-lease shape (confirmed against SlateDB's
`FenceableTransactionalObject` and Kleppmann's fencing-token canon, §10). **This
finding promotes #6/#7 from "nice correctness work" to the load-bearing guard that
gates multi-writer topologies — and it is the motivating case for them.**

---

## 7. Invariants & deny-list check

Touches and *strengthens* (does not weaken) invariants in
[invariants.md](invariants.md):

- **§2 (manifest-atomic visibility):** preserved; lineage now rides the same CAS
  (strengthens — closes the "manifest→commit-graph atomicity" gap).
- **§3 (one snapshot per op):** enforced *by construction* via `&WriteTxn`.
- **§4 (publish at one boundary):** unchanged — still one manifest publish.
- **§5 (recovery part of the commit protocol):** preserved; the sidecar *is* the
  `PublishPlan` (strengthens — commit and recovery cannot diverge). The grace
  window addresses the documented "recovery serialized against live writers
  in-process only" gap.
- **§7 (indexes derived) / §15 (one source of truth, cheaply derived):** this RFC
  is the write-side application of §15 — bound cost to the working set, not
  history. The commit graph becomes derived (strengthens).
- **§5 strict-op SI:** preserved (#5 validation — open guards kept for
  read-modify-write).

**Deny-list:** does *not* hit "cold re-derivation on the hot path" (it removes
two instances), "state that drifts" (lineage stops being a second authority), or
"acks before durable persistence." The `writer_epoch` is the closing move on the
"local `write_text_if_match` is not a cross-process CAS" / multi-process gaps —
add it before admitting multi-process write topologies.

No invariant is weakened. Two Known Gaps **close** (manifest→commit-graph
atomicity; commit-graph parent under concurrency, via Phase 7); one
(read-path-rederivation) gets its **write twin** filed and addressed.

### 7.1 Scope of the correctness claims (literature review, §13)

The "correct by construction" framing (§3, §4.1) is **precise but bounded** — the
DB-canon review flags three places not to over-claim:

- **Per-table serializability, not graph-wide — but the gap is narrow and now
  measured [M].** Three deterministic cases (failpoint `loader.post_ri_pre_stage`,
  placed between RI-validation and staging; red test in `tests/failpoints.rs`):
  - **Cross-table *disjoint* → genuine skew, VIOLATED.** A **non-cascading endpoint
    removal** — `node:Person` *overwrite* dropping Bob, touching only the node table
    — concurrent with an edge insert `Knows(Bob→Alice)`: both commit (write-set-only
    CAS, RI validated once pre-commit and never re-checked at publish) → **committed
    orphan**. (= `iss-ri-write-skew-dangling-edges` + the concurrent face of
    `iss-overwrite-orphans-committed-edges`.)
  - **Cross-table *overlapping* → incidentally protected.** `delete`-based removal
    **cascades** into `edge:Knows`, so the write-sets overlap, the per-table CAS
    engages, and the loser fails **cleanly** (stale-view OCC retry); invariant held.
  - **Same-table → NOT a separate skew.** Cardinality / `@unique(src)` have
    overlapping write-sets, so the per-table CAS holds the constraint; the loser's
    failure is the **HEAD-ahead corruption already scoped to #6/#7** (epoch +
    PublishPlan), not a consistency hole. *(This corrects an earlier
    over-generalization: cardinality/uniqueness do not share the read-set gap.)*

  So the skew is **reachable only for the non-cascading-overwrite × disjoint-edge-insert
  shape** — operation-specific, not constraint-specific.

  **The scoped fix alone is a no-op — proven [M], and the reason is mechanical.**
  Feeding the endpoint node-table versions into the edge's publish *expected* set
  (`check_expected_table_versions`, `publisher.rs:353`) was prototyped exactly; debug
  confirmed the pins reach the check, **and both writers still committed — the orphan
  persisted.** Every publish writes a *unique per-`object_id` row* into `__manifest`
  (merge key `object_id = version_object_id(table, version)`). Two disjoint-table
  writers (`node:Person` vs `edge:Knows`) touch **no common row**, so Lance's
  row-level merge-insert CAS commits both with **no conflict**, the publisher's retry
  loop **never fires**, and `check_expected_table_versions` — a **non-atomic
  pre-check, not part of the CAS** — is evaluated exactly once against the stale
  pre-both manifest and passes for both. The read-set pin only bites if the loser is
  **forced to retry and re-evaluate against fresh state**, which requires a *shared
  contention row* every publish touches. Adding a stand-in global head row
  (`UpdateAll`-touched by every publish) makes the disjoint writers overlap → Lance
  conflict → publisher retry → the reloaded pin (`edge:Knows:1` vs current `5`)
  rejects the stale writer → no orphan (red→green, failpoint suite 52/52). **That
  shared row is exactly Phase-7's `graph_head:<branch>`.**

  **Consequence — §7.1 is NOT a standalone single-server PR** (correct earlier text
  that called it "single-server-live, not deferrable" — it *is* urgent and
  epoch-independent, but it cannot ship against today's per-`object_id` manifest
  without a contention point). Land it one of three ways: **(a)** with Phase 7
  (step 4), reusing `graph_head:<branch>` as the contention row; **(b)** behind a
  minimal per-branch head row ahead of Phase 7 (~15 lines, as prototyped); or
  **(c)** as commit-time re-validation — still must win a serialization point first.
  **Recommended: (c) behind a per-branch head row.** The CAS-map approach carries the
  two costs §11 anticipated — *table-granularity false conflicts* (any `Person`
  overwrite conflicts with any concurrent `edge:Knows` insert, even different rows —
  needs a row-granularity read-set) and *scope* (a global head serializes the whole
  graph; per-branch `graph_head` is the right granularity). Commit-time re-validation
  is precise (no false positives) **and** reuses the same serialization point, so once
  the head row exists it strictly dominates the CAS-map. Either way the head row
  imposes an inherent trade — same-branch writers serialize cross-process (throughput
  ceiling 1/branch, bounded by `PUBLISHER_RETRY_BUDGET`) — **now a correctness
  requirement, not just a Phase-7 side effect** (§11).

  **Two faces, two fixes — do not bundle them.** The above addresses only the
  *concurrent* face (overlapping snapshots, `iss-ri-write-skew-dangling-edges`). The
  *sequential* face (`iss-overwrite-orphans-committed-edges`) — an overwrite drops a
  node that **already has a committed inbound edge**, with *zero* concurrency —
  **cannot** be caught by read-set-in-CAS: the later writer's snapshot legitimately
  post-dates the edge, so its pin matches and it commits. That is a pure
  **inbound-RI-validation** gap: when an overwrite/delete removes node endpoints,
  re-check that no live edge references them. A validation concern, not a CAS one;
  it needs no contention row and ships independently.
  *(Note: `iss-984` is a different bug — remote branch-merge idempotency — not this.)*
- **Recovery: roll-forward is by-construction; roll-back is not.** "Commit and
  recovery replay the identical plan" holds for the **redo** direction (shared
  `plan.apply()`). The undo classifier (NoMovement / UnexpectedAtP1 /
  UnexpectedMultistep / IncompletePhaseB) lives *outside* the shared executor, only
  at open-time — that's where ARIES-style divergence risk concentrates and where the
  §5.5 failpoint coverage is owed.
- **The fence and the cross-file atomicity rest on a linearizable conditional-put.**
  Kleppmann's fencing-token guarantee, the manifest CAS, and the epoch all require a
  linearizable register — true on S3/R2 (If-Match) but **not** on the local-FS path
  (`write_text_if_match` is content-token compare-then-replace, ABA-prone —
  `invariants.md` Known Gap). **Precondition to state up front: every "deterministic
  fence" / "atomic CAS" claim holds *on a store with linearizable conditional-put*;
  the epoch must not use the local-FS path.** Delta Lake §3.2.2 treats the
  object-store consistency model (read-after-write + put-if-absent) as a first-class
  design parameter; so should this RFC.

---

## 8. Relationship to Lance MTT (the seam, not a dependency)

`GraphPublishAuthority.publish(txn, plan)` is exactly the adapter to a future
Lance `catalog.transaction()`. lance#7264 ("Multi-Table Transactions via
Branching") is real and OmniGraph is its reference "Alternative A"
(fast-forward-main + WAL + roll-forward recovery) **[U]**, but it is a 5-day-old
discussion with two unbuilt dependencies (lance#7263 branch merge/rebase,
lance#7185 UUID branch paths), an unresolved central choice (it *favors*
pointer-swap — the opposite identity model from OmniGraph), and an open soundness
question (TTL lease needs an epoch). **Build the seam now on its own merits; do
not schedule around MTT landing.** When it ships, `publish`'s *body* swaps
(stage→CAS→sidecar → `catalog.transaction()`) while `WriteTxn`/`PublishPlan` and
every verb lowering stay. `iss-863`/`iss-864` **[G]** already scope this spike.

The MemWAL/LSM ingest tier (`iss-681` **[G]**, `dec-adopt-lance-v7-memwal`) is
**complementary, not competing, and not in flight** (the `memwal-benefit-analysis`
branch is an empty placeholder; the real analysis is commit `c9a81266`). MemWAL
sits *below* the manifest publisher (per-table durability, opt-in, intra-table);
`WriteTxn` owns the cross-table CAS. Build `WriteTxn` first.

---

## 9. Sequencing

Ordered by leverage and dependency. **The dominant depth term is the redundant
data-table opens (step 3), not the internal tables (step 2)** — §0; both must land
to flatten the curve.

1. **Measure first (Tier-1 gate). ✅ LANDED (gate + harness).** *Prerequisite (1a):*
   the write opener (`open_dataset_head`) is routed through the instrumented
   `open_dataset_tracked` so the gate can count data-table opens (§5.1). The
   write cost-budget tests live in `crates/omnigraph/tests/write_cost.rs` on a
   **shared, store-agnostic harness** (`tests/helpers/cost.rs`: `measure`/`IoCounts`/
   `assert_flat`/`local_graph`/`s3_graph`) that `warm_read_cost.rs` and
   `write_cost_s3.rs` also consume — one vocabulary, no duplicated `IOTracker`
   plumbing. The local gate ships green every-PR guards + the RED `#[ignore]`'d
   internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
   `storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
   LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
2. **Bound history — bring the INTERNAL tables into optimize/cleanup (a code
   change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge
   keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3`
   pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were
   untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed
   by today's tooling — step 2 is a real `all_table_keys` change to add the internal
   tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote
   otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**);
   the internal tables need the same inclusion. **Proven [M]:** compacting the
   internal tables collapsed their scans `__manifest` 285→32, `_graph_commits`
   177→11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately,
   node/edge cleanup **caps** the dominant data-table term as an interim *before*
   step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the
   Q8 boundary watermark must land **with** this step — Lance's version CAS is
   confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on
   R2/S3), so scheduling cleanup without the watermark trades a latency bug for a
   correctness bug. (`gap-read-path-rederivation` write twin.)
3. **The opener fix — a shippable lead + the structural follow-on.**
   - **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
     `TableStore::open_dataset_head_for_write` now delegates to the direct
     `open_dataset_head` opener (`Dataset::open` by URI + `checkout_branch`, routed
     through `instrumentation::open_dataset_tracked` so the cost gate can count it;
     no-op in prod) instead of the `from_namespace` builder. Measured end-to-end on
     the prototype: data term `31 + 12·depth → flat 4`, total `+18 → +5/depth`,
     depth-80 **2.7×** (§2.4), functionally correct on main/branch/node.
     **Acceptance:** the full `cargo test --workspace --locked` suite passes under the
     bypass (the `tests/` integration + `merge_truth_table` + recovery/failpoint
     suites the prototype's `--lib` run didn't cover — schema-apply, branch merge,
     fork-on-first-write, overwrite). **Namespace retired to test-only:** with both
     reads (Fix 2) and now writes bypassing it, *nothing in production routes through
     the Lance namespace* — confirming §2.4's premise. The dead per-table open chain
     (`load_table_from_namespace`, `open_table_head_for_write`) was deleted and the
     `StagedTableNamespace` contract apparatus gated `#[cfg(test)]`, mirroring the
     already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). **Measurement
     note (corrected):** the opener win is **S3-only** — local FS resolves latest with
     one cheap `read_dir` regardless of opener, so the namespace-vs-direct difference
     is invisible there (the local data-table read count *does* grow with depth, but
     that is the merge-insert/RI scan over O(depth) *fragments*, a compaction term,
     not the opener; depth-100 = 92 ops identically before and after the bypass). The
     opener LOCK therefore lives in the bucket-gated `write_cost_s3.rs`, not the local
     `write_cost.rs`.
   - **3b. Full `WriteTxn` (capture-once + intra-txn handle reuse + shared Session).**
     Formalize 3a's open-once into the pinned, threaded `WriteTxn` (re-resolution
     *unrepresentable*, invariant 3) and kill the flat-46 schema-read constant
     (resolve/validate-once, §0/§6). (`iss-write-s3-roundtrip-amplification`.)
4. **Phase 7 — lineage into the manifest.** Removes the per-write
   `commit_graph.refresh`; commit graph becomes a projection. (`iss-991`.)
   **Hard dependency: step 2 must land first (Q1, §12)** — each publisher retry
   re-runs the O(history) `load_publish_state` scan, so the `graph_head` CAS
   contention Phase 7 introduces is acceptable only once compaction bounds that
   scan. Acceptance includes the Q1 concurrent-same-branch-writer gate.
   **Carries the §7.1 concurrent write-skew fix.** The `graph_head:<branch>` row is
   the shared contention point the cross-table read-set-in-CAS needs — proven [M]
   that the read-set fix is a no-op without it (§7.1). So the concurrent face of the
   write-skew lands *with* this step (or, if §7.1 must ship earlier, behind a minimal
   per-branch head row — ~15 lines — or as commit-time re-validation). The
   *sequential* face (`iss-overwrite-orphans-committed-edges`) is independent:
   inbound-RI validation on node removal, no head row, ships anytime.
5. **`PublishPlan` unification + recovery off the hot path + epoch fence — the
   multi-writer safety guard.** Collapse the four writers; move the `__recovery` list
   to open/conflict; add the `writer_epoch` leader-lease. **Motivated by the proven
   §6.5 cliff** (multi-process same-branch writers corrupt at concurrency = 2) — this
   is the guard that makes multi-writer topologies safe, not optional polish.
   **Gated by the §5.5 correctness contract** (the four concurrency tests must land
   with it). `writer_epoch` must be a true cross-process conditional CAS — **not**
   the local-FS `write_text_if_match` path (§7.1). (`iss-856`,
   `iss-merge-recovery-partial-rollforward`, `iss-recovery-sweep-live-writer-rollback`,
   `iss-934`.)
6. **Branch ops.** Lance `Clone` for create (`iss-691`); concurrent delete loops.
7. **Freeze** investment in publisher/sidecar/fork internals; pursue the MTT
   seam (`iss-863`/`iss-864`) as the strategic exit.

**Land PR #277 first** — it closes `iss-merge-recovery-partial-rollforward` and is
the producer-side half of the `PublishPlan` discipline; the heal-relocation in
step 5 must preserve its merge pre-snapshot heal (`exec/merge.rs:1084-1090`) and
its open-time `IncompletePhaseB → RollBack` (which the per-write heal never
performed anyway).

---

## 10. Cross-reference map (the ties)

**Dev-graph items (modernrelay) — what this RFC ties together:**

- Primary: `iss-write-s3-roundtrip-amplification` (the bug).
- Depth term / Phase 7 (commit graph → manifest-derived projection): `iss-991`
  (related: `iss-707` structured commit-graph lineage; `iss-934` Quint
  multi-table-publish verification). Read twin: `gap-read-path-rederivation`.
- Substrate seam: `iss-863`, `iss-864`. Decision: `dec-adopt-lance-v7-memwal`
  (`iss-681`).
- Recovery: `iss-856`, `iss-recovery-sweep-live-writer-rollback`,
  `iss-merge-recovery-partial-rollforward`, `iss-903`, `iss-load-not-crash-safe`.
- Residual migration: `iss-950` (MR-A staged delete, retires D2), `iss-848`
  (index-coverage reconciler, owns `create_vector_index`).
- Branch/load: `iss-691`, `iss-677`, `iss-895`, `iss-topology-cross-branch-cache`,
  `iss-841`, `iss-982`, `iss-423`, `iss-989`.
- Concurrency correctness (survives MTT) — **two faces, two different fixes [M]**
  (§7.1): `iss-ri-write-skew-dangling-edges` (the *concurrent* face; fix =
  read-set-in-CAS **+ a shared `graph_head` contention row**, so it's coupled to
  step 4 / a minimal head row / commit-time re-validation — NOT a standalone PR) and
  `iss-overwrite-orphans-committed-edges` (the *sequential* face; fix =
  **inbound-RI validation on node removal**, ships independently, no contention row).
  *(`iss-984` — remote branch-merge idempotency — is unrelated; not a write-skew.)*
- Blockers: `blk-lance-6658` (shipped 7.0.0), `blk-lance-6666` (open, vector
  index two-phase), `blk-lance-blob-compaction`.
- Epics: `epc-bulk-data-plane`, `epc-lance-v7-migration`, `epc-783` (reliability
  harness), `epc-929` (Quint verification).

**Proposed new dev-graph wiring (not yet written):**

- New **Epic** `epc-write-path-latency` — owns the cluster of orphaned issues
  above (none currently has an epic).
- New **Gap** `gap-write-path-rederivation` — the write twin of
  `gap-read-path-rederivation` (current: write re-derives snapshot + scans
  uncompacted internal tables per write; target: capture-once + bounded history).
- New **Issues**: write-side cost-budget gate + prod metric (step 1; prereq 1a
  routes all opens through the instrumented opener); **opener bypass — open writes
  direct-by-URI, standalone (step 3a, [M] the dominant fix, completes PR #268 Fix 2
  on the write path, §2.4)**; full `WriteTxn` capture-once (step 3b); **add
  `__manifest`/`_graph_commits` to `all_table_keys`** for compaction+cleanup (step 2
  — a code change, `optimize.rs:895-904`); `PublishPlan` unification + epoch
  (step 5); branch-delete concurrency (step 6).
- **Per-table namespace retired to test-only (step 3a landed).** With reads (Fix 2)
  and now writes (step 3a) both opening direct-by-URI, *nothing in production routes
  through the per-table `StagedTableNamespace`*. The dead open chain
  (`load_table_from_namespace`, `open_table_head_for_write`) was deleted; the
  `StagedTableNamespace` struct/impl/factory are now `#[cfg(test)]`, mirroring the
  already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). Both are retained
  only to validate the `LanceNamespace` contract in unit tests. *Production catalog /
  managed-versioning commit coordination for `__manifest` itself goes through a
  **separate** namespace (`GraphNamespacePublisher`), unaffected by this change.* The
  former follow-up to harden `StagedTableNamespace::list_table_versions`
  (`checkout_version` per version, O(depth)) is now purely a test-hygiene note — no
  prod caller can hit it; if any future version-list / time-travel feature needs
  per-table version enumeration, build `TableVersion`s from `versions()` metadata
  directly rather than resurrecting the namespace open path.
- New **Decision** `dec-writetxn-manifest-authoritative-publish` — records this
  RFC's design choice and the MTT-seam stance.

**Key source locations (v0.7.0):**
`omnigraph.rs:561-568,739-779,1317-1389`; `table_ops.rs:505-609`;
`table_store.rs:157-280,282-341,797`; `loader/mod.rs:197,400,485,557`;
`exec/mutation.rs:725`; `exec/merge.rs:1084-1090`;
`db/manifest/publisher.rs:51,93-124,356-371,385,432-440,448-490`;
`exec/mutation.rs:640-673` (D2 rule); `db/manifest/state.rs:44-72,133-141`; `db/manifest/layout.rs:22-26`;
`db/manifest/namespace.rs:111-112` (read open, O(1)),`:357-385`/`:362` (`describe_table` → redundant `Dataset::open` — the write-path double-open),`:158-186,544-550` (write open via `from_namespace`),`:395-427` (`list_table_versions` per-version checkout — test-only O(depth), the §10 follow-up);
`db/manifest/recovery.rs:762,978-988,1522`; `db/commit_graph.rs:136-164,213-272`;
`db/omnigraph/optimize.rs:240,517,895-904`; `instrumentation.rs:37,112-131`;
`runtime_cache.rs:202-283`; `tests/warm_read_cost.rs` (the read-side gate to mirror).

**Upstream:** lance#7264/#7263/#7185 (MTT); Lance `with_version` O(1) open
(`from_namespace` → `describe_table`, `builder.rs:130-178`; `default_resolve_version`
= one HEAD, `commit.rs:939-981`; version-hint PR #6752),
`list_is_lexically_ordered = !is_s3_express` (`aws.rs:183`),
`IOTracker`/`assert_io_*`/`num_stages`, `test_commit_iops`,
`test_commit_uses_version_hint_on_non_lexical_store`; **lance-namespace** design
(`namespace/index.md`, `operations/index.md` — catalog/discovery layer, resolve
once); LanceDB `io_tracking.rs`, `test_reload_resets_consistency_timer`; SlateDB
`FenceableTransactionalObject` (epoch fence), `InstrumentedObjectStore`,
monotonic-ID manifest.

**Reproduce the §0(b) network measurement:** `rustfs` (S3-compat) on `:9000`
behind a ~90-LoC Go counting proxy on `:9100` (adds `LATENCY_MS`, preserves the
SigV4 `Host` header, `/__ctl/reset` + `/__ctl/stat`); an omnigraph cluster on
`s3://…/cluster` through the proxy. Single-write breakdown: reset the proxy log,
`load --mode merge` one edge, classify by S3 key. Depth slope: write N× to main,
diff the per-write log at depth D vs D+20 by table. Native baseline: pylance 7.0.0
`write_dataset(mode="append")` in a loop → flat 6 ops/append at any depth.

---

## 11. Drawbacks, alternatives, reversibility

**Drawbacks.** Phase 7 makes disjoint-table same-branch writers contend on the
`graph_head:<branch>` row (they don't today) — bounded by the Lance retry budget,
inherent to a linear per-branch DAG, gated on a measured concurrency test and on
step 2 landing first (§12 Q1, resolved). **Reframe [M]: this contention is
load-bearing for correctness, not merely a throughput tax.** The §7.1 write-skew is
*unreachable only because* the shared head row forces disjoint cross-table writers to
overlap, conflict, retry, and re-evaluate their read-set pins against fresh state
(proven — without it the scoped CAS fix is a no-op). So §7.1 and the head row are
**coupled**: the "drawback" is exactly what buys the cross-table invariant, and the
throughput ceiling (1 writer/branch, bounded by `PUBLISHER_RETRY_BUDGET`) is a
**correctness requirement** the moment §7.1 ships, not an optional Phase-7 side
effect. `PublishPlan` is a non-trivial refactor of four writers; it must land behind
the cost gate and the `merge_truth_table`/recovery/failpoint suites.

**Alternatives.** (A) *Caching band-aid only* — memoize schema validation, cache
opens within a request: ~30–50% fewer round-trips but leaves open-by-latest and
the O(history) terms. Mitigation, not a fix. (B) *Opener bypass only* (open
direct-by-URI+version, no full txn) — **kills the dominant depth term, now measured
[M]**: a one-line patch flattened the data term `31+12·depth → flat 4` and cut a
depth-80 edge **2.7×** (§2.4), leaving only the secondary internal-table term and
the writer unification. (C) *Full design (this RFC)* — correctness by construction.
(D) *Wait for Lance MTT* — future exit, not a current dependency (§8).
**Recommend: ship B as a standalone PR first (behind the step-1 gate), then C for
the constant-factor + correctness, then step 2 for the internal residual; D as the
strategic end-state.** B is the demonstrated dominant fix, not a partial one.

**Reversibility.** The interface (`WriteTxn`/`PublishPlan`) is internal and
reversible. Phase 7's new `__manifest` object types (`graph_commit`,
`graph_head`) are an **on-disk format addition** — additive (old binaries skip
unknown `object_type`s) but near-permanent; it earns its own validation pass
(forward/back-compat, the validation checklist in the `iss-991` handoff). The
`writer_epoch` is likewise a durable manifest field. Everything else (compaction
scheduling, recovery relocation, branch concurrency, the cost gate) is cheap to
undo.

---

## 12. Resolved questions (was: unresolved)

All five original open questions were investigated read-only against post-#277/#284
`origin/main`, upstream Lance 7.0.0, and the dev graph; each is resolved below. One
new item (Q6), surfaced by peer review, remains genuinely open.

1. **`graph_head` CAS contention → RESOLVED, gated on step 2 + a concurrency test.**
   Retry is publisher-owned; Lance's internal rebase-retry is disabled
   (`conflict_retries(0)`, `publisher.rs:385`) → no double-retry. Row-CAS is true
   one-winner (`TooMuchWriteContention` → retryable, `publisher.rs:432-440`),
   bounded by `PUBLISHER_RETRY_BUDGET = 5`. **But each retry re-runs the O(history)
   `load_publish_state` scan (`publisher.rs:455`)**, so `graph_head` contention
   multiplies the manifest term — **step 2 (compaction) is a hard prerequisite for
   step 4 (Phase 7)**. Same-branch is the real workload (the incident is concurrent
   `main` writes). Residual: a measured gate before Phase 7 — N≈100 concurrent
   same-branch writers, assert bounded retry + O(working-set) re-scan + P99 within
   SLA. Fallback: batched-lineage, or Alternative B (defer lineage-in-manifest).
2. **Recovery grace-window → RESOLVED.** PR #284 is **unrelated** (cluster-apply
   trap; zero `recovery.rs` changes). The dangerous rollback classifications
   (NoMovement / UnexpectedAtP1 / UnexpectedMultistep / #277's IncompletePhaseB)
   fire only at the open-time Full sweep; the per-write heal defers all rollback
   (`recovery.rs:978-988`), so moving the heal off the hot path doesn't break #277.
   A sidecar-age grace window (defer sidecars younger than T_grace, loud typed
   skip, `repair` override) on the existing `created_at`/ULID
   (`recovery.rs:762`/`:1522`) is the sound interim; the permanent fix is the
   in-process background reconciler `iss-856`. Lands step 5 with a failpoint test.
3. **Epoch fence × publisher CAS → RESOLVED (by construction).** With Lance retry
   off (Q1), the publisher loop is the only retry layer. Model `writer_epoch` as a
   **pre-publish hard-fail gate** beside `check_expected_table_versions`
   (`publisher.rs:462`) but non-retryable (a stale epoch is a protocol violation,
   not a race). No double-retry; the epoch gate and the row-CAS loop are
   sequential. SlateDB `FenceableTransactionalObject` is the precedent.
4. **Compaction cadence → RESOLVED.** Not `auto_cleanup` (GCs pinned versions).
   Not foreground every-N-writes (deny-list job-queue + invariant 7 + cost cliff).
   Minimum (step 2): extend `optimize`/`cleanup` to the internal tables AND node/edge
   version cleanup — no special-casing (`SidecarKind::Optimize` already covers a
   mid-compaction crash). Follow-up: an `iss-856`-shaped background reconciler
   triggered by a cheap fragment-count probe (work off the hot path; a reconciler,
   not a job queue — deny-list-clean; SlateDB's epoch-coordinated compactor is the
   precedent). CLI `omnigraph optimize` stays the operator override.
5. **`PublishPlan` residuals → RESOLVED.** Both `delete_where` and
   `create_vector_index` are representable as `TableAction` variants with existing
   sidecar coverage (`SidecarKind::Mutation`/`EnsureIndices`) and are
   content-preserving (roll-forward safe). `TableAction::Delete` migrates to staged
   two-phase via MR-A / `iss-950` (now unblocked — `blk-lance-6658` shipped); **D2
   retires then** (`enforce_no_mixed_destructive_constructive`,
   `exec/mutation.rs:640-673`). `TableAction::CreateVectorIndex` stays inline until
   `blk-lance-6666` ships (`iss-848` reconciler path).

**Resolved post-review:**

6. **The exact mechanism of the data-table chain re-read → RESOLVED (§0, §2.4).**
   Pinned by Lance-source trace + proxy + pylance isolation **[U]**: it is **not**
   `checkout_version` (O(1)) and **not** merge-insert conflict replay. The write
   open goes through `DatasetBuilder::from_namespace` (`namespace.rs:174`), whose
   `describe_table` opens the whole dataset just to return a location
   (`namespace.rs:362`/`:112`) and whose `.load()` resolves latest **again** — a
   double latest-resolution per open, ~13× per write, nothing cached. The open
   resolves latest **without the V2 lexical / version-hint fast path** the direct
   opener uses (likely the un-threaded `Session`/store params,
   `load_table_from_namespace` `namespace.rs:174` — inferred, not traced), so it is
   O(depth) where a direct `from_uri().with_version(N)` is O(1). **The mechanism
   question is now academic for the fix:** bypassing `from_namespace` makes the open
   flat regardless of the precise sub-mechanism (un-threaded `Session` / double
   resolve / missed hint) — the bypass is the answer. (`list_table_versions` is
   **not** on this path — test-only; §10 follow-up.) `checkout_version` stays
   exonerated.

**Resolved end-to-end [M]:**

7. **End-to-end prototype of step 3 → DONE, measured (§2.4 before/after).** A
   prototype patched the opener (`open_dataset_head_for_write`, `table_store.rs:174`)
   to bypass `from_namespace` and open direct-by-URI, rebuilt v0.7.0, and re-ran the
   sweep: the data term collapsed `31 + 12·depth → flat 4`, total `+18/depth →
   +5/depth` (residual = the two internal tables, step 2), depth-80 **1618 → 593 ops
   (2.7×)** — functionally correct on main edge merge, branch create+write+read, and
   node merge. So step 3's "closes the dominant term outright" is **measured, not
   inferred**, and the opener bypass is **shippable standalone** (§9 step 3a).
   **Remaining (not blockers for step 3a's thesis):** the prototype did not cover
   schema-apply / branch merge / fork-on-first-write / overwrite / concurrent — a
   production opener change must pass the full `merge_truth_table`/recovery/failpoint
   suite; and the internal-table cleanup demo (step 2) + the concurrency
   fault-injection harness (steps 4/5) are still owed.

**Newly surfaced (open):**

8. **CAS-resurrection after cleanup → CONFIRMED VULNERABLE [S]; boundary watermark
   is a HARD PREREQUISITE for step 2.** SlateDB found this race (RFC-0026 / issue
   #352): a writer that stalls between computing manifest id `N+1` and creating it
   can, *after GC deletes `N+1`*, re-create it and observe **false success**.
   Lance 7.0.0 was traced directly and is **not immune**: version creation is a plain
   `put_opts(naming_scheme.manifest_path(base, version), PutMode::Create)` /
   `rename_if_not_exists` (`lance-table commit.rs:1421-1437`, `:1358`) on a
   version-numbered, **pruneable** path, with **no monotonic/boundary/watermark
   guard** anywhere in the manifest/commit/dataset path; `cleanup_old_versions` is
   **timestamp-based** (`cleanup.rs:1086`), so it deletes the very file the only
   guard (AlreadyExists→rebase) relies on. A stalled publisher whose target version
   was pruned by step-2 cleanup gets a `PutMode::Create` **success on a non-existent
   version → false success.** Severity by store: **R2/S3 (lexical, prod) = a silent
   lost write** (the resurrected version doesn't win V2 latest-resolution, so data
   lands on a dead branch while the publisher believes it committed); non-lexical =
   the version hint (`commit.rs:1439`) is overwritten to the stale version and
   trusted (worse, but not the prod path). **Action:** step 2 ships **only with** a
   durable **boundary/floor watermark** (GC advances it before deleting; every writer
   rejects `id <= boundary` after a "successful" create — SlateDB's fix), which also
   bounds any list-then-read-latest fallback. This was "lowest-risk earliest item";
   it is now gated (§9 step 2).

---

## 13. External validation (subagents + literature)

Validated read-only against OSS prior art and the DB/distributed-systems canon:

- **SlateDB** (canonical object-store LSM) — tenet-by-tenet ✅ on capture-once
  snapshot, monotonic-ID manifest (no pointer file — *explicitly rejected* in their
  RFC-0001), the **epoch fence** (exact match: `FenceableTransactionalObject`,
  hard-fail, TTL-lease *explicitly rejected* — adopt as specified), background
  epoch-coordinated compaction/GC, and recovery-on-open. **Adopt-items OmniGraph is
  missing / under-specifies:** (1) the **boundary-file** CAS-resurrection guard (Q8);
  (2) **group-commit batching** — coalesce pending `PublishPlan`s into one manifest
  CAS, directly mitigating the Q1 / §6.5 contention; (3) SlateDB *peels* compaction
  state *out* of the manifest (RFC-0013) — the **opposite** of Phase 7's fold-*in*;
  §11 should defend "fold-in (lineage must be atomic with visibility) beats peel-out
  for us"; (4) **write back-pressure** when cleanup lags (`l0_max_ssts`). **Citation
  correction:** SlateDB has the per-RPC counter (`InstrumentedObjectStore`) but
  **not** the flatness-across-history gate — the depth-swept Tier-1 gate (§5.1) is
  OmniGraph-novel; cite it that way.
- **Literature** — OCC/MVCC (Kung-Robinson 1981; DDIA ch.7), ARIES redo/undo, the
  fencing-token canon (Kleppmann — whose motivating example *is* OmniGraph's
  S3-read-modify-write-paused-past-lease scenario), and the lakehouse genre (Delta
  Lake VLDB 2020, Iceberg spec, Neon). The spine — OCC-over-MVCC + one atomic
  manifest CAS + WAL-of-intent recovery + monotonic-epoch fence — is canon-blessed,
  and OmniGraph **exceeds** Delta/Iceberg on the axis that matters (both are
  explicitly *single-table*-transactional; the manifest CAS delivers the atomic
  *cross-table* commit Delta only speculates about). The three scoping caveats are in
  §7.1.
- **HelixDB** (embedded LMDB graph DB) — too different a substrate to validate the
  object-store machinery (LMDB's `commit()` subsumes tenets #2–#8 for free), but it
  **corroborates tenet #1** (capture-once, thread-by-reference, re-resolution
  unrepresentable — its `&mut RwTxn`-threaded traversal is the embedded twin of
  `WriteTxn`) and confirms the bug class is **substrate-induced**. Portable idea for
  the roadmapped traversal work: adjacency as a *persisted, sorted,
  label-partitioned projection* keyed by `(node, label)` (vs the cold-rebuilt
  `TypeIndex`).
-												write-path cost gate + opener bypass (#288)

* docs(rfc): RFC-013 write-path latency design + index link

* perf(engine): open write-path tables directly, bypassing the namespace builder

Write opens routed through DatasetBuilder::from_namespace, whose describe_table
opened the whole dataset just to return a location and then re-resolved the
latest version — an O(commit-depth) double latest-resolution per table open that
missed Lance's O(1) version-hint fast path. On an object store this dominated
write latency (~70%, RFC-013 section 2.4).

TableStore::open_dataset_head_for_write now delegates to the direct opener
(open_dataset_head: Dataset::open by URI + checkout_branch, routed through the
tracked opener so cost tests can count it; a no-op in production). The manifest
already holds every sub-table's location, so the namespace catalog lookup was
redundant; ensure_expected_version still validates head == pinned for strict ops.
This completes PR #268's open-by-location migration on the write side.

With both reads (PR #268) and now writes bypassing it, nothing in production
routes through the per-table Lance namespace. The dead open chain
(load_table_from_namespace, open_table_head_for_write) is deleted and the
StagedTableNamespace contract apparatus is gated #[cfg(test)], mirroring the
already-test-only read namespace; __manifest commit coordination
(GraphNamespacePublisher) is a separate component and is unaffected.

See docs/dev/rfc-013-write-path-latency.md sections 2.4 and 9 (step 3a).

* test(engine): write-path cost-budget gate on a shared harness

Adds tests/helpers/cost.rs, a store-agnostic cost harness (IoCounts/StagedCounts,
measure/measure_with_staged, assert_flat, local_graph/s3_graph) that the read-side
warm_read_cost.rs, write_cost.rs, and write_cost_s3.rs share, so the IOTracker /
task-local plumbing lives in exactly one place instead of duplicated per test.

write_cost.rs (local, every-PR) gates the internal-table scan term flat in
commit-history depth (a RED #[ignore]'d LOCK, the acceptance for bringing the
internal tables into compaction) plus green guards: a single insert's data writes
are bounded, a per-write read-op ceiling fails the moment a round-trip is added,
and a keyed insert routes through stage_merge_insert once with no stage_append or
vector-index build.

write_cost_s3.rs (bucket-gated, rustfs CI) gates the data-table opener term flat
across depth — the object-store-RPC phenomenon local FS cannot reproduce, and the
red->green proof of the opener bypass. Wired into the rustfs_integration CI job
and its path filter.

Guards the "hot-path cost is bounded by work, not history" invariant on writes.
See docs/dev/rfc-013-write-path-latency.md section 5.1, docs/dev/testing.md.

* docs(rfc): RFC-013 step 3a landed; write-skew coupling; cost-gate test map

- Section 9: mark step 1 (gate + harness) and step 3a (opener bypass) landed;
  record the per-table namespace retirement to test-only and the corrected
  measurement note (the opener win is S3-only; the local data-table growth is the
  merge-insert/RI fragment scan, a compaction term, not the opener).
- Sections 7.1/6/11/5.5/10: correct the cross-table write-skew analysis after a
  prototype proved the scoped expected-set fix is a no-op against the per-object_id
  manifest (disjoint writers never share a row, so Lance never conflicts, the
  publisher never retries, and the expected check is a non-atomic pre-check
  evaluated once against stale state). The fix needs a shared contention row
  (Phase-7 graph_head / a minimal head row / commit-time re-validation), so it is
  coupled to that row, not standalone; that contention is load-bearing for
  correctness, not a drawback. Split the concurrent face (read-set + head) from the
  sequential face (inbound-RI validation on node removal) -- two different fixes.
- testing.md: add write_cost.rs / helpers/cost.rs / write_cost_s3.rs to the test
  map; document the local-vs-S3 backend split; extend the cost-budget checklist
  item to the write/open path and point at the shared harness.

* test(engine): isolate the opener in the S3 cost gate; fail loud on S3 setup errors

Addresses two PR review findings on the bucket-gated write_cost_s3 gate:

- The data-table opener was not isolated: `data_reads` also counts the
  merge-insert/RI scan, which reads O(fragment-count) and so grows with history
  for a different reason (compaction's domain, not the opener) -- the same term
  that made the local data-table count grow. The flat assertion would false-RED
  or misattribute scan growth to the opener on rustfs. Fix: compact (db.optimize)
  before each measurement so the table holds ~1 fragment, bounding the scan and
  leaving the opener's latest-version resolution as the only history-varying term.
  Compaction preserves version history, so the opener still faces a deep
  _versions/ chain -- the thing under test.

- s3_graph used `.ok()?`, so when OMNIGRAPH_S3_TEST_BUCKET was set but the store
  was down/misconfigured, init/seed failures collapsed to None and the gate
  skipped + passed vacuously. Fix: skip only when the bucket env var is absent;
  once it is set, init/seed failures panic (mirrors tests/s3_storage.rs).

* test(engine): isolate the S3 opener with a per-prefix IO probe (correct-by-design)

Replaces the fixture-bounded isolation (compact-before-measure) from the prior
commit with the root fix: a path-classifying ObjectStore wrapper (PrefixCounter)
that attributes each data-table read to the opener term (_versions/.manifest) vs
the scan term (data/*.lance). IoCounts now exposes data_opener_reads /
data_scan_reads, so write_cost_s3 asserts the opener flat *directly* -- no
compaction or fixture massaging, and the assertion measures the opener, not the
conflated total. Closes the "harness conflates two IO terms" class: any cost test
(read or write) can now isolate the opener.

PrefixCounter implements only the object_store 0.13 core ObjectStore methods; the
convenience surface (get/put/head/...) routes through get_opts/put_opts via
ObjectStoreExt's blanket impl, so every read/write is still counted.

Validated locally (every-PR) by write_cost::data_table_reads_split_into_flat_opener_
and_growing_scan: opener stays flat (7 -> 3) while scan grows (11 -> 91) and
opener + scan == data_total exactly -- proving the classifier and confirming the
local data-table growth is the fragment scan, not the opener. warm_read_cost
(12 tests) stays green under the shared-harness change.

* refactor(tests): remove cost-harness duplication and namespace cfg(test) noise

Branch self-review (no behavior change) — pay down three liabilities the
write-path work left:

- warm_read_cost.rs kept its own probes() (three IOTrackers + a QueryIoProbes +
  a probe counter) and read raw .stats().read_iops — duplicating the shared
  helpers::cost harness this branch introduced. Migrated all 12 tests onto
  measure()/IoCounts; deleted the local probes(). (This also makes IoCounts'
  version_probes field used rather than dead.)

- insert_cost was copy-pasted verbatim into write_cost.rs and write_cost_s3.rs.
  Hoisted to helpers::cost::measure_insert so the measured write is defined once.

- The per-table Lance namespace (namespace.rs) became entirely test-only after
  step 3a, but was gated with ~22 per-item #[cfg(test)] attributes. Collapsed to
  a single `#[cfg(test)] mod namespace;` and stripped the per-item attributes;
  merged the import groups the gating had split.

Verified: lib in-source 162 passed; write_cost 4 + warm_read_cost 12 passed;
forbidden_apis passed.
											
										
										
											2026-06-20 13:31:15 +02:00
+								# RFC-013: Write-path latency — capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract
 								**Status:** Proposed
 								**Author(s):** write-path latency investigation (handoff + multi-agent validation)
 								**Date:** 2026-06-19
 								**Audience:** engine / storage maintainers
 								**Builds on:**
 								[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) (`GraphClient` — embedded ≡ remote),
 								the query-latency work (PR #268, read-path warm-up — the read-side twin of this change),
 								the iss-991 handoff (manifest-authoritative graph lineage / Phase 7),
 								[writes.md](writes.md), [execution.md](execution.md), [invariants.md](invariants.md).
 								**Tracking (dev graph `modernrelay`):** primary `iss-write-s3-roundtrip-amplification`; depth term `iss-991`; substrate seam `iss-863`/`iss-864`; branch-create `iss-691`; recovery `iss-856`/`iss-recovery-sweep-live-writer-rollback`/`iss-merge-recovery-partial-rollforward`; MemWAL `iss-681`; read twin `gap-read-path-rederivation`.
 								> Status maintained by maintainers: `Proposed` while open, `Accepted` on merge.
 								---
 								## Summary
 								On object-store-backed clusters a single trivial write (one edge, one branch op)
 								issues **hundreds of mostly-sequential object-store round-trips**, and that count
 								**grows without bound with the graph's commit history**, so a long-lived graph
 								degrades to minutes per edge. The cost is invisible on a local filesystem
 								(µs/call) and to correctness tests (results are right, just slow), and it was
 								never measured because nothing in the suite counts *object-store round-trips per
 								logical operation*.
 								This RFC specifies the optimal write path from first principles — **a write is a
 								pure function of one version-pinned snapshot, published in a single
 								manifest-atomic CAS** — and the **cost contract that makes its O(1)-in-history
 								guarantee provable and non-regressable** (deterministic IO-counted tests on every
 								PR). It collapses four hand-rolled writers into one `GraphPublishAuthority`,
 								moves graph lineage into the manifest (so the per-write `_graph_commits` scan
 								disappears), brings the internal metadata tables into compaction (so the
 								per-write `__manifest` scan stops growing), takes recovery off the hot path, and
 								adds an epoch fence for multi-writer safety. None of it is a substrate rewrite —
 								the manifest-CAS model is already correct and is exactly what Lance native
 								multi-table transactions (lance#7264) will later formalize; this RFC builds the
 								seam to that future and pays down the write path onto it.
 								**The dominant fix is demonstrated, not proposed:** a one-line opener-bypass
 								prototype (open writes direct-by-URI instead of through the lance-namespace builder)
 								flattens the depth-dominant term `31 + 12·depth → flat 4` and cuts a depth-80 edge
 								**2.7×** (1618 → 593 ops), measured end-to-end and functionally correct on
 								main/branch/node paths (§2.4). It is shippable as a standalone PR first (§9 step
 a); the rest of the RFC is the constant-factor + correctness + internal-residual
 								work layered on the same seam.
 								---
 								## 0. Validation ledger (read this first)
 								Every claim is tagged: **[M]** measured by me this cycle, **[S]** verified in
 								v0.7.0 source (`file:line` given), **[U]** verified against upstream
 								Lance/LanceDB/SlateDB source or docs, **[G]** tracked in the dev graph (slug
 								given), **[I]** inferred/reasoned.
 								A correction from the originating handoff: it hypothesized that **Cloudflare R2
 								walks the full manifest listing on every open** (a prod-only amplifier absent
 								from AWS). **This is false for the pinned Lance 7.0.0 [U].** R2 is treated as
 								lexically ordered (`list_is_lexically_ordered = !is_s3_express`,
 								`lance-io/.../providers/aws.rs:183`), so R2 gets the O(1) head-only manifest fast
 								path, same as AWS; only S3-Express buckets are excluded, and even those are O(1)
 								via the v7 `latest_version_hint.json`. There is no R2-list config fix because
 								there is no R2-list problem.
 								**The depth term — corrected attribution.** Two measurements, one
 								instrumentation-blind, one complete:
 								*(a) IOTracker probe [M] — internal tables only.* A throwaway probe (the
 								`warm_read_cost` harness applied to a single insert to `main`, swept across
 								commit depth) counted the two internal tables: `__manifest` ≈ 14 + 2·depth,
 								`_graph_commits` ≈ 9 + 2·depth → ≈ 23 + 4·depth, `write_iops = 1`. **But this
 								probe is structurally blind to the write path's per-table *data* opens** — they
 								bypass the instrumented opener (`table_wrapper`), so it reports `probes=0` for the
 								data tables. It measured the *minority* of the cost.
 								*(b) Network-proxy measurement [M] — all RPCs, fresh graph.* A counting proxy in
 								front of `rustfs` (sees every object-store RPC, under `--mode merge` — the
 								production path), on a brand-new graph (400 seed nodes, one committing merge per
 								checkpoint), classified by S3 key:
 								| commit depth | data `_versions` | `__manifest` | `_graph_commits` | node (RI) | schema | TOTAL | `write_iops` |
 								|---:|---:|---:|---:|---:|---:|---:|---:|
 								| 0 | 31 | 29 | 13 | 6 | 46 | 156 | 1 |
 								| 5 | 121 | 44 | 23 | 6 | 46 | 268 | 1 |
 								| 10 | 181 | 59 | 33 | 6 | 46 | 358 | 1 |
 								| 20 | 301 | 89 | 53 | 6 | 46 | 538 | 1 |
 								| 40 | 541 | 149 | 93 | 6 | 46 | 898 | 1 |
 								| 80 | 1021 | 269 | 173 | 6 | 46 | 1618 | 1 |
 								Slopes: **data table +12/depth (~67%)**, `__manifest` +3/depth, `_graph_commits`
 								+2/depth → **TOTAL ≈ 156 + 18·depth**, `write_iops` flat at 1. The IOTracker probe
 								(a) saw only the +4/depth internal subset — blind to the data-table opens, the
 								dominant ~67%.
 								**Constant-factor finding [M]: the schema contract is a flat 46 reads/write** — not
 								depth-scaling, but **29% of the depth-0 cost (46/156)**, from
 								`validate_schema_contract` re-running uncached on every resolve (`omnigraph.rs:561`).
 								A depth-slope gate will *not* catch it; WriteTxn's resolve/validate-once kills it,
 								and the §5.1 fitness assert (`validate_schema_contract_calls == 1`) is what pins it
 								(constant-factor delta, §6).
 								The dominant term is **the written table's open routed through the lance-namespace
 								builder ~13× per write** — now source-traced. The **write** path opens via
 								`DatasetBuilder::from_namespace` (`namespace.rs:174`, from `open_table_head_for_write`
 								`table_store.rs:181` / `namespace.rs:544`). Lance's builder calls the namespace's
 								`describe_table` once and uses only `response.location` (`lance` `builder.rs:130-178`)
 								— but omnigraph's `describe_table` **opens the whole dataset** just to produce that
 								location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`), and `.load()`
 								then **resolves the latest version again** — a **double latest-resolution per
 								open**, ~13× per write, nothing cached. Crucially, latest-resolution is **not
 								inherently O(depth)**: the namespace path is O(depth) because it **misses the V2
 								lexical / `latest_version_hint.json` fast path** that the direct opener engages
 								(most likely because `load_table_from_namespace` attaches no shared `Session`/store
 								params, `namespace.rs:174` — inferred, not traced). The **read** path skips all of
 								it — `from_uri(location).with_version(N)`, one HEAD, O(1) — which is why reads are
 								flat (+12/depth on the data table, §0(b)). **Proven on omnigraph's own table [M]:**
 								a direct `Dataset::open` of the *same physical* 85-version edge table = **2 ops
 								(O(1))**, the `from_namespace` open of that identical table = the O(depth) sweep —
 								same bytes, two open paths. `checkout_version` is also O(1) — **exonerated**, not a
 								back-walk. So `from_uri().with_version(N)` is the O(1) primitive and step 3 makes
 								each open O(1) *intrinsically* (cleanup then becomes hygiene/interim, not
 								load-bearing for read cost — §2.3). **Mode-independent [U]:** `append` ≡ `merge` ≡ +12/depth, so §0(a)
 								measuring a single insert was *not* the defect — the defect is the namespace open
 								path, not the verb. **Using `from_namespace` per-open is a misuse of Lance's
 								design** (the namespace is a catalog/discovery layer — resolve once, then open the
 								dataset directly, `lance-namespace` `operations/index.md` **[U]**); the read path
 								already bypasses it (PR #268 Fix 2 — see §2.4).
 								**Corrected conclusion.** The depth blow-up is in omnigraph's DB layer and is
 								**data-table-dominated**: the redundant per-table opens (fixed by §9 step 3 —
 								WriteTxn open-once-by-pinned-version — plus scheduled *version cleanup* of the
 								node/edge tables) are ~70% of it; the uncompacted internal tables (§9 step 2) are
 								the secondary ~30%. Both the originating R2 hypothesis and the earlier "entirely
 								the internal tables" framing are wrong. The exact Lance call doing the data-table
 								chain re-read (`checkout_version` back-walk vs merge-insert conflict replay) is the
 								one unpinned item — see §12. Reads, by contrast, are flat in depth
 								(`warm_read_cost.rs`, PR #268). This is the O(history)-per-write →
 								O(N²)-cumulative behavior the production incident hit.
 								---
 								## 1. Problem & measurements
 								On object storage every call is a 10–100 ms RPC, there is no cheap stat, and
 								sequential RPCs serialize. A long-lived production graph on R2, originating handoff
 								**[M]**:
 								| operation | prod (R2) | local `file://` |
 								|---|---|---|
 								| one-edge `load --mode merge` → main | ~3 min (90 s workflow timeout) | <1 s |
 								| `branch create --from main` | 120 s | <1 s |
 								| one-row `load` → a branch | 204 s | <1 s |
 								| `branch delete` | 216 s | <1 s |
 								| warm read / `/healthz` | fast (0.2–2 s) | fast |
 								`iss-write-s3-roundtrip-amplification` **[G]** independently records the same:
 								cross-region single insert ~46 s, 5-node mutation ~110 s, vs ~390 ms for a
 								no-storage `/healthz`. Its acceptance criteria are this RFC's goal: *"a single
 								insert issues O(1)-to-few S3 round-trips, not O(number of tables); bulk mutations
 								amortize the manifest commit."*
 								The cost decomposes into terms; the dominant one scales with history (§0):
 . **Per-table opens through the O(depth) lance-namespace builder (DOMINANT,
 								   O(tables × depth)).** Each stage opens via `DatasetBuilder::from_namespace`
 								   (`namespace.rs:174`); its `describe_table` opens the whole dataset just to return
 								   a location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`) and
 								   `.load()` resolves latest **again** — a double latest-resolution per open,
 								   O(depth) on the repro store, ~13× per write with nothing caching it **[S]**
 								   (§2.2). The read path's direct `from_uri().with_version(N)` is O(1). →
 								   **+12 reads/depth, ~70% of the slope [M]**. Fixed by opening once, by pinned
 								   version via the direct opener (§9 step 3); node/edge version *cleanup* bounds it
 								   further.
 . **Per-write `__manifest` scan (O(history), secondary).** Every publish
 								   full-scans the uncompacted `__manifest` (`load_publish_state` →
 								   `read_manifest_scan`, `state.rs:133-141`) **[S]**; the internal tables are
 								   never compacted/cleaned (`optimize` iterates node/edge only,
 								   `optimize.rs:895-904`) **[S]**. +3.1 reads/depth **[M]**.
 . **Per-write `_graph_commits` refresh (O(history), secondary).**
 								   `record_graph_commit` reloads the entire commit cache before each append
 								   (`commit_graph.rs:136-164`) **[S]**; never compacted/cleaned. +2.1 reads/depth
 								   **[M]**. The "read-path anti-pattern, now on writes" (`iss-991` handoff **[G]**).
 								Terms 2+3 are the secondary ~30%; term 1 dominates. Plus per-write fixed taxes: a `list_dir("__recovery/")` (`loader/mod.rs:197`,
 								`exec/mutation.rs:725`, `exec/merge.rs:1090`) **[S]**, and the publisher CAS
 								retry budget (`PUBLISHER_RETRY_BUDGET = 5`, `publisher.rs:51`) **[S]**.
 								Branch ops compound it: `branch create` is a per-table sequential fork loop
 								(`fork_branch_from_state`, `table_store.rs:282`); `branch delete` opens a
 								snapshot per *other* branch (`ensure_branch_delete_safe`, `omnigraph.rs:1317`)
 								and force-deletes per forked table sequentially (`cleanup_deleted_branch_tables`,
 								`omnigraph.rs:1359`) **[S]**.
 								---
 								## 2. Root cause (validated)
 								### 2.1 The write re-derives its world from storage every stage
 								`loader/mod.rs:400` captures a `snapshot` once, but downstream stages **ignore
 								it** and re-resolve **[S]**:
 								- `open_for_mutation_on_branch` (`table_ops.rs:505`) re-calls
 								  `resolved_branch_target` **per table** (`:512`), which runs
 								  `ensure_schema_state_valid` (a full schema-contract storage read with no cache,
 								  `omnigraph.rs:561-568`) and then opens **by head** via
 								  `open_dataset_head_for_write` (`:522`/`:559`), asserting head == pinned only
 								  *after* the open.
 								- `fresh_snapshot_for_branch` (`omnigraph.rs:771`) always does fresh I/O; the
 								  fork authority path re-reads the live manifest (`table_ops.rs:574`).
 								- The captured snapshot is used only for membership/fork checks, never for the
 								  actual opens.
 								The drift guards, CAS retries, and recovery scans are **compensating machinery**
 								for the staleness this self-inflicts. The `Snapshot`/coordinator primitive
 								already exists; it is treated as cheap-to-reacquire rather than as the
 								operation's authoritative identity.
 								### 2.2 The depth terms — data-table re-reads dominate, internal tables secondary
 								Confirmed in code and measurement (§0). The **dominant** term is §2.1's per-table
 								opens: ~13 opens per write through the lance-namespace builder
 								(`DatasetBuilder::from_namespace`, `namespace.rs:174`). The builder calls the
 								namespace's `describe_table` (`lance` `builder.rs:130-178`), and omnigraph's
 								`describe_table` opens the whole dataset just to return a location (`open_head` →
 								`Dataset::open`, `namespace.rs:362`/`:112`); `.load()` then resolves latest again —
 								a **double latest-resolution per open**, O(depth) on the repro store — so cost
 								grows with the table's version count (+12 reads/depth, ~70%). The **read** path
 								opens direct `from_uri().with_version(N)` (`namespace.rs:112` / `SubTableEntry::open`)
 								— O(1) — and native pylance is flat 6 ops at any depth **[U]**, so this is
 								omnigraph's *namespace-open* pattern, not Lance; `checkout_version` is O(1) and not
 								implicated. (The heavier `list_table_versions` — `versions()` + a checkout per
 								version, `namespace.rs:395-427` — is **not** on this path; it is test-only today, a
 								separate latent O(depth): §10 follow-up.) The **secondary** terms are the two
 								internal tables: `load_publish_state` and
 								`commit_graph.refresh` each full-scan a table that gains a fragment per write and
 								is never compacted (+5 reads/depth, ~30%). This is the `gap-read-path-rederivation`
 								**[G]** failure mode — "cost grows with fragment count" — on the *write* path,
 								where PR #268 never reached. `invariants.md` documents the internal-table half:
 								*"the internal metadata tables (`__manifest`, `_graph_commits`) are still not
 								compacted, so the probe and refresh cost still grows with fragment count."*
 								### 2.3 The `skip_auto_cleanup` interaction — and compaction ≠ cleanup
 								v0.7.0 sets `skip_auto_cleanup: true` deliberately (`table_store.rs` 10 sites +
 								`publisher.rs:392`) **[S]** — load-bearing, because Lance 7's on-by-default
 								`auto_cleanup` would GC `__manifest`-pinned snapshot versions (`lance.md` audit)
 								**[U]**. Two distinct levers were turned off and must be replaced *separately*:
 								**compaction** (`compact_files`) rewrites small fragments into fewer larger ones
 								but does **not** prune old versions; **cleanup** (`cleanup_old_versions`) prunes
 								old versions. Measured on a ~85-version graph **[M]**: `optimize`/compaction
 								*added a version* (data-table reads 1035 → 1083, frags 81→1 — **no help** against
 								the depth term); `cleanup --keep 3` dropped it 1035 → 63 (89 versions pruned across
 tables, **16×**). So only *cleanup* bounds the version-chain length. Note today's
 								`cleanup`/`optimize` cover **node/edge tables only** (the "7 tables"; internal
 								`__manifest`/`_graph_commits` are excluded, `optimize.rs:895-904` **[M]**) — so
 								bounding the internal +5/depth residual needs them **added** to the key set (§9 step
 's code change). Operationally: `cleanup` aborts on a remote store without `--yes`
 								(the
 								scheduled job must pass it). Relation to step 3: while the namespace open is still
 								on the write path, cleanup **caps** the dominant term — a real interim mitigation;
 								once step 3 opens direct-by-version (O(1) regardless of version count, §2.4),
 								cleanup is **storage hygiene + internal-table sprawl**, not load-bearing for read
 								cost. The correct replacement is *scheduled* compaction **and** version cleanup
 								(§9 step 2), **not** re-enabling `auto_cleanup`. Without it, version history (and
 								per-write cost) grows forever.
 								### 2.4 Lance namespace: proper use (why the fix is bypass, not patch)
 								The upstream Lance Namespace is a **catalog / discovery layer** — "table
 								discovery, resolving table locations, and coordinating commits" — whose intended
 								division of labor is *"the namespace provides basic information about the table,
 								[then] the Lance SDK … fulfill[s] the other operations"* (`lance-namespace`
 								`namespace/index.md`, `operations/index.md`) **[U]**. It is meant to be consulted
 								to *resolve a table once*, after which you operate on the `Dataset` directly — **not
 								consulted on every per-table open on a hot path.** `DatasetBuilder::from_namespace`
 								itself reflects this: it calls `describe_table` only to extract `location`, then
 								reduces to a `from_uri` builder (`lance` `builder.rs:130-178`). For a system that
 								*already holds* each table's location + version (omnigraph's `__manifest` does, via
 								`SubTableEntry`), routing per-open resolution back through the namespace is the
 								anti-pattern — and it aligns with this project's invariant 1 ("resolve latest state
 								through the substrate's cheap primitive instead of re-scanning") and the deny-list
 								"cold re-derivation on the hot path."
 								So the fix is **bypass, not patch**: open writes by URI + pinned version
 								(`from_uri(location).with_version(N)`) — exactly what the **read** path already does
 								(PR #268 Fix 2; the read path's own comment notes the namespace open "would
 								full-scan `__manifest` twice per open (`describe_table` + `describe_table_version`)"),
 								so this completes #268's open-by-location migration on the write side (§9 step 3).
 								The **custom namespace impl stays** — it is still the right home for legitimate
 								*catalog* operations (`describe_table` / `table_exists` / `list_table_versions` /
 								`create_table_version` / managed-versioning commit coordination); only the
 								per-open *resolution* leaves it. Two Lance facts make this safe and final: opening
 								by explicit version is `default_resolve_version` = a single HEAD, O(1) (`lance`
 								`commit.rs:939-981`), and Lance's own latest-resolution cost work (version-hint, PR
 								#6752) confirms the latest path is the expensive one to avoid. **Proven on
 								omnigraph's own table [M]:** a direct `Dataset::open` of the *same physical*
 -version edge table is 2 ops (O(1)), while the `from_namespace` open of that
 								identical table is the O(depth) sweep — so latest-resolution is not inherently
 								O(depth); the namespace path is O(depth) only because it misses the fast path the
 								direct opener engages (likely the un-threaded `Session`). Step 3 therefore makes
 								each write open O(1) on its own — so node/edge `cleanup` (§2.3) is an **interim
 								mitigation + storage hygiene**, not load-bearing for read cost once step 3 ships.
 								**End-to-end proof [M] — the one-line opener bypass, measured.** A prototype
 								patched `open_dataset_head_for_write` (`table_store.rs:174`) to open directly by URI
 								(bypassing `from_namespace` — exactly step 3 / Alternative B), rebuilt v0.7.0, and
 								re-ran the depth sweep on a fresh graph:
 								| depth | data `edgeVER` baseline | data patched | TOTAL baseline | TOTAL patched |
 								|---:|---:|---:|---:|---:|
 								| 0 | 31 | **4** | 156 | 121 |
 								| 10 | 181 | **4** | 358 | 173 |
 								| 20 | 301 | **4** | 538 | 233 |
 								| 40 | 541 | **4** | 898 | 353 |
 								| 80 | 1021 | **4** | 1618 | **593** |
 								The dominant data-table term collapses `31 + 12·depth → flat 4` (O(1) in history),
 								the total slope drops `+18/depth → +5/depth` (the residual +5 is exactly the two
 								internal tables — step 2's scope), and at depth 80 a single edge drops **1618 → 593
 								ops (2.7×)** from this one change alone, before step 2 / Phase 7. Functional
 								correctness verified on the hot paths: main edge merge, branch create + write +
 								read-back, node merge (managed-versioning still correct) — the direct opener already
 								handles `checkout_branch` for non-main, so the namespace layer was not load-bearing
 								for write correctness on these paths. **Caveat:** the prototype did **not** exercise
 								schema-apply, branch merge, fork-on-first-write to a new table on a branch, overwrite
 								mode, or concurrent writers — a production step 3 must pass the full
 								`merge_truth_table`/recovery/failpoint suite (the namespace may do
 								managed-versioning work that matters there). It proves the thesis + hot-path
 								correctness, not drop-in completeness.
 								**Step 2 also proven [M].** On the step-3-patched binary at depth ~87, compacting
 								the internal tables to 1 fragment each (content-preserving) collapsed their scans:
 								`__manifest` 285 → 32 (8.9×), `_graph_commits` 177 → 11 (16×); the step-3 data term
 								stayed flat at 4. So **both depth terms are now empirically eliminated** — a depth-87
 								single edge drops **~1720 → 198 ops (~8.7×; ≈258 s → ≈30 s at 150 ms/RTT)** with
 								both fixes. The internal term is **fragment-scan growth** (`read_manifest_scan` /
 								`commit_graph.refresh` read all fragments of the *latest* version), so the fix is
 								**compaction** (merge fragments) — distinct from the data table's version-chain term
 								that step 3 / version-cleanup handle. `optimize`'s `all_table_keys`
 								(`optimize.rs:895-904`) excludes the internal tables, so step 2 is a real code
 								change, not just scheduling.
 								---
 								## 3. First principles
 								On object storage the only objective function is **minimize the number of
 								*sequential* round-trips per logical operation, and make that number invariant to
 								graph age, history depth, and table count** — under the hard floor of SI,
 								durability, atomicity, and loud integrity. Three generating principles fall out,
 								each mapped to a validated failure:
 . **Pin once, derive the rest (MVCC / invariant 15).** A write is a pure
 								   function of one immutable, fully-pinned snapshot
 								   `{branch, manifest_version, per-table (location, version, e_tag), schema_hash,
 								   writer_epoch}`, resolved exactly once; every stage reads only from it
 								   (open-by-pinned-version, O(1), cacheable); the only contact with "current" is
 								   the final CAS. → fixes §2.1.
 . **One source of truth, one commit (invariant 2).** Visibility + lineage +
 								   version bumps are **one atomic manifest write**; the commit graph, indexes,
 								   and topology are *projections* of the manifest, never second authorities to
 								   keep in sync. → fixes the §2.2 `_graph_commits` term (iss-991 Phase 7).
 . **The plan is the contract (correct-by-construction recovery).** The writer
 								   serializes its *complete* publish intent **before any HEAD moves**; the live
 								   commit and crash-recovery execute the *identical* plan, so they cannot
 								   diverge. → fixes the partial-publish bug class structurally
 								   (`iss-merge-recovery-partial-rollforward`, PR #277).
 								The optimal single-edge write under these: **~2–3 sequential hops, O(1) in size**
 								— 1 warm probe (0 if the coordinator is unchanged), 1 parallel stage of fragment
 								writes, 1 manifest CAS — regardless of 5 tables or 500, 10 commits or 10M.
 								Lance's own `test_commit_iops` (read 1 / write 2 / stages 3) **[U]** proves the
 								per-table primitive already hits this; the job is to make the *graph* write
 								inherit it.
 								This is not speculative: it is exactly what the two reference object-storage
 								databases do. **LanceDB** threads a pinned `Arc<Dataset>` + shared `Session` and
 								commits with one CAS off a captured `read_version`, never re-resolving "latest"
 								under default consistency **[U]**. **SlateDB** captures a snapshot, treats a
 								monotonic-ID manifest (no pointer file) as the *sole* authority, commits with one
 								conditional-PUT, recovers on open (never per-write), fences with a monotonic
 								`writer_epoch`, and compacts on a schedule **[U]**.
 								---
 								## 4. Reference-level design
 								### 4.1 The interface — one publish authority, one declarative plan
 								The deepest structural flaw is **four hand-rolled writers** (`load_as`,
 								`mutate_as`, `apply_schema_as`, `branch_merge_as`), each re-implementing open →
 								stage → commit → sidecar → lineage. There is **one publish machine**; the verbs
 								are different declarative plans fed to it.
 								```rust
 								// The pinned, immutable operation identity — resolved ONCE.
 								struct WriteTxn {
 								    branch: BranchRef,
 								    base: PinnedSnapshot,   // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
 								    session: Arc<Session>,  // shared per-graph; warms metadata/index caches across opens
 								    handles: HandleCache,   // open-by-version; each table opened once, reused across stages
 								}
 								// A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
 								enum TableAction {
 								    Append(Stream), Upsert(Batch), Overwrite(Image), Delete(Pred),
 								    Fork { from_version: u64 }, Register(Schema), Tombstone,
 								}
 								struct PublishPlan {
 								    base: PinnedSnapshot,
 								    actions: Map<TableKey, Vec<TableAction>>,
 								    lineage: GraphCommitIntent,   // parent = base.head; rides the SAME manifest CAS (Phase 7)
 								    expected: Expectations,       // per-table versions + graph_head + writer_epoch
 								}
 								impl GraphPublishAuthority {
 								    async fn open_txn(&self, branch: BranchRef) -> WriteTxn;                          // 1 warm probe
 								    async fn publish(&self, txn: &WriteTxn, plan: PublishPlan) -> PublishedSnapshot;  // stage∥ → 1 CAS
 								}
 								```
 								Properties that make it optimal:
 								- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
 								  open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
 								- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
 								  recovery both call `plan.apply()` — a merge that bumps tables A+B can never
 								  roll A forward and silently drop B. The
 								  `iss-merge-recovery-partial-rollforward` bug class is gone by design.
 								- **One CAS.** `publish` issues exactly one conditional `__manifest`
 								  merge-insert carrying every touched-table version + the `graph_commit` /
 								  `graph_head` lineage rows + the `writer_epoch` check.
 								- **Verbs are thin lowerings.** `load`/`mutate`/`schema apply`/`branch merge`
 								  each build a `PublishPlan` and call `publish`. Four copies → one machine; the
 								  public `load_as`/`mutate_as` API is unchanged (it lowers internally).
 								The cost contract becomes part of `publish`'s documented API:
 								> `publish(txn, plan)` costs `opens ≤ |plan.touched_tables|` (0 warm),
 								> `stages ≤ 3`, `manifest_ops = O(1)` — **invariant to history depth and table
 								> count.**
 								### 4.2 Supporting mechanics (each validated this cycle)
 								| Mechanic | Design | Validation |
 								|---|---|---|
 								| Open by pinned version | `from_uri(location).with_version(N)` + shared `Session` + warm handle cache — the O(1) opener *reads* already use (`instrumentation::open_table_dataset:112`, `SubTableEntry::open` `db/manifest.rs:200`). **NOT** the write path's `from_namespace` builder (`namespace.rs:174`), whose `describe_table` + `.load()` do an O(depth) double latest-resolution (§2.2 — the dominant cost), and **NOT** `open_dataset_at_state` (opens head then checks out, `table_store.rs:232`, not O(1)). | #0 **[S]** |
 								| Strict-op SI | Update/Delete/SchemaRewrite open by pinned version (consistent read base) and the publish CAS rejects a *same-table* advance. Insert/Merge rely on Lance's natural rebase. **Do not remove the open guards wholesale** — that is a silent lost-update. | #5 **[S]** |
 								| Fork × pinned-version | Fork already opens source at the pinned version and creates the target from it; the live-manifest authority re-read before fork stays (not defeated by the pin). | #6 **[S]** |
 								| Open-once via the direct opener (**THE dominant depth fix**) | Reuse is **intra-transaction** (open each table once, by pinned version, thread it — kills the ~13 namespace-builder opens, the O(depth) double latest-resolution / +12/depth term, §0/§2.2). A commit invalidates its own entry, so no cross-write warm cache. Thread the shared per-graph `Session` through write opens (it is *not* today — `load_table_from_namespace` attaches no session, `namespace.rs:174`). | #9 **[S]** |
 								| Lineage in the manifest (Phase 7) | Publish `graph_commit` + mutable `graph_head:<branch>` rows in the same `__manifest` merge-insert with a branch-head CAS; `_graph_commits` becomes a projection. Removes the per-write `commit_graph.refresh` and closes the "manifest→commit-graph atomicity" + "commit-graph parent under concurrency" gaps. | `iss-991` **[G]**, **[S]** |
 								| Bounded history (compaction **and** cleanup) | Bring the internal table(s) into the `optimize` loop AND schedule version *cleanup* of node/edge tables — compaction rewrites fragments, only cleanup prunes the version chain that §2.2's dominant term re-reads (§2.3). No blob/PK/CAS blocker (`__manifest` has no blob column, `state.rs:44-72`; the unenforced PK is orthogonal to a content-preserving Rewrite). Post-Phase-7 there is only **one** internal table to compact. | #8 **[S]** |
 								| Recovery off the hot path | Move the per-write `list_dir("__recovery/")` to coordinator-open + the CAS-conflict path, guarded by a sidecar-age grace window (the sidecar carries `created_at` micros + a ULID, `recovery.rs:762`/`:1522`). | #4, `iss-856`/`iss-recovery-sweep-live-writer-rollback` **[G][S]** |
 								| Epoch fence | Monotonic `writer_epoch` in `__manifest`, CAS-claimed at writer init, checked on every publish. Fences a whole zombie *writer* deterministically (no TTL); closes the multi-process exposure and the Lance-MTT TTL-lease gap. | SlateDB `FenceableTransactionalObject` **[U]** |
 								| Branch create | Lance `Clone` instead of the per-table fork loop (O(tables)→O(1) sequential). | `iss-691` **[G]** |
 								| Branch delete | Run the per-other-branch safety check and the per-table reclaim loops concurrently (`buffer_unordered`); read branch sets from in-memory coordinator state. | **[S]** |
 								---
 								## 5. The cost contract — measurement & enforcement
 								The bug class is invisible to correctness tests, to local-FS tests, and to
 								wall-clock benches. You can only prevent a regression in a quantity you **define
 								precisely, measure deterministically, and bound on every PR.** The quantity is
 								*sequential object-store round-trips per logical operation, as a function of
 								history depth and table count.* OmniGraph already has the correct pattern for
 								**reads** (`warm_read_cost.rs`, `IOTracker`, swept to depth 20); this RFC extends
 								it across the write/branch/open surface. This is exactly how Lance and SlateDB
 								enforce it **[U]**.
 								### 5.1 Tier 1 — deterministic IO-counted gate (every PR)
 								Ordinary `cargo test`, hermetic (in-memory / tempdir + `IOTracker`), no S3, no
 								wall-clock. Two shapes:
 								```rust
 								// (A) cost-invariant-to-HISTORY — the load-bearing gate. Gate the MERGE verb (the prod path).
 								for depth in [10, 100, 1000] {              // REAL commit history, not row count
 								    build_history(depth);
 								    reset_counters();
 								    let s = measured_merge();               // --mode merge, the read-modify production path
 								    // PRIMARY — the dominant term (§0): the written table's data opens/reads, flat in depth.
 								    assert!(s.data_table_opens          <= touched_tables); // open each table ONCE, by pinned version
 								    assert!(s.data_table_reads_per_open <= K_OPEN);         // each open O(1) in the table's version count
 								    // SECONDARY — internal-table scans flat in depth (compaction + cleanup).
 								    assert!(s.manifest_ops  <= K_MANIFEST); // small CONSTANT, NOT a function of depth
 								    assert!(s.lineage_ops   <= K_LINEAGE);
 								    assert!(s.stages        <= 3);          // bounded sequential hops
 								}
 								assert_flat_across_depths();                // ALL terms — esp. data-table opens — flat in N
 								// (B) fitness functions — architectural invariants AS tests
 								assert_eq!(validate_schema_contract_calls(write), 1);   // resolve-once
 								assert_eq!(coordinator_resolutions(write), 1);          // O(1) resolution
 								assert_eq!(recovery_listdir_calls(steady_state_write), 0);
 								```
 								**Prerequisite, not a follow-up: route ALL opens (read + write) through the one
 								instrumented opener BEFORE the gate is meaningful.** Today the write path's data
 								opens bypass `table_wrapper` (the §0(a) blind spot), so a gate that asserts only
 								`manifest_ops`/`lineage_ops` would **pass a still-broken build** — one that
 								compacts the internal tables (§9 step 2) but keeps the dominant ~13× namespace-open
 								sweep (§2.2). The gate MUST count data-table opens/reads (the dominant term), which
 								requires the routing change first. The data term is **mode-independent** (append ≡
 								merge ≡ +12/depth **[U]**), so either verb exercises it; gate the **merge** verb
 								as the production path. **Fixture caveat [U]:** use *valid* edge endpoints — a
 								write to a non-existent endpoint fails RI validation and rolls back at ~192 ops
 								with **zero chain reads**, so a bad-endpoint fixture silently measures the rollback
 								path and would pass falsely.
 								The load-bearing rule both Lance and SlateDB mostly miss: **assert the constant is
 								flat across N, not just small at one N.** A shallow fixture cannot catch an
 								O(history) cost (the §0(b) table is the red baseline). Add a `num_stages`
 								(sequential-hop) assertion via a `ThrottledStore` wrapper (Lance's
 								`test_commit_iops` setup) so an O(N) listing also blows a wall-time budget.
 								### 5.2 Tier 2 — wall-clock trend (post-merge / nightly, never a PR gate)
 								A `ThrottledStore` criterion bench injecting cross-region RTT (50/150 ms/op — the
 								incident's regime) for single-insert and branch-op latency, with a threshold
 								alert (Bencher.dev `--err` / github-action-benchmark `fail-on-alert`). Both
 								reference DBs keep wall-clock out of the PR gate (too noisy on shared runners)
 								and use it only as a trend.
 								### 5.3 Close the loop — production metric
 								Emit `storage.ops` and `storage.stages` per logical operation as a span/counter
 								(cheap always-on atomics; the heavy per-table attributing wrapper stays
 								test-only behind a `test-util`-style feature, zero release cost). The number
 								asserted in CI is the number observed in prod — `iss-write-s3-roundtrip-amplification`'s
 								cross-region signal becomes a direct readout.
 								### 5.4 Process discipline — test-first for performance
 								Write the depth-sweep cost-budget test **first**: it goes **red today** (§0), the
 								WriteTxn + Phase-7 + compaction work turns it **green** (flat in N), and the
 								red→green is the proof. This is CLAUDE.md rule 12 applied to cost, and the
 								originating handoff's sequencing (§8/§9: land the tests before the fix so the win
 								is measured and locked). Add the policy (extend invariant 15 + testing.md "Cost
 								budget tests"): *any change touching the read/write/branch/open path MUST add or
 								extend a cost-budget test asserting the metric is flat at history depth.*
 								### 5.5 The correctness contract — concurrency tests (the safety twin of the cost gate)
 								The cost gate proves *fast*; these prove *safe*. §6.5's multi-writer cliff slipped
 								the suite for the same structural reason the latency bug did — **nothing runs the
 								schedule that triggers it**: the suite is single-process with the in-process queue
 								(the bug is cross-process), uses local/in-memory stores (no object-store
 								cross-process CAS), and its recovery tests cover restart-time sweep, not
 								live-writer rollback. **These four must land before `PublishPlan`/epoch merge
 								(steps 5):**
 . **Cross-process multi-writer on a real/emulated object store** (the *corruption*
 								   case) — N independent engine **processes** writing the same `(table, branch)`;
 								   assert all commit-or-cleanly-retry (no lost updates, no stuck "needs recovery,"
 								   no HEAD-ahead-of-manifest). **A single-process failpoint test cannot reproduce
 								   the corruption** (in-process degrades to clean OCC, §6.5) — this genuinely needs
 								   a multi-process harness (empirically 1/12 today). State that so nobody writes a
 								   single-process test expecting it to fail.
 . **Deterministic in-process interleaving (failpoint) — WRITTEN, passes [M].** Two→
 								   eight handles, sleep failpoint at the `commit_staged`→publish window
 								   (`loader/mod.rs:605`); resume losers and assert they retry cleanly. This
 								   demonstrates the **benign** path (N=8 → 2 commit, 6 clean OCC retries) — it is the
 								   regression guard for "in-process stays clean," *not* a reproduction of the
 								   cross-process cliff.
 . **Live-writer recovery** (`iss-recovery-sweep-live-writer-rollback`) — a
 								   concurrent open must not roll back a live in-flight publish (the grace window).
 . **Formal model** — a Quint/TLA+ model of `{two writers, interleave commit_staged
 								   and manifest-CAS}` (`iss-934`); it finds the §6.5 cliff immediately.
 . **Cross-table write-skew — WRITTEN, red, and driven red→green in-process [M].**
 								   Failpoint `loader.post_ri_pre_stage` (between RI-validation and staging): writer B
 								   validates "Bob exists" and parks; writer A `overwrite`s `node:Person` dropping Bob
 								   (non-cascading); B commits `Knows(Bob→Alice)` → committed orphan. The red test for
 								   the §7.1 fix. **Acceptance is a single-process gate** — unlike the §6.5 HEAD-ahead
 								   corruption (which genuinely needs the multi-process harness), this skew reproduces
 								   *deterministically in one process*: the parked edge writer's snapshot really does
 								   pin `edge:Knows:1` before the overwrite commits, so the overlap is real with two
 								   in-process handles. The fix went red→green in-process behind a shared head row
 								   (§7.1). Only #1–#4 (HEAD-ahead/epoch corruption) need cross-process scheduling.
 								Plus one **disambiguating run** owed (§6.5 confound): separate-handles in-process
 								on S3 — to confirm the corruption is the process boundary, not the store.
 								This mirrors the cost gate's discipline (assert across the dimension the suite
 								otherwise never exercises) — there, history depth; here, concurrent cross-process
 								schedules.
 								---
 								## 6. What is already right vs. the deltas
 								**Already correct — do not rewrite.** The in-memory `MutationStaging` accumulator,
 								the recovery sidecar mechanism, the per-(table,branch) write queue, D2, the sealed
 								`TableStorage` trait, and the read-path warm-up (PR #268) all stay. This is **not**
 								a substrate rewrite.
 								**One claim to soften — manifest-CAS is atomic *per publish*, not unconditionally
 								cross-table-serializing [M].** The manifest CAS (the reference impl of the
 								lance#7264 "Alternative A") makes each publish atomic and serializes any two writers
 								whose write-sets **share a `__manifest` row** — overlapping or same-table, which is
 								exactly why §6.5's same-table cases and the cascading-delete case retry cleanly. But
 								two writers touching **disjoint** tables write disjoint per-`object_id` rows, so Lance
 								sees no conflict and **both commit** (proven [M], §7.1). The genuinely-atomic
 								cross-table commit §13 contrasts with Delta is the **target** (§4.1's single
 								merge-insert over a shared head row), **not current state**. So "do not rewrite the
 								CAS" holds for the *commit primitive*, but the cross-table-serialization §7.1 needs
 								is a real addition (the shared `graph_head` row), not something the current CAS
 								already provides.
 								**The deltas (each a validated, localized gap):**
 								| # | Delta | Mechanism | Tracking |
 								|---|---|---|---|
 								| 1 | Snapshot re-derived per stage | capture-once `WriteTxn`, thread by ref | `iss-write-s3-roundtrip-amplification` |
 								| 2 | Write opens via `from_namespace` re-resolve the data-table ~13×/write, missing the fast path (**DOMINANT, +12/depth**) | open each table **once, direct `from_uri().with_version(N)`** (bypass namespace, §2.4) + shared Session | `iss-write-s3-roundtrip-amplification`, #0 |
 								| 3 | Lineage = 2nd authority, O(history) refresh (secondary) | Phase 7: lineage into `__manifest` | `iss-991` |
 								| 4 | `__manifest`/`_graph_commits` excluded from optimize/cleanup (`optimize.rs:895-904`; prototype pruned "7 tables" = node/edge only **[M]**) — the +5/depth residual after step 3 | **add them to `all_table_keys`** (a code change) + scheduled compaction/cleanup | `gap-read-path-rederivation` (write twin) |
 								| 5 | `list_dir("__recovery/")` per write | move to open + conflict, grace window | `iss-856`, `iss-recovery-sweep-live-writer-rollback` |
 								| 6 | 4 hand-rolled writers, commit↔recovery drift | one `PublishPlan` executed by both | `iss-merge-recovery-partial-rollforward` (PR #277) |
 								| 7 | No writer epoch (multi-process exposure) | `writer_epoch` in `__manifest` | — (new) |
 								| 8 | branch create = O(tables) fork loop | Lance `Clone` | `iss-691` |
 								| 9 | branch delete = sequential loops | concurrent `buffer_unordered` | — (new) |
 								| 10 | No write/branch cost gate (must count **data-table** opens; route all opens through the instrumented opener first) | Tier-1 IO-counted tests, merge verb | — (new) |
 								| 11 | Schema contract re-validated uncached per resolve (**flat 46 reads/write — 29% of depth-0 cost; constant, not depth**) | resolve/validate-once in `WriteTxn`; §5.1 `validate_schema_contract_calls==1` (the depth gate misses it) | `iss-write-s3-roundtrip-amplification` |
 								---
 								## 6.5 Concurrency correctness — the multi-writer cliff (proven [M])
 								The latency fixes are about *speed*; a separate, proven finding is about *safety*.
 								A multi-writer experiment **[M]** shows concurrent same-branch writers behave very
 								differently by topology:
 								| topology | concurrency | outcome |
 								|---|---|---|
 								| single server (shared in-proc queue, `loader/mod.rs:426`) | 12 | **12 / 12 commit** (clean) |
 								| in-process, separate handles, interleave failpoint at `commit_staged`→publish (`loader/mod.rs:605`) | 8 | **2 / 8 commit; the other 6 are clean retryable OCC** |
 								| multi-process (separate CLIs / S3, no shared queue) | 2 / 3 / 5 / 12 | **1 / N commit; the rest CORRUPT** |
 								**Two distinct failure modes — and the corruption is strictly cross-process:**
 								- **In-process → benign.** Even with *separate handles, no shared queue, high
 								  contention*, losers fail with `stale view of 'edge:Knows': expected manifest table
 								  version 5 but current is 7 — refresh and retry` — a **clean, retryable OCC
 								  conflict; graph state stays consistent.** The publisher CAS is doing its job.
 								- **Cross-process → corruption.** `Lance HEAD version N+1 ahead of manifest version
 								  N; a pending recovery sidecar requires rollback`. **Mechanism:** a losing writer
 								  advances the table's Lance HEAD (`commit_staged`) *before* the manifest CAS; when
 								  the CAS loses, HEAD is ahead of the manifest — a partial commit the per-write heal
 								  **defers** (`recovery.rs:978-988`; only the open-time sweep rolls back), so a
 								  *live* writer hitting it **fails instead of healing**. Self-heals on the next
 								  read-write reopen (not permanently bricked), but during a burst throughput
 								  collapses to one survivor. Reachable at **concurrency = 2** cross-process.
 								So in-process safety **already comes from the publisher CAS** (clean OCC); the
 								corruption needs the process boundary. *(Confound, stated honestly: the in-process
 								interleave ran on local-FS and the cross-process on S3-via-proxy — but
 								single-server-on-S3 was also clean (12/12), giving two independent "in-process
 								clean" points vs one "cross-process corrupt," triangulating on the process
 								boundary, not the store. One disambiguating run — separate-handles in-process on S3
 								— would move this from triangulated to proven; §5.5.)*
 								**Scoping (matters for urgency):** **single-server prod is serialized-correct, just
 								slow** — the in-process `(table,branch)` queue serializes same-branch writes (all 12
 								commit, no lost updates); the production incident was the *latency* (serialized
 								O(depth) writes → 90 s timeout), **not** corruption. The corruption hazard is
 								**latent**: it appears the moment a second writer exists (server replica,
 								CLI-alongside-server, multi-writer scale-out). **So: single-server today =
 								serialized-correct (slow; fixed by steps 2/3); multi-writer = UNSAFE until
 								`writer_epoch` lands.**
 								**The fix is the existing RFC, no new design.** The `A`-before-`B` window
 								(Lance HEAD moves before the manifest references it) is inherent to Lance's
 								per-table-lineage model — you cannot eliminate it, only fence and recover it: the
 								**`writer_epoch`** (delta #7) is a leader-lease via cross-process CAS so two writers
 								are never in the `commit_staged`→manifest-CAS window across processes (it removes
 								the concurrent-race dimension); the **`PublishPlan`=sidecar** (delta #6) makes a
 								single crashed writer roll forward/back deterministically (the crash dimension); and
 								**recovery off the hot path + grace window** (delta #5, Q2) is the exact reason the
 								live writers failed rather than self-healed (`iss-recovery-sweep-live-writer-rollback`).
 								This is the standard WAL-replay + leader-lease shape (confirmed against SlateDB's
 								`FenceableTransactionalObject` and Kleppmann's fencing-token canon, §10). **This
 								finding promotes #6/#7 from "nice correctness work" to the load-bearing guard that
 								gates multi-writer topologies — and it is the motivating case for them.**
 								---
 								## 7. Invariants & deny-list check
 								Touches and *strengthens* (does not weaken) invariants in
 								[invariants.md](invariants.md):
 								- **§2 (manifest-atomic visibility):** preserved; lineage now rides the same CAS
 								  (strengthens — closes the "manifest→commit-graph atomicity" gap).
 								- **§3 (one snapshot per op):** enforced *by construction* via `&WriteTxn`.
 								- **§4 (publish at one boundary):** unchanged — still one manifest publish.
 								- **§5 (recovery part of the commit protocol):** preserved; the sidecar *is* the
 								  `PublishPlan` (strengthens — commit and recovery cannot diverge). The grace
 								  window addresses the documented "recovery serialized against live writers
 								  in-process only" gap.
 								- **§7 (indexes derived) / §15 (one source of truth, cheaply derived):** this RFC
 								  is the write-side application of §15 — bound cost to the working set, not
 								  history. The commit graph becomes derived (strengthens).
 								- **§5 strict-op SI:** preserved (#5 validation — open guards kept for
 								  read-modify-write).
 								**Deny-list:** does *not* hit "cold re-derivation on the hot path" (it removes
 								two instances), "state that drifts" (lineage stops being a second authority), or
 								"acks before durable persistence." The `writer_epoch` is the closing move on the
 								"local `write_text_if_match` is not a cross-process CAS" / multi-process gaps —
 								add it before admitting multi-process write topologies.
 								No invariant is weakened. Two Known Gaps **close** (manifest→commit-graph
 								atomicity; commit-graph parent under concurrency, via Phase 7); one
 								(read-path-rederivation) gets its **write twin** filed and addressed.
 								### 7.1 Scope of the correctness claims (literature review, §13)
 								The "correct by construction" framing (§3, §4.1) is **precise but bounded** — the
 								DB-canon review flags three places not to over-claim:
 								- **Per-table serializability, not graph-wide — but the gap is narrow and now
 								  measured [M].** Three deterministic cases (failpoint `loader.post_ri_pre_stage`,
 								  placed between RI-validation and staging; red test in `tests/failpoints.rs`):
 								  - **Cross-table *disjoint* → genuine skew, VIOLATED.** A **non-cascading endpoint
 								    removal** — `node:Person` *overwrite* dropping Bob, touching only the node table
 								    — concurrent with an edge insert `Knows(Bob→Alice)`: both commit (write-set-only
 								    CAS, RI validated once pre-commit and never re-checked at publish) → **committed
 								    orphan**. (= `iss-ri-write-skew-dangling-edges` + the concurrent face of
 								    `iss-overwrite-orphans-committed-edges`.)
 								  - **Cross-table *overlapping* → incidentally protected.** `delete`-based removal
 								    **cascades** into `edge:Knows`, so the write-sets overlap, the per-table CAS
 								    engages, and the loser fails **cleanly** (stale-view OCC retry); invariant held.
 								  - **Same-table → NOT a separate skew.** Cardinality / `@unique(src)` have
 								    overlapping write-sets, so the per-table CAS holds the constraint; the loser's
 								    failure is the **HEAD-ahead corruption already scoped to #6/#7** (epoch +
 								    PublishPlan), not a consistency hole. *(This corrects an earlier
 								    over-generalization: cardinality/uniqueness do not share the read-set gap.)*
 								  So the skew is **reachable only for the non-cascading-overwrite × disjoint-edge-insert
 								  shape** — operation-specific, not constraint-specific.
 								  **The scoped fix alone is a no-op — proven [M], and the reason is mechanical.**
 								  Feeding the endpoint node-table versions into the edge's publish *expected* set
 								  (`check_expected_table_versions`, `publisher.rs:353`) was prototyped exactly; debug
 								  confirmed the pins reach the check, **and both writers still committed — the orphan
 								  persisted.** Every publish writes a *unique per-`object_id` row* into `__manifest`
 								  (merge key `object_id = version_object_id(table, version)`). Two disjoint-table
 								  writers (`node:Person` vs `edge:Knows`) touch **no common row**, so Lance's
 								  row-level merge-insert CAS commits both with **no conflict**, the publisher's retry
 								  loop **never fires**, and `check_expected_table_versions` — a **non-atomic
 								  pre-check, not part of the CAS** — is evaluated exactly once against the stale
 								  pre-both manifest and passes for both. The read-set pin only bites if the loser is
 								  **forced to retry and re-evaluate against fresh state**, which requires a *shared
 								  contention row* every publish touches. Adding a stand-in global head row
 								  (`UpdateAll`-touched by every publish) makes the disjoint writers overlap → Lance
 								  conflict → publisher retry → the reloaded pin (`edge:Knows:1` vs current `5`)
 								  rejects the stale writer → no orphan (red→green, failpoint suite 52/52). **That
 								  shared row is exactly Phase-7's `graph_head:<branch>`.**
 								  **Consequence — §7.1 is NOT a standalone single-server PR** (correct earlier text
 								  that called it "single-server-live, not deferrable" — it *is* urgent and
 								  epoch-independent, but it cannot ship against today's per-`object_id` manifest
 								  without a contention point). Land it one of three ways: **(a)** with Phase 7
 								  (step 4), reusing `graph_head:<branch>` as the contention row; **(b)** behind a
 								  minimal per-branch head row ahead of Phase 7 (~15 lines, as prototyped); or
 								  **(c)** as commit-time re-validation — still must win a serialization point first.
 								  **Recommended: (c) behind a per-branch head row.** The CAS-map approach carries the
 								  two costs §11 anticipated — *table-granularity false conflicts* (any `Person`
 								  overwrite conflicts with any concurrent `edge:Knows` insert, even different rows —
 								  needs a row-granularity read-set) and *scope* (a global head serializes the whole
 								  graph; per-branch `graph_head` is the right granularity). Commit-time re-validation
 								  is precise (no false positives) **and** reuses the same serialization point, so once
 								  the head row exists it strictly dominates the CAS-map. Either way the head row
 								  imposes an inherent trade — same-branch writers serialize cross-process (throughput
 								  ceiling 1/branch, bounded by `PUBLISHER_RETRY_BUDGET`) — **now a correctness
 								  requirement, not just a Phase-7 side effect** (§11).
 								  **Two faces, two fixes — do not bundle them.** The above addresses only the
 								  *concurrent* face (overlapping snapshots, `iss-ri-write-skew-dangling-edges`). The
 								  *sequential* face (`iss-overwrite-orphans-committed-edges`) — an overwrite drops a
 								  node that **already has a committed inbound edge**, with *zero* concurrency —
 								  **cannot** be caught by read-set-in-CAS: the later writer's snapshot legitimately
 								  post-dates the edge, so its pin matches and it commits. That is a pure
 								  **inbound-RI-validation** gap: when an overwrite/delete removes node endpoints,
 								  re-check that no live edge references them. A validation concern, not a CAS one;
 								  it needs no contention row and ships independently.
 								  *(Note: `iss-984` is a different bug — remote branch-merge idempotency — not this.)*
 								- **Recovery: roll-forward is by-construction; roll-back is not.** "Commit and
 								  recovery replay the identical plan" holds for the **redo** direction (shared
 								  `plan.apply()`). The undo classifier (NoMovement / UnexpectedAtP1 /
 								  UnexpectedMultistep / IncompletePhaseB) lives *outside* the shared executor, only
 								  at open-time — that's where ARIES-style divergence risk concentrates and where the
 								  §5.5 failpoint coverage is owed.
 								- **The fence and the cross-file atomicity rest on a linearizable conditional-put.**
 								  Kleppmann's fencing-token guarantee, the manifest CAS, and the epoch all require a
 								  linearizable register — true on S3/R2 (If-Match) but **not** on the local-FS path
 								  (`write_text_if_match` is content-token compare-then-replace, ABA-prone —
 								  `invariants.md` Known Gap). **Precondition to state up front: every "deterministic
 								  fence" / "atomic CAS" claim holds *on a store with linearizable conditional-put*;
 								  the epoch must not use the local-FS path.** Delta Lake §3.2.2 treats the
 								  object-store consistency model (read-after-write + put-if-absent) as a first-class
 								  design parameter; so should this RFC.
 								---
 								## 8. Relationship to Lance MTT (the seam, not a dependency)
 								`GraphPublishAuthority.publish(txn, plan)` is exactly the adapter to a future
 								Lance `catalog.transaction()`. lance#7264 ("Multi-Table Transactions via
 								Branching") is real and OmniGraph is its reference "Alternative A"
 								(fast-forward-main + WAL + roll-forward recovery) **[U]**, but it is a 5-day-old
 								discussion with two unbuilt dependencies (lance#7263 branch merge/rebase,
 								lance#7185 UUID branch paths), an unresolved central choice (it *favors*
 								pointer-swap — the opposite identity model from OmniGraph), and an open soundness
 								question (TTL lease needs an epoch). **Build the seam now on its own merits; do
 								not schedule around MTT landing.** When it ships, `publish`'s *body* swaps
 								(stage→CAS→sidecar → `catalog.transaction()`) while `WriteTxn`/`PublishPlan` and
 								every verb lowering stay. `iss-863`/`iss-864` **[G]** already scope this spike.
 								The MemWAL/LSM ingest tier (`iss-681` **[G]**, `dec-adopt-lance-v7-memwal`) is
 								**complementary, not competing, and not in flight** (the `memwal-benefit-analysis`
 								branch is an empty placeholder; the real analysis is commit `c9a81266`). MemWAL
 								sits *below* the manifest publisher (per-table durability, opt-in, intra-table);
 								`WriteTxn` owns the cross-table CAS. Build `WriteTxn` first.
 								---
 								## 9. Sequencing
 								Ordered by leverage and dependency. **The dominant depth term is the redundant
 								data-table opens (step 3), not the internal tables (step 2)** — §0; both must land
 								to flatten the curve.
 . **Measure first (Tier-1 gate). ✅ LANDED (gate + harness).** *Prerequisite (1a):*
 								   the write opener (`open_dataset_head`) is routed through the instrumented
 								   `open_dataset_tracked` so the gate can count data-table opens (§5.1). The
 								   write cost-budget tests live in `crates/omnigraph/tests/write_cost.rs` on a
 								   **shared, store-agnostic harness** (`tests/helpers/cost.rs`: `measure`/`IoCounts`/
 								   `assert_flat`/`local_graph`/`s3_graph`) that `warm_read_cost.rs` and
 								   `write_cost_s3.rs` also consume — one vocabulary, no duplicated `IOTracker`
 								   plumbing. The local gate ships green every-PR guards + the RED `#[ignore]`'d
 								   internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
 								   `storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
 								   LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
 . **Bound history — bring the INTERNAL tables into optimize/cleanup (a code
 								   change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge
 								   keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3`
 								   pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were
 								   untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed
 								   by today's tooling — step 2 is a real `all_table_keys` change to add the internal
 								   tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote
 								   otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**);
 								   the internal tables need the same inclusion. **Proven [M]:** compacting the
 								   internal tables collapsed their scans `__manifest` 285→32, `_graph_commits`
 →11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately,
 								   node/edge cleanup **caps** the dominant data-table term as an interim *before*
 								   step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the
 								   Q8 boundary watermark must land **with** this step — Lance's version CAS is
 								   confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on
 								   R2/S3), so scheduling cleanup without the watermark trades a latency bug for a
 								   correctness bug. (`gap-read-path-rederivation` write twin.)
 . **The opener fix — a shippable lead + the structural follow-on.**
 								   - **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
 								     `TableStore::open_dataset_head_for_write` now delegates to the direct
 								     `open_dataset_head` opener (`Dataset::open` by URI + `checkout_branch`, routed
 								     through `instrumentation::open_dataset_tracked` so the cost gate can count it;
 								     no-op in prod) instead of the `from_namespace` builder. Measured end-to-end on
 								     the prototype: data term `31 + 12·depth → flat 4`, total `+18 → +5/depth`,
 								     depth-80 **2.7×** (§2.4), functionally correct on main/branch/node.
 								     **Acceptance:** the full `cargo test --workspace --locked` suite passes under the
 								     bypass (the `tests/` integration + `merge_truth_table` + recovery/failpoint
 								     suites the prototype's `--lib` run didn't cover — schema-apply, branch merge,
 								     fork-on-first-write, overwrite). **Namespace retired to test-only:** with both
 								     reads (Fix 2) and now writes bypassing it, *nothing in production routes through
 								     the Lance namespace* — confirming §2.4's premise. The dead per-table open chain
 								     (`load_table_from_namespace`, `open_table_head_for_write`) was deleted and the
 								     `StagedTableNamespace` contract apparatus gated `#[cfg(test)]`, mirroring the
 								     already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). **Measurement
 								     note (corrected):** the opener win is **S3-only** — local FS resolves latest with
 								     one cheap `read_dir` regardless of opener, so the namespace-vs-direct difference
 								     is invisible there (the local data-table read count *does* grow with depth, but
 								     that is the merge-insert/RI scan over O(depth) *fragments*, a compaction term,
 								     not the opener; depth-100 = 92 ops identically before and after the bypass). The
 								     opener LOCK therefore lives in the bucket-gated `write_cost_s3.rs`, not the local
 								     `write_cost.rs`.
 								   - **3b. Full `WriteTxn` (capture-once + intra-txn handle reuse + shared Session).**
 								     Formalize 3a's open-once into the pinned, threaded `WriteTxn` (re-resolution
 								     *unrepresentable*, invariant 3) and kill the flat-46 schema-read constant
 								     (resolve/validate-once, §0/§6). (`iss-write-s3-roundtrip-amplification`.)
 . **Phase 7 — lineage into the manifest.** Removes the per-write
 								   `commit_graph.refresh`; commit graph becomes a projection. (`iss-991`.)
 								   **Hard dependency: step 2 must land first (Q1, §12)** — each publisher retry
 								   re-runs the O(history) `load_publish_state` scan, so the `graph_head` CAS
 								   contention Phase 7 introduces is acceptable only once compaction bounds that
 								   scan. Acceptance includes the Q1 concurrent-same-branch-writer gate.
 								   **Carries the §7.1 concurrent write-skew fix.** The `graph_head:<branch>` row is
 								   the shared contention point the cross-table read-set-in-CAS needs — proven [M]
 								   that the read-set fix is a no-op without it (§7.1). So the concurrent face of the
 								   write-skew lands *with* this step (or, if §7.1 must ship earlier, behind a minimal
 								   per-branch head row — ~15 lines — or as commit-time re-validation). The
 								   *sequential* face (`iss-overwrite-orphans-committed-edges`) is independent:
 								   inbound-RI validation on node removal, no head row, ships anytime.
 . **`PublishPlan` unification + recovery off the hot path + epoch fence — the
 								   multi-writer safety guard.** Collapse the four writers; move the `__recovery` list
 								   to open/conflict; add the `writer_epoch` leader-lease. **Motivated by the proven
 								   §6.5 cliff** (multi-process same-branch writers corrupt at concurrency = 2) — this
 								   is the guard that makes multi-writer topologies safe, not optional polish.
 								   **Gated by the §5.5 correctness contract** (the four concurrency tests must land
 								   with it). `writer_epoch` must be a true cross-process conditional CAS — **not**
 								   the local-FS `write_text_if_match` path (§7.1). (`iss-856`,
 								   `iss-merge-recovery-partial-rollforward`, `iss-recovery-sweep-live-writer-rollback`,
 								   `iss-934`.)
 . **Branch ops.** Lance `Clone` for create (`iss-691`); concurrent delete loops.
 . **Freeze** investment in publisher/sidecar/fork internals; pursue the MTT
 								   seam (`iss-863`/`iss-864`) as the strategic exit.
 								**Land PR #277 first** — it closes `iss-merge-recovery-partial-rollforward` and is
 								the producer-side half of the `PublishPlan` discipline; the heal-relocation in
 								step 5 must preserve its merge pre-snapshot heal (`exec/merge.rs:1084-1090`) and
 								its open-time `IncompletePhaseB → RollBack` (which the per-write heal never
 								performed anyway).
 								---
 								## 10. Cross-reference map (the ties)
 								**Dev-graph items (modernrelay) — what this RFC ties together:**
 								- Primary: `iss-write-s3-roundtrip-amplification` (the bug).
 								- Depth term / Phase 7 (commit graph → manifest-derived projection): `iss-991`
 								  (related: `iss-707` structured commit-graph lineage; `iss-934` Quint
 								  multi-table-publish verification). Read twin: `gap-read-path-rederivation`.
 								- Substrate seam: `iss-863`, `iss-864`. Decision: `dec-adopt-lance-v7-memwal`
 								  (`iss-681`).
 								- Recovery: `iss-856`, `iss-recovery-sweep-live-writer-rollback`,
 								  `iss-merge-recovery-partial-rollforward`, `iss-903`, `iss-load-not-crash-safe`.
 								- Residual migration: `iss-950` (MR-A staged delete, retires D2), `iss-848`
 								  (index-coverage reconciler, owns `create_vector_index`).
 								- Branch/load: `iss-691`, `iss-677`, `iss-895`, `iss-topology-cross-branch-cache`,
 								  `iss-841`, `iss-982`, `iss-423`, `iss-989`.
 								- Concurrency correctness (survives MTT) — **two faces, two different fixes [M]**
 								  (§7.1): `iss-ri-write-skew-dangling-edges` (the *concurrent* face; fix =
 								  read-set-in-CAS **+ a shared `graph_head` contention row**, so it's coupled to
 								  step 4 / a minimal head row / commit-time re-validation — NOT a standalone PR) and
 								  `iss-overwrite-orphans-committed-edges` (the *sequential* face; fix =
 								  **inbound-RI validation on node removal**, ships independently, no contention row).
 								  *(`iss-984` — remote branch-merge idempotency — is unrelated; not a write-skew.)*
 								- Blockers: `blk-lance-6658` (shipped 7.0.0), `blk-lance-6666` (open, vector
 								  index two-phase), `blk-lance-blob-compaction`.
 								- Epics: `epc-bulk-data-plane`, `epc-lance-v7-migration`, `epc-783` (reliability
 								  harness), `epc-929` (Quint verification).
 								**Proposed new dev-graph wiring (not yet written):**
 								- New **Epic** `epc-write-path-latency` — owns the cluster of orphaned issues
 								  above (none currently has an epic).
 								- New **Gap** `gap-write-path-rederivation` — the write twin of
 								  `gap-read-path-rederivation` (current: write re-derives snapshot + scans
 								  uncompacted internal tables per write; target: capture-once + bounded history).
 								- New **Issues**: write-side cost-budget gate + prod metric (step 1; prereq 1a
 								  routes all opens through the instrumented opener); **opener bypass — open writes
 								  direct-by-URI, standalone (step 3a, [M] the dominant fix, completes PR #268 Fix 2
 								  on the write path, §2.4)**; full `WriteTxn` capture-once (step 3b); **add
 								  `__manifest`/`_graph_commits` to `all_table_keys`** for compaction+cleanup (step 2
 								  — a code change, `optimize.rs:895-904`); `PublishPlan` unification + epoch
 								  (step 5); branch-delete concurrency (step 6).
 								- **Per-table namespace retired to test-only (step 3a landed).** With reads (Fix 2)
 								  and now writes (step 3a) both opening direct-by-URI, *nothing in production routes
 								  through the per-table `StagedTableNamespace`*. The dead open chain
 								  (`load_table_from_namespace`, `open_table_head_for_write`) was deleted; the
 								  `StagedTableNamespace` struct/impl/factory are now `#[cfg(test)]`, mirroring the
 								  already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). Both are retained
 								  only to validate the `LanceNamespace` contract in unit tests. *Production catalog /
 								  managed-versioning commit coordination for `__manifest` itself goes through a
 								  **separate** namespace (`GraphNamespacePublisher`), unaffected by this change.* The
 								  former follow-up to harden `StagedTableNamespace::list_table_versions`
 								  (`checkout_version` per version, O(depth)) is now purely a test-hygiene note — no
 								  prod caller can hit it; if any future version-list / time-travel feature needs
 								  per-table version enumeration, build `TableVersion`s from `versions()` metadata
 								  directly rather than resurrecting the namespace open path.
 								- New **Decision** `dec-writetxn-manifest-authoritative-publish` — records this
 								  RFC's design choice and the MTT-seam stance.
 								**Key source locations (v0.7.0):**
 								`omnigraph.rs:561-568,739-779,1317-1389`; `table_ops.rs:505-609`;
 								`table_store.rs:157-280,282-341,797`; `loader/mod.rs:197,400,485,557`;
 								`exec/mutation.rs:725`; `exec/merge.rs:1084-1090`;
 								`db/manifest/publisher.rs:51,93-124,356-371,385,432-440,448-490`;
 								`exec/mutation.rs:640-673` (D2 rule); `db/manifest/state.rs:44-72,133-141`; `db/manifest/layout.rs:22-26`;
 								`db/manifest/namespace.rs:111-112` (read open, O(1)),`:357-385`/`:362` (`describe_table` → redundant `Dataset::open` — the write-path double-open),`:158-186,544-550` (write open via `from_namespace`),`:395-427` (`list_table_versions` per-version checkout — test-only O(depth), the §10 follow-up);
 								`db/manifest/recovery.rs:762,978-988,1522`; `db/commit_graph.rs:136-164,213-272`;
 								`db/omnigraph/optimize.rs:240,517,895-904`; `instrumentation.rs:37,112-131`;
 								`runtime_cache.rs:202-283`; `tests/warm_read_cost.rs` (the read-side gate to mirror).
 								**Upstream:** lance#7264/#7263/#7185 (MTT); Lance `with_version` O(1) open
 								(`from_namespace` → `describe_table`, `builder.rs:130-178`; `default_resolve_version`
 								= one HEAD, `commit.rs:939-981`; version-hint PR #6752),
 								`list_is_lexically_ordered = !is_s3_express` (`aws.rs:183`),
 								`IOTracker`/`assert_io_*`/`num_stages`, `test_commit_iops`,
 								`test_commit_uses_version_hint_on_non_lexical_store`; **lance-namespace** design
 								(`namespace/index.md`, `operations/index.md` — catalog/discovery layer, resolve
 								once); LanceDB `io_tracking.rs`, `test_reload_resets_consistency_timer`; SlateDB
 								`FenceableTransactionalObject` (epoch fence), `InstrumentedObjectStore`,
 								monotonic-ID manifest.
 								**Reproduce the §0(b) network measurement:** `rustfs` (S3-compat) on `:9000`
 								behind a ~90-LoC Go counting proxy on `:9100` (adds `LATENCY_MS`, preserves the
 								SigV4 `Host` header, `/__ctl/reset` + `/__ctl/stat`); an omnigraph cluster on
 								`s3://…/cluster` through the proxy. Single-write breakdown: reset the proxy log,
 								`load --mode merge` one edge, classify by S3 key. Depth slope: write N× to main,
 								diff the per-write log at depth D vs D+20 by table. Native baseline: pylance 7.0.0
 								`write_dataset(mode="append")` in a loop → flat 6 ops/append at any depth.
 								---
 								## 11. Drawbacks, alternatives, reversibility
 								**Drawbacks.** Phase 7 makes disjoint-table same-branch writers contend on the
 								`graph_head:<branch>` row (they don't today) — bounded by the Lance retry budget,
 								inherent to a linear per-branch DAG, gated on a measured concurrency test and on
 								step 2 landing first (§12 Q1, resolved). **Reframe [M]: this contention is
 								load-bearing for correctness, not merely a throughput tax.** The §7.1 write-skew is
 								*unreachable only because* the shared head row forces disjoint cross-table writers to
 								overlap, conflict, retry, and re-evaluate their read-set pins against fresh state
 								(proven — without it the scoped CAS fix is a no-op). So §7.1 and the head row are
 								**coupled**: the "drawback" is exactly what buys the cross-table invariant, and the
 								throughput ceiling (1 writer/branch, bounded by `PUBLISHER_RETRY_BUDGET`) is a
 								**correctness requirement** the moment §7.1 ships, not an optional Phase-7 side
 								effect. `PublishPlan` is a non-trivial refactor of four writers; it must land behind
 								the cost gate and the `merge_truth_table`/recovery/failpoint suites.
 								**Alternatives.** (A) *Caching band-aid only* — memoize schema validation, cache
 								opens within a request: ~30–50% fewer round-trips but leaves open-by-latest and
 								the O(history) terms. Mitigation, not a fix. (B) *Opener bypass only* (open
 								direct-by-URI+version, no full txn) — **kills the dominant depth term, now measured
 								[M]**: a one-line patch flattened the data term `31+12·depth → flat 4` and cut a
 								depth-80 edge **2.7×** (§2.4), leaving only the secondary internal-table term and
 								the writer unification. (C) *Full design (this RFC)* — correctness by construction.
 								(D) *Wait for Lance MTT* — future exit, not a current dependency (§8).
 								**Recommend: ship B as a standalone PR first (behind the step-1 gate), then C for
 								the constant-factor + correctness, then step 2 for the internal residual; D as the
 								strategic end-state.** B is the demonstrated dominant fix, not a partial one.
 								**Reversibility.** The interface (`WriteTxn`/`PublishPlan`) is internal and
 								reversible. Phase 7's new `__manifest` object types (`graph_commit`,
 								`graph_head`) are an **on-disk format addition** — additive (old binaries skip
 								unknown `object_type`s) but near-permanent; it earns its own validation pass
 								(forward/back-compat, the validation checklist in the `iss-991` handoff). The
 								`writer_epoch` is likewise a durable manifest field. Everything else (compaction
 								scheduling, recovery relocation, branch concurrency, the cost gate) is cheap to
 								undo.
 								---
 								## 12. Resolved questions (was: unresolved)
 								All five original open questions were investigated read-only against post-#277/#284
 								`origin/main`, upstream Lance 7.0.0, and the dev graph; each is resolved below. One
 								new item (Q6), surfaced by peer review, remains genuinely open.
 . **`graph_head` CAS contention → RESOLVED, gated on step 2 + a concurrency test.**
 								   Retry is publisher-owned; Lance's internal rebase-retry is disabled
 								   (`conflict_retries(0)`, `publisher.rs:385`) → no double-retry. Row-CAS is true
 								   one-winner (`TooMuchWriteContention` → retryable, `publisher.rs:432-440`),
 								   bounded by `PUBLISHER_RETRY_BUDGET = 5`. **But each retry re-runs the O(history)
 								   `load_publish_state` scan (`publisher.rs:455`)**, so `graph_head` contention
 								   multiplies the manifest term — **step 2 (compaction) is a hard prerequisite for
 								   step 4 (Phase 7)**. Same-branch is the real workload (the incident is concurrent
 								   `main` writes). Residual: a measured gate before Phase 7 — N≈100 concurrent
 								   same-branch writers, assert bounded retry + O(working-set) re-scan + P99 within
 								   SLA. Fallback: batched-lineage, or Alternative B (defer lineage-in-manifest).
 . **Recovery grace-window → RESOLVED.** PR #284 is **unrelated** (cluster-apply
 								   trap; zero `recovery.rs` changes). The dangerous rollback classifications
 								   (NoMovement / UnexpectedAtP1 / UnexpectedMultistep / #277's IncompletePhaseB)
 								   fire only at the open-time Full sweep; the per-write heal defers all rollback
 								   (`recovery.rs:978-988`), so moving the heal off the hot path doesn't break #277.
 								   A sidecar-age grace window (defer sidecars younger than T_grace, loud typed
 								   skip, `repair` override) on the existing `created_at`/ULID
 								   (`recovery.rs:762`/`:1522`) is the sound interim; the permanent fix is the
 								   in-process background reconciler `iss-856`. Lands step 5 with a failpoint test.
 . **Epoch fence × publisher CAS → RESOLVED (by construction).** With Lance retry
 								   off (Q1), the publisher loop is the only retry layer. Model `writer_epoch` as a
 								   **pre-publish hard-fail gate** beside `check_expected_table_versions`
 								   (`publisher.rs:462`) but non-retryable (a stale epoch is a protocol violation,
 								   not a race). No double-retry; the epoch gate and the row-CAS loop are
 								   sequential. SlateDB `FenceableTransactionalObject` is the precedent.
 . **Compaction cadence → RESOLVED.** Not `auto_cleanup` (GCs pinned versions).
 								   Not foreground every-N-writes (deny-list job-queue + invariant 7 + cost cliff).
 								   Minimum (step 2): extend `optimize`/`cleanup` to the internal tables AND node/edge
 								   version cleanup — no special-casing (`SidecarKind::Optimize` already covers a
 								   mid-compaction crash). Follow-up: an `iss-856`-shaped background reconciler
 								   triggered by a cheap fragment-count probe (work off the hot path; a reconciler,
 								   not a job queue — deny-list-clean; SlateDB's epoch-coordinated compactor is the
 								   precedent). CLI `omnigraph optimize` stays the operator override.
 . **`PublishPlan` residuals → RESOLVED.** Both `delete_where` and
 								   `create_vector_index` are representable as `TableAction` variants with existing
 								   sidecar coverage (`SidecarKind::Mutation`/`EnsureIndices`) and are
 								   content-preserving (roll-forward safe). `TableAction::Delete` migrates to staged
 								   two-phase via MR-A / `iss-950` (now unblocked — `blk-lance-6658` shipped); **D2
 								   retires then** (`enforce_no_mixed_destructive_constructive`,
 								   `exec/mutation.rs:640-673`). `TableAction::CreateVectorIndex` stays inline until
 								   `blk-lance-6666` ships (`iss-848` reconciler path).
 								**Resolved post-review:**
 . **The exact mechanism of the data-table chain re-read → RESOLVED (§0, §2.4).**
 								   Pinned by Lance-source trace + proxy + pylance isolation **[U]**: it is **not**
 								   `checkout_version` (O(1)) and **not** merge-insert conflict replay. The write
 								   open goes through `DatasetBuilder::from_namespace` (`namespace.rs:174`), whose
 								   `describe_table` opens the whole dataset just to return a location
 								   (`namespace.rs:362`/`:112`) and whose `.load()` resolves latest **again** — a
 								   double latest-resolution per open, ~13× per write, nothing cached. The open
 								   resolves latest **without the V2 lexical / version-hint fast path** the direct
 								   opener uses (likely the un-threaded `Session`/store params,
 								   `load_table_from_namespace` `namespace.rs:174` — inferred, not traced), so it is
 								   O(depth) where a direct `from_uri().with_version(N)` is O(1). **The mechanism
 								   question is now academic for the fix:** bypassing `from_namespace` makes the open
 								   flat regardless of the precise sub-mechanism (un-threaded `Session` / double
 								   resolve / missed hint) — the bypass is the answer. (`list_table_versions` is
 								   **not** on this path — test-only; §10 follow-up.) `checkout_version` stays
 								   exonerated.
 								**Resolved end-to-end [M]:**
 . **End-to-end prototype of step 3 → DONE, measured (§2.4 before/after).** A
 								   prototype patched the opener (`open_dataset_head_for_write`, `table_store.rs:174`)
 								   to bypass `from_namespace` and open direct-by-URI, rebuilt v0.7.0, and re-ran the
 								   sweep: the data term collapsed `31 + 12·depth → flat 4`, total `+18/depth →
 								   +5/depth` (residual = the two internal tables, step 2), depth-80 **1618 → 593 ops
 								   (2.7×)** — functionally correct on main edge merge, branch create+write+read, and
 								   node merge. So step 3's "closes the dominant term outright" is **measured, not
 								   inferred**, and the opener bypass is **shippable standalone** (§9 step 3a).
 								   **Remaining (not blockers for step 3a's thesis):** the prototype did not cover
 								   schema-apply / branch merge / fork-on-first-write / overwrite / concurrent — a
 								   production opener change must pass the full `merge_truth_table`/recovery/failpoint
 								   suite; and the internal-table cleanup demo (step 2) + the concurrency
 								   fault-injection harness (steps 4/5) are still owed.
 								**Newly surfaced (open):**
 . **CAS-resurrection after cleanup → CONFIRMED VULNERABLE [S]; boundary watermark
 								   is a HARD PREREQUISITE for step 2.** SlateDB found this race (RFC-0026 / issue
 								   #352): a writer that stalls between computing manifest id `N+1` and creating it
 								   can, *after GC deletes `N+1`*, re-create it and observe **false success**.
 								   Lance 7.0.0 was traced directly and is **not immune**: version creation is a plain
 								   `put_opts(naming_scheme.manifest_path(base, version), PutMode::Create)` /
 								   `rename_if_not_exists` (`lance-table commit.rs:1421-1437`, `:1358`) on a
 								   version-numbered, **pruneable** path, with **no monotonic/boundary/watermark
 								   guard** anywhere in the manifest/commit/dataset path; `cleanup_old_versions` is
 								   **timestamp-based** (`cleanup.rs:1086`), so it deletes the very file the only
 								   guard (AlreadyExists→rebase) relies on. A stalled publisher whose target version
 								   was pruned by step-2 cleanup gets a `PutMode::Create` **success on a non-existent
 								   version → false success.** Severity by store: **R2/S3 (lexical, prod) = a silent
 								   lost write** (the resurrected version doesn't win V2 latest-resolution, so data
 								   lands on a dead branch while the publisher believes it committed); non-lexical =
 								   the version hint (`commit.rs:1439`) is overwritten to the stale version and
 								   trusted (worse, but not the prod path). **Action:** step 2 ships **only with** a
 								   durable **boundary/floor watermark** (GC advances it before deleting; every writer
 								   rejects `id <= boundary` after a "successful" create — SlateDB's fix), which also
 								   bounds any list-then-read-latest fallback. This was "lowest-risk earliest item";
 								   it is now gated (§9 step 2).
 								---
 								## 13. External validation (subagents + literature)
 								Validated read-only against OSS prior art and the DB/distributed-systems canon:
 								- **SlateDB** (canonical object-store LSM) — tenet-by-tenet ✅ on capture-once
 								  snapshot, monotonic-ID manifest (no pointer file — *explicitly rejected* in their
 								  RFC-0001), the **epoch fence** (exact match: `FenceableTransactionalObject`,
 								  hard-fail, TTL-lease *explicitly rejected* — adopt as specified), background
 								  epoch-coordinated compaction/GC, and recovery-on-open. **Adopt-items OmniGraph is
 								  missing / under-specifies:** (1) the **boundary-file** CAS-resurrection guard (Q8);
 								  (2) **group-commit batching** — coalesce pending `PublishPlan`s into one manifest
 								  CAS, directly mitigating the Q1 / §6.5 contention; (3) SlateDB *peels* compaction
 								  state *out* of the manifest (RFC-0013) — the **opposite** of Phase 7's fold-*in*;
 								  §11 should defend "fold-in (lineage must be atomic with visibility) beats peel-out
 								  for us"; (4) **write back-pressure** when cleanup lags (`l0_max_ssts`). **Citation
 								  correction:** SlateDB has the per-RPC counter (`InstrumentedObjectStore`) but
 								  **not** the flatness-across-history gate — the depth-swept Tier-1 gate (§5.1) is
 								  OmniGraph-novel; cite it that way.
 								- **Literature** — OCC/MVCC (Kung-Robinson 1981; DDIA ch.7), ARIES redo/undo, the
 								  fencing-token canon (Kleppmann — whose motivating example *is* OmniGraph's
 								  S3-read-modify-write-paused-past-lease scenario), and the lakehouse genre (Delta
 								  Lake VLDB 2020, Iceberg spec, Neon). The spine — OCC-over-MVCC + one atomic
 								  manifest CAS + WAL-of-intent recovery + monotonic-epoch fence — is canon-blessed,
 								  and OmniGraph **exceeds** Delta/Iceberg on the axis that matters (both are
 								  explicitly *single-table*-transactional; the manifest CAS delivers the atomic
 								  *cross-table* commit Delta only speculates about). The three scoping caveats are in
 								  §7.1.
 								- **HelixDB** (embedded LMDB graph DB) — too different a substrate to validate the
 								  object-store machinery (LMDB's `commit()` subsumes tenets #2–#8 for free), but it
 								  **corroborates tenet #1** (capture-once, thread-by-reference, re-resolution
 								  unrepresentable — its `&mut RwTxn`-threaded traversal is the embedded twin of
 								  `WriteTxn`) and confirms the bug class is **substrate-induced**. Portable idea for
 								  the roadmapped traversal work: adjacency as a *persisted, sorted,
 								  label-partitioned projection* keyed by `(node, label)` (vs the cold-rebuilt
 								  `TypeIndex`).