From f6d2cc03e379ed38f87abbc2ebeed101d515717d Mon Sep 17 00:00:00 2001
From: Ragnor Comerford <ragnor.comerford@gmail.com>
Date: Sat, 20 Jun 2026 13:31:15 +0200
Subject: [PATCH] write-path cost gate + opener bypass (#288)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* docs(rfc): RFC-013 write-path latency design + index link

* perf(engine): open write-path tables directly, bypassing the namespace builder

Write opens routed through DatasetBuilder::from_namespace, whose describe_table
opened the whole dataset just to return a location and then re-resolved the
latest version — an O(commit-depth) double latest-resolution per table open that
missed Lance's O(1) version-hint fast path. On an object store this dominated
write latency (~70%, RFC-013 section 2.4).

TableStore::open_dataset_head_for_write now delegates to the direct opener
(open_dataset_head: Dataset::open by URI + checkout_branch, routed through the
tracked opener so cost tests can count it; a no-op in production). The manifest
already holds every sub-table's location, so the namespace catalog lookup was
redundant; ensure_expected_version still validates head == pinned for strict ops.
This completes PR #268's open-by-location migration on the write side.

With both reads (PR #268) and now writes bypassing it, nothing in production
routes through the per-table Lance namespace. The dead open chain
(load_table_from_namespace, open_table_head_for_write) is deleted and the
StagedTableNamespace contract apparatus is gated #[cfg(test)], mirroring the
already-test-only read namespace; __manifest commit coordination
(GraphNamespacePublisher) is a separate component and is unaffected.

See docs/dev/rfc-013-write-path-latency.md sections 2.4 and 9 (step 3a).

* test(engine): write-path cost-budget gate on a shared harness

Adds tests/helpers/cost.rs, a store-agnostic cost harness (IoCounts/StagedCounts,
measure/measure_with_staged, assert_flat, local_graph/s3_graph) that the read-side
warm_read_cost.rs, write_cost.rs, and write_cost_s3.rs share, so the IOTracker /
task-local plumbing lives in exactly one place instead of duplicated per test.

write_cost.rs (local, every-PR) gates the internal-table scan term flat in
commit-history depth (a RED #[ignore]'d LOCK, the acceptance for bringing the
internal tables into compaction) plus green guards: a single insert's data writes
are bounded, a per-write read-op ceiling fails the moment a round-trip is added,
and a keyed insert routes through stage_merge_insert once with no stage_append or
vector-index build.

write_cost_s3.rs (bucket-gated, rustfs CI) gates the data-table opener term flat
across depth — the object-store-RPC phenomenon local FS cannot reproduce, and the
red->green proof of the opener bypass. Wired into the rustfs_integration CI job
and its path filter.

Guards the "hot-path cost is bounded by work, not history" invariant on writes.
See docs/dev/rfc-013-write-path-latency.md section 5.1, docs/dev/testing.md.

* docs(rfc): RFC-013 step 3a landed; write-skew coupling; cost-gate test map

- Section 9: mark step 1 (gate + harness) and step 3a (opener bypass) landed;
  record the per-table namespace retirement to test-only and the corrected
  measurement note (the opener win is S3-only; the local data-table growth is the
  merge-insert/RI fragment scan, a compaction term, not the opener).
- Sections 7.1/6/11/5.5/10: correct the cross-table write-skew analysis after a
  prototype proved the scoped expected-set fix is a no-op against the per-object_id
  manifest (disjoint writers never share a row, so Lance never conflicts, the
  publisher never retries, and the expected check is a non-atomic pre-check
  evaluated once against stale state). The fix needs a shared contention row
  (Phase-7 graph_head / a minimal head row / commit-time re-validation), so it is
  coupled to that row, not standalone; that contention is load-bearing for
  correctness, not a drawback. Split the concurrent face (read-set + head) from the
  sequential face (inbound-RI validation on node removal) -- two different fixes.
- testing.md: add write_cost.rs / helpers/cost.rs / write_cost_s3.rs to the test
  map; document the local-vs-S3 backend split; extend the cost-budget checklist
  item to the write/open path and point at the shared harness.

* test(engine): isolate the opener in the S3 cost gate; fail loud on S3 setup errors

Addresses two PR review findings on the bucket-gated write_cost_s3 gate:

- The data-table opener was not isolated: `data_reads` also counts the
  merge-insert/RI scan, which reads O(fragment-count) and so grows with history
  for a different reason (compaction's domain, not the opener) -- the same term
  that made the local data-table count grow. The flat assertion would false-RED
  or misattribute scan growth to the opener on rustfs. Fix: compact (db.optimize)
  before each measurement so the table holds ~1 fragment, bounding the scan and
  leaving the opener's latest-version resolution as the only history-varying term.
  Compaction preserves version history, so the opener still faces a deep
  _versions/ chain -- the thing under test.

- s3_graph used `.ok()?`, so when OMNIGRAPH_S3_TEST_BUCKET was set but the store
  was down/misconfigured, init/seed failures collapsed to None and the gate
  skipped + passed vacuously. Fix: skip only when the bucket env var is absent;
  once it is set, init/seed failures panic (mirrors tests/s3_storage.rs).

* test(engine): isolate the S3 opener with a per-prefix IO probe (correct-by-design)

Replaces the fixture-bounded isolation (compact-before-measure) from the prior
commit with the root fix: a path-classifying ObjectStore wrapper (PrefixCounter)
that attributes each data-table read to the opener term (_versions/.manifest) vs
the scan term (data/*.lance). IoCounts now exposes data_opener_reads /
data_scan_reads, so write_cost_s3 asserts the opener flat *directly* -- no
compaction or fixture massaging, and the assertion measures the opener, not the
conflated total. Closes the "harness conflates two IO terms" class: any cost test
(read or write) can now isolate the opener.

PrefixCounter implements only the object_store 0.13 core ObjectStore methods; the
convenience surface (get/put/head/...) routes through get_opts/put_opts via
ObjectStoreExt's blanket impl, so every read/write is still counted.

Validated locally (every-PR) by write_cost::data_table_reads_split_into_flat_opener_
and_growing_scan: opener stays flat (7 -> 3) while scan grows (11 -> 91) and
opener + scan == data_total exactly -- proving the classifier and confirming the
local data-table growth is the fragment scan, not the opener. warm_read_cost
(12 tests) stays green under the shared-harness change.

* refactor(tests): remove cost-harness duplication and namespace cfg(test) noise

Branch self-review (no behavior change) — pay down three liabilities the
write-path work left:

- warm_read_cost.rs kept its own probes() (three IOTrackers + a QueryIoProbes +
  a probe counter) and read raw .stats().read_iops — duplicating the shared
  helpers::cost harness this branch introduced. Migrated all 12 tests onto
  measure()/IoCounts; deleted the local probes(). (This also makes IoCounts'
  version_probes field used rather than dead.)

- insert_cost was copy-pasted verbatim into write_cost.rs and write_cost_s3.rs.
  Hoisted to helpers::cost::measure_insert so the measured write is defined once.

- The per-table Lance namespace (namespace.rs) became entirely test-only after
  step 3a, but was gated with ~22 per-item #[cfg(test)] attributes. Collapsed to
  a single `#[cfg(test)] mod namespace;` and stripped the per-item attributes;
  merged the import groups the gating had split.

Verified: lib in-source 162 passed; write_cost 4 + warm_read_cost 12 passed;
forbidden_apis passed.
---
 .github/workflows/ci.yml                      |    6 +-
 crates/omnigraph/src/db/manifest.rs           |    5 +-
 crates/omnigraph/src/db/manifest/layout.rs    |    1 +
 crates/omnigraph/src/db/manifest/metadata.rs  |    5 +-
 crates/omnigraph/src/db/manifest/namespace.rs |   65 +-
 crates/omnigraph/src/table_store.rs           |   24 +-
 crates/omnigraph/tests/helpers/cost.rs        |  360 +++++
 crates/omnigraph/tests/helpers/mod.rs         |    1 +
 crates/omnigraph/tests/warm_read_cost.rs      |  317 ++---
 crates/omnigraph/tests/write_cost.rs          |  159 +++
 crates/omnigraph/tests/write_cost_s3.rs       |   71 +
 docs/dev/index.md                             |    1 +
 docs/dev/rfc-013-write-path-latency.md        | 1203 +++++++++++++++++
 docs/dev/testing.md                           |    8 +-
 14 files changed, 1959 insertions(+), 267 deletions(-)
 create mode 100644 crates/omnigraph/tests/helpers/cost.rs
 create mode 100644 crates/omnigraph/tests/write_cost.rs
 create mode 100644 crates/omnigraph/tests/write_cost_s3.rs
 create mode 100644 docs/dev/rfc-013-write-path-latency.md

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index fca08da..1e9249f 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -88,7 +88,8 @@ jobs:
               .github/workflows/ci.yml|Cargo.toml|Cargo.lock|crates/*/Cargo.toml) run_rustfs_ci=true ;;
               crates/omnigraph/src/storage.rs) run_rustfs_ci=true ;;
               crates/omnigraph/src/db/manifest.rs|crates/omnigraph/src/db/manifest/*) run_rustfs_ci=true ;;
-              crates/omnigraph/tests/s3_storage.rs|crates/omnigraph/tests/helpers/*) run_rustfs_ci=true ;;
+              crates/omnigraph/tests/s3_storage.rs|crates/omnigraph/tests/write_cost_s3.rs|crates/omnigraph/tests/helpers/*) run_rustfs_ci=true ;;
+              crates/omnigraph/src/table_store.rs|crates/omnigraph/src/instrumentation.rs) run_rustfs_ci=true ;;
               crates/omnigraph-cluster/src/store.rs|crates/omnigraph-cluster/src/serve.rs) run_rustfs_ci=true ;;
               crates/omnigraph-cluster/tests/s3_cluster.rs) run_rustfs_ci=true ;;
               crates/omnigraph-server/tests/s3.rs|crates/omnigraph-server/tests/support/*) run_rustfs_ci=true ;;
@@ -372,6 +373,9 @@ jobs:
       - name: Run RustFS storage tests
         run: cargo test --locked -p omnigraph-engine --test s3_storage -- --nocapture
 
+      - name: Run RustFS write-path cost gate (RFC-013 step 3a opener)
+        run: cargo test --locked -p omnigraph-engine --test write_cost_s3 -- --nocapture
+
       - name: Run RustFS server smoke
         # No name filter: every test in the s3 target is bucket-gated, and a
         # filter matching nothing passes vacuously (which silently ran zero
diff --git a/crates/omnigraph/src/db/manifest.rs b/crates/omnigraph/src/db/manifest.rs
index 19f25a3..4c6410b 100644
--- a/crates/omnigraph/src/db/manifest.rs
+++ b/crates/omnigraph/src/db/manifest.rs
@@ -14,6 +14,10 @@ mod layout;
 mod metadata;
 #[path = "manifest/migrations.rs"]
 mod migrations;
+// Entirely test-only since RFC-013 step 3a: with both reads (Fix 2) and writes
+// bypassing the Lance namespace, nothing in production routes through it; the
+// `LanceNamespace` impls are retained only to validate the contract in unit tests.
+#[cfg(test)]
 #[path = "manifest/namespace.rs"]
 mod namespace;
 #[path = "manifest/publisher.rs"]
@@ -28,7 +32,6 @@ use layout::{manifest_uri, open_manifest_dataset, table_uri_for_path, type_name_
 pub(crate) use metadata::TableVersionMetadata;
 #[cfg(test)]
 use metadata::{OMNIGRAPH_ROW_COUNT_KEY, table_version_metadata_for_state};
-pub(crate) use namespace::open_table_head_for_write;
 #[cfg(test)]
 use namespace::{branch_manifest_namespace, staged_table_namespace};
 use publisher::{GraphNamespacePublisher, ManifestBatchPublisher};
diff --git a/crates/omnigraph/src/db/manifest/layout.rs b/crates/omnigraph/src/db/manifest/layout.rs
index 08fe043..f4ac09b 100644
--- a/crates/omnigraph/src/db/manifest/layout.rs
+++ b/crates/omnigraph/src/db/manifest/layout.rs
@@ -76,6 +76,7 @@ pub(super) fn table_uri_for_path(root_uri: &str, table_path: &str, branch: Optio
     }
 }
 
+#[cfg(test)]
 pub(super) fn namespace_internal_error(message: impl Into<String>) -> LanceNamespaceError {
     LanceNamespaceError::namespace_source(Box::new(std::io::Error::other(message.into())))
 }
diff --git a/crates/omnigraph/src/db/manifest/metadata.rs b/crates/omnigraph/src/db/manifest/metadata.rs
index 7cd6436..d84db34 100644
--- a/crates/omnigraph/src/db/manifest/metadata.rs
+++ b/crates/omnigraph/src/db/manifest/metadata.rs
@@ -2,7 +2,9 @@ use std::collections::HashMap;
 
 use lance::Dataset;
 use lance_namespace::Error as LanceNamespaceError;
-use lance_namespace::models::{CreateTableVersionRequest, TableVersion};
+use lance_namespace::models::CreateTableVersionRequest;
+#[cfg(test)]
+use lance_namespace::models::TableVersion;
 use serde::{Deserialize, Serialize};
 
 use crate::error::{OmniError, Result};
@@ -142,6 +144,7 @@ impl TableVersionMetadata {
         self.to_namespace_version_with_details(version, None, None)
     }
 
+    #[cfg(test)]
     pub(super) fn to_namespace_version_with_details(
         &self,
         version: u64,
diff --git a/crates/omnigraph/src/db/manifest/namespace.rs b/crates/omnigraph/src/db/manifest/namespace.rs
index 0d567e0..a684b4d 100644
--- a/crates/omnigraph/src/db/manifest/namespace.rs
+++ b/crates/omnigraph/src/db/manifest/namespace.rs
@@ -1,3 +1,10 @@
+// Both the read namespace (BranchManifestNamespace) and the write namespace
+// (StagedTableNamespace) are now test-only contract validation. Reads open
+// sub-tables directly by location+version (SubTableEntry::open, Fix 2), and
+// writes open the table head directly by URI (TableStore::open_dataset_head,
+// RFC-013 step 3a), so nothing in production routes through the Lance namespace
+// anymore. These impls are retained only to validate the LanceNamespace
+// contract in unit tests.
 use std::sync::Arc;
 
 use async_trait::async_trait;
@@ -16,30 +23,21 @@ use object_store::{
 
 use crate::error::{OmniError, Result};
 
-use super::layout::{namespace_internal_error, table_uri_for_path};
-#[cfg(test)]
-use super::layout::{open_manifest_dataset, table_id_to_key};
-use super::metadata::TableVersionMetadata;
-#[cfg(test)]
-use super::metadata::{namespace_version_metadata, parse_namespace_version_request};
-#[cfg(test)]
+use super::layout::{
+    namespace_internal_error, open_manifest_dataset, table_id_to_key, table_uri_for_path,
+};
+use super::metadata::{
+    TableVersionMetadata, namespace_version_metadata, parse_namespace_version_request,
+};
 use super::publisher::GraphNamespacePublisher;
-// The read namespace (BranchManifestNamespace) is test-only since Fix 2: reads
-// open sub-tables directly by location+version (SubTableEntry::open), so nothing
-// in production routes a read through the Lance namespace. The writes path uses
-// StagedTableNamespace. These items are retained to validate the namespace
-// contract in unit tests.
-#[cfg(test)]
 use super::state::{ManifestState, SubTableEntry, read_manifest_entries, read_manifest_state};
 
-#[cfg(test)]
 #[derive(Debug, Clone)]
 struct BranchManifestNamespace {
     root_uri: String,
     branch: Option<String>,
 }
 
-#[cfg(test)]
 impl BranchManifestNamespace {
     fn new(root_uri: &str, branch: Option<&str>) -> Self {
         Self {
@@ -146,7 +144,6 @@ impl StagedTableNamespace {
     }
 }
 
-#[cfg(test)]
 pub(crate) fn branch_manifest_namespace(
     root_uri: &str,
     branch: Option<&str>,
@@ -165,27 +162,6 @@ pub(crate) fn staged_table_namespace(
     ))
 }
 
-async fn load_table_from_namespace(
-    namespace: Arc<dyn LanceNamespace>,
-    table_key: &str,
-    branch: Option<&str>,
-    version: Option<u64>,
-) -> Result<Dataset> {
-    let builder = DatasetBuilder::from_namespace(namespace, vec![table_key.to_string()])
-        .await
-        .map_err(|e| OmniError::Lance(e.to_string()))?;
-    let builder = match (branch, version) {
-        (Some(branch), version) => builder.with_branch(branch, version),
-        (None, Some(version)) => builder.with_version(version),
-        (None, None) => builder,
-    };
-    builder
-        .load()
-        .await
-        .map_err(|e| OmniError::Lance(e.to_string()))
-}
-
-#[cfg(test)]
 #[async_trait]
 impl LanceNamespace for BranchManifestNamespace {
     fn namespace_id(&self) -> String {
@@ -540,18 +516,3 @@ impl LanceNamespace for StagedTableNamespace {
         Ok(response)
     }
 }
-
-pub(crate) async fn open_table_head_for_write(
-    root_uri: &str,
-    table_key: &str,
-    table_path: &str,
-    branch: Option<&str>,
-) -> Result<Dataset> {
-    load_table_from_namespace(
-        staged_table_namespace(root_uri, table_key, table_path, branch),
-        table_key,
-        branch,
-        None,
-    )
-    .await
-}
diff --git a/crates/omnigraph/src/table_store.rs b/crates/omnigraph/src/table_store.rs
index 511508f..96e6196 100644
--- a/crates/omnigraph/src/table_store.rs
+++ b/crates/omnigraph/src/table_store.rs
@@ -24,7 +24,7 @@ use lance_table::format::{Fragment, IndexMetadata, RowIdMeta};
 use lance_table::rowids::{RowIdSequence, write_row_ids};
 use std::sync::Arc;
 
-use crate::db::manifest::{TableVersionMetadata, open_table_head_for_write};
+use crate::db::manifest::TableVersionMetadata;
 use crate::db::{Snapshot, SubTableEntry};
 use crate::error::{OmniError, Result};
 use crate::storage_layer::ForkOutcome;
@@ -160,9 +160,15 @@ impl TableStore {
         dataset_uri: &str,
         branch: Option<&str>,
     ) -> Result<Dataset> {
-        let ds = Dataset::open(dataset_uri)
-            .await
-            .map_err(|e| OmniError::Lance(e.to_string()))?;
+        // Direct open by URI (O(1) latest-resolution). Routed through the tracked
+        // opener so a cost test counts it via the per-query `table_wrapper`
+        // (no-op in production — the task-local is unset, so this is exactly
+        // `Dataset::open(uri)`).
+        let ds = crate::instrumentation::open_dataset_tracked(
+            dataset_uri,
+            crate::instrumentation::table_wrapper(),
+        )
+        .await?;
         match branch {
             Some(branch) if branch != "main" => ds
                 .checkout_branch(branch)
@@ -178,8 +184,14 @@ impl TableStore {
         dataset_uri: &str,
         branch: Option<&str>,
     ) -> Result<Dataset> {
-        let table_path = self.table_path_from_dataset_uri(dataset_uri)?;
-        open_table_head_for_write(&self.root_uri, table_key, &table_path, branch).await
+        // RFC-013 step 3a: open writes via the direct opener (O(1)) instead of the
+        // lance-namespace builder, which re-resolved the table's version chain
+        // O(depth) per write. The namespace is a catalog/discovery layer, not a
+        // per-open hot-path component (RFC §2.4); the manifest already holds the
+        // location, and `ensure_expected_version` still validates head == pinned
+        // for strict ops. `table_key` retained for signature stability.
+        let _ = table_key;
+        self.open_dataset_head(dataset_uri, branch).await
     }
 
     pub async fn delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()> {
diff --git a/crates/omnigraph/tests/helpers/cost.rs b/crates/omnigraph/tests/helpers/cost.rs
new file mode 100644
index 0000000..4be9ee6
--- /dev/null
+++ b/crates/omnigraph/tests/helpers/cost.rs
@@ -0,0 +1,360 @@
+//! Shared cost-budget test harness (RFC-013) — the single place the IO-counting
+//! plumbing lives, so `warm_read_cost.rs`, `write_cost.rs`, and the S3 variant
+//! assert in one vocabulary instead of duplicating `probes()` + raw `IOTracker`
+//! reads. Three clean abstractions: structured counts, a `measure` primitive, a
+//! named flat-assertion, plus store-agnostic backend fixtures.
+//!
+//! The data-table wrapper is a **path-classifying** counter (`PrefixCounter`), not a
+//! plain `IOTracker`: it splits each read into the **opener** term (latest-version
+//! resolution — reads of `_versions/`/`.manifest` objects) vs the **scan** term
+//! (data-fragment reads, `data/`/`*.lance`). That lets a cost test isolate the
+//! opener (RFC-013 step 3a's target, O(1) after the bypass) from the merge-insert/RI
+//! scan (O(fragment-count), compaction's domain) even though both ride the same
+//! `Dataset` — without controlling the fixture (no compaction needed). `__manifest`
+//! and `_graph_commits` keep the plain `IOTracker` (no sub-prefixes worth splitting).
+#![allow(dead_code)]
+
+use std::fmt;
+use std::future::Future;
+use std::sync::atomic::{AtomicU64, Ordering};
+use std::sync::{Arc, Mutex};
+
+use async_trait::async_trait;
+use futures::stream::BoxStream;
+use lance::io::WrappingObjectStore;
+use lance_io::utils::tracking_store::IOTracker;
+use object_store::path::Path;
+use object_store::{
+    CopyOptions, GetOptions, GetResult, ListResult, MultipartUpload, ObjectMeta, ObjectStore,
+    PutMultipartOptions, PutOptions, PutPayload, PutResult, Result as OSResult,
+};
+
+use omnigraph::db::Omnigraph;
+use omnigraph::instrumentation::{
+    MergeWriteProbes, QueryIoProbes, with_merge_write_probes, with_query_io_probes,
+};
+use omnigraph::loader::{LoadMode, load_jsonl};
+
+use super::{MUTATION_QUERIES, TEST_DATA, TEST_SCHEMA, init_and_load, mixed_params};
+
+/// Object-store op counts for one measured operation, by table class — the
+/// vocabulary cost tests assert in (vs raw `IOTracker::stats().read_iops`).
+#[derive(Debug, Clone, Copy, Default)]
+pub struct IoCounts {
+    /// Per-table DATA opens (node/edge tables). The dominant write-path term.
+    pub data_reads: u64,
+    pub data_writes: u64,
+    /// DATA-table reads attributed to latest-version resolution (`_versions/`,
+    /// `.manifest`). This is the **opener** term step 3a flattened — isolated from
+    /// the scan, so it can be gated directly without compacting the fixture.
+    pub data_opener_reads: u64,
+    /// DATA-table reads attributed to data fragments (`data/`, `*.lance`) — the
+    /// merge-insert/RI **scan**, which grows with fragment count (compaction's
+    /// domain, not the opener).
+    pub data_scan_reads: u64,
+    /// `__manifest` registry scans (publish state).
+    pub manifest_reads: u64,
+    /// `_graph_commits` lineage scans.
+    pub commit_graph_reads: u64,
+    /// Version-probe invocations (the cheap freshness check).
+    pub version_probes: u64,
+}
+
+impl IoCounts {
+    pub fn total_reads(&self) -> u64 {
+        self.data_reads + self.manifest_reads + self.commit_graph_reads
+    }
+}
+
+/// Which staged-write primitives an operation invoked (from `MergeWriteProbes`).
+#[derive(Debug, Clone, Copy, Default)]
+pub struct StagedCounts {
+    pub stage_append: u64,
+    pub stage_merge_insert: u64,
+    pub create_vector_index: u64,
+    pub scan_staged_combined: u64,
+}
+
+// ── Path-classifying data-table read counter ──
+
+/// How a data-table object read is attributed.
+enum ReadClass {
+    /// Latest-version resolution: `_versions/`, `.manifest`, `_latest`.
+    Opener,
+    /// Data fragments: `data/`, `*.lance`.
+    Scan,
+    /// Anything else (indices, deletion files, …) — counted in the total only.
+    Other,
+}
+
+/// Classify a Lance object path by its role in a write open. Lance's on-object
+/// layout is identical on local FS and S3, so this split is backend-independent.
+fn classify(path: &Path) -> ReadClass {
+    let p = path.as_ref();
+    if p.contains("_versions") || p.ends_with(".manifest") || p.contains("_latest") {
+        ReadClass::Opener
+    } else if p.contains("/data/") || p.starts_with("data/") || p.ends_with(".lance") {
+        ReadClass::Scan
+    } else {
+        ReadClass::Other
+    }
+}
+
+#[derive(Debug, Default, Clone, Copy)]
+struct PrefixCounts {
+    reads: u64,
+    writes: u64,
+    opener_reads: u64,
+    scan_reads: u64,
+}
+
+/// A `WrappingObjectStore` that counts reads/writes and attributes each read to the
+/// opener vs scan term by object-key prefix. Shares its tally via `Arc<Mutex<…>>` so
+/// the wrapped store (handed to Lance) and the test read the same counters.
+#[derive(Debug, Default, Clone)]
+struct PrefixCounter(Arc<Mutex<PrefixCounts>>);
+
+impl PrefixCounter {
+    fn record_read(&self, location: &Path) {
+        let mut c = self.0.lock().unwrap();
+        c.reads += 1;
+        match classify(location) {
+            ReadClass::Opener => c.opener_reads += 1,
+            ReadClass::Scan => c.scan_reads += 1,
+            ReadClass::Other => {}
+        }
+    }
+
+    fn record_write(&self) {
+        self.0.lock().unwrap().writes += 1;
+    }
+
+    fn snapshot(&self) -> PrefixCounts {
+        *self.0.lock().unwrap()
+    }
+}
+
+impl WrappingObjectStore for PrefixCounter {
+    fn wrap(&self, _store_prefix: &str, target: Arc<dyn ObjectStore>) -> Arc<dyn ObjectStore> {
+        Arc::new(PrefixCountingStore {
+            target,
+            counter: self.clone(),
+        })
+    }
+}
+
+/// The wrapped `ObjectStore` that records each call into a [`PrefixCounter`].
+/// Implements only the required core `ObjectStore` methods (object_store 0.13: the
+/// convenience surface — `get`/`put`/`head`/`get_range`/… — lives on
+/// `ObjectStoreExt` and is provided by a blanket impl that routes through `get_opts`
+/// / `put_opts`, so every read/write is still counted here). Per-read path
+/// classification is the only addition over a plain pass-through.
+#[derive(Debug)]
+struct PrefixCountingStore {
+    target: Arc<dyn ObjectStore>,
+    counter: PrefixCounter,
+}
+
+impl fmt::Display for PrefixCountingStore {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "PrefixCountingStore({})", self.target)
+    }
+}
+
+#[async_trait]
+impl ObjectStore for PrefixCountingStore {
+    async fn put_opts(
+        &self,
+        location: &Path,
+        payload: PutPayload,
+        opts: PutOptions,
+    ) -> OSResult<PutResult> {
+        self.counter.record_write();
+        self.target.put_opts(location, payload, opts).await
+    }
+
+    async fn put_multipart_opts(
+        &self,
+        location: &Path,
+        opts: PutMultipartOptions,
+    ) -> OSResult<Box<dyn MultipartUpload>> {
+        self.counter.record_write();
+        self.target.put_multipart_opts(location, opts).await
+    }
+
+    async fn get_opts(&self, location: &Path, options: GetOptions) -> OSResult<GetResult> {
+        self.counter.record_read(location);
+        self.target.get_opts(location, options).await
+    }
+
+    fn delete_stream(
+        &self,
+        locations: BoxStream<'static, OSResult<Path>>,
+    ) -> BoxStream<'static, OSResult<Path>> {
+        self.target.delete_stream(locations)
+    }
+
+    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, OSResult<ObjectMeta>> {
+        self.counter.record_read(&prefix.cloned().unwrap_or_default());
+        self.target.list(prefix)
+    }
+
+    fn list_with_offset(
+        &self,
+        prefix: Option<&Path>,
+        offset: &Path,
+    ) -> BoxStream<'static, OSResult<ObjectMeta>> {
+        self.counter.record_read(&prefix.cloned().unwrap_or_default());
+        self.target.list_with_offset(prefix, offset)
+    }
+
+    async fn list_with_delimiter(&self, prefix: Option<&Path>) -> OSResult<ListResult> {
+        self.counter.record_read(&prefix.cloned().unwrap_or_default());
+        self.target.list_with_delimiter(prefix).await
+    }
+
+    async fn copy_opts(&self, from: &Path, to: &Path, options: CopyOptions) -> OSResult<()> {
+        self.counter.record_write();
+        self.target.copy_opts(from, to, options).await
+    }
+}
+
+/// The tracker handles backing one measurement; read once into [`IoCounts`].
+struct ProbeHandles {
+    manifest: IOTracker,
+    commit_graph: IOTracker,
+    table: PrefixCounter,
+    probe_count: Arc<AtomicU64>,
+}
+
+impl ProbeHandles {
+    fn install() -> (QueryIoProbes, Self) {
+        let h = ProbeHandles {
+            manifest: IOTracker::default(),
+            commit_graph: IOTracker::default(),
+            table: PrefixCounter::default(),
+            probe_count: Arc::new(AtomicU64::new(0)),
+        };
+        let probes = QueryIoProbes {
+            manifest_wrapper: Some(Arc::new(h.manifest.clone()) as Arc<dyn WrappingObjectStore>),
+            commit_graph_wrapper: Some(
+                Arc::new(h.commit_graph.clone()) as Arc<dyn WrappingObjectStore>
+            ),
+            table_wrapper: Some(Arc::new(h.table.clone()) as Arc<dyn WrappingObjectStore>),
+            probe_count: Arc::clone(&h.probe_count),
+        };
+        (probes, h)
+    }
+
+    fn counts(&self) -> IoCounts {
+        let t = self.table.snapshot();
+        IoCounts {
+            data_reads: t.reads,
+            data_writes: t.writes,
+            data_opener_reads: t.opener_reads,
+            data_scan_reads: t.scan_reads,
+            manifest_reads: self.manifest.stats().read_iops,
+            commit_graph_reads: self.commit_graph.stats().read_iops,
+            version_probes: self.probe_count.load(Ordering::Relaxed),
+        }
+    }
+}
+
+/// Run `op` under object-store IO counting; return its output + the counts.
+/// The only place the `QueryIoProbes` task-local + tracker wiring lives.
+pub async fn measure<F: Future>(op: F) -> (F::Output, IoCounts) {
+    let (probes, handles) = ProbeHandles::install();
+    let out = with_query_io_probes(probes, op).await;
+    (out, handles.counts())
+}
+
+/// Like [`measure`], but also capture which staged-write primitives ran
+/// (composes the two task-locals cleanly).
+pub async fn measure_with_staged<F: Future>(op: F) -> (F::Output, IoCounts, StagedCounts) {
+    let (probes, handles) = ProbeHandles::install();
+    let merge = MergeWriteProbes::default();
+    let out = with_merge_write_probes(merge.clone(), with_query_io_probes(probes, op)).await;
+    let staged = StagedCounts {
+        stage_append: merge.stage_append_calls(),
+        stage_merge_insert: merge.stage_merge_insert_calls(),
+        create_vector_index: merge.create_vector_index_calls(),
+        scan_staged_combined: merge.scan_staged_combined_calls(),
+    };
+    (out, handles.counts(), staged)
+}
+
+/// Assert a per-depth metric is flat: the deepest sample must not exceed the
+/// shallowest by more than `slack`. `select` picks the field; `what` names it in
+/// the failure message. The shape every depth-swept cost gate uses.
+pub fn assert_flat(
+    curve: &[(u64, IoCounts)],
+    select: impl Fn(&IoCounts) -> u64,
+    slack: u64,
+    what: &str,
+) {
+    assert!(curve.len() >= 2, "assert_flat needs >= 2 depth points");
+    let (d_lo, lo) = (curve[0].0, select(&curve[0].1));
+    let (d_hi, hi) = (curve[curve.len() - 1].0, select(&curve[curve.len() - 1].1));
+    assert!(
+        hi <= lo + slack,
+        "{what} grew with history: depth {d_lo} = {lo} -> depth {d_hi} = {hi} (slack {slack})"
+    );
+}
+
+/// Assert a per-depth metric *does* grow with history by at least `min_delta` — the
+/// dual of [`assert_flat`], used to prove a term is genuinely history-dependent (so a
+/// flat sibling term isn't flat merely because nothing was measured).
+pub fn assert_grows(
+    curve: &[(u64, IoCounts)],
+    select: impl Fn(&IoCounts) -> u64,
+    min_delta: u64,
+    what: &str,
+) {
+    assert!(curve.len() >= 2, "assert_grows needs >= 2 depth points");
+    let (d_lo, lo) = (curve[0].0, select(&curve[0].1));
+    let (d_hi, hi) = (curve[curve.len() - 1].0, select(&curve[curve.len() - 1].1));
+    assert!(
+        hi >= lo + min_delta,
+        "{what} did not grow as expected: depth {d_lo} = {lo} -> depth {d_hi} = {hi} (min delta {min_delta})"
+    );
+}
+
+/// Measure one committing `insert_person` to `main` — the canonical write the cost
+/// gates sweep over commit-history depth. Shared by `write_cost.rs` and
+/// `write_cost_s3.rs` so the measured write is defined once.
+pub async fn measure_insert(db: &mut Omnigraph, tag: &str) -> IoCounts {
+    let (res, io) = measure(db.mutate(
+        "main",
+        MUTATION_QUERIES,
+        "insert_person",
+        &mixed_params(&[("$name", tag)], &[("$age", 30)]),
+    ))
+    .await;
+    res.unwrap();
+    io
+}
+
+// ── Backend fixtures — one knob, store-agnostic body ──
+
+/// Local tempdir graph (default; deterministic, every-PR).
+pub async fn local_graph(dir: &tempfile::TempDir) -> Omnigraph {
+    init_and_load(dir).await
+}
+
+/// Emulated-S3 graph, bucket-gated. Returns `None` **only** when
+/// `OMNIGRAPH_S3_TEST_BUCKET` is unset, so the caller logs + skips — the
+/// `tests/s3_storage.rs` graceful-skip pattern. Once the bucket *is* configured
+/// (the rustfs CI job), any `init`/seed failure is a real failure and panics
+/// rather than silently skipping — otherwise a down/misconfigured store would let
+/// a bucket-gated gate pass vacuously. `name` disambiguates the prefix.
+pub async fn s3_graph(name: &str) -> Option<Omnigraph> {
+    let bucket = std::env::var("OMNIGRAPH_S3_TEST_BUCKET").ok()?;
+    let uri = format!("s3://{bucket}/cost-tests/{name}-{}", std::process::id());
+    let mut db = Omnigraph::init(&uri, TEST_SCHEMA)
+        .await
+        .expect("OMNIGRAPH_S3_TEST_BUCKET is set but S3 graph init failed");
+    load_jsonl(&mut db, TEST_DATA, LoadMode::Overwrite)
+        .await
+        .expect("OMNIGRAPH_S3_TEST_BUCKET is set but S3 seed load failed");
+    Some(db)
+}
diff --git a/crates/omnigraph/tests/helpers/mod.rs b/crates/omnigraph/tests/helpers/mod.rs
index e690839..131f91b 100644
--- a/crates/omnigraph/tests/helpers/mod.rs
+++ b/crates/omnigraph/tests/helpers/mod.rs
@@ -1,5 +1,6 @@
 #![allow(dead_code)]
 
+pub mod cost;
 pub mod recovery;
 
 use arrow_array::{Array, RecordBatch, StringArray};
diff --git a/crates/omnigraph/tests/warm_read_cost.rs b/crates/omnigraph/tests/warm_read_cost.rs
index d7fc52a..b3f5446 100644
--- a/crates/omnigraph/tests/warm_read_cost.rs
+++ b/crates/omnigraph/tests/warm_read_cost.rs
@@ -1,51 +1,22 @@
 //! Cost-budget tests for the warm read path (Fix 1): a warm same-branch read
-//! must perform no manifest or commit-graph opens, measured with Lance's
-//! `IOTracker` at the object-store boundary (the LanceDB IO-counted-test
-//! pattern; see docs/dev/testing.md). Guards invariant 15 (read cost bounded by
-//! work, not history) for snapshot resolution, and invariant 6 (a warm reader
-//! still observes external commits).
+//! must perform no manifest or commit-graph opens, measured via the shared
+//! `helpers::cost` harness at the object-store boundary (the LanceDB
+//! IO-counted-test pattern; see docs/dev/testing.md). Guards invariant 15 (read
+//! cost bounded by work, not history) for snapshot resolution, and invariant 6
+//! (a warm reader still observes external commits).
 
 mod helpers;
 
-use std::sync::Arc;
-use std::sync::atomic::{AtomicU64, Ordering};
-
 use arrow_array::{Array, StringArray};
-use lance::io::WrappingObjectStore;
-use lance_io::utils::tracking_store::IOTracker;
 use omnigraph::db::{Omnigraph, ReadTarget};
-use omnigraph::instrumentation::{QueryIoProbes, with_query_io_probes};
 use omnigraph_compiler::result::QueryResult;
 
+use helpers::cost::measure;
 use helpers::{
     MUTATION_QUERIES, TEST_QUERIES, commit_many, count_rows, init_and_load, mixed_params,
     mutate_branch, mutate_main, params,
 };
 
-/// IO probes plus the tracker handles to read `read_iops` after the query.
-/// Returns `(probes, manifest, commit_graph, table, probe_count)` — `table`
-/// counts per-table data opens (the cache-miss path), so a cost test can assert
-/// N opens on a cold read and 0 on a warm repeat (Fix 3).
-fn probes() -> (
-    QueryIoProbes,
-    IOTracker,
-    IOTracker,
-    IOTracker,
-    Arc<AtomicU64>,
-) {
-    let manifest = IOTracker::default();
-    let commit_graph = IOTracker::default();
-    let table = IOTracker::default();
-    let probe_count = Arc::new(AtomicU64::new(0));
-    let probes = QueryIoProbes {
-        manifest_wrapper: Some(Arc::new(manifest.clone()) as Arc<dyn WrappingObjectStore>),
-        commit_graph_wrapper: Some(Arc::new(commit_graph.clone()) as Arc<dyn WrappingObjectStore>),
-        table_wrapper: Some(Arc::new(table.clone()) as Arc<dyn WrappingObjectStore>),
-        probe_count: Arc::clone(&probe_count),
-    };
-    (probes, manifest, commit_graph, table, probe_count)
-}
-
 fn first_column_strings(result: &QueryResult) -> Vec<String> {
     if result.num_rows() == 0 {
         return Vec::new();
@@ -75,18 +46,14 @@ async fn warm_same_branch_read_does_no_resolution_opens() {
     // Deep history: warm-read resolution cost must be flat in commit count.
     commit_many(&mut db, 20).await;
 
-    let (probes_in, manifest, commit_graph, _table, probe_count) = probes();
-    with_query_io_probes(
-        probes_in,
-        db.query(
-            ReadTarget::branch("main"),
-            TEST_QUERIES,
-            "total_people",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (out, io) = measure(db.query(
+        ReadTarget::branch("main"),
+        TEST_QUERIES,
+        "total_people",
+        &params(&[]),
+    ))
+    .await;
+    out.unwrap();
 
     // A warm same-branch read opens nothing from the internal tables, even at
     // commit-history depth. Fix 1 reuses the coordinator (no re-open: 0
@@ -95,18 +62,15 @@ async fn warm_same_branch_read_does_no_resolution_opens() {
     // per-table __manifest scan is gone too. Pre-fix, each of these is a deep scan
     // of an internal table that grows with commit count.
     assert_eq!(
-        manifest.stats().read_iops,
-        0,
+        io.manifest_reads, 0,
         "warm same-branch read must not scan __manifest (resolution or per-table)"
     );
     assert_eq!(
-        commit_graph.stats().read_iops,
-        0,
+        io.commit_graph_reads, 0,
         "warm same-branch read must not open the commit graph (no coordinator re-open)"
     );
     assert_eq!(
-        probe_count.load(Ordering::Relaxed),
-        1,
+        io.version_probes, 1,
         "warm same-branch read performs exactly one version probe"
     );
 }
@@ -121,22 +85,17 @@ async fn multi_table_query_does_no_manifest_scans() {
     let dir = tempfile::tempdir().unwrap();
     let db = init_and_load(&dir).await;
 
-    let (probes_in, manifest, _commit_graph, _table, _probe) = probes();
-    with_query_io_probes(
-        probes_in,
-        db.query(
-            ReadTarget::branch("main"),
-            TEST_QUERIES,
-            "age_stats",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (out, io) = measure(db.query(
+        ReadTarget::branch("main"),
+        TEST_QUERIES,
+        "age_stats",
+        &params(&[]),
+    ))
+    .await;
+    out.unwrap();
 
     assert_eq!(
-        manifest.stats().read_iops,
-        0,
+        io.manifest_reads, 0,
         "a multi-table read must not scan __manifest once per touched table"
     );
 }
@@ -278,32 +237,25 @@ async fn warm_branch_read_does_no_manifest_scans() {
     // Bind the handle's coordinator to the branch so reads of it take the warm path.
     db.sync_branch("feature").await.unwrap();
 
-    let (probes_in, manifest, commit_graph, _table, probe_count) = probes();
-    with_query_io_probes(
-        probes_in,
-        db.query(
-            ReadTarget::branch("feature"),
-            TEST_QUERIES,
-            "total_people",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (out, io) = measure(db.query(
+        ReadTarget::branch("feature"),
+        TEST_QUERIES,
+        "total_people",
+        &params(&[]),
+    ))
+    .await;
+    out.unwrap();
 
     assert_eq!(
-        manifest.stats().read_iops,
-        0,
+        io.manifest_reads, 0,
         "warm branch read must not scan __manifest (branch-owned table opened by location)"
     );
     assert_eq!(
-        commit_graph.stats().read_iops,
-        0,
+        io.commit_graph_reads, 0,
         "warm branch read must not open the commit graph"
     );
     assert_eq!(
-        probe_count.load(Ordering::Relaxed),
-        1,
+        io.version_probes, 1,
         "warm branch read performs exactly one version probe"
     );
 }
@@ -369,18 +321,14 @@ async fn warm_read_on_recreated_branch_observes_new_incarnation() {
         "test setup must exercise branch incarnation reuse at one Lance version"
     );
 
-    let (probes_in, manifest, commit_graph, _table, probe_count) = probes();
-    let new_feature = with_query_io_probes(
-        probes_in,
-        reader.query(
-            ReadTarget::branch("feature"),
-            TEST_QUERIES,
-            "get_person",
-            &params(&[("$name", "MainOnly")]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (new_feature, io) = measure(reader.query(
+        ReadTarget::branch("feature"),
+        TEST_QUERIES,
+        "get_person",
+        &params(&[("$name", "MainOnly")]),
+    ))
+    .await;
+    let new_feature = new_feature.unwrap();
 
     assert_eq!(
         new_feature.num_rows(),
@@ -388,17 +336,15 @@ async fn warm_read_on_recreated_branch_observes_new_incarnation() {
         "warm reader must refresh to the recreated branch incarnation"
     );
     assert!(
-        manifest.stats().read_iops > 0,
+        io.manifest_reads > 0,
         "recreated branch must re-read the manifest after the incarnation probe"
     );
     assert_eq!(
-        commit_graph.stats().read_iops,
-        0,
+        io.commit_graph_reads, 0,
         "same-branch incarnation refresh must be manifest-only"
     );
     assert_eq!(
-        probe_count.load(Ordering::Relaxed),
-        2,
+        io.version_probes, 2,
         "stale same-branch read probes once under the read lock and once under the write lock"
     );
 }
@@ -469,39 +415,33 @@ async fn recreated_branch_owned_table_handle_uses_table_etag() {
         "test setup must force table handle identity to differ only by e_tag"
     );
 
-    let (probes_in, manifest, commit_graph, table, probe_count) = probes();
-    let new_person = with_query_io_probes(
-        probes_in,
-        reader.query(
-            ReadTarget::branch("feature"),
-            TEST_QUERIES,
-            "get_person",
-            &params(&[("$name", "NewOnly")]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (new_person, io) = measure(reader.query(
+        ReadTarget::branch("feature"),
+        TEST_QUERIES,
+        "get_person",
+        &params(&[("$name", "NewOnly")]),
+    ))
+    .await;
+    let new_person = new_person.unwrap();
     assert_eq!(
         new_person.num_rows(),
         1,
         "warm reader must open the recreated branch-owned table incarnation"
     );
     assert!(
-        table.stats().read_iops > 0,
+        io.data_reads > 0,
         "table e_tag must force a held-handle cache miss for the recreated table"
     );
     assert!(
-        manifest.stats().read_iops > 0,
+        io.manifest_reads > 0,
         "recreated branch must refresh the manifest"
     );
     assert_eq!(
-        commit_graph.stats().read_iops,
-        0,
+        io.commit_graph_reads, 0,
         "same-branch table-incarnation refresh must be manifest-only"
     );
     assert_eq!(
-        probe_count.load(Ordering::Relaxed),
-        2,
+        io.version_probes, 2,
         "stale same-branch read probes once under each lock"
     );
 
@@ -594,35 +534,29 @@ async fn recreated_branch_traversal_uses_graph_index_incarnation() {
         "test setup must force graph-index identity to differ only by snapshot incarnation"
     );
 
-    let (probes_in, manifest, commit_graph, _table, probe_count) = probes();
-    let new_friends = with_query_io_probes(
-        probes_in,
-        reader.query(
-            ReadTarget::branch("feature"),
-            TEST_QUERIES,
-            "friends_of",
-            &params(&[("$name", "NewWalker")]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (new_friends, io) = measure(reader.query(
+        ReadTarget::branch("feature"),
+        TEST_QUERIES,
+        "friends_of",
+        &params(&[("$name", "NewWalker")]),
+    ))
+    .await;
+    let new_friends = new_friends.unwrap();
     assert_eq!(
         first_column_strings(&new_friends),
         vec!["Bob"],
         "traversal must use the recreated branch's topology, not stale cached graph index"
     );
     assert!(
-        manifest.stats().read_iops > 0,
+        io.manifest_reads > 0,
         "recreated branch traversal must refresh the manifest"
     );
     assert_eq!(
-        commit_graph.stats().read_iops,
-        0,
+        io.commit_graph_reads, 0,
         "same-branch traversal incarnation refresh must be manifest-only"
     );
     assert_eq!(
-        probe_count.load(Ordering::Relaxed),
-        2,
+        io.version_probes, 2,
         "stale same-branch read probes once under each lock"
     );
 
@@ -673,31 +607,25 @@ async fn stale_read_refreshes_manifest_only() {
     .await
     .unwrap();
 
-    let (probes_in, manifest, commit_graph, _table, probe_count) = probes();
-    with_query_io_probes(
-        probes_in,
-        reader.query(
-            ReadTarget::branch("main"),
-            TEST_QUERIES,
-            "total_people",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (out, io) = measure(reader.query(
+        ReadTarget::branch("main"),
+        TEST_QUERIES,
+        "total_people",
+        &params(&[]),
+    ))
+    .await;
+    out.unwrap();
 
     assert!(
-        manifest.stats().read_iops > 0,
+        io.manifest_reads > 0,
         "stale read must re-read the manifest"
     );
     assert_eq!(
-        commit_graph.stats().read_iops,
-        0,
+        io.commit_graph_reads, 0,
         "stale refresh must be manifest-only (no commit-graph scan)"
     );
     assert_eq!(
-        probe_count.load(Ordering::Relaxed),
-        2,
+        io.version_probes, 2,
         "stale same-branch read probes once under the read lock and once under the write lock"
     );
 }
@@ -721,55 +649,40 @@ async fn repeat_warm_read_reuses_table_handles() {
     commit_many(&mut db, 10).await;
 
     // Cold first read: opens the touched table.
-    let (p1, _m1, _c1, table1, _pr1) = probes();
-    with_query_io_probes(
-        p1,
-        db.query(
-            ReadTarget::branch("main"),
-            TEST_QUERIES,
-            "total_people",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (cold_out, cold) = measure(db.query(
+        ReadTarget::branch("main"),
+        TEST_QUERIES,
+        "total_people",
+        &params(&[]),
+    ))
+    .await;
+    cold_out.unwrap();
     assert!(
-        table1.stats().read_iops > 0,
+        cold.data_reads > 0,
         "the cold first read must open the table"
     );
 
     // Warm repeat: the held handle is reused, so no open happens through this
-    // query's table wrapper.
-    let (p2, manifest2, commit_graph2, table2, probe2) = probes();
-    with_query_io_probes(
-        p2,
-        db.query(
-            ReadTarget::branch("main"),
-            TEST_QUERIES,
-            "total_people",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    // query's table wrapper. A fresh `measure()` isolates the warm repeat's cost.
+    let (warm_out, warm) = measure(db.query(
+        ReadTarget::branch("main"),
+        TEST_QUERIES,
+        "total_people",
+        &params(&[]),
+    ))
+    .await;
+    warm_out.unwrap();
     assert_eq!(
-        table2.stats().read_iops,
-        0,
+        warm.data_reads, 0,
         "a warm repeat read must reuse the held handle (0 table opens)"
     );
+    assert_eq!(warm.manifest_reads, 0, "warm repeat read: 0 manifest opens");
     assert_eq!(
-        manifest2.stats().read_iops,
-        0,
-        "warm repeat read: 0 manifest opens"
-    );
-    assert_eq!(
-        commit_graph2.stats().read_iops,
-        0,
+        warm.commit_graph_reads, 0,
         "warm repeat read: 0 commit-graph opens"
     );
     assert_eq!(
-        probe2.load(Ordering::Relaxed),
-        1,
+        warm.version_probes, 1,
         "warm repeat read: exactly one version probe"
     );
 }
@@ -807,20 +720,16 @@ async fn write_invalidates_table_cache_for_changed_table() {
     .unwrap();
 
     // The next read re-opens Person at the new version (cache miss).
-    let (p, _m, _c, table, _pr) = probes();
-    with_query_io_probes(
-        p,
-        db.query(
-            ReadTarget::branch("main"),
-            TEST_QUERIES,
-            "total_people",
-            &params(&[]),
-        ),
-    )
-    .await
-    .unwrap();
+    let (out, io) = measure(db.query(
+        ReadTarget::branch("main"),
+        TEST_QUERIES,
+        "total_people",
+        &params(&[]),
+    ))
+    .await;
+    out.unwrap();
     assert!(
-        table.stats().read_iops > 0,
+        io.data_reads > 0,
         "a read after a write to the table must re-open it (version-keyed miss)"
     );
 
diff --git a/crates/omnigraph/tests/write_cost.rs b/crates/omnigraph/tests/write_cost.rs
new file mode 100644
index 0000000..5f753d7
--- /dev/null
+++ b/crates/omnigraph/tests/write_cost.rs
@@ -0,0 +1,159 @@
+//! Cost-budget tests for the WRITE path (RFC-013 step 1) — the safety/latency
+//! twin of `warm_read_cost.rs`, on the shared `helpers::cost` harness. A
+//! committing write's per-table opens and internal-table scans must be bounded
+//! and **flat across commit-history depth**, measured at the object-store
+//! boundary. Guards invariant 15 (cost bounded by work, not history) on writes.
+//!
+//! **Backend split (see docs/dev/testing.md / RFC-013).** This file runs on
+//! **local FS** and gates the **internal-table** term (`__manifest`/`_graph_commits`
+//! fragment scans, ~+18/depth — O(fragments) on any backend, step 2's target).
+//!
+//! The **data-table opener** term (step 3a's win) is a per-object-store-RPC
+//! phenomenon and is NOT gated here: local-FS latest-resolution is cheap whether
+//! the open goes through the namespace builder or direct-by-URI, so the
+//! namespace→direct switch is invisible on local. Measured: the local data-table
+//! read count grows with depth too (~+0.9/depth), but that is a *different* term —
+//! the merge-insert/RI scan reading O(depth) **fragments**, unchanged by the
+//! opener switch (depth-100 = 92 ops both before and after step 3a, same slope)
+//! and reduced only by compaction. The opener term shows up only on a real object
+//! store (per-version GETs, ~+12/depth → flat after step 3a), so it is gated in
+//! `write_cost_s3.rs` (bucket-gated). Same `measure`/`IoCounts` harness, different
+//! backend; each term gated where it actually manifests.
+#![recursion_limit = "512"]
+
+mod helpers;
+
+use helpers::cost::{
+    IoCounts, assert_flat, assert_grows, local_graph, measure_insert, measure_with_staged,
+};
+use helpers::{MUTATION_QUERIES, commit_many, mixed_params};
+
+// ── (A) The internal-table LOCK — RED today, the acceptance test for step 2 ──
+//
+// `__manifest` / `_graph_commits` scans must be O(1) in commit-history depth.
+// RED today (O(fragments), uncompacted). Un-ignore when step 2 (internal-table
+// compaction) lands — it must go green flat. (The data-table term is the S3
+// gate's, `write_cost_s3.rs`; local-FS hides it.)
+#[tokio::test]
+#[ignore = "RED until step 2 (internal-table compaction): __manifest/_graph_commits scans are O(fragments) today — RFC-013 §0/§2.2. Un-ignore there as the red→green acceptance test."]
+async fn internal_table_scans_are_flat_in_history() {
+    let dir = tempfile::tempdir().unwrap();
+    let mut db = local_graph(&dir).await;
+
+    let mut curve: Vec<(u64, IoCounts)> = Vec::new();
+    let mut current = 0u64;
+    for d in [10u64, 100] {
+        if d > current {
+            commit_many(&mut db, (d - current) as usize).await;
+            current = d;
+        }
+        let io = measure_insert(&mut db, &format!("lock_{d}")).await;
+        current += 1; // the measured write advanced depth by one
+        eprintln!(
+            "depth~{d}: data={} __manifest={} _graph_commits={}",
+            io.data_reads, io.manifest_reads, io.commit_graph_reads
+        );
+        curve.push((d, io));
+    }
+
+    assert_flat(&curve, |c| c.manifest_reads, 4, "__manifest scan");
+    assert_flat(&curve, |c| c.commit_graph_reads, 4, "_graph_commits scan");
+}
+
+// The data-table OPENER history-gate (opener flat across depth) lives in
+// `write_cost_s3.rs` — its history-dependence is an S3-only phenomenon. But the
+// *probe that isolates* the opener (the `PrefixCounter` split) is validated here,
+// every-PR, on local FS:
+
+/// Proves the `PrefixCounter` opener/scan split: a committing write's data-table
+/// reads divide into a **flat opener** term and a **growing scan** term. This pins
+/// (a) the classifier actually attributes reads to the opener bucket (non-zero, so a
+/// flat assertion isn't vacuously flat-at-zero), and (b) the local data-table growth
+/// is the merge-insert/RI fragment scan, not the opener — which is *why* the S3
+/// gate asserts `data_opener_reads`, not total `data_reads`. (On local FS the opener
+/// is O(1) regardless of step 3a; the opener's history-dependence is gated on S3.)
+#[tokio::test]
+async fn data_table_reads_split_into_flat_opener_and_growing_scan() {
+    let dir = tempfile::tempdir().unwrap();
+    let mut db = local_graph(&dir).await;
+
+    let mut curve: Vec<(u64, IoCounts)> = Vec::new();
+    let mut current = 0u64;
+    for d in [10u64, 100] {
+        if d > current {
+            commit_many(&mut db, (d - current) as usize).await;
+            current = d;
+        }
+        let io = measure_insert(&mut db, &format!("split_{d}")).await;
+        current += 1;
+        eprintln!(
+            "depth~{d}: opener={} scan={} data_total={}",
+            io.data_opener_reads, io.data_scan_reads, io.data_reads
+        );
+        curve.push((d, io));
+    }
+
+    assert!(
+        curve[0].1.data_opener_reads > 0,
+        "opener reads must be > 0 — the classifier missed version-resolution reads, \
+         so a flat opener assertion would be vacuous"
+    );
+    assert_flat(&curve, |c| c.data_opener_reads, 4, "local data-table opener");
+    assert_grows(&curve, |c| c.data_scan_reads, 20, "local data-table scan");
+}
+
+// ── (B) Green-today regression guards — run on every PR ──
+
+/// A single insert's *data-table* write cost is O(1): the table commit is a small
+/// constant number of writes, independent of history.
+#[tokio::test]
+async fn single_insert_data_write_is_bounded() {
+    let dir = tempfile::tempdir().unwrap();
+    let mut db = local_graph(&dir).await;
+    commit_many(&mut db, 5).await;
+    let io = measure_insert(&mut db, "w").await;
+    eprintln!("single insert: data_writes={}", io.data_writes);
+    assert!(io.data_writes <= 4, "data-table write_iops should be a small constant, got {}", io.data_writes);
+}
+
+/// At a fixed shallow depth, the per-write object-store read count is below a
+/// documented ceiling. Fails the moment a change *adds* a round-trip on the write
+/// path — the "no new round-trip" guard (calibrated: ~50 at depth ~5).
+#[tokio::test]
+async fn write_op_count_ceiling_at_shallow_depth() {
+    let dir = tempfile::tempdir().unwrap();
+    let mut db = local_graph(&dir).await;
+    commit_many(&mut db, 5).await;
+    let io = measure_insert(&mut db, "ceil").await;
+    eprintln!(
+        "depth~5: data={} __manifest={} _graph_commits={} total_reads={}",
+        io.data_reads, io.manifest_reads, io.commit_graph_reads, io.total_reads()
+    );
+    const CEILING: u64 = 80;
+    assert!(
+        io.total_reads() <= CEILING,
+        "per-write read ops {} exceeded ceiling {CEILING} — a new round-trip was added",
+        io.total_reads()
+    );
+}
+
+// ── (C) Fitness assert via the staged-write probes ──
+
+/// A keyed `Person` insert routes through `stage_merge_insert` exactly once, does
+/// no `stage_append`, and no inline vector-index build. Pins the structural shape.
+#[tokio::test]
+async fn keyed_insert_routes_through_merge_insert_only() {
+    let dir = tempfile::tempdir().unwrap();
+    let mut db = local_graph(&dir).await;
+    let (res, _io, staged) = measure_with_staged(db.mutate(
+        "main",
+        MUTATION_QUERIES,
+        "insert_person",
+        &mixed_params(&[("$name", "fit")], &[("$age", 30)]),
+    ))
+    .await;
+    res.unwrap();
+    assert_eq!(staged.stage_merge_insert, 1, "keyed Person insert stages one merge-insert");
+    assert_eq!(staged.stage_append, 0, "keyed insert must not stage_append");
+    assert_eq!(staged.create_vector_index, 0, "no inline vector-index build on a plain insert");
+}
diff --git a/crates/omnigraph/tests/write_cost_s3.rs b/crates/omnigraph/tests/write_cost_s3.rs
new file mode 100644
index 0000000..d8ffd4f
--- /dev/null
+++ b/crates/omnigraph/tests/write_cost_s3.rs
@@ -0,0 +1,71 @@
+//! S3 (object-store) cost-budget gate for the WRITE path — the bucket-gated twin of
+//! `write_cost.rs` that proves RFC-013 **step 3a's data-table opener win**. On the
+//! shared `helpers::cost` harness (`measure`/`IoCounts`/`assert_flat`/`s3_graph`).
+//!
+//! The opener term is an **object-store-RPC phenomenon**: latest-version resolution
+//! costs per-version GETs/HEADs on S3 (O(depth) before step 3a, when writes routed
+//! through the lance-namespace builder), which local FS cannot reproduce (one cheap
+//! `read_dir` regardless). After step 3a (direct-by-URI opens), the per-write
+//! **data-table read count is FLAT across commit-history depth** — the measured 70%
+//! win. This file is the red→green acceptance for that term (it would be RED on the
+//! pre-3a `from_namespace` opener); `write_cost.rs` gates the internal-table term on
+//! local every-PR.
+//!
+//! **Isolating the opener (important):** total `data_reads` is not opener-only — the
+//! same wrapped `Dataset` backs the merge-insert/RI **scan**, which reads
+//! O(fragment-count) and grows with history for a *different* reason (compaction's
+//! domain, not the opener; this is the term that made the *local* data-table count
+//! grow). The shared harness's `PrefixCounter` attributes each read by object-key
+//! prefix, so this gate asserts `data_opener_reads` (reads of `_versions/`/`.manifest`)
+//! **directly** — no compaction or fixture massaging needed. After step 3a the opener
+//! is O(1) regardless of version-history depth; before it grew ~+12/depth (RFC §2.4
+//! [M]). (See `write_cost.rs` for the local test that proves the split itself —
+//! opener flat, scan growing.)
+//!
+//! Skips gracefully without `OMNIGRAPH_S3_TEST_BUCKET` (the `tests/s3_storage.rs`
+//! pattern); runs for real in the rustfs CI job (`.github/workflows/ci.yml`).
+#![recursion_limit = "512"]
+
+mod helpers;
+
+use helpers::cost::{IoCounts, assert_flat, measure_insert, s3_graph};
+use helpers::commit_many;
+
+/// After step 3a the data-table opener term is flat across depth on a real object
+/// store (the measured win). RED on the pre-3a namespace-builder opener (O(depth)
+/// per-version resolution).
+#[tokio::test]
+async fn data_table_opener_is_flat_in_history_on_s3() {
+    let Some(mut db) = s3_graph("write-cost-opener").await else {
+        eprintln!(
+            "SKIP data_table_opener_is_flat_in_history_on_s3: OMNIGRAPH_S3_TEST_BUCKET \
+             unset (or store unreachable) — the S3 opener gate needs an object store"
+        );
+        return;
+    };
+
+    let mut curve: Vec<(u64, IoCounts)> = Vec::new();
+    let mut current = 0u64;
+    for d in [10u64, 50] {
+        if d > current {
+            commit_many(&mut db, (d - current) as usize).await;
+            current = d;
+        }
+        let io = measure_insert(&mut db, &format!("s3_{d}")).await;
+        current += 1;
+        eprintln!(
+            "depth~{d}: opener={} scan={} data_total={} __manifest={} _graph_commits={}",
+            io.data_opener_reads,
+            io.data_scan_reads,
+            io.data_reads,
+            io.manifest_reads,
+            io.commit_graph_reads
+        );
+        curve.push((d, io));
+    }
+
+    // The opener (latest-version resolution) is O(1) after step 3a (direct-by-URI),
+    // isolated from the scan by the PrefixCounter. Slack absorbs object-store variance;
+    // the pre-3a builder grew this ~+12/depth (RFC §2.4 [M]).
+    assert_flat(&curve, |c| c.data_opener_reads, 8, "S3 data-table opener");
+}
diff --git a/docs/dev/index.md b/docs/dev/index.md
index 91f108b..9fe743f 100644
--- a/docs/dev/index.md
+++ b/docs/dev/index.md
@@ -91,6 +91,7 @@ Working documents for in-flight feature work. Removed when the work lands.
 | Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
 | CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
 | Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
+| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
 
 ## Boundary
 
diff --git a/docs/dev/rfc-013-write-path-latency.md b/docs/dev/rfc-013-write-path-latency.md
new file mode 100644
index 0000000..fa4abf3
--- /dev/null
+++ b/docs/dev/rfc-013-write-path-latency.md
@@ -0,0 +1,1203 @@
+# RFC-013: Write-path latency — capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract
+
+**Status:** Proposed
+**Author(s):** write-path latency investigation (handoff + multi-agent validation)
+**Date:** 2026-06-19
+**Audience:** engine / storage maintainers
+**Builds on:**
+[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) (`GraphClient` — embedded ≡ remote),
+the query-latency work (PR #268, read-path warm-up — the read-side twin of this change),
+the iss-991 handoff (manifest-authoritative graph lineage / Phase 7),
+[writes.md](writes.md), [execution.md](execution.md), [invariants.md](invariants.md).
+**Tracking (dev graph `modernrelay`):** primary `iss-write-s3-roundtrip-amplification`; depth term `iss-991`; substrate seam `iss-863`/`iss-864`; branch-create `iss-691`; recovery `iss-856`/`iss-recovery-sweep-live-writer-rollback`/`iss-merge-recovery-partial-rollforward`; MemWAL `iss-681`; read twin `gap-read-path-rederivation`.
+
+> Status maintained by maintainers: `Proposed` while open, `Accepted` on merge.
+
+---
+
+## Summary
+
+On object-store-backed clusters a single trivial write (one edge, one branch op)
+issues **hundreds of mostly-sequential object-store round-trips**, and that count
+**grows without bound with the graph's commit history**, so a long-lived graph
+degrades to minutes per edge. The cost is invisible on a local filesystem
+(µs/call) and to correctness tests (results are right, just slow), and it was
+never measured because nothing in the suite counts *object-store round-trips per
+logical operation*.
+
+This RFC specifies the optimal write path from first principles — **a write is a
+pure function of one version-pinned snapshot, published in a single
+manifest-atomic CAS** — and the **cost contract that makes its O(1)-in-history
+guarantee provable and non-regressable** (deterministic IO-counted tests on every
+PR). It collapses four hand-rolled writers into one `GraphPublishAuthority`,
+moves graph lineage into the manifest (so the per-write `_graph_commits` scan
+disappears), brings the internal metadata tables into compaction (so the
+per-write `__manifest` scan stops growing), takes recovery off the hot path, and
+adds an epoch fence for multi-writer safety. None of it is a substrate rewrite —
+the manifest-CAS model is already correct and is exactly what Lance native
+multi-table transactions (lance#7264) will later formalize; this RFC builds the
+seam to that future and pays down the write path onto it.
+
+**The dominant fix is demonstrated, not proposed:** a one-line opener-bypass
+prototype (open writes direct-by-URI instead of through the lance-namespace builder)
+flattens the depth-dominant term `31 + 12·depth → flat 4` and cuts a depth-80 edge
+**2.7×** (1618 → 593 ops), measured end-to-end and functionally correct on
+main/branch/node paths (§2.4). It is shippable as a standalone PR first (§9 step
+3a); the rest of the RFC is the constant-factor + correctness + internal-residual
+work layered on the same seam.
+
+---
+
+## 0. Validation ledger (read this first)
+
+Every claim is tagged: **[M]** measured by me this cycle, **[S]** verified in
+v0.7.0 source (`file:line` given), **[U]** verified against upstream
+Lance/LanceDB/SlateDB source or docs, **[G]** tracked in the dev graph (slug
+given), **[I]** inferred/reasoned.
+
+A correction from the originating handoff: it hypothesized that **Cloudflare R2
+walks the full manifest listing on every open** (a prod-only amplifier absent
+from AWS). **This is false for the pinned Lance 7.0.0 [U].** R2 is treated as
+lexically ordered (`list_is_lexically_ordered = !is_s3_express`,
+`lance-io/.../providers/aws.rs:183`), so R2 gets the O(1) head-only manifest fast
+path, same as AWS; only S3-Express buckets are excluded, and even those are O(1)
+via the v7 `latest_version_hint.json`. There is no R2-list config fix because
+there is no R2-list problem.
+
+**The depth term — corrected attribution.** Two measurements, one
+instrumentation-blind, one complete:
+
+*(a) IOTracker probe [M] — internal tables only.* A throwaway probe (the
+`warm_read_cost` harness applied to a single insert to `main`, swept across
+commit depth) counted the two internal tables: `__manifest` ≈ 14 + 2·depth,
+`_graph_commits` ≈ 9 + 2·depth → ≈ 23 + 4·depth, `write_iops = 1`. **But this
+probe is structurally blind to the write path's per-table *data* opens** — they
+bypass the instrumented opener (`table_wrapper`), so it reports `probes=0` for the
+data tables. It measured the *minority* of the cost.
+
+*(b) Network-proxy measurement [M] — all RPCs, fresh graph.* A counting proxy in
+front of `rustfs` (sees every object-store RPC, under `--mode merge` — the
+production path), on a brand-new graph (400 seed nodes, one committing merge per
+checkpoint), classified by S3 key:
+
+| commit depth | data `_versions` | `__manifest` | `_graph_commits` | node (RI) | schema | TOTAL | `write_iops` |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 0 | 31 | 29 | 13 | 6 | 46 | 156 | 1 |
+| 5 | 121 | 44 | 23 | 6 | 46 | 268 | 1 |
+| 10 | 181 | 59 | 33 | 6 | 46 | 358 | 1 |
+| 20 | 301 | 89 | 53 | 6 | 46 | 538 | 1 |
+| 40 | 541 | 149 | 93 | 6 | 46 | 898 | 1 |
+| 80 | 1021 | 269 | 173 | 6 | 46 | 1618 | 1 |
+
+Slopes: **data table +12/depth (~67%)**, `__manifest` +3/depth, `_graph_commits`
++2/depth → **TOTAL ≈ 156 + 18·depth**, `write_iops` flat at 1. The IOTracker probe
+(a) saw only the +4/depth internal subset — blind to the data-table opens, the
+dominant ~67%.
+
+**Constant-factor finding [M]: the schema contract is a flat 46 reads/write** — not
+depth-scaling, but **29% of the depth-0 cost (46/156)**, from
+`validate_schema_contract` re-running uncached on every resolve (`omnigraph.rs:561`).
+A depth-slope gate will *not* catch it; WriteTxn's resolve/validate-once kills it,
+and the §5.1 fitness assert (`validate_schema_contract_calls == 1`) is what pins it
+(constant-factor delta, §6).
+
+The dominant term is **the written table's open routed through the lance-namespace
+builder ~13× per write** — now source-traced. The **write** path opens via
+`DatasetBuilder::from_namespace` (`namespace.rs:174`, from `open_table_head_for_write`
+`table_store.rs:181` / `namespace.rs:544`). Lance's builder calls the namespace's
+`describe_table` once and uses only `response.location` (`lance` `builder.rs:130-178`)
+— but omnigraph's `describe_table` **opens the whole dataset** just to produce that
+location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`), and `.load()`
+then **resolves the latest version again** — a **double latest-resolution per
+open**, ~13× per write, nothing cached. Crucially, latest-resolution is **not
+inherently O(depth)**: the namespace path is O(depth) because it **misses the V2
+lexical / `latest_version_hint.json` fast path** that the direct opener engages
+(most likely because `load_table_from_namespace` attaches no shared `Session`/store
+params, `namespace.rs:174` — inferred, not traced). The **read** path skips all of
+it — `from_uri(location).with_version(N)`, one HEAD, O(1) — which is why reads are
+flat (+12/depth on the data table, §0(b)). **Proven on omnigraph's own table [M]:**
+a direct `Dataset::open` of the *same physical* 85-version edge table = **2 ops
+(O(1))**, the `from_namespace` open of that identical table = the O(depth) sweep —
+same bytes, two open paths. `checkout_version` is also O(1) — **exonerated**, not a
+back-walk. So `from_uri().with_version(N)` is the O(1) primitive and step 3 makes
+each open O(1) *intrinsically* (cleanup then becomes hygiene/interim, not
+load-bearing for read cost — §2.3). **Mode-independent [U]:** `append` ≡ `merge` ≡ +12/depth, so §0(a)
+measuring a single insert was *not* the defect — the defect is the namespace open
+path, not the verb. **Using `from_namespace` per-open is a misuse of Lance's
+design** (the namespace is a catalog/discovery layer — resolve once, then open the
+dataset directly, `lance-namespace` `operations/index.md` **[U]**); the read path
+already bypasses it (PR #268 Fix 2 — see §2.4).
+
+**Corrected conclusion.** The depth blow-up is in omnigraph's DB layer and is
+**data-table-dominated**: the redundant per-table opens (fixed by §9 step 3 —
+WriteTxn open-once-by-pinned-version — plus scheduled *version cleanup* of the
+node/edge tables) are ~70% of it; the uncompacted internal tables (§9 step 2) are
+the secondary ~30%. Both the originating R2 hypothesis and the earlier "entirely
+the internal tables" framing are wrong. The exact Lance call doing the data-table
+chain re-read (`checkout_version` back-walk vs merge-insert conflict replay) is the
+one unpinned item — see §12. Reads, by contrast, are flat in depth
+(`warm_read_cost.rs`, PR #268). This is the O(history)-per-write →
+O(N²)-cumulative behavior the production incident hit.
+
+---
+
+## 1. Problem & measurements
+
+On object storage every call is a 10–100 ms RPC, there is no cheap stat, and
+sequential RPCs serialize. A long-lived production graph on R2, originating handoff
+**[M]**:
+
+| operation | prod (R2) | local `file://` |
+|---|---|---|
+| one-edge `load --mode merge` → main | ~3 min (90 s workflow timeout) | <1 s |
+| `branch create --from main` | 120 s | <1 s |
+| one-row `load` → a branch | 204 s | <1 s |
+| `branch delete` | 216 s | <1 s |
+| warm read / `/healthz` | fast (0.2–2 s) | fast |
+
+`iss-write-s3-roundtrip-amplification` **[G]** independently records the same:
+cross-region single insert ~46 s, 5-node mutation ~110 s, vs ~390 ms for a
+no-storage `/healthz`. Its acceptance criteria are this RFC's goal: *"a single
+insert issues O(1)-to-few S3 round-trips, not O(number of tables); bulk mutations
+amortize the manifest commit."*
+
+The cost decomposes into terms; the dominant one scales with history (§0):
+
+1. **Per-table opens through the O(depth) lance-namespace builder (DOMINANT,
+   O(tables × depth)).** Each stage opens via `DatasetBuilder::from_namespace`
+   (`namespace.rs:174`); its `describe_table` opens the whole dataset just to return
+   a location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`) and
+   `.load()` resolves latest **again** — a double latest-resolution per open,
+   O(depth) on the repro store, ~13× per write with nothing caching it **[S]**
+   (§2.2). The read path's direct `from_uri().with_version(N)` is O(1). →
+   **+12 reads/depth, ~70% of the slope [M]**. Fixed by opening once, by pinned
+   version via the direct opener (§9 step 3); node/edge version *cleanup* bounds it
+   further.
+2. **Per-write `__manifest` scan (O(history), secondary).** Every publish
+   full-scans the uncompacted `__manifest` (`load_publish_state` →
+   `read_manifest_scan`, `state.rs:133-141`) **[S]**; the internal tables are
+   never compacted/cleaned (`optimize` iterates node/edge only,
+   `optimize.rs:895-904`) **[S]**. +3.1 reads/depth **[M]**.
+3. **Per-write `_graph_commits` refresh (O(history), secondary).**
+   `record_graph_commit` reloads the entire commit cache before each append
+   (`commit_graph.rs:136-164`) **[S]**; never compacted/cleaned. +2.1 reads/depth
+   **[M]**. The "read-path anti-pattern, now on writes" (`iss-991` handoff **[G]**).
+
+Terms 2+3 are the secondary ~30%; term 1 dominates. Plus per-write fixed taxes: a `list_dir("__recovery/")` (`loader/mod.rs:197`,
+`exec/mutation.rs:725`, `exec/merge.rs:1090`) **[S]**, and the publisher CAS
+retry budget (`PUBLISHER_RETRY_BUDGET = 5`, `publisher.rs:51`) **[S]**.
+
+Branch ops compound it: `branch create` is a per-table sequential fork loop
+(`fork_branch_from_state`, `table_store.rs:282`); `branch delete` opens a
+snapshot per *other* branch (`ensure_branch_delete_safe`, `omnigraph.rs:1317`)
+and force-deletes per forked table sequentially (`cleanup_deleted_branch_tables`,
+`omnigraph.rs:1359`) **[S]**.
+
+---
+
+## 2. Root cause (validated)
+
+### 2.1 The write re-derives its world from storage every stage
+
+`loader/mod.rs:400` captures a `snapshot` once, but downstream stages **ignore
+it** and re-resolve **[S]**:
+
+- `open_for_mutation_on_branch` (`table_ops.rs:505`) re-calls
+  `resolved_branch_target` **per table** (`:512`), which runs
+  `ensure_schema_state_valid` (a full schema-contract storage read with no cache,
+  `omnigraph.rs:561-568`) and then opens **by head** via
+  `open_dataset_head_for_write` (`:522`/`:559`), asserting head == pinned only
+  *after* the open.
+- `fresh_snapshot_for_branch` (`omnigraph.rs:771`) always does fresh I/O; the
+  fork authority path re-reads the live manifest (`table_ops.rs:574`).
+- The captured snapshot is used only for membership/fork checks, never for the
+  actual opens.
+
+The drift guards, CAS retries, and recovery scans are **compensating machinery**
+for the staleness this self-inflicts. The `Snapshot`/coordinator primitive
+already exists; it is treated as cheap-to-reacquire rather than as the
+operation's authoritative identity.
+
+### 2.2 The depth terms — data-table re-reads dominate, internal tables secondary
+
+Confirmed in code and measurement (§0). The **dominant** term is §2.1's per-table
+opens: ~13 opens per write through the lance-namespace builder
+(`DatasetBuilder::from_namespace`, `namespace.rs:174`). The builder calls the
+namespace's `describe_table` (`lance` `builder.rs:130-178`), and omnigraph's
+`describe_table` opens the whole dataset just to return a location (`open_head` →
+`Dataset::open`, `namespace.rs:362`/`:112`); `.load()` then resolves latest again —
+a **double latest-resolution per open**, O(depth) on the repro store — so cost
+grows with the table's version count (+12 reads/depth, ~70%). The **read** path
+opens direct `from_uri().with_version(N)` (`namespace.rs:112` / `SubTableEntry::open`)
+— O(1) — and native pylance is flat 6 ops at any depth **[U]**, so this is
+omnigraph's *namespace-open* pattern, not Lance; `checkout_version` is O(1) and not
+implicated. (The heavier `list_table_versions` — `versions()` + a checkout per
+version, `namespace.rs:395-427` — is **not** on this path; it is test-only today, a
+separate latent O(depth): §10 follow-up.) The **secondary** terms are the two
+internal tables: `load_publish_state` and
+`commit_graph.refresh` each full-scan a table that gains a fragment per write and
+is never compacted (+5 reads/depth, ~30%). This is the `gap-read-path-rederivation`
+**[G]** failure mode — "cost grows with fragment count" — on the *write* path,
+where PR #268 never reached. `invariants.md` documents the internal-table half:
+*"the internal metadata tables (`__manifest`, `_graph_commits`) are still not
+compacted, so the probe and refresh cost still grows with fragment count."*
+
+### 2.3 The `skip_auto_cleanup` interaction — and compaction ≠ cleanup
+
+v0.7.0 sets `skip_auto_cleanup: true` deliberately (`table_store.rs` 10 sites +
+`publisher.rs:392`) **[S]** — load-bearing, because Lance 7's on-by-default
+`auto_cleanup` would GC `__manifest`-pinned snapshot versions (`lance.md` audit)
+**[U]**. Two distinct levers were turned off and must be replaced *separately*:
+**compaction** (`compact_files`) rewrites small fragments into fewer larger ones
+but does **not** prune old versions; **cleanup** (`cleanup_old_versions`) prunes
+old versions. Measured on a ~85-version graph **[M]**: `optimize`/compaction
+*added a version* (data-table reads 1035 → 1083, frags 81→1 — **no help** against
+the depth term); `cleanup --keep 3` dropped it 1035 → 63 (89 versions pruned across
+7 tables, **16×**). So only *cleanup* bounds the version-chain length. Note today's
+`cleanup`/`optimize` cover **node/edge tables only** (the "7 tables"; internal
+`__manifest`/`_graph_commits` are excluded, `optimize.rs:895-904` **[M]**) — so
+bounding the internal +5/depth residual needs them **added** to the key set (§9 step
+2's code change). Operationally: `cleanup` aborts on a remote store without `--yes`
+(the
+scheduled job must pass it). Relation to step 3: while the namespace open is still
+on the write path, cleanup **caps** the dominant term — a real interim mitigation;
+once step 3 opens direct-by-version (O(1) regardless of version count, §2.4),
+cleanup is **storage hygiene + internal-table sprawl**, not load-bearing for read
+cost. The correct replacement is *scheduled* compaction **and** version cleanup
+(§9 step 2), **not** re-enabling `auto_cleanup`. Without it, version history (and
+per-write cost) grows forever.
+
+### 2.4 Lance namespace: proper use (why the fix is bypass, not patch)
+
+The upstream Lance Namespace is a **catalog / discovery layer** — "table
+discovery, resolving table locations, and coordinating commits" — whose intended
+division of labor is *"the namespace provides basic information about the table,
+[then] the Lance SDK … fulfill[s] the other operations"* (`lance-namespace`
+`namespace/index.md`, `operations/index.md`) **[U]**. It is meant to be consulted
+to *resolve a table once*, after which you operate on the `Dataset` directly — **not
+consulted on every per-table open on a hot path.** `DatasetBuilder::from_namespace`
+itself reflects this: it calls `describe_table` only to extract `location`, then
+reduces to a `from_uri` builder (`lance` `builder.rs:130-178`). For a system that
+*already holds* each table's location + version (omnigraph's `__manifest` does, via
+`SubTableEntry`), routing per-open resolution back through the namespace is the
+anti-pattern — and it aligns with this project's invariant 1 ("resolve latest state
+through the substrate's cheap primitive instead of re-scanning") and the deny-list
+"cold re-derivation on the hot path."
+
+So the fix is **bypass, not patch**: open writes by URI + pinned version
+(`from_uri(location).with_version(N)`) — exactly what the **read** path already does
+(PR #268 Fix 2; the read path's own comment notes the namespace open "would
+full-scan `__manifest` twice per open (`describe_table` + `describe_table_version`)"),
+so this completes #268's open-by-location migration on the write side (§9 step 3).
+The **custom namespace impl stays** — it is still the right home for legitimate
+*catalog* operations (`describe_table` / `table_exists` / `list_table_versions` /
+`create_table_version` / managed-versioning commit coordination); only the
+per-open *resolution* leaves it. Two Lance facts make this safe and final: opening
+by explicit version is `default_resolve_version` = a single HEAD, O(1) (`lance`
+`commit.rs:939-981`), and Lance's own latest-resolution cost work (version-hint, PR
+#6752) confirms the latest path is the expensive one to avoid. **Proven on
+omnigraph's own table [M]:** a direct `Dataset::open` of the *same physical*
+85-version edge table is 2 ops (O(1)), while the `from_namespace` open of that
+identical table is the O(depth) sweep — so latest-resolution is not inherently
+O(depth); the namespace path is O(depth) only because it misses the fast path the
+direct opener engages (likely the un-threaded `Session`). Step 3 therefore makes
+each write open O(1) on its own — so node/edge `cleanup` (§2.3) is an **interim
+mitigation + storage hygiene**, not load-bearing for read cost once step 3 ships.
+
+**End-to-end proof [M] — the one-line opener bypass, measured.** A prototype
+patched `open_dataset_head_for_write` (`table_store.rs:174`) to open directly by URI
+(bypassing `from_namespace` — exactly step 3 / Alternative B), rebuilt v0.7.0, and
+re-ran the depth sweep on a fresh graph:
+
+| depth | data `edgeVER` baseline | data patched | TOTAL baseline | TOTAL patched |
+|---:|---:|---:|---:|---:|
+| 0 | 31 | **4** | 156 | 121 |
+| 10 | 181 | **4** | 358 | 173 |
+| 20 | 301 | **4** | 538 | 233 |
+| 40 | 541 | **4** | 898 | 353 |
+| 80 | 1021 | **4** | 1618 | **593** |
+
+The dominant data-table term collapses `31 + 12·depth → flat 4` (O(1) in history),
+the total slope drops `+18/depth → +5/depth` (the residual +5 is exactly the two
+internal tables — step 2's scope), and at depth 80 a single edge drops **1618 → 593
+ops (2.7×)** from this one change alone, before step 2 / Phase 7. Functional
+correctness verified on the hot paths: main edge merge, branch create + write +
+read-back, node merge (managed-versioning still correct) — the direct opener already
+handles `checkout_branch` for non-main, so the namespace layer was not load-bearing
+for write correctness on these paths. **Caveat:** the prototype did **not** exercise
+schema-apply, branch merge, fork-on-first-write to a new table on a branch, overwrite
+mode, or concurrent writers — a production step 3 must pass the full
+`merge_truth_table`/recovery/failpoint suite (the namespace may do
+managed-versioning work that matters there). It proves the thesis + hot-path
+correctness, not drop-in completeness.
+
+**Step 2 also proven [M].** On the step-3-patched binary at depth ~87, compacting
+the internal tables to 1 fragment each (content-preserving) collapsed their scans:
+`__manifest` 285 → 32 (8.9×), `_graph_commits` 177 → 11 (16×); the step-3 data term
+stayed flat at 4. So **both depth terms are now empirically eliminated** — a depth-87
+single edge drops **~1720 → 198 ops (~8.7×; ≈258 s → ≈30 s at 150 ms/RTT)** with
+both fixes. The internal term is **fragment-scan growth** (`read_manifest_scan` /
+`commit_graph.refresh` read all fragments of the *latest* version), so the fix is
+**compaction** (merge fragments) — distinct from the data table's version-chain term
+that step 3 / version-cleanup handle. `optimize`'s `all_table_keys`
+(`optimize.rs:895-904`) excludes the internal tables, so step 2 is a real code
+change, not just scheduling.
+
+---
+
+## 3. First principles
+
+On object storage the only objective function is **minimize the number of
+*sequential* round-trips per logical operation, and make that number invariant to
+graph age, history depth, and table count** — under the hard floor of SI,
+durability, atomicity, and loud integrity. Three generating principles fall out,
+each mapped to a validated failure:
+
+1. **Pin once, derive the rest (MVCC / invariant 15).** A write is a pure
+   function of one immutable, fully-pinned snapshot
+   `{branch, manifest_version, per-table (location, version, e_tag), schema_hash,
+   writer_epoch}`, resolved exactly once; every stage reads only from it
+   (open-by-pinned-version, O(1), cacheable); the only contact with "current" is
+   the final CAS. → fixes §2.1.
+2. **One source of truth, one commit (invariant 2).** Visibility + lineage +
+   version bumps are **one atomic manifest write**; the commit graph, indexes,
+   and topology are *projections* of the manifest, never second authorities to
+   keep in sync. → fixes the §2.2 `_graph_commits` term (iss-991 Phase 7).
+3. **The plan is the contract (correct-by-construction recovery).** The writer
+   serializes its *complete* publish intent **before any HEAD moves**; the live
+   commit and crash-recovery execute the *identical* plan, so they cannot
+   diverge. → fixes the partial-publish bug class structurally
+   (`iss-merge-recovery-partial-rollforward`, PR #277).
+
+The optimal single-edge write under these: **~2–3 sequential hops, O(1) in size**
+— 1 warm probe (0 if the coordinator is unchanged), 1 parallel stage of fragment
+writes, 1 manifest CAS — regardless of 5 tables or 500, 10 commits or 10M.
+Lance's own `test_commit_iops` (read 1 / write 2 / stages 3) **[U]** proves the
+per-table primitive already hits this; the job is to make the *graph* write
+inherit it.
+
+This is not speculative: it is exactly what the two reference object-storage
+databases do. **LanceDB** threads a pinned `Arc<Dataset>` + shared `Session` and
+commits with one CAS off a captured `read_version`, never re-resolving "latest"
+under default consistency **[U]**. **SlateDB** captures a snapshot, treats a
+monotonic-ID manifest (no pointer file) as the *sole* authority, commits with one
+conditional-PUT, recovers on open (never per-write), fences with a monotonic
+`writer_epoch`, and compacts on a schedule **[U]**.
+
+---
+
+## 4. Reference-level design
+
+### 4.1 The interface — one publish authority, one declarative plan
+
+The deepest structural flaw is **four hand-rolled writers** (`load_as`,
+`mutate_as`, `apply_schema_as`, `branch_merge_as`), each re-implementing open →
+stage → commit → sidecar → lineage. There is **one publish machine**; the verbs
+are different declarative plans fed to it.
+
+```rust
+// The pinned, immutable operation identity — resolved ONCE.
+struct WriteTxn {
+    branch: BranchRef,
+    base: PinnedSnapshot,   // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
+    session: Arc<Session>,  // shared per-graph; warms metadata/index caches across opens
+    handles: HandleCache,   // open-by-version; each table opened once, reused across stages
+}
+
+// A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
+enum TableAction {
+    Append(Stream), Upsert(Batch), Overwrite(Image), Delete(Pred),
+    Fork { from_version: u64 }, Register(Schema), Tombstone,
+}
+struct PublishPlan {
+    base: PinnedSnapshot,
+    actions: Map<TableKey, Vec<TableAction>>,
+    lineage: GraphCommitIntent,   // parent = base.head; rides the SAME manifest CAS (Phase 7)
+    expected: Expectations,       // per-table versions + graph_head + writer_epoch
+}
+
+impl GraphPublishAuthority {
+    async fn open_txn(&self, branch: BranchRef) -> WriteTxn;                          // 1 warm probe
+    async fn publish(&self, txn: &WriteTxn, plan: PublishPlan) -> PublishedSnapshot;  // stage∥ → 1 CAS
+}
+```
+
+Properties that make it optimal:
+
+- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
+  open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
+- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
+  recovery both call `plan.apply()` — a merge that bumps tables A+B can never
+  roll A forward and silently drop B. The
+  `iss-merge-recovery-partial-rollforward` bug class is gone by design.
+- **One CAS.** `publish` issues exactly one conditional `__manifest`
+  merge-insert carrying every touched-table version + the `graph_commit` /
+  `graph_head` lineage rows + the `writer_epoch` check.
+- **Verbs are thin lowerings.** `load`/`mutate`/`schema apply`/`branch merge`
+  each build a `PublishPlan` and call `publish`. Four copies → one machine; the
+  public `load_as`/`mutate_as` API is unchanged (it lowers internally).
+
+The cost contract becomes part of `publish`'s documented API:
+
+> `publish(txn, plan)` costs `opens ≤ |plan.touched_tables|` (0 warm),
+> `stages ≤ 3`, `manifest_ops = O(1)` — **invariant to history depth and table
+> count.**
+
+### 4.2 Supporting mechanics (each validated this cycle)
+
+| Mechanic | Design | Validation |
+|---|---|---|
+| Open by pinned version | `from_uri(location).with_version(N)` + shared `Session` + warm handle cache — the O(1) opener *reads* already use (`instrumentation::open_table_dataset:112`, `SubTableEntry::open` `db/manifest.rs:200`). **NOT** the write path's `from_namespace` builder (`namespace.rs:174`), whose `describe_table` + `.load()` do an O(depth) double latest-resolution (§2.2 — the dominant cost), and **NOT** `open_dataset_at_state` (opens head then checks out, `table_store.rs:232`, not O(1)). | #0 **[S]** |
+| Strict-op SI | Update/Delete/SchemaRewrite open by pinned version (consistent read base) and the publish CAS rejects a *same-table* advance. Insert/Merge rely on Lance's natural rebase. **Do not remove the open guards wholesale** — that is a silent lost-update. | #5 **[S]** |
+| Fork × pinned-version | Fork already opens source at the pinned version and creates the target from it; the live-manifest authority re-read before fork stays (not defeated by the pin). | #6 **[S]** |
+| Open-once via the direct opener (**THE dominant depth fix**) | Reuse is **intra-transaction** (open each table once, by pinned version, thread it — kills the ~13 namespace-builder opens, the O(depth) double latest-resolution / +12/depth term, §0/§2.2). A commit invalidates its own entry, so no cross-write warm cache. Thread the shared per-graph `Session` through write opens (it is *not* today — `load_table_from_namespace` attaches no session, `namespace.rs:174`). | #9 **[S]** |
+| Lineage in the manifest (Phase 7) | Publish `graph_commit` + mutable `graph_head:<branch>` rows in the same `__manifest` merge-insert with a branch-head CAS; `_graph_commits` becomes a projection. Removes the per-write `commit_graph.refresh` and closes the "manifest→commit-graph atomicity" + "commit-graph parent under concurrency" gaps. | `iss-991` **[G]**, **[S]** |
+| Bounded history (compaction **and** cleanup) | Bring the internal table(s) into the `optimize` loop AND schedule version *cleanup* of node/edge tables — compaction rewrites fragments, only cleanup prunes the version chain that §2.2's dominant term re-reads (§2.3). No blob/PK/CAS blocker (`__manifest` has no blob column, `state.rs:44-72`; the unenforced PK is orthogonal to a content-preserving Rewrite). Post-Phase-7 there is only **one** internal table to compact. | #8 **[S]** |
+| Recovery off the hot path | Move the per-write `list_dir("__recovery/")` to coordinator-open + the CAS-conflict path, guarded by a sidecar-age grace window (the sidecar carries `created_at` micros + a ULID, `recovery.rs:762`/`:1522`). | #4, `iss-856`/`iss-recovery-sweep-live-writer-rollback` **[G][S]** |
+| Epoch fence | Monotonic `writer_epoch` in `__manifest`, CAS-claimed at writer init, checked on every publish. Fences a whole zombie *writer* deterministically (no TTL); closes the multi-process exposure and the Lance-MTT TTL-lease gap. | SlateDB `FenceableTransactionalObject` **[U]** |
+| Branch create | Lance `Clone` instead of the per-table fork loop (O(tables)→O(1) sequential). | `iss-691` **[G]** |
+| Branch delete | Run the per-other-branch safety check and the per-table reclaim loops concurrently (`buffer_unordered`); read branch sets from in-memory coordinator state. | **[S]** |
+
+---
+
+## 5. The cost contract — measurement & enforcement
+
+The bug class is invisible to correctness tests, to local-FS tests, and to
+wall-clock benches. You can only prevent a regression in a quantity you **define
+precisely, measure deterministically, and bound on every PR.** The quantity is
+*sequential object-store round-trips per logical operation, as a function of
+history depth and table count.* OmniGraph already has the correct pattern for
+**reads** (`warm_read_cost.rs`, `IOTracker`, swept to depth 20); this RFC extends
+it across the write/branch/open surface. This is exactly how Lance and SlateDB
+enforce it **[U]**.
+
+### 5.1 Tier 1 — deterministic IO-counted gate (every PR)
+
+Ordinary `cargo test`, hermetic (in-memory / tempdir + `IOTracker`), no S3, no
+wall-clock. Two shapes:
+
+```rust
+// (A) cost-invariant-to-HISTORY — the load-bearing gate. Gate the MERGE verb (the prod path).
+for depth in [10, 100, 1000] {              // REAL commit history, not row count
+    build_history(depth);
+    reset_counters();
+    let s = measured_merge();               // --mode merge, the read-modify production path
+    // PRIMARY — the dominant term (§0): the written table's data opens/reads, flat in depth.
+    assert!(s.data_table_opens          <= touched_tables); // open each table ONCE, by pinned version
+    assert!(s.data_table_reads_per_open <= K_OPEN);         // each open O(1) in the table's version count
+    // SECONDARY — internal-table scans flat in depth (compaction + cleanup).
+    assert!(s.manifest_ops  <= K_MANIFEST); // small CONSTANT, NOT a function of depth
+    assert!(s.lineage_ops   <= K_LINEAGE);
+    assert!(s.stages        <= 3);          // bounded sequential hops
+}
+assert_flat_across_depths();                // ALL terms — esp. data-table opens — flat in N
+
+// (B) fitness functions — architectural invariants AS tests
+assert_eq!(validate_schema_contract_calls(write), 1);   // resolve-once
+assert_eq!(coordinator_resolutions(write), 1);          // O(1) resolution
+assert_eq!(recovery_listdir_calls(steady_state_write), 0);
+```
+
+**Prerequisite, not a follow-up: route ALL opens (read + write) through the one
+instrumented opener BEFORE the gate is meaningful.** Today the write path's data
+opens bypass `table_wrapper` (the §0(a) blind spot), so a gate that asserts only
+`manifest_ops`/`lineage_ops` would **pass a still-broken build** — one that
+compacts the internal tables (§9 step 2) but keeps the dominant ~13× namespace-open
+sweep (§2.2). The gate MUST count data-table opens/reads (the dominant term), which
+requires the routing change first. The data term is **mode-independent** (append ≡
+merge ≡ +12/depth **[U]**), so either verb exercises it; gate the **merge** verb
+as the production path. **Fixture caveat [U]:** use *valid* edge endpoints — a
+write to a non-existent endpoint fails RI validation and rolls back at ~192 ops
+with **zero chain reads**, so a bad-endpoint fixture silently measures the rollback
+path and would pass falsely.
+
+The load-bearing rule both Lance and SlateDB mostly miss: **assert the constant is
+flat across N, not just small at one N.** A shallow fixture cannot catch an
+O(history) cost (the §0(b) table is the red baseline). Add a `num_stages`
+(sequential-hop) assertion via a `ThrottledStore` wrapper (Lance's
+`test_commit_iops` setup) so an O(N) listing also blows a wall-time budget.
+
+### 5.2 Tier 2 — wall-clock trend (post-merge / nightly, never a PR gate)
+
+A `ThrottledStore` criterion bench injecting cross-region RTT (50/150 ms/op — the
+incident's regime) for single-insert and branch-op latency, with a threshold
+alert (Bencher.dev `--err` / github-action-benchmark `fail-on-alert`). Both
+reference DBs keep wall-clock out of the PR gate (too noisy on shared runners)
+and use it only as a trend.
+
+### 5.3 Close the loop — production metric
+
+Emit `storage.ops` and `storage.stages` per logical operation as a span/counter
+(cheap always-on atomics; the heavy per-table attributing wrapper stays
+test-only behind a `test-util`-style feature, zero release cost). The number
+asserted in CI is the number observed in prod — `iss-write-s3-roundtrip-amplification`'s
+cross-region signal becomes a direct readout.
+
+### 5.4 Process discipline — test-first for performance
+
+Write the depth-sweep cost-budget test **first**: it goes **red today** (§0), the
+WriteTxn + Phase-7 + compaction work turns it **green** (flat in N), and the
+red→green is the proof. This is CLAUDE.md rule 12 applied to cost, and the
+originating handoff's sequencing (§8/§9: land the tests before the fix so the win
+is measured and locked). Add the policy (extend invariant 15 + testing.md "Cost
+budget tests"): *any change touching the read/write/branch/open path MUST add or
+extend a cost-budget test asserting the metric is flat at history depth.*
+
+### 5.5 The correctness contract — concurrency tests (the safety twin of the cost gate)
+
+The cost gate proves *fast*; these prove *safe*. §6.5's multi-writer cliff slipped
+the suite for the same structural reason the latency bug did — **nothing runs the
+schedule that triggers it**: the suite is single-process with the in-process queue
+(the bug is cross-process), uses local/in-memory stores (no object-store
+cross-process CAS), and its recovery tests cover restart-time sweep, not
+live-writer rollback. **These four must land before `PublishPlan`/epoch merge
+(steps 5):**
+
+1. **Cross-process multi-writer on a real/emulated object store** (the *corruption*
+   case) — N independent engine **processes** writing the same `(table, branch)`;
+   assert all commit-or-cleanly-retry (no lost updates, no stuck "needs recovery,"
+   no HEAD-ahead-of-manifest). **A single-process failpoint test cannot reproduce
+   the corruption** (in-process degrades to clean OCC, §6.5) — this genuinely needs
+   a multi-process harness (empirically 1/12 today). State that so nobody writes a
+   single-process test expecting it to fail.
+2. **Deterministic in-process interleaving (failpoint) — WRITTEN, passes [M].** Two→
+   eight handles, sleep failpoint at the `commit_staged`→publish window
+   (`loader/mod.rs:605`); resume losers and assert they retry cleanly. This
+   demonstrates the **benign** path (N=8 → 2 commit, 6 clean OCC retries) — it is the
+   regression guard for "in-process stays clean," *not* a reproduction of the
+   cross-process cliff.
+3. **Live-writer recovery** (`iss-recovery-sweep-live-writer-rollback`) — a
+   concurrent open must not roll back a live in-flight publish (the grace window).
+4. **Formal model** — a Quint/TLA+ model of `{two writers, interleave commit_staged
+   and manifest-CAS}` (`iss-934`); it finds the §6.5 cliff immediately.
+5. **Cross-table write-skew — WRITTEN, red, and driven red→green in-process [M].**
+   Failpoint `loader.post_ri_pre_stage` (between RI-validation and staging): writer B
+   validates "Bob exists" and parks; writer A `overwrite`s `node:Person` dropping Bob
+   (non-cascading); B commits `Knows(Bob→Alice)` → committed orphan. The red test for
+   the §7.1 fix. **Acceptance is a single-process gate** — unlike the §6.5 HEAD-ahead
+   corruption (which genuinely needs the multi-process harness), this skew reproduces
+   *deterministically in one process*: the parked edge writer's snapshot really does
+   pin `edge:Knows:1` before the overwrite commits, so the overlap is real with two
+   in-process handles. The fix went red→green in-process behind a shared head row
+   (§7.1). Only #1–#4 (HEAD-ahead/epoch corruption) need cross-process scheduling.
+
+Plus one **disambiguating run** owed (§6.5 confound): separate-handles in-process
+on S3 — to confirm the corruption is the process boundary, not the store.
+
+This mirrors the cost gate's discipline (assert across the dimension the suite
+otherwise never exercises) — there, history depth; here, concurrent cross-process
+schedules.
+
+---
+
+## 6. What is already right vs. the deltas
+
+**Already correct — do not rewrite.** The in-memory `MutationStaging` accumulator,
+the recovery sidecar mechanism, the per-(table,branch) write queue, D2, the sealed
+`TableStorage` trait, and the read-path warm-up (PR #268) all stay. This is **not**
+a substrate rewrite.
+
+**One claim to soften — manifest-CAS is atomic *per publish*, not unconditionally
+cross-table-serializing [M].** The manifest CAS (the reference impl of the
+lance#7264 "Alternative A") makes each publish atomic and serializes any two writers
+whose write-sets **share a `__manifest` row** — overlapping or same-table, which is
+exactly why §6.5's same-table cases and the cascading-delete case retry cleanly. But
+two writers touching **disjoint** tables write disjoint per-`object_id` rows, so Lance
+sees no conflict and **both commit** (proven [M], §7.1). The genuinely-atomic
+cross-table commit §13 contrasts with Delta is the **target** (§4.1's single
+merge-insert over a shared head row), **not current state**. So "do not rewrite the
+CAS" holds for the *commit primitive*, but the cross-table-serialization §7.1 needs
+is a real addition (the shared `graph_head` row), not something the current CAS
+already provides.
+
+**The deltas (each a validated, localized gap):**
+
+| # | Delta | Mechanism | Tracking |
+|---|---|---|---|
+| 1 | Snapshot re-derived per stage | capture-once `WriteTxn`, thread by ref | `iss-write-s3-roundtrip-amplification` |
+| 2 | Write opens via `from_namespace` re-resolve the data-table ~13×/write, missing the fast path (**DOMINANT, +12/depth**) | open each table **once, direct `from_uri().with_version(N)`** (bypass namespace, §2.4) + shared Session | `iss-write-s3-roundtrip-amplification`, #0 |
+| 3 | Lineage = 2nd authority, O(history) refresh (secondary) | Phase 7: lineage into `__manifest` | `iss-991` |
+| 4 | `__manifest`/`_graph_commits` excluded from optimize/cleanup (`optimize.rs:895-904`; prototype pruned "7 tables" = node/edge only **[M]**) — the +5/depth residual after step 3 | **add them to `all_table_keys`** (a code change) + scheduled compaction/cleanup | `gap-read-path-rederivation` (write twin) |
+| 5 | `list_dir("__recovery/")` per write | move to open + conflict, grace window | `iss-856`, `iss-recovery-sweep-live-writer-rollback` |
+| 6 | 4 hand-rolled writers, commit↔recovery drift | one `PublishPlan` executed by both | `iss-merge-recovery-partial-rollforward` (PR #277) |
+| 7 | No writer epoch (multi-process exposure) | `writer_epoch` in `__manifest` | — (new) |
+| 8 | branch create = O(tables) fork loop | Lance `Clone` | `iss-691` |
+| 9 | branch delete = sequential loops | concurrent `buffer_unordered` | — (new) |
+| 10 | No write/branch cost gate (must count **data-table** opens; route all opens through the instrumented opener first) | Tier-1 IO-counted tests, merge verb | — (new) |
+| 11 | Schema contract re-validated uncached per resolve (**flat 46 reads/write — 29% of depth-0 cost; constant, not depth**) | resolve/validate-once in `WriteTxn`; §5.1 `validate_schema_contract_calls==1` (the depth gate misses it) | `iss-write-s3-roundtrip-amplification` |
+
+---
+
+## 6.5 Concurrency correctness — the multi-writer cliff (proven [M])
+
+The latency fixes are about *speed*; a separate, proven finding is about *safety*.
+A multi-writer experiment **[M]** shows concurrent same-branch writers behave very
+differently by topology:
+
+| topology | concurrency | outcome |
+|---|---|---|
+| single server (shared in-proc queue, `loader/mod.rs:426`) | 12 | **12 / 12 commit** (clean) |
+| in-process, separate handles, interleave failpoint at `commit_staged`→publish (`loader/mod.rs:605`) | 8 | **2 / 8 commit; the other 6 are clean retryable OCC** |
+| multi-process (separate CLIs / S3, no shared queue) | 2 / 3 / 5 / 12 | **1 / N commit; the rest CORRUPT** |
+
+**Two distinct failure modes — and the corruption is strictly cross-process:**
+
+- **In-process → benign.** Even with *separate handles, no shared queue, high
+  contention*, losers fail with `stale view of 'edge:Knows': expected manifest table
+  version 5 but current is 7 — refresh and retry` — a **clean, retryable OCC
+  conflict; graph state stays consistent.** The publisher CAS is doing its job.
+- **Cross-process → corruption.** `Lance HEAD version N+1 ahead of manifest version
+  N; a pending recovery sidecar requires rollback`. **Mechanism:** a losing writer
+  advances the table's Lance HEAD (`commit_staged`) *before* the manifest CAS; when
+  the CAS loses, HEAD is ahead of the manifest — a partial commit the per-write heal
+  **defers** (`recovery.rs:978-988`; only the open-time sweep rolls back), so a
+  *live* writer hitting it **fails instead of healing**. Self-heals on the next
+  read-write reopen (not permanently bricked), but during a burst throughput
+  collapses to one survivor. Reachable at **concurrency = 2** cross-process.
+
+So in-process safety **already comes from the publisher CAS** (clean OCC); the
+corruption needs the process boundary. *(Confound, stated honestly: the in-process
+interleave ran on local-FS and the cross-process on S3-via-proxy — but
+single-server-on-S3 was also clean (12/12), giving two independent "in-process
+clean" points vs one "cross-process corrupt," triangulating on the process
+boundary, not the store. One disambiguating run — separate-handles in-process on S3
+— would move this from triangulated to proven; §5.5.)*
+
+**Scoping (matters for urgency):** **single-server prod is serialized-correct, just
+slow** — the in-process `(table,branch)` queue serializes same-branch writes (all 12
+commit, no lost updates); the production incident was the *latency* (serialized
+O(depth) writes → 90 s timeout), **not** corruption. The corruption hazard is
+**latent**: it appears the moment a second writer exists (server replica,
+CLI-alongside-server, multi-writer scale-out). **So: single-server today =
+serialized-correct (slow; fixed by steps 2/3); multi-writer = UNSAFE until
+`writer_epoch` lands.**
+
+**The fix is the existing RFC, no new design.** The `A`-before-`B` window
+(Lance HEAD moves before the manifest references it) is inherent to Lance's
+per-table-lineage model — you cannot eliminate it, only fence and recover it: the
+**`writer_epoch`** (delta #7) is a leader-lease via cross-process CAS so two writers
+are never in the `commit_staged`→manifest-CAS window across processes (it removes
+the concurrent-race dimension); the **`PublishPlan`=sidecar** (delta #6) makes a
+single crashed writer roll forward/back deterministically (the crash dimension); and
+**recovery off the hot path + grace window** (delta #5, Q2) is the exact reason the
+live writers failed rather than self-healed (`iss-recovery-sweep-live-writer-rollback`).
+This is the standard WAL-replay + leader-lease shape (confirmed against SlateDB's
+`FenceableTransactionalObject` and Kleppmann's fencing-token canon, §10). **This
+finding promotes #6/#7 from "nice correctness work" to the load-bearing guard that
+gates multi-writer topologies — and it is the motivating case for them.**
+
+---
+
+## 7. Invariants & deny-list check
+
+Touches and *strengthens* (does not weaken) invariants in
+[invariants.md](invariants.md):
+
+- **§2 (manifest-atomic visibility):** preserved; lineage now rides the same CAS
+  (strengthens — closes the "manifest→commit-graph atomicity" gap).
+- **§3 (one snapshot per op):** enforced *by construction* via `&WriteTxn`.
+- **§4 (publish at one boundary):** unchanged — still one manifest publish.
+- **§5 (recovery part of the commit protocol):** preserved; the sidecar *is* the
+  `PublishPlan` (strengthens — commit and recovery cannot diverge). The grace
+  window addresses the documented "recovery serialized against live writers
+  in-process only" gap.
+- **§7 (indexes derived) / §15 (one source of truth, cheaply derived):** this RFC
+  is the write-side application of §15 — bound cost to the working set, not
+  history. The commit graph becomes derived (strengthens).
+- **§5 strict-op SI:** preserved (#5 validation — open guards kept for
+  read-modify-write).
+
+**Deny-list:** does *not* hit "cold re-derivation on the hot path" (it removes
+two instances), "state that drifts" (lineage stops being a second authority), or
+"acks before durable persistence." The `writer_epoch` is the closing move on the
+"local `write_text_if_match` is not a cross-process CAS" / multi-process gaps —
+add it before admitting multi-process write topologies.
+
+No invariant is weakened. Two Known Gaps **close** (manifest→commit-graph
+atomicity; commit-graph parent under concurrency, via Phase 7); one
+(read-path-rederivation) gets its **write twin** filed and addressed.
+
+### 7.1 Scope of the correctness claims (literature review, §13)
+
+The "correct by construction" framing (§3, §4.1) is **precise but bounded** — the
+DB-canon review flags three places not to over-claim:
+
+- **Per-table serializability, not graph-wide — but the gap is narrow and now
+  measured [M].** Three deterministic cases (failpoint `loader.post_ri_pre_stage`,
+  placed between RI-validation and staging; red test in `tests/failpoints.rs`):
+  - **Cross-table *disjoint* → genuine skew, VIOLATED.** A **non-cascading endpoint
+    removal** — `node:Person` *overwrite* dropping Bob, touching only the node table
+    — concurrent with an edge insert `Knows(Bob→Alice)`: both commit (write-set-only
+    CAS, RI validated once pre-commit and never re-checked at publish) → **committed
+    orphan**. (= `iss-ri-write-skew-dangling-edges` + the concurrent face of
+    `iss-overwrite-orphans-committed-edges`.)
+  - **Cross-table *overlapping* → incidentally protected.** `delete`-based removal
+    **cascades** into `edge:Knows`, so the write-sets overlap, the per-table CAS
+    engages, and the loser fails **cleanly** (stale-view OCC retry); invariant held.
+  - **Same-table → NOT a separate skew.** Cardinality / `@unique(src)` have
+    overlapping write-sets, so the per-table CAS holds the constraint; the loser's
+    failure is the **HEAD-ahead corruption already scoped to #6/#7** (epoch +
+    PublishPlan), not a consistency hole. *(This corrects an earlier
+    over-generalization: cardinality/uniqueness do not share the read-set gap.)*
+
+  So the skew is **reachable only for the non-cascading-overwrite × disjoint-edge-insert
+  shape** — operation-specific, not constraint-specific.
+
+  **The scoped fix alone is a no-op — proven [M], and the reason is mechanical.**
+  Feeding the endpoint node-table versions into the edge's publish *expected* set
+  (`check_expected_table_versions`, `publisher.rs:353`) was prototyped exactly; debug
+  confirmed the pins reach the check, **and both writers still committed — the orphan
+  persisted.** Every publish writes a *unique per-`object_id` row* into `__manifest`
+  (merge key `object_id = version_object_id(table, version)`). Two disjoint-table
+  writers (`node:Person` vs `edge:Knows`) touch **no common row**, so Lance's
+  row-level merge-insert CAS commits both with **no conflict**, the publisher's retry
+  loop **never fires**, and `check_expected_table_versions` — a **non-atomic
+  pre-check, not part of the CAS** — is evaluated exactly once against the stale
+  pre-both manifest and passes for both. The read-set pin only bites if the loser is
+  **forced to retry and re-evaluate against fresh state**, which requires a *shared
+  contention row* every publish touches. Adding a stand-in global head row
+  (`UpdateAll`-touched by every publish) makes the disjoint writers overlap → Lance
+  conflict → publisher retry → the reloaded pin (`edge:Knows:1` vs current `5`)
+  rejects the stale writer → no orphan (red→green, failpoint suite 52/52). **That
+  shared row is exactly Phase-7's `graph_head:<branch>`.**
+
+  **Consequence — §7.1 is NOT a standalone single-server PR** (correct earlier text
+  that called it "single-server-live, not deferrable" — it *is* urgent and
+  epoch-independent, but it cannot ship against today's per-`object_id` manifest
+  without a contention point). Land it one of three ways: **(a)** with Phase 7
+  (step 4), reusing `graph_head:<branch>` as the contention row; **(b)** behind a
+  minimal per-branch head row ahead of Phase 7 (~15 lines, as prototyped); or
+  **(c)** as commit-time re-validation — still must win a serialization point first.
+  **Recommended: (c) behind a per-branch head row.** The CAS-map approach carries the
+  two costs §11 anticipated — *table-granularity false conflicts* (any `Person`
+  overwrite conflicts with any concurrent `edge:Knows` insert, even different rows —
+  needs a row-granularity read-set) and *scope* (a global head serializes the whole
+  graph; per-branch `graph_head` is the right granularity). Commit-time re-validation
+  is precise (no false positives) **and** reuses the same serialization point, so once
+  the head row exists it strictly dominates the CAS-map. Either way the head row
+  imposes an inherent trade — same-branch writers serialize cross-process (throughput
+  ceiling 1/branch, bounded by `PUBLISHER_RETRY_BUDGET`) — **now a correctness
+  requirement, not just a Phase-7 side effect** (§11).
+
+  **Two faces, two fixes — do not bundle them.** The above addresses only the
+  *concurrent* face (overlapping snapshots, `iss-ri-write-skew-dangling-edges`). The
+  *sequential* face (`iss-overwrite-orphans-committed-edges`) — an overwrite drops a
+  node that **already has a committed inbound edge**, with *zero* concurrency —
+  **cannot** be caught by read-set-in-CAS: the later writer's snapshot legitimately
+  post-dates the edge, so its pin matches and it commits. That is a pure
+  **inbound-RI-validation** gap: when an overwrite/delete removes node endpoints,
+  re-check that no live edge references them. A validation concern, not a CAS one;
+  it needs no contention row and ships independently.
+  *(Note: `iss-984` is a different bug — remote branch-merge idempotency — not this.)*
+- **Recovery: roll-forward is by-construction; roll-back is not.** "Commit and
+  recovery replay the identical plan" holds for the **redo** direction (shared
+  `plan.apply()`). The undo classifier (NoMovement / UnexpectedAtP1 /
+  UnexpectedMultistep / IncompletePhaseB) lives *outside* the shared executor, only
+  at open-time — that's where ARIES-style divergence risk concentrates and where the
+  §5.5 failpoint coverage is owed.
+- **The fence and the cross-file atomicity rest on a linearizable conditional-put.**
+  Kleppmann's fencing-token guarantee, the manifest CAS, and the epoch all require a
+  linearizable register — true on S3/R2 (If-Match) but **not** on the local-FS path
+  (`write_text_if_match` is content-token compare-then-replace, ABA-prone —
+  `invariants.md` Known Gap). **Precondition to state up front: every "deterministic
+  fence" / "atomic CAS" claim holds *on a store with linearizable conditional-put*;
+  the epoch must not use the local-FS path.** Delta Lake §3.2.2 treats the
+  object-store consistency model (read-after-write + put-if-absent) as a first-class
+  design parameter; so should this RFC.
+
+---
+
+## 8. Relationship to Lance MTT (the seam, not a dependency)
+
+`GraphPublishAuthority.publish(txn, plan)` is exactly the adapter to a future
+Lance `catalog.transaction()`. lance#7264 ("Multi-Table Transactions via
+Branching") is real and OmniGraph is its reference "Alternative A"
+(fast-forward-main + WAL + roll-forward recovery) **[U]**, but it is a 5-day-old
+discussion with two unbuilt dependencies (lance#7263 branch merge/rebase,
+lance#7185 UUID branch paths), an unresolved central choice (it *favors*
+pointer-swap — the opposite identity model from OmniGraph), and an open soundness
+question (TTL lease needs an epoch). **Build the seam now on its own merits; do
+not schedule around MTT landing.** When it ships, `publish`'s *body* swaps
+(stage→CAS→sidecar → `catalog.transaction()`) while `WriteTxn`/`PublishPlan` and
+every verb lowering stay. `iss-863`/`iss-864` **[G]** already scope this spike.
+
+The MemWAL/LSM ingest tier (`iss-681` **[G]**, `dec-adopt-lance-v7-memwal`) is
+**complementary, not competing, and not in flight** (the `memwal-benefit-analysis`
+branch is an empty placeholder; the real analysis is commit `c9a81266`). MemWAL
+sits *below* the manifest publisher (per-table durability, opt-in, intra-table);
+`WriteTxn` owns the cross-table CAS. Build `WriteTxn` first.
+
+---
+
+## 9. Sequencing
+
+Ordered by leverage and dependency. **The dominant depth term is the redundant
+data-table opens (step 3), not the internal tables (step 2)** — §0; both must land
+to flatten the curve.
+
+1. **Measure first (Tier-1 gate). ✅ LANDED (gate + harness).** *Prerequisite (1a):*
+   the write opener (`open_dataset_head`) is routed through the instrumented
+   `open_dataset_tracked` so the gate can count data-table opens (§5.1). The
+   write cost-budget tests live in `crates/omnigraph/tests/write_cost.rs` on a
+   **shared, store-agnostic harness** (`tests/helpers/cost.rs`: `measure`/`IoCounts`/
+   `assert_flat`/`local_graph`/`s3_graph`) that `warm_read_cost.rs` and
+   `write_cost_s3.rs` also consume — one vocabulary, no duplicated `IOTracker`
+   plumbing. The local gate ships green every-PR guards + the RED `#[ignore]`'d
+   internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
+   `storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
+   LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
+2. **Bound history — bring the INTERNAL tables into optimize/cleanup (a code
+   change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge
+   keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3`
+   pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were
+   untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed
+   by today's tooling — step 2 is a real `all_table_keys` change to add the internal
+   tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote
+   otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**);
+   the internal tables need the same inclusion. **Proven [M]:** compacting the
+   internal tables collapsed their scans `__manifest` 285→32, `_graph_commits`
+   177→11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately,
+   node/edge cleanup **caps** the dominant data-table term as an interim *before*
+   step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the
+   Q8 boundary watermark must land **with** this step — Lance's version CAS is
+   confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on
+   R2/S3), so scheduling cleanup without the watermark trades a latency bug for a
+   correctness bug. (`gap-read-path-rederivation` write twin.)
+3. **The opener fix — a shippable lead + the structural follow-on.**
+   - **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
+     `TableStore::open_dataset_head_for_write` now delegates to the direct
+     `open_dataset_head` opener (`Dataset::open` by URI + `checkout_branch`, routed
+     through `instrumentation::open_dataset_tracked` so the cost gate can count it;
+     no-op in prod) instead of the `from_namespace` builder. Measured end-to-end on
+     the prototype: data term `31 + 12·depth → flat 4`, total `+18 → +5/depth`,
+     depth-80 **2.7×** (§2.4), functionally correct on main/branch/node.
+     **Acceptance:** the full `cargo test --workspace --locked` suite passes under the
+     bypass (the `tests/` integration + `merge_truth_table` + recovery/failpoint
+     suites the prototype's `--lib` run didn't cover — schema-apply, branch merge,
+     fork-on-first-write, overwrite). **Namespace retired to test-only:** with both
+     reads (Fix 2) and now writes bypassing it, *nothing in production routes through
+     the Lance namespace* — confirming §2.4's premise. The dead per-table open chain
+     (`load_table_from_namespace`, `open_table_head_for_write`) was deleted and the
+     `StagedTableNamespace` contract apparatus gated `#[cfg(test)]`, mirroring the
+     already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). **Measurement
+     note (corrected):** the opener win is **S3-only** — local FS resolves latest with
+     one cheap `read_dir` regardless of opener, so the namespace-vs-direct difference
+     is invisible there (the local data-table read count *does* grow with depth, but
+     that is the merge-insert/RI scan over O(depth) *fragments*, a compaction term,
+     not the opener; depth-100 = 92 ops identically before and after the bypass). The
+     opener LOCK therefore lives in the bucket-gated `write_cost_s3.rs`, not the local
+     `write_cost.rs`.
+   - **3b. Full `WriteTxn` (capture-once + intra-txn handle reuse + shared Session).**
+     Formalize 3a's open-once into the pinned, threaded `WriteTxn` (re-resolution
+     *unrepresentable*, invariant 3) and kill the flat-46 schema-read constant
+     (resolve/validate-once, §0/§6). (`iss-write-s3-roundtrip-amplification`.)
+4. **Phase 7 — lineage into the manifest.** Removes the per-write
+   `commit_graph.refresh`; commit graph becomes a projection. (`iss-991`.)
+   **Hard dependency: step 2 must land first (Q1, §12)** — each publisher retry
+   re-runs the O(history) `load_publish_state` scan, so the `graph_head` CAS
+   contention Phase 7 introduces is acceptable only once compaction bounds that
+   scan. Acceptance includes the Q1 concurrent-same-branch-writer gate.
+   **Carries the §7.1 concurrent write-skew fix.** The `graph_head:<branch>` row is
+   the shared contention point the cross-table read-set-in-CAS needs — proven [M]
+   that the read-set fix is a no-op without it (§7.1). So the concurrent face of the
+   write-skew lands *with* this step (or, if §7.1 must ship earlier, behind a minimal
+   per-branch head row — ~15 lines — or as commit-time re-validation). The
+   *sequential* face (`iss-overwrite-orphans-committed-edges`) is independent:
+   inbound-RI validation on node removal, no head row, ships anytime.
+5. **`PublishPlan` unification + recovery off the hot path + epoch fence — the
+   multi-writer safety guard.** Collapse the four writers; move the `__recovery` list
+   to open/conflict; add the `writer_epoch` leader-lease. **Motivated by the proven
+   §6.5 cliff** (multi-process same-branch writers corrupt at concurrency = 2) — this
+   is the guard that makes multi-writer topologies safe, not optional polish.
+   **Gated by the §5.5 correctness contract** (the four concurrency tests must land
+   with it). `writer_epoch` must be a true cross-process conditional CAS — **not**
+   the local-FS `write_text_if_match` path (§7.1). (`iss-856`,
+   `iss-merge-recovery-partial-rollforward`, `iss-recovery-sweep-live-writer-rollback`,
+   `iss-934`.)
+6. **Branch ops.** Lance `Clone` for create (`iss-691`); concurrent delete loops.
+7. **Freeze** investment in publisher/sidecar/fork internals; pursue the MTT
+   seam (`iss-863`/`iss-864`) as the strategic exit.
+
+**Land PR #277 first** — it closes `iss-merge-recovery-partial-rollforward` and is
+the producer-side half of the `PublishPlan` discipline; the heal-relocation in
+step 5 must preserve its merge pre-snapshot heal (`exec/merge.rs:1084-1090`) and
+its open-time `IncompletePhaseB → RollBack` (which the per-write heal never
+performed anyway).
+
+---
+
+## 10. Cross-reference map (the ties)
+
+**Dev-graph items (modernrelay) — what this RFC ties together:**
+
+- Primary: `iss-write-s3-roundtrip-amplification` (the bug).
+- Depth term / Phase 7 (commit graph → manifest-derived projection): `iss-991`
+  (related: `iss-707` structured commit-graph lineage; `iss-934` Quint
+  multi-table-publish verification). Read twin: `gap-read-path-rederivation`.
+- Substrate seam: `iss-863`, `iss-864`. Decision: `dec-adopt-lance-v7-memwal`
+  (`iss-681`).
+- Recovery: `iss-856`, `iss-recovery-sweep-live-writer-rollback`,
+  `iss-merge-recovery-partial-rollforward`, `iss-903`, `iss-load-not-crash-safe`.
+- Residual migration: `iss-950` (MR-A staged delete, retires D2), `iss-848`
+  (index-coverage reconciler, owns `create_vector_index`).
+- Branch/load: `iss-691`, `iss-677`, `iss-895`, `iss-topology-cross-branch-cache`,
+  `iss-841`, `iss-982`, `iss-423`, `iss-989`.
+- Concurrency correctness (survives MTT) — **two faces, two different fixes [M]**
+  (§7.1): `iss-ri-write-skew-dangling-edges` (the *concurrent* face; fix =
+  read-set-in-CAS **+ a shared `graph_head` contention row**, so it's coupled to
+  step 4 / a minimal head row / commit-time re-validation — NOT a standalone PR) and
+  `iss-overwrite-orphans-committed-edges` (the *sequential* face; fix =
+  **inbound-RI validation on node removal**, ships independently, no contention row).
+  *(`iss-984` — remote branch-merge idempotency — is unrelated; not a write-skew.)*
+- Blockers: `blk-lance-6658` (shipped 7.0.0), `blk-lance-6666` (open, vector
+  index two-phase), `blk-lance-blob-compaction`.
+- Epics: `epc-bulk-data-plane`, `epc-lance-v7-migration`, `epc-783` (reliability
+  harness), `epc-929` (Quint verification).
+
+**Proposed new dev-graph wiring (not yet written):**
+
+- New **Epic** `epc-write-path-latency` — owns the cluster of orphaned issues
+  above (none currently has an epic).
+- New **Gap** `gap-write-path-rederivation` — the write twin of
+  `gap-read-path-rederivation` (current: write re-derives snapshot + scans
+  uncompacted internal tables per write; target: capture-once + bounded history).
+- New **Issues**: write-side cost-budget gate + prod metric (step 1; prereq 1a
+  routes all opens through the instrumented opener); **opener bypass — open writes
+  direct-by-URI, standalone (step 3a, [M] the dominant fix, completes PR #268 Fix 2
+  on the write path, §2.4)**; full `WriteTxn` capture-once (step 3b); **add
+  `__manifest`/`_graph_commits` to `all_table_keys`** for compaction+cleanup (step 2
+  — a code change, `optimize.rs:895-904`); `PublishPlan` unification + epoch
+  (step 5); branch-delete concurrency (step 6).
+- **Per-table namespace retired to test-only (step 3a landed).** With reads (Fix 2)
+  and now writes (step 3a) both opening direct-by-URI, *nothing in production routes
+  through the per-table `StagedTableNamespace`*. The dead open chain
+  (`load_table_from_namespace`, `open_table_head_for_write`) was deleted; the
+  `StagedTableNamespace` struct/impl/factory are now `#[cfg(test)]`, mirroring the
+  already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). Both are retained
+  only to validate the `LanceNamespace` contract in unit tests. *Production catalog /
+  managed-versioning commit coordination for `__manifest` itself goes through a
+  **separate** namespace (`GraphNamespacePublisher`), unaffected by this change.* The
+  former follow-up to harden `StagedTableNamespace::list_table_versions`
+  (`checkout_version` per version, O(depth)) is now purely a test-hygiene note — no
+  prod caller can hit it; if any future version-list / time-travel feature needs
+  per-table version enumeration, build `TableVersion`s from `versions()` metadata
+  directly rather than resurrecting the namespace open path.
+- New **Decision** `dec-writetxn-manifest-authoritative-publish` — records this
+  RFC's design choice and the MTT-seam stance.
+
+**Key source locations (v0.7.0):**
+`omnigraph.rs:561-568,739-779,1317-1389`; `table_ops.rs:505-609`;
+`table_store.rs:157-280,282-341,797`; `loader/mod.rs:197,400,485,557`;
+`exec/mutation.rs:725`; `exec/merge.rs:1084-1090`;
+`db/manifest/publisher.rs:51,93-124,356-371,385,432-440,448-490`;
+`exec/mutation.rs:640-673` (D2 rule); `db/manifest/state.rs:44-72,133-141`; `db/manifest/layout.rs:22-26`;
+`db/manifest/namespace.rs:111-112` (read open, O(1)),`:357-385`/`:362` (`describe_table` → redundant `Dataset::open` — the write-path double-open),`:158-186,544-550` (write open via `from_namespace`),`:395-427` (`list_table_versions` per-version checkout — test-only O(depth), the §10 follow-up);
+`db/manifest/recovery.rs:762,978-988,1522`; `db/commit_graph.rs:136-164,213-272`;
+`db/omnigraph/optimize.rs:240,517,895-904`; `instrumentation.rs:37,112-131`;
+`runtime_cache.rs:202-283`; `tests/warm_read_cost.rs` (the read-side gate to mirror).
+
+**Upstream:** lance#7264/#7263/#7185 (MTT); Lance `with_version` O(1) open
+(`from_namespace` → `describe_table`, `builder.rs:130-178`; `default_resolve_version`
+= one HEAD, `commit.rs:939-981`; version-hint PR #6752),
+`list_is_lexically_ordered = !is_s3_express` (`aws.rs:183`),
+`IOTracker`/`assert_io_*`/`num_stages`, `test_commit_iops`,
+`test_commit_uses_version_hint_on_non_lexical_store`; **lance-namespace** design
+(`namespace/index.md`, `operations/index.md` — catalog/discovery layer, resolve
+once); LanceDB `io_tracking.rs`, `test_reload_resets_consistency_timer`; SlateDB
+`FenceableTransactionalObject` (epoch fence), `InstrumentedObjectStore`,
+monotonic-ID manifest.
+
+**Reproduce the §0(b) network measurement:** `rustfs` (S3-compat) on `:9000`
+behind a ~90-LoC Go counting proxy on `:9100` (adds `LATENCY_MS`, preserves the
+SigV4 `Host` header, `/__ctl/reset` + `/__ctl/stat`); an omnigraph cluster on
+`s3://…/cluster` through the proxy. Single-write breakdown: reset the proxy log,
+`load --mode merge` one edge, classify by S3 key. Depth slope: write N× to main,
+diff the per-write log at depth D vs D+20 by table. Native baseline: pylance 7.0.0
+`write_dataset(mode="append")` in a loop → flat 6 ops/append at any depth.
+
+---
+
+## 11. Drawbacks, alternatives, reversibility
+
+**Drawbacks.** Phase 7 makes disjoint-table same-branch writers contend on the
+`graph_head:<branch>` row (they don't today) — bounded by the Lance retry budget,
+inherent to a linear per-branch DAG, gated on a measured concurrency test and on
+step 2 landing first (§12 Q1, resolved). **Reframe [M]: this contention is
+load-bearing for correctness, not merely a throughput tax.** The §7.1 write-skew is
+*unreachable only because* the shared head row forces disjoint cross-table writers to
+overlap, conflict, retry, and re-evaluate their read-set pins against fresh state
+(proven — without it the scoped CAS fix is a no-op). So §7.1 and the head row are
+**coupled**: the "drawback" is exactly what buys the cross-table invariant, and the
+throughput ceiling (1 writer/branch, bounded by `PUBLISHER_RETRY_BUDGET`) is a
+**correctness requirement** the moment §7.1 ships, not an optional Phase-7 side
+effect. `PublishPlan` is a non-trivial refactor of four writers; it must land behind
+the cost gate and the `merge_truth_table`/recovery/failpoint suites.
+
+**Alternatives.** (A) *Caching band-aid only* — memoize schema validation, cache
+opens within a request: ~30–50% fewer round-trips but leaves open-by-latest and
+the O(history) terms. Mitigation, not a fix. (B) *Opener bypass only* (open
+direct-by-URI+version, no full txn) — **kills the dominant depth term, now measured
+[M]**: a one-line patch flattened the data term `31+12·depth → flat 4` and cut a
+depth-80 edge **2.7×** (§2.4), leaving only the secondary internal-table term and
+the writer unification. (C) *Full design (this RFC)* — correctness by construction.
+(D) *Wait for Lance MTT* — future exit, not a current dependency (§8).
+**Recommend: ship B as a standalone PR first (behind the step-1 gate), then C for
+the constant-factor + correctness, then step 2 for the internal residual; D as the
+strategic end-state.** B is the demonstrated dominant fix, not a partial one.
+
+**Reversibility.** The interface (`WriteTxn`/`PublishPlan`) is internal and
+reversible. Phase 7's new `__manifest` object types (`graph_commit`,
+`graph_head`) are an **on-disk format addition** — additive (old binaries skip
+unknown `object_type`s) but near-permanent; it earns its own validation pass
+(forward/back-compat, the validation checklist in the `iss-991` handoff). The
+`writer_epoch` is likewise a durable manifest field. Everything else (compaction
+scheduling, recovery relocation, branch concurrency, the cost gate) is cheap to
+undo.
+
+---
+
+## 12. Resolved questions (was: unresolved)
+
+All five original open questions were investigated read-only against post-#277/#284
+`origin/main`, upstream Lance 7.0.0, and the dev graph; each is resolved below. One
+new item (Q6), surfaced by peer review, remains genuinely open.
+
+1. **`graph_head` CAS contention → RESOLVED, gated on step 2 + a concurrency test.**
+   Retry is publisher-owned; Lance's internal rebase-retry is disabled
+   (`conflict_retries(0)`, `publisher.rs:385`) → no double-retry. Row-CAS is true
+   one-winner (`TooMuchWriteContention` → retryable, `publisher.rs:432-440`),
+   bounded by `PUBLISHER_RETRY_BUDGET = 5`. **But each retry re-runs the O(history)
+   `load_publish_state` scan (`publisher.rs:455`)**, so `graph_head` contention
+   multiplies the manifest term — **step 2 (compaction) is a hard prerequisite for
+   step 4 (Phase 7)**. Same-branch is the real workload (the incident is concurrent
+   `main` writes). Residual: a measured gate before Phase 7 — N≈100 concurrent
+   same-branch writers, assert bounded retry + O(working-set) re-scan + P99 within
+   SLA. Fallback: batched-lineage, or Alternative B (defer lineage-in-manifest).
+2. **Recovery grace-window → RESOLVED.** PR #284 is **unrelated** (cluster-apply
+   trap; zero `recovery.rs` changes). The dangerous rollback classifications
+   (NoMovement / UnexpectedAtP1 / UnexpectedMultistep / #277's IncompletePhaseB)
+   fire only at the open-time Full sweep; the per-write heal defers all rollback
+   (`recovery.rs:978-988`), so moving the heal off the hot path doesn't break #277.
+   A sidecar-age grace window (defer sidecars younger than T_grace, loud typed
+   skip, `repair` override) on the existing `created_at`/ULID
+   (`recovery.rs:762`/`:1522`) is the sound interim; the permanent fix is the
+   in-process background reconciler `iss-856`. Lands step 5 with a failpoint test.
+3. **Epoch fence × publisher CAS → RESOLVED (by construction).** With Lance retry
+   off (Q1), the publisher loop is the only retry layer. Model `writer_epoch` as a
+   **pre-publish hard-fail gate** beside `check_expected_table_versions`
+   (`publisher.rs:462`) but non-retryable (a stale epoch is a protocol violation,
+   not a race). No double-retry; the epoch gate and the row-CAS loop are
+   sequential. SlateDB `FenceableTransactionalObject` is the precedent.
+4. **Compaction cadence → RESOLVED.** Not `auto_cleanup` (GCs pinned versions).
+   Not foreground every-N-writes (deny-list job-queue + invariant 7 + cost cliff).
+   Minimum (step 2): extend `optimize`/`cleanup` to the internal tables AND node/edge
+   version cleanup — no special-casing (`SidecarKind::Optimize` already covers a
+   mid-compaction crash). Follow-up: an `iss-856`-shaped background reconciler
+   triggered by a cheap fragment-count probe (work off the hot path; a reconciler,
+   not a job queue — deny-list-clean; SlateDB's epoch-coordinated compactor is the
+   precedent). CLI `omnigraph optimize` stays the operator override.
+5. **`PublishPlan` residuals → RESOLVED.** Both `delete_where` and
+   `create_vector_index` are representable as `TableAction` variants with existing
+   sidecar coverage (`SidecarKind::Mutation`/`EnsureIndices`) and are
+   content-preserving (roll-forward safe). `TableAction::Delete` migrates to staged
+   two-phase via MR-A / `iss-950` (now unblocked — `blk-lance-6658` shipped); **D2
+   retires then** (`enforce_no_mixed_destructive_constructive`,
+   `exec/mutation.rs:640-673`). `TableAction::CreateVectorIndex` stays inline until
+   `blk-lance-6666` ships (`iss-848` reconciler path).
+
+**Resolved post-review:**
+
+6. **The exact mechanism of the data-table chain re-read → RESOLVED (§0, §2.4).**
+   Pinned by Lance-source trace + proxy + pylance isolation **[U]**: it is **not**
+   `checkout_version` (O(1)) and **not** merge-insert conflict replay. The write
+   open goes through `DatasetBuilder::from_namespace` (`namespace.rs:174`), whose
+   `describe_table` opens the whole dataset just to return a location
+   (`namespace.rs:362`/`:112`) and whose `.load()` resolves latest **again** — a
+   double latest-resolution per open, ~13× per write, nothing cached. The open
+   resolves latest **without the V2 lexical / version-hint fast path** the direct
+   opener uses (likely the un-threaded `Session`/store params,
+   `load_table_from_namespace` `namespace.rs:174` — inferred, not traced), so it is
+   O(depth) where a direct `from_uri().with_version(N)` is O(1). **The mechanism
+   question is now academic for the fix:** bypassing `from_namespace` makes the open
+   flat regardless of the precise sub-mechanism (un-threaded `Session` / double
+   resolve / missed hint) — the bypass is the answer. (`list_table_versions` is
+   **not** on this path — test-only; §10 follow-up.) `checkout_version` stays
+   exonerated.
+
+**Resolved end-to-end [M]:**
+
+7. **End-to-end prototype of step 3 → DONE, measured (§2.4 before/after).** A
+   prototype patched the opener (`open_dataset_head_for_write`, `table_store.rs:174`)
+   to bypass `from_namespace` and open direct-by-URI, rebuilt v0.7.0, and re-ran the
+   sweep: the data term collapsed `31 + 12·depth → flat 4`, total `+18/depth →
+   +5/depth` (residual = the two internal tables, step 2), depth-80 **1618 → 593 ops
+   (2.7×)** — functionally correct on main edge merge, branch create+write+read, and
+   node merge. So step 3's "closes the dominant term outright" is **measured, not
+   inferred**, and the opener bypass is **shippable standalone** (§9 step 3a).
+   **Remaining (not blockers for step 3a's thesis):** the prototype did not cover
+   schema-apply / branch merge / fork-on-first-write / overwrite / concurrent — a
+   production opener change must pass the full `merge_truth_table`/recovery/failpoint
+   suite; and the internal-table cleanup demo (step 2) + the concurrency
+   fault-injection harness (steps 4/5) are still owed.
+
+**Newly surfaced (open):**
+
+8. **CAS-resurrection after cleanup → CONFIRMED VULNERABLE [S]; boundary watermark
+   is a HARD PREREQUISITE for step 2.** SlateDB found this race (RFC-0026 / issue
+   #352): a writer that stalls between computing manifest id `N+1` and creating it
+   can, *after GC deletes `N+1`*, re-create it and observe **false success**.
+   Lance 7.0.0 was traced directly and is **not immune**: version creation is a plain
+   `put_opts(naming_scheme.manifest_path(base, version), PutMode::Create)` /
+   `rename_if_not_exists` (`lance-table commit.rs:1421-1437`, `:1358`) on a
+   version-numbered, **pruneable** path, with **no monotonic/boundary/watermark
+   guard** anywhere in the manifest/commit/dataset path; `cleanup_old_versions` is
+   **timestamp-based** (`cleanup.rs:1086`), so it deletes the very file the only
+   guard (AlreadyExists→rebase) relies on. A stalled publisher whose target version
+   was pruned by step-2 cleanup gets a `PutMode::Create` **success on a non-existent
+   version → false success.** Severity by store: **R2/S3 (lexical, prod) = a silent
+   lost write** (the resurrected version doesn't win V2 latest-resolution, so data
+   lands on a dead branch while the publisher believes it committed); non-lexical =
+   the version hint (`commit.rs:1439`) is overwritten to the stale version and
+   trusted (worse, but not the prod path). **Action:** step 2 ships **only with** a
+   durable **boundary/floor watermark** (GC advances it before deleting; every writer
+   rejects `id <= boundary` after a "successful" create — SlateDB's fix), which also
+   bounds any list-then-read-latest fallback. This was "lowest-risk earliest item";
+   it is now gated (§9 step 2).
+
+---
+
+## 13. External validation (subagents + literature)
+
+Validated read-only against OSS prior art and the DB/distributed-systems canon:
+
+- **SlateDB** (canonical object-store LSM) — tenet-by-tenet ✅ on capture-once
+  snapshot, monotonic-ID manifest (no pointer file — *explicitly rejected* in their
+  RFC-0001), the **epoch fence** (exact match: `FenceableTransactionalObject`,
+  hard-fail, TTL-lease *explicitly rejected* — adopt as specified), background
+  epoch-coordinated compaction/GC, and recovery-on-open. **Adopt-items OmniGraph is
+  missing / under-specifies:** (1) the **boundary-file** CAS-resurrection guard (Q8);
+  (2) **group-commit batching** — coalesce pending `PublishPlan`s into one manifest
+  CAS, directly mitigating the Q1 / §6.5 contention; (3) SlateDB *peels* compaction
+  state *out* of the manifest (RFC-0013) — the **opposite** of Phase 7's fold-*in*;
+  §11 should defend "fold-in (lineage must be atomic with visibility) beats peel-out
+  for us"; (4) **write back-pressure** when cleanup lags (`l0_max_ssts`). **Citation
+  correction:** SlateDB has the per-RPC counter (`InstrumentedObjectStore`) but
+  **not** the flatness-across-history gate — the depth-swept Tier-1 gate (§5.1) is
+  OmniGraph-novel; cite it that way.
+- **Literature** — OCC/MVCC (Kung-Robinson 1981; DDIA ch.7), ARIES redo/undo, the
+  fencing-token canon (Kleppmann — whose motivating example *is* OmniGraph's
+  S3-read-modify-write-paused-past-lease scenario), and the lakehouse genre (Delta
+  Lake VLDB 2020, Iceberg spec, Neon). The spine — OCC-over-MVCC + one atomic
+  manifest CAS + WAL-of-intent recovery + monotonic-epoch fence — is canon-blessed,
+  and OmniGraph **exceeds** Delta/Iceberg on the axis that matters (both are
+  explicitly *single-table*-transactional; the manifest CAS delivers the atomic
+  *cross-table* commit Delta only speculates about). The three scoping caveats are in
+  §7.1.
+- **HelixDB** (embedded LMDB graph DB) — too different a substrate to validate the
+  object-store machinery (LMDB's `commit()` subsumes tenets #2–#8 for free), but it
+  **corroborates tenet #1** (capture-once, thread-by-reference, re-resolution
+  unrepresentable — its `&mut RwTxn`-threaded traversal is the embedded twin of
+  `WriteTxn`) and confirms the bug class is **substrate-induced**. Portable idea for
+  the roadmapped traversal work: adjacency as a *persisted, sorted,
+  label-partitioned projection* keyed by `(node, label)` (vs the cold-rebuilt
+  `TypeIndex`).
diff --git a/docs/dev/testing.md b/docs/dev/testing.md
index 6a62580..941cec6 100644
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@@ -26,6 +26,8 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
 | `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
 | `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
+| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — the RED `internal_table_scans_are_flat_in_history` LOCK, `#[ignore]`'d until internal-table compaction lands) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
+| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
 | `lifecycle.rs` | Graph lifecycle, schema state |
 | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
 | `changes.rs` | `diff_between` / `diff_commits` |
@@ -69,9 +71,10 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 
 ## RustFS / S3 integration
 
-CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
+CI runs these S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
 
 - `cargo test -p omnigraph-engine --test s3_storage`
+- `cargo test -p omnigraph-engine --test write_cost_s3` (RFC-013 step 3a's data-table opener cost gate — flat across commit depth on S3; the term local FS can't reproduce)
 - `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot)
 - `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket)
 - `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
@@ -126,7 +129,7 @@ When you pick up any change, walk through this:
 6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
 7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
 8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
-9. **Bound hot-path cost at history depth.** If the change touches a read or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a fixed object-op count) against a fixture with realistic *commit-history depth*, not just realistic row counts. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
+9. **Bound hot-path cost at history depth.** If the change touches a read, **write**, or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a per-write read-op count flat across commit depth) against a fixture with realistic *commit-history depth*, not just realistic row counts. Reuse the shared `helpers::cost` harness (`measure`/`IoCounts`/`assert_flat`) — don't hand-roll `IOTracker` wiring. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
 
 ## Cost-budget tests: bound hot-path cost at history depth
 
@@ -134,6 +137,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
 
 - **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
 - **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
+- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
 - This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).
 
 When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.