omnigraph/docs/dev/handoff-rfc-013-write-path.md

# Handoff: finishing RFC-013 (write-path latency + correctness)

**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)** —
this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).

**Audience:** the engineer/agent who picks up RFC-013 next.

---

## 0. TL;DR — where we are and what's next

RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
for the canonical list. Current reality:

**Landed on `main`:**
- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
  instead of the Lance-namespace builder (#288). **This already banked the dominant
  depth win** — see §2 below; it reframes everything.
- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
  `_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
  correction (#292).
- **Optimize-vs-write race** — optimize survives a cross-process write race on the
  same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
  redundant with Design A). Step 3b stacks on top of this.

**Open PRs (land these; relationships in §7):**
- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
  manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
  **MERGED to main and integrated into this branch** — the converge helper now threads
  Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`).
- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
  (same op-class family as #297, logical side).

**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
area keeps generating, which also absorbs 3b's deferred session-aware write opens.

---

## 1. The corrected mental model (read this before touching anything)

Three reframes from the latest cycle that the older RFC prose may not fully reflect:

### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:

- The dominant **O(depth) data-table** term is **gone**.
- Step 2a flattened the secondary **internal-table** scan term.
- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
  depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
  + compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
  unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
  2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
  flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
  3b; gate it on the constant-factor (S3 round-trips), not a curve.

### 1b. Two op classes, two commit models (the §6.6 principle)
Every concurrency bug in this area is **one op class using the other's commit model**:

| class | examples | commutes? | correct commit model |
|---|---|---|---|
| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |

Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).

### 1c. The root liability (what keeps generating these bugs)
Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
every multi-commit op advances per-table Lance HEAD **before** the manifest references it
(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
- §6.5 cross-process logical corruption,
- #297's own-HEAD-drift misclassification,
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
- the recovery classifier edges.

**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
retires the window entirely.** See §5.

### 1d. The second facet: the write base is a stale pin (no probe)
The READ path resolves its base behind a freshness probe (`resolve_target_inner`
omnigraph.rs:~1072 → `probe_latest_incarnation` → `refresh_manifest_only`); the WRITE path
does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
single staleness feeds **two distinct failure modes**, both surfaced this cycle:

1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
   committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
   `edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
   live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
   uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
   **VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
   the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
   edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
   comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
   underneath: there is **no unified write-validation read-set** — endpoint
   (`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
   `txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
   commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
   (`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
   documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
   §7.1 TOCTOU class, and each new constraint forks the pattern again.

2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
   pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
   `optimize` advanced `__manifest` — even though the advance was content-preserving
   compaction the logical write should fast-forward past (invariant 7). It's the **write-side
   mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
   logical write that must fast-forward past optimize). A served read clears it (the read
   probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
   `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
   (`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
   proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
   `consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
   cross-process lost-update OCC primitive). The fix must **discriminate by op class**.

**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition** —
derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
of Design A — three symptoms, one fix.

---

## 2. Validated facts — do NOT re-derive these

Established this cycle against **Lance 7.0.0 source**
(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
you can trust them without re-investigating.

**Lance (upstream):**
- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
  V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
  (`lance-table/src/io/commit.rs` `default_resolve_version`.)
- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
  keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
  — the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
  carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
  `WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
  `Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
  Session to a write base, open read-style (`open_table_dataset` → `from_uri().with_version()
  .with_session()`) and drive the staged write off that handle.
- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
  many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
  relies on this).
- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
  `commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
  `commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
  via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
  (`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
  **compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
  Lance own it.**
- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
  `Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
  conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
  retry fires only on real overlap.

**Engine (internal):**
- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
  warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
  `TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
  graph** (`read_caches.session`). **Writes bypass all of it by construction**
  (`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
  session and opens by latest, not pinned version).
- A single write opens each table **3–4×** (accumulation → staging reopen → commit
  drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
  (`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
  + 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
  depth — 3b's targets.
- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
  `table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
  `publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
- Fork-on-first-write authority reads (`classify_fork_ref` → `fresh_snapshot_for_branch`)
  must stay **fresh** (not served from a pinned base).
- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
  `local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
  (`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
  change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
  (copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
  (`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.

---

## 3. The #297 cycle (this branch) — what it is, and the lesson

`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.

**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
fixed, both regression-tested (own-HEAD-drift via negative control):
1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
   iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
   later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
   sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
   `head_advanced`; the drift guard fires only when `!head_advanced`.
2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
   its final retry even if the conflict meant a concurrent writer already published `≥ N`
   (postcondition met). Fix: re-check `current >= state.version` on exhaustion.

**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
retry silently converts every "checked once, before any side effect" precondition into
"re-checked after partial side effects."* That's a distinct bug class; it needs
fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
exactly this.)

**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)

---

## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)

**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
`PublishPlan`.

**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
gate yet (only the node gates above).

Commits (off merged-#297 main):
- **Stage 0** — scope `open_count` → `data_open_count`/`internal_open_count` by URI class
  (the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
  raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
  validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
  (`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
  S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
  minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
  manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
  same `Snapshot`).
- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
  the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
  #3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
  cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
  the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
  `CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
  unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).

**RFC §4.1 corrections — how they resolved:**
1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
   the `commit_staged` handle forward into the index build).
2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
   (`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
   and the fork-authority reads stay fresh.
3. *Minimal carrier* → `WriteTxn { branch, base }` (even the `session` from the original
   sketch was dropped as unearned).

**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).

**Guardrails (don't regress):** schema validation is deliberately uncached for drift
detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
post-#297 main before starting.

---

## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix

**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
Properties:
- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
  by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
  §6.5 writers, the write-path guard — all close *by construction*.
- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
  live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
  + reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
  + `writer_epoch`).

**Why it composes with Lance MTT (don't over-build):**
- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
  authority; nothing wasted. Build this.
- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
  a cross-process fence). Build it *last and minimally*, gated on actually deploying
  multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.

**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
validate the maintenance-class and logical-class commit models in isolation first, so Design
A implements a known spec rather than designing under refactor pressure. Do NOT build more
substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
over the window — strictly higher liability than either Design A or waiting for MTT.

**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
open; delete #6658 shipped). Track, don't build yet.

### 5.1 Step-5 design constraints inherited from the #295 spec review
3b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via
eliminate/probe/thread) and **deferred** the full §4.1 opener-unification — the pinned-base
opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op
conflict-timing move — to step 5. So the greptile-bot comments on the #295 *spec* were **moot
for #298** (which built none of those constructs) but are **load-bearing constraints for step
5** when it builds them. Bank them:
1. **Handle cache must be `Send + Sync`** (`Mutex<HashMap<…, Dataset>>`, not `RefCell`) if
   `WriteTxn::open(&self)` is shared across concurrent stage futures — a `RefCell` compiles
   but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This
   is the deny-list "in-process-only `Dataset` impls — `Send + Sync`" item.
2. **The strict-op timing move needs an explicit retry contract.** If step 5 moves
   strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS
   (the §4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit**
   (success or conflict — the handle cache is commit-invalidated), and the retry **re-opens a
   fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base — that
   reproduces the lost-update). **This is the same retry/refresh contract as the stale-view
   false-fail (§1d.2)** — the op-class-aware precondition + "fresh base on retry" are one
   design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the
   contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
3. **The opener-equivalence test must be non-trivial.** A differential test that only passes
   when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open`
   returns the pinned base, not HEAD," the test must **advance the branch HEAD externally
   (direct Lance write), then assert the txn open still reads the base version** — and that a
   strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move).

---

## 6. Why #297 is still needed even if you do Design A
- Design A **relocates** #297's maintenance-class commit logic into the authority's
  `TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
- The two regression tests + §6.6 are the **contract** Design A must keep green.
- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
  correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
  (~a dozen lines). Everything else relocates or persists. **#297 LANDED.**

---

## 7. Open PRs and their relationships
- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
  step 3b stacks on it.
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
  both are de-risking inputs for Design A's per-class commit models.
- **#296** — recovery roll-forward converges on concurrent manifest advance. The fix
  for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
  *aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296
  landed on main first and is merged into this branch:** the converge helper
  (`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward —
  on convergence the audit references the winner's folded `graph_commit_id` (the current
  `graph_head`), not a freshly minted one.
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).

---

## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
  (§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
  the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
  residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
  mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
  projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
  atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
  Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
  **frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
  cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
  `edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
  pinned read-set re-validated under the `graph_head` contention, or the liability survives as
  a second special-case.
- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
  false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
  two tests passing *together*: un-ignore
  `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
  *while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
  (maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
  like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
  Deferred: only the secondary version-count/space term, touches the read/open path, and is
  MTT-redundant. Land when version-count cost bites.
- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
  validation on node removal; independent, ships anytime.
- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
- Branch ops: Lance `Clone` for create (`iss-691`).

---

## 9. Gotchas / traps (learned the hard way)
- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
  reasoning is a bug across processes. This is the recurring trap.
- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
  merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
  the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
  regress a newer version = lost write.
- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
  self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
  interpretation on "have *we* committed yet."
- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
  `skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
  config before compacting to stay non-destructive on upgraded graphs. The strip is a
  separate commit (relevant to the partial-commit retry trap).
- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
  surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
  rare and needs failpoint injection to test.

---

## 10. Verification (the gate)
- `cargo test --workspace --locked` — the canonical gate (matches CI).
- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize` —
  the optimize concurrency/recovery tests.
- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
  cost gates (3b adds the schema-once + open-count asserts here).
- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
  before each change (always-on requirement).

Lance source for re-validation:
`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
`dataset/builder.rs`).
-												feat(engine): `WriteTxn` - validate schema + open each data table once per write (#298)

* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)

Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
  cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
  named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
  live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
  Dataset," so attach the Session by opening read-style then staging off it.

* test(engine): RED step-3b capture-once fitness asserts + open_count probe

Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
  once (3 read_text + 2 exists). Today re-validates at every resolve point —
  measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
  (zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
  table <=1x. Today measured 10 opens.

Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.

* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)

The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.

Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.

* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)

The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).

Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.

Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).

Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.

* feat(engine): validate the schema contract once per write (RFC-013 step 3b)

A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.

Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.

- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
  entry-point ensure_schema_state_valid) and thread Some(&txn) through
  execute_*/open_table_for_mutation, commit_all, and
  commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
  when present — skipping resolved_branch_target's re-validation. The OPEN itself is
  unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
  ensure_expected_version. Schema-once applies to strict and non-strict alike; the
  data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
  the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.

fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.

Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.

* feat(engine): open each data table once per write (RFC-013 step 3b)

A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.

- #1 accumulation: open_for_mutation_on_branch now returns
  (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
  own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
  caller reads is .version() (the CAS fence), which is exactly the pinned base
  version (entry.table_version). So skip open_dataset_head_for_write and source the
  version from txn.base. The node insert path already discarded that handle; the
  edge path resolves a pinned read only when non-default cardinality needs it.
  STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
  entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
  the already-open staging handle (the same primitive ManifestCoordinator::
  probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
  head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
  SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
  prepare_updates_for_commit, which builds indices on the threaded handle instead of
  reopening at the same just-committed version. Absent a handle (other writers,
  inline/delete tables) the reopen path is byte-identical.

When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.

* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)

The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.

- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
  require_handle(ctx) helper for the callers that must have a handle (strict ops,
  the fork path, every no-txn caller — branch merge, the seed test). The handle is
  None only on the non-strict-txn open-skip path (collapse #1); require_handle
  panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
  committed_handles } for commit_all; consumers destructure into the same local
  bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
  is legal only on the non-strict txn path, so a future strict arm returning None
  trips in debug builds instead of handing None to a require_handle consumer.

Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.

* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)

The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.

Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.

No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).

* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5

* docs(rfc-013): capture the write-base-staleness convergence (§1d)

Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):

- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
  made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
  is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
  deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
  endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
  the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
  repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
  optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
  proven wrong (breaks the cross-process lost-update OCC contract).

All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.

* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)

Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).

Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.

* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)

3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).

Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.

Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.

* docs(dev): link handoff docs from index

* docs(engine): tighten 3b claims to match the code (#298 review)

Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.

- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
  carries one); scope "once" to the table-touch hot path and note edge/load RI
  validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
  now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
  open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
  only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
  is allow-listed and holds direct Dataset::opens for branch-management ops (not the
  keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
  endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
  cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).

build clean; write_cost / validators / forbidden_apis green.
											
										
										
											2026-06-23 21:27:31 +02:00
+								# Handoff: finishing RFC-013 (write-path latency + correctness)
 								**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)** —
 								this doc is the *current-state map + the decisions/validation from the latest work cycle
 								+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
 								**Audience:** the engineer/agent who picks up RFC-013 next.
 								---
 								## 0. TL;DR — where we are and what's next
 								RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
 								under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
 								for the canonical list. Current reality:
 								**Landed on `main`:**
 								- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
 								- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
 								  instead of the Lance-namespace builder (#288). **This already banked the dominant
 								  depth win** — see §2 below; it reframes everything.
 								- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
 								  `_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
 								  correction (#292).
 								- **Optimize-vs-write race** — optimize survives a cross-process write race on the
 								  same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
 								  redundant with Design A). Step 3b stacks on top of this.
 								**Open PRs (land these; relationships in §7):**
 								- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
-												feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)

* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
											
										
										
											2026-06-25 13:55:34 +02:00
+								  manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
 								  **MERGED to main and integrated into this branch** — the converge helper now threads
 								  Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`).
-												feat(engine): `WriteTxn` - validate schema + open each data table once per write (#298)

* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)

Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
  cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
  named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
  live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
  Dataset," so attach the Session by opening read-style then staging off it.

* test(engine): RED step-3b capture-once fitness asserts + open_count probe

Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
  once (3 read_text + 2 exists). Today re-validates at every resolve point —
  measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
  (zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
  table <=1x. Today measured 10 opens.

Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.

* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)

The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.

Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.

* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)

The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).

Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.

Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).

Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.

* feat(engine): validate the schema contract once per write (RFC-013 step 3b)

A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.

Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.

- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
  entry-point ensure_schema_state_valid) and thread Some(&txn) through
  execute_*/open_table_for_mutation, commit_all, and
  commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
  when present — skipping resolved_branch_target's re-validation. The OPEN itself is
  unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
  ensure_expected_version. Schema-once applies to strict and non-strict alike; the
  data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
  the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.

fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.

Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.

* feat(engine): open each data table once per write (RFC-013 step 3b)

A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.

- #1 accumulation: open_for_mutation_on_branch now returns
  (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
  own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
  caller reads is .version() (the CAS fence), which is exactly the pinned base
  version (entry.table_version). So skip open_dataset_head_for_write and source the
  version from txn.base. The node insert path already discarded that handle; the
  edge path resolves a pinned read only when non-default cardinality needs it.
  STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
  entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
  the already-open staging handle (the same primitive ManifestCoordinator::
  probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
  head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
  SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
  prepare_updates_for_commit, which builds indices on the threaded handle instead of
  reopening at the same just-committed version. Absent a handle (other writers,
  inline/delete tables) the reopen path is byte-identical.

When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.

* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)

The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.

- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
  require_handle(ctx) helper for the callers that must have a handle (strict ops,
  the fork path, every no-txn caller — branch merge, the seed test). The handle is
  None only on the non-strict-txn open-skip path (collapse #1); require_handle
  panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
  committed_handles } for commit_all; consumers destructure into the same local
  bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
  is legal only on the non-strict txn path, so a future strict arm returning None
  trips in debug builds instead of handing None to a require_handle consumer.

Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.

* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)

The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.

Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.

No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).

* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5

* docs(rfc-013): capture the write-base-staleness convergence (§1d)

Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):

- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
  made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
  is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
  deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
  endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
  the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
  repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
  optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
  proven wrong (breaks the cross-process lost-update OCC contract).

All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.

* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)

Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).

Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.

* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)

3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).

Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.

Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.

* docs(dev): link handoff docs from index

* docs(engine): tighten 3b claims to match the code (#298 review)

Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.

- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
  carries one); scope "once" to the table-touch hot path and note edge/load RI
  validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
  now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
  open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
  only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
  is allow-listed and holds direct Dataset::opens for branch-management ops (not the
  keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
  endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
  cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).

build clean; write_cost / validators / forbidden_apis green.
											
										
										
											2026-06-23 21:27:31 +02:00
+								- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
 								- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
 								  (same op-class family as #297, logical side).
 								**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
 								`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
 								`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
 								area keeps generating, which also absorbs 3b's deferred session-aware write opens.
 								---
 								## 1. The corrected mental model (read this before touching anything)
 								Three reframes from the latest cycle that the older RFC prose may not fully reflect:
 								### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
 								Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
 								that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
 								**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
 								Lance — validated against Lance source). 3a swapped it for the direct opener, which is
 								**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
 								- The dominant **O(depth) data-table** term is **gone**.
 								- Step 2a flattened the secondary **internal-table** scan term.
 								- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
 								  depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
 								  + compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
 								  unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
 a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
 								- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
 								  flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
 b; gate it on the constant-factor (S3 round-trips), not a curve.
 								### 1b. Two op classes, two commit models (the §6.6 principle)
 								Every concurrency bug in this area is **one op class using the other's commit model**:
 								| class | examples | commutes? | correct commit model |
 								|---|---|---|---|
 								| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
 								| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
 								Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
 								(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
 								### 1c. The root liability (what keeps generating these bugs)
 								Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
 								every multi-commit op advances per-table Lance HEAD **before** the manifest references it
 								(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
 								(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
 								paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
 								drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
 								- §6.5 cross-process logical corruption,
 								- #297's own-HEAD-drift misclassification,
 								- the flaky write-path "HEAD ahead of manifest, run repair" guard,
 								- the recovery classifier edges.
 								**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
 								retires the window entirely.** See §5.
 								### 1d. The second facet: the write base is a stale pin (no probe)
 								The READ path resolves its base behind a freshness probe (`resolve_target_inner`
 								omnigraph.rs:~1072 → `probe_latest_incarnation` → `refresh_manifest_only`); the WRITE path
 								does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
 								the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
 								single staleness feeds **two distinct failure modes**, both surfaced this cycle:
 . **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
 								   committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
 								   `edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
 								   live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
 								   uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
 								   **VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
 								   the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
 								   edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
 								   comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
 								   underneath: there is **no unified write-validation read-set** — endpoint
 								   (`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
 								   `txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
 								   commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
 								   (`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
 								   documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
 								   §7.1 TOCTOU class, and each new constraint forks the pattern again.
 . **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
 								   pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
 								   `optimize` advanced `__manifest` — even though the advance was content-preserving
 								   compaction the logical write should fast-forward past (invariant 7). It's the **write-side
 								   mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
 								   logical write that must fast-forward past optimize). A served read clears it (the read
 								   probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
 								   `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
 								   (`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
 								   proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
 								   `consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
 								   cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
 								**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
 								fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition** —
 								derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
 								maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
 								loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
 								after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
 								`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
 								liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
 								pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
 								of Design A — three symptoms, one fix.
 								---
 								## 2. Validated facts — do NOT re-derive these
 								Established this cycle against **Lance 7.0.0 source**
 								(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
 								you can trust them without re-investigating.
 								**Lance (upstream):**
 								- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
 								  V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
 								  (`lance-table/src/io/commit.rs` `default_resolve_version`.)
 								- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
 								  keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
 								  — the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
 								  carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
 								  `WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
 								  `Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
 								  Session to a write base, open read-style (`open_table_dataset` → `from_uri().with_version()
 								  .with_session()`) and drive the staged write off that handle.
 								- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
 								  many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
 								  relies on this).
 								- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
 								  `commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
 								  `commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
 								  via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
 								  (`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
 								  **compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
 								  Lance own it.**
 								- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
 								  `Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
 								  conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
 								  retry fires only on real overlap.
 								**Engine (internal):**
 								- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
 								  warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
 								  `TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
 								  graph** (`read_caches.session`). **Writes bypass all of it by construction**
 								  (`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
 								  session and opens by latest, not pinned version).
 								- A single write opens each table **3–4×** (accumulation → staging reopen → commit
 								  drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
 								  (`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
 								  + 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
 								  depth — 3b's targets.
 								- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
 								  `table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
 								  `publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
 								- Fork-on-first-write authority reads (`classify_fork_ref` → `fresh_snapshot_for_branch`)
 								  must stay **fresh** (not served from a pinned base).
 								- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
 								  `local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
 								  (`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
 								  change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
 								  (copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
 								  (`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
 								---
 								## 3. The #297 cycle (this branch) — what it is, and the lesson
 								`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
 								same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
 								compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
 								— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
 								publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
 								manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
 								in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
 								**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
 								fixed, both regression-tested (own-HEAD-drift via negative control):
 . **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
 								   iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
 								   later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
 								   sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
 								   `head_advanced`; the drift guard fires only when `!head_advanced`.
 . **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
 								   its final retry even if the conflict meant a concurrent writer already published `≥ N`
 								   (postcondition met). Fix: re-check `current >= state.version` on exhaustion.
 								**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
 								retry silently converts every "checked once, before any side effect" precondition into
 								"re-checked after partial side effects."* That's a distinct bug class; it needs
 								fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
 								(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
 								exactly this.)
 								**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
 								movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
 								— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
 								reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
 								---
 								## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
 								**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
 								contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
 								win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
 								insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
 								`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
 								minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
 								mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
 								`PublishPlan`.
 								**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
 								(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
 								`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
 								`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
 								fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
 								needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
 								cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
 								gate yet (only the node gates above).
 								Commits (off merged-#297 main):
 								- **Stage 0** — scope `open_count` → `data_open_count`/`internal_open_count` by URI class
 								  (the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
 								  raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
 								- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
 								  validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
 								  (`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
 								  S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
 								  minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
 								  manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
 								  same `Snapshot`).
 								- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
 								  the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
 								  #3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
 								  cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
 								  the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
 								- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
 								  `CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
 								  unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
 								**RFC §4.1 corrections — how they resolved:**
 . *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
 								   the `commit_staged` handle forward into the index build).
 . *Don't forbid re-resolution* → honored: the commit-time OCC re-read
 								   (`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
 								   and the fork-authority reads stay fresh.
 . *Minimal carrier* → `WriteTxn { branch, base }` (even the `session` from the original
 								   sketch was dropped as unearned).
 								**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
 								open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
 								(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
 								where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
 								#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
 								probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
 								**Guardrails (don't regress):** schema validation is deliberately uncached for drift
 								detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
 								(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
 								machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
 								Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
 								open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
 								post-#297 main before starting.
 								---
 								## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
 								**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
 								the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
 								retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
 								Properties:
 								- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
 								  by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
 								  §6.5 writers, the write-path guard — all close *by construction*.
 								- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
 								  live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
 								- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
 								  + reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
 								  + `writer_epoch`).
 								**Why it composes with Lance MTT (don't over-build):**
 								- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
 								  authority; nothing wasted. Build this.
 								- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
 								  a cross-process fence). Build it *last and minimally*, gated on actually deploying
 								  multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
 								**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
 								= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
 								is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
 								(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
 								the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
 								validate the maintenance-class and logical-class commit models in isolation first, so Design
 								A implements a known spec rather than designing under refactor pressure. Do NOT build more
 								substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
 								over the window — strictly higher liability than either Design A or waiting for MTT.
 								**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
 								+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
 								exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
 								open; delete #6658 shipped). Track, don't build yet.
-												feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)

* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
											
										
										
											2026-06-25 13:55:34 +02:00
+								### 5.1 Step-5 design constraints inherited from the #295 spec review
 b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via
 								eliminate/probe/thread) and **deferred** the full §4.1 opener-unification — the pinned-base
 								opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op
 								conflict-timing move — to step 5. So the greptile-bot comments on the #295 *spec* were **moot
 								for #298** (which built none of those constructs) but are **load-bearing constraints for step
 ** when it builds them. Bank them:
 . **Handle cache must be `Send + Sync`** (`Mutex<HashMap<…, Dataset>>`, not `RefCell`) if
 								   `WriteTxn::open(&self)` is shared across concurrent stage futures — a `RefCell` compiles
 								   but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This
 								   is the deny-list "in-process-only `Dataset` impls — `Send + Sync`" item.
 . **The strict-op timing move needs an explicit retry contract.** If step 5 moves
 								   strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS
 								   (the §4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit**
 								   (success or conflict — the handle cache is commit-invalidated), and the retry **re-opens a
 								   fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base — that
 								   reproduces the lost-update). **This is the same retry/refresh contract as the stale-view
 								   false-fail (§1d.2)** — the op-class-aware precondition + "fresh base on retry" are one
 								   design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the
 								   contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
 . **The opener-equivalence test must be non-trivial.** A differential test that only passes
 								   when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open`
 								   returns the pinned base, not HEAD," the test must **advance the branch HEAD externally
 								   (direct Lance write), then assert the txn open still reads the base version** — and that a
 								   strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move).
-												feat(engine): `WriteTxn` - validate schema + open each data table once per write (#298)

* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)

Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
  cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
  named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
  live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
  Dataset," so attach the Session by opening read-style then staging off it.

* test(engine): RED step-3b capture-once fitness asserts + open_count probe

Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
  once (3 read_text + 2 exists). Today re-validates at every resolve point —
  measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
  (zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
  table <=1x. Today measured 10 opens.

Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.

* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)

The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.

Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.

* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)

The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).

Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.

Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).

Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.

* feat(engine): validate the schema contract once per write (RFC-013 step 3b)

A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.

Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.

- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
  entry-point ensure_schema_state_valid) and thread Some(&txn) through
  execute_*/open_table_for_mutation, commit_all, and
  commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
  when present — skipping resolved_branch_target's re-validation. The OPEN itself is
  unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
  ensure_expected_version. Schema-once applies to strict and non-strict alike; the
  data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
  the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.

fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.

Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.

* feat(engine): open each data table once per write (RFC-013 step 3b)

A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.

- #1 accumulation: open_for_mutation_on_branch now returns
  (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
  own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
  caller reads is .version() (the CAS fence), which is exactly the pinned base
  version (entry.table_version). So skip open_dataset_head_for_write and source the
  version from txn.base. The node insert path already discarded that handle; the
  edge path resolves a pinned read only when non-default cardinality needs it.
  STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
  entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
  the already-open staging handle (the same primitive ManifestCoordinator::
  probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
  head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
  SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
  prepare_updates_for_commit, which builds indices on the threaded handle instead of
  reopening at the same just-committed version. Absent a handle (other writers,
  inline/delete tables) the reopen path is byte-identical.

When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.

* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)

The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.

- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
  require_handle(ctx) helper for the callers that must have a handle (strict ops,
  the fork path, every no-txn caller — branch merge, the seed test). The handle is
  None only on the non-strict-txn open-skip path (collapse #1); require_handle
  panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
  committed_handles } for commit_all; consumers destructure into the same local
  bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
  is legal only on the non-strict txn path, so a future strict arm returning None
  trips in debug builds instead of handing None to a require_handle consumer.

Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.

* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)

The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.

Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.

No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).

* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5

* docs(rfc-013): capture the write-base-staleness convergence (§1d)

Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):

- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
  made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
  is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
  deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
  endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
  the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
  repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
  optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
  proven wrong (breaks the cross-process lost-update OCC contract).

All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.

* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)

Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).

Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.

* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)

3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).

Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.

Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.

* docs(dev): link handoff docs from index

* docs(engine): tighten 3b claims to match the code (#298 review)

Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.

- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
  carries one); scope "once" to the table-touch hot path and note edge/load RI
  validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
  now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
  open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
  only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
  is allow-listed and holds direct Dataset::opens for branch-management ops (not the
  keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
  endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
  cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).

build clean; write_cost / validators / forbidden_apis green.
											
										
										
											2026-06-23 21:27:31 +02:00
+								---
 								## 6. Why #297 is still needed even if you do Design A
 								- Design A **relocates** #297's maintenance-class commit logic into the authority's
 								  `TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
 								- The two regression tests + §6.6 are the **contract** Design A must keep green.
 								- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
 								  correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
 								- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
 								  (~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
 								---
 								## 7. Open PRs and their relationships
 								- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
 								  step 3b stacks on it.
 								- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
 								  both are de-risking inputs for Design A's per-class commit models.
-												feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)

* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
											
										
										
											2026-06-25 13:55:34 +02:00
+								- **#296** — recovery roll-forward converges on concurrent manifest advance. The fix
-												fix(recovery): converge roll-forward on concurrent manifest advance (#296)

* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)

The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.

* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race

Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).

Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
  is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
  names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
  replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
  now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
  site references a const, and `failpoint_names_guard.rs` enforces "no string
  literal names" by source-walk, so a typo is a build error not a silent no-fire.

Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
  classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
  concurrent open-sweeps race one pending sidecar; the sweep parked at the
  failpoint loses its publish CAS to the other and fails the open with
  `ExpectedVersionMismatch`. FAILS at this commit by design.

* fix(recovery): converge roll-forward when the manifest advances concurrently

The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.

A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.

Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.

* Address PR review: harden failpoint name guard + dedupe converge audit

Two issues surfaced in PR review of the failpoint hardening + recovery fix:

1. Name guard had a line-split blind spot. It scanned per line, so a call
   wrapped across lines (`park_first(\n    "name",\n)`) put the literal on a
   different line than the call prefix and bypassed the "no string-literal
   failpoint names" check — and one such literal
   (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
   guard whitespace/newline-tolerant (skip past the open paren to the first
   argument token) so wrapping can't hide a literal, and convert the bypassed
   site to the `names::` const.

2. Convergence path could append a duplicate recovery audit. When a
   roll-forward publish loses its CAS but the manifest already reached the
   sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
   audit unconditionally. Under the heal-first invariant, whoever advanced the
   manifest already healed this sidecar (audit + delete), so a second row
   landed in `_graph_commit_recoveries` for one recovery event. Gate the
   audit+delete on the sidecar still being present: absent => the winner
   completed it, return success with no duplicate row. The convergence
   regression test now asserts exactly one audit row.

* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)

The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.

* fix(recovery): include added-table registrations in the converge audit

The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
											
										
										
											2026-06-24 19:03:51 +02:00
+								  for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
-												feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)

* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
											
										
										
											2026-06-25 13:55:34 +02:00
+								  *aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296
 								  landed on main first and is merged into this branch:** the converge helper
 								  (`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward —
 								  on convergence the audit references the winner's folded `graph_commit_id` (the current
 								  `graph_head`), not a freshly minted one.
-												feat(engine): `WriteTxn` - validate schema + open each data table once per write (#298)

* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)

Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
  cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
  named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
  live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
  Dataset," so attach the Session by opening read-style then staging off it.

* test(engine): RED step-3b capture-once fitness asserts + open_count probe

Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
  once (3 read_text + 2 exists). Today re-validates at every resolve point —
  measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
  (zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
  table <=1x. Today measured 10 opens.

Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.

* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)

The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.

Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.

* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)

The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).

Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.

Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).

Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.

* feat(engine): validate the schema contract once per write (RFC-013 step 3b)

A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.

Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.

- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
  entry-point ensure_schema_state_valid) and thread Some(&txn) through
  execute_*/open_table_for_mutation, commit_all, and
  commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
  when present — skipping resolved_branch_target's re-validation. The OPEN itself is
  unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
  ensure_expected_version. Schema-once applies to strict and non-strict alike; the
  data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
  the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.

fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.

Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.

* feat(engine): open each data table once per write (RFC-013 step 3b)

A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.

- #1 accumulation: open_for_mutation_on_branch now returns
  (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
  own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
  caller reads is .version() (the CAS fence), which is exactly the pinned base
  version (entry.table_version). So skip open_dataset_head_for_write and source the
  version from txn.base. The node insert path already discarded that handle; the
  edge path resolves a pinned read only when non-default cardinality needs it.
  STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
  entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
  the already-open staging handle (the same primitive ManifestCoordinator::
  probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
  head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
  SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
  prepare_updates_for_commit, which builds indices on the threaded handle instead of
  reopening at the same just-committed version. Absent a handle (other writers,
  inline/delete tables) the reopen path is byte-identical.

When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.

* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)

The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.

- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
  require_handle(ctx) helper for the callers that must have a handle (strict ops,
  the fork path, every no-txn caller — branch merge, the seed test). The handle is
  None only on the non-strict-txn open-skip path (collapse #1); require_handle
  panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
  committed_handles } for commit_all; consumers destructure into the same local
  bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
  is legal only on the non-strict txn path, so a future strict arm returning None
  trips in debug builds instead of handing None to a require_handle consumer.

Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.

* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)

The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.

Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.

No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).

* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5

* docs(rfc-013): capture the write-base-staleness convergence (§1d)

Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):

- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
  made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
  is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
  deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
  endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
  the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
  repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
  optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
  proven wrong (breaks the cross-process lost-update OCC contract).

All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.

* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)

Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).

Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.

* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)

3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).

Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.

Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.

* docs(dev): link handoff docs from index

* docs(engine): tighten 3b claims to match the code (#298 review)

Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.

- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
  carries one); scope "once" to the table-touch hot path and note edge/load RI
  validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
  now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
  open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
  only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
  is allow-listed and holds direct Dataset::opens for branch-management ops (not the
  keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
  endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
  cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).

build clean; write_cost / validators / forbidden_apis green.
											
										
										
											2026-06-23 21:27:31 +02:00
+								- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
 								---
 								## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
 								- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
 								  (§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
 								  the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
 								  residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
 								- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
 								  mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
 								  projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
 								  atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
 								  Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
 								  **frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
 								  cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
 								  `edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
 								  pinned read-set re-validated under the `graph_head` contention, or the liability survives as
 								  a second special-case.
 								- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
 								  false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
 								  two tests passing *together*: un-ignore
 								  `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
 								  *while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
 								  (maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
 								  like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
 								- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
 								  Deferred: only the secondary version-count/space term, touches the read/open path, and is
 								  MTT-redundant. Land when version-count cost bites.
 								- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
 								  validation on node removal; independent, ships anytime.
 								- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
 								- Branch ops: Lance `Clone` for create (`iss-691`).
 								---
 								## 9. Gotchas / traps (learned the hard way)
 								- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
 								  reasoning is a bug across processes. This is the recurring trap.
 								- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
 								  merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
 								  the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
 								  regress a newer version = lost write.
 								- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
 								  self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
 								  interpretation on "have *we* committed yet."
 								- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
 								  `skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
 								  config before compacting to stay non-destructive on upgraded graphs. The strip is a
 								  separate commit (relevant to the partial-commit retry trap).
 								- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
 								  surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
 								  rare and needs failpoint injection to test.
 								---
 								## 10. Verification (the gate)
 								- `cargo test --workspace --locked` — the canonical gate (matches CI).
 								- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize` —
 								  the optimize concurrency/recovery tests.
 								- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
 								  cost gates (3b adds the schema-once + open-count asserts here).
 								- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
 								- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
 								  before each change (always-on requirement).
 								Lance source for re-validation:
 								`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
 								`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
 								`dataset/builder.rs`).