feat(engine): WriteTxn - validate schema + open each data table once per write (#298)

* docs(rfc-013): step-3b handoff + §4.1 corrections (validated) Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the 4-subagent validation against current code: - HandleCache → handle-threading (forward the commit-return handle; a version-keyed cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits). - "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase + named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the live-HEAD drift probe, fork authority) are irreducible correctness machinery. - WriteParams DOES carry a session field; the real constraint is "stage off an open Dataset," so attach the Session by opening read-style then staging off it. * test(engine): RED step-3b capture-once fitness asserts + open_count probe Two write-path cost gates, RED today, GREEN after the WriteTxn lands: - write_validates_schema_contract_once: a write must validate the schema contract once (3 read_text + 2 exists). Today re-validates at every resolve point — measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter (zero production change; the write twin of the read-path schema-once test). - keyed_insert_opens_table_at_most_once: a keyed single-table write must open its table <=1x. Today measured 10 opens. Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring probe_count/record_probe), called at both open chokepoints; surfaced as IoCounts.open_count. forbidden_apis guarantees every write open routes through them. * feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding) The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base: Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract once and pins the base snapshot + the shared per-graph Session. Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict bound-branch path — opening the base once from the pinned entry with the warm session (a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the commit-time OCC re-read keep their fresh reads. * test(engine): scope write-path open_count to data tables (RFC-013 step 3b) The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but open_count was a single unclassified counter: record_open() fires in both open chokepoints, and open_dataset_tracked also opens the internal/system tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via commit_graph.rs). So the count conflated data-table opens with the publisher CAS + commit-graph append opens — making the gate measure the wrong quantity and unreachable by threading alone (the manifest publish keeps it >1 regardless). Scope it by table class, mirroring the read-side counters (which already split by URI prefix via separate wrappers): record_open(uri) classifies the open's last path segment and feeds data_open_count vs internal_open_count. IoCounts exposes both; the gate now asserts data_open_count <= 1. Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6 (sum 10, the old conflated value). The RED target for the WriteTxn threading is now the real data-table-open count (4 -> 1), with internal opens correctly out of scope. Pure test-harness/instrumentation; no production behavior change (classification runs only inside the probe closure, skipped when no probes are installed). Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff — this branch is already stacked on origin/main after it merged. * feat(engine): validate the schema contract once per write (RFC-013 step 3b) A single mutate/load re-validated the schema contract ~4 times: at the entry (ensure_schema_state_valid), per-table in open_for_mutation_on_branch (resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch), and in the publisher's index-build snapshot (snapshot_for_branch). Each validation is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of redundant contract I/O on every write. Thread the already-landed WriteTxn carrier through the write path: capture `txn = open_write_txn(branch)` once at the mutate/load entry (the single validation), then source the per-table entry and the commit/publish snapshots from `txn.base` instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every function is byte-identical to before. - mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the entry-point ensure_schema_state_valid) and thread Some(&txn) through execute_*/open_table_for_mutation, commit_all, and commit_updates_on_branch_with_expected. - open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch when present — skipping resolved_branch_target's re-validation. The OPEN itself is unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep ensure_expected_version. Schema-once applies to strict and non-strict alike; the data-open collapse is a separate change. - commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus the schema re-validation) when txn is present; the drift guard is unchanged. - prepare_updates_for_commit uses txn.base for the publisher index-build snapshot. fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot() produces identically — but resolve_target additionally opened the commit graph (a spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved. Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9). keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open collapse lands next. Full engine suite green otherwise. * feat(engine): open each data table once per write (RFC-013 step 3b) A single keyed-node mutate opened its data table 4 times: accumulation (to read .version()), staging (the real write base), the commit-time drift guard (to read live HEAD), and the publisher's index build (reopen at the just-committed version). Collapse three of the four — using the WriteTxn carrier threaded for schema-once — so a write opens each touched data table at most once. - #1 accumulation: open_for_mutation_on_branch now returns (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's own branch, a non-strict (Insert/Merge) op needs no open — the only thing the caller reads is .version() (the CAS fence), which is exactly the pinned base version (entry.table_version). So skip open_dataset_head_for_write and source the version from txn.base. The node insert path already discarded that handle; the edge path resolves a pinned read only when non-default cardinality needs it. STRICT ops and any write that must fork still open live HEAD + ensure_expected_version. - #3 commit drift guard: commit_all reads live HEAD via entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off the already-open staging handle (the same primitive ManifestCoordinator:: probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The head<current / head>current drift classification is byte-identical. - #4 index build: commit_all now returns the per-table post-commit_staged SnapshotHandle map; commit_updates_on_branch_with_expected threads it into prepare_updates_for_commit, which builds indices on the threaded handle instead of reopening at the same just-committed version. Absent a handle (other writers, inline/delete tables) the reopen path is byte-identical. When txn is None (branch merge, schema apply, tests) every function opens and checks exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1). Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle) green. * refactor(engine): name the write-path open/commit returns (RFC-013 step 3b) The open collapse left two positional returns that are easy to mis-thread and carry an unwritten contract: open_for_mutation_on_branch's (Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple (updates, expected_versions, sidecar_handle, guards, committed_handles). Replace both with named structs so each field reads at the call site and the Option's contract is documented, not folklore. - OpenedForMutation { handle, expected_version, full_path, table_branch } with a require_handle(ctx) helper for the callers that must have a handle (strict ops, the fork path, every no-txn caller — branch merge, the seed test). The handle is None only on the non-strict-txn open-skip path (collapse #1); require_handle panics with a named context if that contract is ever broken. - CommittedMutation { updates, expected_versions, sidecar_handle, guards, committed_handles } for commit_all; consumers destructure into the same local bindings they already used, so the publish/sidecar/guard-hold logic is unchanged. - A debug_assert in open_table_for_mutation pins the skip contract: a missing handle is legal only on the non-strict txn path, so a future strict arm returning None trips in debug builds instead of handing None to a require_handle consumer. Pure refactor — no behavior change. Both cost gates stay green (schema 3/2, data_open_count=1), full engine suite + lib (162) green. * refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b) The open collapse greens data_open_count<=1 by SKIPPING the accumulation open, PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle — none of which consume a session. The captured WriteTxn.session was therefore dead (`#[allow(dead_code)]`): unearned surface a reviewer rightly flags. Remove it. The carrier is now {branch, base} — exactly what schema-once + the open collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional publish carrier and is the right home for session-aware base opens, where the warm-session benefit on the single remaining open — an object-store (S3) phenomenon, invisible on local FS — can be earned by its own cost gate rather than carried dead through this PR. No behavior change; both cost gates stay green (schema 3/2, data_open_count=1). * docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5 * docs(rfc-013): capture the write-base-staleness convergence (§1d) Three findings this cycle share one root — the write base is a stale, un-probed, un-classified pin (the read path probes; the write path returns the warm coordinator snapshot): - #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1 made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read + deterministic test + correct the single-writer doc comment. - The structural liability underneath: no unified write-validation read-set — endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live), the same cardinality check forks mutation-vs-loader, none re-validated at commit. - The served-strict-write stale-view false-fail (validated on prod + a #[ignore] repro): a strict update/delete false-fails ExpectedVersionMismatch after an external optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is proven wrong (breaks the cross-process lost-update OCC contract). All three converge on Design A (step 5): open_txn's warm probe makes the base fresh, the op-class-aware precondition (derive maintenance vs logical from Lance per-version transaction metadata — no parallel marker) fast-forwards maintenance and fails logical, and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract. * test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298) Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1 made the non-strict edge-insert cardinality scan read the pinned txn.base instead of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after txn capture is uncounted and a @card max is silently exceeded (invariant 9). Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme) to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt for Alice. B's coordinator is stale by construction (the write path doesn't probe), so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert that must be rejected currently succeeds (panics at unwrap_err). Goes green when the scan reads live HEAD. * fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298) 3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_ read_handle reopened the edge table at the pinned txn.base for the @card scan. Since cardinality is validated once (never rechecked at commit), a concurrent edge committed after txn capture was uncounted and a @card max could be silently exceeded (invariant 9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the mutation's own open_dataset_head_for_write handle). Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable across versions) and open the dataset at its current HEAD via open_dataset_head_for_ write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the edge cardinality path (non-default @card only) is untouched by them, and the extra live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8. Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green; validators / write_cost / writes / consistency / end_to_end / branching all green. * docs(dev): link handoff docs from index * docs(engine): tighten 3b claims to match the code (#298 review) Review caught several comments/docs overclaiming what the code does (the session drop + the #298 cardinality fix left stale/too-strong wording). No logic change. - open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer carries one); scope "once" to the table-touch hot path and note edge/load RI validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5. - edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it now opens LIVE HEAD (#298); corrected. - write_cost.rs: "opens the base once (with the shared Session)" → session-aware base open is deferred to step 5. - data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs is allow-listed and holds direct Dataset::opens for branch-management ops (not the keyed-write hot path the gate measures). Narrowed the claim accordingly. - handoff §4: "schema once / open once" is the node hot path (the two gates); edge endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1). build clean; write_cost / validators / forbidden_apis green.
2026-06-27 02:39:38 +02:00 · 2026-06-23 21:27:31 +02:00 · 2026-06-23 21:27:31 +02:00 · 7d3a52d674
commit 7d3a52d674
parent 6d4606a830
15 changed files with 1405 additions and 106 deletions
--- a/docs/dev/handoff-rfc-013-write-path.md
+++ b/docs/dev/handoff-rfc-013-write-path.md
@ -0,0 +1,430 @@
+# Handoff: finishing RFC-013 (write-path latency + correctness)
+
+**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)** —
+this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
+
+**Audience:** the engineer/agent who picks up RFC-013 next.
+
+---
+
+## 0. TL;DR — where we are and what's next
+
+RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
+under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
+for the canonical list. Current reality:
+
+**Landed on `main`:**
+- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
+- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
+  instead of the Lance-namespace builder (#288). **This already banked the dominant
+  depth win** — see §2 below; it reframes everything.
+- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
+  `_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
+  correction (#292).
+- **Optimize-vs-write race** — optimize survives a cross-process write race on the
+  same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
+  redundant with Design A). Step 3b stacks on top of this.
+
+**Open PRs (land these; relationships in §7):**
+- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
+  manifest advance (this is the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
+- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
+- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
+  (same op-class family as #297, logical side).
+
+**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
+`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
+`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
+area keeps generating, which also absorbs 3b's deferred session-aware write opens.
+
+---
+
+## 1. The corrected mental model (read this before touching anything)
+
+Three reframes from the latest cycle that the older RFC prose may not fully reflect:
+
+### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
+Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
+that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
+**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
+Lance — validated against Lance source). 3a swapped it for the direct opener, which is
+**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
+
+- The dominant **O(depth) data-table** term is **gone**.
+- Step 2a flattened the secondary **internal-table** scan term.
+- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
+  depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
+  + compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
+  unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
+  2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
+- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
+  flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
+  3b; gate it on the constant-factor (S3 round-trips), not a curve.
+
+### 1b. Two op classes, two commit models (the §6.6 principle)
+Every concurrency bug in this area is **one op class using the other's commit model**:
+
+| class | examples | commutes? | correct commit model |
+|---|---|---|---|
+| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
+| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
+
+Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
+(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
+
+### 1c. The root liability (what keeps generating these bugs)
+Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
+every multi-commit op advances per-table Lance HEAD **before** the manifest references it
+(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
+(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
+paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
+drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
+- §6.5 cross-process logical corruption,
+- #297's own-HEAD-drift misclassification,
+- the flaky write-path "HEAD ahead of manifest, run repair" guard,
+- the recovery classifier edges.
+
+**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
+retires the window entirely.** See §5.
+
+### 1d. The second facet: the write base is a stale pin (no probe)
+The READ path resolves its base behind a freshness probe (`resolve_target_inner`
+omnigraph.rs:~1072 → `probe_latest_incarnation` → `refresh_manifest_only`); the WRITE path
+does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
+the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
+single staleness feeds **two distinct failure modes**, both surfaced this cycle:
+
+1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
+   committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
+   `edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
+   live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
+   uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
+   **VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
+   the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
+   edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
+   comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
+   underneath: there is **no unified write-validation read-set** — endpoint
+   (`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
+   `txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
+   commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
+   (`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
+   documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
+   §7.1 TOCTOU class, and each new constraint forks the pattern again.
+
+2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
+   pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
+   `optimize` advanced `__manifest` — even though the advance was content-preserving
+   compaction the logical write should fast-forward past (invariant 7). It's the **write-side
+   mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
+   logical write that must fast-forward past optimize). A served read clears it (the read
+   probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
+   `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
+   (`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
+   proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
+   `consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
+   cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
+
+**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
+fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition** —
+derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
+maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
+loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
+after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
+`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
+liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
+pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
+of Design A — three symptoms, one fix.
+
+---
+
+## 2. Validated facts — do NOT re-derive these
+
+Established this cycle against **Lance 7.0.0 source**
+(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
+you can trust them without re-investigating.
+
+**Lance (upstream):**
+- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
+  V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
+  (`lance-table/src/io/commit.rs` `default_resolve_version`.)
+- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
+  keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
+  — the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
+  carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
+  `WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
+  `Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
+  Session to a write base, open read-style (`open_table_dataset` → `from_uri().with_version()
+  .with_session()`) and drive the staged write off that handle.
+- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
+  many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
+  relies on this).
+- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
+  `commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
+  `commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
+  via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
+  (`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
+  **compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
+  Lance own it.**
+- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
+  `Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
+  conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
+  retry fires only on real overlap.
+
+**Engine (internal):**
+- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
+  warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
+  `TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
+  graph** (`read_caches.session`). **Writes bypass all of it by construction**
+  (`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
+  session and opens by latest, not pinned version).
+- A single write opens each table **3–4×** (accumulation → staging reopen → commit
+  drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
+  (`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
+  + 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
+  depth — 3b's targets.
+- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
+  `table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
+  `publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
+- Fork-on-first-write authority reads (`classify_fork_ref` → `fresh_snapshot_for_branch`)
+  must stay **fresh** (not served from a pinned base).
+- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
+  `local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
+  (`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
+  change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
+  (copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
+  (`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
+
+---
+
+## 3. The #297 cycle (this branch) — what it is, and the lesson
+
+`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
+same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
+compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
+— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
+publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
+manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
+in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
+
+**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
+fixed, both regression-tested (own-HEAD-drift via negative control):
+1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
+   iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
+   later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
+   sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
+   `head_advanced`; the drift guard fires only when `!head_advanced`.
+2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
+   its final retry even if the conflict meant a concurrent writer already published `≥ N`
+   (postcondition met). Fix: re-check `current >= state.version` on exhaustion.
+
+**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
+retry silently converts every "checked once, before any side effect" precondition into
+"re-checked after partial side effects."* That's a distinct bug class; it needs
+fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
+(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
+exactly this.)
+
+**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
+movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
+— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
+reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
+
+---
+
+## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
+
+**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
+contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
+win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
+insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
+`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
+minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
+mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
+`PublishPlan`.
+
+**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
+(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
+`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
+`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
+fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
+needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
+cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
+gate yet (only the node gates above).
+
+Commits (off merged-#297 main):
+- **Stage 0** — scope `open_count` → `data_open_count`/`internal_open_count` by URI class
+  (the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
+  raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
+- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
+  validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
+  (`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
+  S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
+  minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
+  manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
+  same `Snapshot`).
+- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
+  the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
+  #3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
+  cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
+  the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
+- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
+  `CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
+  unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
+
+**RFC §4.1 corrections — how they resolved:**
+1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
+   the `commit_staged` handle forward into the index build).
+2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
+   (`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
+   and the fork-authority reads stay fresh.
+3. *Minimal carrier* → `WriteTxn { branch, base }` (even the `session` from the original
+   sketch was dropped as unearned).
+
+**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
+open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
+(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
+where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
+#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
+probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
+
+**Guardrails (don't regress):** schema validation is deliberately uncached for drift
+detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
+(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
+machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
+Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
+open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
+post-#297 main before starting.
+
+---
+
+## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
+
+**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
+the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
+retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
+Properties:
+- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
+  by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
+  §6.5 writers, the write-path guard — all close *by construction*.
+- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
+  live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
+- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
+  + reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
+  + `writer_epoch`).
+
+**Why it composes with Lance MTT (don't over-build):**
+- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
+  authority; nothing wasted. Build this.
+- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
+  a cross-process fence). Build it *last and minimally*, gated on actually deploying
+  multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
+
+**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
+= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
+is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
+(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
+the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
+validate the maintenance-class and logical-class commit models in isolation first, so Design
+A implements a known spec rather than designing under refactor pressure. Do NOT build more
+substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
+over the window — strictly higher liability than either Design A or waiting for MTT.
+
+**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
+exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
+open; delete #6658 shipped). Track, don't build yet.
+
+---
+
+## 6. Why #297 is still needed even if you do Design A
+- Design A **relocates** #297's maintenance-class commit logic into the authority's
+  `TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
+- The two regression tests + §6.6 are the **contract** Design A must keep green.
+- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
+  correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
+- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
+  (~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
+
+---
+
+## 7. Open PRs and their relationships
+- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
+  step 3b stacks on it.
+- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
+  both are de-risking inputs for Design A's per-class commit models.
+- **#296** — recovery roll-forward converges on concurrent manifest advance. This is the fix
+  for the flaky `iss-schema-apply-reopen-recovery-race` (the handoff in
+  `handoff-schema-apply-recovery-flake.md`). It touches `recovery.rs` and is *aligned* with
+  #297's "postcondition is the state, not winning the CAS" principle — reconcile the monotonic
+  publish with #296's converge helper if #296 lands first.
+- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
+
+---
+
+## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
+- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
+  (§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
+  the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
+  residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
+- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
+  mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
+  projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
+  atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
+  Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
+  **frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
+  cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
+  `edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
+  pinned read-set re-validated under the `graph_head` contention, or the liability survives as
+  a second special-case.
+- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
+  false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
+  two tests passing *together*: un-ignore
+  `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
+  *while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
+  (maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
+  like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
+- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
+  Deferred: only the secondary version-count/space term, touches the read/open path, and is
+  MTT-redundant. Land when version-count cost bites.
+- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
+  validation on node removal; independent, ships anytime.
+- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
+- Branch ops: Lance `Clone` for create (`iss-691`).
+
+---
+
+## 9. Gotchas / traps (learned the hard way)
+- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
+  reasoning is a bug across processes. This is the recurring trap.
+- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
+  merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
+  the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
+  regress a newer version = lost write.
+- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
+  self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
+  interpretation on "have *we* committed yet."
+- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
+  `skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
+  config before compacting to stay non-destructive on upgraded graphs. The strip is a
+  separate commit (relevant to the partial-commit retry trap).
+- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
+  surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
+  rare and needs failpoint injection to test.
+
+---
+
+## 10. Verification (the gate)
+- `cargo test --workspace --locked` — the canonical gate (matches CI).
+- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize` —
+  the optimize concurrency/recovery tests.
+- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
+  cost gates (3b adds the schema-once + open-count asserts here).
+- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
+- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
+  before each change (always-on requirement).
+
+Lance source for re-validation:
+`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
+`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
+`dataset/builder.rs`).
--- a/docs/dev/handoff-schema-apply-recovery-flake.md
+++ b/docs/dev/handoff-schema-apply-recovery-flake.md
@ -0,0 +1,216 @@
+# Handoff: flaky schema-apply → reopen recovery race
+
+**Type:** bug investigation handoff (not yet fixed)
+**Status:** root-caused to a layer + hypothesis; exact mechanism and fix NOT yet validated
+**Severity:** medium — flaky CI; a real (rare) schema-apply-then-reopen failure under load
+**Scope:** pre-existing on `main`; **independent of** RFC-013 step 2 (internal-table
+compaction, PR #291) and step 3a (#288) — those paths never touch schema apply or
+the recovery sweep, and the full `--workspace` gate passes clean on a re-run.
+
+> Do **not** "fix" this by changing the test to use a single handle. That was
+> empirically shown to *reduce but not eliminate* the flake (see Experiments), so it
+> would mask a real product race. This is a correct-by-design fix in the engine, not
+> a test edit.
+
+---
+
+## 1. Symptom
+
+The test
+`crates/omnigraph-server/tests/schema_routes.rs::schema_apply_route_hard_drops_property_with_allow_data_loss`
+intermittently fails. The HTTP schema apply **succeeds** (`applied == true`); the
+*subsequent* `Omnigraph::open(graph)` (which the test does to verify the catalog)
+panics on `.unwrap()` with:
+
+```
+OmniError::Manifest(Conflict,
+  "stale view of node:Person: expected manifest version 5 but current is 7",
+  ExpectedVersionMismatch { expected: 5, actual: 7 })
+```
+
+The values (5, 7) vary; the shape is always "recovery roll-forward expected version
+N, manifest is at M > N." It is raised from the **open-time recovery sweep**, i.e.
+inside `Omnigraph::open`, not from the apply itself.
+
+---
+
+## 2. Reproduction
+
+- **Needs sibling-test parallelism (CPU contention).** Running the target test
+  *alone* is rock-solid (0/20 failures). The flake only appears when other tests in
+  the same binary run concurrently and perturb the timing inside the apply→reopen
+  sequence.
+- Fast repro loop (≈13–40% per run):
+  ```bash
+  cargo test -p omnigraph-server --test schema_routes --no-run
+  for i in $(seq 1 15); do
+    cargo test -p omnigraph-server --test schema_routes 2>&1 \
+      | grep -q "schema_apply_route_hard_drops_property_with_allow_data_loss ... FAILED" \
+      && echo "iter $i FAIL"
+  done
+  ```
+- It originally surfaced in a full `cargo test --workspace` run (max parallelism).
+- Each test uses its own `tempfile::tempdir()`, so this is **not** cross-test shared
+  state — it's a timing race inside one test's own graph.
+
+---
+
+## 3. Experiments run (the discriminating evidence)
+
+Each variant was stress-run under the full `schema_routes` suite (parallel siblings):
+
+| Variant | Flake rate |
+|---|---|
+| Target test in isolation (no sibling parallelism) | **0/20** |
+| **Control** — as written (server handle + out-of-band `Omnigraph::open` load + reopen) | 6/15 ≈ 40% |
+| Drop the live server handle (`drop(app)`) before the reopen | 4/15 ≈ 27% |
+| Remove the out-of-band separate-handle load | 2/15 ≈ 13% |
+| Remove the load **and** drop the server handle (≈ single-handle) | 8/20 ≈ 40% |
+
+**Interpretation:**
+- It is **concurrency-triggered**, not a topology bug: 0% isolated, flaky under
+  parallel load.
+- **No single factor eliminates it.** Removing the out-of-band load roughly halves
+  the rate (it amplifies the race) but leaves a ~13% base. Dropping the live server
+  handle does not clearly help. So the "single-handle test" patch is a **band-aid**,
+  not the fix.
+- The residual base rate with the out-of-band load removed means there is a real
+  race in the **schema-apply → reopen → recovery** path itself.
+
+Caveat on the experiments: `drop(app)` may not synchronously tear down the server's
+engine handle (it can be held by an `Arc`/spawned task), so the "single-handle"
+rows are not airtight. This is one of the things to validate (§6).
+
+---
+
+## 4. Root-cause hypothesis (NOT yet proven)
+
+The failing path is the **open-time recovery sweep's roll-forward** raising
+`ExpectedVersionMismatch` from the publisher's `check_expected_table_versions`.
+
+The hard-drop schema apply (`allow_data_loss=true` → `DropMode::Hard`) is a
+**multi-step migration**: it performs several Lance commits + `__manifest` publishes,
+advancing `node:Person`'s manifest version across multiple versions (e.g. 5 → … → 7).
+To be crash-safe across the Lance-HEAD-before-manifest-publish gap, schema apply
+writes a **recovery sidecar** (`__recovery/{ulid}.json`) pinning per-table
+`expected_version` / `post_commit_pin` before its Phase B.
+
+Hypothesis: under CPU contention, the timing of (a) the migration's multi-version
+advancement, (b) the sidecar's Phase-D deletion, and (c) a later/overlapping
+`Omnigraph::open` recovery sweep interleaves such that the recovery roll-forward
+reads a sidecar whose pinned `expected` is **stale relative to a manifest that
+legitimately advanced several versions**, and **re-publishes at the stale `expected`
+instead of recognizing the migration already completed** → `expected 5, actual 7`.
+
+In other words: the recovery classifier / roll-forward likely does not correctly
+handle a table whose manifest is **already past `post_commit_pin`** by more than one
+step (multi-step migration), or a sidecar whose operation has already fully
+committed. The single-step assumption baked into the Optimize-style pin
+(`post_commit_pin = expected_version + 1`) may not generalize to multi-commit schema
+migrations.
+
+---
+
+## 5. Likely solution (correct-by-design, surgical)
+
+Make the **open-time recovery classifier idempotent against a manifest that advanced
+past the sidecar's pin**:
+
+- If the table's current manifest/Lance version is already `>= post_commit_pin`
+  (operation completed, possibly across multiple versions), classify it as
+  *already-rolled-forward / completed* (the `RolledPastExpected` family) and **delete
+  the sidecar without republishing** — never attempt a publish at the stale
+  `expected`.
+- Ensure the schema-apply sidecar records a pin that the classifier can interpret for
+  a **multi-step** migration (a range / "completed at or beyond" semantics), not a
+  strict single-step `expected + 1`.
+
+This also hardens *real* crash recovery for multi-step schema apply (not just the
+test), and is small + local to `recovery.rs` (+ possibly the schema-apply sidecar
+write in `schema_apply.rs`). It does **not** rearchitect recovery.
+
+Per repo rule 12 (test-first for bug fixes): land a **deterministic** repro first —
+ideally a failpoint that forces the interleaving (pause after the migration's commits
+but before sidecar delete, then run an open) so the red→green is reliable, not a
+stress-loop probability. See the `failpoints.rs` pattern + the schema-apply failpoints
+already in the tree.
+
+---
+
+## 6. What MUST be validated before fixing
+
+1. **Which sidecar is being rolled forward?** Confirm it is the *schema-apply*
+   sidecar (vs the out-of-band `load`'s sidecar, vs another writer). Instrument /
+   log the sidecar `operation_id`, `kind`, and `SidecarTablePin` at the point the
+   recovery sweep raises the error.
+2. **The exact classifier path.** Trace which `TableClassification` arm the failing
+   table hits (`recovery.rs::classify_table`, ~L600) and which roll-forward call
+   raises `ExpectedVersionMismatch` (`heal_pending_sidecars_roll_forward` ~L761,
+   `roll_forward_all` ~L1215, `restore`+publish ~L1275). Confirm it is the
+   multi-step-advanced / already-completed case being mishandled.
+3. **Is `post_commit_pin = expected + 1` the bug?** Verify the hard-drop migration
+   advances `node:Person` by **>1** version, and that the sidecar pins a single-step
+   `+1`, so the classifier can't recognize completion at +2.
+4. **Engine-level reproduction (no server).** Build a deterministic engine-level
+   repro: persistent handle applies a multi-step hard-drop, then a fresh
+   `Omnigraph::open` — ideally with a failpoint forcing the interleave — to confirm
+   the bug is in the engine recovery path and not server-specific (runtime, handle
+   lifecycle). The current evidence is server-test-only.
+5. **Is the out-of-band load *necessary or only amplifying*?** Confirm the ~13% base
+   rate (load removed) is the same root cause, not a second distinct race. If the
+   load is required, the bug is specifically about a second writer's version
+   advancement; if not, it's purely intra-apply.
+6. **`drop(app)` cleanliness.** Verify whether the server's engine handle is truly
+   gone after `drop(app)` (it may be `Arc`-held). If not, the "single-handle"
+   experiments don't isolate the live-handle factor and should be redone with a
+   genuinely single-handle setup.
+
+---
+
+## 7. Relationship to Lance MTT
+
+This bug lives in the **recovery-sidecar roll-forward**, which exists only to bridge
+the Lance-HEAD-before-manifest-publish gap in omnigraph's faked multi-table
+atomicity. `invariants.md` already calls recovery sidecars "scaffolding to remove
+once the substrate closes the gap." Lance **MTT** (native atomic multi-table commits,
+RFC §8 / lance#7264) closes that gap → retires the sidecar → **eliminates this bug
+class.**
+
+Implications:
+- **Don't wait for MTT** — it is the "strategic exit, not a current dependency,"
+  uncertain and far off, and this bug is live now.
+- **Don't over-invest** — keep the fix surgical (classifier idempotency), because the
+  whole sidecar layer is MTT-disposable. A surgical fix retires cleanly with the
+  layer; a recovery rearchitecture would be throwaway.
+
+---
+
+## 8. Key pointers
+
+- Failing test: `crates/omnigraph-server/tests/schema_routes.rs`
+  → `schema_apply_route_hard_drops_property_with_allow_data_loss` (~L777,
+  `#[tokio::test(flavor = "multi_thread")]`).
+- Error type: `OmniError::Manifest` / `ManifestConflictDetails::ExpectedVersionMismatch`
+  (`crates/omnigraph/src/error.rs`); raised by `check_expected_table_versions`
+  (`crates/omnigraph/src/db/manifest/publisher.rs`, ~L356).
+- Recovery sweep + classifier: `crates/omnigraph/src/db/manifest/recovery.rs`
+  — `TableClassification` (~L335), `classify_table` (~L600), roll-forward
+  (`heal_pending_sidecars_roll_forward` ~L761, `roll_forward_all` ~L1215, restore +
+  publish ~L1275).
+- Schema-apply sidecar write: `crates/omnigraph/src/db/omnigraph/schema_apply.rs`
+  (the `SidecarKind` schema-apply pins; `db.coordinator.write().refresh()` ~L692).
+- Open entry point that runs the sweep: `Omnigraph::open` (read-write mode) →
+  `db/manifest/recovery.rs` sweep.
+- Repro: §2 above. Stress under `schema_routes` suite parallelism; 0% isolated.
+
+---
+
+## 9. Suggested next steps
+
+1. Add tracing at the recovery roll-forward error site (sidecar kind/id, pins,
+   observed vs expected) and capture a failing run (§6.1, §6.2).
+2. Reproduce deterministically at the engine level with a failpoint (§6.4) — this is
+   the red test (rule 12).
+3. Implement the classifier-idempotency fix (§5) in a separate commit; confirm
+   red→green and that the stress loop goes to 0 failures over ≥50 iterations.
+4. Keep it a standalone PR (not bundled with RFC-013 follow-ons).
--- a/docs/dev/index.md
+++ b/docs/dev/index.md
@ -93,6 +93,8 @@ Working documents for in-flight feature work. Removed when the work lands.
 | CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
 | Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
 | Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
+| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
+| Schema-apply recovery flake handoff — investigation notes and validation plan for the intermittent schema-apply reopen race | [handoff-schema-apply-recovery-flake.md](handoff-schema-apply-recovery-flake.md) |

 ## Boundary

--- a/docs/dev/rfc-013-write-path-latency.md
+++ b/docs/dev/rfc-013-write-path-latency.md
@ -523,7 +523,10 @@ struct WriteTxn {
    branch: BranchRef,
    base: PinnedSnapshot,   // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
    session: Arc<Session>,  // shared per-graph; warms metadata/index caches across opens
-    handles: HandleCache,   // open-by-version; each table opened once, reused across stages
+    handles: HandleMap,     // open the base once WITH session; thread the handle each
+                            // commit RETURNS forward (HEAD walks N→N+1→N+2). NOT a
+                            // version-keyed cache — HEAD moves, so a (table,version) key
+                            // misses; reuse = forward the commit-return handle. [3b-validated]
 }

 // A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
@ -546,8 +549,17 @@ impl GraphPublishAuthority {

 Properties that make it optimal:

- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
-  open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
+- **Stages take `&WriteTxn`/`&PublishPlan` for the BASE** — re-resolving the pinned
+  read base / open-latest for the pre-commit phase is unrepresentable; invariants 2/3/15
+  hold for the base by construction. **Caveat [3b-validated]:** this is NOT "no
+  re-resolution anywhere." Three commit-boundary reads are irreducible correctness
+  machinery and MUST stay fresh: the commit-time `fresh_snapshot_for_branch` (cross-process
+  OCC), the live-HEAD drift probe (a concurrent writer may have moved HEAD since staging),
+  and the fork-authority reads (`classify_fork_ref` deliberately bypasses the cached base —
+  a pinned base there re-opens the "force-delete a live fork" bug). Model "pinned base for
+  the pre-commit phase + named fresh re-reads at the commit/fork boundary." The achievable
+  open count is **1 base open (with session) + 1 cheap `latest_version_id` probe + threaded
+  commit handles**, not literally one open.
 - **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
  recovery both call `plan.apply()` — a merge that bumps tables A+B can never
  roll A forward and silently drop B. The