omnigraph/docs/dev/handoff-rfc-013-write-path.md
Ragnor Comerford 7d3a52d674
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
feat(engine): WriteTxn - validate schema + open each data table once per write (#298)
* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)

Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
  cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
  named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
  live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
  Dataset," so attach the Session by opening read-style then staging off it.

* test(engine): RED step-3b capture-once fitness asserts + open_count probe

Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
  once (3 read_text + 2 exists). Today re-validates at every resolve point —
  measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
  (zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
  table <=1x. Today measured 10 opens.

Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.

* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)

The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.

Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.

* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)

The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).

Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.

Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).

Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.

* feat(engine): validate the schema contract once per write (RFC-013 step 3b)

A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.

Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.

- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
  entry-point ensure_schema_state_valid) and thread Some(&txn) through
  execute_*/open_table_for_mutation, commit_all, and
  commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
  when present — skipping resolved_branch_target's re-validation. The OPEN itself is
  unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
  ensure_expected_version. Schema-once applies to strict and non-strict alike; the
  data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
  the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.

fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.

Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.

* feat(engine): open each data table once per write (RFC-013 step 3b)

A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.

- #1 accumulation: open_for_mutation_on_branch now returns
  (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
  own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
  caller reads is .version() (the CAS fence), which is exactly the pinned base
  version (entry.table_version). So skip open_dataset_head_for_write and source the
  version from txn.base. The node insert path already discarded that handle; the
  edge path resolves a pinned read only when non-default cardinality needs it.
  STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
  entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
  the already-open staging handle (the same primitive ManifestCoordinator::
  probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
  head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
  SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
  prepare_updates_for_commit, which builds indices on the threaded handle instead of
  reopening at the same just-committed version. Absent a handle (other writers,
  inline/delete tables) the reopen path is byte-identical.

When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.

* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)

The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.

- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
  require_handle(ctx) helper for the callers that must have a handle (strict ops,
  the fork path, every no-txn caller — branch merge, the seed test). The handle is
  None only on the non-strict-txn open-skip path (collapse #1); require_handle
  panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
  committed_handles } for commit_all; consumers destructure into the same local
  bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
  is legal only on the non-strict txn path, so a future strict arm returning None
  trips in debug builds instead of handing None to a require_handle consumer.

Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.

* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)

The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.

Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.

No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).

* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5

* docs(rfc-013): capture the write-base-staleness convergence (§1d)

Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):

- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
  made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
  is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
  deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
  endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
  the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
  repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
  optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
  proven wrong (breaks the cross-process lost-update OCC contract).

All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.

* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)

Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).

Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.

* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)

3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).

Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.

Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.

* docs(dev): link handoff docs from index

* docs(engine): tighten 3b claims to match the code (#298 review)

Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.

- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
  carries one); scope "once" to the table-touch hot path and note edge/load RI
  validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
  now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
  open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
  only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
  is allow-listed and holds direct Dataset::opens for branch-management ops (not the
  keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
  endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
  cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).

build clean; write_cost / validators / forbidden_apis green.
2026-06-23 21:27:31 +02:00

430 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Handoff: finishing RFC-013 (write-path latency + correctness)
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)**
this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
**Audience:** the engineer/agent who picks up RFC-013 next.
---
## 0. TL;DR — where we are and what's next
RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
for the canonical list. Current reality:
**Landed on `main`:**
- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
instead of the Lance-namespace builder (#288). **This already banked the dominant
depth win** — see §2 below; it reframes everything.
- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
`_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
correction (#292).
- **Optimize-vs-write race** — optimize survives a cross-process write race on the
same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
redundant with Design A). Step 3b stacks on top of this.
**Open PRs (land these; relationships in §7):**
- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
manifest advance (this is the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
(same op-class family as #297, logical side).
**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
area keeps generating, which also absorbs 3b's deferred session-aware write opens.
---
## 1. The corrected mental model (read this before touching anything)
Three reframes from the latest cycle that the older RFC prose may not fully reflect:
### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
- The dominant **O(depth) data-table** term is **gone**.
- Step 2a flattened the secondary **internal-table** scan term.
- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
+ compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
3b; gate it on the constant-factor (S3 round-trips), not a curve.
### 1b. Two op classes, two commit models (the §6.6 principle)
Every concurrency bug in this area is **one op class using the other's commit model**:
| class | examples | commutes? | correct commit model |
|---|---|---|---|
| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
### 1c. The root liability (what keeps generating these bugs)
Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
every multi-commit op advances per-table Lance HEAD **before** the manifest references it
(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
- §6.5 cross-process logical corruption,
- #297's own-HEAD-drift misclassification,
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
- the recovery classifier edges.
**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
retires the window entirely.** See §5.
### 1d. The second facet: the write base is a stale pin (no probe)
The READ path resolves its base behind a freshness probe (`resolve_target_inner`
omnigraph.rs:~1072 → `probe_latest_incarnation``refresh_manifest_only`); the WRITE path
does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
single staleness feeds **two distinct failure modes**, both surfaced this cycle:
1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
`edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
**VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
underneath: there is **no unified write-validation read-set** — endpoint
(`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
`txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
(`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
§7.1 TOCTOU class, and each new constraint forks the pattern again.
2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
`optimize` advanced `__manifest` — even though the advance was content-preserving
compaction the logical write should fast-forward past (invariant 7). It's the **write-side
mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
logical write that must fast-forward past optimize). A served read clears it (the read
probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
(`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
`consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition**
derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
of Design A — three symptoms, one fix.
---
## 2. Validated facts — do NOT re-derive these
Established this cycle against **Lance 7.0.0 source**
(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
you can trust them without re-investigating.
**Lance (upstream):**
- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
(`lance-table/src/io/commit.rs` `default_resolve_version`.)
- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
— the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
`WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
`Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
Session to a write base, open read-style (`open_table_dataset``from_uri().with_version()
.with_session()`) and drive the staged write off that handle.
- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
relies on this).
- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
`commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
`commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
(`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
**compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
Lance own it.**
- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
`Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
retry fires only on real overlap.
**Engine (internal):**
- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
`TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
graph** (`read_caches.session`). **Writes bypass all of it by construction**
(`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
session and opens by latest, not pinned version).
- A single write opens each table **34×** (accumulation → staging reopen → commit
drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
(`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
+ 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
depth — 3b's targets.
- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
`table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
`publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
- Fork-on-first-write authority reads (`classify_fork_ref``fresh_snapshot_for_branch`)
must stay **fresh** (not served from a pinned base).
- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
`local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
(`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
(copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
(`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
---
## 3. The #297 cycle (this branch) — what it is, and the lesson
`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
fixed, both regression-tested (own-HEAD-drift via negative control):
1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
`head_advanced`; the drift guard fires only when `!head_advanced`.
2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
its final retry even if the conflict meant a concurrent writer already published `≥ N`
(postcondition met). Fix: re-check `current >= state.version` on exhaustion.
**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
retry silently converts every "checked once, before any side effect" precondition into
"re-checked after partial side effects."* That's a distinct bug class; it needs
fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
exactly this.)
**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
---
## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
`PublishPlan`.
**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
gate yet (only the node gates above).
Commits (off merged-#297 main):
- **Stage 0** — scope `open_count``data_open_count`/`internal_open_count` by URI class
(the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
(`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
same `Snapshot`).
- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
#3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
`CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
**RFC §4.1 corrections — how they resolved:**
1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
the `commit_staged` handle forward into the index build).
2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
(`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
and the fork-authority reads stay fresh.
3. *Minimal carrier*`WriteTxn { branch, base }` (even the `session` from the original
sketch was dropped as unearned).
**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
**Guardrails (don't regress):** schema validation is deliberately uncached for drift
detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
post-#297 main before starting.
---
## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
Properties:
- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
§6.5 writers, the write-path guard — all close *by construction*.
- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
+ reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
+ `writer_epoch`).
**Why it composes with Lance MTT (don't over-build):**
- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
authority; nothing wasted. Build this.
- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
a cross-process fence). Build it *last and minimally*, gated on actually deploying
multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
validate the maintenance-class and logical-class commit models in isolation first, so Design
A implements a known spec rather than designing under refactor pressure. Do NOT build more
substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
over the window — strictly higher liability than either Design A or waiting for MTT.
**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
open; delete #6658 shipped). Track, don't build yet.
---
## 6. Why #297 is still needed even if you do Design A
- Design A **relocates** #297's maintenance-class commit logic into the authority's
`TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
- The two regression tests + §6.6 are the **contract** Design A must keep green.
- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
(~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
---
## 7. Open PRs and their relationships
- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
step 3b stacks on it.
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
both are de-risking inputs for Design A's per-class commit models.
- **#296** — recovery roll-forward converges on concurrent manifest advance. This is the fix
for the flaky `iss-schema-apply-reopen-recovery-race` (the handoff in
`handoff-schema-apply-recovery-flake.md`). It touches `recovery.rs` and is *aligned* with
#297's "postcondition is the state, not winning the CAS" principle — reconcile the monotonic
publish with #296's converge helper if #296 lands first.
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
---
## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
(§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
**frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
`edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
pinned read-set re-validated under the `graph_head` contention, or the liability survives as
a second special-case.
- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
two tests passing *together*: un-ignore
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
*while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
(maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
Deferred: only the secondary version-count/space term, touches the read/open path, and is
MTT-redundant. Land when version-count cost bites.
- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
validation on node removal; independent, ships anytime.
- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
- Branch ops: Lance `Clone` for create (`iss-691`).
---
## 9. Gotchas / traps (learned the hard way)
- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
reasoning is a bug across processes. This is the recurring trap.
- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
regress a newer version = lost write.
- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
interpretation on "have *we* committed yet."
- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
`skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
config before compacting to stay non-destructive on upgraded graphs. The strip is a
separate commit (relevant to the partial-commit retry trap).
- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
rare and needs failpoint injection to test.
---
## 10. Verification (the gate)
- `cargo test --workspace --locked` — the canonical gate (matches CI).
- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize`
the optimize concurrency/recovery tests.
- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
cost gates (3b adds the schema-once + open-count asserts here).
- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
before each change (always-on requirement).
Lance source for re-validation:
`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
`dataset/builder.rs`).