* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)
Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
Dataset," so attach the Session by opening read-style then staging off it.
* test(engine): RED step-3b capture-once fitness asserts + open_count probe
Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
once (3 read_text + 2 exists). Today re-validates at every resolve point —
measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
(zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
table <=1x. Today measured 10 opens.
Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.
* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)
The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.
Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.
* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)
The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).
Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.
Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).
Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.
* feat(engine): validate the schema contract once per write (RFC-013 step 3b)
A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.
Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.
- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
entry-point ensure_schema_state_valid) and thread Some(&txn) through
execute_*/open_table_for_mutation, commit_all, and
commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
when present — skipping resolved_branch_target's re-validation. The OPEN itself is
unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
ensure_expected_version. Schema-once applies to strict and non-strict alike; the
data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.
fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.
Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.
* feat(engine): open each data table once per write (RFC-013 step 3b)
A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.
- #1 accumulation: open_for_mutation_on_branch now returns
(Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
caller reads is .version() (the CAS fence), which is exactly the pinned base
version (entry.table_version). So skip open_dataset_head_for_write and source the
version from txn.base. The node insert path already discarded that handle; the
edge path resolves a pinned read only when non-default cardinality needs it.
STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
the already-open staging handle (the same primitive ManifestCoordinator::
probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
prepare_updates_for_commit, which builds indices on the threaded handle instead of
reopening at the same just-committed version. Absent a handle (other writers,
inline/delete tables) the reopen path is byte-identical.
When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.
* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)
The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.
- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
require_handle(ctx) helper for the callers that must have a handle (strict ops,
the fork path, every no-txn caller — branch merge, the seed test). The handle is
None only on the non-strict-txn open-skip path (collapse #1); require_handle
panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
committed_handles } for commit_all; consumers destructure into the same local
bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
is legal only on the non-strict txn path, so a future strict arm returning None
trips in debug builds instead of handing None to a require_handle consumer.
Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.
* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)
The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.
Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.
No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).
* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5
* docs(rfc-013): capture the write-base-staleness convergence (§1d)
Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):
- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
proven wrong (breaks the cross-process lost-update OCC contract).
All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.
* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)
Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).
Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.
* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)
3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).
Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.
Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.
* docs(dev): link handoff docs from index
* docs(engine): tighten 3b claims to match the code (#298 review)
Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.
- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
carries one); scope "once" to the table-touch hot path and note edge/load RI
validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
is allow-listed and holds direct Dataset::opens for branch-management ops (not the
keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).
build clean; write_cost / validators / forbidden_apis green.
28 KiB
Handoff: finishing RFC-013 (write-path latency + correctness)
Status: living handoff. Source of truth is rfc-013-write-path-latency.md —
this doc is the *current-state map + the decisions/validation from the latest work cycle
- the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
Audience: the engineer/agent who picks up RFC-013 next.
0. TL;DR — where we are and what's next
RFC-013 makes the write path fast and correct on object storage (217 Lance tables
under one __manifest catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
for the canonical list. Current reality:
Landed on main:
- Step 1 — Tier-1 cost gate + the shared
helpers::costharness (#288). - Step 3a — opener bypass: write opens go direct (
Dataset::openby URI + version) instead of the Lance-namespace builder (#288). This already banked the dominant depth win — see §2 below; it reframes everything. - Step 2a — internal-table compaction:
optimizenow compacts__manifest/_graph_commits/_graph_commit_actors(#291). Plus the RFC latency-model correction (#292). - Optimize-vs-write race — optimize survives a cross-process write race on the
same table (#297, LANDED — origin/main
6d4606a8; see §6 for why it's not redundant with Design A). Step 3b stacks on top of this.
Open PRs (land these; relationships in §7):
- #296
correctness-by-design-fix— recovery roll-forward converges on a concurrent manifest advance (this is the fix for the flakyiss-schema-apply-reopen-recovery-race). - #295
docs/rfc-013-step-3b— the step-3b RFC doc. - #254
ragnorc/bug-4-schema-apply-occ— schema-apply vs optimize false-fail (same op-class family as #297, logical side).
Step 3b is DONE (capture-once WriteTxn, schema-once + open-collapse; see §4) on
rfc-013-step-3b-writetxn-v2. Next: Phase 7 (step 4), then the big one — Design A /
PublishPlan unification (step 5) — see §5, the convergent fix for the bug class this
area keeps generating, which also absorbs 3b's deferred session-aware write opens.
1. The corrected mental model (read this before touching anything)
Three reframes from the latest cycle that the older RFC prose may not fully reflect:
1a. 3a already won the depth fight → the residual is constant-factor + RTT
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
that path was O(depth) (it re-opened __manifest + list_table_versions per open —
not a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
O(1) (from_uri(loc).with_version(N) = arithmetic path + one HEAD). So:
- The dominant O(depth) data-table term is gone.
- Step 2a flattened the secondary internal-table scan term.
- What remains is the ~110-hop serial backbone × RTT + compute — a constant in
depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
- compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after 2a; the remaining 15.76s is the serial backbone — step 3b's target, not step 2's.
- Step 3b's win is therefore the call-count/RTT collapse (redundant opens, the flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from 3b; gate it on the constant-factor (S3 round-trips), not a curve.
1b. Two op classes, two commit models (the §6.6 principle)
Every concurrency bug in this area is one op class using the other's commit model:
| class | examples | commutes? | correct commit model |
|---|---|---|---|
| maintenance | compaction (Rewrite), optimize_indices |
yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + monotonic manifest fast-forward — no epoch, no read-set |
| logical mutation | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the writer_epoch fence |
Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance (false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
1c. The root liability (what keeps generating these bugs)
Lance gives per-table atomic commits but no cross-table/cross-step atomicity, so
every multi-commit op advances per-table Lance HEAD before the manifest references it
(the "A-before-B window"). The resulting HEAD vs manifest delta is ambiguous
(external drift? my own in-flight work? a crashed writer?), and many uncoordinated code
paths each re-interpret it (4 writers + the maintenance path + recovery + the write-path
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
- §6.5 cross-process logical corruption,
- #297's own-HEAD-drift misclassification,
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
- the recovery classifier edges.
The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually retires the window entirely. See §5.
1d. The second facet: the write base is a stale pin (no probe)
The READ path resolves its base behind a freshness probe (resolve_target_inner
omnigraph.rs:~1072 → probe_latest_incarnation → refresh_manifest_only); the WRITE path
does NOT (resolved_branch_target omnigraph.rs:~778 returns the warm coord.snapshot() for
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
single staleness feeds two distinct failure modes, both surfaced this cycle:
-
Stale validation reads → integrity under-enforced. Write-path RI checks read committed state off the stale base. 3b's collapse #1 made it worse for edge
@card:edge_cardinality_read_handle(mutation.rs:~614) scans the pinnedtxn.baseinstead of live HEAD (was live HEAD pre-3b), so a concurrent edge committed aftertxncapture is uncounted → a@cardmax can be exceeded (cursor High / codex P1 on #298, VALID). #298 fix: restore the live-HEAD read for that scan (un-regress; gate-safe — thedata_open_countgate is a node insert) + a deterministic regression test (commit A's edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc comment (mutation.rs:~605-613, which assumes a single writer). The structural liability underneath: there is no unified write-validation read-set — endpoint (ensure_node_id_exists, warmsnapshot_for_branch), cardinality (mutation: pinnedtxn.base; loader: warmsnapshot_for_branch— the SAME check forks per write path), commit drift guard (livefresh_snapshot_for_branch), and uniqueness (enforce_unique_constraints_intra_batch, intra-batch only — cross-version uniqueness is a documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the §7.1 TOCTOU class, and each new constraint forks the pattern again. -
Stale OCC pin → false-fail on a maintenance advance. A served strict update/delete pins the stale base version, then false-fails
ExpectedVersionMismatchafter an externaloptimizeadvanced__manifest— even though the advance was content-preserving compaction the logical write should fast-forward past (invariant 7). It's the write-side mirror of #297/§6.6 (#297 made optimize fast-forward past a logical write; this is a logical write that must fast-forward past optimize). A served read clears it (the read probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes(#[ignore]on branchfix/write-path-stale-view-probe). The naive "just probe" fix is proven wrong — a blanket probe silently refreshes past logical advances too, breakingconsistency::stale_handle_public_mutation_must_refresh_then_retry(the deliberate cross-process lost-update OCC primitive). The fix must discriminate by op class.
Both fold into Design A (step 5), same as §1c. open_txn's one warm probe makes the base
fresh (absorbs maintenance advances cheaply); the op-class-aware strict precondition —
derive from Lance's per-version transaction metadata (all Rewrite/ReserveFragments =
maintenance → fast-forward the pin; any Append/Update/Delete/Merge = logical → fail
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
graph_head contention. So the stale-view false-fail, the cardinality/validation-read-set
liability, and #297's mirror are one bug (the write base is a stale, un-probed, un-classified
pin) with one home: the single PublishPlan delta-interpreter (§1c + §5). Strong corroboration
of Design A — three symptoms, one fix.
2. Validated facts — do NOT re-derive these
Established this cycle against Lance 7.0.0 source
(~/.cargo/registry/src/index.crates.io-*/lance-7.0.0) and current engine code. Cited so
you can trust them without re-investigating.
Lance (upstream):
from_uri(loc).with_version(N).load()andcheckout_version(N)are O(1) (computed V2 path_versions/{u64::MAX-N:020}.manifest+ one HEAD; no listing/back-walk). (lance-table/src/io/commit.rsdefault_resolve_version.)- A shared
Arc<Session>(DatasetBuilder::with_session) warms metadata/index caches keyed by(URI, version, e_tag). Caveat: the first manifest read on open is uncached — the Session warms the scan/index metadata, not the first open.WriteParamsdoes carry asessionfield (lance/src/dataset/write.rs), but it only matters on theWriteDestination::Uriarm; OmniGraph's staged path always drives off an already-openDataset, and Lance takes the store/session from that handle. So to attach the shared Session to a write base, open read-style (open_table_dataset→from_uri().with_version() .with_session()) and drive the staged write off that handle. - A held
Arc<Dataset>at a pinned version isSend + Sync, immutable, safe to reuse for many scans/count/staged-write base in one txn (OmniGraph'sTableHandleCachealready relies on this). - No compaction
RetryExecutor(only Delete/MergeInsert/Update have one).commit_compactioncommits a fixedRewriteviaapply_commitdirect. Incommit_transaction, a semanticRetryableCommitConflictescapes the retry loop via?atio/commit.rs:979; the loop only retries the OCCCommitConflict(:1096), and even that re-rebases the same transaction (never re-plans). ⇒ compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let Lance own it. check_rewrite_txn: aRewriterebases cleanly past a concurrentAppend/disjointUpdate/Delete(preserving both); only a same-fragment overlap yields a retryable conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app retry fires only on real overlap.
Engine (internal):
- Read path (post-#268) already has the capture-once machinery:
Snapshot(db/manifest.rs), warmGraphCoordinatorbehind alatest_version_id/incarnation probe, a heldTableHandleCachekeyed(table,branch,version,e_tag), one sharedSessionper graph (read_caches.session). Writes bypass all of it by construction (resolved_branch_targetreturnsread_caches: None; the 3a write opener attaches no session and opens by latest, not pinned version). - A single write opens each table 3–4× (accumulation → staging reopen → commit
drift-guard → publish prepare), each a fresh cold open.
validate_schema_contract(db/schema_state.rs, viaensure_schema_state_valid) runs uncached (~3read_text- 2
exists) at every resolve point (~the flat-46). Both are constant-factor, flat in depth — 3b's targets.
- 2
- Strict-op guards are the lost-update floor (3 layers: pre-stage
ensure_expected_versiontable_store.rs; commit-time strict driftexec/staging.rs; publisher CASpublisher.rs). Capture-once supplies the pinned operand — never remove a guard. - Fork-on-first-write authority reads (
classify_fork_ref→fresh_snapshot_for_branch) must stay fresh (not served from a pinned base). - Cost harness:
helpers::cost(measure/measure_with_staged/IoCounts/assert_flat/local_graph/s3_graph). The schema-once assert can reuseCountingStorageAdapter(warm_read_cost.rs::warm_query_validates_schema_contract_once) with zero prod change; an open-count assert wants a smallopen_countAtomicU64 inQueryIoProbes(copy theprobe_count/record_probepattern). The forbidden-API guard (tests/forbidden_apis.rs) makes an instrumentation-level counter complete.
3. The #297 cycle (this branch) — what it is, and the lesson
fix-optimize-concurrency-race (5 commits): a CLI optimize racing a served write on the
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
compaction paths on the internal path's reopen+replan shape, with a two-level retry
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
publish a monotonic fast-forward (advance to compacted version N, or no-op when the
manifest already moved to ≥ N), never the strict equality CAS. Sidecar written once;
in-process queue kept as a contention reducer (not the cross-process guard); no writer_epoch.
Two review rounds surfaced two follow-on bugs I introduced with the retry loop — both fixed, both regression-tested (own-HEAD-drift via negative control):
- Own-HEAD-drift misclassification (
56d004e0): the drift guard re-ran every iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a later op conflicts), sawHEAD > manifestfrom our own covered work and deleted the sidecar + returnedskipped_for_drift(stranding uncovered drift). Fix: trackhead_advanced; the drift guard fires only when!head_advanced. - Publish exhaustion spurious error (
e9d16a2c): the publish loop returnedErron its final retry even if the conflict meant a concurrent writer already published≥ N(postcondition met). Fix: re-checkcurrent >= state.versionon exhaustion.
The lesson (write it on the wall): wrapping a sequence of side-effecting commits in a
retry silently converts every "checked once, before any side effect" precondition into
"re-checked after partial side effects." That's a distinct bug class; it needs
fault-injection tests at each commit boundary, not just end-to-end concurrency tests.
(The optimize.before_compact / optimize.inject_reindex_conflict failpoints exist for
exactly this.)
Temporary mechanism flag: head_advanced is an in-memory proxy for "is this HEAD
movement mine." Under Design A the authority answers that from the plan/sidecar identity
— so head_advanced is the part that gets replaced, while the monotonic-publish +
reopen/replan semantics are permanent. (Noted in RFC §6.6.)
4. DONE: Step 3b — capture-once WriteTxn (shipped on rfc-013-step-3b-writetxn-v2)
Delivered: on the table-touch hot path, a single mutate/load validates the schema
contract once and opens each touched data table at most once — a constant-factor/RTT
win (not a depth-slope win; 1a). Two cost gates in write_cost.rs lock it (both on a node
insert): write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9) and
keyed_insert_opens_table_at_most_once (data_open_count <= 1, was 4). The carrier is the
minimal WriteTxn { branch, base }, threaded as Option<&WriteTxn> (Some on the hot
mutate/load path, None byte-identical everywhere else); it converges into step 5's
PublishPlan.
Not "once" everywhere (scope, not regression): edge endpoint / cardinality RI validation
(ensure_node_id_exists, the loader's RI + cardinality) still resolves through
snapshot_for_branch and re-validates the schema — and reads warm, not live. Threading
txn.base there to make it "once" would re-introduce the stale-read class the #298 cardinality
fix removed (it now reads live HEAD). Doing schema-once and fresh reads for those validations
needs the unified, re-checked read-set — step 4 §7.1 (§1d). So #298 un-regresses
cardinality only; it does not close write-validation freshness. No edge-insert/load schema-once
gate yet (only the node gates above).
Commits (off merged-#297 main):
- Stage 0 — scope
open_count→data_open_count/internal_open_countby URI class (the review fix:open_dataset_trackedalso opens__manifest/_graph_commits, so the raw counter conflated them and the gate was unreachable). Re-baselined RED 4. - Commit A (schema-once) — capture
txnonce at entry (the single validation); the 4 validation sites collapse: S1 (entryensure_schema_state_valid) removed; S3a (open_for_mutation_on_branch) + S3b (prepare_updates_for_commit) sourcetxn.base; S4 (commit_all) uses newfresh_snapshot_for_branch_unchecked(the OCC manifest re-read minus the schema re-validation).fresh_snapshot_for_branch{,_unchecked}now read the manifest directly viaManifestCoordinator(drops a spurious commit-graphexistsprobe; sameSnapshot). - Commit B (open collapse 4→1) — #1 accumulation open ELIMINATED (the node path discarded
the handle; read
txn.base.entry().table_version); #2 staging open KEPT (the one open); #3 commit drift-guard reads live HEAD viaentry.dataset.dataset().latest_version_id()(a cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses thecommit_stagedhandle threaded throughCommittedMutation/prepare_updates_for_commit. - Commit B.1 + cleanup — named the two positional returns (
OpenedForMutation,CommittedMutation) + adebug_assertpinning the open-skip contract; removed the unearnedWriteTxn.sessionfield (the collapse uses skip/probe/reuse, not a session).
RFC §4.1 corrections — how they resolved:
- Thread the evolving handle, not a version-keyed cache → realized as collapse #4 (carry
the
commit_stagedhandle forward into the index build). - Don't forbid re-resolution → honored: the commit-time OCC re-read
(
fresh_snapshot_for_branch_unchecked— fresh manifest, only schema-revalidation dropped) and the fork-authority reads stay fresh. - Minimal carrier →
WriteTxn { branch, base }(even thesessionfrom the original sketch was dropped as unearned).
Deferred to step 5 (NOT in this PR): session-aware write base opens. The one remaining
open (#2) stays a HEAD open; warming the shared Session across writes is an object-store
(S3) phenomenon invisible on local FS, so it earns its own write_cost_s3.rs gate in step 5,
where txn becomes the non-optional publish carrier. No new concurrency test was needed here:
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
probe fences are unchanged (covered by the green writes.rs/consistency.rs).
Guardrails (don't regress): schema validation is deliberately uncached for drift
detection — collapse to 1 per write, never cache across writes on a long-lived handle
(lifecycle::long_lived_handle_rejects_schema_*). The commit-time fresh read is OCC
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
Pin the correct branch (server-bound-to-main writing a feature branch falls to a fresh
open). A branch rfc-013-step-3b-writetxn exists off an earlier main; rebase onto the
post-#297 main before starting.
5. Design A — the PublishPlan unification (step 5) = the convergent fix
This is the real fix for the bug class in §1c. Collapse the four hand-rolled writers +
the maintenance path into one publish(txn, plan) authority where the CAS + bounded
retry is unconditional and unbypassable (no caller can "hold the queue → skip the CAS").
Properties:
- One interpreter of the
HEAD vs manifestdelta — and "is this my work?" is answered by the plan/sidecar identity, not a re-derived comparison. The own-HEAD-drift bug, the §6.5 writers, the write-path guard — all close by construction. - Recovery = the same
PublishPlanre-applied — the crash-recovery interpreter and the live interpreter become the same code (iss-merge-recovery-partial-rollforwardgone). - Each
TableActioncommits by its class (§1b):Rewrite= maintenance (Lance rebase- reopen/replan + monotonic fast-forward, no epoch); load/mutate = logical (strict OCC
writer_epoch).
Why it composes with Lance MTT (don't over-build):
- The unification itself is convergent — when MTT lands, it slots underneath the same authority; nothing wasted. Build this.
- The
writer_epochis the one MTT-redundant piece (MTT's commit-handler lease subsumes a cross-process fence). Build it last and minimally, gated on actually deploying multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
Sequencing judgment (this cycle's strongest signal): the bug density here (this PR alone = 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim is high integrated-over-time liability. Consider pulling the convergent half of step 5 (the single authority + recovery-as-plan) forward — possibly ahead of 3b — because it stops the bug class rather than patching instances. #297 + #254 are the de-risking inputs: they validate the maintenance-class and logical-class commit models in isolation first, so Design A implements a known spec rather than designing under refactor pressure. Do NOT build more substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper over the window — strictly higher liability than either Design A or waiting for MTT.
Deeper-than-A (post-MTT or as Lance exposes uncommitted variants): all-uncommitted-fragments
- one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
exposing uncommitted variants for
compact_files/optimize_indices/ vector index (#6666 open; delete #6658 shipped). Track, don't build yet.
6. Why #297 is still needed even if you do Design A
- Design A relocates #297's maintenance-class commit logic into the authority's
TableAction::Rewritepath; it does not eliminate it. #297 is the validated spec + tests. - The two regression tests + §6.6 are the contract Design A must keep green.
- The prod bug is live; Design A is the largest write-path change in the RFC. Don't hold a correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
- Genuinely throwaway under Design A: only the loop's location + the
head_advancedproxy (~a dozen lines). Everything else relocates or persists. #297 LANDED.
7. Open PRs and their relationships
- #297 — maintenance-class fix (optimize vs write). LANDED (origin/main
6d4606a8); step 3b stacks on it. - #254 — logical-class fix (schema-apply vs optimize false-fail). Same op-class family; both are de-risking inputs for Design A's per-class commit models.
- #296 — recovery roll-forward converges on concurrent manifest advance. This is the fix
for the flaky
iss-schema-apply-reopen-recovery-race(the handoff inhandoff-schema-apply-recovery-flake.md). It touchesrecovery.rsand is aligned with #297's "postcondition is the state, not winning the CAS" principle — reconcile the monotonic publish with #296's converge helper if #296 lands first. - #295 — the step-3b RFC doc (apply §4's three corrections to it).
8. Remaining RFC steps after 3b (RFC §9 is canonical)
- #298 follow-up (do on the 3b PR, before merge): the edge-
@cardstale-read regression (§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach. - Step 4 / Phase 7 (
iss-991): lineage into__manifest(publishgraph_commit+ mutablegraph_head:<branch>in the same merge-insert;_graph_commitsbecomes a projection). Removes the per-writecommit_graph.refresh; closes the manifest→commit-graph atomicity + commit-graph-parent-under-concurrency gaps. Hard prereq: step 2 (done). Carries the §7.1 concurrent write-skew fix (needs thegraph_headcontention row) — frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality + cross-version uniqueness), not merely "addgraph_head" (§1d.1): the bespokeedge_cardinality_read_handleand the mutation-vs-loader freshness fork dissolve into one pinned read-set re-validated under thegraph_headcontention, or the liability survives as a second special-case. - Step 5 / Design A — §5 above. Acceptance item: the served-strict-write stale-view
false-fail (§1d.2) — the op-class-aware precondition +
open_txnprobe. The contract is two tests passing together: un-ignorewrites.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes(goes green) whileconsistency::stale_handle_public_mutation_must_refresh_then_retrystays green (maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter. - Step 2b — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag). Deferred: only the secondary version-count/space term, touches the read/open path, and is MTT-redundant. Land when version-count cost bites.
- §7.1 sequential write-skew (
iss-overwrite-orphans-committed-edges) — inbound-RI validation on node removal; independent, ships anytime. - #20 — the prod per-write
storage.opsspan metric (RFC §5.3), still owed. - Branch ops: Lance
Clonefor create (iss-691).
9. Gotchas / traps (learned the hard way)
- In-process queue ≠ cross-process lock. Any "I hold the queue → skip the retry/CAS" reasoning is a bug across processes. This is the recurring trap.
- Monotonic publish must be
≥-conditional, never "no assertion." The__manifestmerge-insert is unconditionalUpdateAllkeyed onobject_id(publisher.rs:379), so the equality (or monotonic) pre-check is the only guard — dropping it letsUpdateAllregress a newer version = lost write. - The drift guard interprets an ambiguous delta. Re-evaluating it in a retry over self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest interpretation on "have we committed yet."
compact_filesfires Lance's auto_cleanup GC hook (commits withskip_auto_cleanup=false, no override) — optimize strips stalelance.auto_cleanup.*config before compacting to stay non-destructive on upgraded graphs. The strip is a separate commit (relevant to the partial-commit retry trap).- Lance rebases the common concurrent case for free — so the data-table conflict usually surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is rare and needs failpoint injection to test.
10. Verification (the gate)
cargo test --workspace --locked— the canonical gate (matches CI).cargo test -p omnigraph-engine --features failpoints --test failpoints optimize— the optimize concurrency/recovery tests.cargo test -p omnigraph-engine --test write_cost/write_cost_s3(bucket-gated) — cost gates (3b adds the schema-once + open-count asserts here).cargo test -p omnigraph-engine --test maintenance— optimize/repair/cleanup.- Re-read
invariants.md,lance.md,testing.mdbefore each change (always-on requirement).
Lance source for re-validation:
/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0 (key files: io/commit.rs,
io/commit/conflict_resolver.rs, dataset/optimize.rs, dataset/write/retry.rs,
dataset/builder.rs).