omnigraph/docs/dev/handoff-rfc-013-write-path.md
Ragnor Comerford 1c5cb8741e
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)
* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
2026-06-25 13:55:34 +02:00

31 KiB
Raw Blame History

Handoff: finishing RFC-013 (write-path latency + correctness)

Status: living handoff. Source of truth is rfc-013-write-path-latency.md — this doc is the *current-state map + the decisions/validation from the latest work cycle

  • the concrete next actions*. When they disagree, the RFC wins (and fix this doc).

Audience: the engineer/agent who picks up RFC-013 next.


0. TL;DR — where we are and what's next

RFC-013 makes the write path fast and correct on object storage (217 Lance tables under one __manifest catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC for the canonical list. Current reality:

Landed on main:

  • Step 1 — Tier-1 cost gate + the shared helpers::cost harness (#288).
  • Step 3a — opener bypass: write opens go direct (Dataset::open by URI + version) instead of the Lance-namespace builder (#288). This already banked the dominant depth win — see §2 below; it reframes everything.
  • Step 2a — internal-table compaction: optimize now compacts __manifest / _graph_commits / _graph_commit_actors (#291). Plus the RFC latency-model correction (#292).
  • Optimize-vs-write race — optimize survives a cross-process write race on the same table (#297, LANDED — origin/main 6d4606a8; see §6 for why it's not redundant with Design A). Step 3b stacks on top of this.

Open PRs (land these; relationships in §7):

  • #296 correctness-by-design-fix — recovery roll-forward converges on a concurrent manifest advance (the fix for the flaky iss-schema-apply-reopen-recovery-race). MERGED to main and integrated into this branch — the converge helper now threads Phase-7's manifest-CAS recovery graph_commit_id (see converge_or_defer_roll_forward).
  • #295 docs/rfc-013-step-3b — the step-3b RFC doc.
  • #254 ragnorc/bug-4-schema-apply-occ — schema-apply vs optimize false-fail (same op-class family as #297, logical side).

Step 3b is DONE (capture-once WriteTxn, schema-once + open-collapse; see §4) on rfc-013-step-3b-writetxn-v2. Next: Phase 7 (step 4), then the big one — Design A / PublishPlan unification (step 5) — see §5, the convergent fix for the bug class this area keeps generating, which also absorbs 3b's deferred session-aware write opens.


1. The corrected mental model (read this before touching anything)

Three reframes from the latest cycle that the older RFC prose may not fully reflect:

1a. 3a already won the depth fight → the residual is constant-factor + RTT

Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and that path was O(depth) (it re-opened __manifest + list_table_versions per open — not a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not Lance — validated against Lance source). 3a swapped it for the direct opener, which is O(1) (from_uri(loc).with_version(N) = arithmetic path + one HEAD). So:

  • The dominant O(depth) data-table term is gone.
  • Step 2a flattened the secondary internal-table scan term.
  • What remains is the ~110-hop serial backbone × RTT + compute — a constant in depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
    • compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after 2a; the remaining 15.76s is the serial backbone — step 3b's target, not step 2's.
  • Step 3b's win is therefore the call-count/RTT collapse (redundant opens, the flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from 3b; gate it on the constant-factor (S3 round-trips), not a curve.

1b. Two op classes, two commit models (the §6.6 principle)

Every concurrency bug in this area is one op class using the other's commit model:

class examples commutes? correct commit model
maintenance compaction (Rewrite), optimize_indices yes (content-preserving) Lance native rebase + app reopen/replan on real overlap + monotonic manifest fast-forward — no epoch, no read-set
logical mutation load / mutate / merge / delete no (lost-update, write-skew) strict cross-process OCC: read-set + write-set CAS under the writer_epoch fence

Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance (false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).

1c. The root liability (what keeps generating these bugs)

Lance gives per-table atomic commits but no cross-table/cross-step atomicity, so every multi-commit op advances per-table Lance HEAD before the manifest references it (the "A-before-B window"). The resulting HEAD vs manifest delta is ambiguous (external drift? my own in-flight work? a crashed writer?), and many uncoordinated code paths each re-interpret it (4 writers + the maintenance path + recovery + the write-path drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:

  • §6.5 cross-process logical corruption,
  • #297's own-HEAD-drift misclassification,
  • the flaky write-path "HEAD ahead of manifest, run repair" guard,
  • the recovery classifier edges.

The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually retires the window entirely. See §5.

1d. The second facet: the write base is a stale pin (no probe)

The READ path resolves its base behind a freshness probe (resolve_target_inner omnigraph.rs:~1072 → probe_latest_incarnationrefresh_manifest_only); the WRITE path does NOT (resolved_branch_target omnigraph.rs:~778 returns the warm coord.snapshot() for the bound branch, no probe). So a long-lived server's write base lags the live manifest. That single staleness feeds two distinct failure modes, both surfaced this cycle:

  1. Stale validation reads → integrity under-enforced. Write-path RI checks read committed state off the stale base. 3b's collapse #1 made it worse for edge @card: edge_cardinality_read_handle (mutation.rs:~614) scans the pinned txn.base instead of live HEAD (was live HEAD pre-3b), so a concurrent edge committed after txn capture is uncounted → a @card max can be exceeded (cursor High / codex P1 on #298, VALID). #298 fix: restore the live-HEAD read for that scan (un-regress; gate-safe — the data_open_count gate is a node insert) + a deterministic regression test (commit A's edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc comment (mutation.rs:~605-613, which assumes a single writer). The structural liability underneath: there is no unified write-validation read-set — endpoint (ensure_node_id_exists, warm snapshot_for_branch), cardinality (mutation: pinned txn.base; loader: warm snapshot_for_branch — the SAME check forks per write path), commit drift guard (live fresh_snapshot_for_branch), and uniqueness (enforce_unique_constraints_intra_batch, intra-batch only — cross-version uniqueness is a documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the §7.1 TOCTOU class, and each new constraint forks the pattern again.

  2. Stale OCC pin → false-fail on a maintenance advance. A served strict update/delete pins the stale base version, then false-fails ExpectedVersionMismatch after an external optimize advanced __manifest — even though the advance was content-preserving compaction the logical write should fast-forward past (invariant 7). It's the write-side mirror of #297/§6.6 (#297 made optimize fast-forward past a logical write; this is a logical write that must fast-forward past optimize). A served read clears it (the read probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) + writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes (#[ignore] on branch fix/write-path-stale-view-probe). The naive "just probe" fix is proven wrong — a blanket probe silently refreshes past logical advances too, breaking consistency::stale_handle_public_mutation_must_refresh_then_retry (the deliberate cross-process lost-update OCC primitive). The fix must discriminate by op class.

Both fold into Design A (step 5), same as §1c. open_txn's one warm probe makes the base fresh (absorbs maintenance advances cheaply); the op-class-aware strict precondition — derive from Lance's per-version transaction metadata (all Rewrite/ReserveFragments = maintenance → fast-forward the pin; any Append/Update/Delete/Merge = logical → fail loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the graph_head contention. So the stale-view false-fail, the cardinality/validation-read-set liability, and #297's mirror are one bug (the write base is a stale, un-probed, un-classified pin) with one home: the single PublishPlan delta-interpreter (§1c + §5). Strong corroboration of Design A — three symptoms, one fix.


2. Validated facts — do NOT re-derive these

Established this cycle against Lance 7.0.0 source (~/.cargo/registry/src/index.crates.io-*/lance-7.0.0) and current engine code. Cited so you can trust them without re-investigating.

Lance (upstream):

  • from_uri(loc).with_version(N).load() and checkout_version(N) are O(1) (computed V2 path _versions/{u64::MAX-N:020}.manifest + one HEAD; no listing/back-walk). (lance-table/src/io/commit.rs default_resolve_version.)
  • A shared Arc<Session> (DatasetBuilder::with_session) warms metadata/index caches keyed by (URI, version, e_tag). Caveat: the first manifest read on open is uncached — the Session warms the scan/index metadata, not the first open. WriteParams does carry a session field (lance/src/dataset/write.rs), but it only matters on the WriteDestination::Uri arm; OmniGraph's staged path always drives off an already-open Dataset, and Lance takes the store/session from that handle. So to attach the shared Session to a write base, open read-style (open_table_datasetfrom_uri().with_version() .with_session()) and drive the staged write off that handle.
  • A held Arc<Dataset> at a pinned version is Send + Sync, immutable, safe to reuse for many scans/count/staged-write base in one txn (OmniGraph's TableHandleCache already relies on this).
  • No compaction RetryExecutor (only Delete/MergeInsert/Update have one). commit_compaction commits a fixed Rewrite via apply_commit direct. In commit_transaction, a semantic RetryableCommitConflict escapes the retry loop via ? at io/commit.rs:979; the loop only retries the OCC CommitConflict (:1096), and even that re-rebases the same transaction (never re-plans). ⇒ compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let Lance own it.
  • check_rewrite_txn: a Rewrite rebases cleanly past a concurrent Append/disjoint Update/Delete (preserving both); only a same-fragment overlap yields a retryable conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app retry fires only on real overlap.

Engine (internal):

  • Read path (post-#268) already has the capture-once machinery: Snapshot (db/manifest.rs), warm GraphCoordinator behind a latest_version_id/incarnation probe, a held TableHandleCache keyed (table,branch,version,e_tag), one shared Session per graph (read_caches.session). Writes bypass all of it by construction (resolved_branch_target returns read_caches: None; the 3a write opener attaches no session and opens by latest, not pinned version).
  • A single write opens each table 34× (accumulation → staging reopen → commit drift-guard → publish prepare), each a fresh cold open. validate_schema_contract (db/schema_state.rs, via ensure_schema_state_valid) runs uncached (~3 read_text
    • 2 exists) at every resolve point (~the flat-46). Both are constant-factor, flat in depth — 3b's targets.
  • Strict-op guards are the lost-update floor (3 layers: pre-stage ensure_expected_version table_store.rs; commit-time strict drift exec/staging.rs; publisher CAS publisher.rs). Capture-once supplies the pinned operand — never remove a guard.
  • Fork-on-first-write authority reads (classify_fork_reffresh_snapshot_for_branch) must stay fresh (not served from a pinned base).
  • Cost harness: helpers::cost (measure/measure_with_staged/IoCounts/assert_flat/ local_graph/s3_graph). The schema-once assert can reuse CountingStorageAdapter (warm_read_cost.rs::warm_query_validates_schema_contract_once) with zero prod change; an open-count assert wants a small open_count AtomicU64 in QueryIoProbes (copy the probe_count/record_probe pattern). The forbidden-API guard (tests/forbidden_apis.rs) makes an instrumentation-level counter complete.

3. The #297 cycle (this branch) — what it is, and the lesson

fix-optimize-concurrency-race (5 commits): a CLI optimize racing a served write on the same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both compaction paths on the internal path's reopen+replan shape, with a two-level retry — outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest publish a monotonic fast-forward (advance to compacted version N, or no-op when the manifest already moved to ≥ N), never the strict equality CAS. Sidecar written once; in-process queue kept as a contention reducer (not the cross-process guard); no writer_epoch.

Two review rounds surfaced two follow-on bugs I introduced with the retry loop — both fixed, both regression-tested (own-HEAD-drift via negative control):

  1. Own-HEAD-drift misclassification (56d004e0): the drift guard re-ran every iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a later op conflicts), saw HEAD > manifest from our own covered work and deleted the sidecar + returned skipped_for_drift (stranding uncovered drift). Fix: track head_advanced; the drift guard fires only when !head_advanced.
  2. Publish exhaustion spurious error (e9d16a2c): the publish loop returned Err on its final retry even if the conflict meant a concurrent writer already published ≥ N (postcondition met). Fix: re-check current >= state.version on exhaustion.

The lesson (write it on the wall): wrapping a sequence of side-effecting commits in a retry silently converts every "checked once, before any side effect" precondition into "re-checked after partial side effects." That's a distinct bug class; it needs fault-injection tests at each commit boundary, not just end-to-end concurrency tests. (The optimize.before_compact / optimize.inject_reindex_conflict failpoints exist for exactly this.)

Temporary mechanism flag: head_advanced is an in-memory proxy for "is this HEAD movement mine." Under Design A the authority answers that from the plan/sidecar identity — so head_advanced is the part that gets replaced, while the monotonic-publish + reopen/replan semantics are permanent. (Noted in RFC §6.6.)


4. DONE: Step 3b — capture-once WriteTxn (shipped on rfc-013-step-3b-writetxn-v2)

Delivered: on the table-touch hot path, a single mutate/load validates the schema contract once and opens each touched data table at most once — a constant-factor/RTT win (not a depth-slope win; 1a). Two cost gates in write_cost.rs lock it (both on a node insert): write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9) and keyed_insert_opens_table_at_most_once (data_open_count <= 1, was 4). The carrier is the minimal WriteTxn { branch, base }, threaded as Option<&WriteTxn> (Some on the hot mutate/load path, None byte-identical everywhere else); it converges into step 5's PublishPlan.

Not "once" everywhere (scope, not regression): edge endpoint / cardinality RI validation (ensure_node_id_exists, the loader's RI + cardinality) still resolves through snapshot_for_branch and re-validates the schema — and reads warm, not live. Threading txn.base there to make it "once" would re-introduce the stale-read class the #298 cardinality fix removed (it now reads live HEAD). Doing schema-once and fresh reads for those validations needs the unified, re-checked read-set — step 4 §7.1 (§1d). So #298 un-regresses cardinality only; it does not close write-validation freshness. No edge-insert/load schema-once gate yet (only the node gates above).

Commits (off merged-#297 main):

  • Stage 0 — scope open_countdata_open_count/internal_open_count by URI class (the review fix: open_dataset_tracked also opens __manifest/_graph_commits, so the raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
  • Commit A (schema-once) — capture txn once at entry (the single validation); the 4 validation sites collapse: S1 (entry ensure_schema_state_valid) removed; S3a (open_for_mutation_on_branch) + S3b (prepare_updates_for_commit) source txn.base; S4 (commit_all) uses new fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus the schema re-validation). fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via ManifestCoordinator (drops a spurious commit-graph exists probe; same Snapshot).
  • Commit B (open collapse 4→1) — #1 accumulation open ELIMINATED (the node path discarded the handle; read txn.base.entry().table_version); #2 staging open KEPT (the one open); #3 commit drift-guard reads live HEAD via entry.dataset.dataset().latest_version_id() (a cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses the commit_staged handle threaded through CommittedMutation/prepare_updates_for_commit.
  • Commit B.1 + cleanup — named the two positional returns (OpenedForMutation, CommittedMutation) + a debug_assert pinning the open-skip contract; removed the unearned WriteTxn.session field (the collapse uses skip/probe/reuse, not a session).

RFC §4.1 corrections — how they resolved:

  1. Thread the evolving handle, not a version-keyed cache → realized as collapse #4 (carry the commit_staged handle forward into the index build).
  2. Don't forbid re-resolution → honored: the commit-time OCC re-read (fresh_snapshot_for_branch_unchecked — fresh manifest, only schema-revalidation dropped) and the fork-authority reads stay fresh.
  3. Minimal carrierWriteTxn { branch, base } (even the session from the original sketch was dropped as unearned).

Deferred to step 5 (NOT in this PR): session-aware write base opens. The one remaining open (#2) stays a HEAD open; warming the shared Session across writes is an object-store (S3) phenomenon invisible on local FS, so it earns its own write_cost_s3.rs gate in step 5, where txn becomes the non-optional publish carrier. No new concurrency test was needed here: #2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD probe fences are unchanged (covered by the green writes.rs/consistency.rs).

Guardrails (don't regress): schema validation is deliberately uncached for drift detection — collapse to 1 per write, never cache across writes on a long-lived handle (lifecycle::long_lived_handle_rejects_schema_*). The commit-time fresh read is OCC machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh. Pin the correct branch (server-bound-to-main writing a feature branch falls to a fresh open). A branch rfc-013-step-3b-writetxn exists off an earlier main; rebase onto the post-#297 main before starting.


5. Design A — the PublishPlan unification (step 5) = the convergent fix

This is the real fix for the bug class in §1c. Collapse the four hand-rolled writers + the maintenance path into one publish(txn, plan) authority where the CAS + bounded retry is unconditional and unbypassable (no caller can "hold the queue → skip the CAS"). Properties:

  • One interpreter of the HEAD vs manifest delta — and "is this my work?" is answered by the plan/sidecar identity, not a re-derived comparison. The own-HEAD-drift bug, the §6.5 writers, the write-path guard — all close by construction.
  • Recovery = the same PublishPlan re-applied — the crash-recovery interpreter and the live interpreter become the same code (iss-merge-recovery-partial-rollforward gone).
  • Each TableAction commits by its class (§1b): Rewrite = maintenance (Lance rebase
    • reopen/replan + monotonic fast-forward, no epoch); load/mutate = logical (strict OCC
    • writer_epoch).

Why it composes with Lance MTT (don't over-build):

  • The unification itself is convergent — when MTT lands, it slots underneath the same authority; nothing wasted. Build this.
  • The writer_epoch is the one MTT-redundant piece (MTT's commit-handler lease subsumes a cross-process fence). Build it last and minimally, gated on actually deploying multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.

Sequencing judgment (this cycle's strongest signal): the bug density here (this PR alone = 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim is high integrated-over-time liability. Consider pulling the convergent half of step 5 (the single authority + recovery-as-plan) forward — possibly ahead of 3b — because it stops the bug class rather than patching instances. #297 + #254 are the de-risking inputs: they validate the maintenance-class and logical-class commit models in isolation first, so Design A implements a known spec rather than designing under refactor pressure. Do NOT build more substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper over the window — strictly higher liability than either Design A or waiting for MTT.

Deeper-than-A (post-MTT or as Lance exposes uncommitted variants): all-uncommitted-fragments

  • one manifest commit would shrink the A-before-B window itself, blocked today by Lance not exposing uncommitted variants for compact_files / optimize_indices / vector index (#6666 open; delete #6658 shipped). Track, don't build yet.

5.1 Step-5 design constraints inherited from the #295 spec review

3b shipped a minimal WriteTxn { branch, base } (schema-once + open-collapse via eliminate/probe/thread) and deferred the full §4.1 opener-unification — the pinned-base opener, the shared-Session open, the write-local handle cache, and the strict-op conflict-timing move — to step 5. So the greptile-bot comments on the #295 spec were moot for #298 (which built none of those constructs) but are load-bearing constraints for step 5 when it builds them. Bank them:

  1. Handle cache must be Send + Sync (Mutex<HashMap<…, Dataset>>, not RefCell) if WriteTxn::open(&self) is shared across concurrent stage futures — a RefCell compiles but panics when two stages poll. Or make it &mut self (no parallel-stage sharing). This is the deny-list "in-process-only Dataset impls — Send + Sync" item.
  2. The strict-op timing move needs an explicit retry contract. If step 5 moves strict-op conflict detection from open-time ensure_expected_version to commit-time CAS (the §4.1 pinned-base design), it MUST specify: the txn is discarded after any commit (success or conflict — the handle cache is commit-invalidated), and the retry re-opens a fresh WriteTxn at the new HEAD (never re-stages against the stale pinned base — that reproduces the lost-update). This is the same retry/refresh contract as the stale-view false-fail (§1d.2) — the op-class-aware precondition + "fresh base on retry" are one design point. Today (#298) strict ops keep open-at-HEAD + ensure_expected_version, so the contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
  3. The opener-equivalence test must be non-trivial. A differential test that only passes when HEAD == base proves nothing about pinning. To actually prove "WriteTxn::open returns the pinned base, not HEAD," the test must advance the branch HEAD externally (direct Lance write), then assert the txn open still reads the base version — and that a strict write then fails ExpectedVersionMismatch at commit (verifying the timing move).

6. Why #297 is still needed even if you do Design A

  • Design A relocates #297's maintenance-class commit logic into the authority's TableAction::Rewrite path; it does not eliminate it. #297 is the validated spec + tests.
  • The two regression tests + §6.6 are the contract Design A must keep green.
  • The prod bug is live; Design A is the largest write-path change in the RFC. Don't hold a correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
  • Genuinely throwaway under Design A: only the loop's location + the head_advanced proxy (~a dozen lines). Everything else relocates or persists. #297 LANDED.

7. Open PRs and their relationships

  • #297 — maintenance-class fix (optimize vs write). LANDED (origin/main 6d4606a8); step 3b stacks on it.
  • #254 — logical-class fix (schema-apply vs optimize false-fail). Same op-class family; both are de-risking inputs for Design A's per-class commit models.
  • #296 — recovery roll-forward converges on concurrent manifest advance. The fix for the flaky iss-schema-apply-reopen-recovery-race. It touches recovery.rs and is aligned with #297's "postcondition is the state, not winning the CAS" principle. #296 landed on main first and is merged into this branch: the converge helper (converge_or_defer_roll_forward) was reconciled with Phase-7's manifest-CAS roll-forward — on convergence the audit references the winner's folded graph_commit_id (the current graph_head), not a freshly minted one.
  • #295 — the step-3b RFC doc (apply §4's three corrections to it).

8. Remaining RFC steps after 3b (RFC §9 is canonical)

  • #298 follow-up (do on the 3b PR, before merge): the edge-@card stale-read regression (§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
  • Step 4 / Phase 7 (iss-991): lineage into __manifest (publish graph_commit + mutable graph_head:<branch> in the same merge-insert; _graph_commits becomes a projection). Removes the per-write commit_graph.refresh; closes the manifest→commit-graph atomicity + commit-graph-parent-under-concurrency gaps. Hard prereq: step 2 (done). Carries the §7.1 concurrent write-skew fix (needs the graph_head contention row) — frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality + cross-version uniqueness), not merely "add graph_head" (§1d.1): the bespoke edge_cardinality_read_handle and the mutation-vs-loader freshness fork dissolve into one pinned read-set re-validated under the graph_head contention, or the liability survives as a second special-case.
  • Step 5 / Design A — §5 above. Acceptance item: the served-strict-write stale-view false-fail (§1d.2) — the op-class-aware precondition + open_txn probe. The contract is two tests passing together: un-ignore writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes (goes green) while consistency::stale_handle_public_mutation_must_refresh_then_retry stays green (maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
  • Step 2b — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag). Deferred: only the secondary version-count/space term, touches the read/open path, and is MTT-redundant. Land when version-count cost bites.
  • §7.1 sequential write-skew (iss-overwrite-orphans-committed-edges) — inbound-RI validation on node removal; independent, ships anytime.
  • #20 — the prod per-write storage.ops span metric (RFC §5.3), still owed.
  • Branch ops: Lance Clone for create (iss-691).

9. Gotchas / traps (learned the hard way)

  • In-process queue ≠ cross-process lock. Any "I hold the queue → skip the retry/CAS" reasoning is a bug across processes. This is the recurring trap.
  • Monotonic publish must be -conditional, never "no assertion." The __manifest merge-insert is unconditional UpdateAll keyed on object_id (publisher.rs:379), so the equality (or monotonic) pre-check is the only guard — dropping it lets UpdateAll regress a newer version = lost write.
  • The drift guard interprets an ambiguous delta. Re-evaluating it in a retry over self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest interpretation on "have we committed yet."
  • compact_files fires Lance's auto_cleanup GC hook (commits with skip_auto_cleanup=false, no override) — optimize strips stale lance.auto_cleanup.* config before compacting to stay non-destructive on upgraded graphs. The strip is a separate commit (relevant to the partial-commit retry trap).
  • Lance rebases the common concurrent case for free — so the data-table conflict usually surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is rare and needs failpoint injection to test.

10. Verification (the gate)

  • cargo test --workspace --locked — the canonical gate (matches CI).
  • cargo test -p omnigraph-engine --features failpoints --test failpoints optimize — the optimize concurrency/recovery tests.
  • cargo test -p omnigraph-engine --test write_cost / write_cost_s3 (bucket-gated) — cost gates (3b adds the schema-once + open-count asserts here).
  • cargo test -p omnigraph-engine --test maintenance — optimize/repair/cleanup.
  • Re-read invariants.md, lance.md, testing.md before each change (always-on requirement).

Lance source for re-validation: /Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0 (key files: io/commit.rs, io/commit/conflict_resolver.rs, dataset/optimize.rs, dataset/write/retry.rs, dataset/builder.rs).