omnigraph/docs/dev/handoff-rfc-013-write-path.md
Ragnor Comerford 1c5cb8741e
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)
* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
2026-06-25 13:55:34 +02:00

460 lines
31 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Handoff: finishing RFC-013 (write-path latency + correctness)
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)**
this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
**Audience:** the engineer/agent who picks up RFC-013 next.
---
## 0. TL;DR — where we are and what's next
RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
for the canonical list. Current reality:
**Landed on `main`:**
- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
instead of the Lance-namespace builder (#288). **This already banked the dominant
depth win** — see §2 below; it reframes everything.
- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
`_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
correction (#292).
- **Optimize-vs-write race** — optimize survives a cross-process write race on the
same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
redundant with Design A). Step 3b stacks on top of this.
**Open PRs (land these; relationships in §7):**
- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
**MERGED to main and integrated into this branch** — the converge helper now threads
Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`).
- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
(same op-class family as #297, logical side).
**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
area keeps generating, which also absorbs 3b's deferred session-aware write opens.
---
## 1. The corrected mental model (read this before touching anything)
Three reframes from the latest cycle that the older RFC prose may not fully reflect:
### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
- The dominant **O(depth) data-table** term is **gone**.
- Step 2a flattened the secondary **internal-table** scan term.
- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
+ compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
3b; gate it on the constant-factor (S3 round-trips), not a curve.
### 1b. Two op classes, two commit models (the §6.6 principle)
Every concurrency bug in this area is **one op class using the other's commit model**:
| class | examples | commutes? | correct commit model |
|---|---|---|---|
| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
### 1c. The root liability (what keeps generating these bugs)
Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
every multi-commit op advances per-table Lance HEAD **before** the manifest references it
(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
- §6.5 cross-process logical corruption,
- #297's own-HEAD-drift misclassification,
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
- the recovery classifier edges.
**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
retires the window entirely.** See §5.
### 1d. The second facet: the write base is a stale pin (no probe)
The READ path resolves its base behind a freshness probe (`resolve_target_inner`
omnigraph.rs:~1072 → `probe_latest_incarnation``refresh_manifest_only`); the WRITE path
does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
single staleness feeds **two distinct failure modes**, both surfaced this cycle:
1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
`edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
**VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
underneath: there is **no unified write-validation read-set** — endpoint
(`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
`txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
(`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
§7.1 TOCTOU class, and each new constraint forks the pattern again.
2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
`optimize` advanced `__manifest` — even though the advance was content-preserving
compaction the logical write should fast-forward past (invariant 7). It's the **write-side
mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
logical write that must fast-forward past optimize). A served read clears it (the read
probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
(`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
`consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition**
derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
of Design A — three symptoms, one fix.
---
## 2. Validated facts — do NOT re-derive these
Established this cycle against **Lance 7.0.0 source**
(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
you can trust them without re-investigating.
**Lance (upstream):**
- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
(`lance-table/src/io/commit.rs` `default_resolve_version`.)
- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
— the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
`WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
`Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
Session to a write base, open read-style (`open_table_dataset``from_uri().with_version()
.with_session()`) and drive the staged write off that handle.
- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
relies on this).
- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
`commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
`commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
(`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
**compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
Lance own it.**
- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
`Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
retry fires only on real overlap.
**Engine (internal):**
- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
`TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
graph** (`read_caches.session`). **Writes bypass all of it by construction**
(`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
session and opens by latest, not pinned version).
- A single write opens each table **34×** (accumulation → staging reopen → commit
drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
(`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
+ 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
depth — 3b's targets.
- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
`table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
`publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
- Fork-on-first-write authority reads (`classify_fork_ref``fresh_snapshot_for_branch`)
must stay **fresh** (not served from a pinned base).
- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
`local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
(`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
(copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
(`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
---
## 3. The #297 cycle (this branch) — what it is, and the lesson
`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
fixed, both regression-tested (own-HEAD-drift via negative control):
1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
`head_advanced`; the drift guard fires only when `!head_advanced`.
2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
its final retry even if the conflict meant a concurrent writer already published `≥ N`
(postcondition met). Fix: re-check `current >= state.version` on exhaustion.
**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
retry silently converts every "checked once, before any side effect" precondition into
"re-checked after partial side effects."* That's a distinct bug class; it needs
fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
exactly this.)
**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
---
## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
`PublishPlan`.
**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
gate yet (only the node gates above).
Commits (off merged-#297 main):
- **Stage 0** — scope `open_count``data_open_count`/`internal_open_count` by URI class
(the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
(`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
same `Snapshot`).
- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
#3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
`CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
**RFC §4.1 corrections — how they resolved:**
1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
the `commit_staged` handle forward into the index build).
2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
(`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
and the fork-authority reads stay fresh.
3. *Minimal carrier*`WriteTxn { branch, base }` (even the `session` from the original
sketch was dropped as unearned).
**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
**Guardrails (don't regress):** schema validation is deliberately uncached for drift
detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
post-#297 main before starting.
---
## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
Properties:
- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
§6.5 writers, the write-path guard — all close *by construction*.
- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
+ reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
+ `writer_epoch`).
**Why it composes with Lance MTT (don't over-build):**
- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
authority; nothing wasted. Build this.
- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
a cross-process fence). Build it *last and minimally*, gated on actually deploying
multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
validate the maintenance-class and logical-class commit models in isolation first, so Design
A implements a known spec rather than designing under refactor pressure. Do NOT build more
substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
over the window — strictly higher liability than either Design A or waiting for MTT.
**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
open; delete #6658 shipped). Track, don't build yet.
### 5.1 Step-5 design constraints inherited from the #295 spec review
3b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via
eliminate/probe/thread) and **deferred** the full §4.1 opener-unification — the pinned-base
opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op
conflict-timing move — to step 5. So the greptile-bot comments on the #295 *spec* were **moot
for #298** (which built none of those constructs) but are **load-bearing constraints for step
5** when it builds them. Bank them:
1. **Handle cache must be `Send + Sync`** (`Mutex<HashMap<…, Dataset>>`, not `RefCell`) if
`WriteTxn::open(&self)` is shared across concurrent stage futures — a `RefCell` compiles
but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This
is the deny-list "in-process-only `Dataset` impls — `Send + Sync`" item.
2. **The strict-op timing move needs an explicit retry contract.** If step 5 moves
strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS
(the §4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit**
(success or conflict — the handle cache is commit-invalidated), and the retry **re-opens a
fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base — that
reproduces the lost-update). **This is the same retry/refresh contract as the stale-view
false-fail (§1d.2)** — the op-class-aware precondition + "fresh base on retry" are one
design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the
contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
3. **The opener-equivalence test must be non-trivial.** A differential test that only passes
when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open`
returns the pinned base, not HEAD," the test must **advance the branch HEAD externally
(direct Lance write), then assert the txn open still reads the base version** — and that a
strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move).
---
## 6. Why #297 is still needed even if you do Design A
- Design A **relocates** #297's maintenance-class commit logic into the authority's
`TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
- The two regression tests + §6.6 are the **contract** Design A must keep green.
- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
(~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
---
## 7. Open PRs and their relationships
- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
step 3b stacks on it.
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
both are de-risking inputs for Design A's per-class commit models.
- **#296** — recovery roll-forward converges on concurrent manifest advance. The fix
for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
*aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296
landed on main first and is merged into this branch:** the converge helper
(`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward —
on convergence the audit references the winner's folded `graph_commit_id` (the current
`graph_head`), not a freshly minted one.
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
---
## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
(§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
**frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
`edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
pinned read-set re-validated under the `graph_head` contention, or the liability survives as
a second special-case.
- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
two tests passing *together*: un-ignore
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
*while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
(maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
Deferred: only the secondary version-count/space term, touches the read/open path, and is
MTT-redundant. Land when version-count cost bites.
- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
validation on node removal; independent, ships anytime.
- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
- Branch ops: Lance `Clone` for create (`iss-691`).
---
## 9. Gotchas / traps (learned the hard way)
- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
reasoning is a bug across processes. This is the recurring trap.
- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
regress a newer version = lost write.
- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
interpretation on "have *we* committed yet."
- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
`skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
config before compacting to stay non-destructive on upgraded graphs. The strip is a
separate commit (relevant to the partial-commit retry trap).
- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
rare and needs failpoint injection to test.
---
## 10. Verification (the gate)
- `cargo test --workspace --locked` — the canonical gate (matches CI).
- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize`
the optimize concurrency/recovery tests.
- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
cost gates (3b adds the schema-once + open-count asserts here).
- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
before each change (always-on requirement).
Lance source for re-validation:
`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
`dataset/builder.rs`).