omnigraph/docs/dev/handoff-schema-apply-recovery-flake.md

# Handoff: flaky schema-apply → reopen recovery race

**Type:** bug investigation handoff (not yet fixed)
**Status:** root-caused to a layer + hypothesis; exact mechanism and fix NOT yet validated
**Severity:** medium — flaky CI; a real (rare) schema-apply-then-reopen failure under load
**Scope:** pre-existing on `main`; **independent of** RFC-013 step 2 (internal-table
compaction, PR #291) and step 3a (#288) — those paths never touch schema apply or
the recovery sweep, and the full `--workspace` gate passes clean on a re-run.

> Do **not** "fix" this by changing the test to use a single handle. That was
> empirically shown to *reduce but not eliminate* the flake (see Experiments), so it
> would mask a real product race. This is a correct-by-design fix in the engine, not
> a test edit.

---

## 1. Symptom

The test
`crates/omnigraph-server/tests/schema_routes.rs::schema_apply_route_hard_drops_property_with_allow_data_loss`
intermittently fails. The HTTP schema apply **succeeds** (`applied == true`); the
*subsequent* `Omnigraph::open(graph)` (which the test does to verify the catalog)
panics on `.unwrap()` with:

```
OmniError::Manifest(Conflict,
  "stale view of node:Person: expected manifest version 5 but current is 7",
  ExpectedVersionMismatch { expected: 5, actual: 7 })
```

The values (5, 7) vary; the shape is always "recovery roll-forward expected version
N, manifest is at M > N." It is raised from the **open-time recovery sweep**, i.e.
inside `Omnigraph::open`, not from the apply itself.

---

## 2. Reproduction

- **Needs sibling-test parallelism (CPU contention).** Running the target test
  *alone* is rock-solid (0/20 failures). The flake only appears when other tests in
  the same binary run concurrently and perturb the timing inside the apply→reopen
  sequence.
- Fast repro loop (≈13–40% per run):
  ```bash
  cargo test -p omnigraph-server --test schema_routes --no-run
  for i in $(seq 1 15); do
    cargo test -p omnigraph-server --test schema_routes 2>&1 \
      | grep -q "schema_apply_route_hard_drops_property_with_allow_data_loss ... FAILED" \
      && echo "iter $i FAIL"
  done
  ```
- It originally surfaced in a full `cargo test --workspace` run (max parallelism).
- Each test uses its own `tempfile::tempdir()`, so this is **not** cross-test shared
  state — it's a timing race inside one test's own graph.

---

## 3. Experiments run (the discriminating evidence)

Each variant was stress-run under the full `schema_routes` suite (parallel siblings):

| Variant | Flake rate |
|---|---|
| Target test in isolation (no sibling parallelism) | **0/20** |
| **Control** — as written (server handle + out-of-band `Omnigraph::open` load + reopen) | 6/15 ≈ 40% |
| Drop the live server handle (`drop(app)`) before the reopen | 4/15 ≈ 27% |
| Remove the out-of-band separate-handle load | 2/15 ≈ 13% |
| Remove the load **and** drop the server handle (≈ single-handle) | 8/20 ≈ 40% |

**Interpretation:**
- It is **concurrency-triggered**, not a topology bug: 0% isolated, flaky under
  parallel load.
- **No single factor eliminates it.** Removing the out-of-band load roughly halves
  the rate (it amplifies the race) but leaves a ~13% base. Dropping the live server
  handle does not clearly help. So the "single-handle test" patch is a **band-aid**,
  not the fix.
- The residual base rate with the out-of-band load removed means there is a real
  race in the **schema-apply → reopen → recovery** path itself.

Caveat on the experiments: `drop(app)` may not synchronously tear down the server's
engine handle (it can be held by an `Arc`/spawned task), so the "single-handle"
rows are not airtight. This is one of the things to validate (§6).

---

## 4. Root-cause hypothesis (NOT yet proven)

The failing path is the **open-time recovery sweep's roll-forward** raising
`ExpectedVersionMismatch` from the publisher's `check_expected_table_versions`.

The hard-drop schema apply (`allow_data_loss=true` → `DropMode::Hard`) is a
**multi-step migration**: it performs several Lance commits + `__manifest` publishes,
advancing `node:Person`'s manifest version across multiple versions (e.g. 5 → … → 7).
To be crash-safe across the Lance-HEAD-before-manifest-publish gap, schema apply
writes a **recovery sidecar** (`__recovery/{ulid}.json`) pinning per-table
`expected_version` / `post_commit_pin` before its Phase B.

Hypothesis: under CPU contention, the timing of (a) the migration's multi-version
advancement, (b) the sidecar's Phase-D deletion, and (c) a later/overlapping
`Omnigraph::open` recovery sweep interleaves such that the recovery roll-forward
reads a sidecar whose pinned `expected` is **stale relative to a manifest that
legitimately advanced several versions**, and **re-publishes at the stale `expected`
instead of recognizing the migration already completed** → `expected 5, actual 7`.

In other words: the recovery classifier / roll-forward likely does not correctly
handle a table whose manifest is **already past `post_commit_pin`** by more than one
step (multi-step migration), or a sidecar whose operation has already fully
committed. The single-step assumption baked into the Optimize-style pin
(`post_commit_pin = expected_version + 1`) may not generalize to multi-commit schema
migrations.

---

## 5. Likely solution (correct-by-design, surgical)

Make the **open-time recovery classifier idempotent against a manifest that advanced
past the sidecar's pin**:

- If the table's current manifest/Lance version is already `>= post_commit_pin`
  (operation completed, possibly across multiple versions), classify it as
  *already-rolled-forward / completed* (the `RolledPastExpected` family) and **delete
  the sidecar without republishing** — never attempt a publish at the stale
  `expected`.
- Ensure the schema-apply sidecar records a pin that the classifier can interpret for
  a **multi-step** migration (a range / "completed at or beyond" semantics), not a
  strict single-step `expected + 1`.

This also hardens *real* crash recovery for multi-step schema apply (not just the
test), and is small + local to `recovery.rs` (+ possibly the schema-apply sidecar
write in `schema_apply.rs`). It does **not** rearchitect recovery.

Per repo rule 12 (test-first for bug fixes): land a **deterministic** repro first —
ideally a failpoint that forces the interleaving (pause after the migration's commits
but before sidecar delete, then run an open) so the red→green is reliable, not a
stress-loop probability. See the `failpoints.rs` pattern + the schema-apply failpoints
already in the tree.

---

## 6. What MUST be validated before fixing

1. **Which sidecar is being rolled forward?** Confirm it is the *schema-apply*
   sidecar (vs the out-of-band `load`'s sidecar, vs another writer). Instrument /
   log the sidecar `operation_id`, `kind`, and `SidecarTablePin` at the point the
   recovery sweep raises the error.
2. **The exact classifier path.** Trace which `TableClassification` arm the failing
   table hits (`recovery.rs::classify_table`, ~L600) and which roll-forward call
   raises `ExpectedVersionMismatch` (`heal_pending_sidecars_roll_forward` ~L761,
   `roll_forward_all` ~L1215, `restore`+publish ~L1275). Confirm it is the
   multi-step-advanced / already-completed case being mishandled.
3. **Is `post_commit_pin = expected + 1` the bug?** Verify the hard-drop migration
   advances `node:Person` by **>1** version, and that the sidecar pins a single-step
   `+1`, so the classifier can't recognize completion at +2.
4. **Engine-level reproduction (no server).** Build a deterministic engine-level
   repro: persistent handle applies a multi-step hard-drop, then a fresh
   `Omnigraph::open` — ideally with a failpoint forcing the interleave — to confirm
   the bug is in the engine recovery path and not server-specific (runtime, handle
   lifecycle). The current evidence is server-test-only.
5. **Is the out-of-band load *necessary or only amplifying*?** Confirm the ~13% base
   rate (load removed) is the same root cause, not a second distinct race. If the
   load is required, the bug is specifically about a second writer's version
   advancement; if not, it's purely intra-apply.
6. **`drop(app)` cleanliness.** Verify whether the server's engine handle is truly
   gone after `drop(app)` (it may be `Arc`-held). If not, the "single-handle"
   experiments don't isolate the live-handle factor and should be redone with a
   genuinely single-handle setup.

---

## 7. Relationship to Lance MTT

This bug lives in the **recovery-sidecar roll-forward**, which exists only to bridge
the Lance-HEAD-before-manifest-publish gap in omnigraph's faked multi-table
atomicity. `invariants.md` already calls recovery sidecars "scaffolding to remove
once the substrate closes the gap." Lance **MTT** (native atomic multi-table commits,
RFC §8 / lance#7264) closes that gap → retires the sidecar → **eliminates this bug
class.**

Implications:
- **Don't wait for MTT** — it is the "strategic exit, not a current dependency,"
  uncertain and far off, and this bug is live now.
- **Don't over-invest** — keep the fix surgical (classifier idempotency), because the
  whole sidecar layer is MTT-disposable. A surgical fix retires cleanly with the
  layer; a recovery rearchitecture would be throwaway.

---

## 8. Key pointers

- Failing test: `crates/omnigraph-server/tests/schema_routes.rs`
  → `schema_apply_route_hard_drops_property_with_allow_data_loss` (~L777,
  `#[tokio::test(flavor = "multi_thread")]`).
- Error type: `OmniError::Manifest` / `ManifestConflictDetails::ExpectedVersionMismatch`
  (`crates/omnigraph/src/error.rs`); raised by `check_expected_table_versions`
  (`crates/omnigraph/src/db/manifest/publisher.rs`, ~L356).
- Recovery sweep + classifier: `crates/omnigraph/src/db/manifest/recovery.rs`
  — `TableClassification` (~L335), `classify_table` (~L600), roll-forward
  (`heal_pending_sidecars_roll_forward` ~L761, `roll_forward_all` ~L1215, restore +
  publish ~L1275).
- Schema-apply sidecar write: `crates/omnigraph/src/db/omnigraph/schema_apply.rs`
  (the `SidecarKind` schema-apply pins; `db.coordinator.write().refresh()` ~L692).
- Open entry point that runs the sweep: `Omnigraph::open` (read-write mode) →
  `db/manifest/recovery.rs` sweep.
- Repro: §2 above. Stress under `schema_routes` suite parallelism; 0% isolated.

---

## 9. Suggested next steps

1. Add tracing at the recovery roll-forward error site (sidecar kind/id, pins,
   observed vs expected) and capture a failing run (§6.1, §6.2).
2. Reproduce deterministically at the engine level with a failpoint (§6.4) — this is
   the red test (rule 12).
3. Implement the classifier-idempotency fix (§5) in a separate commit; confirm
   red→green and that the stress loop goes to 0 failures over ≥50 iterations.
4. Keep it a standalone PR (not bundled with RFC-013 follow-ons).
-												feat(engine): `WriteTxn` - validate schema + open each data table once per write (#298)

* docs(rfc-013): step-3b handoff + §4.1 corrections (validated)

Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the
4-subagent validation against current code:
- HandleCache → handle-threading (forward the commit-return handle; a version-keyed
  cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits).
- "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase +
  named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the
  live-HEAD drift probe, fork authority) are irreducible correctness machinery.
- WriteParams DOES carry a session field; the real constraint is "stage off an open
  Dataset," so attach the Session by opening read-style then staging off it.

* test(engine): RED step-3b capture-once fitness asserts + open_count probe

Two write-path cost gates, RED today, GREEN after the WriteTxn lands:
- write_validates_schema_contract_once: a write must validate the schema contract
  once (3 read_text + 2 exists). Today re-validates at every resolve point —
  measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter
  (zero production change; the write twin of the read-path schema-once test).
- keyed_insert_opens_table_at_most_once: a keyed single-table write must open its
  table <=1x. Today measured 10 opens.

Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring
probe_count/record_probe), called at both open chokepoints; surfaced as
IoCounts.open_count. forbidden_apis guarantees every write open routes through them.

* feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding)

The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base:
Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract
once and pins the base snapshot + the shared per-graph Session.

Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads
Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict
bound-branch path — opening the base once from the pinned entry with the warm session
(a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table
schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the
commit-time OCC re-read keep their fresh reads.

* test(engine): scope write-path open_count to data tables (RFC-013 step 3b)

The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but
open_count was a single unclassified counter: record_open() fires in both
open chokepoints, and open_dataset_tracked also opens the internal/system
tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via
commit_graph.rs). So the count conflated data-table opens with the publisher
CAS + commit-graph append opens — making the gate measure the wrong quantity
and unreachable by threading alone (the manifest publish keeps it >1 regardless).

Scope it by table class, mirroring the read-side counters (which already split
by URI prefix via separate wrappers): record_open(uri) classifies the open's
last path segment and feeds data_open_count vs internal_open_count. IoCounts
exposes both; the gate now asserts data_open_count <= 1.

Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6
(sum 10, the old conflated value). The RED target for the WriteTxn threading is
now the real data-table-open count (4 -> 1), with internal opens correctly out
of scope. Pure test-harness/instrumentation; no production behavior change
(classification runs only inside the probe closure, skipped when no probes are
installed).

Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff —
this branch is already stacked on origin/main after it merged.

* feat(engine): validate the schema contract once per write (RFC-013 step 3b)

A single mutate/load re-validated the schema contract ~4 times: at the entry
(ensure_schema_state_valid), per-table in open_for_mutation_on_branch
(resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch),
and in the publisher's index-build snapshot (snapshot_for_branch). Each validation
is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of
redundant contract I/O on every write.

Thread the already-landed WriteTxn carrier through the write path: capture
`txn = open_write_txn(branch)` once at the mutate/load entry (the single validation),
then source the per-table entry and the commit/publish snapshots from `txn.base`
instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every
function is byte-identical to before.

- mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the
  entry-point ensure_schema_state_valid) and thread Some(&txn) through
  execute_*/open_table_for_mutation, commit_all, and
  commit_updates_on_branch_with_expected.
- open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch
  when present — skipping resolved_branch_target's re-validation. The OPEN itself is
  unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep
  ensure_expected_version. Schema-once applies to strict and non-strict alike; the
  data-open collapse is a separate change.
- commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus
  the schema re-validation) when txn is present; the drift guard is unchanged.
- prepare_updates_for_commit uses txn.base for the publisher index-build snapshot.

fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via
ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the
Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot()
produces identically — but resolve_target additionally opened the commit graph (a
spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that
load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None
arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read
is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved.

Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9).
keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open
collapse lands next. Full engine suite green otherwise.

* feat(engine): open each data table once per write (RFC-013 step 3b)

A single keyed-node mutate opened its data table 4 times: accumulation (to read
.version()), staging (the real write base), the commit-time drift guard (to read
live HEAD), and the publisher's index build (reopen at the just-committed version).
Collapse three of the four — using the WriteTxn carrier threaded for schema-once —
so a write opens each touched data table at most once.

- #1 accumulation: open_for_mutation_on_branch now returns
  (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's
  own branch, a non-strict (Insert/Merge) op needs no open — the only thing the
  caller reads is .version() (the CAS fence), which is exactly the pinned base
  version (entry.table_version). So skip open_dataset_head_for_write and source the
  version from txn.base. The node insert path already discarded that handle; the
  edge path resolves a pinned read only when non-default cardinality needs it.
  STRICT ops and any write that must fork still open live HEAD + ensure_expected_version.
- #3 commit drift guard: commit_all reads live HEAD via
  entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off
  the already-open staging handle (the same primitive ManifestCoordinator::
  probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The
  head<current / head>current drift classification is byte-identical.
- #4 index build: commit_all now returns the per-table post-commit_staged
  SnapshotHandle map; commit_updates_on_branch_with_expected threads it into
  prepare_updates_for_commit, which builds indices on the threaded handle instead of
  reopening at the same just-committed version. Absent a handle (other writers,
  inline/delete tables) the reopen path is byte-identical.

When txn is None (branch merge, schema apply, tests) every function opens and checks
exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1).
Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle)
green.

* refactor(engine): name the write-path open/commit returns (RFC-013 step 3b)

The open collapse left two positional returns that are easy to mis-thread and
carry an unwritten contract: open_for_mutation_on_branch's
(Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple
(updates, expected_versions, sidecar_handle, guards, committed_handles). Replace
both with named structs so each field reads at the call site and the Option's
contract is documented, not folklore.

- OpenedForMutation { handle, expected_version, full_path, table_branch } with a
  require_handle(ctx) helper for the callers that must have a handle (strict ops,
  the fork path, every no-txn caller — branch merge, the seed test). The handle is
  None only on the non-strict-txn open-skip path (collapse #1); require_handle
  panics with a named context if that contract is ever broken.
- CommittedMutation { updates, expected_versions, sidecar_handle, guards,
  committed_handles } for commit_all; consumers destructure into the same local
  bindings they already used, so the publish/sidecar/guard-hold logic is unchanged.
- A debug_assert in open_table_for_mutation pins the skip contract: a missing handle
  is legal only on the non-strict txn path, so a future strict arm returning None
  trips in debug builds instead of handing None to a require_handle consumer.

Pure refactor — no behavior change. Both cost gates stay green (schema 3/2,
data_open_count=1), full engine suite + lib (162) green.

* refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b)

The open collapse greens data_open_count<=1 by SKIPPING the accumulation open,
PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle —
none of which consume a session. The captured WriteTxn.session was therefore dead
(`#[allow(dead_code)]`): unearned surface a reviewer rightly flags.

Remove it. The carrier is now {branch, base} — exactly what schema-once + the open
collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional
publish carrier and is the right home for session-aware base opens, where the
warm-session benefit on the single remaining open — an object-store (S3) phenomenon,
invisible on local FS — can be earned by its own cost gate rather than carried dead
through this PR.

No behavior change; both cost gates stay green (schema 3/2, data_open_count=1).

* docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5

* docs(rfc-013): capture the write-base-staleness convergence (§1d)

Three findings this cycle share one root — the write base is a stale, un-probed,
un-classified pin (the read path probes; the write path returns the warm
coordinator snapshot):

- #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1
  made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge
  is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read +
  deterministic test + correct the single-writer doc comment.
- The structural liability underneath: no unified write-validation read-set —
  endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live),
  the same cardinality check forks mutation-vs-loader, none re-validated at commit.
- The served-strict-write stale-view false-fail (validated on prod + a #[ignore]
  repro): a strict update/delete false-fails ExpectedVersionMismatch after an external
  optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is
  proven wrong (breaks the cross-process lost-update OCC contract).

All three converge on Design A (step 5): open_txn's warm probe makes the base fresh,
the op-class-aware precondition (derive maintenance vs logical from Lance per-version
transaction metadata — no parallel marker) fast-forwards maintenance and fails logical,
and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records
the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract.

* test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298)

Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1
made the non-strict edge-insert cardinality scan read the pinned txn.base instead
of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after
txn capture is uncounted and a @card max is silently exceeded (invariant 9).

Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme)
to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt
for Alice. B's coordinator is stale by construction (the write path doesn't probe),
so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert
that must be rejected currently succeeds (panics at unwrap_err). Goes green when the
scan reads live HEAD.

* fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298)

3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_
read_handle reopened the edge table at the pinned txn.base for the @card scan. Since
cardinality is validated once (never rechecked at commit), a concurrent edge committed
after txn capture was uncounted and a @card max could be silently exceeded (invariant
9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the
mutation's own open_dataset_head_for_write handle).

Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable
across versions) and open the dataset at its current HEAD via open_dataset_head_for_
write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the
edge cardinality path (non-default @card only) is untouched by them, and the extra
live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema
re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the
pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8.

Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green;
validators / write_cost / writes / consistency / end_to_end / branching all green.

* docs(dev): link handoff docs from index

* docs(engine): tighten 3b claims to match the code (#298 review)

Review caught several comments/docs overclaiming what the code does (the session
drop + the #298 cardinality fix left stale/too-strong wording). No logic change.

- open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer
  carries one); scope "once" to the table-touch hot path and note edge/load RI
  validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5.
- edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it
  now opens LIVE HEAD (#298); corrected.
- write_cost.rs: "opens the base once (with the shared Session)" → session-aware base
  open is deferred to step 5.
- data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis
  only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs
  is allow-listed and holds direct Dataset::opens for branch-management ops (not the
  keyed-write hot path the gate measures). Narrowed the claim accordingly.
- handoff §4: "schema once / open once" is the node hot path (the two gates); edge
  endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses
  cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1).

build clean; write_cost / validators / forbidden_apis green.
											
										
										
											2026-06-23 21:27:31 +02:00
+								# Handoff: flaky schema-apply → reopen recovery race
 								**Type:** bug investigation handoff (not yet fixed)
 								**Status:** root-caused to a layer + hypothesis; exact mechanism and fix NOT yet validated
 								**Severity:** medium — flaky CI; a real (rare) schema-apply-then-reopen failure under load
 								**Scope:** pre-existing on `main`; **independent of** RFC-013 step 2 (internal-table
 								compaction, PR #291) and step 3a (#288) — those paths never touch schema apply or
 								the recovery sweep, and the full `--workspace` gate passes clean on a re-run.
 								> Do **not** "fix" this by changing the test to use a single handle. That was
 								> empirically shown to *reduce but not eliminate* the flake (see Experiments), so it
 								> would mask a real product race. This is a correct-by-design fix in the engine, not
 								> a test edit.
 								---
 								## 1. Symptom
 								The test
 								`crates/omnigraph-server/tests/schema_routes.rs::schema_apply_route_hard_drops_property_with_allow_data_loss`
 								intermittently fails. The HTTP schema apply **succeeds** (`applied == true`); the
 								*subsequent* `Omnigraph::open(graph)` (which the test does to verify the catalog)
 								panics on `.unwrap()` with:
 								```
 								OmniError::Manifest(Conflict,
 								  "stale view of node:Person: expected manifest version 5 but current is 7",
 								  ExpectedVersionMismatch { expected: 5, actual: 7 })
 								```
 								The values (5, 7) vary; the shape is always "recovery roll-forward expected version
 								N, manifest is at M > N." It is raised from the **open-time recovery sweep**, i.e.
 								inside `Omnigraph::open`, not from the apply itself.
 								---
 								## 2. Reproduction
 								- **Needs sibling-test parallelism (CPU contention).** Running the target test
 								  *alone* is rock-solid (0/20 failures). The flake only appears when other tests in
 								  the same binary run concurrently and perturb the timing inside the apply→reopen
 								  sequence.
 								- Fast repro loop (≈13–40% per run):
 								  ```bash
 								  cargo test -p omnigraph-server --test schema_routes --no-run
 								  for i in $(seq 1 15); do
 								    cargo test -p omnigraph-server --test schema_routes 2>&1 \
 								      | grep -q "schema_apply_route_hard_drops_property_with_allow_data_loss ... FAILED" \
 								      && echo "iter $i FAIL"
 								  done
 								  ```
 								- It originally surfaced in a full `cargo test --workspace` run (max parallelism).
 								- Each test uses its own `tempfile::tempdir()`, so this is **not** cross-test shared
 								  state — it's a timing race inside one test's own graph.
 								---
 								## 3. Experiments run (the discriminating evidence)
 								Each variant was stress-run under the full `schema_routes` suite (parallel siblings):
 								| Variant | Flake rate |
 								|---|---|
 								| Target test in isolation (no sibling parallelism) | **0/20** |
 								| **Control** — as written (server handle + out-of-band `Omnigraph::open` load + reopen) | 6/15 ≈ 40% |
 								| Drop the live server handle (`drop(app)`) before the reopen | 4/15 ≈ 27% |
 								| Remove the out-of-band separate-handle load | 2/15 ≈ 13% |
 								| Remove the load **and** drop the server handle (≈ single-handle) | 8/20 ≈ 40% |
 								**Interpretation:**
 								- It is **concurrency-triggered**, not a topology bug: 0% isolated, flaky under
 								  parallel load.
 								- **No single factor eliminates it.** Removing the out-of-band load roughly halves
 								  the rate (it amplifies the race) but leaves a ~13% base. Dropping the live server
 								  handle does not clearly help. So the "single-handle test" patch is a **band-aid**,
 								  not the fix.
 								- The residual base rate with the out-of-band load removed means there is a real
 								  race in the **schema-apply → reopen → recovery** path itself.
 								Caveat on the experiments: `drop(app)` may not synchronously tear down the server's
 								engine handle (it can be held by an `Arc`/spawned task), so the "single-handle"
 								rows are not airtight. This is one of the things to validate (§6).
 								---
 								## 4. Root-cause hypothesis (NOT yet proven)
 								The failing path is the **open-time recovery sweep's roll-forward** raising
 								`ExpectedVersionMismatch` from the publisher's `check_expected_table_versions`.
 								The hard-drop schema apply (`allow_data_loss=true` → `DropMode::Hard`) is a
 								**multi-step migration**: it performs several Lance commits + `__manifest` publishes,
 								advancing `node:Person`'s manifest version across multiple versions (e.g. 5 → … → 7).
 								To be crash-safe across the Lance-HEAD-before-manifest-publish gap, schema apply
 								writes a **recovery sidecar** (`__recovery/{ulid}.json`) pinning per-table
 								`expected_version` / `post_commit_pin` before its Phase B.
 								Hypothesis: under CPU contention, the timing of (a) the migration's multi-version
 								advancement, (b) the sidecar's Phase-D deletion, and (c) a later/overlapping
 								`Omnigraph::open` recovery sweep interleaves such that the recovery roll-forward
 								reads a sidecar whose pinned `expected` is **stale relative to a manifest that
 								legitimately advanced several versions**, and **re-publishes at the stale `expected`
 								instead of recognizing the migration already completed** → `expected 5, actual 7`.
 								In other words: the recovery classifier / roll-forward likely does not correctly
 								handle a table whose manifest is **already past `post_commit_pin`** by more than one
 								step (multi-step migration), or a sidecar whose operation has already fully
 								committed. The single-step assumption baked into the Optimize-style pin
 								(`post_commit_pin = expected_version + 1`) may not generalize to multi-commit schema
 								migrations.
 								---
 								## 5. Likely solution (correct-by-design, surgical)
 								Make the **open-time recovery classifier idempotent against a manifest that advanced
 								past the sidecar's pin**:
 								- If the table's current manifest/Lance version is already `>= post_commit_pin`
 								  (operation completed, possibly across multiple versions), classify it as
 								  *already-rolled-forward / completed* (the `RolledPastExpected` family) and **delete
 								  the sidecar without republishing** — never attempt a publish at the stale
 								  `expected`.
 								- Ensure the schema-apply sidecar records a pin that the classifier can interpret for
 								  a **multi-step** migration (a range / "completed at or beyond" semantics), not a
 								  strict single-step `expected + 1`.
 								This also hardens *real* crash recovery for multi-step schema apply (not just the
 								test), and is small + local to `recovery.rs` (+ possibly the schema-apply sidecar
 								write in `schema_apply.rs`). It does **not** rearchitect recovery.
 								Per repo rule 12 (test-first for bug fixes): land a **deterministic** repro first —
 								ideally a failpoint that forces the interleaving (pause after the migration's commits
 								but before sidecar delete, then run an open) so the red→green is reliable, not a
 								stress-loop probability. See the `failpoints.rs` pattern + the schema-apply failpoints
 								already in the tree.
 								---
 								## 6. What MUST be validated before fixing
 . **Which sidecar is being rolled forward?** Confirm it is the *schema-apply*
 								   sidecar (vs the out-of-band `load`'s sidecar, vs another writer). Instrument /
 								   log the sidecar `operation_id`, `kind`, and `SidecarTablePin` at the point the
 								   recovery sweep raises the error.
 . **The exact classifier path.** Trace which `TableClassification` arm the failing
 								   table hits (`recovery.rs::classify_table`, ~L600) and which roll-forward call
 								   raises `ExpectedVersionMismatch` (`heal_pending_sidecars_roll_forward` ~L761,
 								   `roll_forward_all` ~L1215, `restore`+publish ~L1275). Confirm it is the
 								   multi-step-advanced / already-completed case being mishandled.
 . **Is `post_commit_pin = expected + 1` the bug?** Verify the hard-drop migration
 								   advances `node:Person` by **>1** version, and that the sidecar pins a single-step
 								   `+1`, so the classifier can't recognize completion at +2.
 . **Engine-level reproduction (no server).** Build a deterministic engine-level
 								   repro: persistent handle applies a multi-step hard-drop, then a fresh
 								   `Omnigraph::open` — ideally with a failpoint forcing the interleave — to confirm
 								   the bug is in the engine recovery path and not server-specific (runtime, handle
 								   lifecycle). The current evidence is server-test-only.
 . **Is the out-of-band load *necessary or only amplifying*?** Confirm the ~13% base
 								   rate (load removed) is the same root cause, not a second distinct race. If the
 								   load is required, the bug is specifically about a second writer's version
 								   advancement; if not, it's purely intra-apply.
 . **`drop(app)` cleanliness.** Verify whether the server's engine handle is truly
 								   gone after `drop(app)` (it may be `Arc`-held). If not, the "single-handle"
 								   experiments don't isolate the live-handle factor and should be redone with a
 								   genuinely single-handle setup.
 								---
 								## 7. Relationship to Lance MTT
 								This bug lives in the **recovery-sidecar roll-forward**, which exists only to bridge
 								the Lance-HEAD-before-manifest-publish gap in omnigraph's faked multi-table
 								atomicity. `invariants.md` already calls recovery sidecars "scaffolding to remove
 								once the substrate closes the gap." Lance **MTT** (native atomic multi-table commits,
 								RFC §8 / lance#7264) closes that gap → retires the sidecar → **eliminates this bug
 								class.**
 								Implications:
 								- **Don't wait for MTT** — it is the "strategic exit, not a current dependency,"
 								  uncertain and far off, and this bug is live now.
 								- **Don't over-invest** — keep the fix surgical (classifier idempotency), because the
 								  whole sidecar layer is MTT-disposable. A surgical fix retires cleanly with the
 								  layer; a recovery rearchitecture would be throwaway.
 								---
 								## 8. Key pointers
 								- Failing test: `crates/omnigraph-server/tests/schema_routes.rs`
 								  → `schema_apply_route_hard_drops_property_with_allow_data_loss` (~L777,
 								  `#[tokio::test(flavor = "multi_thread")]`).
 								- Error type: `OmniError::Manifest` / `ManifestConflictDetails::ExpectedVersionMismatch`
 								  (`crates/omnigraph/src/error.rs`); raised by `check_expected_table_versions`
 								  (`crates/omnigraph/src/db/manifest/publisher.rs`, ~L356).
 								- Recovery sweep + classifier: `crates/omnigraph/src/db/manifest/recovery.rs`
 								  — `TableClassification` (~L335), `classify_table` (~L600), roll-forward
 								  (`heal_pending_sidecars_roll_forward` ~L761, `roll_forward_all` ~L1215, restore +
 								  publish ~L1275).
 								- Schema-apply sidecar write: `crates/omnigraph/src/db/omnigraph/schema_apply.rs`
 								  (the `SidecarKind` schema-apply pins; `db.coordinator.write().refresh()` ~L692).
 								- Open entry point that runs the sweep: `Omnigraph::open` (read-write mode) →
 								  `db/manifest/recovery.rs` sweep.
 								- Repro: §2 above. Stress under `schema_routes` suite parallelism; 0% isolated.
 								---
 								## 9. Suggested next steps
 . Add tracing at the recovery roll-forward error site (sidecar kind/id, pins,
 								   observed vs expected) and capture a failing run (§6.1, §6.2).
 . Reproduce deterministically at the engine level with a failpoint (§6.4) — this is
 								   the red test (rule 12).
 . Implement the classifier-idempotency fix (§5) in a separate commit; confirm
 								   red→green and that the stress loop goes to 0 failures over ≥50 iterations.
 . Keep it a standalone PR (not bundled with RFC-013 follow-ons).