recovery: refresh-time roll-forward closes the in-process residual

Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).

Why roll-forward only and not full sweep:
  * Roll-forward is safe under concurrency (publisher uses row-level
    CAS).
  * Roll-back uses Dataset::restore, which "wins" against concurrent
    Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
    silently orphaning the concurrent writer's commit (pinned by
    tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
    Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
    next ReadWrite open, where no concurrent writers exist and full
    restore is safe.

Implementation:
  * recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
    process_sidecar branches on mode for Abort and RollBack — both
    defer to next ReadWrite open under RollForwardOnly. RollForward
    behavior unchanged.
  * omnigraph.rs: Omnigraph::refresh promoted to pub; calls
    recover_manifest_drift in RollForwardOnly mode after coordinator
    refresh. Steady-state cost: one list_dir of __recovery (early
    return on empty). Adds refresh_coordinator_only — pub(crate) —
    for engine-internal callers that hold an in-flight sidecar (the
    schema_apply lease-check + lock-release paths). Without this split,
    refresh would race the in-flight sidecar.
  * schema_apply.rs: switch all 6 internal db.refresh() call sites to
    refresh_coordinator_only().

Tests:
  * refresh_runs_roll_forward_recovery_in_process — trigger
    mutation.post_finalize_pre_publisher; without restart, call
    db.refresh(); assert sidecar deleted, drifted row visible,
    subsequent mutation succeeds.
  * refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
    a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
    refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
    runs the full sweep which advances HEAD via restore.

Docs:
  * docs/runs.md "Long-running servers" caveat updated to describe the
    refresh-time roll-forward path and the rollback-defer behavior.
  * docs/invariants.md §VI.23 status line updated to reflect in-process
    closure of the common case.

Workspace tests pass with --features failpoints; no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-05-04 00:15:42 +02:00
parent 8c6506f5cd
commit aaa031e834
No known key found for this signature in database
7 changed files with 361 additions and 29 deletions

View file

@ -105,7 +105,7 @@ These are user-visible commitments. They state what the engine guarantees and wh
Specific defaults (timeout values, memory caps, TTL windows) are *configuration*, not invariants — see [docs/constants.md](constants.md) and per-deployment configuration. The invariant is that bounds and contracts exist, not their numerical values.
23. **Atomicity is per-query.** Every `.gq` query is atomic — multi-statement mutations are all-or-nothing via the substrate's atomic-commit primitive. No cross-query `BEGIN`/`COMMIT`; branches and merges fill that role for agent workflows.
*Status: upheld at the writer-trait surface AND across process boundaries — the sealed `TableStorage` trait routes inserts / updates / scalar-index builds / merge_insert / overwrite through `stage_*` + `commit_staged` (Phase A is drift-free), and the open-time recovery sweep in `db/manifest/recovery.rs` (sidecars at `__recovery/{ulid}.json` written by `MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) closes the per-table commit_staged → manifest publish residual on the next `Omnigraph::open`. The "Lance HEAD ahead of `__manifest`" drift class is unreachable for op-execution failures and recoverable across process boundaries for finalize→publisher failures. Continuous in-process recovery (no restart required between Phase B failure and recovery) is the goal of a future background reconciler. Two writer paths still inline-commit pending upstream Lance work: `delete_where` (lance-format/lance#6658) and `create_vector_index` (lance-format/lance#6666).*
*Status: upheld at the writer-trait surface, across process boundaries, AND in-process for the common case — the sealed `TableStorage` trait routes inserts / updates / scalar-index builds / merge_insert / overwrite through `stage_*` + `commit_staged` (Phase A is drift-free); the open-time recovery sweep in `db/manifest/recovery.rs` (sidecars at `__recovery/{ulid}.json` written by `MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) closes the per-table commit_staged → manifest publish residual on the next `Omnigraph::open`; and `Omnigraph::refresh` runs roll-forward-only recovery in-process so long-running servers close the common case (mutation/load finalize → publisher failure) without restart. The "Lance HEAD ahead of `__manifest`" drift class is unreachable for op-execution failures, recoverable across process boundaries for all writer kinds, and recoverable in-process for roll-forward-eligible sidecars. Sidecars that would require `Dataset::restore` are deferred to the next ReadWrite open (restore unsafe under concurrency); continuous in-process recovery for that case requires per-(table, branch) writer-queue acquisition and is the goal of a future background reconciler. Two writer paths still inline-commit pending upstream Lance work: `delete_where` (lance-format/lance#6658) and `create_vector_index` (lance-format/lance#6666).*
24. **Schema integrity is strict at commit.** Type validation, required-field presence (auto-filled from `@default` if declared), uniqueness across batches and versions, and referential integrity — all enforced before commit succeeds. Per-write softening flags are opt-in, never default.
*Status: aspirational — referential integrity at scale requires SIP-backed cross-table validation; not yet implemented. Cross-batch / cross-version uniqueness tracked in MR-714.*

View file

@ -193,12 +193,19 @@ Triggers for the residual: transient Lance write errors during finalize
(object-store retry budget exhaustion, disk full); persistent publisher
contention exceeding `PUBLISHER_RETRY_BUDGET = 5` retries.
**Long-running servers**: between Phase B failure and the next
`Omnigraph::open` (typically a server restart), subsequent writers on
the affected tables surface
`ManifestConflictDetails::ExpectedVersionMismatch`. Continuous
in-process recovery (no restart required) is the goal of a future
background reconciler.
**Long-running servers**: `Omnigraph::refresh` runs roll-forward-only
recovery in-process — the common Phase B → Phase C residual closes
without a restart. The next mutation on the same handle (after refresh)
no longer surfaces `ExpectedVersionMismatch` for the failed table.
Sidecars that would require a `Dataset::restore` (mixed / unexpected
state) are deferred to the next `OpenMode::ReadWrite` open: restore is
unsafe under concurrency because Lance's `check_restore_txn` accepts
the restore against in-flight Append/Update/Delete commits and
silently orphans them (pinned by
`tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning`).
Continuous in-process recovery for the rollback path is the goal of a
future background reconciler with per-(table, branch) writer-queue
acquisition.
The publisher-CAS contract is unchanged: a *concurrent writer* that
advances any of our touched tables between snapshot capture and