docs/tests: reserve Phase A/B/C/D for the per-writer recovery flow

Three terminologies were calling themselves Phase A/B in PR #72:

1. Per-writer recovery (canonical, four phases A/B/C/D — sidecar /
   commit_staged loop / manifest publish / sidecar delete in
   `docs/runs.md:157`).
2. Per-table staged-write contract from MR-793 (two phases —
   `stage_*` then `commit_staged`).
3. Test-narrative scaffolding (Phase A = setup the failure, Phase B
   = verify recovery — used as section dividers in failpoints.rs).

Same letters, three meanings; three reviewers including the bots have
already misread the code in the resulting fog. This change keeps
"Phase A/B/C/D" exclusively for #1 and rewrites the other two:

- `ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`
  → `ensure_indices_stage_btree_failure_leaves_existing_tables_writable`
  (matches the `stage_create_btree_index` API verb).
- Comment at `table_ops.rs:610` and the test docstring at
  `failpoints.rs:807` rewrite "a Phase A failure in the staged-index
  path" → "a stage-step failure in the staged-index path".
- Twelve `// Phase A:` / `// Phase B:` test scaffolding comment
  headers in `failpoints.rs` (across six test fns) become
  `// Setup:` / `// Recovery:`.
- A "Phase letter convention" note added near `docs/runs.md:165`
  spells the rule out for future readers.

Also bundled: rename
`composite_flow_init_load_branch_merge_time_travel_optimize_cleanup`
→ `composite_flow_canonical_lifecycle` so it pairs as a story name
with `composite_flow_multi_branch_sequential_merges` (the previously-
deferred symmetry rename).

No behaviour change. Both renamed tests pass; full failpoints (18) +
composite_flow (2) suites pass; workspace baseline + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-05-05 22:46:03 +02:00
parent fb0f024652
commit a30666bc38
No known key found for this signature in database
4 changed files with 27 additions and 17 deletions

View file

@ -607,8 +607,9 @@ async fn stage_and_commit_btree(
))
})?;
// Failpoint between stage and commit. Used by `tests/failpoints.rs`
// to demonstrate that a Phase A failure in the staged-index path
// leaves no Lance-HEAD drift on the touched table.
// to demonstrate that a stage-step failure in the staged-index
// path (`stage_create_btree_index` succeeded; `commit_staged` not
// yet called) leaves no Lance-HEAD drift on the touched table.
crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")?;
let new_ds = db
.table_store

View file

@ -51,7 +51,7 @@ const TEST_DATA: &str = include_str!("fixtures/test.jsonl");
const TEST_QUERIES: &str = include_str!("fixtures/test.gq");
#[tokio::test]
async fn composite_flow_init_load_branch_merge_time_travel_optimize_cleanup() {
async fn composite_flow_canonical_lifecycle() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();

View file

@ -232,7 +232,7 @@ async fn recovery_rolls_forward_after_finalize_publisher_failure() {
let uri = dir.path().to_str().unwrap().to_string();
let operation_id;
// Phase A: trigger the residual.
// Setup: trigger the residual.
{
let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap();
let _failpoint = ScopedFailPoint::new("mutation.post_finalize_pre_publisher", "return");
@ -272,7 +272,7 @@ async fn recovery_rolls_forward_after_finalize_publisher_failure() {
// Drop the failpoint scope and the engine handle.
}
// Phase B: reopen runs the recovery sweep. The sweep finds the
// Recovery: reopen runs the recovery sweep. The sweep finds the
// sidecar, classifies node:Person as RolledPastExpected, decides
// RollForward, publishes the manifest update, records the audit
// row, deletes the sidecar.
@ -557,7 +557,7 @@ async fn refresh_runs_roll_forward_recovery_in_process() {
let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap();
// Phase A: trigger the residual (sidecar persists; manifest unchanged).
// Setup: trigger the residual (sidecar persists; manifest unchanged).
{
let _failpoint = ScopedFailPoint::new("mutation.post_finalize_pre_publisher", "return");
let err = mutate_main(
@ -581,7 +581,7 @@ async fn refresh_runs_roll_forward_recovery_in_process() {
);
}
// Phase B: explicit refresh runs roll-forward-only recovery
// Recovery: explicit refresh runs roll-forward-only recovery
// in-process — no restart needed. Sidecar finds the Person drift,
// classifies RolledPastExpected, rolls forward via publisher CAS,
// and deletes the sidecar.
@ -804,7 +804,7 @@ async fn finalize_publisher_residual_does_not_drift_untouched_tables() {
.expect("Company write on a non-drifted table should succeed");
}
/// Acceptance test: a Phase A failure in the staged-index path
/// Acceptance test: a stage-step failure in the staged-index path
/// (`stage_create_btree_index` succeeded; `commit_staged` not yet
/// called) leaves NO Lance-HEAD drift on the existing tables.
/// Subsequent operations against those tables succeed without
@ -828,7 +828,7 @@ async fn finalize_publisher_residual_does_not_drift_untouched_tables() {
/// by `cleanup_old_versions` (or removed when a future apply at the
/// same target path resolves the rename).
#[tokio::test]
async fn ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable() {
async fn ensure_indices_stage_btree_failure_leaves_existing_tables_writable() {
let _scenario = FailScenario::setup();
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap().to_string();
@ -1024,7 +1024,7 @@ async fn schema_apply_phase_b_failure_recovered_on_next_open() {
version_main(&db).await.unwrap()
};
// Phase A: trigger the residual via `schema_apply.after_staging_write`.
// Setup: trigger the residual via `schema_apply.after_staging_write`.
// This failpoint fires AFTER the rewritten_tables/indexed_tables loops
// (Lance HEAD advanced) AND AFTER the schema-state staging files are
// written, but BEFORE the manifest publish. The recovery sidecar persists.
@ -1079,7 +1079,7 @@ edge WorksAt: Person -> Company
operation_id = single_sidecar_operation_id(dir.path());
}
// Phase B: reopen runs the recovery sweep. Sidecar's writer_kind is
// Recovery: reopen runs the recovery sweep. Sidecar's writer_kind is
// SchemaApply (loose-match) — classifier accepts the multi-commit
// drift on Person, decision is RollForward, manifest extends to the
// current Lance HEAD.
@ -1188,7 +1188,7 @@ async fn branch_merge_phase_b_failure_recovered_on_next_open() {
version_main(&db).await.unwrap()
};
// Phase A: failpoint fires after the per-table publish loop completes
// Setup: failpoint fires after the per-table publish loop completes
// but before commit_manifest_updates. Sidecar persists.
{
let mut db = Omnigraph::open(&uri).await.unwrap();
@ -1214,7 +1214,7 @@ async fn branch_merge_phase_b_failure_recovered_on_next_open() {
);
}
// Phase B: reopen runs the sweep. BranchMerge uses LOOSE
// Recovery: reopen runs the sweep. BranchMerge uses LOOSE
// classification — `publish_rewritten_merge_table` runs multiple
// commit_staged calls per table (stage_merge_insert + delete_where +
// index rebuilds), so post_commit_pin in the sidecar is a lower
@ -1363,7 +1363,7 @@ async fn branch_merge_phase_b_failure_recovered_on_non_main_target() {
.await
.unwrap();
// Phase A: failpoint fires after the per-table publish loop completes
// Setup: failpoint fires after the per-table publish loop completes
// but before commit_manifest_updates. Sidecar persists with
// branch=Some("target_branch").
{
@ -1389,7 +1389,7 @@ async fn branch_merge_phase_b_failure_recovered_on_non_main_target() {
operation_id = single_sidecar_operation_id(dir.path());
}
// Phase B: reopen runs full sweep. The BranchMerge sidecar's branch
// Recovery: reopen runs full sweep. The BranchMerge sidecar's branch
// = Some("target_branch"); D2 fix opens a per-branch CommitGraph
// for the audit append so the merge-parent linkage is correct.
let db = Omnigraph::open(&uri).await.unwrap();
@ -1554,7 +1554,7 @@ async fn ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_neede
.unwrap();
}
// Phase A: trigger the failpoint. Steady-state ensure_indices
// Setup: trigger the failpoint. Steady-state ensure_indices
// produces zero sidecar pins (the helpers scope pins to tables
// that genuinely need work); no sidecar is written. The failpoint
// still fires, surfacing the Err.
@ -1590,7 +1590,7 @@ async fn ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_neede
);
}
// Phase B: reopen is a clean no-op (no sidecar to recover).
// Recovery: reopen is a clean no-op (no sidecar to recover).
let _db = Omnigraph::open(&uri).await.unwrap();
let recovery_dir = dir.path().join("__recovery");

View file

@ -163,6 +163,15 @@ are left at `Lance HEAD = manifest_pinned + 1`.
3. **Phase C**: publisher commits the manifest.
4. **Phase D**: writer deletes the sidecar.
> **Phase letter convention.** Throughout the recovery code, log
> messages, failpoint names (e.g. `branch_merge.post_phase_b_pre_manifest_commit`),
> and the per-writer integration tests, "Phase A/B/C/D" refers
> exclusively to the four-step lifecycle above. The per-table
> staged-write contract (`stage_*` then `commit_staged`, two steps)
> is referred to by those API verbs — never by phase letters — so a
> reader of `recovery.rs`, `failpoints.rs`, or this document only
> encounters phase letters in the per-writer context.
A failure between Phase A and Phase D leaves the sidecar on disk. The
next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the
recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: