mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
recovery: address PR #72 review findings
Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR. Each fix has a paired test demonstrating the bug before the fix. Merge-blocking fixes: - BranchMerge moved to loose-match classifier arm. publish_rewritten_ merge_table runs multiple commit_staged calls per table (merge_insert + delete_where + index rebuilds). Strict classification rolled back valid completed Phase B work as UnexpectedMultistep. Three new unit tests pin the loose-match behavior for BranchMerge. - branch_merge sidecar uses self.active_branch() (the resolved target branch) instead of inferring from the first sorted table key. The previous heuristic could record None (== main) when the merge target was a non-main branch, causing recovery to publish to the wrong manifest namespace. - Best-effort sidecar delete in all 5 writer sites (mutation, loader, schema_apply, branch_merge, ensure_indices). Previously, a sidecar cleanup failure after a successful manifest publish would error out the user's call for a write that already landed. Now: log a warning and ignore — the next open's recovery sweep tidies the stale sidecar via NoMovement classification. - ensure_indices sidecar scoped to tables that need work via new helpers needs_index_work_node / needs_index_work_edge. Previously the sidecar pinned every catalog table; if only one needed indexing, the others classified as NoMovement and the all-or-nothing decision rolled back legitimate index work. Strongly-recommended fixes: - recover_manifest_drift now takes &mut GraphCoordinator and refreshes between sidecars. Sidecar B's classification needs to see sidecar A's manifest changes, otherwise B can be classified against stale pins and incorrectly roll back work that just landed. - list_sidecars sorts URIs before reading. Sidecar filenames are ULIDs (chronologically sortable), so this gives deterministic, time-ordered processing. Filesystem-order was nondeterministic. - ReadOnly opens skip recover_schema_state_files too (was: only the MR-847 sweep was gated). Read-only consumers may run with read-only credentials; silent open-time mutations violate the contract. Doc cleanups: - Removed stale "Phase 4 placeholder" comment from recover_manifest_drift. - docs/runs.md decision-tree wording now correctly surfaces the InvariantViolation abort path. - docs/branches-commits.md clarifies actor_id is in _graph_commit_actors.lance (joined by graph_commit_id), not on _graph_commits.lance itself. Test surface (post-fixes): - 25 unit tests in db::manifest::recovery (+4 from this commit). - 10 integration tests in tests/recovery.rs (+3 from this commit). - ~672 tests across ~25 binaries pass with --features failpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
932334ba01
commit
164bafbbe7
10 changed files with 477 additions and 69 deletions
|
|
@ -60,4 +60,4 @@ Filtered from `branch_list()` but visible to internals:
|
|||
|
||||
The four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row.
|
||||
|
||||
Audit rows live in `_graph_commit_recoveries.lance` (sibling to `_graph_commits.lance`) and reference the commit graph by `graph_commit_id`. The linked `_graph_commits.lance` row carries `actor_id="omnigraph:recovery"` (the system actor). To find recoveries for a specific original actor: `omnigraph commit list --filter actor=omnigraph:recovery`, then join to `_graph_commit_recoveries.lance` by `graph_commit_id` to read `recovery_for_actor`. Schema: see `crates/omnigraph/src/db/recovery_audit.rs`.
|
||||
Audit rows live in `_graph_commit_recoveries.lance` (sibling to `_graph_commits.lance`) and reference the commit graph by `graph_commit_id`. The linked recovery commit is identified by that same `graph_commit_id`, and `actor_id="omnigraph:recovery"` is stored in `_graph_commit_actors.lance` (joined by `graph_commit_id`) — `_graph_commits.lance` itself does not carry the `actor_id` column. To find recoveries for a specific original actor: `omnigraph commit list --filter actor=omnigraph:recovery`, then join to `_graph_commit_recoveries.lance` by `graph_commit_id` to read `recovery_for_actor`. Schema: see `crates/omnigraph/src/db/recovery_audit.rs`.
|
||||
|
|
|
|||
11
docs/runs.md
11
docs/runs.md
|
|
@ -171,12 +171,17 @@ recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`:
|
|||
Lance HEAD to the manifest pin. Classify per the all-or-nothing
|
||||
decision tree (RolledPastExpected / NoMovement / UnexpectedAtP1 /
|
||||
UnexpectedMultistep / InvariantViolation).
|
||||
- If every table is `RolledPastExpected`, **roll forward**: a single
|
||||
`ManifestBatchPublisher::publish` call extends every pin atomically.
|
||||
- If any table is `InvariantViolation` (Lance HEAD < manifest pinned —
|
||||
should be impossible), **abort** with a loud error and leave the
|
||||
sidecar on disk for operator review.
|
||||
- Otherwise, if every table is `RolledPastExpected`, **roll forward**:
|
||||
a single `ManifestBatchPublisher::publish` call extends every pin
|
||||
atomically.
|
||||
- Otherwise **roll back**: per-table `Dataset::restore` to the
|
||||
expected_version (with a fragment-set short-circuit so repeated
|
||||
mid-sweep crashes don't pile up versions).
|
||||
- Either way, an audit row is recorded — `_graph_commits.lance` carries
|
||||
- After a successful roll-forward or roll-back, an audit row is
|
||||
recorded — `_graph_commits.lance` carries
|
||||
a commit tagged `actor_id = "omnigraph:recovery"`, and a sibling
|
||||
`_graph_commit_recoveries.lance` row carries `recovery_kind`,
|
||||
`recovery_for_actor` (the original sidecar's actor), `operation_id`,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue