omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-27 02:39:38 +02:00

Ragnor Comerford 4a5277b9c0 fix(recovery): converge roll-forward on concurrent manifest advance (#296 ) * refactor(storage): gate test-only TableStore::append_batch behind cfg(test) The inherent append_batch is used only by in-source recovery test setup, but the non-test lib build (cfg(test) off) cannot see those callers and emitted a dead_code warning. Gating the method #[cfg(test)] silences the false positive and enforces its own doc contract ("no new engine call sites") by construction — engine code physically cannot call a cfg(test) method. * test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race Hardens the test infrastructure around the process-global `fail` registry, and adds a deterministic red repro for the open-time recovery sweep's roll-forward CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next commit — this commit is intentionally red (rule 12: red→green visible in log). Harness: - One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate is removed and cluster tests reuse the engine type via `omnigraph/failpoints`. - `#[serial]` on every failpoint test (the registry is process-global, so shared names interfere under parallelism); `serial_test` added to cluster dev-deps. - `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release) replaces fixed-`sleep` cross-thread coordination; the three concurrent tests now rendezvous deterministically. The reached flag doubles as a fired-assert. - Compile-checked `failpoints::names` catalog (engine + cluster); every call site references a const, and `failpoint_names_guard.rs` enforces "no string literal names" by source-walk, so a typo is a build error not a silent no-fire. Red repro: - New `recovery.before_roll_forward_publish` failpoint at the sweep's classify -> publish-CAS window (the only injection point there). - `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one pending sidecar; the sweep parked at the failpoint loses its publish CAS to the other and fails the open with `ExpectedVersionMismatch`. FAILS at this commit by design. * fix(recovery): converge roll-forward when the manifest advances concurrently The open-time recovery sweep classified a pending sidecar as RolledPastExpected, then published a manifest CAS at the sidecar's pinned expected_version. Under a concurrent writer that advanced the manifest past expected during the classify -> publish window, the CAS failed with ExpectedVersionMismatch and `?`-propagated, failing the whole Omnigraph::open. iss-schema-apply-reopen-recovery-race. A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an ExpectedVersionMismatch, re-read the live manifest and check whether the sidecar's intent is already satisfied (every pinned table at a version >= the one we observed and tried to publish; added tables registered; tombstones gone — sound under the heal-first invariant, documented at the check). If satisfied, this is convergence: record the RolledForward audit + delete the sidecar idempotently. If only partway, defer to the next pass. Either way the open no longer fails. Other errors still propagate; a genuine logical conflict resurfaces via the classifier's InvariantViolation. Turns the red repro from the previous commit green. The roll-BACK twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly. * Address PR review: harden failpoint name guard + dedupe converge audit Two issues surfaced in PR review of the failpoint hardening + recovery fix: 1. Name guard had a line-split blind spot. It scanned per line, so a call wrapped across lines (`park_first(\n "name",\n)`) put the literal on a different line than the call prefix and bypassed the "no string-literal failpoint names" check — and one such literal (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the guard whitespace/newline-tolerant (skip past the open paren to the first argument token) so wrapping can't hide a literal, and convert the bypassed site to the `names::` const. 2. Convergence path could append a duplicate recovery audit. When a roll-forward publish loses its CAS but the manifest already reached the sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward audit unconditionally. Under the heal-first invariant, whoever advanced the manifest already healed this sidecar (audit + delete), so a second row landed in `_graph_commit_recoveries` for one recovery event. Gate the audit+delete on the sidecar still being present: absent => the winner completed it, return success with no duplicate row. The convergence regression test now asserts exactly one audit row. * docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR) The handoff was a transient investigation note for `iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge helper + the red→green regression). Its rationale now lives durably in the dev-graph issue, the PR/commit history, and invariants.md, so the handoff is obsolete. Drop the doc, its dev-index row, and the dangling reference from the RFC-013 handoff; the doc cross-link check stays green. * fix(recovery): include added-table registrations in the converge audit The CAS-loss convergence audit built outcomes only from `sidecar.tables`, omitting the `additional_registrations` that the normal `roll_forward_all` audit includes. For a SchemaApply sidecar with added types, a converge-path audit row would be incomplete versus the normal roll-forward path for the same recovery kind. Mirror the roll-forward outcome construction (append a registration outcome per added table) so both paths emit the same audit shape.		2026-06-24 19:03:51 +02:00
..
dev	fix(recovery): converge roll-forward on concurrent manifest advance (#296 )	2026-06-24 19:03:51 +02:00
releases	release: v0.7.1 (#290 )	2026-06-19 23:12:44 +03:00
rfcs	governance: external contribution model (issues/discussions/RFCs/PRs) (#143 )	2026-06-06 23:58:08 +03:00
user	(feat): compact the internal manifest/commit-graph tables in optimize (#291 )	2026-06-21 16:38:20 +02:00