mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-27 02:39:38 +02:00
* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)
The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.
* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race
Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).
Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
site references a const, and `failpoint_names_guard.rs` enforces "no string
literal names" by source-walk, so a typo is a build error not a silent no-fire.
Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
concurrent open-sweeps race one pending sidecar; the sweep parked at the
failpoint loses its publish CAS to the other and fails the open with
`ExpectedVersionMismatch`. FAILS at this commit by design.
* fix(recovery): converge roll-forward when the manifest advances concurrently
The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.
A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.
Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.
* Address PR review: harden failpoint name guard + dedupe converge audit
Two issues surfaced in PR review of the failpoint hardening + recovery fix:
1. Name guard had a line-split blind spot. It scanned per line, so a call
wrapped across lines (`park_first(\n "name",\n)`) put the literal on a
different line than the call prefix and bypassed the "no string-literal
failpoint names" check — and one such literal
(`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
guard whitespace/newline-tolerant (skip past the open paren to the first
argument token) so wrapping can't hide a literal, and convert the bypassed
site to the `names::` const.
2. Convergence path could append a duplicate recovery audit. When a
roll-forward publish loses its CAS but the manifest already reached the
sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
audit unconditionally. Under the heal-first invariant, whoever advanced the
manifest already healed this sidecar (audit + delete), so a second row
landed in `_graph_commit_recoveries` for one recovery event. Gate the
audit+delete on the sidecar still being present: absent => the winner
completed it, return success with no duplicate row. The convergence
regression test now asserts exactly one audit row.
* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)
The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.
* fix(recovery): include added-table registrations in the converge audit
The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
84 lines
3.1 KiB
Rust
84 lines
3.1 KiB
Rust
//! Deterministic rendezvous for concurrent failpoint tests.
|
|
//!
|
|
//! The pattern: park the FIRST thread that hits a failpoint until the test
|
|
//! explicitly releases it, while later arrivals fall through. This replaces
|
|
//! fixed "guess" `sleep`s for cross-thread coordination — the test waits on
|
|
//! the *condition* (the point was reached) with a bounded timeout that fails
|
|
//! loudly, instead of betting a fixed duration is long enough.
|
|
//!
|
|
//! Extracted from the open-coded `AtomicBool` + callback pattern that
|
|
//! `fork_collision_with_live_concurrent_fork_is_retryable` proved out.
|
|
//!
|
|
//! The `reached` flag also doubles as a fired-assertion: a point that is
|
|
//! never hit makes [`Rendezvous::wait_until_reached`] panic, so a typo'd or
|
|
//! misplaced failpoint cannot pass silently.
|
|
|
|
use std::sync::Arc;
|
|
use std::sync::atomic::{AtomicBool, Ordering::SeqCst};
|
|
use std::time::Duration;
|
|
|
|
use omnigraph::failpoints::ScopedFailPoint;
|
|
|
|
/// A parked-on-first-arrival rendezvous bound to a failpoint name. The
|
|
/// underlying callback is RAII-cleaned when this guard drops.
|
|
pub struct Rendezvous {
|
|
name: String,
|
|
reached: Arc<AtomicBool>,
|
|
release: Arc<AtomicBool>,
|
|
_failpoint: ScopedFailPoint,
|
|
}
|
|
|
|
impl Rendezvous {
|
|
/// Register `name` so the FIRST thread to hit it records readiness and
|
|
/// blocks until [`release`](Self::release); later arrivals fall through
|
|
/// immediately. The park is bounded (~30s) so a test bug cannot hang the
|
|
/// suite forever.
|
|
pub fn park_first(name: &str) -> Self {
|
|
let reached = Arc::new(AtomicBool::new(false));
|
|
let release = Arc::new(AtomicBool::new(false));
|
|
let (cb_reached, cb_release) = (Arc::clone(&reached), Arc::clone(&release));
|
|
let _failpoint = ScopedFailPoint::with_callback(name, move || {
|
|
if cb_reached
|
|
.compare_exchange(false, true, SeqCst, SeqCst)
|
|
.is_ok()
|
|
{
|
|
// ~30s bound (6000 * 5ms); released earlier on the common path.
|
|
for _ in 0..6000 {
|
|
if cb_release.load(SeqCst) {
|
|
return;
|
|
}
|
|
std::thread::sleep(Duration::from_millis(5));
|
|
}
|
|
}
|
|
});
|
|
Self {
|
|
name: name.to_string(),
|
|
reached,
|
|
release,
|
|
_failpoint,
|
|
}
|
|
}
|
|
|
|
/// Async-wait until the parked thread has reached the failpoint, polling
|
|
/// the readiness condition with a bounded (~12s) timeout. Panics if the
|
|
/// point is never hit — the fired-assertion.
|
|
pub async fn wait_until_reached(&self) {
|
|
for _ in 0..2400 {
|
|
if self.reached.load(SeqCst) {
|
|
return;
|
|
}
|
|
tokio::time::sleep(Duration::from_millis(5)).await;
|
|
}
|
|
panic!("rendezvous: failpoint '{}' was never reached", self.name);
|
|
}
|
|
|
|
/// Whether the parked thread has reached the failpoint yet.
|
|
pub fn reached(&self) -> bool {
|
|
self.reached.load(SeqCst)
|
|
}
|
|
|
|
/// Release the parked thread so it resumes past the failpoint.
|
|
pub fn release(&self) {
|
|
self.release.store(true, SeqCst);
|
|
}
|
|
}
|