omnigraph/docs/dev
Ragnor Comerford 4a5277b9c0
fix(recovery): converge roll-forward on concurrent manifest advance (#296)
* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)

The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.

* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race

Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).

Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
  is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
  names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
  replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
  now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
  site references a const, and `failpoint_names_guard.rs` enforces "no string
  literal names" by source-walk, so a typo is a build error not a silent no-fire.

Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
  classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
  concurrent open-sweeps race one pending sidecar; the sweep parked at the
  failpoint loses its publish CAS to the other and fails the open with
  `ExpectedVersionMismatch`. FAILS at this commit by design.

* fix(recovery): converge roll-forward when the manifest advances concurrently

The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.

A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.

Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.

* Address PR review: harden failpoint name guard + dedupe converge audit

Two issues surfaced in PR review of the failpoint hardening + recovery fix:

1. Name guard had a line-split blind spot. It scanned per line, so a call
   wrapped across lines (`park_first(\n    "name",\n)`) put the literal on a
   different line than the call prefix and bypassed the "no string-literal
   failpoint names" check — and one such literal
   (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
   guard whitespace/newline-tolerant (skip past the open paren to the first
   argument token) so wrapping can't hide a literal, and convert the bypassed
   site to the `names::` const.

2. Convergence path could append a duplicate recovery audit. When a
   roll-forward publish loses its CAS but the manifest already reached the
   sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
   audit unconditionally. Under the heal-first invariant, whoever advanced the
   manifest already healed this sidecar (audit + delete), so a second row
   landed in `_graph_commit_recoveries` for one recovery event. Gate the
   audit+delete on the sidecar still being present: absent => the winner
   completed it, return success with no duplicate row. The convergence
   regression test now asserts exactly one audit row.

* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)

The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.

* fix(recovery): include added-table registrations in the converge audit

The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
2026-06-24 19:03:51 +02:00
..
architecture.md feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252) 2026-06-15 22:31:29 +03:00
branch-protection.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
bug-case-fix.md fix(engine): preserve identifier case in filter pushdown (#283) (#285) 2026-06-19 18:42:56 +03:00
ci.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
cluster-axioms.md docs(cluster): axiom 15 — single ownership, mode-switch migration, per-operator layer (#164) 2026-06-10 00:44:51 +03:00
cluster-config-implementation-spec.md docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174) 2026-06-10 15:22:12 +03:00
cluster-config-specs.md docs(user): restructure user docs into topic sections (Phase 1) (#223) 2026-06-14 13:52:14 +03:00
docs-issues.md docs(dev): update coherence ledger — cookbooks drift resolved, omnigraph-ts mechanism (#294) 2026-06-21 00:11:48 +03:00
execution.md fix(embedding): address PR review feedback (RFC-012 Phase 2) 2026-06-15 18:37:34 +02:00
handoff-rfc-013-write-path.md fix(recovery): converge roll-forward on concurrent manifest advance (#296) 2026-06-24 19:03:51 +02:00
index.md fix(recovery): converge roll-forward on concurrent manifest advance (#296) 2026-06-24 19:03:51 +02:00
invariants.md fix(recovery): converge roll-forward on concurrent manifest advance (#296) 2026-06-24 19:03:51 +02:00
lance.md test(engine): pin Lance 7 immutable-PK behavior + sharpen native-namespace alignment notes (#240) 2026-06-15 11:33:25 +02:00
merge.md docs: split user and developer docs (#93) 2026-05-15 03:45:22 +03:00
rfc-001-queries-envelope-mcp.md docs(user): restructure user docs into topic sections (Phase 1) (#223) 2026-06-14 13:52:14 +03:00
rfc-002-config-cli-architecture.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-003-mcp-server-surface.md Stored-query registry foundation + config/CLI RFC-002 (#128) 2026-06-01 22:50:31 +02:00
rfc-004-cluster-graph-schema-apply.md docs(cluster): document Stage 4C — Phase 4 complete 2026-06-10 14:44:12 +03:00
rfc-005-server-cluster-boot.md fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284) 2026-06-19 03:34:15 +03:00
rfc-007-operator-config.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-008-deprecate-omnigraph-yaml.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-009-unify-access-paths.md feat: canonical POST /load, deprecate /ingest (RFC-009 Phase 5) (#222) 2026-06-14 03:32:16 +03:00
rfc-010-cli-planes-restructure.md docs(rfc): RFC-010 — apply verification-comment current-state fixups (#215) 2026-06-13 22:24:09 +03:00
rfc-011-cli-refactoring.md feat(cli): add read-only profile list / profile show (RFC-011 D8) (#255) 2026-06-15 23:33:01 +03:00
rfc-012-embedding-provider-config.md Wire cluster embedding providers 2026-06-16 04:02:08 +03:00
rfc-013-write-path-latency.md feat(engine): WriteTxn - validate schema + open each data table once per write (#298) 2026-06-23 21:27:31 +02:00
schema-lint-v1-plan.md schema-lint chassis v1.0: DropProperty Soft + code-tagged diagnostics (MR-694) (#90) 2026-05-16 16:30:03 +03:00
testing.md fix(recovery): converge roll-forward on concurrent manifest advance (#296) 2026-06-24 19:03:51 +02:00
writes.md fix(engine): stop branch-merge fast-forward OOM on embedding tables (#277) 2026-06-19 00:15:06 +02:00