omnigraph/docs/dev/index.md
Ragnor Comerford 4a5277b9c0
fix(recovery): converge roll-forward on concurrent manifest advance (#296)
* refactor(storage): gate test-only TableStore::append_batch behind cfg(test)

The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.

* test(failpoints): harden fault-injection harness + reproduce roll-forward CAS race

Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).

Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
  is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
  names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
  replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
  now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
  site references a const, and `failpoint_names_guard.rs` enforces "no string
  literal names" by source-walk, so a typo is a build error not a silent no-fire.

Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
  classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
  concurrent open-sweeps race one pending sidecar; the sweep parked at the
  failpoint loses its publish CAS to the other and fails the open with
  `ExpectedVersionMismatch`. FAILS at this commit by design.

* fix(recovery): converge roll-forward when the manifest advances concurrently

The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.

A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.

Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.

* Address PR review: harden failpoint name guard + dedupe converge audit

Two issues surfaced in PR review of the failpoint hardening + recovery fix:

1. Name guard had a line-split blind spot. It scanned per line, so a call
   wrapped across lines (`park_first(\n    "name",\n)`) put the literal on a
   different line than the call prefix and bypassed the "no string-literal
   failpoint names" check — and one such literal
   (`mutation.delete_node_pre_primary_delete`) had slipped through. Make the
   guard whitespace/newline-tolerant (skip past the open paren to the first
   argument token) so wrapping can't hide a literal, and convert the bypassed
   site to the `names::` const.

2. Convergence path could append a duplicate recovery audit. When a
   roll-forward publish loses its CAS but the manifest already reached the
   sidecar's goal, `converge_or_defer_roll_forward` recorded a RolledForward
   audit unconditionally. Under the heal-first invariant, whoever advanced the
   manifest already healed this sidecar (audit + delete), so a second row
   landed in `_graph_commit_recoveries` for one recovery event. Gate the
   audit+delete on the sidecar still being present: absent => the winner
   completed it, return success with no duplicate row. The convergence
   regression test now asserts exactly one audit row.

* docs(dev): remove the schema-apply recovery-flake handoff (fixed by this PR)

The handoff was a transient investigation note for
`iss-schema-apply-reopen-recovery-race`, which this PR fixes (the converge
helper + the red→green regression). Its rationale now lives durably in the
dev-graph issue, the PR/commit history, and invariants.md, so the handoff is
obsolete. Drop the doc, its dev-index row, and the dangling reference from the
RFC-013 handoff; the doc cross-link check stays green.

* fix(recovery): include added-table registrations in the converge audit

The CAS-loss convergence audit built outcomes only from `sidecar.tables`,
omitting the `additional_registrations` that the normal `roll_forward_all`
audit includes. For a SchemaApply sidecar with added types, a converge-path
audit row would be incomplete versus the normal roll-forward path for the same
recovery kind. Mirror the roll-forward outcome construction (append a
registration outcome per added table) so both paths emit the same audit shape.
2026-06-24 19:03:51 +02:00

6.8 KiB

Developer Docs

Audience: contributors, maintainers, and coding agents

This is the contributor-facing entry point. These docs explain architecture, invariants, implementation contracts, test ownership, and upstream Lance constraints. User-facing behavior should still be documented through docs/user/index.md and the relevant public reference docs.

Required For Every Non-Trivial Change

Need Read
Architectural rules, known gaps, deny-list invariants.md
Upstream Lance source-of-truth index lance.md
Existing test coverage and test placement testing.md

Architecture And Storage

Area Read
System structure, L1/L2 framing, component diagrams architecture.md
On-disk layout, manifest schema, URI behavior storage.md
Direct-publish writes, D2, staged writes, recovery sidecars writes.md
Query execution, mutation execution, loader flow execution.md
Index lifecycle and graph topology indexes indexes.md
Branch and commit internals branches-commits.md
Three-way merge implementation and conflicts merge.md
Diff/change-feed implementation changes.md
Branch protection policy branch-protection.md

Language, Runtime, And Boundaries

Area Read
Schema grammar, catalog, migration planner schema-language.md
Query grammar, IR, lints, mutation restrictions query-language.md
Embedding client and @embed integration embeddings.md
Cedar policy surface and server gating policy.md
Server auth, OpenAPI, endpoint handlers server.md
Error taxonomy and serialization errors.md
Constants and tunables constants.md
Transaction model public contract transactions.md
User-doc coherence cleanup ledger docs-issues.md

Project Operations

Area Read
CI and release workflows ci.md
Install and deployment packaging install.md, deployment.md
Release history releases/

Contribution & Governance

Area Read
How to contribute (external) CONTRIBUTING.md
Governance model, roles, decision authority GOVERNANCE.md
Public contribution RFC track rfcs/

The docs/rfcs/ track is the public, externally-authorable RFC process. The maintainer/internal RFCs below (rfc-00N-*.md) are a separate, team-owned track; don't conflate the two.

Case Studies

Worked write-ups of specific bugs — root cause, fix, and the reasoning that ruled out the tempting-but-wrong alternatives. Read these for the debugging pattern, not just the outcome.

Area Read
camelCase property filters lowercased at runtime (#283) — two engine→Lance boundaries, two different fixes bug-case-fix.md

Active Implementation Plans

Working documents for in-flight feature work. Removed when the work lands.

Area Read
Schema-lint chassis v1 (MR-694) — --allow-data-loss, soft/hard drops schema-lint-v1-plan.md
Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) rfc-001-queries-envelope-mcp.md
Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) rfc-002-config-cli-architecture.md
MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) rfc-003-mcp-server-surface.md
Future cluster control plane — declarative as-code config, JSON state ledger, reconciler cluster-config-specs.md, cluster-axioms.md, cluster-config-implementation-spec.md
Cluster graph & schema apply — Phase 4 sidecars, roll-forward recovery, approval artifacts rfc-004-cluster-graph-schema-apply.md
Server boots from cluster state — Phase 5 mode switch, applied-revision serving rfc-005-server-cluster-boot.md
Per-operator config — ~/.omnigraph/ identity, keyed credentials, named servers (the operator slice of RFC-002) rfc-007-operator-config.md
Deprecate omnigraph.yaml — one concern per config surface; key-by-key migration map and staged retirement rfc-008-deprecate-omnigraph-yaml.md
Unify CLI embedded/remote access paths — parity referee, shared wire-DTO crate, GraphClient trait, declared plane capabilities rfc-009-unify-access-paths.md
Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) rfc-010-cli-planes-restructure.md
CLI refactoring — one addressing & config model post-omnigraph.yaml: scope + --graph + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) rfc-011-cli-refactoring.md
Provider-independent embedding configuration — one resolved EmbeddingConfig + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor rfc-012-embedding-provider-config.md
Write-path latency — capture-once WriteTxn, version-pinned opens, one GraphPublishAuthority fed declarative PublishPlans, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (iss-write-s3-roundtrip-amplification, iss-991) rfc-013-write-path-latency.md
RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work handoff-rfc-013-write-path.md

Boundary

Developer docs may mention implementation details, stale gaps, upstream Lance blockers, and review rules. User docs should not require that context unless the detail changes the public contract.