mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-09 01:35:18 +02:00

Ragnor Comerford aaa031e834

recovery: refresh-time roll-forward closes the in-process residual

Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).

Why roll-forward only and not full sweep:
  * Roll-forward is safe under concurrency (publisher uses row-level
    CAS).
  * Roll-back uses Dataset::restore, which "wins" against concurrent
    Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
    silently orphaning the concurrent writer's commit (pinned by
    tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
    Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
    next ReadWrite open, where no concurrent writers exist and full
    restore is safe.

Implementation:
  * recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
    process_sidecar branches on mode for Abort and RollBack — both
    defer to next ReadWrite open under RollForwardOnly. RollForward
    behavior unchanged.
  * omnigraph.rs: Omnigraph::refresh promoted to pub; calls
    recover_manifest_drift in RollForwardOnly mode after coordinator
    refresh. Steady-state cost: one list_dir of __recovery (early
    return on empty). Adds refresh_coordinator_only — pub(crate) —
    for engine-internal callers that hold an in-flight sidecar (the
    schema_apply lease-check + lock-release paths). Without this split,
    refresh would race the in-flight sidecar.
  * schema_apply.rs: switch all 6 internal db.refresh() call sites to
    refresh_coordinator_only().

Tests:
  * refresh_runs_roll_forward_recovery_in_process — trigger
    mutation.post_finalize_pre_publisher; without restart, call
    db.refresh(); assert sidecar deleted, drifted row visible,
    subsequent mutation succeeds.
  * refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
    a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
    refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
    runs the full sweep which advances HEAD via restore.

Docs:
  * docs/runs.md "Long-running servers" caveat updated to describe the
    refresh-time roll-forward path and the rollback-defer behavior.
  * docs/invariants.md §VI.23 status line updated to reflect in-process
    closure of the common case.

Workspace tests pass with --features failpoints; no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-04 00:15:42 +02:00

26 KiB

Raw Blame History

Architectural Invariants & Patterns

Type: Reference / standing document Status: Living — updated as decisions accrue Audience: anyone proposing, reviewing, or implementing a change to any part of OmniGraph

This document captures two things:

Invariants (Parts I–VI, VIII): load-bearing principles that hold across the architecture. Breaking one is rare and requires explicit justification.
Current architectural patterns (Part VII): how we realize the invariants today. These are committed conventions, not eternal facts; they may evolve as the engine matures, but until they do, they constrain new work.

These are not query-engine-specific. They apply to every layer.

Status legend

Status: decided. No annotation needed; this is the default.
Status: open — see MR-X. The principle is captured, but the concrete default or mechanism is still under discussion. Future work should follow the captured intent or update this document with the resolution.
Status: aspirational. The invariant describes the target state; current code may not yet uphold it. PRs that move toward upholding it are welcome; PRs that drift away need explicit justification.

Capturing aspirational invariants on purpose: we'd rather record what we want to be true and have current code be measured against it than not have the rule at all.

How to use

Writing an RFC or design proposal: walk through the relevant sections and state how the proposal upholds each invariant — or why a documented exception is justified.
Reviewing a PR or design: scan for invariants the change might violate. The deny-list (§IX) is the fastest first pass.
Debating a tradeoff: invoke the relevant invariant and check whether the tradeoff respects it.
Updating this document: add to the deny-list freely. Removing or relaxing an invariant requires the same review process as any other architectural decision.

I. Substrate respect — delegate, don't rebuild

The first question for any new component: does the substrate already do this?

Current substrate is Lance for storage, indexes, and MVCC; DataFusion is the working assumption for relational machinery. These are committed choices (MR-737 §2.2, §5.11) but not eternal facts. The invariants below are about respecting whatever substrate we adopt.

Don't rebuild what the substrate owns. Storage format, durability (WAL, transaction journal), buffer pool, MVCC, index lifecycle — all delegated. Building parallel implementations turns the project into a different one and locks us out of substrate improvements. Check: Does this proposal introduce a parallel storage format, custom on-disk pages, custom serialization, custom WAL, custom buffer pool?
Don't rebuild relational machinery provided by the runtime substrate. Joins, aggregations, parallelism, spill — extension via the substrate's trait surfaces; never reimplementation. Check: Are we extending the substrate via traits, or reimplementing parts of it?
Don't maintain state parallel to the substrate. Observe substrate state and derive what we need. State that drifts from the substrate is a bug. Check: Does this proposal track index coverage, manifest versions, or fragment locations independently of the substrate?

II. Layering — the seams hold

The IR is the contract between frontend and backend. Frontends emit IR; planner / executor consume it. No frontend logic leaks downward; no executor concerns leak upward. Check: Does the proposal add to the IR, or to a layer? If to a layer, does it cross another layer's concern?
Capabilities and statistics flow upward; data flows downward. Lower layers expose what they can do (capabilities) and what they know (statistics). Upper layers consume both. Methods alone are insufficient — methods without capability advertisement force one-size-fits-all plans. Check: When adding a method to a layer trait, did we also expose the capability so the planner can reason about it?
One trait boundary per layer. Crossing a layer means going through its trait. Direct calls to lower-layer concrete types from upper layers are forbidden. Check: Does this code call lance::Dataset directly outside engine-storage? Call planner internals from the executor?
No god modules. Single-module concerns: storage, IR, planner, executor, frontend, reconciler, schema, policy. Each crate has a reference test suite that runs without the others. Check: Does this PR add a concern to a crate that already owns a different one?
Wire protocols are interchangeable; the IR is the contract. The kernel produces Stream<RecordBatch> end-to-end; transports (HTTP/JSON, Arrow Flight, FlightSQL, future protocols) deliver them at the server boundary. No wire-protocol-specific code in kernel crates. Status: aspirational — Flight not yet implemented; tracked in MR-765. Check: Does this code import arrow_flight (or any transport crate) outside the server layer?

III. Distributability — kernel stays remote-friendly

These are technical constraints, independent of whether we ship a distributed product. They preserve the architectural seam.

The kernel admits parallel and remote implementations. Trait surfaces are thread-safe; no in-process-only assumptions; remote dataset descriptors (URI, snapshot ref, fragment ID) are accepted without requiring an open in-process handle.
IR is location-neutral. No IR operator embeds an assumption about where data lives.
Cost models accept new dimensions (network, latency-tier) as additive extensions. No place hard-codes "all cost is local I/O."
Background work admits alternate implementations. In-process default; separable worker fleet for distributed deployment uses the same trait. Status: aspirational — distributed deployment is out of scope today (MR-737 §2.2); these constraints preserve the seam.

IV. Evolution — additive over rewrite

Additive over rewrite. New IR variants and planner rules slot in. No "tear out and replace" PRs.
Capabilities are additive enums. New variants are additive. Existing implementations keep working.
Feature-flag behavior changes. Every change that alters runtime behavior ships behind a flag. Old code path stays until the new one is proven.
No data drops without a migration. When data needs to move (e.g., adopting stable row IDs), use in-place or dual-write windows. Never "drop and recreate."
No breaking schema changes without a migration plan. Schema-IR changes go through the migration planner with safety tier classification. See the MR-694 family.

V. Honesty — what the system tells operators

Estimate-vs-actual logging on every estimator. Cost models drift; calibration is a continuous process, not a one-off.
Operationally important state is observable. Index coverage, reconciler lag, cost-model accuracy — surfaced through the storage trait's capabilities() and a unified observability API.
Honest failure modes. Cost-model misses degrade gracefully (spill, partial-result, bounded abort). No silent OOM.
Per-query resource consumption is bounded and exposed. Memory cap, wall-clock timeout, max-rows-scanned, max-fragments-scanned. Operators respect them; bounds exposed via explain.
Plans are explainable. Every executed query can be inspected as IR + physical plan + cost annotations. No "you'd have to read the source to know what this does." See MR-684.

VI. Database guarantees — what OmniGraph promises as a system of record

These are user-visible commitments. They state what the engine guarantees and what it does not. For an "agent-native system of record," credibility lives here.

Specific defaults (timeout values, memory caps, TTL windows) are configuration, not invariants — see docs/constants.md and per-deployment configuration. The invariant is that bounds and contracts exist, not their numerical values.

Atomicity is per-query. Every .gq query is atomic — multi-statement mutations are all-or-nothing via the substrate's atomic-commit primitive. No cross-query BEGIN/COMMIT; branches and merges fill that role for agent workflows. Status: upheld at the writer-trait surface, across process boundaries, AND in-process for the common case — the sealed TableStorage trait routes inserts / updates / scalar-index builds / merge_insert / overwrite through stage_* + commit_staged (Phase A is drift-free); the open-time recovery sweep in db/manifest/recovery.rs (sidecars at __recovery/{ulid}.json written by MutationStaging::finalize, schema_apply, branch_merge, ensure_indices) closes the per-table commit_staged → manifest publish residual on the next Omnigraph::open; and Omnigraph::refresh runs roll-forward-only recovery in-process so long-running servers close the common case (mutation/load finalize → publisher failure) without restart. The "Lance HEAD ahead of __manifest" drift class is unreachable for op-execution failures, recoverable across process boundaries for all writer kinds, and recoverable in-process for roll-forward-eligible sidecars. Sidecars that would require Dataset::restore are deferred to the next ReadWrite open (restore unsafe under concurrency); continuous in-process recovery for that case requires per-(table, branch) writer-queue acquisition and is the goal of a future background reconciler. Two writer paths still inline-commit pending upstream Lance work: delete_where (lance-format/lance#6658) and create_vector_index (lance-format/lance#6666).
Schema integrity is strict at commit. Type validation, required-field presence (auto-filled from @default if declared), uniqueness across batches and versions, and referential integrity — all enforced before commit succeeds. Per-write softening flags are opt-in, never default. Status: aspirational — referential integrity at scale requires SIP-backed cross-table validation; not yet implemented. Cross-batch / cross-version uniqueness tracked in MR-714.
Isolation: per-query snapshot; read-your-writes within and across queries in a session. Each query reads from one consistent manifest version. Within a multi-statement mutation, the read subplan inside each write operator sees the writes from earlier statements. Across queries in a session, reads always resolve the latest manifest version — no reader pinning to older snapshots. Status: upheld for inserts/updates — MutationStaging's in-memory accumulator + TableStore::scan_with_pending (DataFusion MemTable union with the committed Lance scan, with merge-shadow semantics for chained updates) implements read-your-writes within a multi-statement mutation. Delete-touching mutations are limited to delete-only by parse-time D₂; closing the within-query RYW gap for deletes requires Lance's two-phase delete API (Lance-upstream lance-format/lance#6658). The "Lance HEAD ahead of __manifest" drift class is unreachable for op-execution failures (the partial-failure test pins this), and the narrower finalize→publisher residual is closed across one open cycle by the open-time recovery sweep — see docs/runs.md "Open-time recovery sweep".
Durability before acknowledgement. Commit returns only after the substrate has confirmed durable persistence. No "fast" or "fire-and-forget" durability levels.
Causal consistency across sessions. If session A commits and session B subsequently reads, session B sees A's write. Single-coordinator: trivially via single-source manifest. Multi-coordinator: enforced via leader-for-writes plus session-token replica reads. Never weakened. Status: aspirational on the multi-coordinator side.
Determinism within a snapshot. Same query + same snapshot + same parameters → order-stable results (deterministic tie-breaks). Plan choice is deterministic given identical statistics. Cross-version determinism is best-effort, not guaranteed (statistics change, plans change). Status: aspirational — current code may rely on HashMap iteration in some paths.
Writes are idempotent under retry. Insert / Update / Merge take an explicit on_conflict policy. Clients may provide an idempotency key on writes; the server deduplicates retries within a configurable TTL window. Schema migrations are idempotent under replay. Status: open — on_conflict policy lands with mutation IR (MR-737 Phase 8); idempotency-key TTL default is undecided.
No silent data loss or corruption. Substrate-level checksums are trusted for storage integrity. Semantic-invariant checks at every commit catch higher-level cases (orphan edges, type drift, broken uniqueness). Every operation succeeds, fails loudly with cause, or degrades observably with metrics.
Every operation has a documented bound. "May run forever" is forbidden as a default. Defaults are configurable; the invariant is that bounds exist, are documented, and are enforced.
Failure scope is bounded. A failing query, fragment-level corruption, or background-task crash does not cascade. Per-table fragment isolation at the storage tier; per-query memory and timeout in the executor. Status: aspirational on the per-query side — per-query memory cap not yet enforced; planned with MR-737 Phase 7.
Crash recovery via the same code paths as steady-state. No special "recovery mode." On restart, the engine reads the manifest, finds the latest committed state, and resumes. Substrate atomicity ensures no partial writes survive.
Strong consistency by default; relaxation is per-query, never per-default. Strong (read-your-writes, monotonic, snapshot) is the default for every query. Eventual consistency is opt-in per read query for analytical workloads where staleness is acceptable. Never available on writes; always logged for audit. Status: aspirational — eventual-consistency opt-in flag tracked in MR-425.

VII. Current architectural patterns

These are how we realize the invariants today. They are committed conventions — until we explicitly revise them, new code follows them. They are not eternal: a future architecture review may replace any of these with a different mechanism that upholds the same invariants. The deny-list (§IX) protects them in the meantime.

Reconciler pattern for derivable state. Index coverage, statistics, anything derivable from manifest state — reconciled, not job-queued. Realizes the "don't maintain state parallel to the substrate" invariant. See MR-737 §5.16. Status: partial after MR-793 PR #70 — scalar index builds (BTree, Inverted) now route through the staged primitives stage_create_*_index + commit_staged instead of inline create_*_index; this is the building block. The reconciler pattern itself (background IndexReconciler task driven by manifest commits, removing synchronous index work from the publish path) is tracked in MR-848. Vector indices remain inline-commit until lance-format/lance#6666 ships.
Polymorphism via Union, not per-feature lowering. Interfaces / wildcards / alternation on nodes and edges share one IR (Polymorphism<T>) and one lowering (Union of per-type concrete plans). Realizes "shared mechanism for shared shape." See MR-737 §5.13. Status: aspirational — node interfaces in MR-579; edge wildcards in MR-744.
Mutations wrap read subplans. Insert / Update / Delete / Merge are operators that consume read-shaped subplans. Same planner, same cost model, same storage trait. Realizes "writes share the planner with reads." See MR-737 §5.12. Status: aspirational — current mutation path is separate from reads.
SIP for cross-operator selectivity propagation. Producers publish ID bitmaps; downstream scans consume them through structured pushdown. Realizes "downstream operators prune via upstream selectivity." Status: aspirational — current code uses IN-list flattening in Expand.
Factorize multi-hop, flatten only at projection. Lists carry multiplicity through intermediate operators. Flatten is inserted by the planner where required, not eagerly. Realizes "intermediate state shouldn't materialize cross-products eagerly." Status: aspirational — current code materializes cross-products eagerly.
Stable row IDs as dense graph IDs. Don't maintain parallel string→u32 maps. Lance's stable row IDs are the substrate's identity layer; we use them directly. Realizes "use the substrate's identity layer." Status: aspirational — current code rebuilds TypeIndex per query.
Rank and score are columns. Retrieval operators emit _score, _rank. Fusion operators consume rank-bearing batches. Realizes "rank/score is data, not metadata." Status: aspirational — current RRF runs the pipeline twice and discards rank.
Policy as predicates. Authorization decisions are filter expressions injected into the planner, not enforcement at the API boundary. Realizes "authorization pushes down with other filters." Status: aspirational — Cedar enforcement currently at HTTP boundary only; tracked in MR-722 / MR-725.
Imports unify under Source; transport is interchangeable. A single Source IR operator with provider variants (File, Flight, Lance, Stream) handles all imports. Lance-to-Lance is a fast-path that bypasses Arrow encode/decode. Realizes "external data sources share one operator surface." Status: aspirational — current loader is JSONL-only; tracked in MR-765.

VIII. Quality gates — every change passes

Tests at every boundary. MemStorage for engine tests; planner-only tests; executor-only tests with a stub storage. No layer tested only via end-to-end.
Reference implementation per trait. Every trait has a primary impl (Lance for storage) and at least a test impl. Status: partial after MR-793 PR #70 — TableStorage (the engine-internal staged-write trait, sealed) has its primary impl on TableStore (Lance-backed). The trait's signatures use opaque SnapshotHandle / StagedHandle types so a future test impl (e.g., MemStorage) can land without changing call sites. No test impl yet; tempfile::tempdir() + Lance is the de-facto test substrate today (see docs/testing.md).
Documented capability surface. New capabilities are documented with what they advertise, who consumes them, how the planner uses them.
Benchmark before optimization. New optimizations land with a benchmark that motivates them; if the motivating workload doesn't exist, the feature waits.

IX. Anti-patterns — deny-list

If a proposal fits one of these, the burden is on the proposer to justify why this case is the exception.

Invariant violations (high bar to override)

Custom WAL / transaction manager / buffer pool. Substrate owns these (§I.1).
Wire-protocol-specific code in kernel crates. Kernel produces Stream<RecordBatch>; transport adapters live at the server boundary only (§II.8).
In-process-only Dataset impls. Trait surfaces stay remote-friendly (§III.9).
State that drifts from the substrate / manifest. Derive from observable state (§I.3).
Cross-query BEGIN/COMMIT transactions. Branches replace them in OSS (§VI.23).
Acks before durable persistence. "Best-effort commit" is forbidden (§VI.26).
Reads that see partial commits. Atomicity is non-negotiable (§VI.23).
Operations without time bounds. Every operation has a documented timeout or backoff (§VI.31).
"Recovery mode" code paths separate from steady-state. Recovery uses the same code as ordinary reads (§VI.33).
Eventual consistency as a default. Strong is default; eventual is opt-in per query, never on writes (§VI.34).
Schema migrations that are not idempotent under replay. Idempotency is required for replay safety (§VI.29).
Plan choice that varies given identical input statistics. Determinism is required (§VI.28).
HashMap iteration order in result ordering or plan choice. Use deterministic tie-breaks (§VI.28).
Cost-blind plan choice. Lowering-order execution is not a planner.
Hidden statistics. If a metric matters for plan choice, it must be exposed through the trait surface (§II.5).
Side-channels for query semantics. Search modes, mutations, polymorphism, imports — all first-class IR concepts (§II.4).
Hand-rolling something the substrate already does. Check the spec first (§I.1).
Mutating in place state that should be immutable (Lance fragments, index segments). New segments instead.
Silent failures. OOM, timeout, partial result — all surfaced and bounded (§V.20).

Pattern violations (overridable with justification)

These protect the current architectural patterns (§VII). A future review may revise them.

Synchronous-inline index updates for indexes expensive to build (vector ANN, FTS). Reconciler pattern instead (§VII.35).
Job queue for state derivable from manifest. Reconciler pattern instead (§VII.35).
Per-feature lowering for shapes that share a structure (interfaces, wildcards, alternation). Use one mechanism (§VII.36).
Per-format import code paths (one path for JSONL, another for Parquet, another for Flight). Use the Source IR operator (§VII.43).
Eager materialization of cross-products in multi-hop. Factorize (§VII.39).
Ad-hoc IN-list filtering when SIP fits (§VII.38).
String-flattened SQL filter generation when structured pushdown is available.
Discarding rank in retrieval. Score and rank propagate as columns (§VII.41).
Auto-creating placeholder nodes for orphan edges (silent invention of data). Reject by default; opt-in per write (§VI.24).
Double-encoding data when both endpoints speak the same format (e.g., Lance → Arrow → Lance when both are Lance). Use a fast-path (§VII.43).
Per-write durability fast paths until MemWAL is stable AND a use case justifies the latency vs. risk tradeoff.

X. Review checklist (use against any non-trivial change)

Print this when reviewing an RFC or PR. Each line is yes / no / N/A.

Does it respect the substrate? (§I)
Does it cross only one trait boundary per layer? (§II)
Are capabilities and stats exposed for any new behavior? (§II.5)
If touching the wire / transport surface, does kernel code stay protocol-agnostic? (§II.8)
Do trait surfaces stay remote-friendly? (§III)
Additive, not rewrite? Feature-flagged where behavior changes? (§IV)
Any new estimator has estimate-vs-actual logging? (§V.18)
Coverage / lag / budget metrics surfaced? (§V.19–21)
Failure modes graceful, bounded, observable? (§V.20)
Atomicity scope respected per query? (§VI.23)
Schema integrity enforced strict at commit unless explicit opt-out? (§VI.24)
Isolation level matches default (per-query snapshot, read-your-writes)? (§VI.25)
Durability ack only after manifest commit? (§VI.26)
Determinism preserved (order-stable, plan-deterministic)? (§VI.28)
Idempotency: explicit on_conflict; idempotency keys honored if used? (§VI.29)
Bounded operations: explicit timeout / memory / concurrency limits? (§VI.31)
If touching imports / external data, does it go through Source? (§VII.43)
If implementing a graph / retrieval feature: reuses an existing pattern (reconciler, Union, mutation-wrap-read, SIP, factorize, Source) where applicable? (§VII)
Tests at every boundary, not just end-to-end? (§VIII.44)
Reference impl + test impl for any new trait? (§VIII.45)
None of the deny-list patterns apply? (§IX)

XI. Living document policy

This document is updated when:

A new architectural decision establishes a new invariant — add it.
An existing invariant is challenged and either reaffirmed (with the case sharpened) or revised (with explicit migration of any affected code).
A new architectural pattern is adopted — add to §VII.
A current pattern (§VII) is replaced — update or remove the entry; update the deny-list.
A new anti-pattern surfaces in review and deserves a place on the deny-list — add it.
An aspirational invariant becomes upheld — remove the status annotation.
An open invariant is decided — record the decision and remove the status annotation.

Updates require the same review process as code. Adding to the deny-list (§IX) is cheap; removing or relaxing an invariant (§I–VI, VIII) requires explicit justification in the proposal. Replacing a pattern (§VII) requires a design discussion linking to the new pattern; until that lands, the existing pattern stays.

When an invariant is contested in the moment, the resolution path is: (a) state the case in the relevant RFC or PR; (b) link it from this document; (c) update this document if the resolution changes the rule.

XII. Source / origin

These invariants and patterns were extracted from the architectural decisions in:

MR-737 — Query Engine v2 RFC (the kernel scope and seams)
MR-744 — Edge wildcards / alternation (one cell of the polymorphic-bindings matrix)
MR-765 — Arrow Flight transport (query, import, export)
The schema migration program (MR-694 family — additive evolution, safety tiers, idempotent replay)
The policy program (MR-722 / MR-725 — predicate pushdown)
The reconciler / index-lifecycle work (MR-737 §5.16, MR-688, MR-679, MR-680)
The factorization and SIP work (MR-737 §5.2, §5.3 — Kuzu / Ladybug inspiration)
The polymorphic-bindings framing (MR-737 §5.13 — one mechanism for eight cells)
The Source-operator framing (MR-737 §5.12 — one mechanism for all imports)
The database-guarantees discussion (§VI): ACID dimensions, CAP-style consistency model, scale-system precedents (ClickHouse, Turbopuffer, LanceDB, Postgres). Each invariant in §VI corresponds to a specific named decision; see prior architecture discussions for the option space considered.

General precedent: Lance + LanceDB Enterprise architecture; ClickHouse merge subsystem; Kubernetes controllers; Postgres autovacuum; the FDAL stack (Flight + DataFusion + Arrow + Lance).

Adding a new invariant or pattern here means we've learned something — either from a hard call we made and want to preserve, or from a mistake we don't want to repeat. Both are worth recording.

26 KiB Raw Blame History Unescape Escape