diff --git a/AGENTS.md b/AGENTS.md index e62b5ae..be197fa 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -198,7 +198,7 @@ omnigraph policy explain --actor act-alice --action change --branch main | Columnar storage on object store | ✅ Arrow/Lance | URI normalization, S3 env-var plumbing | | Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | | Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | -| Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) recovery-on-open reconciler for the residual gap between (1) and (2). Layer (3) is **not yet shipped** — tracked in MR-847. Until MR-847 lands, a failure between per-table `commit_staged` and the manifest publish leaves drift (the documented "Phase B → Phase C residual" — see [docs/runs.md](docs/runs.md)). Engine writes route through a sealed `TableStorage` trait (MR-793) exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index` pending Phase 9) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and call-site conversion (Phase 1b) completes. **Do not describe atomicity as "fully upheld" until MR-847 ships.** | +| Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) recovery-on-open reconciler (MR-847) for the residual gap between (1) and (2). All three layers ship; the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore`, and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) arrives with [MR-856](https://linear.app/modernrelay/issue/MR-856) (background reconciler). Engine writes route through a sealed `TableStorage` trait (MR-793) exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index` pending Phase 9) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and call-site conversion (Phase 1b) completes. | | Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency | | Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy | | BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | diff --git a/docs/branches-commits.md b/docs/branches-commits.md index 4501822..e1f6f29 100644 --- a/docs/branches-commits.md +++ b/docs/branches-commits.md @@ -55,3 +55,9 @@ Filtered from `branch_list()` but visible to internals: - `__schema_apply_lock__` — serializes schema migrations. - `__run__` — legacy from the pre-v0.4.0 Run state machine (removed in MR-771). The branch-name guard predicate `is_internal_run_branch` is kept as defense-in-depth so users cannot create a branch matching the legacy prefix; the filter will be removed once production legacy branches are swept (MR-770). + +## L2 — Recovery audit trail (MR-847) + +The four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row. + +Audit rows live in `_graph_commit_recoveries.lance` (sibling to `_graph_commits.lance`) and reference the commit graph by `graph_commit_id`. The linked `_graph_commits.lance` row carries `actor_id="omnigraph:recovery"` (the system actor). To find recoveries for a specific original actor: `omnigraph commit list --filter actor=omnigraph:recovery`, then join to `_graph_commit_recoveries.lance` by `graph_commit_id` to read `recovery_for_actor`. Schema: see `crates/omnigraph/src/db/recovery_audit.rs`. diff --git a/docs/invariants.md b/docs/invariants.md index 73056e7..4c8cfa0 100644 --- a/docs/invariants.md +++ b/docs/invariants.md @@ -105,13 +105,13 @@ These are user-visible commitments. They state what the engine guarantees and wh Specific defaults (timeout values, memory caps, TTL windows) are *configuration*, not invariants — see [docs/constants.md](constants.md) and per-deployment configuration. The invariant is that bounds and contracts exist, not their numerical values. 23. **Atomicity is per-query.** Every `.gq` query is atomic — multi-statement mutations are all-or-nothing via the substrate's atomic-commit primitive. No cross-query `BEGIN`/`COMMIT`; branches and merges fill that role for agent workflows. - *Status: upheld at the writer-trait surface for inserts / updates / scalar-index builds / merge_insert / overwrite after MR-793 PR #70 — the sealed `TableStorage` trait routes those through `stage_*` + `commit_staged`, so a Phase A failure (between writing fragments and committing) leaves no Lance-HEAD drift on touched tables. **Per-table commit_staged → manifest publish window remains** — a failure between commits across multiple touched tables can leave drift on the partially-committed tables. Lance has no multi-dataset atomic commit primitive; closing this requires the recovery-on-open reconciler tracked in MR-847. Additionally, two writer paths still inline-commit pending upstream Lance work: `delete_where` (lance-format/lance#6658) and `create_vector_index` (lance-format/lance#6666).* + *Status: upheld at the writer-trait surface AND across process boundaries after MR-847 — the sealed `TableStorage` trait routes inserts / updates / scalar-index builds / merge_insert / overwrite through `stage_*` + `commit_staged` (Phase A is drift-free), and the open-time recovery sweep in `db/manifest/recovery.rs` (sidecars at `__recovery/{ulid}.json` written by `MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) closes the per-table commit_staged → manifest publish residual on the next `Omnigraph::open`. The "Lance HEAD ahead of `__manifest`" drift class is unreachable for op-execution failures and recoverable across process boundaries for finalize→publisher failures. Continuous in-process recovery (no restart required between Phase B failure and recovery) arrives with MR-856 (background recovery reconciler). Two writer paths still inline-commit pending upstream Lance work: `delete_where` (lance-format/lance#6658) and `create_vector_index` (lance-format/lance#6666).* 24. **Schema integrity is strict at commit.** Type validation, required-field presence (auto-filled from `@default` if declared), uniqueness across batches and versions, and referential integrity — all enforced before commit succeeds. Per-write softening flags are opt-in, never default. *Status: aspirational — referential integrity at scale requires SIP-backed cross-table validation; not yet implemented. Cross-batch / cross-version uniqueness tracked in MR-714.* 25. **Isolation: per-query snapshot; read-your-writes within and across queries in a session.** Each query reads from one consistent manifest version. Within a multi-statement mutation, the read subplan inside each write operator sees the writes from earlier statements. Across queries in a session, reads always resolve the latest manifest version — no reader pinning to older snapshots. - *Status: upheld for inserts/updates after MR-794 step 2+ — `MutationStaging`'s in-memory accumulator + `TableStore::scan_with_pending` (DataFusion `MemTable` union with the committed Lance scan, with merge-shadow semantics for chained updates) implements read-your-writes within a multi-statement mutation. Delete-touching mutations are limited to delete-only by parse-time D₂; closing the within-query RYW gap for deletes requires Lance's two-phase delete API (tracked: MR-793 / Lance-upstream lance-format/lance#6658). The "Lance HEAD ahead of `__manifest`" drift class is unreachable for op-execution failures (the partial-failure test pins this), but a narrowed residual remains at the finalize→publisher boundary because Lance has no multi-dataset commit primitive — see [docs/runs.md](runs.md) "Finalize → publisher residual".* + *Status: upheld for inserts/updates after MR-794 step 2+ — `MutationStaging`'s in-memory accumulator + `TableStore::scan_with_pending` (DataFusion `MemTable` union with the committed Lance scan, with merge-shadow semantics for chained updates) implements read-your-writes within a multi-statement mutation. Delete-touching mutations are limited to delete-only by parse-time D₂; closing the within-query RYW gap for deletes requires Lance's two-phase delete API (tracked: MR-793 / Lance-upstream lance-format/lance#6658). The "Lance HEAD ahead of `__manifest`" drift class is unreachable for op-execution failures (the partial-failure test pins this), and the narrower finalize→publisher residual is closed across one open cycle by the MR-847 recovery sweep — see [docs/runs.md](runs.md) "Open-time recovery sweep".* 26. **Durability before acknowledgement.** Commit returns only after the substrate has confirmed durable persistence. No "fast" or "fire-and-forget" durability levels. diff --git a/docs/maintenance.md b/docs/maintenance.md index aaab37f..af3ce45 100644 --- a/docs/maintenance.md +++ b/docs/maintenance.md @@ -16,6 +16,7 @@ - `CleanupPolicyOptions { keep_versions: Option, older_than: Option }` — at least one is required. - Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`. - CLI guards with `--confirm`; without it, prints a preview line. +- **MR-847 recovery floor:** `--keep < 3` may garbage-collect Lance versions that the open-time recovery sweep needs as a rollback target (the sweep restores to the manifest-pinned `expected_version`, which is HEAD-1 in the typical Phase B → Phase C drift case). Default `--keep 10` is safe. ## Tombstones diff --git a/docs/runs.md b/docs/runs.md index 971801b..3716337 100644 --- a/docs/runs.md +++ b/docs/runs.md @@ -130,7 +130,7 @@ will replace it. Operator-driven (rare in agent workloads); document permanently until Lance exposes `Operation::Overwrite { fragments }` as a two-phase op. -### Finalize → publisher residual +### Open-time recovery sweep (MR-847) The staged-write rewire eliminates one drift class **by construction at the writer layer**: an op that fails before pushing to the in-memory @@ -139,26 +139,61 @@ rejection) leaves Lance HEAD untouched on every staged table. This is the case the `partial_failure_leaves_target_queryable_and_unblocks_next_mutation` test pins. -A second, narrower drift class remains. `MutationStaging::finalize` -runs `stage_*` + `commit_staged` per touched table sequentially, then -the publisher commits the manifest. Lance has no multi-dataset atomic -commit, so the per-table `commit_staged` calls are independent -operations: if commit_staged on table N+1 fails *after* commit_staged -on tables 1..N succeeded, or if the publisher's CAS pre-check rejects -*after* every commit_staged succeeded, tables 1..N are left at -`Lance HEAD = manifest_pinned + 1`. The next mutation against those -tables surfaces `ManifestConflictDetails::ExpectedVersionMismatch` — -the same loud failure mode the rewire was designed to make rare, just -no longer "unreachable." +A second, narrower drift class — the **finalize → publisher window** — +is closed across one open cycle by the MR-847 recovery sweep: -Triggers: transient Lance write errors during finalize (object-store -retry budget exhaustion, disk full); persistent publisher contention -exceeding `PUBLISHER_RETRY_BUDGET = 5` retries. Closing this requires -either a Lance multi-dataset atomic-commit primitive (filed upstream -alongside the two-phase delete request) or a manifest-layer journal -that replays staged commits on next open. Both are heavyweight; the -v1 stance is "narrowed window, documented residual, surface the loud -error when it fires." +`MutationStaging::finalize` runs `stage_*` + `commit_staged` per touched +table sequentially, then the publisher commits the manifest. Lance has +no multi-dataset atomic commit, so the per-table `commit_staged` calls +are independent operations: if commit_staged on table N+1 fails *after* +commit_staged on tables 1..N succeeded, or if the publisher's CAS +pre-check rejects *after* every commit_staged succeeded, tables 1..N +are left at `Lance HEAD = manifest_pinned + 1`. + +**Recovery protocol** (lifecycle of every staged-write writer — +`MutationStaging::finalize`, `schema_apply::apply_schema_with_lock`, +`branch_merge_on_current_target`, `ensure_indices_for_branch`): + +1. **Phase A**: writer writes a sidecar JSON to + `__recovery/{ulid}.json` BEFORE its first `commit_staged`. The + sidecar names every `(table_key, table_path, expected_version, + post_commit_pin)` it intends to commit + the writer kind + + actor_id. +2. **Phase B**: writer's per-table `commit_staged` loop runs. +3. **Phase C**: publisher commits the manifest. +4. **Phase D**: writer deletes the sidecar. + +A failure between Phase A and Phase D leaves the sidecar on disk. The +next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the +recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: + +- For each sidecar in `__recovery/`, compare every named table's + Lance HEAD to the manifest pin. Classify per the all-or-nothing + decision tree (RolledPastExpected / NoMovement / UnexpectedAtP1 / + UnexpectedMultistep / InvariantViolation). +- If every table is `RolledPastExpected`, **roll forward**: a single + `ManifestBatchPublisher::publish` call extends every pin atomically. +- Otherwise **roll back**: per-table `Dataset::restore` to the + expected_version (with a fragment-set short-circuit so repeated + mid-sweep crashes don't pile up versions). +- Either way, an audit row is recorded — `_graph_commits.lance` carries + a commit tagged `actor_id = "omnigraph:recovery"`, and a sibling + `_graph_commit_recoveries.lance` row carries `recovery_kind`, + `recovery_for_actor` (the original sidecar's actor), `operation_id`, + per-table outcomes. Operators run `omnigraph commit list --filter + actor=omnigraph:recovery` to find recoveries. +- Sidecar deleted as the final step. + +Triggers for the residual: transient Lance write errors during finalize +(object-store retry budget exhaustion, disk full); persistent publisher +contention exceeding `PUBLISHER_RETRY_BUDGET = 5` retries. + +**Long-running servers**: between Phase B failure and the next +`Omnigraph::open` (typically a server restart), subsequent writers on +the affected tables surface +`ManifestConflictDetails::ExpectedVersionMismatch`. Continuous +in-process recovery (no restart required) arrives with MR-856 +(background recovery reconciler). The publisher-CAS contract is unchanged: a *concurrent writer* that advances any of our touched tables between snapshot capture and diff --git a/docs/storage.md b/docs/storage.md index 90806d0..daca1e8 100644 --- a/docs/storage.md +++ b/docs/storage.md @@ -62,13 +62,15 @@ flowchart TB manifest["__manifest/
L2 catalog of sub-tables"]:::l2 nodes["nodes/{fnv1a64-hex}/
one dataset per node type"]:::l2 edges["edges/{fnv1a64-hex}/
one dataset per edge type"]:::l2 - cgraph["_graph_commits.lance/
_graph_commit_actors.lance/"]:::l2 + cgraph["_graph_commits.lance/
_graph_commit_actors.lance/
_graph_commit_recoveries.lance/"]:::l2 + recovery["__recovery/{ulid}.json
MR-847 sidecars (transient)"]:::l2 refs["_refs/branches/{name}.json
graph-level branches"]:::l2 repo --> manifest repo --> nodes repo --> edges repo --> cgraph + repo --> recovery repo --> refs subgraph dataset[Inside each Lance dataset — L1] @@ -90,6 +92,8 @@ flowchart TB - **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here. - **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe. - **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 repos also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; MR-770 sweeps these in production.) +- **`_graph_commit_recoveries.lance`** (MR-847) — one row per recovery sweep action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See `crates/omnigraph/src/db/recovery_audit.rs`. +- **`__recovery/{ulid}.json`** (MR-847) — transient sidecar files written by the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`. - **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads. - **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags. diff --git a/docs/testing.md b/docs/testing.md index 4ba3cbc..675491a 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -32,7 +32,8 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav | `export.rs` | NDJSON streaming export filters | | `s3_storage.rs` | S3-backed repo (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) | | `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior | -| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature) | +| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the four MR-847 per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`). | +| `recovery.rs` | MR-847 open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path | ## Fixtures