diff --git a/AGENTS.md b/AGENTS.md index b876749..3f5b711 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -236,8 +236,8 @@ omnigraph policy explain --actor act-alice --action change --branch main | Columnar storage on object store | ✅ Arrow/Lance | URI normalization, S3 env-var plumbing | | Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | | Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | -| Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore`, and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) is the goal of a future background reconciler. Engine writes route through a sealed `TableStorage` trait exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index`) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and the migration of every call site completes. | -| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) | +| Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore` followed by a manifest publish of the restored version (so both directions converge to `manifest == HEAD` — no residual drift), and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) is the goal of a future background reconciler. Engine writes route through a sealed `TableStorage` trait exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index`) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and the migration of every call site completes. | +| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; **publishes each compacted table's new version to `__manifest`** (so the manifest tracks the Lance HEAD — required for reads to observe compaction and for schema apply / strict writes to pass their HEAD-vs-manifest precondition), under the per-`(table, main)` write queue with `SidecarKind::Optimize` recovery coverage; **refuses on an unrecovered graph** (errors if a `__recovery` sidecar is pending — recovery may roll back a partial write, so optimize requires `manifest == HEAD` going in); **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) | | Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy | | BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | | `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering | diff --git a/crates/omnigraph/src/db/manifest.rs b/crates/omnigraph/src/db/manifest.rs index 3b2886f..5bf1f87 100644 --- a/crates/omnigraph/src/db/manifest.rs +++ b/crates/omnigraph/src/db/manifest.rs @@ -36,7 +36,7 @@ use publisher::{GraphNamespacePublisher, ManifestBatchPublisher}; pub(crate) use recovery::{ RecoveryMode, RecoverySidecar, RecoverySidecarHandle, SidecarKind, SidecarTablePin, SidecarTableRegistration, SidecarTombstone, delete_sidecar, has_schema_apply_sidecar, - new_sidecar, recover_manifest_drift, write_sidecar, + list_sidecars, new_sidecar, recover_manifest_drift, write_sidecar, }; pub use state::SubTableEntry; #[cfg(test)] diff --git a/crates/omnigraph/src/db/manifest/recovery.rs b/crates/omnigraph/src/db/manifest/recovery.rs index 4c1b987..3119531 100644 --- a/crates/omnigraph/src/db/manifest/recovery.rs +++ b/crates/omnigraph/src/db/manifest/recovery.rs @@ -106,6 +106,12 @@ pub(crate) enum SidecarKind { BranchMerge, /// `ensure_indices_for_branch` — index lifecycle commits. EnsureIndices, + /// `optimize_all_tables` — Lance `compact_files` (reserve-fragments + + /// rewrite commits) followed by a manifest publish of the compacted + /// version. Loose-match like the other multi-commit writers; roll-forward + /// is always safe because compaction is content-preserving (Lance + /// `Operation::Rewrite` "reorganizes data without semantic modification"). + Optimize, } /// One table's contribution to a sidecar's intended commit. The classifier @@ -412,11 +418,13 @@ pub(crate) fn parse_sidecar(sidecar_uri: &str, body: &str) -> Result manifest_pinned` as `RolledPastExpected` when /// `pin.expected_version == manifest_pinned` (the writer's CAS /// target matches what the manifest currently shows). The risk this @@ -494,9 +502,12 @@ pub(crate) fn decide(classifications: &[TableClassification]) -> SidecarDecision /// Skipping the restore in those cases would leave Lance HEAD ahead of /// the manifest with no recovery artifact left. /// -/// Cost: under repeated mid-rollback crashes (rare), Lance HEAD -/// accumulates extra restore commits that `omnigraph cleanup` reclaims. -/// Bounded by the number of recovery iterations — typically 1. +/// Cost: a successful roll-back appends one restore commit and then publishes +/// the manifest to match (`roll_back_sidecar`), so the table converges +/// (`manifest == HEAD`) in one pass. Only repeated crashes *between* the restore +/// and that publish (rare) accumulate extra restore commits; each re-classified +/// roll-back restores again and `omnigraph cleanup` reclaims the surplus. +/// Bounded by the number of interrupted recovery iterations — typically 0. pub(crate) async fn restore_table_to_version( table_path: &str, branch: Option<&str>, @@ -801,13 +812,24 @@ async fn roll_back_sidecar( sidecar: &RecoverySidecar, states: &[ClassifiedTable], ) -> Result<()> { - // Restore every table whose Lance HEAD has drifted from the - // manifest pin (RolledPastExpected, UnexpectedAtP1, - // UnexpectedMultistep). NoMovement tables are already at the - // manifest pin — no action. Restore is unconditional; repeated - // mid-rollback crashes accumulate a few extra Lance commits that - // `omnigraph cleanup` reclaims. + // Restore every drifted table (RolledPastExpected / UnexpectedAtP1 / + // UnexpectedMultistep) to its manifest-pinned content, then PUBLISH so + // `manifest == Lance HEAD` for each — symmetric with roll-forward. The + // restore commit's content equals the manifest-pinned version, so re-pinning + // the manifest to the new (restored) HEAD is content-correct and closes the + // orphaned-drift class (`HEAD > manifest` with no covering sidecar). This is + // what makes a failed-then-retried schema_apply converge: after one + // roll-back `manifest == HEAD`, so the retry's precondition passes instead of + // failing one version higher each iteration. + // + // NoMovement tables are already at the pin — excluded from both the restore + // and the publish. The audit `to_version` stays the *logical* rolled-back-to + // version (`manifest_pinned`), while the manifest is published at + // `manifest_pinned + 1` (the restore commit, same content) — keep that + // asymmetry so the audit records the drift (`from_version > to_version`). let mut outcomes = Vec::with_capacity(sidecar.tables.len()); + let mut updates: Vec = Vec::with_capacity(sidecar.tables.len()); + let mut expected: HashMap = HashMap::with_capacity(sidecar.tables.len()); for (pin, state) in sidecar.tables.iter().zip(states.iter()) { if matches!( state.classification, @@ -821,10 +843,20 @@ async fn roll_back_sidecar( state.manifest_pinned, ) .await?; - // `from_version` records the Lance HEAD observed BEFORE the - // restore (the actual drift), not the manifest pin. Operators - // reading `_graph_commit_recoveries.lance` see "rolled back - // from v7 to v5" rather than "v5 → v5". + // Publish the post-restore HEAD, CAS against the current (unmoved) + // manifest pin — the same helper roll-forward uses. + push_table_update_at_head( + root_uri, + &pin.table_key, + &pin.table_path, + pin.table_branch.as_deref(), + state.manifest_pinned, + &mut updates, + &mut expected, + ) + .await?; + // `from_version` records the Lance HEAD observed BEFORE the restore + // (the actual drift); `to_version` the logical pin we rolled back to. outcomes.push(TableOutcome { table_key: pin.table_key.clone(), from_version: state.lance_head, @@ -832,13 +864,23 @@ async fn roll_back_sidecar( }); } } - // Manifest pin doesn't move on rollback; record an audit-only - // commit at the existing version so operators can correlate via - // `omnigraph commit list --filter actor=omnigraph:recovery`. + // Publish the restored HEADs so manifest == HEAD. A degenerate all-NoMovement + // roll-back restores nothing — there's nothing to publish, and the audit + // records the unchanged snapshot version. + let manifest_version = if updates.is_empty() { + snapshot.version() + } else { + let publisher = GraphNamespacePublisher::new(root_uri, sidecar.branch.as_deref()); + publisher + .publish(&updates, &expected) + .await? + .version() + .version + }; record_audit( root_uri, sidecar, - snapshot.version(), + manifest_version, RecoveryKind::RolledBack, outcomes, ) @@ -919,44 +961,20 @@ async fn roll_forward_all( HashMap::with_capacity(sidecar.tables.len() + sidecar.additional_registrations.len()); for pin in &sidecar.tables { - // Open the dataset at its CURRENT Lance HEAD on the pin's branch - // (not at the sidecar's post_commit_pin). For strict-match writers - // (Mutation/Load) HEAD == post_commit_pin by construction. For - // loose-match writers (SchemaApply/EnsureIndices/BranchMerge) HEAD - // may be higher than post_commit_pin (multiple commit_staged - // calls per table); we want to publish to the actual current HEAD. - let head_ds = Dataset::open(&pin.table_path) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let head_ds = match pin.table_branch.as_deref() { - Some(b) if b != "main" => head_ds - .checkout_branch(b) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?, - _ => head_ds, - }; - let head_version = head_ds.version().version; - - let row_count = head_ds - .count_rows(None) - .await - .map_err(|e| OmniError::Lance(e.to_string()))? as u64; - - let table_relative_path = super::table_path_for_table_key(&pin.table_key)?; - let version_metadata = super::metadata::TableVersionMetadata::from_dataset( + // Publish to the table's CURRENT Lance HEAD on the pin's branch (not the + // sidecar's `post_commit_pin`, a lower bound for loose-match writers that + // run multiple commit_staged calls per table). CAS against the pin's + // pre-write `expected_version`. + let head_version = push_table_update_at_head( root_uri, - &table_relative_path, - &head_ds, - )?; - - updates.push(ManifestChange::Update(SubTableUpdate { - table_key: pin.table_key.clone(), - table_version: head_version, - table_branch: pin.table_branch.clone(), - row_count, - version_metadata, - })); - expected.insert(pin.table_key.clone(), pin.expected_version); + &pin.table_key, + &pin.table_path, + pin.table_branch.as_deref(), + pin.expected_version, + &mut updates, + &mut expected, + ) + .await?; published_versions.insert(pin.table_key.clone(), head_version); } @@ -1047,6 +1065,57 @@ async fn roll_forward_all( Ok((new_dataset.version().version, published_versions)) } +/// Open `table_path` at its branch HEAD, read the current Lance HEAD version, +/// row count, and version metadata, and push a `ManifestChange::Update` (plus +/// its CAS `expected` entry) that re-pins the manifest to that HEAD. Returns the +/// published HEAD version. +/// +/// Shared by `roll_forward_all` (where `expected_version` is the sidecar's +/// pre-write pin) and `roll_back_sidecar` (where it is the manifest-pinned +/// version the table was just restored to). The HEAD is read AFTER any restore +/// in the same single-threaded sweep, so no concurrent writer can have advanced +/// it. +async fn push_table_update_at_head( + root_uri: &str, + table_key: &str, + table_path: &str, + branch: Option<&str>, + expected_version: u64, + updates: &mut Vec, + expected: &mut HashMap, +) -> Result { + let head_ds = Dataset::open(table_path) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let head_ds = match branch { + Some(b) if b != "main" => head_ds + .checkout_branch(b) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?, + _ => head_ds, + }; + let head_version = head_ds.version().version; + let row_count = head_ds + .count_rows(None) + .await + .map_err(|e| OmniError::Lance(e.to_string()))? as u64; + let table_relative_path = super::table_path_for_table_key(table_key)?; + let version_metadata = super::metadata::TableVersionMetadata::from_dataset( + root_uri, + &table_relative_path, + &head_ds, + )?; + updates.push(ManifestChange::Update(SubTableUpdate { + table_key: table_key.to_string(), + table_version: head_version, + table_branch: branch.map(str::to_string), + row_count, + version_metadata, + })); + expected.insert(table_key.to_string(), expected_version); + Ok(head_version) +} + /// Append the audit row describing this recovery action. /// /// Two-part write: (a) `_graph_commits.lance` row anchored on the recovery diff --git a/crates/omnigraph/src/db/omnigraph/optimize.rs b/crates/omnigraph/src/db/omnigraph/optimize.rs index fff3f54..ee39323 100644 --- a/crates/omnigraph/src/db/omnigraph/optimize.rs +++ b/crates/omnigraph/src/db/omnigraph/optimize.rs @@ -8,8 +8,14 @@ //! Two dials: //! //! * `optimize_all_tables` — Lance `compact_files` on every table. Rewrites -//! small fragments into fewer large ones. Non-destructive (creates a new -//! version; old fragments remain reachable via older manifest versions). +//! small fragments into fewer large ones, then **publishes the compacted +//! version to the `__manifest`** so the manifest's `table_version` tracks the +//! compacted Lance HEAD (reads pin the manifest version, so without the +//! publish compaction would be invisible to readers and would break the +//! HEAD-vs-manifest precondition of schema apply / strict writes). Compaction +//! is content-preserving (Lance `Operation::Rewrite` "reorganizes data +//! without semantic modification"), so old fragments remain reachable via +//! older manifest versions until `cleanup` runs. //! * `cleanup_all_tables` — Lance `cleanup_old_versions` on every table. //! Removes manifests (and their unique fragments) older than the configured //! retention. Destructive to version history — callers should gate this @@ -23,7 +29,9 @@ use std::time::Duration; use chrono::Utc; use futures::stream::StreamExt; use lance::dataset::cleanup::{CleanupPolicy, RemovalStats}; -use lance::dataset::optimize::{CompactionMetrics, CompactionOptions, compact_files}; +use lance::dataset::optimize::{ + CompactionMetrics, CompactionOptions, compact_files, plan_compaction, +}; use super::*; @@ -111,7 +119,8 @@ pub struct TableOptimizeStats { pub fragments_removed: usize, /// Number of new, larger fragments Lance produced. pub fragments_added: usize, - /// Did this table get a new Lance manifest version from the compaction? + /// Did this table get a new manifest version from the compaction? True when + /// compaction ran and its compacted version was published to `__manifest`. pub committed: bool, /// `Some(reason)` if this table was deliberately not compacted. When set, /// `fragments_removed == 0`, `fragments_added == 0`, and `!committed`. @@ -153,12 +162,29 @@ pub struct TableCleanupStats { pub error: Option, } -/// Run Lance `compact_files` on every node + edge table on `main`. -/// Tables run in parallel (bounded concurrency). +/// Run Lance `compact_files` on every node + edge table on `main`, publishing +/// each compacted table's new version to the `__manifest`. Tables run in +/// parallel (bounded concurrency); each is fault-isolated only at the Lance +/// level — a publish error is propagated (the recovery sidecar covers it). pub async fn optimize_all_tables(db: &Omnigraph) -> Result> { db.ensure_schema_state_valid().await?; db.ensure_schema_apply_idle("optimize").await?; + // Refuse on an unrecovered graph. A pending recovery sidecar means a failed + // write left partial state that the open-time sweep must resolve (roll + // forward/back) first; compacting + publishing a table covered by such a + // sidecar could commit a partial write the sweep would roll back. Reopen the + // graph to run recovery, then re-run optimize. + if !crate::db::manifest::list_sidecars(db.root_uri(), db.storage_adapter()) + .await? + .is_empty() + { + return Err(OmniError::manifest_conflict( + "optimize requires a clean recovery state; reopen the graph to run the \ + recovery sweep before optimizing", + )); + } + let resolved = db.resolved_branch_target(None).await?; let snapshot = resolved.snapshot; @@ -183,49 +209,179 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result> = futures::stream::iter(table_tasks.into_iter()) - .map(|(table_key, full_path, has_blob)| async move { - // Lance `compact_files` mis-decodes blob-v2 columns under the forced - // `BlobHandling::AllBinary` read (see LANCE_SUPPORTS_BLOB_COMPACTION). - // Skip blob-bearing tables and report it rather than aborting the - // whole sweep — the other tables still compact. - if has_blob && !LANCE_SUPPORTS_BLOB_COMPACTION { - tracing::warn!( - target: "omnigraph::optimize", - table = %table_key, - "skipping compaction: table has blob columns the current Lance \ - cannot rewrite (blob-v2 AllBinary decode bug); other tables \ - unaffected — rerun after the Lance fix", - ); - return Ok(TableOptimizeStats::skipped( - table_key, - SkipReason::BlobColumnsUnsupportedByLance, - )); - } - let mut ds = table_store - .open_dataset_head_for_write(&table_key, &full_path, None) - .await?; - let version_before = ds.version().version; - let metrics: CompactionMetrics = - compact_files(&mut ds, CompactionOptions::default(), None) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let version_after = ds.version().version; - Ok(TableOptimizeStats::compacted( - table_key, - &metrics, - version_after != version_before, - )) + .map(move |(table_key, full_path, has_blob)| async move { + optimize_one_table(db, table_key, full_path, has_blob).await }) .buffer_unordered(concurrency) .collect() .await; + // Invalidate caches for any table that published a compaction — done BEFORE + // propagating a sibling table's error, since the published versions are + // durable and reads must observe the new fragment layout (Lance invalidates + // the original row addresses on rewrite). The CSR/CSC graph topology index + // is rebuilt only when an edge table moved. Mirrors schema_apply's + // post-publish invalidation. + let any_committed = stats + .iter() + .any(|s| matches!(s, Ok(st) if st.committed)); + let edge_committed = stats + .iter() + .any(|s| matches!(s, Ok(st) if st.committed && st.table_key.starts_with("edge:"))); + if any_committed { + db.runtime_cache.invalidate_all().await; + if edge_committed { + db.invalidate_graph_index().await; + } + } + stats.into_iter().collect() } +/// Compact one table and publish the compacted version to the `__manifest`. +/// +/// Compaction (`compact_files`) advances the *dataset's* Lance HEAD via a +/// reserve-fragments + rewrite commit, but Lance knows nothing about the +/// `__manifest`. To keep the manifest the single authority for each table's +/// visible version (invariant 2), optimize must publish the compacted version. +/// The Lance-HEAD-before-manifest-publish gap is unavoidable (Lance has no +/// staged/uncommitted compaction), so it is covered by a recovery sidecar like +/// the other multi-commit writers; roll-forward is always safe because +/// compaction is content-preserving. +async fn optimize_one_table( + db: &Omnigraph, + table_key: String, + full_path: String, + has_blob: bool, +) -> Result { + // Lance `compact_files` mis-decodes blob-v2 columns under the forced + // `BlobHandling::AllBinary` read (see LANCE_SUPPORTS_BLOB_COMPACTION). Skip + // blob-bearing tables and report it rather than aborting the whole sweep. + if has_blob && !LANCE_SUPPORTS_BLOB_COMPACTION { + tracing::warn!( + target: "omnigraph::optimize", + table = %table_key, + "skipping compaction: table has blob columns the current Lance \ + cannot rewrite (blob-v2 AllBinary decode bug); other tables \ + unaffected — rerun after the Lance fix", + ); + return Ok(TableOptimizeStats::skipped( + table_key, + SkipReason::BlobColumnsUnsupportedByLance, + )); + } + + // Serialize the whole compact→publish against concurrent mutations on this + // (table, main): compaction is a Rewrite op that retryable-conflicts with a + // concurrent Merge/Update/Delete on overlapping fragments, and an + // interleaved write would also move the manifest version out from under the + // CAS below. Holding the queue makes the CAS baseline read under it exact. + let _guard = db + .write_queue() + .acquire_many(&[(table_key.clone(), None)]) + .await; + + let mut ds = db + .table_store + .open_dataset_head_for_write(&table_key, &full_path, None) + .await?; + + // CAS baseline: the table's current manifest version, read under the queue + // (in-memory coordinator snapshot, no storage I/O — stable for this section). + let expected_version = db + .snapshot() + .await + .entry(&table_key) + .map(|e| e.table_version) + .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; + + // Precise "will it compact?" check — `plan_compaction` also accounts for + // deletion materialization (which can rewrite even a single fragment). A + // steady-state already-compacted table yields an empty plan and is never + // pinned in a sidecar (a zero-commit pin would classify NoMovement on + // recovery and force an all-or-nothing rollback). There is no drift to + // reconcile here: optimize runs only on a recovered graph (the pending- + // sidecar guard above), and recovery roll-back now publishes, so + // `HEAD == manifest` holds going in. + let options = CompactionOptions::default(); + let plan = plan_compaction(&ds, &options) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + if plan.num_tasks() == 0 { + return Ok(TableOptimizeStats::compacted( + table_key, + &CompactionMetrics::default(), + false, + )); + } + + // Phase A: recovery sidecar BEFORE compaction advances the Lance HEAD, so a + // crash before the manifest publish rolls forward on next open. + let sidecar = crate::db::manifest::new_sidecar( + crate::db::manifest::SidecarKind::Optimize, + None, + // optimize is system-attributed (no `optimize_as` actor API today). + None, + vec![crate::db::manifest::SidecarTablePin { + table_key: table_key.clone(), + table_path: full_path.clone(), + expected_version, + // Lower bound — compaction commits N≥1 versions (reserve + rewrite); + // the classifier loose-matches SidecarKind::Optimize. + post_commit_pin: expected_version + 1, + table_branch: None, + }], + ); + let handle = + crate::db::manifest::write_sidecar(db.root_uri(), db.storage_adapter(), &sidecar).await?; + + // Phase B: compaction (reserve-fragments + rewrite commits advance HEAD). + let version_before = ds.version().version; + let metrics: CompactionMetrics = compact_files(&mut ds, options, None) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let version_after = ds.version().version; + let committed = version_after != version_before; + + // Pin the per-writer Phase B → Phase C residual for optimize: Lance HEAD has + // advanced but the manifest publish below hasn't run. + crate::failpoints::maybe_fail("optimize.post_phase_b_pre_manifest_commit")?; + + // Phase C: publish the compacted version to the manifest (one CAS commit, + // expected = the version observed under the queue). On failure the sidecar + // is intentionally left for the open-time recovery sweep to roll forward. + if committed { + let state = db.table_store.table_state(&full_path, &ds).await?; + let update = crate::db::SubTableUpdate { + table_key: table_key.clone(), + table_version: state.version, + table_branch: None, + row_count: state.row_count, + version_metadata: state.version_metadata, + }; + let mut expected = std::collections::HashMap::new(); + expected.insert(table_key.clone(), expected_version); + db.coordinator + .write() + .await + .commit_updates_with_actor_with_expected(&[update], &expected, None) + .await?; + } + + // Phase D: delete the sidecar (best-effort; recovery resolves a leftover). + if let Err(err) = crate::db::manifest::delete_sidecar(&handle, db.storage_adapter()).await { + tracing::warn!( + error = %err, + operation_id = handle.operation_id.as_str(), + "optimize recovery sidecar cleanup failed; next open's recovery sweep will resolve it" + ); + } + + Ok(TableOptimizeStats::compacted(table_key, &metrics, committed)) +} + /// Run Lance `cleanup_old_versions` on every node + edge table on `main`, /// using [`CleanupPolicyOptions`]. The latest manifest is always preserved /// regardless (Lance invariant). diff --git a/crates/omnigraph/tests/composite_flow.rs b/crates/omnigraph/tests/composite_flow.rs index 6c720da..dd41310 100644 --- a/crates/omnigraph/tests/composite_flow.rs +++ b/crates/omnigraph/tests/composite_flow.rs @@ -294,21 +294,19 @@ async fn composite_flow_canonical_lifecycle() { ); // ───────────────────────────────────────────────────────────────── - // Step 10: optimize the post-merge graph — verify indices stay - // valid and queryable. + // Step 10: optimize the post-merge graph — verify compaction is + // published to the manifest (so the manifest pin tracks the compacted + // Lance HEAD), indices stay valid and queryable, and a post-optimize + // strict write commits. // - // **Known limitation**: `optimize_all_tables` calls Lance - // `compact_files` directly — it advances per-table Lance HEAD - // without updating the omnigraph `__manifest` pin. After optimize, - // the next writer's expected_table_versions captures the - // pre-optimize manifest pin, but the publisher's pre-check reads - // a higher version from the manifest dataset (because some other - // path — possibly schema-state recovery on reopen — wrote a newer - // __manifest row). The `ExpectedVersionMismatch` is benign - // (re-issuing the mutation after a snapshot refresh succeeds), but - // a composite test cannot reliably exercise post-optimize mutations - // until that path is investigated. Coverage of post-optimize - // mutations is left to a focused optimize+cleanup integration test. + // This step used to carry a "Known limitation": `optimize_all_tables` + // ran Lance `compact_files` without publishing the new version to + // `__manifest`, so the manifest pin lagged the Lance HEAD and the next + // strict write / schema apply failed with `ExpectedVersionMismatch` + // ("stale view … refresh and retry") — so post-optimize mutations were + // deliberately omitted here. optimize now publishes the compacted + // version, and this flow exercises exactly that previously-failing + // write below. // ───────────────────────────────────────────────────────────────── let optimize_stats = db.optimize().await.unwrap(); assert!( @@ -331,6 +329,28 @@ async fn composite_flow_canonical_lifecycle() { "row counts unchanged by optimize" ); + // A strict update on a compacted table is exactly the write that + // failed with "stale view" before optimize published its compaction. + // It must now commit (Alice is one of the seed Persons; an update + // leaves the row count at 6). + let post_optimize_update = mutate_main( + &mut db, + MUTATION_QUERIES, + "set_age", + &mixed_params(&[("$name", "Alice")], &[("$age", 41)]), + ) + .await + .expect("post-optimize strict update must commit — optimize published the manifest"); + assert_eq!( + post_optimize_update.affected_nodes, 1, + "post-optimize update must affect exactly Alice" + ); + assert_eq!( + count_rows(&db, "node:Person").await, + 6, + "an update must not change the Person row count" + ); + // ───────────────────────────────────────────────────────────────── // Step 11: cleanup — keep last 10 versions, only purge versions // older than 1 hour. With this small test, we have well under 10 @@ -373,14 +393,27 @@ async fn composite_flow_canonical_lifecycle() { branches, ); - // Final query exercise — full read path works post-reopen, - // post-cleanup. Post-cleanup mutation is omitted here pending - // resolution of the optimize-vs-manifest-pin interaction documented - // in Step 10. + // Final exercise — full read AND write path works post-reopen, + // post-cleanup. (The post-cleanup mutation was previously omitted + // pending resolution of the optimize-vs-manifest-pin interaction in + // Step 10; that is now fixed, so a strict write here must commit.) let final_total = query_main(&mut db, TEST_QUERIES, "total_people", &ParamMap::default()) .await .unwrap(); assert!(!final_total.batches().is_empty()); + + let post_reopen_update = mutate_main( + &mut db, + MUTATION_QUERIES, + "set_age", + &mixed_params(&[("$name", "Alice")], &[("$age", 42)]), + ) + .await + .expect("post-reopen, post-cleanup strict update must commit"); + assert_eq!( + post_reopen_update.affected_nodes, 1, + "post-reopen update must affect exactly Alice" + ); } /// Cross-handle sequence that exercises operations after a schema_apply diff --git a/crates/omnigraph/tests/end_to_end.rs b/crates/omnigraph/tests/end_to_end.rs index a0fdb0e..ea11d0e 100644 --- a/crates/omnigraph/tests/end_to_end.rs +++ b/crates/omnigraph/tests/end_to_end.rs @@ -1933,3 +1933,87 @@ query docs_with_tag($tag: String) { "contains-pushdown should return exactly the rows whose tags list contains 'red'" ); } + +// ─── Maintenance in the full lifecycle: optimize (compaction) ──────────────── + +/// `optimize` (Lance compaction) is part of a realistic graph lifecycle: it +/// advances the Lance HEAD and publishes the compacted version to the manifest. +/// The rest of the flow must keep working across that boundary — reads observe +/// the compacted data, strict updates (which check Lance HEAD == manifest +/// version) still commit, inserts still commit, and the state survives a reopen +/// (the open-time recovery sweep finds no leftover drift). Before optimize +/// published its compaction, the manifest lagged the Lance HEAD here and the +/// post-optimize update below failed with "stale view ... refresh and retry". +#[tokio::test] +async fn full_flow_optimize_then_query_update_and_reopen() { + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap().to_string(); + let mut db = init_and_load(&dir).await; + + // Build several Person fragments so compaction has something to merge. + for (name, age) in [("Eve", 40), ("Frank", 41), ("Grace", 42)] { + mutate_main( + &mut db, + MUTATION_QUERIES, + "insert_person", + &mixed_params(&[("$name", name)], &[("$age", age)]), + ) + .await + .unwrap(); + } + + let stats = db.optimize().await.unwrap(); + assert!( + stats.iter().any(|s| s.committed), + "a multi-fragment table should have compacted in this flow" + ); + + // Reads observe the compacted data. + let qr = query_main( + &mut db, + TEST_QUERIES, + "get_person", + ¶ms(&[("$name", "Alice")]), + ) + .await + .unwrap(); + assert_eq!(qr.num_rows(), 1); + + // Strict update after optimize commits (previously failed with "stale view" + // because the manifest lagged the compacted Lance HEAD). + let upd = mutate_main( + &mut db, + MUTATION_QUERIES, + "set_age", + &mixed_params(&[("$name", "Alice")], &[("$age", 31)]), + ) + .await + .unwrap(); + assert_eq!(upd.affected_nodes, 1); + + // Insert after optimize also commits. + mutate_main( + &mut db, + MUTATION_QUERIES, + "insert_person", + &mixed_params(&[("$name", "Ivan")], &[("$age", 50)]), + ) + .await + .unwrap(); + assert_eq!(count_rows(&db, "node:Person").await, 8); // 4 seed + Eve/Frank/Grace + Ivan + + // State survives a reopen — the recovery sweep runs and finds no drift. + drop(db); + let reopened = Omnigraph::open(&uri).await.unwrap(); + assert_eq!(count_rows(&reopened, "node:Person").await, 8); + let alice = reopened + .entity_at_target(ReadTarget::branch("main"), "node:Person", "Alice") + .await + .unwrap() + .unwrap(); + assert_eq!( + alice["age"], + serde_json::json!(31), + "Alice's post-optimize age update must persist across reopen" + ); +} diff --git a/crates/omnigraph/tests/failpoints.rs b/crates/omnigraph/tests/failpoints.rs index 149c63a..d240108 100644 --- a/crates/omnigraph/tests/failpoints.rs +++ b/crates/omnigraph/tests/failpoints.rs @@ -1245,7 +1245,7 @@ async fn refresh_defers_rollback_eligible_sidecar_to_next_open() { // the rollback (will use Dataset::restore safely; no concurrent // writers at open time). drop(db); - let _db = Omnigraph::open(&uri).await.unwrap(); + let db = Omnigraph::open(&uri).await.unwrap(); // After full-sweep recovery, the sidecar should be processed // (deleted). Sidecar's tables are eligible for rollback (UnexpectedAtP1): // restore happens on Person (HEAD advances by 1). @@ -1268,6 +1268,19 @@ async fn refresh_defers_rollback_eligible_sidecar_to_next_open() { "full sweep must run Dataset::restore (head advances); \ post_head={post_head}, final_head={final_head}", ); + // Convergence: roll-back published the restored HEAD, so the manifest pin + // tracks Lance HEAD afterward (no residual drift). + let entry_version = db + .snapshot_of(omnigraph::db::ReadTarget::branch("main")) + .await + .unwrap() + .entry("node:Person") + .unwrap() + .table_version; + assert_eq!( + entry_version, final_head, + "full-sweep roll-back must publish so manifest pin ({entry_version}) == Lance HEAD ({final_head})", + ); } /// Companion to the above — confirms that a finalize→publisher failure @@ -1461,10 +1474,15 @@ edge WorksAt: Person -> Company } let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - version_main(&db).await.unwrap(), - pre_failure_version, - "manifest must remain on the old schema when no schema staging files existed" + // Roll-back now publishes the restored version, so the manifest version + // advances — but to the OLD-schema content: the migration never applied + // (asserted by count_rows + the `_schema.pg` checks below), and the sweep + // converges (`manifest == Lance HEAD`, asserted by + // assert_post_recovery_invariants's RolledBack arm). + assert!( + version_main(&db).await.unwrap() > pre_failure_version, + "roll-back publishes the restored (old-schema) version, advancing the manifest; \ + pre={pre_failure_version}", ); assert_eq!( helpers::count_rows(&db, "node:Person").await, @@ -1637,6 +1655,100 @@ edge WorksAt: Person -> Company ); } +/// `optimize` Phase B → Phase C residual: `compact_files` advanced the Lance +/// HEAD but the manifest publish hasn't run. The `Optimize` recovery sidecar +/// (loose-match, like SchemaApply/EnsureIndices) must roll the compacted version +/// forward on next open so the manifest tracks the Lance HEAD — and the healed +/// table must then accept a schema apply (the original bug's victim). +#[tokio::test] +async fn optimize_phase_b_failure_recovered_on_next_open() { + let _scenario = FailScenario::setup(); + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap().to_string(); + let operation_id; + + // Seed: several separate Person inserts → multiple fragments, so compaction + // has real work and advances the Lance HEAD. + { + let db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); + for (name, age) in [("alice", 30), ("bob", 31), ("carol", 32), ("dave", 33)] { + db.mutate( + "main", + MUTATION_QUERIES, + "insert_person", + &mixed_params(&[("$name", name)], &[("$age", age)]), + ) + .await + .unwrap(); + } + } + + let pre_failure_version = { + let db = Omnigraph::open(&uri).await.unwrap(); + version_main(&db).await.unwrap() + }; + + // Failpoint fires AFTER compact_files advanced the Lance HEAD but BEFORE the + // manifest publish. The Optimize sidecar persists (only node:Person has + // compactable fragments, so exactly one sidecar is written). + { + let db = Omnigraph::open(&uri).await.unwrap(); + let _failpoint = + ScopedFailPoint::new("optimize.post_phase_b_pre_manifest_commit", "return"); + let err = db.optimize().await.unwrap_err(); + assert!( + err.to_string() + .contains("injected failpoint triggered: optimize.post_phase_b_pre_manifest_commit"), + "unexpected error: {err}" + ); + + let recovery_dir = dir.path().join("__recovery"); + let sidecars: Vec<_> = std::fs::read_dir(&recovery_dir) + .unwrap() + .filter_map(|e| e.ok()) + .collect(); + assert_eq!( + sidecars.len(), + 1, + "exactly one Optimize sidecar must persist after optimize failure" + ); + operation_id = single_sidecar_operation_id(dir.path()); + } + + // Recovery: reopen runs the sweep. The Optimize sidecar classifies + // RolledPastExpected (loose-match) → RollForward → manifest extends to the + // compacted Lance HEAD. + let db = Omnigraph::open(&uri).await.unwrap(); + let post_recovery_version = version_main(&db).await.unwrap(); + assert!( + post_recovery_version > pre_failure_version, + "manifest version must advance post-recovery (compaction rolled forward); \ + pre={pre_failure_version}, post={post_recovery_version}", + ); + drop(db); + + assert_post_recovery_invariants( + dir.path(), + &operation_id, + RecoveryExpectation::RolledForward { + tables: vec![TableExpectation::main("node:Person")], + }, + ) + .await + .unwrap(); + + // The healed table accepts an additive schema apply — its HEAD-vs-manifest + // precondition is satisfied because recovery published the compacted version. + let db = Omnigraph::open(&uri).await.unwrap(); + let desired = helpers::TEST_SCHEMA.replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + db.apply_schema(&desired) + .await + .expect("schema apply after optimize recovery must succeed"); +} + #[tokio::test] async fn branch_merge_phase_b_failure_recovered_on_next_open() { use omnigraph::loader::{LoadMode, load_jsonl}; diff --git a/crates/omnigraph/tests/helpers/recovery.rs b/crates/omnigraph/tests/helpers/recovery.rs index c76009e..90d9a25 100644 --- a/crates/omnigraph/tests/helpers/recovery.rs +++ b/crates/omnigraph/tests/helpers/recovery.rs @@ -181,6 +181,9 @@ pub async fn assert_post_recovery_invariants( "audit row for {operation_id} recorded the wrong recovery_kind", ); assert_rollback_outcomes_record_drift(&audit); + // Roll-back now publishes the restored HEAD, so manifest == Lance + // HEAD afterward (symmetric with roll-forward) — no residual drift. + assert_manifest_pins_match_lance_heads(graph_root, &tables).await?; assert_recovery_commit_shape(graph_root, &audit, &tables).await?; assert_non_main_did_not_move_main(graph_root, &tables).await?; assert_idempotent_reopen(graph_root, operation_id).await?; diff --git a/crates/omnigraph/tests/maintenance.rs b/crates/omnigraph/tests/maintenance.rs index 3e61677..2a5a659 100644 --- a/crates/omnigraph/tests/maintenance.rs +++ b/crates/omnigraph/tests/maintenance.rs @@ -8,10 +8,12 @@ mod helpers; use std::time::Duration; use lance::Dataset; -use omnigraph::db::{CleanupPolicyOptions, Omnigraph, SkipReason}; +use omnigraph::db::{CleanupPolicyOptions, Omnigraph, ReadTarget, SkipReason}; use omnigraph::loader::{LoadMode, load_jsonl}; -use helpers::{TEST_DATA, TEST_SCHEMA, count_rows, init_and_load}; +use helpers::{ + MUTATION_QUERIES, TEST_DATA, TEST_SCHEMA, count_rows, init_and_load, mixed_params, mutate_main, +}; /// Filesystem URI of a node sub-table, mirroring the engine's layout /// (FNV-1a of the type name under `nodes/`). Matches the helper in @@ -163,6 +165,124 @@ node Tag {\n slug: String @key\n}\n"; assert_eq!(tag.skipped, None, "non-blob table must not be skipped"); } +// Regression: `optimize` must publish its compaction to the `__manifest` so the +// manifest's recorded `table_version` tracks the compacted Lance HEAD. +// +// Lance `compact_files` advances the *dataset's* version (reserve-fragments + +// rewrite commits) but knows nothing about OmniGraph's `__manifest`. If optimize +// does not publish a manifest update, the manifest's `table_version` lags the +// Lance HEAD: reads stay pinned to the pre-compaction version (compaction is +// invisible to them) and any subsequent schema apply / strict update/delete +// fails its HEAD-vs-manifest precondition with +// "stale view of '': expected manifest table version X but current is Y". +// This pins the fix — optimize publishes the compacted version, so manifest == +// HEAD and migrations after a compaction succeed. +#[tokio::test] +async fn optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds() { + let dir = tempfile::tempdir().unwrap(); + let root = dir.path().to_str().unwrap().trim_end_matches('/').to_string(); + let mut db = init_and_load(&dir).await; + + // Several separate inserts → multiple Person fragments, so `compact_files` + // actually merges and moves the Lance HEAD (a single fragment is a no-op). + for (name, age) in [("Eve", 40), ("Frank", 41), ("Grace", 42), ("Heidi", 43)] { + mutate_main( + &mut db, + MUTATION_QUERIES, + "insert_person", + &mixed_params(&[("$name", name)], &[("$age", age as i64)]), + ) + .await + .expect("insert"); + } + + let stats = db.optimize().await.unwrap(); + let person = stats + .iter() + .find(|s| s.table_key == "node:Person") + .expect("Person stat present"); + assert!( + person.committed, + "Person is multi-fragment, so optimize must have compacted it" + ); + + // After optimize, the manifest's recorded table_version must equal the actual + // Lance HEAD — optimize published its compaction, so there is no drift. + let snap = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); + let entry = snap.entry("node:Person").unwrap(); + let manifest_version = entry.table_version; + let full = format!("{}/{}", root, entry.table_path); + let lance_head = Dataset::open(&full).await.unwrap().version().version; + assert_eq!( + manifest_version, lance_head, + "after optimize, manifest table_version ({manifest_version}) must equal Lance HEAD ({lance_head})", + ); + + // Reads observe the compacted version with rows preserved (4 seed + 4 inserts). + assert_eq!(count_rows(&db, "node:Person").await, 8); + + // The headline: an additive (nullable property) migration touching the + // just-compacted table succeeds, where it previously failed with "stale view". + let desired = TEST_SCHEMA.replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + let result = db + .apply_schema(&desired) + .await + .expect("additive schema apply after optimize must succeed"); + assert!(result.applied, "schema apply should report applied=true"); +} + +// Regression: `optimize` must REFUSE when an unresolved recovery sidecar is +// pending. Operating on an unrecovered graph could publish a partial write that +// the all-or-nothing recovery sweep would roll back; the operator must reopen +// (run the recovery sweep) first. +#[tokio::test] +async fn optimize_defers_when_recovery_sidecar_is_pending() { + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap(); + let db = init_and_load(&dir).await; + + // Simulate an in-process failed write that left a recovery sidecar on disk. + let recovery_dir = dir.path().join("__recovery"); + std::fs::create_dir_all(&recovery_dir).unwrap(); + let person_path = node_table_uri(uri, "Person"); + let sidecar_json = format!( + r#"{{ + "schema_version": 1, + "operation_id": "01H000000000000000000DEFR", + "started_at": "0", + "branch": null, + "actor_id": "act-test", + "writer_kind": "Mutation", + "tables": [ + {{ + "table_key": "node:Person", + "table_path": "{}", + "expected_version": 1, + "post_commit_pin": 2 + }} + ] + }}"#, + person_path + ); + std::fs::write( + recovery_dir.join("01H000000000000000000DEFR.json"), + sidecar_json, + ) + .unwrap(); + + let err = db + .optimize() + .await + .expect_err("optimize must defer (error) while a recovery sidecar is pending"); + assert!( + err.to_string().to_lowercase().contains("recovery"), + "optimize defer error should mention recovery; got: {err}", + ); +} + #[tokio::test] async fn cleanup_without_any_policy_option_errors() { let dir = tempfile::tempdir().unwrap(); diff --git a/crates/omnigraph/tests/recovery.rs b/crates/omnigraph/tests/recovery.rs index a090178..f6b19e8 100644 --- a/crates/omnigraph/tests/recovery.rs +++ b/crates/omnigraph/tests/recovery.rs @@ -278,6 +278,97 @@ async fn recovery_rolls_back_synthetic_drift_on_open() { ); } +/// Regression: recovery roll-back must PUBLISH the restored version so +/// `manifest == Lance HEAD` afterward (no residual "orphaned drift"). Before the +/// fix, roll-back restored via `Dataset::restore` but left the manifest pin +/// behind HEAD, so a subsequent strict write / schema apply failed its +/// HEAD-vs-manifest precondition ("stale view … refresh and retry") — and a +/// failed schema apply's own roll-back leaked +1 each retry (the original bug's +/// loop). With convergence, one roll-back leaves `manifest == HEAD` and the +/// follow-up succeeds. +#[tokio::test] +async fn recovery_rollback_converges_manifest_so_schema_apply_succeeds() { + use omnigraph::db::ReadTarget; + use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; + + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap(); + + let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); + load_jsonl( + &mut db, + r#"{"type":"Person","data":{"name":"alice","age":30}} +{"type":"Person","data":{"name":"bob","age":25}} +"#, + LoadMode::Append, + ) + .await + .unwrap(); + drop(db); + + // Forge a Phase-B residual: advance Person's Lance HEAD without publishing to + // the manifest (the manifest pin stays at the load's committed version). + let person_uri = node_table_uri(uri, "Person"); + let store = TableStore::new(uri); + let mut ds = Dataset::open(&person_uri).await.unwrap(); + let manifest_pin = ds.version().version; + let _ = store + .delete_where(&person_uri, &mut ds, "1 = 2") + .await + .unwrap(); + drop(ds); + + // Roll-back-classified sidecar (post_commit_pin != observed head ⇒ + // UnexpectedAtP1 ⇒ RollBack). + let sidecar_json = format!( + r#"{{ + "schema_version": 1, + "operation_id": "01H0000000000000000000CVG", + "started_at": "0", + "branch": null, + "actor_id": "act-test", + "writer_kind": "Mutation", + "tables": [ + {{ + "table_key": "node:Person", + "table_path": "{}", + "expected_version": {}, + "post_commit_pin": {} + }} + ] + }}"#, + person_uri, manifest_pin, manifest_pin + ); + write_sidecar_file(dir.path(), "01H0000000000000000000CVG", &sidecar_json); + + // Reopen runs the sweep: restore Person to manifest_pin, then PUBLISH so the + // manifest tracks the restored Lance HEAD. + let db = Omnigraph::open(uri).await.unwrap(); + + // Convergence: manifest pin == Lance HEAD. Fails before the fix — the + // manifest stays at manifest_pin while HEAD advanced past it. + let snap = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); + let entry = snap.entry("node:Person").unwrap(); + let lance_head = Dataset::open(&person_uri).await.unwrap().version().version; + assert_eq!( + entry.table_version, lance_head, + "roll-back must publish so manifest pin ({}) == Lance HEAD ({})", + entry.table_version, lance_head, + ); + + // The +1-loop victim: an additive schema apply must now succeed (its + // HEAD-vs-manifest precondition is satisfied). Before the fix this failed + // with "stale view … refresh and retry". + let desired = TEST_SCHEMA.replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + db.apply_schema(&desired) + .await + .expect("schema apply after a converging roll-back must succeed"); +} + // ===================================================================== // Phase 4 — roll-forward path + audit row recording // ===================================================================== diff --git a/docs/dev/testing.md b/docs/dev/testing.md index 425fcee..f18600b 100644 --- a/docs/dev/testing.md +++ b/docs/dev/testing.md @@ -34,10 +34,10 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav | `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) | | `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior | | `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths | -| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation | -| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the four per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`). | +| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes the compacted version so the manifest tracks the Lance HEAD and a subsequent schema apply succeeds (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), and reconciles a pre-existing manifest-behind-HEAD drift forged via raw Lance compaction (`optimize_reconciles_preexisting_manifest_head_drift`) | +| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`). | | `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path | -| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories). | +| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). | ## Fixtures diff --git a/docs/dev/writes.md b/docs/dev/writes.md index 8b692b4..d2c7c7e 100644 --- a/docs/dev/writes.md +++ b/docs/dev/writes.md @@ -157,10 +157,14 @@ are left at `Lance HEAD = manifest_pinned + 1`. **Recovery protocol** (lifecycle of every staged-write writer — `MutationStaging::finalize`, `schema_apply::apply_schema_with_lock`, -`branch_merge_on_current_target`, `ensure_indices_for_branch`): +`branch_merge_on_current_target`, `ensure_indices_for_branch`, +`optimize_all_tables`): 1. **Phase A**: writer writes a sidecar JSON to - `__recovery/{ulid}.json` BEFORE its first `commit_staged`. The + `__recovery/{ulid}.json` BEFORE its first HEAD-advancing commit + (`commit_staged`, or `compact_files` for `optimize_all_tables`, + which advances the Lance HEAD via a reserve-fragments + rewrite + commit rather than a staged write). The sidecar names every `(table_key, table_path, expected_version, post_commit_pin)` it intends to commit + the writer kind + actor_id. @@ -195,8 +199,13 @@ recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: otherwise full open-time recovery rolls them back and refresh-time recovery leaves them for the next read-write open. - Otherwise **roll back**: per-table `Dataset::restore` to the - manifest-pinned table version for that branch. Rollback records the - actual restore target in the audit row's `to_version`. + manifest-pinned table version, then a single `ManifestBatchPublisher::publish` + of the restored HEAD — symmetric with roll-forward, so `manifest == HEAD` + after recovery (no residual drift). This convergence is what lets a + failed-then-retried schema apply succeed instead of failing one version higher + each iteration. The audit row's `to_version` records the logical + rolled-back-to version (`manifest_pinned`); the manifest is published at the + restore commit (`manifest_pinned + 1`, same content). - After a successful roll-forward or roll-back, an audit row is recorded — `_graph_commits.lance` carries a commit tagged `actor_id = "omnigraph:recovery"`, and a sibling diff --git a/docs/user/branches-commits.md b/docs/user/branches-commits.md index 0565186..a4044cb 100644 --- a/docs/user/branches-commits.md +++ b/docs/user/branches-commits.md @@ -58,6 +58,6 @@ Internal or legacy branch refs: ## L2 — Recovery audit trail -The four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row. +The five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row. Audit rows live in `_graph_commit_recoveries.lance` (sibling to `_graph_commits.lance`) and reference the commit graph by `graph_commit_id`. The linked recovery commit is identified by that same `graph_commit_id`, and `actor_id="omnigraph:recovery"` is stored in `_graph_commit_actors.lance` (joined by `graph_commit_id`) — `_graph_commits.lance` itself does not carry the `actor_id` column. To find recoveries for a specific original actor: `omnigraph commit list --filter actor=omnigraph:recovery`, then join to `_graph_commit_recoveries.lance` by `graph_commit_id` to read `recovery_for_actor`. Schema: see `crates/omnigraph/src/db/recovery_audit.rs`. diff --git a/docs/user/maintenance.md b/docs/user/maintenance.md index 3628fa0..a835799 100644 --- a/docs/user/maintenance.md +++ b/docs/user/maintenance.md @@ -4,8 +4,10 @@ ## `optimize_all_tables(db)` — non-destructive -- Lance `compact_files()` on every node + edge table on `main`. -- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests. +- Lance `compact_files()` on every node + edge table on `main`, then **publishes the compacted version to the `__manifest`** so the manifest's `table_version` tracks the compacted Lance HEAD. Reads pin the manifest version, so without this publish compaction would be invisible to readers *and* would break the HEAD-vs-manifest precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually compacted. +- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests until `cleanup` runs. +- Each table's compact→publish runs under its per-`(table, main)` write queue (serializing with concurrent mutations — compaction is a Lance `Rewrite` op that retryable-conflicts with a concurrent merge/update/delete on overlapping fragments). The Lance-HEAD-before-manifest-publish gap is covered by a `SidecarKind::Optimize` recovery sidecar (loose-match): a crash in that window rolls the compacted version forward on the next `Omnigraph::open` (compaction is content-preserving, so roll-forward is always safe). +- **Requires a recovered graph.** `optimize` refuses (errors) when an unresolved recovery sidecar is present under `__recovery` — operating on an unrecovered graph could publish a partial write the open-time recovery sweep would roll back. Reopen the graph to run the recovery sweep, then re-run `optimize`. (Recovery roll-back now publishes its restored version, so a recovered graph always satisfies `manifest == Lance HEAD` going in; there is no leftover drift for `optimize` to interpret.) - Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8). - Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped }]`. - **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: Some(BlobColumnsUnsupportedByLance)` (and logged via `tracing::warn`) instead of compacted, and the rest of the sweep proceeds normally. The current Lance `compact_files` mis-decodes blob-v2 columns under its forced `BlobHandling::AllBinary` read; **reads and writes are unaffected** — only compaction is. This is gated by `LANCE_SUPPORTS_BLOB_COMPACTION` (`db/omnigraph/optimize.rs`) and removed when the upstream Lance fix lands (see [docs/dev/lance.md](../dev/lance.md)). Consequence: fragment count and deleted-row space on blob tables are not reclaimed until then; query results are never affected. diff --git a/docs/user/storage.md b/docs/user/storage.md index d1c52b5..2c57a92 100644 --- a/docs/user/storage.md +++ b/docs/user/storage.md @@ -94,7 +94,7 @@ flowchart TB - **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe. - **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the v2→v3 migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once `delete_prefix` lands.) - **`_graph_commit_recoveries.lance`** — one row per recovery sweep action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See `crates/omnigraph/src/db/recovery_audit.rs`. -- **`__recovery/{ulid}.json`** — transient sidecar files written by the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`. +- **`__recovery/{ulid}.json`** — transient sidecar files written by the five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`. - **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads. - **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags.