From d54bccb9401551415d426c45714c20caee8ebe86 Mon Sep 17 00:00:00 2001 From: Ragnor Comerford Date: Tue, 2 Jun 2026 17:12:00 +0200 Subject: [PATCH] fix(optimize): skip blob-bearing tables to avoid Lance compaction crash (#138) * test(optimize): pin Lance blob-column compaction failure as a surface guard Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), failing even a pristine uniform-V2_2 multi-fragment blob table; reads use descriptor handling and are unaffected. Guard 10 reproduces this and is self-retiring: it turns red on the Lance bump that fixes the bug, forcing LANCE_SUPPORTS_BLOB_COMPACTION to flip. * fix(optimize): skip blob-bearing tables instead of crashing compaction omnigraph optimize aborted the whole sweep when any node/edge table had a Blob property: Lance compact_files cannot decode blob-v2 columns under AllBinary (the column-index error pinned by the surface guard). Skip blob-bearing tables behind a LANCE_SUPPORTS_BLOB_COMPACTION gate and report them via TableOptimizeStats.skipped / SkipReason (surfaced in the CLI and a tracing::warn) instead of erroring, which also isolates the failure so the other tables still compact. Reads/writes are unaffected; only fragment/space reclamation on blob tables is deferred until the upstream Lance fix. Adds a maintenance.rs regression test (validated red with the column-index symptom before the fix, green after), a concise v0.6.1 release note, and updates docs (maintenance, cli-reference, AGENTS capability matrix, invariants Known Gaps, lance.md audit, constants). * refactor(optimize): make TableOptimizeStats and SkipReason non_exhaustive Both are returned result types, never built by callers, so #[non_exhaustive] makes this the last field/variant addition that can break downstream literal construction and keeps future ones non-breaking (review feedback on the public-field addition). The v0.6.1 Compatibility Notes call out the source-level change. Also drops the now-stale "RED today / GREEN after the fix lands" narration in the optimize_skips_blob_table_and_reports_skip test (historical regression context now that the fix is in this branch), and folds in the expanded v0.6.1 release note. * chore(release): bump workspace to v0.6.1 Coherent version bump to accompany the v0.6.1 release note: all five crate manifests + path-dependency constraints, Cargo.lock, the AGENTS.md surveyed-version line, and openapi.json info.version move 0.6.0 -> 0.6.1. Matches the established release pattern (#118 landed the v0.6.0 note + bump together) and resolves the Codex/Devin review flag that a v0.6.1 note without a bump leaves CARGO_PKG_VERSION reporting 0.6.0 and mixed package versions. --- AGENTS.md | 4 +- Cargo.lock | 10 +- crates/omnigraph-cli/Cargo.toml | 10 +- crates/omnigraph-cli/src/main.rs | 9 +- crates/omnigraph-compiler/Cargo.toml | 2 +- crates/omnigraph-policy/Cargo.toml | 2 +- crates/omnigraph-server/Cargo.toml | 8 +- crates/omnigraph/Cargo.toml | 8 +- crates/omnigraph/src/db/mod.rs | 2 +- crates/omnigraph/src/db/omnigraph.rs | 2 +- crates/omnigraph/src/db/omnigraph/optimize.rs | 129 ++++++++++++++++-- .../omnigraph/tests/lance_surface_guards.rs | 85 ++++++++++++ crates/omnigraph/tests/maintenance.rs | 93 ++++++++++++- docs/dev/invariants.md | 9 ++ docs/dev/lance.md | 3 +- docs/releases/v0.6.1.md | 26 ++++ docs/user/cli-reference.md | 2 +- docs/user/constants.md | 1 + docs/user/maintenance.md | 3 +- openapi.json | 2 +- 20 files changed, 363 insertions(+), 47 deletions(-) create mode 100644 docs/releases/v0.6.1.md diff --git a/AGENTS.md b/AGENTS.md index 68de6b8..b876749 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -16,7 +16,7 @@ Tools that support `@`-imports (Claude Code) auto-include all three files via th `CLAUDE.md` is a symlink to this file — there is exactly one source of truth. Edit `AGENTS.md`. -**Version surveyed:** 0.6.0 +**Version surveyed:** 0.6.1 **Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-policy`, `omnigraph-cli`, `omnigraph-server` **Storage substrate:** Lance 6.x (columnar, versioned, branchable) **License:** MIT @@ -237,7 +237,7 @@ omnigraph policy explain --actor act-alice --action change --branch main | Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | | Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | | Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore`, and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) is the goal of a future background reconciler. Engine writes route through a sealed `TableStorage` trait exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index`) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and the migration of every call site completes. | -| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency | +| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) | | Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy | | BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | | `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering | diff --git a/Cargo.lock b/Cargo.lock index a3d6d62..3223b9c 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -4543,7 +4543,7 @@ dependencies = [ [[package]] name = "omnigraph-cli" -version = "0.6.0" +version = "0.6.1" dependencies = [ "assert_cmd", "clap", @@ -4565,7 +4565,7 @@ dependencies = [ [[package]] name = "omnigraph-compiler" -version = "0.6.0" +version = "0.6.1" dependencies = [ "ahash", "arrow-array", @@ -4586,7 +4586,7 @@ dependencies = [ [[package]] name = "omnigraph-engine" -version = "0.6.0" +version = "0.6.1" dependencies = [ "arc-swap", "arrow-array", @@ -4627,7 +4627,7 @@ dependencies = [ [[package]] name = "omnigraph-policy" -version = "0.6.0" +version = "0.6.1" dependencies = [ "cedar-policy", "clap", @@ -4640,7 +4640,7 @@ dependencies = [ [[package]] name = "omnigraph-server" -version = "0.6.0" +version = "0.6.1" dependencies = [ "arc-swap", "async-trait", diff --git a/crates/omnigraph-cli/Cargo.toml b/crates/omnigraph-cli/Cargo.toml index 0d35ed8..641068e 100644 --- a/crates/omnigraph-cli/Cargo.toml +++ b/crates/omnigraph-cli/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-cli" -version = "0.6.0" +version = "0.6.1" edition = "2024" description = "CLI for the Omnigraph graph database." license = "MIT" @@ -13,10 +13,10 @@ name = "omnigraph" path = "src/main.rs" [dependencies] -omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.0" } -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } -omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" } -omnigraph-server = { path = "../omnigraph-server", version = "0.6.0" } +omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.1" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" } +omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.1" } +omnigraph-server = { path = "../omnigraph-server", version = "0.6.1" } clap = { workspace = true } color-eyre = { workspace = true } serde = { workspace = true } diff --git a/crates/omnigraph-cli/src/main.rs b/crates/omnigraph-cli/src/main.rs index 879f070..29b55c4 100644 --- a/crates/omnigraph-cli/src/main.rs +++ b/crates/omnigraph-cli/src/main.rs @@ -3011,18 +3011,19 @@ async fn main() -> Result<()> { "fragments_removed": s.fragments_removed, "fragments_added": s.fragments_added, "committed": s.committed, + "skipped": s.skipped.map(|r| r.as_str()), })).collect::>(), }); print_json(&value)?; } else { println!("optimize {} — {} tables", uri, stats.len()); for s in &stats { - if s.committed { + if let Some(reason) = s.skipped { + println!(" {:<40} skipped ({reason})", s.table_key); + } else if s.committed { println!( " {:<40} frags {} → {} ✓", - s.table_key, - s.fragments_removed + s.fragments_added - s.fragments_added, - s.fragments_added + s.table_key, s.fragments_removed, s.fragments_added ); } else { println!(" {:<40} no-op", s.table_key); diff --git a/crates/omnigraph-compiler/Cargo.toml b/crates/omnigraph-compiler/Cargo.toml index 229b862..545db83 100644 --- a/crates/omnigraph-compiler/Cargo.toml +++ b/crates/omnigraph-compiler/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-compiler" -version = "0.6.0" +version = "0.6.1" edition = "2024" description = "Schema/query compiler for Omnigraph. Zero Lance dependency." license = "MIT" diff --git a/crates/omnigraph-policy/Cargo.toml b/crates/omnigraph-policy/Cargo.toml index dacda35..3d14fc5 100644 --- a/crates/omnigraph-policy/Cargo.toml +++ b/crates/omnigraph-policy/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-policy" -version = "0.6.0" +version = "0.6.1" edition = "2024" description = "Policy / authorization layer for Omnigraph — Cedar-backed PolicyEngine, PolicyChecker trait, ResourceScope enum." license = "MIT" diff --git a/crates/omnigraph-server/Cargo.toml b/crates/omnigraph-server/Cargo.toml index e9a0e46..5994aa1 100644 --- a/crates/omnigraph-server/Cargo.toml +++ b/crates/omnigraph-server/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-server" -version = "0.6.0" +version = "0.6.1" edition = "2024" description = "HTTP server for the Omnigraph graph database." license = "MIT" @@ -19,9 +19,9 @@ default = [] aws = ["dep:aws-config", "dep:aws-sdk-secretsmanager"] [dependencies] -omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.0" } -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } -omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" } +omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.1" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" } +omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.1" } axum = { workspace = true } clap = { workspace = true } color-eyre = { workspace = true } diff --git a/crates/omnigraph/Cargo.toml b/crates/omnigraph/Cargo.toml index 1fa3436..70f51d8 100644 --- a/crates/omnigraph/Cargo.toml +++ b/crates/omnigraph/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-engine" -version = "0.6.0" +version = "0.6.1" edition = "2024" description = "Runtime engine for the Omnigraph graph database." license = "MIT" @@ -16,8 +16,8 @@ default = [] failpoints = ["dep:fail", "fail/failpoints"] [dependencies] -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } -omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" } +omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.1" } lance = { workspace = true } lance-datafusion = { workspace = true } datafusion = { workspace = true } @@ -51,7 +51,7 @@ chrono = { workspace = true } arc-swap = { workspace = true } [dev-dependencies] -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" } tokio = { workspace = true } lance-namespace-impls = { workspace = true } serial_test = "3" diff --git a/crates/omnigraph/src/db/mod.rs b/crates/omnigraph/src/db/mod.rs index d0b292f..8702f88 100644 --- a/crates/omnigraph/src/db/mod.rs +++ b/crates/omnigraph/src/db/mod.rs @@ -13,7 +13,7 @@ pub use manifest::{Snapshot, SubTableEntry, SubTableUpdate}; pub(crate) use omnigraph::ensure_public_branch_ref; pub use omnigraph::{ CleanupPolicyOptions, InitOptions, MergeOutcome, Omnigraph, OpenMode, SchemaApplyOptions, - SchemaApplyResult, TableCleanupStats, TableOptimizeStats, + SchemaApplyResult, SkipReason, TableCleanupStats, TableOptimizeStats, }; pub(crate) use run_registry::is_internal_run_branch; diff --git a/crates/omnigraph/src/db/omnigraph.rs b/crates/omnigraph/src/db/omnigraph.rs index 9d1403d..7b8a3f6 100644 --- a/crates/omnigraph/src/db/omnigraph.rs +++ b/crates/omnigraph/src/db/omnigraph.rs @@ -33,7 +33,7 @@ mod optimize; mod schema_apply; mod table_ops; -pub use optimize::{CleanupPolicyOptions, TableCleanupStats, TableOptimizeStats}; +pub use optimize::{CleanupPolicyOptions, SkipReason, TableCleanupStats, TableOptimizeStats}; pub use schema_apply::SchemaApplyOptions; use super::commit_graph::GraphCommit; diff --git a/crates/omnigraph/src/db/omnigraph/optimize.rs b/crates/omnigraph/src/db/omnigraph/optimize.rs index c703836..fff3f54 100644 --- a/crates/omnigraph/src/db/omnigraph/optimize.rs +++ b/crates/omnigraph/src/db/omnigraph/optimize.rs @@ -40,6 +40,20 @@ fn maint_concurrency() -> usize { .unwrap_or(DEFAULT_MAINT_CONCURRENCY) } +/// Whether the installed Lance can compact a dataset that contains blob +/// columns. `false` today: Lance `compact_files` forces +/// `BlobHandling::AllBinary` on the read side, and the blob-v2 struct decoder +/// mis-counts columns ("there were more fields in the schema than provided +/// column indices"), failing even a pristine uniform-V2_2 multi-fragment blob +/// table. Reads are unaffected (queries use descriptor handling). +/// +/// While `false`, [`optimize_all_tables`] skips blob-bearing tables and reports +/// [`SkipReason::BlobColumnsUnsupportedByLance`] instead of aborting the whole +/// sweep. Flip to `true` once the upstream Lance fix ships — the +/// `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard +/// turns red on that bump and forces this flip. Tracked in `docs/dev/lance.md`. +const LANCE_SUPPORTS_BLOB_COMPACTION: bool = false; + /// Retention knobs for [`cleanup_all_tables`]. At least one must be set or /// nothing is cleaned. If both are set, Lance applies them as AND (a manifest /// is kept if it satisfies either — i.e. only manifests older than BOTH the @@ -52,8 +66,45 @@ pub struct CleanupPolicyOptions { pub older_than: Option, } -/// Per-table outcome of `optimize_all_tables`. +/// Why `optimize` did not compact a table. Typed so callers branch on the +/// reason rather than sniffing a string. One variant today, gated by +/// [`LANCE_SUPPORTS_BLOB_COMPACTION`]. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +#[non_exhaustive] +pub enum SkipReason { + /// The table has one or more `Blob` columns. Lance `compact_files` forces + /// `BlobHandling::AllBinary`, which mis-decodes blob-v2 columns; see + /// [`LANCE_SUPPORTS_BLOB_COMPACTION`] and `docs/dev/lance.md`. + BlobColumnsUnsupportedByLance, +} + +impl SkipReason { + /// Stable machine-readable token for serialized output (e.g. CLI `--json`). + /// Once emitted this is part of the output contract — keep it stable. + pub fn as_str(&self) -> &'static str { + match self { + SkipReason::BlobColumnsUnsupportedByLance => "blob_columns_unsupported_by_lance", + } + } +} + +impl std::fmt::Display for SkipReason { + /// Human-readable reason for CLI and log output. + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + let msg = match self { + SkipReason::BlobColumnsUnsupportedByLance => { + "blob columns — Lance compaction unsupported" + } + }; + f.write_str(msg) + } +} + +/// Per-table outcome of `optimize_all_tables`. This is a returned result type, +/// not built by callers, so it is `#[non_exhaustive]`: future fields stay +/// non-breaking and downstream code reads fields rather than constructing it. #[derive(Debug, Clone)] +#[non_exhaustive] pub struct TableOptimizeStats { pub table_key: String, /// Number of source fragments that were rewritten by Lance. @@ -62,6 +113,33 @@ pub struct TableOptimizeStats { pub fragments_added: usize, /// Did this table get a new Lance manifest version from the compaction? pub committed: bool, + /// `Some(reason)` if this table was deliberately not compacted. When set, + /// `fragments_removed == 0`, `fragments_added == 0`, and `!committed`. + pub skipped: Option, +} + +impl TableOptimizeStats { + /// Stat for a table that Lance actually compacted. + fn compacted(table_key: String, metrics: &CompactionMetrics, committed: bool) -> Self { + Self { + table_key, + fragments_removed: metrics.fragments_removed, + fragments_added: metrics.fragments_added, + committed, + skipped: None, + } + } + + /// Stat for a table that was deliberately skipped (compaction not attempted). + fn skipped(table_key: String, reason: SkipReason) -> Self { + Self { + table_key, + fragments_removed: 0, + fragments_added: 0, + committed: false, + skipped: Some(reason), + } + } } /// Per-table outcome of `cleanup_all_tables`. `error` is `Some` when this @@ -84,14 +162,21 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result = all_table_keys(&db.catalog()) - .into_iter() - .filter_map(|table_key| { - let entry = snapshot.entry(&table_key)?; + // Compute per-table state (path + whether it has blob columns) up front, in + // a scope that drops the catalog handle before the async stream starts. + let table_tasks: Vec<(String, String, bool)> = { + let catalog = db.catalog(); + let mut tasks = Vec::new(); + for table_key in all_table_keys(&catalog) { + let Some(entry) = snapshot.entry(&table_key) else { + continue; + }; let full_path = format!("{}/{}", db.root_uri, entry.table_path); - Some((table_key, full_path)) - }) - .collect(); + let has_blob = !blob_properties_for_table_key(&catalog, &table_key)?.is_empty(); + tasks.push((table_key, full_path, has_blob)); + } + tasks + }; if table_tasks.is_empty() { return Ok(Vec::new()); @@ -101,7 +186,24 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result> = futures::stream::iter(table_tasks.into_iter()) - .map(|(table_key, full_path)| async move { + .map(|(table_key, full_path, has_blob)| async move { + // Lance `compact_files` mis-decodes blob-v2 columns under the forced + // `BlobHandling::AllBinary` read (see LANCE_SUPPORTS_BLOB_COMPACTION). + // Skip blob-bearing tables and report it rather than aborting the + // whole sweep — the other tables still compact. + if has_blob && !LANCE_SUPPORTS_BLOB_COMPACTION { + tracing::warn!( + target: "omnigraph::optimize", + table = %table_key, + "skipping compaction: table has blob columns the current Lance \ + cannot rewrite (blob-v2 AllBinary decode bug); other tables \ + unaffected — rerun after the Lance fix", + ); + return Ok(TableOptimizeStats::skipped( + table_key, + SkipReason::BlobColumnsUnsupportedByLance, + )); + } let mut ds = table_store .open_dataset_head_for_write(&table_key, &full_path, None) .await?; @@ -111,12 +213,11 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result RecordBatch { + let ids: Vec = (start..start + n).map(|i| format!("n{i}")).collect(); + let data = + LargeBinaryArray::from_iter_values((start..start + n).map(|i| format!("blob{i}"))); + let blob_uri = StringArray::from(vec![None::<&str>; n as usize]); + let DataType::Struct(fields) = lance::blob::blob_field("content", true).data_type().clone() + else { + unreachable!("blob_field is always a Struct"); + }; + let content = StructArray::new( + fields, + vec![Arc::new(data) as _, Arc::new(blob_uri) as _], + None, + ); + let schema = Arc::new(Schema::new(vec![ + Field::new("id", DataType::Utf8, false), + lance::blob::blob_field("content", true), + ])); + RecordBatch::try_new( + schema, + vec![Arc::new(StringArray::from(ids)) as _, Arc::new(content) as _], + ) + .unwrap() + } + + async fn write(uri: &str, batch: RecordBatch, mode: WriteMode) { + let schema = batch.schema(); + let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); + // Blob v2 requires file version >= 2.2; without the pin the *write* + // would fail with a different error, masking the guard's intent. + let params = WriteParams { + mode, + enable_stable_row_ids: true, + data_storage_version: Some(LanceFileVersion::V2_2), + ..Default::default() + }; + Dataset::write(reader, uri, Some(params)).await.unwrap(); + } + + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().join("guard10-blob.lance"); + let uri = uri.to_str().unwrap(); + + // Uniform V2_2, two fragments → forces compaction to actually rewrite. + write(uri, blob_batch(0, 2), WriteMode::Create).await; + write(uri, blob_batch(100, 2), WriteMode::Append).await; + + let mut ds = Dataset::open(uri).await.unwrap(); + assert!( + ds.get_fragments().len() >= 2, + "guard needs a multi-fragment table to trigger a real compaction rewrite" + ); + + let result = compact_files(&mut ds, CompactionOptions::default(), None).await; + let err = result.expect_err( + "compact_files unexpectedly SUCCEEDED on a blob table — the Lance blob-v2 \ + compaction bug is fixed. Flip LANCE_SUPPORTS_BLOB_COMPACTION to true in \ + db/omnigraph/optimize.rs, remove the blob-skip branch, and re-pin docs/dev/lance.md.", + ); + assert!( + err.to_string() + .contains("more fields in the schema than provided column indices"), + "blob compaction failed with an unexpected error (Lance internals may have \ + shifted): {err}" + ); +} diff --git a/crates/omnigraph/tests/maintenance.rs b/crates/omnigraph/tests/maintenance.rs index 722bdc4..3e61677 100644 --- a/crates/omnigraph/tests/maintenance.rs +++ b/crates/omnigraph/tests/maintenance.rs @@ -8,7 +8,7 @@ mod helpers; use std::time::Duration; use lance::Dataset; -use omnigraph::db::{CleanupPolicyOptions, Omnigraph}; +use omnigraph::db::{CleanupPolicyOptions, Omnigraph, SkipReason}; use omnigraph::loader::{LoadMode, load_jsonl}; use helpers::{TEST_DATA, TEST_SCHEMA, count_rows, init_and_load}; @@ -72,6 +72,97 @@ async fn optimize_after_load_then_again_is_idempotent() { } } +// Regression: `optimize` must not crash on a graph that has a `Blob` table. +// +// Lance `compact_files` forces `BlobHandling::AllBinary`, which mis-decodes +// blob-v2 columns ("more fields in the schema than provided column indices"), +// failing even a pristine uniform-V2_2 multi-fragment blob table. `optimize` +// must skip blob-bearing tables (and report the skip) rather than aborting the +// whole sweep. +// +// Before the skip fix, `optimize()` returned that Lance error here and aborted +// the whole sweep; it now skips the blob table (`doc.skipped == Some(..)`) +// while the sibling non-blob `Tag` table still compacts. The skip is gated by +// `LANCE_SUPPORTS_BLOB_COMPACTION`; the surface guard +// `compact_files_still_fails_on_blob_columns` flags when the upstream Lance fix +// makes the skip (and this test's blob arm) removable. +#[tokio::test] +async fn optimize_skips_blob_table_and_reports_skip() { + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().to_str().unwrap(); + // One Blob node type (`Doc`) + one plain node type (`Tag`): proves the blob + // table is skipped while a non-blob table in the same sweep still compacts. + let schema = "\ +node Doc {\n slug: String @key\n content: Blob\n}\n\ +node Tag {\n slug: String @key\n}\n"; + let mut db = Omnigraph::init(uri, schema).await.unwrap(); + + // Multi-fragment blob table: Overwrite creates fragment 1; each Merge of + // new keys appends another. A >=2-fragment blob table is exactly what + // crashes `compact_files` today (single fragment would no-op and not crash). + load_jsonl( + &mut db, + "{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"content\":\"base64:aGVsbG8x\"}}\n{\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"content\":\"base64:aGVsbG8y\"}}", + LoadMode::Overwrite, + ) + .await + .unwrap(); + load_jsonl( + &mut db, + "{\"type\":\"Doc\",\"data\":{\"slug\":\"d3\",\"content\":\"base64:aGVsbG8z\"}}", + LoadMode::Merge, + ) + .await + .unwrap(); + load_jsonl( + &mut db, + "{\"type\":\"Doc\",\"data\":{\"slug\":\"d4\",\"content\":\"base64:aGVsbG80\"}}", + LoadMode::Merge, + ) + .await + .unwrap(); + // Plain table, also multi-fragment so it has something to compact. + load_jsonl( + &mut db, + "{\"type\":\"Tag\",\"data\":{\"slug\":\"t1\"}}\n{\"type\":\"Tag\",\"data\":{\"slug\":\"t2\"}}", + LoadMode::Merge, + ) + .await + .unwrap(); + load_jsonl( + &mut db, + "{\"type\":\"Tag\",\"data\":{\"slug\":\"t3\"}}", + LoadMode::Merge, + ) + .await + .unwrap(); + + let stats = db + .optimize() + .await + .expect("optimize must not crash on a graph with a Blob table"); + + let doc = stats + .iter() + .find(|s| s.table_key == "node:Doc") + .expect("Doc stat present"); + let tag = stats + .iter() + .find(|s| s.table_key == "node:Tag") + .expect("Tag stat present"); + // The blob table is skipped (and reported), not compacted. + assert_eq!( + doc.skipped, + Some(SkipReason::BlobColumnsUnsupportedByLance), + "blob table must be reported as skipped", + ); + assert!(!doc.committed, "skipped blob table is not compacted"); + assert_eq!(doc.fragments_removed, 0); + assert_eq!(doc.fragments_added, 0); + // The plain (non-blob) table is unaffected by the skip. + assert_eq!(tag.skipped, None, "non-blob table must not be skipped"); +} + #[tokio::test] async fn cleanup_without_any_policy_option_errors() { let dir = tempfile::tempdir().unwrap(); diff --git a/docs/dev/invariants.md b/docs/dev/invariants.md index 0cf295c..5ee4f17 100644 --- a/docs/dev/invariants.md +++ b/docs/dev/invariants.md @@ -130,6 +130,15 @@ them explicit. - **Deletes and vector indexes:** `delete_where` and vector index creation still advance Lance HEAD inline because the required public Lance APIs are missing. Keep D2 and recovery coverage in place until those residuals are removed. +- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns + under its forced `BlobHandling::AllBinary` read ("more fields in the schema + than provided column indices"), so `optimize` skips any table with a `Blob` + property — reporting `SkipReason::BlobColumnsUnsupportedByLance` (loud, not a + silent drop) behind the `LANCE_SUPPORTS_BLOB_COMPACTION` gate. Reads and writes + are unaffected; only space/fragment reclamation on blob tables is deferred. + Remove the skip when the upstream Lance fix lands — the + `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard + turns red on that bump to force it. - **Planner capability/stat surfaces:** cost-aware planning, complete capability advertisement, and explain-with-cost are roadmap. Do not describe them as implemented. diff --git a/docs/dev/lance.md b/docs/dev/lance.md index 100da6f..9d2b990 100644 --- a/docs/dev/lance.md +++ b/docs/dev/lance.md @@ -176,7 +176,8 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53, - **Lance #6666 still open** (`build_index_metadata_from_segments` public): vector-index two-phase blocked; inline `create_vector_index` residual retained. - **Lance #6877 still open** (`MergeInsertBuilder` dup-rowid): PR #109's `SourceDedupeBehavior::FirstSeen` + `check_batch_unique_by_keys` precondition stay load-bearing. - **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`. +- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed → flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions. -Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (9 named guards; 4 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension). +Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension). Bump this date stanza on the next alignment pass. diff --git a/docs/releases/v0.6.1.md b/docs/releases/v0.6.1.md new file mode 100644 index 0000000..aafe1af --- /dev/null +++ b/docs/releases/v0.6.1.md @@ -0,0 +1,26 @@ +# Omnigraph v0.6.1 + +v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safer branch cleanup, more complete release artifacts, and a Lance blob-compaction workaround. + +## Highlights + +- **Stored-query registries.** `omnigraph.yaml` can declare curated `queries:` blocks per graph. Servers load and type-check them at startup, `omnigraph queries validate` checks them offline, `omnigraph queries list` shows exposed queries and typed params, `GET /queries` exposes a typed catalog, and `POST /queries/{name}` invokes a stored query without accepting ad hoc `.gq` source from the client. +- **Stored-query policy gate.** New Cedar action `invoke_query` gates the stored-query invocation surface. Stored mutations are double-gated: `invoke_query` to reach the stored query and `change` for the actual write. +- **Safer branch deletion.** `branch_delete` now treats the manifest as the authority, flips branch visibility atomically, and reclaims per-table/commit-graph forks as derived state. If best-effort reclaim is interrupted, `cleanup` reconciles orphaned forks; reusing a branch name before cleanup reports an actionable error. +- **Blob-safe optimize.** `omnigraph optimize` skips tables with `Blob` properties instead of failing the whole sweep on Lance's blob-v2 compaction decode bug. Skips are visible in human output, `--json` as `skipped`, `TableOptimizeStats.skipped`, and logs; non-blob tables still compact normally. +- **Deployment improvements.** The container entrypoint now composes `OMNIGRAPH_TARGET_URI` with `OMNIGRAPH_CONFIG`, so operators can keep the graph URI in env while loading policy/query config from a mounted file. The local RustFS bootstrap pins RustFS beta.3 and allows the current insecure local-dev default credentials. +- **Windows release support.** Tagged and edge releases now publish Windows x86_64 archives containing `omnigraph.exe` and `omnigraph-server.exe`, with a PowerShell installer and Windows install docs. +- **Release tooling.** Homebrew formula generation was tightened to produce audit-clean formulas. + +## Compatibility Notes + +- A graph selected by name (`--target` or `server.graph`) now uses `graphs..policy` and `graphs..queries`. Top-level `policy` / `queries` blocks are only for anonymous bare-URI single-graph mode; using them with a named graph now fails loudly with migration guidance. +- `mcp.expose` defaults to `true` for stored-query registry entries. Set `mcp: { expose: false }` for service-only queries that should not appear in the catalog. +- `invoke_query` is graph-scoped, not branch-scoped. Branch/snapshot access remains enforced by the inner `read` / `change` gate. +- Blob tables are not compacted until the upstream Lance fix lands, so fragment count and deleted-row space on blob tables are not reclaimed by `optimize`. Reads, writes, and query results are unaffected; no on-disk migration is required. +- `TableOptimizeStats` is now `#[non_exhaustive]` and gains a `skipped: Option` field (so does the new `SkipReason` enum). This is a source-level change only for downstream code that built this returned result struct by literal — rare, since it is produced by `optimize` and consumed by reading its fields; field access is unaffected, and `#[non_exhaustive]` keeps future additions non-breaking. + +## Docs And Cleanup + +- Public docs were updated for stored queries, policy, server routes, deployment, Windows installation, branch deletion, maintenance, and the `runs` docs rename to `writes`. +- README copy and release documentation were refreshed; older release notes had small typo/wording fixes. diff --git a/docs/user/cli-reference.md b/docs/user/cli-reference.md index 019e4ad..8263919 100644 --- a/docs/user/cli-reference.md +++ b/docs/user/cli-reference.md @@ -21,7 +21,7 @@ A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` sc | `schema plan \| apply \| show (alias: get)` | migrations | | `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` | | `queries validate \| list` | operate on the server-side stored-query registry (the `queries:` block). `validate` type-checks every stored query against the live schema offline (opens the selected graph; exits non-zero on any breakage), catching schema drift without restarting the server; `list` prints the selected registry's query names, MCP exposure, and typed params. For per-graph registries, pass `--target ` or set `cli.graph`; with no graph selection, `list` shows only top-level `queries:`. Distinct from `lint`, which validates a single `.gq` file | -| `optimize` | non-destructive Lance compaction | +| `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns; `--json` reports a `skipped` field) | | `cleanup --keep N --older-than 7d --confirm` | destructive version GC | | `embed` | offline JSONL embedding pipeline | | `policy validate \| test \| explain` | Cedar tooling. Selects `cli.graph`, else `server.graph`, else top-level `policy.file` | diff --git a/docs/user/constants.md b/docs/user/constants.md index 527aaea..8f13555 100644 --- a/docs/user/constants.md +++ b/docs/user/constants.md @@ -11,6 +11,7 @@ | Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` | `db/manifest/migrations.rs` | | Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` | | Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` | +| Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | `db/omnigraph/optimize.rs` | | Graph index cache size | `8` (LRU) | `runtime_cache.rs` | | Default body limit | `1 MB` | `omnigraph-server/lib.rs` | | Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` | diff --git a/docs/user/maintenance.md b/docs/user/maintenance.md index 9839ea1..3628fa0 100644 --- a/docs/user/maintenance.md +++ b/docs/user/maintenance.md @@ -7,7 +7,8 @@ - Lance `compact_files()` on every node + edge table on `main`. - Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests. - Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8). -- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`. +- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped }]`. +- **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: Some(BlobColumnsUnsupportedByLance)` (and logged via `tracing::warn`) instead of compacted, and the rest of the sweep proceeds normally. The current Lance `compact_files` mis-decodes blob-v2 columns under its forced `BlobHandling::AllBinary` read; **reads and writes are unaffected** — only compaction is. This is gated by `LANCE_SUPPORTS_BLOB_COMPACTION` (`db/omnigraph/optimize.rs`) and removed when the upstream Lance fix lands (see [docs/dev/lance.md](../dev/lance.md)). Consequence: fragment count and deleted-row space on blob tables are not reclaimed until then; query results are never affected. ## `cleanup_all_tables(db, options)` — destructive diff --git a/openapi.json b/openapi.json index 08d39c4..aced64d 100644 --- a/openapi.json +++ b/openapi.json @@ -7,7 +7,7 @@ "name": "MIT", "identifier": "MIT" }, - "version": "0.6.0" + "version": "0.6.1" }, "paths": { "/branches": {