mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
fix(optimize): skip blob-bearing tables to avoid Lance compaction crash (#138)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
* test(optimize): pin Lance blob-column compaction failure as a surface guard
Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), failing even a pristine uniform-V2_2 multi-fragment blob table; reads use descriptor handling and are unaffected.
Guard 10 reproduces this and is self-retiring: it turns red on the Lance bump that fixes the bug, forcing LANCE_SUPPORTS_BLOB_COMPACTION to flip.
* fix(optimize): skip blob-bearing tables instead of crashing compaction
omnigraph optimize aborted the whole sweep when any node/edge table had a Blob property: Lance compact_files cannot decode blob-v2 columns under AllBinary (the column-index error pinned by the surface guard). Skip blob-bearing tables behind a LANCE_SUPPORTS_BLOB_COMPACTION gate and report them via TableOptimizeStats.skipped / SkipReason (surfaced in the CLI and a tracing::warn) instead of erroring, which also isolates the failure so the other tables still compact.
Reads/writes are unaffected; only fragment/space reclamation on blob tables is deferred until the upstream Lance fix. Adds a maintenance.rs regression test (validated red with the column-index symptom before the fix, green after), a concise v0.6.1 release note, and updates docs (maintenance, cli-reference, AGENTS capability matrix, invariants Known Gaps, lance.md audit, constants).
* refactor(optimize): make TableOptimizeStats and SkipReason non_exhaustive
Both are returned result types, never built by callers, so #[non_exhaustive] makes this the last field/variant addition that can break downstream literal construction and keeps future ones non-breaking (review feedback on the public-field addition). The v0.6.1 Compatibility Notes call out the source-level change.
Also drops the now-stale "RED today / GREEN after the fix lands" narration in the optimize_skips_blob_table_and_reports_skip test (historical regression context now that the fix is in this branch), and folds in the expanded v0.6.1 release note.
* chore(release): bump workspace to v0.6.1
Coherent version bump to accompany the v0.6.1 release note: all five crate manifests + path-dependency constraints, Cargo.lock, the AGENTS.md surveyed-version line, and openapi.json info.version move 0.6.0 -> 0.6.1. Matches the established release pattern (#118 landed the v0.6.0 note + bump together) and resolves the Codex/Devin review flag that a v0.6.1 note without a bump leaves CARGO_PKG_VERSION reporting 0.6.0 and mixed package versions.
This commit is contained in:
parent
3c2b1b8051
commit
d54bccb940
20 changed files with 363 additions and 47 deletions
|
|
@ -16,7 +16,7 @@ Tools that support `@`-imports (Claude Code) auto-include all three files via th
|
|||
|
||||
`CLAUDE.md` is a symlink to this file — there is exactly one source of truth. Edit `AGENTS.md`.
|
||||
|
||||
**Version surveyed:** 0.6.0
|
||||
**Version surveyed:** 0.6.1
|
||||
**Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-policy`, `omnigraph-cli`, `omnigraph-server`
|
||||
**Storage substrate:** Lance 6.x (columnar, versioned, branchable)
|
||||
**License:** MIT
|
||||
|
|
@ -237,7 +237,7 @@ omnigraph policy explain --actor act-alice --action change --branch main
|
|||
| Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables |
|
||||
| Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering |
|
||||
| Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore`, and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) is the goal of a future background reconciler. Engine writes route through a sealed `TableStorage` trait exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index`) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and the migration of every call site completes. |
|
||||
| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency |
|
||||
| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) |
|
||||
| Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy |
|
||||
| BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches |
|
||||
| `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering |
|
||||
|
|
|
|||
10
Cargo.lock
generated
10
Cargo.lock
generated
|
|
@ -4543,7 +4543,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "omnigraph-cli"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
dependencies = [
|
||||
"assert_cmd",
|
||||
"clap",
|
||||
|
|
@ -4565,7 +4565,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "omnigraph-compiler"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
dependencies = [
|
||||
"ahash",
|
||||
"arrow-array",
|
||||
|
|
@ -4586,7 +4586,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "omnigraph-engine"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
dependencies = [
|
||||
"arc-swap",
|
||||
"arrow-array",
|
||||
|
|
@ -4627,7 +4627,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "omnigraph-policy"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
dependencies = [
|
||||
"cedar-policy",
|
||||
"clap",
|
||||
|
|
@ -4640,7 +4640,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "omnigraph-server"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
dependencies = [
|
||||
"arc-swap",
|
||||
"async-trait",
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[package]
|
||||
name = "omnigraph-cli"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
edition = "2024"
|
||||
description = "CLI for the Omnigraph graph database."
|
||||
license = "MIT"
|
||||
|
|
@ -13,10 +13,10 @@ name = "omnigraph"
|
|||
path = "src/main.rs"
|
||||
|
||||
[dependencies]
|
||||
omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.0" }
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" }
|
||||
omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" }
|
||||
omnigraph-server = { path = "../omnigraph-server", version = "0.6.0" }
|
||||
omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.1" }
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" }
|
||||
omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.1" }
|
||||
omnigraph-server = { path = "../omnigraph-server", version = "0.6.1" }
|
||||
clap = { workspace = true }
|
||||
color-eyre = { workspace = true }
|
||||
serde = { workspace = true }
|
||||
|
|
|
|||
|
|
@ -3011,18 +3011,19 @@ async fn main() -> Result<()> {
|
|||
"fragments_removed": s.fragments_removed,
|
||||
"fragments_added": s.fragments_added,
|
||||
"committed": s.committed,
|
||||
"skipped": s.skipped.map(|r| r.as_str()),
|
||||
})).collect::<Vec<_>>(),
|
||||
});
|
||||
print_json(&value)?;
|
||||
} else {
|
||||
println!("optimize {} — {} tables", uri, stats.len());
|
||||
for s in &stats {
|
||||
if s.committed {
|
||||
if let Some(reason) = s.skipped {
|
||||
println!(" {:<40} skipped ({reason})", s.table_key);
|
||||
} else if s.committed {
|
||||
println!(
|
||||
" {:<40} frags {} → {} ✓",
|
||||
s.table_key,
|
||||
s.fragments_removed + s.fragments_added - s.fragments_added,
|
||||
s.fragments_added
|
||||
s.table_key, s.fragments_removed, s.fragments_added
|
||||
);
|
||||
} else {
|
||||
println!(" {:<40} no-op", s.table_key);
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[package]
|
||||
name = "omnigraph-compiler"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
edition = "2024"
|
||||
description = "Schema/query compiler for Omnigraph. Zero Lance dependency."
|
||||
license = "MIT"
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[package]
|
||||
name = "omnigraph-policy"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
edition = "2024"
|
||||
description = "Policy / authorization layer for Omnigraph — Cedar-backed PolicyEngine, PolicyChecker trait, ResourceScope enum."
|
||||
license = "MIT"
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[package]
|
||||
name = "omnigraph-server"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
edition = "2024"
|
||||
description = "HTTP server for the Omnigraph graph database."
|
||||
license = "MIT"
|
||||
|
|
@ -19,9 +19,9 @@ default = []
|
|||
aws = ["dep:aws-config", "dep:aws-sdk-secretsmanager"]
|
||||
|
||||
[dependencies]
|
||||
omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.0" }
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" }
|
||||
omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" }
|
||||
omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.1" }
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" }
|
||||
omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.1" }
|
||||
axum = { workspace = true }
|
||||
clap = { workspace = true }
|
||||
color-eyre = { workspace = true }
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[package]
|
||||
name = "omnigraph-engine"
|
||||
version = "0.6.0"
|
||||
version = "0.6.1"
|
||||
edition = "2024"
|
||||
description = "Runtime engine for the Omnigraph graph database."
|
||||
license = "MIT"
|
||||
|
|
@ -16,8 +16,8 @@ default = []
|
|||
failpoints = ["dep:fail", "fail/failpoints"]
|
||||
|
||||
[dependencies]
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" }
|
||||
omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" }
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" }
|
||||
omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.1" }
|
||||
lance = { workspace = true }
|
||||
lance-datafusion = { workspace = true }
|
||||
datafusion = { workspace = true }
|
||||
|
|
@ -51,7 +51,7 @@ chrono = { workspace = true }
|
|||
arc-swap = { workspace = true }
|
||||
|
||||
[dev-dependencies]
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" }
|
||||
omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.1" }
|
||||
tokio = { workspace = true }
|
||||
lance-namespace-impls = { workspace = true }
|
||||
serial_test = "3"
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ pub use manifest::{Snapshot, SubTableEntry, SubTableUpdate};
|
|||
pub(crate) use omnigraph::ensure_public_branch_ref;
|
||||
pub use omnigraph::{
|
||||
CleanupPolicyOptions, InitOptions, MergeOutcome, Omnigraph, OpenMode, SchemaApplyOptions,
|
||||
SchemaApplyResult, TableCleanupStats, TableOptimizeStats,
|
||||
SchemaApplyResult, SkipReason, TableCleanupStats, TableOptimizeStats,
|
||||
};
|
||||
pub(crate) use run_registry::is_internal_run_branch;
|
||||
|
||||
|
|
|
|||
|
|
@ -33,7 +33,7 @@ mod optimize;
|
|||
mod schema_apply;
|
||||
mod table_ops;
|
||||
|
||||
pub use optimize::{CleanupPolicyOptions, TableCleanupStats, TableOptimizeStats};
|
||||
pub use optimize::{CleanupPolicyOptions, SkipReason, TableCleanupStats, TableOptimizeStats};
|
||||
pub use schema_apply::SchemaApplyOptions;
|
||||
|
||||
use super::commit_graph::GraphCommit;
|
||||
|
|
|
|||
|
|
@ -40,6 +40,20 @@ fn maint_concurrency() -> usize {
|
|||
.unwrap_or(DEFAULT_MAINT_CONCURRENCY)
|
||||
}
|
||||
|
||||
/// Whether the installed Lance can compact a dataset that contains blob
|
||||
/// columns. `false` today: Lance `compact_files` forces
|
||||
/// `BlobHandling::AllBinary` on the read side, and the blob-v2 struct decoder
|
||||
/// mis-counts columns ("there were more fields in the schema than provided
|
||||
/// column indices"), failing even a pristine uniform-V2_2 multi-fragment blob
|
||||
/// table. Reads are unaffected (queries use descriptor handling).
|
||||
///
|
||||
/// While `false`, [`optimize_all_tables`] skips blob-bearing tables and reports
|
||||
/// [`SkipReason::BlobColumnsUnsupportedByLance`] instead of aborting the whole
|
||||
/// sweep. Flip to `true` once the upstream Lance fix ships — the
|
||||
/// `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard
|
||||
/// turns red on that bump and forces this flip. Tracked in `docs/dev/lance.md`.
|
||||
const LANCE_SUPPORTS_BLOB_COMPACTION: bool = false;
|
||||
|
||||
/// Retention knobs for [`cleanup_all_tables`]. At least one must be set or
|
||||
/// nothing is cleaned. If both are set, Lance applies them as AND (a manifest
|
||||
/// is kept if it satisfies either — i.e. only manifests older than BOTH the
|
||||
|
|
@ -52,8 +66,45 @@ pub struct CleanupPolicyOptions {
|
|||
pub older_than: Option<Duration>,
|
||||
}
|
||||
|
||||
/// Per-table outcome of `optimize_all_tables`.
|
||||
/// Why `optimize` did not compact a table. Typed so callers branch on the
|
||||
/// reason rather than sniffing a string. One variant today, gated by
|
||||
/// [`LANCE_SUPPORTS_BLOB_COMPACTION`].
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
#[non_exhaustive]
|
||||
pub enum SkipReason {
|
||||
/// The table has one or more `Blob` columns. Lance `compact_files` forces
|
||||
/// `BlobHandling::AllBinary`, which mis-decodes blob-v2 columns; see
|
||||
/// [`LANCE_SUPPORTS_BLOB_COMPACTION`] and `docs/dev/lance.md`.
|
||||
BlobColumnsUnsupportedByLance,
|
||||
}
|
||||
|
||||
impl SkipReason {
|
||||
/// Stable machine-readable token for serialized output (e.g. CLI `--json`).
|
||||
/// Once emitted this is part of the output contract — keep it stable.
|
||||
pub fn as_str(&self) -> &'static str {
|
||||
match self {
|
||||
SkipReason::BlobColumnsUnsupportedByLance => "blob_columns_unsupported_by_lance",
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for SkipReason {
|
||||
/// Human-readable reason for CLI and log output.
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
let msg = match self {
|
||||
SkipReason::BlobColumnsUnsupportedByLance => {
|
||||
"blob columns — Lance compaction unsupported"
|
||||
}
|
||||
};
|
||||
f.write_str(msg)
|
||||
}
|
||||
}
|
||||
|
||||
/// Per-table outcome of `optimize_all_tables`. This is a returned result type,
|
||||
/// not built by callers, so it is `#[non_exhaustive]`: future fields stay
|
||||
/// non-breaking and downstream code reads fields rather than constructing it.
|
||||
#[derive(Debug, Clone)]
|
||||
#[non_exhaustive]
|
||||
pub struct TableOptimizeStats {
|
||||
pub table_key: String,
|
||||
/// Number of source fragments that were rewritten by Lance.
|
||||
|
|
@ -62,6 +113,33 @@ pub struct TableOptimizeStats {
|
|||
pub fragments_added: usize,
|
||||
/// Did this table get a new Lance manifest version from the compaction?
|
||||
pub committed: bool,
|
||||
/// `Some(reason)` if this table was deliberately not compacted. When set,
|
||||
/// `fragments_removed == 0`, `fragments_added == 0`, and `!committed`.
|
||||
pub skipped: Option<SkipReason>,
|
||||
}
|
||||
|
||||
impl TableOptimizeStats {
|
||||
/// Stat for a table that Lance actually compacted.
|
||||
fn compacted(table_key: String, metrics: &CompactionMetrics, committed: bool) -> Self {
|
||||
Self {
|
||||
table_key,
|
||||
fragments_removed: metrics.fragments_removed,
|
||||
fragments_added: metrics.fragments_added,
|
||||
committed,
|
||||
skipped: None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Stat for a table that was deliberately skipped (compaction not attempted).
|
||||
fn skipped(table_key: String, reason: SkipReason) -> Self {
|
||||
Self {
|
||||
table_key,
|
||||
fragments_removed: 0,
|
||||
fragments_added: 0,
|
||||
committed: false,
|
||||
skipped: Some(reason),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Per-table outcome of `cleanup_all_tables`. `error` is `Some` when this
|
||||
|
|
@ -84,14 +162,21 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result<Vec<TableOptimizeStat
|
|||
let resolved = db.resolved_branch_target(None).await?;
|
||||
let snapshot = resolved.snapshot;
|
||||
|
||||
let table_tasks: Vec<_> = all_table_keys(&db.catalog())
|
||||
.into_iter()
|
||||
.filter_map(|table_key| {
|
||||
let entry = snapshot.entry(&table_key)?;
|
||||
// Compute per-table state (path + whether it has blob columns) up front, in
|
||||
// a scope that drops the catalog handle before the async stream starts.
|
||||
let table_tasks: Vec<(String, String, bool)> = {
|
||||
let catalog = db.catalog();
|
||||
let mut tasks = Vec::new();
|
||||
for table_key in all_table_keys(&catalog) {
|
||||
let Some(entry) = snapshot.entry(&table_key) else {
|
||||
continue;
|
||||
};
|
||||
let full_path = format!("{}/{}", db.root_uri, entry.table_path);
|
||||
Some((table_key, full_path))
|
||||
})
|
||||
.collect();
|
||||
let has_blob = !blob_properties_for_table_key(&catalog, &table_key)?.is_empty();
|
||||
tasks.push((table_key, full_path, has_blob));
|
||||
}
|
||||
tasks
|
||||
};
|
||||
|
||||
if table_tasks.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
|
|
@ -101,7 +186,24 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result<Vec<TableOptimizeStat
|
|||
let table_store = &db.table_store;
|
||||
|
||||
let stats: Vec<Result<TableOptimizeStats>> = futures::stream::iter(table_tasks.into_iter())
|
||||
.map(|(table_key, full_path)| async move {
|
||||
.map(|(table_key, full_path, has_blob)| async move {
|
||||
// Lance `compact_files` mis-decodes blob-v2 columns under the forced
|
||||
// `BlobHandling::AllBinary` read (see LANCE_SUPPORTS_BLOB_COMPACTION).
|
||||
// Skip blob-bearing tables and report it rather than aborting the
|
||||
// whole sweep — the other tables still compact.
|
||||
if has_blob && !LANCE_SUPPORTS_BLOB_COMPACTION {
|
||||
tracing::warn!(
|
||||
target: "omnigraph::optimize",
|
||||
table = %table_key,
|
||||
"skipping compaction: table has blob columns the current Lance \
|
||||
cannot rewrite (blob-v2 AllBinary decode bug); other tables \
|
||||
unaffected — rerun after the Lance fix",
|
||||
);
|
||||
return Ok(TableOptimizeStats::skipped(
|
||||
table_key,
|
||||
SkipReason::BlobColumnsUnsupportedByLance,
|
||||
));
|
||||
}
|
||||
let mut ds = table_store
|
||||
.open_dataset_head_for_write(&table_key, &full_path, None)
|
||||
.await?;
|
||||
|
|
@ -111,12 +213,11 @@ pub async fn optimize_all_tables(db: &Omnigraph) -> Result<Vec<TableOptimizeStat
|
|||
.await
|
||||
.map_err(|e| OmniError::Lance(e.to_string()))?;
|
||||
let version_after = ds.version().version;
|
||||
Ok(TableOptimizeStats {
|
||||
Ok(TableOptimizeStats::compacted(
|
||||
table_key,
|
||||
fragments_removed: metrics.fragments_removed,
|
||||
fragments_added: metrics.fragments_added,
|
||||
committed: version_after != version_before,
|
||||
})
|
||||
&metrics,
|
||||
version_after != version_before,
|
||||
))
|
||||
})
|
||||
.buffer_unordered(concurrency)
|
||||
.collect()
|
||||
|
|
|
|||
|
|
@ -290,3 +290,88 @@ async fn force_delete_branch_semantics() {
|
|||
TableStore::force_delete_branch's NotFound tolerance can be simplified."
|
||||
);
|
||||
}
|
||||
|
||||
// --- Guard 10: blob-column compaction is still broken in this Lance --------
|
||||
//
|
||||
// `db/omnigraph/optimize.rs` skips tables with blob columns while
|
||||
// `LANCE_SUPPORTS_BLOB_COMPACTION = false`: Lance `compact_files` forces
|
||||
// `BlobHandling::AllBinary`, and the blob-v2 struct decoder mis-counts columns
|
||||
// ("more fields in the schema than provided column indices"), failing even a
|
||||
// pristine uniform-V2_2 multi-fragment blob table. Reads are unaffected (they
|
||||
// use descriptor handling).
|
||||
//
|
||||
// WHEN THIS TEST TURNS RED (compact_files no longer errors), the Lance bug is
|
||||
// fixed: flip `LANCE_SUPPORTS_BLOB_COMPACTION` to true in optimize.rs, drop the
|
||||
// blob-skip branch + the `optimize_skips_blob_table_and_reports_skip`
|
||||
// skip assertions in maintenance.rs, and re-pin docs/dev/lance.md.
|
||||
|
||||
#[tokio::test]
|
||||
async fn compact_files_still_fails_on_blob_columns() {
|
||||
use arrow_array::{LargeBinaryArray, StructArray};
|
||||
|
||||
fn blob_batch(start: i32, n: i32) -> RecordBatch {
|
||||
let ids: Vec<String> = (start..start + n).map(|i| format!("n{i}")).collect();
|
||||
let data =
|
||||
LargeBinaryArray::from_iter_values((start..start + n).map(|i| format!("blob{i}")));
|
||||
let blob_uri = StringArray::from(vec![None::<&str>; n as usize]);
|
||||
let DataType::Struct(fields) = lance::blob::blob_field("content", true).data_type().clone()
|
||||
else {
|
||||
unreachable!("blob_field is always a Struct");
|
||||
};
|
||||
let content = StructArray::new(
|
||||
fields,
|
||||
vec![Arc::new(data) as _, Arc::new(blob_uri) as _],
|
||||
None,
|
||||
);
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Utf8, false),
|
||||
lance::blob::blob_field("content", true),
|
||||
]));
|
||||
RecordBatch::try_new(
|
||||
schema,
|
||||
vec![Arc::new(StringArray::from(ids)) as _, Arc::new(content) as _],
|
||||
)
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
async fn write(uri: &str, batch: RecordBatch, mode: WriteMode) {
|
||||
let schema = batch.schema();
|
||||
let reader = RecordBatchIterator::new(vec![Ok(batch)], schema);
|
||||
// Blob v2 requires file version >= 2.2; without the pin the *write*
|
||||
// would fail with a different error, masking the guard's intent.
|
||||
let params = WriteParams {
|
||||
mode,
|
||||
enable_stable_row_ids: true,
|
||||
data_storage_version: Some(LanceFileVersion::V2_2),
|
||||
..Default::default()
|
||||
};
|
||||
Dataset::write(reader, uri, Some(params)).await.unwrap();
|
||||
}
|
||||
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let uri = dir.path().join("guard10-blob.lance");
|
||||
let uri = uri.to_str().unwrap();
|
||||
|
||||
// Uniform V2_2, two fragments → forces compaction to actually rewrite.
|
||||
write(uri, blob_batch(0, 2), WriteMode::Create).await;
|
||||
write(uri, blob_batch(100, 2), WriteMode::Append).await;
|
||||
|
||||
let mut ds = Dataset::open(uri).await.unwrap();
|
||||
assert!(
|
||||
ds.get_fragments().len() >= 2,
|
||||
"guard needs a multi-fragment table to trigger a real compaction rewrite"
|
||||
);
|
||||
|
||||
let result = compact_files(&mut ds, CompactionOptions::default(), None).await;
|
||||
let err = result.expect_err(
|
||||
"compact_files unexpectedly SUCCEEDED on a blob table — the Lance blob-v2 \
|
||||
compaction bug is fixed. Flip LANCE_SUPPORTS_BLOB_COMPACTION to true in \
|
||||
db/omnigraph/optimize.rs, remove the blob-skip branch, and re-pin docs/dev/lance.md.",
|
||||
);
|
||||
assert!(
|
||||
err.to_string()
|
||||
.contains("more fields in the schema than provided column indices"),
|
||||
"blob compaction failed with an unexpected error (Lance internals may have \
|
||||
shifted): {err}"
|
||||
);
|
||||
}
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ mod helpers;
|
|||
use std::time::Duration;
|
||||
|
||||
use lance::Dataset;
|
||||
use omnigraph::db::{CleanupPolicyOptions, Omnigraph};
|
||||
use omnigraph::db::{CleanupPolicyOptions, Omnigraph, SkipReason};
|
||||
use omnigraph::loader::{LoadMode, load_jsonl};
|
||||
|
||||
use helpers::{TEST_DATA, TEST_SCHEMA, count_rows, init_and_load};
|
||||
|
|
@ -72,6 +72,97 @@ async fn optimize_after_load_then_again_is_idempotent() {
|
|||
}
|
||||
}
|
||||
|
||||
// Regression: `optimize` must not crash on a graph that has a `Blob` table.
|
||||
//
|
||||
// Lance `compact_files` forces `BlobHandling::AllBinary`, which mis-decodes
|
||||
// blob-v2 columns ("more fields in the schema than provided column indices"),
|
||||
// failing even a pristine uniform-V2_2 multi-fragment blob table. `optimize`
|
||||
// must skip blob-bearing tables (and report the skip) rather than aborting the
|
||||
// whole sweep.
|
||||
//
|
||||
// Before the skip fix, `optimize()` returned that Lance error here and aborted
|
||||
// the whole sweep; it now skips the blob table (`doc.skipped == Some(..)`)
|
||||
// while the sibling non-blob `Tag` table still compacts. The skip is gated by
|
||||
// `LANCE_SUPPORTS_BLOB_COMPACTION`; the surface guard
|
||||
// `compact_files_still_fails_on_blob_columns` flags when the upstream Lance fix
|
||||
// makes the skip (and this test's blob arm) removable.
|
||||
#[tokio::test]
|
||||
async fn optimize_skips_blob_table_and_reports_skip() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let uri = dir.path().to_str().unwrap();
|
||||
// One Blob node type (`Doc`) + one plain node type (`Tag`): proves the blob
|
||||
// table is skipped while a non-blob table in the same sweep still compacts.
|
||||
let schema = "\
|
||||
node Doc {\n slug: String @key\n content: Blob\n}\n\
|
||||
node Tag {\n slug: String @key\n}\n";
|
||||
let mut db = Omnigraph::init(uri, schema).await.unwrap();
|
||||
|
||||
// Multi-fragment blob table: Overwrite creates fragment 1; each Merge of
|
||||
// new keys appends another. A >=2-fragment blob table is exactly what
|
||||
// crashes `compact_files` today (single fragment would no-op and not crash).
|
||||
load_jsonl(
|
||||
&mut db,
|
||||
"{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"content\":\"base64:aGVsbG8x\"}}\n{\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"content\":\"base64:aGVsbG8y\"}}",
|
||||
LoadMode::Overwrite,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
load_jsonl(
|
||||
&mut db,
|
||||
"{\"type\":\"Doc\",\"data\":{\"slug\":\"d3\",\"content\":\"base64:aGVsbG8z\"}}",
|
||||
LoadMode::Merge,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
load_jsonl(
|
||||
&mut db,
|
||||
"{\"type\":\"Doc\",\"data\":{\"slug\":\"d4\",\"content\":\"base64:aGVsbG80\"}}",
|
||||
LoadMode::Merge,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
// Plain table, also multi-fragment so it has something to compact.
|
||||
load_jsonl(
|
||||
&mut db,
|
||||
"{\"type\":\"Tag\",\"data\":{\"slug\":\"t1\"}}\n{\"type\":\"Tag\",\"data\":{\"slug\":\"t2\"}}",
|
||||
LoadMode::Merge,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
load_jsonl(
|
||||
&mut db,
|
||||
"{\"type\":\"Tag\",\"data\":{\"slug\":\"t3\"}}",
|
||||
LoadMode::Merge,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let stats = db
|
||||
.optimize()
|
||||
.await
|
||||
.expect("optimize must not crash on a graph with a Blob table");
|
||||
|
||||
let doc = stats
|
||||
.iter()
|
||||
.find(|s| s.table_key == "node:Doc")
|
||||
.expect("Doc stat present");
|
||||
let tag = stats
|
||||
.iter()
|
||||
.find(|s| s.table_key == "node:Tag")
|
||||
.expect("Tag stat present");
|
||||
// The blob table is skipped (and reported), not compacted.
|
||||
assert_eq!(
|
||||
doc.skipped,
|
||||
Some(SkipReason::BlobColumnsUnsupportedByLance),
|
||||
"blob table must be reported as skipped",
|
||||
);
|
||||
assert!(!doc.committed, "skipped blob table is not compacted");
|
||||
assert_eq!(doc.fragments_removed, 0);
|
||||
assert_eq!(doc.fragments_added, 0);
|
||||
// The plain (non-blob) table is unaffected by the skip.
|
||||
assert_eq!(tag.skipped, None, "non-blob table must not be skipped");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn cleanup_without_any_policy_option_errors() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
|
|
|
|||
|
|
@ -130,6 +130,15 @@ them explicit.
|
|||
- **Deletes and vector indexes:** `delete_where` and vector index creation still
|
||||
advance Lance HEAD inline because the required public Lance APIs are missing.
|
||||
Keep D2 and recovery coverage in place until those residuals are removed.
|
||||
- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns
|
||||
under its forced `BlobHandling::AllBinary` read ("more fields in the schema
|
||||
than provided column indices"), so `optimize` skips any table with a `Blob`
|
||||
property — reporting `SkipReason::BlobColumnsUnsupportedByLance` (loud, not a
|
||||
silent drop) behind the `LANCE_SUPPORTS_BLOB_COMPACTION` gate. Reads and writes
|
||||
are unaffected; only space/fragment reclamation on blob tables is deferred.
|
||||
Remove the skip when the upstream Lance fix lands — the
|
||||
`lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard
|
||||
turns red on that bump to force it.
|
||||
- **Planner capability/stat surfaces:** cost-aware planning, complete
|
||||
capability advertisement, and explain-with-cost are roadmap. Do not describe
|
||||
them as implemented.
|
||||
|
|
|
|||
|
|
@ -176,7 +176,8 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
|
|||
- **Lance #6666 still open** (`build_index_metadata_from_segments` public): vector-index two-phase blocked; inline `create_vector_index` residual retained.
|
||||
- **Lance #6877 still open** (`MergeInsertBuilder` dup-rowid): PR #109's `SourceDedupeBehavior::FirstSeen` + `check_batch_unique_by_keys` precondition stay load-bearing.
|
||||
- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`.
|
||||
- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed → flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions.
|
||||
|
||||
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (9 named guards; 4 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
|
||||
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
|
||||
|
||||
Bump this date stanza on the next alignment pass.
|
||||
|
|
|
|||
26
docs/releases/v0.6.1.md
Normal file
26
docs/releases/v0.6.1.md
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# Omnigraph v0.6.1
|
||||
|
||||
v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safer branch cleanup, more complete release artifacts, and a Lance blob-compaction workaround.
|
||||
|
||||
## Highlights
|
||||
|
||||
- **Stored-query registries.** `omnigraph.yaml` can declare curated `queries:` blocks per graph. Servers load and type-check them at startup, `omnigraph queries validate` checks them offline, `omnigraph queries list` shows exposed queries and typed params, `GET /queries` exposes a typed catalog, and `POST /queries/{name}` invokes a stored query without accepting ad hoc `.gq` source from the client.
|
||||
- **Stored-query policy gate.** New Cedar action `invoke_query` gates the stored-query invocation surface. Stored mutations are double-gated: `invoke_query` to reach the stored query and `change` for the actual write.
|
||||
- **Safer branch deletion.** `branch_delete` now treats the manifest as the authority, flips branch visibility atomically, and reclaims per-table/commit-graph forks as derived state. If best-effort reclaim is interrupted, `cleanup` reconciles orphaned forks; reusing a branch name before cleanup reports an actionable error.
|
||||
- **Blob-safe optimize.** `omnigraph optimize` skips tables with `Blob` properties instead of failing the whole sweep on Lance's blob-v2 compaction decode bug. Skips are visible in human output, `--json` as `skipped`, `TableOptimizeStats.skipped`, and logs; non-blob tables still compact normally.
|
||||
- **Deployment improvements.** The container entrypoint now composes `OMNIGRAPH_TARGET_URI` with `OMNIGRAPH_CONFIG`, so operators can keep the graph URI in env while loading policy/query config from a mounted file. The local RustFS bootstrap pins RustFS beta.3 and allows the current insecure local-dev default credentials.
|
||||
- **Windows release support.** Tagged and edge releases now publish Windows x86_64 archives containing `omnigraph.exe` and `omnigraph-server.exe`, with a PowerShell installer and Windows install docs.
|
||||
- **Release tooling.** Homebrew formula generation was tightened to produce audit-clean formulas.
|
||||
|
||||
## Compatibility Notes
|
||||
|
||||
- A graph selected by name (`--target` or `server.graph`) now uses `graphs.<name>.policy` and `graphs.<name>.queries`. Top-level `policy` / `queries` blocks are only for anonymous bare-URI single-graph mode; using them with a named graph now fails loudly with migration guidance.
|
||||
- `mcp.expose` defaults to `true` for stored-query registry entries. Set `mcp: { expose: false }` for service-only queries that should not appear in the catalog.
|
||||
- `invoke_query` is graph-scoped, not branch-scoped. Branch/snapshot access remains enforced by the inner `read` / `change` gate.
|
||||
- Blob tables are not compacted until the upstream Lance fix lands, so fragment count and deleted-row space on blob tables are not reclaimed by `optimize`. Reads, writes, and query results are unaffected; no on-disk migration is required.
|
||||
- `TableOptimizeStats` is now `#[non_exhaustive]` and gains a `skipped: Option<SkipReason>` field (so does the new `SkipReason` enum). This is a source-level change only for downstream code that built this returned result struct by literal — rare, since it is produced by `optimize` and consumed by reading its fields; field access is unaffected, and `#[non_exhaustive]` keeps future additions non-breaking.
|
||||
|
||||
## Docs And Cleanup
|
||||
|
||||
- Public docs were updated for stored queries, policy, server routes, deployment, Windows installation, branch deletion, maintenance, and the `runs` docs rename to `writes`.
|
||||
- README copy and release documentation were refreshed; older release notes had small typo/wording fixes.
|
||||
|
|
@ -21,7 +21,7 @@ A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` sc
|
|||
| `schema plan \| apply \| show (alias: get)` | migrations |
|
||||
| `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` |
|
||||
| `queries validate \| list` | operate on the server-side stored-query registry (the `queries:` block). `validate` type-checks every stored query against the live schema offline (opens the selected graph; exits non-zero on any breakage), catching schema drift without restarting the server; `list` prints the selected registry's query names, MCP exposure, and typed params. For per-graph registries, pass `--target <graph>` or set `cli.graph`; with no graph selection, `list` shows only top-level `queries:`. Distinct from `lint`, which validates a single `.gq` file |
|
||||
| `optimize` | non-destructive Lance compaction |
|
||||
| `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns; `--json` reports a `skipped` field) |
|
||||
| `cleanup --keep N --older-than 7d --confirm` | destructive version GC |
|
||||
| `embed` | offline JSONL embedding pipeline |
|
||||
| `policy validate \| test \| explain` | Cedar tooling. Selects `cli.graph`, else `server.graph`, else top-level `policy.file` |
|
||||
|
|
|
|||
|
|
@ -11,6 +11,7 @@
|
|||
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` | `db/manifest/migrations.rs` |
|
||||
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` |
|
||||
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` |
|
||||
| Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | `db/omnigraph/optimize.rs` |
|
||||
| Graph index cache size | `8` (LRU) | `runtime_cache.rs` |
|
||||
| Default body limit | `1 MB` | `omnigraph-server/lib.rs` |
|
||||
| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` |
|
||||
|
|
|
|||
|
|
@ -7,7 +7,8 @@
|
|||
- Lance `compact_files()` on every node + edge table on `main`.
|
||||
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
|
||||
- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
|
||||
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`.
|
||||
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed, skipped }]`.
|
||||
- **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: Some(BlobColumnsUnsupportedByLance)` (and logged via `tracing::warn`) instead of compacted, and the rest of the sweep proceeds normally. The current Lance `compact_files` mis-decodes blob-v2 columns under its forced `BlobHandling::AllBinary` read; **reads and writes are unaffected** — only compaction is. This is gated by `LANCE_SUPPORTS_BLOB_COMPACTION` (`db/omnigraph/optimize.rs`) and removed when the upstream Lance fix lands (see [docs/dev/lance.md](../dev/lance.md)). Consequence: fragment count and deleted-row space on blob tables are not reclaimed until then; query results are never affected.
|
||||
|
||||
## `cleanup_all_tables(db, options)` — destructive
|
||||
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@
|
|||
"name": "MIT",
|
||||
"identifier": "MIT"
|
||||
},
|
||||
"version": "0.6.0"
|
||||
"version": "0.6.1"
|
||||
},
|
||||
"paths": {
|
||||
"/branches": {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue