mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-30 02:49:39 +02:00
* test(lance): pin force_delete_branch surface guard
Pin the Lance 6.0.1 force_delete_branch behavior the branch-delete
single-authority redesign relies on: plain delete_branch errors on a
missing ref, force_delete_branch removes an existing forked branch, and
the local-store quirk where force_delete on a fully-absent branch still
errors (worked around by the upcoming TableStore::force_delete_branch).
Re-pin the docs/dev/lance.md alignment stanza (9 guards; 4 runtime).
* feat(storage): add force branch-delete to TableStore + CommitGraph
Add TableStore::force_delete_branch and CommitGraph::force_delete_branch
(idempotent: tolerate an already-absent branch via Lance RefNotFound /
NotFound), plus CommitGraph::list_branches for the cleanup reconciler to
diff against the manifest authority. RefConflict (referencing
descendants) is still surfaced. Unused until the branch-delete rewire.
* test(maintenance): red — cleanup reconciles orphaned branch forks
Forge a Lance branch on the Person table that the manifest never
references (a zombie fork from an incomplete prior delete) and assert
cleanup reclaims it while leaving main intact. Fails today: cleanup does
not yet reconcile orphaned forks. Goes green with the next commit.
* fix(maintenance): reconcile orphaned branch forks in cleanup
Add reconcile_orphaned_branches: force_delete_branch every per-table and
commit-graph Lance branch absent from the manifest branch set (the
authority), children-before-parents. Folded into cleanup_all_tables,
runs before version GC. Idempotent and authority-derived; no-ops once
nothing is orphaned, and would harmlessly find nothing if a future Lance
atomic multi-dataset branch op prevented orphans. Adds TableStore::list_branches
and exposes graph_commits_uri(pub crate). Turns the maintenance red test green.
* test(failpoints): red — branch_delete partial failure converges
Add the branch_delete.before_table_cleanup failpoint hook (inert without
the feature) and a regression test: a cleanup-step failure after the
manifest authority flip must leave branch_delete returning Ok, the branch
gone, the orphan stranded, then reclaimed by cleanup, and the name
reusable. Fails today: cleanup_deleted_branch_tables propagates the error
as a hard failure. Goes green with the next commit.
* fix(branch): best-effort fork reclaim after the manifest flip
Make branch_delete treat per-table forks and the commit-graph branch as
derived state reclaimed best-effort with force_delete_branch after the
manifest authority flip. A reclaim failure (transient error, or the
branch_delete.before_table_cleanup failpoint) is logged via tracing::warn
and swallowed: the branch is already gone and the cleanup reconciler
converges the orphan. cleanup_deleted_branch_tables no longer returns an
error or blocks the call. Turns the partial-failure recovery test green.
* test(failpoints): red — recreate over orphaned fork is actionable
After a partial-failure delete leaves a fork orphaned, recreating the
branch name and writing to the previously-forked table before cleanup
runs currently surfaces the opaque ExpectedVersionMismatch ("stale view
... expected manifest table version N"). Assert instead a clear error
pointing the user at cleanup. Goes green with the next commit.
* fix(branch): actionable orphan-collision error in fork_branch_from_state
When a fork's create_branch collides with an existing target ref, reuse
it only if its head matches source_version (a legitimate concurrent
first-write). A version mismatch means a zombie fork from an incomplete
prior delete: return a manifest_conflict pointing the user at
`omnigraph cleanup`, instead of the opaque ExpectedVersionMismatch.
Turns the recreate-over-orphan red test green.
* docs(invariants): single-authority branch-lifecycle + Lance forward-compat
Record branch delete in the Current Truth Matrix: manifest is the single
authority flipped atomically first, per-table forks + commit-graph branch
are derived state reclaimed best-effort with the cleanup reconciler as
backstop, and reusing a name whose reclaim failed surfaces an actionable
error. Note the reconciler is authority-derived and degrades to a no-op
under a future Lance atomic multi-dataset branch op, the same shape as
invariant 7.
* test(failpoints): red — cleanup isolates a single-table failure
Add the cleanup.table_gc failpoint hook (inert without the feature) and
an error: Option<String> field on TableCleanupStats (mechanical, always
None for now). Regression test: a one-shot version-GC failure for one
table must not abort the whole cleanup — assert cleanup still succeeds,
surfaces the failure per-table in stats, and the independent reconcile
pass still reclaimed an orphan. Fails today: the version-GC collect
aborts on the first table error. Goes green with the next commit.
* fix(maintenance): fault-isolate cleanup per table
Make the cleanup sweep do as much as it can and converge on re-run
instead of aborting wholesale on one table's transient error
(invariant 13). The version-GC loop now records a per-table failure on
its stats row (error: Some) and logs it rather than collecting into a
Result that aborts; reconcile_orphaned_branches isolates per-table and
commit-graph failures into BranchReconcileStats.failures. The CLI reports
any failed tables and tells the user to rerun cleanup. Addresses the
Devin review finding. Turns the single-table-failure test green.
* test(failpoints): red — branch_create heals commit-graph zombie + is atomic
Add the branch_delete.before_commit_graph_reclaim failpoint hook and two
regression tests: (a) recreating a name whose delete left a commit-graph
zombie must succeed (today it dies on Lance's internal Clone error), and
(b) branch_create must roll back the manifest branch when the derived
commit-graph branch fails (today it leaves the manifest branch created
while returning Err). Both fail now; green with the next commit. The
existing branch_create_failpoint_triggers test still passes.
* fix(branch): make branch_create atomic + heal commit-graph zombie
branch_create now flips the manifest authority first, then creates the
derived commit-graph branch in create_commit_graph_branch, force-dropping
any orphaned commit-graph ref left by an incomplete prior delete (the
manifest branch is fresh, so a same-named commit-graph branch is provably
a zombie). If commit-graph creation fails, the manifest branch is rolled
back so the name never half-exists. Addresses the Codex review finding.
Turns the two branch_create red tests green; existing tests unaffected.
* test(failpoints): red — fork collision misclassifies live concurrent fork
Add the fork.before_classify failpoint hook and a concurrency test: when
a concurrent first-write legitimately wins the fork race, the loser must
get a retryable refresh-and-retry, not the misleading run-cleanup orphan
error. Today the version-comparison misclassifies the live fork as an
orphan (the Cursor finding). Goes green with the next commit.
* fix(branch): manifest-arbitrated fork-collision classification
Classify a fork collision by the manifest authority instead of comparing
Lance branch versions. Before forking, open_owned_dataset_for_branch_write
re-reads the live manifest: if the table is already forked on the active
branch, a concurrent first-write won and the loser gets a retryable
refresh-and-retry (not a misleading orphan error). fork_branch_from_state
no longer guesses from versions — a create collision past that check is
an orphan, so it returns the actionable cleanup error. Addresses the
Cursor finding; turns the live-concurrent-fork test green, zombie path
unchanged.
* test(failpoints): close branch-lifecycle test gaps
Three coverage additions for the branch-delete work (behavior already
correct; these lock it in and catch regressions):
- cleanup_isolates_reconcile_failure: inject a force-delete failure into
the reconcile loop (new cleanup.reconcile_fork hook) and assert the
sweep continues + converges on re-run. Directly covers the reconcile
loop the Devin finding was about (previously only version-GC was).
- cleanup_reclaims_orphaned_commit_graph_branch: forge a commit-graph
orphan via the delete reclaim failpoint and assert cleanup's
reconcile_commit_graph_orphans drops it (previously untested).
- fork_collision_with_live_concurrent_fork_is_retryable: replace the
fixed 300ms sleep with a deterministic readiness signal (cfg_callback +
compare_exchange atomics) so the two-writer ordering can't flake.
Full failpoints suite 31/0.
229 lines
7.6 KiB
Rust
229 lines
7.6 KiB
Rust
// Maintenance tests: `optimize` (Lance compact_files) and `cleanup`
|
|
// (Lance cleanup_old_versions) at the graph level. Covers no-op edges
|
|
// (empty graph, already-optimized graph), the policy-validation contract on
|
|
// `cleanup`, and the keep-versions cap that protects head.
|
|
|
|
mod helpers;
|
|
|
|
use std::time::Duration;
|
|
|
|
use lance::Dataset;
|
|
use omnigraph::db::{CleanupPolicyOptions, Omnigraph};
|
|
use omnigraph::loader::{LoadMode, load_jsonl};
|
|
|
|
use helpers::{TEST_DATA, TEST_SCHEMA, count_rows, init_and_load};
|
|
|
|
/// Filesystem URI of a node sub-table, mirroring the engine's layout
|
|
/// (FNV-1a of the type name under `nodes/`). Matches the helper in
|
|
/// `failpoints.rs`; used to inspect/forge Lance branches directly in tests.
|
|
fn node_table_uri(root: &str, type_name: &str) -> String {
|
|
let mut hash: u64 = 0xcbf2_9ce4_8422_2325;
|
|
for &b in type_name.as_bytes() {
|
|
hash ^= b as u64;
|
|
hash = hash.wrapping_mul(0x100_0000_01b3);
|
|
}
|
|
format!("{}/nodes/{hash:016x}", root.trim_end_matches('/'))
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn optimize_on_empty_graph_returns_stats_per_table_with_no_changes() {
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let uri = dir.path().to_str().unwrap();
|
|
let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap();
|
|
|
|
let stats = db.optimize().await.unwrap();
|
|
|
|
// Schema declares 2 nodes + 2 edges = 4 tables. Compaction should run on
|
|
// each but find nothing to merge.
|
|
assert_eq!(stats.len(), 4);
|
|
for s in &stats {
|
|
assert_eq!(s.fragments_removed, 0, "{} should not remove", s.table_key);
|
|
assert_eq!(s.fragments_added, 0, "{} should not add", s.table_key);
|
|
}
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn optimize_after_load_then_again_is_idempotent() {
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let mut db = init_and_load(&dir).await;
|
|
|
|
// First pass may compact (load wrote real fragments).
|
|
let _first = db.optimize().await.unwrap();
|
|
|
|
// Second pass should be a no-op: already-compacted graph produces no
|
|
// fragments_removed / fragments_added.
|
|
let second = db.optimize().await.unwrap();
|
|
for s in &second {
|
|
assert_eq!(
|
|
s.fragments_removed, 0,
|
|
"{} re-optimize should be no-op",
|
|
s.table_key
|
|
);
|
|
assert_eq!(
|
|
s.fragments_added, 0,
|
|
"{} re-optimize should be no-op",
|
|
s.table_key
|
|
);
|
|
assert!(
|
|
!s.committed,
|
|
"{} re-optimize should not commit a new version",
|
|
s.table_key
|
|
);
|
|
}
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn cleanup_without_any_policy_option_errors() {
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let mut db = init_and_load(&dir).await;
|
|
|
|
let err = db
|
|
.cleanup(CleanupPolicyOptions::default())
|
|
.await
|
|
.expect_err("cleanup with no policy options must error");
|
|
|
|
let msg = format!("{}", err);
|
|
assert!(
|
|
msg.contains("keep_versions") && msg.contains("older_than"),
|
|
"error should name the two policy fields, got: {msg}"
|
|
);
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn cleanup_keep_one_preserves_head_and_table_remains_readable() {
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let mut db = init_and_load(&dir).await;
|
|
|
|
let people_before = count_rows(&db, "node:Person").await;
|
|
assert!(
|
|
people_before > 0,
|
|
"fixture should seed Person rows for this test to be meaningful"
|
|
);
|
|
|
|
// Most aggressive version-based cleanup short of forcing keep=0. Lance's
|
|
// contract is that head is always preserved regardless, so the table
|
|
// must remain openable and rows must still be visible.
|
|
let _stats = db
|
|
.cleanup(CleanupPolicyOptions {
|
|
keep_versions: Some(1),
|
|
older_than: None,
|
|
})
|
|
.await
|
|
.unwrap();
|
|
|
|
assert_eq!(count_rows(&db, "node:Person").await, people_before);
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn cleanup_older_than_zero_preserves_head() {
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let mut db = init_and_load(&dir).await;
|
|
|
|
// Aggressive policy: every version is "older than zero seconds ago".
|
|
// Lance must still preserve the head manifest, so the table is openable
|
|
// afterwards and a subsequent load still works.
|
|
let _stats = db
|
|
.cleanup(CleanupPolicyOptions {
|
|
keep_versions: None,
|
|
older_than: Some(Duration::from_secs(0)),
|
|
})
|
|
.await
|
|
.unwrap();
|
|
|
|
// Smoke test: after aggressive cleanup, we can still read and write the
|
|
// graph — head wasn't pruned.
|
|
load_jsonl(&mut db, TEST_DATA, LoadMode::Merge)
|
|
.await
|
|
.unwrap();
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn cleanup_then_optimize_preserves_rows_and_table_remains_writable() {
|
|
// Cleanup destroys version history; the concern is that subsequent
|
|
// optimize on a freshly-cleaned table could trip over dropped fragment
|
|
// refs or stale manifests. Assert the sequence preserves row content,
|
|
// leaves head readable, and doesn't break a subsequent write.
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let mut db = init_and_load(&dir).await;
|
|
|
|
let people_before = count_rows(&db, "node:Person").await;
|
|
let companies_before = count_rows(&db, "node:Company").await;
|
|
assert!(
|
|
people_before > 0 && companies_before > 0,
|
|
"fixture should seed both Person and Company rows"
|
|
);
|
|
|
|
db.cleanup(CleanupPolicyOptions {
|
|
keep_versions: Some(1),
|
|
older_than: None,
|
|
})
|
|
.await
|
|
.unwrap();
|
|
db.optimize().await.unwrap();
|
|
|
|
// Head is preserved through both ops.
|
|
assert_eq!(count_rows(&db, "node:Person").await, people_before);
|
|
assert_eq!(count_rows(&db, "node:Company").await, companies_before);
|
|
|
|
// Table is still writable after the cleanup+optimize sequence.
|
|
load_jsonl(&mut db, TEST_DATA, LoadMode::Merge)
|
|
.await
|
|
.unwrap();
|
|
assert_eq!(count_rows(&db, "node:Person").await, people_before);
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn cleanup_reconciles_orphaned_branch_forks() {
|
|
// An incomplete prior `branch_delete` can leave a per-table Lance branch
|
|
// that the manifest no longer references (a "zombie" fork). It is
|
|
// unreachable through any snapshot but pins its `tree/{branch}/` storage.
|
|
// `cleanup` must reconcile it away: drop every Lance branch absent from the
|
|
// manifest authority, without touching `main`.
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let uri = dir.path().to_str().unwrap().to_string();
|
|
let mut db = init_and_load(&dir).await;
|
|
|
|
let people_before = count_rows(&db, "node:Person").await;
|
|
assert!(people_before > 0, "fixture should seed Person rows");
|
|
|
|
// Forge an orphaned fork the manifest never knew about.
|
|
let person_uri = node_table_uri(&uri, "Person");
|
|
{
|
|
let mut ds = Dataset::open(&person_uri).await.unwrap();
|
|
let base = ds.version().version;
|
|
ds.create_branch("ghost", base, None).await.unwrap();
|
|
assert!(
|
|
ds.list_branches().await.unwrap().contains_key("ghost"),
|
|
"precondition: orphaned fork staged"
|
|
);
|
|
}
|
|
|
|
db.cleanup(CleanupPolicyOptions {
|
|
keep_versions: Some(1),
|
|
older_than: None,
|
|
})
|
|
.await
|
|
.unwrap();
|
|
|
|
// Orphan reclaimed; main untouched.
|
|
{
|
|
let ds = Dataset::open(&person_uri).await.unwrap();
|
|
assert!(
|
|
!ds.list_branches().await.unwrap().contains_key("ghost"),
|
|
"cleanup should reconcile the orphaned 'ghost' fork away"
|
|
);
|
|
}
|
|
assert_eq!(
|
|
count_rows(&db, "node:Person").await,
|
|
people_before,
|
|
"cleanup must not disturb main while reconciling orphans"
|
|
);
|
|
|
|
// Idempotent: a second cleanup with the orphan already gone is a no-op.
|
|
db.cleanup(CleanupPolicyOptions {
|
|
keep_versions: Some(1),
|
|
older_than: None,
|
|
})
|
|
.await
|
|
.unwrap();
|
|
}
|