fix(engine): run __run__ sweep at Omnigraph::open, not only on publish

Review (PR #132) caught a regression: removing __run__ from
`is_internal_system_branch` exposed legacy `__run__*` branches to the
schema-apply blocking-branch checks (schema_apply.rs:104 and :778) and to
`branch_list()`, but the v2→v3 sweep ran only inside the publisher's
`load_publish_state`. On a pre-v0.4.0 graph whose first write is a schema
apply, the blocking-branch check fires before any publish, so apply failed
with "found non-main branches: __run__…". The same lazy timing also created a
reverse hazard: a user-created `__run__*` branch on a still-v2 graph could be
deleted by the first publish's sweep.

Fix: run the internal-schema migration in `Omnigraph::open(ReadWrite)` (new
`manifest::migrate_on_open`), before the coordinator reads branch state. The
sweep now lands before any branch-observing code, and a graph is stamped v3 at
open — so the one-time sweep can never catch a legitimately-created branch.
Both checks and `branch_list` see the swept graph; correct by construction for
every write path.

Accepted residual: a read-only open of an unmigrated legacy graph still lists
`__run__*` (read-only opens must not write, so they can't sweep). Documented.

Regression test `legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply`
confirmed RED before the fix (panicked on the branch_list leak assertion) and
GREEN after. Also updates the stale schema_apply.rs comment, the writes.md
"Migration code" section, and adds the v3 row to storage.md's migration table.
This commit is contained in:
Ragnor Comerford 2026-05-30 23:46:33 +02:00
parent 4ed2313a80
commit 08cfd9f843
No known key found for this signature in database
5 changed files with 90 additions and 8 deletions

View file

@ -48,6 +48,22 @@ const OBJECT_TYPE_TABLE_VERSION: &str = "table_version";
const OBJECT_TYPE_TABLE_TOMBSTONE: &str = "table_tombstone";
const TABLE_VERSION_MANAGEMENT_KEY: &str = "table_version_management";
/// Apply pending internal-schema migrations against `__manifest` on the
/// open-for-write path, independent of a publish.
///
/// `Omnigraph::open(ReadWrite)` calls this before the coordinator reads branch
/// state, so branch-observing code (`branch_list`, the schema-apply
/// blocking-branch checks) sees the post-migration graph. In particular the
/// v2→v3 step sweeps legacy `__run__*` staging branches off `__manifest`
/// (MR-770); running it here closes the window where those branches would
/// otherwise block schema apply before the first publish runs the migration.
///
/// Idempotent: a no-op stamp read when the on-disk version already matches.
pub(crate) async fn migrate_on_open(root_uri: &str) -> Result<()> {
let mut dataset = open_manifest_dataset(root_uri, None).await?;
migrations::migrate_internal_schema(&mut dataset).await
}
/// Immutable point-in-time view of the database.
///
/// Cheap to create (no storage I/O). All reads within a query go through one

View file

@ -340,6 +340,16 @@ impl Omnigraph {
mode: OpenMode,
) -> Result<Self> {
let root = normalize_root_uri(uri)?;
// Apply pending internal-schema migrations before the coordinator reads
// branch state, so `branch_list` and the schema-apply blocking-branch
// checks observe the post-migration graph — notably the v2→v3 sweep of
// legacy `__run__*` staging branches (MR-770). ReadWrite only: a
// read-only open must not trigger object-store writes, so a read-only
// open of an unmigrated legacy graph still lists `__run__*` until its
// first read-write open (an accepted, documented limitation).
if matches!(mode, OpenMode::ReadWrite) {
crate::db::manifest::migrate_on_open(&root).await?;
}
// Open the coordinator first so the schema-staging recovery sweep can
// compare its snapshot against any leftover staging files.
let mut coordinator = GraphCoordinator::open(&root, Arc::clone(&storage)).await?;
@ -2216,6 +2226,56 @@ edge WorksAt: Person -> Company
assert!(result.applied, "schema apply should have applied");
}
/// Regression (MR-770): a pre-v0.4.0 graph that still carries a stale
/// `__run__*` branch on `__manifest` must not block schema apply. The
/// v2→v3 sweep runs in `Omnigraph::open(ReadWrite)` — before the
/// schema-apply blocking-branch check — so apply succeeds with no
/// intervening publish.
///
/// Confirmed to fail before the open-time migration landed: the reopened
/// graph still listed `__run__legacy`, and `apply_schema` returned
/// "found non-main branches: __run__legacy".
#[tokio::test]
async fn legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap();
// Synthesize a legacy graph: a stale `__run__` branch on `__manifest`
// plus the manifest stamp rewound to v2 (pre-sweep).
db.branch_create("__run__legacy").await.unwrap();
drop(db);
{
let mut ds = lance::Dataset::open(&format!("{}/__manifest", uri))
.await
.unwrap();
ds.update_schema_metadata([(
"omnigraph:internal_schema_version".to_string(),
Some("2".to_string()),
)])
.await
.unwrap();
}
// Reopen (ReadWrite): the open-time migration must sweep `__run__legacy`
// before any branch-observing code runs.
let db = Omnigraph::open(uri).await.unwrap();
let branches = db.branch_list().await.unwrap();
assert!(
!branches.iter().any(|b| b.starts_with("__run__")),
"open-time migration must sweep legacy __run__ branches; got {branches:?}",
);
// Schema apply must proceed with no intervening publish — the
// blocking-branch check no longer sees `__run__legacy`.
let desired = TEST_SCHEMA.replace(
" age: I32?\n}",
" age: I32?\n nickname: String?\n}",
);
let result = db.apply_schema(&desired).await.unwrap();
assert!(result.applied, "schema apply should have applied");
}
#[tokio::test]
async fn test_apply_schema_adds_index_for_existing_property() {
let dir = tempfile::tempdir().unwrap();

View file

@ -94,11 +94,11 @@ pub(super) async fn apply_schema_with_lock(
) -> Result<SchemaApplyResult> {
db.ensure_schema_state_valid().await?;
let branches = db.coordinator.read().await.all_branches().await?;
// Skip `main` and internal system branches. The schema-apply lock branch
// is excluded because it is the cluster-wide schema-apply serializer.
// `__run__*` branches are no longer created; the filter remains as
// defense-in-depth for legacy graphs with leftover staging branches.
// A future production sweep will let this guard go.
// Skip `main` and internal system branches (the schema-apply lock branch,
// the cluster-wide schema-apply serializer). Legacy `__run__*` staging
// branches were swept off `__manifest` by the v2→v3 migration that runs in
// `Omnigraph::open(ReadWrite)` before this check (MR-770), so they no
// longer appear here.
let blocking_branches = branches
.into_iter()
.filter(|branch| branch != "main" && !is_internal_system_branch(branch))

View file

@ -248,9 +248,14 @@ list`.
## Migration code
`db/manifest/migrations.rs` does not change. Active deletion of
`_graph_runs.lance` belongs in MR-770 (the production sweep) — this PR
stops *creating* run state but does not destroy legacy bytes on disk.
`db/manifest/migrations.rs` carries the v2→v3 internal-schema step (MR-770):
a one-time sweep that deletes legacy `__run__*` staging branches off
`__manifest`. It runs in `Omnigraph::open(ReadWrite)` (via
`manifest::migrate_on_open`, before the coordinator reads branch state) and
again on the publisher's write path; both are idempotent once the stamp is at
v3. Deleting the inert `_graph_runs.lance` / `_graph_run_actors.lance` dataset
*bytes* is still deferred — it needs a `StorageAdapter::delete_prefix`
primitive — but those bytes are invisible to graph-level state.
## Mid-query partial failure: closed by MR-794

View file

@ -47,6 +47,7 @@ Adding a new on-disk shape change is one constant bump (`INTERNAL_MANIFEST_SCHEM
|---|---|
| v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; publisher had no row-level CAS protection. |
| v2 | `__manifest.object_id` carries `lance-schema:unenforced-primary-key=true`; row-level CAS engaged. Stamped as `omnigraph:internal_schema_version=2`. |
| v3 | One-time sweep of legacy `__run__*` staging branches (pre-v0.4.0 Run state machine, removed MR-771) off `__manifest`. Runs at `Omnigraph::open(ReadWrite)` and on publish. Stamped as `omnigraph:internal_schema_version=3`. |
## On-disk layout