omnigraph/docs/releases/v0.6.1.md
Ragnor Comerford 54842808db
feat(engine): sweep & remove legacy __run__ branch guard (MR-770) (#132)
* feat(engine): sweep legacy __run__ branches via v2→v3 manifest migration

Pre-v0.4.0 graphs can carry stale `__run__<id>` staging branches on the
`__manifest` dataset, left by the Run state machine removed in MR-771. Lance's
`list_branches` still enumerates them, so they leak into `branch_list()` and
count as blocking branches at schema-apply time.

Add a one-time `migrate_v2_to_v3` arm to the internal-schema dispatcher: on the
first read-write open it enumerates `__manifest` branches, deletes every
`__run__*` ref, and bumps the stamp to 3. Idempotent under retry (re-enumerates
fresh each run). The `"__run__"` prefix is inlined so the migration does not
depend on the run_registry guard that MR-770 removes next.

This is the prerequisite sweep; the guard removal follows in the next commit.

* refactor(engine): remove the legacy __run__ branch guard (MR-770)

With the v2→v3 migration sweeping stale `__run__*` branches off `__manifest`
on first read-write open, the defense-in-depth `is_internal_run_branch` guard
is no longer needed.

- delete `db/run_registry.rs`; drop the module + re-export from `db/mod.rs`
- collapse `is_internal_system_branch` to the schema-apply-lock check only
- `ensure_public_branch_ref`: drop the run-ref rejection; `__run__*` is now an
  ordinary branch name
- `branch_merge`: reject `is_internal_system_branch` (was run-only) so the
  schema-apply lock is rejected consistently with create/delete — a small,
  deliberate tightening
- update the inline schema-apply test + the writes integration tests
  (`public_branch_apis_reject_internal_run_refs` →
  `public_branch_apis_reject_internal_system_refs`, which also asserts
  `__run__*` now creates successfully)
- docs: flip the "pending production sweep / defense-in-depth" notes to
  "auto-swept by the v2→v3 migration"; document the read-only-open limitation

Known residual: the inert `_graph_runs.lance` / `_graph_run_actors.lance` bytes
remain until a `StorageAdapter::delete_prefix` primitive lands.

* fix(engine): run __run__ sweep at Omnigraph::open, not only on publish

Review (PR #132) caught a regression: removing __run__ from
`is_internal_system_branch` exposed legacy `__run__*` branches to the
schema-apply blocking-branch checks (schema_apply.rs:104 and :778) and to
`branch_list()`, but the v2→v3 sweep ran only inside the publisher's
`load_publish_state`. On a pre-v0.4.0 graph whose first write is a schema
apply, the blocking-branch check fires before any publish, so apply failed
with "found non-main branches: __run__…". The same lazy timing also created a
reverse hazard: a user-created `__run__*` branch on a still-v2 graph could be
deleted by the first publish's sweep.

Fix: run the internal-schema migration in `Omnigraph::open(ReadWrite)` (new
`manifest::migrate_on_open`), before the coordinator reads branch state. The
sweep now lands before any branch-observing code, and a graph is stamped v3 at
open — so the one-time sweep can never catch a legitimately-created branch.
Both checks and `branch_list` see the swept graph; correct by construction for
every write path.

Accepted residual: a read-only open of an unmigrated legacy graph still lists
`__run__*` (read-only opens must not write, so they can't sweep). Documented.

Regression test `legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply`
confirmed RED before the fix (panicked on the branch_list leak assertion) and
GREEN after. Also updates the stale schema_apply.rs comment, the writes.md
"Migration code" section, and adds the v3 row to storage.md's migration table.

* test(engine): sweep multiple legacy __run__ branches; doc nit

Strengthen the v2→v3 migration test to synthesize three `__run__*` branches
(a real legacy graph accumulates one per run) so the migration's delete loop
is exercised on a single reused dataset handle, not just a single branch.
Confirms multi-branch deletion is safe.

Also drop a stale "active runs" reference from the branch_delete doc line.

* fix(engine): force-delete in __run__ sweep for concurrency safety

`migrate_v2_to_v3` ran `Dataset::delete_branch` (= `branches().delete(.., false)`),
which errors "BranchContents not found" if the branch is already gone. Since the
sweep now runs in `Omnigraph::open(ReadWrite)`, two processes opening the same
legacy v2 graph concurrently would race: one wins each delete, the other's open
fails. The migration only claimed idempotency under *sequential* retry.

Switch to `Dataset::force_delete_branch` (= `delete(.., true)`), Lance's
documented path for cleaning up zombie branches, which tolerates an
already-absent branch. The sweep is now idempotent under concurrent runners and
robust to partial/zombie state. Found in self-review; no behavior change for the
common single-open path.

* docs(release): note MR-770 __run__ cleanup in v0.6.1

* docs(branches): reconcile branch cleanup semantics
2026-06-07 18:33:14 +03:00

4.7 KiB

Omnigraph v0.6.1

v0.6.1 focuses on operational polish after v0.6.0: stored-query registries, safer branch cleanup, more complete release artifacts, and a Lance blob-compaction workaround.

Highlights

  • Stored-query registries. omnigraph.yaml can declare curated queries: blocks per graph. Servers load and type-check them at startup, omnigraph queries validate checks them offline, omnigraph queries list shows exposed queries and typed params, GET /queries exposes a typed catalog, and POST /queries/{name} invokes a stored query without accepting ad hoc .gq source from the client.
  • Stored-query policy gate. New Cedar action invoke_query gates the stored-query invocation surface. Stored mutations are double-gated: invoke_query to reach the stored query and change for the actual write.
  • Safer branch deletion. branch_delete now treats the manifest as the authority, flips branch visibility atomically, and reclaims per-table/commit-graph forks as derived state. If best-effort reclaim is interrupted, cleanup reconciles orphaned forks; reusing a branch name before cleanup reports an actionable error.
  • Legacy __run__ cleanup (MR-770). Removed the last functional remnant of the Run state machine (retired in v0.4.0): the __run__ branch-name guard. A new v2→v3 __manifest internal-schema migration sweeps any stale __run__* staging branches on the first read-write open, so __run__* is no longer a reserved branch name. This closes the "unpromoted __run__ branches block reads" condition behind the zombie-run cascade incident; the inert _graph_runs.lance row cleanup is tracked separately (it needs a delete_prefix primitive).
  • Blob-safe optimize. omnigraph optimize skips tables with Blob properties instead of failing the whole sweep on Lance's blob-v2 compaction decode bug. Skips are visible in human output, --json as skipped, TableOptimizeStats.skipped, and logs; non-blob tables still compact normally.
  • Deployment improvements. The container entrypoint now composes OMNIGRAPH_TARGET_URI with OMNIGRAPH_CONFIG, so operators can keep the graph URI in env while loading policy/query config from a mounted file. The local RustFS bootstrap pins RustFS beta.3 and allows the current insecure local-dev default credentials.
  • Windows release support. Tagged and edge releases now publish Windows x86_64 archives containing omnigraph.exe and omnigraph-server.exe, with a PowerShell installer and Windows install docs.
  • Release tooling. Homebrew formula generation was tightened to produce audit-clean formulas.

Compatibility Notes

  • A graph selected by name (--target or server.graph) now uses graphs.<name>.policy and graphs.<name>.queries. Top-level policy / queries blocks are only for anonymous bare-URI single-graph mode; using them with a named graph now fails loudly with migration guidance.
  • mcp.expose defaults to true for stored-query registry entries. Set mcp: { expose: false } for service-only queries that should not appear in the catalog.
  • invoke_query is graph-scoped, not branch-scoped. Branch/snapshot access remains enforced by the inner read / change gate.
  • Legacy __run__ migration. Graphs created before v0.4.0 are migrated automatically on the first read-write open by a v0.6.1 binary (one-time __manifest stamp v2→v3 sweep of stale __run__* branches). No action required. Two caveats: (1) a graph opened read-only still lists any stale __run__* branch until its first read-write open, since the migration is write-path-only like all manifest migrations — long-lived read-only deployments should be opened read-write once after upgrading; (2) the inert _graph_runs.lance / _graph_run_actors.lance dataset bytes are left in place until a future delete_prefix primitive (they are invisible to graph-level state).
  • Blob tables are not compacted until the upstream Lance fix lands, so fragment count and deleted-row space on blob tables are not reclaimed by optimize. Reads, writes, and query results are unaffected; no on-disk migration is required.
  • TableOptimizeStats is now #[non_exhaustive] and gains a skipped: Option<SkipReason> field (so does the new SkipReason enum). This is a source-level change only for downstream code that built this returned result struct by literal — rare, since it is produced by optimize and consumed by reading its fields; field access is unaffected, and #[non_exhaustive] keeps future additions non-breaking.

Docs And Cleanup

  • Public docs were updated for stored queries, policy, server routes, deployment, Windows installation, branch deletion, maintenance, and the runs docs rename to writes.
  • README copy and release documentation were refreshed; older release notes had small typo/wording fixes.