MR-771: demote Run to direct-publish via expected_table_versions CAS

mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-04-30 08:52:50 +02:00
parent 4e5374a85e
commit 35be20cb05
No known key found for this signature in database
28 changed files with 1188 additions and 3216 deletions

89
docs/releases/v0.4.0.md Normal file
View file

@ -0,0 +1,89 @@
# Omnigraph v0.4.0
Omnigraph v0.4.0 demotes the Run state machine to commit metadata via the
publisher's CAS, fixing the cancellation hole that motivated MR-771 and
reducing the engine's surface area.
## Highlights
- **Direct-to-target writes (MR-771)**: `mutate_as` and `load` write
directly to the target tables and call
`ManifestBatchPublisher::publish` once at the end with
`expected_table_versions`. No more `__run__<id>` staging branches, no
more `RunRecord` state machine. Cross-table OCC is enforced inside the
publisher's row-level CAS on `__manifest`.
- **Cancellation safety by construction**: a dropped mutation future
leaves no graph-level state — only orphaned Lance fragments, reclaimed
by `omnigraph cleanup`. The "zombie run" cascade documented in
`.context/zombie-run-investigation.md` is gone.
- **Read-your-writes inside multi-statement mutations**: a `.gq` query
that inserts and then references a row in the same statement now sees
its own writes via an in-process `MutationStaging` cache, even though
no manifest commit happens between ops.
- **Structured conflict surface**: concurrent writers race through the
publisher's CAS; the loser surfaces as
`ManifestConflictDetails::ExpectedVersionMismatch { table_key,
expected, actual }`. The HTTP server maps this to **409 Conflict** with
a structured `manifest_conflict` body so clients can detect-and-retry
without parsing the message.
## Removed
This is a breaking release. Pre-0.4.0 / no SLA.
- `omnigraph::db::{RunRecord, RunStatus, RunId}` types and the
`_graph_runs.lance` / `_graph_run_actors.lance` Lance datasets.
- Engine APIs `begin_run`, `begin_run_as`, `publish_run`,
`publish_run_as`, `abort_run`, `fail_run`, `terminate_run`,
`list_runs`, `get_run`.
- HTTP endpoints: `GET /runs`, `GET /runs/{run_id}`, `POST
/runs/{run_id}/publish`, `POST /runs/{run_id}/abort`. The
`RunListOutput` and `RunOutput` schemas are removed from the OpenAPI
document.
- CLI subcommands: `omnigraph run list`, `omnigraph run show`, `omnigraph
run publish`, `omnigraph run abort`. Use `omnigraph commit list`
reading the commit graph for audit history.
- Cedar policy actions `run_publish` and `run_abort`. Existing
`policy.yaml` files referencing these actions will fail validation —
remove the rules; the `change` action covers the equivalent gating.
## Behavior changes
- `mutate_as` / `load` are now **atomic per query, single publish at the
end**. A failed mutation leaves the target unchanged with no
intermediate manifest commits.
- The `OmniError::manifest_conflict` shape produced by concurrent
writers is now `ExpectedVersionMismatch` (was `MergeConflict::DivergentUpdate`
via the run merge path). Clients that match on the conflict body must
switch to inspecting `manifest_conflict.table_key/expected/actual`.
## Known limitation
A multi-statement mutation that writes a Lance fragment in op-N and then
fails in op-N+1 leaves the touched table with Lance HEAD ahead of the
manifest. The next mutation against that table fails with
`ExpectedVersionMismatch`. Most validation runs before any Lance write,
so single-statement mutations are unaffected; the narrow path is
multi-statement queries with late-op failures. Tracked as a follow-up;
see [docs/runs.md](../runs.md#known-limitation-mid-query-partial-failure-on-the-same-table)
for the workaround.
## Upgrade notes
- **Stale `__run__*` branches and `_graph_runs.lance`** in legacy v0.3.x
repos are *inert* — the engine no longer reads them — but they remain
on disk until production cleanup. MR-770 owns the destructive sweep;
this release deliberately does not touch legacy bytes.
- The `is_internal_run_branch` predicate is kept as a defense-in-depth
guard against users naming a branch `__run__*`. It will be removed in
a follow-up alongside MR-770.
- External scripts hitting `/runs/*` will now receive 404. Migrate them
to `/commits` for audit history; mutation status is implied by the
HTTP response on `/change` itself.
## Included Changes
- MR-771 — Demote Run: write directly to target via publisher
- MR-766 — `ManifestBatchPublisher::publish` accepts per-table
`expected_table_versions` (landed earlier; this release wires it in
end-to-end)