Merge branch 'main' into ragnorc/omnigraph-mcp-crate

Folds in v0.7.2 (release #301) + RFC-013 Phase 7 (graph lineage in __manifest,
internal schema v3→v4 migration #299; WriteTxn #298; recovery convergence #296)
under the MCP branch.

Conflict resolutions (2 files):
- crates/omnigraph-server/Cargo.toml: take main's 0.7.2 path-dep constraints;
  keep our omnigraph-mcp dep (bumped to 0.7.2).
- docs/releases/v0.8.0.md (add/add): both branches drafted v0.8.0 notes for the
  same next minor — combined them. v0.8.0 now documents BOTH the MCP surface
  (ours) and main's __manifest lineage fold + the breaking internal-schema-v4
  upgrade-order requirement (kept prominent under Upgrade notes). Corrected our
  'no breaking changes / on-disk format unchanged' line, which the v4 migration
  makes false.

Coherence: omnigraph-mcp [package] + Cargo.lock bumped 0.7.1→0.7.2; openapi.json
auto-merged to info.version 0.7.2 (no API-surface drift from the incoming
engine-internal commits). Verification deferred to CI (no local rebuild).
This commit is contained in:
Ragnor Comerford 2026-06-25 15:53:53 +02:00
commit 4d4c2164de
No known key found for this signature in database
62 changed files with 5898 additions and 1053 deletions

View file

@ -133,7 +133,7 @@ flowchart TB
subgraph state[graph state]
coord[GraphCoordinator]:::l2
mr[ManifestCoordinator<br/>db/manifest.rs]:::l2
cg[CommitGraph<br/>_graph_commits.lance]:::l2
cg[CommitGraph<br/>projection of __manifest graph_commit/graph_head rows]:::l2
stg[MutationStaging<br/>per-query in-memory accumulator<br/>exec/staging.rs]:::l2
end

View file

@ -0,0 +1,460 @@
# Handoff: finishing RFC-013 (write-path latency + correctness)
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)**
this doc is the *current-state map + the decisions/validation from the latest work cycle
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
**Audience:** the engineer/agent who picks up RFC-013 next.
---
## 0. TL;DR — where we are and what's next
RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
for the canonical list. Current reality:
**Landed on `main`:**
- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
instead of the Lance-namespace builder (#288). **This already banked the dominant
depth win** — see §2 below; it reframes everything.
- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
`_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
correction (#292).
- **Optimize-vs-write race** — optimize survives a cross-process write race on the
same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
redundant with Design A). Step 3b stacks on top of this.
**Open PRs (land these; relationships in §7):**
- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
**MERGED to main and integrated into this branch** — the converge helper now threads
Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`).
- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
(same op-class family as #297, logical side).
**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
area keeps generating, which also absorbs 3b's deferred session-aware write opens.
---
## 1. The corrected mental model (read this before touching anything)
Three reframes from the latest cycle that the older RFC prose may not fully reflect:
### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
- The dominant **O(depth) data-table** term is **gone**.
- Step 2a flattened the secondary **internal-table** scan term.
- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
+ compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
3b; gate it on the constant-factor (S3 round-trips), not a curve.
### 1b. Two op classes, two commit models (the §6.6 principle)
Every concurrency bug in this area is **one op class using the other's commit model**:
| class | examples | commutes? | correct commit model |
|---|---|---|---|
| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
### 1c. The root liability (what keeps generating these bugs)
Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
every multi-commit op advances per-table Lance HEAD **before** the manifest references it
(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
- §6.5 cross-process logical corruption,
- #297's own-HEAD-drift misclassification,
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
- the recovery classifier edges.
**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
retires the window entirely.** See §5.
### 1d. The second facet: the write base is a stale pin (no probe)
The READ path resolves its base behind a freshness probe (`resolve_target_inner`
omnigraph.rs:~1072 → `probe_latest_incarnation``refresh_manifest_only`); the WRITE path
does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
single staleness feeds **two distinct failure modes**, both surfaced this cycle:
1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
`edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
**VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
underneath: there is **no unified write-validation read-set** — endpoint
(`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
`txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
(`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
§7.1 TOCTOU class, and each new constraint forks the pattern again.
2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
`optimize` advanced `__manifest` — even though the advance was content-preserving
compaction the logical write should fast-forward past (invariant 7). It's the **write-side
mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
logical write that must fast-forward past optimize). A served read clears it (the read
probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
(`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
`consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition**
derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
of Design A — three symptoms, one fix.
---
## 2. Validated facts — do NOT re-derive these
Established this cycle against **Lance 7.0.0 source**
(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
you can trust them without re-investigating.
**Lance (upstream):**
- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
(`lance-table/src/io/commit.rs` `default_resolve_version`.)
- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
— the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
`WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
`Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
Session to a write base, open read-style (`open_table_dataset` → `from_uri().with_version()
.with_session()`) and drive the staged write off that handle.
- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
relies on this).
- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
`commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
`commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
(`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
**compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
Lance own it.**
- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
`Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
retry fires only on real overlap.
**Engine (internal):**
- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
`TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
graph** (`read_caches.session`). **Writes bypass all of it by construction**
(`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
session and opens by latest, not pinned version).
- A single write opens each table **34×** (accumulation → staging reopen → commit
drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
(`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
+ 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
depth — 3b's targets.
- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
`table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
`publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
- Fork-on-first-write authority reads (`classify_fork_ref``fresh_snapshot_for_branch`)
must stay **fresh** (not served from a pinned base).
- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
`local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
(`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
(copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
(`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
---
## 3. The #297 cycle (this branch) — what it is, and the lesson
`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
fixed, both regression-tested (own-HEAD-drift via negative control):
1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
`head_advanced`; the drift guard fires only when `!head_advanced`.
2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
its final retry even if the conflict meant a concurrent writer already published `≥ N`
(postcondition met). Fix: re-check `current >= state.version` on exhaustion.
**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
retry silently converts every "checked once, before any side effect" precondition into
"re-checked after partial side effects."* That's a distinct bug class; it needs
fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
exactly this.)
**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
---
## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
`PublishPlan`.
**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
gate yet (only the node gates above).
Commits (off merged-#297 main):
- **Stage 0** — scope `open_count``data_open_count`/`internal_open_count` by URI class
(the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
(`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
same `Snapshot`).
- **Commit B (open collapse 4→1)**#1 accumulation open ELIMINATED (the node path discarded
the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
#3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
`CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
**RFC §4.1 corrections — how they resolved:**
1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
the `commit_staged` handle forward into the index build).
2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
(`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
and the fork-authority reads stay fresh.
3. *Minimal carrier*`WriteTxn { branch, base }` (even the `session` from the original
sketch was dropped as unearned).
**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
**Guardrails (don't regress):** schema validation is deliberately uncached for drift
detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
post-#297 main before starting.
---
## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
Properties:
- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
§6.5 writers, the write-path guard — all close *by construction*.
- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
+ reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
+ `writer_epoch`).
**Why it composes with Lance MTT (don't over-build):**
- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
authority; nothing wasted. Build this.
- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
a cross-process fence). Build it *last and minimally*, gated on actually deploying
multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
validate the maintenance-class and logical-class commit models in isolation first, so Design
A implements a known spec rather than designing under refactor pressure. Do NOT build more
substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
over the window — strictly higher liability than either Design A or waiting for MTT.
**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
open; delete #6658 shipped). Track, don't build yet.
### 5.1 Step-5 design constraints inherited from the #295 spec review
3b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via
eliminate/probe/thread) and **deferred** the full §4.1 opener-unification — the pinned-base
opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op
conflict-timing move — to step 5. So the greptile-bot comments on the #295 *spec* were **moot
for #298** (which built none of those constructs) but are **load-bearing constraints for step
5** when it builds them. Bank them:
1. **Handle cache must be `Send + Sync`** (`Mutex<HashMap<…, Dataset>>`, not `RefCell`) if
`WriteTxn::open(&self)` is shared across concurrent stage futures — a `RefCell` compiles
but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This
is the deny-list "in-process-only `Dataset` impls — `Send + Sync`" item.
2. **The strict-op timing move needs an explicit retry contract.** If step 5 moves
strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS
(the §4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit**
(success or conflict — the handle cache is commit-invalidated), and the retry **re-opens a
fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base — that
reproduces the lost-update). **This is the same retry/refresh contract as the stale-view
false-fail (§1d.2)** — the op-class-aware precondition + "fresh base on retry" are one
design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the
contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
3. **The opener-equivalence test must be non-trivial.** A differential test that only passes
when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open`
returns the pinned base, not HEAD," the test must **advance the branch HEAD externally
(direct Lance write), then assert the txn open still reads the base version** — and that a
strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move).
---
## 6. Why #297 is still needed even if you do Design A
- Design A **relocates** #297's maintenance-class commit logic into the authority's
`TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
- The two regression tests + §6.6 are the **contract** Design A must keep green.
- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
(~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
---
## 7. Open PRs and their relationships
- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
step 3b stacks on it.
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
both are de-risking inputs for Design A's per-class commit models.
- **#296** — recovery roll-forward converges on concurrent manifest advance. The fix
for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
*aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296
landed on main first and is merged into this branch:** the converge helper
(`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward —
on convergence the audit references the winner's folded `graph_commit_id` (the current
`graph_head`), not a freshly minted one.
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
---
## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
(§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
**frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
`edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
pinned read-set re-validated under the `graph_head` contention, or the liability survives as
a second special-case.
- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
two tests passing *together*: un-ignore
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
*while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
(maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
Deferred: only the secondary version-count/space term, touches the read/open path, and is
MTT-redundant. Land when version-count cost bites.
- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
validation on node removal; independent, ships anytime.
- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
- Branch ops: Lance `Clone` for create (`iss-691`).
---
## 9. Gotchas / traps (learned the hard way)
- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
reasoning is a bug across processes. This is the recurring trap.
- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
regress a newer version = lost write.
- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
interpretation on "have *we* committed yet."
- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
`skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
config before compacting to stay non-destructive on upgraded graphs. The strip is a
separate commit (relevant to the partial-commit retry trap).
- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
rare and needs failpoint injection to test.
---
## 10. Verification (the gate)
- `cargo test --workspace --locked` — the canonical gate (matches CI).
- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize`
the optimize concurrency/recovery tests.
- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
cost gates (3b adds the schema-once + open-count asserts here).
- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
before each change (always-on requirement).
Lance source for re-validation:
`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
`dataset/builder.rs`).

View file

@ -93,6 +93,7 @@ Working documents for in-flight feature work. Removed when the work lands.
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
## Boundary

View file

@ -211,10 +211,21 @@ them explicit.
sweep has the same exposure, and always has): it may roll a live foreign
writer's sidecar forward, which degrades to publisher-CAS contention for
data writes but can race the schema-staging promotion for a foreign live
schema apply. Multi-process writers on one graph are already documented
one-winner-CAS territory; closing this fully needs a cross-process
serialization primitive (e.g. lease-based use of the schema-apply lock
branch) — design it before promoting multi-process write topologies.
schema apply. The roll-**forward** CAS contention is now
convergence-idempotent: when the publish loses the CAS to a concurrent
writer that already reached the sidecar's goal, the sweep treats it as
convergence (record the `RolledForward` audit + delete) rather than a fatal
`ExpectedVersionMismatch`, and defers when the manifest is only partway
(`converge_or_defer_roll_forward` in `db/manifest/recovery.rs`;
iss-schema-apply-reopen-recovery-race). So a concurrent advance no longer
fails the open. The schema-staging promotion race and the destructive
roll-**back** path (Lance `Restore` "trumps" a concurrent commit, so it
cannot be made idempotent — iss-recovery-sweep-live-writer-rollback) still
need the cross-process primitive. Multi-process writers on one graph are
already documented one-winner-CAS territory; closing this fully needs a
cross-process serialization primitive (e.g. lease-based use of the
schema-apply lock branch) — design it before promoting multi-process write
topologies.
- **Fork reclaim is in-process-safe only:** the first write to a table on a
branch forks it (a Lance `create_branch` that advances state before the
manifest publish). An interrupted fork (crash, or a cancelled request
@ -242,20 +253,43 @@ them explicit.
acknowledged-before-visible bug this branch fixed. Close it (local CAS
primitive, or a trait-level lock requirement) before admitting any
lock-free `if_match` caller.
- **Manifest→commit-graph publish atomicity:** a graph commit advances
`__manifest` (the visibility authority) and then appends `_graph_commits` as
two separate writes (`commit_updates_with_actor_with_expected`, failpoint
`graph_publish.before_commit_append`). A crash between them leaves the manifest
at version N with no commit-graph row for N. Live reads and durability are
unaffected — the live version resolves via the manifest
(`GraphCoordinator::version()`), not the commit-graph head — and the open-time
recovery sweep does NOT repair it (`lance_head == manifest_pinned` classifies
`NoMovement`; a recovery sidecar would not change this). Impact is bounded to
commit history: `commit list` misses N, time-travel by commit id to N fails,
and merge-base loses a node (a likely-benign off-by-one re-merge). This affects
every publish, not a specific maintenance command. Eventual fix: make the
commit graph reconcilable from the manifest (or the two writes atomic) — not a
recovery-sidecar concern.
- **Manifest→commit-graph publish atomicity — CLOSED (RFC-013 Phase 7):** graph
lineage now lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
rows written in the SAME `MergeInsertBuilder` commit as the table-version rows
(`commit_changes_with_lineage``GraphNamespacePublisher::publish` with a
`LineageIntent`). There is no second write to fail between — a graph commit and
its lineage land at one manifest version atomically, so a crash after the publish
leaves no gap. The commit-graph cache is a derived projection of those manifest
rows; nothing writes `_graph_commits.lance` (it persists only to carry branch
refs). The prior two-write gap (manifest at N with no `_graph_commits` row for N)
is gone by construction. A graph created before Phase 7 (internal schema v3)
carries its lineage only in `_graph_commits.lance`; the `migrate_v3_to_v4`
internal-schema step (`db/manifest/migrations.rs`) backfills it into `__manifest`
per-branch on the first read-write open (idempotent, crash-safe, data-preserving),
and a read-only open of an un-migrated v3 graph sources the DAG from
`_graph_commits.lance` via a stamp-gated transitional fallback so reads stay
correct until the first write migrates it. An old binary refuses a v4-stamped
graph (read-write and read-only) with the standard upgrade error. The migration
is **loud on failure and concurrent-runner idempotent**: the legacy-open read
(`read_legacy_commit_cache`) treats only a genuine not-found as "no legacy data"
and propagates any other open error (so a transient/corrupt open can never stamp
v4 over an empty backfill — orphaning lineage permanently), and the backfill
converges all-or-nothing when two runners open the same legacy graph at once — a
bounded re-open retry on the `graph_head:<branch>` row-level CAS plus an
idempotent terminal stamp bump (both runners write the same value, so a concurrent
`UpdateConfig`/`IncompatibleTransaction` loss re-opens and no-ops if the stamp
already landed). The branch read path (`load_commit_cache_for_branch`) also
refuses an out-of-range branch stamp (`> CURRENT` or `< MIN_SUPPORTED`;
defense-in-depth; not a live hole because migrations run main-first, so main
refuses first). The migration chain is **floor-bounded**:
`MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (migrations.rs; 1 today, a pure no-op) is
the oldest stamp this binary opens, enforced symmetrically with the ceiling by the
single `refuse_if_stamp_unsupported` guard at all three stamp-read sites
(write-path migrate, read-only open, branch lineage-read). Raising MIN sheds the
now-dead `migrate_vN_…` arms and (at MIN ≥ 4) the `commit_graph_legacy_v3` legacy
readers; a compile-time tripwire (`LOWEST_REGISTERED_MIGRATION_SOURCE`) fails the
build if the floor and the lowest registered arm drift. Retirement runbook lives on
the `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` doc-comment.
- **Planner capability/stat surfaces:** cost-aware planning, complete
capability advertisement, and explain-with-cost are roadmap. Do not describe
them as implemented.
@ -291,19 +325,23 @@ them explicit.
in history; but they are not yet brought into `cleanup` (version GC), so the
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
graph is not yet reconcilable from the manifest; and the traversal id-map is
graph IS now reconcilable from the manifest (RFC-013 Phase 7 — it is a pure
projection of the `graph_commit`/`graph_head` rows); the traversal id-map is
still rebuilt.
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
the commit-graph head from storage before appending, so a same-branch write
after an external commit no longer forks the commit DAG by parenting off a
stale cached head (the single-process fork, pre-existing for non-strict
inserts and widened to strict ops by Fix 1's `refresh_manifest_only`, is now
closed). Residual: two processes writing disjoint tables can still pass their
per-table manifest CAS and append off the same parent (a refresh-then-append
TOCTOU). The convergent fix is reconcile-from-manifest (parent = the commit at
the manifest version the publisher CAS'd against; `manifest_version` is on
every commit row), composing with the manifest-to-commit-graph atomicity gap;
it needs commit-graph append ordering or a Lance append-CAS to fully close.
- **Commit-graph parent under concurrency — CLOSED (RFC-013 Phase 7):** the graph
commit is now recorded in the manifest publish CAS, and the publisher resolves
the new commit's parent INSIDE its retry loop, per attempt, from the just-loaded
`__manifest` (the `should_replace_head` winner over the visible `graph_commit`
rows). A CAS-conflict retry re-reads the advanced head and parents correctly, so
the refresh-then-append TOCTOU is gone. Two processes writing disjoint tables on
the same branch now also contend on the shared `graph_head:<branch>` row (one
`object_id`, `WhenMatched::UpdateAll`): one wins, the other retries and re-parents
— so the cross-process disjoint-table fork is closed too. This is the intended
§7.1 contention point, pinned by
`manifest::tests::concurrent_disjoint_writes_share_head_and_form_linear_chain`
(two disjoint writers → both commit, single linear chain) and
`manifest::tests::n_concurrent_disjoint_writers_converge_to_one_linear_chain`
(N=8 disjoint writers with app-level retry → one linear chain of 8, no fork).
## Deny-list

View file

@ -170,6 +170,7 @@ Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, Da
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
- **Two surface guards added by the v3→v4 migration-robustness follow-up** (not a Lance bump, but they pin Lance error surfaces the migration now classifies on): `dataset_open_missing_returns_not_found_variant` (a missing `Dataset::open` returns `DatasetNotFound`/`NotFound` — the legacy-open read in `db/commit_graph.rs::read_legacy_commit_cache` treats only those as "no legacy data" and propagates everything else) and `lance_error_incompatible_transaction_variant_exists` (a concurrent `UpdateConfig` stamp-bump loses with `IncompatibleTransaction``db/manifest/migrations.rs::commit_v4_stamp_idempotently` matches it to retry the benign same-value race). Re-run on a Lance bump like the others.
Bump this date stanza on the next alignment pass.

View file

@ -523,7 +523,10 @@ struct WriteTxn {
branch: BranchRef,
base: PinnedSnapshot, // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
session: Arc<Session>, // shared per-graph; warms metadata/index caches across opens
handles: HandleCache, // open-by-version; each table opened once, reused across stages
handles: HandleMap, // open the base once WITH session; thread the handle each
// commit RETURNS forward (HEAD walks N→N+1→N+2). NOT a
// version-keyed cache — HEAD moves, so a (table,version) key
// misses; reuse = forward the commit-return handle. [3b-validated]
}
// A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
@ -546,8 +549,17 @@ impl GraphPublishAuthority {
Properties that make it optimal:
- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
- **Stages take `&WriteTxn`/`&PublishPlan` for the BASE** — re-resolving the pinned
read base / open-latest for the pre-commit phase is unrepresentable; invariants 2/3/15
hold for the base by construction. **Caveat [3b-validated]:** this is NOT "no
re-resolution anywhere." Three commit-boundary reads are irreducible correctness
machinery and MUST stay fresh: the commit-time `fresh_snapshot_for_branch` (cross-process
OCC), the live-HEAD drift probe (a concurrent writer may have moved HEAD since staging),
and the fork-authority reads (`classify_fork_ref` deliberately bypasses the cached base —
a pinned base there re-opens the "force-delete a live fork" bug). Model "pinned base for
the pre-commit phase + named fresh re-reads at the commit/fork boundary." The achievable
open count is **1 base open (with session) + 1 cheap `latest_version_id` probe + threaded
commit handles**, not literally one open.
- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
recovery both call `plan.apply()` — a merge that bumps tables A+B can never
roll A forward and silently drop B. The

View file

@ -47,7 +47,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). Also the v3→v4 migration fault-injection test (`transient_legacy_open_failure_aborts_migration_without_stamping_v4`, `migration.v3_to_v4.legacy_open` failpoint): a transient legacy-open failure aborts the migration loudly and leaves it retryable (stamp stays v3, no partial backfill), never stamping v4 over an empty backfill. Also the v4 stamp-bump exhaustion regression (`v4_stamp_exhaustion_returns_retryable_contention`, `migration.v4_stamp.force_incompatible` failpoint): the stamp retry loop surfaces a retryable `RowLevelCasContention` on exhaustion, not a stringified `Lance`. And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
@ -65,10 +65,12 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
## Failpoints (fault injection)
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml` **and** `crates/omnigraph-cluster/Cargo.toml`; the cluster feature does not enable the engine's).
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` expose `maybe_fail("name")` and `ScopedFailPoint` for tests.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, cluster apply's payload→state-write window, etc.).
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (crash-mid-apply + state CAS race via `fail::cfg_callback`; integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` in `crates/omnigraph/Cargo.toml`; the cluster's `failpoints` feature additionally enables `omnigraph/failpoints` (`crates/omnigraph-cluster/Cargo.toml`), so the shared test guard is available to cluster tests.
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` each expose `maybe_fail("name")` (per-crate error type). The test-side config guard `ScopedFailPoint` (`new` for action strings, `with_callback` for callbacks; RAII `Drop` removes the point) lives **once** in the engine and is reused by both test binaries.
- **Names are compile-checked.** Every failpoint name is a `pub const` in `omnigraph::failpoints::names` (engine) / `omnigraph_cluster::failpoints::names` (cluster). Call sites and tests reference the constant, never a bare literal — a typo is a compile error, not a silently-never-firing point. Add a new failpoint by adding its const first.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, the recovery sweep's classify→roll-forward-publish window, cluster apply's payload→state-write window, etc.).
- **Serialize and rendezvous, never sleep.** The `fail` registry is process-global, so every failpoint test carries `#[serial]` (`serial_test`). For concurrent tests, use `helpers::failpoint::Rendezvous` (`tests/helpers/failpoint.rs`): `park_first(name)` parks the first thread to hit the point until `release()`, and `wait_until_reached().await` blocks on that condition (it doubles as a fired-assertion). Do not coordinate threads with fixed `sleep`s.
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
## RustFS / S3 integration

View file

@ -230,8 +230,9 @@ recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`:
rolled-back-to version (`manifest_pinned`); the manifest is published at the
restore commit (`manifest_pinned + 1`, same content).
- After a successful roll-forward or roll-back, an audit row is
recorded — `_graph_commits.lance` carries
a commit tagged `actor_id = "omnigraph:recovery"`, and a sibling
recorded — the graph commit lineage (the `graph_commit` rows in `__manifest`
since RFC-013 Phase 7) carries a commit tagged
`actor_id = "omnigraph:recovery"`, and a sibling
`_graph_commit_recoveries.lance` row carries `recovery_kind`,
`recovery_for_actor` (the original sidecar's actor), `operation_id`,
per-table outcomes. Operators run `omnigraph commit list --filter
@ -336,20 +337,40 @@ actual }`. The HTTP server maps this to **409 Conflict** with body
## Audit
`actor_id` lands in `_graph_commits.lance` via `record_graph_commit` (no
intermediate run record). Audit history is queried via `omnigraph commit
list`.
`actor_id` lands in the graph commit lineage — the `graph_commit` rows in
`__manifest`, written in the publish CAS (RFC-013 Phase 7; previously
`_graph_commits.lance`). Audit history is queried via `omnigraph commit list`.
## Migration code
`db/manifest/migrations.rs` carries the v2→v3 internal-schema step (MR-770):
a one-time sweep that deletes legacy `__run__*` staging branches off
`__manifest`. It runs in `Omnigraph::open(ReadWrite)` (via
`manifest::migrate_on_open`, before the coordinator reads branch state) and
again on the publisher's write path; both are idempotent once the stamp is at
v3. Deleting the inert `_graph_runs.lance` / `_graph_run_actors.lance` dataset
*bytes* is still deferred — it needs a `StorageAdapter::delete_prefix`
primitive — but those bytes are invisible to graph-level state.
`db/manifest/migrations.rs` is the single place on-disk `__manifest` shape is
reconciled with what the binary expects, stepping the
`omnigraph:internal_schema_version` stamp forward one `match`-arm at a time. It
runs in `Omnigraph::open(ReadWrite)` (via `manifest::migrate_on_open`, before the
coordinator reads branch state) and again on the publisher's write path, so each
branch migrates on its first write; every step is idempotent under crash-retry
(work first, stamp bump last).
- **v2→v3** (MR-770): a one-time sweep that deletes legacy `__run__*` staging
branches off `__manifest`. Deleting the inert `_graph_runs.lance` /
`_graph_run_actors.lance` dataset *bytes* is still deferred — it needs a
`StorageAdapter::delete_prefix` primitive — but those bytes are invisible to
graph-level state.
- **v3→v4** (RFC-013 Phase 7, `migrate_v3_to_v4`): backfills the graph lineage
from `_graph_commits.lance` into `__manifest` as `graph_commit` / `graph_head`
rows. A graph created before Phase 7 has its lineage only in
`_graph_commits.lance`; the new binary reads lineage from the `__manifest`
projection, so without this backfill it would see an empty commit DAG. The
backfill is per-branch (each branch migrates on its first write), idempotent
(keyed on `object_id`; a fast-path guard skips when `__manifest` already
carries `graph_commit` rows), and writes exactly one `graph_head:<branch>` row
for the actual head. `_graph_commits.lance` is left in place as the branch-ref
carrier — no commit row is written to it again. While a graph is below v4, a
**read-only** open (which never writes, so never migrates) sources the commit
DAG from `_graph_commits.lance` via the stamp-gated transitional fallback in
`CommitGraph::open*`, so reads see correct history before the first write
migrates the graph. An old binary opening a v4-stamped graph is refused with an
"upgrade omnigraph" error in both read-write and read-only modes.
## Mid-query partial failure: closed by MR-794

60
docs/releases/v0.7.2.md Normal file
View file

@ -0,0 +1,60 @@
# Omnigraph v0.7.2
A patch release over v0.7.1: write-path latency reductions plus three
correctness fixes on the maintenance and recovery paths. No breaking changes, no
on-disk format change, and no migration — drop-in over v0.7.1.
## Performance
- **Write opens go direct, schema validates once (#288, #298).** Write opens
used to route through the per-table Lance namespace catalog, which re-opened
the dataset just to read its location and re-resolved the latest version on
every table open — an O(commit-depth) double resolution that dominated write
latency on object stores (~70%). Writes now open each touched data table
directly by its manifest-recorded location (Lance's O(1) version-hint path),
validate the schema contract once per write instead of ~4×, and open each
touched table once instead of 4×.
- **`optimize` compacts the internal metadata tables (#291).** `optimize`
previously iterated only node/edge tables, so the internal `__manifest`,
`_graph_commits`, and `_graph_commit_actors` tables accumulated one fragment
per commit and were never compacted — making every write's metadata scan grow
with commit history. `optimize` now compacts all three, so a periodically
optimized long-lived graph keeps its per-write metadata scan flat in history.
## Fixes
- **`optimize` survives a cross-process write race (#297).** A CLI `optimize`
racing a served write on the same table could fail: the in-process write queue
doesn't serialize across processes, so a concurrent insert/delete advancing the
manifest between optimize's compaction and its publish broke the strict
equality CAS. Optimize now reopens-and-replans on a genuine Lance conflict and
fast-forwards its publish monotonically, so a maintenance compaction never
fails a live write. Bounded retry; sustained contention surfaces a loud
conflict rather than dropping work.
- **`optimize` is non-destructive on upgraded graphs (#291).** A graph created by
a pre-0.7.0 binary carries an on-by-default Lance auto-cleanup config; under it,
optimize's compaction commit could fire Lance's version-GC hook and prune
`__manifest`-pinned versions (breaking snapshots and time travel). Optimize now
strips any stale `lance.auto_cleanup.*` config off every table — data and
internal — before its HEAD-advancing commits, so compaction can never GC pinned
versions.
- **Recovery converges instead of failing `open` under a concurrent manifest
advance (#296).** The open-time recovery sweep published its roll-forward at the
sidecar's pinned expected version; if another writer advanced the manifest
during the classify→publish window, the CAS failed and aborted the whole
`Omnigraph::open`. The sweep now treats roll-forward as "the manifest reflects
the sidecar's committed state," not "this sweep won the CAS": on a CAS loss it
re-reads the live manifest and, when the sidecar's intent is already satisfied,
records the recovery and deletes the sidecar idempotently — so a concurrent
advance no longer fails the open. (The destructive roll-back twin still defers
to a cross-process lease, as documented.)
## Upgrade notes
Drop-in over v0.7.1 — no configuration, schema, or data changes. Upgrade the
server and CLI together as usual. Graphs created on v0.7.1 read and write
identically on v0.7.2; the optimize non-destructive fix additionally protects
graphs created by pre-0.7.0 binaries from version GC during compaction.

View file

@ -1,16 +1,23 @@
# Omnigraph v0.8.0
v0.8.0 makes every served graph an **MCP (Model Context Protocol) server**. An
MCP-capable agent — Claude Code/Desktop, Cursor, the OpenAI Responses `mcp` tool,
and others — can connect to a graph and operate it directly: run reads and
mutations, load data, manage branches, browse commits, read the schema, and
invoke the graph's curated stored queries. The surface adds no new capability and
no new business logic; every tool delegates to the same engine/handler path the
REST routes use and is gated by the same Cedar policy.
v0.8.0 has two headline changes:
## Highlights
1. **Every served graph becomes an MCP (Model Context Protocol) server** — an
MCP-capable agent (Claude Code/Desktop, Cursor, the OpenAI Responses `mcp`
tool, and others) can connect to a graph and operate it directly. The surface
adds no new capability and no new business logic; every tool delegates to the
same engine/handler path the REST routes use and is gated by the same Cedar
policy. It is **additive**.
2. **Graph commit lineage moves into `__manifest`** (RFC-013 Phase 7), folded
into the publish CAS, via a one-time on-disk migration (internal schema
**v3 → v4**). This is the first internal-schema change since v0.4.0 and carries
an **upgrade-order requirement** — read the upgrade notes before rolling it out.
### MCP surface (`POST /graphs/{id}/mcp`)
## MCP surface (`POST /graphs/{id}/mcp`)
An MCP-capable agent can connect to a graph and run reads and mutations, load
data, manage branches, browse commits, read the schema, and invoke the graph's
curated stored queries.
- **One MCP endpoint per served graph**, mounted automatically by the cluster
server — no separate flag. It is a stateless Streamable-HTTP transport: a
@ -78,8 +85,56 @@ carried in the query source:
unsupported version is a `400`); `initialize` negotiates the version in its
body and is exempt by design.
## Graph lineage now lives in `__manifest` (internal schema v4)
The graph commit DAG (commits, parents, merge parents, per-branch heads, and the
authoring actor) is now stored in `__manifest` as `graph_commit` / `graph_head`
rows, written in the **same commit (CAS)** as the table-version rows of a graph
publish. Previously the lineage lived in a separate `_graph_commits.lance`
dataset written after the manifest commit, leaving a narrow window where a crash
could land a manifest version with no matching lineage row. Folding the lineage
into the publish closes that gap by construction: a graph commit and its lineage
now land atomically at one manifest version. The in-memory commit graph is a
projection of those manifest rows; `_graph_commits.lance` is retained only as a
carrier for Lance branch refs and no longer receives commit rows.
This bumps the `__manifest` internal schema stamp from **v3 to v4**.
### Existing graphs migrate seamlessly on first write
A graph created by an earlier binary (internal schema v3) keeps its lineage in
`_graph_commits.lance` with none in `__manifest`. On the **first read-write
open**, Omnigraph backfills that lineage into `__manifest` (the `migrate_v3_to_v4`
internal-schema step) and bumps the stamp to v4. The migration:
- is **per-branch** — each branch backfills on its first write;
- is **idempotent and crash-safe** — the stamp bump is the last step, and the
backfill is keyed on the commit id, so a crash mid-migration re-runs harmlessly
on the next open;
- **preserves all data** — every commit, parent, merge parent, actor, and head is
carried over; commit ids are stable, so existing references still resolve.
No data is lost and no operator action is required beyond upgrading the binary.
Before its first write migrates the graph, a **read-only** open of a v3 graph
(e.g. `omnigraph commit list`, NDJSON export) still reads correct history via a
transitional fallback that sources the commit DAG from `_graph_commits.lance`
read-only opens never write, so they never migrate, but they never show an empty
history either.
## Upgrade notes
- **Breaking: internal schema v4 — upgrade writer (and reader) binaries first.**
Internal schema v4 is a hard version gate. Once a graph has been opened for
write by a v0.8.0 binary, its `__manifest` is stamped v4, and an **older binary
will refuse to open it** — read-write *and* read-only — with an
`upgrade omnigraph before opening this graph` error rather than silently
misreading the new lineage. This is the standard forward-version protection
(same shape as the v1→v2 / v2→v3 steps), now enforced on the read-only path
too. Upgrade every writer (and reader) binary that touches a graph to v0.8.0
before, or together with, the first write under the new version. A mixed fleet
where an old binary still writes the same graph is unsupported, as with any
internal-schema bump.
- **`GET /graphs/{id}/queries` is now `invoke_query`-gated (was `read`).** The
stored-query catalog uses the same authority as invocation and the MCP
`tools/list` surface, so discovery and invocation agree ("see the menu iff you
@ -87,8 +142,9 @@ carried in the query source:
`403` instead of a listing; in default-deny mode the endpoint returns `403`
until an `invoke_query` rule is configured. This is the one observable REST
behavior change in this release.
- Otherwise no breaking changes: the rest of the REST surface, CLI, cluster
config, and on-disk format are unchanged. The MCP endpoint is additive.
- **The MCP endpoint is additive.** Apart from the `GET /queries` gate change and
the v4 on-disk migration above, the REST surface, CLI, and cluster config are
unchanged.
- **Pointing an agent at a graph:** configure your MCP client with the URL
`https://<host>/graphs/<id>/mcp` and the same bearer token you use for REST.
See [docs/user/operations/mcp.md](../user/operations/mcp.md) for the connect

View file

@ -20,13 +20,14 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin
- **Layout**:
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
- `__manifest/` — the catalog of all sub-tables and their published versions
- `_graph_commits.lance` / `_graph_commit_actors.lance`the commit graph and its actor map
- `__manifest/` — the catalog of all sub-tables and their published versions, **and** the graph commit lineage (RFC-013 Phase 7)
- `_graph_commits.lance` / `_graph_commit_actors.lance`legacy / branch-ref carriers. Since RFC-013 Phase 7 the graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows, written in the publish CAS); `_graph_commits.lance` no longer receives commit rows, but is retained to carry the Lance branch refs that `create_branch` / `list_branches` / the `cleanup` orphan reconciler operate on. A graph created before Phase 7 (internal schema v3) keeps its lineage here until its first read-write open, which migrates it into `__manifest` via `migrate_v3_to_v4`.
- (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed. The internal schema migration sweeps stale `__run__*` branches on first write-open; the inert dataset bytes themselves remain until a prefix-delete storage primitive lands)
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
- `object_type``table | table_version | table_tombstone`
- `table_key``node:<TypeName> | edge:<EdgeName>`
- `object_type``table | table_version | table_tombstone | graph_commit | graph_head`
- `table_key``node:<TypeName> | edge:<EdgeName>` (empty for `graph_commit` / `graph_head` lineage rows)
- `table_branch` is `null` for the main lineage and the branch name otherwise
- **Graph lineage rows** (RFC-013 Phase 7): one immutable `graph_commit` row per commit (`object_id` = the commit ULID; `metadata` JSON carries parent / merged-parent / actor / timestamp) plus one mutable `graph_head:<branch>` pointer per branch (`graph_head:main` for main). The in-memory commit DAG is a projection of these rows.
- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones — rows where `object_type = table_tombstone`, whose own `table_version` (acting as the tombstone version) is `>= the entry's table_version`.
- **Atomic publish**: multi-dataset commits publish so that a single write to `__manifest` flips all the new sub-table versions visible at once.
- **Row-level CAS on the merge-insert join key**: `object_id` carries an unenforced-primary-key annotation so Lance's bloom-filter conflict resolver rejects two concurrent commits that land the same `object_id` row. Without this annotation, Lance's transparent rebase would admit silent duplicates from racing publishers.
@ -90,8 +91,8 @@ flowchart TB
- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph.
- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.)
- **`_graph_commit_recoveries.lance`** — one row per crash-recovery action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join.
- **`_graph_commits.lance`** is an L2 dataset retained only as a branch-ref carrier (and, on a pre-Phase-7 graph, the migration source). Since RFC-013 Phase 7 the graph commit DAG lives in `__manifest` as `graph_commit` / `graph_head` rows written in the publish CAS — `_graph_commits.lance` and its paired `_graph_commit_actors.lance` no longer receive commit rows. A graph created before Phase 7 (internal schema v3) backfills its lineage into `__manifest` on its first read-write open (`migrate_v3_to_v4`). (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.)
- **`_graph_commit_recoveries.lance`** — one row per crash-recovery action. Joined by `graph_commit_id` to the graph commit lineage (the `graph_commit` rows in `__manifest` since RFC-013 Phase 7); the linked commit carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join.
- **`__recovery/{ulid}.json`** — transient sidecar files written by a writer before it advances the underlying dataset, deleted once the matching manifest publish succeeds. A sidecar persisting after process exit means the writer crashed mid-commit; the next read-write open processes it. Steady-state directory is empty.
- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.
- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags.

View file

@ -3,12 +3,12 @@
| Name | Value | Area |
|---|---|---|
| `MANIFEST_DIR` | `__manifest` | manifest layout |
| Commit graph dir | `_graph_commits.lance` | commit graph |
| Commit graph dir | `_graph_commits.lance` | branch-ref carrier + pre-v4 lineage source (lineage lives in `__manifest` since RFC-013 Phase 7) |
| Run registry dir (legacy, removed) | `_graph_runs.lance` | inert post-v0.4.0; bytes remain until a prefix-delete primitive lands |
| Run branch prefix (legacy, removed) | `__run__` | swept off `__manifest` by the internal schema migration; no longer a reserved name |
| Schema apply lock | `__schema_apply_lock__` | schema apply |
| Manifest publisher retry budget | `PUBLISHER_RETRY_BUDGET = 5` | manifest publish |
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 3` | manifest migrations |
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 4` | manifest migrations (v4 = graph lineage in `__manifest`, RFC-013 Phase 7) |
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | merge execution |
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | optimize/cleanup |
| Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | optimize |