mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-27 02:39:38 +02:00
Merge branch 'main' into ragnorc/omnigraph-mcp-crate
Folds in v0.7.2 (release #301) + RFC-013 Phase 7 (graph lineage in __manifest, internal schema v3→v4 migration #299; WriteTxn #298; recovery convergence #296) under the MCP branch. Conflict resolutions (2 files): - crates/omnigraph-server/Cargo.toml: take main's 0.7.2 path-dep constraints; keep our omnigraph-mcp dep (bumped to 0.7.2). - docs/releases/v0.8.0.md (add/add): both branches drafted v0.8.0 notes for the same next minor — combined them. v0.8.0 now documents BOTH the MCP surface (ours) and main's __manifest lineage fold + the breaking internal-schema-v4 upgrade-order requirement (kept prominent under Upgrade notes). Corrected our 'no breaking changes / on-disk format unchanged' line, which the v4 migration makes false. Coherence: omnigraph-mcp [package] + Cargo.lock bumped 0.7.1→0.7.2; openapi.json auto-merged to info.version 0.7.2 (no API-surface drift from the incoming engine-internal commits). Verification deferred to CI (no local rebuild).
This commit is contained in:
commit
4d4c2164de
62 changed files with 5898 additions and 1053 deletions
|
|
@ -133,7 +133,7 @@ flowchart TB
|
|||
subgraph state[graph state]
|
||||
coord[GraphCoordinator]:::l2
|
||||
mr[ManifestCoordinator<br/>db/manifest.rs]:::l2
|
||||
cg[CommitGraph<br/>_graph_commits.lance]:::l2
|
||||
cg[CommitGraph<br/>projection of __manifest graph_commit/graph_head rows]:::l2
|
||||
stg[MutationStaging<br/>per-query in-memory accumulator<br/>exec/staging.rs]:::l2
|
||||
end
|
||||
|
||||
|
|
|
|||
460
docs/dev/handoff-rfc-013-write-path.md
Normal file
460
docs/dev/handoff-rfc-013-write-path.md
Normal file
|
|
@ -0,0 +1,460 @@
|
|||
# Handoff: finishing RFC-013 (write-path latency + correctness)
|
||||
|
||||
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)** —
|
||||
this doc is the *current-state map + the decisions/validation from the latest work cycle
|
||||
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
|
||||
|
||||
**Audience:** the engineer/agent who picks up RFC-013 next.
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR — where we are and what's next
|
||||
|
||||
RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
|
||||
under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
|
||||
for the canonical list. Current reality:
|
||||
|
||||
**Landed on `main`:**
|
||||
- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
|
||||
- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
|
||||
instead of the Lance-namespace builder (#288). **This already banked the dominant
|
||||
depth win** — see §2 below; it reframes everything.
|
||||
- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
|
||||
`_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
|
||||
correction (#292).
|
||||
- **Optimize-vs-write race** — optimize survives a cross-process write race on the
|
||||
same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
|
||||
redundant with Design A). Step 3b stacks on top of this.
|
||||
|
||||
**Open PRs (land these; relationships in §7):**
|
||||
- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
|
||||
manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
|
||||
**MERGED to main and integrated into this branch** — the converge helper now threads
|
||||
Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`).
|
||||
- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
|
||||
- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
|
||||
(same op-class family as #297, logical side).
|
||||
|
||||
**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
|
||||
`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one — Design A /
|
||||
`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this
|
||||
area keeps generating, which also absorbs 3b's deferred session-aware write opens.
|
||||
|
||||
---
|
||||
|
||||
## 1. The corrected mental model (read this before touching anything)
|
||||
|
||||
Three reframes from the latest cycle that the older RFC prose may not fully reflect:
|
||||
|
||||
### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
|
||||
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
|
||||
that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
|
||||
**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
|
||||
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
|
||||
**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
|
||||
|
||||
- The dominant **O(depth) data-table** term is **gone**.
|
||||
- Step 2a flattened the secondary **internal-table** scan term.
|
||||
- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
|
||||
depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
|
||||
+ compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
|
||||
unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
|
||||
2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
|
||||
- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
|
||||
flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
|
||||
3b; gate it on the constant-factor (S3 round-trips), not a curve.
|
||||
|
||||
### 1b. Two op classes, two commit models (the §6.6 principle)
|
||||
Every concurrency bug in this area is **one op class using the other's commit model**:
|
||||
|
||||
| class | examples | commutes? | correct commit model |
|
||||
|---|---|---|---|
|
||||
| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
|
||||
| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
|
||||
|
||||
Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
|
||||
(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
|
||||
|
||||
### 1c. The root liability (what keeps generating these bugs)
|
||||
Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
|
||||
every multi-commit op advances per-table Lance HEAD **before** the manifest references it
|
||||
(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
|
||||
(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
|
||||
paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
|
||||
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
|
||||
- §6.5 cross-process logical corruption,
|
||||
- #297's own-HEAD-drift misclassification,
|
||||
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
|
||||
- the recovery classifier edges.
|
||||
|
||||
**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
|
||||
retires the window entirely.** See §5.
|
||||
|
||||
### 1d. The second facet: the write base is a stale pin (no probe)
|
||||
The READ path resolves its base behind a freshness probe (`resolve_target_inner`
|
||||
omnigraph.rs:~1072 → `probe_latest_incarnation` → `refresh_manifest_only`); the WRITE path
|
||||
does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
|
||||
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
|
||||
single staleness feeds **two distinct failure modes**, both surfaced this cycle:
|
||||
|
||||
1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
|
||||
committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
|
||||
`edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
|
||||
live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
|
||||
uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
|
||||
**VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
|
||||
the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
|
||||
edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
|
||||
comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
|
||||
underneath: there is **no unified write-validation read-set** — endpoint
|
||||
(`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
|
||||
`txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
|
||||
commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
|
||||
(`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
|
||||
documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
|
||||
§7.1 TOCTOU class, and each new constraint forks the pattern again.
|
||||
|
||||
2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
|
||||
pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
|
||||
`optimize` advanced `__manifest` — even though the advance was content-preserving
|
||||
compaction the logical write should fast-forward past (invariant 7). It's the **write-side
|
||||
mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
|
||||
logical write that must fast-forward past optimize). A served read clears it (the read
|
||||
probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
|
||||
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
|
||||
(`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
|
||||
proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
|
||||
`consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
|
||||
cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
|
||||
|
||||
**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
|
||||
fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition** —
|
||||
derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
|
||||
maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
|
||||
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
|
||||
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
|
||||
`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
|
||||
liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
|
||||
pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
|
||||
of Design A — three symptoms, one fix.
|
||||
|
||||
---
|
||||
|
||||
## 2. Validated facts — do NOT re-derive these
|
||||
|
||||
Established this cycle against **Lance 7.0.0 source**
|
||||
(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
|
||||
you can trust them without re-investigating.
|
||||
|
||||
**Lance (upstream):**
|
||||
- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
|
||||
V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
|
||||
(`lance-table/src/io/commit.rs` `default_resolve_version`.)
|
||||
- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
|
||||
keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
|
||||
— the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
|
||||
carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
|
||||
`WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
|
||||
`Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
|
||||
Session to a write base, open read-style (`open_table_dataset` → `from_uri().with_version()
|
||||
.with_session()`) and drive the staged write off that handle.
|
||||
- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
|
||||
many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
|
||||
relies on this).
|
||||
- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
|
||||
`commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
|
||||
`commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
|
||||
via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
|
||||
(`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
|
||||
**compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
|
||||
Lance own it.**
|
||||
- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
|
||||
`Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
|
||||
conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
|
||||
retry fires only on real overlap.
|
||||
|
||||
**Engine (internal):**
|
||||
- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
|
||||
warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
|
||||
`TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
|
||||
graph** (`read_caches.session`). **Writes bypass all of it by construction**
|
||||
(`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
|
||||
session and opens by latest, not pinned version).
|
||||
- A single write opens each table **3–4×** (accumulation → staging reopen → commit
|
||||
drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
|
||||
(`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
|
||||
+ 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
|
||||
depth — 3b's targets.
|
||||
- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
|
||||
`table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
|
||||
`publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
|
||||
- Fork-on-first-write authority reads (`classify_fork_ref` → `fresh_snapshot_for_branch`)
|
||||
must stay **fresh** (not served from a pinned base).
|
||||
- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
|
||||
`local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
|
||||
(`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
|
||||
change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
|
||||
(copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
|
||||
(`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
|
||||
|
||||
---
|
||||
|
||||
## 3. The #297 cycle (this branch) — what it is, and the lesson
|
||||
|
||||
`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
|
||||
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
|
||||
compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
|
||||
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
|
||||
publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
|
||||
manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
|
||||
in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
|
||||
|
||||
**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
|
||||
fixed, both regression-tested (own-HEAD-drift via negative control):
|
||||
1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
|
||||
iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
|
||||
later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
|
||||
sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
|
||||
`head_advanced`; the drift guard fires only when `!head_advanced`.
|
||||
2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
|
||||
its final retry even if the conflict meant a concurrent writer already published `≥ N`
|
||||
(postcondition met). Fix: re-check `current >= state.version` on exhaustion.
|
||||
|
||||
**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
|
||||
retry silently converts every "checked once, before any side effect" precondition into
|
||||
"re-checked after partial side effects."* That's a distinct bug class; it needs
|
||||
fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
|
||||
(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
|
||||
exactly this.)
|
||||
|
||||
**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
|
||||
movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
|
||||
— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
|
||||
reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
|
||||
|
||||
---
|
||||
|
||||
## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
|
||||
|
||||
**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
|
||||
contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
|
||||
win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
|
||||
insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
|
||||
`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
|
||||
minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
|
||||
mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
|
||||
`PublishPlan`.
|
||||
|
||||
**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
|
||||
(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
|
||||
`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
|
||||
`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
|
||||
fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
|
||||
needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
|
||||
cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
|
||||
gate yet (only the node gates above).
|
||||
|
||||
Commits (off merged-#297 main):
|
||||
- **Stage 0** — scope `open_count` → `data_open_count`/`internal_open_count` by URI class
|
||||
(the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
|
||||
raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
|
||||
- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
|
||||
validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
|
||||
(`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
|
||||
S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
|
||||
minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
|
||||
manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
|
||||
same `Snapshot`).
|
||||
- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
|
||||
the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
|
||||
#3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
|
||||
cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
|
||||
the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
|
||||
- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
|
||||
`CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
|
||||
unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
|
||||
|
||||
**RFC §4.1 corrections — how they resolved:**
|
||||
1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
|
||||
the `commit_staged` handle forward into the index build).
|
||||
2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
|
||||
(`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
|
||||
and the fork-authority reads stay fresh.
|
||||
3. *Minimal carrier* → `WriteTxn { branch, base }` (even the `session` from the original
|
||||
sketch was dropped as unearned).
|
||||
|
||||
**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
|
||||
open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
|
||||
(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
|
||||
where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
|
||||
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
|
||||
probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
|
||||
|
||||
**Guardrails (don't regress):** schema validation is deliberately uncached for drift
|
||||
detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
|
||||
(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
|
||||
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
|
||||
Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
|
||||
open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
|
||||
post-#297 main before starting.
|
||||
|
||||
---
|
||||
|
||||
## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
|
||||
|
||||
**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
|
||||
the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
|
||||
retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
|
||||
Properties:
|
||||
- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
|
||||
by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
|
||||
§6.5 writers, the write-path guard — all close *by construction*.
|
||||
- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
|
||||
live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
|
||||
- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
|
||||
+ reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
|
||||
+ `writer_epoch`).
|
||||
|
||||
**Why it composes with Lance MTT (don't over-build):**
|
||||
- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
|
||||
authority; nothing wasted. Build this.
|
||||
- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
|
||||
a cross-process fence). Build it *last and minimally*, gated on actually deploying
|
||||
multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
|
||||
|
||||
**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
|
||||
= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
|
||||
is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
|
||||
(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
|
||||
the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
|
||||
validate the maintenance-class and logical-class commit models in isolation first, so Design
|
||||
A implements a known spec rather than designing under refactor pressure. Do NOT build more
|
||||
substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
|
||||
over the window — strictly higher liability than either Design A or waiting for MTT.
|
||||
|
||||
**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
|
||||
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
|
||||
exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
|
||||
open; delete #6658 shipped). Track, don't build yet.
|
||||
|
||||
### 5.1 Step-5 design constraints inherited from the #295 spec review
|
||||
3b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via
|
||||
eliminate/probe/thread) and **deferred** the full §4.1 opener-unification — the pinned-base
|
||||
opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op
|
||||
conflict-timing move — to step 5. So the greptile-bot comments on the #295 *spec* were **moot
|
||||
for #298** (which built none of those constructs) but are **load-bearing constraints for step
|
||||
5** when it builds them. Bank them:
|
||||
1. **Handle cache must be `Send + Sync`** (`Mutex<HashMap<…, Dataset>>`, not `RefCell`) if
|
||||
`WriteTxn::open(&self)` is shared across concurrent stage futures — a `RefCell` compiles
|
||||
but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This
|
||||
is the deny-list "in-process-only `Dataset` impls — `Send + Sync`" item.
|
||||
2. **The strict-op timing move needs an explicit retry contract.** If step 5 moves
|
||||
strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS
|
||||
(the §4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit**
|
||||
(success or conflict — the handle cache is commit-invalidated), and the retry **re-opens a
|
||||
fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base — that
|
||||
reproduces the lost-update). **This is the same retry/refresh contract as the stale-view
|
||||
false-fail (§1d.2)** — the op-class-aware precondition + "fresh base on retry" are one
|
||||
design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the
|
||||
contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
|
||||
3. **The opener-equivalence test must be non-trivial.** A differential test that only passes
|
||||
when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open`
|
||||
returns the pinned base, not HEAD," the test must **advance the branch HEAD externally
|
||||
(direct Lance write), then assert the txn open still reads the base version** — and that a
|
||||
strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move).
|
||||
|
||||
---
|
||||
|
||||
## 6. Why #297 is still needed even if you do Design A
|
||||
- Design A **relocates** #297's maintenance-class commit logic into the authority's
|
||||
`TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
|
||||
- The two regression tests + §6.6 are the **contract** Design A must keep green.
|
||||
- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
|
||||
correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
|
||||
- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
|
||||
(~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
|
||||
|
||||
---
|
||||
|
||||
## 7. Open PRs and their relationships
|
||||
- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
|
||||
step 3b stacks on it.
|
||||
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
|
||||
both are de-risking inputs for Design A's per-class commit models.
|
||||
- **#296** — recovery roll-forward converges on concurrent manifest advance. The fix
|
||||
for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
|
||||
*aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296
|
||||
landed on main first and is merged into this branch:** the converge helper
|
||||
(`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward —
|
||||
on convergence the audit references the winner's folded `graph_commit_id` (the current
|
||||
`graph_head`), not a freshly minted one.
|
||||
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
|
||||
|
||||
---
|
||||
|
||||
## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
|
||||
- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
|
||||
(§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
|
||||
the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
|
||||
residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
|
||||
- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` +
|
||||
mutable `graph_head:<branch>` in the same merge-insert; `_graph_commits` becomes a
|
||||
projection). Removes the per-write `commit_graph.refresh`; closes the manifest→commit-graph
|
||||
atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).**
|
||||
Carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
|
||||
**frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
|
||||
cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
|
||||
`edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
|
||||
pinned read-set re-validated under the `graph_head` contention, or the liability survives as
|
||||
a second special-case.
|
||||
- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
|
||||
false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
|
||||
two tests passing *together*: un-ignore
|
||||
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
|
||||
*while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
|
||||
(maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
|
||||
like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
|
||||
- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
|
||||
Deferred: only the secondary version-count/space term, touches the read/open path, and is
|
||||
MTT-redundant. Land when version-count cost bites.
|
||||
- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
|
||||
validation on node removal; independent, ships anytime.
|
||||
- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
|
||||
- Branch ops: Lance `Clone` for create (`iss-691`).
|
||||
|
||||
---
|
||||
|
||||
## 9. Gotchas / traps (learned the hard way)
|
||||
- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
|
||||
reasoning is a bug across processes. This is the recurring trap.
|
||||
- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
|
||||
merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
|
||||
the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
|
||||
regress a newer version = lost write.
|
||||
- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
|
||||
self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
|
||||
interpretation on "have *we* committed yet."
|
||||
- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
|
||||
`skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
|
||||
config before compacting to stay non-destructive on upgraded graphs. The strip is a
|
||||
separate commit (relevant to the partial-commit retry trap).
|
||||
- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
|
||||
surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
|
||||
rare and needs failpoint injection to test.
|
||||
|
||||
---
|
||||
|
||||
## 10. Verification (the gate)
|
||||
- `cargo test --workspace --locked` — the canonical gate (matches CI).
|
||||
- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize` —
|
||||
the optimize concurrency/recovery tests.
|
||||
- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
|
||||
cost gates (3b adds the schema-once + open-count asserts here).
|
||||
- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
|
||||
- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
|
||||
before each change (always-on requirement).
|
||||
|
||||
Lance source for re-validation:
|
||||
`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
|
||||
`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
|
||||
`dataset/builder.rs`).
|
||||
|
|
@ -93,6 +93,7 @@ Working documents for in-flight feature work. Removed when the work lands.
|
|||
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
|
||||
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
|
||||
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
|
||||
| RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) |
|
||||
|
||||
## Boundary
|
||||
|
||||
|
|
|
|||
|
|
@ -211,10 +211,21 @@ them explicit.
|
|||
sweep has the same exposure, and always has): it may roll a live foreign
|
||||
writer's sidecar forward, which degrades to publisher-CAS contention for
|
||||
data writes but can race the schema-staging promotion for a foreign live
|
||||
schema apply. Multi-process writers on one graph are already documented
|
||||
one-winner-CAS territory; closing this fully needs a cross-process
|
||||
serialization primitive (e.g. lease-based use of the schema-apply lock
|
||||
branch) — design it before promoting multi-process write topologies.
|
||||
schema apply. The roll-**forward** CAS contention is now
|
||||
convergence-idempotent: when the publish loses the CAS to a concurrent
|
||||
writer that already reached the sidecar's goal, the sweep treats it as
|
||||
convergence (record the `RolledForward` audit + delete) rather than a fatal
|
||||
`ExpectedVersionMismatch`, and defers when the manifest is only partway
|
||||
(`converge_or_defer_roll_forward` in `db/manifest/recovery.rs`;
|
||||
iss-schema-apply-reopen-recovery-race). So a concurrent advance no longer
|
||||
fails the open. The schema-staging promotion race and the destructive
|
||||
roll-**back** path (Lance `Restore` "trumps" a concurrent commit, so it
|
||||
cannot be made idempotent — iss-recovery-sweep-live-writer-rollback) still
|
||||
need the cross-process primitive. Multi-process writers on one graph are
|
||||
already documented one-winner-CAS territory; closing this fully needs a
|
||||
cross-process serialization primitive (e.g. lease-based use of the
|
||||
schema-apply lock branch) — design it before promoting multi-process write
|
||||
topologies.
|
||||
- **Fork reclaim is in-process-safe only:** the first write to a table on a
|
||||
branch forks it (a Lance `create_branch` that advances state before the
|
||||
manifest publish). An interrupted fork (crash, or a cancelled request
|
||||
|
|
@ -242,20 +253,43 @@ them explicit.
|
|||
acknowledged-before-visible bug this branch fixed. Close it (local CAS
|
||||
primitive, or a trait-level lock requirement) before admitting any
|
||||
lock-free `if_match` caller.
|
||||
- **Manifest→commit-graph publish atomicity:** a graph commit advances
|
||||
`__manifest` (the visibility authority) and then appends `_graph_commits` as
|
||||
two separate writes (`commit_updates_with_actor_with_expected`, failpoint
|
||||
`graph_publish.before_commit_append`). A crash between them leaves the manifest
|
||||
at version N with no commit-graph row for N. Live reads and durability are
|
||||
unaffected — the live version resolves via the manifest
|
||||
(`GraphCoordinator::version()`), not the commit-graph head — and the open-time
|
||||
recovery sweep does NOT repair it (`lance_head == manifest_pinned` classifies
|
||||
`NoMovement`; a recovery sidecar would not change this). Impact is bounded to
|
||||
commit history: `commit list` misses N, time-travel by commit id to N fails,
|
||||
and merge-base loses a node (a likely-benign off-by-one re-merge). This affects
|
||||
every publish, not a specific maintenance command. Eventual fix: make the
|
||||
commit graph reconcilable from the manifest (or the two writes atomic) — not a
|
||||
recovery-sidecar concern.
|
||||
- **Manifest→commit-graph publish atomicity — CLOSED (RFC-013 Phase 7):** graph
|
||||
lineage now lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
|
||||
rows written in the SAME `MergeInsertBuilder` commit as the table-version rows
|
||||
(`commit_changes_with_lineage` → `GraphNamespacePublisher::publish` with a
|
||||
`LineageIntent`). There is no second write to fail between — a graph commit and
|
||||
its lineage land at one manifest version atomically, so a crash after the publish
|
||||
leaves no gap. The commit-graph cache is a derived projection of those manifest
|
||||
rows; nothing writes `_graph_commits.lance` (it persists only to carry branch
|
||||
refs). The prior two-write gap (manifest at N with no `_graph_commits` row for N)
|
||||
is gone by construction. A graph created before Phase 7 (internal schema v3)
|
||||
carries its lineage only in `_graph_commits.lance`; the `migrate_v3_to_v4`
|
||||
internal-schema step (`db/manifest/migrations.rs`) backfills it into `__manifest`
|
||||
per-branch on the first read-write open (idempotent, crash-safe, data-preserving),
|
||||
and a read-only open of an un-migrated v3 graph sources the DAG from
|
||||
`_graph_commits.lance` via a stamp-gated transitional fallback so reads stay
|
||||
correct until the first write migrates it. An old binary refuses a v4-stamped
|
||||
graph (read-write and read-only) with the standard upgrade error. The migration
|
||||
is **loud on failure and concurrent-runner idempotent**: the legacy-open read
|
||||
(`read_legacy_commit_cache`) treats only a genuine not-found as "no legacy data"
|
||||
and propagates any other open error (so a transient/corrupt open can never stamp
|
||||
v4 over an empty backfill — orphaning lineage permanently), and the backfill
|
||||
converges all-or-nothing when two runners open the same legacy graph at once — a
|
||||
bounded re-open retry on the `graph_head:<branch>` row-level CAS plus an
|
||||
idempotent terminal stamp bump (both runners write the same value, so a concurrent
|
||||
`UpdateConfig`/`IncompatibleTransaction` loss re-opens and no-ops if the stamp
|
||||
already landed). The branch read path (`load_commit_cache_for_branch`) also
|
||||
refuses an out-of-range branch stamp (`> CURRENT` or `< MIN_SUPPORTED`;
|
||||
defense-in-depth; not a live hole because migrations run main-first, so main
|
||||
refuses first). The migration chain is **floor-bounded**:
|
||||
`MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (migrations.rs; 1 today, a pure no-op) is
|
||||
the oldest stamp this binary opens, enforced symmetrically with the ceiling by the
|
||||
single `refuse_if_stamp_unsupported` guard at all three stamp-read sites
|
||||
(write-path migrate, read-only open, branch lineage-read). Raising MIN sheds the
|
||||
now-dead `migrate_vN_…` arms and (at MIN ≥ 4) the `commit_graph_legacy_v3` legacy
|
||||
readers; a compile-time tripwire (`LOWEST_REGISTERED_MIGRATION_SOURCE`) fails the
|
||||
build if the floor and the lowest registered arm drift. Retirement runbook lives on
|
||||
the `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` doc-comment.
|
||||
- **Planner capability/stat surfaces:** cost-aware planning, complete
|
||||
capability advertisement, and explain-with-cost are roadmap. Do not describe
|
||||
them as implemented.
|
||||
|
|
@ -291,19 +325,23 @@ them explicit.
|
|||
in history; but they are not yet brought into `cleanup` (version GC), so the
|
||||
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
|
||||
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
|
||||
graph is not yet reconcilable from the manifest; and the traversal id-map is
|
||||
graph IS now reconcilable from the manifest (RFC-013 Phase 7 — it is a pure
|
||||
projection of the `graph_commit`/`graph_head` rows); the traversal id-map is
|
||||
still rebuilt.
|
||||
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
|
||||
the commit-graph head from storage before appending, so a same-branch write
|
||||
after an external commit no longer forks the commit DAG by parenting off a
|
||||
stale cached head (the single-process fork, pre-existing for non-strict
|
||||
inserts and widened to strict ops by Fix 1's `refresh_manifest_only`, is now
|
||||
closed). Residual: two processes writing disjoint tables can still pass their
|
||||
per-table manifest CAS and append off the same parent (a refresh-then-append
|
||||
TOCTOU). The convergent fix is reconcile-from-manifest (parent = the commit at
|
||||
the manifest version the publisher CAS'd against; `manifest_version` is on
|
||||
every commit row), composing with the manifest-to-commit-graph atomicity gap;
|
||||
it needs commit-graph append ordering or a Lance append-CAS to fully close.
|
||||
- **Commit-graph parent under concurrency — CLOSED (RFC-013 Phase 7):** the graph
|
||||
commit is now recorded in the manifest publish CAS, and the publisher resolves
|
||||
the new commit's parent INSIDE its retry loop, per attempt, from the just-loaded
|
||||
`__manifest` (the `should_replace_head` winner over the visible `graph_commit`
|
||||
rows). A CAS-conflict retry re-reads the advanced head and parents correctly, so
|
||||
the refresh-then-append TOCTOU is gone. Two processes writing disjoint tables on
|
||||
the same branch now also contend on the shared `graph_head:<branch>` row (one
|
||||
`object_id`, `WhenMatched::UpdateAll`): one wins, the other retries and re-parents
|
||||
— so the cross-process disjoint-table fork is closed too. This is the intended
|
||||
§7.1 contention point, pinned by
|
||||
`manifest::tests::concurrent_disjoint_writes_share_head_and_form_linear_chain`
|
||||
(two disjoint writers → both commit, single linear chain) and
|
||||
`manifest::tests::n_concurrent_disjoint_writers_converge_to_one_linear_chain`
|
||||
(N=8 disjoint writers with app-level retry → one linear chain of 8, no fork).
|
||||
|
||||
## Deny-list
|
||||
|
||||
|
|
|
|||
|
|
@ -170,6 +170,7 @@ Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, Da
|
|||
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
|
||||
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
|
||||
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
|
||||
- **Two surface guards added by the v3→v4 migration-robustness follow-up** (not a Lance bump, but they pin Lance error surfaces the migration now classifies on): `dataset_open_missing_returns_not_found_variant` (a missing `Dataset::open` returns `DatasetNotFound`/`NotFound` — the legacy-open read in `db/commit_graph.rs::read_legacy_commit_cache` treats only those as "no legacy data" and propagates everything else) and `lance_error_incompatible_transaction_variant_exists` (a concurrent `UpdateConfig` stamp-bump loses with `IncompatibleTransaction` — `db/manifest/migrations.rs::commit_v4_stamp_idempotently` matches it to retry the benign same-value race). Re-run on a Lance bump like the others.
|
||||
|
||||
Bump this date stanza on the next alignment pass.
|
||||
|
||||
|
|
|
|||
|
|
@ -523,7 +523,10 @@ struct WriteTxn {
|
|||
branch: BranchRef,
|
||||
base: PinnedSnapshot, // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
|
||||
session: Arc<Session>, // shared per-graph; warms metadata/index caches across opens
|
||||
handles: HandleCache, // open-by-version; each table opened once, reused across stages
|
||||
handles: HandleMap, // open the base once WITH session; thread the handle each
|
||||
// commit RETURNS forward (HEAD walks N→N+1→N+2). NOT a
|
||||
// version-keyed cache — HEAD moves, so a (table,version) key
|
||||
// misses; reuse = forward the commit-return handle. [3b-validated]
|
||||
}
|
||||
|
||||
// A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
|
||||
|
|
@ -546,8 +549,17 @@ impl GraphPublishAuthority {
|
|||
|
||||
Properties that make it optimal:
|
||||
|
||||
- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
|
||||
open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
|
||||
- **Stages take `&WriteTxn`/`&PublishPlan` for the BASE** — re-resolving the pinned
|
||||
read base / open-latest for the pre-commit phase is unrepresentable; invariants 2/3/15
|
||||
hold for the base by construction. **Caveat [3b-validated]:** this is NOT "no
|
||||
re-resolution anywhere." Three commit-boundary reads are irreducible correctness
|
||||
machinery and MUST stay fresh: the commit-time `fresh_snapshot_for_branch` (cross-process
|
||||
OCC), the live-HEAD drift probe (a concurrent writer may have moved HEAD since staging),
|
||||
and the fork-authority reads (`classify_fork_ref` deliberately bypasses the cached base —
|
||||
a pinned base there re-opens the "force-delete a live fork" bug). Model "pinned base for
|
||||
the pre-commit phase + named fresh re-reads at the commit/fork boundary." The achievable
|
||||
open count is **1 base open (with session) + 1 cheap `latest_version_id` probe + threaded
|
||||
commit handles**, not literally one open.
|
||||
- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
|
||||
recovery both call `plan.apply()` — a merge that bumps tables A+B can never
|
||||
roll A forward and silently drop B. The
|
||||
|
|
|
|||
|
|
@ -47,7 +47,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
|
||||
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
|
||||
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
|
||||
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
|
||||
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). Also the v3→v4 migration fault-injection test (`transient_legacy_open_failure_aborts_migration_without_stamping_v4`, `migration.v3_to_v4.legacy_open` failpoint): a transient legacy-open failure aborts the migration loudly and leaves it retryable (stamp stays v3, no partial backfill), never stamping v4 over an empty backfill. Also the v4 stamp-bump exhaustion regression (`v4_stamp_exhaustion_returns_retryable_contention`, `migration.v4_stamp.force_incompatible` failpoint): the stamp retry loop surfaces a retryable `RowLevelCasContention` on exhaustion, not a stringified `Lance`. And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
|
||||
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
|
||||
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
|
||||
|
||||
|
|
@ -65,10 +65,12 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
|
||||
## Failpoints (fault injection)
|
||||
|
||||
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml` **and** `crates/omnigraph-cluster/Cargo.toml`; the cluster feature does not enable the engine's).
|
||||
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` expose `maybe_fail("name")` and `ScopedFailPoint` for tests.
|
||||
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, cluster apply's payload→state-write window, etc.).
|
||||
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (crash-mid-apply + state CAS race via `fail::cfg_callback`; integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
|
||||
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` in `crates/omnigraph/Cargo.toml`; the cluster's `failpoints` feature additionally enables `omnigraph/failpoints` (`crates/omnigraph-cluster/Cargo.toml`), so the shared test guard is available to cluster tests.
|
||||
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` each expose `maybe_fail("name")` (per-crate error type). The test-side config guard `ScopedFailPoint` (`new` for action strings, `with_callback` for callbacks; RAII `Drop` removes the point) lives **once** in the engine and is reused by both test binaries.
|
||||
- **Names are compile-checked.** Every failpoint name is a `pub const` in `omnigraph::failpoints::names` (engine) / `omnigraph_cluster::failpoints::names` (cluster). Call sites and tests reference the constant, never a bare literal — a typo is a compile error, not a silently-never-firing point. Add a new failpoint by adding its const first.
|
||||
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, the recovery sweep's classify→roll-forward-publish window, cluster apply's payload→state-write window, etc.).
|
||||
- **Serialize and rendezvous, never sleep.** The `fail` registry is process-global, so every failpoint test carries `#[serial]` (`serial_test`). For concurrent tests, use `helpers::failpoint::Rendezvous` (`tests/helpers/failpoint.rs`): `park_first(name)` parks the first thread to hit the point until `release()`, and `wait_until_reached().await` blocks on that condition (it doubles as a fired-assertion). Do not coordinate threads with fixed `sleep`s.
|
||||
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
|
||||
|
||||
## RustFS / S3 integration
|
||||
|
||||
|
|
|
|||
|
|
@ -230,8 +230,9 @@ recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`:
|
|||
rolled-back-to version (`manifest_pinned`); the manifest is published at the
|
||||
restore commit (`manifest_pinned + 1`, same content).
|
||||
- After a successful roll-forward or roll-back, an audit row is
|
||||
recorded — `_graph_commits.lance` carries
|
||||
a commit tagged `actor_id = "omnigraph:recovery"`, and a sibling
|
||||
recorded — the graph commit lineage (the `graph_commit` rows in `__manifest`
|
||||
since RFC-013 Phase 7) carries a commit tagged
|
||||
`actor_id = "omnigraph:recovery"`, and a sibling
|
||||
`_graph_commit_recoveries.lance` row carries `recovery_kind`,
|
||||
`recovery_for_actor` (the original sidecar's actor), `operation_id`,
|
||||
per-table outcomes. Operators run `omnigraph commit list --filter
|
||||
|
|
@ -336,20 +337,40 @@ actual }`. The HTTP server maps this to **409 Conflict** with body
|
|||
|
||||
## Audit
|
||||
|
||||
`actor_id` lands in `_graph_commits.lance` via `record_graph_commit` (no
|
||||
intermediate run record). Audit history is queried via `omnigraph commit
|
||||
list`.
|
||||
`actor_id` lands in the graph commit lineage — the `graph_commit` rows in
|
||||
`__manifest`, written in the publish CAS (RFC-013 Phase 7; previously
|
||||
`_graph_commits.lance`). Audit history is queried via `omnigraph commit list`.
|
||||
|
||||
## Migration code
|
||||
|
||||
`db/manifest/migrations.rs` carries the v2→v3 internal-schema step (MR-770):
|
||||
a one-time sweep that deletes legacy `__run__*` staging branches off
|
||||
`__manifest`. It runs in `Omnigraph::open(ReadWrite)` (via
|
||||
`manifest::migrate_on_open`, before the coordinator reads branch state) and
|
||||
again on the publisher's write path; both are idempotent once the stamp is at
|
||||
v3. Deleting the inert `_graph_runs.lance` / `_graph_run_actors.lance` dataset
|
||||
*bytes* is still deferred — it needs a `StorageAdapter::delete_prefix`
|
||||
primitive — but those bytes are invisible to graph-level state.
|
||||
`db/manifest/migrations.rs` is the single place on-disk `__manifest` shape is
|
||||
reconciled with what the binary expects, stepping the
|
||||
`omnigraph:internal_schema_version` stamp forward one `match`-arm at a time. It
|
||||
runs in `Omnigraph::open(ReadWrite)` (via `manifest::migrate_on_open`, before the
|
||||
coordinator reads branch state) and again on the publisher's write path, so each
|
||||
branch migrates on its first write; every step is idempotent under crash-retry
|
||||
(work first, stamp bump last).
|
||||
|
||||
- **v2→v3** (MR-770): a one-time sweep that deletes legacy `__run__*` staging
|
||||
branches off `__manifest`. Deleting the inert `_graph_runs.lance` /
|
||||
`_graph_run_actors.lance` dataset *bytes* is still deferred — it needs a
|
||||
`StorageAdapter::delete_prefix` primitive — but those bytes are invisible to
|
||||
graph-level state.
|
||||
- **v3→v4** (RFC-013 Phase 7, `migrate_v3_to_v4`): backfills the graph lineage
|
||||
from `_graph_commits.lance` into `__manifest` as `graph_commit` / `graph_head`
|
||||
rows. A graph created before Phase 7 has its lineage only in
|
||||
`_graph_commits.lance`; the new binary reads lineage from the `__manifest`
|
||||
projection, so without this backfill it would see an empty commit DAG. The
|
||||
backfill is per-branch (each branch migrates on its first write), idempotent
|
||||
(keyed on `object_id`; a fast-path guard skips when `__manifest` already
|
||||
carries `graph_commit` rows), and writes exactly one `graph_head:<branch>` row
|
||||
for the actual head. `_graph_commits.lance` is left in place as the branch-ref
|
||||
carrier — no commit row is written to it again. While a graph is below v4, a
|
||||
**read-only** open (which never writes, so never migrates) sources the commit
|
||||
DAG from `_graph_commits.lance` via the stamp-gated transitional fallback in
|
||||
`CommitGraph::open*`, so reads see correct history before the first write
|
||||
migrates the graph. An old binary opening a v4-stamped graph is refused with an
|
||||
"upgrade omnigraph" error in both read-write and read-only modes.
|
||||
|
||||
## Mid-query partial failure: closed by MR-794
|
||||
|
||||
|
|
|
|||
60
docs/releases/v0.7.2.md
Normal file
60
docs/releases/v0.7.2.md
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
# Omnigraph v0.7.2
|
||||
|
||||
A patch release over v0.7.1: write-path latency reductions plus three
|
||||
correctness fixes on the maintenance and recovery paths. No breaking changes, no
|
||||
on-disk format change, and no migration — drop-in over v0.7.1.
|
||||
|
||||
## Performance
|
||||
|
||||
- **Write opens go direct, schema validates once (#288, #298).** Write opens
|
||||
used to route through the per-table Lance namespace catalog, which re-opened
|
||||
the dataset just to read its location and re-resolved the latest version on
|
||||
every table open — an O(commit-depth) double resolution that dominated write
|
||||
latency on object stores (~70%). Writes now open each touched data table
|
||||
directly by its manifest-recorded location (Lance's O(1) version-hint path),
|
||||
validate the schema contract once per write instead of ~4×, and open each
|
||||
touched table once instead of 4×.
|
||||
|
||||
- **`optimize` compacts the internal metadata tables (#291).** `optimize`
|
||||
previously iterated only node/edge tables, so the internal `__manifest`,
|
||||
`_graph_commits`, and `_graph_commit_actors` tables accumulated one fragment
|
||||
per commit and were never compacted — making every write's metadata scan grow
|
||||
with commit history. `optimize` now compacts all three, so a periodically
|
||||
optimized long-lived graph keeps its per-write metadata scan flat in history.
|
||||
|
||||
## Fixes
|
||||
|
||||
- **`optimize` survives a cross-process write race (#297).** A CLI `optimize`
|
||||
racing a served write on the same table could fail: the in-process write queue
|
||||
doesn't serialize across processes, so a concurrent insert/delete advancing the
|
||||
manifest between optimize's compaction and its publish broke the strict
|
||||
equality CAS. Optimize now reopens-and-replans on a genuine Lance conflict and
|
||||
fast-forwards its publish monotonically, so a maintenance compaction never
|
||||
fails a live write. Bounded retry; sustained contention surfaces a loud
|
||||
conflict rather than dropping work.
|
||||
|
||||
- **`optimize` is non-destructive on upgraded graphs (#291).** A graph created by
|
||||
a pre-0.7.0 binary carries an on-by-default Lance auto-cleanup config; under it,
|
||||
optimize's compaction commit could fire Lance's version-GC hook and prune
|
||||
`__manifest`-pinned versions (breaking snapshots and time travel). Optimize now
|
||||
strips any stale `lance.auto_cleanup.*` config off every table — data and
|
||||
internal — before its HEAD-advancing commits, so compaction can never GC pinned
|
||||
versions.
|
||||
|
||||
- **Recovery converges instead of failing `open` under a concurrent manifest
|
||||
advance (#296).** The open-time recovery sweep published its roll-forward at the
|
||||
sidecar's pinned expected version; if another writer advanced the manifest
|
||||
during the classify→publish window, the CAS failed and aborted the whole
|
||||
`Omnigraph::open`. The sweep now treats roll-forward as "the manifest reflects
|
||||
the sidecar's committed state," not "this sweep won the CAS": on a CAS loss it
|
||||
re-reads the live manifest and, when the sidecar's intent is already satisfied,
|
||||
records the recovery and deletes the sidecar idempotently — so a concurrent
|
||||
advance no longer fails the open. (The destructive roll-back twin still defers
|
||||
to a cross-process lease, as documented.)
|
||||
|
||||
## Upgrade notes
|
||||
|
||||
Drop-in over v0.7.1 — no configuration, schema, or data changes. Upgrade the
|
||||
server and CLI together as usual. Graphs created on v0.7.1 read and write
|
||||
identically on v0.7.2; the optimize non-destructive fix additionally protects
|
||||
graphs created by pre-0.7.0 binaries from version GC during compaction.
|
||||
|
|
@ -1,16 +1,23 @@
|
|||
# Omnigraph v0.8.0
|
||||
|
||||
v0.8.0 makes every served graph an **MCP (Model Context Protocol) server**. An
|
||||
MCP-capable agent — Claude Code/Desktop, Cursor, the OpenAI Responses `mcp` tool,
|
||||
and others — can connect to a graph and operate it directly: run reads and
|
||||
mutations, load data, manage branches, browse commits, read the schema, and
|
||||
invoke the graph's curated stored queries. The surface adds no new capability and
|
||||
no new business logic; every tool delegates to the same engine/handler path the
|
||||
REST routes use and is gated by the same Cedar policy.
|
||||
v0.8.0 has two headline changes:
|
||||
|
||||
## Highlights
|
||||
1. **Every served graph becomes an MCP (Model Context Protocol) server** — an
|
||||
MCP-capable agent (Claude Code/Desktop, Cursor, the OpenAI Responses `mcp`
|
||||
tool, and others) can connect to a graph and operate it directly. The surface
|
||||
adds no new capability and no new business logic; every tool delegates to the
|
||||
same engine/handler path the REST routes use and is gated by the same Cedar
|
||||
policy. It is **additive**.
|
||||
2. **Graph commit lineage moves into `__manifest`** (RFC-013 Phase 7), folded
|
||||
into the publish CAS, via a one-time on-disk migration (internal schema
|
||||
**v3 → v4**). This is the first internal-schema change since v0.4.0 and carries
|
||||
an **upgrade-order requirement** — read the upgrade notes before rolling it out.
|
||||
|
||||
### MCP surface (`POST /graphs/{id}/mcp`)
|
||||
## MCP surface (`POST /graphs/{id}/mcp`)
|
||||
|
||||
An MCP-capable agent can connect to a graph and run reads and mutations, load
|
||||
data, manage branches, browse commits, read the schema, and invoke the graph's
|
||||
curated stored queries.
|
||||
|
||||
- **One MCP endpoint per served graph**, mounted automatically by the cluster
|
||||
server — no separate flag. It is a stateless Streamable-HTTP transport: a
|
||||
|
|
@ -78,8 +85,56 @@ carried in the query source:
|
|||
unsupported version is a `400`); `initialize` negotiates the version in its
|
||||
body and is exempt by design.
|
||||
|
||||
## Graph lineage now lives in `__manifest` (internal schema v4)
|
||||
|
||||
The graph commit DAG (commits, parents, merge parents, per-branch heads, and the
|
||||
authoring actor) is now stored in `__manifest` as `graph_commit` / `graph_head`
|
||||
rows, written in the **same commit (CAS)** as the table-version rows of a graph
|
||||
publish. Previously the lineage lived in a separate `_graph_commits.lance`
|
||||
dataset written after the manifest commit, leaving a narrow window where a crash
|
||||
could land a manifest version with no matching lineage row. Folding the lineage
|
||||
into the publish closes that gap by construction: a graph commit and its lineage
|
||||
now land atomically at one manifest version. The in-memory commit graph is a
|
||||
projection of those manifest rows; `_graph_commits.lance` is retained only as a
|
||||
carrier for Lance branch refs and no longer receives commit rows.
|
||||
|
||||
This bumps the `__manifest` internal schema stamp from **v3 to v4**.
|
||||
|
||||
### Existing graphs migrate seamlessly on first write
|
||||
|
||||
A graph created by an earlier binary (internal schema v3) keeps its lineage in
|
||||
`_graph_commits.lance` with none in `__manifest`. On the **first read-write
|
||||
open**, Omnigraph backfills that lineage into `__manifest` (the `migrate_v3_to_v4`
|
||||
internal-schema step) and bumps the stamp to v4. The migration:
|
||||
|
||||
- is **per-branch** — each branch backfills on its first write;
|
||||
- is **idempotent and crash-safe** — the stamp bump is the last step, and the
|
||||
backfill is keyed on the commit id, so a crash mid-migration re-runs harmlessly
|
||||
on the next open;
|
||||
- **preserves all data** — every commit, parent, merge parent, actor, and head is
|
||||
carried over; commit ids are stable, so existing references still resolve.
|
||||
|
||||
No data is lost and no operator action is required beyond upgrading the binary.
|
||||
|
||||
Before its first write migrates the graph, a **read-only** open of a v3 graph
|
||||
(e.g. `omnigraph commit list`, NDJSON export) still reads correct history via a
|
||||
transitional fallback that sources the commit DAG from `_graph_commits.lance` —
|
||||
read-only opens never write, so they never migrate, but they never show an empty
|
||||
history either.
|
||||
|
||||
## Upgrade notes
|
||||
|
||||
- **Breaking: internal schema v4 — upgrade writer (and reader) binaries first.**
|
||||
Internal schema v4 is a hard version gate. Once a graph has been opened for
|
||||
write by a v0.8.0 binary, its `__manifest` is stamped v4, and an **older binary
|
||||
will refuse to open it** — read-write *and* read-only — with an
|
||||
`upgrade omnigraph before opening this graph` error rather than silently
|
||||
misreading the new lineage. This is the standard forward-version protection
|
||||
(same shape as the v1→v2 / v2→v3 steps), now enforced on the read-only path
|
||||
too. Upgrade every writer (and reader) binary that touches a graph to v0.8.0
|
||||
before, or together with, the first write under the new version. A mixed fleet
|
||||
where an old binary still writes the same graph is unsupported, as with any
|
||||
internal-schema bump.
|
||||
- **`GET /graphs/{id}/queries` is now `invoke_query`-gated (was `read`).** The
|
||||
stored-query catalog uses the same authority as invocation and the MCP
|
||||
`tools/list` surface, so discovery and invocation agree ("see the menu iff you
|
||||
|
|
@ -87,8 +142,9 @@ carried in the query source:
|
|||
`403` instead of a listing; in default-deny mode the endpoint returns `403`
|
||||
until an `invoke_query` rule is configured. This is the one observable REST
|
||||
behavior change in this release.
|
||||
- Otherwise no breaking changes: the rest of the REST surface, CLI, cluster
|
||||
config, and on-disk format are unchanged. The MCP endpoint is additive.
|
||||
- **The MCP endpoint is additive.** Apart from the `GET /queries` gate change and
|
||||
the v4 on-disk migration above, the REST surface, CLI, and cluster config are
|
||||
unchanged.
|
||||
- **Pointing an agent at a graph:** configure your MCP client with the URL
|
||||
`https://<host>/graphs/<id>/mcp` and the same bearer token you use for REST.
|
||||
See [docs/user/operations/mcp.md](../user/operations/mcp.md) for the connect
|
||||
|
|
|
|||
|
|
@ -20,13 +20,14 @@ OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordin
|
|||
- **Layout**:
|
||||
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
|
||||
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
|
||||
- `__manifest/` — the catalog of all sub-tables and their published versions
|
||||
- `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map
|
||||
- `__manifest/` — the catalog of all sub-tables and their published versions, **and** the graph commit lineage (RFC-013 Phase 7)
|
||||
- `_graph_commits.lance` / `_graph_commit_actors.lance` — legacy / branch-ref carriers. Since RFC-013 Phase 7 the graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows, written in the publish CAS); `_graph_commits.lance` no longer receives commit rows, but is retained to carry the Lance branch refs that `create_branch` / `list_branches` / the `cleanup` orphan reconciler operate on. A graph created before Phase 7 (internal schema v3) keeps its lineage here until its first read-write open, which migrates it into `__manifest` via `migrate_v3_to_v4`.
|
||||
- (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed. The internal schema migration sweeps stale `__run__*` branches on first write-open; the inert dataset bytes themselves remain until a prefix-delete storage primitive lands)
|
||||
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
|
||||
- `object_type` ∈ `table | table_version | table_tombstone`
|
||||
- `table_key` ∈ `node:<TypeName> | edge:<EdgeName>`
|
||||
- `object_type` ∈ `table | table_version | table_tombstone | graph_commit | graph_head`
|
||||
- `table_key` ∈ `node:<TypeName> | edge:<EdgeName>` (empty for `graph_commit` / `graph_head` lineage rows)
|
||||
- `table_branch` is `null` for the main lineage and the branch name otherwise
|
||||
- **Graph lineage rows** (RFC-013 Phase 7): one immutable `graph_commit` row per commit (`object_id` = the commit ULID; `metadata` JSON carries parent / merged-parent / actor / timestamp) plus one mutable `graph_head:<branch>` pointer per branch (`graph_head:main` for main). The in-memory commit DAG is a projection of these rows.
|
||||
- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones — rows where `object_type = table_tombstone`, whose own `table_version` (acting as the tombstone version) is `>= the entry's table_version`.
|
||||
- **Atomic publish**: multi-dataset commits publish so that a single write to `__manifest` flips all the new sub-table versions visible at once.
|
||||
- **Row-level CAS on the merge-insert join key**: `object_id` carries an unenforced-primary-key annotation so Lance's bloom-filter conflict resolver rejects two concurrent commits that land the same `object_id` row. Without this annotation, Lance's transparent rebase would admit silent duplicates from racing publishers.
|
||||
|
|
@ -90,8 +91,8 @@ flowchart TB
|
|||
- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph.
|
||||
- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
|
||||
- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
|
||||
- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.)
|
||||
- **`_graph_commit_recoveries.lance`** — one row per crash-recovery action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join.
|
||||
- **`_graph_commits.lance`** is an L2 dataset retained only as a branch-ref carrier (and, on a pre-Phase-7 graph, the migration source). Since RFC-013 Phase 7 the graph commit DAG lives in `__manifest` as `graph_commit` / `graph_head` rows written in the publish CAS — `_graph_commits.lance` and its paired `_graph_commit_actors.lance` no longer receive commit rows. A graph created before Phase 7 (internal schema v3) backfills its lineage into `__manifest` on its first read-write open (`migrate_v3_to_v4`). (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.)
|
||||
- **`_graph_commit_recoveries.lance`** — one row per crash-recovery action. Joined by `graph_commit_id` to the graph commit lineage (the `graph_commit` rows in `__manifest` since RFC-013 Phase 7); the linked commit carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join.
|
||||
- **`__recovery/{ulid}.json`** — transient sidecar files written by a writer before it advances the underlying dataset, deleted once the matching manifest publish succeeds. A sidecar persisting after process exit means the writer crashed mid-commit; the next read-write open processes it. Steady-state directory is empty.
|
||||
- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.
|
||||
- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags.
|
||||
|
|
|
|||
|
|
@ -3,12 +3,12 @@
|
|||
| Name | Value | Area |
|
||||
|---|---|---|
|
||||
| `MANIFEST_DIR` | `__manifest` | manifest layout |
|
||||
| Commit graph dir | `_graph_commits.lance` | commit graph |
|
||||
| Commit graph dir | `_graph_commits.lance` | branch-ref carrier + pre-v4 lineage source (lineage lives in `__manifest` since RFC-013 Phase 7) |
|
||||
| Run registry dir (legacy, removed) | `_graph_runs.lance` | inert post-v0.4.0; bytes remain until a prefix-delete primitive lands |
|
||||
| Run branch prefix (legacy, removed) | `__run__` | swept off `__manifest` by the internal schema migration; no longer a reserved name |
|
||||
| Schema apply lock | `__schema_apply_lock__` | schema apply |
|
||||
| Manifest publisher retry budget | `PUBLISHER_RETRY_BUDGET = 5` | manifest publish |
|
||||
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 3` | manifest migrations |
|
||||
| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 4` | manifest migrations (v4 = graph lineage in `__manifest`, RFC-013 Phase 7) |
|
||||
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | merge execution |
|
||||
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | optimize/cleanup |
|
||||
| Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | optimize |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue