Closes the Cursor Bugbot HIGH on commit 22d76db (round 2 review):
`branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100`
still used the swap_coordinator_for_branch + operate +
restore_coordinator pattern across three separate
`coordinator.write().await` acquisitions. Two concurrent merges with
distinct targets would interleave their swaps, leaving each merge's
body running against the other's swapped coord — A's `feature_a →
target_a` would land its rewrite in target_b instead.
Adds `merge_exclusive: Arc<tokio::sync::Mutex<()>>` to `Omnigraph`,
held across the entire swap → operate → restore window in
`branch_merge_impl`. Concurrent branch merges now serialize relative
to each other; everything else (per-(table, branch) writer queues,
/change, /ingest) is unaffected.
Why the mutex rather than the deeper "operate on local coord"
refactor (the round-1 fix shape applied to `branch_create_from`):
`branch_merge_on_current_target` calls `self.snapshot()` and
`self.ensure_commit_graph_initialized()` internally, which use
`self.coordinator` directly. Threading an explicit target coord
parameter through the merge body would unwind dozens of call sites.
The mutex is a smaller intrusion that fully closes the race.
Documented as a follow-up if telemetry shows merge concurrency
matters.
Pinned by `concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other`
(previous commit). Pre-fix: M=4 iterations of concurrent merges
deterministically corrupted target row counts. Post-fix: all M
iterations land each merge on its declared target. The two adjacent
branch concurrency tests
(`concurrent_change_during_branch_merge_preserves_writes`,
`concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator`)
still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The interior-mutability primitives from Step B (catalog ArcSwap,
schema_source ArcSwap, coordinator Mutex, and RuntimeCache's existing
internal locking) make every Omnigraph engine write API safe to expose
under &self. This commit flips the public surface so the HTTP server
can hold Arc<Omnigraph> in PR 2 Step F instead of Arc<RwLock<Omnigraph>>.
Public API conversions:
- mutate, mutate_as
- ingest, ingest_as, ingest_file, ingest_file_as
- load, load_as, load_file
- branch_merge, branch_merge_as
- apply_schema
- ensure_indices, ensure_indices_on
- optimize
Inner functions converted in lockstep (their signatures must match the
new caller shape):
- mutate_with_current_actor, ingest_with_current_actor,
load_direct_on_branch
- execute_named_mutation, execute_insert, execute_update,
execute_delete, execute_delete_node, execute_delete_edge
- branch_merge_impl, branch_merge_on_current_target
- load_jsonl_reader
- schema_apply::{apply_schema, apply_schema_with_lock,
acquire_schema_apply_lock, release_schema_apply_lock,
ensure_schema_apply_idle}
- table_ops::{ensure_indices, ensure_indices_on,
ensure_indices_for_branch, commit_prepared_updates,
commit_prepared_updates_with_expected,
commit_prepared_updates_on_branch,
commit_prepared_updates_on_branch_with_expected,
commit_manifest_updates, record_merge_commit,
ensure_commit_graph_initialized, commit_updates_on_branch_with_expected}
- optimize::optimize_all_tables
- Omnigraph::commit_manifest_updates, record_merge_commit,
commit_updates_on_branch_with_expected, ensure_commit_graph_initialized
The conversion is mechanical: callers that previously took `db: &mut
Omnigraph` now take `db: &Omnigraph`; every interior mutation goes
through the existing locks (coordinator.lock().await, store_catalog,
runtime_cache.invalidate_all). No new locks acquired, no new lock-order
hazards introduced.
102 lib tests + 24 runs + 30 branching + 63 end_to_end + 39 server
tests pass. Workspace compiles clean (1 warning on a now-redundant `mut`
binding in CLI; cleaned up in a follow-up). The remaining work in PR 2
is the AppState flip (Arc<RwLock<Omnigraph>> -> Arc<Omnigraph> +
WorkloadController), the revalidation perf optimization in commit_all,
and the WorkloadController itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the GraphCoordinator field in `Arc<tokio::sync::Mutex<...>>` so
engine APIs can move from `&mut self` to `&self` without giving up the
coordinator's mutating refresh path. Lock acquisition order: always
before runtime_cache (when both are needed in one scope). Critical
sections stay short — load+clone for snapshot/version/current_branch,
single-method delegations elsewhere.
Public API changes:
- `Omnigraph::version()` and `Omnigraph::snapshot()` (pub(crate))
become async; callers add `.await`.
- `Omnigraph::active_branch()` returns `Option<String>` (cloned) instead
of `Option<&str>` borrowed from the coordinator. Callers either
`.await` the result + use `.as_deref()`, or hoist into a binding.
`&self`-converted methods this round (tied to the coordinator wrap, not
the Step C surface refactor):
- `swap_coordinator_for_branch`
- `restore_coordinator` (now async; was sync)
- `sync_branch`
- `refresh`
- `refresh_coordinator_only`
- `reload_schema_if_source_changed`
- `branch_create`, `branch_create_from`, `branch_delete`, `branch_list`
- `delete_branch_storage_only`
- `ensure_branch_delete_safe`
- `ensure_schema_apply_idle`
- `ensure_schema_apply_idle` helper in schema_apply.rs (matches signature)
Caller updates: branch_create_from_impl threads `restore_coordinator`'s
new async signature; schema_apply, table_ops, exec/merge wrap every
direct `db.coordinator.X()` in `db.coordinator.lock().await.X()`;
exec/merge hoists `active_branch_for_keys` once outside the per-table
closure that builds queue keys + sidecar pins.
All 102 lib tests + 30 branching + 24 runs + 10 lifecycle + 16
staged_writes + 63 end_to_end pass workspace-wide. Zero test
regressions; the only behavior change is on the `Omnigraph` API
surface (sync -> async on the three accessors above).
Step C (engine API conversion: apply_schema, mutate_as, ingest_as,
branch_merge_as &mut self -> &self) follows in a subsequent commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles the working-tree state from the prior session (PR 0 bench harness,
PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration)
together with the first half of PR 2's interior-mutability foundation
(catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams
intermix in 7 of the same files, so splitting via git add -p was
impractical. Subsequent PR 2 steps land as separate atomic commits.
PR 0 — server-level concurrent /change bench harness
- crates/omnigraph-server/examples/bench_concurrent_http.rs (new)
- .context/bench-results/{baseline-main,after-pr1}/ (gitignored)
PR 1a — drop the audit_actor_id field, thread per-call
- removed Omnigraph::audit_actor_id and the swap-restore patterns in
mutation.rs, merge.rs, loader/mod.rs
- actor_id: Option<&str> threaded through MutationStaging::finalize,
mutate_with_current_actor, ingest_with_current_actor,
branch_merge_impl, branch_merge_on_current_target,
commit_prepared_updates*, record_merge_commit,
commit_updates_on_branch_with_expected
- apply_schema and ensure_indices_for_branch pass None (system-attributed)
PR 1b — per-(table_key, branch) write queue + revalidation + sidecar
- new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager,
acquire/acquire_many, sorted+deduped acquisition; 6 unit tests
- Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor
- MutationStaging::finalize split into stage_all (Phase A, no queue)
and StagedMutation::commit_all (Phase B, acquire_many + revalidate
pins + sidecar + commit_staged); guards held across publisher
- delete-only mutations now emit recovery sidecars; revalidation
extended to inline_committed tables
- branch_merge_on_current_target, apply_schema_with_lock, and
ensure_indices_for_branch acquire per-table queues for their
touched tables
PR 2 Step B (partial) — catalog and schema_source via ArcSwap
- catalog: Catalog -> Arc<ArcSwap<Catalog>>
- schema_source: String -> Arc<ArcSwap<String>>
- public accessors return Arc<Catalog> / Arc<String>; readers bind
locally where the borrow has to outlive an expression
- new pub(crate) store_catalog / store_schema_source helpers replace
the field assignments in apply_schema and reload_schema_if_source_changed
- 117 tests across lifecycle/end_to_end/branching/runs pass; engine
lib + workspace compile clean
Coordinator wrap (Mutex) and the &mut self -> &self engine API
conversion follow in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small PR #72 review findings addressed:
- merge.rs sidecar pin recorded `entry.table_branch` (where the table
currently lives in the target manifest) instead of the merge target
branch where commits actually land via `publish_rewritten_merge_table`
→ `open_for_mutation` → `fork_dataset_from_entry_state`. Recovery's
`open_lance_head` would then check the wrong ref. Aligned with the
pattern already used in `ensure_indices_for_branch` (table_ops.rs:115).
Added `branch_merge_sidecar_pins_table_branch_to_active_branch`
contract test that reads the sidecar JSON and asserts every per-pin
`table_branch` equals the active (target) branch — catches the
regression even when the values happen to coincide in the test setup.
- Rollback audit `from_version` previously equalled `to_version`
(both `manifest_pinned`), telling operators nothing about the actual
Lance HEAD drift before restore. Captured `lance_head` in
`ClassifiedTable` and used it as `from_version` so audit rows now
show "rolled back from v7 to v5" instead of "v5 → v5". Added
`assert_rollback_outcomes_record_drift` invariant in the test helper,
invoked automatically by every `RecoveryExpectation::RolledBack` test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B1. Schema-apply atomicity. Before this commit, a failure between
`_schema.pg.staging` write and the manifest publish left the repo
corrupt: Lance HEADs advanced under the new schema, manifest stayed
at old pins, and on reopen schema-state recovery deleted the staging
files (manifest's table set still matched the live schema), then
manifest-drift recovery rolled the table versions forward — leaving
new-schema data on disk with the old `_schema.pg` live.
Fix: a SchemaApply sidecar is the marker that Phase B completed but
Phase C didn't. New helper `has_schema_apply_sidecar` is consulted
by `recover_schema_state_files` BEFORE its disambiguation logic;
when present, it completes the staging→final rename so the
subsequent manifest-drift roll-forward sees the new catalog.
B2. Branch-aware recovery. Sidecars from feature-branch writers were
being classified against main's snapshot and main's Lance HEAD,
silently no-op'ing or rolling back the wrong table version (the
classifier saw NoMovement; the writer's drift on the feature branch
persisted; subsequent feature writers surfaced
ExpectedVersionMismatch).
Fix: SidecarTablePin gets an optional `table_branch` field;
`recover_manifest_drift` opens a per-branch coordinator
(`GraphCoordinator::open_branch`) per sidecar; `open_lance_head`,
`restore_table_to_version`, and `roll_forward_all` honor the pin's
branch via `Dataset::checkout_branch`.
B3. Remove fragment-id short-circuit in `restore_table_to_version`.
Equal fragment IDs do NOT imply equal content: Lance index commits
and deletion-vector updates change the manifest without touching
fragment IDs. Skipping restore in those cases would leave Lance HEAD
ahead of the manifest with no recovery artifact left. Restore is
now unconditional; pile-up under repeated mid-rollback crashes
bounded and reclaimed by `omnigraph cleanup`.
B4. Recovered branch_merge records merge parent. `record_audit` always
called `append_commit`, dropping `merged_parent_commit_id`. Future
`branch_merge feature -> main` between the same pair lost
already-up-to-date detection. RecoverySidecar gets an optional
`merge_source_commit_id`; `branch_merge_on_current_target`
populates it from `source_head_commit_id`; `record_audit`
dispatches to `append_merge_commit` when present.
New tests: feature-branch sidecar classification (B2); B1 deepens the
existing schema_apply test with live-`_schema.pg` and new-type
assertions; B4 deepens the existing branch_merge test by reading
`_graph_commits.lance` and asserting a non-null `merged_parent_commit_id`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A1. tests/recovery.rs: rewrite recovery_multi_sidecar_requires_fresh_snapshot_for_correctness
to use real `append_batch` instead of fragment-preserving `delete_where("1 = 2")`.
The previous setup made restore_table_to_version's fragment-set short-circuit
no-op the bug path, so the load-bearing `HEAD == v3` assertion passed in both
bug and fix paths. Real appends produce different fragment-id sets across v1,
v2, v3 so a real restore actually runs in the bug path (HEAD becomes v4).
Added a person_batch helper matching the post-init Lance schema (id, age, name).
A2. exec/merge.rs: filter recovery sidecar pins to `RewriteMerged` candidates
only. `AdoptSourceState`'s pure-pointer-switch and fork subcases don't
advance Lance HEAD; pinning them would force NoMovement on recovery and
trigger an all-or-nothing rollback that destroys legit RewriteMerged work.
Documented residual: AdoptSourceState subcases that internally call
publish_rewritten_merge_table aren't covered by the sidecar; closing that
requires pre-computing source deltas during candidate classification (a
structural change to CandidateTableState) — left as follow-up.
A3. db/omnigraph/table_ops.rs: add the same branch filter
(`active_branch.is_some() && entry.table_branch.is_none() => continue`)
to the ensure_indices sidecar pin loop that the processing loop already
has. Without this, main-branch tables that need index work get pinned but
never committed when ensure_indices runs on a feature branch → NoMovement
→ all-or-nothing rollback destroys feature-branch work.
A4. tests/failpoints.rs: deepen schema_apply_phase_b_failure and
branch_merge_phase_b_failure tests with post-recovery manifest-pin advance
assertions. branch_merge test setup also mutates main so the merge
produces at least one RewriteMerged candidate (required after A2's pin
filter — a no-op merge with all-AdoptSourceState would write no sidecar).
Fixed stale "BranchMerge is strict-classified" comment to reflect current
loose classification.
A5. tests/composite_flow.rs: remove duplicate back-to-back `total_people`
query in step 12.
Full workspace test sweep with --features failpoints passes: no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three bundled changes:
1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
the test function). OmniGraph is consumed by both humans and agents
- naming the test after one audience misframes the library.
2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
review-round labels from source, tests, and docs added by this
branch. Internal traceability belongs in commit messages and PR
descriptions, not in checked-in artifacts. Upstream
lance-format/lance issue refs and pre-existing MR-XXX refs in docs
not touched by this branch are left alone.
3. Two outstanding review findings addressed:
- `needs_index_work_node` / `needs_index_work_edge`: propagate
`count_rows` errors instead of `unwrap_or(0)`. Silently treating
transient I/O failures as "0 rows" risked skipping a table from
the recovery sidecar pin set that was actually about to be
modified.
- `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
strengthen the assertion to fail when sidecar B classifies under
a stale snapshot. The new assertion checks post-recovery Lance
HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
deleted + audit rows present" pair passed in both the bug and
fix paths because both delete the sidecar and write an audit
row; the differentiator is the post-recovery HEAD. Strengthening
the assertion exposed an additional nuance: in this overlapping-
sidecar scenario sidecar B's audit kind is RolledBack (no-op)
rather than RolledForward, since sidecar A's roll-forward
publishes Lance HEAD as the new manifest pin (absorbing B's
work). The docstring now explains why this is correct given
current `roll_forward_all` semantics.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking
bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR.
Each fix has a paired test demonstrating the bug before the fix.
Merge-blocking fixes:
- BranchMerge moved to loose-match classifier arm. publish_rewritten_
merge_table runs multiple commit_staged calls per table (merge_insert
+ delete_where + index rebuilds). Strict classification rolled back
valid completed Phase B work as UnexpectedMultistep. Three new unit
tests pin the loose-match behavior for BranchMerge.
- branch_merge sidecar uses self.active_branch() (the resolved target
branch) instead of inferring from the first sorted table key. The
previous heuristic could record None (== main) when the merge target
was a non-main branch, causing recovery to publish to the wrong
manifest namespace.
- Best-effort sidecar delete in all 5 writer sites (mutation, loader,
schema_apply, branch_merge, ensure_indices). Previously, a sidecar
cleanup failure after a successful manifest publish would error out
the user's call for a write that already landed. Now: log a warning
and ignore — the next open's recovery sweep tidies the stale sidecar
via NoMovement classification.
- ensure_indices sidecar scoped to tables that need work via new
helpers needs_index_work_node / needs_index_work_edge. Previously
the sidecar pinned every catalog table; if only one needed indexing,
the others classified as NoMovement and the all-or-nothing decision
rolled back legitimate index work.
Strongly-recommended fixes:
- recover_manifest_drift now takes &mut GraphCoordinator and refreshes
between sidecars. Sidecar B's classification needs to see sidecar
A's manifest changes, otherwise B can be classified against stale
pins and incorrectly roll back work that just landed.
- list_sidecars sorts URIs before reading. Sidecar filenames are
ULIDs (chronologically sortable), so this gives deterministic,
time-ordered processing. Filesystem-order was nondeterministic.
- ReadOnly opens skip recover_schema_state_files too (was: only the
MR-847 sweep was gated). Read-only consumers may run with read-only
credentials; silent open-time mutations violate the contract.
Doc cleanups:
- Removed stale "Phase 4 placeholder" comment from
recover_manifest_drift.
- docs/runs.md decision-tree wording now correctly surfaces the
InvariantViolation abort path.
- docs/branches-commits.md clarifies actor_id is in
_graph_commit_actors.lance (joined by graph_commit_id), not on
_graph_commits.lance itself.
Test surface (post-fixes):
- 25 unit tests in db::manifest::recovery (+4 from this commit).
- 10 integration tests in tests/recovery.rs (+3 from this commit).
- ~672 tests across ~25 binaries pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the three paired per-writer tests required by MR-847's acceptance
criteria — "All four migrated writers ... have paired Phase B → recovery
integration tests."
Production additions (~10 LOC):
- New failpoint `branch_merge.post_phase_b_pre_manifest_commit` in
`exec/merge.rs::branch_merge_on_current_target` between the per-table
publish loop and `commit_manifest_updates`.
- New failpoint `ensure_indices.post_phase_b_pre_manifest_commit` in
`db/omnigraph/table_ops.rs::ensure_indices_for_branch` between the
per-table loops and `commit_prepared_updates_on_branch`.
- For schema_apply, the existing `schema_apply.after_staging_write`
failpoint already fires in the right window (after the per-table
rewrites + index builds, before the manifest publish).
Sidecar tweak:
- `schema_apply` sidecar's `branch` is now `None` (was
`Some("__schema_apply_lock__")`). The lock branch is purely a
serialization sentinel; `coordinator.commit_changes_with_actor`
publishes against the coordinator's pre-lock branch (main). After
the failpoint fires, `release_schema_apply_lock` removes the lock
branch — if the sidecar referenced it, the recovery sweep would try
to publish to a branch that no longer exists and fail. Fix: record
the actual publish target.
Tests added in `tests/failpoints.rs` (~280 LOC):
- `schema_apply_phase_b_failure_recovered_on_next_open` — seeds a row,
opens, attempts a schema apply that adds a new node type + a new
property (the new type ensures the table set differs so
`recover_schema_state_files` doesn't trip on property-only
ambiguity), failpoint fires, drops engine, reopens, asserts sidecar
deleted + audit row recorded.
- `branch_merge_phase_b_failure_recovered_on_next_open` — seeds main,
branches off, mutates the branch, attempts merge with the
`branch_merge.post_phase_b_pre_manifest_commit` failpoint active.
Same recovery shape.
- `ensure_indices_phase_b_failure_recovered_on_next_open` — seeds
rows, attempts ensure_indices with the
`ensure_indices.post_phase_b_pre_manifest_commit` failpoint active.
After this commit, all four migrated writers have paired
Phase B → recovery tests:
- mutate_as / load: `recovery_rolls_forward_after_finalize_publisher_failure` (Phase 5)
- schema_apply: `schema_apply_phase_b_failure_recovered_on_next_open`
- branch_merge: `branch_merge_phase_b_failure_recovered_on_next_open`
- ensure_indices: `ensure_indices_phase_b_failure_recovered_on_next_open`
11 failpoint tests pass; full workspace lib + integration tests pass
(350+ tests across 20 binaries).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three writers each follow the same shape established in Phase 5: build
SidecarTablePin list before the per-table commit_staged loop, write the
sidecar via recovery::write_sidecar, do the existing work, delete the
sidecar after the manifest publish succeeds.
Loose-match classifier (recovery.rs):
The classifier now distinguishes strict vs. loose match per
SidecarKind. Strict (Mutation, Load, BranchMerge): exactly one
commit_staged per table; lance_head == manifest_pinned + 1 AND
post_commit_pin == lance_head required. Loose (SchemaApply,
EnsureIndices): the writer may run N >= 1 commit_staged calls per
table — index builds + rewrites compound, and the exact N is hard to
compute at sidecar-write time. Loose accepts any
lance_head > manifest_pinned (with expected_version still matching the
manifest pin) as RolledPastExpected. The risk it admits — an external
agent advancing HEAD between sidecar write and recovery — is out of
scope for the single-coordinator model (MR-668 territory).
roll_forward_all now reads the CURRENT Lance HEAD per table (not the
sidecar's post_commit_pin) so the manifest publish reflects whatever
HEAD landed, even if the loose-match writer committed multiple times
per table.
Per-writer wiring:
- schema_apply::apply_schema_with_lock: sidecar covers
rewritten_tables ∪ indexed_tables (the tables that go through
stage_overwrite/stage_create_index commit_staged). Skips
added_tables (fresh datasets, no Phase B residual class) and
renamed_tables (handled by the existing schema-state staging
recovery in recover_schema_state_files).
- branch_merge::branch_merge_on_current_target: sidecar covers every
table in candidates (publish_adopted_source_state +
publish_rewritten_merge_table do the per-table commit_staged work).
Sidecar writes after validate_merge_candidates and deletes after
commit_manifest_updates.
- ensure_indices_for_branch: sidecar covers every node + edge type in
the catalog with a manifest entry (build_indices_on_dataset is
per-table-per-index commit_staged). Skips when the catalog has
nothing — steady-state calls incur no sidecar I/O when the manifest
already pins all expected types.
Allow recovery_audit.rs in forbidden_apis.rs:
The new db/recovery_audit.rs uses Dataset::write to bootstrap the
_graph_commit_recoveries.lance dataset (same pattern as
commit_graph.rs which is already allow-listed). Add it to the
ALLOW_LIST_FILES list in tests/forbidden_apis.rs.
8 new unit tests in db::manifest::recovery cover the loose-match
classifier branches (SchemaApply + EnsureIndices accept multi-commit
drift, NoMovement and InvariantViolation behave the same as strict).
All 20 test binaries pass (350+ tests across the workspace).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.
Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
sealed (only in-tree types can impl), single impl on `TableStore`
delegating to existing inherent methods; `Omnigraph::storage()`
accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
`stage_create_btree_index`, `stage_create_inverted_index` —
implementing the simple branch of Lance's `CreateIndexBuilder::execute`
(scalar indices only; vector indices stay inline because
`build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
Six new tests in `tests/staged_writes.rs` pin both the new primitives
and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
walks engine source, fails on direct lance::* inline-commit API use
outside `table_store.rs` / `db/manifest/`. Skips comment lines and
honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
`stage_create_*_index` + `commit_staged` instead of
`create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
inline because `InsertBuilder::execute_uncommitted` rejects empty
data. The narrow inline window is bounded by `__schema_apply_lock__`.
Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
(incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
— 5 tests, all green.
* `cargo test --workspace` — green.
Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
(mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
(depends on Phase 1b).
Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
vector-index builds can be staged outside the lance crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>