Closes the swap-restore race in `branch_create_from_impl` by simply not
touching `self.coordinator` at all. Open the source-branch coordinator
locally, call `branch_create` on it, drop it. The new branch is
durable on disk via the manifest write that `GraphCoordinator::branch_create`
issues on its own commit graph; subsequent reads of any coord will see
it after their normal manifest refresh.
Pre-fix: `branch_create_from_impl` ran swap → operate → restore as
three separate `coordinator.write().await` acquisitions. Under `&self`
concurrency, two callers with distinct source branches could interleave
their swaps, leaving each caller's "operate" step running against the
other's swapped coordinator and forking the new branch off the wrong
HEAD. Pinned by `concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator`
(previous commit) which deterministically reproduced the race with
8/8 forks landing on the wrong parent.
Why correct by design (AGENTS.md rule 9): closing the bug class
"non-atomic three-step coordinator manipulation under &self callers"
by removing the manipulation entirely. There's no scratch-space race
to lose because there's no scratch space.
Note: `branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100`
keeps the same swap-restore pattern. Its inner `branch_merge_on_current_target`
calls `self.snapshot()` and `self.ensure_commit_graph_initialized()` which
acquire the coord lock independently, so the simple "operate on local
coord" refactor doesn't compose without a deeper interface change. The
per-(table, branch) writer queue inside the merge body
(`crates/omnigraph/src/exec/merge.rs:1224`) bounds the damage in
practice; a deterministic regression for concurrent merges is tracked
under Block 3.1 of the plan.
`swap_coordinator_for_branch` and `restore_coordinator` remain
crate-internal for now (still used by `branch_merge_impl`); a follow-up
can remove them if the merge path is similarly refactored.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix so the red → green pair is visible in git log.
The test demonstrates that two concurrent `POST /branches` calls with
distinct `from` parents corrupt coordinator state: A's "operate" step
runs against B's swapped coordinator instead of its own, forking the
new branch off the wrong parent's HEAD.
Currently fails on f925ad1 with all 8 gamma branches (declared
parent: alpha, 5 rows) reporting 4 rows — beta's row count. The
operate step ran against beta's coord because B's swap interleaved
between A's swap and A's operate.
Fix lands in the next commit: hold a single `coordinator.write().await`
guard across the entire swap-operate-restore sequence in
`branch_create_from_impl` so the three steps are atomic relative to
other callers.
Closes the bug class "non-atomic three-step coordinator manipulation
under &self callers" rather than guarding the specific call site —
the right architectural seam (single critical section per swap-restore
sequence) eliminates the interleave window for branch_create_from and
any future swap-restore caller.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the bug class "Lance internal conflict surfaces as 500 instead
of 409" for in-process concurrent strict-op writers on the same row.
Pre-fix: in `MutationStaging::commit_all`, after queue acquisition, the
staged Lance transaction (built against V0) was handed straight to
`commit_staged`. When Lance HEAD has advanced past V0 (because the
queue's prior winner already published), Lance's transaction conflict
resolver fires `RetryableCommitConflict` for Update vs Update on the
same row, which wraps as `OmniError::Lance(<string>)` and the API maps
it to HTTP 500. Users see "internal server error" instead of a clean
retryable conflict.
Fix: track the strictest `MutationOpKind` per touched table on
`MutationStaging` and propagate through `StagedMutation`. In
`commit_all`'s recapture loop, before each `commit_staged`, fail-fast
with `OmniError::manifest_expected_version_mismatch` (→ HTTP 409
ExpectedVersionMismatch) for tables whose tracked op_kind has
`strict_pre_stage_version_check() == true` (Update/Delete/SchemaRewrite)
when the staged dataset's version doesn't match the fresh manifest pin
under the queue. Insert/Merge tables skip the check — concurrent
inserts on disjoint keys legitimately coexist via Lance's auto-rebase,
so the check would over-reject the existing same-key insert path.
Threading: `ensure_path` now takes `op_kind` and stores it in a new
`op_kinds: HashMap<String, MutationOpKind>` on `MutationStaging`, with
strictness-upgrade semantics so mixed insert+update on the same table
still fires the strict check at commit time. `StagedMutation` carries
`op_kinds` through to `commit_all`.
Pinned by `change_concurrent_updates_same_key_serialize_via_publisher_cas`
in `crates/omnigraph-server/tests/server.rs` (added in the previous
commit). All Phase 2 sentinels still pass:
change_concurrent_inserts_same_key_serialize_without_409,
change_conflict_returns_manifest_conflict_409,
branch_merge_conflict_response_includes_structured_conflicts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix so the red → green pair is visible in git log.
The test asserts the RYW invariant for in-process concurrent UPDATEs on
the same row: exactly one writer commits and N-1 receive 409
manifest_conflict. Currently fails on f925ad1 with 1 x 200 + 7 x 500:
> "storage: Retryable commit conflict for version 6: This Update
> transaction was preempted by concurrent transaction Update at
> version 6. Please retry."
Lance's transaction conflict resolver correctly detects the Update vs
Update race, but the error wraps as `OmniError::Lance(<string>)` and the
API surfaces it as 500 internal rather than 409 retryable conflict. Users
see "internal server error" for what is documented as a retryable
conflict path.
The fix lands in the next commit: an op-kind-aware drift check at the
commit_all entry that returns 409 ExpectedVersionMismatch for tables
whose first touch was Update / Delete / SchemaRewrite when the staged
dataset version drifts from the manifest pin under the queue.
Closes the bug class "Lance internal conflict surfaces as 500 instead
of 409" rather than mapping the specific Lance error variant — the
right architectural layer (engine boundary, under the queue) catches
the drift before commit_staged ever runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two durable engineering rules to the maintenance contract so they
load into context on every session:
- Rule 8: write a regression test that reproduces the bug first, confirm
it fails, land it just before the fix commit so the red→green pair is
visible in git log. A reviewer can check out the test commit alone and
reproduce the failure.
- Rule 9: when a bug surfaces, identify the root cause and make the fix
correct by construction. Don't patch the symptom. If the design admits
the bug class, close the class — don't add a guard around the latest
instance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 2 made Omnigraph::schema_source() return Arc<String> via ArcSwap, but
the failpoints test still compared against &'static str constants. Three
E0308 type mismatches were blocking the Test Workspace CI job; this fix
restores compilation.
- failpoints.rs:125,160,195 now call schema_source().as_str() to align
with the &str constants.
- Drops 11 unused let mut db = ... bindings on the same path (engine
write APIs are &self post PR 2 Step C).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bdd6440 commit re-captured expected_versions from `db.snapshot()`
(bound-branch view). That broke any mutation on a non-bound branch:
when the engine handle is bound to main but the mutation targets
feature, the bound-branch snapshot returns main's pin for each
table, not feature's. The publisher commits to feature, reads
feature's manifest entry, sees a different version → 409 even though
no concurrent writer existed.
Reproduced by `branch_merge_conflict_response_includes_structured_conflicts`
which mutates main then mutates feature on the same Omnigraph handle —
the second mutation failed with "expected V6, current V5".
Switch the re-capture to `db.snapshot_for_branch(branch).await` so the
per-branch entries are resolved correctly. This is one fresh manifest
read per mutation (the same I/O PR 1b had pre-Step-D), but it is now
required for cross-branch correctness — Step D's "in-memory under
single-coordinator invariant" rationale was only sound for
single-branch workloads.
Single-table same-branch mutations could still skip this read (queue
exclusivity makes the publisher CAS a no-op), but the conditional
adds complexity for marginal gain. Left as a follow-up perf
optimization tracked in `.context/bench-results/comparison.md`.
Bench numbers updated:
- single-actor 1x1: 15.2 ops/s vs baseline 12.3 (+24%)
- disjoint 8x8: 7.12 ops/s vs baseline 6.24 (+14%)
- same-key 8x1: 77% errors via the strict ensure_expected_version
check upstream of commit_all; same-key concurrent-write fix is a
separate follow-up.
All 102 lib + 39 server + 24 runs + 30 branching + 20 traversal +
9 validators tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Step D commit (1b0a2c9) skipped revalidation for single-table
mutations, betting that the publisher's CAS would be a no-op under the
per-(table, branch) queue. The bench falsified this: expected_versions
was captured during stage_all (BEFORE acquire_many), so by the time
the queue acquired and the publisher ran, those captured pins were
stale w.r.t. any in-process concurrent writer that had published in
between. Same-key 8x1 produced ~99% manifest_conflict 409 rejections
because every actor after the first carried stale expected_versions.
Fix: always re-read the in-memory snapshot under the queue and
overwrite expected_versions with the current per-table values.
Single-coordinator invariant (one Arc<Omnigraph> per process) makes
this safe with zero I/O — publishes update the shared coordinator
BEFORE releasing queue guards, so a contending tenant's read sees a
fresh view by the time it acquires its keys. The publisher's CAS
becomes a correct no-op for queued tables; cross-process drift
(coord stale because coord doesn't see external publishes) still
rejects via the publisher CAS as ExpectedVersionMismatch -> 409,
preserving the change_conflict_returns_manifest_conflict_409
regression sentinel.
Trade-off documented in the comment: SERIALIZABLE-opt-in writes
(§VI.36 aspirational) will need an additional revalidation step
here; the bench's append/upsert pattern is fine because Lance's
natural rebase handles concurrent writes onto the same dataset.
Bench results captured at .context/bench-results/after-pr2/ +
.context/bench-results/comparison.md:
- single-actor 1x1: 15.0 ops/s vs baseline 12.3 (+22%)
- disjoint 8x8: 7.03 ops/s vs baseline 6.24 (+13%)
- same-key 8x1: still rejected (76% errors) by the
ensure_expected_version strict check upstream of commit_all;
follow-up to address.
Disjoint's 13% is below the master plan's ≥8× target. Bench shows
the coordinator Mutex is now the dominant serializer; relaxing
to RwLock for snapshot/version reads is the next perf step,
tracked as a follow-up in comparison.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/server.md: new "Per-actor admission control (MR-686)" section
documenting WorkloadController defaults, the 429/503 mapping with
Retry-After semantics, the Cedar-then-admission ordering, and the
/change-only-for-now scope. Adds 429 / 503 to the listed HTTP status
codes and `too_many_requests` / `service_unavailable` to the ErrorCode
enumeration in the error model paragraph.
- docs/architecture.md: server/CLI diagram updated. Adds WorkloadController
and WriteQueueManager nodes; flow is HTTP -> auth -> Cedar -> admission
-> engine -> queue. Engine label changed to "Arc<Omnigraph>" to reflect
the AppState flip. Prose now points at server.md and runs.md for the
admission/queue contracts. The CLI's bypass-admission note is preserved.
- docs/invariants.md §VI.23 status annotation: explicitly cites the
per-(table, branch) writer-queue + revalidation-under-queue as closing
the Lance-HEAD-vs-manifest drift class under concurrent writers once
the global RwLock is removed (PR 2 Step F). Continuous in-process
rollback recovery still aspirational (MR-870 ticket).
scripts/check-agents-md.sh passes (26 links, 26 docs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The substantive PR 2 change. Removes the global server `RwLock<Omnigraph>`
that has serialized every mutating request across all actors. Disjoint
`(table, branch)` writes from different actors now run concurrently,
guarded only by the engine's per-(table, branch) write queue (PR 1b)
and per-actor admission control (PR 2 Step E).
AppState changes:
- `db: Arc<RwLock<Omnigraph>>` -> `engine: Arc<Omnigraph>`
- New field: `workload: Arc<workload::WorkloadController>` initialized
from env (`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=16`,
`OMNIGRAPH_PER_ACTOR_BYTES_MAX=4GiB`,
`OMNIGRAPH_GLOBAL_REWRITE_MAX=4`).
- `tokio::sync::RwLock` import dropped.
Handler updates (16 sites):
- All `Arc::clone(&state.db).read_owned().await` and `write_owned()`
calls replaced with `let db = &state.engine`. Engine APIs are now
`&self` (Step C) so this works directly.
- `/export` clones `Arc<Omnigraph>` once and moves into the spawned
task instead of acquiring a long-held read lock.
- `/change` handler additionally wires
`state.workload.try_admit(&actor_arc, est_bytes)`. Cedar runs FIRST
so denied requests don't consume admission slots; admission runs
SECOND before the engine call. `est_bytes` uses the request body
size as a coarse proxy.
API surface additions (`api::ErrorCode`):
- `TooManyRequests` -> HTTP 429 (per-actor cap exceeded; respect
`Retry-After`)
- `ServiceUnavailable` -> HTTP 503 (global rewrite pool exhausted)
`ApiError` constructors `too_many_requests` / `service_unavailable` and
`from_workload_reject` (maps `RejectReason` variants to HTTP status).
Other mutating handlers (`/ingest`, `/branches/*`, `/branches/merge`,
`/schema/apply`) currently flow through the Arc<Omnigraph> path
without admission gates; wiring those is mechanical and lands as a
follow-up. The /change hot path covers the bulk of MR-686's load
profile.
OpenAPI regenerated to include the new ErrorCode variants.
102 lib + 39 server tests + 5 workload tests pass. The regression
sentinel `change_conflict_returns_manifest_conflict_409` continues
to pass (revalidation perf opt + per-table queue + publisher CAS
preserve manifest_conflict semantics under concurrent writers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 2 removes the global server `RwLock<Omnigraph>` (Step F). Without
admission control, one heavy actor would exhaust shared capacity
(Lance I/O threads, manifest churn, network) and starve other actors.
The WorkloadController bounds per-actor in-flight count + bytes and
provides a global rewrite-pool semaphore for compaction / index builds.
New file: `crates/omnigraph-server/src/workload.rs` (~250 LOC + 5 tests).
API:
- `WorkloadController::new(inflight_cap, byte_cap, rewrite_cap)` /
`from_env()` / `with_defaults()`.
- `try_admit(actor_id, est_bytes) -> Result<AdmissionGuard, RejectReason>`
acquires both an in-flight count permit and adds est_bytes to the
per-actor counter atomically; returns RejectReason on either gate.
- `try_admit_rewrite() -> Result<RewriteGuard, RejectReason>` for the
global rewrite pool (Step F maps RewriteGuard exhaustion to HTTP 503).
- `RejectReason::{InFlightCountExceeded, ByteBudgetExceeded,
GlobalRewriteExhausted}`.
Race-free admission via `tokio::sync::Semaphore::try_acquire_owned()`
for the count gate (master plan Finding 6: independent atomic
load+check+add lets two callers both pass a cap-N check; the Semaphore
gate is atomic). Bytes use `fetch_add` + decrement-on-rejection so the
cap is never exceeded even on rollback.
Defaults (override via env):
- OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=16
- OMNIGRAPH_PER_ACTOR_BYTES_MAX=4_294_967_296 (4 GiB)
- OMNIGRAPH_GLOBAL_REWRITE_MAX=4
Tests cover under-cap admission, byte-budget rollback, per-actor
isolation, global rewrite cap, and the load-bearing 32-concurrent-vs-
cap-16 race test (forces real contention via a broadcast release
channel so guards can't recycle permits task-by-task; pins the
master plan's race-free invariant).
Adds workspace dep `dashmap = "6"` for per-actor state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the PR 1b regression (-17% disjoint, -30% same-key) by
eliminating the fresh `db.snapshot_for_branch(branch).await` that
PR 1b's commit_all issued per mutation.
Single-table mutations (`staged.len() + inline_committed.len() == 1`):
skip revalidation entirely. The per-(table, branch) Mutex queue holds
exclusive while we commit; the publisher's CAS catches any drift
that slipped between expected_versions capture and queue acquisition.
Conflict cost: 1 orphan Lance HEAD advance, recovered via the existing
sidecar protocol on the next ReadWrite open. This is the same trade-off
the master plan §"Revalidation perf optimization" prescribes.
Multi-table mutations: replace `db.snapshot_for_branch(branch)` (fresh
manifest read) with `db.snapshot()` (in-memory). Correct under MR-686's
single-process scope because all in-process tenants share one
`Arc<Omnigraph>` -> one coordinator; publishes update the shared
coordinator BEFORE releasing queue guards, so a contending tenant
reads a fresh in-memory view by the time it acquires its queue keys.
The within-mutation race (A captures expected_versions[T2]=V0, B
publishes T2->V1 during A's stage I/O, A then acquires T2's queue)
is caught via the in-memory check. Multi-coordinator deployments
(§VI.27 aspirational) would need force-refresh under the queue —
documented in §VI's "Explicit non-commitments".
Adds a SAFETY comment naming the two load-bearing premises:
(1) per-table queue uses exclusive Mutex (not RwLock), and
(2) single-coordinator invariant (one Omnigraph engine per process).
Migrating either breaks this skip.
Regression sentinel `change_conflict_returns_manifest_conflict_409`
passes. 102 lib + 24 runs + 16 staged_writes pass with the new path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The interior-mutability primitives from Step B (catalog ArcSwap,
schema_source ArcSwap, coordinator Mutex, and RuntimeCache's existing
internal locking) make every Omnigraph engine write API safe to expose
under &self. This commit flips the public surface so the HTTP server
can hold Arc<Omnigraph> in PR 2 Step F instead of Arc<RwLock<Omnigraph>>.
Public API conversions:
- mutate, mutate_as
- ingest, ingest_as, ingest_file, ingest_file_as
- load, load_as, load_file
- branch_merge, branch_merge_as
- apply_schema
- ensure_indices, ensure_indices_on
- optimize
Inner functions converted in lockstep (their signatures must match the
new caller shape):
- mutate_with_current_actor, ingest_with_current_actor,
load_direct_on_branch
- execute_named_mutation, execute_insert, execute_update,
execute_delete, execute_delete_node, execute_delete_edge
- branch_merge_impl, branch_merge_on_current_target
- load_jsonl_reader
- schema_apply::{apply_schema, apply_schema_with_lock,
acquire_schema_apply_lock, release_schema_apply_lock,
ensure_schema_apply_idle}
- table_ops::{ensure_indices, ensure_indices_on,
ensure_indices_for_branch, commit_prepared_updates,
commit_prepared_updates_with_expected,
commit_prepared_updates_on_branch,
commit_prepared_updates_on_branch_with_expected,
commit_manifest_updates, record_merge_commit,
ensure_commit_graph_initialized, commit_updates_on_branch_with_expected}
- optimize::optimize_all_tables
- Omnigraph::commit_manifest_updates, record_merge_commit,
commit_updates_on_branch_with_expected, ensure_commit_graph_initialized
The conversion is mechanical: callers that previously took `db: &mut
Omnigraph` now take `db: &Omnigraph`; every interior mutation goes
through the existing locks (coordinator.lock().await, store_catalog,
runtime_cache.invalidate_all). No new locks acquired, no new lock-order
hazards introduced.
102 lib tests + 24 runs + 30 branching + 63 end_to_end + 39 server
tests pass. Workspace compiles clean (1 warning on a now-redundant `mut`
binding in CLI; cleaned up in a follow-up). The remaining work in PR 2
is the AppState flip (Arc<RwLock<Omnigraph>> -> Arc<Omnigraph> +
WorkloadController), the revalidation perf optimization in commit_all,
and the WorkloadController itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the GraphCoordinator field in `Arc<tokio::sync::Mutex<...>>` so
engine APIs can move from `&mut self` to `&self` without giving up the
coordinator's mutating refresh path. Lock acquisition order: always
before runtime_cache (when both are needed in one scope). Critical
sections stay short — load+clone for snapshot/version/current_branch,
single-method delegations elsewhere.
Public API changes:
- `Omnigraph::version()` and `Omnigraph::snapshot()` (pub(crate))
become async; callers add `.await`.
- `Omnigraph::active_branch()` returns `Option<String>` (cloned) instead
of `Option<&str>` borrowed from the coordinator. Callers either
`.await` the result + use `.as_deref()`, or hoist into a binding.
`&self`-converted methods this round (tied to the coordinator wrap, not
the Step C surface refactor):
- `swap_coordinator_for_branch`
- `restore_coordinator` (now async; was sync)
- `sync_branch`
- `refresh`
- `refresh_coordinator_only`
- `reload_schema_if_source_changed`
- `branch_create`, `branch_create_from`, `branch_delete`, `branch_list`
- `delete_branch_storage_only`
- `ensure_branch_delete_safe`
- `ensure_schema_apply_idle`
- `ensure_schema_apply_idle` helper in schema_apply.rs (matches signature)
Caller updates: branch_create_from_impl threads `restore_coordinator`'s
new async signature; schema_apply, table_ops, exec/merge wrap every
direct `db.coordinator.X()` in `db.coordinator.lock().await.X()`;
exec/merge hoists `active_branch_for_keys` once outside the per-table
closure that builds queue keys + sidecar pins.
All 102 lib tests + 30 branching + 24 runs + 10 lifecycle + 16
staged_writes + 63 end_to_end pass workspace-wide. Zero test
regressions; the only behavior change is on the `Omnigraph` API
surface (sync -> async on the three accessors above).
Step C (engine API conversion: apply_schema, mutate_as, ingest_as,
branch_merge_as &mut self -> &self) follows in a subsequent commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles the working-tree state from the prior session (PR 0 bench harness,
PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration)
together with the first half of PR 2's interior-mutability foundation
(catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams
intermix in 7 of the same files, so splitting via git add -p was
impractical. Subsequent PR 2 steps land as separate atomic commits.
PR 0 — server-level concurrent /change bench harness
- crates/omnigraph-server/examples/bench_concurrent_http.rs (new)
- .context/bench-results/{baseline-main,after-pr1}/ (gitignored)
PR 1a — drop the audit_actor_id field, thread per-call
- removed Omnigraph::audit_actor_id and the swap-restore patterns in
mutation.rs, merge.rs, loader/mod.rs
- actor_id: Option<&str> threaded through MutationStaging::finalize,
mutate_with_current_actor, ingest_with_current_actor,
branch_merge_impl, branch_merge_on_current_target,
commit_prepared_updates*, record_merge_commit,
commit_updates_on_branch_with_expected
- apply_schema and ensure_indices_for_branch pass None (system-attributed)
PR 1b — per-(table_key, branch) write queue + revalidation + sidecar
- new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager,
acquire/acquire_many, sorted+deduped acquisition; 6 unit tests
- Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor
- MutationStaging::finalize split into stage_all (Phase A, no queue)
and StagedMutation::commit_all (Phase B, acquire_many + revalidate
pins + sidecar + commit_staged); guards held across publisher
- delete-only mutations now emit recovery sidecars; revalidation
extended to inline_committed tables
- branch_merge_on_current_target, apply_schema_with_lock, and
ensure_indices_for_branch acquire per-table queues for their
touched tables
PR 2 Step B (partial) — catalog and schema_source via ArcSwap
- catalog: Catalog -> Arc<ArcSwap<Catalog>>
- schema_source: String -> Arc<ArcSwap<String>>
- public accessors return Arc<Catalog> / Arc<String>; readers bind
locally where the borrow has to outlive an expression
- new pub(crate) store_catalog / store_schema_source helpers replace
the field assignments in apply_schema and reload_schema_if_source_changed
- 117 tests across lifecycle/end_to_end/branching/runs pass; engine
lib + workspace compile clean
Coordinator wrap (Mutex) and the &mut self -> &self engine API
conversion follow in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 2 wraps the Omnigraph engine's catalog and schema_source fields in
ArcSwap so reads stay zero-cost while apply_schema can swap atomically
without &mut self. arc-swap lands as an unused workspace dep here so the
follow-up commits that wrap fields can land in isolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new §VI invariants name what OmniGraph commits to as an agent-native
system of record: branches as the cross-query coordination primitive,
per-query isolation as a per-query opt-in (Serializable up, eventual down),
and type-aware agent-resolvable merges. Plus an explicit non-commitments
subsection so reviewers see what is intentionally out of scope (Strict
Serializable across queries, cross-process linearizable single-object writes,
auto-resolution of ambiguous merge conflicts).
§VII and §VIII renumber by +3 to make room (35-43 -> 38-46, 44-47 -> 47-50);
deny-list and review-checklist references in §IX/§X follow. testing.md's
pre-existing stale §VII.33/34/36 references resolve to their actual
§VIII.47/48/50 targets in the same pass. staged_writes.rs:866's docstring
gains an MR-686 forward reference so the load-bearing concurrency-hazard
test points readers at the queue work that closes the gap.
§VI.34 is preserved alongside the broader §VI.36 to keep its MR-425
pointer addressable; the overlap is documented in §VI.36's status line.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three terminologies were calling themselves Phase A/B in PR #72:
1. Per-writer recovery (canonical, four phases A/B/C/D — sidecar /
commit_staged loop / manifest publish / sidecar delete in
`docs/runs.md:157`).
2. Per-table staged-write contract from MR-793 (two phases —
`stage_*` then `commit_staged`).
3. Test-narrative scaffolding (Phase A = setup the failure, Phase B
= verify recovery — used as section dividers in failpoints.rs).
Same letters, three meanings; three reviewers including the bots have
already misread the code in the resulting fog. This change keeps
"Phase A/B/C/D" exclusively for #1 and rewrites the other two:
- `ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`
→ `ensure_indices_stage_btree_failure_leaves_existing_tables_writable`
(matches the `stage_create_btree_index` API verb).
- Comment at `table_ops.rs:610` and the test docstring at
`failpoints.rs:807` rewrite "a Phase A failure in the staged-index
path" → "a stage-step failure in the staged-index path".
- Twelve `// Phase A:` / `// Phase B:` test scaffolding comment
headers in `failpoints.rs` (across six test fns) become
`// Setup:` / `// Recovery:`.
- A "Phase letter convention" note added near `docs/runs.md:165`
spells the rule out for future readers.
Also bundled: rename
`composite_flow_init_load_branch_merge_time_travel_optimize_cleanup`
→ `composite_flow_canonical_lifecycle` so it pairs as a story name
with `composite_flow_multi_branch_sequential_merges` (the previously-
deferred symmetry rename).
No behaviour change. Both renamed tests pass; full failpoints (18) +
composite_flow (2) suites pass; workspace baseline + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cursor flagged that SchemaApply sidecars only captured `Update` pins
(via `snapshot.entry()?` in schema_apply.rs:166), so recovery's
`roll_forward_all` only published `ManifestChange::Update` for the
rewritten/indexed tables. Added types (`added_tables`) and tombstones
(`renamed_tables` sources) were silently dropped during recovery.
Reproducer: in `schema_apply_phase_b_failure_recovered_on_next_open`,
the v2 schema added a `Tag` node type. Pre-fix, `node:Tag` ended up as
an orphan dataset on disk while the manifest never received a
`RegisterTable` entry — the live `_schema.pg` declared a type the
manifest didn't know about, and `count_rows(node:Tag)` panicked with
`no manifest entry for node:Tag`. The existing test passed only
because it never queried Tag.
Fix:
1. Extend `RecoverySidecar` with `additional_registrations` and
`tombstones` fields (optional, serde-default for backward compat
with existing on-disk sidecars). Both are SchemaApply-only.
2. Populate them in `apply_schema_with_lock` from the migration plan's
upfront diff (`added_tables` + `renamed_tables` keys for
registrations; `renamed_tables` values for tombstones, version-
pinned at `source_entry.table_version + 1`).
3. Update `roll_forward_all` to:
- emit `RegisterTable` + `Update` for each `additional_registrations`
entry (read the dataset's current Lance HEAD for the version
metadata + row_count)
- emit `Tombstone` for each `tombstones` entry
- filter against `snapshot` so previously-published registrations /
tombstones are skipped (handles the post-Phase-C-success-but-
sidecar-not-yet-deleted case — without filtering, the publisher's
CAS pre-check would error with `expected=0, actual=N` on the
redundant Register)
4. Extend the audit-row outcomes to include published registrations.
Test changes:
- `schema_apply_phase_b_failure_recovered_on_next_open` now asserts
`count_rows(node:Tag) == 0` (no panic), proving the new manifest
entry exists.
- `schema_apply_recovers_pre_commit_crash` renamed to
`schema_apply_pre_commit_crash_rolls_forward_via_sidecar` and
rewritten — pre-fix it expected pre-commit crashes to roll BACK
(delete staging, keep V1, leave Company as orphan); the sidecar
protocol's "complete the writer's intent" semantic now rolls
FORWARD (rename staging -> final, register Company atomically). The
new assertions verify schema = V2 and `node:Company` is queryable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cursor flagged that if `roll_forward_all` succeeds (manifest pin
advances) but `record_audit` then fails, the sidecar persists. On
the next open, every table classifies as NoMovement
(lance_head == manifest_pinned, both already reflect the prior
roll-forward) → `decide` returns RollBack → `roll_back_sidecar`
records a RolledBack audit row with empty per-table outcomes.
Operators reading `_graph_commit_recoveries.lance` see "RolledBack"
for an operation whose actual outcome was a successful roll-forward.
`process_sidecar`'s RollBack arm now distinguishes "stale-after-
success" from a legitimate rollback: when every classification is
NoMovement AND any pin's `manifest_pinned > expected_version` (the
manifest already advanced past the writer's CAS target), recovery
dispatches to `record_audit_recovery_rollforward` which writes a
RolledForward audit row with reconstructed outcomes
(`from_version = expected_version`,
`to_version = manifest_pinned`) and deletes the sidecar. No Lance
writes — the substrate is already in the post-roll-forward state.
Safe in `RollForwardOnly` mode (refresh-time recovery) because no
`Dataset::restore` is involved; the legitimate-rollback path stays
deferred to the next ReadWrite open as before.
Added `recovery_records_rolled_forward_for_stale_sidecar_after_successful_roll_forward`
integration test that synthesizes the state by writing a sidecar
whose `expected_version < manifest_pin` and asserts:
- audit row records `RolledForward` (not `RolledBack`)
- per-table outcome reports the correct `from_version` /
`to_version` pair
- sidecar is deleted
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cubic flagged that the time-travel `total_people` assertions only
checked `result.num_rows() == 1`, which would still pass if the
historical query returned the wrong count (e.g., 10 instead of 6
because of a planner regression resolving against current state
instead of the captured snapshot). Added `assert_total` helper that
extracts the Int64 `total` column and verifies the actual value.
Replaces three weak `num_rows() == 1` assertions in
`composite_flow_multi_branch_sequential_merges`:
- post-both-merges: total = 10
- time-travel to pre-merge-a: total = 6
- post-reopen: total = 10
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut of `composite_flow_multi_branch_sequential_merges` used
dataset-direct `count_rows` for read-side assertions, which proves
data is on disk but skips the query path entirely — planner, BTree
index lookup, edge traversal, aggregation, and snapshot resolution
all stay untested. Replaced with strategic `.gq` query checkpoints:
- branch isolation via `get_person` after Eve insert (Eve visible
on feat-a; absent on main)
- 1-hop traversal via `friends_of(Grace)` after the Knows-edge
insert (validates the topology index against branch-local edges)
- post-merge query-engine readback after merge feat-a → main
(Eve findable through BTree, Grace's edge traversable through
the rebuilt Knows index)
- aggregation via `total_people` after merge feat-b → main
(count over a multi-fragment table whose shape is the result
of two sequential merges)
- time-travel via `ReadTarget::Snapshot(captured_id)` for both
`total_people` and `friends_of` / `get_person` at the two
pre-merge points (catches planner regressions where historical
queries accidentally resolve current indices)
- post-reopen query-engine readback (catches reopen-time index/
catalog binding regressions)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `composite_flow_multi_branch_sequential_merges` covering the
agent-workflow pattern that single-merge tests in `branching.rs`
cannot reach: two feature branches diverging from main with main
writes interleaved between every diverge point, sequential merges
into main, time-travel through the resulting merge DAG, and reopen
consistency over a multi-merge history.
The script (18 numbered steps with assertions per step):
init+load → mutate main → branch feat-a → mutate main → mutate
feat-a → branch feat-b → mutate feat-b → mutate feat-a (with
edge) → merge feat-a → mutate main → merge feat-b → time-travel
to pre-merge-a + pre-merge-b → reopen + verify.
Catches eight compositional gap categories that only surface with
≥2 merges and main mutations between them: base/LCA recomputation
across two merges, manifest-pin propagation through merge commits,
time-travel through merge DAG without state bleed-through, branch-
DAG consistency, sibling-branch isolation under writes elsewhere,
post-merge main-write integration, multi-merge reopen replay, and
clean-flow recovery-sidecar absence.
`composite_flow.rs` was added to `docs/testing.md` so the before-
every-task checklist points agents at the file before duplicating
coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small PR #72 review findings addressed:
- merge.rs sidecar pin recorded `entry.table_branch` (where the table
currently lives in the target manifest) instead of the merge target
branch where commits actually land via `publish_rewritten_merge_table`
→ `open_for_mutation` → `fork_dataset_from_entry_state`. Recovery's
`open_lance_head` would then check the wrong ref. Aligned with the
pattern already used in `ensure_indices_for_branch` (table_ops.rs:115).
Added `branch_merge_sidecar_pins_table_branch_to_active_branch`
contract test that reads the sidecar JSON and asserts every per-pin
`table_branch` equals the active (target) branch — catches the
regression even when the values happen to coincide in the test setup.
- Rollback audit `from_version` previously equalled `to_version`
(both `manifest_pinned`), telling operators nothing about the actual
Lance HEAD drift before restore. Captured `lance_head` in
`ClassifiedTable` and used it as `from_version` so audit rows now
show "rolled back from v7 to v5" instead of "v5 → v5". Added
`assert_rollback_outcomes_record_drift` invariant in the test helper,
invoked automatically by every `RecoveryExpectation::RolledBack` test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle of three correctness fixes plus a shared invariants helper that
existing tests now use.
1. SchemaApply atomicity: close the residual gap where a sidecar exists
but staging files don't (e.g., Phase B failure BEFORE
`_schema.pg.staging` write). `recover_schema_state_files` now returns
a `SchemaStateRecovery` discriminator (`Noop` /
`CleanedStaging` / `CompletedStagingRename { schema_apply_sidecar }`);
the token threads through `recover_manifest_drift` →
`process_sidecar`. SchemaApply sidecars are eligible for roll-forward
ONLY when the staging rename completed in the same recovery pass.
Full mode rolls back; RollForwardOnly defers. Without this, recovery
would publish the manifest pin against new-schema data while
`_schema.pg` stayed old (real corruption). New failpoint
`schema_apply.before_staging_write` + new test
`schema_apply_without_schema_staging_rolls_back_on_next_open` pin
the gating.
2. Rollback target correction. Rollback now restores Lance HEAD to the
current manifest pin (`state.manifest_pinned`) instead of the
sidecar's `expected_version`. For UnexpectedAtP1/UnexpectedMultistep
classifications these can differ; the old code could regress Lance
HEAD past the manifest pin, re-introducing drift in the OTHER
direction. The new behavior establishes `Lance HEAD == manifest pin`
post-rollback — the canonical drift-free invariant. Param renamed
from `expected_version` → `target_version` to match. Audit
`to_version` records the actual restore target.
This is a latent-behavior change. Any external consumer that compared
`audit.to_version` against `sidecar.expected_version` for non-trivial
classifications now sees the manifest pin instead.
3. Audit commit-graph unification. `record_audit` now opens the
per-branch commit graph for ANY sidecar with `sidecar.branch.is_some()`
— not just BranchMerge. Plain Mutation/Load/EnsureIndices commits on a
feature branch now correctly land on that branch's commit graph,
instead of main's. Closes the class of bug analogous to D2 but for
non-merge writers.
Pre-existing repos with non-main commits already on main's commit
graph stay where they are; future recoveries write to the per-branch
ref. Mixed-version compatibility is asymmetric but safe (old binaries
ignore per-branch refs they don't know about; new binaries read both).
4. Recovery invariants helper + branch-axis cells. New
`tests/helpers/recovery.rs` (~505 LOC) exports
`assert_post_recovery_invariants(repo, op_id, RecoveryExpectation)`
plus a `TableExpectation` builder. Six existing recovery tests
refactored to call it; per-test bespoke assertions replaced. Two new
branch-axis cells added in `tests/failpoints.rs`:
- `recovery_rolls_forward_load_on_feature_branch`
- `recovery_rolls_forward_ensure_indices_on_feature_branch`
The loader gains a `mutation.post_finalize_pre_publisher` failpoint
hook (gated on the `failpoints` feature; zero-cost in release) so the
load test can pin the same Phase B → Phase C boundary the mutation
path uses.
Misc:
- `Omnigraph::refresh` extracts `reload_schema_if_source_changed`:
early-return when schema source unchanged (saves IR parse + catalog
rebuild on the steady-state refresh path).
- New test injection point
`failpoint_publish_table_head_without_index_rebuild_for_test`
under `#[cfg(feature = "failpoints")]`.
Tests: 31 recovery + failpoint integration tests pass (14 + 17, up from
14 + 16). Full workspace sweep with `--features failpoints` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E1. After D3 added recover_schema_state_files to refresh(), the
in-memory `self.schema_source` and `self.catalog` were left stale:
a SchemaApply sidecar processed via refresh would rename the
staging files (`_schema.pg`, IR contract) into place but the
handle continued operating against the old catalog. Subsequent
operations would surface schema mismatches against post-migration
data on disk.
Fix: after recover_manifest_drift completes, refresh() now mirrors
open_with_storage_and_mode's schema-load sequence — re-reads
`_schema.pg`, parses IR via load_or_bootstrap_schema_contract,
rebuilds the catalog with fixup_blob_schemas, and assigns into
self.schema_source / self.catalog. Steady-state cost: one read +
one parse per refresh; only mutates handle state when the on-disk
schema actually changed.
E2. The non-main branch_merge recovery test
(`branch_merge_phase_b_failure_recovered_on_non_main_target`)
asserted only `merged_parent_commit_id` was non-null — but
`merged_parent_commit_id` is independently populated from
sidecar.merge_source_commit_id (the SOURCE branch's tip), so the
assertion would pass even if D2's per-branch CommitGraph fix
regressed (the bug was about `parent_commit_id`, the TARGET
branch's tip).
Fix: capture target_branch's commit-graph head BEFORE the failed
merge by scanning target_branch's Lance ref on _graph_commits.lance
and picking the latest commit by created_at. After recovery, find
the recovery merge commit (the one with non-null
merged_parent_commit_id) and assert its `parent_commit_id` ==
captured pre-failure head. Without D2, recovery would record the
GLOBAL head (the source_branch's insert-Carol commit on this test)
instead, and the assertion fails.
Also fixes the column-type cast: created_at is stored as
TimestampMicrosecondArray, not Int64Array.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
D1. roll_forward_all returns per-table actual published versions; the
audit row's `to_version` records that, not pin.post_commit_pin
(the latter is a lower bound for loose-match writers SchemaApply /
EnsureIndices / BranchMerge — pin.post_commit_pin = expected + 1
while actual published HEAD can be expected + N).
D2. Branch-merge recovery audit uses CommitGraph::open_at_branch when
sidecar.branch is Some, so the merge parent is the TARGET BRANCH's
tip (not the global head). Without this, recovered branch_merge
on a non-main target records the wrong merged_parent_commit_id and
future merges between the same pair lose already-up-to-date
detection / merge-base correctness.
D3. Omnigraph::refresh now mirrors open's recovery composition: runs
recover_schema_state_files BEFORE recover_manifest_drift. Without
this, a SchemaApply sidecar processed via refresh would publish
the manifest + delete the sidecar without renaming the staging
schema files, leaving the repo with new-schema data and old
`_schema.pg` (real corruption). Refresh's docstring now enumerates
each open-time recovery step it maintains, so the next maintainer's
diff between open() and refresh() is trivial.
D4. ensure_indices sidecar pin records `active_branch` (where commits
actually land), not `entry.table_branch` (where the table currently
lives). On first fork-on-write, the processing loop's
open_owned_dataset_for_branch_write forks to active_branch and the
commit lands there — recovery's open_lance_head must check the
same branch. Without this, recovery checks the wrong ref and
misses Phase B drift entirely.
D5. Two new branch-axis tests:
* recovery_rolls_back_feature_branch_sidecar_against_feature_branch
— feature-branch rollback variant; asserts post-recovery audit
kind == RolledBack and the actual restore commit landed on the
feature ref.
* branch_merge_phase_b_failure_recovered_on_non_main_target
— non-main merge target variant; reads the target branch's
commit graph (Lance ref) and asserts the recovery commit has
a non-null merged_parent_commit_id (pins D2).
Bug pattern: all four are at composition seams between concepts that
were each tested individually (writer-precision × actual-Lance-HEAD;
branch-context × commit-graph-API; recovery-path × writer-kind; pin-
time-branch × commit-time-branch). The branch-axis matrix is the
cheapest mechanical prevention for D2/D4-class regressions.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).
Why roll-forward only and not full sweep:
* Roll-forward is safe under concurrency (publisher uses row-level
CAS).
* Roll-back uses Dataset::restore, which "wins" against concurrent
Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
silently orphaning the concurrent writer's commit (pinned by
tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
next ReadWrite open, where no concurrent writers exist and full
restore is safe.
Implementation:
* recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
process_sidecar branches on mode for Abort and RollBack — both
defer to next ReadWrite open under RollForwardOnly. RollForward
behavior unchanged.
* omnigraph.rs: Omnigraph::refresh promoted to pub; calls
recover_manifest_drift in RollForwardOnly mode after coordinator
refresh. Steady-state cost: one list_dir of __recovery (early
return on empty). Adds refresh_coordinator_only — pub(crate) —
for engine-internal callers that hold an in-flight sidecar (the
schema_apply lease-check + lock-release paths). Without this split,
refresh would race the in-flight sidecar.
* schema_apply.rs: switch all 6 internal db.refresh() call sites to
refresh_coordinator_only().
Tests:
* refresh_runs_roll_forward_recovery_in_process — trigger
mutation.post_finalize_pre_publisher; without restart, call
db.refresh(); assert sidecar deleted, drifted row visible,
subsequent mutation succeeds.
* refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
runs the full sweep which advances HEAD via restore.
Docs:
* docs/runs.md "Long-running servers" caveat updated to describe the
refresh-time roll-forward path and the rollback-defer behavior.
* docs/invariants.md §VI.23 status line updated to reflect in-process
closure of the common case.
Workspace tests pass with --features failpoints; no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B1. Schema-apply atomicity. Before this commit, a failure between
`_schema.pg.staging` write and the manifest publish left the repo
corrupt: Lance HEADs advanced under the new schema, manifest stayed
at old pins, and on reopen schema-state recovery deleted the staging
files (manifest's table set still matched the live schema), then
manifest-drift recovery rolled the table versions forward — leaving
new-schema data on disk with the old `_schema.pg` live.
Fix: a SchemaApply sidecar is the marker that Phase B completed but
Phase C didn't. New helper `has_schema_apply_sidecar` is consulted
by `recover_schema_state_files` BEFORE its disambiguation logic;
when present, it completes the staging→final rename so the
subsequent manifest-drift roll-forward sees the new catalog.
B2. Branch-aware recovery. Sidecars from feature-branch writers were
being classified against main's snapshot and main's Lance HEAD,
silently no-op'ing or rolling back the wrong table version (the
classifier saw NoMovement; the writer's drift on the feature branch
persisted; subsequent feature writers surfaced
ExpectedVersionMismatch).
Fix: SidecarTablePin gets an optional `table_branch` field;
`recover_manifest_drift` opens a per-branch coordinator
(`GraphCoordinator::open_branch`) per sidecar; `open_lance_head`,
`restore_table_to_version`, and `roll_forward_all` honor the pin's
branch via `Dataset::checkout_branch`.
B3. Remove fragment-id short-circuit in `restore_table_to_version`.
Equal fragment IDs do NOT imply equal content: Lance index commits
and deletion-vector updates change the manifest without touching
fragment IDs. Skipping restore in those cases would leave Lance HEAD
ahead of the manifest with no recovery artifact left. Restore is
now unconditional; pile-up under repeated mid-rollback crashes
bounded and reclaimed by `omnigraph cleanup`.
B4. Recovered branch_merge records merge parent. `record_audit` always
called `append_commit`, dropping `merged_parent_commit_id`. Future
`branch_merge feature -> main` between the same pair lost
already-up-to-date detection. RecoverySidecar gets an optional
`merge_source_commit_id`; `branch_merge_on_current_target`
populates it from `source_head_commit_id`; `record_audit`
dispatches to `append_merge_commit` when present.
New tests: feature-branch sidecar classification (B2); B1 deepens the
existing schema_apply test with live-`_schema.pg` and new-type
assertions; B4 deepens the existing branch_merge test by reading
`_graph_commits.lance` and asserting a non-null `merged_parent_commit_id`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A1. tests/recovery.rs: rewrite recovery_multi_sidecar_requires_fresh_snapshot_for_correctness
to use real `append_batch` instead of fragment-preserving `delete_where("1 = 2")`.
The previous setup made restore_table_to_version's fragment-set short-circuit
no-op the bug path, so the load-bearing `HEAD == v3` assertion passed in both
bug and fix paths. Real appends produce different fragment-id sets across v1,
v2, v3 so a real restore actually runs in the bug path (HEAD becomes v4).
Added a person_batch helper matching the post-init Lance schema (id, age, name).
A2. exec/merge.rs: filter recovery sidecar pins to `RewriteMerged` candidates
only. `AdoptSourceState`'s pure-pointer-switch and fork subcases don't
advance Lance HEAD; pinning them would force NoMovement on recovery and
trigger an all-or-nothing rollback that destroys legit RewriteMerged work.
Documented residual: AdoptSourceState subcases that internally call
publish_rewritten_merge_table aren't covered by the sidecar; closing that
requires pre-computing source deltas during candidate classification (a
structural change to CandidateTableState) — left as follow-up.
A3. db/omnigraph/table_ops.rs: add the same branch filter
(`active_branch.is_some() && entry.table_branch.is_none() => continue`)
to the ensure_indices sidecar pin loop that the processing loop already
has. Without this, main-branch tables that need index work get pinned but
never committed when ensure_indices runs on a feature branch → NoMovement
→ all-or-nothing rollback destroys feature-branch work.
A4. tests/failpoints.rs: deepen schema_apply_phase_b_failure and
branch_merge_phase_b_failure tests with post-recovery manifest-pin advance
assertions. branch_merge test setup also mutates main so the merge
produces at least one RewriteMerged candidate (required after A2's pin
filter — a no-op merge with all-AdoptSourceState would write no sidecar).
Fixed stale "BranchMerge is strict-classified" comment to reflect current
loose classification.
A5. tests/composite_flow.rs: remove duplicate back-to-back `total_people`
query in step 12.
Full workspace test sweep with --features failpoints passes: no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three bundled changes:
1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
the test function). OmniGraph is consumed by both humans and agents
- naming the test after one audience misframes the library.
2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
review-round labels from source, tests, and docs added by this
branch. Internal traceability belongs in commit messages and PR
descriptions, not in checked-in artifacts. Upstream
lance-format/lance issue refs and pre-existing MR-XXX refs in docs
not touched by this branch are left alone.
3. Two outstanding review findings addressed:
- `needs_index_work_node` / `needs_index_work_edge`: propagate
`count_rows` errors instead of `unwrap_or(0)`. Silently treating
transient I/O failures as "0 rows" risked skipping a table from
the recovery sidecar pin set that was actually about to be
modified.
- `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
strengthen the assertion to fail when sidecar B classifies under
a stale snapshot. The new assertion checks post-recovery Lance
HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
deleted + audit rows present" pair passed in both the bug and
fix paths because both delete the sidecar and write an audit
row; the differentiator is the post-recovery HEAD. Strengthening
the assertion exposed an additional nuance: in this overlapping-
sidecar scenario sidecar B's audit kind is RolledBack (no-op)
rather than RolledForward, since sidecar A's roll-forward
publishes Lance HEAD as the new manifest pin (absorbing B's
work). The docstring now explains why this is correct given
current `roll_forward_all` semantics.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements MR-858 ahead of the rest of the MR-857 epic: the deterministic
narrative test counterpart to MR-783's randomized harness.
`tests/agent_lifecycle.rs::agent_lifecycle_init_load_branch_merge_time_travel_optimize_cleanup`
walks the canonical agent flow end to end:
1. init repo with TEST_SCHEMA
2. load_jsonl seed data (4 Person + 2 Company nodes; Knows + WorksAt edges)
3. branch_create feature off main
4. mutate on feature: single-statement insert (Eve) + multi-statement
insert+edge (Frank knows Eve)
5. query on feature: total_people / friends_of (traversal) /
unemployed (anti-join) / friend_counts (aggregation)
6. mutate on main (set Bob's age) — sets up non-conflicting merge
7. branch_merge feature → main; verify version advance
8. query post-merge: confirm Eve visible on main (from feature) +
Bob visible (from main mutation, carried through merge)
9. snapshot_at_version(pre_merge_version): time-travel still sees
pre-merge state (4 Persons, no Eve)
10. optimize the post-merge graph; verify reads still work + counts
unchanged
11. cleanup with --keep 10 --older-than 3600s (no-op for this short
test, but exercises the call path)
12. drop + reopen; verify all counts + branch list consistent;
confirm read path works post-cleanup-reopen
**Known limitation surfaced**: post-optimize mutation path in step 11
hit `ExpectedVersionMismatch` because `optimize_all_tables` advances
per-table Lance HEAD without updating the `__manifest` pin
(`db/omnigraph/optimize.rs:77`), and something between optimize and
re-open writes a higher version row to `__manifest`. Test documents
this and defers full coverage to MR-859 (`omnigraph optimize` +
`cleanup` integration coverage), keeping the read-path-after-cleanup
assertion which is the headline operator concern.
Test runs in <1s. ~672 workspace tests pass with --features
failpoints; no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bot reviewers (cubic + cursor) flagged 5 follow-on issues after the
first fix push. Three are real bugs in the Phase 6-8 ensure_indices
sidecar wiring; two are AI-slop flags on shallow tests. One cursor
finding is a false positive on intentional node/edge index asymmetry.
Real bugs fixed:
- needs_index_work_node and needs_index_work_edge now skip empty
tables (count_rows == 0). The ensure_indices_for_branch loop has
`if row_count > 0 { build_indices(...) }`, so empty tables produce
zero commit_staged calls. Pinning them in the sidecar would force
NoMovement classification on recovery and trigger the all-or-nothing
rollback of any sibling table's legitimate index work (cubic #1).
- needs_index_work_node and needs_index_work_edge now respect the
table_branch parameter from the snapshot entry, instead of always
passing None (== main). For branch writes, opening the wrong HEAD
could miss recoverable Phase B commits (cubic #2).
- needs_index_work_edge documented as intentionally BTree-only (mirrors
the build_indices_on_dataset_for_catalog edge branch which only
builds id/src/dst BTrees). Cursor flagged FTS/vector omission as
inconsistency with the node helper; confirmed intentional via
inline comment so future readers know the asymmetry is on purpose
(cursor finding, false positive marked).
Test improvements:
- recovery_multi_sidecar_requires_fresh_snapshot_for_correctness — new
integration test that uses TWO sidecars on the SAME table where
sidecar B's expected_version equals sidecar A's post_commit_pin.
Sidecar B's classification only succeeds if the recovery sweep
refreshes the snapshot between iterations to see A's manifest
update. Without the refresh fix from the prior commit, B would be
classified against stale pins (cubic #4 follow-up).
- recovery_ensure_indices_handles_empty_tables — new integration test
that runs ensure_indices on an all-empty repo. With the round-2 fix,
both initial and steady-state runs leave no sidecar (zero pins ⇒
zero sidecar I/O). Without the empty-table fix, the sidecar would
pin Company (zero rows but missing indices) and force a NoMovement
rollback (cubic #1 verification).
- ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_needed —
renamed/rewrote the prior `_recovered_on_next_open` test to assert
the post-fix invariant: when load_jsonl auto-built every catalog
index via prepare_updates_for_commit, ensure_indices's needs_work
helpers correctly report zero pins and produce no sidecar. The old
assertion ("exactly one sidecar must persist") was wrong for the
scoped behavior.
Test surface (post-round-2):
- 25 unit tests in db::manifest::recovery (BranchMerge classifier,
sort order, primitives — unchanged).
- 12 integration tests in tests/recovery.rs (+2 from this commit).
- 11 failpoint tests including the four per-writer Phase B → recovery
tests (one renamed to reflect the scoped behavior).
- ~672 workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking
bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR.
Each fix has a paired test demonstrating the bug before the fix.
Merge-blocking fixes:
- BranchMerge moved to loose-match classifier arm. publish_rewritten_
merge_table runs multiple commit_staged calls per table (merge_insert
+ delete_where + index rebuilds). Strict classification rolled back
valid completed Phase B work as UnexpectedMultistep. Three new unit
tests pin the loose-match behavior for BranchMerge.
- branch_merge sidecar uses self.active_branch() (the resolved target
branch) instead of inferring from the first sorted table key. The
previous heuristic could record None (== main) when the merge target
was a non-main branch, causing recovery to publish to the wrong
manifest namespace.
- Best-effort sidecar delete in all 5 writer sites (mutation, loader,
schema_apply, branch_merge, ensure_indices). Previously, a sidecar
cleanup failure after a successful manifest publish would error out
the user's call for a write that already landed. Now: log a warning
and ignore — the next open's recovery sweep tidies the stale sidecar
via NoMovement classification.
- ensure_indices sidecar scoped to tables that need work via new
helpers needs_index_work_node / needs_index_work_edge. Previously
the sidecar pinned every catalog table; if only one needed indexing,
the others classified as NoMovement and the all-or-nothing decision
rolled back legitimate index work.
Strongly-recommended fixes:
- recover_manifest_drift now takes &mut GraphCoordinator and refreshes
between sidecars. Sidecar B's classification needs to see sidecar
A's manifest changes, otherwise B can be classified against stale
pins and incorrectly roll back work that just landed.
- list_sidecars sorts URIs before reading. Sidecar filenames are
ULIDs (chronologically sortable), so this gives deterministic,
time-ordered processing. Filesystem-order was nondeterministic.
- ReadOnly opens skip recover_schema_state_files too (was: only the
MR-847 sweep was gated). Read-only consumers may run with read-only
credentials; silent open-time mutations violate the contract.
Doc cleanups:
- Removed stale "Phase 4 placeholder" comment from
recover_manifest_drift.
- docs/runs.md decision-tree wording now correctly surfaces the
InvariantViolation abort path.
- docs/branches-commits.md clarifies actor_id is in
_graph_commit_actors.lance (joined by graph_commit_id), not on
_graph_commits.lance itself.
Test surface (post-fixes):
- 25 unit tests in db::manifest::recovery (+4 from this commit).
- 10 integration tests in tests/recovery.rs (+3 from this commit).
- ~672 tests across ~25 binaries pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the doc surface to reflect MR-847 having shipped end to end —
sidecar protocol, classifier, all-or-nothing decision tree, roll-forward
via ManifestBatchPublisher, roll-back via Dataset::restore with
fragment-set short-circuit, audit trail in
_graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and
the four migrated writers all carrying sidecars across Phase B → Phase C.
- docs/invariants.md §VI.23: change from "upheld at the writer-trait
surface for inserts/updates/etc., per-table commit_staged → manifest
publish window remains" to "upheld at the writer-trait surface AND
across process boundaries". The MR-847 sweep closes the residual on
the next Omnigraph::open. The "continuous in-process" property
(no ExpectedVersionMismatch surfacing to subsequent writers between
Phase B failure and process restart) is honest follow-up at MR-856.
- docs/runs.md: replace "Finalize → publisher residual" section with
"Open-time recovery sweep (MR-847)" — describes the sidecar protocol
lifecycle (Phases A-D), the sweep's classifier + decision dispatch,
the audit trail, and the operator-facing query
(omnigraph commit list --filter actor=omnigraph:recovery).
- AGENTS.md capability matrix "Atomic single-dataset commits" row:
drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat;
describe the three layers as all shipping; reference MR-856 for the
background-reconciler follow-up.
- docs/storage.md: add _graph_commit_recoveries.lance and
__recovery/{ulid}.json to the on-disk layout (mermaid + prose).
- docs/branches-commits.md: new "Recovery audit trail (MR-847)"
subsection describing the join from
_graph_commits.lance:actor_id="omnigraph:recovery" to
_graph_commit_recoveries.lance:graph_commit_id for operator
post-mortem.
- docs/maintenance.md: note the MR-847 recovery floor on cleanup —
--keep < 3 may garbage-collect Lance versions the recovery sweep
needs as a rollback target. Default --keep 10 is safe.
- docs/testing.md: add tests/recovery.rs to the engine integration-test
table; expand the failpoints.rs row to mention the four MR-847
per-writer Phase B → recovery integration tests.
- .context/mr-847-design.md: prepend a "Status: DONE" stanza listing
every commit hash + scope across phases 1-10.
AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs).
Full workspace test sweep passes with --features failpoints (361 tests
across 20 binaries).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the three paired per-writer tests required by MR-847's acceptance
criteria — "All four migrated writers ... have paired Phase B → recovery
integration tests."
Production additions (~10 LOC):
- New failpoint `branch_merge.post_phase_b_pre_manifest_commit` in
`exec/merge.rs::branch_merge_on_current_target` between the per-table
publish loop and `commit_manifest_updates`.
- New failpoint `ensure_indices.post_phase_b_pre_manifest_commit` in
`db/omnigraph/table_ops.rs::ensure_indices_for_branch` between the
per-table loops and `commit_prepared_updates_on_branch`.
- For schema_apply, the existing `schema_apply.after_staging_write`
failpoint already fires in the right window (after the per-table
rewrites + index builds, before the manifest publish).
Sidecar tweak:
- `schema_apply` sidecar's `branch` is now `None` (was
`Some("__schema_apply_lock__")`). The lock branch is purely a
serialization sentinel; `coordinator.commit_changes_with_actor`
publishes against the coordinator's pre-lock branch (main). After
the failpoint fires, `release_schema_apply_lock` removes the lock
branch — if the sidecar referenced it, the recovery sweep would try
to publish to a branch that no longer exists and fail. Fix: record
the actual publish target.
Tests added in `tests/failpoints.rs` (~280 LOC):
- `schema_apply_phase_b_failure_recovered_on_next_open` — seeds a row,
opens, attempts a schema apply that adds a new node type + a new
property (the new type ensures the table set differs so
`recover_schema_state_files` doesn't trip on property-only
ambiguity), failpoint fires, drops engine, reopens, asserts sidecar
deleted + audit row recorded.
- `branch_merge_phase_b_failure_recovered_on_next_open` — seeds main,
branches off, mutates the branch, attempts merge with the
`branch_merge.post_phase_b_pre_manifest_commit` failpoint active.
Same recovery shape.
- `ensure_indices_phase_b_failure_recovered_on_next_open` — seeds
rows, attempts ensure_indices with the
`ensure_indices.post_phase_b_pre_manifest_commit` failpoint active.
After this commit, all four migrated writers have paired
Phase B → recovery tests:
- mutate_as / load: `recovery_rolls_forward_after_finalize_publisher_failure` (Phase 5)
- schema_apply: `schema_apply_phase_b_failure_recovered_on_next_open`
- branch_merge: `branch_merge_phase_b_failure_recovered_on_next_open`
- ensure_indices: `ensure_indices_phase_b_failure_recovered_on_next_open`
11 failpoint tests pass; full workspace lib + integration tests pass
(350+ tests across 20 binaries).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three writers each follow the same shape established in Phase 5: build
SidecarTablePin list before the per-table commit_staged loop, write the
sidecar via recovery::write_sidecar, do the existing work, delete the
sidecar after the manifest publish succeeds.
Loose-match classifier (recovery.rs):
The classifier now distinguishes strict vs. loose match per
SidecarKind. Strict (Mutation, Load, BranchMerge): exactly one
commit_staged per table; lance_head == manifest_pinned + 1 AND
post_commit_pin == lance_head required. Loose (SchemaApply,
EnsureIndices): the writer may run N >= 1 commit_staged calls per
table — index builds + rewrites compound, and the exact N is hard to
compute at sidecar-write time. Loose accepts any
lance_head > manifest_pinned (with expected_version still matching the
manifest pin) as RolledPastExpected. The risk it admits — an external
agent advancing HEAD between sidecar write and recovery — is out of
scope for the single-coordinator model (MR-668 territory).
roll_forward_all now reads the CURRENT Lance HEAD per table (not the
sidecar's post_commit_pin) so the manifest publish reflects whatever
HEAD landed, even if the loose-match writer committed multiple times
per table.
Per-writer wiring:
- schema_apply::apply_schema_with_lock: sidecar covers
rewritten_tables ∪ indexed_tables (the tables that go through
stage_overwrite/stage_create_index commit_staged). Skips
added_tables (fresh datasets, no Phase B residual class) and
renamed_tables (handled by the existing schema-state staging
recovery in recover_schema_state_files).
- branch_merge::branch_merge_on_current_target: sidecar covers every
table in candidates (publish_adopted_source_state +
publish_rewritten_merge_table do the per-table commit_staged work).
Sidecar writes after validate_merge_candidates and deletes after
commit_manifest_updates.
- ensure_indices_for_branch: sidecar covers every node + edge type in
the catalog with a manifest entry (build_indices_on_dataset is
per-table-per-index commit_staged). Skips when the catalog has
nothing — steady-state calls incur no sidecar I/O when the manifest
already pins all expected types.
Allow recovery_audit.rs in forbidden_apis.rs:
The new db/recovery_audit.rs uses Dataset::write to bootstrap the
_graph_commit_recoveries.lance dataset (same pattern as
commit_graph.rs which is already allow-listed). Add it to the
ALLOW_LIST_FILES list in tests/forbidden_apis.rs.
8 new unit tests in db::manifest::recovery cover the loose-match
classifier branches (SchemaApply + EnsureIndices accept multi-commit
drift, NoMovement and InvariantViolation behave the same as strict).
All 20 test binaries pass (350+ tests across the workspace).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production wiring (~120 LOC):
- `MutationStaging::finalize` now takes a `SidecarKind` parameter and
returns an additional `Option<RecoverySidecarHandle>`. Builds a
Vec<SidecarTablePin> from `pending` BEFORE the per-table commit_staged
loop and writes the sidecar via `recovery::write_sidecar`. Skips the
sidecar when `pending` is empty (delete-only mutation; D₂ keeps these
out of the staged-write path so the option is just a clean signal,
not a code path users hit).
- `exec/mutation.rs::execute_mutation_as` (around line 740): destructure
the new third element, pass `SidecarKind::Mutation`, delete the
sidecar after `commit_updates_on_branch_with_expected` succeeds.
- `loader/mod.rs::ingest_loaded` (around line 540): same shape, with
`SidecarKind::Load`. The Overwrite path stays inline-commit (legacy
residual; out of MR-847 scope per docs/runs.md).
- New engine accessors `Omnigraph::storage_adapter()` and
`Omnigraph::root_uri()` for the sidecar I/O. The pre-existing
`db.storage` field stays private; no other engine code reaches around
the accessor.
- Re-exports from `db::manifest`: `new_sidecar`, `write_sidecar`,
`delete_sidecar`, plus the `RecoverySidecar*` types and `SidecarKind`,
so consumers in `exec/` can use them via `crate::db::manifest::...`.
Bugfix folded in (~5 LOC): make `coordinator` mutable in
`Omnigraph::open_with_storage_and_mode` and call `coordinator.refresh()`
after the recovery sweep returns. Roll-forward advances the manifest
pin on disk; without the refresh the returned engine carried a stale
in-memory snapshot. The Phase 4 tests passed only because they
opened Lance datasets directly rather than going through `db.snapshot()`.
Storage adapter (~15 LOC): `LocalStorageAdapter::write_text` now ensures
the parent directory exists via `tokio::fs::create_dir_all`. Required
because the sidecar protocol writes into `__recovery/` which doesn't
pre-exist after `Omnigraph::init`. S3 has no equivalent; PutObject is
path-agnostic.
Headline test flip (~150 LOC):
- `tests/failpoints.rs::finalize_publisher_residual_drifts_lance_head_until_next_writer_recovers`
is replaced by `recovery_rolls_forward_after_finalize_publisher_failure`.
Same setup (failpoint at `mutation.post_finalize_pre_publisher`) but
after the synthetic failure the test:
1. Asserts the sidecar persists in `__recovery/` for the recovery
sweep to find.
2. Drops the engine handle.
3. Reopens via `Omnigraph::open` — recovery sweep classifies
RolledPastExpected, decides RollForward, publishes the manifest
update, records the audit row, deletes the sidecar.
4. Asserts the sidecar is gone.
5. Asserts the originally-attempted Eve insert is now visible
(Person count = 1).
6. Asserts a subsequent insert succeeds without
ExpectedVersionMismatch (Person count = 2).
7. Asserts the audit dataset `_graph_commit_recoveries.lance` exists.
This is the headline contract the MR-847 acceptance criteria require.
All other failpoint and runs tests continue to pass (8 + 24 unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the remaining half of the open-time recovery sweep.
Roll-forward execution (db/manifest/recovery.rs::roll_forward_all):
constructs a GraphNamespacePublisher directly (recovery runs inside
Omnigraph::open before the engine struct exists, so we can't go through
Omnigraph::commit_updates_on_branch_with_expected). Builds a
ManifestChange::Update per sidecar table reading row_count and
TableVersionMetadata from the dataset at post_commit_pin (cheap;
manifest-level reads, not a row scan), then calls publisher.publish with
expected_table_versions = sidecar.expected_version per table. Single
__manifest CAS extends every pin atomically — all-or-nothing at the
substrate. Persistent CAS contention surfaces as the typed
ExpectedVersionMismatch error and leaves the sidecar in place for the
next open's retry.
Audit model (new crates/omnigraph/src/db/recovery_audit.rs +
record_audit() in recovery.rs): each successful recovery sweep records
a graph-commit row tagged with actor_id="omnigraph:recovery" plus a
row in a new sibling table _graph_commit_recoveries.lance carrying
recovery_kind (RolledForward | RolledBack), recovery_for_actor (the
sidecar's original actor_id), operation_id (sidecar ULID),
sidecar_writer_kind, per_table_outcomes (JSON-serialized for schema
flexibility), and created_at. Operators investigating "did my mutation
land?" can find the answer via `omnigraph commit list --filter
actor=omnigraph:recovery` joined to the recoveries table by
graph_commit_id.
The sibling-table choice avoids bumping INTERNAL_MANIFEST_SCHEMA_VERSION
or migrating _graph_commits.lance. Same not-atomic-pair-write shape as
the existing _graph_commits + _graph_commit_actors split — a crash
between the two sequential writes leaves an orphan commit row with no
recovery row. Recovery sweep tolerates this: re-entry classifies
already-restored / already-published tables as NoMovement, the action
is a no-op, and the audit append is retried.
Note on classifier: process_sidecar's RollBack arm now restores
RolledPastExpected, UnexpectedAtP1, AND UnexpectedMultistep (any drift
class). Earlier Phase 3 logic restricted to RolledPastExpected only,
which left UnexpectedAtP1/UnexpectedMultistep tables drifted; the
all-or-nothing decision rule per docs/invariants.md §VI.23 demands all
drifted tables be restored.
3 new integration tests in tests/recovery.rs (7 total now):
- recovery_rolls_forward_after_phase_b_completes — happy-path
roll-forward; audit row recorded; idempotent on second open.
- recovery_rolls_back_records_audit_row_with_recovery_actor —
roll-back path also records an audit row with the original actor.
- recovery_rolls_forward_with_null_actor — sidecar without actor_id
still records the audit row (recovery_for_actor = None).
3 new unit tests in db::recovery_audit pin the round-trip + persistence
+ recovery_kind string parsing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `OpenMode::{ReadWrite, ReadOnly}` and route `Omnigraph::open` through
`open_with_storage_and_mode`. Recovery sweep runs only under
`OpenMode::ReadWrite` — read-only consumers (NDJSON export, commit list,
schema show) skip it via `Omnigraph::open_read_only`. Rationale: the
sweep performs Lance writes (Dataset::restore, manifest publish); a
read-only consumer with read-only object-store credentials shouldn't
trigger writes, and reads always resolve through the manifest pin
regardless of any drift on the per-table side.
`recover_manifest_drift` lands in db/manifest/recovery.rs and is wired
into Omnigraph::open AFTER recover_schema_state_files — schema-state
recovery operates on staging files; manifest-drift recovery operates on
Lance HEADs that may depend on schema-state being settled.
Roll-back path is fully implemented: classify each table per the
sidecar's intent, dispatch the all-or-nothing decision, and call
restore_table_to_version for any table with drift (RolledPastExpected,
UnexpectedAtP1, or UnexpectedMultistep). NoMovement tables are already
at expected_version — no action. Sidecar deleted as the final step.
Roll-forward path errors with a Phase-4 placeholder so it surfaces
loudly if reached without the audit + manifest-publish wiring landing
first.
Concurrency: today (pre-MR-686) recovery is naturally serialized by the
single-coordinator model. Open runs at server startup BEFORE
Arc<RwLock<Omnigraph>> wraps the engine (lib.rs:194), so no request
handlers can race. CLI is sequential by caller orchestration. Under
MR-686's per-(table_key, branch) queues + MR-856 (background recovery
reconciler), the queue acquisition will need to extend to recovery
sweeps — handoff documented on MR-686 ticket and in MR-856.
4 integration tests in tests/recovery.rs pin the Phase 3 contract:
- recovery_does_not_run_on_clean_open — no sidecars; sweep is a no-op.
- recovery_refuses_unknown_schema_version_on_open — sidecar v=99
surfaces SidecarSchemaError and is left on disk for operator review.
- read_only_open_skips_recovery_sweep — even a sidecar with bogus
table_path doesn't get classified under OpenMode::ReadOnly.
- recovery_rolls_back_synthetic_drift_on_open — sidecar with mismatched
post_commit_pin classifies as UnexpectedAtP1, decision is RollBack,
restore is invoked, sidecar is deleted, idempotent on second open.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests in tests/staged_writes.rs that the recovery sweep design
depends on:
- lance_restore_appends_one_commit_with_checked_out_content — verifies
Dataset::restore() (no-args; restores currently-checked-out version)
produces HEAD+1, not HEAD+2 as the v1 design assumed. Source confirmed
at lance-4.0.0/src/dataset.rs:1106; this test prevents a future lance
bump from silently breaking the recovery rollback math.
- lance_restore_loses_to_concurrent_append_via_orphaning — pins the
concurrency hazard motivating MR-847's open-time-only invocation
strategy: check_restore_txn (lance-4.0.0/src/io/commit/conflict_
resolver.rs:986) returns Ok against Append/Update/Delete/CreateIndex/
Merge/etc., so a Restore commits successfully even when a concurrent
legitimate writer just landed an Append — silently orphaning the
Append's data from the active timeline. MR-847 sidesteps via running
recovery only at Omnigraph::open (before any other writers race);
MR-856 (continuous-recovery reconciler) must guard via per-(table,
branch) queue acquisition once MR-686 lands.
These two tests together pin the foundation for MR-847's correctness
claims and document the load-bearing constraint MR-856 will inherit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes MR-793 acceptance §1 via option (b): every inline-commit method
remaining on the trait surface is named, the upstream blocker or
internal phase that closes it is cited, and the call-site residual
comment is mandated.
Reframes the criterion text in the MR-793 ticket comment from "either
full sealing OR all residuals enumerated" — this commit ships the
"enumerated" path. The "full sealing" path (Phase 1b + Phase 9 + the
two Lance upstream tickets) closes the matrix entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fetched https://lance.org/format/table/mem_wal/ in full via npx mdrip.
The "Overview / Details / Implementation" sidebar items turned out to
be anchor sections on the same URL, not separate pages.
Key findings (relevant to MR-847's recovery reconciler design):
* MemWAL is opt-in. Requires (1) unenforced primary key in schema,
(2) explicit shard config, (3) writers using the LSM-tree write
path. omnigraph does NOT enable it; we use direct write_fragments +
commit(Operation::Append).
* MemWAL is intra-table — addresses streaming-write throughput for
one Lance base table via MemTables → flushed MemTables → async
merge. It does not coordinate across multiple tables.
* MemWAL's recovery is intra-table: WAL replay reconstructs MemTable
state for one table. It does NOT help with omnigraph's cross-table
manifest-pinned-vs-Lance-HEAD drift class.
Conclusion: MR-847's recovery reconciler design is unaffected. The
two operate at different abstraction layers.
Borrowable: MemWAL's epoch-based fencing pattern is structurally
similar to a future multi-coordinator sidecar protocol; noted on
MR-847 for if MR-668 (multi-process) ever lands.
External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:
(1) per-table Lance `commit_staged` for the data write
(2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
cross-table ordering
(3) recovery-on-open reconciler for the residual gap between (1)
and (2) — NOT YET SHIPPED, tracked in MR-847.
Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).
Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.
End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>