Commit graph

87 commits

Author SHA1 Message Date
Ragnor Comerford
72d3da66de
recovery: per-writer Phase B failure → recovery integration tests (Phase 9)
Add the three paired per-writer tests required by MR-847's acceptance
criteria — "All four migrated writers ... have paired Phase B → recovery
integration tests."

Production additions (~10 LOC):
- New failpoint `branch_merge.post_phase_b_pre_manifest_commit` in
  `exec/merge.rs::branch_merge_on_current_target` between the per-table
  publish loop and `commit_manifest_updates`.
- New failpoint `ensure_indices.post_phase_b_pre_manifest_commit` in
  `db/omnigraph/table_ops.rs::ensure_indices_for_branch` between the
  per-table loops and `commit_prepared_updates_on_branch`.
- For schema_apply, the existing `schema_apply.after_staging_write`
  failpoint already fires in the right window (after the per-table
  rewrites + index builds, before the manifest publish).

Sidecar tweak:
- `schema_apply` sidecar's `branch` is now `None` (was
  `Some("__schema_apply_lock__")`). The lock branch is purely a
  serialization sentinel; `coordinator.commit_changes_with_actor`
  publishes against the coordinator's pre-lock branch (main). After
  the failpoint fires, `release_schema_apply_lock` removes the lock
  branch — if the sidecar referenced it, the recovery sweep would try
  to publish to a branch that no longer exists and fail. Fix: record
  the actual publish target.

Tests added in `tests/failpoints.rs` (~280 LOC):
- `schema_apply_phase_b_failure_recovered_on_next_open` — seeds a row,
  opens, attempts a schema apply that adds a new node type + a new
  property (the new type ensures the table set differs so
  `recover_schema_state_files` doesn't trip on property-only
  ambiguity), failpoint fires, drops engine, reopens, asserts sidecar
  deleted + audit row recorded.
- `branch_merge_phase_b_failure_recovered_on_next_open` — seeds main,
  branches off, mutates the branch, attempts merge with the
  `branch_merge.post_phase_b_pre_manifest_commit` failpoint active.
  Same recovery shape.
- `ensure_indices_phase_b_failure_recovered_on_next_open` — seeds
  rows, attempts ensure_indices with the
  `ensure_indices.post_phase_b_pre_manifest_commit` failpoint active.

After this commit, all four migrated writers have paired
Phase B → recovery tests:
- mutate_as / load: `recovery_rolls_forward_after_finalize_publisher_failure` (Phase 5)
- schema_apply: `schema_apply_phase_b_failure_recovered_on_next_open`
- branch_merge: `branch_merge_phase_b_failure_recovered_on_next_open`
- ensure_indices: `ensure_indices_phase_b_failure_recovered_on_next_open`

11 failpoint tests pass; full workspace lib + integration tests pass
(350+ tests across 20 binaries).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
c6827919ca
recovery: wire sidecar into schema_apply, branch_merge, ensure_indices (Phases 6-8)
Three writers each follow the same shape established in Phase 5: build
SidecarTablePin list before the per-table commit_staged loop, write the
sidecar via recovery::write_sidecar, do the existing work, delete the
sidecar after the manifest publish succeeds.

Loose-match classifier (recovery.rs):

The classifier now distinguishes strict vs. loose match per
SidecarKind. Strict (Mutation, Load, BranchMerge): exactly one
commit_staged per table; lance_head == manifest_pinned + 1 AND
post_commit_pin == lance_head required. Loose (SchemaApply,
EnsureIndices): the writer may run N >= 1 commit_staged calls per
table — index builds + rewrites compound, and the exact N is hard to
compute at sidecar-write time. Loose accepts any
lance_head > manifest_pinned (with expected_version still matching the
manifest pin) as RolledPastExpected. The risk it admits — an external
agent advancing HEAD between sidecar write and recovery — is out of
scope for the single-coordinator model (MR-668 territory).

roll_forward_all now reads the CURRENT Lance HEAD per table (not the
sidecar's post_commit_pin) so the manifest publish reflects whatever
HEAD landed, even if the loose-match writer committed multiple times
per table.

Per-writer wiring:

- schema_apply::apply_schema_with_lock: sidecar covers
  rewritten_tables ∪ indexed_tables (the tables that go through
  stage_overwrite/stage_create_index commit_staged). Skips
  added_tables (fresh datasets, no Phase B residual class) and
  renamed_tables (handled by the existing schema-state staging
  recovery in recover_schema_state_files).
- branch_merge::branch_merge_on_current_target: sidecar covers every
  table in candidates (publish_adopted_source_state +
  publish_rewritten_merge_table do the per-table commit_staged work).
  Sidecar writes after validate_merge_candidates and deletes after
  commit_manifest_updates.
- ensure_indices_for_branch: sidecar covers every node + edge type in
  the catalog with a manifest entry (build_indices_on_dataset is
  per-table-per-index commit_staged). Skips when the catalog has
  nothing — steady-state calls incur no sidecar I/O when the manifest
  already pins all expected types.

Allow recovery_audit.rs in forbidden_apis.rs:

The new db/recovery_audit.rs uses Dataset::write to bootstrap the
_graph_commit_recoveries.lance dataset (same pattern as
commit_graph.rs which is already allow-listed). Add it to the
ALLOW_LIST_FILES list in tests/forbidden_apis.rs.

8 new unit tests in db::manifest::recovery cover the loose-match
classifier branches (SchemaApply + EnsureIndices accept multi-commit
drift, NoMovement and InvariantViolation behave the same as strict).

All 20 test binaries pass (350+ tests across the workspace).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
49ca7e5068
recovery: wire sidecar into MutationStaging::finalize + flip headline test (Phase 5)
Production wiring (~120 LOC):

- `MutationStaging::finalize` now takes a `SidecarKind` parameter and
  returns an additional `Option<RecoverySidecarHandle>`. Builds a
  Vec<SidecarTablePin> from `pending` BEFORE the per-table commit_staged
  loop and writes the sidecar via `recovery::write_sidecar`. Skips the
  sidecar when `pending` is empty (delete-only mutation; D₂ keeps these
  out of the staged-write path so the option is just a clean signal,
  not a code path users hit).
- `exec/mutation.rs::execute_mutation_as` (around line 740): destructure
  the new third element, pass `SidecarKind::Mutation`, delete the
  sidecar after `commit_updates_on_branch_with_expected` succeeds.
- `loader/mod.rs::ingest_loaded` (around line 540): same shape, with
  `SidecarKind::Load`. The Overwrite path stays inline-commit (legacy
  residual; out of MR-847 scope per docs/runs.md).
- New engine accessors `Omnigraph::storage_adapter()` and
  `Omnigraph::root_uri()` for the sidecar I/O. The pre-existing
  `db.storage` field stays private; no other engine code reaches around
  the accessor.
- Re-exports from `db::manifest`: `new_sidecar`, `write_sidecar`,
  `delete_sidecar`, plus the `RecoverySidecar*` types and `SidecarKind`,
  so consumers in `exec/` can use them via `crate::db::manifest::...`.

Bugfix folded in (~5 LOC): make `coordinator` mutable in
`Omnigraph::open_with_storage_and_mode` and call `coordinator.refresh()`
after the recovery sweep returns. Roll-forward advances the manifest
pin on disk; without the refresh the returned engine carried a stale
in-memory snapshot. The Phase 4 tests passed only because they
opened Lance datasets directly rather than going through `db.snapshot()`.

Storage adapter (~15 LOC): `LocalStorageAdapter::write_text` now ensures
the parent directory exists via `tokio::fs::create_dir_all`. Required
because the sidecar protocol writes into `__recovery/` which doesn't
pre-exist after `Omnigraph::init`. S3 has no equivalent; PutObject is
path-agnostic.

Headline test flip (~150 LOC):

- `tests/failpoints.rs::finalize_publisher_residual_drifts_lance_head_until_next_writer_recovers`
  is replaced by `recovery_rolls_forward_after_finalize_publisher_failure`.
  Same setup (failpoint at `mutation.post_finalize_pre_publisher`) but
  after the synthetic failure the test:
  1. Asserts the sidecar persists in `__recovery/` for the recovery
     sweep to find.
  2. Drops the engine handle.
  3. Reopens via `Omnigraph::open` — recovery sweep classifies
     RolledPastExpected, decides RollForward, publishes the manifest
     update, records the audit row, deletes the sidecar.
  4. Asserts the sidecar is gone.
  5. Asserts the originally-attempted Eve insert is now visible
     (Person count = 1).
  6. Asserts a subsequent insert succeeds without
     ExpectedVersionMismatch (Person count = 2).
  7. Asserts the audit dataset `_graph_commit_recoveries.lance` exists.
  This is the headline contract the MR-847 acceptance criteria require.

All other failpoint and runs tests continue to pass (8 + 24 unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
ca21e73d43
recovery: roll-forward execution + audit row (Phase 4)
Implement the remaining half of the open-time recovery sweep.

Roll-forward execution (db/manifest/recovery.rs::roll_forward_all):
constructs a GraphNamespacePublisher directly (recovery runs inside
Omnigraph::open before the engine struct exists, so we can't go through
Omnigraph::commit_updates_on_branch_with_expected). Builds a
ManifestChange::Update per sidecar table reading row_count and
TableVersionMetadata from the dataset at post_commit_pin (cheap;
manifest-level reads, not a row scan), then calls publisher.publish with
expected_table_versions = sidecar.expected_version per table. Single
__manifest CAS extends every pin atomically — all-or-nothing at the
substrate. Persistent CAS contention surfaces as the typed
ExpectedVersionMismatch error and leaves the sidecar in place for the
next open's retry.

Audit model (new crates/omnigraph/src/db/recovery_audit.rs +
record_audit() in recovery.rs): each successful recovery sweep records
a graph-commit row tagged with actor_id="omnigraph:recovery" plus a
row in a new sibling table _graph_commit_recoveries.lance carrying
recovery_kind (RolledForward | RolledBack), recovery_for_actor (the
sidecar's original actor_id), operation_id (sidecar ULID),
sidecar_writer_kind, per_table_outcomes (JSON-serialized for schema
flexibility), and created_at. Operators investigating "did my mutation
land?" can find the answer via `omnigraph commit list --filter
actor=omnigraph:recovery` joined to the recoveries table by
graph_commit_id.

The sibling-table choice avoids bumping INTERNAL_MANIFEST_SCHEMA_VERSION
or migrating _graph_commits.lance. Same not-atomic-pair-write shape as
the existing _graph_commits + _graph_commit_actors split — a crash
between the two sequential writes leaves an orphan commit row with no
recovery row. Recovery sweep tolerates this: re-entry classifies
already-restored / already-published tables as NoMovement, the action
is a no-op, and the audit append is retried.

Note on classifier: process_sidecar's RollBack arm now restores
RolledPastExpected, UnexpectedAtP1, AND UnexpectedMultistep (any drift
class). Earlier Phase 3 logic restricted to RolledPastExpected only,
which left UnexpectedAtP1/UnexpectedMultistep tables drifted; the
all-or-nothing decision rule per docs/invariants.md §VI.23 demands all
drifted tables be restored.

3 new integration tests in tests/recovery.rs (7 total now):
- recovery_rolls_forward_after_phase_b_completes — happy-path
  roll-forward; audit row recorded; idempotent on second open.
- recovery_rolls_back_records_audit_row_with_recovery_actor —
  roll-back path also records an audit row with the original actor.
- recovery_rolls_forward_with_null_actor — sidecar without actor_id
  still records the audit row (recovery_for_actor = None).

3 new unit tests in db::recovery_audit pin the round-trip + persistence
+ recovery_kind string parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
c2fc3e7c40
recovery: wire open-time sweep + OpenMode (Phase 3)
Add `OpenMode::{ReadWrite, ReadOnly}` and route `Omnigraph::open` through
`open_with_storage_and_mode`. Recovery sweep runs only under
`OpenMode::ReadWrite` — read-only consumers (NDJSON export, commit list,
schema show) skip it via `Omnigraph::open_read_only`. Rationale: the
sweep performs Lance writes (Dataset::restore, manifest publish); a
read-only consumer with read-only object-store credentials shouldn't
trigger writes, and reads always resolve through the manifest pin
regardless of any drift on the per-table side.

`recover_manifest_drift` lands in db/manifest/recovery.rs and is wired
into Omnigraph::open AFTER recover_schema_state_files — schema-state
recovery operates on staging files; manifest-drift recovery operates on
Lance HEADs that may depend on schema-state being settled.

Roll-back path is fully implemented: classify each table per the
sidecar's intent, dispatch the all-or-nothing decision, and call
restore_table_to_version for any table with drift (RolledPastExpected,
UnexpectedAtP1, or UnexpectedMultistep). NoMovement tables are already
at expected_version — no action. Sidecar deleted as the final step.

Roll-forward path errors with a Phase-4 placeholder so it surfaces
loudly if reached without the audit + manifest-publish wiring landing
first.

Concurrency: today (pre-MR-686) recovery is naturally serialized by the
single-coordinator model. Open runs at server startup BEFORE
Arc<RwLock<Omnigraph>> wraps the engine (lib.rs:194), so no request
handlers can race. CLI is sequential by caller orchestration. Under
MR-686's per-(table_key, branch) queues + MR-856 (background recovery
reconciler), the queue acquisition will need to extend to recovery
sweeps — handoff documented on MR-686 ticket and in MR-856.

4 integration tests in tests/recovery.rs pin the Phase 3 contract:
- recovery_does_not_run_on_clean_open — no sidecars; sweep is a no-op.
- recovery_refuses_unknown_schema_version_on_open — sidecar v=99
  surfaces SidecarSchemaError and is left on disk for operator review.
- read_only_open_skips_recovery_sweep — even a sidecar with bogus
  table_path doesn't get classified under OpenMode::ReadOnly.
- recovery_rolls_back_synthetic_drift_on_open — sidecar with mismatched
  post_commit_pin classifies as UnexpectedAtP1, decision is RollBack,
  restore is invoked, sidecar is deleted, idempotent on second open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
376d91d538
recovery: scaffold sidecar protocol + classifier + decision tree (Phase 2)
Add db/manifest/recovery.rs with the primitives the open-time recovery
sweep will invoke. No integration into Omnigraph::open or any writer
path yet — those land in Phase 3+.

Sidecar protocol:
- RecoverySidecar JSON shape (schema_version=1; SidecarSchemaError
  refuses unknown versions — old binaries don't guess at newer shapes).
- SidecarKind {Mutation, Load, SchemaApply, BranchMerge, EnsureIndices}
  for audit attribution.
- SidecarTablePin {table_key, table_path, expected_version,
  post_commit_pin}.
- write_sidecar / delete_sidecar / list_sidecars / parse_sidecar.

Classifier + decision dispatcher (all-or-nothing per sidecar):
- TableClassification {NoMovement, RolledPastExpected, UnexpectedAtP1,
  UnexpectedMultistep, InvariantViolation}.
- classify_table(pin, lance_head, manifest_pinned).
- decide(&[TableClassification]) -> SidecarDecision {RollForward,
  RollBack, Abort}. Mid-Phase-B crash with mixed states rolls BACK
  (not forward) — atomicity per docs/invariants.md §VI.23.

Restore primitive:
- restore_table_to_version(table_path, expected_version): open,
  checkout(expected_version), restore. Includes a fragment-set
  equality short-circuit so repeated mid-rollback crashes don't pile
  up Lance versions (Lance fragments are immutable; equal fragment-ids
  ⇒ equal content).

StorageAdapter trait extension:
- Added list_dir(dir_uri) -> Vec<String> for sidecar enumeration.
  LocalStorageAdapter uses tokio::fs::read_dir; S3StorageAdapter uses
  object_store::list with a prefix-collision guard
  (filters to require the directory '/' boundary so listing
  __recovery doesn't accidentally match __recovery_log/...).
  RecordingStorageAdapter (test wrapper) delegates to inner.

17 unit tests covering: classifier branches, decision branches
(including mid-Phase-B mix → RollBack and empty slice → RollForward),
JSON round-trip, schema-version refusal, restore HEAD+1, fragment-set
short-circuit no-op, list_sidecars empty/round-trip/non-JSON-skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
1e70028293
MR-847: pin Lance restore semantics empirically (Phase 1)
Two new tests in tests/staged_writes.rs that the recovery sweep design
depends on:

- lance_restore_appends_one_commit_with_checked_out_content — verifies
  Dataset::restore() (no-args; restores currently-checked-out version)
  produces HEAD+1, not HEAD+2 as the v1 design assumed. Source confirmed
  at lance-4.0.0/src/dataset.rs:1106; this test prevents a future lance
  bump from silently breaking the recovery rollback math.

- lance_restore_loses_to_concurrent_append_via_orphaning — pins the
  concurrency hazard motivating MR-847's open-time-only invocation
  strategy: check_restore_txn (lance-4.0.0/src/io/commit/conflict_
  resolver.rs:986) returns Ok against Append/Update/Delete/CreateIndex/
  Merge/etc., so a Restore commits successfully even when a concurrent
  legitimate writer just landed an Append — silently orphaning the
  Append's data from the active timeline. MR-847 sidesteps via running
  recovery only at Omnigraph::open (before any other writers race);
  MR-856 (continuous-recovery reconciler) must guard via per-(table,
  branch) queue acquisition once MR-686 lands.

These two tests together pin the foundation for MR-847's correctness
claims and document the load-bearing constraint MR-856 will inherit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
8726ffe0a3
release: bump version to 0.4.1 2026-05-02 23:20:50 +02:00
Ragnor Comerford
bf7d716d1b
forbidden_apis: add .restore( and document why .append( / .delete( are excluded
Cubic flagged that the guard misses `ds.append() / ds.delete() / ds.restore()`.

`.restore(` is added — Lance-specific (no false positives in the
workspace).

`.append(` and `.delete(` stay excluded with a documenting comment:
* `.append(` over-matches `Vec::append`, `String::append`, every
  `arrow_array::xxxArrayBuilder::append` (30+ legit uses across
  `exec/mutation.rs`, `loader/jsonl.rs`, `exec/projection.rs`).
* `.delete(` over-matches `ObjectStore::delete` (used in `storage.rs`,
  `db/schema_state.rs`, `db/omnigraph.rs:1277` for staging-file
  cleanup) and would require many `// forbidden-api-allow:` sentinels
  for legitimate uses.

The remaining bypass route — engine code that imports `lance::Dataset`
and calls `ds.append(reader, params)` — is bounded by:
1. The trait surface itself (sealed, only-callable-via-trait once
   Phase 1b call-site conversion completes).
2. The PR-review process catching new `lance::Dataset` imports in
   non-storage files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:25:52 +02:00
Ragnor Comerford
9b0920b5da
address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
  / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
  `Dataset::add_columns` / `update_columns` / `drop_columns` /
  `truncate_table` / `restore` and the bare `.merge_insert(` /
  `.add_columns(` / `.update_columns(` / `.drop_columns(` /
  `.truncate_table(` method patterns. Deliberately avoid `.append(` /
  `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
  arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
  `graph_coordinator.rs` — they're manifest-layer infra that legitimately
  uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
  `None`) to `open_dataset_head_for_write` for consistency with the
  sibling `indexed_tables` block. Schema apply rejects non-main
  branches at the lock-acquire step today, so behavior is unchanged;
  this is a defensive consistency fix that survives a future relaxation
  of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
  claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
  doc: softened the "stage_* + commit_staged are the only paths" /
  "trait funnels every write" overclaim. Inline-commit residuals
  (`delete_where`, `create_vector_index`) remain on the trait pending
  upstream Lance work (#6658, #6666); legacy `append_batch` etc.
  remain pending Phase 1b / Phase 9. Module doc now describes the
  current transitional state honestly.

Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
  but returned only `DeleteState`, dropping the post-delete dataset.
  Future callers migrating from the inherent `&mut Dataset` API would
  lose the post-delete dataset state needed for indexing /
  `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
  matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
  the unused `Scanner` import (the marker existed only to suppress an
  unused-import warning — fixing the import is the cleaner answer).

Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
  with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
  between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
  triggers the failpoint via a schema-apply that adds a new node type;
  proves that existing tables are unaffected (Person mutation succeeds
  after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
  drift on tables outside the failed `added_tables` iteration.

`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
  writer-trait surface for inserts / updates / scalar-index builds /
  merge_insert / overwrite after MR-793 PR #70. Per-table
  commit_staged → manifest publish window remains; closing requires
  MR-847's recovery-on-open reconciler. `delete_where` and
  `create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
  primitives are the building blocks; the reconciler task itself is
  MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
  impl on `TableStore` with opaque-handle signatures; no test impl
  yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:47:07 +02:00
Ragnor Comerford
3135ff5d19
MR-793 phases 1-6: TableStorage trait + staged-write surface for engine writers
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.

Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
  sealed (only in-tree types can impl), single impl on `TableStore`
  delegating to existing inherent methods; `Omnigraph::storage()`
  accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
  `stage_create_btree_index`, `stage_create_inverted_index` —
  implementing the simple branch of Lance's `CreateIndexBuilder::execute`
  (scalar indices only; vector indices stay inline because
  `build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
  Six new tests in `tests/staged_writes.rs` pin both the new primitives
  and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
  walks engine source, fails on direct lance::* inline-commit API use
  outside `table_store.rs` / `db/manifest/`. Skips comment lines and
  honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
  `stage_create_*_index` + `commit_staged` instead of
  `create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
  named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
  merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
  delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
  use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
  inline because `InsertBuilder::execute_uncommitted` rejects empty
  data. The narrow inline window is bounded by `__schema_apply_lock__`.

Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
  (incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
  — 5 tests, all green.
* `cargo test --workspace` — green.

Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
  (mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
  residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
  removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
  (depends on Phase 1b).

Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
  vector-index builds can be staged outside the lance crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:03:15 +02:00
Ragnor Comerford
044ed46019
chore: scrub Linear ticket numbers and review-bot mentions from code comments
OmniGraph is OSS; internal Linear ticket references and code-review-bot
mentions in source-code comments don't help external readers and leak
internal tooling. Replace ticket numbers (MR-XXX) with descriptive
prose, drop linear.app URLs, and remove inline mentions of
Cursor/Bugbot/Cubic/Codex review threads.

Scope is limited to source-code comments (`crates/`). Docs under
`docs/` keep their MR-XXX references — those are part of the
established change-history narrative for in-repo docs and don't
require a Linear account to find context for.

No behavior changes; no public API changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:45:38 +02:00
Ragnor Comerford
ea16c74329
MR-794 step 2: address Cursor Bugbot follow-ups on commits 3223b51 + 052b6e6
Four code/doc fixes from the latest Cursor Bugbot pass:

* **Misplaced doc comment in table_store.rs (Medium):** the doc block
  intended for `scan_pending_batches` was, after my earlier edit,
  attached to `collect_string_column_values` because the new helper
  was inserted between the original docblock and `scan_pending_batches`.
  Move the docblock back onto its function and add a note about the
  shared SQL-dialect contract with the Lance scanner (the predicate
  goes to both, which is fine for `predicate_to_sql`'s plain comparison
  shapes today; future Lance-specific scanner extensions in the filter
  would need translation).

* **Missing null check on committed `id` column (Low):** the
  committed-side loop in `collect_node_ids_with_pending` (and the
  parallel non-pending `collect_node_ids`) read `id_col.value(i)`
  without `is_valid(i)` first. `id` is the @key column on every node
  type and non-nullable by schema, so this is unreachable today, but
  the inconsistency with the pending-side `is_valid` guard is worth
  closing for symmetry / defense.

* **Misleading comment in count_pending_src_with_dedupe (Low):** the
  comment claimed "fall back to naive counting" but the code did
  `continue`. Fix: it's unreachable in practice (the pending-side
  schema always contains the key when the caller passes one), so
  failing loudly with a typed error if it ever does fire is correct
  — silently skipping the batch would let `@card` violations slip
  past validation.

* **PendingTable.schema mismatch surfaces too late (Medium):**
  PendingTable captures the schema from the first batch and never
  updates it. On a blob-bearing table, `insert` produces a full-schema
  batch and `update` (without assigning every blob) produces a
  subset-schema batch. Pre-fix the mismatch surfaced inside
  finalize/MemTable construction — distant from the offending op.
  Post-fix `MutationStaging::append_batch` validates the new batch's
  schema against the existing accumulator's schema and returns a
  typed error directing the caller to split the mutation. Error
  fires at the offending op, not at end-of-query. New helper
  `schemas_compatible` compares field name + data_type pairs;
  nullability and field metadata differences stay tolerated (downstream
  concat already permits those).

Cubic Cursor Bugbot finding #5 (cascade delete edge re-open) self-resolved
in the bot's own analysis ("logic appears sound on re-examination") —
no action.

New test on tests/runs.rs:

* append_batch_rejects_mismatched_schema_in_blob_table_at_offending_op
  — pins the early-error path. Builds a blob-bearing schema, runs an
  `insert + update` query where the update doesn't assign the blob,
  asserts the error fires at the second op with the "Split the
  mutation" message and the manifest is unchanged.

Local: tests/runs.rs 24/24 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:50:13 +02:00
Ragnor Comerford
052b6e680f
MR-794 step 2: address PR #68 follow-up review (Cubic) — pending dedupe + projection guard + CI
Three new findings from Cubic on commit 3223b51:

* **Pending edge cardinality counted within-input duplicates** (P2):
  count_src_per_edge's pending walk added every row to the count,
  including duplicate rows that finalize will collapse via
  dedupe_merge_batches_by_id. A LoadMode::Merge with the same edge id
  twice would over-count → spurious @card violation. Fix: when
  dedupe_key_column is Some, walk pending in reverse, track seen keys
  via HashSet, count only the kept (last-occurrence) rows. Mirrors
  finalize-time dedupe so cardinality counts what stage_merge_insert
  actually publishes.

* **scan_with_pending silently disabled merge-shadow when projection
  omitted key_column** (P2): if a caller passed Some("id") as
  key_column but their projection didn't include "id", the
  filter_out_rows_where_string_in helper passed batches through
  unchanged — silently degrading to union semantics. Fix: validate
  up front that projection contains key_column when both are Some;
  return a typed Lance error otherwise. Tightened the helper too:
  missing column is now an internal error (was a silent passthrough).

* **Cascade-vs-explicit delete test was too weak** (P2): asserted
  only that edge count decreased after delete. The cascade alone
  could satisfy that even if the explicit second-delete silently
  no-op'd. Strengthened: assert post_knows == 0, which only holds
  when both ops landed (Bob→Diana would survive if op-2 no-op'd).

CI gap: also added test_failpoints_feature job to .github/workflows/ci.yml.
The workspace test runs without --features failpoints (the feature is
behind a Cargo flag), so the failpoints test suite was never exercised
by CI before now. The new job builds + runs
`cargo test -p omnigraph-engine --features failpoints --test failpoints`
on every full CI run, mirroring the test_aws_feature pattern.

New tests on tests/runs.rs:

* load_merge_mode_dedupes_within_pending_for_cardinality_count
  (Cubic P2 #2 — pending-vs-pending dedup, distinct from the
  load_merge_mode_dedupes_edge_for_cardinality_count test which
  covers committed-vs-pending dedup).
* scan_with_pending_rejects_key_column_missing_from_projection
  (Cubic P2 #3 — verifies the up-front validation rejects bad
  callers and that the happy path still works correctly).

Local test results:

* tests/runs.rs: 23/23 passed
* tests/failpoints.rs --features failpoints: 7/7 passed (includes the
  two new finalize→publisher residual tests landed in 3223b51).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:47:45 +02:00
Ragnor Comerford
3223b51cf1
MR-794 step 2: address PR #68 review — merge semantics, cardinality, residual
Five fixes from PR #68 review (Cursor Bugbot + Codex + Cubic):

* **scan_with_pending gains merge-shadow semantics** (Codex P1, Cubic P1#1):
  new `key_column: Option<&str>` parameter. When set, committed rows
  whose key value appears in any pending batch are excluded from the
  scan — making `scan_with_pending` correctly merge-semantic for chained
  updates instead of naively unioning. execute_update calls with
  Some("id"). Without this, a chained `update where age > 30` could
  match a row whose pending value already moved out of range.

* **Multi-delete on same table no longer trips ExpectedVersionMismatch**
  (Cursor Bugbot HIGH): open_table_for_mutation routes through
  reopen_for_mutation when staging.inline_committed has the table,
  using the post-inline-commit Lance version captured at record_inline
  time. The legacy open_for_mutation_on_branch fence (Lance HEAD ==
  manifest pinned) is correct cross-writer but wrong intra-query when
  deletes have already advanced HEAD on this table. Branch goes away
  when Lance ships two-phase delete (lance-format/lance#6658).

* **Cardinality validation consolidated** (Cursor LOW + Codex P2 +
  Cubic P1#2 + Cubic P2): new exec/staging::count_src_per_edge +
  enforce_cardinality_bounds shared by mutation and loader paths.
  Restores the missing min-cardinality check on the engine path.
  Loader Merge mode passes Some("id") to dedupe edges being updated
  by id (not double-count committed + pending). Loader Append mode
  and engine path pass None (ULID-generated ids never collide).

* **Dead count_rows_with_pending removed** (Cursor LOW): never called.

* **Misleading concat-helper comment fixed** (Cubic P3): claimed
  schema normalization the helper doesn't implement. Updated to match
  reality.

* **Documentation honesty** (Cubic P1#3): MR-794 narrows but doesn't
  eliminate the "Lance HEAD ahead of __manifest" drift class. Drift is
  unreachable for op-execution failures (the partial_failure test pins
  this), but a residual remains at the finalize→publisher boundary
  because Lance has no multi-dataset commit primitive: per-table
  commit_staged calls run sequentially before manifest commit. Updated
  docs/runs.md, docs/invariants.md §VI.25, docs/releases/v0.4.1.md to
  scope the claim precisely.

* **Failpoint test pinning the residual**: new
  mutation.post_finalize_pre_publisher failpoint + two tests in
  tests/failpoints.rs that confirm the documented residual behavior.
  Catches future regressions that widen the residual.

Test additions on tests/runs.rs:

* chained_updates_with_overlapping_predicate_respects_intermediate_value
* multi_statement_delete_on_same_node_table
* cascade_delete_node_then_explicit_delete_edge_on_same_table
* mutation_insert_edge_enforces_min_cardinality
* load_merge_mode_dedupes_edge_for_cardinality_count

113/113 engine integration tests pass (runs + end_to_end + consistency
+ staged_writes + validators). Failpoints feature build runs in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:47:55 +02:00
Ragnor Comerford
82350bdc4a
MR-794 step 2: tests — flip partial_failure + add coalesce/D₂/load tests
* Flip partial_failure_observably_rolls_back_but_blocks_next_mutation_on_same_table
  → partial_failure_leaves_target_queryable_and_unblocks_next_mutation.
  Asserts the next mutation succeeds (no drift) instead of expecting
  the legacy ExpectedVersionMismatch.
* New: mutation_rejects_mixed_insert_and_delete_at_parse_time —
  D₂ check fires before any I/O.
* New: mixed_insert_and_update_on_same_person_coalesces_to_one_merge —
  verifies dedupe-by-id last-write-wins in MutationStaging::finalize.
* New: multiple_appends_to_same_edge_coalesce_to_one_append —
  two-statement edge insert publishes exactly once.
* New: multi_statement_inserts_publish_exactly_once — manifest
  version advances by exactly 1 per multi-statement query.
* New: load_with_bad_edge_reference_unblocks_next_load — RI
  violation aborts before publish; next load succeeds.
* New: load_with_cardinality_violation_unblocks_next_load — uses
  a custom WorksAt @card(0..1) schema; cardinality violation aborts
  before publish; next load succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:00 +02:00
Ragnor Comerford
bbf610ea9b
MR-794 step 2: rewire loader for Append/Merge — Overwrite stays inline
load_jsonl_reader dispatches on mode: Append/Merge use the
MutationStaging accumulator (per-type batch staging, single stage_* +
commit_staged per touched table at end-of-load, publisher CAS).
Overwrite keeps the legacy concurrent inline-commit path
(truncate-then-append doesn't fit the staged shape; overwrite has no
in-flight read-your-writes requirement).

* New helpers collect_node_ids_with_pending and
  validate_edge_cardinality_with_pending_loader — loader analogs
  of the engine's pending-aware validators.
* Phase 2c (RI) and Phase 3 (cardinality) consult pending batches
  for Append/Merge so a mid-load failure aborts the load before any
  Lance write reaches HEAD.

A failed Append/Merge load no longer advances Lance HEAD on staged
tables — the next load on the same tables proceeds normally with no
ExpectedVersionMismatch. Overwrite mode's drift residual is unchanged
from today's behavior; documented in docs/runs.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:47 +02:00
Ragnor Comerford
e6f48ba24d
MR-794 step 2: rewire mutation.rs to in-memory accumulator + D₂
* Replace mutation.rs's MutationStaging.latest with the new
  pending + inline_committed shape from exec::staging. Inserts
  and updates push batches into pending; deletes still inline-commit
  via record_inline.
* Rewrite execute_insert, execute_update, execute_delete*,
  validate_edge_insert_endpoints, ensure_node_id_exists for the
  new shape. Edge cardinality validates against committed scan +
  in-memory pending walk (validate_edge_cardinality_with_pending).
* D₂ parse-time check: a query is either insert/update-only or
  delete-only. Mixed → reject before any I/O.
* Drop CoordinatorRestoreGuard and the swap_coordinator_for_branch /
  restore_coordinator dance from mutate_with_current_actor. Branch
  is threaded explicitly through execute_named_mutation and the
  per-op functions. (merge.rs keeps its own swap pattern.)
* apply_assignments updated to copy unassigned blob columns from
  the scan when present, enabling full-schema update batches; for
  blob-bearing tables we still project away the blob columns at scan
  time (Lance's filter pushdown panics otherwise) and accept the
  narrow-schema output for the v1 path.

A failed mid-query op no longer advances Lance HEAD on staged tables —
the next mutation proceeds normally with no ExpectedVersionMismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:37 +02:00
Ragnor Comerford
cdfbccbfdc
MR-794 step 2: scaffold MutationStaging accumulator + scan_with_pending
Add the scaffolding for the in-memory staged-write rewire — no behavior
change yet:

* New crates/omnigraph/src/exec/staging.rs with MutationStaging,
  PendingTable, PendingMode, StagedTablePath, plus the end-of-query
  finalize() that issues one stage_* + commit_staged per pending
  table (Merge mode dedupes by id, last-write-wins).
* TableStore::scan_with_pending and count_rows_with_pending helpers —
  Lance scan committed + DataFusion MemTable scan pending, concat.
  Sidesteps the Scanner::with_fragments filter-pushdown limitation
  documented on scan_with_staged.
* Add datafusion = "52" to workspace + omnigraph-engine deps for
  MemTable (transitively pulled by Lance already).

Engine code still uses the legacy MutationStaging shape; the rewire
lands in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:21 +02:00
Ragnor Comerford
7c09220210
MR-794 step 1: fix u32 cast + pin scan_with_staged filter limitation
Two CI failures, both addressed:

(1) u32/u64 type mismatch in stage_append (compile error):
ds.manifest.max_fragment_id is Option<u32>, but Lance's Fragment::id
and the commit-time renumbering counter in
Transaction::fragments_with_ids operate on u64. Cast max_fragment_id
to u64 before the arithmetic.

(2) scan_with_staged_pushes_filter_through_committed_and_staged failed
because Lance's stats-based fragment pruning drops uncommitted staged
fragments from filtered scans — they lack the per-column statistics
that committed fragments carry. With filter `age >= 30` and a staged
dave (age=35), dave is silently absent from the result.
scanner.use_stats(false) does not bypass this in lance 4.0.0
(verified locally).

Rather than chase Lance internals further, document the limitation:
- stage_merge_insert / scan_with_staged docstring updated to flag the
  filter contract as incomplete on staged fragments.
- Test renamed to scan_with_staged_with_filter_silently_drops_staged_rows
  and flipped to assert the actual behavior, with a clear note pointing
  at the design pivot (.context/mr-794-step2-design.md §1.1) and
  instructions for whoever sees the assertion fail in the future.
- Test also asserts that unfiltered scan_with_staged returns all rows —
  confirms the issue is specifically filter pushdown, not fragment
  scanning per se.

The engine's MR-794 step 2+ design (in-memory pending-batch
accumulation + DataFusion MemTable for read-your-writes) sidesteps
this entirely; production code is unaffected. scan_with_staged stays
on the public surface for primitive-level testing and for callers
that don't need filter pushdown.

All 8 staged_writes tests + 10 runs + 63 end_to_end + consistency
green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 01:03:27 +02:00
Ragnor Comerford
85cfffeaf8
MR-794 step 1: assign real fragment IDs to staged appends
CI exposed the actual root cause behind the three staged_writes
test failures: Lance's InsertBuilder::execute_uncommitted produces
fragments with id=0 as a "Temporary ID" (lance-4.0.0
dataset/write.rs:1044, with the assertion at line 1712). Real IDs
get assigned at commit time by Transaction::fragments_with_ids
(transaction.rs:1456). Because we expose pre-commit fragments to
scan_with_staged via Scanner::with_fragments, two fragments collide
on id=0 in the combined list — the staged fragment with the seed
fragment, or two staged fragments with each other.

Lance's scanner mishandles the collision. Symptoms observed in
the three failing tests:
- chained_stage_appends: only 1 distinct _rowid (other fragments
  silently dropped)
- count_rows_with_staged_matches_scan: range overflow ("Invalid read
  of range 0..2 for fragment 0 with 1 addressable rows")
- scan_with_staged_pushes_filter: duplicate carol + missing dave
  (one fragment read twice, the other not at all)

Fix: assign real fragment IDs in stage_append, mirroring Lance's
commit-time logic. Use ds.manifest.max_fragment_id + 1 as the base,
incremented by the prior_stages fragment count so chained
stage_appends produce distinct IDs. The row_id_meta assignment
stays — both are needed for the scanner to correctly map row IDs
through the combined fragment list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:18:47 +02:00
Ragnor Comerford
61b3f5090b
MR-794 step 1: thread row-ID offset + add commit_staged + filter tests
Three follow-ups to the staged-writes primitives, all caught by the
"are we missing tests?" review:

(1) Path A row-ID threading (Gap 1, real bug):
stage_append now takes prior_stages: &[StagedWrite] and offsets the
assigned row IDs by the sum of prior stages' physical_rows. Without
this, two stage_appends against the same dataset both started at
ds.manifest.next_row_id, producing fragments with overlapping _rowid
ranges. This would have fired in Step 2+ on any multi-statement
mutation like `insert Knows ...; insert Knows ...` (multiple appends
to the same edge table — allowed under D₂′). The slice mirrors
scan_with_staged's API shape; the same slice is passed to both stage
and scan. Documented contract: only stage_append results in
prior_stages (D₂′ guarantees this upstream).

(2) commit_staged round-trip tests (Gap 2):
Two tests covering stage_append + commit_staged and stage_merge_insert
+ commit_staged. Validate that Lance's commit-time row-ID assignment
works correctly even after our pre-commit row_id_meta assignment in
the append path — the two assignments diverge but neither is observed
across the boundary.

(3) Filter pushdown test (Gap 3):
scan_with_staged with a SQL filter applies it across both committed
and staged fragments. Validates the MR-794 ticket's claim that Lance's
with_fragments preserves filter/vector/FTS pushdown (Lance tests
test_scalar_index_respects_fragment_list etc.).

Also adds chained_stage_appends_have_distinct_row_ids which directly
demonstrates the Gap 1 fix by projecting _rowid and asserting no
duplicates across 1 committed + 2 staged rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:59:59 +02:00
Ragnor Comerford
714f1f0c0a
MR-794 step 1: assign row_id_meta to stage_append fragments
CI exposed a real Step 1 bug surfaced by the new staged_writes tests:
stage_append → scan_with_staged fails on stable_row_id datasets with
"Missing row id meta" (lance-4.0.0/src/dataset/rowids.rs:22).

Root cause: InsertBuilder::execute_uncommitted produces fragments with
row_id_meta = None. Lance's commit phase normally populates it via
Transaction::assign_row_ids, but scan_with_staged reads the staged
fragments BEFORE commit. MergeInsertBuilder::execute_uncommitted dodges
this by populating row_id_meta inline (transaction.rs:1618) — that's
why the two merge-side tests in tests/staged_writes.rs passed and the
two append-side tests failed.

The bug was always present in the primitive — PR #66 shipped it the
same way. PR #66 had no tests calling stage_append, so neither CI nor
the bot reviewers caught it. Step 2+ would have hit it on the first
mutation that did "insert + insert with FK validation," but the failure
would have looked like a MutationStaging wiring bug; localizing it
here saves the next session the chase.

Fix: assign row_id_meta on the cloned fragments returned in
StagedWrite.new_fragments. Mirrors the relevant arm of Lance's
Transaction::assign_row_ids (transaction.rs:2682) for the
row_id_meta = None case. The transaction's internal fragment copy stays
untouched — Lance assigns its own IDs at commit time, and the two ID
assignments don't have to agree because no caller threads _rowid from
the staged scan into the commit path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:49:14 +02:00
Ragnor Comerford
2fe2669017
MR-794 step 1: import arrow_array::Array in staged_writes test
CI failed compiling tests/staged_writes.rs — `.len()` is on the Array
trait, not on the concrete StringArray/Int32Array types. Add the
trait import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:19:05 +02:00
Ragnor Comerford
4c5fa3d8b8
MR-794 step 1: address PR #67 Codex P1 — document chained-merge contract
Codex flagged that combine_committed_with_staged can return duplicates
on chained stage_merge_inserts: each call's MergeInsertBuilder runs
against the committed view (it does not see prior staged fragments), so
two staged merges whose source rows share keys both produce
Operation::Update transactions whose new_fragments contain the shared
row. The combined scan returns it twice.

The bug is intrinsic to Lance's API: there is no public way to make
MergeInsertBuilder see uncommitted fragments. Fixing the primitive
itself requires either a Lance API extension or in-memory pre-merge
logic, neither in scope for v1.

The v1 fix is a parse-time companion (D₂′) added with the engine rewire
in MR-794 step 2+: per touched table, ops must be all stage_append OR
exactly one stage_merge_insert. Multi-table queries and append-chains
remain safe; only chained merges on a single table are rejected.

This commit:
- Documents the contract on stage_merge_insert and
  combine_committed_with_staged so callers know the invariant the
  primitive relies on.
- Adds tests/staged_writes.rs with four primitive-level tests:
  - stage_append + scan_with_staged shows committed + staged
  - stage_merge_insert dedupes superseded committed fragments
    (regression for the removed_fragment_ids fix that PR #66's
    730631c added)
  - count_rows_with_staged matches scan
  - chained stage_merge_insert with shared key documents the
    duplicate-row behavior; assertion pins it so a future change either
    preserves the contract or consciously fixes it (and updates the test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:10:19 +02:00
Ragnor Comerford
6dc4167291
MR-794 step 1: address PR #66 review — track removed_fragment_ids
Three independent automated reviews (Cubic P1, Cursor High, Codex P1)
flagged a real correctness bug in stage_merge_insert: Operation::Update
returns three fields — removed_fragment_ids, updated_fragments,
new_fragments — and we were collecting only the latter two into
StagedWrite.new_fragments while discarding removed_fragment_ids.

That breaks read-your-writes for any merge_insert that rewrites an
existing fragment: scan_with_staged combines the dataset's full committed
manifest with the staged new_fragments, so the *original* committed
fragment (which the rewrite supersedes) and its rewritten version both
end up in the Scanner's fragment list. Result: duplicate rows.

Fix:

- StagedWrite gains `removed_fragment_ids: Vec<u64>` populated from
  Operation::Update; empty for Operation::Append (which never supersedes
  existing fragments).
- scan_with_staged / count_rows_with_staged take `&[StagedWrite]` instead
  of `&[Fragment]` so they have access to both fields.
- A new `combine_committed_with_staged` helper composes the visible
  fragment list as `committed - removed + new`, deduping by fragment ID.

Also addresses cubic's P3 doc-fab note: the StagedWrite doc comment
claimed the type was "used by MutationStaging and the loader" but those
callers don't exist in this PR (they're MR-794 step 2+). Reword to
"defined here for later integration" so the doc doesn't lie about the
current state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 19:19:39 +02:00
Ragnor Comerford
3601002440
MR-794 step 1: add staged-write primitives to TableStore
Lance's distributed-write API splits "write fragment files" from "advance
HEAD": write_fragments returns a Transaction with FragmentMetadata; a
later CommitBuilder::execute(transaction) commits via the manifest CAS.
The same shape exists for merge_insert via
MergeInsertBuilder::execute_uncommitted. Scanner::with_fragments(staged)
lets in-flight reads see uncommitted staged data.

Adds wrappers for these primitives:

- StagedWrite carries the uncommitted Transaction plus the new Fragments
  (extracted for read-your-writes via Scanner::with_fragments).
- TableStore::stage_append wraps InsertBuilder::execute_uncommitted.
- TableStore::stage_merge_insert wraps MergeInsertBuilder::execute_uncommitted.
- TableStore::commit_staged wraps CommitBuilder::execute.
- TableStore::scan_with_staged / count_rows_with_staged thread the staged
  fragments into a Scanner alongside the dataset's committed fragments.

The MutationStaging integration that uses these primitives is the next
step in MR-794 — it requires a coordinated rewrite of execute_insert /
execute_update / execute_delete plus the load_jsonl_reader path, plus
end-of-query commit logic. Doc comment on MutationStaging is updated to
reference MR-794 and these primitives so the followup is well-anchored.

The current MR-771 limitation in docs/runs.md ("mid-query partial failure
leaves Lance HEAD ahead of __manifest") still applies until the
follow-up lands; the primitives are the building blocks but not yet the
fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 19:19:38 +02:00
Ragnor Comerford
1a906403cb
MR-771: address cubic comment — drop vacuous __run__ check in cancel test
cubic correctly flagged that the assertion `!branches_after.iter().any(|b| b.starts_with("__run__"))` is vacuous because `branch_list()` already filters `__run__*` via `is_internal_system_branch`. The real structural property (no `__run__` branches can ever be created) is enforced by MR-771's deletion of `begin_run` etc. — that's a build-time invariant, not a runtime one.

Drop the vacuous assertion; document why. The remaining checks (public branch list unchanged + `_graph_runs.lance` never reappears) cover the actual cancel-safety properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:31:49 +02:00
Ragnor Comerford
a7109d5fba
MR-771: address PR review feedback
Three fixes from automated PR review on #65:

1. Internal-branch guard in mutation/load (Cursor Bugbot, Medium).
   Pre-MR-771 the begin_run path called ensure_public_branch_ref;
   the direct-publish replacements only normalized the name. A caller
   passing __run__* or __schema_apply_lock__ verbatim could write
   directly to a system branch. Re-add the explicit guard at the
   public write boundary in mutate_with_current_actor and load.

2. Panic-safe coordinator restoration (Cursor Bugbot, High).
   The previous swap-and-restore pattern would skip restore_coordinator
   if execute_named_mutation panicked, leaving the handle pinned to
   the wrong branch indefinitely. Replace with a CoordinatorRestoreGuard
   RAII type that captures the previous coordinator on swap and
   restores it in Drop.

3. Flaky cancel-safety test (cubic, P2).
   tests/runs.rs::cancelled_mutation_future_leaves_no_state asserted
   manifest version equality after handle.abort(), but abort races
   the spawned task. Re-frame around what actually defines cancel
   safety: no __run__* branches, no _graph_runs.lance, no synthesized
   public branches.

The fourth comment (Codex P1: branch_delete losing its in-flight
write barrier) is bigger in scope — fits in the MR-794 storage-trait
staging story rather than a hotfix here. Tracked there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:17:00 +02:00
Ragnor Comerford
35be20cb05
MR-771: demote Run to direct-publish via expected_table_versions CAS
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 08:52:50 +02:00
Claude
243c0c3464
Add internal-schema versioning + auto-migration for __manifest
The on-disk shape of `__manifest` is reconciled with the binary via a single
stamp + dispatcher in `db/manifest/migrations.rs`:

- `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes.
- The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest
  dataset's schema-level metadata (Lance `update_schema_metadata`).
- `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from
  the on-disk stamp until it matches the binary, then returns. Idempotent.
- `init_manifest_repo` stamps the current version at creation; the publisher's
  open-for-write path runs pending migrations before reading state. Reads
  stay side-effect-free.
- Forward-version protection: a stamp higher than the binary's known version
  triggers a clear "upgrade omnigraph first" error so an old binary cannot
  clobber a newer schema.

Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step:
the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id`
that engages Lance's row-level CAS at commit time. New repos created via
`init` are stamped at v2 immediately and don't need migration.

Adding a future on-disk shape change is one constant bump, one match arm in
`migrate_internal_schema`, and one test — no new branches in unrelated code
paths. Code outside the migration module never inspects the stamp.

New tests in `manifest/tests.rs`:
- `test_init_stamps_internal_schema_version`
- `test_publish_migrates_pre_stamp_manifest_to_current_version`
- `test_publish_rejects_manifest_stamped_at_future_version`

Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated
per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
Claude
df0e158190
Add per-table expected-version OCC to ManifestBatchPublisher (MR-766)
Layered approach selected by the CAS-granularity investigation
(.context/merge-insert-cas-granularity.md):

- Annotate __manifest.object_id with `lance-schema:unenforced-primary-key`,
  enabling Lance row-level CAS via the bloom-filter conflict resolver.
  Closes a latent silent-duplicate bug where two concurrent publishes of
  the same `version:T@v=N+1` row could both land in disjoint fragments.
- Extend `ManifestBatchPublisher::publish` with `expected_table_versions:
  &HashMap<String, u64>`. Empty map preserves today's behavior; populated
  map asserts the manifest's latest non-tombstoned version per table
  matches the caller's view. Mismatches surface as a typed
  `ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected,
  actual }` so callers can match without parsing strings.
- Set `merge_builder.conflict_retries(0)` so Lance's transparent rebase
  cannot silently break the OCC contract; retries are owned by the
  publisher loop, where each attempt re-runs `load_publish_state` and
  the expected-version pre-check.
- Surface `ManifestCoordinator::commit_with_expected` for the callers
  that need strict OCC (the run-demotion ticket); existing `commit` and
  `commit_changes` paths are unaffected.

New tests in `manifest/tests.rs` cover: matching expected versions,
stale expected with typed details, drift on an untouched expected
table, unknown expected table (actual=0), and the headline case of two
concurrent publishes with overlapping expected versions where exactly
one succeeds.
2026-04-29 00:34:20 +00:00
Andrew Altshuler
56b6319197
Enforce schema validators on every write path (#59)
Several validators were defined but only called from a subset of write
paths, so writes that violated @unique, @range, @check, enum, or
@cardinality constraints could silently succeed and corrupt data.

Adds two new helpers in loader/mod.rs:

- validate_enum_constraints — batch-level enum check, scans Arrow
  string columns (and list-of-string columns) for values outside the
  declared set
- enforce_unique_constraints_intra_batch — single-batch duplicate
  detection over named columns; partial enforcement (does not check
  against committed rows yet — cross-batch enforcement is a separate
  effort)

Wires the validators into:

- load_jsonl_reader nodes (alongside the existing
  validate_value_constraints call) and edges (which had no enum or
  unique check at all)
- exec/mutation.rs node insert, edge insert, and update paths
- mutation edge insert now also calls validate_edge_cardinality after
  the row lands but before the manifest commit, matching the loader's
  Phase 3 behavior

A new tests/validators.rs suite asserts rejection on every entry path
for invalid enum values, @range violations, intra-batch @unique
duplicates, and edge @card excesses.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 04:51:10 +03:00
Andrew Altshuler
f75b941a9e
Make schema apply atomic across crashes (#57)
Schema apply previously committed the manifest before writing the schema
source and IR contract files. A crash in that window left the manifest
pointing at the new schema while _schema.pg, _schema.ir.json, and
__schema_state.json still reflected the old one — a silent inconsistency
that subsequent reads hit as type errors.

Reorders the apply: write to staging filenames first, commit the
manifest, then atomically rename staging → final. On open, a recovery
sweep reconciles any leftover staging files against the manifest's table
set: pre-commit crashes get the staging files deleted, post-commit
crashes get the renames completed (idempotent — handles partial
renames). Property-only migrations where both schemas imply the same
table set return an operator-actionable error rather than guessing.

Adds rename_text + delete to StorageAdapter (atomic on local FS via
tokio::fs::rename; copy + delete on S3 — recovery is tolerant of the
non-atomic case). Failpoints test coverage at both crash boundaries plus
a partial-rename scenario.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:21:00 +03:00
Andrew Altshuler
7310f69928
Revert "Merge pull request #49 from ModernRelay/ragnorc/x-request-id" (#54)
This reverts commit b352fca13c, reversing
changes made to 748ad334a9.
2026-04-26 15:56:29 +03:00
Ragnor Comerford
b352fca13c
Merge pull request #49 from ModernRelay/ragnorc/x-request-id
Add X-Request-Id middleware
2026-04-26 12:33:33 +02:00
Ragnor Comerford
e14b203208
Reuse X_REQUEST_ID constant for inbound header lookup
Both Cursor Bugbot and Cubic flagged that the inbound `headers().get(...)`
call constructed `HeaderName::from_static("x-request-id")` inline instead
of reusing the `X_REQUEST_ID` constant defined at the top of the file.
The two were already kept in sync by both being `from_static("x-request-id")`,
but a future rename would have to touch both sites or risk silent drift
between read and write.

Also drops the now-unused `header` module import.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 12:05:19 +02:00
Ragnor Comerford
748ad334a9
Merge pull request #48 from ModernRelay/ragnorc/api-sdk-research
Polish OpenAPI spec for SDK generation
2026-04-26 11:52:46 +02:00
Ragnor Comerford
189caf893c
Merge pull request #47 from ModernRelay/perf/expand-dense-ids
perf(expand): dense u32 ids end-to-end (follow-up to #45)
2026-04-25 23:54:03 +02:00
Ragnor Comerford
284c9377c2
Add X-Request-Id middleware
Per-request ULID minted at the edge, exposed in request extensions and
on the response header. Caller-supplied X-Request-Id is echoed when
well-formed (1..=128 ASCII printable characters); otherwise rejected
and replaced with a fresh ULID so the value is always safe to log.

Companion to the TypeScript SDK redesign — clients now correlate logs
across the wire by reading X-Request-Id from response headers (and the
SDK already surfaces it on every OmnigraphError as `requestId`).

No spec change required; the header is a transport-layer concern.

Tests:
- mint a ULID when no header is provided
- echo a valid caller-supplied id
- reject overlong header (200 chars), mint a fresh ULID

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 22:56:17 +02:00
Ragnor Comerford
7809bf607e
Polish OpenAPI spec for SDK generation
Add operation descriptions and examples to utoipa annotations so the
generated TypeScript SDK has rich JSDoc, and so future Python/Go SDKs
and any /openapi.json docs UI benefit from the same effort.

- Doc comments on all 18 handlers (utoipa picks up summary/description)
- #[schema(example = ...)] on free-text fields (query_source,
  schema_source, NDJSON data) and i64 timestamps
- Destructive/irreversible warnings on change, applySchema, ingest,
  mergeBranches, deleteBranch, publishRun, abortRun

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:36:51 +02:00
Andrew Altshuler
74eb5a5380
Parallel per-type load writes + omnigraph optimize/cleanup CLI (#46)
* Parallel per-type load writes + omnigraph optimize/cleanup CLI

## MR-677.3 — parallel per-type load writes

The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).

## MR-676 — `omnigraph optimize` and `omnigraph cleanup`

New CLI subcommands that walk every node + edge table in the repo:

- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
  table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
  runs Lance `cleanup_old_versions` to prune historical manifests +
  unique fragments. Requires `--confirm` because it's destructive.
  Supports both count-based and time-based retention (or both AND'd
  together). Time uses chrono `DateTime<Utc>` (added as a workspace
  dep, default-features off).

Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.

Public API on `Omnigraph`:

  pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
  pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
      -> Result<Vec<TableCleanupStats>>

All 10 existing loader tests still pass.

Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate openapi.json

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-04-25 14:22:14 +03:00
Ragnor Comerford
53d7f47909
Pass dense u32 ids through expand instead of round-tripping via String
BFS now emits Vec<u32> dense ids directly with HashSet<u32> per-source
dedup. Only the deduped set is stringified for Lance's IN-list. The
post-hydrate alignment uses a dense-indexed Vec<Option<u32>> instead of
HashMap<&str, usize>, giving O(1) lookup without repeated string hashing.

End-to-end on the bench_expand harness (release, M-series):

  query     baseline    after     speedup
  1k  hop3   460.2 ms    23.7 ms     19x
  10k hop2     4.21 s   139.9 ms     30x
  10k hop3    40.59 s    898.5 ms    45x
  30k hop2    11.71 s    490.2 ms    24x
  30k hop3   197.38 s      3.22 s    61x

The cost lived in stringifying every (src,dst) pair and re-hashing the
strings during alignment; once dense ids stay dense, the BFS inner loop
and the final fan-out both collapse to integer ops.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 12:07:25 +02:00
andrew
628bc2e607 Clean up bench_expand example
Remove vestigial code left from removed hasher variants: unused
BuildHasherDefault import, PhantomData suppression line, orphan planning
comments for Variant C/E. Also drop an unused `mut` on the PRNG closure
binding. No behavior change; compiles warning-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:59:21 +03:00
Ragnor Comerford
d8e0bfeb22
Dedupe dst ids before hydrating nodes in execute_expand (#45)
The BFS in execute_expand emits one (src_idx, dst_id) pair per edge, so
dst_id_list contains heavy duplication when multi-hop traversals revisit
the same destination nodes. hydrate_nodes then built an
"id IN ('a', 'b', ...)" filter from the full list, passing it verbatim
to Lance. On a 30k-node Person graph, a 3-hop query produced a 15.4M-
entry IN-list against a 30k-row target — 512x more entries than unique
ids.

Deduplicate before the Lance scan; the post-hydrate alignment HashMap
already fans results back out to the original (src, dst) pairs, so
output is bit-identical.

Bench numbers (crates/omnigraph/examples/bench_expand.rs, min of 2-3
runs, release build):

  query         before     after    speedup
  1k   hop3     460 ms     28 ms     16x
  10k  hop2    4.21 s     188 ms     22x
  10k  hop3   40.59 s    1.30 s     31x
  30k  hop2   11.71 s    678 ms     17x
  30k  hop3  197.38 s    4.86 s     41x

All existing omnigraph-engine tests pass (72/72, 0 failures).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 00:56:18 +03:00
Andrew Altshuler
8649b2084f
Prepare v0.3.0 release (#44)
* Prepare v0.3.0 release

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate openapi.json

* ci: retrigger CI on latest openapi.json

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-04-21 19:11:34 +03:00
andrew
2df578eab8 Delete __run__ branches on every terminal state (MR-674)
Run branches are transactional scaffolding — the durable audit lives
on RunRecord. Invariant: every terminal state (Published, Aborted,
Failed) deletes the __run__ branch.

- Add `terminate_run` helper: appends terminal RunRecord, then
  deletes the run branch. Delete errors are swallowed — the record
  is authoritative; `cleanup_terminal_run_branches_for_target`
  retries on later `branch_delete` of the target.
- Wire into `publish_run_as`, `abort_run`, `fail_run`.
- Include `Failed` in the cleanup filter (was `Published | Aborted`
  only) for legacy-repo GC during branch_delete.
- Cleanup now checks `coordinator.all_branches()` first to skip
  branches already deleted by a concurrent handle — avoids Lance
  NotFound when two handles publish/clean up independently.
- Drop `Failed` from `ensure_branch_delete_safe` — post-fix, Failed
  means the branch is already gone, so there's no reason to block
  target deletion (MR-674 "Downstream effects").

Tests:
- New regression: `run_branches_do_not_accumulate_across_repeated_loads`
  — 10 loads + 1 abort → `branch_list() == ["main"]`.
- New `failed_load_deletes_run_branch` asserts Failed path cleans up.
- Rename `abort_run_keeps_target_unchanged_and_preserves_hidden_branch_for_inspection`
  → `abort_run_leaves_target_unchanged_and_deletes_run_branch`, invert
  the hidden-branch assertion.
- Rewrite `public_{load,mutation}_preserves_staged_edge_ids_on_publish`
  to capture staged IDs before publish instead of inspecting the run
  branch after (branch is gone now).
- Update MR-670 regression test to assert the run branch is *absent*
  after publish.

Deferred to follow-up: `--keep-run-branch` debug flag, `omnigraph run gc`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:15:39 +03:00
andrew
54101f7e2c Extract remaining crowded compiler test modules
Files where inline tests crowded out production code (test/prod ratio
≥ 0.8) move to sibling files via `#[path]`. Files where production
dominates (query_input.rs, schema_plan.rs) stay inline — extracting
would add noise, not reduce it.

- ir/lower.rs: 1239 → 577 lines (ratio 1.15)
- catalog/mod.rs: 594 → 326 lines (ratio 0.83)
- query/lint.rs: 562 → 314 lines (ratio 0.80)

catalog/tests.rs uses the shorter name since it's inside a module
directory (no ambiguity with filename).

All 229 compiler tests green, identical count to before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 22:49:09 +03:00
andrew
94849a50b4 Extract compiler test modules to sibling files
typecheck.rs, schema/parser.rs, and query/parser.rs each had
~1000-line inline `mod tests` blocks that overshadowed the production
code in the file. Move each to a sibling `*_tests.rs` using
`#[path = "..."] mod tests;`.

- typecheck.rs: 2865 → 1708 lines; typecheck_tests.rs: 1156 lines
- schema/parser.rs: 1950 → 994 lines; parser_tests.rs: 955 lines
- query/parser.rs: 1737 → 803 lines; parser_tests.rs: 933 lines

No visibility change — the sibling module still has `use super::*`
access to crate-privates. No semantic edits beyond de-indenting by
4 spaces (mechanical). All 229 compiler tests green, identical
count to before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:50:18 +03:00
andrew
f05ea2c7c3 Extract public-API tests from omnigraph.rs to integration tests
The inline `mod tests` in crates/omnigraph/src/db/omnigraph.rs had grown
to ~620 lines, mixing tests that need crate-private access with tests
that only exercise the public API. Splits the latter out.

- tests/lifecycle.rs: 10 init/open/snapshot/drift tests
- tests/schema_apply.rs: 5 plan/apply tests
- omnigraph.rs: 10 tests remain inline because they use
  db.coordinator, db.table_store(), ManifestCoordinator,
  SCHEMA_APPLY_LOCK_BRANCH, or is_internal_run_branch — all
  crate-private and intentionally kept so.

No behavior change. Zero semantic edits to the tests themselves beyond
replacing db.snapshot() (pub(crate)) with snapshot_main helper at
integration-test boundaries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:09:34 +03:00