Commit graph

213 commits

Author SHA1 Message Date
Andrew Altshuler
92e3886cfa
ci: add publish-crates workflow for crates.io releases (#74)
The release.yml workflow builds binaries and updates Homebrew but never
published to crates.io — v0.4.0 and v0.4.1 are missing from the registry
even though the local Cargo.toml and the v0.4.1 tag are at 0.4.1.

This adds a separate workflow that:
- auto-publishes on every v* tag push (future releases self-publish)
- can be manually dispatched with a tag input (catch up on v0.4.1)
- is idempotent: skips a crate if its current crates.io version already
  matches local, so a partial failure is safe to retry
- gates on CARGO_REGISTRY_TOKEN (already in repo secrets); skips cleanly
  if the token is ever rotated out

Publishes in dependency order: omnigraph-compiler → omnigraph-engine →
omnigraph-server → omnigraph-cli. Path-only deps in Cargo.toml carry
explicit version fields, so cargo publish strips paths and resolves
against crates.io.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:48:37 +03:00
Ragnor Comerford
9459672549
Update README.md 2026-05-07 12:13:37 +02:00
Ragnor Comerford
62b4b112b9
Add Slack community link to README
Added a link to the Omnigraph Slack community for collaboration.
2026-05-07 12:11:45 +02:00
Ragnor Comerford
122ae5c990
Update README.md 2026-05-07 12:11:15 +02:00
Ragnor Comerford
bd0f82e5c5
Update README.md 2026-05-07 12:11:03 +02:00
Ragnor Comerford
028b913d9a
Merge pull request #72 from ModernRelay/ragnorc/mr847-recovery-reconciler
Recovery-on-open reconciler
2026-05-06 11:59:09 +02:00
Ragnor Comerford
a30666bc38
docs/tests: reserve Phase A/B/C/D for the per-writer recovery flow
Three terminologies were calling themselves Phase A/B in PR #72:

1. Per-writer recovery (canonical, four phases A/B/C/D — sidecar /
   commit_staged loop / manifest publish / sidecar delete in
   `docs/runs.md:157`).
2. Per-table staged-write contract from MR-793 (two phases —
   `stage_*` then `commit_staged`).
3. Test-narrative scaffolding (Phase A = setup the failure, Phase B
   = verify recovery — used as section dividers in failpoints.rs).

Same letters, three meanings; three reviewers including the bots have
already misread the code in the resulting fog. This change keeps
"Phase A/B/C/D" exclusively for #1 and rewrites the other two:

- `ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`
  → `ensure_indices_stage_btree_failure_leaves_existing_tables_writable`
  (matches the `stage_create_btree_index` API verb).
- Comment at `table_ops.rs:610` and the test docstring at
  `failpoints.rs:807` rewrite "a Phase A failure in the staged-index
  path" → "a stage-step failure in the staged-index path".
- Twelve `// Phase A:` / `// Phase B:` test scaffolding comment
  headers in `failpoints.rs` (across six test fns) become
  `// Setup:` / `// Recovery:`.
- A "Phase letter convention" note added near `docs/runs.md:165`
  spells the rule out for future readers.

Also bundled: rename
`composite_flow_init_load_branch_merge_time_travel_optimize_cleanup`
→ `composite_flow_canonical_lifecycle` so it pairs as a story name
with `composite_flow_multi_branch_sequential_merges` (the previously-
deferred symmetry rename).

No behaviour change. Both renamed tests pass; full failpoints (18) +
composite_flow (2) suites pass; workspace baseline + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 22:46:03 +02:00
Ragnor Comerford
fb0f024652
recovery: register added tables + tombstones in SchemaApply roll-forward
Cursor flagged that SchemaApply sidecars only captured `Update` pins
(via `snapshot.entry()?` in schema_apply.rs:166), so recovery's
`roll_forward_all` only published `ManifestChange::Update` for the
rewritten/indexed tables. Added types (`added_tables`) and tombstones
(`renamed_tables` sources) were silently dropped during recovery.

Reproducer: in `schema_apply_phase_b_failure_recovered_on_next_open`,
the v2 schema added a `Tag` node type. Pre-fix, `node:Tag` ended up as
an orphan dataset on disk while the manifest never received a
`RegisterTable` entry — the live `_schema.pg` declared a type the
manifest didn't know about, and `count_rows(node:Tag)` panicked with
`no manifest entry for node:Tag`. The existing test passed only
because it never queried Tag.

Fix:
1. Extend `RecoverySidecar` with `additional_registrations` and
   `tombstones` fields (optional, serde-default for backward compat
   with existing on-disk sidecars). Both are SchemaApply-only.
2. Populate them in `apply_schema_with_lock` from the migration plan's
   upfront diff (`added_tables` + `renamed_tables` keys for
   registrations; `renamed_tables` values for tombstones, version-
   pinned at `source_entry.table_version + 1`).
3. Update `roll_forward_all` to:
   - emit `RegisterTable` + `Update` for each `additional_registrations`
     entry (read the dataset's current Lance HEAD for the version
     metadata + row_count)
   - emit `Tombstone` for each `tombstones` entry
   - filter against `snapshot` so previously-published registrations /
     tombstones are skipped (handles the post-Phase-C-success-but-
     sidecar-not-yet-deleted case — without filtering, the publisher's
     CAS pre-check would error with `expected=0, actual=N` on the
     redundant Register)
4. Extend the audit-row outcomes to include published registrations.

Test changes:
- `schema_apply_phase_b_failure_recovered_on_next_open` now asserts
  `count_rows(node:Tag) == 0` (no panic), proving the new manifest
  entry exists.
- `schema_apply_recovers_pre_commit_crash` renamed to
  `schema_apply_pre_commit_crash_rolls_forward_via_sidecar` and
  rewritten — pre-fix it expected pre-commit crashes to roll BACK
  (delete staging, keep V1, leave Company as orphan); the sidecar
  protocol's "complete the writer's intent" semantic now rolls
  FORWARD (rename staging -> final, register Company atomically). The
  new assertions verify schema = V2 and `node:Company` is queryable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 22:15:50 +02:00
Ragnor Comerford
3ea7a1fd50
recovery: record RolledForward audit on stale-after-success sidecar
Cursor flagged that if `roll_forward_all` succeeds (manifest pin
advances) but `record_audit` then fails, the sidecar persists. On
the next open, every table classifies as NoMovement
(lance_head == manifest_pinned, both already reflect the prior
roll-forward) → `decide` returns RollBack → `roll_back_sidecar`
records a RolledBack audit row with empty per-table outcomes.
Operators reading `_graph_commit_recoveries.lance` see "RolledBack"
for an operation whose actual outcome was a successful roll-forward.

`process_sidecar`'s RollBack arm now distinguishes "stale-after-
success" from a legitimate rollback: when every classification is
NoMovement AND any pin's `manifest_pinned > expected_version` (the
manifest already advanced past the writer's CAS target), recovery
dispatches to `record_audit_recovery_rollforward` which writes a
RolledForward audit row with reconstructed outcomes
(`from_version = expected_version`,
`to_version = manifest_pinned`) and deletes the sidecar. No Lance
writes — the substrate is already in the post-roll-forward state.

Safe in `RollForwardOnly` mode (refresh-time recovery) because no
`Dataset::restore` is involved; the legitimate-rollback path stays
deferred to the next ReadWrite open as before.

Added `recovery_records_rolled_forward_for_stale_sidecar_after_successful_roll_forward`
integration test that synthesizes the state by writing a sidecar
whose `expected_version < manifest_pin` and asserts:
- audit row records `RolledForward` (not `RolledBack`)
- per-table outcome reports the correct `from_version` /
  `to_version` pair
- sidecar is deleted

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 20:12:43 +02:00
Ragnor Comerford
11a9b3c8b9
tests: assert actual total_people count, not just row count
Cubic flagged that the time-travel `total_people` assertions only
checked `result.num_rows() == 1`, which would still pass if the
historical query returned the wrong count (e.g., 10 instead of 6
because of a planner regression resolving against current state
instead of the captured snapshot). Added `assert_total` helper that
extracts the Int64 `total` column and verifies the actual value.

Replaces three weak `num_rows() == 1` assertions in
`composite_flow_multi_branch_sequential_merges`:

- post-both-merges: total = 10
- time-travel to pre-merge-a: total = 6
- post-reopen: total = 10

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 20:12:27 +02:00
Ragnor Comerford
0a6f3d796a
tests: extend multi-branch flow with .gq query checkpoints
The first cut of `composite_flow_multi_branch_sequential_merges` used
dataset-direct `count_rows` for read-side assertions, which proves
data is on disk but skips the query path entirely — planner, BTree
index lookup, edge traversal, aggregation, and snapshot resolution
all stay untested. Replaced with strategic `.gq` query checkpoints:

  - branch isolation via `get_person` after Eve insert (Eve visible
    on feat-a; absent on main)
  - 1-hop traversal via `friends_of(Grace)` after the Knows-edge
    insert (validates the topology index against branch-local edges)
  - post-merge query-engine readback after merge feat-a → main
    (Eve findable through BTree, Grace's edge traversable through
    the rebuilt Knows index)
  - aggregation via `total_people` after merge feat-b → main
    (count over a multi-fragment table whose shape is the result
    of two sequential merges)
  - time-travel via `ReadTarget::Snapshot(captured_id)` for both
    `total_people` and `friends_of` / `get_person` at the two
    pre-merge points (catches planner regressions where historical
    queries accidentally resolve current indices)
  - post-reopen query-engine readback (catches reopen-time index/
    catalog binding regressions)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:42:17 +02:00
Ragnor Comerford
9fc6526ec0
tests: multi-branch sequential merges compositional flow
Adds `composite_flow_multi_branch_sequential_merges` covering the
agent-workflow pattern that single-merge tests in `branching.rs`
cannot reach: two feature branches diverging from main with main
writes interleaved between every diverge point, sequential merges
into main, time-travel through the resulting merge DAG, and reopen
consistency over a multi-merge history.

The script (18 numbered steps with assertions per step):
  init+load → mutate main → branch feat-a → mutate main → mutate
  feat-a → branch feat-b → mutate feat-b → mutate feat-a (with
  edge) → merge feat-a → mutate main → merge feat-b → time-travel
  to pre-merge-a + pre-merge-b → reopen + verify.

Catches eight compositional gap categories that only surface with
≥2 merges and main mutations between them: base/LCA recomputation
across two merges, manifest-pin propagation through merge commits,
time-travel through merge DAG without state bleed-through, branch-
DAG consistency, sibling-branch isolation under writes elsewhere,
post-merge main-write integration, multi-merge reopen replay, and
clean-flow recovery-sidecar absence.

`composite_flow.rs` was added to `docs/testing.md` so the before-
every-task checklist points agents at the file before duplicating
coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:34:04 +02:00
Ragnor Comerford
58a3ff0e48
recovery: align merge sidecar branch with active_branch + record rollback drift
Two small PR #72 review findings addressed:

- merge.rs sidecar pin recorded `entry.table_branch` (where the table
  currently lives in the target manifest) instead of the merge target
  branch where commits actually land via `publish_rewritten_merge_table`
  → `open_for_mutation` → `fork_dataset_from_entry_state`. Recovery's
  `open_lance_head` would then check the wrong ref. Aligned with the
  pattern already used in `ensure_indices_for_branch` (table_ops.rs:115).
  Added `branch_merge_sidecar_pins_table_branch_to_active_branch`
  contract test that reads the sidecar JSON and asserts every per-pin
  `table_branch` equals the active (target) branch — catches the
  regression even when the values happen to coincide in the test setup.

- Rollback audit `from_version` previously equalled `to_version`
  (both `manifest_pinned`), telling operators nothing about the actual
  Lance HEAD drift before restore. Captured `lance_head` in
  `ClassifiedTable` and used it as `from_version` so audit rows now
  show "rolled back from v7 to v5" instead of "v5 → v5". Added
  `assert_rollback_outcomes_record_drift` invariant in the test helper,
  invoked automatically by every `RecoveryExpectation::RolledBack` test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:33:32 +02:00
Ragnor Comerford
815ff743f5
recovery: refresh-time roll-forward closes the in-process residual + invariants helper
Bundle of three correctness fixes plus a shared invariants helper that
existing tests now use.

1. SchemaApply atomicity: close the residual gap where a sidecar exists
   but staging files don't (e.g., Phase B failure BEFORE
   `_schema.pg.staging` write). `recover_schema_state_files` now returns
   a `SchemaStateRecovery` discriminator (`Noop` /
   `CleanedStaging` / `CompletedStagingRename { schema_apply_sidecar }`);
   the token threads through `recover_manifest_drift` →
   `process_sidecar`. SchemaApply sidecars are eligible for roll-forward
   ONLY when the staging rename completed in the same recovery pass.
   Full mode rolls back; RollForwardOnly defers. Without this, recovery
   would publish the manifest pin against new-schema data while
   `_schema.pg` stayed old (real corruption). New failpoint
   `schema_apply.before_staging_write` + new test
   `schema_apply_without_schema_staging_rolls_back_on_next_open` pin
   the gating.

2. Rollback target correction. Rollback now restores Lance HEAD to the
   current manifest pin (`state.manifest_pinned`) instead of the
   sidecar's `expected_version`. For UnexpectedAtP1/UnexpectedMultistep
   classifications these can differ; the old code could regress Lance
   HEAD past the manifest pin, re-introducing drift in the OTHER
   direction. The new behavior establishes `Lance HEAD == manifest pin`
   post-rollback — the canonical drift-free invariant. Param renamed
   from `expected_version` → `target_version` to match. Audit
   `to_version` records the actual restore target.

   This is a latent-behavior change. Any external consumer that compared
   `audit.to_version` against `sidecar.expected_version` for non-trivial
   classifications now sees the manifest pin instead.

3. Audit commit-graph unification. `record_audit` now opens the
   per-branch commit graph for ANY sidecar with `sidecar.branch.is_some()`
   — not just BranchMerge. Plain Mutation/Load/EnsureIndices commits on a
   feature branch now correctly land on that branch's commit graph,
   instead of main's. Closes the class of bug analogous to D2 but for
   non-merge writers.

   Pre-existing repos with non-main commits already on main's commit
   graph stay where they are; future recoveries write to the per-branch
   ref. Mixed-version compatibility is asymmetric but safe (old binaries
   ignore per-branch refs they don't know about; new binaries read both).

4. Recovery invariants helper + branch-axis cells. New
   `tests/helpers/recovery.rs` (~505 LOC) exports
   `assert_post_recovery_invariants(repo, op_id, RecoveryExpectation)`
   plus a `TableExpectation` builder. Six existing recovery tests
   refactored to call it; per-test bespoke assertions replaced. Two new
   branch-axis cells added in `tests/failpoints.rs`:
     - `recovery_rolls_forward_load_on_feature_branch`
     - `recovery_rolls_forward_ensure_indices_on_feature_branch`
   The loader gains a `mutation.post_finalize_pre_publisher` failpoint
   hook (gated on the `failpoints` feature; zero-cost in release) so the
   load test can pin the same Phase B → Phase C boundary the mutation
   path uses.

Misc:
   - `Omnigraph::refresh` extracts `reload_schema_if_source_changed`:
     early-return when schema source unchanged (saves IR parse + catalog
     rebuild on the steady-state refresh path).
   - New test injection point
     `failpoint_publish_table_head_without_index_rebuild_for_test`
     under `#[cfg(feature = "failpoints")]`.

Tests: 31 recovery + failpoint integration tests pass (14 + 17, up from
14 + 16). Full workspace sweep with `--features failpoints` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:04:48 +02:00
Ragnor Comerford
44c0d0bc4b
recovery: refresh reloads schema after staging recovery; non-main merge test pins parent_commit_id
E1. After D3 added recover_schema_state_files to refresh(), the
    in-memory `self.schema_source` and `self.catalog` were left stale:
    a SchemaApply sidecar processed via refresh would rename the
    staging files (`_schema.pg`, IR contract) into place but the
    handle continued operating against the old catalog. Subsequent
    operations would surface schema mismatches against post-migration
    data on disk.

    Fix: after recover_manifest_drift completes, refresh() now mirrors
    open_with_storage_and_mode's schema-load sequence — re-reads
    `_schema.pg`, parses IR via load_or_bootstrap_schema_contract,
    rebuilds the catalog with fixup_blob_schemas, and assigns into
    self.schema_source / self.catalog. Steady-state cost: one read +
    one parse per refresh; only mutates handle state when the on-disk
    schema actually changed.

E2. The non-main branch_merge recovery test
    (`branch_merge_phase_b_failure_recovered_on_non_main_target`)
    asserted only `merged_parent_commit_id` was non-null — but
    `merged_parent_commit_id` is independently populated from
    sidecar.merge_source_commit_id (the SOURCE branch's tip), so the
    assertion would pass even if D2's per-branch CommitGraph fix
    regressed (the bug was about `parent_commit_id`, the TARGET
    branch's tip).

    Fix: capture target_branch's commit-graph head BEFORE the failed
    merge by scanning target_branch's Lance ref on _graph_commits.lance
    and picking the latest commit by created_at. After recovery, find
    the recovery merge commit (the one with non-null
    merged_parent_commit_id) and assert its `parent_commit_id` ==
    captured pre-failure head. Without D2, recovery would record the
    GLOBAL head (the source_branch's insert-Carol commit on this test)
    instead, and the assertion fails.

    Also fixes the column-type cast: created_at is stored as
    TimestampMicrosecondArray, not Int64Array.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:06:17 +02:00
Ragnor Comerford
2ce4efc450
recovery: four review-round-4 fixes + branch-axis test matrix
D1. roll_forward_all returns per-table actual published versions; the
    audit row's `to_version` records that, not pin.post_commit_pin
    (the latter is a lower bound for loose-match writers SchemaApply /
    EnsureIndices / BranchMerge — pin.post_commit_pin = expected + 1
    while actual published HEAD can be expected + N).

D2. Branch-merge recovery audit uses CommitGraph::open_at_branch when
    sidecar.branch is Some, so the merge parent is the TARGET BRANCH's
    tip (not the global head). Without this, recovered branch_merge
    on a non-main target records the wrong merged_parent_commit_id and
    future merges between the same pair lose already-up-to-date
    detection / merge-base correctness.

D3. Omnigraph::refresh now mirrors open's recovery composition: runs
    recover_schema_state_files BEFORE recover_manifest_drift. Without
    this, a SchemaApply sidecar processed via refresh would publish
    the manifest + delete the sidecar without renaming the staging
    schema files, leaving the repo with new-schema data and old
    `_schema.pg` (real corruption). Refresh's docstring now enumerates
    each open-time recovery step it maintains, so the next maintainer's
    diff between open() and refresh() is trivial.

D4. ensure_indices sidecar pin records `active_branch` (where commits
    actually land), not `entry.table_branch` (where the table currently
    lives). On first fork-on-write, the processing loop's
    open_owned_dataset_for_branch_write forks to active_branch and the
    commit lands there — recovery's open_lance_head must check the
    same branch. Without this, recovery checks the wrong ref and
    misses Phase B drift entirely.

D5. Two new branch-axis tests:
    * recovery_rolls_back_feature_branch_sidecar_against_feature_branch
      — feature-branch rollback variant; asserts post-recovery audit
      kind == RolledBack and the actual restore commit landed on the
      feature ref.
    * branch_merge_phase_b_failure_recovered_on_non_main_target
      — non-main merge target variant; reads the target branch's
      commit graph (Lance ref) and asserts the recovery commit has
      a non-null merged_parent_commit_id (pins D2).

Bug pattern: all four are at composition seams between concepts that
were each tested individually (writer-precision × actual-Lance-HEAD;
branch-context × commit-graph-API; recovery-path × writer-kind; pin-
time-branch × commit-time-branch). The branch-axis matrix is the
cheapest mechanical prevention for D2/D4-class regressions.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 11:34:18 +02:00
Ragnor Comerford
aaa031e834
recovery: refresh-time roll-forward closes the in-process residual
Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).

Why roll-forward only and not full sweep:
  * Roll-forward is safe under concurrency (publisher uses row-level
    CAS).
  * Roll-back uses Dataset::restore, which "wins" against concurrent
    Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
    silently orphaning the concurrent writer's commit (pinned by
    tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
    Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
    next ReadWrite open, where no concurrent writers exist and full
    restore is safe.

Implementation:
  * recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
    process_sidecar branches on mode for Abort and RollBack — both
    defer to next ReadWrite open under RollForwardOnly. RollForward
    behavior unchanged.
  * omnigraph.rs: Omnigraph::refresh promoted to pub; calls
    recover_manifest_drift in RollForwardOnly mode after coordinator
    refresh. Steady-state cost: one list_dir of __recovery (early
    return on empty). Adds refresh_coordinator_only — pub(crate) —
    for engine-internal callers that hold an in-flight sidecar (the
    schema_apply lease-check + lock-release paths). Without this split,
    refresh would race the in-flight sidecar.
  * schema_apply.rs: switch all 6 internal db.refresh() call sites to
    refresh_coordinator_only().

Tests:
  * refresh_runs_roll_forward_recovery_in_process — trigger
    mutation.post_finalize_pre_publisher; without restart, call
    db.refresh(); assert sidecar deleted, drifted row visible,
    subsequent mutation succeeds.
  * refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
    a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
    refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
    runs the full sweep which advances HEAD via restore.

Docs:
  * docs/runs.md "Long-running servers" caveat updated to describe the
    refresh-time roll-forward path and the rollback-defer behavior.
  * docs/invariants.md §VI.23 status line updated to reflect in-process
    closure of the common case.

Workspace tests pass with --features failpoints; no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 00:15:42 +02:00
Ragnor Comerford
8c6506f5cd
recovery: close four correctness gaps (schema-apply, branch-aware, restore short-circuit, merge parent)
B1. Schema-apply atomicity. Before this commit, a failure between
    `_schema.pg.staging` write and the manifest publish left the repo
    corrupt: Lance HEADs advanced under the new schema, manifest stayed
    at old pins, and on reopen schema-state recovery deleted the staging
    files (manifest's table set still matched the live schema), then
    manifest-drift recovery rolled the table versions forward — leaving
    new-schema data on disk with the old `_schema.pg` live.

    Fix: a SchemaApply sidecar is the marker that Phase B completed but
    Phase C didn't. New helper `has_schema_apply_sidecar` is consulted
    by `recover_schema_state_files` BEFORE its disambiguation logic;
    when present, it completes the staging→final rename so the
    subsequent manifest-drift roll-forward sees the new catalog.

B2. Branch-aware recovery. Sidecars from feature-branch writers were
    being classified against main's snapshot and main's Lance HEAD,
    silently no-op'ing or rolling back the wrong table version (the
    classifier saw NoMovement; the writer's drift on the feature branch
    persisted; subsequent feature writers surfaced
    ExpectedVersionMismatch).

    Fix: SidecarTablePin gets an optional `table_branch` field;
    `recover_manifest_drift` opens a per-branch coordinator
    (`GraphCoordinator::open_branch`) per sidecar; `open_lance_head`,
    `restore_table_to_version`, and `roll_forward_all` honor the pin's
    branch via `Dataset::checkout_branch`.

B3. Remove fragment-id short-circuit in `restore_table_to_version`.
    Equal fragment IDs do NOT imply equal content: Lance index commits
    and deletion-vector updates change the manifest without touching
    fragment IDs. Skipping restore in those cases would leave Lance HEAD
    ahead of the manifest with no recovery artifact left. Restore is
    now unconditional; pile-up under repeated mid-rollback crashes
    bounded and reclaimed by `omnigraph cleanup`.

B4. Recovered branch_merge records merge parent. `record_audit` always
    called `append_commit`, dropping `merged_parent_commit_id`. Future
    `branch_merge feature -> main` between the same pair lost
    already-up-to-date detection. RecoverySidecar gets an optional
    `merge_source_commit_id`; `branch_merge_on_current_target`
    populates it from `source_head_commit_id`; `record_audit`
    dispatches to `append_merge_commit` when present.

New tests: feature-branch sidecar classification (B2); B1 deepens the
existing schema_apply test with live-`_schema.pg` and new-type
assertions; B4 deepens the existing branch_merge test by reading
`_graph_commits.lance` and asserting a non-null `merged_parent_commit_id`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 23:39:41 +02:00
Ragnor Comerford
35c4b16e91
recovery: address five outstanding review findings
A1. tests/recovery.rs: rewrite recovery_multi_sidecar_requires_fresh_snapshot_for_correctness
    to use real `append_batch` instead of fragment-preserving `delete_where("1 = 2")`.
    The previous setup made restore_table_to_version's fragment-set short-circuit
    no-op the bug path, so the load-bearing `HEAD == v3` assertion passed in both
    bug and fix paths. Real appends produce different fragment-id sets across v1,
    v2, v3 so a real restore actually runs in the bug path (HEAD becomes v4).
    Added a person_batch helper matching the post-init Lance schema (id, age, name).

A2. exec/merge.rs: filter recovery sidecar pins to `RewriteMerged` candidates
    only. `AdoptSourceState`'s pure-pointer-switch and fork subcases don't
    advance Lance HEAD; pinning them would force NoMovement on recovery and
    trigger an all-or-nothing rollback that destroys legit RewriteMerged work.
    Documented residual: AdoptSourceState subcases that internally call
    publish_rewritten_merge_table aren't covered by the sidecar; closing that
    requires pre-computing source deltas during candidate classification (a
    structural change to CandidateTableState) — left as follow-up.

A3. db/omnigraph/table_ops.rs: add the same branch filter
    (`active_branch.is_some() && entry.table_branch.is_none() => continue`)
    to the ensure_indices sidecar pin loop that the processing loop already
    has. Without this, main-branch tables that need index work get pinned but
    never committed when ensure_indices runs on a feature branch → NoMovement
    → all-or-nothing rollback destroys feature-branch work.

A4. tests/failpoints.rs: deepen schema_apply_phase_b_failure and
    branch_merge_phase_b_failure tests with post-recovery manifest-pin advance
    assertions. branch_merge test setup also mutates main so the merge
    produces at least one RewriteMerged candidate (required after A2's pin
    filter — a no-op merge with all-AdoptSourceState would write no sidecar).
    Fixed stale "BranchMerge is strict-classified" comment to reflect current
    loose classification.

A5. tests/composite_flow.rs: remove duplicate back-to-back `total_people`
    query in step 12.

Full workspace test sweep with --features failpoints passes: no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:09:58 +02:00
Ragnor Comerford
05e52f2ee0
recovery: rename composite test, strip ticket references, address review
Three bundled changes:

1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
   the test function). OmniGraph is consumed by both humans and agents
   - naming the test after one audience misframes the library.

2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
   review-round labels from source, tests, and docs added by this
   branch. Internal traceability belongs in commit messages and PR
   descriptions, not in checked-in artifacts. Upstream
   lance-format/lance issue refs and pre-existing MR-XXX refs in docs
   not touched by this branch are left alone.

3. Two outstanding review findings addressed:
   - `needs_index_work_node` / `needs_index_work_edge`: propagate
     `count_rows` errors instead of `unwrap_or(0)`. Silently treating
     transient I/O failures as "0 rows" risked skipping a table from
     the recovery sidecar pin set that was actually about to be
     modified.
   - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
     strengthen the assertion to fail when sidecar B classifies under
     a stale snapshot. The new assertion checks post-recovery Lance
     HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
     deleted + audit rows present" pair passed in both the bug and
     fix paths because both delete the sidecar and write an audit
     row; the differentiator is the post-recovery HEAD. Strengthening
     the assertion exposed an additional nuance: in this overlapping-
     sidecar scenario sidecar B's audit kind is RolledBack (no-op)
     rather than RolledForward, since sidecar A's roll-forward
     publishes Lance HEAD as the new manifest pin (absorbing B's
     work). The docstring now explains why this is correct given
     current `roll_forward_all` semantics.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:56:36 +02:00
Ragnor Comerford
78cc548846
tests: composite agent-lifecycle integration test (MR-858)
Implements MR-858 ahead of the rest of the MR-857 epic: the deterministic
narrative test counterpart to MR-783's randomized harness.

`tests/agent_lifecycle.rs::agent_lifecycle_init_load_branch_merge_time_travel_optimize_cleanup`
walks the canonical agent flow end to end:

  1. init repo with TEST_SCHEMA
  2. load_jsonl seed data (4 Person + 2 Company nodes; Knows + WorksAt edges)
  3. branch_create feature off main
  4. mutate on feature: single-statement insert (Eve) + multi-statement
     insert+edge (Frank knows Eve)
  5. query on feature: total_people / friends_of (traversal) /
     unemployed (anti-join) / friend_counts (aggregation)
  6. mutate on main (set Bob's age) — sets up non-conflicting merge
  7. branch_merge feature → main; verify version advance
  8. query post-merge: confirm Eve visible on main (from feature) +
     Bob visible (from main mutation, carried through merge)
  9. snapshot_at_version(pre_merge_version): time-travel still sees
     pre-merge state (4 Persons, no Eve)
 10. optimize the post-merge graph; verify reads still work + counts
     unchanged
 11. cleanup with --keep 10 --older-than 3600s (no-op for this short
     test, but exercises the call path)
 12. drop + reopen; verify all counts + branch list consistent;
     confirm read path works post-cleanup-reopen

**Known limitation surfaced**: post-optimize mutation path in step 11
hit `ExpectedVersionMismatch` because `optimize_all_tables` advances
per-table Lance HEAD without updating the `__manifest` pin
(`db/omnigraph/optimize.rs:77`), and something between optimize and
re-open writes a higher version row to `__manifest`. Test documents
this and defers full coverage to MR-859 (`omnigraph optimize` +
`cleanup` integration coverage), keeping the read-path-after-cleanup
assertion which is the headline operator concern.

Test runs in <1s. ~672 workspace tests pass with --features
failpoints; no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:10:28 +02:00
Ragnor Comerford
26b4c61d44
recovery: address PR #72 round-2 review findings
Bot reviewers (cubic + cursor) flagged 5 follow-on issues after the
first fix push. Three are real bugs in the Phase 6-8 ensure_indices
sidecar wiring; two are AI-slop flags on shallow tests. One cursor
finding is a false positive on intentional node/edge index asymmetry.

Real bugs fixed:

- needs_index_work_node and needs_index_work_edge now skip empty
  tables (count_rows == 0). The ensure_indices_for_branch loop has
  `if row_count > 0 { build_indices(...) }`, so empty tables produce
  zero commit_staged calls. Pinning them in the sidecar would force
  NoMovement classification on recovery and trigger the all-or-nothing
  rollback of any sibling table's legitimate index work (cubic #1).

- needs_index_work_node and needs_index_work_edge now respect the
  table_branch parameter from the snapshot entry, instead of always
  passing None (== main). For branch writes, opening the wrong HEAD
  could miss recoverable Phase B commits (cubic #2).

- needs_index_work_edge documented as intentionally BTree-only (mirrors
  the build_indices_on_dataset_for_catalog edge branch which only
  builds id/src/dst BTrees). Cursor flagged FTS/vector omission as
  inconsistency with the node helper; confirmed intentional via
  inline comment so future readers know the asymmetry is on purpose
  (cursor finding, false positive marked).

Test improvements:

- recovery_multi_sidecar_requires_fresh_snapshot_for_correctness — new
  integration test that uses TWO sidecars on the SAME table where
  sidecar B's expected_version equals sidecar A's post_commit_pin.
  Sidecar B's classification only succeeds if the recovery sweep
  refreshes the snapshot between iterations to see A's manifest
  update. Without the refresh fix from the prior commit, B would be
  classified against stale pins (cubic #4 follow-up).

- recovery_ensure_indices_handles_empty_tables — new integration test
  that runs ensure_indices on an all-empty repo. With the round-2 fix,
  both initial and steady-state runs leave no sidecar (zero pins ⇒
  zero sidecar I/O). Without the empty-table fix, the sidecar would
  pin Company (zero rows but missing indices) and force a NoMovement
  rollback (cubic #1 verification).

- ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_needed —
  renamed/rewrote the prior `_recovered_on_next_open` test to assert
  the post-fix invariant: when load_jsonl auto-built every catalog
  index via prepare_updates_for_commit, ensure_indices's needs_work
  helpers correctly report zero pins and produce no sidecar. The old
  assertion ("exactly one sidecar must persist") was wrong for the
  scoped behavior.

Test surface (post-round-2):
- 25 unit tests in db::manifest::recovery (BranchMerge classifier,
  sort order, primitives — unchanged).
- 12 integration tests in tests/recovery.rs (+2 from this commit).
- 11 failpoint tests including the four per-writer Phase B → recovery
  tests (one renamed to reflect the scoped behavior).
- ~672 workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:50:33 +02:00
Ragnor Comerford
164bafbbe7
recovery: address PR #72 review findings
Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking
bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR.
Each fix has a paired test demonstrating the bug before the fix.

Merge-blocking fixes:

- BranchMerge moved to loose-match classifier arm. publish_rewritten_
  merge_table runs multiple commit_staged calls per table (merge_insert
  + delete_where + index rebuilds). Strict classification rolled back
  valid completed Phase B work as UnexpectedMultistep. Three new unit
  tests pin the loose-match behavior for BranchMerge.

- branch_merge sidecar uses self.active_branch() (the resolved target
  branch) instead of inferring from the first sorted table key. The
  previous heuristic could record None (== main) when the merge target
  was a non-main branch, causing recovery to publish to the wrong
  manifest namespace.

- Best-effort sidecar delete in all 5 writer sites (mutation, loader,
  schema_apply, branch_merge, ensure_indices). Previously, a sidecar
  cleanup failure after a successful manifest publish would error out
  the user's call for a write that already landed. Now: log a warning
  and ignore — the next open's recovery sweep tidies the stale sidecar
  via NoMovement classification.

- ensure_indices sidecar scoped to tables that need work via new
  helpers needs_index_work_node / needs_index_work_edge. Previously
  the sidecar pinned every catalog table; if only one needed indexing,
  the others classified as NoMovement and the all-or-nothing decision
  rolled back legitimate index work.

Strongly-recommended fixes:

- recover_manifest_drift now takes &mut GraphCoordinator and refreshes
  between sidecars. Sidecar B's classification needs to see sidecar
  A's manifest changes, otherwise B can be classified against stale
  pins and incorrectly roll back work that just landed.

- list_sidecars sorts URIs before reading. Sidecar filenames are
  ULIDs (chronologically sortable), so this gives deterministic,
  time-ordered processing. Filesystem-order was nondeterministic.

- ReadOnly opens skip recover_schema_state_files too (was: only the
  MR-847 sweep was gated). Read-only consumers may run with read-only
  credentials; silent open-time mutations violate the contract.

Doc cleanups:

- Removed stale "Phase 4 placeholder" comment from
  recover_manifest_drift.
- docs/runs.md decision-tree wording now correctly surfaces the
  InvariantViolation abort path.
- docs/branches-commits.md clarifies actor_id is in
  _graph_commit_actors.lance (joined by graph_commit_id), not on
  _graph_commits.lance itself.

Test surface (post-fixes):
- 25 unit tests in db::manifest::recovery (+4 from this commit).
- 10 integration tests in tests/recovery.rs (+3 from this commit).
- ~672 tests across ~25 binaries pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:21:40 +02:00
Ragnor Comerford
932334ba01
recovery: document MR-847 ship across all reference docs (Phase 10)
Update the doc surface to reflect MR-847 having shipped end to end —
sidecar protocol, classifier, all-or-nothing decision tree, roll-forward
via ManifestBatchPublisher, roll-back via Dataset::restore with
fragment-set short-circuit, audit trail in
_graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and
the four migrated writers all carrying sidecars across Phase B → Phase C.

- docs/invariants.md §VI.23: change from "upheld at the writer-trait
  surface for inserts/updates/etc., per-table commit_staged → manifest
  publish window remains" to "upheld at the writer-trait surface AND
  across process boundaries". The MR-847 sweep closes the residual on
  the next Omnigraph::open. The "continuous in-process" property
  (no ExpectedVersionMismatch surfacing to subsequent writers between
  Phase B failure and process restart) is honest follow-up at MR-856.

- docs/runs.md: replace "Finalize → publisher residual" section with
  "Open-time recovery sweep (MR-847)" — describes the sidecar protocol
  lifecycle (Phases A-D), the sweep's classifier + decision dispatch,
  the audit trail, and the operator-facing query
  (omnigraph commit list --filter actor=omnigraph:recovery).

- AGENTS.md capability matrix "Atomic single-dataset commits" row:
  drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat;
  describe the three layers as all shipping; reference MR-856 for the
  background-reconciler follow-up.

- docs/storage.md: add _graph_commit_recoveries.lance and
  __recovery/{ulid}.json to the on-disk layout (mermaid + prose).

- docs/branches-commits.md: new "Recovery audit trail (MR-847)"
  subsection describing the join from
  _graph_commits.lance:actor_id="omnigraph:recovery" to
  _graph_commit_recoveries.lance:graph_commit_id for operator
  post-mortem.

- docs/maintenance.md: note the MR-847 recovery floor on cleanup —
  --keep < 3 may garbage-collect Lance versions the recovery sweep
  needs as a rollback target. Default --keep 10 is safe.

- docs/testing.md: add tests/recovery.rs to the engine integration-test
  table; expand the failpoints.rs row to mention the four MR-847
  per-writer Phase B → recovery integration tests.

- .context/mr-847-design.md: prepend a "Status: DONE" stanza listing
  every commit hash + scope across phases 1-10.

AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs).
Full workspace test sweep passes with --features failpoints (361 tests
across 20 binaries).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
72d3da66de
recovery: per-writer Phase B failure → recovery integration tests (Phase 9)
Add the three paired per-writer tests required by MR-847's acceptance
criteria — "All four migrated writers ... have paired Phase B → recovery
integration tests."

Production additions (~10 LOC):
- New failpoint `branch_merge.post_phase_b_pre_manifest_commit` in
  `exec/merge.rs::branch_merge_on_current_target` between the per-table
  publish loop and `commit_manifest_updates`.
- New failpoint `ensure_indices.post_phase_b_pre_manifest_commit` in
  `db/omnigraph/table_ops.rs::ensure_indices_for_branch` between the
  per-table loops and `commit_prepared_updates_on_branch`.
- For schema_apply, the existing `schema_apply.after_staging_write`
  failpoint already fires in the right window (after the per-table
  rewrites + index builds, before the manifest publish).

Sidecar tweak:
- `schema_apply` sidecar's `branch` is now `None` (was
  `Some("__schema_apply_lock__")`). The lock branch is purely a
  serialization sentinel; `coordinator.commit_changes_with_actor`
  publishes against the coordinator's pre-lock branch (main). After
  the failpoint fires, `release_schema_apply_lock` removes the lock
  branch — if the sidecar referenced it, the recovery sweep would try
  to publish to a branch that no longer exists and fail. Fix: record
  the actual publish target.

Tests added in `tests/failpoints.rs` (~280 LOC):
- `schema_apply_phase_b_failure_recovered_on_next_open` — seeds a row,
  opens, attempts a schema apply that adds a new node type + a new
  property (the new type ensures the table set differs so
  `recover_schema_state_files` doesn't trip on property-only
  ambiguity), failpoint fires, drops engine, reopens, asserts sidecar
  deleted + audit row recorded.
- `branch_merge_phase_b_failure_recovered_on_next_open` — seeds main,
  branches off, mutates the branch, attempts merge with the
  `branch_merge.post_phase_b_pre_manifest_commit` failpoint active.
  Same recovery shape.
- `ensure_indices_phase_b_failure_recovered_on_next_open` — seeds
  rows, attempts ensure_indices with the
  `ensure_indices.post_phase_b_pre_manifest_commit` failpoint active.

After this commit, all four migrated writers have paired
Phase B → recovery tests:
- mutate_as / load: `recovery_rolls_forward_after_finalize_publisher_failure` (Phase 5)
- schema_apply: `schema_apply_phase_b_failure_recovered_on_next_open`
- branch_merge: `branch_merge_phase_b_failure_recovered_on_next_open`
- ensure_indices: `ensure_indices_phase_b_failure_recovered_on_next_open`

11 failpoint tests pass; full workspace lib + integration tests pass
(350+ tests across 20 binaries).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
c6827919ca
recovery: wire sidecar into schema_apply, branch_merge, ensure_indices (Phases 6-8)
Three writers each follow the same shape established in Phase 5: build
SidecarTablePin list before the per-table commit_staged loop, write the
sidecar via recovery::write_sidecar, do the existing work, delete the
sidecar after the manifest publish succeeds.

Loose-match classifier (recovery.rs):

The classifier now distinguishes strict vs. loose match per
SidecarKind. Strict (Mutation, Load, BranchMerge): exactly one
commit_staged per table; lance_head == manifest_pinned + 1 AND
post_commit_pin == lance_head required. Loose (SchemaApply,
EnsureIndices): the writer may run N >= 1 commit_staged calls per
table — index builds + rewrites compound, and the exact N is hard to
compute at sidecar-write time. Loose accepts any
lance_head > manifest_pinned (with expected_version still matching the
manifest pin) as RolledPastExpected. The risk it admits — an external
agent advancing HEAD between sidecar write and recovery — is out of
scope for the single-coordinator model (MR-668 territory).

roll_forward_all now reads the CURRENT Lance HEAD per table (not the
sidecar's post_commit_pin) so the manifest publish reflects whatever
HEAD landed, even if the loose-match writer committed multiple times
per table.

Per-writer wiring:

- schema_apply::apply_schema_with_lock: sidecar covers
  rewritten_tables ∪ indexed_tables (the tables that go through
  stage_overwrite/stage_create_index commit_staged). Skips
  added_tables (fresh datasets, no Phase B residual class) and
  renamed_tables (handled by the existing schema-state staging
  recovery in recover_schema_state_files).
- branch_merge::branch_merge_on_current_target: sidecar covers every
  table in candidates (publish_adopted_source_state +
  publish_rewritten_merge_table do the per-table commit_staged work).
  Sidecar writes after validate_merge_candidates and deletes after
  commit_manifest_updates.
- ensure_indices_for_branch: sidecar covers every node + edge type in
  the catalog with a manifest entry (build_indices_on_dataset is
  per-table-per-index commit_staged). Skips when the catalog has
  nothing — steady-state calls incur no sidecar I/O when the manifest
  already pins all expected types.

Allow recovery_audit.rs in forbidden_apis.rs:

The new db/recovery_audit.rs uses Dataset::write to bootstrap the
_graph_commit_recoveries.lance dataset (same pattern as
commit_graph.rs which is already allow-listed). Add it to the
ALLOW_LIST_FILES list in tests/forbidden_apis.rs.

8 new unit tests in db::manifest::recovery cover the loose-match
classifier branches (SchemaApply + EnsureIndices accept multi-commit
drift, NoMovement and InvariantViolation behave the same as strict).

All 20 test binaries pass (350+ tests across the workspace).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
49ca7e5068
recovery: wire sidecar into MutationStaging::finalize + flip headline test (Phase 5)
Production wiring (~120 LOC):

- `MutationStaging::finalize` now takes a `SidecarKind` parameter and
  returns an additional `Option<RecoverySidecarHandle>`. Builds a
  Vec<SidecarTablePin> from `pending` BEFORE the per-table commit_staged
  loop and writes the sidecar via `recovery::write_sidecar`. Skips the
  sidecar when `pending` is empty (delete-only mutation; D₂ keeps these
  out of the staged-write path so the option is just a clean signal,
  not a code path users hit).
- `exec/mutation.rs::execute_mutation_as` (around line 740): destructure
  the new third element, pass `SidecarKind::Mutation`, delete the
  sidecar after `commit_updates_on_branch_with_expected` succeeds.
- `loader/mod.rs::ingest_loaded` (around line 540): same shape, with
  `SidecarKind::Load`. The Overwrite path stays inline-commit (legacy
  residual; out of MR-847 scope per docs/runs.md).
- New engine accessors `Omnigraph::storage_adapter()` and
  `Omnigraph::root_uri()` for the sidecar I/O. The pre-existing
  `db.storage` field stays private; no other engine code reaches around
  the accessor.
- Re-exports from `db::manifest`: `new_sidecar`, `write_sidecar`,
  `delete_sidecar`, plus the `RecoverySidecar*` types and `SidecarKind`,
  so consumers in `exec/` can use them via `crate::db::manifest::...`.

Bugfix folded in (~5 LOC): make `coordinator` mutable in
`Omnigraph::open_with_storage_and_mode` and call `coordinator.refresh()`
after the recovery sweep returns. Roll-forward advances the manifest
pin on disk; without the refresh the returned engine carried a stale
in-memory snapshot. The Phase 4 tests passed only because they
opened Lance datasets directly rather than going through `db.snapshot()`.

Storage adapter (~15 LOC): `LocalStorageAdapter::write_text` now ensures
the parent directory exists via `tokio::fs::create_dir_all`. Required
because the sidecar protocol writes into `__recovery/` which doesn't
pre-exist after `Omnigraph::init`. S3 has no equivalent; PutObject is
path-agnostic.

Headline test flip (~150 LOC):

- `tests/failpoints.rs::finalize_publisher_residual_drifts_lance_head_until_next_writer_recovers`
  is replaced by `recovery_rolls_forward_after_finalize_publisher_failure`.
  Same setup (failpoint at `mutation.post_finalize_pre_publisher`) but
  after the synthetic failure the test:
  1. Asserts the sidecar persists in `__recovery/` for the recovery
     sweep to find.
  2. Drops the engine handle.
  3. Reopens via `Omnigraph::open` — recovery sweep classifies
     RolledPastExpected, decides RollForward, publishes the manifest
     update, records the audit row, deletes the sidecar.
  4. Asserts the sidecar is gone.
  5. Asserts the originally-attempted Eve insert is now visible
     (Person count = 1).
  6. Asserts a subsequent insert succeeds without
     ExpectedVersionMismatch (Person count = 2).
  7. Asserts the audit dataset `_graph_commit_recoveries.lance` exists.
  This is the headline contract the MR-847 acceptance criteria require.

All other failpoint and runs tests continue to pass (8 + 24 unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
ca21e73d43
recovery: roll-forward execution + audit row (Phase 4)
Implement the remaining half of the open-time recovery sweep.

Roll-forward execution (db/manifest/recovery.rs::roll_forward_all):
constructs a GraphNamespacePublisher directly (recovery runs inside
Omnigraph::open before the engine struct exists, so we can't go through
Omnigraph::commit_updates_on_branch_with_expected). Builds a
ManifestChange::Update per sidecar table reading row_count and
TableVersionMetadata from the dataset at post_commit_pin (cheap;
manifest-level reads, not a row scan), then calls publisher.publish with
expected_table_versions = sidecar.expected_version per table. Single
__manifest CAS extends every pin atomically — all-or-nothing at the
substrate. Persistent CAS contention surfaces as the typed
ExpectedVersionMismatch error and leaves the sidecar in place for the
next open's retry.

Audit model (new crates/omnigraph/src/db/recovery_audit.rs +
record_audit() in recovery.rs): each successful recovery sweep records
a graph-commit row tagged with actor_id="omnigraph:recovery" plus a
row in a new sibling table _graph_commit_recoveries.lance carrying
recovery_kind (RolledForward | RolledBack), recovery_for_actor (the
sidecar's original actor_id), operation_id (sidecar ULID),
sidecar_writer_kind, per_table_outcomes (JSON-serialized for schema
flexibility), and created_at. Operators investigating "did my mutation
land?" can find the answer via `omnigraph commit list --filter
actor=omnigraph:recovery` joined to the recoveries table by
graph_commit_id.

The sibling-table choice avoids bumping INTERNAL_MANIFEST_SCHEMA_VERSION
or migrating _graph_commits.lance. Same not-atomic-pair-write shape as
the existing _graph_commits + _graph_commit_actors split — a crash
between the two sequential writes leaves an orphan commit row with no
recovery row. Recovery sweep tolerates this: re-entry classifies
already-restored / already-published tables as NoMovement, the action
is a no-op, and the audit append is retried.

Note on classifier: process_sidecar's RollBack arm now restores
RolledPastExpected, UnexpectedAtP1, AND UnexpectedMultistep (any drift
class). Earlier Phase 3 logic restricted to RolledPastExpected only,
which left UnexpectedAtP1/UnexpectedMultistep tables drifted; the
all-or-nothing decision rule per docs/invariants.md §VI.23 demands all
drifted tables be restored.

3 new integration tests in tests/recovery.rs (7 total now):
- recovery_rolls_forward_after_phase_b_completes — happy-path
  roll-forward; audit row recorded; idempotent on second open.
- recovery_rolls_back_records_audit_row_with_recovery_actor —
  roll-back path also records an audit row with the original actor.
- recovery_rolls_forward_with_null_actor — sidecar without actor_id
  still records the audit row (recovery_for_actor = None).

3 new unit tests in db::recovery_audit pin the round-trip + persistence
+ recovery_kind string parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
c2fc3e7c40
recovery: wire open-time sweep + OpenMode (Phase 3)
Add `OpenMode::{ReadWrite, ReadOnly}` and route `Omnigraph::open` through
`open_with_storage_and_mode`. Recovery sweep runs only under
`OpenMode::ReadWrite` — read-only consumers (NDJSON export, commit list,
schema show) skip it via `Omnigraph::open_read_only`. Rationale: the
sweep performs Lance writes (Dataset::restore, manifest publish); a
read-only consumer with read-only object-store credentials shouldn't
trigger writes, and reads always resolve through the manifest pin
regardless of any drift on the per-table side.

`recover_manifest_drift` lands in db/manifest/recovery.rs and is wired
into Omnigraph::open AFTER recover_schema_state_files — schema-state
recovery operates on staging files; manifest-drift recovery operates on
Lance HEADs that may depend on schema-state being settled.

Roll-back path is fully implemented: classify each table per the
sidecar's intent, dispatch the all-or-nothing decision, and call
restore_table_to_version for any table with drift (RolledPastExpected,
UnexpectedAtP1, or UnexpectedMultistep). NoMovement tables are already
at expected_version — no action. Sidecar deleted as the final step.

Roll-forward path errors with a Phase-4 placeholder so it surfaces
loudly if reached without the audit + manifest-publish wiring landing
first.

Concurrency: today (pre-MR-686) recovery is naturally serialized by the
single-coordinator model. Open runs at server startup BEFORE
Arc<RwLock<Omnigraph>> wraps the engine (lib.rs:194), so no request
handlers can race. CLI is sequential by caller orchestration. Under
MR-686's per-(table_key, branch) queues + MR-856 (background recovery
reconciler), the queue acquisition will need to extend to recovery
sweeps — handoff documented on MR-686 ticket and in MR-856.

4 integration tests in tests/recovery.rs pin the Phase 3 contract:
- recovery_does_not_run_on_clean_open — no sidecars; sweep is a no-op.
- recovery_refuses_unknown_schema_version_on_open — sidecar v=99
  surfaces SidecarSchemaError and is left on disk for operator review.
- read_only_open_skips_recovery_sweep — even a sidecar with bogus
  table_path doesn't get classified under OpenMode::ReadOnly.
- recovery_rolls_back_synthetic_drift_on_open — sidecar with mismatched
  post_commit_pin classifies as UnexpectedAtP1, decision is RollBack,
  restore is invoked, sidecar is deleted, idempotent on second open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
376d91d538
recovery: scaffold sidecar protocol + classifier + decision tree (Phase 2)
Add db/manifest/recovery.rs with the primitives the open-time recovery
sweep will invoke. No integration into Omnigraph::open or any writer
path yet — those land in Phase 3+.

Sidecar protocol:
- RecoverySidecar JSON shape (schema_version=1; SidecarSchemaError
  refuses unknown versions — old binaries don't guess at newer shapes).
- SidecarKind {Mutation, Load, SchemaApply, BranchMerge, EnsureIndices}
  for audit attribution.
- SidecarTablePin {table_key, table_path, expected_version,
  post_commit_pin}.
- write_sidecar / delete_sidecar / list_sidecars / parse_sidecar.

Classifier + decision dispatcher (all-or-nothing per sidecar):
- TableClassification {NoMovement, RolledPastExpected, UnexpectedAtP1,
  UnexpectedMultistep, InvariantViolation}.
- classify_table(pin, lance_head, manifest_pinned).
- decide(&[TableClassification]) -> SidecarDecision {RollForward,
  RollBack, Abort}. Mid-Phase-B crash with mixed states rolls BACK
  (not forward) — atomicity per docs/invariants.md §VI.23.

Restore primitive:
- restore_table_to_version(table_path, expected_version): open,
  checkout(expected_version), restore. Includes a fragment-set
  equality short-circuit so repeated mid-rollback crashes don't pile
  up Lance versions (Lance fragments are immutable; equal fragment-ids
  ⇒ equal content).

StorageAdapter trait extension:
- Added list_dir(dir_uri) -> Vec<String> for sidecar enumeration.
  LocalStorageAdapter uses tokio::fs::read_dir; S3StorageAdapter uses
  object_store::list with a prefix-collision guard
  (filters to require the directory '/' boundary so listing
  __recovery doesn't accidentally match __recovery_log/...).
  RecordingStorageAdapter (test wrapper) delegates to inner.

17 unit tests covering: classifier branches, decision branches
(including mid-Phase-B mix → RollBack and empty slice → RollForward),
JSON round-trip, schema-version refusal, restore HEAD+1, fragment-set
short-circuit no-op, list_sidecars empty/round-trip/non-JSON-skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
1e70028293
MR-847: pin Lance restore semantics empirically (Phase 1)
Two new tests in tests/staged_writes.rs that the recovery sweep design
depends on:

- lance_restore_appends_one_commit_with_checked_out_content — verifies
  Dataset::restore() (no-args; restores currently-checked-out version)
  produces HEAD+1, not HEAD+2 as the v1 design assumed. Source confirmed
  at lance-4.0.0/src/dataset.rs:1106; this test prevents a future lance
  bump from silently breaking the recovery rollback math.

- lance_restore_loses_to_concurrent_append_via_orphaning — pins the
  concurrency hazard motivating MR-847's open-time-only invocation
  strategy: check_restore_txn (lance-4.0.0/src/io/commit/conflict_
  resolver.rs:986) returns Ok against Append/Update/Delete/CreateIndex/
  Merge/etc., so a Restore commits successfully even when a concurrent
  legitimate writer just landed an Append — silently orphaning the
  Append's data from the active timeline. MR-847 sidesteps via running
  recovery only at Omnigraph::open (before any other writers race);
  MR-856 (continuous-recovery reconciler) must guard via per-(table,
  branch) queue acquisition once MR-686 lands.

These two tests together pin the foundation for MR-847's correctness
claims and document the load-bearing constraint MR-856 will inherit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:23 +02:00
Ragnor Comerford
8726ffe0a3
release: bump version to 0.4.1 2026-05-02 23:20:50 +02:00
Andrew Altshuler
e041130de3
docs: rename omnigraph-starters to omnigraph-cookbooks (#71)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:37:09 +03:00
Ragnor Comerford
2e20f4d69f
Merge pull request #70 from ModernRelay/ragnorc/mr793-phases-1-6
TableStorage trait + staged-write surface
2026-05-02 20:05:52 +02:00
Ragnor Comerford
151a1798b5
runs: enumerate inline-commit residuals on TableStorage as a residuals matrix
Closes MR-793 acceptance §1 via option (b): every inline-commit method
remaining on the trait surface is named, the upstream blocker or
internal phase that closes it is cited, and the call-site residual
comment is mandated.

Reframes the criterion text in the MR-793 ticket comment from "either
full sealing OR all residuals enumerated" — this commit ships the
"enumerated" path. The "full sealing" path (Phase 1b + Phase 9 + the
two Lance upstream tickets) closes the matrix entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:46:07 +02:00
Ragnor Comerford
c9a81266e4
lance: confirm MemWAL is opt-in, intra-table, no overlap with MR-847
Fetched https://lance.org/format/table/mem_wal/ in full via npx mdrip.
The "Overview / Details / Implementation" sidebar items turned out to
be anchor sections on the same URL, not separate pages.

Key findings (relevant to MR-847's recovery reconciler design):

* MemWAL is opt-in. Requires (1) unenforced primary key in schema,
  (2) explicit shard config, (3) writers using the LSM-tree write
  path. omnigraph does NOT enable it; we use direct write_fragments +
  commit(Operation::Append).

* MemWAL is intra-table — addresses streaming-write throughput for
  one Lance base table via MemTables → flushed MemTables → async
  merge. It does not coordinate across multiple tables.

* MemWAL's recovery is intra-table: WAL replay reconstructs MemTable
  state for one table. It does NOT help with omnigraph's cross-table
  manifest-pinned-vs-Lance-HEAD drift class.

Conclusion: MR-847's recovery reconciler design is unaffected. The
two operate at different abstraction layers.

Borrowable: MemWAL's epoch-based fencing pattern is structurally
similar to a future multi-coordinator sidecar protocol; noted on
MR-847 for if MR-668 (multi-process) ever lands.
2026-05-02 19:44:37 +02:00
Ragnor Comerford
5afde54d69
agents: stop overclaiming atomic multi-table publish — describe the three layers honestly
External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:

  (1) per-table Lance `commit_staged` for the data write
  (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
      cross-table ordering
  (3) recovery-on-open reconciler for the residual gap between (1)
      and (2) — NOT YET SHIPPED, tracked in MR-847.

Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).

Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.

End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:35:34 +02:00
Ragnor Comerford
bf7d716d1b
forbidden_apis: add .restore( and document why .append( / .delete( are excluded
Cubic flagged that the guard misses `ds.append() / ds.delete() / ds.restore()`.

`.restore(` is added — Lance-specific (no false positives in the
workspace).

`.append(` and `.delete(` stay excluded with a documenting comment:
* `.append(` over-matches `Vec::append`, `String::append`, every
  `arrow_array::xxxArrayBuilder::append` (30+ legit uses across
  `exec/mutation.rs`, `loader/jsonl.rs`, `exec/projection.rs`).
* `.delete(` over-matches `ObjectStore::delete` (used in `storage.rs`,
  `db/schema_state.rs`, `db/omnigraph.rs:1277` for staging-file
  cleanup) and would require many `// forbidden-api-allow:` sentinels
  for legitimate uses.

The remaining bypass route — engine code that imports `lance::Dataset`
and calls `ds.append(reader, params)` — is bounded by:
1. The trait surface itself (sealed, only-callable-via-trait once
   Phase 1b call-site conversion completes).
2. The PR-review process catching new `lance::Dataset` imports in
   non-storage files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:25:52 +02:00
Ragnor Comerford
9b0920b5da
address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
  / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
  `Dataset::add_columns` / `update_columns` / `drop_columns` /
  `truncate_table` / `restore` and the bare `.merge_insert(` /
  `.add_columns(` / `.update_columns(` / `.drop_columns(` /
  `.truncate_table(` method patterns. Deliberately avoid `.append(` /
  `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
  arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
  `graph_coordinator.rs` — they're manifest-layer infra that legitimately
  uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
  `None`) to `open_dataset_head_for_write` for consistency with the
  sibling `indexed_tables` block. Schema apply rejects non-main
  branches at the lock-acquire step today, so behavior is unchanged;
  this is a defensive consistency fix that survives a future relaxation
  of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
  claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
  doc: softened the "stage_* + commit_staged are the only paths" /
  "trait funnels every write" overclaim. Inline-commit residuals
  (`delete_where`, `create_vector_index`) remain on the trait pending
  upstream Lance work (#6658, #6666); legacy `append_batch` etc.
  remain pending Phase 1b / Phase 9. Module doc now describes the
  current transitional state honestly.

Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
  but returned only `DeleteState`, dropping the post-delete dataset.
  Future callers migrating from the inherent `&mut Dataset` API would
  lose the post-delete dataset state needed for indexing /
  `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
  matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
  the unused `Scanner` import (the marker existed only to suppress an
  unused-import warning — fixing the import is the cleaner answer).

Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
  with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
  between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
  triggers the failpoint via a schema-apply that adds a new node type;
  proves that existing tables are unaffected (Person mutation succeeds
  after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
  drift on tables outside the failed `added_tables` iteration.

`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
  writer-trait surface for inserts / updates / scalar-index builds /
  merge_insert / overwrite after MR-793 PR #70. Per-table
  commit_staged → manifest publish window remains; closing requires
  MR-847's recovery-on-open reconciler. `delete_where` and
  `create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
  primitives are the building blocks; the reconciler task itself is
  MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
  impl on `TableStore` with opaque-handle signatures; no test impl
  yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:47:07 +02:00
Ragnor Comerford
b87be5e9f0
agents: read every Lance page even slightly relevant, not just the obvious match
Behavior is interlocked across Lance pages — transactions reference
index lifecycle, index lifecycle references compaction, compaction
references row-id lineage. Skipping a "slightly relevant" page is how
alignment misses happen. The index alone is not a substitute for
reading the pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:44:41 +02:00
Ragnor Comerford
17bf978d0e
MR-793 follow-up: lance docs alignment audit + mandate full-page fetch via mdrip
* AGENTS.md / docs/lance.md: agents must use `npx mdrip` (not summarizing
  WebFetch) when consulting Lance docs. WebFetch routinely drops
  load-bearing details — `pub(crate)` blockers, sub-specs behind nav hubs,
  default flags. Lesson learned during the MR-793 alignment audit.
* docs/lance.md: add "Last alignment audit: 2026-05-02" stanza
  documenting MemWAL gap, lance#6666 companion ticket, stable-row-ID
  status (experimental, may unblock MR-848), FRI as documented
  compaction-friendly alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:41:32 +02:00
Ragnor Comerford
3135ff5d19
MR-793 phases 1-6: TableStorage trait + staged-write surface for engine writers
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.

Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
  sealed (only in-tree types can impl), single impl on `TableStore`
  delegating to existing inherent methods; `Omnigraph::storage()`
  accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
  `stage_create_btree_index`, `stage_create_inverted_index` —
  implementing the simple branch of Lance's `CreateIndexBuilder::execute`
  (scalar indices only; vector indices stay inline because
  `build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
  Six new tests in `tests/staged_writes.rs` pin both the new primitives
  and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
  walks engine source, fails on direct lance::* inline-commit API use
  outside `table_store.rs` / `db/manifest/`. Skips comment lines and
  honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
  `stage_create_*_index` + `commit_staged` instead of
  `create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
  named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
  merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
  delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
  use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
  inline because `InsertBuilder::execute_uncommitted` rejects empty
  data. The narrow inline window is bounded by `__schema_apply_lock__`.

Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
  (incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
  — 5 tests, all green.
* `cargo test --workspace` — green.

Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
  (mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
  residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
  removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
  (depends on Phase 1b).

Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
  vector-index builds can be staged outside the lance crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:03:15 +02:00
Ragnor Comerford
6f60c0cbcf
Merge pull request #68 from ModernRelay/ragnorc/mr794-rewire
MR-794 step 2: in-memory accumulator rewire for mutate_as + load
2026-05-01 23:06:10 +02:00
Ragnor Comerford
044ed46019
chore: scrub Linear ticket numbers and review-bot mentions from code comments
OmniGraph is OSS; internal Linear ticket references and code-review-bot
mentions in source-code comments don't help external readers and leak
internal tooling. Replace ticket numbers (MR-XXX) with descriptive
prose, drop linear.app URLs, and remove inline mentions of
Cursor/Bugbot/Cubic/Codex review threads.

Scope is limited to source-code comments (`crates/`). Docs under
`docs/` keep their MR-XXX references — those are part of the
established change-history narrative for in-repo docs and don't
require a Linear account to find context for.

No behavior changes; no public API changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:45:38 +02:00
Ragnor Comerford
ea16c74329
MR-794 step 2: address Cursor Bugbot follow-ups on commits 3223b51 + 052b6e6
Four code/doc fixes from the latest Cursor Bugbot pass:

* **Misplaced doc comment in table_store.rs (Medium):** the doc block
  intended for `scan_pending_batches` was, after my earlier edit,
  attached to `collect_string_column_values` because the new helper
  was inserted between the original docblock and `scan_pending_batches`.
  Move the docblock back onto its function and add a note about the
  shared SQL-dialect contract with the Lance scanner (the predicate
  goes to both, which is fine for `predicate_to_sql`'s plain comparison
  shapes today; future Lance-specific scanner extensions in the filter
  would need translation).

* **Missing null check on committed `id` column (Low):** the
  committed-side loop in `collect_node_ids_with_pending` (and the
  parallel non-pending `collect_node_ids`) read `id_col.value(i)`
  without `is_valid(i)` first. `id` is the @key column on every node
  type and non-nullable by schema, so this is unreachable today, but
  the inconsistency with the pending-side `is_valid` guard is worth
  closing for symmetry / defense.

* **Misleading comment in count_pending_src_with_dedupe (Low):** the
  comment claimed "fall back to naive counting" but the code did
  `continue`. Fix: it's unreachable in practice (the pending-side
  schema always contains the key when the caller passes one), so
  failing loudly with a typed error if it ever does fire is correct
  — silently skipping the batch would let `@card` violations slip
  past validation.

* **PendingTable.schema mismatch surfaces too late (Medium):**
  PendingTable captures the schema from the first batch and never
  updates it. On a blob-bearing table, `insert` produces a full-schema
  batch and `update` (without assigning every blob) produces a
  subset-schema batch. Pre-fix the mismatch surfaced inside
  finalize/MemTable construction — distant from the offending op.
  Post-fix `MutationStaging::append_batch` validates the new batch's
  schema against the existing accumulator's schema and returns a
  typed error directing the caller to split the mutation. Error
  fires at the offending op, not at end-of-query. New helper
  `schemas_compatible` compares field name + data_type pairs;
  nullability and field metadata differences stay tolerated (downstream
  concat already permits those).

Cubic Cursor Bugbot finding #5 (cascade delete edge re-open) self-resolved
in the bot's own analysis ("logic appears sound on re-examination") —
no action.

New test on tests/runs.rs:

* append_batch_rejects_mismatched_schema_in_blob_table_at_offending_op
  — pins the early-error path. Builds a blob-bearing schema, runs an
  `insert + update` query where the update doesn't assign the blob,
  asserts the error fires at the second op with the "Split the
  mutation" message and the manifest is unchanged.

Local: tests/runs.rs 24/24 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:50:13 +02:00
Ragnor Comerford
675568ce85
ci: fold failpoints test into Test Workspace job
The standalone test_failpoints_feature job took 21min on first run
(cold cache; the omnigraph-engine crate has lance + datafusion deps
that make any fresh build expensive). Folding into Test Workspace
shares the warm cache so the failpoints invocation is incremental —
~30s vs 21min on subsequent runs, and within the workspace job's
existing budget.

The failpoints feature is gated behind a Cargo flag and only adds
the small `fail` crate dep + a few feature-gated code paths; it
doesn't change the dep tree of any other crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:15:14 +02:00
Ragnor Comerford
052b6e680f
MR-794 step 2: address PR #68 follow-up review (Cubic) — pending dedupe + projection guard + CI
Three new findings from Cubic on commit 3223b51:

* **Pending edge cardinality counted within-input duplicates** (P2):
  count_src_per_edge's pending walk added every row to the count,
  including duplicate rows that finalize will collapse via
  dedupe_merge_batches_by_id. A LoadMode::Merge with the same edge id
  twice would over-count → spurious @card violation. Fix: when
  dedupe_key_column is Some, walk pending in reverse, track seen keys
  via HashSet, count only the kept (last-occurrence) rows. Mirrors
  finalize-time dedupe so cardinality counts what stage_merge_insert
  actually publishes.

* **scan_with_pending silently disabled merge-shadow when projection
  omitted key_column** (P2): if a caller passed Some("id") as
  key_column but their projection didn't include "id", the
  filter_out_rows_where_string_in helper passed batches through
  unchanged — silently degrading to union semantics. Fix: validate
  up front that projection contains key_column when both are Some;
  return a typed Lance error otherwise. Tightened the helper too:
  missing column is now an internal error (was a silent passthrough).

* **Cascade-vs-explicit delete test was too weak** (P2): asserted
  only that edge count decreased after delete. The cascade alone
  could satisfy that even if the explicit second-delete silently
  no-op'd. Strengthened: assert post_knows == 0, which only holds
  when both ops landed (Bob→Diana would survive if op-2 no-op'd).

CI gap: also added test_failpoints_feature job to .github/workflows/ci.yml.
The workspace test runs without --features failpoints (the feature is
behind a Cargo flag), so the failpoints test suite was never exercised
by CI before now. The new job builds + runs
`cargo test -p omnigraph-engine --features failpoints --test failpoints`
on every full CI run, mirroring the test_aws_feature pattern.

New tests on tests/runs.rs:

* load_merge_mode_dedupes_within_pending_for_cardinality_count
  (Cubic P2 #2 — pending-vs-pending dedup, distinct from the
  load_merge_mode_dedupes_edge_for_cardinality_count test which
  covers committed-vs-pending dedup).
* scan_with_pending_rejects_key_column_missing_from_projection
  (Cubic P2 #3 — verifies the up-front validation rejects bad
  callers and that the happy path still works correctly).

Local test results:

* tests/runs.rs: 23/23 passed
* tests/failpoints.rs --features failpoints: 7/7 passed (includes the
  two new finalize→publisher residual tests landed in 3223b51).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:47:45 +02:00
Aaron Goh
68c9d1bc91 docs: add community Slack link
Add a Community section to the README with the Omnigraph Slack invite so contributors and users have a clear place to ask questions and share feedback.
2026-05-01 14:39:21 +02:00
Ragnor Comerford
3223b51cf1
MR-794 step 2: address PR #68 review — merge semantics, cardinality, residual
Five fixes from PR #68 review (Cursor Bugbot + Codex + Cubic):

* **scan_with_pending gains merge-shadow semantics** (Codex P1, Cubic P1#1):
  new `key_column: Option<&str>` parameter. When set, committed rows
  whose key value appears in any pending batch are excluded from the
  scan — making `scan_with_pending` correctly merge-semantic for chained
  updates instead of naively unioning. execute_update calls with
  Some("id"). Without this, a chained `update where age > 30` could
  match a row whose pending value already moved out of range.

* **Multi-delete on same table no longer trips ExpectedVersionMismatch**
  (Cursor Bugbot HIGH): open_table_for_mutation routes through
  reopen_for_mutation when staging.inline_committed has the table,
  using the post-inline-commit Lance version captured at record_inline
  time. The legacy open_for_mutation_on_branch fence (Lance HEAD ==
  manifest pinned) is correct cross-writer but wrong intra-query when
  deletes have already advanced HEAD on this table. Branch goes away
  when Lance ships two-phase delete (lance-format/lance#6658).

* **Cardinality validation consolidated** (Cursor LOW + Codex P2 +
  Cubic P1#2 + Cubic P2): new exec/staging::count_src_per_edge +
  enforce_cardinality_bounds shared by mutation and loader paths.
  Restores the missing min-cardinality check on the engine path.
  Loader Merge mode passes Some("id") to dedupe edges being updated
  by id (not double-count committed + pending). Loader Append mode
  and engine path pass None (ULID-generated ids never collide).

* **Dead count_rows_with_pending removed** (Cursor LOW): never called.

* **Misleading concat-helper comment fixed** (Cubic P3): claimed
  schema normalization the helper doesn't implement. Updated to match
  reality.

* **Documentation honesty** (Cubic P1#3): MR-794 narrows but doesn't
  eliminate the "Lance HEAD ahead of __manifest" drift class. Drift is
  unreachable for op-execution failures (the partial_failure test pins
  this), but a residual remains at the finalize→publisher boundary
  because Lance has no multi-dataset commit primitive: per-table
  commit_staged calls run sequentially before manifest commit. Updated
  docs/runs.md, docs/invariants.md §VI.25, docs/releases/v0.4.1.md to
  scope the claim precisely.

* **Failpoint test pinning the residual**: new
  mutation.post_finalize_pre_publisher failpoint + two tests in
  tests/failpoints.rs that confirm the documented residual behavior.
  Catches future regressions that widen the residual.

Test additions on tests/runs.rs:

* chained_updates_with_overlapping_predicate_respects_intermediate_value
* multi_statement_delete_on_same_node_table
* cascade_delete_node_then_explicit_delete_edge_on_same_table
* mutation_insert_edge_enforces_min_cardinality
* load_merge_mode_dedupes_edge_for_cardinality_count

113/113 engine integration tests pass (runs + end_to_end + consistency
+ staged_writes + validators). Failpoints feature build runs in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:47:55 +02:00
Ragnor Comerford
a61e82f47a
MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.

MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
  section; document the in-memory accumulator + D₂ rule + the
  LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
  upheld for inserts/updates. Within-query read-your-writes is now
  load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
  accumulator (MR-794)" subsection with per-op flow; refresh the
  engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
  for the staged-write path; updated the LoadMode table to call
  out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
  contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
  test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
  description and the L2 capability matrix row.

Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
  from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
  remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
  legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
  with omnigraph commit list/show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:19 +02:00