omnigraph/crates
Ragnor Comerford e1b40aee0b
engine: opt MergeInsertBuilder into FirstSeen for Lance dup-rowid bug (MR-957) (#109)
* engine: opt MergeInsertBuilder into FirstSeen for Lance dup-rowid bug (MR-957)

Lance 4.0.x's MergeInsertBuilder rejects sequential merge_insert /
update against rows previously rewritten by merge_insert with a
spurious "Ambiguous merge inserts: multiple source rows match the
same target row on (id = ...)" error. The engine passes exactly 1
source row; Lance's `processed_row_ids: Mutex<HashSet<u64>>`
(lance-4.0.0 src/dataset/write/merge_insert.rs:2099) double-processes
the same source/target match against datasets previously rewritten
by merge_insert and errors under the default
SourceDedupeBehavior::Fail.

Two surfaces hit it:
- Load: `omnigraph load --mode merge` twice against the same @key set.
- Mutate: sequential `update T set {f:v} where x=y` on the same row.

Fix: opt both MergeInsertBuilder call sites (merge_insert_batch,
stage_merge_insert) into SourceDedupeBehavior::FirstSeen. Lance
silently skips a duplicate match instead of erroring.

Correctness-preserving for OmniGraph because source-side duplicates
are already rejected upstream of these call sites:
- Loader: enforce_unique_constraints_intra_batch (loader/mod.rs:1453)
  rejects intra-batch dup @key values across all three LoadModes,
  pinned by the new loader_rejects_intra_batch_duplicate_keys test.
- Mutate: MutationStaging::finalize pre-dedupes by id.

So FirstSeen only suppresses the spurious Lance behavior, never user
data.

Regression coverage:
- consistency::load_merge_repeated_against_overlapping_keys_succeeds
  — load surface (was the basis of the original PR #98 report).
- runs::second_sequential_update_on_same_row_succeeds — update
  surface (MR-920).
- consistency::loader_rejects_intra_batch_duplicate_keys — pins
  FirstSeen's safety argument.
- consistency::load_merge_window_2_documents_upstream_lance_gap —
  canary for the residual upstream Lance gap (after MR-848 removes
  the eager BTREE-on-id, re-establishing the index via
  ensure_indices re-triggers the bug class). Drop the FirstSeen
  setter only when this canary stays green without it.

Cross-validation on the prior PR #98 branch: both use_index(false)
(PR #98's hypothesis) and FirstSeen (MR-920's hypothesis) cover both
surfaces individually. FirstSeen chosen because it has no perf cost
(use_index(false) would force full-table scans on every merge_insert).

Supersedes PR #98 and andrew/merge-insert-firstseen.

Tracked at MR-957; upstream: lance-format/lance#6877.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* engine: add dedup-by-keys precondition on merge_insert primitives

Addresses Codex P1 on PR #109: `SourceDedupeBehavior::FirstSeen`
silently collapses duplicate source rows, and the branch-merge rewrite
path (`exec/merge.rs::publish_rewritten_merge_table`) feeds a
concatenated batch directly into `stage_merge_insert` without going
through `MutationStaging::finalize`'s pre-dedupe. By construction the
merge algorithm (`compute_source_delta` / `compute_three_way_delta`
walk via `OrderedTableCursor` and push each id at most once) produces
1-row-per-id, but the invariant was implicit — a future refactor
could violate it and FirstSeen would mask the bug as silent data
loss.

Add `check_batch_unique_by_keys` as a release-mode precondition at the
top of `merge_insert_batch` and `stage_merge_insert`. Errors with an
explicit "duplicate source row" message before the builder runs, so
real source dups continue to fail-fast regardless of caller.

Cost: one extra O(N) pass over the key column on every merge_insert.
String HashSet over typical batch sizes is microseconds — negligible
next to the merge_insert itself.

The inline comment in `table_store.rs` now enumerates all three
pre-dedup paths (load / mutate / branch-merge) and names the
precondition as the structural pin instead of relying on
by-construction invariants from three separate callers.

Three new unit tests in `table_store::tests` pin the helper itself;
the existing `loader_rejects_intra_batch_duplicate_keys` integration
test continues to pin the loader's intake-time check as the first
defense layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 18:19:54 +01:00
..
omnigraph engine: opt MergeInsertBuilder into FirstSeen for Lance dup-rowid bug (MR-957) (#109) 2026-05-22 18:19:54 +01:00
omnigraph-cli schema: HTTP allow_data_loss exposure + e2e drop coverage (MR-694 follow-up) (#107) 2026-05-19 01:56:46 +03:00
omnigraph-compiler schema-lint v1 commit 4: emit + apply DropType { Soft } (#99) 2026-05-16 20:25:42 +03:00
omnigraph-policy policy: chassis core — omnigraph-policy crate + Omnigraph::enforce() (MR-722) (#102) 2026-05-18 00:36:36 +03:00
omnigraph-server schema: HTTP allow_data_loss exposure + e2e drop coverage (MR-694 follow-up) (#107) 2026-05-19 01:56:46 +03:00