engine: opt MergeInsertBuilder into FirstSeen for Lance dup-rowid bug (MR-957) (#109)

* engine: opt MergeInsertBuilder into FirstSeen for Lance dup-rowid bug (MR-957)

Lance 4.0.x's MergeInsertBuilder rejects sequential merge_insert /
update against rows previously rewritten by merge_insert with a
spurious "Ambiguous merge inserts: multiple source rows match the
same target row on (id = ...)" error. The engine passes exactly 1
source row; Lance's `processed_row_ids: Mutex<HashSet<u64>>`
(lance-4.0.0 src/dataset/write/merge_insert.rs:2099) double-processes
the same source/target match against datasets previously rewritten
by merge_insert and errors under the default
SourceDedupeBehavior::Fail.

Two surfaces hit it:
- Load: `omnigraph load --mode merge` twice against the same @key set.
- Mutate: sequential `update T set {f:v} where x=y` on the same row.

Fix: opt both MergeInsertBuilder call sites (merge_insert_batch,
stage_merge_insert) into SourceDedupeBehavior::FirstSeen. Lance
silently skips a duplicate match instead of erroring.

Correctness-preserving for OmniGraph because source-side duplicates
are already rejected upstream of these call sites:
- Loader: enforce_unique_constraints_intra_batch (loader/mod.rs:1453)
  rejects intra-batch dup @key values across all three LoadModes,
  pinned by the new loader_rejects_intra_batch_duplicate_keys test.
- Mutate: MutationStaging::finalize pre-dedupes by id.

So FirstSeen only suppresses the spurious Lance behavior, never user
data.

Regression coverage:
- consistency::load_merge_repeated_against_overlapping_keys_succeeds
  — load surface (was the basis of the original PR #98 report).
- runs::second_sequential_update_on_same_row_succeeds — update
  surface (MR-920).
- consistency::loader_rejects_intra_batch_duplicate_keys — pins
  FirstSeen's safety argument.
- consistency::load_merge_window_2_documents_upstream_lance_gap —
  canary for the residual upstream Lance gap (after MR-848 removes
  the eager BTREE-on-id, re-establishing the index via
  ensure_indices re-triggers the bug class). Drop the FirstSeen
  setter only when this canary stays green without it.

Cross-validation on the prior PR #98 branch: both use_index(false)
(PR #98's hypothesis) and FirstSeen (MR-920's hypothesis) cover both
surfaces individually. FirstSeen chosen because it has no perf cost
(use_index(false) would force full-table scans on every merge_insert).

Supersedes PR #98 and andrew/merge-insert-firstseen.

Tracked at MR-957; upstream: lance-format/lance#6877.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* engine: add dedup-by-keys precondition on merge_insert primitives

Addresses Codex P1 on PR #109: `SourceDedupeBehavior::FirstSeen`
silently collapses duplicate source rows, and the branch-merge rewrite
path (`exec/merge.rs::publish_rewritten_merge_table`) feeds a
concatenated batch directly into `stage_merge_insert` without going
through `MutationStaging::finalize`'s pre-dedupe. By construction the
merge algorithm (`compute_source_delta` / `compute_three_way_delta`
walk via `OrderedTableCursor` and push each id at most once) produces
1-row-per-id, but the invariant was implicit — a future refactor
could violate it and FirstSeen would mask the bug as silent data
loss.

Add `check_batch_unique_by_keys` as a release-mode precondition at the
top of `merge_insert_batch` and `stage_merge_insert`. Errors with an
explicit "duplicate source row" message before the builder runs, so
real source dups continue to fail-fast regardless of caller.

Cost: one extra O(N) pass over the key column on every merge_insert.
String HashSet over typical batch sizes is microseconds — negligible
next to the merge_insert itself.

The inline comment in `table_store.rs` now enumerates all three
pre-dedup paths (load / mutate / branch-merge) and names the
precondition as the structural pin instead of relying on
by-construction invariants from three separate callers.

Three new unit tests in `table_store::tests` pin the helper itself;
the existing `loader_rejects_intra_batch_duplicate_keys` integration
test continues to pin the loader's intake-time check as the first
defense layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-05-22 18:19:54 +01:00 committed by GitHub
parent aadfa11ecb
commit e1b40aee0b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 440 additions and 8 deletions

View file

@ -8,6 +8,7 @@ use lance::Dataset;
use lance::blob::BlobArrayBuilder;
use lance::dataset::scanner::{ColumnOrdering, DatasetRecordBatchStream, Scanner};
use lance::dataset::transaction::{Operation, Transaction, TransactionBuilder};
use lance::dataset::write::merge_insert::SourceDedupeBehavior;
use lance::dataset::{
CommitBuilder, InsertBuilder, MergeInsertBuilder, WhenMatched, WhenNotMatched, WriteMode,
WriteParams,
@ -651,15 +652,58 @@ impl TableStore {
return self.table_state(dataset_uri, &ds).await;
}
// Precondition for the FirstSeen workaround below: every caller of
// this primitive must hand in a source batch that is unique by
// `key_columns`. Without this check, `SourceDedupeBehavior::FirstSeen`
// would silently collapse genuine duplicates instead of erroring.
check_batch_unique_by_keys(&batch, &key_columns, "merge_insert_batch")?;
// TODO(lance-upstream): MergeInsertBuilder does not accept WriteParams,
// so allow_external_blob_outside_bases cannot be set here. External URI
// blobs via merge_insert (LoadMode::Merge, mutations) are unsupported
// until Lance exposes WriteParams on MergeInsertBuilder.
let ds = Arc::new(ds);
let job = MergeInsertBuilder::try_new(ds, key_columns)
.map_err(|e| OmniError::Lance(e.to_string()))?
.when_matched(when_matched)
.when_not_matched(when_not_matched)
let mut builder = MergeInsertBuilder::try_new(ds, key_columns)
.map_err(|e| OmniError::Lance(e.to_string()))?;
builder.when_matched(when_matched);
builder.when_not_matched(when_not_matched);
// Workaround for a Lance 4.0.x bug class where sequential
// merge_insert calls against rows previously rewritten by
// merge_insert produce a spurious "Ambiguous merge inserts:
// multiple source rows match the same target row on (id = ...)"
// error. Lance's `processed_row_ids: Mutex<HashSet<u64>>`
// (lance-4.0.0 `src/dataset/write/merge_insert.rs:2099`)
// double-processes the same source/target match against
// datasets previously rewritten by merge_insert, and the default
// `SourceDedupeBehavior::Fail` errors on the second insertion.
// `FirstSeen` makes Lance skip the duplicate match instead.
//
// Covers both observed surfaces:
// - PR #98 (sequential `load --mode merge` against same keys).
// - MR-920 (sequential `update T set {f} where x=y` on same row).
//
// Correctness-preserving for OmniGraph because every call path
// that reaches this primitive either pre-dedupes the source batch
// by id, or surfaces a real source dup via the
// `check_batch_unique_by_keys` precondition above (which fires
// before the FirstSeen setter has a chance to silently collapse
// anything):
// - Load path: `enforce_unique_constraints_intra_batch`
// (`loader/mod.rs:1453`) errors on intra-batch `@key` dups.
// - Mutate path: `MutationStaging::finalize` (`exec/staging.rs`)
// accumulates and dedupes by `id`.
// - Branch-merge path: `compute_source_delta` /
// `compute_three_way_delta` (`exec/merge.rs`) walk via
// `OrderedTableCursor` and `push_row` each id at most once.
// So FirstSeen only suppresses the spurious Lance behavior, never
// user data. Pinned by `loader_rejects_intra_batch_duplicate_keys`
// in `tests/consistency.rs` plus the
// `check_batch_unique_by_keys` precondition.
//
// Retire when upstream Lance fixes the bug class. Tracked at
// MR-957; upstream: lance-format/lance#6877.
builder.source_dedupe_behavior(SourceDedupeBehavior::FirstSeen);
let job = builder
.try_build()
.map_err(|e| OmniError::Lance(e.to_string()))?;
@ -870,11 +914,26 @@ impl TableStore {
"stage_merge_insert called with empty batch".to_string(),
));
}
// Precondition for FirstSeen below. See the comment on
// `merge_insert_batch` for why this check is here, not on the caller:
// every call path that reaches stage_merge_insert (load,
// MutationStaging::finalize, branch_merge::publish_rewritten_merge_table)
// must hand in a source batch that is unique by `key_columns`.
check_batch_unique_by_keys(&batch, &key_columns, "stage_merge_insert")?;
let ds = Arc::new(ds);
let job = MergeInsertBuilder::try_new(ds, key_columns)
.map_err(|e| OmniError::Lance(e.to_string()))?
.when_matched(when_matched)
.when_not_matched(when_not_matched)
let mut builder = MergeInsertBuilder::try_new(ds, key_columns)
.map_err(|e| OmniError::Lance(e.to_string()))?;
builder.when_matched(when_matched);
builder.when_not_matched(when_not_matched);
// See `merge_insert_batch` for the FirstSeen rationale. Workaround
// for the Lance 4.0.x bug class where sequential merge_insert /
// update against rows previously rewritten by merge_insert trips
// Lance's `processed_row_ids` HashSet and errors under the default
// `SourceDedupeBehavior::Fail`. Retire when upstream Lance is fixed.
builder.source_dedupe_behavior(SourceDedupeBehavior::FirstSeen);
let job = builder
.try_build()
.map_err(|e| OmniError::Lance(e.to_string()))?;
let schema = batch.schema();
@ -1651,3 +1710,107 @@ fn combine_committed_with_staged(ds: &Dataset, staged: &[StagedWrite]) -> Vec<Fr
}
combined
}
/// Precondition guard for `merge_insert_batch` and `stage_merge_insert`.
/// Both opt into `SourceDedupeBehavior::FirstSeen` to suppress the Lance
/// `processed_row_ids` bug (MR-957). FirstSeen would *also* silently
/// collapse genuine duplicate source keys; this check restores fail-fast
/// behavior on real dups by erroring before the builder gets a chance to
/// silently skip them.
///
/// Today only single-column string keys are used at the call sites
/// (`vec!["id".to_string()]`). The check restricts itself to that shape
/// and surfaces an internal error if a future caller passes anything
/// else — keeping the assumption explicit instead of silently degrading.
fn check_batch_unique_by_keys(
batch: &RecordBatch,
key_columns: &[String],
context: &'static str,
) -> Result<()> {
if key_columns.len() != 1 {
return Err(OmniError::manifest_internal(format!(
"{}: check_batch_unique_by_keys currently supports single-column keys only, got {:?}",
context, key_columns
)));
}
let key_col_name = &key_columns[0];
let column = batch.column_by_name(key_col_name).ok_or_else(|| {
OmniError::manifest_internal(format!(
"{}: source batch missing key column '{}'",
context, key_col_name
))
})?;
let strs = column
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| {
OmniError::manifest_internal(format!(
"{}: key column '{}' is not a StringArray (got {:?})",
context,
key_col_name,
column.data_type()
))
})?;
let mut seen: std::collections::HashSet<&str> =
std::collections::HashSet::with_capacity(batch.num_rows());
for i in 0..strs.len() {
if !strs.is_valid(i) {
continue;
}
let v = strs.value(i);
if !seen.insert(v) {
return Err(OmniError::manifest(format!(
"{}: duplicate source row for key '{}' (column '{}'); \
callers must hand in a batch unique by `key_columns` \
see MR-957",
context, v, key_col_name
)));
}
}
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
use arrow_array::StringArray;
use arrow_schema::{DataType, Field, Schema};
fn batch_with_ids(ids: &[&str]) -> RecordBatch {
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Utf8, false)]));
let col = Arc::new(StringArray::from(ids.to_vec())) as ArrayRef;
RecordBatch::try_new(schema, vec![col]).unwrap()
}
#[test]
fn check_batch_unique_by_keys_passes_when_all_unique() {
let batch = batch_with_ids(&["a", "b", "c"]);
check_batch_unique_by_keys(&batch, &["id".to_string()], "test").unwrap();
}
#[test]
fn check_batch_unique_by_keys_errors_on_duplicate_id() {
let batch = batch_with_ids(&["a", "b", "a"]);
let err =
check_batch_unique_by_keys(&batch, &["id".to_string()], "test").unwrap_err();
let msg = err.to_string();
assert!(
msg.contains("duplicate source row for key 'a'"),
"unexpected error: {msg}"
);
assert!(msg.contains("MR-957"), "error should reference MR-957: {msg}");
}
#[test]
fn check_batch_unique_by_keys_rejects_multi_column_keys() {
let batch = batch_with_ids(&["a"]);
let err = check_batch_unique_by_keys(
&batch,
&["id".to_string(), "other".to_string()],
"test",
)
.unwrap_err();
assert!(err.to_string().contains("single-column keys only"));
}
}

View file

@ -119,6 +119,189 @@ async fn load_merge_upserts_existing_and_inserts_new() {
}
}
/// Regression: two sequential `LoadMode::Merge` invocations against the
/// same set of keys must both succeed. Pre-fix, the second one failed
/// with `Ambiguous merge inserts are prohibited: multiple source rows
/// match the same target row on (id = "TEST-1")` even though every
/// source batch had one row per key.
///
/// Triggered by Lance's `processed_row_ids: Mutex<HashSet<u64>>`
/// (lance-4.0.0 `src/dataset/write/merge_insert.rs:2099`) double-
/// processing the same source/target match against datasets previously
/// rewritten by merge_insert. Worked around by opting
/// `MergeInsertBuilder` into `SourceDedupeBehavior::FirstSeen` in
/// `crates/omnigraph/src/table_store.rs` — see that file for the full
/// rationale and the safety pin (`loader_rejects_intra_batch_duplicate_keys`).
/// Tracked at MR-957; upstream: lance-format/lance#6877.
#[tokio::test]
async fn load_merge_repeated_against_overlapping_keys_succeeds() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let schema = r#"
node Thing {
key: String @key
required_val: String
optional_val: String?
}
"#;
let mut db = Omnigraph::init(uri, schema).await.unwrap();
// Seed with 50 fully-populated rows (id + required + optional).
let mut seed = String::new();
for i in 1..=50 {
seed.push_str(&format!(
r#"{{"type":"Thing","data":{{"key":"TEST-{i}","required_val":"required {i}","optional_val":"optional {i}"}}}}
"#,
));
}
load_jsonl(&mut db, &seed, LoadMode::Overwrite)
.await
.unwrap();
// Partial-schema delta — mirrors the bug report exactly: omits
// `optional_val`. 25 existing keys + 5 new keys, one row per key.
let mut delta = String::new();
for i in (1..=25).chain(51..=55) {
delta.push_str(&format!(
r#"{{"type":"Thing","data":{{"key":"TEST-{i}","required_val":"required {i} UPDATED"}}}}
"#,
));
}
load_jsonl(&mut db, &delta, LoadMode::Merge)
.await
.expect("first merge must succeed");
assert_eq!(count_rows(&db, "node:Thing").await, 55);
load_jsonl(&mut db, &delta, LoadMode::Merge)
.await
.expect("second merge against same keys must succeed");
assert_eq!(count_rows(&db, "node:Thing").await, 55);
}
/// Safety pin for the `SourceDedupeBehavior::FirstSeen` workaround in
/// `crates/omnigraph/src/table_store.rs`. FirstSeen tells Lance to
/// silently skip a duplicate source row instead of erroring. Our use of
/// it depends on user-provided duplicates being rejected *before* the
/// batch reaches Lance — otherwise FirstSeen could silently drop user
/// data.
///
/// Defense in depth:
/// 1. The loader's `enforce_unique_constraints_intra_batch`
/// (`loader/mod.rs:1453`), invoked unconditionally on any node type
/// with a `@key`, errors on intra-batch duplicate `@key` values at
/// intake — pinned by this test across every `LoadMode`.
/// 2. The `check_batch_unique_by_keys` precondition at the top of
/// `merge_insert_batch` and `stage_merge_insert` is the final
/// fail-fast guard: even if a future caller bypasses the loader path
/// (e.g. branch-merge's `publish_rewritten_merge_table` builds its
/// own source batch directly), a real duplicate id reaches Lance
/// only after surfacing as an `OmniError::Manifest`, never silently
/// via FirstSeen. Pinned by the unit tests in `table_store::tests`.
#[tokio::test]
async fn loader_rejects_intra_batch_duplicate_keys() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let schema = r#"
node Thing {
key: String @key
value: String
}
"#;
let mut db = Omnigraph::init(uri, schema).await.unwrap();
let dupes = r#"{"type":"Thing","data":{"key":"DUP","value":"first"}}
{"type":"Thing","data":{"key":"DUP","value":"second"}}
"#;
for mode in [LoadMode::Overwrite, LoadMode::Append, LoadMode::Merge] {
let err = load_jsonl(&mut db, dupes, mode).await.unwrap_err();
let msg = err.to_string();
assert!(
msg.contains("@unique violation") && msg.contains("DUP"),
"load mode {mode:?} must reject intra-batch duplicate @key (got: {msg})"
);
assert_eq!(
count_rows(&db, "node:Thing").await,
0,
"load mode {mode:?} must not persist any rows when the batch is rejected"
);
}
}
/// Canary for the upstream Lance gap that the `FirstSeen` workaround
/// in `table_store.rs` masks. The bug class is "Window 2": load →
/// indices built explicitly → merge → merge. Even with the engine
/// fully aligned to the "indexes are derived state" invariant
/// (MR-848), as long as an `id` index has been built between the
/// first and second merge_insert, the Lance internal that triggers
/// the bug remains reachable.
///
/// This test runs the Window-2 sequence under the FirstSeen workaround.
/// It is expected to pass today. If a future Lance upgrade or local
/// change makes it START failing, the workaround has lost effectiveness
/// (upstream Lance changed something, or the FirstSeen setter was
/// dropped from `table_store.rs`). If a future Lance upgrade fixes the
/// bug class, this test continues to pass and the FirstSeen setter can
/// be retired.
///
/// Tracked at MR-957; upstream: lance-format/lance#6877.
#[tokio::test]
async fn load_merge_window_2_documents_upstream_lance_gap() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let schema = r#"
node Thing {
key: String @key
required_val: String
optional_val: String?
}
"#;
let mut db = Omnigraph::init(uri, schema).await.unwrap();
let mut seed = String::new();
for i in 1..=50 {
seed.push_str(&format!(
r#"{{"type":"Thing","data":{{"key":"TEST-{i}","required_val":"required {i}","optional_val":"optional {i}"}}}}
"#,
));
}
load_jsonl(&mut db, &seed, LoadMode::Overwrite)
.await
.unwrap();
// Explicit ensure_indices between seed and the merges — the Window
// 2 trigger. The eager-build behavior (MR-583) means the BTREE on
// `id` is already present here, but calling explicitly pins the
// invariant for the post-MR-848 future where the eager build is
// gone.
db.ensure_indices().await.unwrap();
let mut delta = String::new();
for i in (1..=25).chain(51..=55) {
delta.push_str(&format!(
r#"{{"type":"Thing","data":{{"key":"TEST-{i}","required_val":"required {i} UPDATED"}}}}
"#,
));
}
// Both merges must succeed under the FirstSeen workaround.
// `processed_row_ids` re-processes the same target row_id under
// the default `SourceDedupeBehavior::Fail`; FirstSeen tolerates it.
load_jsonl(&mut db, &delta, LoadMode::Merge)
.await
.expect("first merge after ensure_indices must succeed");
db.ensure_indices().await.unwrap();
load_jsonl(&mut db, &delta, LoadMode::Merge)
.await
.expect(
"second merge after ensure_indices must succeed \
(Window 2 canary: drop the FirstSeen setter in table_store.rs \
only when this stays green WITHOUT it)",
);
assert_eq!(count_rows(&db, "node:Thing").await, 55);
}
#[tokio::test]
async fn cross_type_traversal_deduplicates_duplicate_edges() {
let dir = tempfile::tempdir().unwrap();

View file

@ -521,6 +521,10 @@ query delete_two_persons($first: String, $second: String) {
delete Person where name = $first
delete Person where name = $second
}
query update_age_by_name($name: String, $age: I32) {
update Person set { age: $age } where name = $name
}
"#;
/// D₂: a query mixing inserts/updates with deletes is rejected at parse
@ -1362,3 +1366,85 @@ query insert_then_update_note(
.unwrap();
assert_eq!(qr.num_rows(), 0, "letter must not be visible after early error");
}
/// MR-920 regression: two sequential `update T set {f:v} where x=y`
/// invocations against the same row must both succeed. Pre-fix, the
/// second one failed with `Ambiguous merge inserts are prohibited:
/// multiple source rows match the same target row on (id = "Alice")`
/// even though the scan returned exactly one row.
///
/// Root cause hypothesis (per MR-920): Lance's
/// `processed_row_ids: Mutex<HashSet<u64>>`
/// (`src/dataset/write/merge_insert.rs:2099`) double-processes the
/// same target row_id against datasets previously rewritten by
/// merge_insert. `SourceDedupeBehavior::FirstSeen` makes Lance skip
/// rather than error.
///
/// Companion to `consistency.rs::load_merge_repeated_against_overlapping_keys_succeeds`
/// (PR #98 / Window 1 of the bug class via the load surface).
#[tokio::test]
async fn second_sequential_update_on_same_row_succeeds() {
let dir = tempfile::tempdir().unwrap();
let mut db = init_and_load(&dir).await;
db.mutate(
"main",
STAGED_QUERIES,
"update_age_by_name",
&mixed_params(&[("$name", "Alice")], &[("$age", 99)]),
)
.await
.expect("first sequential update on Alice must succeed");
let batches = read_table(&db, "node:Person").await;
let alice_count: usize = batches
.iter()
.map(|b| {
let names = b
.column_by_name("name")
.unwrap()
.as_any()
.downcast_ref::<arrow_array::StringArray>()
.unwrap();
(0..b.num_rows())
.filter(|i| names.is_valid(*i) && names.value(*i) == "Alice")
.count()
})
.sum();
assert_eq!(
alice_count, 1,
"after first update, exactly one Alice row should be visible"
);
db.mutate(
"main",
STAGED_QUERIES,
"update_age_by_name",
&mixed_params(&[("$name", "Alice")], &[("$age", 42)]),
)
.await
.expect("second sequential update on Alice must succeed");
let batches = read_table(&db, "node:Person").await;
let mut alice_age: Option<i32> = None;
for batch in &batches {
let names = batch
.column_by_name("name")
.unwrap()
.as_any()
.downcast_ref::<arrow_array::StringArray>()
.unwrap();
let ages = batch
.column_by_name("age")
.unwrap()
.as_any()
.downcast_ref::<arrow_array::Int32Array>()
.unwrap();
for i in 0..batch.num_rows() {
if names.is_valid(i) && names.value(i) == "Alice" && ages.is_valid(i) {
alice_age = Some(ages.value(i));
}
}
}
assert_eq!(alice_age, Some(42), "Alice's age must reflect the second update");
}