mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-07-03 02:51:04 +02:00
feat(engine): unify constraint validation across all write surfaces (#314)
* feat(engine): unify constraint validation across all write surfaces
Constraint enforcement (value/range/check, enum, uniqueness, edge
referential integrity, cardinality) was implemented three times — once
each in the bulk loader, the mutation executor, and the branch-merge
path — and had drifted: merge validated @range/@check but not enum, and
neither the mutation nor the load path enforced cross-version uniqueness
against already-committed rows.
Introduce one catalog-derived evaluator (`crate::validate`) that all
three surfaces route through. It is delta-scoped (checks only the change
set, not the whole graph) and index-backed (probes committed state
through the @key/@unique/src/dst BTREEs instead of full-scanning every
catalog table), reusing the existing leaf checks
(validate_value_constraints, validate_enum_constraints,
composite_unique_key) so the surfaces cannot drift again. A one-row-delta
merge now opens ~3 data tables instead of ~6+, and validation cost is
flat in graph size rather than O(V+E).
Behavior changes (all stricter, none relaxed):
- Enum constraints are now enforced on the merge path (was a gap).
- A write or load whose @unique value collides with an already-committed
different row is now rejected (cross-version uniqueness); re-upserting
an existing @key still upserts.
- Uniqueness distinguishes a duplicate key WITHIN one input batch (two
distinct records -> rejected, e.g. a bulk load listing a @key twice)
from the SAME id reappearing ACROSS batches (ordered supersession of
one logical row -> coalesced, e.g. a mutation insert-then-update).
- Overwrite loads validate per-table: a touched table's committed view is
its replacement image (empty), but a table absent from the batch keeps
its committed rows, so an edges-only overwrite still resolves
referential integrity against retained nodes.
Remove the per-surface validation orchestration the evaluator supersedes,
and the now-orphaned version-pinned dataset opener from the sealed
storage trait (reads route through the snapshot path). Docs (invariants,
testing) updated; full engine suite green.
* test(engine): pin orphan-edge validation on adopt-by-pointer merge
Regression for a gap in the unified merge validation: when a table is
adopted by pointer switch (AdoptSourceState) — source on main, target on a
branch — build_merge_changeset skips it, so referential integrity is never
checked for it. Merging main into a branch that deleted a node while main
added an edge to that node silently publishes the orphan edge.
This test merges main -> feature where feature deleted Bob and main added
Knows Alice->Bob, and asserts an OrphanEdge conflict. Red against HEAD
(merge returns Merged); turns green with the AdoptSourceState validation fix.
* fix(engine): validate adopt-by-pointer merge tables (AdoptSourceState)
The unified merge validator skipped any table classified AdoptSourceState
(a pointer switch / fork), so referential integrity, uniqueness, and
cardinality were never checked for it. Merging main into a branch that
deleted a node while main added an edge to that node silently published the
orphan edge — the prior full-scan validation caught it.
Root cause: classify_adopt keyed AdoptSourceState on the publish mechanism
("does it advance Lance HEAD") and returned before computing any delta, and
build_merge_changeset then skipped the table. Fix decouples the validation
input from the publish mechanism: classify_adopt now always computes the
source-vs-target delta (base == target on this path, so it is the right
validation delta) and carries it as AdoptSourceState { validation_delta };
build_merge_changeset validates it exactly like AdoptWithDelta. The publish
stays a pointer/fork (delta ignored) and remains excluded from recovery
pins, so publish/recovery semantics are unchanged — only validation is
restored. Closes the class: no publish optimization can bypass validation.
Turns the orphan-edge regression test green.
* test(engine): pin typed committed-uniqueness probe on non-String columns
The cross-version @unique check pushes a committed-state filter built from
the stringified key. On a non-String @unique column (e.g. Date) this compares
a Date32 column to a Utf8 literal — and the stringified key is the raw day
count, so the probe raises "Cannot cast string '20633' to Date32" for ANY
second write to the table (colliding or not).
Two regressions: a colliding Date value must surface a proper "@unique
violation" (not a coercion error), and a non-colliding write must succeed.
Both red against HEAD; green with the typed-literal probe fix.
* fix(engine): build committed uniqueness probe from typed column values
The cross-version @unique check pushed a Lance filter built with a
stringified key (lit(String)) against the real, typed column. On a
non-String @unique column this compared a Date32/numeric/bool column to a
Utf8 literal: a coercion error on Date/Bool (failing every write to the
table) or a silent miss on Float. For Date the stringified key was even the
raw day count, so the literal could never parse.
unique_holders now takes typed ScalarValues, built at the call site via
ScalarValue::try_from_array(group_column, row), so the pushed-down predicate
compares like-typed for any scalar @unique. The in-memory intra-delta dedup
keeps the stringified key (a type-agnostic equality grouping, unaffected).
Turns the Date @unique cross-version regression tests green.
* test(engine): pin id-keyed cardinality on merge-load edge moves/dups
Two cardinality drifts between validation and what commit persists:
- Move (B): a Merge-load that moves an edge to a new src only recounts the
new src, so vacating a src and dropping it below @card min is missed —
moving Alice's only WorksAt to Bob silently succeeds under @card(1..).
- Dup (A): a Merge-load batch listing one edge id under two srcs counts it
under both, but commit dedupes by id (last-wins). Alice gets a phantom
second edge and a spurious "has 2 edges (max 1)" violation under @card(0..1).
Both red against HEAD; green with the id-keyed last-wins cardinality model.
* fix(engine): key merge/load cardinality by edge id, last-wins
@card validation diverged from what commit persists in two ways: (1) it only
recounted the new src of a delta edge, so a Merge-load that moves an edge to a
new src never rechecked the vacated src and missed a drop below @card min; (2)
it counted raw delta rows, so the same edge id under two srcs in one batch was
counted under both, while commit dedupes by id (last-wins) — a phantom edge
and a spurious max violation.
evaluate_cardinality now coalesces the delta by edge id (last-wins, matching
dedupe_merge_batches_by_id) and builds the affected-src set from both the new
src of each delta edge AND the old committed src of each changed/deleted edge
id; a committed edge is dropped from its src when the delta deletes or
re-places it. The validated edge set per src now equals the committed image.
Turns the edge-move and duplicate-id cardinality regression tests green.
* docs(rfcs): add RFC 0001 — branch merge by fragment adoption
Proposed design for the by-design fix to merge cost/OOM: adopt the source
branch's Lance fragments by reference (base_paths) instead of re-materializing
rows, with a re-home reconciler + branch-delete reference guard closing the
dangling-reference lifecycle, and a reachability-complete cleanup sweep. Grounded
in the public Lance 7.0.0 multi-base APIs and the prior art (Delta shallow/deep
clone, Iceberg/lakeFS reachability GC). Status: Proposed.
* test(engine): pin @card validation on direct edge delete
Deletes stage as predicates, not constructive batches, so a delete-only
mutation produces an empty change-set and validate_changeset no-ops — a
`delete WorksAt where from = X` that removes a source's only edge commits
below @card(1..), while the merge path (which carries deleted_ids) rejects it.
Red against HEAD (the delete commits); green once the delete path resolves
its predicates into the validation change-set.
* fix(engine): validate edge cardinality on delete via resolved predicates
A delete-only mutation produced an empty change-set (deletes stage as
predicates, not constructive batches), so validate_changeset no-op'd and a
`delete Edge` that dropped a source below @card min committed silently — while
the merge path, which carries deleted_ids, rejects it.
validate_staged_mutation now resolves each staged delete predicate against the
live committed table (CommittedState::deleted_ids_matching, a SQL-filter scan
projecting id) and folds the matched ids into the change-set's deleted_ids for
that table. The existing evaluator then recounts the srcs a delete empties
(@card min) and sees removed rows for RI/node-delete — the same faithful
change-set the merge path already builds, so validation matches what commits.
Covers direct edge deletes, node deletes, and node-delete edge cascades
uniformly (all are staged predicates).
Turns the direct-edge-delete @card regression test green.
* refactor(engine): capture deleted ids at delete time, drop validation re-scan
The delete-cardinality fix resolved staged delete predicates a second time at
validation. Instead, capture the removed ids during the delete op's own scan:
execute_delete_edge and the node-delete edge cascade now scan id (not
count_rows), record the ids via MutationStaging::record_deleted_ids, and
to_changeset() folds them into the change-set's deleted_ids. validate_staged_
mutation reverts to plain to_changeset(); CommittedState::deleted_ids_matching
and scan_filtered_sql are removed.
Behavior-preserving (the @card-on-delete test stays green) and strictly fewer
scans — one scan at delete time replaces count-here + resolve-at-validation.
Node deletes already scanned their ids; this reuses that via a shared
ids_from_batches helper. Full engine suite green; workspace builds clean.
* test(engine): pin overwrite-removal RI + coalesced-unique final image
Two reviewer findings, both red against HEAD:
- F1 (High): overwriting a node table removes nodes without expressing them as
deleted_ids, so a retained edge in a non-overwritten table that references a
removed node is published as an orphan (edge-RI path-b never runs).
overwrite_node_removal_rejects_retained_orphan_edge.
- F2 (Medium): evaluate_unique accumulates superseded keys across batches, so a
mutation that frees a @unique value (Alice.email temp -> final) and reuses it
(insert Carol.email = temp) false-rejects a valid final image.
chained_unique_update_then_reuse_freed_value_is_not_a_violation.
* fix(engine): validate overwrite removals (orphan edges, emptied srcs)
An Overwrite load replaces each touched table, but to_changeset() only recorded
the new batch, never the committed rows the overwrite removes. So overwriting
node:Person to drop Bob while a retained edge:Knows(Alice->Bob) referenced him
published an orphan edge unchecked — edge-RI path-b is gated on the node's
deleted_ids, which were empty.
The loader now computes per overwritten table the removed ids (committed ids in
the pinned base minus the replacement batch's ids, via validate::
overwrite_removed_ids) and folds them into the change-set's deleted_ids. The
evaluator then runs RI path-b and cardinality against them — the same faithful
change-set the merge path builds. Overwrite is per-table, so a table absent from
the batch is untouched; a removed node referenced by a retained edge is now a
loud OrphanEdge.
Updates two tests that asserted the old silent-orphan behavior to
self-consistent overwrites (per-table Overwrite can't drop edge endpoints
without also overwriting the edge tables): end_to_end::overwrite_replaces_data
and writes::load_overwrite_with_bad_edge_reference_unblocks_next_load. The
orphan-rejection case itself is pinned by the new validators test.
* fix(engine): evaluate @unique against the coalesced final delta image
evaluate_unique iterated the raw delta batches and accumulated every key it saw
into one cross-batch map, so a coalesced write that frees then reuses a @unique
value within a query — update a row's email to 'temp', update the same row to
'final', insert a new row with 'temp' — false-rejected: 'temp' lingered in the
seen-set from the superseded first write though it no longer holds in the final
image that commits.
Restructure to validate the final coalesced image — the bytes that actually
publish:
- Pass 1 coalesces the delta by id (last-wins) into each id's final key, and
flags genuine within-ONE-batch duplicates (two distinct input records — the
bulk-load contract) before coalescing, so an unordered load batch with a real
dup still rejects.
- Pass 2 checks two distinct final ids holding the same key.
- Pass 3 does the committed cross-version lookup, excluding the delta's own ids.
Entries are sorted by id before the cross-row/committed passes so violation
order never depends on HashMap iteration. Coalescing first also drops the
redundant committed probes a superseded key used to issue.
Pinned by the chained-update red test; preserves intra-batch dup rejection
(consistency::loader_rejects_intra_batch_duplicate_keys) and cross-version
uniqueness (validators).
* style(engine): drop trailing blank line at staging.rs EOF
Left by a block-delete in an earlier refactor; flagged by git diff --check.
* docs(engine): refresh validate.rs module doc to current consumers
The module doc still said the merge path was the only consumer and the write
path a later, mechanical migration, and listed cardinality as a later
increment. Mutation and bulk load have since migrated onto the evaluator and
cardinality ships — correct both so the doc reflects that all three write
surfaces route through one evaluator.
This commit is contained in:
parent
4afb513700
commit
0dce7c8d18
18 changed files with 2467 additions and 1102 deletions
|
|
@ -43,7 +43,8 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `export.rs` | NDJSON streaming export filters |
|
||||
| `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set). Includes `s3_fresh_branch_traversal_reuses_main_graph_index_with_etags` — the CSR topology cache-key test on a **real** per-table e_tag (`None` on local FS, so `warm_read_cost.rs` can't reach this path); forces CSR via the scoped `with_traversal_mode` seam |
|
||||
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
|
||||
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
|
||||
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL load, mutation insert/update. ALL THREE write surfaces — mutation, bulk load, AND merge — route through the unified `crate::validate` evaluator (Δ-scoped, index-backed, reusing these leaf checks). Cross-version-uniqueness closure: `cross_version_unique_rejected_on_mutation_insert` + `reinsert_existing_key_is_upsert_not_unique_violation` (mutation path); `cross_version_unique_rejected_on_append_load` + `merge_load_reupsert_existing_key_is_not_unique_violation` (load path). Per-table `Overwrite`: `overwrite_load_validates_ri_against_new_image` (an edges-only overwrite still resolves RI against retained committed nodes) + `append_load_rejects_orphan_edge`. The evaluator's own unit tests live in `src/validate.rs` (`#[cfg(test)]`); its merge-conflict equivalence is pinned by `merge_truth_table.rs` (OrphanEdge) + `branching.rs` (Unique/Cardinality merge tests). Intra-batch duplicate-`@key` rejection on every load mode is pinned by `consistency.rs::loader_rejects_intra_batch_duplicate_keys`; the mutation-coalesce counterpart (insert+update / chained updates of one id are NOT a self-collision) by `writes.rs`. Non-String `@unique` columns probe committed state with a TYPED literal (not a stringified key): `cross_version_unique_rejected_on_date_column` + `noncolliding_write_to_date_unique_column_succeeds` (a `Date @unique` collision is a proper `@unique` violation, and a distinct value does not raise a Date32-vs-Utf8 coercion error). Cardinality is keyed by edge id, last-wins (matching commit's `dedupe_merge_batches_by_id`): `merge_load_edge_src_move_rechecks_vacated_src_cardinality` (a Merge-load moving an edge recounts the vacated src for `@card` min) + `merge_load_duplicate_edge_id_counts_once_per_card` (a dup edge id under two srcs in one batch counts once, no spurious max violation). Direct deletes capture the ids they remove (from the delete op's own scan) into the change-set's `deleted_ids`, so a delete emptying a src is validated: `mutation_delete_edge_below_card_min_rejected` (a `delete Edge` dropping a src below `@card` min is rejected, not silently committed). |
|
||||
| `merge_cost.rs` | Cost-budget tests for branch MERGE on the shared `helpers::cost` harness: `merge_validation_is_delta_scoped` (a 1-row-delta merge opens ≤3 data tables — Δ-scoped, not the whole catalog; was ~6 pre-#5) and `merge_manifest_cost_grows_with_history` (the cross-branch `__manifest` open amplification still grows with commit depth — a separate, not-yet-addressed term — while validation `data_open_count` stays flat) |
|
||||
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
|
||||
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
|
||||
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open — iss-schema-apply-reopen-recovery-race). |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue