mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-07-03 02:51:04 +02:00
feat(engine): unify constraint validation across all write surfaces (#314)
* feat(engine): unify constraint validation across all write surfaces
Constraint enforcement (value/range/check, enum, uniqueness, edge
referential integrity, cardinality) was implemented three times — once
each in the bulk loader, the mutation executor, and the branch-merge
path — and had drifted: merge validated @range/@check but not enum, and
neither the mutation nor the load path enforced cross-version uniqueness
against already-committed rows.
Introduce one catalog-derived evaluator (`crate::validate`) that all
three surfaces route through. It is delta-scoped (checks only the change
set, not the whole graph) and index-backed (probes committed state
through the @key/@unique/src/dst BTREEs instead of full-scanning every
catalog table), reusing the existing leaf checks
(validate_value_constraints, validate_enum_constraints,
composite_unique_key) so the surfaces cannot drift again. A one-row-delta
merge now opens ~3 data tables instead of ~6+, and validation cost is
flat in graph size rather than O(V+E).
Behavior changes (all stricter, none relaxed):
- Enum constraints are now enforced on the merge path (was a gap).
- A write or load whose @unique value collides with an already-committed
different row is now rejected (cross-version uniqueness); re-upserting
an existing @key still upserts.
- Uniqueness distinguishes a duplicate key WITHIN one input batch (two
distinct records -> rejected, e.g. a bulk load listing a @key twice)
from the SAME id reappearing ACROSS batches (ordered supersession of
one logical row -> coalesced, e.g. a mutation insert-then-update).
- Overwrite loads validate per-table: a touched table's committed view is
its replacement image (empty), but a table absent from the batch keeps
its committed rows, so an edges-only overwrite still resolves
referential integrity against retained nodes.
Remove the per-surface validation orchestration the evaluator supersedes,
and the now-orphaned version-pinned dataset opener from the sealed
storage trait (reads route through the snapshot path). Docs (invariants,
testing) updated; full engine suite green.
* test(engine): pin orphan-edge validation on adopt-by-pointer merge
Regression for a gap in the unified merge validation: when a table is
adopted by pointer switch (AdoptSourceState) — source on main, target on a
branch — build_merge_changeset skips it, so referential integrity is never
checked for it. Merging main into a branch that deleted a node while main
added an edge to that node silently publishes the orphan edge.
This test merges main -> feature where feature deleted Bob and main added
Knows Alice->Bob, and asserts an OrphanEdge conflict. Red against HEAD
(merge returns Merged); turns green with the AdoptSourceState validation fix.
* fix(engine): validate adopt-by-pointer merge tables (AdoptSourceState)
The unified merge validator skipped any table classified AdoptSourceState
(a pointer switch / fork), so referential integrity, uniqueness, and
cardinality were never checked for it. Merging main into a branch that
deleted a node while main added an edge to that node silently published the
orphan edge — the prior full-scan validation caught it.
Root cause: classify_adopt keyed AdoptSourceState on the publish mechanism
("does it advance Lance HEAD") and returned before computing any delta, and
build_merge_changeset then skipped the table. Fix decouples the validation
input from the publish mechanism: classify_adopt now always computes the
source-vs-target delta (base == target on this path, so it is the right
validation delta) and carries it as AdoptSourceState { validation_delta };
build_merge_changeset validates it exactly like AdoptWithDelta. The publish
stays a pointer/fork (delta ignored) and remains excluded from recovery
pins, so publish/recovery semantics are unchanged — only validation is
restored. Closes the class: no publish optimization can bypass validation.
Turns the orphan-edge regression test green.
* test(engine): pin typed committed-uniqueness probe on non-String columns
The cross-version @unique check pushes a committed-state filter built from
the stringified key. On a non-String @unique column (e.g. Date) this compares
a Date32 column to a Utf8 literal — and the stringified key is the raw day
count, so the probe raises "Cannot cast string '20633' to Date32" for ANY
second write to the table (colliding or not).
Two regressions: a colliding Date value must surface a proper "@unique
violation" (not a coercion error), and a non-colliding write must succeed.
Both red against HEAD; green with the typed-literal probe fix.
* fix(engine): build committed uniqueness probe from typed column values
The cross-version @unique check pushed a Lance filter built with a
stringified key (lit(String)) against the real, typed column. On a
non-String @unique column this compared a Date32/numeric/bool column to a
Utf8 literal: a coercion error on Date/Bool (failing every write to the
table) or a silent miss on Float. For Date the stringified key was even the
raw day count, so the literal could never parse.
unique_holders now takes typed ScalarValues, built at the call site via
ScalarValue::try_from_array(group_column, row), so the pushed-down predicate
compares like-typed for any scalar @unique. The in-memory intra-delta dedup
keeps the stringified key (a type-agnostic equality grouping, unaffected).
Turns the Date @unique cross-version regression tests green.
* test(engine): pin id-keyed cardinality on merge-load edge moves/dups
Two cardinality drifts between validation and what commit persists:
- Move (B): a Merge-load that moves an edge to a new src only recounts the
new src, so vacating a src and dropping it below @card min is missed —
moving Alice's only WorksAt to Bob silently succeeds under @card(1..).
- Dup (A): a Merge-load batch listing one edge id under two srcs counts it
under both, but commit dedupes by id (last-wins). Alice gets a phantom
second edge and a spurious "has 2 edges (max 1)" violation under @card(0..1).
Both red against HEAD; green with the id-keyed last-wins cardinality model.
* fix(engine): key merge/load cardinality by edge id, last-wins
@card validation diverged from what commit persists in two ways: (1) it only
recounted the new src of a delta edge, so a Merge-load that moves an edge to a
new src never rechecked the vacated src and missed a drop below @card min; (2)
it counted raw delta rows, so the same edge id under two srcs in one batch was
counted under both, while commit dedupes by id (last-wins) — a phantom edge
and a spurious max violation.
evaluate_cardinality now coalesces the delta by edge id (last-wins, matching
dedupe_merge_batches_by_id) and builds the affected-src set from both the new
src of each delta edge AND the old committed src of each changed/deleted edge
id; a committed edge is dropped from its src when the delta deletes or
re-places it. The validated edge set per src now equals the committed image.
Turns the edge-move and duplicate-id cardinality regression tests green.
* docs(rfcs): add RFC 0001 — branch merge by fragment adoption
Proposed design for the by-design fix to merge cost/OOM: adopt the source
branch's Lance fragments by reference (base_paths) instead of re-materializing
rows, with a re-home reconciler + branch-delete reference guard closing the
dangling-reference lifecycle, and a reachability-complete cleanup sweep. Grounded
in the public Lance 7.0.0 multi-base APIs and the prior art (Delta shallow/deep
clone, Iceberg/lakeFS reachability GC). Status: Proposed.
* test(engine): pin @card validation on direct edge delete
Deletes stage as predicates, not constructive batches, so a delete-only
mutation produces an empty change-set and validate_changeset no-ops — a
`delete WorksAt where from = X` that removes a source's only edge commits
below @card(1..), while the merge path (which carries deleted_ids) rejects it.
Red against HEAD (the delete commits); green once the delete path resolves
its predicates into the validation change-set.
* fix(engine): validate edge cardinality on delete via resolved predicates
A delete-only mutation produced an empty change-set (deletes stage as
predicates, not constructive batches), so validate_changeset no-op'd and a
`delete Edge` that dropped a source below @card min committed silently — while
the merge path, which carries deleted_ids, rejects it.
validate_staged_mutation now resolves each staged delete predicate against the
live committed table (CommittedState::deleted_ids_matching, a SQL-filter scan
projecting id) and folds the matched ids into the change-set's deleted_ids for
that table. The existing evaluator then recounts the srcs a delete empties
(@card min) and sees removed rows for RI/node-delete — the same faithful
change-set the merge path already builds, so validation matches what commits.
Covers direct edge deletes, node deletes, and node-delete edge cascades
uniformly (all are staged predicates).
Turns the direct-edge-delete @card regression test green.
* refactor(engine): capture deleted ids at delete time, drop validation re-scan
The delete-cardinality fix resolved staged delete predicates a second time at
validation. Instead, capture the removed ids during the delete op's own scan:
execute_delete_edge and the node-delete edge cascade now scan id (not
count_rows), record the ids via MutationStaging::record_deleted_ids, and
to_changeset() folds them into the change-set's deleted_ids. validate_staged_
mutation reverts to plain to_changeset(); CommittedState::deleted_ids_matching
and scan_filtered_sql are removed.
Behavior-preserving (the @card-on-delete test stays green) and strictly fewer
scans — one scan at delete time replaces count-here + resolve-at-validation.
Node deletes already scanned their ids; this reuses that via a shared
ids_from_batches helper. Full engine suite green; workspace builds clean.
* test(engine): pin overwrite-removal RI + coalesced-unique final image
Two reviewer findings, both red against HEAD:
- F1 (High): overwriting a node table removes nodes without expressing them as
deleted_ids, so a retained edge in a non-overwritten table that references a
removed node is published as an orphan (edge-RI path-b never runs).
overwrite_node_removal_rejects_retained_orphan_edge.
- F2 (Medium): evaluate_unique accumulates superseded keys across batches, so a
mutation that frees a @unique value (Alice.email temp -> final) and reuses it
(insert Carol.email = temp) false-rejects a valid final image.
chained_unique_update_then_reuse_freed_value_is_not_a_violation.
* fix(engine): validate overwrite removals (orphan edges, emptied srcs)
An Overwrite load replaces each touched table, but to_changeset() only recorded
the new batch, never the committed rows the overwrite removes. So overwriting
node:Person to drop Bob while a retained edge:Knows(Alice->Bob) referenced him
published an orphan edge unchecked — edge-RI path-b is gated on the node's
deleted_ids, which were empty.
The loader now computes per overwritten table the removed ids (committed ids in
the pinned base minus the replacement batch's ids, via validate::
overwrite_removed_ids) and folds them into the change-set's deleted_ids. The
evaluator then runs RI path-b and cardinality against them — the same faithful
change-set the merge path builds. Overwrite is per-table, so a table absent from
the batch is untouched; a removed node referenced by a retained edge is now a
loud OrphanEdge.
Updates two tests that asserted the old silent-orphan behavior to
self-consistent overwrites (per-table Overwrite can't drop edge endpoints
without also overwriting the edge tables): end_to_end::overwrite_replaces_data
and writes::load_overwrite_with_bad_edge_reference_unblocks_next_load. The
orphan-rejection case itself is pinned by the new validators test.
* fix(engine): evaluate @unique against the coalesced final delta image
evaluate_unique iterated the raw delta batches and accumulated every key it saw
into one cross-batch map, so a coalesced write that frees then reuses a @unique
value within a query — update a row's email to 'temp', update the same row to
'final', insert a new row with 'temp' — false-rejected: 'temp' lingered in the
seen-set from the superseded first write though it no longer holds in the final
image that commits.
Restructure to validate the final coalesced image — the bytes that actually
publish:
- Pass 1 coalesces the delta by id (last-wins) into each id's final key, and
flags genuine within-ONE-batch duplicates (two distinct input records — the
bulk-load contract) before coalescing, so an unordered load batch with a real
dup still rejects.
- Pass 2 checks two distinct final ids holding the same key.
- Pass 3 does the committed cross-version lookup, excluding the delta's own ids.
Entries are sorted by id before the cross-row/committed passes so violation
order never depends on HashMap iteration. Coalescing first also drops the
redundant committed probes a superseded key used to issue.
Pinned by the chained-update red test; preserves intra-batch dup rejection
(consistency::loader_rejects_intra_batch_duplicate_keys) and cross-version
uniqueness (validators).
* style(engine): drop trailing blank line at staging.rs EOF
Left by a block-delete in an earlier refactor; flagged by git diff --check.
* docs(engine): refresh validate.rs module doc to current consumers
The module doc still said the merge path was the only consumer and the write
path a later, mechanical migration, and listed cardinality as a later
increment. Mutation and bulk load have since migrated onto the evaluator and
cardinality ships — correct both so the doc reflects that all three write
surfaces route through one evaluator.
This commit is contained in:
parent
4afb513700
commit
0dce7c8d18
18 changed files with 2467 additions and 1102 deletions
|
|
@ -157,7 +157,7 @@ converge the physical state.
|
|||
| Deletes | Staged like inserts/updates (`stage_delete` via Lance 7.0 `DeleteBuilder::execute_uncommitted`, MR-A) — no inline HEAD advance; mixed insert/update/delete in one query rejected by D2 as a deliberate boundary (constructive XOR destructive per query; compose via separate mutations or a branch) | [query-language.md](../user/queries/index.md), [writes.md](writes.md) |
|
||||
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
|
||||
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) |
|
||||
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) |
|
||||
| Unique constraints | Value/enum, uniqueness, edge-RI, and cardinality route through ONE unified, catalog-derived evaluator (`crate::validate`) on ALL THREE write surfaces — branch-merge, mutation, and bulk load: Δ-scoped (checks the delta, not the whole graph) and index-backed (probes committed state through its BTREE rather than full-scanning every catalog table), reusing the leaf checks (`loader::validate_value_constraints`/`validate_enum_constraints`/`composite_unique_key`) so the surfaces cannot drift. This closed the prior merge bug (merge validated `@range`/`@check` but not enum) AND the **cross-version uniqueness gap** on the mutation and load paths (a duplicate of a committed `@unique` value is now rejected; the merge path always enforced it). The committed view is the merge target snapshot (merge), the write's pinned `txn.base` (mutation), or the pinned pre-load base (load — `Overwrite` validates the batch as the whole new image, committed view empty); `@card` reads LIVE HEAD per edge table on the mutation path only (the #298 stale-handle fix). `@key` is id-backed, so it is checked intra-delta only (a committed holder of a key value is always the same row — an upsert), skipping a wasted O(Δ) probe per keyed row; `@unique` (non-key) groups do the committed lookup. | [schema-language.md](../user/schema/index.md) |
|
||||
| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the sole inline-commit residual (`create_vector_index`) is split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) |
|
||||
| Index lifecycle | `@index`/`@key` declares *intent*; the physical index is derived state and never fails a logical op. `schema apply` builds no indexes (records intent only; index-only changes touch no table data). `load`/`mutate` build inline through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar → BTREE, free-text String → FTS, Vector → vector) that fault-isolates an untrainable Vector column into a *pending* index instead of aborting. `optimize`/`ensure_indices` is the reconciler: it creates declared-but-missing indexes and folds appended/rewritten fragments into existing ones (`optimize_indices`), reporting still-pending columns. Explicit maintenance call, not yet a background loop | [indexes.md](../user/search/indexes.md), [maintenance.md](../user/operations/maintenance.md) |
|
||||
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/queries/index.md) |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue