fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types (#160)

* fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types

Hardening on top of #133. That PR introduced a shared
`loader::composite_unique_key(parts)` joining per-column scalars with U+001F
and routed both intake and branch-merge through it, closing the original
'|' vs U+001F separator drift. This takes the shared keying the rest of the
way to correct-by-design:

- Collision-free by construction: the key is now the tuple of per-column
  scalar strings (Vec<String>) keyed directly, no separator, so no data value
  (not even a literal U+001F) can forge a collision.
- One scalar converter across both paths: intake used an explicit type-match,
  merge used Arrow's array_value_to_string. Both now derive the key through
  composite_unique_key(group_columns, row), so they can't drift on conversion.
- Loud on un-keyable types: the scalar converter returned None for any Arrow
  type it didn't recognize, and the caller treated None as null-exempt, so a
  @unique on a column type it couldn't reduce (list, blob) was silently
  un-enforced. It now returns Err, surfacing the constraint it can't enforce
  instead of weakening it in silence.

Tests:
- consistency::composite_unique_key_is_consistent_across_intake_and_merge pins
  that intake and merge key the tuple identically (load-on-branch then merge
  of values containing '|').
- loader unit tests pin tuple keying + null exemption and the loud error on an
  un-keyable (binary) column.

Docs: invariants truth-matrix updated; stale loader/mod.rs line pointers fixed.
Scope unchanged: intra-batch / merge-candidate-set only; cross-version
uniqueness against committed rows stays a documented gap.

* fix(unique): cover all string encodings; make format_tuple private (PR #160 review)

Addresses two Greptile P2 comments on PR #160:

- unique_key_scalar handled only StringArray (Utf8). The loud-on-unknown-type
  behavior turned any legal string column that read back as LargeUtf8 or
  Utf8View into a hard write failure (the old code silently returned None). Add
  LargeStringArray and StringViewArray arms so a legal string column is keyable
  in every physical Arrow encoding; the Err path now fires only for a genuinely
  un-keyable logical type (list/blob/vector), never a legal value in an
  unenumerated encoding.
- format_tuple was pub(crate) but only used within loader/mod.rs; make it a
  private fn (matches the old format_unique_columns it replaced, minimal
  exposed surface).

New unit test unique_key_scalar_handles_all_string_encodings pins that Utf8 /
LargeUtf8 / Utf8View all render rather than error.
This commit is contained in:
Ragnor Comerford 2026-06-09 19:28:21 +02:00 committed by GitHub
parent dbfdddc952
commit e0d88d1295
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 235 additions and 73 deletions

View file

@ -101,7 +101,7 @@ Use it this way:
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [writes.md](writes.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branches-commits.md), [maintenance.md](../user/maintenance.md) |
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) |
| Unique constraints | Intra-batch and write-path checks exist; full cross-version uniqueness is still a gap | [schema-language.md](../user/schema-language.md) |
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema-language.md) |
| Storage trait | `TableStorage` exists as the sealed staged-write surface; full call-site migration and capability/stat surfaces are incomplete | [writes.md](writes.md), [architecture.md](architecture.md) |
| Index lifecycle | `ensure_indices` is explicit today; reconciler-based convergence is roadmap | [indexes.md](../user/indexes.md), [maintenance.md](../user/maintenance.md) |
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/query-language.md) |