omnigraph/docs/user/search/indexes.md
Ragnor Comerford 1bed998052
fix(engine): scalar index coverage + filter literal coercion (query latency) (#216)
* fix(engine): lower date/datetime filter literals as typed Arrow scalars

`literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings,
relying on DataFusion implicit casts. Against a physical `Date32`/`Date64`
column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a
scalar BTREE and degrades the scan to a full filtered read. Lower to typed
`Date32`/`Date64` scalars instead (reusing the loader's
`parse_date32_literal`/`parse_date64_literal`, already used by the in-memory
comparison arm), so the predicate stays a direct column comparison and the
index is used. Malformed literals fall back to the Utf8 string so pushdown
behavior never regresses.

Tests: unit goldens asserting the lowered literal is typed (red before, green
after) + inline-binding pushdown equality in literal_filters confirming the
epoch conversion selects the right rows.

* fix(engine): build scalar BTREE for enum and orderable-scalar @index columns

`build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and
`Vector` (-> vector). Enums are physically `String`, so an enum `@index`
column (e.g. `status`) got an FTS inverted index, which Lance never consults
for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through
and built nothing. Both meant equality/range filters degraded to full scans
with `indices_loaded=0`.

Dispatch index kind by property type via a shared `node_prop_index_kind`:
enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector,
list/Blob -> none. The helper is shared by the builder and
`needs_index_work_node` so they cannot drift — the latter decides recovery-
sidecar pinning, and under-reporting would leave a HEAD-advancing index build
uncovered (invariant 5).

Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report
`IndexCoverage::Indexed` while free-text String/un-annotated columns stay
`Degraded` (negative control). Docs: docs/user/indexes.md.

* feat(engine): reindex in optimize to keep index coverage current

A scalar/FTS/vector index only covers the fragments it was built over. Rows
appended after the build (e.g. `ingest --mode merge`, whose commit does not
rebuild an existing index) are scanned unindexed, and `compact_files` rewrites
fragments out of coverage. Nothing folded them back in, so coverage decayed as
the graph grew — even the id/src/dst BTREEs that power traversal.

`optimize_one_table` now runs Lance `optimize_indices` after `compact_files`
(incremental merge, not retrain — the same compact->optimize_indices sequence
LanceDB's `optimize()` uses) and enters the publish path on compaction work OR
stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the
fragment_bitmap logic). `optimize_indices` is a committing call with no
uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered
by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops.
Blob-bearing tables are still skipped (the Lance blob-compaction bug is
compaction-specific; reindex-for-blob deferred as a noted follow-up).

Tests: maintenance.rs asserts an appended fragment is uncovered before and
covered after optimize, and idempotency holds (second pass is a no-op).
lance_surface_guards pins the `optimize_indices` signature and its incremental-
coverage behavior. The existing optimize Phase-B recovery failpoint now also
exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md,
lance.md, AGENTS.md.

* fix(engine): coerce pushdown filter literals to the column type

Filter literals were pushed to Lance in their natural Arrow type (every integer
Int64, every float Float64). Against a narrower indexed column DataFusion widens
to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats
the scalar BTREE and degrades to a full filtered read. A physical-plan probe
confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`;
by an i64 literal it does not.

Thread the scan's `arrow_schema` through `build_lance_filter_expr` ->
`ir_filter_to_expr` and coerce each literal operand to the opposite column's
exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same
path the in-memory arm uses, so the two arms agree). Coercion never demotes a
filter to None: on failure it falls back to the natural literal, because a node
scan has no in-memory fallback for inline filters.

Supersedes the date-specific change in e4ef67b (PR1): the probe shows dates were
never index-defeated — temporal coercion casts the LITERAL, not the column — so
PR1's index-use rationale was wrong though harmless. The generic coercion
subsumes it; `literal_to_expr`'s date arms revert to the natural Utf8 fallback,
and its unit tests now assert the live coerced path.

Tests: surface guard `scalar_index_use_requires_matched_literal_type` pins the
substrate behavior (matched -> index, widened -> column-cast full scan); unit
tests cover Int32/UInt32/Float32 coercion, range op, reversed operand order, and
the natural fallback; `literal_filters` adds an I32 column with equality + range
and an F32 pushdown case.

* fix(engine): only coerce filter literals when the cast is lossless

The literal coercion in f064121 narrowed unconditionally. typecheck permits
numeric cross-type comparisons (`types_compatible`), so an out-of-domain literal
reaches `literal_to_typed_expr` and casts lossily: a fractional float vs an
integer column truncates (`{ count: 2.7 }` -> `count = 2`, wrongly matching the
count=2 row) and an out-of-range integer overflows to null (`count < 3e9` on I32
-> `count < NULL` -> empty). Both silently change results, and a node scan has no
in-memory fallback for inline filters.

Add a lossless guard for integer targets: round-trip the cast back to the natural
type and, on mismatch, return None so the caller keeps the natural literal
(correct via DataFusion coercion; the index is just unused for that out-of-domain
predicate). Float targets stay coerced -- narrowing F64 -> F32 is the column's own
precision domain, not a value error.

Resolves the two valid review findings on PR #216 (Codex float truncation, Greptile
out-of-range). Tests: unit cases for fractional/out-of-range fallback vs
whole-float/in-range coerce vs F32 exemption; e2e `{ count: 2.7 }` returns no rows.
2026-06-14 16:31:19 +02:00

3.1 KiB

Indexes

L1 — Lance index types OmniGraph exposes

Index Use Notes
BTREE scalar = / range / IN / IS NULL on a scalar always on the node id and edge src/dst; and on each one-column @index/@key property that is an enum or an orderable scalar (DateTime/Date/I32/I64/U32/U64/F32/F64/Bool)
Inverted (FTS) search, fuzzy, match_text, bm25 created on free-text (non-enum) String @index/@key columns
Vector nearest() k-NN Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim)

The per-property index a column gets is decided by node_prop_index_kind (shared by the builder and the sidecar-pinning coverage check so they cannot drift): enums and orderable scalars → BTREE, free-text Strings → FTS, Vector → vector, list/Blob columns → none.

Free-text Strings are not equality-indexed. A non-enum String column (including a String @key slug) gets an FTS inverted index, which Lance does not consult for =/range — only for search/match_text/bm25. So an equality filter on a free-text String falls back to a full scan. If you filter a String identifier by equality on a large table, model it so the value is the node id, or track it as a follow-up to also build a BTREE on such columns.

Coverage and cost. Each indexed column adds index files and build time, and an index only covers the fragments it was built over. Rows appended after the index was built (e.g. by ingest --mode merge) are scanned unindexed until a reindex extends coverage; see maintenanceoptimize.

L2 — OmniGraph orchestration

  • ensure_indices() / ensure_indices_on(branch) — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
  • Indexes are built on the branch head (not on a snapshot), so reads always see the current index state.
  • Lazy branch forking for indexes: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
  • Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.

L2 — Graph topology index

This is OmniGraph-specific (not Lance):

  • A Compressed Sparse Row (CSR) adjacency representation of edges, with both out- (CSR) and in- (CSC) directions, plus a dense per-node-type id mapping.
  • Built on demand from a snapshot's edge tables, lazily: only when an Expand the planner routes to the CSR path (dense / large frontier) or an AntiJoin actually needs it.
  • Cached per snapshot (LRU, keyed by snapshot id + edge table versions), so repeat traversals over the same snapshot reuse it.
  • Selective Expands resolve neighbors from the persisted src/dst BTREE instead (one indexed scan per hop) and never trigger the CSR build; see query-language → Traversal execution. Pure scans, and queries served entirely by the indexed traversal path, skip it.