omnigraph/docs/user/indexes.md
Ragnor Comerford 481de860b2
fix(engine): build scalar BTREE for enum and orderable-scalar @index columns
`build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and
`Vector` (-> vector). Enums are physically `String`, so an enum `@index`
column (e.g. `status`) got an FTS inverted index, which Lance never consults
for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through
and built nothing. Both meant equality/range filters degraded to full scans
with `indices_loaded=0`.

Dispatch index kind by property type via a shared `node_prop_index_kind`:
enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector,
list/Blob -> none. The helper is shared by the builder and
`needs_index_work_node` so they cannot drift — the latter decides recovery-
sidecar pinning, and under-reporting would leave a HEAD-advancing index build
uncovered (invariant 5).

Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report
`IndexCoverage::Indexed` while free-text String/un-annotated columns stay
`Degraded` (negative control). Docs: docs/user/indexes.md.
2026-06-13 18:42:58 +02:00

3.2 KiB

Indexes

L1 — Lance index types OmniGraph exposes

Index Use Notes
BTREE scalar = / range / IN / IS NULL on a scalar always on the node id and edge src/dst; and on each one-column @index/@key property that is an enum or an orderable scalar (DateTime/Date/I32/I64/U32/U64/F32/F64/Bool)
Inverted (FTS) search, fuzzy, match_text, bm25 created on free-text (non-enum) String @index/@key columns
Vector nearest() k-NN Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim)

The per-property index a column gets is decided by node_prop_index_kind (shared by the builder and the sidecar-pinning coverage check so they cannot drift): enums and orderable scalars → BTREE, free-text Strings → FTS, Vector → vector, list/Blob columns → none.

Free-text Strings are not equality-indexed. A non-enum String column (including a String @key slug) gets an FTS inverted index, which Lance does not consult for =/range — only for search/match_text/bm25. So an equality filter on a free-text String falls back to a full scan. If you filter a String identifier by equality on a large table, model it so the value is the node id, or track it as a follow-up to also build a BTREE on such columns.

Coverage and cost. Each indexed column adds index files and build time, and an index only covers the fragments it was built over. Rows appended after the index was built (e.g. by ingest --mode merge) are scanned unindexed until a reindex extends coverage; see maintenanceoptimize.

L2 — OmniGraph orchestration

  • ensure_indices() / ensure_indices_on(branch) — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
  • Indexes are built on the branch head (not on a snapshot), so reads always see the current index state.
  • Lazy branch forking for indexes: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
  • Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.

L2 — Graph topology index (graph_index/mod.rs)

This is OmniGraph-specific (not Lance):

  • TypeIndex: dense u32 ↔ String id mapping per node type.
  • CsrIndex: Compressed Sparse Row representation of edges per edge type — offsets[i]..offsets[i+1] slices into targets.
  • GraphIndex { type_indices, csr (out), csc (in) } — built on demand from a snapshot's edge tables, lazily: only when an Expand the planner routes to the CSR path (dense / large frontier) or an AntiJoin actually needs it.
  • Cached in RuntimeCache::graph_indices (LRU, max 8 entries, keyed by snapshot id + edge table versions).
  • Selective Expands resolve neighbors from the persisted src/dst BTREE instead (one indexed scan per hop) and never trigger the CSR build; see query-language → Expand. Pure scans, and queries served entirely by the indexed traversal path, skip it.