`build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md.
3.2 KiB
Indexes
L1 — Lance index types OmniGraph exposes
| Index | Use | Notes |
|---|---|---|
| BTREE scalar | = / range / IN / IS NULL on a scalar |
always on the node id and edge src/dst; and on each one-column @index/@key property that is an enum or an orderable scalar (DateTime/Date/I32/I64/U32/U64/F32/F64/Bool) |
| Inverted (FTS) | search, fuzzy, match_text, bm25 |
created on free-text (non-enum) String @index/@key columns |
| Vector | nearest() k-NN |
Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
The per-property index a column gets is decided by node_prop_index_kind (shared
by the builder and the sidecar-pinning coverage check so they cannot drift):
enums and orderable scalars → BTREE, free-text Strings → FTS, Vector → vector,
list/Blob columns → none.
Free-text Strings are not equality-indexed. A non-enum
Stringcolumn (including aString @keyslug) gets an FTS inverted index, which Lance does not consult for=/range — only forsearch/match_text/bm25. So an equality filter on a free-text String falls back to a full scan. If you filter a String identifier by equality on a large table, model it so the value is the node id, or track it as a follow-up to also build a BTREE on such columns.
Coverage and cost. Each indexed column adds index files and build time, and an index only covers the fragments it was built over. Rows appended after the index was built (e.g. by
ingest --mode merge) are scanned unindexed until a reindex extends coverage; see maintenance →optimize.
L2 — OmniGraph orchestration
ensure_indices()/ensure_indices_on(branch)— idempotent build of BTREE + inverted indexes for the current head; safe to re-run.- Indexes are built on the branch head (not on a snapshot), so reads always see the current index state.
- Lazy branch forking for indexes: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
L2 — Graph topology index (graph_index/mod.rs)
This is OmniGraph-specific (not Lance):
TypeIndex: denseu32 ↔ String idmapping per node type.CsrIndex: Compressed Sparse Row representation of edges per edge type —offsets[i]..offsets[i+1]slices intotargets.GraphIndex { type_indices, csr (out), csc (in) }— built on demand from a snapshot's edge tables, lazily: only when anExpandthe planner routes to the CSR path (dense / large frontier) or anAntiJoinactually needs it.- Cached in
RuntimeCache::graph_indices(LRU, max 8 entries, keyed by snapshot id + edge table versions). - Selective
Expands resolve neighbors from the persistedsrc/dstBTREE instead (one indexed scan per hop) and never trigger the CSR build; see query-language → Expand. Pure scans, and queries served entirely by the indexed traversal path, skip it.