mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-15 01:55:13 +02:00

Ragnor Comerford 481de860b2

fix(engine): build scalar BTREE for enum and orderable-scalar @index columns

`build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and
`Vector` (-> vector). Enums are physically `String`, so an enum `@index`
column (e.g. `status`) got an FTS inverted index, which Lance never consults
for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through
and built nothing. Both meant equality/range filters degraded to full scans
with `indices_loaded=0`.

Dispatch index kind by property type via a shared `node_prop_index_kind`:
enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector,
list/Blob -> none. The helper is shared by the builder and
`needs_index_work_node` so they cannot drift — the latter decides recovery-
sidecar pinning, and under-reporting would leave a HEAD-advancing index build
uncovered (invariant 5).

Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report
`IndexCoverage::Indexed` while free-text String/un-annotated columns stay
`Degraded` (negative control). Docs: docs/user/indexes.md.

2026-06-13 18:42:58 +02:00

3.2 KiB

Raw Blame History

Indexes

L1 — Lance index types OmniGraph exposes

Index	Use	Notes
BTREE scalar	`=` / range / `IN` / `IS NULL` on a scalar	always on the node `id` and edge `src`/`dst`; and on each one-column `@index`/`@key` property that is an enum or an orderable scalar (`DateTime`/`Date`/`I32`/`I64`/`U32`/`U64`/`F32`/`F64`/`Bool`)
Inverted (FTS)	`search`, `fuzzy`, `match_text`, `bm25`	created on free-text (non-enum) `String` `@index`/`@key` columns
Vector	`nearest()` k-NN	Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim)

The per-property index a column gets is decided by node_prop_index_kind (shared by the builder and the sidecar-pinning coverage check so they cannot drift): enums and orderable scalars → BTREE, free-text Strings → FTS, Vector → vector, list/Blob columns → none.

Free-text Strings are not equality-indexed. A non-enum String column (including a String @key slug) gets an FTS inverted index, which Lance does not consult for =/range — only for search/match_text/bm25. So an equality filter on a free-text String falls back to a full scan. If you filter a String identifier by equality on a large table, model it so the value is the node id, or track it as a follow-up to also build a BTREE on such columns.

Coverage and cost. Each indexed column adds index files and build time, and an index only covers the fragments it was built over. Rows appended after the index was built (e.g. by ingest --mode merge) are scanned unindexed until a reindex extends coverage; see maintenance → optimize.

L2 — OmniGraph orchestration

ensure_indices() / ensure_indices_on(branch) — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
Indexes are built on the branch head (not on a snapshot), so reads always see the current index state.
Lazy branch forking for indexes: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.

L2 — Graph topology index (`graph_index/mod.rs`)

This is OmniGraph-specific (not Lance):

TypeIndex: dense u32 ↔ String id mapping per node type.
CsrIndex: Compressed Sparse Row representation of edges per edge type — offsets[i]..offsets[i+1] slices into targets.
GraphIndex { type_indices, csr (out), csc (in) } — built on demand from a snapshot's edge tables, lazily: only when an Expand the planner routes to the CSR path (dense / large frontier) or an AntiJoin actually needs it.
Cached in RuntimeCache::graph_indices (LRU, max 8 entries, keyed by snapshot id + edge table versions).
Selective Expands resolve neighbors from the persisted src/dst BTREE instead (one indexed scan per hop) and never trigger the CSR build; see query-language → Expand. Pure scans, and queries served entirely by the indexed traversal path, skip it.

3.2 KiB Raw Blame History

Indexes

L1 — Lance index types OmniGraph exposes

L2 — OmniGraph orchestration

L2 — Graph topology index (graph_index/mod.rs)

3.2 KiB

Raw Blame History

L2 — Graph topology index (`graph_index/mod.rs`)