mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-15 01:55:13 +02:00
`build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md.
43 lines
3.2 KiB
Markdown
43 lines
3.2 KiB
Markdown
# Indexes
|
|
|
|
## L1 — Lance index types OmniGraph exposes
|
|
|
|
| Index | Use | Notes |
|
|
|---|---|---|
|
|
| **BTREE scalar** | `=` / range / `IN` / `IS NULL` on a scalar | always on the node `id` and edge `src`/`dst`; and on each one-column `@index`/`@key` property that is an **enum** or an **orderable scalar** (`DateTime`/`Date`/`I32`/`I64`/`U32`/`U64`/`F32`/`F64`/`Bool`) |
|
|
| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on **free-text** (non-enum) `String` `@index`/`@key` columns |
|
|
| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
|
|
|
|
The per-property index a column gets is decided by `node_prop_index_kind` (shared
|
|
by the builder and the sidecar-pinning coverage check so they cannot drift):
|
|
enums and orderable scalars → BTREE, free-text Strings → FTS, `Vector` → vector,
|
|
list/`Blob` columns → none.
|
|
|
|
> **Free-text Strings are not equality-indexed.** A non-enum `String` column
|
|
> (including a `String @key` slug) gets an FTS inverted index, which Lance does
|
|
> **not** consult for `=`/range — only for `search`/`match_text`/`bm25`. So an
|
|
> equality filter on a free-text String falls back to a full scan. If you filter
|
|
> a String identifier by equality on a large table, model it so the value is the
|
|
> node id, or track it as a follow-up to also build a BTREE on such columns.
|
|
|
|
> **Coverage and cost.** Each indexed column adds index files and build time, and
|
|
> an index only covers the fragments it was built over. Rows appended after the
|
|
> index was built (e.g. by `ingest --mode merge`) are scanned unindexed until a
|
|
> reindex extends coverage; see [maintenance](maintenance.md) → `optimize`.
|
|
|
|
## L2 — OmniGraph orchestration
|
|
|
|
- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
|
|
- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state.
|
|
- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
|
|
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
|
|
|
|
## L2 — Graph topology index (`graph_index/mod.rs`)
|
|
|
|
This is OmniGraph-specific (not Lance):
|
|
|
|
- `TypeIndex`: dense `u32 ↔ String id` mapping per node type.
|
|
- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`.
|
|
- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables, **lazily**: only when an `Expand` the planner routes to the CSR path (dense / large frontier) or an `AntiJoin` actually needs it.
|
|
- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions).
|
|
- Selective `Expand`s resolve neighbors from the persisted `src`/`dst` BTREE instead (one indexed scan per hop) and never trigger the CSR build; see [query-language](query-language.md) → Expand. Pure scans, and queries served entirely by the indexed traversal path, skip it.
|