mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-15 01:55:13 +02:00
* fix(engine): lower date/datetime filter literals as typed Arrow scalars `literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings, relying on DataFusion implicit casts. Against a physical `Date32`/`Date64` column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a scalar BTREE and degrades the scan to a full filtered read. Lower to typed `Date32`/`Date64` scalars instead (reusing the loader's `parse_date32_literal`/`parse_date64_literal`, already used by the in-memory comparison arm), so the predicate stays a direct column comparison and the index is used. Malformed literals fall back to the Utf8 string so pushdown behavior never regresses. Tests: unit goldens asserting the lowered literal is typed (red before, green after) + inline-binding pushdown equality in literal_filters confirming the epoch conversion selects the right rows. * fix(engine): build scalar BTREE for enum and orderable-scalar @index columns `build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md. * feat(engine): reindex in optimize to keep index coverage current A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the build (e.g. `ingest --mode merge`, whose commit does not rebuild an existing index) are scanned unindexed, and `compact_files` rewrites fragments out of coverage. Nothing folded them back in, so coverage decayed as the graph grew — even the id/src/dst BTREEs that power traversal. `optimize_one_table` now runs Lance `optimize_indices` after `compact_files` (incremental merge, not retrain — the same compact->optimize_indices sequence LanceDB's `optimize()` uses) and enters the publish path on compaction work OR stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the fragment_bitmap logic). `optimize_indices` is a committing call with no uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops. Blob-bearing tables are still skipped (the Lance blob-compaction bug is compaction-specific; reindex-for-blob deferred as a noted follow-up). Tests: maintenance.rs asserts an appended fragment is uncovered before and covered after optimize, and idempotency holds (second pass is a no-op). lance_surface_guards pins the `optimize_indices` signature and its incremental- coverage behavior. The existing optimize Phase-B recovery failpoint now also exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md, lance.md, AGENTS.md. * fix(engine): coerce pushdown filter literals to the column type Filter literals were pushed to Lance in their natural Arrow type (every integer Int64, every float Float64). Against a narrower indexed column DataFusion widens to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats the scalar BTREE and degrades to a full filtered read. A physical-plan probe confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`; by an i64 literal it does not. Thread the scan's `arrow_schema` through `build_lance_filter_expr` -> `ir_filter_to_expr` and coerce each literal operand to the opposite column's exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same path the in-memory arm uses, so the two arms agree). Coercion never demotes a filter to None: on failure it falls back to the natural literal, because a node scan has no in-memory fallback for inline filters. Supersedes the date-specific change ine4ef67b(PR1): the probe shows dates were never index-defeated — temporal coercion casts the LITERAL, not the column — so PR1's index-use rationale was wrong though harmless. The generic coercion subsumes it; `literal_to_expr`'s date arms revert to the natural Utf8 fallback, and its unit tests now assert the live coerced path. Tests: surface guard `scalar_index_use_requires_matched_literal_type` pins the substrate behavior (matched -> index, widened -> column-cast full scan); unit tests cover Int32/UInt32/Float32 coercion, range op, reversed operand order, and the natural fallback; `literal_filters` adds an I32 column with equality + range and an F32 pushdown case. * fix(engine): only coerce filter literals when the cast is lossless The literal coercion inf064121narrowed unconditionally. typecheck permits numeric cross-type comparisons (`types_compatible`), so an out-of-domain literal reaches `literal_to_typed_expr` and casts lossily: a fractional float vs an integer column truncates (`{ count: 2.7 }` -> `count = 2`, wrongly matching the count=2 row) and an out-of-range integer overflows to null (`count < 3e9` on I32 -> `count < NULL` -> empty). Both silently change results, and a node scan has no in-memory fallback for inline filters. Add a lossless guard for integer targets: round-trip the cast back to the natural type and, on mismatch, return None so the caller keeps the natural literal (correct via DataFusion coercion; the index is just unused for that out-of-domain predicate). Float targets stay coerced -- narrowing F64 -> F32 is the column's own precision domain, not a value error. Resolves the two valid review findings on PR #216 (Codex float truncation, Greptile out-of-range). Tests: unit cases for fractional/out-of-range fallback vs whole-float/in-range coerce vs F32 exemption; e2e `{ count: 2.7 }` returns no rows.
42 lines
3.1 KiB
Markdown
42 lines
3.1 KiB
Markdown
# Indexes
|
|
|
|
## L1 — Lance index types OmniGraph exposes
|
|
|
|
| Index | Use | Notes |
|
|
|---|---|---|
|
|
| **BTREE scalar** | `=` / range / `IN` / `IS NULL` on a scalar | always on the node `id` and edge `src`/`dst`; and on each one-column `@index`/`@key` property that is an **enum** or an **orderable scalar** (`DateTime`/`Date`/`I32`/`I64`/`U32`/`U64`/`F32`/`F64`/`Bool`) |
|
|
| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on **free-text** (non-enum) `String` `@index`/`@key` columns |
|
|
| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
|
|
|
|
The per-property index a column gets is decided by `node_prop_index_kind` (shared
|
|
by the builder and the sidecar-pinning coverage check so they cannot drift):
|
|
enums and orderable scalars → BTREE, free-text Strings → FTS, `Vector` → vector,
|
|
list/`Blob` columns → none.
|
|
|
|
> **Free-text Strings are not equality-indexed.** A non-enum `String` column
|
|
> (including a `String @key` slug) gets an FTS inverted index, which Lance does
|
|
> **not** consult for `=`/range — only for `search`/`match_text`/`bm25`. So an
|
|
> equality filter on a free-text String falls back to a full scan. If you filter
|
|
> a String identifier by equality on a large table, model it so the value is the
|
|
> node id, or track it as a follow-up to also build a BTREE on such columns.
|
|
|
|
> **Coverage and cost.** Each indexed column adds index files and build time, and
|
|
> an index only covers the fragments it was built over. Rows appended after the
|
|
> index was built (e.g. by `ingest --mode merge`) are scanned unindexed until a
|
|
> reindex extends coverage; see [maintenance](maintenance.md) → `optimize`.
|
|
|
|
## L2 — OmniGraph orchestration
|
|
|
|
- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
|
|
- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state.
|
|
- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
|
|
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
|
|
|
|
## L2 — Graph topology index
|
|
|
|
This is OmniGraph-specific (not Lance):
|
|
|
|
- A Compressed Sparse Row (CSR) adjacency representation of edges, with both out- (CSR) and in- (CSC) directions, plus a dense per-node-type id mapping.
|
|
- Built on demand from a snapshot's edge tables, **lazily**: only when an `Expand` the planner routes to the CSR path (dense / large frontier) or an `AntiJoin` actually needs it.
|
|
- Cached per snapshot (LRU, keyed by snapshot id + edge table versions), so repeat traversals over the same snapshot reuse it.
|
|
- Selective `Expand`s resolve neighbors from the persisted `src`/`dst` BTREE instead (one indexed scan per hop) and never trigger the CSR build; see [query-language](../queries/index.md) → Traversal execution. Pure scans, and queries served entirely by the indexed traversal path, skip it.
|