sqlite-vec/IVF_PLAN.md
Alex Garcia 3358e127f6 Add IVF index for vec0 virtual table
Add inverted file (IVF) index type: partitions vectors into clusters via
k-means, quantizes to int8, and scans only the nearest nprobe partitions at
query time. Includes shadow table management, insert/delete, KNN integration,
compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes
superseded ivf-benchmarks/ directory.
2026-03-31 01:18:47 -07:00

9.9 KiB
Raw Blame History

IVF Index for sqlite-vec

Overview

IVF (Inverted File Index) is an approximate nearest neighbor index for sqlite-vec's vec0 virtual table. It partitions vectors into clusters via k-means, then at query time only scans the nearest clusters instead of all vectors. Combined with scalar or binary quantization, this gives 5-20x query speedups over brute-force with tunable recall.

SQL API

Table Creation

CREATE VIRTUAL TABLE vec_items USING vec0(
  id INTEGER PRIMARY KEY,
  embedding float[768] distance_metric=cosine
    INDEXED BY ivf(nlist=128, nprobe=16)
);

-- With quantization (4x smaller cells, rescore for recall)
CREATE VIRTUAL TABLE vec_items USING vec0(
  id INTEGER PRIMARY KEY,
  embedding float[768] distance_metric=cosine
    INDEXED BY ivf(nlist=128, nprobe=16, quantizer=int8, oversample=4)
);

Parameters

Parameter Values Default Description
nlist 1-65536, or 0 128 Number of k-means clusters. Rule of thumb: sqrt(N)
nprobe 1-nlist 10 Clusters to search at query time. More = better recall, slower
quantizer none, int8, binary none How vectors are stored in cells
oversample >= 1 1 Re-rank oversample * k candidates with full-precision distance

Inserting Vectors

-- Works immediately, even before training
INSERT INTO vec_items(id, embedding) VALUES (1, :vector);

Before centroids exist, vectors go to an "unassigned" partition and queries do brute-force. After training, new inserts are assigned to the nearest centroid.

Training (Computing Centroids)

-- Run built-in k-means on all vectors
INSERT INTO vec_items(id) VALUES ('compute-centroids');

This loads all vectors into memory, runs k-means++ with Lloyd's algorithm, creates quantized centroids, and redistributes all vectors into cluster cells. It's a blocking operation — run it once after bulk insert.

Manual Centroid Import

-- Import externally-computed centroids
INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:0', :centroid_0);
INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1);

-- Assign vectors to imported centroids
INSERT INTO vec_items(id) VALUES ('assign-vectors');

Runtime Parameter Tuning

-- Change nprobe without rebuilding the index
INSERT INTO vec_items(id) VALUES ('nprobe=32');

KNN Queries

-- Same syntax as standard vec0
SELECT id, distance
FROM vec_items
WHERE embedding MATCH :query AND k = 10;

Other Commands

-- Remove centroids, move all vectors back to unassigned
INSERT INTO vec_items(id) VALUES ('clear-centroids');

How It Works

Architecture

User vector (float32)
  → quantize to int8/binary (if quantizer != none)
  → find nearest centroid (quantized distance)
  → store quantized vector in cell blob
  → store full vector in KV table (if quantizer != none)
  → query:
      1. quantize query vector
      2. find top nprobe centroids by quantized distance
      3. scan cell blobs: quantized distance (fast, small I/O)
      4. if oversample > 1: re-score top N*k with full vectors
      5. return top k

Shadow Tables

For a table vec_items with vector column index 0:

Table Schema Purpose
vec_items_ivf_centroids00 centroid_id PK, centroid BLOB K-means centroids (quantized)
vec_items_ivf_cells00 centroid_id, n_vectors, validity BLOB, rowids BLOB, vectors BLOB Packed vector cells, 64 vectors max per row. Multiple rows per centroid. Index on centroid_id.
vec_items_ivf_rowid_map00 rowid PK, cell_id, slot Maps vector rowid → cell location for O(1) delete
vec_items_ivf_vectors00 rowid PK, vector BLOB Full-precision vectors (only when quantizer != none)

Cell Storage

Cells use packed blob storage identical to vec0's chunk layout:

  • validity: bitmap (1 bit per slot) marking live vectors
  • rowids: packed i64 array
  • vectors: packed array of quantized vectors

Cells are capped at 64 vectors (~200KB at 768-dim float32, ~48KB for int8, ~6KB for binary). When a cell fills, a new row is created for the same centroid. This avoids SQLite overflow page traversal which was a 110x performance bottleneck with unbounded cells.

Quantization

int8: Each float32 dimension clamped to [-1,1] and scaled to int8 [-127,127]. 4x storage reduction. Distance computed via int8 L2.

binary: Sign-bit quantization — each bit is 1 if the float is positive. 32x storage reduction. Distance computed via hamming distance.

Oversample re-ranking: When oversample > 1, the quantized scan collects oversample * k candidates, then looks up each candidate's full-precision vector from the KV table and re-computes exact distance. This recovers nearly all recall lost from quantization. At oversample=4 with int8, recall matches non-quantized IVF exactly.

K-Means

Uses Lloyd's algorithm with k-means++ initialization:

  1. K-means++ picks initial centroids weighted by distance
  2. Lloyd's iterations: assign vectors to nearest centroid, recompute centroids as cluster means
  3. Empty cluster handling: reassign to farthest point
  4. K-means runs in float32; centroids are quantized before storage

Training data: recommend 16× nlist vectors. At nlist=1000, that's 16k vectors — k-means takes ~140s on 768-dim data.

Performance

100k vectors (COHERE 768-dim cosine)

                          name  qry(ms)  recall
───────────────────────────────────────────────
          ivf(q=int8,os=4),p=8    5.3ms  0.934  ← 6x faster than flat
         ivf(q=int8,os=4),p=16    5.4ms  0.968
               ivf(q=none),p=8    5.3ms  0.934
      ivf(q=binary,os=10),p=16    1.3ms  0.832  ← 26x faster than flat
         ivf(q=int8,os=4),p=32    7.4ms  0.990
              ivf(q=none),p=32   15.5ms  0.992
                    int8(os=4)   18.7ms  0.996
                     bit(os=8)   18.7ms  0.884
                          flat   33.7ms  1.000

1M vectors (COHERE 768-dim cosine)

                            name  insert  train    MB  qry(ms)  recall
──────────────────────────────────────────────────────────────────────
            ivf(q=int8,os=4),p=8   163s   142s  4725   16.3ms  0.892
        ivf(q=binary,os=10),p=16   118s   144s  4073   17.7ms  0.830
           ivf(q=int8,os=4),p=16   163s   142s  4725   24.3ms  0.950
           ivf(q=int8,os=4),p=32   163s   142s  4725   41.6ms  0.980
                 ivf(q=none),p=8   497s   144s  3101   52.1ms  0.890
                 ivf(q=none),p=16  497s   144s  3101   56.6ms  0.950
                       bit(os=8)    18s      -  3048   83.5ms  0.918
                 ivf(q=none),p=32  497s   144s  3101  103.9ms  0.980
                      int8(os=4)    19s      -  3689  169.1ms  0.994
                            flat    20s      -  2955  338.0ms  1.000

Best config at 1M: ivf(quantizer=int8, oversample=4, nprobe=16) — 24ms query, 0.95 recall, 14x faster than flat, 7x faster than int8 baseline.

Scaling Characteristics

Metric 100k 1M Scaling
Flat query 34ms 338ms 10x (linear)
IVF int8 p=16 5.4ms 24.3ms 4.5x (sublinear)
IVF insert rate ~10k/s ~6k/s Slight degradation
Training (nlist=1000) 13s 142s ~11x

Implementation

File Structure

sqlite-vec-ivf-kmeans.c    K-means++ algorithm (pure C, no SQLite deps)
sqlite-vec-ivf.c           All IVF logic: parser, shadow tables, insert,
                           delete, query, centroid commands, quantization
sqlite-vec.c               ~50 lines of additions: struct fields, #includes,
                           dispatch hooks in parse/create/insert/delete/filter

Both IVF files are #included into sqlite-vec.c. No Makefile changes needed.

Key Design Decisions

  1. Fixed-size cells (64 vectors) instead of one blob per centroid. Avoids SQLite overflow page traversal which caused 110x insert slowdown.

  2. Multiple cell rows per centroid with an index on centroid_id. When a cell fills, a new row is created. Query scans all rows for probed centroids via WHERE centroid_id IN (...).

  3. Always store full vectors when quantizer != none (in _ivf_vectors KV table). Enables oversample re-ranking and point queries returning original precision.

  4. K-means in float32, quantize after. Simpler than quantized k-means, and assignment accuracy doesn't suffer much since nprobe compensates.

  5. NEON SIMD for cosine distance. Added cosine_float_neon() with 4-wide FMA for dot product + magnitudes. Benefits all vec0 queries, not just IVF.

  6. Runtime nprobe tuning. INSERT INTO t(id) VALUES ('nprobe=N') changes the probe count without rebuilding — enables fast parameter sweeps.

Optimization History

Optimization Impact
Fixed-size cells (64 max) 110x insert speedup
Skip chunk writes for IVF 2x DB size reduction
NEON cosine distance 2x query speedup + 13% recall improvement (correct metric)
Cached prepared statements Eliminated per-insert prepare/finalize
Batched cell reads (IN clause) Fewer SQLite queries per KNN
int8 quantization 2.5x query speedup at same recall
Binary quantization 32x less cell I/O
Oversample re-ranking Recovers quantization recall loss

Remaining Work

See ivf-benchmarks/TODO.md for the full list. Key items:

  • Cache centroids in memory — each insert re-reads all centroids from SQLite
  • Runtime oversample — same pattern as nprobe runtime command
  • SIMD k-means — training uses scalar distance, could be 4x faster
  • Top-k heap — replace qsort with min-heap for large nprobe
  • IVF-PQ — product quantization for better compression/recall tradeoff