mirror of https://github.com/asg017/sqlite-vec.git synced 2026-04-25 08:46:49 +02:00

Alex Garcia 3358e127f6 Add IVF index for vec0 virtual table

Add inverted file (IVF) index type: partitions vectors into clusters via
k-means, quantizes to int8, and scans only the nearest nprobe partitions at
query time. Includes shadow table management, insert/delete, KNN integration,
compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes
superseded ivf-benchmarks/ directory.

2026-03-31 01:18:47 -07:00

9.9 KiB

Raw Blame History

IVF Index for sqlite-vec

Overview

IVF (Inverted File Index) is an approximate nearest neighbor index for sqlite-vec's vec0 virtual table. It partitions vectors into clusters via k-means, then at query time only scans the nearest clusters instead of all vectors. Combined with scalar or binary quantization, this gives 5-20x query speedups over brute-force with tunable recall.

SQL API

Table Creation

CREATE VIRTUAL TABLE vec_items USING vec0(
  id INTEGER PRIMARY KEY,
  embedding float[768] distance_metric=cosine
    INDEXED BY ivf(nlist=128, nprobe=16)
);

-- With quantization (4x smaller cells, rescore for recall)
CREATE VIRTUAL TABLE vec_items USING vec0(
  id INTEGER PRIMARY KEY,
  embedding float[768] distance_metric=cosine
    INDEXED BY ivf(nlist=128, nprobe=16, quantizer=int8, oversample=4)
);

Parameters

Parameter	Values	Default	Description
`nlist`	1-65536, or 0	128	Number of k-means clusters. Rule of thumb: `sqrt(N)`
`nprobe`	1-nlist	10	Clusters to search at query time. More = better recall, slower
`quantizer`	`none`, `int8`, `binary`	`none`	How vectors are stored in cells
`oversample`	>= 1	1	Re-rank `oversample * k` candidates with full-precision distance

Inserting Vectors

-- Works immediately, even before training
INSERT INTO vec_items(id, embedding) VALUES (1, :vector);

Before centroids exist, vectors go to an "unassigned" partition and queries do brute-force. After training, new inserts are assigned to the nearest centroid.

Training (Computing Centroids)

-- Run built-in k-means on all vectors
INSERT INTO vec_items(id) VALUES ('compute-centroids');

This loads all vectors into memory, runs k-means++ with Lloyd's algorithm, creates quantized centroids, and redistributes all vectors into cluster cells. It's a blocking operation — run it once after bulk insert.

Manual Centroid Import

-- Import externally-computed centroids
INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:0', :centroid_0);
INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1);

-- Assign vectors to imported centroids
INSERT INTO vec_items(id) VALUES ('assign-vectors');

Runtime Parameter Tuning

-- Change nprobe without rebuilding the index
INSERT INTO vec_items(id) VALUES ('nprobe=32');

KNN Queries

-- Same syntax as standard vec0
SELECT id, distance
FROM vec_items
WHERE embedding MATCH :query AND k = 10;

Other Commands

-- Remove centroids, move all vectors back to unassigned
INSERT INTO vec_items(id) VALUES ('clear-centroids');

How It Works

Architecture

User vector (float32)
  → quantize to int8/binary (if quantizer != none)
  → find nearest centroid (quantized distance)
  → store quantized vector in cell blob
  → store full vector in KV table (if quantizer != none)
  → query:
      1. quantize query vector
      2. find top nprobe centroids by quantized distance
      3. scan cell blobs: quantized distance (fast, small I/O)
      4. if oversample > 1: re-score top N*k with full vectors
      5. return top k

Shadow Tables

For a table vec_items with vector column index 0:

Table	Schema	Purpose
`vec_items_ivf_centroids00`	`centroid_id PK, centroid BLOB`	K-means centroids (quantized)
`vec_items_ivf_cells00`	`centroid_id, n_vectors, validity BLOB, rowids BLOB, vectors BLOB`	Packed vector cells, 64 vectors max per row. Multiple rows per centroid. Index on centroid_id.
`vec_items_ivf_rowid_map00`	`rowid PK, cell_id, slot`	Maps vector rowid → cell location for O(1) delete
`vec_items_ivf_vectors00`	`rowid PK, vector BLOB`	Full-precision vectors (only when quantizer != none)

Cell Storage

Cells use packed blob storage identical to vec0's chunk layout:

validity: bitmap (1 bit per slot) marking live vectors
rowids: packed i64 array
vectors: packed array of quantized vectors

Cells are capped at 64 vectors (~200KB at 768-dim float32, ~48KB for int8, ~6KB for binary). When a cell fills, a new row is created for the same centroid. This avoids SQLite overflow page traversal which was a 110x performance bottleneck with unbounded cells.

Quantization

int8: Each float32 dimension clamped to [-1,1] and scaled to int8 [-127,127]. 4x storage reduction. Distance computed via int8 L2.

binary: Sign-bit quantization — each bit is 1 if the float is positive. 32x storage reduction. Distance computed via hamming distance.

Oversample re-ranking: When oversample > 1, the quantized scan collects oversample * k candidates, then looks up each candidate's full-precision vector from the KV table and re-computes exact distance. This recovers nearly all recall lost from quantization. At oversample=4 with int8, recall matches non-quantized IVF exactly.

K-Means

Uses Lloyd's algorithm with k-means++ initialization:

K-means++ picks initial centroids weighted by distance
Lloyd's iterations: assign vectors to nearest centroid, recompute centroids as cluster means
Empty cluster handling: reassign to farthest point
K-means runs in float32; centroids are quantized before storage

Training data: recommend 16× nlist vectors. At nlist=1000, that's 16k vectors — k-means takes ~140s on 768-dim data.

Performance

100k vectors (COHERE 768-dim cosine)

                          name  qry(ms)  recall
───────────────────────────────────────────────
          ivf(q=int8,os=4),p=8    5.3ms  0.934  ← 6x faster than flat
         ivf(q=int8,os=4),p=16    5.4ms  0.968
               ivf(q=none),p=8    5.3ms  0.934
      ivf(q=binary,os=10),p=16    1.3ms  0.832  ← 26x faster than flat
         ivf(q=int8,os=4),p=32    7.4ms  0.990
              ivf(q=none),p=32   15.5ms  0.992
                    int8(os=4)   18.7ms  0.996
                     bit(os=8)   18.7ms  0.884
                          flat   33.7ms  1.000

1M vectors (COHERE 768-dim cosine)

                            name  insert  train    MB  qry(ms)  recall
──────────────────────────────────────────────────────────────────────
            ivf(q=int8,os=4),p=8   163s   142s  4725   16.3ms  0.892
        ivf(q=binary,os=10),p=16   118s   144s  4073   17.7ms  0.830
           ivf(q=int8,os=4),p=16   163s   142s  4725   24.3ms  0.950
           ivf(q=int8,os=4),p=32   163s   142s  4725   41.6ms  0.980
                 ivf(q=none),p=8   497s   144s  3101   52.1ms  0.890
                 ivf(q=none),p=16  497s   144s  3101   56.6ms  0.950
                       bit(os=8)    18s      -  3048   83.5ms  0.918
                 ivf(q=none),p=32  497s   144s  3101  103.9ms  0.980
                      int8(os=4)    19s      -  3689  169.1ms  0.994
                            flat    20s      -  2955  338.0ms  1.000

Best config at 1M: ivf(quantizer=int8, oversample=4, nprobe=16) — 24ms query, 0.95 recall, 14x faster than flat, 7x faster than int8 baseline.

Scaling Characteristics

Metric	100k	1M	Scaling
Flat query	34ms	338ms	10x (linear)
IVF int8 p=16	5.4ms	24.3ms	4.5x (sublinear)
IVF insert rate	~10k/s	~6k/s	Slight degradation
Training (nlist=1000)	13s	142s	~11x

Implementation

File Structure

sqlite-vec-ivf-kmeans.c    K-means++ algorithm (pure C, no SQLite deps)
sqlite-vec-ivf.c           All IVF logic: parser, shadow tables, insert,
                           delete, query, centroid commands, quantization
sqlite-vec.c               ~50 lines of additions: struct fields, #includes,
                           dispatch hooks in parse/create/insert/delete/filter

Both IVF files are #included into sqlite-vec.c. No Makefile changes needed.

Key Design Decisions

Fixed-size cells (64 vectors) instead of one blob per centroid. Avoids SQLite overflow page traversal which caused 110x insert slowdown.
Multiple cell rows per centroid with an index on centroid_id. When a cell fills, a new row is created. Query scans all rows for probed centroids via WHERE centroid_id IN (...).
Always store full vectors when quantizer != none (in _ivf_vectors KV table). Enables oversample re-ranking and point queries returning original precision.
K-means in float32, quantize after. Simpler than quantized k-means, and assignment accuracy doesn't suffer much since nprobe compensates.
NEON SIMD for cosine distance. Added cosine_float_neon() with 4-wide FMA for dot product + magnitudes. Benefits all vec0 queries, not just IVF.
Runtime nprobe tuning. INSERT INTO t(id) VALUES ('nprobe=N') changes the probe count without rebuilding — enables fast parameter sweeps.

Optimization History

Optimization	Impact
Fixed-size cells (64 max)	110x insert speedup
Skip chunk writes for IVF	2x DB size reduction
NEON cosine distance	2x query speedup + 13% recall improvement (correct metric)
Cached prepared statements	Eliminated per-insert prepare/finalize
Batched cell reads (IN clause)	Fewer SQLite queries per KNN
int8 quantization	2.5x query speedup at same recall
Binary quantization	32x less cell I/O
Oversample re-ranking	Recovers quantization recall loss

Remaining Work

See ivf-benchmarks/TODO.md for the full list. Key items:

Cache centroids in memory — each insert re-reads all centroids from SQLite
Runtime oversample — same pattern as nprobe runtime command
SIMD k-means — training uses scalar distance, could be 4x faster
Top-k heap — replace qsort with min-heap for large nprobe
IVF-PQ — product quantization for better compression/recall tradeoff

9.9 KiB Raw Blame History Unescape Escape