2026-03-31 01:03:32 -07:00
|
|
|
# KNN Benchmarks for sqlite-vec
|
|
|
|
|
|
|
|
|
|
Benchmarking infrastructure for vec0 KNN configurations. Includes brute-force
|
2026-03-31 01:29:49 -07:00
|
|
|
baselines (float, int8, bit), rescore, IVF, and DiskANN index types.
|
|
|
|
|
|
|
|
|
|
## Datasets
|
|
|
|
|
|
|
|
|
|
Each dataset is a subdirectory containing a `Makefile` and `build_base_db.py`
|
|
|
|
|
that produce a `base.db`. The benchmark runner auto-discovers any subdirectory
|
|
|
|
|
with a `base.db` file.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
cohere1m/ # Cohere 768d cosine, 1M vectors
|
|
|
|
|
Makefile # downloads parquets from Zilliz, builds base.db
|
|
|
|
|
build_base_db.py
|
|
|
|
|
base.db # (generated)
|
|
|
|
|
|
|
|
|
|
cohere10m/ # Cohere 768d cosine, 10M vectors (10 train shards)
|
|
|
|
|
Makefile # make -j12 download to fetch all shards in parallel
|
|
|
|
|
build_base_db.py
|
|
|
|
|
base.db # (generated)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Every `base.db` has the same schema:
|
|
|
|
|
|
|
|
|
|
| Table | Columns | Description |
|
|
|
|
|
|-------|---------|-------------|
|
|
|
|
|
| `train` | `id INTEGER PRIMARY KEY, vector BLOB` | Indexed vectors (f32 blobs) |
|
|
|
|
|
| `query_vectors` | `id INTEGER PRIMARY KEY, vector BLOB` | Query vectors for KNN evaluation |
|
|
|
|
|
| `neighbors` | `query_vector_id INTEGER, rank INTEGER, neighbors_id TEXT` | Ground-truth nearest neighbors |
|
|
|
|
|
|
|
|
|
|
To add a new dataset, create a directory with a `Makefile` that builds `base.db`
|
|
|
|
|
with the tables above. It will be available via `--dataset <dirname>` automatically.
|
|
|
|
|
|
|
|
|
|
### Building datasets
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Cohere 1M
|
|
|
|
|
cd cohere1m && make download && make && cd ..
|
|
|
|
|
|
|
|
|
|
# Cohere 10M (parallel download recommended — 10 train shards + test + neighbors)
|
|
|
|
|
cd cohere10m && make -j12 download && make && cd ..
|
|
|
|
|
```
|
2026-03-31 01:03:32 -07:00
|
|
|
|
|
|
|
|
## Prerequisites
|
|
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
- Built `dist/vec0` extension (run `make loadable` from repo root)
|
2026-03-31 01:03:32 -07:00
|
|
|
- Python 3.10+
|
2026-03-31 01:29:49 -07:00
|
|
|
- `uv`
|
2026-03-31 01:03:32 -07:00
|
|
|
|
|
|
|
|
## Quick start
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-03-31 01:29:49 -07:00
|
|
|
# 1. Build a dataset
|
|
|
|
|
cd cohere1m && make && cd ..
|
2026-03-31 01:03:32 -07:00
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
# 2. Quick smoke test (5k vectors)
|
2026-03-31 01:03:32 -07:00
|
|
|
make bench-smoke
|
|
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
# 3. Full benchmark at 10k
|
2026-03-31 01:03:32 -07:00
|
|
|
make bench-10k
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-03-31 01:29:49 -07:00
|
|
|
uv run python bench.py --subset-size 10000 -k 10 -n 50 --dataset cohere1m \
|
2026-03-31 01:03:32 -07:00
|
|
|
"brute-float:type=baseline,variant=float" \
|
2026-03-31 01:29:49 -07:00
|
|
|
"rescore-bit-os8:type=rescore,quantizer=bit,oversample=8"
|
2026-03-31 01:03:32 -07:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Config format
|
|
|
|
|
|
|
|
|
|
`name:type=<index_type>,key=val,key=val`
|
|
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
| Index type | Keys |
|
|
|
|
|
|-----------|------|
|
|
|
|
|
| `baseline` | `variant` (float/int8/bit), `oversample` |
|
|
|
|
|
| `rescore` | `quantizer` (bit/int8), `oversample` |
|
|
|
|
|
| `ivf` | `nlist`, `nprobe` |
|
|
|
|
|
| `diskann` | `R`, `L`, `quantizer` (binary/int8), `buffer_threshold` |
|
2026-03-31 01:03:32 -07:00
|
|
|
|
|
|
|
|
### Make targets
|
|
|
|
|
|
|
|
|
|
| Target | Description |
|
|
|
|
|
|--------|-------------|
|
2026-03-31 01:29:49 -07:00
|
|
|
| `make seed` | Download and build default dataset |
|
|
|
|
|
| `make bench-smoke` | Quick 5k test (3 configs) |
|
2026-03-31 01:03:32 -07:00
|
|
|
| `make bench-10k` | All configs at 10k vectors |
|
|
|
|
|
| `make bench-50k` | All configs at 50k vectors |
|
|
|
|
|
| `make bench-100k` | All configs at 100k vectors |
|
|
|
|
|
| `make bench-all` | 10k + 50k + 100k |
|
2026-03-31 01:29:49 -07:00
|
|
|
| `make bench-ivf` | Baselines + IVF across 10k/50k/100k |
|
|
|
|
|
| `make bench-diskann` | Baselines + DiskANN across 10k/50k/100k |
|
2026-03-31 01:03:32 -07:00
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
## Results DB
|
2026-03-31 01:03:32 -07:00
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
Each run writes to `runs/<dataset>/<subset_size>/results.db` (SQLite, WAL mode).
|
|
|
|
|
Progress is written continuously — query from another terminal to monitor:
|
2026-03-31 01:03:32 -07:00
|
|
|
|
|
|
|
|
```bash
|
2026-03-31 01:29:49 -07:00
|
|
|
sqlite3 runs/cohere1m/10000/results.db "SELECT run_id, config_name, status FROM runs"
|
2026-03-31 01:03:32 -07:00
|
|
|
```
|
|
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
See `results_schema.sql` for the full schema (tables: `runs`, `run_results`,
|
|
|
|
|
`insert_batches`, `queries`).
|
|
|
|
|
|
|
|
|
|
## Adding an index type
|
2026-03-31 01:03:32 -07:00
|
|
|
|
2026-03-31 01:29:49 -07:00
|
|
|
Add an entry to `INDEX_REGISTRY` in `bench.py` and append configs to
|
|
|
|
|
`ALL_CONFIGS` in the `Makefile`. See existing entries for the pattern.
|