mirror of
https://github.com/asg017/sqlite-vec.git
synced 2026-04-25 08:46:49 +02:00
Add approximate nearest neighbor infrastructure to vec0: shared distance dispatch (vec0_distance_full), flat index type with parser, NEON-optimized cosine/Hamming for float32/int8, amalgamation script, and benchmark suite (benchmarks-ann/) with ground-truth generation and profiling tools. Remove unused vec_npy_each/vec_static_blobs code, fix missing stdint.h include.
81 lines
2.1 KiB
Markdown
81 lines
2.1 KiB
Markdown
# KNN Benchmarks for sqlite-vec
|
|
|
|
Benchmarking infrastructure for vec0 KNN configurations. Includes brute-force
|
|
baselines (float, int8, bit); index-specific branches add their own types
|
|
via the `INDEX_REGISTRY` in `bench.py`.
|
|
|
|
## Prerequisites
|
|
|
|
- Built `dist/vec0` extension (run `make` from repo root)
|
|
- Python 3.10+
|
|
- `uv` (for seed data prep): `pip install uv`
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
# 1. Download dataset and build seed DB (~3 GB download, ~5 min)
|
|
make seed
|
|
|
|
# 2. Run a quick smoke test (5k vectors, ~1 min)
|
|
make bench-smoke
|
|
|
|
# 3. Run full benchmark at 10k
|
|
make bench-10k
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Direct invocation
|
|
|
|
```bash
|
|
python bench.py --subset-size 10000 \
|
|
"brute-float:type=baseline,variant=float" \
|
|
"brute-int8:type=baseline,variant=int8" \
|
|
"brute-bit:type=baseline,variant=bit"
|
|
```
|
|
|
|
### Config format
|
|
|
|
`name:type=<index_type>,key=val,key=val`
|
|
|
|
| Index type | Keys | Branch |
|
|
|-----------|------|--------|
|
|
| `baseline` | `variant` (float/int8/bit), `oversample` | this branch |
|
|
|
|
Index branches register additional types in `INDEX_REGISTRY`. See the
|
|
docstring in `bench.py` for the extension API.
|
|
|
|
### Make targets
|
|
|
|
| Target | Description |
|
|
|--------|-------------|
|
|
| `make seed` | Download COHERE 1M dataset |
|
|
| `make ground-truth` | Pre-compute ground truth for 10k/50k/100k |
|
|
| `make bench-smoke` | Quick 5k baseline test |
|
|
| `make bench-10k` | All configs at 10k vectors |
|
|
| `make bench-50k` | All configs at 50k vectors |
|
|
| `make bench-100k` | All configs at 100k vectors |
|
|
| `make bench-all` | 10k + 50k + 100k |
|
|
|
|
## Adding an index type
|
|
|
|
In your index branch, add an entry to `INDEX_REGISTRY` in `bench.py` and
|
|
append your configs to `ALL_CONFIGS` in the `Makefile`. See the existing
|
|
`baseline` entry and the comments in both files for the pattern.
|
|
|
|
## Results
|
|
|
|
Results are stored in `runs/<dir>/results.db` using the schema in `schema.sql`.
|
|
|
|
```bash
|
|
sqlite3 runs/10k/results.db "
|
|
SELECT config_name, recall, mean_ms, qps
|
|
FROM bench_results
|
|
ORDER BY recall DESC
|
|
"
|
|
```
|
|
|
|
## Dataset
|
|
|
|
[Zilliz COHERE Medium 1M](https://zilliz.com/learn/datasets-for-vector-database-benchmarks):
|
|
768 dimensions, cosine distance, 1M train vectors + 10k query vectors with precomputed neighbors.
|