mirror of
https://github.com/asg017/sqlite-vec.git
synced 2026-04-25 00:36:56 +02:00
Add ANN search support for vec0 virtual table (#273)
Add approximate nearest neighbor infrastructure to vec0: shared distance dispatch (vec0_distance_full), flat index type with parser, NEON-optimized cosine/Hamming for float32/int8, amalgamation script, and benchmark suite (benchmarks-ann/) with ground-truth generation and profiling tools. Remove unused vec_npy_each/vec_static_blobs code, fix missing stdint.h include.
This commit is contained in:
parent
e9f598abfa
commit
0de765f457
27 changed files with 2177 additions and 2116 deletions
81
benchmarks-ann/README.md
Normal file
81
benchmarks-ann/README.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
# KNN Benchmarks for sqlite-vec
|
||||
|
||||
Benchmarking infrastructure for vec0 KNN configurations. Includes brute-force
|
||||
baselines (float, int8, bit); index-specific branches add their own types
|
||||
via the `INDEX_REGISTRY` in `bench.py`.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Built `dist/vec0` extension (run `make` from repo root)
|
||||
- Python 3.10+
|
||||
- `uv` (for seed data prep): `pip install uv`
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
# 1. Download dataset and build seed DB (~3 GB download, ~5 min)
|
||||
make seed
|
||||
|
||||
# 2. Run a quick smoke test (5k vectors, ~1 min)
|
||||
make bench-smoke
|
||||
|
||||
# 3. Run full benchmark at 10k
|
||||
make bench-10k
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Direct invocation
|
||||
|
||||
```bash
|
||||
python bench.py --subset-size 10000 \
|
||||
"brute-float:type=baseline,variant=float" \
|
||||
"brute-int8:type=baseline,variant=int8" \
|
||||
"brute-bit:type=baseline,variant=bit"
|
||||
```
|
||||
|
||||
### Config format
|
||||
|
||||
`name:type=<index_type>,key=val,key=val`
|
||||
|
||||
| Index type | Keys | Branch |
|
||||
|-----------|------|--------|
|
||||
| `baseline` | `variant` (float/int8/bit), `oversample` | this branch |
|
||||
|
||||
Index branches register additional types in `INDEX_REGISTRY`. See the
|
||||
docstring in `bench.py` for the extension API.
|
||||
|
||||
### Make targets
|
||||
|
||||
| Target | Description |
|
||||
|--------|-------------|
|
||||
| `make seed` | Download COHERE 1M dataset |
|
||||
| `make ground-truth` | Pre-compute ground truth for 10k/50k/100k |
|
||||
| `make bench-smoke` | Quick 5k baseline test |
|
||||
| `make bench-10k` | All configs at 10k vectors |
|
||||
| `make bench-50k` | All configs at 50k vectors |
|
||||
| `make bench-100k` | All configs at 100k vectors |
|
||||
| `make bench-all` | 10k + 50k + 100k |
|
||||
|
||||
## Adding an index type
|
||||
|
||||
In your index branch, add an entry to `INDEX_REGISTRY` in `bench.py` and
|
||||
append your configs to `ALL_CONFIGS` in the `Makefile`. See the existing
|
||||
`baseline` entry and the comments in both files for the pattern.
|
||||
|
||||
## Results
|
||||
|
||||
Results are stored in `runs/<dir>/results.db` using the schema in `schema.sql`.
|
||||
|
||||
```bash
|
||||
sqlite3 runs/10k/results.db "
|
||||
SELECT config_name, recall, mean_ms, qps
|
||||
FROM bench_results
|
||||
ORDER BY recall DESC
|
||||
"
|
||||
```
|
||||
|
||||
## Dataset
|
||||
|
||||
[Zilliz COHERE Medium 1M](https://zilliz.com/learn/datasets-for-vector-database-benchmarks):
|
||||
768 dimensions, cosine distance, 1M train vectors + 10k query vectors with precomputed neighbors.
|
||||
Loading…
Add table
Add a link
Reference in a new issue