sqlite-vec/TODO.md

74 lines
2.8 KiB
Markdown
Raw Normal View History

# TODO: `ann` base branch + consolidated benchmarks
## 1. Create `ann` branch with shared code
### 1.1 Branch setup
- [x] `git checkout -B ann origin/main`
- [x] Cherry-pick `624f998` (vec0_distance_full shared distance dispatch)
- [x] Cherry-pick stdint.h fix for test header
- [ ] Pull NEON cosine optimization from ivf-yolo3 into shared code
- Currently only in ivf branch but is general-purpose (benefits all distance calcs)
- Lives in `distance_cosine_float()` — ~57 lines of ARM NEON vectorized cosine
### 1.2 Benchmark infrastructure (`benchmarks-ann/`)
- [x] Seed data pipeline (`seed/Makefile`, `seed/build_base_db.py`)
- [x] Ground truth generator (`ground_truth.py`)
- [x] Results schema (`schema.sql`)
- [x] Benchmark runner with `INDEX_REGISTRY` extension point (`bench.py`)
- Baseline configs (float, int8-rescore, bit-rescore) implemented
- Index branches register their types via `INDEX_REGISTRY` dict
- [x] Makefile with baseline targets
- [x] README
### 1.3 Rebase feature branches onto `ann`
- [x] Rebase `diskann-yolo2` onto `ann` (1 commit: DiskANN implementation)
- [x] Rebase `ivf-yolo3` onto `ann` (1 commit: IVF implementation)
- [x] Rebase `annoy-yolo2` onto `ann` (2 commits: Annoy implementation + schema fix)
- [x] Verify each branch has only its index-specific commits remaining
- [ ] Force-push all 4 branches to origin
---
## 2. Per-branch: register index type in benchmarks
Each index branch should add to `benchmarks-ann/` when rebased onto `ann`:
### 2.1 Register in `bench.py`
Add an `INDEX_REGISTRY` entry. Each entry provides:
- `defaults` — default param values
- `create_table_sql(params)` — CREATE VIRTUAL TABLE with INDEXED BY clause
- `insert_sql(params)` — custom insert SQL, or None for default
- `post_insert_hook(conn, params)` — training/building step, returns time
- `run_query(conn, params, query, k)` — custom query, or None for default MATCH
- `describe(params)` — one-line description for report output
### 2.2 Add configs to `Makefile`
Append index-specific config variables and targets. Example pattern:
```makefile
DISKANN_CONFIGS = \
"diskann-R48-binary:type=diskann,R=48,L=128,quantizer=binary" \
...
ALL_CONFIGS += $(DISKANN_CONFIGS)
bench-diskann: seed
$(BENCH) --subset-size 10000 -k 10 -o runs/diskann $(BASELINES) $(DISKANN_CONFIGS)
...
```
### 2.3 Migrate existing benchmark results/docs
- Move useful results docs (RESULTS.md, etc.) into `benchmarks-ann/results/`
- Delete redundant per-branch benchmark directories once consolidated infra is proven
---
## 3. Future improvements
- [ ] Reporting script (`report.py`) — query results.db, produce markdown comparison tables
- [ ] Profiling targets in Makefile (lift from ivf-yolo3's Instruments/perf wrappers)
- [ ] Pre-computed ground truth integration (use GT DB files instead of on-the-fly brute-force)