Commit graph

450 commits

Author SHA1 Message Date
Alex Garcia
5522e86cd2 Validate validity/rowids blob sizes in rescore KNN path
The rescore KNN loop read validity and rowids blobs from the chunks
iterator without checking their sizes matched chunk_size expectations.
A truncated or corrupt blob could cause OOB reads in bitmap_copy or
rowid array access. The flat KNN path already had these checks.

Adds corruption tests: truncated rowids blob and truncated validity
blob both produce errors instead of crashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:49:40 -07:00
Alex Garcia
f2c9fb8f08 Add text PK, WAL concurrency tests, and fix bench-smoke config
Infrastructure improvements:
- Fix benchmarks-ann Makefile: type=baseline -> type=vec0-flat (baseline
  was never a valid INDEX_REGISTRY key)
- Add DiskANN + text primary key test: insert, KNN, delete, KNN
- Add rescore + text primary key test: insert, KNN, delete, KNN
- Add WAL concurrency test: reader sees snapshot isolation while
  writer has an open transaction, KNN works on reader's snapshot

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:43:49 -07:00
Alex Garcia
d684178a12 Add AVX2-optimized Hamming distance using VPSHUFB popcount
Implements distance_hamming_avx2() which processes 32 bytes per
iteration using the standard VPSHUFB nibble-lookup popcount pattern.
Dispatched when SQLITE_VEC_ENABLE_AVX is defined and input >= 32
bytes. Falls back to u64 scalar or u8 byte-at-a-time for smaller
inputs.

Also adds -mavx2 flag to Makefile for x86-64 targets alongside
existing -mavx.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:39:41 -07:00
Alex Garcia
d033bf5728 Add delete recall benchmark suite
New benchmarks-ann/bench-delete/ directory measures KNN recall
degradation after random row deletion across index types (flat,
rescore, IVF, DiskANN). For each config and delete percentage:
builds index, measures baseline recall, copies DB, deletes random
rows, measures post-delete recall, VACUUMs and records size savings.

Includes Makefile targets, self-contained smoke test with synthetic
data, and results DB for analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:13:40 -07:00
Alex Garcia
b00865429b Filter deleted nodes from DiskANN search results and add delete tests
DiskANN's delete repair only fixes forward edges (nodes the deleted
node pointed to). Stale reverse edges can cause deleted rowids to
appear in search results. Fix: track a 'confirmed' flag on each
search candidate, set when the full-precision vector is successfully
read during re-ranking. Only confirmed candidates are included in
output. Zero additional SQL queries — piggybacks on the existing
re-rank vector read.

Also adds delete hardening tests:
- Rescore: interleaved delete+KNN, rowid_in after deletes, full
  delete+reinsert cycle
- DiskANN: delete+reinsert cycles with KNN verification, interleaved
  delete+KNN

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:13:29 -07:00
Alex Garcia
2f4c2e4bdb Fix alignment UB in distance_hamming_u64
Casting unaligned blob pointers to u64* is undefined behavior on
strict-alignment architectures. Use memcpy to safely load u64 values
from potentially unaligned memory (compilers optimize this to native
loads on architectures that support unaligned access).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:57:01 -07:00
Alex Garcia
7de925be70 Fix int16 overflow in l2_sqr_int8_neon SIMD distance
vmulq_s16(diff, diff) produced int16 results, but diff can be up to
255 for int8 vectors (-128 vs 127), and 255^2 = 65025 overflows
int16 (max 32767). This caused NaN/wrong results for int8 vectors
with large differences.

Fix: use vmull_s16 (widening multiply) to produce int32 results
directly, avoiding the intermediate int16 overflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:55:37 -07:00
Alex Garcia
4bee88384b Reject IVF binary quantizer when dimensions not divisible by 8
The binary quantizer uses D/8 for buffer sizes and memset, which
truncates for non-multiple-of-8 dimensions, causing OOB writes.
Rather than using ceiling division, enforce the constraint at
table creation time with a clear parse error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:51:27 -07:00
Alex Garcia
5e4c557f93 Initialize rescore distance variable to FLT_MAX
The `dist` variable in rescore KNN quantized distance computation was
uninitialized. If the switch on quantizer_type or distance_metric
didn't match any case, the uninitialized value would propagate into
the top-k heap, potentially returning garbage results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:35:55 -07:00
Alex Garcia
82f4eb08bf Add NULL checks after sqlite3_column_blob in rescore and DiskANN
sqlite3_column_blob() returns NULL for zero-length blobs or on OOM.
Several call sites in rescore KNN and DiskANN node/vector read passed
the result directly to memcpy without checking, risking NULL deref on
corrupt or empty databases. IVF already had proper NULL checks.

Adds corruption regression tests that truncate shadow table blobs and
verify the query errors cleanly instead of crashing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:31:49 -07:00
Alex Garcia
9df59b4c03 Temporarily block vector UPDATE for DiskANN and IVF indexes
vec0Update_UpdateVectorColumn writes to flat chunk blobs but does not
update DiskANN graph or IVF index structures, silently corrupting KNN
results. Now returns a clear error for these index types. Rescore
UPDATE is unaffected — it already has a full implementation that
updates both quantized chunks and float vectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:08:08 -07:00
Alex Garcia
07f56e3cbe Fix #if SQLITE_VEC_ENABLE_RESCORE guards wrapping non-rescore logic
Six sites used #if SQLITE_VEC_ENABLE_RESCORE to guard _vector_chunks
skip logic that applies to ALL non-flat index types. When RESCORE was
compiled out, DiskANN and IVF columns would incorrectly access flat
chunk tables. Two sites also missed DiskANN in the skip enumeration,
which would break mixed flat+DiskANN tables.

Fix: replace all six compile-time guards with unconditional runtime
`!= VEC0_INDEX_TYPE_FLAT` checks. Also move rescore_on_delete inside
the !vec0_all_columns_diskann guard to prevent use of uninitialized
chunk_id/chunk_offset, and initialize those variables to 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:51:08 -07:00
Alex Garcia
3cfc2e0c1f Fix broken unzip -d line in vendor.sh
Remove stray incomplete `unzip -d` command that would error on CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:14:18 -07:00
Alex Garcia
85cf415397 Remove dead typedef macros and harmful u_int*_t redefinitions
The UINT32_TYPE/UINT16_TYPE/INT16_TYPE/UINT8_TYPE/INT8_TYPE/LONGDOUBLE_TYPE
macros (copied from sqlite3.c) were never used anywhere in sqlite-vec.
The u_int8_t/u_int16_t/u_int64_t typedefs redefined standard types using
BSD-only types despite <stdint.h> already being included, breaking builds
on musl/Alpine, strict C99, and requiring reactive platform guards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 12:57:06 -07:00
Alex Garcia
85973b3814 v0.1.10-alpha.1 2026-03-31 01:31:39 -07:00
Alex Garcia
8544081a67
Add comprehensive ANN benchmarking suite (#279)
Extend benchmarks-ann/ with results database (SQLite with per-query detail
and continuous writes), dataset subfolder organization, --subset-size and
--warmup options. Supports systematic comparison across flat, rescore, IVF,
and DiskANN index types.
2026-03-31 01:29:49 -07:00
Alex Garcia
a248ecd061 Fix DiskANN command dispatch when IVF is disabled
The command insert handler (used for runtime config like
search_list_size_search) was gated behind SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE,
which defaults to 0. DiskANN commands were unreachable unless IVF was
also compiled in. Widen the guard to also activate when
SQLITE_VEC_ENABLE_DISKANN is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:26:55 -07:00
Alex Garcia
1e3bb3e5e3
Implement DiskANN ANN index for vec0 virtual tables
Add DiskANN index for vec0 virtual table
2026-03-31 01:23:05 -07:00
Alex Garcia
fb81c011ff rm demo gha workflow 2026-03-31 01:21:54 -07:00
Alex Garcia
575371d751 Add DiskANN index for vec0 virtual table
Add DiskANN graph-based index: builds a Vamana graph with configurable R
(max degree) and L (search list size, separate for insert/query), supports
int8 quantization with rescore, lazy reverse-edge replacement, pre-quantized
query optimization, and insert buffer reuse. Includes shadow table management,
delete support, KNN integration, compile flag (SQLITE_VEC_ENABLE_DISKANN),
release-demo workflow, fuzz targets, and tests. Fixes rescore int8
quantization bug.
2026-03-31 01:21:54 -07:00
Alex Garcia
e2c38f387c
Implement experimental new IVF ANN index for vec0 tables
Add new experimental IVF ANN INdex
2026-03-31 01:19:55 -07:00
Alex Garcia
bb3ef78f75 Hide IVF behind SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE, default off
Rename SQLITE_VEC_ENABLE_IVF to SQLITE_VEC_EXPERIMENTAL_IVF_ENABLE and
flip the default from 1 to 0. IVF tests are automatically skipped when
the build flag is not set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:18:47 -07:00
Alex Garcia
3e26925ce0 rm ivf plan file 2026-03-31 01:18:47 -07:00
Alex Garcia
3358e127f6 Add IVF index for vec0 virtual table
Add inverted file (IVF) index type: partitions vectors into clusters via
k-means, quantizes to int8, and scans only the nearest nprobe partitions at
query time. Includes shadow table management, insert/delete, KNN integration,
compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes
superseded ivf-benchmarks/ directory.
2026-03-31 01:18:47 -07:00
Alex Garcia
43982c144b
Merge pull request #276 from asg017/pr/rescore
Add a new `rescore` ANN index for `vec0` tables
2026-03-31 01:14:32 -07:00
Alex Garcia
45d1375602 Merge branch 'main' into pr/rescore 2026-03-31 01:12:50 -07:00
Alex Garcia
69ccb2405a Fix Android cross-compilation failing with unsupported '-mavx' flag
The Linux AVX auto-detection checked the host's /proc/cpuinfo, which
passes on x86 CI runners even when cross-compiling for Android ARM
targets. Skip AVX detection when CC contains 'android'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:06:38 -07:00
Alex Garcia
0de765f457
Add ANN search support for vec0 virtual table (#273)
Add approximate nearest neighbor infrastructure to vec0: shared distance
dispatch (vec0_distance_full), flat index type with parser, NEON-optimized
cosine/Hamming for float32/int8, amalgamation script, and benchmark suite
(benchmarks-ann/) with ground-truth generation and profiling tools. Remove
unused vec_npy_each/vec_static_blobs code, fix missing stdint.h include.
2026-03-31 01:03:32 -07:00
Alex Garcia
e9f598abfa v0.1.9 2026-03-31 00:59:06 -07:00
Alex Garcia
6c3bf3669f v0.1.9-alpha.1 2026-03-30 16:41:16 -07:00
Alex Garcia
69f7b658e9 rm unnecessary TODO 2026-03-30 16:40:44 -07:00
Alex Garcia
ee9bd2ba4d
Fix SQLITE_DONE leak in ClearMetadata that broke DELETE on long text metadata (#274) (#275)
vec0Update_Delete_ClearMetadata's long-text branch runs a DELETE via
sqlite3_step, which returns SQLITE_DONE (101) on success. The code
checked for failure but never normalized the success case to SQLITE_OK.
The function's epilogue returned SQLITE_DONE as-is, which the caller
(vec0Update_Delete) treated as an error, aborting the DELETE scan and
silently leaving rows behind.

- Normalize rc to SQLITE_OK after successful sqlite3_step in ClearMetadata
- Move sqlite3_finalize before the rc check (cleanup on all paths)
- Add test_delete_by_metadata_with_long_text regression test
- Update test_deletes snapshot (row 3 now correctly deleted)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:39:59 -07:00
Alex Garcia
ba0db0b6d6 Add rescore index for ANN queries
Add rescore index type: stores full-precision float vectors in a rowid-keyed
shadow table, quantizes to int8 for fast initial scan, then rescores top
candidates with original vectors. Includes config parser, shadow table
management, insert/delete support, KNN integration, compile flag
(SQLITE_VEC_ENABLE_RESCORE), fuzz targets, and tests.
2026-03-29 19:45:54 -07:00
Alex Garcia
bf2455f2ba Add ANN search support for vec0 virtual table
Add approximate nearest neighbor infrastructure to vec0: shared distance
dispatch (vec0_distance_full), flat index type with parser, NEON-optimized
cosine/Hamming for float32/int8, amalgamation script, and benchmark suite
(benchmarks-ann/) with ground-truth generation and profiling tools. Remove
unused vec_npy_each/vec_static_blobs code, fix missing stdint.h include.
2026-03-29 19:44:44 -07:00
Alex Garcia
dfd8dc5290 v0.1.8 2026-03-29 19:02:44 -07:00
Alex Garcia
e7ae41b761 v0.1.8-alpha.1 2026-03-20 21:12:02 -07:00
Alex Garcia
a8d81cb235 bump sqlte-dist 2026-03-20 21:11:08 -07:00
Alex Garcia
633eecf506 v0.1.7 2026-03-17 00:25:43 -07:00
Alex Garcia
4138619e3f v0.1.7-alpha.13 2026-03-17 00:19:48 -07:00
Alex Garcia
7669f69c8d v0.1.7-alpha.12 2026-03-17 00:09:48 -07:00
Alex Garcia
380b0bb032 Redact version from info table snapshot to avoid test failures on version bumps
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 00:09:28 -07:00
Alex Garcia
41cecbadc3 v0.1.7-alpha.11 2026-03-17 00:05:08 -07:00
Alex Garcia
cb147c8834
Complete vec0 DELETE: zero data, reclaim empty chunks, fix metadata rc bug (#268)
When a row is deleted from a vec0 virtual table, the rowid slot in
_chunks.rowids and vector data in _vector_chunksNN.vectors are now
zeroed out (previously left as stale data, tracked in #54). When all
rows in a chunk are deleted (validity bitmap all zeros), the chunk and
its associated vector/metadata shadow table rows are reclaimed.

- Add vec0Update_Delete_ClearRowid to zero the rowid blob slot
- Add vec0Update_Delete_ClearVectors to zero all vector blob slots
- Add vec0Update_Delete_DeleteChunkIfEmpty to detect and delete
  fully-empty chunks from _chunks, _vector_chunksNN, _metadatachunksNN
- Fix missing rc check in ClearMetadata loop (bug: errors were silently
  ignored)
- Fix vec0_new_chunk to explicitly set _rowid_ on shadow table INSERTs
  (SHADOW_TABLE_ROWID_QUIRK: "rowid PRIMARY KEY" without INTEGER type
  is not a true rowid alias, causing blob_open failures after chunk
  delete+recreate cycles)
- Add 13 new tests covering rowid/vector zeroing, chunk reclamation,
  metadata/auxiliary/partition/text-PK/int8/bit variants, and
  page_count shrinkage verification
- Add vec0-delete-completeness fuzz target
- Update snapshots for new delete zeroing behavior

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 00:02:36 -07:00
Alex Garcia
5f4e5dd4dd fuzz-macos: mark as continue-on-error (best-effort)
Homebrew LLVM 18 runtime dylibs use typed allocation ABI symbols
(__ZnwmSt19__type_descriptor_t) not available in macOS 14's system
libc++, causing dyld to abort. Xcode clang doesn't ship libFuzzer.

Mark fuzz-macos as continue-on-error (same as fuzz-windows) so it
doesn't block CI. Linux fuzzing remains the primary bug detector.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 11:46:11 -07:00
Alex Garcia
d04e2aeda1 Fix remaining fuzzer issues: leaks and macOS SDK headers
sqlite-vec.c:
- vec0_free: add loops to free partition, auxiliary, and metadata
  column names (previously leaked on error paths)
- vec0_init: update pNew->numXxxColumns incrementally in the parse
  loop so vec0_free sees correct counts on early goto-error paths
  (previously the counts were only written after the loop, so vec0_free
  would loop 0 times and leak names allocated inside the loop)

fuzz.yaml:
- macOS: pass -isysroot $(xcrun --sdk macosx --show-sdk-path) so
  Xcode clang can find system headers (stdio.h, assert.h, etc.)
- Fix artifact upload paths: libFuzzer writes crash-*/leak-* to
  the cwd (repo root), not tests/fuzz/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 17:35:41 -08:00
Alex Garcia
e4b1e264b5 Fix macOS fuzz: use Xcode clang instead of Homebrew LLVM
Homebrew LLVM 18's ASAN/libFuzzer runtimes contain weak-def symbols
for typed allocation operators (__ZnwmSt19__type_descriptor_t) that
don't exist in macOS 14's system libc++. Since the symbol is embedded
in the pre-built runtime dylibs (not our code), link-time flags cannot
fix it.

Switch to Apple's Xcode clang which ships its own libFuzzer and ASAN
runtime built against the system libc++ — no ABI mismatch possible.
No Homebrew LLVM install needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:44:39 -08:00
Alex Garcia
b93a669224 Fix macOS fuzz: explicitly link LLVM libc++ to avoid weak-def symbol error
The fuzz targets were crashing on macOS 14 with:
  dyld: weak-def symbol not found '__ZnwmSt19__type_descriptor_t'

libFuzzer compiled with LLVM 18 uses typed allocation ABI symbols
not present in macOS 14's system libc++. Since DYLD_LIBRARY_PATH
cannot override SIP-protected /usr/lib/libc++.1.dylib at runtime,
we fix this at link time:
- -nostdlib++: suppress implicit system libc++ linking
- -L$LLVM/lib/c++ -lc++: explicitly link LLVM's libc++ (which has the symbol)
- -Wl,-rpath,$LLVM/lib/c++: embed rpath so dyld finds LLVM's libc++ at runtime

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:06:36 -08:00
Alex Garcia
1b53b942e0 Fix remaining fuzzer issues: leaks, UBSAN NaN, macOS LLVM version
- fuzz.yaml: switch macOS to llvm@18 (latest LLVM uses typed allocation
  C++ ABI symbols not available on macOS 14 runner's system libc++)
- sqlite-vec.c: fix NaN input in vec_quantize_int8 by using !(val <= X)
  comparisons which evaluate to true for NaN, ensuring the clamp fires
- sqlite-vec.c: free pzErrMsg in vec_eachFilter error path (was leaking
  the error string returned by vector_from_value)
- sqlite-vec.c: add sqlite3_free(pNew) to vec0_init error path; vec0_free
  frees the contents but not the struct itself, mirroring vec0Disconnect
- sqlite-vec.c: free knn_data in vec0Filter_knn cleanup when rc != SQLITE_OK;
  on error the cursor's knn_data field is never set so it would not be
  freed by the cursor teardown path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:36:59 -08:00
Alex Garcia
8c976205dd gha: bump windows runners 2026-03-03 07:22:04 -08:00
Alex Garcia
cdbc34785f Fix fuzzer-found bugs and CI build issues
- fuzz.yaml: embed rpath to Homebrew LLVM's libc++ so macOS binaries can
  find the right C++ runtime at load time (fixes dyld weak-def crash)
- fuzz.yaml: add `make sqlite-vec.h` step on all platforms before building
  fuzz targets (the header is generated from a template, not checked in)
- fuzz.yaml: drop llvm version pin on Windows so choco succeeds when a
  newer LLVM is already installed on the runner
- sqlite-vec.c: change fvec_cleanup / fvec_cleanup_noop to take void*
  instead of f32* so they are ABI-compatible with vector_cleanup; removes
  UBSAN indirect-call errors at many call sites
- sqlite-vec.c: copy BLOB data into sqlite3_malloc'd buffer in
  fvec_from_value instead of aliasing the raw blob pointer, fixing UBSAN
  misaligned-load errors when SQLite hands us an unaligned blob
- sqlite-vec.c: guard npy_token_next string scan with ptr < end check
  before the closing-quote dereference (heap-buffer-overflow)
- sqlite-vec.c: clamp vec_quantize_int8 intermediate value to [-128, 127]
  before casting to i8 (UBSAN out-of-range conversion)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 07:16:33 -08:00