Metadata filtering (#124)

* initial pass at PARTITION KEY support.

* Initial pass, allow auxiliary columns on vec0 virtual tables

* update TODO

* Initial pass at metadata filtering

* unit tests

* gha this PR branch

* fixup tests

* doc internal

* fix tests, KNN/rowids in

* define SQLITE_INDEX_CONSTRAINT_OFFSET

* whoops

* update tests, syrupy, use uv

* un ignore pyproject.toml

* dot

* tests/

* type error?

* win: .exe, update error name

* try fix macos python, paren around expr?

* win bash?

* dbg :(

* explicit error

* op

* dbg win

* win ./tests/.venv/Scripts/python.exe

* block UPDATEs on partition key values for now

* test this branch

* accidentally removved "partition key type mistmatch" block during merge

* typo ugh

* bruv

* start aux snapshots

* drop aux shadow table on destroy

* enforce column types

* block WHERE constraints on auxiliary columns in KNN queries

* support delete

* support UPDATE on auxiliary columns

* test this PR

* dont inline that

* test-metadata.py

* memzero text buffer

* stress test

* more snpashot tests

* rm double/int32, just float/int64

* finish type checking

* long text support

* DELETE support

* UPDATE support

* fix snapshot names

* drop not-used in eqp

* small fixes

* boolean comparison handling

* ensure error is raised when long string constraint

* new version string for beta builds

* typo whoops

* ann-filtering-benchmark directory

* test-case

* updates

* fix aux column error when using non-default rowid values, needs test

* refactor some text knn filtering

* rowids blob read only on text metadata filters

* refactor

* add failing test causes for non eq text knn

* text knn NE

* test cases diff

* GT

* text knn GT/GE fixes

* text knn LT/LE

* clean

* vtab_in handling

* unblock aux failures for now

* guard sqlite3_vtab_in

* else in guard?

* fixes and tests

* add broken shadow table test

* rename _metadata_chunksNN shadown table to _metadatachunksNN, for proper shadowName detection

* _metadata_text_NN shadow tables to _metadatatextNN

* SQLITE_VEC_VERSION_MAJOR SQLITE_VEC_VERSION_MINOR and SQLITE_VEC_VERSION_PATCH in sqlite-vec.h

* _info shadow table

* forgot to update aux snapshot?

* fix aux tests
This commit is contained in:
Alex Garcia 2024-11-20 00:59:34 -08:00 committed by GitHub
parent 9bfeaa7842
commit 352f953fc0
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
21 changed files with 7361 additions and 105 deletions

View file

@ -5,6 +5,7 @@ on:
- main
- partition-by
- auxiliary
- metadata-filtering
permissions:
contents: read
jobs:

5
.gitignore vendored
View file

@ -26,3 +26,8 @@ sqlite-vec.h
tmp/
poetry.lock
*.jsonl
memstat.c
memstat.*

View file

@ -1,5 +1,51 @@
# `sqlite-vec` Architecture
Internal documentation for how `sqlite-vec` works under-the-hood. Not meant for
users of the `sqlite-vec` project, consult
[the official `sqlite-vec` documentation](https://alexgarcia.xyz/sqlite-vec) for
how-to-guides. Rather, this is for people interested in how `sqlite-vec` works
and some guidelines to any future contributors.
Very much a WIP.
## `vec0`
### Shadow Tables
#### `xyz_chunks`
- `chunk_id INTEGER`
- `size INTEGER`
- `validity BLOB`
- `rowids BLOB`
#### `xyz_rowids`
- `rowid INTEGER`
- `id`
- `chunk_id INTEGER`
- `chunk_offset INTEGER`
#### `xyz_vector_chunksNN`
- `rowid INTEGER`
- `vector BLOB`
#### `xyz_auxiliary`
- `rowid INTEGER`
- `valueNN [type]`
#### `xyz_metadatachunksNN`
- `rowid INTEGER`
- `data BLOB`
#### `xyz_metadatatextNN`
- `rowid INTEGER`
- `data TEXT`
### idxStr
The `vec0` idxStr is a string composed of single "header" character and 0 or
@ -14,8 +60,11 @@ The "header" charcter denotes the type of query plan, as determined by the
| `VEC0_QUERY_PLAN_POINT` | `'2'` | Perform a single-lookup point query for the provided rowid |
| `VEC0_QUERY_PLAN_KNN` | `'3'` | Perform a KNN-style query on the provided query vector and parameters. |
Each 4-character "block" is associated with a corresponding value in `argv[]`. For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is associated with `argv[2]` and so on. Each block describes what kind of value or filter the given `argv[i]` value is.
Each 4-character "block" is associated with a corresponding value in `argv[]`.
For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and
is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is
associated with `argv[2]` and so on. Each block describes what kind of value or
filter the given `argv[i]` value is.
#### `VEC0_IDXSTR_KIND_KNN_MATCH` (`'{'`)
@ -31,8 +80,8 @@ The remaining 3 characters of the block are `_` fillers.
#### `VEC0_IDXSTR_KIND_KNN_ROWID_IN` (`'['`)
`argv[i]` is the optional `rowid in (...)` value, and must be handled with [`sqlite3_vtab_in_first()` /
`sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
`argv[i]` is the optional `rowid in (...)` value, and must be handled with
[`sqlite3_vtab_in_first()` / `sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
The remaining 3 characters of the block are `_` fillers.
@ -40,15 +89,34 @@ The remaining 3 characters of the block are `_` fillers.
`argv[i]` is a "constraint" on a specific partition key.
The second character of the block denotes which partition key to filter on, using `A` to denote the first partition key column, `B` for the second, etc. It is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
The second character of the block denotes which partition key to filter on,
using `A` to denote the first partition key column, `B` for the second, etc. It
is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
The third character of the block denotes which operator is used in the constraint. It will be one of the values of `enum vec0_partition_operator`, as only a subset of operations are supported on partition keys.
The third character of the block denotes which operator is used in the
constraint. It will be one of the values of `enum vec0_partition_operator`, as
only a subset of operations are supported on partition keys.
The fourth character of the block is a `_` filler.
#### `VEC0_IDXSTR_KIND_POINT_ID` (`'!'`)
`argv[i]` is the value of the rowid or id to match against for the point query.
The remaining 3 characters of the block are `_` fillers.
#### `VEC0_IDXSTR_KIND_METADATA_CONSTRAINT` (`'&'`)
`argv[i]` is the value of the `WHERE` constraint for a metdata column in a KNN
query.
The second character of the block denotes which metadata column the constraint
belongs to, using `A` to denote the first metadata column column, `B` for the
second, etc. It is encoded with `'A' + metadata_idx` and can be decoded with
`c - 'A'`.
The third character of the block is the constraint operator. It will be one of
`enum vec0_metadata_operator`, as only a subset of operators are supported on
metadata column KNN filters.
The foruth character of the block is a `_` filler.

View file

@ -153,6 +153,9 @@ sqlite-vec.h: sqlite-vec.h.tmpl VERSION
VERSION=$(shell cat VERSION) \
DATE=$(shell date -r VERSION +'%FT%TZ%z') \
SOURCE=$(shell git log -n 1 --pretty=format:%H -- VERSION) \
VERSION_MAJOR=$$(echo $$VERSION | cut -d. -f1) \
VERSION_MINOR=$$(echo $$VERSION | cut -d. -f2) \
VERSION_PATCH=$$(echo $$VERSION | cut -d. -f3 | cut -d- -f1) \
envsubst < $< > $@
clean:

28
TODO
View file

@ -1,13 +1,17 @@
# partition
- [ ] add `xyz_info` shadow table with version etc.
- [ ] UPDATE on partition key values
- remove previous row from chunk, insert into new one?
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
# auxiliary columns
- later:
- NOT NULL?
- perf: INSERT stmt should be cached on vec0_vtab
- perf: LEFT JOIN aux table to rowids query in vec0_cursor for rowid/point
stmts, to avoid N lookup queries
- later
- [ ] partition: UPDATE support
- [ ] skip invalid validity entries in knn filter?
- [ ] nulls in metadata
- [ ] partition `x in (...)` handling
- [ ] blobs/date/datetime
- [ ] uuid/ulid perf
- [ ] Aux columns: `NOT NULL` constraint
- [ ] Metadata columns: `NOT NULL` constraint
- [ ] Partiion key: `NOT NULL` constraint
- [ ] dictionary encoding?
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
- [ ] perf
- [ ] aux: cache INSERT
- [ ] aux: LEFT JOIN on `_rowids` queries to avoid N lookup queries

File diff suppressed because it is too large Load diff

View file

@ -18,9 +18,16 @@
#endif
#define SQLITE_VEC_VERSION "v${VERSION}"
// TODO rm
#define SQLITE_VEC_VERSION "v-metadata-experiment.01"
#define SQLITE_VEC_DATE "${DATE}"
#define SQLITE_VEC_SOURCE "${SOURCE}"
#define SQLITE_VEC_VERSION_MAJOR ${VERSION_MAJOR}
#define SQLITE_VEC_VERSION_MINOR ${VERSION_MINOR}
#define SQLITE_VEC_VERSION_PATCH ${VERSION_PATCH}
#ifdef __cplusplus
extern "C" {
#endif

327
test.sql
View file

@ -1,10 +1,333 @@
.load dist/vec0
.echo on
.load dist/vec0main
.bail on
.mode qbox
.load ./memstat
.echo on
select name, value from sqlite_memstat where name = 'MEMORY_USED';
create virtual table v using vec0(
vector float[1],
name1 text,
name2 text,
age int,
chunk_size=8
);
select name, value from sqlite_memstat where name = 'MEMORY_USED';
insert into v(vector, name1, name2, age) values
('[1]', 'alex', 'xxxx', 1),
('[2]', 'alex', 'aaaa', 2),
('[3]', 'alex', 'aaaa', 3),
('[4]', 'brian', 'aaaa', 1),
('[5]', 'brian', 'aaaa', 2),
('[6]', 'brian', 'aaaa', 3),
('[7]', 'craig', 'aaaa', 1),
('[8]', 'craig', 'xxxx', 2),
('[9]', 'craig', 'xxxx', 3),
('[10]', '123456789012345', 'xxxx', 3);
select name, value from sqlite_memstat where name = 'MEMORY_USED';
select rowid, name1, name2, age, vec_to_json(vector)
from v
where vector match '[0]'
and k = 5
and name1 in ('alex', 'brian', 'craig')
--and name2 in ('aaaa', 'xxxx')
and age in (1, 2, 3, 2222,3333,4444);
select name, value from sqlite_memstat where name = 'MEMORY_USED';
select rowid, name1, name2, age, vec_to_json(vector)
from v
where vector match '[0]'
and k = 5
and name1 in ('123456789012345', 'superfluous');
.exit
create virtual table v using vec0(
vector float[1],
+description text
);
insert into v(rowid, vector, description) values (1, '[1]', 'aaa');
select * from v;
.exit
create virtual table vec_articles using vec0(
article_id integer primary key,
year integer partition key,
headline_embedding float[1],
+headline text,
+url text,
word_count integer,
print_section text,
print_page integer,
pub_date text,
);
insert into vec_articles values (1111, 2020, '[1]', 'headline', 'https://...', 200, 'A', 1, '2020-01-01');
select * from vec_articles;
.exit
create table movies(movie_id integer primary key, synopsis text);
INSERT INTO movies(movie_id, synopsis)
VALUES
(1, 'A family is haunted by demonic spirits after moving into a new house, requiring the help of paranormal investigators.'),
(2, 'Two dim-witted friends embark on a cross-country road trip to return a briefcase full of money to its owner.'),
(3, 'A team of explorers travels through a wormhole in space in an attempt to ensure humanitys survival.'),
(4, 'A young hobbit embarks on a journey with a fellowship to destroy a powerful ring and save Middle-earth from darkness.'),
(5, 'A documentary about the dangers of global warming, featuring former U.S. Vice President Al Gore.'),
(6, 'After the death of her secretive mother, a woman discovers terrifying secrets about her family lineage.'),
(7, 'A clueless but charismatic TV anchorman struggles to stay relevant in the world of broadcast journalism.'),
(8, 'A young blade runner uncovers a long-buried secret that leads him to track down former blade runner Rick Deckard.'),
(9, 'A young boy discovers he is a wizard and attends a magical school, where he learns about his destiny.'),
(10, 'A rock climber attempts to scale El Capitan in Yosemite National Park without the use of ropes or safety gear.'),
(11, 'A young African-American man uncovers a disturbing secret when he visits his white girlfriend''s family estate.'),
(12, 'Three friends wake up from a bachelor party in Las Vegas with no memory of the previous night and must retrace their steps.'),
(13, 'A computer hacker learns about the true nature of his reality and his role in the war against its controllers.'),
(14, 'In post-Civil War Spain, a young girl escapes into an eerie but captivating fantasy world.'),
(15, 'A documentary that explores racial inequality in the United States, focusing on the prison system and mass incarceration.'),
(16, 'A young woman is followed by an unknown supernatural force after a sexual encounter.'),
(17, 'Two immature but well-meaning stepbrothers become instant rivals when their single parents marry.'),
(18, 'A thief with the ability to enter people''s dreams is tasked with planting an idea into a target''s subconscious.'),
(19, 'A mute woman forms a unique relationship with a mysterious aquatic creature being held in a secret research facility.'),
(20, 'A documentary about the life and legacy of Fred Rogers, the beloved host of the children''s TV show "Mister Rogers'' Neighborhood."');
create virtual table vec_movies using vec0(
movie_id integer primary key,
synopsis_embedding float[1],
+title text,
genre text,
num_reviews int,
mean_rating float,
chunk_size=8
);
.schema
/*
insert into vec_movies(movie_id, synopsis_embedding, num_reviews, mean_rating) values
(1, '[1]', 153, 4.6),
(2, '[2]', 382, 2.6),
(3, '[3]', 53, 5.0),
(4, '[4]', 210, 4.2),
(5, '[5]', 93, 3.4),
(6, '[6]', 167, 4.7),
(7, '[7]', 482, 2.9),
(8, '[8]', 301, 5.0),
(9, '[9]', 134, 4.1),
(10, '[10]', 66, 3.2),
(11, '[11]', 88, 4.9),
(12, '[12]', 59, 2.8),
(13, '[13]', 423, 4.5),
(14, '[14]', 275, 3.6),
(15, '[15]', 191, 4.4),
(16, '[16]', 314, 4.3),
(17, '[17]', 74, 3.0),
(18, '[18]', 201, 5.0),
(19, '[19]', 399, 2.7),
(20, '[20]', 186, 4.8);
*/
/*
INSERT INTO vec_movies(movie_id, synopsis_embedding, genre, num_reviews, mean_rating)
VALUES
(1, '[1]', 'horror', 153, 4.6),
(2, '[2]', 'comedy', 382, 2.6),
(3, '[3]', 'scifi', 53, 5.0),
(4, '[4]', 'fantasy', 210, 4.2),
(5, '[5]', 'documentary', 93, 3.4),
(6, '[6]', 'horror', 167, 4.7),
(7, '[7]', 'comedy', 482, 2.9),
(8, '[8]', 'scifi', 301, 5.0),
(9, '[9]', 'fantasy', 134, 4.1),
(10, '[10]', 'documentary', 66, 3.2),
(11, '[11]', 'horror', 88, 4.9),
(12, '[12]', 'comedy', 59, 2.8),
(13, '[13]', 'scifi', 423, 4.5),
(14, '[14]', 'fantasy', 275, 3.6),
(15, '[15]', 'documentary', 191, 4.4),
(16, '[16]', 'horror', 314, 4.3),
(17, '[17]', 'comedy', 74, 3.0),
(18, '[18]', 'scifi', 201, 5.0),
(19, '[19]', 'fantasy', 399, 2.7),
(20, '[20]', 'documentary', 186, 4.8);
*/
INSERT INTO vec_movies(movie_id, synopsis_embedding, genre, title, num_reviews, mean_rating)
VALUES
(1, '[1]', 'horror', 'The Conjuring', 153, 4.6),
(2, '[2]', 'comedy', 'Dumb and Dumber', 382, 2.6),
(3, '[3]', 'scifi', 'Interstellar', 53, 5.0),
(4, '[4]', 'fantasy', 'The Lord of the Rings: The Fellowship of the Ring', 210, 4.2),
(5, '[5]', 'documentary', 'An Inconvenient Truth', 93, 3.4),
(6, '[6]', 'horror', 'Hereditary', 167, 4.7),
(7, '[7]', 'comedy', 'Anchorman: The Legend of Ron Burgundy', 482, 2.9),
(8, '[8]', 'scifi', 'Blade Runner 2049', 301, 5.0),
(9, '[9]', 'fantasy', 'Harry Potter and the Sorcerer''s Stone', 134, 4.1),
(10, '[10]', 'documentary', 'Free Solo', 66, 3.2),
(11, '[11]', 'horror', 'Get Out', 88, 4.9),
(12, '[12]', 'comedy', 'The Hangover', 59, 2.8),
(13, '[13]', 'scifi', 'The Matrix', 423, 4.5),
(14, '[14]', 'fantasy', 'Pan''s Labyrinth', 275, 3.6),
(15, '[15]', 'documentary', '13th', 191, 4.4),
(16, '[16]', 'horror', 'It Follows', 314, 4.3),
(17, '[17]', 'comedy', 'Step Brothers', 74, 3.0),
(18, '[18]', 'scifi', 'Inception', 201, 5.0),
(19, '[19]', 'fantasy', 'The Shape of Water', 399, 2.7),
(20, '[20]', 'documentary', 'Won''t You Be My Neighbor?', 186, 4.8),
(21, '[21]', 'scifi', 'Gravity', 342, 4.0),
(22, '[22]', 'scifi', 'Dune', 451, 4.4),
(23, '[23]', 'scifi', 'The Martian', 522, 4.6),
(24, '[24]', 'horror', 'A Quiet Place', 271, 4.3),
(25, '[25]', 'fantasy', 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe', 310, 3.9);
--select * from vec_movies;
--select * from vec_movies_metadata_chunks00;
create virtual table vec_chunks using vec0(
user_id integer partition key,
+contents text,
contents_embedding float[1],
);
INSERT INTO vec_chunks (rowid, user_id, contents, contents_embedding) VALUES
(1, 123, 'Our PTO policy allows employees to take both vacation and sick leave as needed.', '[1]'),
(2, 123, 'Employees must provide notice at least two weeks in advance for planned vacations.', '[2]'),
(3, 123, 'Sick leave can be taken without advance notice, but employees must inform their manager.', '[3]'),
(4, 123, 'Unused PTO can be carried over to the following year, up to a maximum of 40 hours.', '[4]'),
(5, 123, 'PTO must be used in increments of at least 4 hours.', '[5]'),
(6, 456, 'New employees are granted 10 days of PTO during their first year of employment.', '[6]'),
(7, 456, 'After the first year, employees earn an additional day of PTO for each year of service.', '[7]'),
(8, 789, 'PTO requests will be reviewed by the HR department and are subject to approval.', '[8]'),
(9, 789, 'The company reserves the right to deny PTO requests during peak operational periods.', '[9]'),
(10, 456, 'If PTO is denied, the employee will be given an alternative time to take leave.', '[10]'),
(11, 789, 'Employees who are out of PTO must request unpaid leave for any additional time off.', '[11]'),
(12, 789, 'In case of a family emergency, employees can request emergency leave.', '[12]'),
(13, 456, 'Emergency leave may be granted for personal or family illness, or other critical situations.', '[13]'),
(14, 789, 'The maximum length of emergency leave is subject to company discretion.', '[14]'),
(15, 123, 'All PTO balances will be displayed on the employee self-service portal.', '[15]'),
(16, 456, 'Employees who are terminated will be paid for unused PTO, as per state law.', '[16]'),
(17, 123, 'Part-time employees are eligible for PTO on a pro-rata basis.', '[17]'),
(18, 789, 'The company encourages employees to use their PTO to maintain work-life balance.', '[18]'),
(19, 456, 'Employees should not book travel plans until their PTO request has been approved.', '[19]'),
(20, 123, 'Managers are responsible for tracking their team members'' PTO usage.', '[20]');
select rowid, user_id, contents, distance
from vec_chunks
where contents_embedding match '[19]'
and user_id = 123
and k = 5;
.exit
-- PARTITION KEY and auxiliar columns!
create virtual table vec_chunks using vec0(
-- internally shard the vector index by user
user_id integer partition key,
-- store the chunk text pre-embedding as an "auxiliary column"
+contents text,
contents_embeddings float[1024],
);
select rowid, user_id, contents, distance
from vec_chunks
where contents_embedding match '[...]'
and user_id = 123
and k = 5;
/*
rowid user_id contents distance
20 123 'Managers are responsible for tracking their team members'' │ 1.0 │
PTO usage.' │ │
17 123 'Part-time employees are eligible for PTO on a pro-rata basi │ 2.0 │
s.' │ │
15 123 'All PTO balances will be displayed on the employee self-ser │ 4.0 │
vice portal.' │ │
5 123 'PTO must be used in increments of at least 4 hours.' 14.0
4 123 'Unused PTO can be carried over to the following year, up to │ 15.0 │
a maximum of 40 hours.' │ │
*/
-- metadata filters!
create virtual table vec_movies using vec0(
movie_id integer primary key,
synopsis_embedding float[1024],
genre text,
num_reviews int,
mean_rating float
);
select
movie_id,
title,
genre,
num_reviews,
mean_rating,
distance
from vec_movies
where synopsis_embedding match '[15.5]'
and genre = 'scifi'
and num_reviews between 100 and 500
and mean_rating > 3.5
and k = 5;
/*
movie_id title genre num_reviews mean_rating distance
13 'The Matrix' 'scifi' 423 4.5 2.5
18 'Inception' 'scifi' 201 5.0 2.5
21 'Gravity' 'scifi' 342 4.0 5.5
22 'Dune' 'scifi' 451 4.40000009536743 6.5
8 'Blade Runner 2049' 'scifi' 301 5.0 7.5
*/
.exit
create virtual table vec_movies using vec0(
movie_id integer primary key,
synopsis_embedding float[768],
genre text,
num_reviews int,
mean_rating float,
);
.exit
create virtual table vec_chunks using vec0(
chunk_id integer primary key,
contents_embedding float[1],

View file

@ -316,7 +316,7 @@
'type': 'table',
'name': 'sqlite_sequence',
'tbl_name': 'sqlite_sequence',
'rootpage': 3,
'rootpage': 5,
'sql': 'CREATE TABLE sqlite_sequence(name,seq)',
}),
]),
@ -326,18 +326,25 @@
OrderedDict({
'sql': 'select * from sqlite_master order by name',
'rows': list([
OrderedDict({
'type': 'index',
'name': 'sqlite_autoindex_v_info_1',
'tbl_name': 'v_info',
'rootpage': 3,
'sql': None,
}),
OrderedDict({
'type': 'index',
'name': 'sqlite_autoindex_v_vector_chunks00_1',
'tbl_name': 'v_vector_chunks00',
'rootpage': 6,
'rootpage': 8,
'sql': None,
}),
OrderedDict({
'type': 'table',
'name': 'sqlite_sequence',
'tbl_name': 'sqlite_sequence',
'rootpage': 3,
'rootpage': 5,
'sql': 'CREATE TABLE sqlite_sequence(name,seq)',
}),
OrderedDict({
@ -351,28 +358,35 @@
'type': 'table',
'name': 'v_auxiliary',
'tbl_name': 'v_auxiliary',
'rootpage': 7,
'rootpage': 9,
'sql': 'CREATE TABLE "v_auxiliary"( rowid integer PRIMARY KEY , value00)',
}),
OrderedDict({
'type': 'table',
'name': 'v_chunks',
'tbl_name': 'v_chunks',
'rootpage': 2,
'rootpage': 4,
'sql': 'CREATE TABLE "v_chunks"(chunk_id INTEGER PRIMARY KEY AUTOINCREMENT,size INTEGER NOT NULL,validity BLOB NOT NULL,rowids BLOB NOT NULL)',
}),
OrderedDict({
'type': 'table',
'name': 'v_info',
'tbl_name': 'v_info',
'rootpage': 2,
'sql': 'CREATE TABLE "v_info" (key text primary key, value any)',
}),
OrderedDict({
'type': 'table',
'name': 'v_rowids',
'tbl_name': 'v_rowids',
'rootpage': 4,
'rootpage': 6,
'sql': 'CREATE TABLE "v_rowids"(rowid INTEGER PRIMARY KEY AUTOINCREMENT,id,chunk_id INTEGER,chunk_offset INTEGER)',
}),
OrderedDict({
'type': 'table',
'name': 'v_vector_chunks00',
'tbl_name': 'v_vector_chunks00',
'rootpage': 5,
'rootpage': 7,
'sql': 'CREATE TABLE "v_vector_chunks00"(rowid PRIMARY KEY,vectors BLOB NOT NULL)',
}),
]),
@ -409,25 +423,25 @@
# ---
# name: test_types.3
dict({
'error': 'OperationalError',
'error': 'IntegrityError',
'message': 'Auxiliary column type mismatch: The auxiliary column aux_int has type INTEGER, but TEXT was provided.',
})
# ---
# name: test_types.4
dict({
'error': 'OperationalError',
'error': 'IntegrityError',
'message': 'Auxiliary column type mismatch: The auxiliary column aux_float has type FLOAT, but TEXT was provided.',
})
# ---
# name: test_types.5
dict({
'error': 'OperationalError',
'error': 'IntegrityError',
'message': 'Auxiliary column type mismatch: The auxiliary column aux_text has type TEXT, but INTEGER was provided.',
})
# ---
# name: test_types.6
dict({
'error': 'OperationalError',
'error': 'IntegrityError',
'message': 'Auxiliary column type mismatch: The auxiliary column aux_blob has type BLOB, but INTEGER was provided.',
})
# ---

View file

@ -0,0 +1,184 @@
# serializer version: 1
# name: test_info
OrderedDict({
'sql': 'select key, typeof(value) from v_info order by 1',
'rows': list([
OrderedDict({
'key': 'CREATE_VERSION',
'typeof(value)': 'text',
}),
OrderedDict({
'key': 'CREATE_VERSION_MAJOR',
'typeof(value)': 'integer',
}),
OrderedDict({
'key': 'CREATE_VERSION_MINOR',
'typeof(value)': 'integer',
}),
OrderedDict({
'key': 'CREATE_VERSION_PATCH',
'typeof(value)': 'integer',
}),
]),
})
# ---
# name: test_shadow
OrderedDict({
'sql': 'select * from sqlite_master order by name',
'rows': list([
OrderedDict({
'type': 'index',
'name': 'sqlite_autoindex_v_info_1',
'tbl_name': 'v_info',
'rootpage': 3,
'sql': None,
}),
OrderedDict({
'type': 'index',
'name': 'sqlite_autoindex_v_metadatachunks00_1',
'tbl_name': 'v_metadatachunks00',
'rootpage': 10,
'sql': None,
}),
OrderedDict({
'type': 'index',
'name': 'sqlite_autoindex_v_metadatatext00_1',
'tbl_name': 'v_metadatatext00',
'rootpage': 12,
'sql': None,
}),
OrderedDict({
'type': 'index',
'name': 'sqlite_autoindex_v_vector_chunks00_1',
'tbl_name': 'v_vector_chunks00',
'rootpage': 8,
'sql': None,
}),
OrderedDict({
'type': 'table',
'name': 'sqlite_sequence',
'tbl_name': 'sqlite_sequence',
'rootpage': 5,
'sql': 'CREATE TABLE sqlite_sequence(name,seq)',
}),
OrderedDict({
'type': 'table',
'name': 'v',
'tbl_name': 'v',
'rootpage': 0,
'sql': 'CREATE VIRTUAL TABLE v using vec0(a float[1], partition text partition key, metadata text, +name text, chunk_size=8)',
}),
OrderedDict({
'type': 'table',
'name': 'v_auxiliary',
'tbl_name': 'v_auxiliary',
'rootpage': 13,
'sql': 'CREATE TABLE "v_auxiliary"( rowid integer PRIMARY KEY , value00)',
}),
OrderedDict({
'type': 'table',
'name': 'v_chunks',
'tbl_name': 'v_chunks',
'rootpage': 4,
'sql': 'CREATE TABLE "v_chunks"(chunk_id INTEGER PRIMARY KEY AUTOINCREMENT,size INTEGER NOT NULL,sequence_id integer,partition00,validity BLOB NOT NULL, rowids BLOB NOT NULL)',
}),
OrderedDict({
'type': 'table',
'name': 'v_info',
'tbl_name': 'v_info',
'rootpage': 2,
'sql': 'CREATE TABLE "v_info" (key text primary key, value any)',
}),
OrderedDict({
'type': 'table',
'name': 'v_metadatachunks00',
'tbl_name': 'v_metadatachunks00',
'rootpage': 9,
'sql': 'CREATE TABLE "v_metadatachunks00"(rowid PRIMARY KEY, data BLOB NOT NULL)',
}),
OrderedDict({
'type': 'table',
'name': 'v_metadatatext00',
'tbl_name': 'v_metadatatext00',
'rootpage': 11,
'sql': 'CREATE TABLE "v_metadatatext00"(rowid PRIMARY KEY, data TEXT)',
}),
OrderedDict({
'type': 'table',
'name': 'v_rowids',
'tbl_name': 'v_rowids',
'rootpage': 6,
'sql': 'CREATE TABLE "v_rowids"(rowid INTEGER PRIMARY KEY AUTOINCREMENT,id,chunk_id INTEGER,chunk_offset INTEGER)',
}),
OrderedDict({
'type': 'table',
'name': 'v_vector_chunks00',
'tbl_name': 'v_vector_chunks00',
'rootpage': 7,
'sql': 'CREATE TABLE "v_vector_chunks00"(rowid PRIMARY KEY,vectors BLOB NOT NULL)',
}),
]),
})
# ---
# name: test_shadow.1
OrderedDict({
'sql': "select * from pragma_table_list where type = 'shadow'",
'rows': list([
OrderedDict({
'schema': 'main',
'name': 'v_auxiliary',
'type': 'shadow',
'ncol': 2,
'wr': 0,
'strict': 0,
}),
OrderedDict({
'schema': 'main',
'name': 'v_chunks',
'type': 'shadow',
'ncol': 6,
'wr': 0,
'strict': 0,
}),
OrderedDict({
'schema': 'main',
'name': 'v_info',
'type': 'shadow',
'ncol': 2,
'wr': 0,
'strict': 0,
}),
OrderedDict({
'schema': 'main',
'name': 'v_rowids',
'type': 'shadow',
'ncol': 4,
'wr': 0,
'strict': 0,
}),
OrderedDict({
'schema': 'main',
'name': 'v_metadatachunks00',
'type': 'shadow',
'ncol': 2,
'wr': 0,
'strict': 0,
}),
OrderedDict({
'schema': 'main',
'name': 'v_metadatatext00',
'type': 'shadow',
'ncol': 2,
'wr': 0,
'strict': 0,
}),
]),
})
# ---
# name: test_shadow.2
OrderedDict({
'sql': "select * from pragma_table_list where type = 'shadow'",
'rows': list([
]),
})
# ---

File diff suppressed because it is too large Load diff

1
tests/afbd/.gitignore vendored Normal file
View file

@ -0,0 +1 @@
*.tgz

View file

@ -0,0 +1 @@
3.12

9
tests/afbd/Makefile Normal file
View file

@ -0,0 +1,9 @@
random_ints_1m.tgz:
curl -o $@ https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_ints_1m.tgz
random_float_1m.tgz:
curl -o $@ https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_float_1m.tgz
random_keywords_1m.tgz:
curl -o $@ https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_keywords_1m.tgz
all: random_ints_1m.tgz random_float_1m.tgz random_keywords_1m.tgz

12
tests/afbd/README.md Normal file
View file

@ -0,0 +1,12 @@
# hnm
```
tar -xOzf hnm.tgz ./tests.jsonl > tests.jsonl
solite q "select group_concat(distinct key) from lines_read('tests.jsonl'), json_each(line -> '$.conditions.and[0]')"
```
```
> python test-afbd.py build hnm.tgz --metadata product_group_name,colour_group_name,index_group_name,perceived_colour_value_name,section_name,product_type_name,department_name,graphical_appearance_name,garment_group_name,perceived_colour_master_name
```

231
tests/afbd/test-afbd.py Normal file
View file

@ -0,0 +1,231 @@
import numpy as np
from tqdm import tqdm
from deepdiff import DeepDiff
import tarfile
import json
from io import BytesIO
import sqlite3
from typing import List
from struct import pack
import time
from pathlib import Path
import argparse
def serialize_float32(vector: List[float]) -> bytes:
"""Serializes a list of floats into the "raw bytes" format sqlite-vec expects"""
return pack("%sf" % len(vector), *vector)
def build_command(file_path, metadata_set=None):
if metadata_set:
metadata_set = set(metadata_set.split(","))
file_path = Path(file_path)
print(f"reading {file_path}...")
t0 = time.time()
with tarfile.open(file_path, "r:gz") as archive:
for file in archive:
if file.name == "./payloads.jsonl":
payloads = [
json.loads(line)
for line in archive.extractfile(file.name).readlines()
]
if file.name == "./tests.jsonl":
tests = [
json.loads(line)
for line in archive.extractfile(file.name).readlines()
]
if file.name == "./vectors.npy":
f = BytesIO()
f.write(archive.extractfile(file.name).read())
f.seek(0)
vectors = np.load(f)
assert payloads is not None
assert tests is not None
assert vectors is not None
dimensions = vectors.shape[1]
metadata_columns = sorted(list(payloads[0].keys()))
def col_type(v):
if isinstance(v, int):
return "integer"
if isinstance(v, float):
return "float"
if isinstance(v, str):
return "text"
raise Exception(f"Unknown column type: {v}")
metadata_columns_types = [col_type(payloads[0][col]) for col in metadata_columns]
print(time.time() - t0)
t0 = time.time()
print("seeding...")
db = sqlite3.connect(f"{file_path.stem}.db")
db.execute("PRAGMA page_size = 16384")
db.row_factory = sqlite3.Row
db.enable_load_extension(True)
db.load_extension("../../dist/vec0")
db.enable_load_extension(False)
with db:
db.execute("create table tests(data)")
for test in tests:
db.execute("insert into tests values (?)", [json.dumps(test)])
with db:
create_sql = f"create virtual table v using vec0(vector float[{dimensions}] distance_metric=cosine"
insert_sql = "insert into v(rowid, vector"
for name, type in zip(metadata_columns, metadata_columns_types):
if metadata_set:
if name in metadata_set:
create_sql += f", {name} {type}"
else:
create_sql += f", +{name} {type}"
else:
create_sql += f", {name} {type}"
insert_sql += f", {name}"
create_sql += ")"
insert_sql += ") values (" + ",".join("?" * (2 + len(metadata_columns))) + ")"
print(create_sql)
print(insert_sql)
db.execute(create_sql)
for idx, (payload, vector) in enumerate(
tqdm(zip(payloads, vectors), total=len(payloads))
):
params = [idx, vector]
for c in metadata_columns:
params.append(payload[c])
db.execute(insert_sql, params)
print(time.time() - t0)
def tests_command(file_path):
file_path = Path(file_path)
db = sqlite3.connect(f"{file_path.stem}.db")
db.execute("PRAGMA cache_size = -100000000")
db.row_factory = sqlite3.Row
db.enable_load_extension(True)
db.load_extension("../../dist/vec0")
db.enable_load_extension(False)
tests = [
json.loads(row["data"])
for row in db.execute("select data from tests").fetchall()
]
num_or_skips = 0
num_1off_errors = 0
t0 = time.time()
print("testing...")
for idx, test in enumerate(tqdm(tests)):
query = test["query"]
conditions = test["conditions"]
expected_closest_ids = test["closest_ids"]
expected_closest_scores = test["closest_scores"]
sql = "select rowid, 1 - distance as similarity from v where vector match ? and k = ?"
params = [serialize_float32(query), len(expected_closest_ids)]
if "and" in conditions:
for condition in conditions["and"]:
assert len(condition.keys()) == 1
column = list(condition.keys())[0]
assert len(list(condition[column].keys())) == 1
condition_type = list(condition[column].keys())[0]
if condition_type == "match":
value = condition[column]["match"]["value"]
sql += f" and {column} = ?"
params.append(value)
elif condition_type == "range":
sql += f" and {column} between ? and ?"
params.append(condition[column]["range"]["gt"])
params.append(condition[column]["range"]["lt"])
else:
raise Exception(f"Unknown condition type: {condition_type}")
elif "or" in conditions:
column = list(conditions["or"][0].keys())[0]
condition_type = list(conditions["or"][0][column].keys())[0]
assert condition_type == "match"
sql += f" and {column} in ("
for idx, condition in enumerate(conditions["or"]):
if condition_type == "match":
value = condition[column]["match"]["value"]
if idx != 0:
sql += ","
sql += "?"
params.append(value)
elif condition_type == "range":
breakpoint()
else:
raise Exception(f"Unknown condition type: {condition_type}")
sql += ")"
# print(sql, params[1:])
rows = db.execute(sql, params).fetchall()
actual_closest_ids = [row["rowid"] for row in rows]
matches = expected_closest_ids == actual_closest_ids
if not matches:
diff = DeepDiff(
expected_closest_ids, actual_closest_ids, ignore_order=False
)
assert len(list(diff.keys())) == 1
assert "values_changed" in diff.keys()
keys_changed = list(diff["values_changed"].keys())
if len(keys_changed) == 2:
akey, bkey = keys_changed
a = int(akey.lstrip("root[").rstrip("]"))
b = int(bkey.lstrip("root[").rstrip("]"))
assert abs(a - b) == 1
assert (
diff["values_changed"][akey]["new_value"]
== diff["values_changed"][bkey]["old_value"]
)
assert (
diff["values_changed"][akey]["old_value"]
== diff["values_changed"][bkey]["new_value"]
)
elif len(keys_changed) == 1:
v = int(keys_changed[0].lstrip("root[").rstrip("]"))
assert (v + 1) == len(expected_closest_ids)
else:
raise Exception("fuck")
num_1off_errors += 1
# print(closest_scores)
# print([row["similarity"] for row in rows])
# assert closest_scores == [row["similarity"] for row in rows]
print("Number skipped: ", num_or_skips)
print("Num 1 off errors: ", num_1off_errors)
print("1 off error rate: ", num_1off_errors / (len(tests) - num_or_skips))
print(time.time() - t0)
print("done")
def main():
parser = argparse.ArgumentParser(description="CLI tool")
subparsers = parser.add_subparsers(dest="command", required=True)
build_parser = subparsers.add_parser("build")
build_parser.add_argument("file", type=str, help="Path to input file")
build_parser.add_argument("--metadata", type=str, help="Metadata columns")
build_parser.set_defaults(func=lambda args: build_command(args.file, args.metadata))
tests_parser = subparsers.add_parser("test")
tests_parser.add_argument("file", type=str, help="Path to input file")
tests_parser.set_defaults(func=lambda args: tests_command(args.file))
args = parser.parse_args()
args.func(args)
if __name__ == "__main__":
main()

View file

@ -55,7 +55,10 @@ def test_types(db, snapshot):
)
assert exec(db, "select * from v") == snapshot()
# TODO: integrity test transaction failures in shadow tables
db.commit()
# bad types
db.execute("BEGIN")
assert (
exec(db, INSERT, [b"\x11\x11\x11\x11", "not int", 1.2, "text", b"blob"])
== snapshot()
@ -66,6 +69,7 @@ def test_types(db, snapshot):
)
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1.2, 1, b"blob"]) == snapshot()
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1.2, "text", 1]) == snapshot()
db.execute("ROLLBACK")
# NULLs are totally chill
assert exec(db, INSERT, [b"\x11\x11\x11\x11", None, None, None, None]) == snapshot()
@ -151,5 +155,7 @@ def vec0_shadow_table_contents(db, v):
]
o = {}
for shadow_table in shadow_tables:
if shadow_table.endswith("_info"):
continue
o[shadow_table] = exec(db, f"select * from {shadow_table}")
return o

60
tests/test-general.py Normal file
View file

@ -0,0 +1,60 @@
import sqlite3
from collections import OrderedDict
import pytest
@pytest.mark.skipif(
sqlite3.sqlite_version_info[1] < 37,
reason="pragma_table_list was added in SQLite 3.37",
)
def test_shadow(db, snapshot):
db.execute(
"create virtual table v using vec0(a float[1], partition text partition key, metadata text, +name text, chunk_size=8)"
)
assert exec(db, "select * from sqlite_master order by name") == snapshot()
assert (
exec(db, "select * from pragma_table_list where type = 'shadow'") == snapshot()
)
db.execute("drop table v;")
assert (
exec(db, "select * from pragma_table_list where type = 'shadow'") == snapshot()
)
def test_info(db, snapshot):
db.execute("create virtual table v using vec0(a float[1])")
assert exec(db, "select key, typeof(value) from v_info order by 1") == snapshot()
def exec(db, sql, parameters=[]):
try:
rows = db.execute(sql, parameters).fetchall()
except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:
return {
"error": e.__class__.__name__,
"message": str(e),
}
a = []
for row in rows:
o = OrderedDict()
for k in row.keys():
o[k] = row[k]
a.append(o)
result = OrderedDict()
result["sql"] = sql
result["rows"] = a
return result
def vec0_shadow_table_contents(db, v):
shadow_tables = [
row[0]
for row in db.execute(
"select name from sqlite_master where name like ? order by 1", [f"{v}_%"]
).fetchall()
]
o = {}
for shadow_table in shadow_tables:
o[shadow_table] = exec(db, f"select * from {shadow_table}")
return o

View file

@ -1022,6 +1022,7 @@ def test_vec0_drops():
] == [
"t1",
"t1_chunks",
"t1_info",
"t1_rowids",
"t1_vector_chunks00",
"t1_vector_chunks01",
@ -2216,6 +2217,9 @@ def test_smoke():
{
"name": "vec_xyz_chunks",
},
{
"name": "vec_xyz_info",
},
{
"name": "vec_xyz_rowids",
},

629
tests/test-metadata.py Normal file
View file

@ -0,0 +1,629 @@
import pytest
import sqlite3
from collections import OrderedDict
import json
def test_constructor_limit(db, snapshot):
assert exec(
db,
f"""
create virtual table v using vec0(
{",".join([f"metadata{x} integer" for x in range(17)])}
v float[1]
)
""",
) == snapshot(name="max 16 metadata columns")
def test_normal(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
)
assert exec(
db, "select * from sqlite_master where type = 'table' order by name"
) == snapshot(name="sqlite_master")
assert vec0_shadow_table_contents(db, "v") == snapshot()
INSERT = "insert into v(vector, b, n, f, t) values (?, ?, ?, ?, ?)"
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 1.1, "one"]) == snapshot()
assert exec(db, INSERT, [b"\x22\x22\x22\x22", 1, 2, 2.2, "two"]) == snapshot()
assert exec(db, INSERT, [b"\x33\x33\x33\x33", 1, 3, 3.3, "three"]) == snapshot()
assert exec(db, "select * from v") == snapshot()
assert vec0_shadow_table_contents(db, "v") == snapshot()
assert exec(db, "drop table v") == snapshot()
assert exec(db, "select * from sqlite_master") == snapshot()
#
# assert exec(db, "select * from v") == snapshot()
# assert vec0_shadow_table_contents(db, "v") == snapshot()
#
# db.execute("drop table v;")
# assert exec(db, "select * from sqlite_master order by name") == snapshot(
# name="sqlite_master post drop"
# )
def test_text_knn(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
)
assert vec0_shadow_table_contents(db, "v") == snapshot()
INSERT = "insert into v(vector, name) values (?, ?)"
db.execute(
"""
INSERT INTO v(vector, name) VALUES
('[.11]', 'aaa'),
('[.22]', 'bbb'),
('[.33]', 'ccc'),
('[.44]', 'ddd'),
('[.55]', 'eee'),
('[.66]', 'fff'),
('[.77]', 'ggg'),
('[.88]', 'hhh'),
('[.99]', 'iii');
"""
)
assert exec(db, "select * from v") == snapshot()
assert vec0_shadow_table_contents(db, "v") == snapshot()
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[1]' and k = 5",
)
== snapshot()
)
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name < 'ddd'",
)
== snapshot()
)
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name <= 'ddd'",
)
== snapshot()
)
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name > 'fff'",
)
== snapshot()
)
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name >= 'fff'",
)
== snapshot()
)
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name = 'aaa'",
)
== snapshot()
)
assert (
exec(
db,
"select rowid, name, distance from v where vector match '[.01]' and k = 5 and name != 'aaa'",
)
== snapshot()
)
def test_long_text_updates(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
)
assert vec0_shadow_table_contents(db, "v") == snapshot()
INSERT = "insert into v(vector, name) values (?, ?)"
exec(db, INSERT, [b"\x11\x11\x11\x11", "123456789a12"])
exec(db, INSERT, [b"\x11\x11\x11\x11", "123456789a123"])
assert exec(db, "select * from v") == snapshot()
assert vec0_shadow_table_contents(db, "v") == snapshot()
def test_long_text_knn(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
)
INSERT = "insert into v(vector, name) values (?, ?)"
exec(db, INSERT, ["[1]", "aaaa"])
exec(db, INSERT, ["[2]", "aaaaaaaaaaaa_aaa"])
exec(db, INSERT, ["[3]", "bbbb"])
exec(db, INSERT, ["[4]", "bbbbbbbbbbbb_bbb"])
exec(db, INSERT, ["[5]", "cccc"])
exec(db, INSERT, ["[6]", "cccccccccccc_ccc"])
tests = [
"bbbb",
"bb",
"bbbbbb",
"bbbbbbbbbbbb_bbb",
"bbbbbbbbbbbb_aaa",
"bbbbbbbbbbbb_ccc",
"longlonglonglonglonglonglong",
]
ops = ["=", "!=", "<", "<=", ">", ">="]
op_names = ["eq", "ne", "lt", "le", "gt", "ge"]
for test in tests:
for op, op_name in zip(ops, op_names):
assert exec(
db,
f"select rowid, name, distance from v where vector match '[100]' and k = 5 and name {op} ?",
[test],
) == snapshot(name=f"{op_name}-{test}")
def test_types(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
)
INSERT = "insert into v(vector, b, n, f, t) values (?, ?, ?, ?, ?)"
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 1.1, "test"]) == snapshot(
name="legal"
)
# fmt: off
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 'illegal', 1, 1.1, 'test']) == snapshot(name="illegal-type-boolean")
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 'illegal', 1.1, 'test']) == snapshot(name="illegal-type-int")
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 'illegal', 'test']) == snapshot(name="illegal-type-float")
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 1.1, 420]) == snapshot(name="illegal-type-text")
# fmt: on
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 44, 1, 1.1, "test"]) == snapshot(
name="illegal-boolean"
)
def test_updates(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
)
INSERT = "insert into v(rowid, vector, b, n, f, t) values (?, ?, ?, ?, ?, ?)"
exec(db, INSERT, [1, b"\x11\x11\x11\x11", 1, 1, 1.1, "test1"])
exec(db, INSERT, [2, b"\x22\x22\x22\x22", 1, 2, 2.2, "test2"])
exec(db, INSERT, [3, b"\x33\x33\x33\x33", 1, 3, 3.3, "1234567890123"])
assert exec(db, "select * from v") == snapshot(name="1-init-contents")
assert vec0_shadow_table_contents(db, "v") == snapshot(name="1-init-shadow")
assert exec(
db, "UPDATE v SET b = 0, n = 11, f = 11.11, t = 'newtest1' where rowid = 1"
)
assert exec(db, "select * from v") == snapshot(name="general-update-contents")
assert vec0_shadow_table_contents(db, "v") == snapshot(
name="general-update-shaodnw"
)
# string update #1: long string updated to long string
exec(db, "UPDATE v SET t = '1234567890123-updated' where rowid = 3")
assert exec(db, "select * from v") == snapshot(name="string-update-1-contents")
assert vec0_shadow_table_contents(db, "v") == snapshot(
name="string-update-1-shadow"
)
# string update #2: short string updated to short string
exec(db, "UPDATE v SET t = 'test2-short' where rowid = 2")
assert exec(db, "select * from v") == snapshot(name="string-update-2-contents")
assert vec0_shadow_table_contents(db, "v") == snapshot(
name="string-update-2-shadow"
)
# string update #3: short string updated to long string
exec(db, "UPDATE v SET t = 'test2-long-long-long' where rowid = 2")
assert exec(db, "select * from v") == snapshot(name="string-update-3-contents")
assert vec0_shadow_table_contents(db, "v") == snapshot(
name="string-update-3-shadow"
)
# string update #4: long string updated to short string
exec(db, "UPDATE v SET t = 'test2-shortx' where rowid = 2")
assert exec(db, "select * from v") == snapshot(name="string-update-4-contents")
assert vec0_shadow_table_contents(db, "v") == snapshot(
name="string-update-4-shadow"
)
def test_deletes(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
)
INSERT = "insert into v(rowid, vector, b, n, f, t) values (?, ?, ?, ?, ?, ?)"
assert exec(db, INSERT, [1, b"\x11\x11\x11\x11", 1, 1, 1.1, "test1"]) == snapshot()
assert exec(db, INSERT, [2, b"\x22\x22\x22\x22", 1, 2, 2.2, "test2"]) == snapshot()
assert (
exec(db, INSERT, [3, b"\x33\x33\x33\x33", 1, 3, 3.3, "1234567890123"])
== snapshot()
)
assert exec(db, "select * from v") == snapshot()
assert vec0_shadow_table_contents(db, "v") == snapshot()
assert exec(db, "DELETE FROM v where rowid = 1") == snapshot()
assert exec(db, "select * from v") == snapshot()
assert vec0_shadow_table_contents(db, "v") == snapshot()
assert exec(db, "DELETE FROM v where rowid = 3") == snapshot()
assert exec(db, "select * from v") == snapshot()
assert vec0_shadow_table_contents(db, "v") == snapshot()
def test_knn(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
)
assert exec(
db, "select * from sqlite_master where type = 'table' order by name"
) == snapshot(name="sqlite_master")
db.executemany(
"insert into v(vector, name) values (?, ?)",
[("[1]", "alex"), ("[2]", "brian"), ("[3]", "craig")],
)
# EVIDENCE-OF: V16511_00582 catches "illegal" constraints on metadata columns
assert (
exec(
db,
"select *, distance from v where vector match '[5]' and k = 3 and name like 'illegal'",
)
== snapshot()
)
SUPPORTS_VTAB_IN = sqlite3.sqlite_version_info[1] >= 38
@pytest.mark.skipif(
not SUPPORTS_VTAB_IN, reason="requires vtab `x in (...)` support in SQLite >=3.38"
)
def test_vtab_in(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], n int, t text, b boolean, f float, chunk_size=8)"
)
db.executemany(
"insert into v(rowid, vector, n, t, b, f) values (?, ?, ?, ?, ?, ?)",
[
(1, "[1]", 999, "aaaa", 0, 1.1),
(2, "[2]", 555, "aaaa", 0, 1.1),
(3, "[3]", 999, "aaaa", 0, 1.1),
(4, "[4]", 555, "aaaa", 0, 1.1),
(5, "[5]", 999, "zzzz", 0, 1.1),
(6, "[6]", 555, "zzzz", 0, 1.1),
(7, "[7]", 999, "zzzz", 0, 1.1),
(8, "[8]", 555, "zzzz", 0, 1.1),
],
)
# EVIDENCE-OF: V15248_32086
assert exec(
db, "select * from v where vector match '[0]' and k = 8 and b in (1, 0)"
) == snapshot(name="block-bool")
assert exec(
db, "select * from v where vector match '[0]' and k = 8 and f in (1.1, 0.0)"
) == snapshot(name="block-float")
assert exec(
db,
"select rowid, n, distance from v where vector match '[0]' and k = 8 and n in (555, 999)",
) == snapshot(name="allow-int-all")
assert exec(
db,
"select rowid, n, distance from v where vector match '[0]' and k = 8 and n in (555, -1, -2)",
) == snapshot(name="allow-int-superfluous")
assert exec(
db,
"select rowid, t, distance from v where vector match '[0]' and k = 8 and t in ('aaaa', 'zzzz')",
) == snapshot(name="allow-text-all")
assert exec(
db,
"select rowid, t, distance from v where vector match '[0]' and k = 8 and t in ('aaaa', 'foo', 'bar')",
) == snapshot(name="allow-text-superfluous")
def test_vtab_in_long_text(db, snapshot):
db.execute(
"create virtual table v using vec0(vector float[1], t text, chunk_size=8)"
)
data = [
(1, "aaaa"),
(2, "aaaaaaaaaaaa_aaa"),
(3, "bbbb"),
(4, "bbbbbbbbbbbb_bbb"),
(5, "cccc"),
(6, "cccccccccccc_ccc"),
]
db.executemany(
"insert into v(rowid, vector, t) values (:rowid, printf('[%d]', :rowid), :vector)",
[{"rowid": row[0], "vector": row[1]} for row in data],
)
for _, lookup in data:
assert exec(
db,
"select rowid, t from v where vector match '[0]' and k = 10 and t in (?, 'nonsense')",
[lookup],
) == snapshot(name=f"individual-{lookup}")
assert exec(
db,
"select rowid, t from v where vector match '[0]' and k = 10 and t in (select value from json_each(?))",
[json.dumps([row[1] for row in data])],
) == snapshot(name="all")
def test_idxstr(db, snapshot):
db.execute(
"""
create virtual table vec_movies using vec0(
movie_id integer primary key,
synopsis_embedding float[1],
+title text,
is_favorited boolean,
genre text,
num_reviews int,
mean_rating float,
chunk_size=8
);
"""
)
assert (
eqp(
db,
"select * from vec_movies where synopsis_embedding match '' and k = 0 and is_favorited = true",
)
== snapshot()
)
ops = ["<", ">", "<=", ">=", "!="]
for op in ops:
assert eqp(
db,
f"select * from vec_movies where synopsis_embedding match '' and k = 0 and genre {op} NULL",
) == snapshot(name=f"knn-constraint-text {op}")
for op in ops:
assert eqp(
db,
f"select * from vec_movies where synopsis_embedding match '' and k = 0 and num_reviews {op} NULL",
) == snapshot(name=f"knn-constraint-int {op}")
for op in ops:
assert eqp(
db,
f"select * from vec_movies where synopsis_embedding match '' and k = 0 and mean_rating {op} NULL",
) == snapshot(name=f"knn-constraint-float {op}")
# for op in ops:
# assert eqp(
# db,
# f"select * from vec_movies where synopsis_embedding match '' and k = 0 and is_favorited {op} NULL",
# ) == snapshot(name=f"knn-constraint-boolean {op}")
def eqp(db, sql):
o = OrderedDict()
o["sql"] = sql
o["plan"] = [
dict(row) for row in db.execute(f"explain query plan {sql}").fetchall()
]
for p in o["plan"]:
# value is different on macos-aarch64 in github actions, not sure why
del p["notused"]
return o
def test_stress(db, snapshot):
db.execute(
"""
create virtual table vec_movies using vec0(
movie_id integer primary key,
synopsis_embedding float[1],
+title text,
is_favorited boolean,
genre text,
num_reviews int,
mean_rating float,
chunk_size=8
);
"""
)
db.execute(
"""
INSERT INTO vec_movies(movie_id, synopsis_embedding, is_favorited, genre, title, num_reviews, mean_rating)
VALUES
(1, '[1]', 0, 'horror', 'The Conjuring', 153, 4.6),
(2, '[2]', 0, 'comedy', 'Dumb and Dumber', 382, 2.6),
(3, '[3]', 0, 'scifi', 'Interstellar', 53, 5.0),
(4, '[4]', 0, 'fantasy', 'The Lord of the Rings: The Fellowship of the Ring', 210, 4.2),
(5, '[5]', 1, 'documentary', 'An Inconvenient Truth', 93, 3.4),
(6, '[6]', 1, 'horror', 'Hereditary', 167, 4.7),
(7, '[7]', 1, 'comedy', 'Anchorman: The Legend of Ron Burgundy', 482, 2.9),
(8, '[8]', 0, 'scifi', 'Blade Runner 2049', 301, 5.0),
(9, '[9]', 1, 'fantasy', 'Harry Potter and the Sorcerer''s Stone', 134, 4.1),
(10, '[10]', 0, 'documentary', 'Free Solo', 66, 3.2),
(11, '[11]', 1, 'horror', 'Get Out', 88, 4.9),
(12, '[12]', 0, 'comedy', 'The Hangover', 59, 2.8),
(13, '[13]', 1, 'scifi', 'The Matrix', 423, 4.5),
(14, '[14]', 0, 'fantasy', 'Pan''s Labyrinth', 275, 3.6),
(15, '[15]', 1, 'documentary', '13th', 191, 4.4),
(16, '[16]', 0, 'horror', 'It Follows', 314, 4.3),
(17, '[17]', 1, 'comedy', 'Step Brothers', 74, 3.0),
(18, '[18]', 1, 'scifi', 'Inception', 201, 5.0),
(19, '[19]', 1, 'fantasy', 'The Shape of Water', 399, 2.7),
(20, '[20]', 1, 'documentary', 'Won''t You Be My Neighbor?', 186, 4.8),
(21, '[21]', 1, 'scifi', 'Gravity', 342, 4.0),
(22, '[22]', 1, 'scifi', 'Dune', 451, 4.4),
(23, '[23]', 1, 'scifi', 'The Martian', 522, 4.6),
(24, '[24]', 1, 'horror', 'A Quiet Place', 271, 4.3),
(25, '[25]', 1, 'fantasy', 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe', 310, 3.9);
"""
)
assert vec0_shadow_table_contents(db, "vec_movies") == snapshot()
assert (
exec(
db,
"""
select
movie_id,
title,
genre,
num_reviews,
mean_rating,
is_favorited,
distance
from vec_movies
where synopsis_embedding match '[15.5]'
and genre = 'scifi'
and num_reviews between 100 and 500
and mean_rating > 3.5
and k = 5;
""",
)
== snapshot()
)
assert (
exec(
db,
"select movie_id, genre, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and genre = 'horror'",
)
== snapshot()
)
assert (
exec(
db,
"select movie_id, genre, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and genre = 'comedy'",
)
== snapshot()
)
assert (
exec(
db,
"select movie_id, num_reviews, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and num_reviews between 100 and 500",
)
== snapshot()
)
assert (
exec(
db,
"select movie_id, num_reviews, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and num_reviews >= 500",
)
== snapshot()
)
assert (
exec(
db,
"select movie_id, mean_rating, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and mean_rating < 3.0",
)
== snapshot()
)
assert (
exec(
db,
"select movie_id, mean_rating, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and mean_rating between 4.0 and 5.0",
)
== snapshot()
)
assert exec(
db,
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited = TRUE",
) == snapshot(name="bool-eq-true")
assert exec(
db,
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited != TRUE",
) == snapshot(name="bool-ne-true")
assert exec(
db,
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited = FALSE",
) == snapshot(name="bool-eq-false")
assert exec(
db,
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited != FALSE",
) == snapshot(name="bool-ne-false")
# EVIDENCE-OF: V10145_26984
assert exec(
db,
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited >= 999",
) == snapshot(name="bool-other-op")
def test_errors(db, snapshot):
db.execute("create virtual table v using vec0(vector float[1], t text)")
db.execute("insert into v(vector, t) values ('[1]', 'aaaaaaaaaaaax')")
assert exec(db, "select * from v") == snapshot()
# EVIDENCE-OF: V15466_32305
db.set_authorizer(
authorizer_deny_on(sqlite3.SQLITE_READ, "v_metadatatext00", "data")
)
assert exec(db, "select * from v") == snapshot()
def authorizer_deny_on(operation, x1, x2=None):
def _auth(op, p1, p2, p3, p4):
if op == operation and p1 == x1 and p2 == x2:
return sqlite3.SQLITE_DENY
return sqlite3.SQLITE_OK
return _auth
def exec(db, sql, parameters=[]):
try:
rows = db.execute(sql, parameters).fetchall()
except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:
return {
"error": e.__class__.__name__,
"message": str(e),
}
a = []
for row in rows:
o = OrderedDict()
for k in row.keys():
o[k] = row[k]
a.append(o)
result = OrderedDict()
result["sql"] = sql
result["rows"] = a
return result
def vec0_shadow_table_contents(db, v):
shadow_tables = [
row[0]
for row in db.execute(
"select name from sqlite_master where name like ? order by 1", [f"{v}_%"]
).fetchall()
]
o = {}
for shadow_table in shadow_tables:
if shadow_table.endswith("_info"):
continue
o[shadow_table] = exec(db, f"select * from {shadow_table}")
return o

View file

@ -111,5 +111,7 @@ def vec0_shadow_table_contents(db, v):
]
o = {}
for shadow_table in shadow_tables:
if shadow_table.endswith("_info"):
continue
o[shadow_table] = exec(db, f"select * from {shadow_table}")
return o