mirror of
https://github.com/asg017/sqlite-vec.git
synced 2026-04-24 16:26:37 +02:00
Metadata filtering (#124)
* initial pass at PARTITION KEY support. * Initial pass, allow auxiliary columns on vec0 virtual tables * update TODO * Initial pass at metadata filtering * unit tests * gha this PR branch * fixup tests * doc internal * fix tests, KNN/rowids in * define SQLITE_INDEX_CONSTRAINT_OFFSET * whoops * update tests, syrupy, use uv * un ignore pyproject.toml * dot * tests/ * type error? * win: .exe, update error name * try fix macos python, paren around expr? * win bash? * dbg :( * explicit error * op * dbg win * win ./tests/.venv/Scripts/python.exe * block UPDATEs on partition key values for now * test this branch * accidentally removved "partition key type mistmatch" block during merge * typo ugh * bruv * start aux snapshots * drop aux shadow table on destroy * enforce column types * block WHERE constraints on auxiliary columns in KNN queries * support delete * support UPDATE on auxiliary columns * test this PR * dont inline that * test-metadata.py * memzero text buffer * stress test * more snpashot tests * rm double/int32, just float/int64 * finish type checking * long text support * DELETE support * UPDATE support * fix snapshot names * drop not-used in eqp * small fixes * boolean comparison handling * ensure error is raised when long string constraint * new version string for beta builds * typo whoops * ann-filtering-benchmark directory * test-case * updates * fix aux column error when using non-default rowid values, needs test * refactor some text knn filtering * rowids blob read only on text metadata filters * refactor * add failing test causes for non eq text knn * text knn NE * test cases diff * GT * text knn GT/GE fixes * text knn LT/LE * clean * vtab_in handling * unblock aux failures for now * guard sqlite3_vtab_in * else in guard? * fixes and tests * add broken shadow table test * rename _metadata_chunksNN shadown table to _metadatachunksNN, for proper shadowName detection * _metadata_text_NN shadow tables to _metadatatextNN * SQLITE_VEC_VERSION_MAJOR SQLITE_VEC_VERSION_MINOR and SQLITE_VEC_VERSION_PATCH in sqlite-vec.h * _info shadow table * forgot to update aux snapshot? * fix aux tests
This commit is contained in:
parent
9bfeaa7842
commit
352f953fc0
21 changed files with 7361 additions and 105 deletions
1
.github/workflows/test.yaml
vendored
1
.github/workflows/test.yaml
vendored
|
|
@ -5,6 +5,7 @@ on:
|
|||
- main
|
||||
- partition-by
|
||||
- auxiliary
|
||||
- metadata-filtering
|
||||
permissions:
|
||||
contents: read
|
||||
jobs:
|
||||
|
|
|
|||
5
.gitignore
vendored
5
.gitignore
vendored
|
|
@ -26,3 +26,8 @@ sqlite-vec.h
|
|||
tmp/
|
||||
|
||||
poetry.lock
|
||||
|
||||
*.jsonl
|
||||
|
||||
memstat.c
|
||||
memstat.*
|
||||
|
|
|
|||
|
|
@ -1,5 +1,51 @@
|
|||
# `sqlite-vec` Architecture
|
||||
|
||||
Internal documentation for how `sqlite-vec` works under-the-hood. Not meant for
|
||||
users of the `sqlite-vec` project, consult
|
||||
[the official `sqlite-vec` documentation](https://alexgarcia.xyz/sqlite-vec) for
|
||||
how-to-guides. Rather, this is for people interested in how `sqlite-vec` works
|
||||
and some guidelines to any future contributors.
|
||||
|
||||
Very much a WIP.
|
||||
|
||||
## `vec0`
|
||||
|
||||
### Shadow Tables
|
||||
|
||||
#### `xyz_chunks`
|
||||
|
||||
- `chunk_id INTEGER`
|
||||
- `size INTEGER`
|
||||
- `validity BLOB`
|
||||
- `rowids BLOB`
|
||||
|
||||
#### `xyz_rowids`
|
||||
|
||||
- `rowid INTEGER`
|
||||
- `id`
|
||||
- `chunk_id INTEGER`
|
||||
- `chunk_offset INTEGER`
|
||||
|
||||
#### `xyz_vector_chunksNN`
|
||||
|
||||
- `rowid INTEGER`
|
||||
- `vector BLOB`
|
||||
|
||||
#### `xyz_auxiliary`
|
||||
|
||||
- `rowid INTEGER`
|
||||
- `valueNN [type]`
|
||||
|
||||
#### `xyz_metadatachunksNN`
|
||||
|
||||
- `rowid INTEGER`
|
||||
- `data BLOB`
|
||||
|
||||
#### `xyz_metadatatextNN`
|
||||
|
||||
- `rowid INTEGER`
|
||||
- `data TEXT`
|
||||
|
||||
### idxStr
|
||||
|
||||
The `vec0` idxStr is a string composed of single "header" character and 0 or
|
||||
|
|
@ -14,8 +60,11 @@ The "header" charcter denotes the type of query plan, as determined by the
|
|||
| `VEC0_QUERY_PLAN_POINT` | `'2'` | Perform a single-lookup point query for the provided rowid |
|
||||
| `VEC0_QUERY_PLAN_KNN` | `'3'` | Perform a KNN-style query on the provided query vector and parameters. |
|
||||
|
||||
Each 4-character "block" is associated with a corresponding value in `argv[]`. For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is associated with `argv[2]` and so on. Each block describes what kind of value or filter the given `argv[i]` value is.
|
||||
|
||||
Each 4-character "block" is associated with a corresponding value in `argv[]`.
|
||||
For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and
|
||||
is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is
|
||||
associated with `argv[2]` and so on. Each block describes what kind of value or
|
||||
filter the given `argv[i]` value is.
|
||||
|
||||
#### `VEC0_IDXSTR_KIND_KNN_MATCH` (`'{'`)
|
||||
|
||||
|
|
@ -31,8 +80,8 @@ The remaining 3 characters of the block are `_` fillers.
|
|||
|
||||
#### `VEC0_IDXSTR_KIND_KNN_ROWID_IN` (`'['`)
|
||||
|
||||
`argv[i]` is the optional `rowid in (...)` value, and must be handled with [`sqlite3_vtab_in_first()` /
|
||||
`sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
|
||||
`argv[i]` is the optional `rowid in (...)` value, and must be handled with
|
||||
[`sqlite3_vtab_in_first()` / `sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
|
||||
|
||||
The remaining 3 characters of the block are `_` fillers.
|
||||
|
||||
|
|
@ -40,15 +89,34 @@ The remaining 3 characters of the block are `_` fillers.
|
|||
|
||||
`argv[i]` is a "constraint" on a specific partition key.
|
||||
|
||||
The second character of the block denotes which partition key to filter on, using `A` to denote the first partition key column, `B` for the second, etc. It is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
|
||||
The second character of the block denotes which partition key to filter on,
|
||||
using `A` to denote the first partition key column, `B` for the second, etc. It
|
||||
is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
|
||||
|
||||
The third character of the block denotes which operator is used in the constraint. It will be one of the values of `enum vec0_partition_operator`, as only a subset of operations are supported on partition keys.
|
||||
The third character of the block denotes which operator is used in the
|
||||
constraint. It will be one of the values of `enum vec0_partition_operator`, as
|
||||
only a subset of operations are supported on partition keys.
|
||||
|
||||
The fourth character of the block is a `_` filler.
|
||||
|
||||
|
||||
#### `VEC0_IDXSTR_KIND_POINT_ID` (`'!'`)
|
||||
|
||||
`argv[i]` is the value of the rowid or id to match against for the point query.
|
||||
|
||||
The remaining 3 characters of the block are `_` fillers.
|
||||
|
||||
#### `VEC0_IDXSTR_KIND_METADATA_CONSTRAINT` (`'&'`)
|
||||
|
||||
`argv[i]` is the value of the `WHERE` constraint for a metdata column in a KNN
|
||||
query.
|
||||
|
||||
The second character of the block denotes which metadata column the constraint
|
||||
belongs to, using `A` to denote the first metadata column column, `B` for the
|
||||
second, etc. It is encoded with `'A' + metadata_idx` and can be decoded with
|
||||
`c - 'A'`.
|
||||
|
||||
The third character of the block is the constraint operator. It will be one of
|
||||
`enum vec0_metadata_operator`, as only a subset of operators are supported on
|
||||
metadata column KNN filters.
|
||||
|
||||
The foruth character of the block is a `_` filler.
|
||||
|
|
|
|||
3
Makefile
3
Makefile
|
|
@ -153,6 +153,9 @@ sqlite-vec.h: sqlite-vec.h.tmpl VERSION
|
|||
VERSION=$(shell cat VERSION) \
|
||||
DATE=$(shell date -r VERSION +'%FT%TZ%z') \
|
||||
SOURCE=$(shell git log -n 1 --pretty=format:%H -- VERSION) \
|
||||
VERSION_MAJOR=$$(echo $$VERSION | cut -d. -f1) \
|
||||
VERSION_MINOR=$$(echo $$VERSION | cut -d. -f2) \
|
||||
VERSION_PATCH=$$(echo $$VERSION | cut -d. -f3 | cut -d- -f1) \
|
||||
envsubst < $< > $@
|
||||
|
||||
clean:
|
||||
|
|
|
|||
28
TODO
28
TODO
|
|
@ -1,13 +1,17 @@
|
|||
# partition
|
||||
- [ ] add `xyz_info` shadow table with version etc.
|
||||
|
||||
- [ ] UPDATE on partition key values
|
||||
- remove previous row from chunk, insert into new one?
|
||||
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
|
||||
|
||||
# auxiliary columns
|
||||
|
||||
- later:
|
||||
- NOT NULL?
|
||||
- perf: INSERT stmt should be cached on vec0_vtab
|
||||
- perf: LEFT JOIN aux table to rowids query in vec0_cursor for rowid/point
|
||||
stmts, to avoid N lookup queries
|
||||
- later
|
||||
- [ ] partition: UPDATE support
|
||||
- [ ] skip invalid validity entries in knn filter?
|
||||
- [ ] nulls in metadata
|
||||
- [ ] partition `x in (...)` handling
|
||||
- [ ] blobs/date/datetime
|
||||
- [ ] uuid/ulid perf
|
||||
- [ ] Aux columns: `NOT NULL` constraint
|
||||
- [ ] Metadata columns: `NOT NULL` constraint
|
||||
- [ ] Partiion key: `NOT NULL` constraint
|
||||
- [ ] dictionary encoding?
|
||||
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
|
||||
- [ ] perf
|
||||
- [ ] aux: cache INSERT
|
||||
- [ ] aux: LEFT JOIN on `_rowids` queries to avoid N lookup queries
|
||||
|
|
|
|||
1741
sqlite-vec.c
1741
sqlite-vec.c
File diff suppressed because it is too large
Load diff
|
|
@ -18,9 +18,16 @@
|
|||
#endif
|
||||
|
||||
#define SQLITE_VEC_VERSION "v${VERSION}"
|
||||
// TODO rm
|
||||
#define SQLITE_VEC_VERSION "v-metadata-experiment.01"
|
||||
#define SQLITE_VEC_DATE "${DATE}"
|
||||
#define SQLITE_VEC_SOURCE "${SOURCE}"
|
||||
|
||||
|
||||
#define SQLITE_VEC_VERSION_MAJOR ${VERSION_MAJOR}
|
||||
#define SQLITE_VEC_VERSION_MINOR ${VERSION_MINOR}
|
||||
#define SQLITE_VEC_VERSION_PATCH ${VERSION_PATCH}
|
||||
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
|
|
|
|||
327
test.sql
327
test.sql
|
|
@ -1,10 +1,333 @@
|
|||
|
||||
.load dist/vec0
|
||||
.echo on
|
||||
.load dist/vec0main
|
||||
.bail on
|
||||
|
||||
.mode qbox
|
||||
|
||||
|
||||
.load ./memstat
|
||||
.echo on
|
||||
|
||||
select name, value from sqlite_memstat where name = 'MEMORY_USED';
|
||||
|
||||
create virtual table v using vec0(
|
||||
vector float[1],
|
||||
name1 text,
|
||||
name2 text,
|
||||
age int,
|
||||
chunk_size=8
|
||||
);
|
||||
|
||||
select name, value from sqlite_memstat where name = 'MEMORY_USED';
|
||||
|
||||
insert into v(vector, name1, name2, age) values
|
||||
('[1]', 'alex', 'xxxx', 1),
|
||||
('[2]', 'alex', 'aaaa', 2),
|
||||
('[3]', 'alex', 'aaaa', 3),
|
||||
('[4]', 'brian', 'aaaa', 1),
|
||||
('[5]', 'brian', 'aaaa', 2),
|
||||
('[6]', 'brian', 'aaaa', 3),
|
||||
('[7]', 'craig', 'aaaa', 1),
|
||||
('[8]', 'craig', 'xxxx', 2),
|
||||
('[9]', 'craig', 'xxxx', 3),
|
||||
('[10]', '123456789012345', 'xxxx', 3);
|
||||
|
||||
select name, value from sqlite_memstat where name = 'MEMORY_USED';
|
||||
|
||||
select rowid, name1, name2, age, vec_to_json(vector)
|
||||
from v
|
||||
where vector match '[0]'
|
||||
and k = 5
|
||||
and name1 in ('alex', 'brian', 'craig')
|
||||
--and name2 in ('aaaa', 'xxxx')
|
||||
and age in (1, 2, 3, 2222,3333,4444);
|
||||
|
||||
select name, value from sqlite_memstat where name = 'MEMORY_USED';
|
||||
|
||||
select rowid, name1, name2, age, vec_to_json(vector)
|
||||
from v
|
||||
where vector match '[0]'
|
||||
and k = 5
|
||||
and name1 in ('123456789012345', 'superfluous');
|
||||
|
||||
|
||||
.exit
|
||||
|
||||
create virtual table v using vec0(
|
||||
vector float[1],
|
||||
+description text
|
||||
);
|
||||
insert into v(rowid, vector, description) values (1, '[1]', 'aaa');
|
||||
select * from v;
|
||||
|
||||
.exit
|
||||
|
||||
create virtual table vec_articles using vec0(
|
||||
article_id integer primary key,
|
||||
year integer partition key,
|
||||
headline_embedding float[1],
|
||||
+headline text,
|
||||
+url text,
|
||||
word_count integer,
|
||||
print_section text,
|
||||
print_page integer,
|
||||
pub_date text,
|
||||
);
|
||||
|
||||
insert into vec_articles values (1111, 2020, '[1]', 'headline', 'https://...', 200, 'A', 1, '2020-01-01');
|
||||
|
||||
select * from vec_articles;
|
||||
|
||||
.exit
|
||||
|
||||
|
||||
create table movies(movie_id integer primary key, synopsis text);
|
||||
INSERT INTO movies(movie_id, synopsis)
|
||||
VALUES
|
||||
(1, 'A family is haunted by demonic spirits after moving into a new house, requiring the help of paranormal investigators.'),
|
||||
(2, 'Two dim-witted friends embark on a cross-country road trip to return a briefcase full of money to its owner.'),
|
||||
(3, 'A team of explorers travels through a wormhole in space in an attempt to ensure humanity’s survival.'),
|
||||
(4, 'A young hobbit embarks on a journey with a fellowship to destroy a powerful ring and save Middle-earth from darkness.'),
|
||||
(5, 'A documentary about the dangers of global warming, featuring former U.S. Vice President Al Gore.'),
|
||||
(6, 'After the death of her secretive mother, a woman discovers terrifying secrets about her family lineage.'),
|
||||
(7, 'A clueless but charismatic TV anchorman struggles to stay relevant in the world of broadcast journalism.'),
|
||||
(8, 'A young blade runner uncovers a long-buried secret that leads him to track down former blade runner Rick Deckard.'),
|
||||
(9, 'A young boy discovers he is a wizard and attends a magical school, where he learns about his destiny.'),
|
||||
(10, 'A rock climber attempts to scale El Capitan in Yosemite National Park without the use of ropes or safety gear.'),
|
||||
(11, 'A young African-American man uncovers a disturbing secret when he visits his white girlfriend''s family estate.'),
|
||||
(12, 'Three friends wake up from a bachelor party in Las Vegas with no memory of the previous night and must retrace their steps.'),
|
||||
(13, 'A computer hacker learns about the true nature of his reality and his role in the war against its controllers.'),
|
||||
(14, 'In post-Civil War Spain, a young girl escapes into an eerie but captivating fantasy world.'),
|
||||
(15, 'A documentary that explores racial inequality in the United States, focusing on the prison system and mass incarceration.'),
|
||||
(16, 'A young woman is followed by an unknown supernatural force after a sexual encounter.'),
|
||||
(17, 'Two immature but well-meaning stepbrothers become instant rivals when their single parents marry.'),
|
||||
(18, 'A thief with the ability to enter people''s dreams is tasked with planting an idea into a target''s subconscious.'),
|
||||
(19, 'A mute woman forms a unique relationship with a mysterious aquatic creature being held in a secret research facility.'),
|
||||
(20, 'A documentary about the life and legacy of Fred Rogers, the beloved host of the children''s TV show "Mister Rogers'' Neighborhood."');
|
||||
|
||||
|
||||
create virtual table vec_movies using vec0(
|
||||
movie_id integer primary key,
|
||||
synopsis_embedding float[1],
|
||||
+title text,
|
||||
genre text,
|
||||
num_reviews int,
|
||||
mean_rating float,
|
||||
chunk_size=8
|
||||
);
|
||||
|
||||
.schema
|
||||
/*
|
||||
insert into vec_movies(movie_id, synopsis_embedding, num_reviews, mean_rating) values
|
||||
(1, '[1]', 153, 4.6),
|
||||
(2, '[2]', 382, 2.6),
|
||||
(3, '[3]', 53, 5.0),
|
||||
(4, '[4]', 210, 4.2),
|
||||
(5, '[5]', 93, 3.4),
|
||||
(6, '[6]', 167, 4.7),
|
||||
(7, '[7]', 482, 2.9),
|
||||
(8, '[8]', 301, 5.0),
|
||||
(9, '[9]', 134, 4.1),
|
||||
(10, '[10]', 66, 3.2),
|
||||
(11, '[11]', 88, 4.9),
|
||||
(12, '[12]', 59, 2.8),
|
||||
(13, '[13]', 423, 4.5),
|
||||
(14, '[14]', 275, 3.6),
|
||||
(15, '[15]', 191, 4.4),
|
||||
(16, '[16]', 314, 4.3),
|
||||
(17, '[17]', 74, 3.0),
|
||||
(18, '[18]', 201, 5.0),
|
||||
(19, '[19]', 399, 2.7),
|
||||
(20, '[20]', 186, 4.8);
|
||||
*/
|
||||
|
||||
/*
|
||||
|
||||
INSERT INTO vec_movies(movie_id, synopsis_embedding, genre, num_reviews, mean_rating)
|
||||
VALUES
|
||||
(1, '[1]', 'horror', 153, 4.6),
|
||||
(2, '[2]', 'comedy', 382, 2.6),
|
||||
(3, '[3]', 'scifi', 53, 5.0),
|
||||
(4, '[4]', 'fantasy', 210, 4.2),
|
||||
(5, '[5]', 'documentary', 93, 3.4),
|
||||
(6, '[6]', 'horror', 167, 4.7),
|
||||
(7, '[7]', 'comedy', 482, 2.9),
|
||||
(8, '[8]', 'scifi', 301, 5.0),
|
||||
(9, '[9]', 'fantasy', 134, 4.1),
|
||||
(10, '[10]', 'documentary', 66, 3.2),
|
||||
(11, '[11]', 'horror', 88, 4.9),
|
||||
(12, '[12]', 'comedy', 59, 2.8),
|
||||
(13, '[13]', 'scifi', 423, 4.5),
|
||||
(14, '[14]', 'fantasy', 275, 3.6),
|
||||
(15, '[15]', 'documentary', 191, 4.4),
|
||||
(16, '[16]', 'horror', 314, 4.3),
|
||||
(17, '[17]', 'comedy', 74, 3.0),
|
||||
(18, '[18]', 'scifi', 201, 5.0),
|
||||
(19, '[19]', 'fantasy', 399, 2.7),
|
||||
(20, '[20]', 'documentary', 186, 4.8);
|
||||
*/
|
||||
|
||||
INSERT INTO vec_movies(movie_id, synopsis_embedding, genre, title, num_reviews, mean_rating)
|
||||
VALUES
|
||||
(1, '[1]', 'horror', 'The Conjuring', 153, 4.6),
|
||||
(2, '[2]', 'comedy', 'Dumb and Dumber', 382, 2.6),
|
||||
(3, '[3]', 'scifi', 'Interstellar', 53, 5.0),
|
||||
(4, '[4]', 'fantasy', 'The Lord of the Rings: The Fellowship of the Ring', 210, 4.2),
|
||||
(5, '[5]', 'documentary', 'An Inconvenient Truth', 93, 3.4),
|
||||
(6, '[6]', 'horror', 'Hereditary', 167, 4.7),
|
||||
(7, '[7]', 'comedy', 'Anchorman: The Legend of Ron Burgundy', 482, 2.9),
|
||||
(8, '[8]', 'scifi', 'Blade Runner 2049', 301, 5.0),
|
||||
(9, '[9]', 'fantasy', 'Harry Potter and the Sorcerer''s Stone', 134, 4.1),
|
||||
(10, '[10]', 'documentary', 'Free Solo', 66, 3.2),
|
||||
(11, '[11]', 'horror', 'Get Out', 88, 4.9),
|
||||
(12, '[12]', 'comedy', 'The Hangover', 59, 2.8),
|
||||
(13, '[13]', 'scifi', 'The Matrix', 423, 4.5),
|
||||
(14, '[14]', 'fantasy', 'Pan''s Labyrinth', 275, 3.6),
|
||||
(15, '[15]', 'documentary', '13th', 191, 4.4),
|
||||
(16, '[16]', 'horror', 'It Follows', 314, 4.3),
|
||||
(17, '[17]', 'comedy', 'Step Brothers', 74, 3.0),
|
||||
(18, '[18]', 'scifi', 'Inception', 201, 5.0),
|
||||
(19, '[19]', 'fantasy', 'The Shape of Water', 399, 2.7),
|
||||
(20, '[20]', 'documentary', 'Won''t You Be My Neighbor?', 186, 4.8),
|
||||
(21, '[21]', 'scifi', 'Gravity', 342, 4.0),
|
||||
(22, '[22]', 'scifi', 'Dune', 451, 4.4),
|
||||
(23, '[23]', 'scifi', 'The Martian', 522, 4.6),
|
||||
(24, '[24]', 'horror', 'A Quiet Place', 271, 4.3),
|
||||
(25, '[25]', 'fantasy', 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe', 310, 3.9);
|
||||
|
||||
--select * from vec_movies;
|
||||
--select * from vec_movies_metadata_chunks00;
|
||||
|
||||
|
||||
create virtual table vec_chunks using vec0(
|
||||
user_id integer partition key,
|
||||
+contents text,
|
||||
contents_embedding float[1],
|
||||
);
|
||||
|
||||
INSERT INTO vec_chunks (rowid, user_id, contents, contents_embedding) VALUES
|
||||
(1, 123, 'Our PTO policy allows employees to take both vacation and sick leave as needed.', '[1]'),
|
||||
(2, 123, 'Employees must provide notice at least two weeks in advance for planned vacations.', '[2]'),
|
||||
(3, 123, 'Sick leave can be taken without advance notice, but employees must inform their manager.', '[3]'),
|
||||
(4, 123, 'Unused PTO can be carried over to the following year, up to a maximum of 40 hours.', '[4]'),
|
||||
(5, 123, 'PTO must be used in increments of at least 4 hours.', '[5]'),
|
||||
(6, 456, 'New employees are granted 10 days of PTO during their first year of employment.', '[6]'),
|
||||
(7, 456, 'After the first year, employees earn an additional day of PTO for each year of service.', '[7]'),
|
||||
(8, 789, 'PTO requests will be reviewed by the HR department and are subject to approval.', '[8]'),
|
||||
(9, 789, 'The company reserves the right to deny PTO requests during peak operational periods.', '[9]'),
|
||||
(10, 456, 'If PTO is denied, the employee will be given an alternative time to take leave.', '[10]'),
|
||||
(11, 789, 'Employees who are out of PTO must request unpaid leave for any additional time off.', '[11]'),
|
||||
(12, 789, 'In case of a family emergency, employees can request emergency leave.', '[12]'),
|
||||
(13, 456, 'Emergency leave may be granted for personal or family illness, or other critical situations.', '[13]'),
|
||||
(14, 789, 'The maximum length of emergency leave is subject to company discretion.', '[14]'),
|
||||
(15, 123, 'All PTO balances will be displayed on the employee self-service portal.', '[15]'),
|
||||
(16, 456, 'Employees who are terminated will be paid for unused PTO, as per state law.', '[16]'),
|
||||
(17, 123, 'Part-time employees are eligible for PTO on a pro-rata basis.', '[17]'),
|
||||
(18, 789, 'The company encourages employees to use their PTO to maintain work-life balance.', '[18]'),
|
||||
(19, 456, 'Employees should not book travel plans until their PTO request has been approved.', '[19]'),
|
||||
(20, 123, 'Managers are responsible for tracking their team members'' PTO usage.', '[20]');
|
||||
|
||||
select rowid, user_id, contents, distance
|
||||
from vec_chunks
|
||||
where contents_embedding match '[19]'
|
||||
and user_id = 123
|
||||
and k = 5;
|
||||
|
||||
.exit
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
-- PARTITION KEY and auxiliar columns!
|
||||
create virtual table vec_chunks using vec0(
|
||||
-- internally shard the vector index by user
|
||||
user_id integer partition key,
|
||||
-- store the chunk text pre-embedding as an "auxiliary column"
|
||||
+contents text,
|
||||
contents_embeddings float[1024],
|
||||
);
|
||||
|
||||
select rowid, user_id, contents, distance
|
||||
from vec_chunks
|
||||
where contents_embedding match '[...]'
|
||||
and user_id = 123
|
||||
and k = 5;
|
||||
/*
|
||||
┌───────┬─────────┬──────────────────────────────────────────────────────────────┬──────────┐
|
||||
│ rowid │ user_id │ contents │ distance │
|
||||
├───────┼─────────┼──────────────────────────────────────────────────────────────┼──────────┤
|
||||
│ 20 │ 123 │ 'Managers are responsible for tracking their team members'' │ 1.0 │
|
||||
│ │ │ PTO usage.' │ │
|
||||
├───────┼─────────┼──────────────────────────────────────────────────────────────┼──────────┤
|
||||
│ 17 │ 123 │ 'Part-time employees are eligible for PTO on a pro-rata basi │ 2.0 │
|
||||
│ │ │ s.' │ │
|
||||
├───────┼─────────┼──────────────────────────────────────────────────────────────┼──────────┤
|
||||
│ 15 │ 123 │ 'All PTO balances will be displayed on the employee self-ser │ 4.0 │
|
||||
│ │ │ vice portal.' │ │
|
||||
├───────┼─────────┼──────────────────────────────────────────────────────────────┼──────────┤
|
||||
│ 5 │ 123 │ 'PTO must be used in increments of at least 4 hours.' │ 14.0 │
|
||||
├───────┼─────────┼──────────────────────────────────────────────────────────────┼──────────┤
|
||||
│ 4 │ 123 │ 'Unused PTO can be carried over to the following year, up to │ 15.0 │
|
||||
│ │ │ a maximum of 40 hours.' │ │
|
||||
└───────┴─────────┴──────────────────────────────────────────────────────────────┴──────────┘
|
||||
*/
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
-- metadata filters!
|
||||
create virtual table vec_movies using vec0(
|
||||
movie_id integer primary key,
|
||||
synopsis_embedding float[1024],
|
||||
genre text,
|
||||
num_reviews int,
|
||||
mean_rating float
|
||||
);
|
||||
|
||||
select
|
||||
movie_id,
|
||||
title,
|
||||
genre,
|
||||
num_reviews,
|
||||
mean_rating,
|
||||
distance
|
||||
from vec_movies
|
||||
where synopsis_embedding match '[15.5]'
|
||||
and genre = 'scifi'
|
||||
and num_reviews between 100 and 500
|
||||
and mean_rating > 3.5
|
||||
and k = 5;
|
||||
/*
|
||||
┌──────────┬─────────────────────┬─────────┬─────────────┬──────────────────┬──────────┐
|
||||
│ movie_id │ title │ genre │ num_reviews │ mean_rating │ distance │
|
||||
├──────────┼─────────────────────┼─────────┼─────────────┼──────────────────┼──────────┤
|
||||
│ 13 │ 'The Matrix' │ 'scifi' │ 423 │ 4.5 │ 2.5 │
|
||||
│ 18 │ 'Inception' │ 'scifi' │ 201 │ 5.0 │ 2.5 │
|
||||
│ 21 │ 'Gravity' │ 'scifi' │ 342 │ 4.0 │ 5.5 │
|
||||
│ 22 │ 'Dune' │ 'scifi' │ 451 │ 4.40000009536743 │ 6.5 │
|
||||
│ 8 │ 'Blade Runner 2049' │ 'scifi' │ 301 │ 5.0 │ 7.5 │
|
||||
└──────────┴─────────────────────┴─────────┴─────────────┴──────────────────┴──────────┘
|
||||
*/
|
||||
|
||||
|
||||
|
||||
|
||||
.exit
|
||||
|
||||
create virtual table vec_movies using vec0(
|
||||
movie_id integer primary key,
|
||||
synopsis_embedding float[768],
|
||||
genre text,
|
||||
num_reviews int,
|
||||
mean_rating float,
|
||||
);
|
||||
|
||||
|
||||
.exit
|
||||
|
||||
|
||||
create virtual table vec_chunks using vec0(
|
||||
chunk_id integer primary key,
|
||||
contents_embedding float[1],
|
||||
|
|
|
|||
|
|
@ -316,7 +316,7 @@
|
|||
'type': 'table',
|
||||
'name': 'sqlite_sequence',
|
||||
'tbl_name': 'sqlite_sequence',
|
||||
'rootpage': 3,
|
||||
'rootpage': 5,
|
||||
'sql': 'CREATE TABLE sqlite_sequence(name,seq)',
|
||||
}),
|
||||
]),
|
||||
|
|
@ -326,18 +326,25 @@
|
|||
OrderedDict({
|
||||
'sql': 'select * from sqlite_master order by name',
|
||||
'rows': list([
|
||||
OrderedDict({
|
||||
'type': 'index',
|
||||
'name': 'sqlite_autoindex_v_info_1',
|
||||
'tbl_name': 'v_info',
|
||||
'rootpage': 3,
|
||||
'sql': None,
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'index',
|
||||
'name': 'sqlite_autoindex_v_vector_chunks00_1',
|
||||
'tbl_name': 'v_vector_chunks00',
|
||||
'rootpage': 6,
|
||||
'rootpage': 8,
|
||||
'sql': None,
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'sqlite_sequence',
|
||||
'tbl_name': 'sqlite_sequence',
|
||||
'rootpage': 3,
|
||||
'rootpage': 5,
|
||||
'sql': 'CREATE TABLE sqlite_sequence(name,seq)',
|
||||
}),
|
||||
OrderedDict({
|
||||
|
|
@ -351,28 +358,35 @@
|
|||
'type': 'table',
|
||||
'name': 'v_auxiliary',
|
||||
'tbl_name': 'v_auxiliary',
|
||||
'rootpage': 7,
|
||||
'rootpage': 9,
|
||||
'sql': 'CREATE TABLE "v_auxiliary"( rowid integer PRIMARY KEY , value00)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_chunks',
|
||||
'tbl_name': 'v_chunks',
|
||||
'rootpage': 2,
|
||||
'rootpage': 4,
|
||||
'sql': 'CREATE TABLE "v_chunks"(chunk_id INTEGER PRIMARY KEY AUTOINCREMENT,size INTEGER NOT NULL,validity BLOB NOT NULL,rowids BLOB NOT NULL)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_info',
|
||||
'tbl_name': 'v_info',
|
||||
'rootpage': 2,
|
||||
'sql': 'CREATE TABLE "v_info" (key text primary key, value any)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_rowids',
|
||||
'tbl_name': 'v_rowids',
|
||||
'rootpage': 4,
|
||||
'rootpage': 6,
|
||||
'sql': 'CREATE TABLE "v_rowids"(rowid INTEGER PRIMARY KEY AUTOINCREMENT,id,chunk_id INTEGER,chunk_offset INTEGER)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_vector_chunks00',
|
||||
'tbl_name': 'v_vector_chunks00',
|
||||
'rootpage': 5,
|
||||
'rootpage': 7,
|
||||
'sql': 'CREATE TABLE "v_vector_chunks00"(rowid PRIMARY KEY,vectors BLOB NOT NULL)',
|
||||
}),
|
||||
]),
|
||||
|
|
@ -409,25 +423,25 @@
|
|||
# ---
|
||||
# name: test_types.3
|
||||
dict({
|
||||
'error': 'OperationalError',
|
||||
'error': 'IntegrityError',
|
||||
'message': 'Auxiliary column type mismatch: The auxiliary column aux_int has type INTEGER, but TEXT was provided.',
|
||||
})
|
||||
# ---
|
||||
# name: test_types.4
|
||||
dict({
|
||||
'error': 'OperationalError',
|
||||
'error': 'IntegrityError',
|
||||
'message': 'Auxiliary column type mismatch: The auxiliary column aux_float has type FLOAT, but TEXT was provided.',
|
||||
})
|
||||
# ---
|
||||
# name: test_types.5
|
||||
dict({
|
||||
'error': 'OperationalError',
|
||||
'error': 'IntegrityError',
|
||||
'message': 'Auxiliary column type mismatch: The auxiliary column aux_text has type TEXT, but INTEGER was provided.',
|
||||
})
|
||||
# ---
|
||||
# name: test_types.6
|
||||
dict({
|
||||
'error': 'OperationalError',
|
||||
'error': 'IntegrityError',
|
||||
'message': 'Auxiliary column type mismatch: The auxiliary column aux_blob has type BLOB, but INTEGER was provided.',
|
||||
})
|
||||
# ---
|
||||
|
|
|
|||
184
tests/__snapshots__/test-general.ambr
Normal file
184
tests/__snapshots__/test-general.ambr
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
# serializer version: 1
|
||||
# name: test_info
|
||||
OrderedDict({
|
||||
'sql': 'select key, typeof(value) from v_info order by 1',
|
||||
'rows': list([
|
||||
OrderedDict({
|
||||
'key': 'CREATE_VERSION',
|
||||
'typeof(value)': 'text',
|
||||
}),
|
||||
OrderedDict({
|
||||
'key': 'CREATE_VERSION_MAJOR',
|
||||
'typeof(value)': 'integer',
|
||||
}),
|
||||
OrderedDict({
|
||||
'key': 'CREATE_VERSION_MINOR',
|
||||
'typeof(value)': 'integer',
|
||||
}),
|
||||
OrderedDict({
|
||||
'key': 'CREATE_VERSION_PATCH',
|
||||
'typeof(value)': 'integer',
|
||||
}),
|
||||
]),
|
||||
})
|
||||
# ---
|
||||
# name: test_shadow
|
||||
OrderedDict({
|
||||
'sql': 'select * from sqlite_master order by name',
|
||||
'rows': list([
|
||||
OrderedDict({
|
||||
'type': 'index',
|
||||
'name': 'sqlite_autoindex_v_info_1',
|
||||
'tbl_name': 'v_info',
|
||||
'rootpage': 3,
|
||||
'sql': None,
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'index',
|
||||
'name': 'sqlite_autoindex_v_metadatachunks00_1',
|
||||
'tbl_name': 'v_metadatachunks00',
|
||||
'rootpage': 10,
|
||||
'sql': None,
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'index',
|
||||
'name': 'sqlite_autoindex_v_metadatatext00_1',
|
||||
'tbl_name': 'v_metadatatext00',
|
||||
'rootpage': 12,
|
||||
'sql': None,
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'index',
|
||||
'name': 'sqlite_autoindex_v_vector_chunks00_1',
|
||||
'tbl_name': 'v_vector_chunks00',
|
||||
'rootpage': 8,
|
||||
'sql': None,
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'sqlite_sequence',
|
||||
'tbl_name': 'sqlite_sequence',
|
||||
'rootpage': 5,
|
||||
'sql': 'CREATE TABLE sqlite_sequence(name,seq)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v',
|
||||
'tbl_name': 'v',
|
||||
'rootpage': 0,
|
||||
'sql': 'CREATE VIRTUAL TABLE v using vec0(a float[1], partition text partition key, metadata text, +name text, chunk_size=8)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_auxiliary',
|
||||
'tbl_name': 'v_auxiliary',
|
||||
'rootpage': 13,
|
||||
'sql': 'CREATE TABLE "v_auxiliary"( rowid integer PRIMARY KEY , value00)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_chunks',
|
||||
'tbl_name': 'v_chunks',
|
||||
'rootpage': 4,
|
||||
'sql': 'CREATE TABLE "v_chunks"(chunk_id INTEGER PRIMARY KEY AUTOINCREMENT,size INTEGER NOT NULL,sequence_id integer,partition00,validity BLOB NOT NULL, rowids BLOB NOT NULL)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_info',
|
||||
'tbl_name': 'v_info',
|
||||
'rootpage': 2,
|
||||
'sql': 'CREATE TABLE "v_info" (key text primary key, value any)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_metadatachunks00',
|
||||
'tbl_name': 'v_metadatachunks00',
|
||||
'rootpage': 9,
|
||||
'sql': 'CREATE TABLE "v_metadatachunks00"(rowid PRIMARY KEY, data BLOB NOT NULL)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_metadatatext00',
|
||||
'tbl_name': 'v_metadatatext00',
|
||||
'rootpage': 11,
|
||||
'sql': 'CREATE TABLE "v_metadatatext00"(rowid PRIMARY KEY, data TEXT)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_rowids',
|
||||
'tbl_name': 'v_rowids',
|
||||
'rootpage': 6,
|
||||
'sql': 'CREATE TABLE "v_rowids"(rowid INTEGER PRIMARY KEY AUTOINCREMENT,id,chunk_id INTEGER,chunk_offset INTEGER)',
|
||||
}),
|
||||
OrderedDict({
|
||||
'type': 'table',
|
||||
'name': 'v_vector_chunks00',
|
||||
'tbl_name': 'v_vector_chunks00',
|
||||
'rootpage': 7,
|
||||
'sql': 'CREATE TABLE "v_vector_chunks00"(rowid PRIMARY KEY,vectors BLOB NOT NULL)',
|
||||
}),
|
||||
]),
|
||||
})
|
||||
# ---
|
||||
# name: test_shadow.1
|
||||
OrderedDict({
|
||||
'sql': "select * from pragma_table_list where type = 'shadow'",
|
||||
'rows': list([
|
||||
OrderedDict({
|
||||
'schema': 'main',
|
||||
'name': 'v_auxiliary',
|
||||
'type': 'shadow',
|
||||
'ncol': 2,
|
||||
'wr': 0,
|
||||
'strict': 0,
|
||||
}),
|
||||
OrderedDict({
|
||||
'schema': 'main',
|
||||
'name': 'v_chunks',
|
||||
'type': 'shadow',
|
||||
'ncol': 6,
|
||||
'wr': 0,
|
||||
'strict': 0,
|
||||
}),
|
||||
OrderedDict({
|
||||
'schema': 'main',
|
||||
'name': 'v_info',
|
||||
'type': 'shadow',
|
||||
'ncol': 2,
|
||||
'wr': 0,
|
||||
'strict': 0,
|
||||
}),
|
||||
OrderedDict({
|
||||
'schema': 'main',
|
||||
'name': 'v_rowids',
|
||||
'type': 'shadow',
|
||||
'ncol': 4,
|
||||
'wr': 0,
|
||||
'strict': 0,
|
||||
}),
|
||||
OrderedDict({
|
||||
'schema': 'main',
|
||||
'name': 'v_metadatachunks00',
|
||||
'type': 'shadow',
|
||||
'ncol': 2,
|
||||
'wr': 0,
|
||||
'strict': 0,
|
||||
}),
|
||||
OrderedDict({
|
||||
'schema': 'main',
|
||||
'name': 'v_metadatatext00',
|
||||
'type': 'shadow',
|
||||
'ncol': 2,
|
||||
'wr': 0,
|
||||
'strict': 0,
|
||||
}),
|
||||
]),
|
||||
})
|
||||
# ---
|
||||
# name: test_shadow.2
|
||||
OrderedDict({
|
||||
'sql': "select * from pragma_table_list where type = 'shadow'",
|
||||
'rows': list([
|
||||
]),
|
||||
})
|
||||
# ---
|
||||
4097
tests/__snapshots__/test-metadata.ambr
Normal file
4097
tests/__snapshots__/test-metadata.ambr
Normal file
File diff suppressed because it is too large
Load diff
1
tests/afbd/.gitignore
vendored
Normal file
1
tests/afbd/.gitignore
vendored
Normal file
|
|
@ -0,0 +1 @@
|
|||
*.tgz
|
||||
1
tests/afbd/.python-version
Normal file
1
tests/afbd/.python-version
Normal file
|
|
@ -0,0 +1 @@
|
|||
3.12
|
||||
9
tests/afbd/Makefile
Normal file
9
tests/afbd/Makefile
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
random_ints_1m.tgz:
|
||||
curl -o $@ https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_ints_1m.tgz
|
||||
|
||||
random_float_1m.tgz:
|
||||
curl -o $@ https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_float_1m.tgz
|
||||
|
||||
random_keywords_1m.tgz:
|
||||
curl -o $@ https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_keywords_1m.tgz
|
||||
all: random_ints_1m.tgz random_float_1m.tgz random_keywords_1m.tgz
|
||||
12
tests/afbd/README.md
Normal file
12
tests/afbd/README.md
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
|
||||
# hnm
|
||||
|
||||
```
|
||||
tar -xOzf hnm.tgz ./tests.jsonl > tests.jsonl
|
||||
solite q "select group_concat(distinct key) from lines_read('tests.jsonl'), json_each(line -> '$.conditions.and[0]')"
|
||||
```
|
||||
|
||||
|
||||
```
|
||||
> python test-afbd.py build hnm.tgz --metadata product_group_name,colour_group_name,index_group_name,perceived_colour_value_name,section_name,product_type_name,department_name,graphical_appearance_name,garment_group_name,perceived_colour_master_name
|
||||
```
|
||||
231
tests/afbd/test-afbd.py
Normal file
231
tests/afbd/test-afbd.py
Normal file
|
|
@ -0,0 +1,231 @@
|
|||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
from deepdiff import DeepDiff
|
||||
|
||||
import tarfile
|
||||
import json
|
||||
from io import BytesIO
|
||||
import sqlite3
|
||||
from typing import List
|
||||
from struct import pack
|
||||
import time
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
|
||||
|
||||
def serialize_float32(vector: List[float]) -> bytes:
|
||||
"""Serializes a list of floats into the "raw bytes" format sqlite-vec expects"""
|
||||
return pack("%sf" % len(vector), *vector)
|
||||
|
||||
|
||||
def build_command(file_path, metadata_set=None):
|
||||
if metadata_set:
|
||||
metadata_set = set(metadata_set.split(","))
|
||||
|
||||
file_path = Path(file_path)
|
||||
print(f"reading {file_path}...")
|
||||
t0 = time.time()
|
||||
with tarfile.open(file_path, "r:gz") as archive:
|
||||
for file in archive:
|
||||
if file.name == "./payloads.jsonl":
|
||||
payloads = [
|
||||
json.loads(line)
|
||||
for line in archive.extractfile(file.name).readlines()
|
||||
]
|
||||
if file.name == "./tests.jsonl":
|
||||
tests = [
|
||||
json.loads(line)
|
||||
for line in archive.extractfile(file.name).readlines()
|
||||
]
|
||||
if file.name == "./vectors.npy":
|
||||
f = BytesIO()
|
||||
f.write(archive.extractfile(file.name).read())
|
||||
f.seek(0)
|
||||
vectors = np.load(f)
|
||||
|
||||
assert payloads is not None
|
||||
assert tests is not None
|
||||
assert vectors is not None
|
||||
dimensions = vectors.shape[1]
|
||||
metadata_columns = sorted(list(payloads[0].keys()))
|
||||
|
||||
def col_type(v):
|
||||
if isinstance(v, int):
|
||||
return "integer"
|
||||
if isinstance(v, float):
|
||||
return "float"
|
||||
if isinstance(v, str):
|
||||
return "text"
|
||||
raise Exception(f"Unknown column type: {v}")
|
||||
|
||||
metadata_columns_types = [col_type(payloads[0][col]) for col in metadata_columns]
|
||||
|
||||
print(time.time() - t0)
|
||||
t0 = time.time()
|
||||
print("seeding...")
|
||||
|
||||
db = sqlite3.connect(f"{file_path.stem}.db")
|
||||
db.execute("PRAGMA page_size = 16384")
|
||||
db.row_factory = sqlite3.Row
|
||||
db.enable_load_extension(True)
|
||||
db.load_extension("../../dist/vec0")
|
||||
db.enable_load_extension(False)
|
||||
|
||||
with db:
|
||||
db.execute("create table tests(data)")
|
||||
|
||||
for test in tests:
|
||||
db.execute("insert into tests values (?)", [json.dumps(test)])
|
||||
|
||||
with db:
|
||||
create_sql = f"create virtual table v using vec0(vector float[{dimensions}] distance_metric=cosine"
|
||||
insert_sql = "insert into v(rowid, vector"
|
||||
for name, type in zip(metadata_columns, metadata_columns_types):
|
||||
if metadata_set:
|
||||
if name in metadata_set:
|
||||
create_sql += f", {name} {type}"
|
||||
else:
|
||||
create_sql += f", +{name} {type}"
|
||||
else:
|
||||
create_sql += f", {name} {type}"
|
||||
|
||||
insert_sql += f", {name}"
|
||||
create_sql += ")"
|
||||
insert_sql += ") values (" + ",".join("?" * (2 + len(metadata_columns))) + ")"
|
||||
print(create_sql)
|
||||
print(insert_sql)
|
||||
|
||||
db.execute(create_sql)
|
||||
|
||||
for idx, (payload, vector) in enumerate(
|
||||
tqdm(zip(payloads, vectors), total=len(payloads))
|
||||
):
|
||||
params = [idx, vector]
|
||||
for c in metadata_columns:
|
||||
params.append(payload[c])
|
||||
db.execute(insert_sql, params)
|
||||
|
||||
print(time.time() - t0)
|
||||
|
||||
|
||||
def tests_command(file_path):
|
||||
file_path = Path(file_path)
|
||||
db = sqlite3.connect(f"{file_path.stem}.db")
|
||||
db.execute("PRAGMA cache_size = -100000000")
|
||||
db.row_factory = sqlite3.Row
|
||||
db.enable_load_extension(True)
|
||||
db.load_extension("../../dist/vec0")
|
||||
db.enable_load_extension(False)
|
||||
|
||||
tests = [
|
||||
json.loads(row["data"])
|
||||
for row in db.execute("select data from tests").fetchall()
|
||||
]
|
||||
|
||||
num_or_skips = 0
|
||||
num_1off_errors = 0
|
||||
|
||||
t0 = time.time()
|
||||
print("testing...")
|
||||
for idx, test in enumerate(tqdm(tests)):
|
||||
query = test["query"]
|
||||
conditions = test["conditions"]
|
||||
expected_closest_ids = test["closest_ids"]
|
||||
expected_closest_scores = test["closest_scores"]
|
||||
|
||||
sql = "select rowid, 1 - distance as similarity from v where vector match ? and k = ?"
|
||||
params = [serialize_float32(query), len(expected_closest_ids)]
|
||||
|
||||
if "and" in conditions:
|
||||
for condition in conditions["and"]:
|
||||
assert len(condition.keys()) == 1
|
||||
column = list(condition.keys())[0]
|
||||
assert len(list(condition[column].keys())) == 1
|
||||
condition_type = list(condition[column].keys())[0]
|
||||
if condition_type == "match":
|
||||
value = condition[column]["match"]["value"]
|
||||
sql += f" and {column} = ?"
|
||||
params.append(value)
|
||||
elif condition_type == "range":
|
||||
sql += f" and {column} between ? and ?"
|
||||
params.append(condition[column]["range"]["gt"])
|
||||
params.append(condition[column]["range"]["lt"])
|
||||
else:
|
||||
raise Exception(f"Unknown condition type: {condition_type}")
|
||||
elif "or" in conditions:
|
||||
column = list(conditions["or"][0].keys())[0]
|
||||
condition_type = list(conditions["or"][0][column].keys())[0]
|
||||
assert condition_type == "match"
|
||||
sql += f" and {column} in ("
|
||||
for idx, condition in enumerate(conditions["or"]):
|
||||
if condition_type == "match":
|
||||
value = condition[column]["match"]["value"]
|
||||
if idx != 0:
|
||||
sql += ","
|
||||
sql += "?"
|
||||
params.append(value)
|
||||
elif condition_type == "range":
|
||||
breakpoint()
|
||||
else:
|
||||
raise Exception(f"Unknown condition type: {condition_type}")
|
||||
sql += ")"
|
||||
|
||||
# print(sql, params[1:])
|
||||
rows = db.execute(sql, params).fetchall()
|
||||
actual_closest_ids = [row["rowid"] for row in rows]
|
||||
matches = expected_closest_ids == actual_closest_ids
|
||||
if not matches:
|
||||
diff = DeepDiff(
|
||||
expected_closest_ids, actual_closest_ids, ignore_order=False
|
||||
)
|
||||
assert len(list(diff.keys())) == 1
|
||||
assert "values_changed" in diff.keys()
|
||||
keys_changed = list(diff["values_changed"].keys())
|
||||
if len(keys_changed) == 2:
|
||||
akey, bkey = keys_changed
|
||||
a = int(akey.lstrip("root[").rstrip("]"))
|
||||
b = int(bkey.lstrip("root[").rstrip("]"))
|
||||
assert abs(a - b) == 1
|
||||
assert (
|
||||
diff["values_changed"][akey]["new_value"]
|
||||
== diff["values_changed"][bkey]["old_value"]
|
||||
)
|
||||
assert (
|
||||
diff["values_changed"][akey]["old_value"]
|
||||
== diff["values_changed"][bkey]["new_value"]
|
||||
)
|
||||
elif len(keys_changed) == 1:
|
||||
v = int(keys_changed[0].lstrip("root[").rstrip("]"))
|
||||
assert (v + 1) == len(expected_closest_ids)
|
||||
else:
|
||||
raise Exception("fuck")
|
||||
num_1off_errors += 1
|
||||
# print(closest_scores)
|
||||
# print([row["similarity"] for row in rows])
|
||||
# assert closest_scores == [row["similarity"] for row in rows]
|
||||
print("Number skipped: ", num_or_skips)
|
||||
print("Num 1 off errors: ", num_1off_errors)
|
||||
print("1 off error rate: ", num_1off_errors / (len(tests) - num_or_skips))
|
||||
print(time.time() - t0)
|
||||
print("done")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="CLI tool")
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
build_parser = subparsers.add_parser("build")
|
||||
build_parser.add_argument("file", type=str, help="Path to input file")
|
||||
build_parser.add_argument("--metadata", type=str, help="Metadata columns")
|
||||
build_parser.set_defaults(func=lambda args: build_command(args.file, args.metadata))
|
||||
|
||||
tests_parser = subparsers.add_parser("test")
|
||||
tests_parser.add_argument("file", type=str, help="Path to input file")
|
||||
tests_parser.set_defaults(func=lambda args: tests_command(args.file))
|
||||
|
||||
args = parser.parse_args()
|
||||
args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -55,7 +55,10 @@ def test_types(db, snapshot):
|
|||
)
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
|
||||
# TODO: integrity test transaction failures in shadow tables
|
||||
db.commit()
|
||||
# bad types
|
||||
db.execute("BEGIN")
|
||||
assert (
|
||||
exec(db, INSERT, [b"\x11\x11\x11\x11", "not int", 1.2, "text", b"blob"])
|
||||
== snapshot()
|
||||
|
|
@ -66,6 +69,7 @@ def test_types(db, snapshot):
|
|||
)
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1.2, 1, b"blob"]) == snapshot()
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1.2, "text", 1]) == snapshot()
|
||||
db.execute("ROLLBACK")
|
||||
|
||||
# NULLs are totally chill
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", None, None, None, None]) == snapshot()
|
||||
|
|
@ -151,5 +155,7 @@ def vec0_shadow_table_contents(db, v):
|
|||
]
|
||||
o = {}
|
||||
for shadow_table in shadow_tables:
|
||||
if shadow_table.endswith("_info"):
|
||||
continue
|
||||
o[shadow_table] = exec(db, f"select * from {shadow_table}")
|
||||
return o
|
||||
|
|
|
|||
60
tests/test-general.py
Normal file
60
tests/test-general.py
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
import sqlite3
|
||||
from collections import OrderedDict
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
sqlite3.sqlite_version_info[1] < 37,
|
||||
reason="pragma_table_list was added in SQLite 3.37",
|
||||
)
|
||||
def test_shadow(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(a float[1], partition text partition key, metadata text, +name text, chunk_size=8)"
|
||||
)
|
||||
assert exec(db, "select * from sqlite_master order by name") == snapshot()
|
||||
assert (
|
||||
exec(db, "select * from pragma_table_list where type = 'shadow'") == snapshot()
|
||||
)
|
||||
|
||||
db.execute("drop table v;")
|
||||
assert (
|
||||
exec(db, "select * from pragma_table_list where type = 'shadow'") == snapshot()
|
||||
)
|
||||
|
||||
|
||||
def test_info(db, snapshot):
|
||||
db.execute("create virtual table v using vec0(a float[1])")
|
||||
assert exec(db, "select key, typeof(value) from v_info order by 1") == snapshot()
|
||||
|
||||
|
||||
def exec(db, sql, parameters=[]):
|
||||
try:
|
||||
rows = db.execute(sql, parameters).fetchall()
|
||||
except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:
|
||||
return {
|
||||
"error": e.__class__.__name__,
|
||||
"message": str(e),
|
||||
}
|
||||
a = []
|
||||
for row in rows:
|
||||
o = OrderedDict()
|
||||
for k in row.keys():
|
||||
o[k] = row[k]
|
||||
a.append(o)
|
||||
result = OrderedDict()
|
||||
result["sql"] = sql
|
||||
result["rows"] = a
|
||||
return result
|
||||
|
||||
|
||||
def vec0_shadow_table_contents(db, v):
|
||||
shadow_tables = [
|
||||
row[0]
|
||||
for row in db.execute(
|
||||
"select name from sqlite_master where name like ? order by 1", [f"{v}_%"]
|
||||
).fetchall()
|
||||
]
|
||||
o = {}
|
||||
for shadow_table in shadow_tables:
|
||||
o[shadow_table] = exec(db, f"select * from {shadow_table}")
|
||||
return o
|
||||
|
|
@ -1022,6 +1022,7 @@ def test_vec0_drops():
|
|||
] == [
|
||||
"t1",
|
||||
"t1_chunks",
|
||||
"t1_info",
|
||||
"t1_rowids",
|
||||
"t1_vector_chunks00",
|
||||
"t1_vector_chunks01",
|
||||
|
|
@ -2216,6 +2217,9 @@ def test_smoke():
|
|||
{
|
||||
"name": "vec_xyz_chunks",
|
||||
},
|
||||
{
|
||||
"name": "vec_xyz_info",
|
||||
},
|
||||
{
|
||||
"name": "vec_xyz_rowids",
|
||||
},
|
||||
|
|
|
|||
629
tests/test-metadata.py
Normal file
629
tests/test-metadata.py
Normal file
|
|
@ -0,0 +1,629 @@
|
|||
import pytest
|
||||
import sqlite3
|
||||
from collections import OrderedDict
|
||||
import json
|
||||
|
||||
|
||||
def test_constructor_limit(db, snapshot):
|
||||
assert exec(
|
||||
db,
|
||||
f"""
|
||||
create virtual table v using vec0(
|
||||
{",".join([f"metadata{x} integer" for x in range(17)])}
|
||||
v float[1]
|
||||
)
|
||||
""",
|
||||
) == snapshot(name="max 16 metadata columns")
|
||||
|
||||
|
||||
def test_normal(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
|
||||
)
|
||||
assert exec(
|
||||
db, "select * from sqlite_master where type = 'table' order by name"
|
||||
) == snapshot(name="sqlite_master")
|
||||
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
INSERT = "insert into v(vector, b, n, f, t) values (?, ?, ?, ?, ?)"
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 1.1, "one"]) == snapshot()
|
||||
assert exec(db, INSERT, [b"\x22\x22\x22\x22", 1, 2, 2.2, "two"]) == snapshot()
|
||||
assert exec(db, INSERT, [b"\x33\x33\x33\x33", 1, 3, 3.3, "three"]) == snapshot()
|
||||
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
assert exec(db, "drop table v") == snapshot()
|
||||
assert exec(db, "select * from sqlite_master") == snapshot()
|
||||
|
||||
|
||||
#
|
||||
# assert exec(db, "select * from v") == snapshot()
|
||||
# assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
#
|
||||
# db.execute("drop table v;")
|
||||
# assert exec(db, "select * from sqlite_master order by name") == snapshot(
|
||||
# name="sqlite_master post drop"
|
||||
# )
|
||||
|
||||
|
||||
def test_text_knn(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
|
||||
)
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
INSERT = "insert into v(vector, name) values (?, ?)"
|
||||
db.execute(
|
||||
"""
|
||||
INSERT INTO v(vector, name) VALUES
|
||||
('[.11]', 'aaa'),
|
||||
('[.22]', 'bbb'),
|
||||
('[.33]', 'ccc'),
|
||||
('[.44]', 'ddd'),
|
||||
('[.55]', 'eee'),
|
||||
('[.66]', 'fff'),
|
||||
('[.77]', 'ggg'),
|
||||
('[.88]', 'hhh'),
|
||||
('[.99]', 'iii');
|
||||
"""
|
||||
)
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[1]' and k = 5",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name < 'ddd'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name <= 'ddd'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name > 'fff'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name >= 'fff'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[1]' and k = 5 and name = 'aaa'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select rowid, name, distance from v where vector match '[.01]' and k = 5 and name != 'aaa'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
|
||||
def test_long_text_updates(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
|
||||
)
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
INSERT = "insert into v(vector, name) values (?, ?)"
|
||||
exec(db, INSERT, [b"\x11\x11\x11\x11", "123456789a12"])
|
||||
exec(db, INSERT, [b"\x11\x11\x11\x11", "123456789a123"])
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
|
||||
def test_long_text_knn(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
|
||||
)
|
||||
INSERT = "insert into v(vector, name) values (?, ?)"
|
||||
exec(db, INSERT, ["[1]", "aaaa"])
|
||||
exec(db, INSERT, ["[2]", "aaaaaaaaaaaa_aaa"])
|
||||
exec(db, INSERT, ["[3]", "bbbb"])
|
||||
exec(db, INSERT, ["[4]", "bbbbbbbbbbbb_bbb"])
|
||||
exec(db, INSERT, ["[5]", "cccc"])
|
||||
exec(db, INSERT, ["[6]", "cccccccccccc_ccc"])
|
||||
|
||||
tests = [
|
||||
"bbbb",
|
||||
"bb",
|
||||
"bbbbbb",
|
||||
"bbbbbbbbbbbb_bbb",
|
||||
"bbbbbbbbbbbb_aaa",
|
||||
"bbbbbbbbbbbb_ccc",
|
||||
"longlonglonglonglonglonglong",
|
||||
]
|
||||
ops = ["=", "!=", "<", "<=", ">", ">="]
|
||||
op_names = ["eq", "ne", "lt", "le", "gt", "ge"]
|
||||
|
||||
for test in tests:
|
||||
for op, op_name in zip(ops, op_names):
|
||||
assert exec(
|
||||
db,
|
||||
f"select rowid, name, distance from v where vector match '[100]' and k = 5 and name {op} ?",
|
||||
[test],
|
||||
) == snapshot(name=f"{op_name}-{test}")
|
||||
|
||||
|
||||
def test_types(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
|
||||
)
|
||||
INSERT = "insert into v(vector, b, n, f, t) values (?, ?, ?, ?, ?)"
|
||||
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 1.1, "test"]) == snapshot(
|
||||
name="legal"
|
||||
)
|
||||
|
||||
# fmt: off
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 'illegal', 1, 1.1, 'test']) == snapshot(name="illegal-type-boolean")
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 'illegal', 1.1, 'test']) == snapshot(name="illegal-type-int")
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 'illegal', 'test']) == snapshot(name="illegal-type-float")
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 1, 1, 1.1, 420]) == snapshot(name="illegal-type-text")
|
||||
# fmt: on
|
||||
|
||||
assert exec(db, INSERT, [b"\x11\x11\x11\x11", 44, 1, 1.1, "test"]) == snapshot(
|
||||
name="illegal-boolean"
|
||||
)
|
||||
|
||||
|
||||
def test_updates(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
|
||||
)
|
||||
INSERT = "insert into v(rowid, vector, b, n, f, t) values (?, ?, ?, ?, ?, ?)"
|
||||
|
||||
exec(db, INSERT, [1, b"\x11\x11\x11\x11", 1, 1, 1.1, "test1"])
|
||||
exec(db, INSERT, [2, b"\x22\x22\x22\x22", 1, 2, 2.2, "test2"])
|
||||
exec(db, INSERT, [3, b"\x33\x33\x33\x33", 1, 3, 3.3, "1234567890123"])
|
||||
assert exec(db, "select * from v") == snapshot(name="1-init-contents")
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot(name="1-init-shadow")
|
||||
|
||||
assert exec(
|
||||
db, "UPDATE v SET b = 0, n = 11, f = 11.11, t = 'newtest1' where rowid = 1"
|
||||
)
|
||||
assert exec(db, "select * from v") == snapshot(name="general-update-contents")
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot(
|
||||
name="general-update-shaodnw"
|
||||
)
|
||||
|
||||
# string update #1: long string updated to long string
|
||||
exec(db, "UPDATE v SET t = '1234567890123-updated' where rowid = 3")
|
||||
assert exec(db, "select * from v") == snapshot(name="string-update-1-contents")
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot(
|
||||
name="string-update-1-shadow"
|
||||
)
|
||||
|
||||
# string update #2: short string updated to short string
|
||||
exec(db, "UPDATE v SET t = 'test2-short' where rowid = 2")
|
||||
assert exec(db, "select * from v") == snapshot(name="string-update-2-contents")
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot(
|
||||
name="string-update-2-shadow"
|
||||
)
|
||||
|
||||
# string update #3: short string updated to long string
|
||||
exec(db, "UPDATE v SET t = 'test2-long-long-long' where rowid = 2")
|
||||
assert exec(db, "select * from v") == snapshot(name="string-update-3-contents")
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot(
|
||||
name="string-update-3-shadow"
|
||||
)
|
||||
|
||||
# string update #4: long string updated to short string
|
||||
exec(db, "UPDATE v SET t = 'test2-shortx' where rowid = 2")
|
||||
assert exec(db, "select * from v") == snapshot(name="string-update-4-contents")
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot(
|
||||
name="string-update-4-shadow"
|
||||
)
|
||||
|
||||
|
||||
def test_deletes(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], b boolean, n int, f float, t text, chunk_size=8)"
|
||||
)
|
||||
INSERT = "insert into v(rowid, vector, b, n, f, t) values (?, ?, ?, ?, ?, ?)"
|
||||
|
||||
assert exec(db, INSERT, [1, b"\x11\x11\x11\x11", 1, 1, 1.1, "test1"]) == snapshot()
|
||||
assert exec(db, INSERT, [2, b"\x22\x22\x22\x22", 1, 2, 2.2, "test2"]) == snapshot()
|
||||
assert (
|
||||
exec(db, INSERT, [3, b"\x33\x33\x33\x33", 1, 3, 3.3, "1234567890123"])
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
assert exec(db, "DELETE FROM v where rowid = 1") == snapshot()
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
assert exec(db, "DELETE FROM v where rowid = 3") == snapshot()
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
assert vec0_shadow_table_contents(db, "v") == snapshot()
|
||||
|
||||
|
||||
def test_knn(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], name text, chunk_size=8)"
|
||||
)
|
||||
assert exec(
|
||||
db, "select * from sqlite_master where type = 'table' order by name"
|
||||
) == snapshot(name="sqlite_master")
|
||||
db.executemany(
|
||||
"insert into v(vector, name) values (?, ?)",
|
||||
[("[1]", "alex"), ("[2]", "brian"), ("[3]", "craig")],
|
||||
)
|
||||
|
||||
# EVIDENCE-OF: V16511_00582 catches "illegal" constraints on metadata columns
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select *, distance from v where vector match '[5]' and k = 3 and name like 'illegal'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
|
||||
SUPPORTS_VTAB_IN = sqlite3.sqlite_version_info[1] >= 38
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not SUPPORTS_VTAB_IN, reason="requires vtab `x in (...)` support in SQLite >=3.38"
|
||||
)
|
||||
def test_vtab_in(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], n int, t text, b boolean, f float, chunk_size=8)"
|
||||
)
|
||||
db.executemany(
|
||||
"insert into v(rowid, vector, n, t, b, f) values (?, ?, ?, ?, ?, ?)",
|
||||
[
|
||||
(1, "[1]", 999, "aaaa", 0, 1.1),
|
||||
(2, "[2]", 555, "aaaa", 0, 1.1),
|
||||
(3, "[3]", 999, "aaaa", 0, 1.1),
|
||||
(4, "[4]", 555, "aaaa", 0, 1.1),
|
||||
(5, "[5]", 999, "zzzz", 0, 1.1),
|
||||
(6, "[6]", 555, "zzzz", 0, 1.1),
|
||||
(7, "[7]", 999, "zzzz", 0, 1.1),
|
||||
(8, "[8]", 555, "zzzz", 0, 1.1),
|
||||
],
|
||||
)
|
||||
|
||||
# EVIDENCE-OF: V15248_32086
|
||||
assert exec(
|
||||
db, "select * from v where vector match '[0]' and k = 8 and b in (1, 0)"
|
||||
) == snapshot(name="block-bool")
|
||||
|
||||
assert exec(
|
||||
db, "select * from v where vector match '[0]' and k = 8 and f in (1.1, 0.0)"
|
||||
) == snapshot(name="block-float")
|
||||
|
||||
assert exec(
|
||||
db,
|
||||
"select rowid, n, distance from v where vector match '[0]' and k = 8 and n in (555, 999)",
|
||||
) == snapshot(name="allow-int-all")
|
||||
assert exec(
|
||||
db,
|
||||
"select rowid, n, distance from v where vector match '[0]' and k = 8 and n in (555, -1, -2)",
|
||||
) == snapshot(name="allow-int-superfluous")
|
||||
|
||||
assert exec(
|
||||
db,
|
||||
"select rowid, t, distance from v where vector match '[0]' and k = 8 and t in ('aaaa', 'zzzz')",
|
||||
) == snapshot(name="allow-text-all")
|
||||
assert exec(
|
||||
db,
|
||||
"select rowid, t, distance from v where vector match '[0]' and k = 8 and t in ('aaaa', 'foo', 'bar')",
|
||||
) == snapshot(name="allow-text-superfluous")
|
||||
|
||||
|
||||
def test_vtab_in_long_text(db, snapshot):
|
||||
db.execute(
|
||||
"create virtual table v using vec0(vector float[1], t text, chunk_size=8)"
|
||||
)
|
||||
data = [
|
||||
(1, "aaaa"),
|
||||
(2, "aaaaaaaaaaaa_aaa"),
|
||||
(3, "bbbb"),
|
||||
(4, "bbbbbbbbbbbb_bbb"),
|
||||
(5, "cccc"),
|
||||
(6, "cccccccccccc_ccc"),
|
||||
]
|
||||
db.executemany(
|
||||
"insert into v(rowid, vector, t) values (:rowid, printf('[%d]', :rowid), :vector)",
|
||||
[{"rowid": row[0], "vector": row[1]} for row in data],
|
||||
)
|
||||
|
||||
for _, lookup in data:
|
||||
assert exec(
|
||||
db,
|
||||
"select rowid, t from v where vector match '[0]' and k = 10 and t in (?, 'nonsense')",
|
||||
[lookup],
|
||||
) == snapshot(name=f"individual-{lookup}")
|
||||
|
||||
assert exec(
|
||||
db,
|
||||
"select rowid, t from v where vector match '[0]' and k = 10 and t in (select value from json_each(?))",
|
||||
[json.dumps([row[1] for row in data])],
|
||||
) == snapshot(name="all")
|
||||
|
||||
|
||||
def test_idxstr(db, snapshot):
|
||||
db.execute(
|
||||
"""
|
||||
create virtual table vec_movies using vec0(
|
||||
movie_id integer primary key,
|
||||
synopsis_embedding float[1],
|
||||
+title text,
|
||||
is_favorited boolean,
|
||||
genre text,
|
||||
num_reviews int,
|
||||
mean_rating float,
|
||||
chunk_size=8
|
||||
);
|
||||
"""
|
||||
)
|
||||
|
||||
assert (
|
||||
eqp(
|
||||
db,
|
||||
"select * from vec_movies where synopsis_embedding match '' and k = 0 and is_favorited = true",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
ops = ["<", ">", "<=", ">=", "!="]
|
||||
|
||||
for op in ops:
|
||||
assert eqp(
|
||||
db,
|
||||
f"select * from vec_movies where synopsis_embedding match '' and k = 0 and genre {op} NULL",
|
||||
) == snapshot(name=f"knn-constraint-text {op}")
|
||||
|
||||
for op in ops:
|
||||
assert eqp(
|
||||
db,
|
||||
f"select * from vec_movies where synopsis_embedding match '' and k = 0 and num_reviews {op} NULL",
|
||||
) == snapshot(name=f"knn-constraint-int {op}")
|
||||
|
||||
for op in ops:
|
||||
assert eqp(
|
||||
db,
|
||||
f"select * from vec_movies where synopsis_embedding match '' and k = 0 and mean_rating {op} NULL",
|
||||
) == snapshot(name=f"knn-constraint-float {op}")
|
||||
|
||||
# for op in ops:
|
||||
# assert eqp(
|
||||
# db,
|
||||
# f"select * from vec_movies where synopsis_embedding match '' and k = 0 and is_favorited {op} NULL",
|
||||
# ) == snapshot(name=f"knn-constraint-boolean {op}")
|
||||
|
||||
|
||||
def eqp(db, sql):
|
||||
o = OrderedDict()
|
||||
o["sql"] = sql
|
||||
o["plan"] = [
|
||||
dict(row) for row in db.execute(f"explain query plan {sql}").fetchall()
|
||||
]
|
||||
for p in o["plan"]:
|
||||
# value is different on macos-aarch64 in github actions, not sure why
|
||||
del p["notused"]
|
||||
return o
|
||||
|
||||
|
||||
def test_stress(db, snapshot):
|
||||
db.execute(
|
||||
"""
|
||||
create virtual table vec_movies using vec0(
|
||||
movie_id integer primary key,
|
||||
synopsis_embedding float[1],
|
||||
+title text,
|
||||
is_favorited boolean,
|
||||
genre text,
|
||||
num_reviews int,
|
||||
mean_rating float,
|
||||
chunk_size=8
|
||||
);
|
||||
"""
|
||||
)
|
||||
|
||||
db.execute(
|
||||
"""
|
||||
INSERT INTO vec_movies(movie_id, synopsis_embedding, is_favorited, genre, title, num_reviews, mean_rating)
|
||||
VALUES
|
||||
(1, '[1]', 0, 'horror', 'The Conjuring', 153, 4.6),
|
||||
(2, '[2]', 0, 'comedy', 'Dumb and Dumber', 382, 2.6),
|
||||
(3, '[3]', 0, 'scifi', 'Interstellar', 53, 5.0),
|
||||
(4, '[4]', 0, 'fantasy', 'The Lord of the Rings: The Fellowship of the Ring', 210, 4.2),
|
||||
(5, '[5]', 1, 'documentary', 'An Inconvenient Truth', 93, 3.4),
|
||||
(6, '[6]', 1, 'horror', 'Hereditary', 167, 4.7),
|
||||
(7, '[7]', 1, 'comedy', 'Anchorman: The Legend of Ron Burgundy', 482, 2.9),
|
||||
(8, '[8]', 0, 'scifi', 'Blade Runner 2049', 301, 5.0),
|
||||
(9, '[9]', 1, 'fantasy', 'Harry Potter and the Sorcerer''s Stone', 134, 4.1),
|
||||
(10, '[10]', 0, 'documentary', 'Free Solo', 66, 3.2),
|
||||
(11, '[11]', 1, 'horror', 'Get Out', 88, 4.9),
|
||||
(12, '[12]', 0, 'comedy', 'The Hangover', 59, 2.8),
|
||||
(13, '[13]', 1, 'scifi', 'The Matrix', 423, 4.5),
|
||||
(14, '[14]', 0, 'fantasy', 'Pan''s Labyrinth', 275, 3.6),
|
||||
(15, '[15]', 1, 'documentary', '13th', 191, 4.4),
|
||||
(16, '[16]', 0, 'horror', 'It Follows', 314, 4.3),
|
||||
(17, '[17]', 1, 'comedy', 'Step Brothers', 74, 3.0),
|
||||
(18, '[18]', 1, 'scifi', 'Inception', 201, 5.0),
|
||||
(19, '[19]', 1, 'fantasy', 'The Shape of Water', 399, 2.7),
|
||||
(20, '[20]', 1, 'documentary', 'Won''t You Be My Neighbor?', 186, 4.8),
|
||||
(21, '[21]', 1, 'scifi', 'Gravity', 342, 4.0),
|
||||
(22, '[22]', 1, 'scifi', 'Dune', 451, 4.4),
|
||||
(23, '[23]', 1, 'scifi', 'The Martian', 522, 4.6),
|
||||
(24, '[24]', 1, 'horror', 'A Quiet Place', 271, 4.3),
|
||||
(25, '[25]', 1, 'fantasy', 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe', 310, 3.9);
|
||||
|
||||
"""
|
||||
)
|
||||
|
||||
assert vec0_shadow_table_contents(db, "vec_movies") == snapshot()
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"""
|
||||
select
|
||||
movie_id,
|
||||
title,
|
||||
genre,
|
||||
num_reviews,
|
||||
mean_rating,
|
||||
is_favorited,
|
||||
distance
|
||||
from vec_movies
|
||||
where synopsis_embedding match '[15.5]'
|
||||
and genre = 'scifi'
|
||||
and num_reviews between 100 and 500
|
||||
and mean_rating > 3.5
|
||||
and k = 5;
|
||||
""",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select movie_id, genre, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and genre = 'horror'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select movie_id, genre, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and genre = 'comedy'",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select movie_id, num_reviews, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and num_reviews between 100 and 500",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select movie_id, num_reviews, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and num_reviews >= 500",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select movie_id, mean_rating, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and mean_rating < 3.0",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
assert (
|
||||
exec(
|
||||
db,
|
||||
"select movie_id, mean_rating, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and mean_rating between 4.0 and 5.0",
|
||||
)
|
||||
== snapshot()
|
||||
)
|
||||
|
||||
assert exec(
|
||||
db,
|
||||
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited = TRUE",
|
||||
) == snapshot(name="bool-eq-true")
|
||||
assert exec(
|
||||
db,
|
||||
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited != TRUE",
|
||||
) == snapshot(name="bool-ne-true")
|
||||
assert exec(
|
||||
db,
|
||||
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited = FALSE",
|
||||
) == snapshot(name="bool-eq-false")
|
||||
assert exec(
|
||||
db,
|
||||
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited != FALSE",
|
||||
) == snapshot(name="bool-ne-false")
|
||||
|
||||
# EVIDENCE-OF: V10145_26984
|
||||
assert exec(
|
||||
db,
|
||||
"select movie_id, is_favorited, distance from vec_movies where synopsis_embedding match '[100]' and k = 5 and is_favorited >= 999",
|
||||
) == snapshot(name="bool-other-op")
|
||||
|
||||
|
||||
def test_errors(db, snapshot):
|
||||
db.execute("create virtual table v using vec0(vector float[1], t text)")
|
||||
db.execute("insert into v(vector, t) values ('[1]', 'aaaaaaaaaaaax')")
|
||||
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
|
||||
# EVIDENCE-OF: V15466_32305
|
||||
db.set_authorizer(
|
||||
authorizer_deny_on(sqlite3.SQLITE_READ, "v_metadatatext00", "data")
|
||||
)
|
||||
assert exec(db, "select * from v") == snapshot()
|
||||
|
||||
|
||||
def authorizer_deny_on(operation, x1, x2=None):
|
||||
def _auth(op, p1, p2, p3, p4):
|
||||
if op == operation and p1 == x1 and p2 == x2:
|
||||
return sqlite3.SQLITE_DENY
|
||||
return sqlite3.SQLITE_OK
|
||||
|
||||
return _auth
|
||||
|
||||
|
||||
def exec(db, sql, parameters=[]):
|
||||
try:
|
||||
rows = db.execute(sql, parameters).fetchall()
|
||||
except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:
|
||||
return {
|
||||
"error": e.__class__.__name__,
|
||||
"message": str(e),
|
||||
}
|
||||
a = []
|
||||
for row in rows:
|
||||
o = OrderedDict()
|
||||
for k in row.keys():
|
||||
o[k] = row[k]
|
||||
a.append(o)
|
||||
result = OrderedDict()
|
||||
result["sql"] = sql
|
||||
result["rows"] = a
|
||||
return result
|
||||
|
||||
|
||||
def vec0_shadow_table_contents(db, v):
|
||||
shadow_tables = [
|
||||
row[0]
|
||||
for row in db.execute(
|
||||
"select name from sqlite_master where name like ? order by 1", [f"{v}_%"]
|
||||
).fetchall()
|
||||
]
|
||||
o = {}
|
||||
for shadow_table in shadow_tables:
|
||||
if shadow_table.endswith("_info"):
|
||||
continue
|
||||
o[shadow_table] = exec(db, f"select * from {shadow_table}")
|
||||
return o
|
||||
|
|
@ -111,5 +111,7 @@ def vec0_shadow_table_contents(db, v):
|
|||
]
|
||||
o = {}
|
||||
for shadow_table in shadow_tables:
|
||||
if shadow_table.endswith("_info"):
|
||||
continue
|
||||
o[shadow_table] = exec(db, f"select * from {shadow_table}")
|
||||
return o
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue