mirror of
https://github.com/asg017/sqlite-vec.git
synced 2026-04-25 16:56:27 +02:00
810 lines
33 KiB
Text
810 lines
33 KiB
Text
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# NBC News Headlines: Building FTS5 + `vec0` indexes\n",
|
||
"\n",
|
||
"Using the dataset built in [the previous `./1_scrape.ipynb` notebook](./1_scrape.ipynb), \n",
|
||
"this notebook will enrich that dataset with a full-text search index and a semantic search index,\n",
|
||
"using [FTS5](https://www.sqlite.org/fts5.html), \n",
|
||
"[`sqlite-vec`](https://github.com/asg017/sqlite-vec), and \n",
|
||
"[`sqlite-lembed`](https://github.com/asg017/sqlite-lembed).\n",
|
||
"\n",
|
||
"This example will use pure SQL for everything. You can do the same exact thing in Python/JavaScript/Go/Rust/etc., or use\n",
|
||
"your own embeddings providers like Ollama/llamafile/OpenAI/etc. The core mechanics of FTS5 and `sqlite-vec` will remain the same. \n",
|
||
"\n",
|
||
"We will use the [Snowflake Artic Embed v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) embeddings model to generate embeddings. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[no code]"
|
||
]
|
||
},
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
".open tmp-artic2.db"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Step 1: Create a FTS5 index\n",
|
||
"\n",
|
||
"Creating a full-text search index is as simple as 3 SQL commands! We already have the headlines stored in the `articles` \n",
|
||
"table under the `headline` column, so it's just a matter of initializing the FTS5 virtual table and inserting the data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"create virtual table fts_articles using fts5(\n",
|
||
" headline,\n",
|
||
" content='articles', content_rowid='id'\n",
|
||
");\n",
|
||
"\n",
|
||
"insert into fts_articles(rowid, headline)\n",
|
||
" select rowid, headline\n",
|
||
" from articles;\n",
|
||
"\n",
|
||
"insert into fts_articles(fts_articles) values('optimize');"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By convention we name the FTS5 table `fts_articles`, where the `fts_` prefix says \"this virtual table is full-text search of the `articles` table\". We are only searching the `headline` column, the rest can be ignored. \n",
|
||
"\n",
|
||
"Here we are using the [\"external content tables\"](https://www.sqlite.org/fts5.html#external_content_tables)\n",
|
||
"feature in FTS5 tables, which will avoid storing the headlines a 2nd time, since they already exist in the `articles` table. \n",
|
||
"This part isn't required, but saves us a bit of storage. \n",
|
||
"\n",
|
||
"We also use the [`'optimize'`](https://www.sqlite.org/fts5.html#the_optimize_command) command\n",
|
||
" to keep things tidy. This doesn't do much on such a small dataset, but is important to remember for larger tables!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"<th>\n",
|
||
"headline\n",
|
||
"</th>\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Kamala Harris visits Planned Parenthood clinic\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Former Marine sentenced to 9 years in prison for firebombing Planned Parenthood clinic\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"2 rows × 1 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┐\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mheadline\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┤\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mKamala Harris visits Planned Parenthood clinic \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mFormer Marine sentenced to 9 years in prison for firebombing Planned Parenthood clinic\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m└\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┘\n",
|
||
"\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"select *\n",
|
||
"from fts_articles\n",
|
||
"where headline match 'planned parenthood'\n",
|
||
"limit 10;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Step 2: Create a \"semantic index\"\n",
|
||
"\n",
|
||
"\"Semantic index\" in this case is just a fancy way of saying \"vector store\", which we will do with a `sqlite-vec` `vec0` virtual table. \n",
|
||
"\n",
|
||
"Now, `sqlite-vec` just stores vectors, it doesn't generate embeddings for us. There are hundreds of different remote APIs or local inference runtimes you can use to generate embeddings,\n",
|
||
"but here we will use [`sqlite-lembed`](https://github.com/asg017/sqlite-lembed) to keep everything local and everything in pure SQL. \n",
|
||
"\n",
|
||
"We will need to choose an embeddings model in the [GGUF format](https://huggingface.co/docs/hub/en/gguf),\n",
|
||
"since `sqlite-lembed` uses [llama.cpp](https://github.com/ggerganov/llama.cpp) under the hood. \n",
|
||
"Here we will use [`Snowflake/snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5),\n",
|
||
"where we can find a GGUF version [here](https://huggingface.co/asg017/sqlite-lembed-model-examples/tree/main/snowflake-arctic-embed-m-v1.5). \n",
|
||
"This model is small-sh (`436MB` full-sized, `118MB` at `Q8_0` quantized), and is trained on fairly recent data so it understands\n",
|
||
"recent events like \"COVID-19\" or \"Kamala Harris\". \n",
|
||
"\n",
|
||
"You can download a `.gguf` quantized version of this model with:\n",
|
||
"\n",
|
||
"```bash\n",
|
||
"wget https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/snowflake-arctic-embed-m-v1.5/snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf\n",
|
||
"```\n",
|
||
"\n",
|
||
"And we can configure `sqlite-lembed` to use this model like so:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
".load ./lembed0\n",
|
||
".load ../../dist/vec0\n",
|
||
"\n",
|
||
"insert into lembed_models(name, model) values\n",
|
||
" ('default', lembed_model_from_file('./snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf'));"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"It's embeddings time! We can use the `lembed()` function, which takes in text and returns a vector representation of that text,\n",
|
||
"as an embeddings BLOB that we can insert directly into a `vec0` virtul table. \n",
|
||
"\n",
|
||
"We'll declare this new `vec_articles` table, using the `vec_` prefix as convention. This matches the `fts_articles` table above. \n",
|
||
"The Snowflake embedding model generate vectors with `768` dimensions, which we we store as-as. \n",
|
||
"\n",
|
||
"Embedding and inserting into this vector store is as easy as a single `INSERT INTO` and `lembed()` call."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"create virtual table vec_articles using vec0(\n",
|
||
" article_id integer primary key,\n",
|
||
" headline_embedding float[768]\n",
|
||
");\n",
|
||
"\n",
|
||
"insert into vec_articles(article_id, headline_embedding)\n",
|
||
"select\n",
|
||
" rowid,\n",
|
||
" lembed(headline)\n",
|
||
"from articles;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"This took ~13 minutes for ~14,500 embeddings on my older 2019 Macbook, but newer computers with better CPUs will finish quicker (it took `2m20s` on my newer Mac M1 Mini). \n",
|
||
"\n",
|
||
"Once the `vec_articles` is ready, we can perform a KNN query like so:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"<th>\n",
|
||
"headline\n",
|
||
"</th>\n",
|
||
"<th>\n",
|
||
"distance\n",
|
||
"</th>\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Kamala Harris visits Planned Parenthood clinic\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.492593914270401\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"After Dobbs decision, more women are managing their own abortions\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.5789032578468323\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Transforming Healthcare\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.5822411179542542\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"A timeline of Trump's many, many positions on abortion\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.6101462841033936\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"How a network of abortion pill providers works together in the wake of new threats\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.6196886897087097\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"'Major hurdles': The reality check behind Biden's big abortion promise\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.6198344826698303\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Trump's conflicting abortion stances are coming back to haunt him — and his party\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.6198986768722534\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Where abortion rights could be on the ballot this fall: From the Politics Desk\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.6201764345169067\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"How the Biden campaign quickly mobilized on Trump's abortion stance\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.633980393409729\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Battle over abortion heats up in Arizona — and could be on the 2024 ballot\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"0.6341449022293091\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"10 rows × 2 columns\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┬\u001b[0m\u001b[0m────────────────────\u001b[0m\u001b[0m┐\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mheadline\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mdistance\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┼\u001b[0m\u001b[0m────────────────────\u001b[0m\u001b[0m┤\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mKamala Harris visits Planned Parenthood clinic \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 0.492593914270401\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mAfter Dobbs decision, more women are managing their own abortions \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.5789032578468323\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTransforming Healthcare \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.5822411179542542\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mA timeline of Trump's many, many positions on abortion \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6101462841033936\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mHow a network of abortion pill providers works together in the wake of new threats\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6196886897087097\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m'Major hurdles': The reality check behind Biden's big abortion promise \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6198344826698303\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTrump's conflicting abortion stances are coming back to haunt him — and his party \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6198986768722534\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mWhere abortion rights could be on the ballot this fall: From the Politics Desk \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6201764345169067\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mHow the Biden campaign quickly mobilized on Trump's abortion stance \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 0.633980393409729\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mBattle over abortion heats up in Arizona — and could be on the 2024 ballot \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6341449022293091\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m└\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┴\u001b[0m\u001b[0m────────────────────\u001b[0m\u001b[0m┘\n",
|
||
"\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"select\n",
|
||
" articles.headline,\n",
|
||
" vec_articles.distance\n",
|
||
"from vec_articles\n",
|
||
"left join articles on articles.rowid = vec_articles.article_id\n",
|
||
"where headline_embedding match lembed(\"planned parenthood\")\n",
|
||
" and k = 10;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Slim it down with Binary Quantization\n",
|
||
"\n",
|
||
"The vectors in the `vec_articles` table take up a lot of space. A vector with `768` dimensions take up `786 * 4 = 3072` bytes of space each, or around `45MB` of space for these ~14,500 entries. \n",
|
||
"\n",
|
||
"That's a lot — the original text dataset was only `~4MB`!\n",
|
||
"\n",
|
||
"If you want to make the database smaller, there's a number of quantization or other methods to do so, by trading accuracy. \n",
|
||
"Here's an example of performing [binary quantization](https://alexgarcia.xyz/sqlite-vec/guides/binary-quant.html)\n",
|
||
"on this dataset, storing 768-dimensional bit-vectors instead of floating-point vectors, a `32x` size reduction, at the expense of accuracy. \n",
|
||
"\n",
|
||
"We'll keep the current SQLite database as-is, and instead make a copy into a new SQLite database file, and change the `vec_articles` table\n",
|
||
"to store bit-vectors instead. \n",
|
||
"\n",
|
||
"First, we'll make a copy of the current database into a new file:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"vacuum into 'tmp-artic2.slim.db';"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we'll make a connection to this new file, and drop the old `vec_articles` table that contains the large `float[768]` vectors."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"attach database 'tmp-artic2.slim.db' as slim;\n",
|
||
"drop table slim.vec_articles;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we can create a new `vec0` table, storing `bit[768]` vectors instead! \n",
|
||
"We can insert the original `float[768]` from the `main.vec_articles` table (original table),\n",
|
||
"calling [`vec_quantize_binary()`](https://alexgarcia.xyz/sqlite-vec/api-reference.html#vec_quantize_binary) to convert the floats to bits. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"create virtual table slim.vec_articles using vec0(\n",
|
||
" article_id integer primary key,\n",
|
||
" headline_embedding bit[768]\n",
|
||
");\n",
|
||
"\n",
|
||
"insert into slim.vec_articles(article_id, headline_embedding)\n",
|
||
"select\n",
|
||
" article_id,\n",
|
||
" vec_quantize_binary(headline_embedding)\n",
|
||
"from main.vec_articles;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Then we can `VACUUM` the new `slim` database to shrink the file, delete the `DROP`'ed pages from the older `vec0` table. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"0 row × 0 column\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"vacuum slim;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"And there we have it! This file is `7.1MB`, a large reduction from the original `53MB` table. \n",
|
||
"\n",
|
||
"KNN queries are similar, only adding the `vec_quantize_binary()` function to the query vector."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<table>\n",
|
||
"<thead>\n",
|
||
"<tr style=\"text-align: center;\">\n",
|
||
"<th>\n",
|
||
"headline\n",
|
||
"</th>\n",
|
||
"<th>\n",
|
||
"distance\n",
|
||
"</th>\n",
|
||
"</tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Kamala Harris visits Planned Parenthood clinic\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"139\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"How a network of abortion pill providers works together in the wake of new threats\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"151\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"After Dobbs decision, more women are managing their own abortions\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"153\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"A timeline of Trump's many, many positions on abortion\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"156\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Two of the country’s largest transgender rights organizations will merge\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"158\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Transforming Healthcare\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"158\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"With Harris and Walz, Democrats put abortion rights at the top of the agenda\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"159\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"In states with strict abortion policies, simply seeing an OB/GYN for regular care can be difficult\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"160\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Where abortion rights could be on the ballot this fall: From the Politics Desk\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"161\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"<tr>\n",
|
||
"<td style=\"text-align: left;\">\n",
|
||
"Map: Where medication abortion is and isn’t legal\n",
|
||
"</td>\n",
|
||
"<td >\n",
|
||
"162\n",
|
||
"</td>\n",
|
||
"</tr>\n",
|
||
"</tbody>\n",
|
||
"</table>\n",
|
||
"<div style=\"text-align: right;\">\n",
|
||
"10 rows × 2 columns\n",
|
||
"</div>\n",
|
||
"</div>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[0m┌\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┬\u001b[0m\u001b[0m──────────\u001b[0m\u001b[0m┐\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mheadline\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mdistance\u001b[0m\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┼\u001b[0m\u001b[0m──────────\u001b[0m\u001b[0m┤\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mKamala Harris visits Planned Parenthood clinic \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 139\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mHow a network of abortion pill providers works together in the wake of new threats \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 151\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mAfter Dobbs decision, more women are managing their own abortions \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 153\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mA timeline of Trump's many, many positions on abortion \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 156\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTwo of the country’s largest transgender rights organizations will merge \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 158\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTransforming Healthcare \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 158\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mWith Harris and Walz, Democrats put abortion rights at the top of the agenda \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 159\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mIn states with strict abortion policies, simply seeing an OB/GYN for regular care can be difficult\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 160\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mWhere abortion rights could be on the ballot this fall: From the Politics Desk \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 161\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mMap: Where medication abortion is and isn’t legal \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 162\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
|
||
"\u001b[0m\u001b[0m\u001b[0m└\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┴\u001b[0m\u001b[0m──────────\u001b[0m\u001b[0m┘\n",
|
||
"\u001b[0m\u001b[0m"
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"select\n",
|
||
" slim.articles.headline,\n",
|
||
" slim.vec_articles.distance\n",
|
||
"from slim.vec_articles\n",
|
||
"left join slim.articles on slim.articles.rowid = slim.vec_articles.article_id\n",
|
||
"where headline_embedding match vec_quantize_binary(lembed(\"planned parenthood\"))\n",
|
||
" and k = 10;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"You'll notice the results differ slightly to the full-sized query from above. Some results are ordered differently, some are missing. \n",
|
||
"The `distance` in this binary KNN search is hamming distance, not the default L2 distance. "
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Solite",
|
||
"language": "sql",
|
||
"name": "solite"
|
||
},
|
||
"language_info": {
|
||
"file_extension": ".sql",
|
||
"mimetype": "text/x.sqlite",
|
||
"name": "sql",
|
||
"nb_converter": "script",
|
||
"pygments_lexer": "sql",
|
||
"version": "TODO"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|