sqlite-vec/examples/nbc-headlines/2_build.ipynb
2024-10-02 10:24:49 -07:00

854 lines
34 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NBC News Headlines: Building FTS5 + `vec0` indexes\n",
"\n",
"Using the dataset built in [the previous `./1_scrape.ipynb` notebook](./1_scrape.ipynb), \n",
"this notebook will enrich that dataset with a full-text search index and a semantic search index,\n",
"using [FTS5](https://www.sqlite.org/fts5.html), \n",
"[`sqlite-vec`](https://github.com/asg017/sqlite-vec), and \n",
"[`sqlite-lembed`](https://github.com/asg017/sqlite-lembed).\n",
"\n",
"This example will use pure SQL for everything. You can do the same exact thing in Python/JavaScript/Go/Rust/etc., or use\n",
"your own embeddings providers like Ollama/llamafile/OpenAI/etc. The core mechanics of FTS5 and `sqlite-vec` will remain the same. \n",
"\n",
"We will use the [Snowflake Artic Embed v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) embeddings model to generate embeddings. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[no code]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
".open tmp-artic2.db"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 1: Create a FTS5 index\n",
"\n",
"Creating a full-text search index is as simple as 3 SQL commands! We already have the headlines stored in the `articles` \n",
"table under the `headline` column, so it's just a matter of initializing the FTS5 virtual table and inserting the data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"create virtual table fts_articles using fts5(\n",
" headline,\n",
" content='articles', content_rowid='id'\n",
");\n",
"\n",
"insert into fts_articles(rowid, headline)\n",
" select rowid, headline\n",
" from articles;\n",
"\n",
"insert into fts_articles(fts_articles) values('optimize');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By convention we name the FTS5 table `fts_articles`, where the `fts_` prefix says \"this virtual table is full-text search of the `articles` table\". We are only searching the `headline` column, the rest can be ignored. \n",
"\n",
"Here we are using the [\"external content tables\"](https://www.sqlite.org/fts5.html#external_content_tables)\n",
"feature in FTS5 tables, which will avoid storing the headlines a 2nd time, since they already exist in the `articles` table. \n",
"This part isn't required, but saves us a bit of storage. \n",
"\n",
"We also use the [`'optimize'`](https://www.sqlite.org/fts5.html#the_optimize_command) command\n",
" to keep things tidy. This doesn't do much on such a small dataset, but is important to remember for larger tables!"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"<th>\n",
"headline\n",
"</th>\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Kamala Harris visits Planned Parenthood clinic\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Former Marine sentenced to 9 years in prison for firebombing Planned Parenthood clinic\n",
"</td>\n",
"</tr>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"2 rows × 1 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┐\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mheadline\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┤\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mKamala Harris visits Planned Parenthood clinic \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mFormer Marine sentenced to 9 years in prison for firebombing Planned Parenthood clinic\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m└\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┘\n",
"\u001b[0m\u001b[0m"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"select *\n",
"from fts_articles\n",
"where headline match 'planned parenthood'\n",
"limit 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create a \"semantic index\"\n",
"\n",
"\"Semantic index\" in this case is just a fancy way of saying \"vector store\", which we will do with a `sqlite-vec` `vec0` virtual table. \n",
"\n",
"Now, `sqlite-vec` just stores vectors, it doesn't generate embeddings for us. There are hundreds of different remote APIs or local inference runtimes you can use to generate embeddings,\n",
"but here we will use [`sqlite-lembed`](https://github.com/asg017/sqlite-lembed) to keep everything local and everything in pure SQL. \n",
"\n",
"We will need to choose an embeddings model in the [GGUF format](https://huggingface.co/docs/hub/en/gguf),\n",
"since `sqlite-lembed` uses [llama.cpp](https://github.com/ggerganov/llama.cpp) under the hood. \n",
"Here we will use [`Snowflake/snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5),\n",
"where we can find a GGUF version [here](https://huggingface.co/asg017/sqlite-lembed-model-examples/tree/main/snowflake-arctic-embed-m-v1.5). \n",
"This model is small-sh (`436MB` full-sized, `118MB` at `Q8_0` quantized), and is trained on fairly recent data so it understands\n",
"recent events like \"COVID-19\" or \"Kamala Harris\". \n",
"\n",
"You can download a `.gguf` quantized version of this model with:\n",
"\n",
"```bash\n",
"wget https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/snowflake-arctic-embed-m-v1.5/snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf\n",
"```\n",
"\n",
"And we can configure `sqlite-lembed` to use this model like so:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
".load ./lembed0\n",
".load ../../dist/vec0\n",
"\n",
"insert into lembed_models(name, model) values\n",
" ('default', lembed_model_from_file('./snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf'));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's embeddings time! We can use the `lembed()` function, which takes in text and returns a vector representation of that text,\n",
"as an embeddings BLOB that we can insert directly into a `vec0` virtul table. \n",
"\n",
"We'll declare this new `vec_articles` table, using the `vec_` prefix as convention. This matches the `fts_articles` table above. \n",
"The Snowflake embedding model generate vectors with `768` dimensions, which we we store as-as. \n",
"\n",
"Embedding and inserting into this vector store is as easy as a single `INSERT INTO` and `lembed()` call."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"create virtual table vec_articles using vec0(\n",
" article_id integer primary key,\n",
" headline_embedding float[768]\n",
");\n",
"\n",
"insert into vec_articles(article_id, headline_embedding)\n",
"select\n",
" rowid,\n",
" lembed(headline)\n",
"from articles;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This took ~13 minutes for ~14,500 embeddings on my older 2019 Macbook, but newer computers with better CPUs will finish quicker (it took `2m20s` on my newer Mac M1 Mini). \n",
"\n",
"Once the `vec_articles` is ready, we can perform a KNN query like so:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"<th>\n",
"headline\n",
"</th>\n",
"<th>\n",
"distance\n",
"</th>\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Kamala Harris visits Planned Parenthood clinic\n",
"</td>\n",
"<td >\n",
"0.492593914270401\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"After Dobbs decision, more women are managing their own abortions\n",
"</td>\n",
"<td >\n",
"0.5789032578468323\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Transforming Healthcare\n",
"</td>\n",
"<td >\n",
"0.5822411179542542\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"A timeline of Trump&#39;s many, many positions on abortion\n",
"</td>\n",
"<td >\n",
"0.6101462841033936\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"How a network of abortion pill providers works together in the wake of new threats\n",
"</td>\n",
"<td >\n",
"0.6196886897087097\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"&#39;Major hurdles&#39;: The reality check behind Biden&#39;s big abortion promise\n",
"</td>\n",
"<td >\n",
"0.6198344826698303\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Trump&#39;s conflicting abortion stances are coming back to haunt him — and his party\n",
"</td>\n",
"<td >\n",
"0.6198986768722534\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Where abortion rights could be on the ballot this fall: From the Politics Desk\n",
"</td>\n",
"<td >\n",
"0.6201764345169067\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"How the Biden campaign quickly mobilized on Trump&#39;s abortion stance\n",
"</td>\n",
"<td >\n",
"0.633980393409729\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Battle over abortion heats up in Arizona — and could be on the 2024 ballot\n",
"</td>\n",
"<td >\n",
"0.6341449022293091\n",
"</td>\n",
"</tr>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"10 rows × 2 columns\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┬\u001b[0m\u001b[0m────────────────────\u001b[0m\u001b[0m┐\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mheadline\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mdistance\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┼\u001b[0m\u001b[0m────────────────────\u001b[0m\u001b[0m┤\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mKamala Harris visits Planned Parenthood clinic \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 0.492593914270401\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mAfter Dobbs decision, more women are managing their own abortions \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.5789032578468323\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTransforming Healthcare \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.5822411179542542\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mA timeline of Trump's many, many positions on abortion \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6101462841033936\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mHow a network of abortion pill providers works together in the wake of new threats\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6196886897087097\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m'Major hurdles': The reality check behind Biden's big abortion promise \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6198344826698303\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTrump's conflicting abortion stances are coming back to haunt him — and his party \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6198986768722534\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mWhere abortion rights could be on the ballot this fall: From the Politics Desk \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6201764345169067\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mHow the Biden campaign quickly mobilized on Trump's abortion stance \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 0.633980393409729\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mBattle over abortion heats up in Arizona — and could be on the 2024 ballot \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m0.6341449022293091\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m└\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┴\u001b[0m\u001b[0m────────────────────\u001b[0m\u001b[0m┘\n",
"\u001b[0m\u001b[0m"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"select\n",
" articles.headline,\n",
" vec_articles.distance\n",
"from vec_articles\n",
"left join articles on articles.rowid = vec_articles.article_id\n",
"where headline_embedding match lembed(\"planned parenthood\")\n",
" and k = 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Slim it down with Binary Quantization\n",
"\n",
"The vectors in the `vec_articles` table take up a lot of space. A vector with `768` dimensions take up `786 * 4 = 3072` bytes of space each, or around `45MB` of space for these ~14,500 entries. \n",
"\n",
"That's a lot — the original text dataset was only `~4MB`!\n",
"\n",
"If you want to make the database smaller, there's a number of quantization or other methods to do so, by trading accuracy. \n",
"Here's an example of performing [binary quantization](https://alexgarcia.xyz/sqlite-vec/guides/binary-quant.html)\n",
"on this dataset, storing 768-dimensional bit-vectors instead of floating-point vectors, a `32x` size reduction, at the expense of accuracy. \n",
"\n",
"We'll keep the current SQLite database as-is, and instead make a copy into a new SQLite database file, and change the `vec_articles` table\n",
"to store bit-vectors instead. \n",
"\n",
"First, we'll make a copy of the current database into a new file:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vacuum into 'tmp-artic2.slim.db';"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll make a connection to this new file, and drop the old `vec_articles` table that contains the large `float[768]` vectors."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"attach database 'tmp-artic2.slim.db' as slim;\n",
"drop table slim.vec_articles;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can create a new `vec0` table, storing `bit[768]` vectors instead! \n",
"We can insert the original `float[768]` from the `main.vec_articles` table (original table),\n",
"calling [`vec_quantize_binary()`](https://alexgarcia.xyz/sqlite-vec/api-reference.html#vec_quantize_binary) to convert the floats to bits. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"create virtual table slim.vec_articles using vec0(\n",
" article_id integer primary key,\n",
" headline_embedding bit[768]\n",
");\n",
"\n",
"insert into slim.vec_articles(article_id, headline_embedding)\n",
"select\n",
" article_id,\n",
" vec_quantize_binary(headline_embedding)\n",
"from main.vec_articles;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can `VACUUM` the new `slim` database to shrink the file, delete the `DROP`'ed pages from the older `vec0` table. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"0 row × 0 column\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vacuum slim;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And there we have it! This file is `7.1MB`, a large reduction from the original `53MB` table. \n",
"\n",
"KNN queries are similar, only adding the `vec_quantize_binary()` function to the query vector."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table>\n",
"<thead>\n",
"<tr style=\"text-align: center;\">\n",
"<th>\n",
"headline\n",
"</th>\n",
"<th>\n",
"distance\n",
"</th>\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Kamala Harris visits Planned Parenthood clinic\n",
"</td>\n",
"<td >\n",
"139\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"How a network of abortion pill providers works together in the wake of new threats\n",
"</td>\n",
"<td >\n",
"151\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"After Dobbs decision, more women are managing their own abortions\n",
"</td>\n",
"<td >\n",
"153\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"A timeline of Trump&#39;s many, many positions on abortion\n",
"</td>\n",
"<td >\n",
"156\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Two of the countrys largest transgender rights organizations will merge\n",
"</td>\n",
"<td >\n",
"158\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Transforming Healthcare\n",
"</td>\n",
"<td >\n",
"158\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"With Harris and Walz, Democrats put abortion rights at the top of the agenda\n",
"</td>\n",
"<td >\n",
"159\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"In states with strict abortion policies, simply seeing an OB/GYN for regular care can be difficult\n",
"</td>\n",
"<td >\n",
"160\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Where abortion rights could be on the ballot this fall: From the Politics Desk\n",
"</td>\n",
"<td >\n",
"161\n",
"</td>\n",
"</tr>\n",
"<tr>\n",
"<td style=\"text-align: left;\">\n",
"Map: Where medication abortion is and isnt legal\n",
"</td>\n",
"<td >\n",
"162\n",
"</td>\n",
"</tr>\n",
"</tbody>\n",
"</table>\n",
"<div style=\"text-align: right;\">\n",
"10 rows × 2 columns\n",
"</div>\n",
"</div>\n"
],
"text/plain": [
"\u001b[0m┌\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┬\u001b[0m\u001b[0m──────────\u001b[0m\u001b[0m┐\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mheadline\u001b[0m \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m\u001b[0m\u001b[1mdistance\u001b[0m\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m├\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┼\u001b[0m\u001b[0m──────────\u001b[0m\u001b[0m┤\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mKamala Harris visits Planned Parenthood clinic \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 139\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mHow a network of abortion pill providers works together in the wake of new threats \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 151\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mAfter Dobbs decision, more women are managing their own abortions \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 153\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mA timeline of Trump's many, many positions on abortion \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 156\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTwo of the countrys largest transgender rights organizations will merge \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 158\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mTransforming Healthcare \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 158\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mWith Harris and Walz, Democrats put abortion rights at the top of the agenda \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 159\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mIn states with strict abortion policies, simply seeing an OB/GYN for regular care can be difficult\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 160\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mWhere abortion rights could be on the ballot this fall: From the Politics Desk \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 161\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0mMap: Where medication abortion is and isnt legal \u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m \u001b[0m\u001b[0m\u001b[0m 162\u001b[0m \u001b[0m\u001b[0m│\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0m\u001b[0m└\u001b[0m\u001b[0m────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\u001b[0m┴\u001b[0m\u001b[0m──────────\u001b[0m\u001b[0m┘\n",
"\u001b[0m\u001b[0m"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"select\n",
" slim.articles.headline,\n",
" slim.vec_articles.distance\n",
"from slim.vec_articles\n",
"left join slim.articles on slim.articles.rowid = slim.vec_articles.article_id\n",
"where headline_embedding match vec_quantize_binary(lembed(\"planned parenthood\"))\n",
" and k = 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll notice the results differ slightly to the full-sized query from above. Some results are ordered differently, some are missing. \n",
"The `distance` in this binary KNN search is hamming distance, not the default L2 distance. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Solite",
"language": "sql",
"name": "solite"
},
"language_info": {
"file_extension": ".sql",
"mimetype": "text/x.sqlite",
"name": "sqlite",
"nb_converter": "script",
"pygments_lexer": "sql",
"version": "TODO"
}
},
"nbformat": 4,
"nbformat_minor": 2
}