docs

2026-06-11 15:15:19 +02:00 · 2024-07-31 12:55:03 -07:00 · 2024-07-31 12:55:03 -07:00 · 356f75cca7
commit 356f75cca7
parent 4febdff11a
17 changed files with 350 additions and 166 deletions
--- a/site/guides/matryoshka.md
+++ b/site/guides/matryoshka.md
@ -2,17 +2,17 @@

 Matryoshka embeddings are a new class of embedding models introduced in the
 TODO-YYY paper [_TODO title_](https://arxiv.org/abs/2205.13147). They allow one
-to truncate excess dimensions in large vector, without lossing much quality.
+to truncate excess dimensions in large vector, without sacrificing much quality.

 Let's say your embedding model generate 1024-dimensional vectors. If you have 1
 million of these 1024-dimensional vectors, they would take up `4.096 GB` of
-space! You're not able to reduce the dimensions without lossing a lot of
+space! You're not able to reduce the dimensions without losing a lot of
 quality - if you were to remove half of the dimensions 512-dimensional vectors,
 you could expect to also lose 50% or more of the quality of results. There are
-other dimensional-reduction techniques, like [PCA](#TODO), but this requires a
-complicated and expensive training process.
+other dimensional-reduction techniques, like [PCA](#TODO) or [Product Quantization](#TODO), but they typically require
+complicated and expensive training processes.

-Matryoshka embeddings, on the other hand, _can_ be truncated, without losing
+Matryoshka embeddings, on the other hand, _can_ be truncated, without losing much
 quality. Using [`mixedbread.ai`](#TODO) `mxbai-embed-large-v1` model, they claim
 that

@ -20,16 +20,20 @@ They are called "Matryoshka" embeddings because ... TODO

 ## Matryoshka Embeddings with `sqlite-vec`

-You can use a combination of [`vec_slice()`](/api-reference#vec_slice) and
-[`vec_normalize()`](/api-reference#vec_slice) on Matryoshka embeddings to
+You can use a combination of [`vec_slice()`](../api-reference.md#vec_slice) and
+[`vec_normalize()`](../api-reference.md#vec_slice) on Matryoshka embeddings to
 truncate.

 ```sql
 select
-  vec_normalize(vec_slice(title_embeddings, 0, 256)) as title_embeddings_256d
+  vec_normalize(
+    vec_slice(title_embeddings, 0, 256)
+  ) as title_embeddings_256d
 from vec_articles;
 ```

+[`vec_slice()`](../api-reference.md#vec_slice) will cut down the vector to the first 256 dimensions. Then [`vec_normalize()`](../api-reference.md#vec_normalize) will normalize that truncated vector, which is typically a required step for Matryoshka embeddings.
+
 ## Benchmarks

 ## Suppported Models
@ -47,3 +51,7 @@ https://www.mixedbread.ai/blog/binary-mrl
 `mxbai-embed-large-v1`: 1024, 512, 256, 128, 64

 `nomic-embed-text-v1.5`: 768, 512, 256, 128, 64
+
+```
+# TODO new snowflake model
+```