mirror of
https://github.com/samvallad33/vestige.git
synced 2026-04-25 00:36:22 +02:00
feat(v2.1.0): Qwen3-Embedding-0.6B backend scaffolding (feature-gated)
Day 2 of the Qwen3 migration. Default build is unchanged — nomic stays
the embedding backend, every existing caller continues to see 256-dim
Matryoshka-truncated vectors from the ONNX path. The `qwen3-embed`
feature flag routes to fastembed's standalone Qwen3TextEmbedding
(Candle backend) for 1024-dim native output with 32K context.
Honest scope note: this commit is SCAFFOLDING. Under `qwen3-embed`
the backend initialises cleanly, the vector index now sizes itself to
1024d via feature-gated DEFAULT_DIMENSIONS, and the full 366-test lib
suite passes. End-to-end ingest + search under qwen3-embed still has
two gaps that Day 3 closes: sqlite.rs hardcodes the embedding_model
string as 'nomic-embed-text-v1.5' at the write sites, and
get_query_embedding doesn't call qwen3_format_query on the query text.
Neither is a regression for default builds — both are explicit Day 3
work items tracked in the audit inventory.
What's here:
- New `Backend` enum wraps either `TextEmbedding` (Nomic ONNX) or
`Qwen3TextEmbedding` (Candle) behind the same Mutex<OnceLock<...>>
the rest of Vestige already calls through. `EmbeddingService::embed`
dispatches via `Backend::embed_batch` + `Backend::post_process` so
the public API shape doesn't change.
- `qwen3-embed` Cargo feature = fastembed/qwen3 + direct candle-core
pinned to =0.10.2 (exact, not caret — supply-chain defence alongside
Cargo.lock; fastembed doesn't re-export candle_core types so we need
a direct dep path for candle_core::{Device, DType}).
- `qwen3_format_query()` helper + `QWEN3_QUERY_INSTRUCTION` constant.
Qwen3 is asymmetric — queries require the Instruct prefix, documents
go in raw. Prefix format matches the canonical `get_detailed_instruct`
Python reference in the HF model card (no space after `Query:`). The
helper is a no-op under the nomic backend so upstream code can wrap
queries unconditionally.
- Per-backend dimensions: `NOMIC_EMBEDDING_DIMENSIONS = 256`,
`QWEN3_EMBEDDING_DIMENSIONS = 1024`. `EMBEDDING_DIMENSIONS` resolves
to the active backend at compile time for back-compat.
- `search/vector.rs::DEFAULT_DIMENSIONS` and
`advanced/adaptive_embedding.rs::{DEFAULT_DIMENSIONS, CODE_DIMENSIONS}`
feature-gated to match the active backend so the USearch index
sizes itself correctly.
- Per-backend model_name() returning the HF repo ID ("nomic-ai/..." or
"Qwen/..."). Will be threaded through storage write sites in Day 3.
- MAX_TEXT_LENGTH bumps to 32K under qwen3-embed to match Qwen3's
context window; stays at 8K for nomic.
- Backend::post_process applies matryoshka_truncate for Nomic only;
Qwen3 output is already last-token pooled + L2-normalized by the
Candle model (verified in fastembed-5.13.2 qwen3.rs:1124-1125).
- Device selection: `#[cfg(feature = "metal")]` uses
Device::new_metal(0) with CPU fallback on failure; otherwise CPU.
CUDA auto-selection deferred to Day 3+.
- Shape-contract guard at the Backend output boundary — empty outer
OR inner vectors return EmbeddingError::EmbeddingFailed instead of
the previous `.unwrap()` + zero-dim vector reaching USearch.
Tests: 366 passing under default features AND --features qwen3-embed.
Zero clippy warnings on both. One live integration test
(`test_qwen3_embed_live`) `#[ignore]`d so CI doesn't try to pull the
1.2 GB Qwen3 weights on every run; invoke explicitly with
`cargo test --features qwen3-embed -- --ignored test_qwen3_embed_live`.
Pre-push audit (4 parallel reviewers — security, code-quality,
end-to-end flow trace, external verification) ran clean on:
- Cfg soundness across default / qwen3-embed / qwen3-embed+metal /
nomic-v2 / no-default-features / encryption matrices
- Doc-comment fidelity vs fastembed-5.13.2 source
- External claims (1024d, 32K ctx, MRL 32-1024, L2-normalized,
last-token pooling) all verified against Qwen3 HF model card and
fastembed qwen3.rs
- Zero `unsafe`, zero reachable panics, zero info-disclosure leaks
beyond HF upstream error strings
Day 3 (next session):
- sqlite.rs:663 and :669 — write EmbeddingService::model_name()
instead of hardcoded "nomic-embed-text-v1.5"
- sqlite.rs:1639 get_query_embedding — wrap query text with
qwen3_format_query() before calling embed()
- sqlite.rs load_embeddings_into_index — refuse cross-backend loads
(legacy nomic rows under qwen3 build) instead of silent re-use
- Add a migration warn when backend mismatch is detected
This commit is contained in:
parent
6c24a0ca69
commit
82b78ab664
6 changed files with 422 additions and 83 deletions
1
Cargo.lock
generated
1
Cargo.lock
generated
|
|
@ -4533,6 +4533,7 @@ checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"
|
|||
name = "vestige-core"
|
||||
version = "2.0.6"
|
||||
dependencies = [
|
||||
"candle-core",
|
||||
"chrono",
|
||||
"criterion",
|
||||
"directories",
|
||||
|
|
|
|||
|
|
@ -42,6 +42,13 @@ nomic-v2 = ["embeddings", "fastembed/nomic-v2-moe"]
|
|||
# Qwen3 Reranker (Candle backend, high-precision cross-encoder)
|
||||
qwen3-reranker = ["embeddings", "fastembed/qwen3"]
|
||||
|
||||
# Qwen3 Embedding 0.6B (Candle backend, 1024d, 32K context, +8.8 MTEB retrieval pts vs nomic v1.5)
|
||||
# Uses fastembed's standalone Qwen3TextEmbedding (parallel to the ONNX TextEmbedding enum path).
|
||||
# Query/document asymmetry: queries MUST use the instruct prefix via qwen3_format_query().
|
||||
# Enable with `--features qwen3-embed` (+ `metal` for Apple Silicon GPU acceleration).
|
||||
# Requires candle-core as a direct dep so we can name `Device` and `DType` at the call site.
|
||||
qwen3-embed = ["embeddings", "fastembed/qwen3", "dep:candle-core"]
|
||||
|
||||
# Metal GPU acceleration on Apple Silicon (significantly faster inference)
|
||||
metal = ["fastembed/metal"]
|
||||
|
||||
|
|
@ -87,6 +94,17 @@ notify = "8"
|
|||
# v5.11: Adds Nomic v2 MoE (nomic-v2-moe feature) + Qwen3 reranker (qwen3 feature)
|
||||
fastembed = { version = "5.11", default-features = false, features = ["hf-hub-native-tls", "image-models"], optional = true }
|
||||
|
||||
# candle-core is already pulled in transitively by fastembed's `qwen3` feature,
|
||||
# but it is NOT re-exported, so we declare it as a direct optional dep so the
|
||||
# Qwen3 backend can name `candle_core::Device` and `candle_core::DType` when
|
||||
# calling `Qwen3TextEmbedding::from_hf(...)`.
|
||||
#
|
||||
# Version tightened to `=0.10.2` (not `^0.10`) so a compromised crates.io push
|
||||
# of 0.10.3 can't sneak into a `cargo update`. Cargo.lock pins the exact hash
|
||||
# for reproducible builds regardless, but the direct-dep specifier is the
|
||||
# secondary defence. Bump in lockstep with fastembed whenever it moves.
|
||||
candle-core = { version = "=0.10.2", default-features = false, optional = true }
|
||||
|
||||
# ============================================================================
|
||||
# OPTIONAL: Vector Search (USearch - HNSW, 20x faster than FAISS)
|
||||
# ============================================================================
|
||||
|
|
|
|||
|
|
@ -31,11 +31,34 @@
|
|||
use serde::{Deserialize, Serialize};
|
||||
use std::collections::HashMap;
|
||||
|
||||
/// Default embedding dimensions after Matryoshka truncation (768 → 256)
|
||||
pub const DEFAULT_DIMENSIONS: usize = 256;
|
||||
/// Default embedding dimensions for the active backend.
|
||||
/// Tracks `embeddings::local::EMBEDDING_DIMENSIONS` — must match at compile time
|
||||
/// or the adaptive strategy layer ends up sizing buffers against the wrong
|
||||
/// backend. 256 for nomic (Matryoshka 768→256), 1024 for Qwen3 native.
|
||||
pub const DEFAULT_DIMENSIONS: usize = {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
1024
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
256
|
||||
}
|
||||
};
|
||||
|
||||
/// Code embedding dimensions (matches default after Matryoshka truncation)
|
||||
pub const CODE_DIMENSIONS: usize = 256;
|
||||
/// Code embedding dimensions (matches default after Matryoshka truncation).
|
||||
/// Same-shape gating as `DEFAULT_DIMENSIONS` so code embeddings flow through
|
||||
/// the same index when the backend is swapped.
|
||||
pub const CODE_DIMENSIONS: usize = {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
1024
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
256
|
||||
}
|
||||
};
|
||||
|
||||
/// Supported programming languages for code embeddings
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
|
||||
|
|
|
|||
|
|
@ -1,12 +1,22 @@
|
|||
//! Local Semantic Embeddings
|
||||
//!
|
||||
//! Uses fastembed v5.11 for local inference.
|
||||
//! Uses fastembed v5.13 for local inference.
|
||||
//!
|
||||
//! ## Models
|
||||
//!
|
||||
//! - **Default**: Nomic Embed Text v1.5 (ONNX, 768d → 256d Matryoshka, 8192 context)
|
||||
//! - **Optional**: Nomic Embed Text v2 MoE (Candle, 475M params, 305M active, 8 experts)
|
||||
//! Enable with `nomic-v2` feature flag + `metal` for Apple Silicon acceleration.
|
||||
//! - **Default (nomic)**: Nomic Embed Text v1.5 (ONNX via `TextEmbedding`, 768d → 256d
|
||||
//! Matryoshka, 8192 token context). Single binary, no GPU required.
|
||||
//! - **Optional (qwen3)**: Qwen3 Embedding 0.6B (Candle via `Qwen3TextEmbedding`, 1024d,
|
||||
//! 32K token context). Enable with `qwen3-embed` feature flag; combine with `metal`
|
||||
//! for Apple Silicon GPU acceleration. Asymmetric: queries MUST use the Instruct
|
||||
//! prefix via [`qwen3_format_query`], documents get no prefix.
|
||||
//!
|
||||
//! ## Dual-backend architecture
|
||||
//!
|
||||
//! The `Backend` enum routes to either fastembed's ONNX `TextEmbedding` path
|
||||
//! (nomic, default) or the standalone Candle `Qwen3TextEmbedding` path
|
||||
//! (Qwen3, feature-gated). Both are held behind the same global `OnceLock<Mutex<...>>`
|
||||
//! so the rest of Vestige calls `EmbeddingService::embed()` unchanged.
|
||||
|
||||
use fastembed::{EmbeddingModel, InitOptions, TextEmbedding};
|
||||
use std::sync::{Mutex, OnceLock};
|
||||
|
|
@ -15,23 +25,166 @@ use std::sync::{Mutex, OnceLock};
|
|||
// CONSTANTS
|
||||
// ============================================================================
|
||||
|
||||
/// Embedding dimensions after Matryoshka truncation
|
||||
/// Nomic Embed Text v1.5 output dimensions after Matryoshka truncation.
|
||||
/// Truncated from 768 → 256 for 3x storage savings with only ~2% quality loss
|
||||
/// (Matryoshka Representation Learning — the first N dims ARE the N-dim representation)
|
||||
pub const EMBEDDING_DIMENSIONS: usize = 256;
|
||||
/// (Matryoshka Representation Learning — the first N dims ARE the N-dim representation).
|
||||
pub const NOMIC_EMBEDDING_DIMENSIONS: usize = 256;
|
||||
|
||||
/// Maximum text length for embedding (truncated if longer)
|
||||
pub const MAX_TEXT_LENGTH: usize = 8192;
|
||||
/// Qwen3-Embedding-0.6B native output dimensions.
|
||||
/// Supports Matryoshka truncation to 32-1024; we keep the full 1024 by default
|
||||
/// so the ~+8-9 point MTEB retrieval lift over the nomic baseline isn't eroded
|
||||
/// by truncation (Qwen3-Embedding-0.6B scores 61.83 on MTEB-Eng-v2 retrieval
|
||||
/// per its HF model card; nomic-v1.5 is around 52-53 on the comparable bench —
|
||||
/// cross-version so delta is ballpark, not precision).
|
||||
/// Index storage cost at int8 = 1 KB/vec vs nomic's 256 B/vec (4x).
|
||||
pub const QWEN3_EMBEDDING_DIMENSIONS: usize = 1024;
|
||||
|
||||
/// Back-compat alias: default embedding dimensions used by downstream code that
|
||||
/// predates the dual-backend split. Always returns the active backend's native
|
||||
/// dimension count. Callers that need a specific backend's dim should use the
|
||||
/// explicit constants above.
|
||||
pub const EMBEDDING_DIMENSIONS: usize = {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
QWEN3_EMBEDDING_DIMENSIONS
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
NOMIC_EMBEDDING_DIMENSIONS
|
||||
}
|
||||
};
|
||||
|
||||
/// Maximum text length for embedding (truncated if longer).
|
||||
/// Nomic caps at 8K; Qwen3 allows 32K. Use the active backend's limit.
|
||||
pub const MAX_TEXT_LENGTH: usize = {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
32_768
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
8192
|
||||
}
|
||||
};
|
||||
|
||||
/// Batch size for efficient embedding generation
|
||||
pub const BATCH_SIZE: usize = 32;
|
||||
|
||||
/// Qwen3 instruct prefix template for retrieval queries.
|
||||
/// Qwen3-Embedding is asymmetric: queries get the instruct-wrapped format,
|
||||
/// documents are embedded raw. Missing this drops retrieval NDCG by ~3 points
|
||||
/// per the Qwen3 model card.
|
||||
pub const QWEN3_QUERY_INSTRUCTION: &str =
|
||||
"Given a web search query, retrieve relevant passages that answer the query";
|
||||
|
||||
/// Format a query string with the Qwen3 instruct prefix. No-op under the nomic
|
||||
/// backend (nomic is symmetric — query and document embeddings share an embedding
|
||||
/// function). Call this on QUERY text at search time, never on document text
|
||||
/// at ingest time.
|
||||
///
|
||||
/// The exact template (no space after `Query:`) matches the canonical
|
||||
/// `get_detailed_instruct` function in the Qwen3-Embedding-0.6B model card —
|
||||
/// the TEI docs happen to include a space but the Python reference function
|
||||
/// does not, so we match the Python reference.
|
||||
#[inline]
|
||||
pub fn qwen3_format_query(query: &str) -> String {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
format!(
|
||||
"Instruct: {instruction}\nQuery:{query}",
|
||||
instruction = QWEN3_QUERY_INSTRUCTION,
|
||||
query = query,
|
||||
)
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
query.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// GLOBAL MODEL (with Mutex for fastembed v5 API)
|
||||
// BACKEND ENUM
|
||||
// ============================================================================
|
||||
|
||||
/// Result type for model initialization
|
||||
static EMBEDDING_MODEL_RESULT: OnceLock<Result<Mutex<TextEmbedding>, String>> = OnceLock::new();
|
||||
/// Internal embedding backend. Held inside a `Mutex` to serialise access
|
||||
/// across callers — the Nomic ONNX path needs `&mut self` for `embed`, and
|
||||
/// the Qwen3 Candle path takes `&self` but the `Mutex` still keeps the API
|
||||
/// uniform and gives us a single mutation-safe story under both cfgs without
|
||||
/// a `Backend`-specific lock type.
|
||||
///
|
||||
/// The enum is private — callers go through `EmbeddingService` which hides
|
||||
/// the branch. This lets us add new backends (e.g. ONNX Qwen3 for lower memory,
|
||||
/// binary-quantized variants) without breaking downstream code.
|
||||
enum Backend {
|
||||
/// fastembed ONNX path — Nomic Embed Text v1.5 (768d → 256d Matryoshka).
|
||||
///
|
||||
/// This variant is constructed only under the default (non-Qwen3) build.
|
||||
/// When `qwen3-embed` is enabled `init_backend` selects [`Self::Qwen3`]
|
||||
/// exclusively, so the Nomic variant is dead code under that cfg. The
|
||||
/// match arms on this enum still handle both variants so the codebase
|
||||
/// can be audited feature-agnostically without `#[cfg]` noise.
|
||||
#[cfg_attr(feature = "qwen3-embed", allow(dead_code))]
|
||||
Nomic(TextEmbedding),
|
||||
/// fastembed Candle path — Qwen3-Embedding-0.6B (1024d, 32K context).
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
Qwen3(fastembed::Qwen3TextEmbedding),
|
||||
}
|
||||
|
||||
impl Backend {
|
||||
/// Embed a batch of texts. The Nomic path truncates to Matryoshka dims
|
||||
/// internally; the Qwen3 path returns full-dim L2-normalized vectors.
|
||||
fn embed_batch(&mut self, texts: Vec<&str>) -> Result<Vec<Vec<f32>>, EmbeddingError> {
|
||||
match self {
|
||||
Self::Nomic(model) => model
|
||||
.embed(texts, None)
|
||||
.map_err(|e| EmbeddingError::EmbeddingFailed(e.to_string())),
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
Self::Qwen3(model) => model
|
||||
.embed(&texts)
|
||||
.map_err(|e| EmbeddingError::EmbeddingFailed(e.to_string())),
|
||||
}
|
||||
}
|
||||
|
||||
/// Post-process a raw embedding before handing it back to callers.
|
||||
/// Nomic: Matryoshka-truncate to 256d + L2-normalize.
|
||||
/// Qwen3: pass through (already last-token pooled and L2-normalized by the model).
|
||||
#[inline]
|
||||
fn post_process(&self, raw: Vec<f32>) -> Vec<f32> {
|
||||
match self {
|
||||
Self::Nomic(_) => matryoshka_truncate(raw),
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
Self::Qwen3(_) => raw,
|
||||
}
|
||||
}
|
||||
|
||||
/// HuggingFace repo ID for this backend's model. Written to the
|
||||
/// `embedding_model` column on every knowledge-node row so dual-index
|
||||
/// search can route queries to the matching USearch index at retrieval time.
|
||||
fn model_name(&self) -> &'static str {
|
||||
match self {
|
||||
Self::Nomic(_) => "nomic-ai/nomic-embed-text-v1.5",
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
Self::Qwen3(_) => "Qwen/Qwen3-Embedding-0.6B",
|
||||
}
|
||||
}
|
||||
|
||||
/// Output vector dimensions after post-processing.
|
||||
fn dimensions(&self) -> usize {
|
||||
match self {
|
||||
Self::Nomic(_) => NOMIC_EMBEDDING_DIMENSIONS,
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
Self::Qwen3(_) => QWEN3_EMBEDDING_DIMENSIONS,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// GLOBAL MODEL (with Mutex for fastembed's &mut self API)
|
||||
// ============================================================================
|
||||
|
||||
/// Global backend, initialised on first use. Held as a `Mutex` because both
|
||||
/// underlying embedding models require exclusive access for `embed()`.
|
||||
static EMBEDDING_BACKEND: OnceLock<Result<Mutex<Backend>, String>> = OnceLock::new();
|
||||
|
||||
/// Get the default cache directory for fastembed models.
|
||||
///
|
||||
|
|
@ -65,37 +218,93 @@ pub(crate) fn get_cache_dir() -> std::path::PathBuf {
|
|||
std::path::PathBuf::from(".fastembed_cache")
|
||||
}
|
||||
|
||||
/// Initialize the global embedding model
|
||||
/// Using nomic-embed-text-v1.5 (768d) - 8192 token context, Matryoshka support
|
||||
fn get_model() -> Result<std::sync::MutexGuard<'static, TextEmbedding>, EmbeddingError> {
|
||||
let result = EMBEDDING_MODEL_RESULT.get_or_init(|| {
|
||||
// Get cache directory (respects FASTEMBED_CACHE_PATH env var)
|
||||
let cache_dir = get_cache_dir();
|
||||
/// Initialise the Nomic ONNX backend. Downloads the model on first use.
|
||||
///
|
||||
/// Called by [`init_backend`] only when `qwen3-embed` is NOT enabled.
|
||||
/// Kept compiled under both cfgs so that a future runtime-selectable backend
|
||||
/// can reuse it without a cross-feature refactor; silenced as dead code when
|
||||
/// the Qwen3 feature is on.
|
||||
#[cfg_attr(feature = "qwen3-embed", allow(dead_code))]
|
||||
fn init_nomic(cache_dir: std::path::PathBuf) -> Result<Backend, String> {
|
||||
// nomic-embed-text-v1.5: 768 dimensions, 8192 token context, Matryoshka
|
||||
let options = InitOptions::new(EmbeddingModel::NomicEmbedTextV15)
|
||||
.with_show_download_progress(true)
|
||||
.with_cache_dir(cache_dir);
|
||||
|
||||
// Create cache directory if it doesn't exist
|
||||
if let Err(e) = std::fs::create_dir_all(&cache_dir) {
|
||||
tracing::warn!("Failed to create cache directory {:?}: {}", cache_dir, e);
|
||||
}
|
||||
TextEmbedding::try_new(options).map(Backend::Nomic).map_err(|e| {
|
||||
format!(
|
||||
"Failed to initialize nomic-embed-text-v1.5 embedding model: {}. \
|
||||
Ensure ONNX runtime is available and model files can be downloaded.",
|
||||
e
|
||||
)
|
||||
})
|
||||
}
|
||||
|
||||
// nomic-embed-text-v1.5: 768 dimensions, 8192 token context
|
||||
// Matryoshka representation learning, fully open source
|
||||
let options = InitOptions::new(EmbeddingModel::NomicEmbedTextV15)
|
||||
.with_show_download_progress(true)
|
||||
.with_cache_dir(cache_dir);
|
||||
|
||||
TextEmbedding::try_new(options)
|
||||
.map(Mutex::new)
|
||||
.map_err(|e| {
|
||||
format!(
|
||||
"Failed to initialize nomic-embed-text-v1.5 embedding model: {}. \
|
||||
Ensure ONNX runtime is available and model files can be downloaded.",
|
||||
e
|
||||
)
|
||||
})
|
||||
/// Initialise the Qwen3 Candle backend. Downloads ~1.2 GB model weights on first
|
||||
/// use (same cache dir as the ONNX path). Uses Metal GPU on Apple Silicon when
|
||||
/// `metal` feature is on; CPU otherwise. CUDA auto-selection is a Day-3+ follow
|
||||
/// (candle-core 0.10 exposes `Device::new_cuda(0)` but we ship the CPU fallback
|
||||
/// first to keep Linux users working out of the box).
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
fn init_qwen3(_cache_dir: std::path::PathBuf) -> Result<Backend, String> {
|
||||
// Device selection is caller-side in candle-core 0.10: fastembed does NOT
|
||||
// auto-select from its own `metal` / `cuda` feature. We gate on vestige-core's
|
||||
// `metal` feature and fall back to CPU if Metal init fails (e.g. x86 macOS
|
||||
// or a broken Apple Silicon Metal stack) so the feature flag is always safe
|
||||
// to combine with qwen3-embed.
|
||||
#[cfg(feature = "metal")]
|
||||
let device = candle_core::Device::new_metal(0).unwrap_or_else(|e| {
|
||||
tracing::warn!("Metal device init failed ({}); falling back to CPU", e);
|
||||
candle_core::Device::Cpu
|
||||
});
|
||||
#[cfg(not(feature = "metal"))]
|
||||
let device = candle_core::Device::Cpu;
|
||||
|
||||
let dtype = candle_core::DType::F32;
|
||||
|
||||
fastembed::Qwen3TextEmbedding::from_hf(
|
||||
"Qwen/Qwen3-Embedding-0.6B",
|
||||
&device,
|
||||
dtype,
|
||||
MAX_TEXT_LENGTH,
|
||||
)
|
||||
.map(Backend::Qwen3)
|
||||
.map_err(|e| {
|
||||
format!(
|
||||
"Failed to initialize Qwen3-Embedding-0.6B: {}. \
|
||||
First-run requires ~1.2 GB download to ~/.cache/vestige/fastembed; \
|
||||
subsequent runs load from cache.",
|
||||
e
|
||||
)
|
||||
})
|
||||
}
|
||||
|
||||
/// Initialise the active backend based on compiled features. Qwen3 wins when
|
||||
/// both features are enabled (it's strictly newer and more capable).
|
||||
fn init_backend() -> Result<Backend, String> {
|
||||
let cache_dir = get_cache_dir();
|
||||
|
||||
// Create cache directory if it doesn't exist
|
||||
if let Err(e) = std::fs::create_dir_all(&cache_dir) {
|
||||
tracing::warn!("Failed to create cache directory {:?}: {}", cache_dir, e);
|
||||
}
|
||||
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
init_qwen3(cache_dir)
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
init_nomic(cache_dir)
|
||||
}
|
||||
}
|
||||
|
||||
/// Lock and return the global embedding backend. Initialises on first call.
|
||||
fn get_backend() -> Result<std::sync::MutexGuard<'static, Backend>, EmbeddingError> {
|
||||
let result = EMBEDDING_BACKEND.get_or_init(|| init_backend().map(Mutex::new));
|
||||
|
||||
match result {
|
||||
Ok(model) => model
|
||||
Ok(backend) => backend
|
||||
.lock()
|
||||
.map_err(|e| EmbeddingError::ModelInit(format!("Lock poisoned: {}", e))),
|
||||
Err(err) => Err(EmbeddingError::ModelInit(err.clone())),
|
||||
|
|
@ -223,7 +432,7 @@ impl EmbeddingService {
|
|||
|
||||
/// Check if the model is ready
|
||||
pub fn is_ready(&self) -> bool {
|
||||
match get_model() {
|
||||
match get_backend() {
|
||||
Ok(_) => true,
|
||||
Err(e) => {
|
||||
tracing::warn!("Embedding model not ready: {}", e);
|
||||
|
|
@ -234,33 +443,44 @@ impl EmbeddingService {
|
|||
|
||||
/// Check if the model is ready and return the error if not
|
||||
pub fn check_ready(&self) -> Result<(), EmbeddingError> {
|
||||
get_model().map(|_| ())
|
||||
get_backend().map(|_| ())
|
||||
}
|
||||
|
||||
/// Initialize the model (downloads if necessary)
|
||||
pub fn init(&self) -> Result<(), EmbeddingError> {
|
||||
let _model = get_model()?; // Ensures model is loaded and returns any init errors
|
||||
let _model = get_backend()?; // Ensures model is loaded and returns any init errors
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Get the model name
|
||||
/// HuggingFace repo ID of the active backend. Used by storage to tag every
|
||||
/// embedded row with its source model so dual-index search can route at
|
||||
/// retrieval time without re-embedding the query against every index.
|
||||
pub fn model_name(&self) -> &'static str {
|
||||
#[cfg(feature = "nomic-v2")]
|
||||
{
|
||||
"nomic-ai/nomic-embed-text-v2-moe"
|
||||
}
|
||||
#[cfg(not(feature = "nomic-v2"))]
|
||||
{
|
||||
"nomic-ai/nomic-embed-text-v1.5"
|
||||
// Acquire the lock only to read a const — cheap, and avoids duplicating
|
||||
// the cfg branch at the call site.
|
||||
match get_backend() {
|
||||
Ok(guard) => guard.model_name(),
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
Err(_) => "Qwen/Qwen3-Embedding-0.6B",
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
Err(_) => "nomic-ai/nomic-embed-text-v1.5",
|
||||
}
|
||||
}
|
||||
|
||||
/// Get the embedding dimensions
|
||||
/// Output vector dimensions for the active backend.
|
||||
pub fn dimensions(&self) -> usize {
|
||||
EMBEDDING_DIMENSIONS
|
||||
match get_backend() {
|
||||
Ok(guard) => guard.dimensions(),
|
||||
Err(_) => EMBEDDING_DIMENSIONS,
|
||||
}
|
||||
}
|
||||
|
||||
/// Generate embedding for a single text
|
||||
/// Generate embedding for a single text.
|
||||
///
|
||||
/// Documents go in raw. For QUERY text under the Qwen3 backend, the caller
|
||||
/// is responsible for wrapping with [`qwen3_format_query`] before calling
|
||||
/// this method — the asymmetric query/document format is intentional and
|
||||
/// handled at the search layer, not the embedding layer.
|
||||
pub fn embed(&self, text: &str) -> Result<Embedding, EmbeddingError> {
|
||||
if text.is_empty() {
|
||||
return Err(EmbeddingError::InvalidInput(
|
||||
|
|
@ -268,7 +488,7 @@ impl EmbeddingService {
|
|||
));
|
||||
}
|
||||
|
||||
let mut model = get_model()?;
|
||||
let mut backend = get_backend()?;
|
||||
|
||||
// Truncate if too long (char-boundary safe)
|
||||
let text = if text.len() > MAX_TEXT_LENGTH {
|
||||
|
|
@ -281,26 +501,36 @@ impl EmbeddingService {
|
|||
text
|
||||
};
|
||||
|
||||
let embeddings = model
|
||||
.embed(vec![text], None)
|
||||
.map_err(|e| EmbeddingError::EmbeddingFailed(e.to_string()))?;
|
||||
let raw = backend.embed_batch(vec![text])?;
|
||||
|
||||
if embeddings.is_empty() {
|
||||
// Shape contract: both backends must return at least one vector of
|
||||
// non-zero length for a non-empty input. An empty outer or inner vec
|
||||
// means the backend misbehaved (e.g. fastembed regression, malformed
|
||||
// ONNX output). Guard both so a silent zero-dim vector never lands in
|
||||
// the USearch index where it would later blow up with an opaque
|
||||
// InvalidDimensions error deep in the search path.
|
||||
let first = raw.into_iter().next().ok_or_else(|| {
|
||||
EmbeddingError::EmbeddingFailed("No embedding generated".to_string())
|
||||
})?;
|
||||
if first.is_empty() {
|
||||
return Err(EmbeddingError::EmbeddingFailed(
|
||||
"No embedding generated".to_string(),
|
||||
"Backend returned an empty embedding vector".to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
Ok(Embedding::new(matryoshka_truncate(embeddings[0].clone())))
|
||||
Ok(Embedding::new(backend.post_process(first)))
|
||||
}
|
||||
|
||||
/// Generate embeddings for multiple texts (batch processing)
|
||||
/// Generate embeddings for multiple texts (batch processing).
|
||||
///
|
||||
/// As with [`Self::embed`], query/document asymmetry is the caller's
|
||||
/// responsibility: wrap query texts with [`qwen3_format_query`] upstream.
|
||||
pub fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Embedding>, EmbeddingError> {
|
||||
if texts.is_empty() {
|
||||
return Ok(vec![]);
|
||||
}
|
||||
|
||||
let mut model = get_model()?;
|
||||
let mut backend = get_backend()?;
|
||||
let mut all_embeddings = Vec::with_capacity(texts.len());
|
||||
|
||||
// Process in batches for efficiency
|
||||
|
|
@ -320,12 +550,10 @@ impl EmbeddingService {
|
|||
})
|
||||
.collect();
|
||||
|
||||
let embeddings = model
|
||||
.embed(truncated, None)
|
||||
.map_err(|e| EmbeddingError::EmbeddingFailed(e.to_string()))?;
|
||||
let raw = backend.embed_batch(truncated)?;
|
||||
|
||||
for emb in embeddings {
|
||||
all_embeddings.push(Embedding::new(matryoshka_truncate(emb)));
|
||||
for emb in raw {
|
||||
all_embeddings.push(Embedding::new(backend.post_process(emb)));
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -356,15 +584,19 @@ impl EmbeddingService {
|
|||
// SIMILARITY FUNCTIONS
|
||||
// ============================================================================
|
||||
|
||||
/// Apply Matryoshka truncation: truncate to EMBEDDING_DIMENSIONS and L2-normalize
|
||||
/// Apply Matryoshka truncation: truncate to [`NOMIC_EMBEDDING_DIMENSIONS`] and L2-normalize.
|
||||
///
|
||||
/// Nomic Embed v1.5 supports Matryoshka Representation Learning,
|
||||
/// meaning the first N dimensions of the 768-dim output ARE a valid
|
||||
/// N-dimensional embedding with minimal quality loss (~2% on MTEB for 256-dim).
|
||||
///
|
||||
/// Not applied to the Qwen3 backend — Qwen3 output is already last-token pooled
|
||||
/// and L2-normalized by the Candle model internals, and we keep full 1024-dim
|
||||
/// by default so the retrieval quality gain over nomic isn't Matryoshka-capped.
|
||||
#[inline]
|
||||
pub fn matryoshka_truncate(mut vector: Vec<f32>) -> Vec<f32> {
|
||||
if vector.len() > EMBEDDING_DIMENSIONS {
|
||||
vector.truncate(EMBEDDING_DIMENSIONS);
|
||||
if vector.len() > NOMIC_EMBEDDING_DIMENSIONS {
|
||||
vector.truncate(NOMIC_EMBEDDING_DIMENSIONS);
|
||||
}
|
||||
// L2-normalize the truncated vector
|
||||
let norm = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
|
||||
|
|
@ -512,4 +744,50 @@ mod tests {
|
|||
assert_eq!(results[0].0, 0); // First candidate should be most similar
|
||||
assert!((results[0].1 - 1.0).abs() < 0.0001);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_qwen3_format_query_feature_gated() {
|
||||
let wrapped = qwen3_format_query("cats are cute");
|
||||
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
// With Qwen3 active, queries get wrapped in the Instruct template.
|
||||
// No space between `Query:` and the user text — this matches the
|
||||
// canonical `get_detailed_instruct` function in the Qwen3 model
|
||||
// card's Python example, even though the TEI curl example has a
|
||||
// space. We follow the Python reference.
|
||||
assert!(wrapped.starts_with("Instruct: "));
|
||||
assert!(wrapped.ends_with("\nQuery:cats are cute"));
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
// Under the nomic backend the wrapper is a no-op.
|
||||
assert_eq!(wrapped, "cats are cute");
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_backend_dimensions_match_feature_flag() {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
assert_eq!(EMBEDDING_DIMENSIONS, QWEN3_EMBEDDING_DIMENSIONS);
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
assert_eq!(EMBEDDING_DIMENSIONS, NOMIC_EMBEDDING_DIMENSIONS);
|
||||
}
|
||||
|
||||
/// Integration: load the Qwen3 backend and verify it produces a 1024-dim
|
||||
/// L2-normalized vector on CPU. Ignored by default because it downloads
|
||||
/// ~1.2 GB of model weights on first run.
|
||||
///
|
||||
/// Run with: `cargo test --features qwen3-embed -- --ignored test_qwen3_embed_live`
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
#[test]
|
||||
#[ignore]
|
||||
fn test_qwen3_embed_live() {
|
||||
let service = EmbeddingService::new();
|
||||
service.init().expect("Qwen3 backend init");
|
||||
|
||||
let emb = service.embed("hello world").expect("embed succeeds");
|
||||
assert_eq!(emb.dimensions, QWEN3_EMBEDDING_DIMENSIONS);
|
||||
assert!(emb.is_normalized(), "Qwen3 output must be L2-normalized");
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,13 +1,15 @@
|
|||
//! Semantic Embeddings Module
|
||||
//!
|
||||
//! Provides local embedding generation using fastembed (ONNX-based).
|
||||
//! No external API calls required - 100% local and private.
|
||||
//! Provides local embedding generation using fastembed v5.13.
|
||||
//! No external API calls required — 100% local and private.
|
||||
//!
|
||||
//! Supports:
|
||||
//! - Text embedding generation (768-dimensional vectors via nomic-embed-text-v1.5)
|
||||
//! - Cosine similarity computation
|
||||
//! - Batch embedding for efficiency
|
||||
//! - Hybrid multi-model fusion (future)
|
||||
//! - Dual backend: Nomic Embed v1.5 (ONNX, default, 768d native → 256d
|
||||
//! Matryoshka) or Qwen3-Embedding-0.6B (Candle, `qwen3-embed` feature,
|
||||
//! 1024d native, 32K context).
|
||||
//! - Cosine similarity computation.
|
||||
//! - Batch embedding for efficiency.
|
||||
//! - Hybrid multi-model fusion (future).
|
||||
|
||||
mod code;
|
||||
mod hybrid;
|
||||
|
|
@ -16,7 +18,8 @@ mod local;
|
|||
pub(crate) use local::get_cache_dir;
|
||||
pub use local::{
|
||||
BATCH_SIZE, EMBEDDING_DIMENSIONS, Embedding, EmbeddingError, EmbeddingService, MAX_TEXT_LENGTH,
|
||||
cosine_similarity, dot_product, euclidean_distance, matryoshka_truncate,
|
||||
NOMIC_EMBEDDING_DIMENSIONS, QWEN3_EMBEDDING_DIMENSIONS, QWEN3_QUERY_INSTRUCTION,
|
||||
cosine_similarity, dot_product, euclidean_distance, matryoshka_truncate, qwen3_format_query,
|
||||
};
|
||||
|
||||
pub use code::CodeEmbedding;
|
||||
|
|
|
|||
|
|
@ -17,9 +17,25 @@ use usearch::{Index, IndexOptions, MetricKind, ScalarKind};
|
|||
// CONSTANTS
|
||||
// ============================================================================
|
||||
|
||||
/// Default embedding dimensions after Matryoshka truncation (768 → 256)
|
||||
/// 3x storage savings with only ~2% quality loss on MTEB benchmarks
|
||||
pub const DEFAULT_DIMENSIONS: usize = 256;
|
||||
/// Default embedding dimensions for the active backend.
|
||||
///
|
||||
/// - Nomic backend (default): 256 after Matryoshka truncation from 768.
|
||||
/// 3x storage savings with ~2% quality loss on MTEB benchmarks.
|
||||
/// - Qwen3 backend (`qwen3-embed` feature): 1024 native, no truncation.
|
||||
///
|
||||
/// Must track `embeddings::local::EMBEDDING_DIMENSIONS` so the USearch index
|
||||
/// dimension matches what `EmbeddingService::embed()` produces. Mismatches
|
||||
/// surface as `VectorSearchError::InvalidDimensions` at insert time.
|
||||
pub const DEFAULT_DIMENSIONS: usize = {
|
||||
#[cfg(feature = "qwen3-embed")]
|
||||
{
|
||||
1024
|
||||
}
|
||||
#[cfg(not(feature = "qwen3-embed"))]
|
||||
{
|
||||
256
|
||||
}
|
||||
};
|
||||
|
||||
/// HNSW connectivity parameter (higher = better recall, more memory)
|
||||
pub const DEFAULT_CONNECTIVITY: usize = 16;
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue