feat: adding a semantic cache layer

This commit is contained in:
Alpha Nerd 2026-03-08 09:12:09 +01:00
parent c3d47c7ffe
commit dd4b12da6a
13 changed files with 1138 additions and 22 deletions

View file

@ -204,6 +204,149 @@ max_concurrent_connections: 3
**Recommendation**: Use multiple endpoints for redundancy and load distribution.
## Semantic LLM Cache
NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
### How it works
1. On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint.
2. On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
3. On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes.
4. **MOE requests** (`moe-*` model prefix) always bypass the cache.
5. **Token counts** are never recorded for cache hits.
### Cache key strategy
| Signal | How matched |
|---|---|
| `model + system_prompt` | Exact — hard context isolation per deployment |
| BM25-weighted embedding of chat history | Semantic — conversation context signal |
| Embedding of last user message | Semantic — the actual question |
The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.
### Quick start — exact match (lean image)
```yaml
cache_enabled: true
cache_backend: sqlite # persists across restarts
cache_similarity: 1.0 # exact match only, no sentence-transformers needed
cache_ttl: 3600
```
### Quick start — semantic matching (:semantic image)
```yaml
cache_enabled: true
cache_backend: sqlite
cache_similarity: 0.90 # hit if ≥90% cosine similarity
cache_ttl: 3600
cache_history_weight: 0.3
```
Pull the semantic image:
```bash
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
```
### Cache configuration options
#### `cache_enabled`
**Type**: `bool` | **Default**: `false`
Enable or disable the cache. All other cache settings are ignored when `false`.
#### `cache_backend`
**Type**: `str` | **Default**: `"memory"`
| Value | Description | Persists | Multi-replica |
|---|---|---|---|
| `memory` | In-process LRU dict | ❌ | ❌ |
| `sqlite` | File-based via `aiosqlite` | ✅ | ❌ |
| `redis` | Redis via `redis.asyncio` | ✅ | ✅ |
Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache.
#### `cache_similarity`
**Type**: `float` | **Default**: `1.0`
Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag.
Recommended starting value for semantic mode: `0.90`.
#### `cache_ttl`
**Type**: `int | null` | **Default**: `3600`
Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever.
#### `cache_db_path`
**Type**: `str` | **Default**: `"llm_cache.db"`
Path to the SQLite cache database. Only used when `cache_backend: sqlite`.
#### `cache_redis_url`
**Type**: `str` | **Default**: `"redis://localhost:6379/0"`
Redis connection URL. Only used when `cache_backend: redis`.
#### `cache_history_weight`
**Type**: `float` | **Default**: `0.3`
Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`.
### Cache management endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config |
| `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters |
```bash
# Check cache performance
curl http://localhost:12434/api/cache/stats
# Clear the cache
curl -X POST http://localhost:12434/api/cache/invalidate
```
Example stats response:
```json
{
"enabled": true,
"hits": 1547,
"misses": 892,
"hit_rate": 0.634,
"semantic": true,
"backend": "sqlite",
"similarity_threshold": 0.9,
"history_weight": 0.3
}
```
### Docker image variants
| Tag | Semantic cache | Image size |
|---|---|---|
| `latest` | ❌ exact match only | ~300 MB |
| `latest-semantic` | ✅ sentence-transformers + model pre-baked | ~800 MB |
Build locally:
```bash
# Lean (exact match)
docker build -t nomyo-router .
# Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in)
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
```
## Configuration Validation
The router validates the configuration at startup: