feat: adding a semantic cache layer
This commit is contained in:
parent
c3d47c7ffe
commit
dd4b12da6a
13 changed files with 1138 additions and 22 deletions
|
|
@ -204,6 +204,149 @@ max_concurrent_connections: 3
|
|||
|
||||
**Recommendation**: Use multiple endpoints for redundancy and load distribution.
|
||||
|
||||
## Semantic LLM Cache
|
||||
|
||||
NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
|
||||
|
||||
### How it works
|
||||
|
||||
1. On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint.
|
||||
2. On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
|
||||
3. On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes.
|
||||
4. **MOE requests** (`moe-*` model prefix) always bypass the cache.
|
||||
5. **Token counts** are never recorded for cache hits.
|
||||
|
||||
### Cache key strategy
|
||||
|
||||
| Signal | How matched |
|
||||
|---|---|
|
||||
| `model + system_prompt` | Exact — hard context isolation per deployment |
|
||||
| BM25-weighted embedding of chat history | Semantic — conversation context signal |
|
||||
| Embedding of last user message | Semantic — the actual question |
|
||||
|
||||
The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.
|
||||
|
||||
### Quick start — exact match (lean image)
|
||||
|
||||
```yaml
|
||||
cache_enabled: true
|
||||
cache_backend: sqlite # persists across restarts
|
||||
cache_similarity: 1.0 # exact match only, no sentence-transformers needed
|
||||
cache_ttl: 3600
|
||||
```
|
||||
|
||||
### Quick start — semantic matching (:semantic image)
|
||||
|
||||
```yaml
|
||||
cache_enabled: true
|
||||
cache_backend: sqlite
|
||||
cache_similarity: 0.90 # hit if ≥90% cosine similarity
|
||||
cache_ttl: 3600
|
||||
cache_history_weight: 0.3
|
||||
```
|
||||
|
||||
Pull the semantic image:
|
||||
```bash
|
||||
docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
|
||||
```
|
||||
|
||||
### Cache configuration options
|
||||
|
||||
#### `cache_enabled`
|
||||
|
||||
**Type**: `bool` | **Default**: `false`
|
||||
|
||||
Enable or disable the cache. All other cache settings are ignored when `false`.
|
||||
|
||||
#### `cache_backend`
|
||||
|
||||
**Type**: `str` | **Default**: `"memory"`
|
||||
|
||||
| Value | Description | Persists | Multi-replica |
|
||||
|---|---|---|---|
|
||||
| `memory` | In-process LRU dict | ❌ | ❌ |
|
||||
| `sqlite` | File-based via `aiosqlite` | ✅ | ❌ |
|
||||
| `redis` | Redis via `redis.asyncio` | ✅ | ✅ |
|
||||
|
||||
Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache.
|
||||
|
||||
#### `cache_similarity`
|
||||
|
||||
**Type**: `float` | **Default**: `1.0`
|
||||
|
||||
Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag.
|
||||
|
||||
Recommended starting value for semantic mode: `0.90`.
|
||||
|
||||
#### `cache_ttl`
|
||||
|
||||
**Type**: `int | null` | **Default**: `3600`
|
||||
|
||||
Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever.
|
||||
|
||||
#### `cache_db_path`
|
||||
|
||||
**Type**: `str` | **Default**: `"llm_cache.db"`
|
||||
|
||||
Path to the SQLite cache database. Only used when `cache_backend: sqlite`.
|
||||
|
||||
#### `cache_redis_url`
|
||||
|
||||
**Type**: `str` | **Default**: `"redis://localhost:6379/0"`
|
||||
|
||||
Redis connection URL. Only used when `cache_backend: redis`.
|
||||
|
||||
#### `cache_history_weight`
|
||||
|
||||
**Type**: `float` | **Default**: `0.3`
|
||||
|
||||
Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`.
|
||||
|
||||
### Cache management endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|---|---|---|
|
||||
| `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config |
|
||||
| `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters |
|
||||
|
||||
```bash
|
||||
# Check cache performance
|
||||
curl http://localhost:12434/api/cache/stats
|
||||
|
||||
# Clear the cache
|
||||
curl -X POST http://localhost:12434/api/cache/invalidate
|
||||
```
|
||||
|
||||
Example stats response:
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"hits": 1547,
|
||||
"misses": 892,
|
||||
"hit_rate": 0.634,
|
||||
"semantic": true,
|
||||
"backend": "sqlite",
|
||||
"similarity_threshold": 0.9,
|
||||
"history_weight": 0.3
|
||||
}
|
||||
```
|
||||
|
||||
### Docker image variants
|
||||
|
||||
| Tag | Semantic cache | Image size |
|
||||
|---|---|---|
|
||||
| `latest` | ❌ exact match only | ~300 MB |
|
||||
| `latest-semantic` | ✅ sentence-transformers + model pre-baked | ~800 MB |
|
||||
|
||||
Build locally:
|
||||
```bash
|
||||
# Lean (exact match)
|
||||
docker build -t nomyo-router .
|
||||
|
||||
# Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in)
|
||||
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
|
||||
```
|
||||
|
||||
## Configuration Validation
|
||||
|
||||
The router validates the configuration at startup:
|
||||
|
|
|
|||
|
|
@ -82,10 +82,23 @@ sudo systemctl status nomyo-router
|
|||
|
||||
## 2. Docker Deployment
|
||||
|
||||
### Image variants
|
||||
|
||||
| Tag | Semantic cache | Image size |
|
||||
|---|---|---|
|
||||
| `latest` | ❌ exact match only | ~300 MB |
|
||||
| `latest-semantic` | ✅ sentence-transformers + `all-MiniLM-L6-v2` pre-baked | ~800 MB |
|
||||
|
||||
The `:semantic` variant enables `cache_similarity < 1.0` in `config.yaml`. The lean image falls back to exact-match caching with a warning if semantic mode is configured.
|
||||
|
||||
### Build the Image
|
||||
|
||||
```bash
|
||||
# Lean build (exact match cache, default)
|
||||
docker build -t nomyo-router .
|
||||
|
||||
# Semantic build (~500 MB larger, all-MiniLM-L6-v2 model baked in at build time)
|
||||
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
|
||||
```
|
||||
|
||||
### Run the Container
|
||||
|
|
|
|||
|
|
@ -1,20 +1,30 @@
|
|||
# Docker Compose example for NOMYO Router with multiple Ollama instances
|
||||
#
|
||||
# Two router profiles are provided:
|
||||
# nomyo-router — lean image, exact-match cache only (~300 MB)
|
||||
# nomyo-router-semantic — semantic image, sentence-transformers baked in (~800 MB)
|
||||
#
|
||||
# Uncomment the redis service and set cache_backend: redis in config.yaml
|
||||
# to share the LLM response cache across multiple router replicas.
|
||||
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
# NOMYO Router
|
||||
# NOMYO Router — lean image (exact-match cache, default)
|
||||
nomyo-router:
|
||||
image: nomyo-router:latest
|
||||
build: .
|
||||
build:
|
||||
context: .
|
||||
args:
|
||||
SEMANTIC_CACHE: "false"
|
||||
ports:
|
||||
- "12434:12434"
|
||||
environment:
|
||||
- CONFIG_PATH=/app/config/config.yaml
|
||||
- NOMYO_ROUTER_DB_PATH=/app/token_counts.db
|
||||
- NOMYO_ROUTER_DB_PATH=/app/data/token_counts.db
|
||||
volumes:
|
||||
- ./config:/app/config
|
||||
- router-db:/app/token_counts.db
|
||||
- router-data:/app/data
|
||||
depends_on:
|
||||
- ollama1
|
||||
- ollama2
|
||||
|
|
@ -23,6 +33,45 @@ services:
|
|||
networks:
|
||||
- nomyo-net
|
||||
|
||||
# NOMYO Router — semantic image (cache_similarity < 1.0 support, ~800 MB)
|
||||
# Build: docker compose build nomyo-router-semantic
|
||||
# Switch: comment out nomyo-router above, uncomment this block.
|
||||
# nomyo-router-semantic:
|
||||
# image: nomyo-router:semantic
|
||||
# build:
|
||||
# context: .
|
||||
# args:
|
||||
# SEMANTIC_CACHE: "true"
|
||||
# ports:
|
||||
# - "12434:12434"
|
||||
# environment:
|
||||
# - CONFIG_PATH=/app/config/config.yaml
|
||||
# - NOMYO_ROUTER_DB_PATH=/app/data/token_counts.db
|
||||
# volumes:
|
||||
# - ./config:/app/config
|
||||
# - router-data:/app/data
|
||||
# - hf-cache:/app/data/hf_cache # share HuggingFace model cache across builds
|
||||
# depends_on:
|
||||
# - ollama1
|
||||
# - ollama2
|
||||
# - ollama3
|
||||
# restart: unless-stopped
|
||||
# networks:
|
||||
# - nomyo-net
|
||||
|
||||
# Optional: Redis for shared LLM response cache across multiple router replicas.
|
||||
# Requires cache_backend: redis in config.yaml.
|
||||
# redis:
|
||||
# image: redis:7-alpine
|
||||
# ports:
|
||||
# - "6379:6379"
|
||||
# volumes:
|
||||
# - redis-data:/data
|
||||
# command: redis-server --save 60 1 --loglevel warning
|
||||
# restart: unless-stopped
|
||||
# networks:
|
||||
# - nomyo-net
|
||||
|
||||
# Ollama Instance 1
|
||||
ollama1:
|
||||
image: ollama/ollama:latest
|
||||
|
|
@ -87,7 +136,9 @@ services:
|
|||
- nomyo-net
|
||||
|
||||
volumes:
|
||||
router-db:
|
||||
router-data:
|
||||
# hf-cache: # uncomment when using nomyo-router-semantic
|
||||
# redis-data: # uncomment when using Redis cache backend
|
||||
ollama1-data:
|
||||
ollama2-data:
|
||||
ollama3-data:
|
||||
|
|
|
|||
|
|
@ -29,4 +29,38 @@ api_keys:
|
|||
"http://192.168.0.52:11434": "ollama"
|
||||
"https://api.openai.com/v1": "${OPENAI_KEY}"
|
||||
"http://localhost:8080/v1": "llama-server" # Optional API key for llama-server - depends on llama_server config
|
||||
"http://192.168.0.33:8081/v1": "llama-server"
|
||||
"http://192.168.0.33:8081/v1": "llama-server"
|
||||
|
||||
# -------------------------------------------------------------
|
||||
# Semantic LLM Cache (optional — disabled by default)
|
||||
# Caches LLM responses to cut costs and latency on repeated or
|
||||
# semantically similar prompts.
|
||||
# Cached routes: /api/chat /api/generate /v1/chat/completions /v1/completions
|
||||
# MOE requests (moe-* model prefix) always bypass the cache.
|
||||
# -------------------------------------------------------------
|
||||
# cache_enabled: false
|
||||
|
||||
# Backend — where cached responses are stored:
|
||||
# memory → in-process LRU (lost on restart, not shared across replicas) [default]
|
||||
# sqlite → persistent file-based (single instance, survives restart)
|
||||
# redis → distributed (shared across replicas, requires Redis)
|
||||
# cache_backend: memory
|
||||
|
||||
# Cosine similarity threshold for a cache hit:
|
||||
# 1.0 → exact match only (works on any image variant)
|
||||
# <1.0 → semantic matching (requires the :semantic Docker image tag)
|
||||
# cache_similarity: 1.0
|
||||
|
||||
# Response TTL in seconds. Remove the key or set to null to cache forever.
|
||||
# cache_ttl: 3600
|
||||
|
||||
# SQLite backend: path to the cache database file
|
||||
# cache_db_path: llm_cache.db
|
||||
|
||||
# Redis backend: connection URL
|
||||
# cache_redis_url: redis://localhost:6379/0
|
||||
|
||||
# Weight of the BM25-weighted chat-history embedding vs last-user-message embedding.
|
||||
# 0.3 = 30% history context signal, 70% question signal.
|
||||
# Only relevant when cache_similarity < 1.0.
|
||||
# cache_history_weight: 0.3
|
||||
|
|
@ -133,6 +133,39 @@ Response:
|
|||
}
|
||||
```
|
||||
|
||||
### Cache Statistics
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/cache/stats
|
||||
```
|
||||
|
||||
Response when cache is enabled:
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"hits": 1547,
|
||||
"misses": 892,
|
||||
"hit_rate": 0.634,
|
||||
"semantic": true,
|
||||
"backend": "sqlite",
|
||||
"similarity_threshold": 0.9,
|
||||
"history_weight": 0.3
|
||||
}
|
||||
```
|
||||
|
||||
Response when cache is disabled:
|
||||
```json
|
||||
{ "enabled": false }
|
||||
```
|
||||
|
||||
### Cache Invalidation
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:12434/api/cache/invalidate
|
||||
```
|
||||
|
||||
Clears all cached entries and resets hit/miss counters.
|
||||
|
||||
### Real-time Usage Stream
|
||||
|
||||
```bash
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue