feat: adding a semantic cache layer

2026-03-08 09:12:09 +01:00 · 2026-03-08 09:12:09 +01:00 · dd4b12da6a
commit dd4b12da6a
parent c3d47c7ffe
13 changed files with 1138 additions and 22 deletions
--- a/doc/configuration.md
+++ b/doc/configuration.md
@ -204,6 +204,149 @@ max_concurrent_connections: 3

 **Recommendation**: Use multiple endpoints for redundancy and load distribution.

+## Semantic LLM Cache
+
+NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
+
+### How it works
+
+1. On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint.
+2. On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
+3. On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes.
+4. **MOE requests** (`moe-*` model prefix) always bypass the cache.
+5. **Token counts** are never recorded for cache hits.
+
+### Cache key strategy
+
+| Signal | How matched |
+|---|---|
+| `model + system_prompt` | Exact — hard context isolation per deployment |
+| BM25-weighted embedding of chat history | Semantic — conversation context signal |
+| Embedding of last user message | Semantic — the actual question |
+
+The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.
+
+### Quick start — exact match (lean image)
+
+```yaml
+cache_enabled: true
+cache_backend: sqlite    # persists across restarts
+cache_similarity: 1.0   # exact match only, no sentence-transformers needed
+cache_ttl: 3600
+```
+
+### Quick start — semantic matching (:semantic image)
+
+```yaml
+cache_enabled: true
+cache_backend: sqlite
+cache_similarity: 0.90   # hit if ≥90% cosine similarity
+cache_ttl: 3600
+cache_history_weight: 0.3
+```
+
+Pull the semantic image:
+```bash
+docker pull ghcr.io/nomyo-ai/nomyo-router:latest-semantic
+```
+
+### Cache configuration options
+
+#### `cache_enabled`
+
+**Type**: `bool` | **Default**: `false`
+
+Enable or disable the cache. All other cache settings are ignored when `false`.
+
+#### `cache_backend`
+
+**Type**: `str` | **Default**: `"memory"`
+
+| Value | Description | Persists | Multi-replica |
+|---|---|---|---|
+| `memory` | In-process LRU dict | ❌ | ❌ |
+| `sqlite` | File-based via `aiosqlite` | ✅ | ❌ |
+| `redis` | Redis via `redis.asyncio` | ✅ | ✅ |
+
+Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache.
+
+#### `cache_similarity`
+
+**Type**: `float` | **Default**: `1.0`
+
+Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag.
+
+Recommended starting value for semantic mode: `0.90`.
+
+#### `cache_ttl`
+
+**Type**: `int | null` | **Default**: `3600`
+
+Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever.
+
+#### `cache_db_path`
+
+**Type**: `str` | **Default**: `"llm_cache.db"`
+
+Path to the SQLite cache database. Only used when `cache_backend: sqlite`.
+
+#### `cache_redis_url`
+
+**Type**: `str` | **Default**: `"redis://localhost:6379/0"`
+
+Redis connection URL. Only used when `cache_backend: redis`.
+
+#### `cache_history_weight`
+
+**Type**: `float` | **Default**: `0.3`
+
+Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`.
+
+### Cache management endpoints
+
+| Endpoint | Method | Description |
+|---|---|---|
+| `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config |
+| `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters |
+
+```bash
+# Check cache performance
+curl http://localhost:12434/api/cache/stats
+
+# Clear the cache
+curl -X POST http://localhost:12434/api/cache/invalidate
+```
+
+Example stats response:
+```json
+{
+  "enabled": true,
+  "hits": 1547,
+  "misses": 892,
+  "hit_rate": 0.634,
+  "semantic": true,
+  "backend": "sqlite",
+  "similarity_threshold": 0.9,
+  "history_weight": 0.3
+}
+```
+
+### Docker image variants
+
+| Tag | Semantic cache | Image size |
+|---|---|---|
+| `latest` | ❌ exact match only | ~300 MB |
+| `latest-semantic` | ✅ sentence-transformers + model pre-baked | ~800 MB |
+
+Build locally:
+```bash
+# Lean (exact match)
+docker build -t nomyo-router .
+
+# Semantic (~500 MB larger, all-MiniLM-L6-v2 model baked in)
+docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
+```
+
 ## Configuration Validation

 The router validates the configuration at startup:
--- a/doc/deployment.md
+++ b/doc/deployment.md
@ -82,10 +82,23 @@ sudo systemctl status nomyo-router

 ## 2. Docker Deployment

+### Image variants
+
+| Tag | Semantic cache | Image size |
+|---|---|---|
+| `latest` | ❌ exact match only | ~300 MB |
+| `latest-semantic` | ✅ sentence-transformers + `all-MiniLM-L6-v2` pre-baked | ~800 MB |
+
+The `:semantic` variant enables `cache_similarity < 1.0` in `config.yaml`. The lean image falls back to exact-match caching with a warning if semantic mode is configured.
+
 ### Build the Image

 ```bash
+# Lean build (exact match cache, default)
 docker build -t nomyo-router .
+
+# Semantic build (~500 MB larger, all-MiniLM-L6-v2 model baked in at build time)
+docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .
 ```

 ### Run the Container
--- a/doc/examples/docker-compose.yml
+++ b/doc/examples/docker-compose.yml
@ -1,20 +1,30 @@
 # Docker Compose example for NOMYO Router with multiple Ollama instances
+#
+# Two router profiles are provided:
+#   nomyo-router          — lean image, exact-match cache only (~300 MB)
+#   nomyo-router-semantic — semantic image, sentence-transformers baked in (~800 MB)
+#
+# Uncomment the redis service and set cache_backend: redis in config.yaml
+# to share the LLM response cache across multiple router replicas.

 version: '3.8'

 services:
-  # NOMYO Router
+  # NOMYO Router — lean image (exact-match cache, default)
  nomyo-router:
    image: nomyo-router:latest
-    build: .
+    build:
+      context: .
+      args:
+        SEMANTIC_CACHE: "false"
    ports:
      - "12434:12434"
    environment:
      - CONFIG_PATH=/app/config/config.yaml
-      - NOMYO_ROUTER_DB_PATH=/app/token_counts.db
+      - NOMYO_ROUTER_DB_PATH=/app/data/token_counts.db
    volumes:
      - ./config:/app/config
-      - router-db:/app/token_counts.db
+      - router-data:/app/data
    depends_on:
      - ollama1
      - ollama2
@ -23,6 +33,45 @@ services:
    networks:
      - nomyo-net

+  # NOMYO Router — semantic image (cache_similarity < 1.0 support, ~800 MB)
+  # Build:  docker compose build nomyo-router-semantic
+  # Switch: comment out nomyo-router above, uncomment this block.
+  # nomyo-router-semantic:
+  #   image: nomyo-router:semantic
+  #   build:
+  #     context: .
+  #     args:
+  #       SEMANTIC_CACHE: "true"
+  #   ports:
+  #     - "12434:12434"
+  #   environment:
+  #     - CONFIG_PATH=/app/config/config.yaml
+  #     - NOMYO_ROUTER_DB_PATH=/app/data/token_counts.db
+  #   volumes:
+  #     - ./config:/app/config
+  #     - router-data:/app/data
+  #     - hf-cache:/app/data/hf_cache   # share HuggingFace model cache across builds
+  #   depends_on:
+  #     - ollama1
+  #     - ollama2
+  #     - ollama3
+  #   restart: unless-stopped
+  #   networks:
+  #     - nomyo-net
+
+  # Optional: Redis for shared LLM response cache across multiple router replicas.
+  # Requires cache_backend: redis in config.yaml.
+  # redis:
+  #   image: redis:7-alpine
+  #   ports:
+  #     - "6379:6379"
+  #   volumes:
+  #     - redis-data:/data
+  #   command: redis-server --save 60 1 --loglevel warning
+  #   restart: unless-stopped
+  #   networks:
+  #     - nomyo-net
+
  # Ollama Instance 1
  ollama1:
    image: ollama/ollama:latest
@ -87,7 +136,9 @@ services:
      - nomyo-net

 volumes:
-  router-db:
+  router-data:
+  # hf-cache:     # uncomment when using nomyo-router-semantic
+  # redis-data:   # uncomment when using Redis cache backend
  ollama1-data:
  ollama2-data:
  ollama3-data:
--- a/doc/examples/sample-config.yaml
+++ b/doc/examples/sample-config.yaml
@ -29,4 +29,38 @@ api_keys:
  "http://192.168.0.52:11434": "ollama"
  "https://api.openai.com/v1": "${OPENAI_KEY}"
  "http://localhost:8080/v1": "llama-server"  # Optional API key for llama-server - depends on llama_server config
-  "http://192.168.0.33:8081/v1": "llama-server"
+  "http://192.168.0.33:8081/v1": "llama-server"
+
+# -------------------------------------------------------------
+# Semantic LLM Cache (optional — disabled by default)
+# Caches LLM responses to cut costs and latency on repeated or
+# semantically similar prompts.
+# Cached routes: /api/chat  /api/generate  /v1/chat/completions  /v1/completions
+# MOE requests (moe-* model prefix) always bypass the cache.
+# -------------------------------------------------------------
+# cache_enabled: false
+
+# Backend — where cached responses are stored:
+#   memory  → in-process LRU (lost on restart, not shared across replicas) [default]
+#   sqlite  → persistent file-based   (single instance, survives restart)
+#   redis   → distributed             (shared across replicas, requires Redis)
+# cache_backend: memory
+
+# Cosine similarity threshold for a cache hit:
+#   1.0  → exact match only  (works on any image variant)
+#   <1.0 → semantic matching (requires the :semantic Docker image tag)
+# cache_similarity: 1.0
+
+# Response TTL in seconds. Remove the key or set to null to cache forever.
+# cache_ttl: 3600
+
+# SQLite backend: path to the cache database file
+# cache_db_path: llm_cache.db
+
+# Redis backend: connection URL
+# cache_redis_url: redis://localhost:6379/0
+
+# Weight of the BM25-weighted chat-history embedding vs last-user-message embedding.
+# 0.3 = 30% history context signal, 70% question signal.
+# Only relevant when cache_similarity < 1.0.
+# cache_history_weight: 0.3
--- a/doc/monitoring.md
+++ b/doc/monitoring.md
@ -133,6 +133,39 @@ Response:
 }
 ```

+### Cache Statistics
+
+```bash
+curl http://localhost:12434/api/cache/stats
+```
+
+Response when cache is enabled:
+```json
+{
+  "enabled": true,
+  "hits": 1547,
+  "misses": 892,
+  "hit_rate": 0.634,
+  "semantic": true,
+  "backend": "sqlite",
+  "similarity_threshold": 0.9,
+  "history_weight": 0.3
+}
+```
+
+Response when cache is disabled:
+```json
+{ "enabled": false }
+```
+
+### Cache Invalidation
+
+```bash
+curl -X POST http://localhost:12434/api/cache/invalidate
+```
+
+Clears all cached entries and resets hit/miss counters.
+
 ### Real-time Usage Stream

 ```bash