doc: updated usage.md

2026-03-08 09:26:53 +01:00 · 2026-03-08 09:26:53 +01:00 · e8b8981421
commit e8b8981421
parent dd4b12da6a
1 changed files with 54 additions and 0 deletions
--- a/doc/usage.md
+++ b/doc/usage.md
@ -79,6 +79,8 @@ For OpenAI API compatibility:
 | `/api/config`                      | GET    | Endpoint configuration                   |
 | `/api/usage-stream`                | GET    | Real-time usage updates (SSE)            |
 | `/health`                          | GET    | Health check                             |
+| `/api/cache/stats`                 | GET    | Cache hit/miss counters and config       |
+| `/api/cache/invalidate`            | POST   | Clear all cache entries and counters     |

 ## Making Requests

@ -147,6 +149,58 @@ The MOE system:
 3. Selects the best response
 4. Generates a final refined response

+### Semantic LLM Cache
+
+The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.
+
+Enable it in `config.yaml`:
+
+```yaml
+cache_enabled: true
+cache_backend: sqlite       # persists across restarts
+cache_similarity: 0.9       # semantic matching (requires :semantic image)
+cache_ttl: 3600
+```
+
+For exact-match only (no extra dependencies):
+
+```yaml
+cache_enabled: true
+cache_backend: sqlite
+cache_similarity: 1.0
+```
+
+Check cache performance:
+
+```bash
+curl http://localhost:12434/api/cache/stats
+```
+
+```json
+{
+  "enabled": true,
+  "hits": 1547,
+  "misses": 892,
+  "hit_rate": 0.634,
+  "semantic": true,
+  "backend": "sqlite",
+  "similarity_threshold": 0.9,
+  "history_weight": 0.3
+}
+```
+
+Clear the cache:
+
+```bash
+curl -X POST http://localhost:12434/api/cache/invalidate
+```
+
+**Notes:**
+- MOE requests (`moe-*` model prefix) always bypass the cache
+- Cache is isolated per `model + system prompt` — different users with different system prompts cannot receive each other's cached responses
+- Semantic matching requires the `:semantic` Docker image tag (`ghcr.io/nomyo-ai/nomyo-router:latest-semantic`)
+- See [configuration.md](configuration.md#semantic-llm-cache) for all cache options
+
 ### Token Tracking

 The router automatically tracks token usage: