doc: updated usage.md
This commit is contained in:
parent
dd4b12da6a
commit
e8b8981421
1 changed files with 54 additions and 0 deletions
54
doc/usage.md
54
doc/usage.md
|
|
@ -79,6 +79,8 @@ For OpenAI API compatibility:
|
|||
| `/api/config` | GET | Endpoint configuration |
|
||||
| `/api/usage-stream` | GET | Real-time usage updates (SSE) |
|
||||
| `/health` | GET | Health check |
|
||||
| `/api/cache/stats` | GET | Cache hit/miss counters and config |
|
||||
| `/api/cache/invalidate` | POST | Clear all cache entries and counters |
|
||||
|
||||
## Making Requests
|
||||
|
||||
|
|
@ -147,6 +149,58 @@ The MOE system:
|
|||
3. Selects the best response
|
||||
4. Generates a final refined response
|
||||
|
||||
### Semantic LLM Cache
|
||||
|
||||
The router can cache LLM responses and serve them instantly — bypassing endpoint selection, model loading, and token generation entirely. Cached responses work for both streaming and non-streaming clients.
|
||||
|
||||
Enable it in `config.yaml`:
|
||||
|
||||
```yaml
|
||||
cache_enabled: true
|
||||
cache_backend: sqlite # persists across restarts
|
||||
cache_similarity: 0.9 # semantic matching (requires :semantic image)
|
||||
cache_ttl: 3600
|
||||
```
|
||||
|
||||
For exact-match only (no extra dependencies):
|
||||
|
||||
```yaml
|
||||
cache_enabled: true
|
||||
cache_backend: sqlite
|
||||
cache_similarity: 1.0
|
||||
```
|
||||
|
||||
Check cache performance:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/cache/stats
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"hits": 1547,
|
||||
"misses": 892,
|
||||
"hit_rate": 0.634,
|
||||
"semantic": true,
|
||||
"backend": "sqlite",
|
||||
"similarity_threshold": 0.9,
|
||||
"history_weight": 0.3
|
||||
}
|
||||
```
|
||||
|
||||
Clear the cache:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:12434/api/cache/invalidate
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- MOE requests (`moe-*` model prefix) always bypass the cache
|
||||
- Cache is isolated per `model + system prompt` — different users with different system prompts cannot receive each other's cached responses
|
||||
- Semantic matching requires the `:semantic` Docker image tag (`ghcr.io/nomyo-ai/nomyo-router:latest-semantic`)
|
||||
- See [configuration.md](configuration.md#semantic-llm-cache) for all cache options
|
||||
|
||||
### Token Tracking
|
||||
|
||||
The router automatically tracks token usage:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue