# Confirm endpoints are exactly as in endpoints block
api_keys:
"http://192.168.0.50:11434": "ollama"
"http://192.168.0.51:11434": "ollama"
"http://192.168.0.52:11434": "ollama"
"https://api.openai.com/v1": "${OPENAI_KEY}"
```
## Configuration Options
### `endpoints`
**Type**: `list[str]`
**Description**: List of Ollama endpoint URLs. Can include both Ollama endpoints (`http://host:11434`) and OpenAI-compatible endpoints (`https://api.openai.com/v1`).
**Examples**:
```yaml
endpoints:
- http://localhost:11434
- http://ollama1:11434
- http://ollama2:11434
- https://api.openai.com/v1
- https://api.anthropic.com/v1
```
**Notes**:
- Ollama endpoints use the standard `/api/` prefix
- OpenAI-compatible endpoints use `/v1` prefix
- The router automatically detects endpoint type based on URL pattern
### `max_concurrent_connections`
**Type**: `int`
**Default**: `1`
**Description**: Maximum number of concurrent connections allowed per endpoint-model pair. This corresponds to Ollama's `OLLAMA_NUM_PARALLEL` setting.
**Example**:
```yaml
max_concurrent_connections: 4
```
**Notes**:
- This setting controls how many requests can be processed simultaneously for a specific model on a specific endpoint
- When this limit is reached, the router will route requests to other endpoints with available capacity
- Higher values allow more parallel requests but may increase memory usage
**Default**: `{}` (all endpoints use the global `max_concurrent_connections`)
**Description**: Per-endpoint overrides for configuration values. The endpoint URL must match the entry in `endpoints` exactly. Any field not listed falls back to the global default.
**Supported per-endpoint fields**:
| Field | Description |
|---|---|
| `max_concurrent_connections` | Overrides the global limit for this endpoint only |
- Useful when endpoints have different hardware capacity.
- The utilization ratio used by WRR (`priority_routing: true`) is computed per-endpoint using the effective limit, so a node with `max_concurrent_connections: 4` running 2 requests is considered 50% utilized, same as a node with limit 2 running 1 request.
---
### `priority_routing`
**Type**: `bool` (optional)
**Default**: `false`
**Description**: Selects the load-balancing algorithm used when multiple endpoints are available for a request.
| Value | Algorithm |
|---|---|
| `false` (default) | Random selection among equally-idle endpoints; otherwise pick the least-loaded endpoint by raw connection count. |
| `true` | **Weighted Round Robin (WRR)** — endpoints are ranked by utilization ratio (`active_connections / max_concurrent_connections`). Config order acts as the tiebreaker: the endpoint listed first in `endpoints` is preferred when two candidates have equal utilization. |
**Example**:
```yaml
priority_routing: true
```
**When to use WRR**:
- You have a primary GPU node and one or more fallback nodes, and want the primary to absorb all traffic until it is genuinely saturated.
- Combined with `endpoint_config` to give the primary a higher `max_concurrent_connections`, so the utilization ratio reflects real capacity rather than raw slot counts.
**Example — primary/fallback setup**:
```yaml
endpoints:
- http://gpu-primary:11434 # preferred
- http://gpu-secondary:11434 # fallback
endpoint_config:
"http://gpu-primary:11434":
max_concurrent_connections: 4
"http://gpu-secondary:11434":
max_concurrent_connections: 2
priority_routing: true
```
With this config the primary handles up to 4 concurrent requests before the secondary receives any traffic.
**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
#### How a conversation is identified
The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
```
SHA1( model
+ every leading message with role="system"
+ the first message with role="user" )
```
Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
**What this means in practice**
| You send… | Fingerprint behaves like… |
|---|---|
| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
| Turn 1 of a fresh chat | **New** fingerprint → new pin |
| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
#### TTL and refresh
Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
#### Why the dashboard may show more than one dot per visible conversation
The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
- *Title generation* — synthetic system prompt + the user message as content
- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
#### Example
```yaml
endpoints:
- http://gpu-primary:11434
- http://gpu-secondary:11434
conversation_affinity: true
conversation_affinity_ttl: 300
```
With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
#### When to enable
- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
- ✅ Multi-endpoint deployments where models are loaded on more than one node.
- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
---
### `conversation_affinity_ttl`
**Type**: `int` (seconds, optional)
**Default**: `300`
**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
**Example**:
```yaml
conversation_affinity: true
conversation_affinity_ttl: 600 # half an hour of inactivity before un-pinning
**Description**: Shared secret that gates access to the NOMYO Router APIs and dashboard. When set, clients must send `Authorization: Bearer <key>` or an `api_key` query parameter.
**Example**:
```yaml
nomyo-router-api-key: "super-secret-value"
```
**Notes**:
- Leave this blank or omit it to disable router-level authentication.
- You can also set the `NOMYO_ROUTER_API_KEY` environment variable to avoid storing the key in plain text.
**Description**: Path to the SQLite database file for storing token counts. If not set, defaults to `token_counts.db` in the current working directory.
**Description**: Router-level API key. When set, all router endpoints and the dashboard require this key via `Authorization: Bearer <key>` or the `api_key` query parameter.
When you have heterogeneous hardware and want to prefer a faster node:
```yaml
endpoints:
- http://gpu-primary:11434 # high-VRAM node, listed first = highest priority
- http://gpu-secondary:11434
endpoint_config:
"http://gpu-primary:11434":
max_concurrent_connections: 4
"http://gpu-secondary:11434":
max_concurrent_connections: 2
priority_routing: true
```
The router sends all requests to the primary until its utilization ratio reaches 100%, then spills over to the secondary. Without `priority_routing: true` the default behaviour is random selection among idle endpoints.
NOMYO Router can cache LLM responses and serve them directly — skipping endpoint selection, model load, and token generation entirely.
### How it works
1. On every cacheable request (`/api/chat`, `/api/generate`, `/v1/chat/completions`, `/v1/completions`) the cache is checked **before** choosing an endpoint.
2. On a **cache hit** the stored response is returned immediately as a single chunk (streaming or non-streaming — both work).
3. On a **cache miss** the request is forwarded normally. The response is stored in the cache after it completes.
4.**MOE requests** (`moe-*` model prefix) always bypass the cache.
5.**Token counts** are never recorded for cache hits.
### Cache key strategy
| Signal | How matched |
|---|---|
| `model + system_prompt` | Exact — hard context isolation per deployment |
| BM25-weighted embedding of chat history | Semantic — conversation context signal |
| Embedding of last user message | Semantic — the actual question |
The two semantic vectors are combined as a weighted mean (tuned by `cache_history_weight`) before cosine similarity comparison, staying at a single 384-dimensional vector compatible with the library's storage format.
### Quick start — exact match (lean image)
```yaml
cache_enabled: true
cache_backend: sqlite # persists across restarts
cache_similarity: 1.0 # exact match only, no sentence-transformers needed
Enable or disable the cache. All other cache settings are ignored when `false`.
#### `cache_backend`
**Type**: `str` | **Default**: `"memory"`
| Value | Description | Persists | Multi-replica |
|---|---|---|---|
| `memory` | In-process LRU dict | ❌ | ❌ |
| `sqlite` | File-based via `aiosqlite` | ✅ | ❌ |
| `redis` | Redis via `redis.asyncio` | ✅ | ✅ |
Use `redis` when running multiple router replicas behind a load balancer — all replicas share one warm cache.
#### `cache_similarity`
**Type**: `float` | **Default**: `1.0`
Cosine similarity threshold. `1.0` means exact match only (no embedding model needed). Values below `1.0` enable semantic matching, which requires the `:semantic` Docker image tag.
Recommended starting value for semantic mode: `0.90`.
#### `cache_ttl`
**Type**: `int | null` | **Default**: `3600`
Time-to-live for cache entries in seconds. Remove the key or set to `null` to cache forever.
#### `cache_db_path`
**Type**: `str` | **Default**: `"llm_cache.db"`
Path to the SQLite cache database. Only used when `cache_backend: sqlite`.
Redis connection URL. Only used when `cache_backend: redis`.
#### `cache_history_weight`
**Type**: `float` | **Default**: `0.3`
Weight of the BM25-weighted chat-history embedding in the combined cache key vector. `0.3` means the history contributes 30% and the final user message contributes 70% of the similarity signal. Only used when `cache_similarity < 1.0`.
### Cache management endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/api/cache/stats` | `GET` | Hit/miss counters, hit rate, current config |
| `/api/cache/invalidate` | `POST` | Clear all cache entries and reset counters |
```bash
# Check cache performance
curl http://localhost:12434/api/cache/stats
# Clear the cache
curl -X POST http://localhost:12434/api/cache/invalidate