feat: visualization of conversation affinity in dashboard

This commit is contained in:
Alpha Nerd 2026-05-13 13:38:37 +02:00
parent 4acbaeb29c
commit aa7ec6354a
Signed by: alpha-nerd
SSH key fingerprint: SHA256:QkkAgVoYi9TQ0UKPkiKSfnerZy2h4qhi3SVPXJmBN+M
5 changed files with 306 additions and 19 deletions

View file

@ -166,6 +166,91 @@ With this config the primary handles up to 4 concurrent requests before the seco
---
### `conversation_affinity`
**Type**: `bool` (optional)
**Default**: `false`
**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)
**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
#### How a conversation is identified
The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
```
SHA1( model
+ every leading message with role="system"
+ the first message with role="user" )
```
Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
**What this means in practice**
| You send… | Fingerprint behaves like… |
|---|---|
| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
| Turn 1 of a fresh chat | **New** fingerprint → new pin |
| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
#### TTL and refresh
Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
#### Why the dashboard may show more than one dot per visible conversation
The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
- *Title generation* — synthetic system prompt + the user message as content
- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
#### Example
```yaml
endpoints:
- http://gpu-primary:11434
- http://gpu-secondary:11434
conversation_affinity: true
conversation_affinity_ttl: 300
```
With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
#### When to enable
- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
- ✅ Multi-endpoint deployments where models are loaded on more than one node.
- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
---
### `conversation_affinity_ttl`
**Type**: `int` (seconds, optional)
**Default**: `300`
**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
**Example**:
```yaml
conversation_affinity: true
conversation_affinity_ttl: 600 # half an hour of inactivity before un-pinning
```
---
### `router_api_key`
**Type**: `str` (optional)

View file

@ -166,6 +166,39 @@ curl -X POST http://localhost:12434/api/cache/invalidate
Clears all cached entries and resets hit/miss counters.
### Affinity Stats (Conversation Affinity)
```bash
curl http://localhost:12434/api/affinity_stats
```
Response when [`conversation_affinity`](configuration.md#conversation_affinity) is enabled:
```json
{
"enabled": true,
"ttl": 300,
"entries": [
{ "endpoint": "http://gpu-primary:11434", "model": "llama3.2:latest", "remaining": 287.4 },
{ "endpoint": "http://gpu-primary:11434", "model": "llama3.2:latest", "remaining": 113.0 },
{ "endpoint": "http://gpu-secondary:11434", "model": "qwen2.5-coder:7b", "remaining": 44.8 }
]
}
```
Response when the feature is disabled:
```json
{ "enabled": false, "ttl": 300, "entries": [] }
```
- One element per **live pinned conversation** (no fingerprints or content — just the endpoint/model the pin points to and how many seconds it has left before expiry).
- Aggregation by `(endpoint, model)` is left to the consumer: the dashboard does this client-side.
- The endpoint is gated by the same `nomyo-router-api-key` middleware as the rest of `/api/*`.
The dashboard's **Running Models (PS) → Affinity** column is rendered from this data. The column auto-hides when `enabled: false`. Each row shows one dot per live pin against that `(endpoint, model)` pair; dot opacity = `remaining / ttl` (floor 0.15), so freshly-routed pins are solid and pins close to expiry fade out. A `+N` overflow badge appears once a single (endpoint, model) holds more than 12 active pins; an em-dash (`—`) marks an `(endpoint, model)` with no live pins.
> Multiple dots for what looks like "one chat window" is normal — most chat UIs (Open WebUI, LibreChat, …) fire auxiliary requests (title generation, follow-up suggestions, tag extraction) that have their own first-turn fingerprint and therefore their own pin. See [Conversation Affinity → Why the dashboard may show more than one dot per visible conversation](configuration.md#conversation_affinity) for the details.
### Real-time Usage Stream
```bash