feat: visualization of conversation affinity in dashboard

2026-05-13 13:38:37 +02:00 · 2026-05-13 13:38:37 +02:00 · aa7ec6354a
commit aa7ec6354a
parent 4acbaeb29c
5 changed files with 306 additions and 19 deletions
--- a/doc/configuration.md
+++ b/doc/configuration.md
@ -166,6 +166,91 @@ With this config the primary handles up to 4 concurrent requests before the seco

 ---

+### `conversation_affinity`
+
+**Type**: `bool` (optional)
+
+**Default**: `false`
+
+**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)
+
+**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
+
+#### How a conversation is identified
+
+The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
+
+```
+SHA1(  model
+     + every leading message with role="system"
+     + the first message with role="user"  )
+```
+
+Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
+
+**What this means in practice**
+
+| You send… | Fingerprint behaves like… |
+|---|---|
+| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
+| Turn 1 of a fresh chat | **New** fingerprint → new pin |
+| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
+| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
+
+#### TTL and refresh
+
+Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
+
+#### Why the dashboard may show more than one dot per visible conversation
+
+The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
+
+- *Title generation* — synthetic system prompt + the user message as content
+- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
+- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
+
+Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
+
+#### Example
+
+```yaml
+endpoints:
+  - http://gpu-primary:11434
+  - http://gpu-secondary:11434
+
+conversation_affinity: true
+conversation_affinity_ttl: 300
+```
+
+With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
+
+#### When to enable
+
+- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
+- ✅ Multi-endpoint deployments where models are loaded on more than one node.
+- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
+- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
+
+---
+
+### `conversation_affinity_ttl`
+
+**Type**: `int` (seconds, optional)
+
+**Default**: `300`
+
+**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
+
+**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
+
+**Example**:
+```yaml
+conversation_affinity: true
+conversation_affinity_ttl: 600   # half an hour of inactivity before un-pinning
+```
+
+---
+
 ### `router_api_key`

 **Type**: `str` (optional)
--- a/doc/monitoring.md
+++ b/doc/monitoring.md
@ -166,6 +166,39 @@ curl -X POST http://localhost:12434/api/cache/invalidate

 Clears all cached entries and resets hit/miss counters.

+### Affinity Stats (Conversation Affinity)
+
+```bash
+curl http://localhost:12434/api/affinity_stats
+```
+
+Response when [`conversation_affinity`](configuration.md#conversation_affinity) is enabled:
+
+```json
+{
+  "enabled": true,
+  "ttl": 300,
+  "entries": [
+    { "endpoint": "http://gpu-primary:11434",   "model": "llama3.2:latest",     "remaining": 287.4 },
+    { "endpoint": "http://gpu-primary:11434",   "model": "llama3.2:latest",     "remaining": 113.0 },
+    { "endpoint": "http://gpu-secondary:11434", "model": "qwen2.5-coder:7b",    "remaining":  44.8 }
+  ]
+}
+```
+
+Response when the feature is disabled:
+```json
+{ "enabled": false, "ttl": 300, "entries": [] }
+```
+
+- One element per **live pinned conversation** (no fingerprints or content — just the endpoint/model the pin points to and how many seconds it has left before expiry).
+- Aggregation by `(endpoint, model)` is left to the consumer: the dashboard does this client-side.
+- The endpoint is gated by the same `nomyo-router-api-key` middleware as the rest of `/api/*`.
+
+The dashboard's **Running Models (PS) → Affinity** column is rendered from this data. The column auto-hides when `enabled: false`. Each row shows one dot per live pin against that `(endpoint, model)` pair; dot opacity = `remaining / ttl` (floor 0.15), so freshly-routed pins are solid and pins close to expiry fade out. A `+N` overflow badge appears once a single (endpoint, model) holds more than 12 active pins; an em-dash (`—`) marks an `(endpoint, model)` with no live pins.
+
+> Multiple dots for what looks like "one chat window" is normal — most chat UIs (Open WebUI, LibreChat, …) fire auxiliary requests (title generation, follow-up suggestions, tag extraction) that have their own first-turn fingerprint and therefore their own pin. See [Conversation Affinity → Why the dashboard may show more than one dot per visible conversation](configuration.md#conversation_affinity) for the details.
+
 ### Real-time Usage Stream

 ```bash