feat: visualization of conversation affinity in dashboard

2026-05-13 13:38:37 +02:00 · 2026-05-13 13:38:37 +02:00 · aa7ec6354a
commit aa7ec6354a
parent 4acbaeb29c
5 changed files with 306 additions and 19 deletions
--- a/doc/configuration.md
+++ b/doc/configuration.md
@ -166,6 +166,91 @@ With this config the primary handles up to 4 concurrent requests before the seco

 ---

+### `conversation_affinity`
+
+**Type**: `bool` (optional)
+
+**Default**: `false`
+
+**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)
+
+**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
+
+#### How a conversation is identified
+
+The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
+
+```
+SHA1(  model
+     + every leading message with role="system"
+     + the first message with role="user"  )
+```
+
+Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
+
+**What this means in practice**
+
+| You send… | Fingerprint behaves like… |
+|---|---|
+| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
+| Turn 1 of a fresh chat | **New** fingerprint → new pin |
+| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
+| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
+
+#### TTL and refresh
+
+Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
+
+#### Why the dashboard may show more than one dot per visible conversation
+
+The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
+
+- *Title generation* — synthetic system prompt + the user message as content
+- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
+- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
+
+Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
+
+#### Example
+
+```yaml
+endpoints:
+  - http://gpu-primary:11434
+  - http://gpu-secondary:11434
+
+conversation_affinity: true
+conversation_affinity_ttl: 300
+```
+
+With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
+
+#### When to enable
+
+- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
+- ✅ Multi-endpoint deployments where models are loaded on more than one node.
+- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
+- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
+
+---
+
+### `conversation_affinity_ttl`
+
+**Type**: `int` (seconds, optional)
+
+**Default**: `300`
+
+**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
+
+**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
+
+**Example**:
+```yaml
+conversation_affinity: true
+conversation_affinity_ttl: 600   # half an hour of inactivity before un-pinning
+```
+
+---
+
 ### `router_api_key`

 **Type**: `str` (optional)