feat: visualization of conversation affinity in dashboard
This commit is contained in:
parent
4acbaeb29c
commit
aa7ec6354a
5 changed files with 306 additions and 19 deletions
|
|
@ -166,6 +166,91 @@ With this config the primary handles up to 4 concurrent requests before the seco
|
|||
|
||||
---
|
||||
|
||||
### `conversation_affinity`
|
||||
|
||||
**Type**: `bool` (optional)
|
||||
|
||||
**Default**: `false`
|
||||
|
||||
**Companion setting**: [`conversation_affinity_ttl`](#conversation_affinity_ttl)
|
||||
|
||||
**Description**: When enabled, the router prefers to send follow-up requests of the same conversation back to the endpoint that already served the first turn. This keeps the backend's prompt cache (the llama.cpp / Ollama **KV cache**) warm: the first user turn pays the cold prefill cost, every later turn reuses the same prefix and only generates new tokens. It is a **soft preference** — when the previously-chosen endpoint is no longer eligible (model unloaded, no free slot), the router falls back to the standard selection algorithm (`priority_routing` or random).
|
||||
|
||||
#### How a conversation is identified
|
||||
|
||||
The router does **not** track session IDs or auth tokens. It computes a stable fingerprint per request from:
|
||||
|
||||
```
|
||||
SHA1( model
|
||||
+ every leading message with role="system"
|
||||
+ the first message with role="user" )
|
||||
```
|
||||
|
||||
Anything after the first user turn is ignored — those later messages extend the same KV prefix, so they don't change the cache identity.
|
||||
|
||||
**What this means in practice**
|
||||
|
||||
| You send… | Fingerprint behaves like… |
|
||||
|---|---|
|
||||
| Turn 2 of the same chat (history grows but first system+user are unchanged) | **Same** as turn 1 → pin is reused and TTL refreshed |
|
||||
| Turn 1 of a fresh chat | **New** fingerprint → new pin |
|
||||
| Same first user prompt but a different model | **New** fingerprint (model is part of the hash) |
|
||||
| Same chat but the client mutates the system prompt between turns (e.g. injects a fresh timestamp) | **New** fingerprint — the affinity will not stick |
|
||||
|
||||
#### TTL and refresh
|
||||
|
||||
Every time `choose_endpoint` returns a pinned endpoint, the entry's expiry is bumped to `now + conversation_affinity_ttl`. An idle conversation drops out of the map once that window elapses without traffic. Default 300 s matches Ollama's default `keep_alive` — once the backend has unloaded the model, the KV cache is gone too, so a stale pin would be pointless anyway.
|
||||
|
||||
#### Why the dashboard may show more than one dot per visible conversation
|
||||
|
||||
The fingerprint is computed per **HTTP request**, not per chat-window. Most chat UIs (Open WebUI in particular) fire several **auxiliary** requests alongside the real conversation:
|
||||
|
||||
- *Title generation* — synthetic system prompt + the user message as content
|
||||
- *Follow-up question suggestion* — synthetic system prompt + the conversation as content
|
||||
- *Tag generation*, *memory extraction*, *retrieval query rewriting*, etc.
|
||||
|
||||
Each of those has its own `(system + first user turn)` and therefore its own fingerprint and its own pin in [the affinity dot matrix](monitoring.md#affinity-stats-conversation-affinity). They all *correctly* refer to a real warm KV-cache prefix on the backend, so the routing they drive is right — they just don't visually map 1:1 to a user-perceived "conversation."
|
||||
|
||||
#### Example
|
||||
|
||||
```yaml
|
||||
endpoints:
|
||||
- http://gpu-primary:11434
|
||||
- http://gpu-secondary:11434
|
||||
|
||||
conversation_affinity: true
|
||||
conversation_affinity_ttl: 300
|
||||
```
|
||||
|
||||
With this configuration, a chat that starts on `gpu-primary` will keep returning to `gpu-primary` for follow-up turns as long as the model is still loaded there and a slot is free, even if `gpu-secondary` happens to be more idle at that moment. Cold-prefill cost is paid once instead of once per turn.
|
||||
|
||||
#### When to enable
|
||||
|
||||
- ✅ Interactive chat workloads with long histories — the prefill savings on every follow-up turn are substantial.
|
||||
- ✅ Multi-endpoint deployments where models are loaded on more than one node.
|
||||
- ❌ Pure one-shot / single-turn workloads (no KV-cache to keep warm).
|
||||
- ❌ When you specifically want strict load-balancing parity — affinity intentionally biases against perfect balance.
|
||||
|
||||
---
|
||||
|
||||
### `conversation_affinity_ttl`
|
||||
|
||||
**Type**: `int` (seconds, optional)
|
||||
|
||||
**Default**: `300`
|
||||
|
||||
**Description**: How long a conversation stays pinned to its endpoint after the last request that touched it. Refreshed on every reuse — so an actively-used conversation keeps its pin indefinitely; an abandoned one expires after `conversation_affinity_ttl` seconds of silence.
|
||||
|
||||
**Recommendation**: leave this aligned with the backend's `keep_alive` window. If the model is unloaded by the backend, the KV cache is gone and there is no benefit to keeping the pin.
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
conversation_affinity: true
|
||||
conversation_affinity_ttl: 600 # half an hour of inactivity before un-pinning
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `router_api_key`
|
||||
|
||||
**Type**: `str` (optional)
|
||||
|
|
|
|||
|
|
@ -166,6 +166,39 @@ curl -X POST http://localhost:12434/api/cache/invalidate
|
|||
|
||||
Clears all cached entries and resets hit/miss counters.
|
||||
|
||||
### Affinity Stats (Conversation Affinity)
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/affinity_stats
|
||||
```
|
||||
|
||||
Response when [`conversation_affinity`](configuration.md#conversation_affinity) is enabled:
|
||||
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"ttl": 300,
|
||||
"entries": [
|
||||
{ "endpoint": "http://gpu-primary:11434", "model": "llama3.2:latest", "remaining": 287.4 },
|
||||
{ "endpoint": "http://gpu-primary:11434", "model": "llama3.2:latest", "remaining": 113.0 },
|
||||
{ "endpoint": "http://gpu-secondary:11434", "model": "qwen2.5-coder:7b", "remaining": 44.8 }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Response when the feature is disabled:
|
||||
```json
|
||||
{ "enabled": false, "ttl": 300, "entries": [] }
|
||||
```
|
||||
|
||||
- One element per **live pinned conversation** (no fingerprints or content — just the endpoint/model the pin points to and how many seconds it has left before expiry).
|
||||
- Aggregation by `(endpoint, model)` is left to the consumer: the dashboard does this client-side.
|
||||
- The endpoint is gated by the same `nomyo-router-api-key` middleware as the rest of `/api/*`.
|
||||
|
||||
The dashboard's **Running Models (PS) → Affinity** column is rendered from this data. The column auto-hides when `enabled: false`. Each row shows one dot per live pin against that `(endpoint, model)` pair; dot opacity = `remaining / ttl` (floor 0.15), so freshly-routed pins are solid and pins close to expiry fade out. A `+N` overflow badge appears once a single (endpoint, model) holds more than 12 active pins; an em-dash (`—`) marks an `(endpoint, model)` with no live pins.
|
||||
|
||||
> Multiple dots for what looks like "one chat window" is normal — most chat UIs (Open WebUI, LibreChat, …) fire auxiliary requests (title generation, follow-up suggestions, tag extraction) that have their own first-turn fingerprint and therefore their own pin. See [Conversation Affinity → Why the dashboard may show more than one dot per visible conversation](configuration.md#conversation_affinity) for the details.
|
||||
|
||||
### Real-time Usage Stream
|
||||
|
||||
```bash
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue